Ragnar Sundblad
2017-11-03 13:46:38 UTC
Hi all,
We have compute clusters where the nodes have almost everything of their roots in afs; most things in /, as /etc and /usr, are soft links into a complete os installation in afs. To be able to have some writable files and directories, such as /etc/adjtime or /var/tmp, we bind mount files and directories in the tree which is actually in afs (mainly using the rwtab functionality), and a lustre client that also gets mounted in the afs tree.
When we upgraded from CentOS 7.3 to 7.4, kernel 3.10.0-693.5.2.el7.x86_64, and using OpenAFS client 1.6.21.1 or 1.6.20.1, when users having home directories in afs log in and start accessing their data, mounts in the afs tree starts to get randomly unmounted. In the lustre case, the lustre client nicely reports that it unmounts, so the unmounts seem to be handled in an orderly manner.
We have a suspicion this may be related to the problem reported in the thread “getcwd() error for RHEL 7.4 kernel”, and that the kernel for some reason decides that path to the mount point is no good and unmounts.
In addition, when this has started to happen, we are not able to mount anything more into afs, mount returns ENOENT.
This is pretty easy to repeat.
Our workaround for now is to use the tpmfs based root all the way down to the mount points, and have soft links into afs further down for the rest, which seems to work.
Please let us know if we can provide any help debugging this.
/ragge
PDC Center for High Performance Computing, KTH Royal Institute of Technology, Stockholm, Sweden
We have compute clusters where the nodes have almost everything of their roots in afs; most things in /, as /etc and /usr, are soft links into a complete os installation in afs. To be able to have some writable files and directories, such as /etc/adjtime or /var/tmp, we bind mount files and directories in the tree which is actually in afs (mainly using the rwtab functionality), and a lustre client that also gets mounted in the afs tree.
When we upgraded from CentOS 7.3 to 7.4, kernel 3.10.0-693.5.2.el7.x86_64, and using OpenAFS client 1.6.21.1 or 1.6.20.1, when users having home directories in afs log in and start accessing their data, mounts in the afs tree starts to get randomly unmounted. In the lustre case, the lustre client nicely reports that it unmounts, so the unmounts seem to be handled in an orderly manner.
We have a suspicion this may be related to the problem reported in the thread “getcwd() error for RHEL 7.4 kernel”, and that the kernel for some reason decides that path to the mount point is no good and unmounts.
In addition, when this has started to happen, we are not able to mount anything more into afs, mount returns ENOENT.
This is pretty easy to repeat.
Our workaround for now is to use the tpmfs based root all the way down to the mount points, and have soft links into afs further down for the rest, which seems to work.
Please let us know if we can provide any help debugging this.
/ragge
PDC Center for High Performance Computing, KTH Royal Institute of Technology, Stockholm, Sweden