Discussion:
[OpenAFS-devel] Problem with mounts in AFS on CentOS 7.4 with openafs 1.6.2[01].1
Ragnar Sundblad
2017-11-03 13:46:38 UTC
Permalink
Hi all,

We have compute clusters where the nodes have almost everything of their roots in afs; most things in /, as /etc and /usr, are soft links into a complete os installation in afs. To be able to have some writable files and directories, such as /etc/adjtime or /var/tmp, we bind mount files and directories in the tree which is actually in afs (mainly using the rwtab functionality), and a lustre client that also gets mounted in the afs tree.

When we upgraded from CentOS 7.3 to 7.4, kernel 3.10.0-693.5.2.el7.x86_64, and using OpenAFS client 1.6.21.1 or 1.6.20.1, when users having home directories in afs log in and start accessing their data, mounts in the afs tree starts to get randomly unmounted. In the lustre case, the lustre client nicely reports that it unmounts, so the unmounts seem to be handled in an orderly manner.

We have a suspicion this may be related to the problem reported in the thread “getcwd() error for RHEL 7.4 kernel”, and that the kernel for some reason decides that path to the mount point is no good and unmounts.
In addition, when this has started to happen, we are not able to mount anything more into afs, mount returns ENOENT.

This is pretty easy to repeat.

Our workaround for now is to use the tpmfs based root all the way down to the mount points, and have soft links into afs further down for the rest, which seems to work.

Please let us know if we can provide any help debugging this.


/ragge

PDC Center for High Performance Computing, KTH Royal Institute of Technology, Stockholm, Sweden
Mark Vitale
2017-11-03 14:51:17 UTC
Permalink
Ragge,
Post by Ragnar Sundblad
We have compute clusters where the nodes have almost everything of their roots in afs; most things in /, as /etc and /usr, are soft links into a complete os installation in afs. To be able to have some writable files and directories, such as /etc/adjtime or /var/tmp, we bind mount files and directories in the tree which is actually in afs (mainly using the rwtab functionality), and a lustre client that also gets mounted in the afs tree.
When we upgraded from CentOS 7.3 to 7.4, kernel 3.10.0-693.5.2.el7.x86_64, and using OpenAFS client 1.6.21.1 or 1.6.20.1, when users having home directories in afs log in and start accessing their data, mounts in the afs tree starts to get randomly unmounted. In the lustre case, the lustre client nicely reports that it unmounts, so the unmounts seem to be handled in an orderly manner.
We have a suspicion this may be related to the problem reported in the thread “getcwd() error for RHEL 7.4 kernel”, and that the kernel for some reason decides that path to the mount point is no good and unmounts.
In addition, when this has started to happen, we are not able to mount anything more into afs, mount returns ENOENT.
This is pretty easy to repeat.
Thank you for your detailed report.
I have an idea about what this may be, but I will try to duplicate it on my test system first.
Post by Ragnar Sundblad
Our workaround for now is to use the tpmfs based root all the way down to the mount points, and have soft links into afs further down for the rest, which seems to work.
It’s good that you have a workaround; thank you for sharing that as well.
Post by Ragnar Sundblad
Please let us know if we can provide any help debugging this.
For now I would like to see your afsd options, and also the output from ‘cmdebug <client> -cache’ for an affected client.

Although you haven’t reported the getcwd() problem, could you please confirm if you’ve seen it or not?

And finally, just to confirm, you have seen bind mounts in /afs unmounted at CentOS 7.4 with both OpenAFS 1.6.21.1 and 1.6.20.1, but _not_ with CentOS 7.3 and those same OpenAFS client releases - correct?

Thanks,

Mark Vitale
OpenAFS release team

�zpIׯzY��X��X���^�Ru�ޖ�^��좸
Ragnar Sundblad
2017-11-03 16:29:58 UTC
Permalink
Hi Mark,
Post by Mark Vitale
Ragge,
Post by Ragnar Sundblad
We have compute clusters where the nodes have almost everything of their roots in afs; most things in /, as /etc and /usr, are soft links into a complete os installation in afs. To be able to have some writable files and directories, such as /etc/adjtime or /var/tmp, we bind mount files and directories in the tree which is actually in afs (mainly using the rwtab functionality), and a lustre client that also gets mounted in the afs tree.
When we upgraded from CentOS 7.3 to 7.4, kernel 3.10.0-693.5.2.el7.x86_64, and using OpenAFS client 1.6.21.1 or 1.6.20.1, when users having home directories in afs log in and start accessing their data, mounts in the afs tree starts to get randomly unmounted. In the lustre case, the lustre client nicely reports that it unmounts, so the unmounts seem to be handled in an orderly manner.
We have a suspicion this may be related to the problem reported in the thread “getcwd() error for RHEL 7.4 kernel”, and that the kernel for some reason decides that path to the mount point is no good and unmounts.
In addition, when this has started to happen, we are not able to mount anything more into afs, mount returns ENOENT.
This is pretty easy to repeat.
Thank you for your detailed report.
I have an idea about what this may be, but I will try to duplicate it on my test system first.
Thanks for investigating! :-)
Post by Mark Vitale
Post by Ragnar Sundblad
Our workaround for now is to use the tpmfs based root all the way down to the mount points, and have soft links into afs further down for the rest, which seems to work.
It’s good that you have a workaround; thank you for sharing that as well.
Post by Ragnar Sundblad
Please let us know if we can provide any help debugging this.
For now I would like to see your afsd options, and also the output from ‘cmdebug <client> -cache’ for an affected client.
We start it like so:
/bin/chroot /sysimage /usr/vice/etc/afsd -memcache -verbose -nosettime -dynroot -mountdir /afs
(Before systemd is started, we set up the runtime root in /sysimage, then chroot there, and start systemd to let it bring up the system.)

Here is a cmdebug:
# cmdebug tegner-login-2 -cache
Chunk files: 1562
Stat caches: 2343
Data caches: 1562
Volume caches: 200
Chunk size: 65536
Cache size: 100000 kB
Set time: no
Cache type: memory

I now see that I forgot to mention that we use memory cache (since the nodes are diskless).
Post by Mark Vitale
Although you haven’t reported the getcwd() problem, could you please confirm if you’ve seen it or not?
We have not seen it, but we haven’t really looked for it either. Is there some test we could try?
Post by Mark Vitale
And finally, just to confirm, you have seen bind mounts in /afs unmounted at CentOS 7.4 with both OpenAFS 1.6.21.1 and 1.6.20.1, but _not_ with CentOS 7.3 and those same OpenAFS client releases - correct?
With 7.3 (kernel 3.10.0-514.26.2.el7.x86_64) we actually used openafs client 1.6.20.2, but with that combination this mount-within-afs thing worked just fine.

Thanks!

/ragge
Ragnar Sundblad
2017-12-20 15:53:14 UTC
Permalink
Hi Mark,


Just to report back:

We have tried your (no longer recommended) patch
https://gerrit.openafs.org/#/c/12796/
as you pointed out in the thread "getcwd() error for RHEL 7.4 kernel” in the openafs-info list.

As far as we have seen, this indeed solved our disappearing mount point problems.

We will of course switch to the new version of the patch (or maybe just 1.8.0) as soon as there is one.

Thanks for your work!


Best regards,

/ragge
Post by Ragnar Sundblad
Hi Mark,
Post by Mark Vitale
Ragge,
Post by Ragnar Sundblad
We have compute clusters where the nodes have almost everything of their roots in afs; most things in /, as /etc and /usr, are soft links into a complete os installation in afs. To be able to have some writable files and directories, such as /etc/adjtime or /var/tmp, we bind mount files and directories in the tree which is actually in afs (mainly using the rwtab functionality), and a lustre client that also gets mounted in the afs tree.
When we upgraded from CentOS 7.3 to 7.4, kernel 3.10.0-693.5.2.el7.x86_64, and using OpenAFS client 1.6.21.1 or 1.6.20.1, when users having home directories in afs log in and start accessing their data, mounts in the afs tree starts to get randomly unmounted. In the lustre case, the lustre client nicely reports that it unmounts, so the unmounts seem to be handled in an orderly manner.
We have a suspicion this may be related to the problem reported in the thread “getcwd() error for RHEL 7.4 kernel”, and that the kernel for some reason decides that path to the mount point is no good and unmounts.
In addition, when this has started to happen, we are not able to mount anything more into afs, mount returns ENOENT.
This is pretty easy to repeat.
Thank you for your detailed report.
I have an idea about what this may be, but I will try to duplicate it on my test system first.
Thanks for investigating! :-)
Post by Mark Vitale
Post by Ragnar Sundblad
Our workaround for now is to use the tpmfs based root all the way down to the mount points, and have soft links into afs further down for the rest, which seems to work.
It’s good that you have a workaround; thank you for sharing that as well.
Post by Ragnar Sundblad
Please let us know if we can provide any help debugging this.
For now I would like to see your afsd options, and also the output from ‘cmdebug <client> -cache’ for an affected client.
/bin/chroot /sysimage /usr/vice/etc/afsd -memcache -verbose -nosettime -dynroot -mountdir /afs
(Before systemd is started, we set up the runtime root in /sysimage, then chroot there, and start systemd to let it bring up the system.)
# cmdebug tegner-login-2 -cache
Chunk files: 1562
Stat caches: 2343
Data caches: 1562
Volume caches: 200
Chunk size: 65536
Cache size: 100000 kB
Set time: no
Cache type: memory
I now see that I forgot to mention that we use memory cache (since the nodes are diskless).
Post by Mark Vitale
Although you haven’t reported the getcwd() problem, could you please confirm if you’ve seen it or not?
We have not seen it, but we haven’t really looked for it either. Is there some test we could try?
Post by Mark Vitale
And finally, just to confirm, you have seen bind mounts in /afs unmounted at CentOS 7.4 with both OpenAFS 1.6.21.1 and 1.6.20.1, but _not_ with CentOS 7.3 and those same OpenAFS client releases - correct?
With 7.3 (kernel 3.10.0-514.26.2.el7.x86_64) we actually used openafs client 1.6.20.2, but with that combination this mount-within-afs thing worked just fine.
Thanks!
/ragge
_______________________________________________
OpenAFS-devel mailing list
https://lists.openafs.org/mailman/listinfo/openafs-devel
Mark Vitale
2018-03-07 15:50:11 UTC
Permalink
Ragge,
Post by Ragnar Sundblad
We have tried your (no longer recommended) patch
https://gerrit.openafs.org/#/c/12796/
as you pointed out in the thread "getcwd() error for RHEL 7.4 kernel” in the openafs-info list.
As far as we have seen, this indeed solved our disappearing mount point problems.
We will of course switch to the new version of the patch (or maybe just 1.8.0) as soon as there is one.
The following fixes have been merged for the getcwd() problem; they should also eliminate your disappearing mounts:

master: https://gerrit.openafs.org/#/c/12830/
1.8.x https://gerrit.openafs.org/#/c/12851/ included in 1.8.0-pre4 03 Jan 2018
1.6.x https://gerrit.openafs.org/#/c/12860/ included in 1.6.22.2 01 Feb 2018

Please let us know if you can confirm this at your site.

Regards,
--
Mark Vitale
Sine Nomine Associates

�zpIׯzY��X��X���^�Ru�ޖ�^��좸
Loading...