Discussion:
[OpenAFS-devel] Re: [OpenAFS] getcwd() error for RHEL 7.4 kernel
Stephan Wiesand
2017-10-20 12:27:20 UTC
Permalink
[taking this thread to -devel]
I ran configure against the EL7.3 and EL7.4 GA kernels (3.10.0-514.el7 and 3.10.0-696.el7) and compared the results.
7.3 7.4
locks_lock_file_wait no yes
inode_lock no yes
exported tasklist_lock yes no
It turns out the EL7.4 kernel turns tasklist_lock from an rwlock_t into a qrwlock_t and all read_{,un}lock() calls into qread_{,un}lock() ones. And no, it's not what mainline kernels do, including 4.14-rc5.

We should probably adapt to this, and I guess it shouldn't be too hard, but is this change likely to be the reason for more frequent getcwd() problems?
--
Stephan Wiesand
DESY -DV-
Platanenallee 6
15738 Zeuthen, Germany
Mark Vitale
2017-10-20 19:17:06 UTC
Permalink
Post by Stephan Wiesand
[taking this thread to -devel]
I ran configure against the EL7.3 and EL7.4 GA kernels (3.10.0-514.el7 and 3.10.0-696.el7) and compared the results.
7.3 7.4
locks_lock_file_wait no yes
inode_lock no yes
exported tasklist_lock yes no
Thank you for this good information, Stephan. Were those 3 the only OpenAFS config differences you found?
Post by Stephan Wiesand
It turns out the EL7.4 kernel turns tasklist_lock from an rwlock_t into a qrwlock_t and all read_{,un}lock() calls into qread_{,un}lock() ones. And no, it's not what mainline kernels do, including 4.14-rc5.
We should probably adapt to this, and I guess it shouldn’t be too hard, but is this change likely to be the reason for more frequent getcwd() problems?
I took a look at all three differences with regard to the OpenAFS 1.6.20.2 code, and I don’t see a way that any of them could be causing the getcwd problems.

In particular, the threadlist_lock references in OpenAFS 1.6.20.2 source will not actually result in any OpenAFS kernel module references, due to the results from other parts of the autoconfig for RHEL 7.4. You can verify this for yourself by issuing: ’nm <openafs.ko> | grep threadlist_lock’

However, don’t rely on the nm trick to look for the other symbols referenced above. inode_lock() is defined as static inline and is thus inlined as a mutex_unlock(&inode->i_lock), which is indistinguishable from other mutex_unlock references. And locks_lock_file_wait() is also static inline - it shows up as locks_lock_inode_wait in the nm output.

So in summary, thank you, but I don’t believe any of these explain the current getcwd symptoms.

Has anyone seen this with RHEL 7.4 and the previous OpenAFS releases - 1.6.20.1 or older?



Mark Vitale
OpenAFS release team

:��T�z���x%��N���'^��h���~�+
Stephan Wiesand
2017-10-20 19:27:14 UTC
Permalink
Post by Mark Vitale
Post by Stephan Wiesand
[taking this thread to -devel]
I ran configure against the EL7.3 and EL7.4 GA kernels (3.10.0-514.el7 and 3.10.0-696.el7) and compared the results.
7.3 7.4
locks_lock_file_wait no yes
inode_lock no yes
exported tasklist_lock yes no
Thank you for this good information, Stephan. Were those 3 the only OpenAFS config differences you found?
Yes of course.
Post by Mark Vitale
Post by Stephan Wiesand
It turns out the EL7.4 kernel turns tasklist_lock from an rwlock_t into a qrwlock_t and all read_{,un}lock() calls into qread_{,un}lock() ones. And no, it's not what mainline kernels do, including 4.14-rc5.
We should probably adapt to this, and I guess it shouldn’t be too hard, but is this change likely to be the reason for more frequent getcwd() problems?
I took a look at all three differences with regard to the OpenAFS 1.6.20.2 code, and I don’t see a way that any of them could be causing the getcwd problems.
In particular, the threadlist_lock references in OpenAFS 1.6.20.2 source will not actually result in any OpenAFS kernel module references, due to the results from other parts of the autoconfig for RHEL 7.4. You can verify this for yourself by issuing: ’nm <openafs.ko> | grep threadlist_lock’
However, don’t rely on the nm trick to look for the other symbols referenced above. inode_lock() is defined as static inline and is thus inlined as a mutex_unlock(&inode->i_lock), which is indistinguishable from other mutex_unlock references. And locks_lock_file_wait() is also static inline - it shows up as locks_lock_inode_wait in the nm output.
So in summary, thank you, but I don’t believe any of these explain the current getcwd symptoms.
Has anyone seen this with RHEL 7.4 and the previous OpenAFS releases - 1.6.20.1 or older?
Not here. It was 1.6.21, and the statistics isn't exactly great.

You mean it could simply be "shake harder" unmasking the actual issue again?
--
Stephan Wiesand
DESY -DV-
Platanenenallee 6
15738 Zeuthen, Germany
Loading...