Fixed
Created: Nov 24, 2016
Updated: Dec 3, 2018
Resolved Date: Nov 27, 2016
Found In Version: 6.0.0.31
Fix Version: 6.0.0.32
Severity: Severe
Applicable for: Wind River Linux 6
Component/s: Kernel
Customer reported a problem in OVP6-34. In that case, we provided 4 patches to customer. After applying the patches, the customer see the following problem.
=====================
PROBLEM DESCRIPTION
=====================
We are seeing strange behavior on our systems where pthread_mutex_lock() gets struck indefinitely at __pause_nocancel.
This behavior is consistently reproduced with the simple code attached to the case. Also, please be noted that the
issue is seen only when the protocol PTHREAD_PRIO_INHERIT is enabled on the mutex.
The fault sequence is (Please find attached sample program “pthd_issue.c” to recreate the issue):
1) Consider two pthreads p1 and p2.
2) p1 locks the mutex and calls pthread_cond_wait.
3) p2 signals p1 to come out of wait by calling pthread_cond_broadcast.
4) p1 comes out of pthread_cond_wait() and performs pthread_mutex_unlock() and pthread_mutex_lock().
5) p1 then strucks inside pthread_mutex_lock() at __pause_nocancel indefenetely.
On further debugging, we noticed that at step-3 above, i.e. when p1 comes out of pthread_cond_wait(), the FUTEX_WAITER bit
is set in the __lock. It is strange that this bit is set though there are no waiters.
Please find the detailed debug logs as below :
============
DEBUG LOGS
============
By connected GDB, I could see that thread stuck indefinitely at __pause_nocancel.
(gdb) bt
#0 0x0000003106e096a2 in pthread_join (threadid=140737354036992, thread_return=thread_return@entry=0x0) at pthread_join.c:92
#1 0x0000000000400949 in main (argc=<optimized out>, argv=<optimized out>) at vsmartd.c:119
(gdb) info threads
Id Target Id Frame
2 Thread 0x7ffff7fe7700 (LWP 6738) "vsmartd" 0x0000003106e0fae0 in __pause_nocancel () at ../sysdeps/unix/syscall-template.S:81
* 1 Thread 0x7ffff7fe9700 (LWP 6733) "vsmartd" 0x0000003106e096a2 in pthread_join (threadid=140737354036992,
thread_return=thread_return@entry=0x0) at pthread_join.c:92
(gdb) thread 2
[Switching to thread 2 (Thread 0x7ffff7fe7700 (LWP 6738))]
#0 0x0000003106e0fae0 in __pause_nocancel () at ../sysdeps/unix/syscall-template.S:81
81 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
(gdb) bt
#0 0x0000003106e0fae0 in __pause_nocancel () at ../sysdeps/unix/syscall-template.S:81
#1 0x0000003106e05d00 in __pthread_mutex_lock_full (mutex=0x6012e0 <mutex>) at ../nptl/pthread_mutex_lock.c:354
#2 0x0000000000400b0d in thread (arg=<optimized out>) at vsmartd.c:46
#3 0x0000003106e08201 in start_thread (arg=0x7ffff7fe7700) at pthread_create.c:312
#4 0x00000031066e68bd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb)
I ran one more gdb session to inspect __lock value and the observation is as below:
1) thread is waiting at pthread_cond_wait(). The value of mutex object looks as below. __lock is TID ( thread_id )
(gdb) p *(pthread_mutex_t *) 0x6012e0
$1 = {__data = {__lock = 16182, __count = 1, __owner = 16182, __nusers = 1, __kind = 32, __spins = 0, __elision = 0, __list = {__prev = 0x0,
__next = 0x0}}, __size = "6?\000\000\001\000\000\000\066?\000\000\001\000\000\000 ", '\000' <repeats 22 times>, __align = 4294983478}
2) thread comes out of pthread_cond_wait(). The value of mutex object looks as below :
(gdb) p *(pthread_mutex_t *) 0x6012e0
$3 = {__data = {__lock = -2147467466, __count = 1, __owner = 16182, __nusers = 1, __kind = 32, __spins = 0, __elision = 0, __list = {
__prev = 0x0, __next = 0x0}}, __size = "6?\000\200\001\000\000\000\066?\000\000\001\000\000\000 ", '\000' <repeats 22 times>,
__align = 6442467126}
Please be noted that the pthread_cond_wait() inside will perform mutex unlock before it blocks and perform mutex lock after it wakes up. The value of __lock
now shows FUTEX_WAITER | TID i.e. FUTEX_WAITER bit is set. But it is strange that FUTEX_WAITER bit is set even though there are no waiters for the mutex.
3) Now thread peforms pthread_mutex_unlock. Ideally __lock should be 0 ( I assume ). But it does not happen.
(gdb) p *(pthread_mutex_t *) 0x6012e0
$4 = {__data = {__lock = -2147467466, __count = 1, __owner = 0, __nusers = 0, __kind = 32, __spins = 0, __elision = 0, __list = {
__prev = 0x0, __next = 0x0}}, __size = "6?\000\200\001", '\000' <repeats 11 times>, " ", '\000' <repeats 22 times>,
__align = 6442467126}
4) Now thread performs pthread_mutex_lock and it is taken up by __pause_nocancel indefenetely.
1. configure project with the following options:
/opt/WRLX6/wrlinux-6/wrlinux/configure --enable-board=intel-x86-64 --enable-kernel=preempt-rt --enable-rootfs=ovp-kvm+installer-support --enable-addons=wr-ovp --enable-sdkimage-staticlibs=yes --with-template=feature/kvm,feature/benchmark,feature/debug,feature/gdb,feature/kdump,feature/kexec,feature/kernel-debug,feature/kernel-tune,feature/build_libc,feature/dpdk-vswitch,feature/lttng2 --enable-parallel-pkgbuilds=8 --enable-jobs=8 --enable-checkout-all-layers=yes --with-package=wrs-kvm-helper,qemu,libvirt,libvirt-libvirtd,libvirt-virsh,kernel-module-kvm,kernel-module-kvm-intel,kernel-module-kvm-amd,bridge-utils,inetutils,inetutils-rshd,inetutils-telnetd,inetutils-ftpd,ltp --enable-rm-oldimgs=yes --enable-internet-download=yes --enable-reconfig --enable-prelink=no --with-rcpl-version=0020
2. apply the 4 patches in the OVP6-34 to kernel source, then rebuild the kernel.
2. compile the attaches test program, copy to the export/dist/root/
3. make start-target
run the test program.
It will hung the console all the time.
Without the patch, it will print "The problem is not recreated for you!", and then exit.