Wind River Support Network

HomeDefectsLIN6-11997
Fixed

LIN6-11997 : Issue seen with futex patch from OVP6-34

Created: Nov 24, 2016    Updated: Dec 3, 2018
Resolved Date: Nov 27, 2016
Found In Version: 6.0.0.31
Fix Version: 6.0.0.32
Severity: Severe
Applicable for: Wind River Linux 6
Component/s: Kernel

Description

Customer reported a problem in OVP6-34. In that case, we provided 4 patches to customer. After applying the patches, the customer see the following problem.

===================== 
PROBLEM DESCRIPTION 
===================== 

We are seeing strange behavior on our systems where pthread_mutex_lock() gets struck indefinitely at __pause_nocancel. 
This behavior is consistently reproduced with the simple code attached to the case. Also, please be noted that the 
issue is seen only when the protocol PTHREAD_PRIO_INHERIT is enabled on the mutex. 


The fault sequence is (Please find attached sample program “pthd_issue.c” to recreate the issue): 

1) Consider two pthreads p1 and p2. 

2) p1 locks the mutex and calls pthread_cond_wait. 

3) p2 signals p1 to come out of wait by calling pthread_cond_broadcast. 

4) p1 comes out of pthread_cond_wait() and performs pthread_mutex_unlock() and pthread_mutex_lock(). 

5) p1 then strucks inside pthread_mutex_lock() at __pause_nocancel indefenetely. 

On further debugging, we noticed that at step-3 above, i.e. when p1 comes out of pthread_cond_wait(), the FUTEX_WAITER bit 
is set in the __lock. It is strange that this bit is set though there are no waiters. 

Please find the detailed debug logs as below : 


============ 
DEBUG LOGS 
============ 

By connected GDB, I could see that thread stuck indefinitely at __pause_nocancel. 

(gdb) bt 
#0 0x0000003106e096a2 in pthread_join (threadid=140737354036992, thread_return=thread_return@entry=0x0) at pthread_join.c:92 
#1 0x0000000000400949 in main (argc=<optimized out>, argv=<optimized out>) at vsmartd.c:119 
(gdb) info threads 
 Id Target Id Frame 
 2 Thread 0x7ffff7fe7700 (LWP 6738) "vsmartd" 0x0000003106e0fae0 in __pause_nocancel () at ../sysdeps/unix/syscall-template.S:81 
* 1 Thread 0x7ffff7fe9700 (LWP 6733) "vsmartd" 0x0000003106e096a2 in pthread_join (threadid=140737354036992, 
 thread_return=thread_return@entry=0x0) at pthread_join.c:92 
(gdb) thread 2 
[Switching to thread 2 (Thread 0x7ffff7fe7700 (LWP 6738))] 
#0 0x0000003106e0fae0 in __pause_nocancel () at ../sysdeps/unix/syscall-template.S:81 
81 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS) 
(gdb) bt 
#0 0x0000003106e0fae0 in __pause_nocancel () at ../sysdeps/unix/syscall-template.S:81 
#1 0x0000003106e05d00 in __pthread_mutex_lock_full (mutex=0x6012e0 <mutex>) at ../nptl/pthread_mutex_lock.c:354 
#2 0x0000000000400b0d in thread (arg=<optimized out>) at vsmartd.c:46 
#3 0x0000003106e08201 in start_thread (arg=0x7ffff7fe7700) at pthread_create.c:312 
#4 0x00000031066e68bd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 
(gdb) 


I ran one more gdb session to inspect __lock value and the observation is as below: 


1) thread is waiting at pthread_cond_wait(). The value of mutex object looks as below. __lock is TID ( thread_id ) 

(gdb) p *(pthread_mutex_t *) 0x6012e0 
$1 = {__data = {__lock = 16182, __count = 1, __owner = 16182, __nusers = 1, __kind = 32, __spins = 0, __elision = 0, __list = {__prev = 0x0, 
 __next = 0x0}}, __size = "6?\000\000\001\000\000\000\066?\000\000\001\000\000\000 ", '\000' <repeats 22 times>, __align = 4294983478} 


2) thread comes out of pthread_cond_wait(). The value of mutex object looks as below : 

(gdb) p *(pthread_mutex_t *) 0x6012e0 
$3 = {__data = {__lock = -2147467466, __count = 1, __owner = 16182, __nusers = 1, __kind = 32, __spins = 0, __elision = 0, __list = { 
 __prev = 0x0, __next = 0x0}}, __size = "6?\000\200\001\000\000\000\066?\000\000\001\000\000\000 ", '\000' <repeats 22 times>, 
 __align = 6442467126} 

Please be noted that the pthread_cond_wait() inside will perform mutex unlock before it blocks and perform mutex lock after it wakes up. The value of __lock 
now shows FUTEX_WAITER | TID i.e. FUTEX_WAITER bit is set. But it is strange that FUTEX_WAITER bit is set even though there are no waiters for the mutex. 

3) Now thread peforms pthread_mutex_unlock. Ideally __lock should be 0 ( I assume ). But it does not happen. 

(gdb) p *(pthread_mutex_t *) 0x6012e0 
$4 = {__data = {__lock = -2147467466, __count = 1, __owner = 0, __nusers = 0, __kind = 32, __spins = 0, __elision = 0, __list = { 
 __prev = 0x0, __next = 0x0}}, __size = "6?\000\200\001", '\000' <repeats 11 times>, " ", '\000' <repeats 22 times>, 
 __align = 6442467126} 


4) Now thread performs pthread_mutex_lock and it is taken up by __pause_nocancel indefenetely. 

Steps to Reproduce

1. configure project with the following options:
/opt/WRLX6/wrlinux-6/wrlinux/configure --enable-board=intel-x86-64 --enable-kernel=preempt-rt --enable-rootfs=ovp-kvm+installer-support --enable-addons=wr-ovp --enable-sdkimage-staticlibs=yes --with-template=feature/kvm,feature/benchmark,feature/debug,feature/gdb,feature/kdump,feature/kexec,feature/kernel-debug,feature/kernel-tune,feature/build_libc,feature/dpdk-vswitch,feature/lttng2 --enable-parallel-pkgbuilds=8 --enable-jobs=8 --enable-checkout-all-layers=yes --with-package=wrs-kvm-helper,qemu,libvirt,libvirt-libvirtd,libvirt-virsh,kernel-module-kvm,kernel-module-kvm-intel,kernel-module-kvm-amd,bridge-utils,inetutils,inetutils-rshd,inetutils-telnetd,inetutils-ftpd,ltp --enable-rm-oldimgs=yes --enable-internet-download=yes --enable-reconfig --enable-prelink=no --with-rcpl-version=0020

2. apply the 4 patches in the OVP6-34 to kernel source, then rebuild the kernel.

2. compile the attaches test program, copy to the export/dist/root/

3. make start-target
run the test program.
It will hung the console all the time.
Without the patch, it will print "The problem is not recreated for you!", and then exit.

Other Downloads


Live chat
Online