Acknowledged                
                
            
            
                
                    Created: May 24, 2021   
                                            Updated: May 25, 2021                                    
                
                    
                                    
             
         
        
            
            
                
                
                                    
                        Found In Version: 10.21.20.1                    
                
                
                                        
                            Severity: Standard                        
                    
                                        
                            Applicable for: Wind River Linux LTS 21                        
                    
                                    
                        Component/s: BSP                    
                
                
                             
         
                        
                You may see the following call trace.
[ 4796.204070] WARNING: possible circular locking dependency detected
[ 4796.204073] 5.10.37-yocto-standard #1 Tainted: G W O
[ 4796.204074] ------------------------------------------------------
[ 4796.204078] cat/30248 is trying to acquire lock:
[ 4796.215108] =============================
[ 4796.221128] ffffffff86268900 (cpu_hotplug_lock){++++}-
{0:0}
, at: get_energy_counter+0x2c/0x90 [intel_rapl_common]
[ 4796.221136]
but task is already holding lock:
[ 4796.221137] ff217b1f5e498a10 (kn->active#412){++++}-
{0:0}
, at: kernfs_seq_start+0x37/0xc0
[ 4796.221144]
which lock already depends on the new lock.
[ 4796.221145]
the existing dependency chain (in reverse order) is:
[ 4796.221146]
-> #2 (
[ 4796.250107] WARNING: suspicious RCU usage
[ 4796.253420] kn->active#412){++++}-
{0:0}
:
[ 4796.253428] __lock_acquire+0x3b9/0x6d0
[ 4796.253430] lock_acquire.part.0+0xa4/0x190
[ 4796.253432] lock_acquire+0x73/0x150
[ 4796.253435] kernfs_drain+0x163/0x1c0
[ 4796.253437] __kernfs_remove+0x18d/0x1e0
[ 4796.253439] kernfs_remove_by_name_ns+0x46/0x90
[ 4796.253443] remove_files+0x30/0x70
[ 4796.273127] 5.10.37-yocto-standard #1 Tainted: G W O
[ 4796.277197] sysfs_remove_group+0x3d/0x80
[ 4796.277199] sysfs_remove_groups+0x2e/0x50
[ 4796.277202] device_remove_attrs+0x50/0x80
[ 4796.277204] device_del+0x196/0x440
[ 4796.277206] device_unregister+0x42/0x60
[ 4796.277210] powercap_unregister_zone+0x49/0x60
[ 4796.277213] rapl_remove_package+0xd5/0x240 [intel_rapl_common]
[ 4796.287123] -----------------------------
[ 4796.290615] rapl_cpu_down_prep+0x5f/0x80 [intel_rapl_msr]
[ 4796.290617] cpuhp_invoke_callback+0x8f/0x280
[ 4796.290618] cpuhp_thread_fun+0x18f/0x230
[ 4796.290620] smpboot_thread_fn+0x1e7/0x2a0
[ 4796.290622] kthread+0x168/0x190
[ 4796.290625] ret_from_fork+0x1f/0x30
[ 4796.290626]
> #1 (cpuhp_state-down){..}
{0:0}
[ 4796.294467] kernel/sched/core.c:7263 Illegal context switch in RCU-sched read-side critical section!
[ 4796.297945] :
[ 4796.297947] __lock_acquire+0x3b9/0x6d0
[ 4796.297949] lock_acquire.part.0+0xa4/0x190
[ 4796.297950] lock_acquire+0x73/0x150
[ 4796.297952] cpuhp_kick_ap_work+0x47/0x1c0
[ 4796.297956] _cpu_down+0x139/0x220
[ 4796.297958] cpu_device_down+0x36/0x60
[ 4796.297962] cpu_subsys_offline+0xe/0x10
[ 4796.297965] device_offline+0xe8/0x120
[ 4796.302146]
other info that might help us debug this:
[ 4796.306576] online_store+0x53/0xc0
[ 4796.306578] dev_attr_store+0x17/0x30
[ 4796.306580] sysfs_kf_write+0x4b/0x60
[ 4796.306582] kernfs_fop_write_iter+0x13a/0x1d0
[ 4796.306585] new_sync_write+0x117/0x1b0
[ 4796.306587] vfs_write+0x2be/0x3b0
[ 4796.309813]
rcu_scheduler_active = 2, debug_locks = 0
[ 4796.314071] ksys_write+0x70/0xf0
[ 4796.314072] __x64_sys_write+0x1a/0x20
[ 4796.314074] do_syscall_64+0x38/0x50
[ 4796.314076] entry_SYSCALL_64_after_hwframe+0x44/0xa9
–
[ 4796.377012] Chain exists of:
cpu_hotplug_lock -> cpuhp_state-down --> kn>active
[ 4796.404113] CPU: 48 PID: 813 Comm: lock_torture_wr Tainted: G W O 5.10.37-yocto-standard #1
[ 4796.404980] #412
[ 4796.409071] Hardware name: Intel Corporation WHITLEY/WHITLEY, BIOS WLYDCRB1.SYS.0017.D75.2007020055 07/02/2020
[ 4796.413244] Possible unsafe locking scenario:
[ 4796.413247] CPU0 CPU1
[ 4796.417681] Call Trace:
[ 4796.422714] ---- ----
[ 4796.426723] show_stack+0x52/0x58
[ 4796.432795] lock(
[ 4796.437319] dump_stack+0x7d/0x9f
[ 4796.441926] kn->active
[ 4796.446534] lockdep_rcu_suspicious+0xdb/0xe5
[ 4796.450527] #412
[ 4796.454964] ___might_sleep+0x147/0x150
[ 4796.459995] );
[ 4796.466422] stutter_wait+0xa1/0xd0
[ 4796.470420] lock(
[ 4796.476414] lock_torture_writer+0xd7/0x160
[ 4796.481281] cpuhp_state-down
[ 4796.485804] ? lock_torture_reader+0x110/0x110
[ 4796.490408] );
[ 4796.494145] kthread+0x168/0x190
[ 4796.498225] lock(
[ 4796.504307] ? __kthread_bind_mask+0x70/0x70
[ 4796.513405] kn->active#412
[ 4796.514993] ret_from_fork+0x1f/0x30
[ 4796.519339] );
[ 4796.827192] lock(cpu_hotplug_lock);
[ 4796.830849]
DEADLOCK ***
[ 4796.836756] 3 locks held by cat/30248:
[ 4796.840498] #0: ff217b1f6702c768 (&p->lock){..}-
{3:3}
, at: seq_read_iter+0x55/0x530
[ 4796.848400] #1: ff217b201463b290 (&of->mutex){..}-
{3:3}
, at: kernfs_seq_start+0x2f/0xc0
[ 4796.856650] #2: ff217b1f5e498a10 (kn->active#412){++++}-
{0:0}
, at: kernfs_seq_start+0x37/0xc0
[ 4796.865242]
stack backtrace:
[ 4796.869584] CPU: 85 PID: 30248 Comm: cat Tainted: G W O 5.10.37-yocto-standard #1
[ 4796.878167] Hardware name: Intel Corporation WHITLEY/WHITLEY, BIOS WLYDCRB1.SYS.0017.D75.2007020055 07/02/2020
[ 4796.888128] Call Trace:
[ 4796.890578] show_stack+0x52/0x58
[ 4796.893889] dump_stack+0x7d/0x9f
[ 4796.897200] print_circular_bug.cold+0x34/0x39
[ 4796.901634] check_noncircular+0x116/0x130
[ 4796.905725] ? __this_cpu_preempt_check+0x13/0x20
[ 4796.910421] check_prev_add+0x95/0xc00
[ 4796.914170] ? add_chain_cache+0x210/0x4a0
[ 4796.918261] validate_chain+0x5d0/0x8d0
[ 4796.922093] __lock_acquire+0x3b9/0x6d0
[ 4796.925931] lock_acquire.part.0+0xa4/0x190
[ 4796.930113] ? get_energy_counter+0x2c/0x90 [intel_rapl_common]
[ 4796.936017] ? __this_cpu_preempt_check+0x13/0x20
[ 4796.940715] lock_acquire+0x73/0x150
[ 4796.944283] ? get_energy_counter+0x2c/0x90 [intel_rapl_common]
[ 4796.950191] cpus_read_lock+0x3d/0xd0
[ 4796.953855] ? get_energy_counter+0x2c/0x90 [intel_rapl_common]
[ 4796.959759] get_energy_counter+0x2c/0x90 [intel_rapl_common]
[ 4796.965492] energy_uj_show+0x3c/0x80
[ 4796.969148] dev_attr_show+0x1d/0x40
[ 4796.972720] sysfs_kf_seq_show+0xa1/0xe0
[ 4796.976635] kernfs_seq_show+0x2d/0x30
[ 4796.980379] seq_read_iter+0x1e4/0x530
[ 4796.984124] ? handle_pte_fault+0x1da/0x210
[ 4796.988299] kernfs_fop_read_iter+0x32/0x40
[ 4796.992475] new_sync_read+0x110/0x1a0
[ 4796.996219] vfs_read+0x183/0x1e0
[ 4796.999529] ksys_read+0x70/0xf0
[ 4797.002756] __x64_sys_read+0x1a/0x20
[ 4797.006411] do_syscall_64+0x38/0x50
[ 4797.009981] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 4797.015023] RIP: 0033:0x7f1ebddeade2
[ 4797.018592] Code: c0 e9 d2 fe ff ff 50 48 8d 3d 6a f2 09 00 e8 55 d7 01 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
[ 4797.037294] RSP: 002b:00007ffc913ec708 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 4797.044841] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f1ebddeade2
[ 4797.051957] RDX: 0000000000020000 RSI: 00007f1ebdcdb000 RDI: 0000000000000003
[ 4797.059070] RBP: 00007f1ebdcdb000 R08: 00007f1ebdcda010 R09: 0000000000000000
[ 4797.066177] R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000000000
[ 4797.073293] R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000
This issue only happens on multi-socket system. When the last CPU of the socket is doing offline, the corresponding intel RAPL sysfs will be removed. This process introduces the lock chain "cpu_hotplug_lock --> cpuhp_state-down --> kn>active#294" (take #294 for example).
After that when you are trying to read intel RAPL attribute file through sysfs, the lock chain "kn->active#294 --> cpu_hotplug_lock" will be introduced.
So lockdep check will detect circular locking dependency between "cpu_hotplug_lock" and "kn->active#294".
This could lead to deadlock in theory. But probability is very very small.
Two workarounds:
1) Do not read intel RAPL sysfs and remove it at the same time. This can avoid dead lock, but cannot disappear lockdep check warning.
2) Do not unplug all CPU on one socket, leave one alive at least. This can avoid lockdep check warning and deadlock.
        
                        
                /lpg-build/cdc/fast_prod/WRL10_21/WRL10_21_GIT/wrlinux-10/setup.sh
--machines=intel-x86-64 --templates feature/intel-qat feature/xfce feature/dpdk feature/bsp-extras feature/docker feature/kexec feature/kdump feature/lxc feature/tpm2 --layers meta-dpdk --distro wrlinux --accept-eula=yes --dl-layers
source ./environment-setup-x86_64-wrlinuxsdk-linux
source ./oe-init-build-env build
bitbake-layers add-layer /folk/kliang/lxgit/wrlinux/wr-testing/bts-dev/
echo require templates/feature/bts/template.conf >> conf/local.conf
echo IMAGE_FSTYPES += \" tar.bz2\" >> conf/local.conf
echo BB_NO_NETWORK_pn-lmbench = \"0\" >> conf/local.conf
echo PNWHITELIST_dpdk += \'dpdk\' >> conf/local.conf
echo BB_NO_NETWORK = \'0\' >> conf/local.conf
echo IMAGE_FSTYPES_remove = \" live hddimg iso\" >>conf/local.conf
bitbake wrlinux-image-std