Wind River Support Network

HomeDefectsLIN6-13569
Fixed

LIN6-13569 : kernel crash in cpu accounting (3.10.62-ltsi-WR6.0.0.29)

Created: Sep 12, 2017    Updated: Dec 3, 2018
Resolved Date: Sep 24, 2017
Found In Version: 6.0.0.34
Fix Version: 6.0.0.35
Severity: Severe
Applicable for: Wind River Linux 6
Component/s: Kernel

Description

Information provided by the Customer : 

The issue can be reproduced under heavy load. 

This is the feedback from the engineer that is hitting this issue:
Regardless of any amount of memory corruption (which I have yet to find evidence of) or bad arguments in/from userspace, the kernel should not OOPS in simple conditional mutex and socket library calls (eventually syscalls). I would be interested if they have some examples of calls I could make and the arguments I could use that would cause a similar issue to see if I'm doing anything like it.

I reproduced it three times again today, so it's easy to reproduce in our own setup.
It will probably be harder for WR since they won't have any of devices/equipment/software/etc.. 
I was looking for any unrecoverable ECC errors in the blade SEL since the log was full and could not record them previously, but I saw nothing of the sort - just some warnings about the temperature going high. I'm at a loss at this point and have other things that I should be doing.

Important points about the program that might help Wind River in their debugging:

* Highly multi-threaded message broker using a UNIX domain socket for communications between 48 other processes.
* Simple pattern of accepting each connection to the filesystem socket representation, spinning it off into an RX thread, and that RX thread spinning off a TX thread to operate on the resulting file descriptor. Thus, every file descriptor has an RX and a TX thread operating on the same file descriptor in parallel (UNIX sockets are full duplex, should be no problem).
* The TX thread is always the one that causes the OOPS and disappears.
* The RX thread reports a read error on the shared file descriptor shortly before the problem manifests (a couple seconds) claiming an invalid file descriptor. Dissection of the core and enhanced error messages show that the file descriptor number passed to the read() call appears correct and lsof shows the matching file descriptor is open.

Let me know if there is any more information that I can provide to help the investigation.

Other Downloads


Live chat
Online