Customer is running stcp stack of PNE-LE1.3 in their product(Intel_mpcbl0040 based custom board).
the sctp stack which they are using is the one which following 5 patches were already applied (these patches are provided by WindRiver)
sctp-2.6.15-upgrade.patch
sctp-2.6.15-security-fix.patch
sctp-2.6.15-security-fix2.patch
sctp-2.6.19.patch
sctp-2.6.19-backport.patch
besides above patches, they have not modified sctp stack at all according to them.
During the system operation with this stcp stack, following kernel panic has happend
it does not happen regularly
May 9 19:30:12 201 kernel: f8df3236
May 9 19:30:12 201 kernel: PREEMPT SMP
May 9 19:30:12 201 kernel: LTT NESTING LEVEL : 0
May 9 19:30:12 201 kernel: Modules linked in: gps50hz acs85xx_smbus fawkes ohci_hcd i2c_i801 i2c_core ehci_hcd uhci_hcd usbcore bonding ipmi_watchdog ipmi_si ipmi_devintf ipmi_msghandler sctp ipv6
May 9 19:30:12 201 kernel: CPU: 1
May 9 19:30:12 201 kernel: EIP: 0060:[
May 9 19:30:12 201 kernel: EFLAGS: 00010202 (2.6.14.7-selinux1)
May 9 19:30:12 201 kernel: EIP is at sctp_sendmsg+0x2fc/0x721 [sctp]
May 9 19:30:12 201 kernel: eax: ffffffff ebx: d6845080 ecx: d6305320 edx: d62dc994
May 9 19:30:12 201 kernel: esi: d667a000 edi: d62dc980 ebp: dfd71c10 esp: dfd71b78
May 9 19:30:12 201 kernel: ds: 007b es: 007b ss: 0068
May 9 19:30:12 201 kernel: Process wag_oal (pid: 3964, threadinfo=dfd71000 task=dfc5f550)
May 9 19:30:12 201 kernel: Stack: 00000036 d7729550 00000000 d62dc994 d6305320 00001ce4 00000000 00000000
May 9 19:30:12 201 kernel: 00000000 00000000 d667a000 00000000 d693da80 dfd71dd0 d6845080 000000fa
May 9 19:30:12 201 kernel: 00000000 00000000 d66a11cc 00000000 00000000 00000000 00000000 00000000
May 9 19:30:12 201 kernel: Call Trace:
May 9 19:30:12 201 kernel: [
May 9 19:30:12 201 kernel: [
May 9 19:30:12 201 kernel: [
May 9 19:30:12 201 kernel: [
May 9 19:30:12 201 kernel: [
May 9 19:30:12 201 kernel: [
May 9 19:30:12 201 kernel: [
May 9 19:30:12 201 kernel: [
May 9 19:30:12 201 kernel: [
May 9 19:30:12 201 kernel: [
May 9 19:30:12 201 kernel: Code: c2 0f 84 70 03 00 00 8b bd 74 ff ff ff 83 ef 14 89 f8 e8 ba 84 ff ff 8b 77 54 89 f0 8b 5e 18 e8 ce 73 ff ff 8b 47 1c f0 ff 43 18 <89> 58 08 c7 40 70 54 94 35 c0 8b 80 88 00 00 00 f0 01 43 64 8b
May 9 19:30:21 201 kernel: <3>BUG: soft lockup detected on CPU#0!
May 9 19:30:21 201 kernel:
May 9 19:30:21 201 kernel: Pid: 29705, comm: wag_oal
May 9 19:30:21 201 kernel: EIP: 0060:[
May 9 19:30:21 201 kernel: EIP is at freeary+0x16/0x81
May 9 19:30:21 201 kernel: EFLAGS: 00000282 Not tainted (2.6.14.7-selinux1)
May 9 19:30:21 201 kernel: EAX: d6305320 EBX: d77cbb48 ECX: c0463e20 EDX: 38608061
May 9 19:30:21 201 kernel: ESI: 00000000 EDI: 38608061 EBP: d6335e48 DS: 007b ES: 007b
May 9 19:30:21 201 kernel: CR0: 8005003b CR2: b3eea000 CR3: 16a2c000 CR4: 000006d0
codes are executed with the following sequences.
and finally go to panic state
[user application] sending sctp message with socket call.
[system call] sys_socketcall -> sys_sendmsg -> sock_sendmsg -> inet_sendmsg
-> sctp_sendmsg
[part of sctp_sendmsg in net/sctp/socket.c]
/* Now send the (possibly) fragmented message. */
list_for_each(pos, &datamsg->chunks) {
chunk = list_entry(pos, struct sctp_chunk, frag_list);
sctp_datamsg_track(chunk);
/* Do accounting for the write space. */
sctp_set_owner_w(chunk); <==== (1)
chunk->transport = chunk_tp;
/* Send it to the lower layers. Note: all chunks
* must either fail or succeed. The lower layer
* works that way today. Keep it that way or this
* breaks.
*/
err = sctp_primitive_SEND(asoc, chunk);
/* Did the lower layer accept the chunk? */
if (err)
sctp_chunk_free(chunk);
SCTP_DEBUG_PRINTK("We sent primitively.\n");
}
====> from (1)
static inline void sctp_set_owner_w(struct sctp_chunk *chunk)
{
struct sctp_association *asoc = chunk->asoc;
struct sock *sk = asoc->base.sk;
/* The sndbuf space is tracked per association. */
sctp_association_hold(asoc);
skb_set_owner_w(chunk->skb, sk); <===== (2)
chunk->skb->destructor = sctp_wfree;
:
:
}
=======> from (2)
[include/net/sock.h]
static inline void skb_set_owner_w(struct sk_buff *skb, struct sock *sk)
{
sock_hold(sk);
skb->sk = sk; <== (3) panic point
skb->destructor = sock_wfree;
atomic_add(skb->truesize, &sk->sk_wmem_alloc);
}
To investigate the exact point of panic on C-source line, you can refer to the following
which is a part of de-assembled code for sctp kernel module
00010f3a
sctp_sendmsg():
net/sctp/socket.c:1370
10f3a: 55 push %ebp
net/sctp/socket.c:1378
10f3b: 31 c0 xor %eax,%eax
:
:
:
net/sctp/socket.c:1722
11212: 8b bd 74 ff ff ff mov 0xffffff74(%ebp),%edi
11218: 83 ef 14 sub $0x14,%edi
net/sctp/socket.c:1723
1121b: 89 f8 mov %edi,%eax
1121d: e8 fc ff ff ff call 1121e
net/sctp/socket.c:143
11222: 8b 77 54 mov 0x54(%edi),%esi
net/sctp/socket.c:147
11225: 89 f0 mov %esi,%eax
net/sctp/socket.c:144
11227: 8b 5e 18 mov 0x18(%esi),%ebx
net/sctp/socket.c:147
1122a: e8 fc ff ff ff call 1122b
include/net/sock.h:1092
1122f: 8b 47 1c mov 0x1c(%edi),%eax
include/asm/atomic.h:103
11232: f0 ff 43 18 lock incl 0x18(%ebx)
include/net/sock.h:1094
11236: 89 58 08 mov %ebx,0x8(%eax) <==== panic point (4)
include/net/sock.h:1095
11239: c7 40 70 00 00 00 00 movl $0x0,0x70(%eax)
As you can see from the kernel panic message. (EIP is at sctp_sendmsg+0x2fc/0x721 [sctp])
11236 is the panic point (0x10f3a (start address of sctp_sendmsg) + 0x2fc = 0x11236)
here, you also can see, eax register points to skb pointer structure and it's value is 0xffffffff. (see the c source line pointed by (3) and deassembled code(4) )
It causes a kernel panic while it is trying to set value(here sk, see c source line pointed by (3) ) to a member of skb.
I just found the panic point, but I need to know what makes the pointer address of skb(eax register here) to 0xffffffff.
what I am suspecting is,
Heavy traffic blocks os to allocate memory for skb, or unstable sctp chunk handling mechanism in low version kernel.
Do we have another patch to make sctp chunk handling operation more stable?
or any workaround for this kind of problem.
1) configure --enable-board=intel_mpcbl0040 --enable-kernel=standard --enable-rootfs=glibc_std
2) make fs
3) run new kernel and rootfs on target