Wind River Support Network

HomeDefectsLIN7-4274
Fixed

LIN7-4274 : WRL5: ixgbe driver was improved in case of ECC errors on NIC

Created: Jul 12, 2015    Updated: Sep 8, 2018
Resolved Date: Jul 16, 2015
Found In Version: 7.0.0.5
Fix Version: 7.0.0.8
Severity: Standard
Applicable for: Wind River Linux 7
Component/s: Kernel

Description

On one of our nodes we faced with many ECC errors on Intel 10G NIC unit. As a consequence we also got Tx Unit Hang (see messages log below).

We found that behavior of ixgbe driver was improved in case of ECC errors on NIC (fixed by patch https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/net/ethernet/intel?id=d773ce2de1c670e0d259870a2ea8fd9f60ab98cd).

We need to get backported version of this patch.

Logfiles (part of linux kernel messages log):
[14:35:07 12/06/2015.000 eqm01s07p2 kernel @1.7 ] : : [1331424.115928] ixgbe 0000:05:00.0: eth5: Detected Tx Unit Hang
[14:35:07 12/06/2015.000 eqm01s07p2 kernel @1.7 ] : : [1331424.115932] Tx Queue <10>
[14:35:07 12/06/2015.000 eqm01s07p2 kernel @1.7 ] : : [1331424.115934] TDH, TDT <32>, <36>
[14:35:07 12/06/2015.000 eqm01s07p2 kernel @1.7 ] : : [1331424.115936] next_to_use <36>
[14:35:07 12/06/2015.000 eqm01s07p2 kernel @1.7 ] : : [1331424.115938] next_to_clean <32>
[14:35:07 12/06/2015.000 eqm01s07p2 kernel @1.7 ] : : [1331424.115940] tx_buffer_info[next_to_clean]
[14:35:07 12/06/2015.000 eqm01s07p2 kernel @1.7 ] : : [1331424.115942] time_stamp <14f7e1263>
[14:35:07 12/06/2015.000 eqm01s07p2 kernel @1.7 ] : : [1331424.115944] jiffies <14f7e20a0>
[14:35:07 12/06/2015.000 eqm01s07p2 kernel @1.7 ] : : [1331424.115951] ixgbe 0000:05:00.0: eth5: Detected Tx Unit Hang
[14:35:07 12/06/2015.000 eqm01s07p2 kernel @1.7 ] : : [1331424.115955] Tx Queue <0>
[14:35:07 12/06/2015.000 eqm01s07p2 kernel @1.7 ] : : [1331424.115958] TDH, TDT <34>, <37>
[14:35:07 12/06/2015.000 eqm01s07p2 kernel @1.7 ] : : [1331424.115962] next_to_use <37>
[14:35:07 12/06/2015.000 eqm01s07p2 kernel @1.7 ] : : [1331424.115965] next_to_clean <34>
[14:35:07 12/06/2015.000 eqm01s07p2 kernel @1.7 ] : : [1331424.115968] tx_buffer_info[next_to_clean]
[14:35:07 12/06/2015.000 eqm01s07p2 kernel @1.7 ] : : [1331424.115971] time_stamp <14f7e1263>
[14:35:07 12/06/2015.000 eqm01s07p2 kernel @1.7 ] : : [1331424.115974] jiffies <14f7e20a0>
[14:35:07 12/06/2015.000 eqm01s07p2 kernel @1.7 ] : : [1331424.115979] ixgbe 0000:05:00.0: eth5: tx hang 256 detected on queue 9, resetting adapter
[14:35:07 12/06/2015.000 eqm01s07p2 kernel @1.7 ] : : [1331424.115986] ixgbe 0000:05:00.0: eth5: tx hang 256 detected on queue 11, resetting adapter
[14:35:07 12/06/2015.000 eqm01s07p2 kernel @1.7 ] : : [1331424.115992] ixgbe 0000:05:00.0: eth5: tx hang 256 detected on queue 10, resetting adapter
[14:35:07 12/06/2015.000 eqm01s07p2 kernel @1.7 ] : : [1331424.115998] ixgbe 0000:05:00.0: eth5: tx hang 256 detected on queue 0, resetting adapter
[14:35:07 12/06/2015.000 eqm01s07p2 kernel @1.7 ] : : [1331424.116003] ixgbe 0000:05:00.0: eth5: Reset adapter
[14:35:07 12/06/2015.000 eqm01s07p2 kernel @1.7 ] : : [1331424.116007] ixgbe 0000:05:00.0: eth5: tx hang 257 detected on queue 5, resetting adapter
[14:35:07 12/06/2015.000 eqm01s07p2 kernel @1.7 ] : : [1331424.219611] [PF_RING] NOTICE: netdev event: eth5 state 0004
[14:35:07 12/06/2015.000 eqm01s07p2 kernel @1.7 ] : : [1331424.224519] bonding: remux0: link status definitely down for interface eth5, disabling it
[14:35:07 12/06/2015.000 eqm01s07p2 kernel @1.7 ] : : [1331424.472239] ixgbe 0000:05:00.0: eth5: NIC Link is Up 10 Gbps, Flow Control: RX/TX
[14:35:07 12/06/2015.000 eqm01s07p2 kernel @1.7 ] : : [1331424.474495] [PF_RING] NOTICE: netdev event: eth5 state 0004
[14:35:07 12/06/2015.000 eqm01s07p2 kernel @1.7 ] : : [1331424.503083] ixgbe 0000:05:00.0: eth5: Received unrecoverable ECC Err, please reboot
[14:35:07 12/06/2015.000 eqm01s07p2 kernel @1.7 ] : : [1331424.503465] ixgbe 0000:05:00.0: eth5: Received unrecoverable ECC Err, please reboot
[14:35:07 12/06/2015.000 eqm01s07p2 kernel @1.7 ] : : [1331424.509224] ixgbe 0000:05:00.0: eth5: Received unrecoverable ECC Err, please reboot
[14:35:07 12/06/2015.000 eqm01s07p2 kernel @1.7 ] : : [1331424.513951] ixgbe 0000:05:00.0: eth5: Received unrecoverable ECC Err, please reboot
[14:35:07 12/06/2015.000 eqm01s07p2 kernel @1.7 ] : : [1331424.520271] ixgbe 0000:05:00.0: eth5: Received unrecoverable ECC Err, please reboot

Impact of the problem: reboot of whole node

Steps to Reproduce

It is very hard to reproduce it, needs to emulate ECC errors on Intel NIC.
Frequency of reproduction: has been seen only once

Other Downloads


Live chat
Online