Linux: e1000e kernel panic with Intel 82574L kernel panic and ASPM
So I bought myself an intel GigE card with the 82574L chip to deal with a certain problem I've been having with the RealTek one (cough). So I wanted to try another chip other then RealTek ones to see if that would make a difference.
I suppose one could argue I should have done more research because I install the card in the system and what do I run into after trying to transfer some files?
Sep 9 17:40:57 mbpc kernel: ————[ cut here ]————
Sep 9 17:40:57 mbpc kernel: WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted)
Sep 9 17:40:57 mbpc kernel: Hardware name: GA-890XA-UD3
Sep 9 17:40:57 mbpc kernel: NETDEV WATCHDOG: eth1 (e1000e): transmit queue 0 timed out
Sep 9 17:40:57 mbpc kernel: Modules linked in: cryptd aes_x86_64 aes_generic arc4 ar9170usb mac80211 ath cfg80211 rfkill fuse bonding nfsd lockd nfs_acl auth_rpcgss autofs4 sunrpc cpufreq_ondemand powernow_k8 freq_table mperf 8021q garp stp llc ipt_REJECT ipt_LOG xt_multiport iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 xfs exportfs dm_mirror dm_region_hash dm_log uinput raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx snd_emu10k1_synth snd_emux_synth snd_seq_virmidi snd_seq_midi_event snd_seq_midi_emul snd_emu10k1 snd_rawmidi snd_ac97_codec ac97_bus snd_util_mem microcode sg serio_raw k10temp edac_core edac_mce_amd shpchp i2c_piix4 r8168(U) e1000e snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc ext4 mbcache jbd2 sr_mod cdrom firewire_ohci firewire_core crc_itu_t sd_mod crc_t10dif ahci pata_acpi ata_ge
Sep 9 17:40:57 mbpc kernel: neric pata_jmicron radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mod [last unloaded: scsi_wait_scan]
Sep 9 17:40:57 mbpc kernel: Pid: 0, comm: swapper Not tainted 2.6.32-279.5.1.el6.x86_64 #1
Sep 9 17:40:57 mbpc kernel: Call Trace:
Sep 9 17:40:57 mbpc kernel: <IRQ> [
Sep 9 17:40:57 mbpc kernel: [
Sep 9 17:40:57 mbpc kernel: [
Sep 9 17:40:57 mbpc kernel: [
Sep 9 17:40:57 mbpc kernel: [
Sep 9 17:40:57 mbpc kernel: [
Sep 9 17:40:57 mbpc kernel: [
Sep 9 17:40:57 mbpc kernel: [
Sep 9 17:40:57 mbpc kernel: [
Sep 9 17:40:57 mbpc kernel: [
Sep 9 17:40:57 mbpc kernel: [
Sep 9 17:40:57 mbpc kernel: [
Sep 9 17:40:57 mbpc kernel: [
Sep 9 17:40:57 mbpc kernel: [
Sep 9 17:40:57 mbpc kernel: [
Sep 9 17:40:57 mbpc kernel: <EOI> [
Sep 9 17:40:57 mbpc kernel: [
Sep 9 17:40:57 mbpc kernel: [
Sep 9 17:40:57 mbpc kernel: [
Sep 9 17:40:57 mbpc kernel: [
Sep 9 17:40:57 mbpc kernel: [
Sep 9 17:40:57 mbpc kernel: [
Sep 9 17:40:57 mbpc kernel: [
Sep 9 17:40:57 mbpc kernel: —[ end trace aede656af05e9e90 ]—
Sep 9 17:40:57 mbpc kernel: e1000e 0000:03:00.0: eth1: Reset adapter
Sep 9 17:40:57 mbpc kernel: bonding: bond0: link status definitely down for interface eth1, disabling it
Sep 9 17:41:01 mbpc kernel: e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Sep 9 17:41:01 mbpc kernel: bond0: link status definitely up for interface eth1, 1000 Mbps full duplex.
Sep 9 17:41:50 mbpc kernel: e1000e 0000:03:00.0: eth1: Reset adapter
Sep 9 17:41:50 mbpc kernel: bonding: bond0: link status definitely down for interface eth1, disabling it
Sep 9 17:41:53 mbpc kernel: e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Sep 9 17:41:54 mbpc kernel: bond0: link status definitely up for interface eth1, 1000 Mbps full duplex.
Sep 9 17:42:14 mbpc smbd[9540]: [2012/09/09 17:42:14.893011, 0] lib/util_sock.c:474(read_fd_with_timeout)
Sep 9 17:42:14 mbpc smbd[9540]: [2012/09/09 17:42:14.893583, 0] lib/util_sock.c:1441(get_peer_addr_internal)
Sep 9 17:42:14 mbpc smbd[9540]: getpeername failed. Error was Transport endpoint is not connected
Sep 9 17:42:14 mbpc smbd[9540]: read_fd_with_timeout: client 0.0.0.0 read error = Connection reset by peer.
Sep 9 17:42:28 mbpc kernel: e1000e 0000:03:00.0: eth1: Reset adapter
Sep 9 17:42:28 mbpc kernel: bonding: bond0: link status definitely down for interface eth1, disabling it
Sep 9 17:42:29 mbpc smbd[2287]: [2012/09/09 17:42:29.932801, 0] smbd/server.c:281(remove_child_pid)
Sep 9 17:42:29 mbpc smbd[2287]: Could not find child 15205 — ignoring
Sep 9 17:42:32 mbpc kernel: e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Sep 9 17:42:32 mbpc kernel: bond0: link status definitely up for interface eth1, 1000 Mbps full duplex.
Sep 9 17:43:11 mbpc kernel: bonding: bond0: link status definitely down for interface eth1, disabling it
Sep 9 17:43:14 mbpc kernel: e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Sep 9 17:43:14 mbpc kernel: bond0: link status definitely up for interface eth1, 1000 Mbps full duplex.
Sep 9 17:43:23 mbpc kernel: usb 3-3: USB setup failed (-19).
Sep 9 17:43:23 mbpc kernel: bonding: bond0: Adding slave wlan0.
Sep 9 17:43:23 mbpc kernel: usb 3-3: USB setup failed (-19).
Here's how we went about looking into this one:
One of the first things we did is to check online if this is reported and if newer software is available. To see how stable this can really be, we check out latest drivers from intel, compile and install them. Before that we'll also try the latest intel drivers from sourceforge.net One key thing with debugging these is to work with the latest software available then see if the issue reappears and finally apply any other solutions recommended. The last thing we want is to try to solve an issue only to see that the latest drivers resolve it already. So here we go.
In our case, again, we'll download the latest driver from Intel for the 82574L chip, see above for link, or directly using the below one:
wget http://downloadmirror.intel.com/15817/eng/e1000e-2.0.0.1.tar.gz
tar zxvf e100e*
navigate to the src folder and compile (note I elected not to install as I want to backup the old copy):
# make
make -C /lib/modules/2.6.32-279.5.1.el6.x86_64/build SUBDIRS=/root/e1000eIntel/e1000e-2.0.0.1/src modules
make[1]: Entering directory `/usr/src/kernels/2.6.32-279.5.1.el6.x86_64'
CC [M] /root/e1000eIntel/e1000e-2.0.0.1/src/netdev.o
CC [M] /root/e1000eIntel/e1000e-2.0.0.1/src/ethtool.o
CC [M] /root/e1000eIntel/e1000e-2.0.0.1/src/param.o
CC [M] /root/e1000eIntel/e1000e-2.0.0.1/src/82571.o
CC [M] /root/e1000eIntel/e1000e-2.0.0.1/src/ich8lan.o
CC [M] /root/e1000eIntel/e1000e-2.0.0.1/src/80003es2lan.o
CC [M] /root/e1000eIntel/e1000e-2.0.0.1/src/mac.o
CC [M] /root/e1000eIntel/e1000e-2.0.0.1/src/nvm.o
CC [M] /root/e1000eIntel/e1000e-2.0.0.1/src/phy.o
CC [M] /root/e1000eIntel/e1000e-2.0.0.1/src/manage.o
CC [M] /root/e1000eIntel/e1000e-2.0.0.1/src/kcompat.o
LD [M] /root/e1000eIntel/e1000e-2.0.0.1/src/e1000e.o
Building modules, stage 2.
MODPOST 1 modules
CC /root/e1000eIntel/e1000e-2.0.0.1/src/e1000e.mod.o
LD [M] /root/e1000eIntel/e1000e-2.0.0.1/src/e1000e.ko.unsigned
NO SIGN [M] /root/e1000eIntel/e1000e-2.0.0.1/src/e1000e.ko
make[1]: Leaving directory `/usr/src/kernels/2.6.32-279.5.1.el6.x86_64'
#
# ls -al /lib/modules/2.6.32-279.5.1.el6.x86_64/kernel/drivers/net/e1000e/e1000e.ko e1000e.ko
-rw-r–r–. 1 root root 5630679 Sep 10 01:42 e1000e.ko
-rwxr–r–. 1 root root 365368 Aug 14 17:48 /lib/modules/2.6.32-279.5.1.el6.x86_64/kernel/drivers/net/e1000e/e1000e.ko
#
backup the old one.
cp -ip /lib/modules/2.6.32-279.5.1.el6.x86_64/kernel/drivers/net/e1000e/e1000e.ko ./e1000e.ko.2.6.32-279.5.1.el6.x86_64.ORIG
remove the old driver:
[root@mbpc src]# lsmod|grep -i e1000
e1000e 222073 0
[root@mbpc src]# rmmod e1000e
[root@mbpc src]# lsmod|grep -i e1000
[root@mbpc src]#
with expected results in /var/log/messages:
bonding: bond0: releasing active interface eth1
e1000e 0000:03:00.0: PCI INT A disabled
And copy over the old one then reinsert.
[root@mbpc src]# cp -ip e1000e.ko /lib/modules/2.6.32-279.5.1.el6.x86_64/kernel/drivers/net/e1000e/
cp: overwrite `/lib/modules/2.6.32-279.5.1.el6.x86_64/kernel/drivers/net/e1000e/e1000e.ko'? y
[root@mbpc src]#
[root@mbpc src]# chmod 755 /lib/modules/2.6.32-279.5.1.el6.x86_64/kernel/drivers/net/e1000e/e1000e.ko
[root@mbpc src]# ls -al /lib/modules/2.6.32-279.5.1.el6.x86_64/kernel/drivers/net/e1000e/e1000e.ko
-rwxr-xr-x. 1 root root 5630679 Sep 10 01:42 /lib/modules/2.6.32-279.5.1.el6.x86_64/kernel/drivers/net/e1000e/e1000e.ko
[root@mbpc src]#
[root@mbpc src]# lsmod|grep -i e1000e
[root@mbpc src]# modprobe e1000e
[root@mbpc src]# lsmod|grep -i e1000e
e1000e 260558 0
[root@mbpc src]#
THE NEW ONE IS
e1000e: Intel(R) PRO/1000 Network Driver – 2.0.0.1-NAPI
e1000e: Copyright(c) 1999 – 2012 Intel Corporation.
e1000e 0000:03:00.0: Disabling ASPM L0s
e1000e 0000:03:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
e1000e 0000:03:00.0: setting latency timer to 64
e1000e 0000:03:00.0: irq 29 for MSI/MSI-X
e1000e 0000:03:00.0: irq 30 for MSI/MSI-X
e1000e 0000:03:00.0: irq 31 for MSI/MSI-X
e1000e 0000:03:00.0: eth1: (PCI Express:2.5GT/s:Width x1) 68:05:ca:0a:80:d6
e1000e 0000:03:00.0: eth1: Intel(R) PRO/1000 Network Connection
e1000e 0000:03:00.0: eth1: MAC: 4, PHY: 8, PBA No: E46981-007
bonding: bond0: Adding slave eth1.
8021q: adding VLAN 0 to HW filter on device eth1
bonding: bond0: enslaving eth1 as an active interface with a down link.
e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
bond0: link status definitely up for interface eth1, 1000 Mbps full duplex.
THE OLD ONE WAS
e1000e: Intel(R) PRO/1000 Network Driver – 1.9.5-k
e1000e: Copyright(c) 1999 – 2012 Intel Corporation.
e1000e 0000:03:00.0: Disabling ASPM L0s
e1000e 0000:03:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
e1000e 0000:03:00.0: setting latency timer to 64
e1000e 0000:03:00.0: irq 29 for MSI/MSI-X
e1000e 0000:03:00.0: irq 30 for MSI/MSI-X
e1000e 0000:03:00.0: irq 31 for MSI/MSI-X
e1000e 0000:03:00.0: eth1: (PCI Express:2.5GT/s:Width x1) 68:05:ca:0a:80:d6
e1000e 0000:03:00.0: eth1: Intel(R) PRO/1000 Network Connection
e1000e 0000:03:00.0: eth1: MAC: 3, PHY: 8, PBA No: E46981-007
bonding: bond0: Adding slave eth1.
8021q: adding VLAN 0 to HW filter on device eth1
bonding: bond0: enslaving eth1 as an active interface with a down link.
e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
bond0: link status definitely up for interface eth1, 1000 Mbps full duplex.
Now let's try with this new one. However noticed that we now get these, though the results are slightly better and no kernel panic (yet):
Sep 10 02:01:43 mbpc kernel: e1000e 0000:03:00.0: eth1: Reset adapter
Sep 10 02:01:43 mbpc kernel: bonding: bond0: link status definitely down for interface eth1, disabling it
And unfortunately after some time we are back to this error when transfers are nearing 120MB/s:
Sep 11 01:07:06 mbpc kernel: ------------[ cut here ]------------ Sep 11 01:07:06 mbpc kernel: WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted) Sep 11 01:07:06 mbpc kernel: Hardware name: GA-890XA-UD3 Sep 11 01:07:06 mbpc kernel: NETDEV WATCHDOG: eth1 (e1000e): transmit queue 0 timed out Sep 11 01:07:06 mbpc kernel: Modules linked in: bluetooth rfkill fuse bonding nfsd lockd nfs_acl auth_rpcgss autofs4 sunrpc cpufreq_ondemand powernow_k8 freq_table mperf 8021q garp stp llc ipt_REJECT ipt_LOG xt_multiport iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 xfs exportfs dm_mirror dm_region_hash dm_log uinput raid456 async_raid6_recov async_pq snd_emu10k1_synth snd_emux_synth snd_seq_virmidi snd_seq_midi_event snd_seq_midi_emul raid6_pq snd_emu10k1 async_xor xor async_memcpy async_tx snd_rawmidi snd_ac97_codec ac97_bus snd_util_mem microcode sg xhci_hcd i2c_piix4 serio_raw edac_core edac_mce_amd k10temp shpchp snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc r8168(U) e1000e(U) ext4 mbcache jbd2 firewire_ohci firewire_core crc_itu_t sr_mod cdrom sd_mod crc_t10dif pata_acpi ata_generic pata_jmicron ahci radeon ttm drm_kms_helper Sep 11 01:07:06 mbpc kernel: drm i2c_algo_bit i2c_core dm_mod [last unloaded: scsi_wait_scan] Sep 11 01:07:06 mbpc kernel: Pid: 0, comm: swapper Not tainted 2.6.32-279.5.1.el6.x86_64 #1 Sep 11 01:07:06 mbpc kernel: Call Trace: Sep 11 01:07:06 mbpc kernel: <IRQ> [] ? warn_slowpath_common+0x87/0xc0 Sep 11 01:07:06 mbpc kernel: [ ] ? warn_slowpath_fmt+0x46/0x50 Sep 11 01:07:06 mbpc kernel: [ ] ? dev_watchdog+0x26d/0x280 Sep 11 01:07:06 mbpc kernel: [ ] ? insert_work+0x6d/0xb0 Sep 11 01:07:06 mbpc kernel: [ ] ? dev_watchdog+0x0/0x280 Sep 11 01:07:06 mbpc kernel: [ ] ? run_timer_softirq+0x197/0x340 Sep 11 01:07:06 mbpc kernel: [ ] ? tick_sched_timer+0x0/0xc0 Sep 11 01:07:06 mbpc kernel: [ ] ? lapic_next_event+0x1d/0x30 Sep 11 01:07:06 mbpc kernel: [ ] ? __do_softirq+0xc1/0x1e0 Sep 11 01:07:06 mbpc kernel: [ ] ? hrtimer_interrupt+0x140/0x250 Sep 11 01:07:06 mbpc kernel: [ ] ? call_softirq+0x1c/0x30 Sep 11 01:07:06 mbpc kernel: [ ] ? do_softirq+0x65/0xa0 Sep 11 01:07:06 mbpc kernel: [ ] ? irq_exit+0x85/0x90 Sep 11 01:07:06 mbpc kernel: [ ] ? smp_apic_timer_interrupt+0x70/0x9b Sep 11 01:07:06 mbpc kernel: [ ] ? apic_timer_interrupt+0x13/0x20 Sep 11 01:07:06 mbpc kernel: <EOI> [ ] ? native_safe_halt+0xb/0x10 Sep 11 01:07:06 mbpc kernel: [ ] ? default_idle+0x4d/0xb0 Sep 11 01:07:06 mbpc kernel: [ ] ? c1e_idle+0x9d/0x120 Sep 11 01:07:06 mbpc kernel: [ ] ? cpu_idle+0xb6/0x110 Sep 11 01:07:06 mbpc kernel: [ ] ? rest_init+0x7a/0x80 Sep 11 01:07:06 mbpc kernel: [ ] ? start_kernel+0x424/0x430 Sep 11 01:07:06 mbpc kernel: [ ] ? x86_64_start_reservations+0x125/0x129 Sep 11 01:07:06 mbpc kernel: [ ] ? x86_64_start_kernel+0xfa/0x109 Sep 11 01:07:06 mbpc kernel: ---[ end trace 762ff9093df5c720 ]---
Now we will try with the kmod-e1000e driver from elrepo.com. We start with installing the repor rpm. The repo will allow results using yum search e1000e:
# rpm -Uvh http://elrepo.org/elrepo-release-6-4.el6.elrepo.noarch.rpm
Retrieving http://elrepo.org/elrepo-release-6-4.el6.elrepo.noarch.rpm
warning: /var/tmp/rpm-tmp.fyaK1l: Header V4 DSA/SHA1 Signature, key ID baadae52: NOKEY
Preparing… ########################################### [100%]
1:elrepo-release ########################################### [100%]
#
Following that, we'll look for the kmod e1000e module to use:
kmod-e1000e.x86_64 : e1000e kernel module(s)
and unfortunately, this is a real cockroach and simply doesn't want to go away because even with that repo we still get:
Sep 11 08:16:16 mbpc kernel: ————[ cut here ]————
Sep 11 08:16:16 mbpc kernel: WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted)
Sep 11 08:16:16 mbpc kernel: Hardware name: GA-890XA-UD3
Sep 11 08:16:16 mbpc kernel: NETDEV WATCHDOG: eth1 (e1000e): transmit queue 0 timed out
Sep 11 08:16:16 mbpc kernel: Modules linked in: fuse bonding nfsd lockd nfs_acl auth_rpcgss autofs4 sunrpc cpufreq_ondemand powernow_k8 freq_table mperf 8021q garp stp llc ipt_REJECT ipt_LOG xt_multiport iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 xfs exportfs dm_mirror dm_region_hash dm_log uinput raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx snd_emu10k1_synth snd_emux_synth snd_seq_virmidi snd_seq_midi_event snd_seq_midi_emul snd_emu10k1 snd_rawmidi snd_ac97_codec ac97_bus snd_util_mem microcode sg serio_raw k10temp edac_core edac_mce_amd shpchp i2c_piix4 r8168(U) e1000e(U) xhci_hcd snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc ext4 mbcache jbd2 firewire_ohci firewire_core crc_itu_t sr_mod cdrom sd_mod crc_t10dif pata_acpi ata_generic pata_jmicron ahci radeon ttm drm_kms_helper drm i2c_algo_bit
Sep 11 08:16:16 mbpc kernel: i2c_core dm_mod [last unloaded: scsi_wait_scan]
Sep 11 08:16:16 mbpc kernel: Pid: 0, comm: swapper Not tainted 2.6.32-279.5.1.el6.x86_64 #1
Sep 11 08:16:16 mbpc kernel: Call Trace:
Sep 11 08:16:16 mbpc kernel: <IRQ> [
Sep 11 08:16:16 mbpc kernel: [
Sep 11 08:16:16 mbpc kernel: [
Sep 11 08:16:16 mbpc kernel: [
Sep 11 08:16:16 mbpc kernel: [
Sep 11 08:16:16 mbpc kernel: [
Sep 11 08:16:16 mbpc kernel: [
Sep 11 08:16:16 mbpc kernel: [
Sep 11 08:16:16 mbpc kernel: [
Sep 11 08:16:16 mbpc kernel: [
Sep 11 08:16:16 mbpc kernel: [
Sep 11 08:16:16 mbpc kernel: [
Sep 11 08:16:16 mbpc kernel: [
Sep 11 08:16:16 mbpc kernel: [
Sep 11 08:16:16 mbpc kernel: <EOI> [
Sep 11 08:16:16 mbpc kernel: [
Sep 11 08:16:16 mbpc kernel: [
Sep 11 08:16:16 mbpc kernel: [
Sep 11 08:16:16 mbpc kernel: [
Sep 11 08:16:16 mbpc kernel: —[ end trace 8c5c01167afc32df ]—
So we will try to reach out here to see what's going on after we revert back to the intel drivers (We'll work directly with the drivers from the manufacturer of the 82574L chip):
On our DD-WRT router we also run a watcher to keep track of the dmesg log so we can capture what occurred at a particular time when the driver crashed and WireShark on our Windows 7 client from which we are transferring files from:
root@DD-WRT-INTERNET:~# for i in $(awk 'BEGIN { for ( i=0; i<10000; i++ ) { print i; } }'); do sleep 5; date; dmesg|g
rep -v PROTO; done
to monitor messages there. This time also we insert the module with debug enabled to 16:
[root@mbpc ~]# modprobe e1000e debug=16
[root@mbpc ~]# modinfo -p e1000e
debug:Debug level (0=none,...,16=all)
Node:[ROUTING] Node to allocate memory on, default -1
EEE:Enable/disable on parts that support the feature
CrcStripping:Enable CRC Stripping, disable if your BMC needs the CRC
KumeranLockLoss:Enable Kumeran lock loss workaround
SmartPowerDownEnable:Enable PHY smart power down
IntMode:Interrupt Mode
InterruptThrottleRate:Interrupt Throttling Rate
RxAbsIntDelay:Receive Absolute Interrupt Delay
RxIntDelay:Receive Interrupt Delay
TxAbsIntDelay:Transmit Absolute Interrupt Delay
TxIntDelay:Transmit Interrupt Delay
copybreak:Maximum size of packet that is copied to a new buffer on receive
[root@mbpc ~]#
and get a slew of messages all containing:
Sep 12 01:34:28 mbpc kernel: ————[ cut here ]————
Sep 12 01:34:28 mbpc kernel: WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted)
Sep 12 01:34:28 mbpc kernel: Hardware name: GA-890XA-UD3
Sep 12 01:34:28 mbpc kernel: NETDEV WATCHDOG: eth1 (e1000e): transmit queue 0 timed out
Sep 12 01:34:28 mbpc kernel: [
.
.
Sep 12 01:34:28 mbpc kernel: [
Sep 12 01:34:28 mbpc kernel: —[ end trace 84035930063641a1 ]—
Sep 12 01:34:28 mbpc kernel: e1000e 0000:03:00.0: Net device Info
Sep 12 01:34:28 mbpc kernel: e1000e: Device Name state trans_start last_rx
Sep 12 01:34:28 mbpc kernel: e1000e: eth1 0000000000000003 0000000100F39584 0000000000000000
Sep 12 01:34:28 mbpc kernel: e1000e 0000:03:00.0: Register Dump
Sep 12 01:34:28 mbpc kernel: e1000e: Register Name Value
Sep 12 01:34:28 mbpc kernel: e1000e: CTRL 18100248
Sep 12 01:34:28 mbpc kernel: e1000e: STATUS 00080783
Sep 12 01:34:28 mbpc kernel: e1000e: CTRL_EXT 99580000
Sep 12 01:34:28 mbpc kernel: e1000e: ICR 00000083
Sep 12 01:34:28 mbpc kernel: e1000e: RCTL 04048002
Sep 12 01:34:28 mbpc kernel: e1000e: RDLEN 00001000
Sep 12 01:34:28 mbpc kernel: e1000e: RDH 000000b7
Sep 12 01:34:28 mbpc kernel: e1000e: RDT 000000b0
Sep 12 01:34:28 mbpc kernel: e1000e: RDTR 00000020
Sep 12 01:34:28 mbpc kernel: e1000e: RXDCTL[0-1] 01040420 01040420
Sep 12 01:34:28 mbpc kernel: e1000e: ERT 00000000
Sep 12 01:34:28 mbpc kernel: e1000e: RDBAL 296e0000
Sep 12 01:34:28 mbpc kernel: e1000e: RDBAH 00000001
Sep 12 01:34:28 mbpc kernel: e1000e: RDFH 00000af8
Sep 12 01:34:28 mbpc kernel: e1000e: RDFT 00000af8
Sep 12 01:34:28 mbpc kernel: e1000e: RDFHS 00000af8
Sep 12 01:34:28 mbpc kernel: e1000e: RDFTS 00000af8
Sep 12 01:34:28 mbpc kernel: e1000e: RDFPC 00000000
Sep 12 01:34:28 mbpc kernel: e1000e: TCTL 3003f00a
Sep 12 01:34:28 mbpc kernel: e1000e: TDBAL 00495000
Sep 12 01:34:28 mbpc kernel: e1000e: TDBAH 00000001
Sep 12 01:34:28 mbpc kernel: e1000e: TDLEN 00001000
Sep 12 01:34:28 mbpc kernel: e1000e: TDH 0000004d
Sep 12 01:34:28 mbpc kernel: e1000e: TDT 0000004d
Sep 12 01:34:28 mbpc kernel: e1000e: TIDV 00000008
Sep 12 01:34:28 mbpc kernel: e1000e: TXDCTL[0-1] 0145011f 0145011f
Sep 12 01:34:28 mbpc kernel: e1000e: TADV 00000020
Sep 12 01:34:28 mbpc kernel: e1000e: TARC[0-1] 04000403 00000403
Sep 12 01:34:28 mbpc kernel: e1000e: TDFH 00001136
Sep 12 01:34:28 mbpc kernel: e1000e: TDFT 00001136
Sep 12 01:34:28 mbpc kernel: e1000e: TDFHS 00001136
Sep 12 01:34:28 mbpc kernel: e1000e: TDFTS 00001136
Sep 12 01:34:28 mbpc kernel: e1000e: TDFPC 00000000
Sep 12 01:34:28 mbpc kernel: e1000e 0000:03:00.0: Tx Ring Summary
Sep 12 01:34:28 mbpc kernel: e1000e: Queue [NTU] [NTC] [bi(ntc)->dma ] leng ntw timestamp
Sep 12 01:34:28 mbpc kernel: e1000e: 0 4D 61 00000000BCE064DA 0036 61 0000000100F39573
Sep 12 01:34:28 mbpc kernel: e1000e 0000:03:00.0: Tx Ring Dump
Sep 12 01:34:28 mbpc kernel: e1000e: Tl[desc] [address 63:0 ] [SpeCssSCmCsLen] [bi->dma ] leng ntw timestamp bi->skb <– Legacy format
Sep 12 01:34:28 mbpc kernel: e1000e: Tc[desc] [Ce CoCsIpceCoS] [MssHlRSCm0Plen] [bi->dma ] leng ntw timestamp bi->skb <– Ext Context format
Sep 12 01:34:28 mbpc kernel: e1000e: Td[desc] [address 63:0 ] [VlaPoRSCm1Dlen] [bi->dma ] leng ntw timestamp bi->skb <– Ext Data format
Sep 12 01:34:28 mbpc kernel: e1000e: Td[0x000] 00000000C266D7EA 00000201AB100033 00000000C266D7EA 0033 0 0000000100F3957F ffff88008bd63468
Sep 12 01:34:28 mbpc kernel: ffff8800c266d7ea: 00 00 00 2f ff 53 4d 42 2f 00 00 00 00 88 03 c8 …/.SMB/…….
Sep 12 01:34:28 mbpc kernel: ffff8800c266d7fa: 00 00 00 00 00 00 00 00 00 00 00 00 01 00 ff fe …………….
Sep 12 01:34:28 mbpc kernel: ffff8800c266d80a: 64 00 c0 86 06 ff 00 00 00 00 80 00 00 00 00 00 d……………
Sep 12 01:34:28 mbpc kernel: ffff8800c266d81a: 00 00 00 …
Sep 12 01:34:28 mbpc kernel: e1000e: Tl[0x001] 00000000CC3A6CDA 000000018B000036 00000000CC3A6CDA 0036 1 0000000100F3957F ffff8801282bccc0
Sep 12 01:34:28 mbpc kernel: ffff8800cc3a6cda: 14 da e9 19 95 b5 1c 6f 65 3f fc 14 08 00 45 00 …….oe?….E.
Sep 12 01:34:28 mbpc kernel: ffff8800cc3a6cea: 00 28 4d 0f 40 00 40 06 6c 53 c0 a8 00 0e c0 a8 .(M.@.@.lS……
Sep 12 01:34:28 mbpc kernel: ffff8800cc3a6cfa: 00 0f 01 bd f0 51 2d 56 18 7e a2 72 5d 5e 50 10 …..Q-V.~.r]^P.
Sep 12 01:34:28 mbpc kernel: ffff8800cc3a6d0a: 56 2e a0 84 00 00 V…..
Sep 12 01:34:28 mbpc kernel: e1000e: Tl[0x002] 00000000B5674CDA 000000018B000036 00000000B5674CDA 0036 2 0000000100F3957F ffff8800cc37ae80
Sep 12 01:34:28 mbpc kernel: ffff8800b5674cda: 14 da e9 19 95 b5 1c 6f 65 3f fc 14 08 00 45 00 …….oe?….E.
Sep 12 01:34:28 mbpc kernel: ffff8800b5674cea: 00 28 4d 10 40 00 40 06 6c 52 c0 a8 00 0e c0 a8 .(M.@.@.lR……
Sep 12 01:34:28 mbpc kernel: ffff8800b5674cfa: 00 0f 01 bd f0 51 2d 56 18 7e a2 72 6e 7a 50 10 …..Q-V.~.rnzP.
Sep 12 01:34:28 mbpc kernel: ffff8800b5674d0a: 56 2e 8f 68 00 00 V..h..
Sep 12 01:34:28 mbpc kernel: e1000e: Tc[0x003] 0000322200000000 0000000021000000 0000000000000000 0036 5 0000000100F3957F (null)
Sep 12 01:34:28 mbpc kernel: e1000e: Td[0x004] 00000000CC3A6A4A 0000020022100036 00000000CC3A6A4A 0036 4 0000000100F3957F (null)
Sep 12 01:34:28 mbpc kernel: ffff8800cc3a6a4a: 14 da e9 19 95 b5 1c 6f 65 3f fc 14 08 00 45 00 …….oe?….E.
Sep 12 01:34:28 mbpc kernel: ffff8800cc3a6a5a: 00 5b 4d 11 40 00 40 06 6c 1e c0 a8 00 0e c0 a8 .[M.@.@.l…….
Sep 12 01:34:28 mbpc kernel: ffff8800cc3a6a6a: 00 0f 01 bd f0 51 2d 56 18 7e a2 72 6e 7a 50 18 …..Q-V.~.rnzP.
Sep 12 01:34:28 mbpc kernel: ffff8800cc3a6a7a: 56 2e 81 bb 00 00 V…..
Sep 12 01:34:28 mbpc kernel: e1000e: Td[0x005] 00000000C266D81D 00000201AB100033 00000000C266D81D 0033 5 0000000100F3957F ffff88008bd63268
Sep 12 01:34:28 mbpc kernel: ffff8800c266d81d: 00 00 00 2f ff 53 4d 42 2f 00 00 00 00 88 03 c8 …/.SMB/…….
Sep 12 01:34:28 mbpc kernel: ffff8800c266d82d: 00 00 00 00 00 00 00 00 00 00 00 00 01 00 ff fe …………….
Sep 12 01:34:28 mbpc kernel: ffff8800c266d83d: 64 00 01 87 06 ff 00 00 00 00 80 00 00 00 00 00 d……………
Sep 12 01:34:28 mbpc kernel: ffff8800c266d84d: 00 00 00 …
#
However, upon further reading of the intel post that deals with this issue we notice that this may be a hardware issue that might not be solvable by software:
The ASPM problem with 82574L is hardware based and is not solvable in software other than to disable it.
Since the platforms vary in their reliability in disabling the feature from the driver, your best option
is to always boot with pcie_aspm=off with that part in the system.
So, at least we are clear in your situation, the ASPM needs to be disabled. Please let me know if there are
continued problems after booting with pcie_aspm=off.
Thanks,
Carolyn
Carolyn Wyborny
Linux Development
LAN Access Division
Intel Corporation
If I understand correctly, e1000e attempts to disable ASPM to work around an 82574L hardware erratum, but the PCI core either doesn't disable ASPM or it gets re-enabled somehow. [ From: Bjorn Helgaasgoogle.com> ]
but does suggest a few things to try:
pcie_aspm=off
setpci -s 02:00.0 CAP_EXP+10.b=40
setpci -s 03:00.0 CAP_EXP+10.b=40
but no luck and the same issue repeats again. Now at this point we have a rather interesting decision. This is a newly baught card so returning it to the store is still an option. The main driver of this decision is that if this is a hardware based issue, no amount of tinkering and duck-tape-solutions are going to give us the stability we're looking for. Do we really want to commit so much time to debugging this if the driver / hardware is supposed to handle these exceptions and only end up with a very delicate configuration that might not survive across reboots? If a future update occurs somewhere in the software rendering one of the options entirely useless we could be stuck with an unusable NIC. Unfortunately on this MB I could not disable L0's and L1's. No such option.
But let's give it a couple more chances. After a quick lookup of the messages from the extra debug, looks like we might be hitting this bug.
At this point I was nearing the return policy time limit of the card to the store where I got it from. So after giving it some more tries in different combinations to see if the card can sustain speeds near 100MB/s (Roughly 90% of it's theoretical max) without much luck, just disconnects and similar messages in the logs, it was time to look for and try my luck with another model of the intel variants: Intel 82541PI.
Fast forward to March 24 2013:
After purchasing the Intel card with the 82541PI NIC the results were nothing short of on target. Not only did the card work right out of the box it sustained high transfer rates of around 110MB/s over my network. Nothing wrong with that and nothing more can be said when something works as expected. The card is typically labeled Intel Corporation PRO/1000 GT Desktop Adapter:
Enjoy! Hopefully your adventure was shortened significantly by this, unlike mine. 🙂
Cheers,
TK
Hello,
Looks like i have a similar issues,where you able to resolve this ?
May be you try the 2.14,which released recently.From the change log,look like they have addressed ASPM issue with 82574 chips.
Changelog:
* Upstream commit d4a4206ebbaf48b55803a7eb34e330530d83a889 – e1000e: Disable ASPM L1 on 82574 * Upstream commit 727c356f4d799b53f94cf8fe43e19d64482348c7 – e1000e: Fix default interrupt throttle rate not set in NIC HW * Upstream commit 569a3aff70e880588fe4b3f1622ac60abbeb4a28 – e1000e: MSI
Hey Ryan,
Thanks very much for the reply. Appreciated.
Yup. If I recall correctly (been a couple of months) I believe I already tried those 2.14 ones about 2 – 3 months ago when I was testing that Intel NIC. Part of my requirements here was to see sustained speeds of about 90% of the theoretical max. The card could not. With time ticking away and with this being labeled as a hardware issue (ie not solvable in software) it would be a very delicate duck-tape solution at best which could resurface as soon as I upgrade the system at a later date. The 82541PI upon further checks doesn't have the same hardware issue but I still need to do more testing on it. There were also some limitations on my M/B with regards to setting L0 / L1.
So I was down 2 (One Realtek RTL8111/8168B and this one 82574L with only one, the 82541PI to go. 🙂 With my HTPC storage running at 300MB/s Reads and Writes (Have a search on the site for it), the card simply has to perform.
Definitely reply back if you do get it working on your system and if you managed to get it to sustain speeds of about 90% of the theoretical max for more then 5 minutes.
Cheers,
TK
Tom,
I have 82574L runnung on a production server,which has multiple Virtual Machines running on it.I noticed this happening twice a month,where the NIC lost the connection for 4 seconds,no kernel panic noticed though..but the VM did lose the network for 2-3 minutes.Although,there is lot of data transfer that happens (approx 50MB/sec),the card doesnt seem to fail always,but only under certain condtions ?
Does the card fail only when the data transfers reaches close to 100 MB/sec?Should there be a sustained transfer rate of 100MB/s ?I'd like to know under what circumstances does the card fail for you.Tomorrow i will run in a test where i'll try to saturate the gig link (100MB/s) and see if the card misbehaves.I have lot of servers at my work place,running on 82754L,but i havent noticed anything wrong with them yet,except for this server which is on Gig link hosting multiple VMs.
Currently the machine in inquestion is running Centos 6.2 with 3.6.6-1.el6xen.x86_64 kernel and e1000e driver 2.0.0-k. More updates to follow.
http://bugs.centos.org/view.php?id=6018
Hey Ryan,
Latest Intel drivers, in case you want to fall back on it after tests. Looks like they posted that within hours later of me getting the old one I got above (Unbeleaveable!):
# ls -al /root/Downloads/e1000e-2.0.0.1.tar.gz
-rw-r–r–. 1 root root 258613 Sep 10 01:39 /root/Downloads/e1000e-2.0.0.1.tar.gz
#
The NIC for me tanked every single time when I tested after applying a whole set of combinations and various drivers + trying to set parameters here and there for L1. I expected they disabled the L1 / L0 in the driver I used as well but it looks like that may have only been conditional when I got the drivers (see below). Not sure if the 2.0.0.1-NAPI is newer then the 2.0.0-k driver you have.
The card failed for me shortly after starting large test file (Some linux ISO's will do.) transfers and speeds roughly around 100MB's or approaching 100MB's. Sometimes the connection lasted up to a minute and other times it tanked after a few seconds. Card's theoretical max is 125MB's. My TP-LINK switches were all 1Gb's enabled. I've did a diff of the two versions, the one now and the one I got before:
Looks like the drivers I got, which were hours before Intel updated their driver, might not have had the L1 disabled so for me, after digging into the differences briefly, so the L1 flag might have gotten reenabled by the driver even after I set the kernel parameters. So looks like I was out of luck when I tried. This is good for you. You have the latest kernel and latest Intel drivers to work with but my 82754L has already been returned to the store quite some time ago. :). The good thing is is that I don't see anything in the changelog listed for the 82541PI but that might be in another file. Including the changelog here for future comparison in case the issue still shows up:
One thing I do recall is that I did notice the L1 at some point in the logs and I vaguey recall seeing it reenabled, which would make sense considering the driver differences I'm seeing now. One other thing you have going for you is also the release of the latest kernel since my testing above. Looking forward to hearing from you about your tests. You may want to load up the module in debug mode (see post above) to get more messages out of the tests.
Cheers,
TK
Hey Ryan,
Should add or elaborate that for my tests above, I expected the card to fully complete the transfer at speeds close to it's theoretical max irrespective of how long it took. (If you can see speeds close to theoretical MAX for 5 or more minutes, it would have then been for far longer then I managed to get the card to work under my tests.) At much lower speeds, such as 20MB and below the issue was less common if at all present. For example, if I did just small file transfers to and from the storage, the card probably would never fail. As I was doing full blown tests to stress test it virtually all my tests went up to the high transfer rates hence likely why the issue was realized every single time in my tests. The card should have been able to handle the speed and simply not allow any further increase in speed.
Cheers,
TK