STC840.20:tuleta:tul516p01 panic after injecting Leaf EEH

Bug #1581034 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
High
Unassigned
Xenial
Fix Released
High
Tim Gardner

Bug Description

Dear Canonical,

There is a bug on nvme device driver that causes EEH to be broken during an event. This causes an OOPS on the nvme, and make the entire machine unavailable. This is the trace that we see during this problem:

        [ 121.614394] Unable to handle kernel paging request for data at address 0x00000020
        [ 121.614524] Faulting instruction address: 0xd00000000dfb5530
        [ 121.614602] Oops: Kernel access of bad area, sig: 11 [#1]
        [ 121.614654] SMP NR_CPUS=2048 NUMA pSeries
        [ 121.614713] Modules linked in: rpadlpar_io rpaphp nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache dccp_diag dccp tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag pseries_rng rtc_generic binfmt_misc sunrpc autofs4 mlx4_en vxlan ip6_udp_tunnel udp_tunnel dm_round_robin ses enclosure lpfc mlx4_core scsi_transport_fc nvme ipr scsi_dh_emc scsi_dh_rdac scsi_dh_alua dm_multipath
        [ 121.615390] CPU: 18 PID: 19973 Comm: hxecpu Not tainted 4.4.0-21-generic #37-Ubuntu
        [ 121.615450] task: c000001fbb589370 ti: c000001fc7148000 task.ti: c000001fc7148000
        [ 121.615478] NIP: d00000000dfb5530 LR: d00000000dfb5650 CTR: d00000000dfb5550
        [ 121.615497] REGS: c000001fc714b700 TRAP: 0300 Not tainted (4.4.0-21-generic)
        [ 121.615512] MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 39090553 XER: a0000000
        [ 121.615686] CFAR: c000000000008468 DAR: 0000000000000020 DSISR: 40000000 SOFTE: 1
        GPR00: d00000000dfb5650 c000001fc714b980 d00000000dfc9178 c000001fdcd0e000
        GPR04: c000001fc1599200 0000000000000000 0000000000000000 0000000000000001
        GPR08: 0000000000000000 0000000000000000 0000000000000020 0000000000000005
        GPR12: d00000000dfb5550 c00000000e7eab00 c000001ff348a938 0000000000000100
        GPR16: c000001ff348a738 c000001ff348a538 0000001ff2500000 0000000000000000
        GPR20: c000001fc714bc40 c000000000f89d00 c000001fc714bb70 0000000000000000
        GPR24: 0000000000000001 c000000000548ae0 0000000000000020 c000001fdcd0e000
        GPR28: 00000000000001ff c000001fc1599200 0000000000000000 c000001fc1599200
        [ 121.616623] NIP [d00000000dfb5530] nvme_free_iod+0x100/0x120 [nvme]
        [ 121.616701] LR [d00000000dfb5650] nvme_complete_rq+0x100/0x240 [nvme]
        [ 121.616743] Call Trace:
        [ 121.616782] [c000001fc714b980] [0000000000000908] 0x908 (unreliable)
        [ 121.616851] [c000001fc714b9d0] [d00000000dfb5650] nvme_complete_rq+0x100/0x240 [nvme]
        [ 121.616925] [c000001fc714ba50] [c00000000054860c] __blk_mq_complete_request+0xbc/0x1b0
        [ 121.616990] [c000001fc714ba90] [c00000000054c540] bt_for_each+0x160/0x170
        [ 121.617074] [c000001fc714bb00] [c00000000054d4e8] blk_mq_queue_tag_busy_iter+0x78/0x110
        [ 121.617156] [c000001fc714bb50] [c000000000547358] blk_mq_rq_timer+0x48/0x140
        [ 121.617226] [c000001fc714bb90] [c00000000014a13c] call_timer_fn+0x5c/0x1c0
        [ 121.617296] [c000001fc714bc20] [c00000000014a5fc] run_timer_softirq+0x31c/0x3f0
        [ 121.617370] [c000001fc714bcf0] [c0000000000beb78] __do_softirq+0x188/0x3e0
        [ 121.617442] [c000001fc714bde0] [c0000000000bf048] irq_exit+0xc8/0x100
        [ 121.617507] [c000001fc714be00] [c00000000001f954] timer_interrupt+0xa4/0xe0
        [ 121.617562] [c000001fc714be30] [c000000000002714] decrementer_common+0x114/0x180
        [ 121.617619] Instruction dump:
        [ 121.617663] e8010010 eb41ffd0 eb61ffd8 eb81ffe0 7c0803a6 eba1ffe8 ebc1fff0 ebe1fff8
        [ 121.617829] 4e800020 60000000 60000000 60420000 <7c88502a> e87b0110 7fc5f378 48008d95

This bug was already fixed upstream (version 4.5) , and these are the commit IDs that contain the fix:

 * 646017a612e7 ("NVMe: Fix namespace removal deadlock")
 * 69d9a99c258e ("NVMe: Move error handling to failed reset handler")
 * a59e0f5795fe5 ("blk-mq: End unstarted requests on dying queue")

Backports for each of these patches are attached.

Please, apply to the 16.04 kernel.

Revision history for this message
bugproxy (bugproxy) wrote : dmesg from kdump

Default Comment by Bridge

tags: added: architecture-ppc64 bugnameltc-140746 severity-critical targetmilestone-inin1604
Revision history for this message
bugproxy (bugproxy) wrote : 0001-NVMe-Fix-namespace-removal-deadlock.patch

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : 0002-NVMe-Move-error-handling-to-failed-reset-handler.patch

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : 0003-blk-mq-End-unstarted-requests-on-dying-queue.patch

Default Comment by Bridge

Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Freenode.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/1581034/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
affects: ubuntu → linux (Ubuntu)
Changed in linux (Ubuntu):
assignee: Taco Screen team (taco-screen-team) → Canonical (canonical)
importance: Undecided → High
status: New → Triaged
Changed in linux (Ubuntu Xenial):
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
importance: Undecided → High
status: New → Triaged
Changed in linux (Ubuntu):
assignee: Canonical (canonical) → nobody
status: Triaged → Fix Released
Tim Gardner (timg-tpi)
Changed in linux (Ubuntu Xenial):
assignee: Canonical Kernel Team (canonical-kernel-team) → Tim Gardner (timg-tpi)
status: Triaged → In Progress
Revision history for this message
Tim Gardner (timg-tpi) wrote :
Revision history for this message
bugproxy (bugproxy) wrote : 0001-NVMe-Fix-namespace-removal-deadlock.patch

Default Comment by Bridge

Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-06-06 10:32 EDT-------
> This bug was already fixed upstream (version 4.5) , and these are the commit
> IDs that contain the fix:
>
>
> * 646017a612e7 ("NVMe: Fix namespace removal deadlock")
> * 69d9a99c258e ("NVMe: Move error handling to failed reset handler")
> * a59e0f5795fe5 ("blk-mq: End unstarted requests on dying queue")
>

Canonical, any updates? We need this applied to the ubuntu kernel.

Revision history for this message
Tim Gardner (timg-tpi) wrote :

These patches were applied May 26, 2016. They should be released with Ubuntu-4.4.0-24.42. You can check for yourself at 'git://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/xenial master-next'

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-06-09 13:53 EDT-------
(In reply to comment #39)
> These patches were applied May 26, 2016. They should be released with
> Ubuntu-4.4.0-24.42. You can check for yourself at
> 'git://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/xenial
> master-next'

Thanks!

We'll get it tested internally.

Revision history for this message
Kamal Mostafa (kamalmostafa) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-06-17 08:40 EDT-------
(In reply to comment #41)
> This bug is awaiting verification that the kernel in -proposed solves the
> problem. Please test the kernel and update this bug with the results. If the
> problem is solved, change the tag 'verification-needed-xenial' to
> 'verification-done-xenial'.
>
> If verification is not done by 5 working days from today, this fix will be
> dropped from the source code, and this bug will be closed.
>
> See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to
> enable and use -proposed. Thank you!

I used kernel 4.4.0-25-generic from xenial-proposed, and was able to perform 129 iterations of the leaf-eeh tool without crashing. Issue looks fixed with that kernel, thanks!
Moving to verified...

tags: added: verification-done-xenial
removed: verification-needed-xenial
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (26.1 KiB)

This bug was fixed in the package linux - 4.4.0-28.47

---------------
linux (4.4.0-28.47) xenial; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1595874

  * Linux netfilter local privilege escalation issues (LP: #1595350)
    - netfilter: x_tables: don't move to non-existent next rule
    - netfilter: x_tables: validate targets of jumps
    - netfilter: x_tables: add and use xt_check_entry_offsets
    - netfilter: x_tables: kill check_entry helper
    - netfilter: x_tables: assert minimum target size
    - netfilter: x_tables: add compat version of xt_check_entry_offsets
    - netfilter: x_tables: check standard target size too
    - netfilter: x_tables: check for bogus target offset
    - netfilter: x_tables: validate all offsets and sizes in a rule
    - netfilter: x_tables: don't reject valid target size on some architectures
    - netfilter: arp_tables: simplify translate_compat_table args
    - netfilter: ip_tables: simplify translate_compat_table args
    - netfilter: ip6_tables: simplify translate_compat_table args
    - netfilter: x_tables: xt_compat_match_from_user doesn't need a retval
    - netfilter: x_tables: do compat validation via translate_table
    - netfilter: x_tables: introduce and use xt_copy_counters_from_user

  * Linux netfilter IPT_SO_SET_REPLACE memory corruption (LP: #1555338)
    - netfilter: x_tables: validate e->target_offset early
    - netfilter: x_tables: make sure e->next_offset covers remaining blob size
    - netfilter: x_tables: fix unconditional helper

linux (4.4.0-27.46) xenial; urgency=low

  [ Kamal Mostafa ]

  * Release Tracking Bug
    - LP: #1594906

  * Support Edge Gateway's Bluetooth LED (LP: #1512999)
    - Revert "UBUNTU: SAUCE: Bluetooth: Support for LED on Marvell modules"

linux (4.4.0-26.45) xenial; urgency=low

  [ Kamal Mostafa ]

  * Release Tracking Bug
    - LP: #1594442

  * linux: Implement secure boot state variables (LP: #1593075)
    - SAUCE: UEFI: Add secure boot and MOK SB State disabled sysctl

  * failures building userspace packages that include ethtool.h (LP: #1592930)
    - ethtool.h: define INT_MAX for userland

linux (4.4.0-25.44) xenial; urgency=low

  [ Kamal Mostafa ]

  * Release Tracking Bug
    - LP: #1591289

  * Xenial update to v4.4.13 stable release (LP: #1590455)
    - MIPS64: R6: R2 emulation bugfix
    - MIPS: math-emu: Fix jalr emulation when rd == $0
    - MIPS: MSA: Fix a link error on `_init_msa_upper' with older GCC
    - MIPS: Don't unwind to user mode with EVA
    - MIPS: Avoid using unwind_stack() with usermode
    - MIPS: Fix siginfo.h to use strict posix types
    - MIPS: Fix uapi include in exported asm/siginfo.h
    - MIPS: Fix watchpoint restoration
    - MIPS: Flush highmem pages in __flush_dcache_page
    - MIPS: Handle highmem pages in __update_cache
    - MIPS: Sync icache & dcache in set_pte_at
    - MIPS: ath79: make bootconsole wait for both THRE and TEMT
    - MIPS: Reserve nosave data for hibernation
    - MIPS: Loongson-3: Reserve 32MB for RS780E integrated GPU
    - MIPS: Use copy_s.fmt rather than copy_u.fmt
    - MIPS: Fix MSA ld_*/st_* asm macros to use PTR_ADDU
    - MIPS: Prevent "restoration" of MSA c...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.