STC840.20:Alpine:alp7fp1:Ubuntu 16.04, BlueFin (SAN) EEH 6 times during boot then disabled SRC BA188002:b0314a_1612.840

Bug #1587316 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
High
Unassigned
Xenial
Fix Released
Undecided
Tim Gardner
Yakkety
Fix Released
High
Unassigned

Bug Description

== Comment: #0 - Application Cdeadmin <email address hidden> - 2016-03-21 15:55:09 ==

== Comment: #1 - Application Cdeadmin <email address hidden> - 2016-03-21 15:55:11 ==
==== State: Open by: mlfield on 21 March 2016 14:45:01 ====

==========================Automatic entries==========================
Contact: LittleField, Michael *CONTRACTOR*
Backup: Thirukumaran V T (<email address hidden>), Deepti Umarani (<email address hidden>), Brian M. Carpenter(<email address hidden>)

===== sys_capture v5.24 === 2016-03-21_14-25-41 ===========

|
| |
| System Hardware Information:
| NODE /Sys-0/Node-0, U78C7.001.1AQH383-P2
| FSP /Sys-0/Node-0/FSP-0, FSP-2 DD 1.0, U78C7.001.1AQH383-P1-C5
| PSI /Sys-0/Node-0/FSP-0/PSI-0
| PSI /Sys-0/Node-0/FSP-0/PSI-1
| MEMBUF /Sys-0/Node-0/Membuf-12, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C11
| MEMBUF /Sys-0/Node-0/Membuf-13, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C10
| MEMBUF /Sys-0/Node-0/Membuf-14, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C12
| MEMBUF /Sys-0/Node-0/Membuf-15, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C13
| MEMBUF /Sys-0/Node-0/Membuf-20, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C23
| MEMBUF /Sys-0/Node-0/Membuf-21, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C22
| MEMBUF /Sys-0/Node-0/Membuf-22, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C24
| MEMBUF /Sys-0/Node-0/Membuf-23, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C25
| MEMBUF /Sys-0/Node-0/Membuf-28, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C19
| MEMBUF /Sys-0/Node-0/Membuf-29, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C18
| MEMBUF /Sys-0/Node-0/Membuf-30, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C20
| MEMBUF /Sys-0/Node-0/Membuf-31, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C21
| MEMBUF /Sys-0/Node-0/Membuf-36, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C31
| MEMBUF /Sys-0/Node-0/Membuf-37, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C30
| MEMBUF /Sys-0/Node-0/Membuf-38, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C32
| MEMBUF /Sys-0/Node-0/Membuf-39, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C33
| MEMBUF /Sys-0/Node-0/Membuf-4, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C15
| MEMBUF /Sys-0/Node-0/Membuf-44, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C27
| MEMBUF /Sys-0/Node-0/Membuf-45, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C26
| MEMBUF /Sys-0/Node-0/Membuf-46, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C28
| MEMBUF /Sys-0/Node-0/Membuf-47, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C29
| MEMBUF /Sys-0/Node-0/Membuf-5, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C14
| MEMBUF /Sys-0/Node-0/Membuf-52, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C39
| MEMBUF /Sys-0/Node-0/Membuf-53, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C38
| MEMBUF /Sys-0/Node-0/Membuf-54, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C40
| MEMBUF /Sys-0/Node-0/Membuf-55, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C41
| MEMBUF /Sys-0/Node-0/Membuf-6, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C16
| MEMBUF /Sys-0/Node-0/Membuf-60, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C35
| MEMBUF /Sys-0/Node-0/Membuf-61, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C34
| MEMBUF /Sys-0/Node-0/Membuf-62, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C36
| MEMBUF /Sys-0/Node-0/Membuf-63, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C37
| MEMBUF /Sys-0/Node-0/Membuf-7, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C17
| PROC /Sys-0/Node-0/Proc-0, P8/Murano DD 2.1, U78C7.001.1AQH383-P2-C2
| CORE /Sys-0/Node-0/Proc-0/EX-12/Core-0
| CORE /Sys-0/Node-0/Proc-0/EX-13/Core-0
| CORE /Sys-0/Node-0/Proc-0/EX-14/Core-0
| CORE /Sys-0/Node-0/Proc-0/EX-4/Core-0
| PCI /Sys-0/Node-0/Proc-0/PCI-0
| PCI /Sys-0/Node-0/Proc-0/PCI-1
| PCI /Sys-0/Node-0/Proc-0/PCI-2
| PSI /Sys-0/Node-0/Proc-0/PSI-0
| PROC /Sys-0/Node-0/Proc-1, P8/Murano DD 2.1, U78C7.001.1AQH383-P2-C2
| CORE /Sys-0/Node-0/Proc-1/EX-13/Core-0
| CORE /Sys-0/Node-0/Proc-1/EX-14/Core-0
| CORE /Sys-0/Node-0/Proc-1/EX-4/Core-0
| CORE /Sys-0/Node-0/Proc-1/EX-5/Core-0
| PCI /Sys-0/Node-0/Proc-1/PCI-0
| PCI /Sys-0/Node-0/Proc-1/PCI-1
| PCI /Sys-0/Node-0/Proc-1/PCI-2
| PROC /Sys-0/Node-0/Proc-2, P8/Murano DD 2.1, U78C7.001.1AQH383-P2-C3
| CORE /Sys-0/Node-0/Proc-2/EX-13/Core-0
| CORE /Sys-0/Node-0/Proc-2/EX-14/Core-0
| CORE /Sys-0/Node-0/Proc-2/EX-4/Core-0
| CORE /Sys-0/Node-0/Proc-2/EX-5/Core-0
| PCI /Sys-0/Node-0/Proc-2/PCI-0
| PCI /Sys-0/Node-0/Proc-2/PCI-1
| PCI /Sys-0/Node-0/Proc-2/PCI-2
| PSI /Sys-0/Node-0/Proc-2/PSI-0
| PROC /Sys-0/Node-0/Proc-3, P8/Murano DD 2.1, U78C7.001.1AQH383-P2-C3
| CORE /Sys-0/Node-0/Proc-3/EX-12/Core-0
| CORE /Sys-0/Node-0/Proc-3/EX-13/Core-0
| CORE /Sys-0/Node-0/Proc-3/EX-4/Core-0
| CORE /Sys-0/Node-0/Proc-3/EX-6/Core-0
| PCI /Sys-0/Node-0/Proc-3/PCI-0
| PCI /Sys-0/Node-0/Proc-3/PCI-1
| PCI /Sys-0/Node-0/Proc-3/PCI-2
| PROC /Sys-0/Node-0/Proc-4, P8/Murano DD 2.1, U78C7.001.1AQH383-P2-C6
| CORE /Sys-0/Node-0/Proc-4/EX-12/Core-0
| CORE /Sys-0/Node-0/Proc-4/EX-13/Core-0
| CORE /Sys-0/Node-0/Proc-4/EX-14/Core-0
| CORE /Sys-0/Node-0/Proc-4/EX-6/Core-0
| PCI /Sys-0/Node-0/Proc-4/PCI-0
| PCI /Sys-0/Node-0/Proc-4/PCI-1
| PCI /Sys-0/Node-0/Proc-4/PCI-2
| PROC /Sys-0/Node-0/Proc-5, P8/Murano DD 2.1, U78C7.001.1AQH383-P2-C6
| CORE /Sys-0/Node-0/Proc-5/EX-12/Core-0
| CORE /Sys-0/Node-0/Proc-5/EX-13/Core-0
| CORE /Sys-0/Node-0/Proc-5/EX-14/Core-0
| CORE /Sys-0/Node-0/Proc-5/EX-4/Core-0
| PCI /Sys-0/Node-0/Proc-5/PCI-0
| PCI /Sys-0/Node-0/Proc-5/PCI-1
| PCI /Sys-0/Node-0/Proc-5/PCI-2
| PROC /Sys-0/Node-0/Proc-6, P8/Murano DD 2.1, U78C7.001.1AQH383-P2-C7
| CORE /Sys-0/Node-0/Proc-6/EX-12/Core-0
| CORE /Sys-0/Node-0/Proc-6/EX-14/Core-0
| CORE /Sys-0/Node-0/Proc-6/EX-4/Core-0
| CORE /Sys-0/Node-0/Proc-6/EX-5/Core-0
| PCI /Sys-0/Node-0/Proc-6/PCI-0
| PCI /Sys-0/Node-0/Proc-6/PCI-1
| PCI /Sys-0/Node-0/Proc-6/PCI-2
| PROC /Sys-0/Node-0/Proc-7, P8/Murano DD 2.1, U78C7.001.1AQH383-P2-C7
| CORE /Sys-0/Node-0/Proc-7/EX-12/Core-0
| CORE /Sys-0/Node-0/Proc-7/EX-13/Core-0
| CORE /Sys-0/Node-0/Proc-7/EX-14/Core-0
| CORE /Sys-0/Node-0/Proc-7/EX-6/Core-0
| PCI /Sys-0/Node-0/Proc-7/PCI-0
| PCI /Sys-0/Node-0/Proc-7/PCI-1
| PCI /Sys-0/Node-0/Proc-7/PCI-2
|
| System Hardware Summary:
| Configured Proc Cores: 32
| Configured IO UNITs: 24
| Configured PCIe PHB: 24
| Installed Nodes: 1
|
| Hardware InitFile Information:
| No tool support for FIRENZE
|
| Hardware (CINI) Frequency Information:
| No tool support for FIRENZE
|
| VPD Information:
| Backplane VPD:
| None found or VPD info is not available.
| VPD LID Information:
| VPD LID File [/opt/extucode/80e00040.lid]:
| VPD Keyword: [LX], Data: [3100050100300040]
| VPD LID File [/opt/extucode/80e00041.lid]:
| VPD Keyword: [LX], Data: [3100040100300041]
| VPD LID File [/opt/extucode/80e00042.lid]:
| VPD Keyword: [LX], Data: [3100040100300042]
| VPD LID File [/opt/extucode/80e00043.lid]:
| VPD Keyword: [LX], Data: [3100040100300043]
| VPD LID File [/opt/extucode/80e00044.lid]:
| VPD Keyword: [LX], Data: [3100040100300044]
| VPD LID File [/opt/extucode/80e00047.lid]:
| VPD Keyword: [LX], Data: [3100040100300047]
| Format: 0x31 (1)
| Enclosure ID: 0x0004 (P8 HV (Tuleta))
| Server Type: 0x01 (i/pSeries)
| FRU Type: 0x00 (Backplane)
| VPD Pass: 0x30 (0)
| LID Name: 0x0047 (P8 Alpine xS4U)
| VPD LID File [/opt/extucode/80e00050.lid]:
| VPD Keyword: [LX], Data: [3100060100300050]
| VPD LID File [/opt/extucode/80e00051.lid]:
| VPD Keyword: [LX], Data: [3100060100300051]
| VPD LID File [/opt/extucode/80e00942.lid]:
| VPD Keyword: [LX], Data: [3100040100300942]
| VPD LID File [/opt/extucode/80e00944.lid]:
| VPD Keyword: [LX], Data: [3100040100300944]
| VPD LID File [/opt/extucode/80e00947.lid]:
| VPD Keyword: [LX], Data: [3100040100300947]
| Format: 0x31 (1)
| Enclosure ID: 0x0004 (P8 HV (Tuleta))
| Server Type: 0x01 (i/pSeries)
| FRU Type: 0x00 (Backplane)
| VPD Pass: 0x30 (0)
| LID Name: 0x0947 (P8 Alpine Storage/Shark)
| VPD LID File [/opt/extucode/80e00ff0.lid]:
| VPD Keyword: [LX], Data: [3100040100300FF0]
|
| WARNINGS:
| * Informational: This machine has signed firmware (ship image)
|
| ERRL: Attempting to dump error logs using errl...
| Dumping all error logs on FSP to file...
| ERRL: The FSP stopped responding... skipping
|
| FFDC:
| FNM: Attempting connection for basic health check...
| TimeSincePhypStarted=82:13:57.539
| No failed tasks found.
|
| FNM: Attempting connection for PHYP FFDC...
| FNM PHYP FFDC data stored in /fspmount/alpine/alp7fp1/b0314a_1612.840/fsp/PHYP.FFDC.20160321142537.phyp
|
| FipS MyFFDC: Was not attempted. Reason:[Not requested]
|
| Cronus: Data collection not attempted. (Unable to use Cronus via SSH Tunnel)
|
|----- File(s) Created During Capture ------
| SysCapture Primary LogFile: /fspmount/alpine/alp7fp1/b0314a_1612.840/fsp/PHYP.FFDC.20160321142537
| FNM PHYP FFDC stored in: /fspmount/alpine/alp7fp1/b0314a_1612.840/fsp/PHYP.FFDC.20160321142537.phyp
|
============== end of capture ==============

============================Manual entries===========================
Title: STC840.20:Alpine:alp7fp1:Ubuntu 16.04, BlueFin (SAN) EEH 6 times during boot then disabled SRC BA188002:b0314a_1612.840

Problem Description :
Booting Ubuntu 16.04 with Blufin (SAN) and several other adapters, Bluefin EEH 6 times and then disabled, SRC BA188002 reported. All other adapters did not have any issues.

===================================END===============================
==== State: Open by: mlfield on 21 March 2016 14:47:26 ====

Attached Dmesg Log: dmesg1.txt

mlfield (<email address hidden>) added native attachment /opt/IBM/WebSphere/AppServer/profiles/cqweb/temp/ausratsrv5Node01/server1/TeamEAR/cqweb.war/dmesg1.txt on 2016-03-21 14:47:26

== Comment: #2 - Application Cdeadmin <email address hidden> - 2016-03-21 15:55:16 ==

== Comment: #12 - Mauricio Faria De Oliveira <email address hidden> - 2016-04-04 14:09:48 ==
Info from Mike on ST.
Assigned the adapter in the drawer to the LPAR, it hit the problem just like the adapter in the CEC.
This points to a kernel/driver problem, since 14.04 didn't hit the problem.

<email address hidden> - Michael Littlefield/Austin/Contr/IBM: just added both bluefins, its happen with both so MEX and CEC.
# Slot Description Device(s)
U78C7.001.1AQH383-P1-C4 PCI-E capable, Rev 3, 16x lanes with 16x lanes connected fibre-channel
                                                                                   fibre-channel
U78C7.001.1AQH383-P1-C6 PCI-E capable, Rev 3, 8x lanes with 8x lanes connected 0000:60:00.1
                                                                                   0000:60:00.0
U78CD.001.FZH0132-P1-C1 PCI-E capable, Rev 3, 16x lanes with 16x lanes connected fibre-channel
                                                                                   fibre-channel
U78CD.001.FZH0132-P2-C1 PCI-E capable, Rev 3, 16x lanes with 16x lanes connected 0002:50:00.0
U78CD.001.FZH0132-P2-C3 PCI-E capable, Rev 3, 8x lanes with 8x lanes connected 0003:70:00.0
U78CD.001.FZH0132-P2-C6 PCI-E capable, Rev 3, 8x lanes with 8x lanes connected 0004:a0:00.5
                                                                                   0004:a0:00.4
                                                                                   0004:a0:00.3
                                                                                   0004:a0:00.2
                                                                                   0004:a0:00.1
                                                                                   0004:a0:00.0

== Comment: #16 - Mauricio Faria De Oliveira <email address hidden> - 2016-04-12 18:00:26 ==
Mike provided the LPAR for debugging earlier today.

Observations.
1) The NUMA nodes configuration is weird -- likely an effect of DLPAR of Memory/CPU.
- node 0: has CPUs but has no memory
- node 1: has CPUs and memory
- node 6: has no CPUs but has memory

(0) root @ alp7p04: /root
# numactl -H
available: 3 nodes (0,2,6)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 0 size: 0 MB
node 0 free: 0 MB
node 2 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 2 size: 34216 MB
node 2 free: 33248 MB
node 6 cpus:
node 6 size: 6644 MB
node 6 free: 6568 MB
node distances:
node 0 2 6
  0: 10 40 40
  2: 40 10 40
  6: 40 40 10

2) The problem does not reproduce with 14.04 kernel (4.2 from wily).

Comparing the dmesg logs up to the point of failure, there are differences in the NUMA setup code.
2a) A small offset difference in the NUMA DATA starting address. For example:

16.04: [ 0.000000] numa: NODE_DATA [mem 0x9ffe46100-0x9ffe4ffff]

14.04: [ 0.000000] numa: NODE_DATA [mem 0x9ffe45000-0x9ffe4ffff]

2b) A *totally* different end address in the "Initmem setup node 0"

16:04: [ 0.000000] Initmem setup node 0 [mem 0x0000000000000000-0x0000000000000000]

14.04: [ 0.000000] Initmem setup node 0 [mem 0x0000000000000000-0xffffffffffffffff]

In progress.
I'll go through the NUMA setup code.

== Comment: #20 - Mauricio Faria De Oliveira <email address hidden> - 2016-04-12 18:18:52 ==
Booting the 16.04 kernel with the numa=off boot option.
The EEH errors still happen, but at a very later time (e.g., the 6th error/permanent failure happens only after the login prompt)

== Comment: #22 - Mauricio Faria De Oliveira <email address hidden> - 2016-04-13 10:23:33 ==
(In reply to comment #16)
> 2b) A *totally* different end address in the "Initmem setup node 0"
>
> 16:04: [ 0.000000] Initmem setup node 0 [mem
> 0x0000000000000000-0x0000000000000000]
>
> 14.04: [ 0.000000] Initmem setup node 0 [mem
> 0x0000000000000000-0xffffffffffffffff]

And this is the value on the original/reported dmesg attachment (on different NUMA node configuration, before some memory and CPUs were moved from this LPAR to another one):

[Mon Mar 21 09:07:45 2016] Initmem setup node 0 [mem 0x0000000000000000-0x00000078cfffffff]

Notice it's non-zero as well as 14.04.. so not sure the NUMA differences have something directly related to this bug.

== Comment: #27 - Mauricio Faria De Oliveira <email address hidden> - 2016-05-18 19:47:05 ==
Assigning this bug to Guilherme per EEH debugging experience and contacts.

From what we've discussed, this problem doesn't seem to be specific to the lpfc device driver.
This same adapter/driver works fine on other systems (it has passed our FVT Regression testing w/out this problem).
So, we suspect of some changes either in EEH / machine/platform-dependent code that is causing this, given that the 14.04 HWE kernel doesn't show this issue on this same LPAR.

== Comment: #30 - Guilherme Guaglianoni Piccoli <email address hidden> - 2016-05-25 16:35:50 ==
Quick update on this one: I'm investigating since Monday, and what I found is that in those cases of spontaneous EEH, the PCI BARs of the device are fulfilled with 0xFF, indicating some kind of corruption in adapter's memory.

To dump the PCI BARs I firstly booted without EEH (by using eeh=off). The problem reproduces on kernel upstream v4.5, but not in v4.4 - so it seems a regression.

I'm studying the commits between those revisions, making bisects, etc...so we can find which commits introduced this behavior.

Thanks,

Guilherme

== Comment: #31 - Guilherme Guaglianoni Piccoli <email address hidden> - 2016-05-27 18:59:09 ==
Offending commit was found after doing some bisect and analysis on upstream kernel:

d6de08cc462 ("lpfc: Fix the FLOGI discovery logic to comply with T11 standards")

When this comment was reverted in kernel 4.6, the problem disappeared.
I do see some FLOGI failure on dmesg, but I guess this is somewhat normal (reference: https://access.redhat.com/solutions/400483);

Now, next step is to investigate what's going on with this commit; it should has been tested before it was merged, so this could be a non-expected corner case we're experiencing. I guess Maur?cio's opinion would be really useful here, since he has much expertise in Fiber Channel devices (he should be back on next week's beginning).

One more thought: it's important to determine what is the real priority of this bug, meaning if this is a stop ship or the impact on some release would be critical, we could ask Canonical to revert it until a proper fix be implemented. Guess Brian, Mauricio and Breno's opinion on this are valuable.

Thanks,

Guilherme

== Comment: #32 - Mauricio Faria De Oliveira <email address hidden> - 2016-05-30 10:13:57 ==
Guilherme,

Thank you very much for the precise handling on this one. Reassigning it back to myself.

I wouldn't imagine this was a driver specific problem, but given your pointer to this commit, it's indeed something in that direction -- the dmesg log confirm there's some involvement of the FLOGI (fabric login) steps (related to the mentioned commit)

The devices have 2 ports (eg, PCI functions 0 and 1).
- Function 0 is processed first -- probe finishes OK, and it starts FLOGI steps.
- Function 1 starts probe during Function 0's FLOGI steps -- and Function 1 probe fails on with the EEH.

So, the change in the FLOGI logic seems to be quite involved in the problems sensed by the mailbox commands that result in the EEH.

More on this later.

[ 1.215858] lpfc 0001:01:00.0: enabling device (0144 -> 0146)
...
[ 2.143487] lpfc 0001:01:00.1: enabling device (0144 -> 0146)
...
[ 2.636592] lpfc 0001:01:00.0: 0:1303 Link Up Event x1 received Data: x1 x0 x80 x0 x0 x0 0
[ 2.638459] lpfc 0001:01:00.0: 0:(0):2858 FLOGI failure Status:x3/x103 TMO:x14 Data x1800 x0
[ 2.638464] lpfc 0001:01:00.0: 0:(0):0100 FLOGI failure Status:x3/x103 TMO:x14
[ 2.639019] EEH: Frozen PHB#1-PE#10000 detected
...
[ 2.639049] [c00000084f612ee0] [c000000000037a84] eeh_check_failure+0x84/0xd0
[ 2.639061] [c00000084f612f20] [d000000008ed3cc4] lpfc_sli4_wait_bmbx_ready+0x114/0x150 [lpfc]
...
[ 2.639086] [c00000084f6131c0] [d000000008ee7780] lpfc_cq_create+0x210/0x370 [lpfc].
...
[ 2.639113] [c00000084f613550] [d000000008f23a28] lpfc_pci_probe_one+0x1248/0x13d0 [lpfc]
[ 2.639117] [c00000084f6135f0] [c0000000005daefc] local_pci_probe+0x6c/0x140
...
[ 2.639158] lpfc 0001:01:00.1: 1:(0):2544 Mailbox command x9b (x1/xc) cannot issue Data: x200 x1
...
[ 2.639166] lpfc 0001:01:00.1: 1:2501 CQ_CREATE mailbox failed with status x0 add_status x0, mbx status xff
...

== Comment: #33 - Guilherme Guaglianoni Piccoli <email address hidden> - 2016-05-30 12:56:21 ==
Thanks Maur?cio!

I noticed compiling kernel both with the commit and without it (by reverting it), the following if is taken on lpfc_mbox_dev_check() :

if (phba->link_state == LPFC_HBA_ERROR)

So, in both cases the link_state is off but the commit perhaps introduced some order re-arrangement in the way it cannot handle anymore with this fail, maybe because of a race condition between threads.
This conclusion came from the following snippet of commit message:

"Required reworking the call sequence in the discovery threads."

Thanks for taking from now.
Cheers,

Guilherme

== Comment: #34 - Breno Henrique Leitao <email address hidden> - 2016-05-30 13:25:00 ==
> we could ask Canonical to revert it until a proper fix be
> implemented. Guess Brian, Mauricio and Breno's opinion on this are valuable.

Well, it will not be simple to ask them to revert it. Although we requested the lpfc package upgrade [via bug #132388], there was another request to do so (LP: #1541592), so, I would suggest trying to propose a fix, other than asking to revert this commit.

Does it make sense?

== Comment: #35 - Mauricio Faria De Oliveira <email address hidden> - 2016-05-30 14:17:25 ==
It seems this commit might fix the problem. I'm working on a build with it.

ae09c765109293b600ba9169aa3d632e1ac1a843
lpfc: Fix DMA faults observed upon plugging loopback connector

Driver didn't program the REG_VFI mailbox correctly, giving the adapter
bad addresses.

== Comment: #36 - Mauricio Faria De Oliveira <email address hidden> - 2016-05-30 17:35:30 ==
Hi Canonical,

Can you please apply this fix for the lpfc driver?

This upstream commit fixes the problem:

 ae09c765109293b600ba9169aa3d632e1ac1a843
 lpfc: Fix DMA faults observed upon plugging loopback connector

Original kernel (4.4.0-22.40)

 root@alp7p04:~# uname -a
 Linux alp7p04 4.4.0-22-generic #40-Ubuntu SMP Thu May 12 22:03:35 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux

 root@alp7p04:~# dmesg | grep -i eeh
 [ 0.051252] EEH: pSeries platform initialized
 [ 0.137050] EEH: devices created
 [ 0.167121] EEH: PCI Enhanced I/O Error Handling Enabled
 [ 3.039195] EEH: Frozen PHB#3-PE#10000 detected
 [ 3.039211] EEH: PE location: N/A, PHB location: N/A
 [ 3.039234] [c00000062fa16e40] [c0000000000379b4] eeh_dev_check_failure+0x534/0x580
 [ 3.039237] [c00000062fa16ee0] [c000000000037a84] eeh_check_failure+0x84/0xd0
 [ 3.039398] EEH: Detected PCI bus error on PHB#3-PE#10000
 <...>

Patched kernel (4.4.0-22.40 + patch)

 root@alp7p04:~# uname -a
 Linux alp7p04 4.4.0-22-generic #40+bz139414c35 SMP Mon May 30 10:54:04 CDT 2016 ppc64le ppc64le ppc64le GNU/Linux

 root@alp7p04:~# dmesg | grep -i eeh
 [ 0.051222] EEH: pSeries platform initialized
 [ 0.137348] EEH: devices created
 [ 0.167359] EEH: PCI Enhanced I/O Error Handling Enabled
 root@alp7p04:~#

== Comment: #38 - Mauricio Faria De Oliveira <email address hidden> - 2016-05-30 17:42:13 ==

Revision history for this message
bugproxy (bugproxy) wrote : dmesg1.txt

Default Comment by Bridge

tags: added: architecture-ppc64 bugnameltc-139414 severity-critical targetmilestone-inin1604
Revision history for this message
bugproxy (bugproxy) wrote : dmesg 16.04

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : dmesg 14.04

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : dmesg 16.04 with numa=off

Default Comment by Bridge

Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Hi Canonical,

Sorry about the incredibly long bug description.
I've selected which comments should have been mirrored, but this wasn't honored.

This bug is a problem in the linux source package, lpfc driver.
Summary of error messages and upstream fix below.

== Comment: #36 - Mauricio Faria De Oliveira <email address hidden> - 2016-05-30 17:35:30 ==
Hi Canonical,

Can you please apply this fix for the lpfc driver?

This upstream commit fixes the problem:

ae09c765109293b600ba9169aa3d632e1ac1a843
lpfc: Fix DMA faults observed upon plugging loopback connector

Original kernel (4.4.0-22.40)

root@alp7p04:~# uname -a
Linux alp7p04 4.4.0-22-generic #40-Ubuntu SMP Thu May 12 22:03:35 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux

root@alp7p04:~# dmesg | grep -i eeh
[ 0.051252] EEH: pSeries platform initialized
[ 0.137050] EEH: devices created
[ 0.167121] EEH: PCI Enhanced I/O Error Handling Enabled
[ 3.039195] EEH: Frozen PHB#3-PE#10000 detected
[ 3.039211] EEH: PE location: N/A, PHB location: N/A
[ 3.039234] [c00000062fa16e40] [c0000000000379b4] eeh_dev_check_failure+0x534/0x580
[ 3.039237] [c00000062fa16ee0] [c000000000037a84] eeh_check_failure+0x84/0xd0
[ 3.039398] EEH: Detected PCI bus error on PHB#3-PE#10000
<...>

Patched kernel (4.4.0-22.40 + patch)

root@alp7p04:~# uname -a
Linux alp7p04 4.4.0-22-generic #40+bz139414c35 SMP Mon May 30 10:54:04 CDT 2016 ppc64le ppc64le ppc64le GNU/Linux

root@alp7p04:~# dmesg | grep -i eeh
[ 0.051222] EEH: pSeries platform initialized
[ 0.137348] EEH: devices created
[ 0.167359] EEH: PCI Enhanced I/O Error Handling Enabled
root@alp7p04:~#

affects: ubuntu → linux (Ubuntu)
Changed in linux (Ubuntu):
assignee: Taco Screen team (taco-screen-team) → Canonical Kernel Team (canonical-kernel-team)
importance: Undecided → High
status: New → Triaged
Revision history for this message
Tim Gardner (timg-tpi) wrote :
Changed in linux (Ubuntu Yakkety):
assignee: Canonical Kernel Team (canonical-kernel-team) → nobody
status: Triaged → Fix Released
Changed in linux (Ubuntu Xenial):
assignee: nobody → Tim Gardner (timg-tpi)
status: New → In Progress
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla
Download full text (3.5 KiB)

------- Comment From <email address hidden> 2016-06-07 08:23 EDT-------
===================================END=================================== State: Verify by: cde00 on 31 May 2016 03:43:19 ====

== Comment: #1 - Application Cdeadmin <email address hidden> - 2016-03-21 15:55:11 ====== State: Verify by: cde00 on 31 May 2016 04:07:26 ====

==== State: Verify by: byrneadw on 01 June 2016 11:03:58 ====

I loaded the test packages and can now successfully run HTX, I am not seeing EEH errors anymore but I do still see these "FLOGI failure Status:x3/x103 TMO:x14" errors.

2) from #1 execute ssh root@rcx2c360 (password is PASSW0RD)
==== State: Verify by: byrneadw on 01 June 2016 11:07:41 ====

I loaded the test packages and can now successfully run HTX, I am not seeing EEH errors anymore but I do still see these "FLOGI failure Status:x3/x103 TMO:x14" errors.
I see a comment earlier that suggests it is normal ( update #31 from Guilherme ).

I'm wondering if this is another event to add to our ignore list. In addition to the comment from Guilherme I can see a very similar event already in our ignore list due to feedback we received on SW315535 - event in that case was "FLOGI failure Status:x3/x103 TMO:x4". I'm not sure what the difference between TMO:x4 vs TMO:x14 is

Is it ok to add "FLOGI failure Status:x3/x103 TMO:x14" events to our ignore list also or is more debug required ?

root@rcx2c360:/tmp# dmesg -T --level=alert,crit,err
[Wed Jun 1 13:26:03 2016] lpfc 0000:01:00.0: 0:1303 Link Up Event x1 received Data: x1 x0 x80 x0 x0 x0 0
[Wed Jun 1 13:26:03 2016] lpfc 0000:01:00.0: 0:(0):2858 FLOGI failure Status:x3/x103 TMO:x14 Data x1800 x0
[Wed Jun 1 13:26:03 2016] lpfc 0000:01:00.0: 0:(0):0100 FLOGI failure Status:x3/x103 TMO:x14
[Wed Jun 1 13:26:04 2016] lpfc 0000:01:00.1: 1:(0):2858 FLOGI failure Status:x3/x103 TMO:x14 Data x1800 x0
[Wed Jun 1 13:26:04 2016] lpfc 0000:01:00.1: 1:(0):0100 FLOGI failure Status:x3/x103 TMO:x14

===>> If required, access to system:
1) Telnet rchd08e0.rchland.ibm.com ( login with userid=dlth1025, password=tim2fish )
2) from #1 execute ssh root@rcx2c360 (password is PASSW0RD)

==== State: Verify by: byrneadw on 02 June 2016 17:11:41 ====

considering TMO:x4 and TMO:x14 are timeout values it suggests to me this is the same error we hit before with SW315535. The root cause of SW315535 was the mfg usage of wrap plugs on the Fibre ports for the purpose of running HTX. It resulted in the FLOGI message because a port cannot login to itself.

The TMO values must have changed with Ubuntu 16.04 or new drivers as you mentioned above. This is the first system with a Bluefin running with 16.04 we've had. In the past all our systems with Bluefin were running in Habanero boxes with Ubuntu 14.04.03

In SW315535 Dan Eisenhauer commented :
"That "error" message means the link came up, so I am conjecturing that there is a wrap plug installed, The FLOGI failed messages would be expected in that case since a port cannot login to itself. So, all those messages are expected and indicate that a wrap plug is installed and the adapters are functioning. Those can all be ignored."

I removed the wrap plugs on our Garrison system and was a...

Read more...

Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed
Revision history for this message
Kamal Mostafa (kamalmostafa) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Marking as verification done.

The commit is present on the git tag for the -proposed kernel [1].

It was not possible to re-verify this kernel package on the original equipment that reproduced the problem,
but the commit itself was verified earlier on it to resolve it (documented in comment #5).

http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/commit/drivers/scsi/lpfc?h=Ubuntu-4.4.0-30.49&id=978ae5801bab1b9dc41c5b57937a317ec7ac1ece

tags: added: verification-done-xenial
removed: verification-needed-xenial
bugproxy (bugproxy)
tags: added: severity-high targetmilestone-inin16041
removed: severity-critical targetmilestone-inin1604
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (6.1 KiB)

This bug was fixed in the package linux - 4.4.0-31.50

---------------
linux (4.4.0-31.50) xenial; urgency=low

  [ Kamal Mostafa ]

  * Release Tracking Bug
    - LP: #1602449

  * nouveau: boot hangs at blank screen with unsupported graphics cards
    (LP: #1602340)
    - SAUCE: drm: check for supported chipset before booting fbdev off the hw

linux (4.4.0-30.49) xenial; urgency=low

  [ Kamal Mostafa ]

  * Release Tracking Bug
    - LP: #1597897

  * FCP devices are not detected correctly nor deterministically (LP: #1567602)
    - scsi_dh_alua: Disable ALUA handling for non-disk devices
    - scsi_dh_alua: Use vpd_pg83 information
    - scsi_dh_alua: improved logging
    - scsi_dh_alua: sanitze sense code handling
    - scsi_dh_alua: use standard logging functions
    - scsi_dh_alua: return standard SCSI return codes in submit_rtpg
    - scsi_dh_alua: fixup description of stpg_endio()
    - scsi_dh_alua: use flag for RTPG extended header
    - scsi_dh_alua: use unaligned access macros
    - scsi_dh_alua: rework alua_check_tpgs() to return the tpgs mode
    - scsi_dh_alua: simplify sense code handling
    - scsi: Add scsi_vpd_lun_id()
    - scsi: Add scsi_vpd_tpg_id()
    - scsi_dh_alua: use scsi_vpd_tpg_id()
    - scsi_dh_alua: Remove stale variables
    - scsi_dh_alua: Pass buffer as function argument
    - scsi_dh_alua: separate out alua_stpg()
    - scsi_dh_alua: Make stpg synchronous
    - scsi_dh_alua: call alua_rtpg() if stpg fails
    - scsi_dh_alua: switch to scsi_execute_req_flags()
    - scsi_dh_alua: allocate RTPG buffer separately
    - scsi_dh_alua: Use separate alua_port_group structure
    - scsi_dh_alua: use unique device id
    - scsi_dh_alua: simplify alua_initialize()
    - revert commit a8e5a2d593cb ("[SCSI] scsi_dh_alua: ALUA handler attach should
      succeed while TPG is transitioning")
    - scsi_dh_alua: move optimize_stpg evaluation
    - scsi_dh_alua: remove 'rel_port' from alua_dh_data structure
    - scsi_dh_alua: Use workqueue for RTPG
    - scsi_dh_alua: Allow workqueue to run synchronously
    - scsi_dh_alua: Add new blacklist flag 'BLIST_SYNC_ALUA'
    - scsi_dh_alua: Recheck state on unit attention
    - scsi_dh_alua: update all port states
    - scsi_dh_alua: Send TEST UNIT READY to poll for transitioning
    - scsi_dh_alua: do not fail for unknown VPD identification

linux (4.4.0-29.48) xenial; urgency=low

  [ Kamal Mostafa ]

  * Release Tracking Bug
    - LP: #1597015

  * Wireless hotkey fails on Dell XPS 15 9550 (LP: #1589886)
    - intel-hid: new hid event driver for hotkeys
    - intel-hid: fix incorrect entries in intel_hid_keymap
    - intel-hid: allocate correct amount of memory for private struct
    - intel-hid: add a workaround to ignore an event after waking up from S4.
    - [Config] CONFIG_INTEL_HID_EVENT=m

  * cgroupfs mounts can hang (LP: #1588056)
    - Revert "UBUNTU: SAUCE: (namespace) mqueue: Super blocks must be owned by the
      user ns which owns the ipc ns"
    - Revert "UBUNTU: SAUCE: kernfs: Do not match superblock in another user
      namespace when mounting"
    - Revert "UBUNTU: SAUCE: cgroup: Use a new super block when mounting in a
      cgroup namespace"
    - (name...

Read more...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.