Ubuntu
linux package

STC840.20:Alpine:alp7fp1:Ubuntu 16.04, BlueFin (SAN) EEH 6 times during boot then disabled SRC BA188002:b0314a_1612.840

Bug #1587316 reported by bugproxy on 2016-05-31

This bug affects 1 person

	Status	Importance	Assigned to
linux (Ubuntu)	Fix Released	High	Unassigned
Xenial	Fix Released	Undecided	Tim Gardner
Yakkety	Fix Released	High	Unassigned

Bug Description

== Comment: #0 - Application Cdeadmin <email address hidden> - 2016-03-21 15:55:09 ==

== Comment: #1 - Application Cdeadmin <email address hidden> - 2016-03-21 15:55:11 ==
==== State: Open by: mlfield on 21 March 2016 14:45:01 ====

==========================Automatic entries==========================
Contact: LittleField, Michael *CONTRACTOR*
Backup: Thirukumaran V T (<email address hidden>), Deepti Umarani (<email address hidden>), Brian M. Carpenter(<email address hidden>)

===== sys_capture v5.24 === 2016-03-21_14-25-41 ===========

|
| |
| System Hardware Information:
| NODE /Sys-0/Node-0, U78C7.001.1AQH383-P2
| FSP /Sys-0/Node-0/FSP-0, FSP-2 DD 1.0, U78C7.001.1AQH383-P1-C5
| PSI /Sys-0/Node-0/FSP-0/PSI-0
| PSI /Sys-0/Node-0/FSP-0/PSI-1
| MEMBUF /Sys-0/Node-0/Membuf-12, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C11
| MEMBUF /Sys-0/Node-0/Membuf-13, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C10
| MEMBUF /Sys-0/Node-0/Membuf-14, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C12
| MEMBUF /Sys-0/Node-0/Membuf-15, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C13
| MEMBUF /Sys-0/Node-0/Membuf-20, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C23
| MEMBUF /Sys-0/Node-0/Membuf-21, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C22
| MEMBUF /Sys-0/Node-0/Membuf-22, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C24
| MEMBUF /Sys-0/Node-0/Membuf-23, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C25
| MEMBUF /Sys-0/Node-0/Membuf-28, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C19
| MEMBUF /Sys-0/Node-0/Membuf-29, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C18
| MEMBUF /Sys-0/Node-0/Membuf-30, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C20
| MEMBUF /Sys-0/Node-0/Membuf-31, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C21
| MEMBUF /Sys-0/Node-0/Membuf-36, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C31
| MEMBUF /Sys-0/Node-0/Membuf-37, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C30
| MEMBUF /Sys-0/Node-0/Membuf-38, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C32
| MEMBUF /Sys-0/Node-0/Membuf-39, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C33
| MEMBUF /Sys-0/Node-0/Membuf-4, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C15
| MEMBUF /Sys-0/Node-0/Membuf-44, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C27
| MEMBUF /Sys-0/Node-0/Membuf-45, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C26
| MEMBUF /Sys-0/Node-0/Membuf-46, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C28
| MEMBUF /Sys-0/Node-0/Membuf-47, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C29
| MEMBUF /Sys-0/Node-0/Membuf-5, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C14
| MEMBUF /Sys-0/Node-0/Membuf-52, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C39
| MEMBUF /Sys-0/Node-0/Membuf-53, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C38
| MEMBUF /Sys-0/Node-0/Membuf-54, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C40
| MEMBUF /Sys-0/Node-0/Membuf-55, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C41
| MEMBUF /Sys-0/Node-0/Membuf-6, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C16
| MEMBUF /Sys-0/Node-0/Membuf-60, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C35
| MEMBUF /Sys-0/Node-0/Membuf-61, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C34
| MEMBUF /Sys-0/Node-0/Membuf-62, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C36
| MEMBUF /Sys-0/Node-0/Membuf-63, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C37
| MEMBUF /Sys-0/Node-0/Membuf-7, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C17
| PROC /Sys-0/Node-0/Proc-0, P8/Murano DD 2.1, U78C7.001.1AQH383-P2-C2
| CORE /Sys-0/Node-0/Proc-0/EX-12/Core-0
| CORE /Sys-0/Node-0/Proc-0/EX-13/Core-0
| CORE /Sys-0/Node-0/Proc-0/EX-14/Core-0
| CORE /Sys-0/Node-0/Proc-0/EX-4/Core-0
| PCI /Sys-0/Node-0/Proc-0/PCI-0
| PCI /Sys-0/Node-0/Proc-0/PCI-1
| PCI /Sys-0/Node-0/Proc-0/PCI-2
| PSI /Sys-0/Node-0/Proc-0/PSI-0
| PROC /Sys-0/Node-0/Proc-1, P8/Murano DD 2.1, U78C7.001.1AQH383-P2-C2
| CORE /Sys-0/Node-0/Proc-1/EX-13/Core-0
| CORE /Sys-0/Node-0/Proc-1/EX-14/Core-0
| CORE /Sys-0/Node-0/Proc-1/EX-4/Core-0
| CORE /Sys-0/Node-0/Proc-1/EX-5/Core-0
| PCI /Sys-0/Node-0/Proc-1/PCI-0
| PCI /Sys-0/Node-0/Proc-1/PCI-1
| PCI /Sys-0/Node-0/Proc-1/PCI-2
| PROC /Sys-0/Node-0/Proc-2, P8/Murano DD 2.1, U78C7.001.1AQH383-P2-C3
| CORE /Sys-0/Node-0/Proc-2/EX-13/Core-0
| CORE /Sys-0/Node-0/Proc-2/EX-14/Core-0
| CORE /Sys-0/Node-0/Proc-2/EX-4/Core-0
| CORE /Sys-0/Node-0/Proc-2/EX-5/Core-0
| PCI /Sys-0/Node-0/Proc-2/PCI-0
| PCI /Sys-0/Node-0/Proc-2/PCI-1
| PCI /Sys-0/Node-0/Proc-2/PCI-2
| PSI /Sys-0/Node-0/Proc-2/PSI-0
| PROC /Sys-0/Node-0/Proc-3, P8/Murano DD 2.1, U78C7.001.1AQH383-P2-C3
| CORE /Sys-0/Node-0/Proc-3/EX-12/Core-0
| CORE /Sys-0/Node-0/Proc-3/EX-13/Core-0
| CORE /Sys-0/Node-0/Proc-3/EX-4/Core-0
| CORE /Sys-0/Node-0/Proc-3/EX-6/Core-0
| PCI /Sys-0/Node-0/Proc-3/PCI-0
| PCI /Sys-0/Node-0/Proc-3/PCI-1
| PCI /Sys-0/Node-0/Proc-3/PCI-2
| PROC /Sys-0/Node-0/Proc-4, P8/Murano DD 2.1, U78C7.001.1AQH383-P2-C6
| CORE /Sys-0/Node-0/Proc-4/EX-12/Core-0
| CORE /Sys-0/Node-0/Proc-4/EX-13/Core-0
| CORE /Sys-0/Node-0/Proc-4/EX-14/Core-0
| CORE /Sys-0/Node-0/Proc-4/EX-6/Core-0
| PCI /Sys-0/Node-0/Proc-4/PCI-0
| PCI /Sys-0/Node-0/Proc-4/PCI-1
| PCI /Sys-0/Node-0/Proc-4/PCI-2
| PROC /Sys-0/Node-0/Proc-5, P8/Murano DD 2.1, U78C7.001.1AQH383-P2-C6
| CORE /Sys-0/Node-0/Proc-5/EX-12/Core-0
| CORE /Sys-0/Node-0/Proc-5/EX-13/Core-0
| CORE /Sys-0/Node-0/Proc-5/EX-14/Core-0
| CORE /Sys-0/Node-0/Proc-5/EX-4/Core-0
| PCI /Sys-0/Node-0/Proc-5/PCI-0
| PCI /Sys-0/Node-0/Proc-5/PCI-1
| PCI /Sys-0/Node-0/Proc-5/PCI-2
| PROC /Sys-0/Node-0/Proc-6, P8/Murano DD 2.1, U78C7.001.1AQH383-P2-C7
| CORE /Sys-0/Node-0/Proc-6/EX-12/Core-0
| CORE /Sys-0/Node-0/Proc-6/EX-14/Core-0
| CORE /Sys-0/Node-0/Proc-6/EX-4/Core-0
| CORE /Sys-0/Node-0/Proc-6/EX-5/Core-0
| PCI /Sys-0/Node-0/Proc-6/PCI-0
| PCI /Sys-0/Node-0/Proc-6/PCI-1
| PCI /Sys-0/Node-0/Proc-6/PCI-2
| PROC /Sys-0/Node-0/Proc-7, P8/Murano DD 2.1, U78C7.001.1AQH383-P2-C7
| CORE /Sys-0/Node-0/Proc-7/EX-12/Core-0
| CORE /Sys-0/Node-0/Proc-7/EX-13/Core-0
| CORE /Sys-0/Node-0/Proc-7/EX-14/Core-0
| CORE /Sys-0/Node-0/Proc-7/EX-6/Core-0
| PCI /Sys-0/Node-0/Proc-7/PCI-0
| PCI /Sys-0/Node-0/Proc-7/PCI-1
| PCI /Sys-0/Node-0/Proc-7/PCI-2
|
| System Hardware Summary:
| Configured Proc Cores: 32
| Configured IO UNITs: 24
| Configured PCIe PHB: 24
| Installed Nodes: 1
|
| Hardware InitFile Information:
| No tool support for FIRENZE
|
| Hardware (CINI) Frequency Information:
| No tool support for FIRENZE
|
| VPD Information:
| Backplane VPD:
| None found or VPD info is not available.
| VPD LID Information:
| VPD LID File [/opt/extucode/80e00040.lid]:
| VPD Keyword: [LX], Data: [3100050100300040]
| VPD LID File [/opt/extucode/80e00041.lid]:
| VPD Keyword: [LX], Data: [3100040100300041]
| VPD LID File [/opt/extucode/80e00042.lid]:
| VPD Keyword: [LX], Data: [3100040100300042]
| VPD LID File [/opt/extucode/80e00043.lid]:
| VPD Keyword: [LX], Data: [3100040100300043]
| VPD LID File [/opt/extucode/80e00044.lid]:
| VPD Keyword: [LX], Data: [3100040100300044]
| VPD LID File [/opt/extucode/80e00047.lid]:
| VPD Keyword: [LX], Data: [3100040100300047]
| Format: 0x31 (1)
| Enclosure ID: 0x0004 (P8 HV (Tuleta))
| Server Type: 0x01 (i/pSeries)
| FRU Type: 0x00 (Backplane)
| VPD Pass: 0x30 (0)
| LID Name: 0x0047 (P8 Alpine xS4U)
| VPD LID File [/opt/extucode/80e00050.lid]:
| VPD Keyword: [LX], Data: [3100060100300050]
| VPD LID File [/opt/extucode/80e00051.lid]:
| VPD Keyword: [LX], Data: [3100060100300051]
| VPD LID File [/opt/extucode/80e00942.lid]:
| VPD Keyword: [LX], Data: [3100040100300942]
| VPD LID File [/opt/extucode/80e00944.lid]:
| VPD Keyword: [LX], Data: [3100040100300944]
| VPD LID File [/opt/extucode/80e00947.lid]:
| VPD Keyword: [LX], Data: [3100040100300947]
| Format: 0x31 (1)
| Enclosure ID: 0x0004 (P8 HV (Tuleta))
| Server Type: 0x01 (i/pSeries)
| FRU Type: 0x00 (Backplane)
| VPD Pass: 0x30 (0)
| LID Name: 0x0947 (P8 Alpine Storage/Shark)
| VPD LID File [/opt/extucode/80e00ff0.lid]:
| VPD Keyword: [LX], Data: [3100040100300FF0]
|
| WARNINGS:
| * Informational: This machine has signed firmware (ship image)
|
| ERRL: Attempting to dump error logs using errl...
| Dumping all error logs on FSP to file...
| ERRL: The FSP stopped responding... skipping
|
| FFDC:
| FNM: Attempting connection for basic health check...
| TimeSincePhypStarted=82:13:57.539
| No failed tasks found.
|
| FNM: Attempting connection for PHYP FFDC...
| FNM PHYP FFDC data stored in /fspmount/alpine/alp7fp1/b0314a_1612.840/fsp/PHYP.FFDC.20160321142537.phyp
|
| FipS MyFFDC: Was not attempted. Reason:[Not requested]
|
| Cronus: Data collection not attempted. (Unable to use Cronus via SSH Tunnel)
|
|----- File(s) Created During Capture ------
| SysCapture Primary LogFile: /fspmount/alpine/alp7fp1/b0314a_1612.840/fsp/PHYP.FFDC.20160321142537
| FNM PHYP FFDC stored in: /fspmount/alpine/alp7fp1/b0314a_1612.840/fsp/PHYP.FFDC.20160321142537.phyp
|
============== end of capture ==============

============================Manual entries===========================
Title: STC840.20:Alpine:alp7fp1:Ubuntu 16.04, BlueFin (SAN) EEH 6 times during boot then disabled SRC BA188002:b0314a_1612.840

Problem Description :
Booting Ubuntu 16.04 with Blufin (SAN) and several other adapters, Bluefin EEH 6 times and then disabled, SRC BA188002 reported. All other adapters did not have any issues.

===================================END===============================
==== State: Open by: mlfield on 21 March 2016 14:47:26 ====

Attached Dmesg Log: dmesg1.txt

mlfield (<email address hidden>) added native attachment /opt/IBM/WebSphere/AppServer/profiles/cqweb/temp/ausratsrv5Node01/server1/TeamEAR/cqweb.war/dmesg1.txt on 2016-03-21 14:47:26

== Comment: #2 - Application Cdeadmin <email address hidden> - 2016-03-21 15:55:16 ==

== Comment: #12 - Mauricio Faria De Oliveira <email address hidden> - 2016-04-04 14:09:48 ==
Info from Mike on ST.
Assigned the adapter in the drawer to the LPAR, it hit the problem just like the adapter in the CEC.
This points to a kernel/driver problem, since 14.04 didn't hit the problem.

<email address hidden> - Michael Littlefield/Austin/Contr/IBM: just added both bluefins, its happen with both so MEX and CEC.
# Slot Description Device(s)
U78C7.001.1AQH383-P1-C4 PCI-E capable, Rev 3, 16x lanes with 16x lanes connected fibre-channel
                                                                                   fibre-channel
U78C7.001.1AQH383-P1-C6 PCI-E capable, Rev 3, 8x lanes with 8x lanes connected 0000:60:00.1
                                                                                   0000:60:00.0
U78CD.001.FZH0132-P1-C1 PCI-E capable, Rev 3, 16x lanes with 16x lanes connected fibre-channel
                                                                                   fibre-channel
U78CD.001.FZH0132-P2-C1 PCI-E capable, Rev 3, 16x lanes with 16x lanes connected 0002:50:00.0
U78CD.001.FZH0132-P2-C3 PCI-E capable, Rev 3, 8x lanes with 8x lanes connected 0003:70:00.0
U78CD.001.FZH0132-P2-C6 PCI-E capable, Rev 3, 8x lanes with 8x lanes connected 0004:a0:00.5
                                                                                   0004:a0:00.4
                                                                                   0004:a0:00.3
                                                                                   0004:a0:00.2
                                                                                   0004:a0:00.1
                                                                                   0004:a0:00.0

== Comment: #16 - Mauricio Faria De Oliveira <email address hidden> - 2016-04-12 18:00:26 ==
Mike provided the LPAR for debugging earlier today.

Observations.
1) The NUMA nodes configuration is weird -- likely an effect of DLPAR of Memory/CPU.
- node 0: has CPUs but has no memory
- node 1: has CPUs and memory
- node 6: has no CPUs but has memory

(0) root @ alp7p04: /root
# numactl -H
available: 3 nodes (0,2,6)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 0 size: 0 MB
node 0 free: 0 MB
node 2 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 2 size: 34216 MB
node 2 free: 33248 MB
node 6 cpus:
node 6 size: 6644 MB
node 6 free: 6568 MB
node distances:
node 0 2 6
  0: 10 40 40
  2: 40 10 40
  6: 40 40 10

2) The problem does not reproduce with 14.04 kernel (4.2 from wily).

Comparing the dmesg logs up to the point of failure, there are differences in the NUMA setup code.
2a) A small offset difference in the NUMA DATA starting address. For example:

16.04: [ 0.000000] numa: NODE_DATA [mem 0x9ffe46100-0x9ffe4ffff]

14.04: [ 0.000000] numa: NODE_DATA [mem 0x9ffe45000-0x9ffe4ffff]

2b) A *totally* different end address in the "Initmem setup node 0"

16:04: [ 0.000000] Initmem setup node 0 [mem 0x0000000000000000-0x0000000000000000]

14.04: [ 0.000000] Initmem setup node 0 [mem 0x0000000000000000-0xffffffffffffffff]

In progress.
I'll go through the NUMA setup code.

== Comment: #20 - Mauricio Faria De Oliveira <email address hidden> - 2016-04-12 18:18:52 ==
Booting the 16.04 kernel with the numa=off boot option.
The EEH errors still happen, but at a very later time (e.g., the 6th error/permanent failure happens only after the login prompt)

== Comment: #22 - Mauricio Faria De Oliveira <email address hidden> - 2016-04-13 10:23:33 ==
(In reply to comment #16)
> 2b) A *totally* different end address in the "Initmem setup node 0"
>
> 16:04: [ 0.000000] Initmem setup node 0 [mem
> 0x0000000000000000-0x0000000000000000]
>
> 14.04: [ 0.000000] Initmem setup node 0 [mem
> 0x0000000000000000-0xffffffffffffffff]

And this is the value on the original/reported dmesg attachment (on different NUMA node configuration, before some memory and CPUs were moved from this LPAR to another one):

[Mon Mar 21 09:07:45 2016] Initmem setup node 0 [mem 0x0000000000000000-0x00000078cfffffff]

Notice it's non-zero as well as 14.04.. so not sure the NUMA differences have something directly related to this bug.

== Comment: #27 - Mauricio Faria De Oliveira <email address hidden> - 2016-05-18 19:47:05 ==
Assigning this bug to Guilherme per EEH debugging experience and contacts.

From what we've discussed, this problem doesn't seem to be specific to the lpfc device driver.
This same adapter/driver works fine on other systems (it has passed our FVT Regression testing w/out this problem).
So, we suspect of some changes either in EEH / machine/platform-dependent code that is causing this, given that the 14.04 HWE kernel doesn't show this issue on this same LPAR.

== Comment: #30 - Guilherme Guaglianoni Piccoli <email address hidden> - 2016-05-25 16:35:50 ==
Quick update on this one: I'm investigating since Monday, and what I found is that in those cases of spontaneous EEH, the PCI BARs of the device are fulfilled with 0xFF, indicating some kind of corruption in adapter's memory.

To dump the PCI BARs I firstly booted without EEH (by using eeh=off). The problem reproduces on kernel upstream v4.5, but not in v4.4 - so it seems a regression.

I'm studying the commits between those revisions, making bisects, etc...so we can find which commits introduced this behavior.

Thanks,

Guilherme

== Comment: #31 - Guilherme Guaglianoni Piccoli <email address hidden> - 2016-05-27 18:59:09 ==
Offending commit was found after doing some bisect and analysis on upstream kernel:

d6de08cc462 ("lpfc: Fix the FLOGI discovery logic to comply with T11 standards")

When this comment was reverted in kernel 4.6, the problem disappeared.
I do see some FLOGI failure on dmesg, but I guess this is somewhat normal (reference: https://access.redhat.com/solutions/400483);

Now, next step is to investigate what's going on with this commit; it should has been tested before it was merged, so this could be a non-expected corner case we're experiencing. I guess Maur?cio's opinion would be really useful here, since he has much expertise in Fiber Channel devices (he should be back on next week's beginning).

One more thought: it's important to determine what is the real priority of this bug, meaning if this is a stop ship or the impact on some release would be critical, we could ask Canonical to revert it until a proper fix be implemented. Guess Brian, Mauricio and Breno's opinion on this are valuable.

Thanks,

Guilherme

== Comment: #32 - Mauricio Faria De Oliveira <email address hidden> - 2016-05-30 10:13:57 ==
Guilherme,

Thank you very much for the precise handling on this one. Reassigning it back to myself.

I wouldn't imagine this was a driver specific problem, but given your pointer to this commit, it's indeed something in that direction -- the dmesg log confirm there's some involvement of the FLOGI (fabric login) steps (related to the mentioned commit)

The devices have 2 ports (eg, PCI functions 0 and 1).
- Function 0 is processed first -- probe finishes OK, and it starts FLOGI steps.
- Function 1 starts probe during Function 0's FLOGI steps -- and Function 1 probe fails on with the EEH.

So, the change in the FLOGI logic seems to be quite involved in the problems sensed by the mailbox commands that result in the EEH.

More on this later.

[ 1.215858] lpfc 0001:01:00.0: enabling device (0144 -> 0146)
...
[ 2.143487] lpfc 0001:01:00.1: enabling device (0144 -> 0146)
...
[ 2.636592] lpfc 0001:01:00.0: 0:1303 Link Up Event x1 received Data: x1 x0 x80 x0 x0 x0 0
[ 2.638459] lpfc 0001:01:00.0: 0:(0):2858 FLOGI failure Status:x3/x103 TMO:x14 Data x1800 x0
[ 2.638464] lpfc 0001:01:00.0: 0:(0):0100 FLOGI failure Status:x3/x103 TMO:x14
[ 2.639019] EEH: Frozen PHB#1-PE#10000 detected
...
[ 2.639049] [c00000084f612ee0] [c000000000037a84] eeh_check_failure+0x84/0xd0
[ 2.639061] [c00000084f612f20] [d000000008ed3cc4] lpfc_sli4_wait_bmbx_ready+0x114/0x150 [lpfc]
...
[ 2.639086] [c00000084f6131c0] [d000000008ee7780] lpfc_cq_create+0x210/0x370 [lpfc].
...
[ 2.639113] [c00000084f613550] [d000000008f23a28] lpfc_pci_probe_one+0x1248/0x13d0 [lpfc]
[ 2.639117] [c00000084f6135f0] [c0000000005daefc] local_pci_probe+0x6c/0x140
...
[ 2.639158] lpfc 0001:01:00.1: 1:(0):2544 Mailbox command x9b (x1/xc) cannot issue Data: x200 x1
...
[ 2.639166] lpfc 0001:01:00.1: 1:2501 CQ_CREATE mailbox failed with status x0 add_status x0, mbx status xff
...

== Comment: #33 - Guilherme Guaglianoni Piccoli <email address hidden> - 2016-05-30 12:56:21 ==
Thanks Maur?cio!

I noticed compiling kernel both with the commit and without it (by reverting it), the following if is taken on lpfc_mbox_dev_check() :

if (phba->link_state == LPFC_HBA_ERROR)

So, in both cases the link_state is off but the commit perhaps introduced some order re-arrangement in the way it cannot handle anymore with this fail, maybe because of a race condition between threads.
This conclusion came from the following snippet of commit message:

"Required reworking the call sequence in the discovery threads."

Thanks for taking from now.
Cheers,

Guilherme

== Comment: #34 - Breno Henrique Leitao <email address hidden> - 2016-05-30 13:25:00 ==
> we could ask Canonical to revert it until a proper fix be
> implemented. Guess Brian, Mauricio and Breno's opinion on this are valuable.

Well, it will not be simple to ask them to revert it. Although we requested the lpfc package upgrade [via bug #132388], there was another request to do so (LP: #1541592), so, I would suggest trying to propose a fix, other than asking to revert this commit.

Does it make sense?

== Comment: #35 - Mauricio Faria De Oliveira <email address hidden> - 2016-05-30 14:17:25 ==
It seems this commit might fix the problem. I'm working on a build with it.

ae09c765109293b600ba9169aa3d632e1ac1a843
lpfc: Fix DMA faults observed upon plugging loopback connector

Driver didn't program the REG_VFI mailbox correctly, giving the adapter
bad addresses.

== Comment: #36 - Mauricio Faria De Oliveira <email address hidden> - 2016-05-30 17:35:30 ==
Hi Canonical,

Can you please apply this fix for the lpfc driver?

This upstream commit fixes the problem:

ae09c765109293b600ba9169aa3d632e1ac1a843
lpfc: Fix DMA faults observed upon plugging loopback connector

Original kernel (4.4.0-22.40)

root@alp7p04:~# uname -a
Linux alp7p04 4.4.0-22-generic #40-Ubuntu SMP Thu May 12 22:03:35 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux

root@alp7p04:~# dmesg | grep -i eeh
[ 0.051252] EEH: pSeries platform initialized
[ 0.137050] EEH: devices created
[ 0.167121] EEH: PCI Enhanced I/O Error Handling Enabled
[ 3.039195] EEH: Frozen PHB#3-PE#10000 detected
[ 3.039211] EEH: PE location: N/A, PHB location: N/A
[ 3.039234] [c00000062fa16e40] [c0000000000379b4] eeh_dev_check_failure+0x534/0x580
[ 3.039237] [c00000062fa16ee0] [c000000000037a84] eeh_check_failure+0x84/0xd0
[ 3.039398] EEH: Detected PCI bus error on PHB#3-PE#10000
<...>

Patched kernel (4.4.0-22.40 + patch)

root@alp7p04:~# uname -a
Linux alp7p04 4.4.0-22-generic #40+bz139414c35 SMP Mon May 30 10:54:04 CDT 2016 ppc64le ppc64le ppc64le GNU/Linux

root@alp7p04:~# dmesg | grep -i eeh
[ 0.051222] EEH: pSeries platform initialized
[ 0.137348] EEH: devices created
[ 0.167359] EEH: PCI Enhanced I/O Error Handling Enabled
root@alp7p04:~#

== Comment: #38 - Mauricio Faria De Oliveira <email address hidden> - 2016-05-30 17:42:13 ==

Tags:

Revision history for this message

bugproxy (bugproxy) wrote on 2016-05-31: dmesg1.txt

dmesg1.txt Edit (139.9 KiB, text/plain)

Default Comment by Bridge

tags:

added: architecture-ppc64 bugnameltc-139414 severity-critical targetmilestone-inin1604

Revision history for this message

bugproxy (bugproxy) wrote on 2016-05-31: dmesg 16.04

dmesg 16.04 Edit (124.5 KiB, text/plain)

Default Comment by Bridge

Revision history for this message

bugproxy (bugproxy) wrote on 2016-05-31: dmesg 14.04

dmesg 14.04 Edit (33.7 KiB, text/plain)

Default Comment by Bridge

Revision history for this message

bugproxy (bugproxy) wrote on 2016-05-31: dmesg 16.04 with numa=off

dmesg 16.04 with numa=off Edit (111.8 KiB, text/plain)

Default Comment by Bridge

Changed in ubuntu:
assignee:	nobody → Taco Screen team (taco-screen-team)

Revision history for this message

Mauricio Faria de Oliveira (mfo) wrote on 2016-05-31:

Hi Canonical,

Sorry about the incredibly long bug description.
I've selected which comments should have been mirrored, but this wasn't honored.

This bug is a problem in the linux source package, lpfc driver.
Summary of error messages and upstream fix below.

== Comment: #36 - Mauricio Faria De Oliveira <email address hidden> - 2016-05-30 17:35:30 ==
Hi Canonical,

Can you please apply this fix for the lpfc driver?

This upstream commit fixes the problem:

ae09c765109293b600ba9169aa3d632e1ac1a843
lpfc: Fix DMA faults observed upon plugging loopback connector

Original kernel (4.4.0-22.40)

root@alp7p04:~# uname -a
Linux alp7p04 4.4.0-22-generic #40-Ubuntu SMP Thu May 12 22:03:35 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux

Patched kernel (4.4.0-22.40 + patch)

root@alp7p04:~# uname -a
Linux alp7p04 4.4.0-22-generic #40+bz139414c35 SMP Mon May 30 10:54:04 CDT 2016 ppc64le ppc64le ppc64le GNU/Linux

root@alp7p04:~# dmesg | grep -i eeh
[ 0.051222] EEH: pSeries platform initialized
[ 0.137348] EEH: devices created
[ 0.167359] EEH: PCI Enhanced I/O Error Handling Enabled
root@alp7p04:~#

affects:

ubuntu → linux (Ubuntu)

Leann Ogasawara (leannogasawara) on 2016-05-31

Changed in linux (Ubuntu):
assignee:	Taco Screen team (taco-screen-team) → Canonical Kernel Team (canonical-kernel-team)
importance:	Undecided → High
status:	New → Triaged

Revision history for this message

Tim Gardner (timg-tpi) wrote on 2016-06-03:

https://lists.ubuntu.com/archives/kernel-team/2016-June/077898.html

Changed in linux (Ubuntu Yakkety):
assignee:	Canonical Kernel Team (canonical-kernel-team) → nobody
status:	Triaged → Fix Released
Changed in linux (Ubuntu Xenial):
assignee:	nobody → Tim Gardner (timg-tpi)
status:	New → In Progress

Revision history for this message

bugproxy (bugproxy) wrote on 2016-06-07: Comment bridged from LTC Bugzilla

Download full text (3.5 KiB)

------- Comment From <email address hidden> 2016-06-07 08:23 EDT-------
===================================END=================================== State: Verify by: cde00 on 31 May 2016 03:43:19 ====

== Comment: #1 - Application Cdeadmin <email address hidden> - 2016-03-21 15:55:11 ====== State: Verify by: cde00 on 31 May 2016 04:07:26 ====

==== State: Verify by: byrneadw on 01 June 2016 11:03:58 ====

I loaded the test packages and can now successfully run HTX, I am not seeing EEH errors anymore but I do still see these "FLOGI failure Status:x3/x103 TMO:x14" errors.

2) from #1 execute ssh root@rcx2c360 (password is PASSW0RD)
==== State: Verify by: byrneadw on 01 June 2016 11:07:41 ====

I loaded the test packages and can now successfully run HTX, I am not seeing EEH errors anymore but I do still see these "FLOGI failure Status:x3/x103 TMO:x14" errors.
I see a comment earlier that suggests it is normal ( update #31 from Guilherme ).

I'm wondering if this is another event to add to our ignore list. In addition to the comment from Guilherme I can see a very similar event already in our ignore list due to feedback we received on SW315535 - event in that case was "FLOGI failure Status:x3/x103 TMO:x4". I'm not sure what the difference between TMO:x4 vs TMO:x14 is

Is it ok to add "FLOGI failure Status:x3/x103 TMO:x14" events to our ignore list also or is more debug required ?

root@rcx2c360:/tmp# dmesg -T --level=alert,crit,err
[Wed Jun 1 13:26:03 2016] lpfc 0000:01:00.0: 0:1303 Link Up Event x1 received Data: x1 x0 x80 x0 x0 x0 0
[Wed Jun 1 13:26:03 2016] lpfc 0000:01:00.0: 0:(0):2858 FLOGI failure Status:x3/x103 TMO:x14 Data x1800 x0
[Wed Jun 1 13:26:03 2016] lpfc 0000:01:00.0: 0:(0):0100 FLOGI failure Status:x3/x103 TMO:x14
[Wed Jun 1 13:26:04 2016] lpfc 0000:01:00.1: 1:(0):2858 FLOGI failure Status:x3/x103 TMO:x14 Data x1800 x0
[Wed Jun 1 13:26:04 2016] lpfc 0000:01:00.1: 1:(0):0100 FLOGI failure Status:x3/x103 TMO:x14

===>> If required, access to system:
1) Telnet rchd08e0.rchland.ibm.com ( login with userid=dlth1025, password=tim2fish )
2) from #1 execute ssh root@rcx2c360 (password is PASSW0RD)

==== State: Verify by: byrneadw on 02 June 2016 17:11:41 ====

considering TMO:x4 and TMO:x14 are timeout values it suggests to me this is the same error we hit before with SW315535. The root cause of SW315535 was the mfg usage of wrap plugs on the Fibre ports for the purpose of running HTX. It resulted in the FLOGI message because a port cannot login to itself.

The TMO values must have changed with Ubuntu 16.04 or new drivers as you mentioned above. This is the first system with a Bluefin running with 16.04 we've had. In the past all our systems with Bluefin were running in Habanero boxes with Ubuntu 14.04.03

In SW315535 Dan Eisenhauer commented :
"That "error" message means the link came up, so I am conjecturing that there is a wrap plug installed, The FLOGI failed messages would be expected in that case since a port cannot login to itself. So, all those messages are expected and indicate that a wrap plug is installed and the adapters are functioning. Those can all be ignored."

I removed the wrap plugs on our Garrison system and was a...

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Changed in linux (Ubuntu Xenial):
status:	In Progress → Fix Committed

Changed in linux (Ubuntu Xenial):
status:	Fix Committed → Fix Released

Ubuntulinux package

STC840.20:Alpine:alp7fp1:Ubuntu 16.04, BlueFin (SAN) EEH 6 times during boot then disabled SRC BA188002:b0314a_1612.840

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
linux package