sata_nv regression, reboots system

Bug #210637 reported by Jeremy Jackson
12
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
High
Colin Ian King
Hardy
Fix Released
High
Colin Ian King
Intrepid
Fix Released
High
Colin Ian King

Bug Description

Trying 2.6.24-13-server doesn't help. Since 2.6.23 somewhere, sata_nv has gotten improved exception handling, but there's a serious bug fixed in 2.6.25-pre6. On a server, to guarantee data integrity, I run # hdparm -W 0 /dev/sda ; hdparm -W 0 /dev/sdb. On Gutsy, this gives periodic kernel messages, but on Hardy it fails terribly. XFS filesystem shutdown due to error, and most times system reboots without any messages to serial console.

The fix is in the following commit in git linux-2.6-stable tree:

author Robert Hancock <email address hidden>
 Wed, 30 Jan 2008 01:53:19 +0000 (19:53 -0600)
committer Jeff Garzik <email address hidden>
 Fri, 1 Feb 2008 17:26:38 +0000 (12:26 -0500)
commit a1fe782414b7122d4c0501d3a0988b7302fa586f
.....
This patch is based on an original patch from Kuan Luo of NVIDIA,
posted under subject "fixed a bug of adma in rhel4u5 with HDS7250SASUN500G".
His description follows. I've reworked it a bit to avoid some unnecessary
repeated checks but it should be functionally identical.

"The patch is to solve the error message "ata1: CPB flags CMD err,
flags=0x11" when testing HDS7250SASUN500G in rhel4u5.
I tested this hd in 2.6.24-rc7 which needed to remove the mask in
blacklist to run the ncq and the same error also showed up...
.....

Revision history for this message
Jeremy Jackson (jerj) wrote :

I tried 2.6.25-pre6 and it seems to work. I tried the Hardy kernel + the patch, it also works.

I'm attaching the required logs.

Revision history for this message
Jeremy Jackson (jerj) wrote :
Revision history for this message
Jeremy Jackson (jerj) wrote :
Revision history for this message
Jeremy Jackson (jerj) wrote :
Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

Hi Jeremy,

This may or may not make it into Hardy as we're currently in Beta freeze and the kernel team might not be making any more uploads. If anything, this will automatically be available in the Intrepid Ibex 8.10 kernel and we can also try getting this in for the Hardy 8.04.1point release. In the mean time I'll reassign to the kernel team for their consideration if they can squeeze this in before Hardy final comes out. Thanks.

Changed in linux:
assignee: nobody → ubuntu-kernel-team
importance: Undecided → High
status: New → Triaged
Revision history for this message
David Goldsmith (dg) wrote :

I believe I've hit the same bug running Hardy beta on a Sun Ultra 40 with HDS7225S.

I've had related kernel problems since about 2.6.20 and am eager to get off of SuSE 10.2 (running 2.6.19), which is the last kernel I've been able to boot reliably. Far less errors in the kernel supplied with Hardy, but I still hit this bug about every other boot.

Any word on the prospects of a fix for this in Hardy, and if not, how might I go about getting a patch? Or, is there a boot param I might use as a workaround?

Thanks,

David

Revision history for this message
Jeremy Jackson (jerj) wrote : Re: [Bug 210637] Re: sata_nv regression, reboots system

Hi David,

I don't know if you follow the linux-ide mailing list, but there's been
a ton of work done on SATA error handling, the sata_nv driver, and NCQ
support between 2.6.16 and 2.6.25.

I have been paying attention, looking to find SATA Port Multiplier
support the moment it's ready. I don't think it's in Hardy kernel BTW.

I'm curious to know more about your issue. Are you setting Write Cache
off? That triggers it for me since it doesn't take much write load to
cause multiple outstanding commands that way.

What filesystem are you using, do you use device-mapper, anything else
about your setup, like smartd, mdadm daemon, that might be accessing the
drives?

My impression of Hardy (which is supposed to be a long-term release -
LTS) is that it's going out on schedule, no matter how many unresolved
issues... sadly it feels like windows a bit. I'm a bit upset they
deferred this patch, yet managed to find the time to introduce other
kernel changes that disabled a number of sound cards.

so a word to the wise, test like crazy, 2.6 series is undergoing a ton
of changes upstream as well. buyer beware.

As for kernel command line, I haven't tried these, but you can turn off
ADMA mode, as well as NCQ. But the real fix is the patch I mentioned in
the Ubuntu ticket. It gives the exact GIT commit ID, as well as a
reference to the mailing list post about it. You can rebuild your kernel
from source + that patch for a fix.

I may have to use the Hardy kernel source + that patch until they
release the fix, or go back to Gutsy kernel and patch in the dme1737
hardware monitor driver.

On Tue, 2008-04-15 at 14:34 +0000, David Goldsmith wrote:
> I believe I've hit the same bug running Hardy beta on a Sun Ultra 40
> with HDS7225S.
>
> I've had related kernel problems since about 2.6.20 and am eager to get
> off of SuSE 10.2 (running 2.6.19), which is the last kernel I've been
> able to boot reliably. Far less errors in the kernel supplied with
> Hardy, but I still hit this bug about every other boot.
>
> Any word on the prospects of a fix for this in Hardy, and if not, how
> might I go about getting a patch? Or, is there a boot param I might use
> as a workaround?
>
> Thanks,
>
> David
>
--
Jeremy Jackson
Coplanar Networks
(519)489-4903
http://www.coplanar.net
<email address hidden>

Revision history for this message
Jeremy Jackson (jerj) wrote :

I'm puzzled that, although Hardy has LTS status, this fix has been deferred, yet there was time to push changes that broke multiple sound cards.... /me continues to grumble...

Revision history for this message
Jeremy Jackson (jerj) wrote :

I should add that I'm running a Hardy kernel from 13 days ago + the above patch, on 1 machine here, with write cache disabled, no problems so far. Machine uptime 13 days and counting.

Revision history for this message
David Goldsmith (dg) wrote :
Download full text (3.8 KiB)

Hi, Jeremy,

Thanks for the information about this issue. I'm in the power user
category, never compiled the kernel before (well, maybe a couple of
times), pretty knowledgeable about stuff but not enough to read the alias.
I write training material for servers for a living so I'm good technically
but not to with it when it comes to Linux troubleshooting.

So I'm looking for a pretty simple workaround, if one exists.

Responses inline

> Hi David,
>
> I don't know if you follow the linux-ide mailing list, but there's been
> a ton of work done on SATA error handling, the sata_nv driver, and NCQ
> support between 2.6.16 and 2.6.25.
>
> I have been paying attention, looking to find SATA Port Multiplier
> support the moment it's ready. I don't think it's in Hardy kernel BTW.
>
> I'm curious to know more about your issue. Are you setting Write Cache
> off? That triggers it for me since it doesn't take much write load to
> cause multiple outstanding commands that way.

Nope. Just a plain vanilla setup on a Sun Ultra 40 (not the M2). I can
send you dmesg output from Hardy and/or SuSE w/2.6.19 kernel if you like.
The nVidia chipset on this machine is suspected of being a problem,
according to some other threads I've read.

>
> What filesystem are you using, do you use device-mapper, anything else
> about your setup, like smartd, mdadm daemon, that might be accessing the
> drives?

Just plain old ext3 (I think - I don't have the system in front of me
right now).

>
> My impression of Hardy (which is supposed to be a long-term release -
> LTS) is that it's going out on schedule, no matter how many unresolved
> issues... sadly it feels like windows a bit. I'm a bit upset they
> deferred this patch, yet managed to find the time to introduce other
> kernel changes that disabled a number of sound cards.
>
> so a word to the wise, test like crazy, 2.6 series is undergoing a ton
> of changes upstream as well. buyer beware.

Understood. But I'm basically just a hobbyist / power user type so no big
deal.

>
> As for kernel command line, I haven't tried these, but you can turn off
> ADMA mode, as well as NCQ. But the real fix is the patch I mentioned in
> the Ubuntu ticket. It gives the exact GIT commit ID, as well as a
> reference to the mailing list post about it. You can rebuild your kernel
> from source + that patch for a fix.

If you know the command-line options to do that, that would be great. I'd
give it a try tonight.

If you could point me at a tutorial on obtaining and applying a patch,
that would be nice, too. I could probably figure it out.

>
> I may have to use the Hardy kernel source + that patch until they
> release the fix, or go back to Gutsy kernel and patch in the dme1737
> hardware monitor driver.

I can certainly wait for 8.04.1, or switch off to another distro with a
more up to date kernel if need be. I have no good reason to get off of
SuSE other than I want my home machine to have the same configuration as
my work machine.

Thanks,

David

>
>
> On Tue, 2008-04-15 at 14:34 +0000, David Goldsmith wrote:
>> I believe I've hit the same bug running Hardy beta on a Sun Ultra 40
>> with HDS7225S.
>>
>> I've had related kernel problem...

Read more...

Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

I'm going to milestone this for 8.04.1 for the kernel team to consider.

Please also note that the patch does appear to be in the Intrepid Ibex 8.10 kernel which is currently being pulled together. If you would you care to test the Intrepid Ibex 8.10 kernel, it is currently available in the following PPA:

https://edge.launchpad.net/~kernel-ppa/+archive

If you are not familiar with how to install packages from a PPA basically do the following:

Create the file /etc/apt/sources.list.d/kernel-ppa.list to include the following two lines:

deb http://ppa.launchpad.net/kernel-ppa/ubuntu hardy main
deb-src http://ppa.launchpad.net/kernel-ppa/ubuntu hardy main

Then run the command:

sudo apt-get update

You should then be able to install the linux-image-2.6.25 kernel package. Please let us know your results. Thanks.

Changed in linux:
milestone: none → ubuntu-8.04.1
Changed in linux:
assignee: ubuntu-kernel-team → colin-king
Changed in linux:
status: Triaged → In Progress
Revision history for this message
Jeremy Jackson (jerj) wrote :

FYI I've rebuilt with the linux-source pkg from Hardy + above patch, running for 10 days so far with write cache off, some heavy io loads (rsync and xfs_repair on 400GB raid1 + device mapper) no issues with SATA.

Revision history for this message
Colin Ian King (colin-king) wrote :

Hi,

I've put up some kernel packages (linux-2.6.24-17.32ckingppa4 with a linux-image that contains the relevant fixes) in my PPA at https://launchpad.net/~colin-king/+archive

Please try this kernel and report any success/regressions. If this fixes the problem I can add the patch to Hardy 8.04.1. Thanks.

Revision history for this message
Colin Ian King (colin-king) wrote :

Hi Jeremy,

Is there any chance that you can try out the kernel packages in my PPA to check if this patched kernel resolves the issue? Once it has been verified I can then add the patch to 8.04.1 so that everyone can benefit from the fix.

Thank you.

Colin

Changed in linux:
assignee: nobody → colin-king
importance: Undecided → High
status: New → In Progress
assignee: colin-king → nobody
milestone: ubuntu-8.04.1 → none
status: In Progress → Fix Released
Changed in linux:
status: Fix Released → Fix Committed
Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

Hi Jeremy,

Just so you have directions on how to install from Colin's PPA, it's pretty much the same steps as in comment https://bugs.edge.launchpad.net/ubuntu/+bug/210637/comments/11 . The only change will be that when you create /etc/apt/sources.list.d/kernel-ppa.list use the following two lines instead:

deb http://ppa.launchpad.net/colin-king/ubuntu hardy main
deb-src http://ppa.launchpad.net/colin-king/ubuntu hardy main

Then as in the previous steps run "sudo apt-get update". I believe the package name that you will want to then install is called "linux-image-2.6.24-17-generic"

ogasawara@yoji:~$ apt-cache show linux-image-2.6.24-17-generic
Package: linux-image-2.6.24-17-generic
Source: linux
Priority: optional
Section: base
Installed-Size: 60360
Maintainer: Ubuntu Kernel Team <email address hidden>
Architecture: i386
Version: 2.6.24-17.32ckingppa4

Hope that helps. Thanks.

Revision history for this message
David Goldsmith (dg) wrote :

This morning, I verified that the fix from Colin's PPA works on my system. No more reboots! Hope you'll be able to get this into 8.04.1

Thanks and regards,

David

Revision history for this message
Colin Ian King (colin-king) wrote :

SRU justification:

Impact: sata_nv regression, reboots system. Since 2.6.23 somewhere,
sata_nv has gained improved exception handling, but there is a serious
regression that causes a system reboot.

Testcase: run # hdparm -W 0 /dev/sda ; hdparm -W 0 /dev/sdb. On Gutsy,
this gives periodic kernel messages, but on Hardy it fails terribly. XFS
filesystem shutdown due to error, and most times system reboots without
any messages to serial console.

Patch in my PPA tested and verified
https://bugs.launchpad.net/ubuntu/+bug/210637/comments/16

Patch from upstream cherry pick a1fe782414b7122d4c0501d3a0988b7302fa586f

Changed in linux:
status: In Progress → Fix Committed
Revision history for this message
Martin Pitt (pitti) wrote :

Accepted into -proposed, please test and give feedback here

Revision history for this message
David Goldsmith (dg) wrote :

Verified the kernel in hardy-proposed this morning. This kernel loads cleanly with no rebooting.

But, as noted in my earlier e-mail to Colin, my sound card does not work with this kernel. Maybe support for sound is not compiled in, or is specific to the developer's machine.

No other problems.

Thanks,

David

Revision history for this message
Martin Pitt (pitti) wrote :

Copied to hardy-updates. The new kernel was tested extensively by many people, who reported back in other bug reports. Due to lack of feedback, this particular bug was not confirmed to be tested, though. Please report back here if the bug still occurs for you with the new kernel packages, then we will reopen this bug.

Revision history for this message
Martin Pitt (pitti) wrote :

Please apply to Intrepid as well.

Changed in linux:
assignee: nobody → colin-king
Revision history for this message
Martin Pitt (pitti) wrote :

Copied to hardy-updates. The new kernel was tested extensively by many people, who reported back in other bug reports. Due to lack of feedback, this particular bug was not confirmed to be tested, though. Please report back here if the bug still occurs for you with the new kernel packages, then we will reopen this bug.

Changed in linux:
status: Fix Committed → Fix Released
Revision history for this message
Slawomir Gajowniczek (imkebe) wrote :

After upgrading to 2.6.24-18.32 swncq seems to be enabled by default. I notice few errors like

[...]
[69919.423025] ata3: EH in SWNCQ mode,QC:qc_active 0x3FF sactive 0x3FF
[69919.423060] ata3: SWNCQ:qc_active 0x3 defer_bits 0x3FC last_issue_tag 0x1
[69919.423061] dhfis 0x3 dmafis 0x1 sdbfis 0x0
[69919.423082] ata3: ATA_REG 0x40 ERR_REG 0x0
[69919.423105] ata3: tag : dhfis dmafis sdbfis sacitve
[69919.423126] ata3: tag 0x0: 1 1 0 1
[69919.423142] ata3: tag 0x1: 1 0 0 1
[69919.423165] ata3.00: exception Emask 0x0 SAct 0x3ff SErr 0x0 action 0x6 frozen
[69919.423200] ata3.00: cmd 60/78:00:bc:cd:c6/00:00:19:00:00/40 tag 0 ncq 61440 in
[69919.423201] res 40/00:00:09:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[69919.423231] ata3.00: status: { DRDY }
[...]
[69919.425237] ata3: hard resetting link
[69919.898648] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[69919.914821] ata3.00: configured for UDMA/133
[69919.914845] ata3: EH complete
[...]

Recently I've upgraded to 2.6.24-19.33 and errors still remains, but now there are plenty of them. I attach some debug information. I don't realy know if this problem is related to swncq. Please notify me if not, so I can submit it as a new bug.

Revision history for this message
Slawomir Gajowniczek (imkebe) wrote :
Revision history for this message
Colin Ian King (colin-king) wrote :

Martin Wrote:
> Please apply to Intrepid as well.

It's been automatically included from the 2.6.26 tree.

Colin

Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

Marking "Fix Released" against Intrepid since the intrepid kernel is now available in the archives. Thanks.

Changed in linux:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.