null pointer dereference in hsa_init

Bug #2007993 reported by Cory Bloor
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
rocr-runtime (Debian)
Fix Released
Unknown
rocr-runtime (Ubuntu)
Fix Released
Undecided
Unassigned
Jammy
Fix Released
Undecided
Unassigned
Kinetic
Fix Released
Undecided
Unassigned

Bug Description

[ Impact ]

The rocr-runtime provides the basic interface between compute code written to run on AMD GPUs and the AMDGPU/AMDKFD driver within the kernel. On Ubuntu 22.04, the library crashes with a segfault during initialization. This bug makes the library unusable.

On Ubuntu 22.04, the main use for this library is in rocminfo, which provides AMD GPU users with a description of the compute capabilities of their hardware. For example, rocminfo provides the name of the ISA for the hardware, which is useful for choosing compiler flags when building GPU libraries from source. Invoking rocminfo is also an easy way for novice users to find information about their hardware (e.g., for inclusion in bug reports filed against GPU libraries). It would therefore be useful if this fix could be backported to Ubuntu 22.04.

The fix changes the order of initialization of a pair of static variables in the rocr-runtime by moving them into the same translation unit, thereby ensuring the order is both deterministic and correct.

[ Test Plan ]

To reproduce this bug, you will need an AMD GPU installed on the machine. Then the following terminal commands should be sufficient to cause a segfault originating in the rocr-runtime:

    apt install rocminfo kmod
    rocminfo

Once the bug is fixed, you should see detailed information about your installed GPU hardware printed to standard output. This bug is deterministic at runtime, so it is relatively easy to verify if you have the necessary hardware.

On Ubuntu 22.04, the rocminfo utility is the only package that depends on rocr-runtime, so this simple test is fairly comprehensive.

[ Where problems could occur ]

The rocr-runtime package is already badly broken, so the risk associated with backporting a fix is low. If a mistake were made in fixing this bug, the most likely outcome would be that the package remains broken.

[ Other info ]

The same fix is in use on Debian Unstable, Ubuntu 23.04 and upstream, so it is already being used in other environments (albeit with different versions of rocr-runtime).

[ Original bug report ]

# System Information:
Description: Ubuntu 22.04.2 LTS
Release: 22.04

# Package Version:
libhsa-runtime64-1:
  Installed: 5.0.0-1
  Source: rocr-runtime

# What was done:

    # on Ubuntu 22.04 or 22.10 with an AMD GPU installed
    apt install rocminfo kmod
    rocminfo

# What was seen:

    ROCk module is loaded
    Segmentation fault (core dumped)

Note that the rocminfo utility will not try to initialize libhsa-runtime64 unless you have an AMD GPU installed, which is required to reproduce this problem.

After some debugging, I came to the conclusion that this is a null pointer dereference in libhsa-runtime64. The order of static initialization is different when building the rocr-runtime package on Ubuntu as compared to on Debian, and this results in the package working on Debian but crashing when it's rebuilt for Ubuntu. A couple of static variables are being copied before they are initialized, leading to a null pointer dereference later on in the program.

# What was expected:
rocminfo should not crash

# Debian Bug:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1031089

# Debian Patch:
https://salsa.debian.org/rocm-team/rocr-runtime/-/blob/debian/5.2.3-3/debian/patches/0003-fix-static-initialization-order.patch

The patch applied to the Debian package has fixed this bug in Ubuntu 23.04. It would be great if the fix could also be applied to Ubuntu 22.04 LTS. There's not a lot of ROCm functionality in Jammy, but fixing this bug would at least get the basics like rocminfo working.

Revision history for this message
Cory Bloor (slavik81) wrote :
Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

The attachment "debian patch that fixes this bug in ubuntu 23.04" seems to be a patch. If it isn't, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are a member of the ~ubuntu-reviewers, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issues please contact him.]

tags: added: patch
Changed in rocr-runtime (Ubuntu):
status: New → Fix Released
Revision history for this message
Stefano Rivera (stefanor) wrote :

I've prepared an upload, but am unable to test it locally. Cory can test it.

Revision history for this message
Stefano Rivera (stefanor) wrote :
Cory Bloor (slavik81)
description: updated
Revision history for this message
Cory Bloor (slavik81) wrote :

I applied the debdiff that Stefano supplied and built the new package on my test machine. I can confirm that the crashing behaviour has been resolved with his change. If you'd like to review the results of my testing, please see the attached logfile for the behaviour with rocr-runtime 5.0.0-1 and rocr-runtime 5.0.0-1ubuntu0.1.

Changed in rocr-runtime (Debian):
status: Unknown → Fix Released
Revision history for this message
Robie Basak (racb) wrote : Please test proposed package

Hello Cory, or anyone else affected,

Accepted rocr-runtime into jammy-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/rocr-runtime/5.0.0-1ubuntu0.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-jammy to verification-done-jammy. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-jammy. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in rocr-runtime (Ubuntu Jammy):
status: New → Fix Committed
tags: added: verification-needed verification-needed-jammy
Revision history for this message
Cory Bloor (slavik81) wrote :

Thanks Robie,

I have verified that the updated package fixes this bug.

I reproduced this issue in a fresh Ubuntu 22.04 container by following the test plan from the original bug description with jammy-proposed enabled (but with no proposed packages installed). The call to rocminfo crashed as expected. I then installed libhsa-runtime64-1/jammy-proposed and verified that the rocminfo output was correct.

The details of this testing (including details about the hardware used) can be found in the attached log: libhsa-runtime64-1_5.0.0-1ubuntu0.1_verification.txt.

tags: added: verification-done-jammy
removed: verification-needed-jammy
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Cory, https://launchpad.net/ubuntu/+source/rocr-runtime/5.0.0-1ubuntu0.1 is an FTBFS in jammy proposed, I'm guessing you verified it on amd64 (the only green build), but the FTBFS in the other architectures has to be resolved.

At first glance it doesn't look like the type of error that a rebuild would fix:

<<PKGBUILDDIR>>/src/core/util/utils.h:63:10: fatal error: x86intrin.h: No such file or directory

I'm flipping the verification tag back to needed because of that, to avoid mistakes in releasing this.

tags: added: verification-needed-jammy
removed: verification-done-jammy
Revision history for this message
Cory Bloor (slavik81) wrote :

Hi Andreas, this package was only ever released on amd64 on jammy [1]. You can see that when jammy was built a year ago, the currently released version of this package also FTBFS on all platforms except amd64 [2].

The proposed patch didn't fix the build on non-amd64 platforms, but it didn't break it either. Those platforms never built successfully to begin with.

[1]: https://packages.ubuntu.com/jammy/libs/libhsa-runtime64-1
[2]: https://launchpad.net/ubuntu/+source/rocr-runtime/5.0.0-1

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Ah, indeed, thanks for the explanation Cory.

tags: added: verification-done-jammy
removed: verification-needed-jammy
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

It looks like kinetic is also affected, by looking at the code. Or was kinetic also as lucky as debian and didn't experience the crash? Are you able to verify?

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

The [original bug report] section in the bug description hints that it also happens in 22.10 (kinetic):

    # What was done:

        # on Ubuntu 22.04 or 22.10 with an AMD GPU installed
        apt install rocminfo kmod
        rocminfo

I'm therefore adding a task for kinetic as well. If kinetic is not affected in the end, we can easily mark it as invalid.

Revision history for this message
Cory Bloor (slavik81) wrote :

Yes, kinetic is affected. I can confirm that the 0003-fix-static-initialization-order.patch fixes the bug there too. I've attached a debdiff that applies the fix.

Revision history for this message
Robie Basak (racb) wrote :

Thanks. This still needs review and sponsorship. Subscribing ~ubuntu-sponsors. Who sponsored your upload to Jammy - maybe they could review and sponsor this for Kinetic as well?

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

The jammy update is still blocked in a kinetic upload.

Changed in rocr-runtime (Ubuntu Kinetic):
status: New → In Progress
assignee: nobody → Erich Eickmeyer (eeickmeyer)
Revision history for this message
Steve Langasek (vorlon) wrote :

Hello Cory, or anyone else affected,

Accepted rocr-runtime into kinetic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/rocr-runtime/5.1.0-2ubuntu0.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-kinetic to verification-done-kinetic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-kinetic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in rocr-runtime (Ubuntu Kinetic):
status: In Progress → Fix Committed
tags: added: verification-needed-kinetic
Revision history for this message
Cory Bloor (slavik81) wrote :

I have verified that the updated package fixes this bug.

I reproduced this issue in a fresh Ubuntu 22.10 container by following the test plan from the original bug description with kinetic-proposed enabled (but with no proposed packages installed). The call to rocminfo crashed as expected. I then installed libhsa-runtime64-1/kinetic-proposed and verified that the rocminfo output was correct.

The details of this testing (including details about the hardware used) can be found in the attached log.

tags: added: verification-done-kinetic
removed: verification-needed-kinetic
Changed in rocr-runtime (Ubuntu Kinetic):
assignee: Erich Eickmeyer (eeickmeyer) → nobody
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package rocr-runtime - 5.1.0-2ubuntu0.1

---------------
rocr-runtime (5.1.0-2ubuntu0.1) kinetic; urgency=medium

  * Patch: Fix Segfault in hsa_init on systems with an AMD GPU
    (LP: #2007993).

 -- Cordell Bloor <email address hidden> Fri, 21 Apr 2023 14:55:45 -0600

Changed in rocr-runtime (Ubuntu Kinetic):
status: Fix Committed → Fix Released
Revision history for this message
Chris Halse Rogers (raof) wrote : Update Released

The verification of the Stable Release Update for rocr-runtime has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package rocr-runtime - 5.0.0-1ubuntu0.1

---------------
rocr-runtime (5.0.0-1ubuntu0.1) jammy; urgency=medium

  * Patch: Fix Segfault in hsa_init on systems with an AMD GPU. Thanks Cory
    Bloor. (LP: #2007993).

 -- Stefano Rivera <email address hidden> Tue, 28 Feb 2023 15:19:35 -0400

Changed in rocr-runtime (Ubuntu Jammy):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.