null pointer dereference in hsa_init
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
rocr-runtime (Debian) |
Fix Released
|
Unknown
|
|||
rocr-runtime (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Jammy |
Fix Released
|
Undecided
|
Unassigned | ||
Kinetic |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
[ Impact ]
The rocr-runtime provides the basic interface between compute code written to run on AMD GPUs and the AMDGPU/AMDKFD driver within the kernel. On Ubuntu 22.04, the library crashes with a segfault during initialization. This bug makes the library unusable.
On Ubuntu 22.04, the main use for this library is in rocminfo, which provides AMD GPU users with a description of the compute capabilities of their hardware. For example, rocminfo provides the name of the ISA for the hardware, which is useful for choosing compiler flags when building GPU libraries from source. Invoking rocminfo is also an easy way for novice users to find information about their hardware (e.g., for inclusion in bug reports filed against GPU libraries). It would therefore be useful if this fix could be backported to Ubuntu 22.04.
The fix changes the order of initialization of a pair of static variables in the rocr-runtime by moving them into the same translation unit, thereby ensuring the order is both deterministic and correct.
[ Test Plan ]
To reproduce this bug, you will need an AMD GPU installed on the machine. Then the following terminal commands should be sufficient to cause a segfault originating in the rocr-runtime:
apt install rocminfo kmod
rocminfo
Once the bug is fixed, you should see detailed information about your installed GPU hardware printed to standard output. This bug is deterministic at runtime, so it is relatively easy to verify if you have the necessary hardware.
On Ubuntu 22.04, the rocminfo utility is the only package that depends on rocr-runtime, so this simple test is fairly comprehensive.
[ Where problems could occur ]
The rocr-runtime package is already badly broken, so the risk associated with backporting a fix is low. If a mistake were made in fixing this bug, the most likely outcome would be that the package remains broken.
[ Other info ]
The same fix is in use on Debian Unstable, Ubuntu 23.04 and upstream, so it is already being used in other environments (albeit with different versions of rocr-runtime).
[ Original bug report ]
# System Information:
Description: Ubuntu 22.04.2 LTS
Release: 22.04
# Package Version:
libhsa-runtime64-1:
Installed: 5.0.0-1
Source: rocr-runtime
# What was done:
# on Ubuntu 22.04 or 22.10 with an AMD GPU installed
apt install rocminfo kmod
rocminfo
# What was seen:
ROCk module is loaded
Segmentation fault (core dumped)
Note that the rocminfo utility will not try to initialize libhsa-runtime64 unless you have an AMD GPU installed, which is required to reproduce this problem.
After some debugging, I came to the conclusion that this is a null pointer dereference in libhsa-runtime64. The order of static initialization is different when building the rocr-runtime package on Ubuntu as compared to on Debian, and this results in the package working on Debian but crashing when it's rebuilt for Ubuntu. A couple of static variables are being copied before they are initialized, leading to a null pointer dereference later on in the program.
# What was expected:
rocminfo should not crash
# Debian Bug:
https:/
# Debian Patch:
https:/
The patch applied to the Debian package has fixed this bug in Ubuntu 23.04. It would be great if the fix could also be applied to Ubuntu 22.04 LTS. There's not a lot of ROCm functionality in Jammy, but fixing this bug would at least get the basics like rocminfo working.
Changed in rocr-runtime (Ubuntu): | |
status: | New → Fix Released |
description: | updated |
Changed in rocr-runtime (Debian): | |
status: | Unknown → Fix Released |
Changed in rocr-runtime (Ubuntu Kinetic): | |
status: | New → In Progress |
assignee: | nobody → Erich Eickmeyer (eeickmeyer) |
Changed in rocr-runtime (Ubuntu Kinetic): | |
assignee: | Erich Eickmeyer (eeickmeyer) → nobody |
The attachment "debian patch that fixes this bug in ubuntu 23.04" seems to be a patch. If it isn't, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are a member of the ~ubuntu-reviewers, unsubscribe the team.
[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issues please contact him.]