-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AMDGPU support #135
AMDGPU support #135
Conversation
Some of them are very useless. If we really want we can fix them later, but -Wall already contains most of the useful warnings.
There's no reason to need a cast from void * to another pointer in C. We do that all the time without a cast with malloc.
Using typedef struct aliases is makes it more difficult to understand what exactly is being abstrated away [1], I need this change to understand the code. nvmlDevice_t in continue to use nvmlDevice_t to make consistent with nvml headers. [1] https://www.kernel.org/doc/html/latest/process/coding-style.html#typedefs
Since we are adding more vendors, having some sort of generic simplifies the code a lot, instead of using unions.
So we don't need to hardcode supported vendors in generic extract_gpuinfo code.
* Use shifts for mask and make it into ssize_t (so it gets arithmetic shift instead of logical) * Remove hardcoded names of nvidia in variable names outside of nvidia file.
Mostly this is just interacting with DRM kernel code and libdrm.
This is via the fdinfo interface. The enumeration of all the visible fds on a system is slightly expensive, but I'm not sure how to avoid it. The logic is partly adapted from unmerged intel_gpu_top code [1], adapted for AMDGPU. i915 kernel changes has not been merged into mainline yet, AFAICT. [1] https://patchwork.freedesktop.org/series/100571/
Amazing! |
I successfully tested on a system I have access to with an AMD Radeon RX 6800 XT. For the fan speed, I saw that you added a comment. I did not find a way to get it through AMDGPU. lm_sensors seem to get the correct info though, so I will investigate where to retrieve it. Edit: Never mind I found that for the fan https://dri.freedesktop.org/docs/drm/gpu/amdgpu.html |
Awesome! Yeah I didn't bother with checking for The reason I didn't do fan speed was that it seemed like it was slightly convoluted to retrieve. Rather than being a libdrm API call, I have to go through sysfs; doable but code can get messy and I can't test it. |
src/extract_gpuinfo_amdgpu.c
Outdated
close(allocated->hwmonFD); | ||
if (allocated->sysfsFD >= 0) | ||
close(allocated->sysfsFD); | ||
fclose(allocated->fanSpeedFILE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm actually relying on the closing of the program to free all the resource usages; it doesn't actually go through the device list to free each device's handles. This allocate_list is more of freeing the array of devices itself.
How I made it work is pre-allocating an array of devices, and when we successfuly get the handles for one device, add that device to the exported linked list. The whole array is added the "allocations" linked list just once.
Should we add this free-each-device's handles?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Why are you using a list if there is only one element in the "allocations" list, which is static to this file?
I guess that relying on the program termination to close the file descriptors is fine in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, at first I wasn't sure how many times get_device_handles
gets called :)
src/extract_gpuinfo_amdgpu.c
Outdated
// There should be one directory inside hwmon, with a name having the following pattern hwmon[0-9]+ | ||
if (dirEntry->d_type == DT_DIR) { | ||
size_t matchLen = 0; | ||
for (matchLen = 0; matchLen < sizeof(hwmon) - 1; ++matchLen) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be strncmp(dirEntry->d_name, "hwmon", sizeof("hwmon") - 1)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, I'll fix that.
src/extract_gpuinfo_amdgpu.c
Outdated
if (dirEntry) { | ||
gpu_info->hwmonFD = openat(dirfd(hwmonDir), dirEntry->d_name, O_RDONLY); | ||
} else | ||
gpu_info->hwmonFD = -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about instead of adding these else-s, pre-initialize gpu_info->hwmonFD = -1;
and only assign it if things go well? Could also avoid nested ifs by returning early.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right.
I do the convoluted if/else spaghetti as a first pass when the code is still changing many times.
I will clean this up.
Other than that, does the patch work on your APU? |
The APU does not expose fan information on the GPU's hwmon, AFAICT.
|
I think that we are almost good. |
Nice! I'm including amdgpu_drm.h directly because it's a kernel UAPI header, and I think any sane compiler toolchain on Linux would include the kernel UAPI headers. Though, up to you, could copy the declarations too since kernel-userspace ABI has guaranteed stability. |
The problem with NVIDIA was that the nvml header was not packaged on all the distributions, and downloading a recent header could break the build. Also I was not certain I could redistribute a version of the nvidia header. That is why I included the definitions directly. Things are different with libdrm. The header are consistently available on any distribution and updated with the kernel. Much cleaner than NVIDIA. Hence, relying on libdrm for AMDGPU support is fine in my opinion! If you are fine with the latest changes I will merge it into master. |
I see.
Nice naming lol
No objections. Go ahead. Thanks! |
A picture is worth a thousand words:
This PR isn't fully ready for merge yet. I think I'd like some feedback and do some polishing first (hence the RFC). My AMDGPU is also an iGPU / APU which limits how much I can test myself.
Most the access is using libdrm and kernel APIs. <drm/amdgpu_drm.h> is a kernel UAPI header so I directly included it. For libdrm I'm using dlopen and re-declaring the constants to avoid a compile-time dependency, just like what the original code does with NVML.
I also did some code refactoring to make my life easier, hope that is okay.
I really cannot deal with typedef structsCC #106, i915 kernel changes doesn't seem to have merged into mainline yet.