-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
current develop does not compile with ROCm 5.2.3 (LUMI-G default) #1432
Comments
I ran into the same problem on Crusher. I hope this will be fixed soon.
Steve
On Jan 21, 2024, at 4:51 AM, Bartosz Kostrzewa ***@***.***> wrote:
It seems that the preparations for ROCm 6 have broken compilation with our current production stack based on ROCm 5.2.3 on LUMI-G (at least for me). Note that 5.2.3 is the default on the machine and the only "officially supported" version as far as I can tell.
https://github.com/lattice/quda/blob/273d4fe8dca06fbc52b209ab7ee27bdf83d6c4bd/lib/targets/hip/malloc.cpp#L531
/users/bakostrz/code/quda-develop-273d4fe/lib/targets/hip/malloc.cpp:531:18: error: no member named 'type' in 'hipPointerAttribute_t'
switch (attr.type) {
~~~~ ^
/users/bakostrz/code/quda-develop-273d4fe/lib/targets/hip/malloc.cpp:539:57: error: no member named 'type' in 'hipPointerAttribute_t'
default: errorQuda("Unknown memory type %d\n", attr.type); return QUDA_INVALID_FIELD_LOCATION;
~~~~ ^
/users/bakostrz/code/quda-develop-273d4fe/lib/../include/util_quda.h:76:30: note: expanded from macro 'errorQuda'
fprintf(getOutputFile(), __VA_ARGS__); \
^~~~~~~~~~~
/users/bakostrz/code/quda-develop-273d4fe/lib/targets/hip/malloc.cpp:539:57: error: no member named 'type' in 'hipPointerAttribute_t'
default: errorQuda("Unknown memory type %d\n", attr.type); return QUDA_INVALID_FIELD_LOCATION;
~~~~ ^
/users/bakostrz/code/quda-develop-273d4fe/lib/../include/util_quda.h:77:74: note: expanded from macro 'errorQuda'
errorQuda_(__PRETTY_FUNCTION__, quda::file_name(__FILE__), __LINE__, __VA_ARGS__); \
^~~~~~~~~~~
3 errors generated when compiling for gfx90a.
make[2]: *** [lib/CMakeFiles/quda_cpp.dir/build.make:1070: lib/CMakeFiles/quda_cpp.dir/targets/hip/malloc.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [CMakeFiles/Makefile2:1039: lib/CMakeFiles/quda_cpp.dir/all] Error 2
make: *** [Makefile:146: all] Error 2
While other ROCm versions are available on the machine, they are all marked "experimental".
It is quite unfortunate that there has been a rename between 5.2.x and 5.y.x of the hipMemoryType member of hipPointerAttribute_t from memoryType to type. There seem to be a couple of intermediate versions which have both in the form of a union.
@dmcdougall<https://github.com/dmcdougall> do you think it might be possible to support both 5.2.x and later versions at least for a while?
—
Reply to this email directly, view it on GitHub<#1432>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABGG3BJ4FDQQMSZN53KUP4TYPTQI5AVCNFSM6AAAAABCD3NV3KVHI2DSMVQWIX3LMV43ASLTON2WKOZSGA4TENJQG43TMMY>.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
@stevengottlieb I've found that at least on on LUMI-G, the ROCm-5.6.1-based stack (that is provided by the LUMI admins in addition to the official software stack by HPE) seems to work as long as one disables P2P. Luckily GPU-aware MPI works at least, which is what matters most for QUDA-HIP. I don't see anything in the Crusher docs to indicate that HPE/ORNL provide anything beyond ROCM 5.4.0 on Crusher but I don't have access to the machine so I can't check if there is not perhaps an unadvertised module somewhere. |
Addendum to the comment above: this is with cray-mpich/8.1.27. |
@dmcdougall Any news on getting rocm 5.2.3 support back into QUDA? Between the changes here and the delays on the HPE side (I guess) in making a newer official software stack available we are stuck without an offloaded fermion force on LUMI-G. |
Lots of issues with this unofficial rocm 5.6.1 unfortunately to the point that it's unusable. |
I'm so sorry for the delayed response here. I didn't see this until the most recent ping. My sincere apologies. ROCm 5.2.3 is extremely old. QUDA is an extremely difficult application for compilers to handle and AMD have addressed several internal compiler errors, codegen bugs, and double- If you're having issues with 5.6, I wonder if you're mixing a ROCm 5.6 userland with the ROCm 5.2 driver. This is not supported at all, and not guaranteed to work. You can typically be pretty successful with a userland version that is at most two versions (in either direction) against a given driver version. |
Thanks a lot for getting back to me on this, it is much appreciated.
We've been trying this for at least half a year now. This is also not just a problem on LUMI-G but also on Crusher AFAIK, see the list of rocm versions available there: https://docs.olcf.ornl.gov/systems/crusher_quick_start_guide.html#determining-the-compatibility-of-cray-mpich-and-rocm
Of course that's exactly what's happening on LUMI-G: the only version of rocm officially supported by HPE on the machine is 5.2.3 and that's also the driver version. The LUMI admins have provided a frankenversion of 5.6.1 in the hope of helping users who have encountered issues with older versions but they do state explicitly that this is not officially supported. So for us rocm 5.2.3 is the only version which actually works on LUMI-G and it's very unfortunate that the current QUDA develop head commit does not contain a workaround to still work with that. |
I think we're talking about two different things. The situation on Crusher is different. Crusher is running the latest GPU driver:
This driver version is from ROCm 6.0. Crusher also has deployments of ROCms up to 6.0.0:
There are older versions there, too. But the latest stack is available, and it is compatible with the CPE that is deployed on Crusher. The page you're referring to hasn't been updated in over a year. The information on that page was (and still is) correct, but there are newer versions available with associated CPE compatibility requirements that are not listed on that page. I can let the OLCF folks know about this page and help them update it. Thanks for bringing that page to my attention. It's been a while since I've built QUDA on Crusher and Frontier, but I've worked with Balint Joo to address both correctness and performance-related issues with QUDA in 5.5 and 5.6, so if my memory serves me correctly, QUDA did successfully build with 5.6. There was also work I did to prepare QUDA (#1415 and #1418) for the ROCm 6 release which contained some breaking changes (namely a header-file re-org and the removal of the Of course, users that were on versions older than ROCm 5.5 never saw the header file warnings, and never had the opportunity to gracefully move to the new The situation on LUMI-G is different because both the driver and the userland are almost two years old. These need to be updated. ROCms newer than 5.4 aren't guaranteed to work with a driver from 5.2. Updating the software stack is critical. With all of this said, I have tried to address your concern in #1445. I wish you success with ROCm 5.2, but I will re-iterate that you are working with a compiler that is almost two years old, and there are bugs that QUDA triggered in the ROCm compiler that have been addressed since ROCm 5.2. I'm sorry that I can't be more helpful. |
I'm happy to hear that. I was just referring to the page because @stevengottlieb mentioned above that he was having similar problems. I guess these are then really different issues.
I hope that the LUMI admins / HPE will get around to it
Thanks a lot, I will test this out as soon as possible. I really hope the LUMI admins / HPE will upgrade the driver and software stack this year but in the meantime your workaround in #1445 should make us able to compile current QUDA versions. It appears that we have not yet hit any of the issues with ROCm 5.2.3 that you describe, but I will keep your warning in mind and stress again with the admins how crucial an upgrade of the software stack would be. |
This is very valuable information, thanks. I'll try to discuss again with the LUMI admins. |
Thanks for #1445 ! |
You're welcome. |
It seems that the preparations for ROCm 6 have broken compilation with our current production stack based on ROCm 5.2.3 on LUMI-G (at least for me). Note that 5.2.3 is the default on the machine and the only "officially supported" version as far as I can tell.
quda/lib/targets/hip/malloc.cpp
Line 531 in 273d4fe
While other ROCm versions are available on the machine, they are all marked "experimental".
It is quite unfortunate that there has been a rename between 5.2.x and 5.y.x of the
hipMemoryType
member ofhipPointerAttribute_t
frommemoryType
totype
. There seem to be a couple of intermediate versions which have both in the form of a union.@dmcdougall do you think it might be possible to support both 5.2.x and later versions at least for a while?
The text was updated successfully, but these errors were encountered: