Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

current develop does not compile with ROCm 5.2.3 (LUMI-G default) #1432

Closed
kostrzewa opened this issue Jan 21, 2024 · 12 comments
Closed

current develop does not compile with ROCm 5.2.3 (LUMI-G default) #1432

kostrzewa opened this issue Jan 21, 2024 · 12 comments

Comments

@kostrzewa
Copy link
Member

It seems that the preparations for ROCm 6 have broken compilation with our current production stack based on ROCm 5.2.3 on LUMI-G (at least for me). Note that 5.2.3 is the default on the machine and the only "officially supported" version as far as I can tell.

switch (attr.type) {

/users/bakostrz/code/quda-develop-273d4fe/lib/targets/hip/malloc.cpp:531:18: error: no member named 'type' in 'hipPointerAttribute_t'
    switch (attr.type) {
            ~~~~ ^
/users/bakostrz/code/quda-develop-273d4fe/lib/targets/hip/malloc.cpp:539:57: error: no member named 'type' in 'hipPointerAttribute_t'
    default: errorQuda("Unknown memory type %d\n", attr.type); return QUDA_INVALID_FIELD_LOCATION;
                                                   ~~~~ ^
/users/bakostrz/code/quda-develop-273d4fe/lib/../include/util_quda.h:76:30: note: expanded from macro 'errorQuda'
    fprintf(getOutputFile(), __VA_ARGS__);                                                                             \
                             ^~~~~~~~~~~
/users/bakostrz/code/quda-develop-273d4fe/lib/targets/hip/malloc.cpp:539:57: error: no member named 'type' in 'hipPointerAttribute_t'
    default: errorQuda("Unknown memory type %d\n", attr.type); return QUDA_INVALID_FIELD_LOCATION;
                                                   ~~~~ ^
/users/bakostrz/code/quda-develop-273d4fe/lib/../include/util_quda.h:77:74: note: expanded from macro 'errorQuda'
    errorQuda_(__PRETTY_FUNCTION__, quda::file_name(__FILE__), __LINE__, __VA_ARGS__);                                 \
                                                                         ^~~~~~~~~~~
3 errors generated when compiling for gfx90a.
make[2]: *** [lib/CMakeFiles/quda_cpp.dir/build.make:1070: lib/CMakeFiles/quda_cpp.dir/targets/hip/malloc.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [CMakeFiles/Makefile2:1039: lib/CMakeFiles/quda_cpp.dir/all] Error 2
make: *** [Makefile:146: all] Error 2

While other ROCm versions are available on the machine, they are all marked "experimental".

It is quite unfortunate that there has been a rename between 5.2.x and 5.y.x of the hipMemoryType member of hipPointerAttribute_t from memoryType to type. There seem to be a couple of intermediate versions which have both in the form of a union.

@dmcdougall do you think it might be possible to support both 5.2.x and later versions at least for a while?

@stevengottlieb
Copy link
Member

stevengottlieb commented Jan 21, 2024 via email

@kostrzewa
Copy link
Member Author

@stevengottlieb I've found that at least on on LUMI-G, the ROCm-5.6.1-based stack (that is provided by the LUMI admins in addition to the official software stack by HPE) seems to work as long as one disables P2P. Luckily GPU-aware MPI works at least, which is what matters most for QUDA-HIP.

I don't see anything in the Crusher docs to indicate that HPE/ORNL provide anything beyond ROCM 5.4.0 on Crusher but I don't have access to the machine so I can't check if there is not perhaps an unadvertised module somewhere.

@kostrzewa
Copy link
Member Author

Addendum to the comment above: this is with cray-mpich/8.1.27.

@kostrzewa
Copy link
Member Author

@dmcdougall Any news on getting rocm 5.2.3 support back into QUDA? Between the changes here and the delays on the HPE side (I guess) in making a newer official software stack available we are stuck without an offloaded fermion force on LUMI-G.

@kostrzewa
Copy link
Member Author

I've found that at least on on LUMI-G, the ROCm-5.6.1-based stack (that is provided by the LUMI admins in addition to the official software stack by HPE) seems to work as long as one disables P2P. Luckily GPU-aware MPI works at least, which is what matters most for QUDA-HIP.

Lots of issues with this unofficial rocm 5.6.1 unfortunately to the point that it's unusable.

@dmcdougall
Copy link
Contributor

I'm so sorry for the delayed response here. I didn't see this until the most recent ping. My sincere apologies.

ROCm 5.2.3 is extremely old. QUDA is an extremely difficult application for compilers to handle and AMD have addressed several internal compiler errors, codegen bugs, and double-free bugs in the KFD since 5.2.3. My suggestion here would be to raise a polite request to the LUMI-G system administrators to update to the latest ROCm stack. There are legitimate and very important software bug fixes that have happened since ROCm 5.2. This approach also helps all the other LUMI-G users, and not just the ones running into problems with QUDA.

If you're having issues with 5.6, I wonder if you're mixing a ROCm 5.6 userland with the ROCm 5.2 driver. This is not supported at all, and not guaranteed to work. You can typically be pretty successful with a userland version that is at most two versions (in either direction) against a given driver version.

@kostrzewa
Copy link
Member Author

Thanks a lot for getting back to me on this, it is much appreciated.

My suggestion here would be to raise a polite request to the LUMI-G system administrators to update to the latest ROCm stack. There are legitimate and very important software bug fixes that have happened since ROCm 5.2. This approach also helps all the other LUMI-G users, and not just the ones running into problems with QUDA.

We've been trying this for at least half a year now. This is also not just a problem on LUMI-G but also on Crusher AFAIK, see the list of rocm versions available there: https://docs.olcf.ornl.gov/systems/crusher_quick_start_guide.html#determining-the-compatibility-of-cray-mpich-and-rocm

If you're having issues with 5.6, I wonder if you're mixing a ROCm 5.6 userland with the ROCm 5.2 driver. This is not supported at all

Of course that's exactly what's happening on LUMI-G: the only version of rocm officially supported by HPE on the machine is 5.2.3 and that's also the driver version. The LUMI admins have provided a frankenversion of 5.6.1 in the hope of helping users who have encountered issues with older versions but they do state explicitly that this is not officially supported.

So for us rocm 5.2.3 is the only version which actually works on LUMI-G and it's very unfortunate that the current QUDA develop head commit does not contain a workaround to still work with that.

@dmcdougall
Copy link
Contributor

I think we're talking about two different things.

The situation on Crusher is different. Crusher is running the latest GPU driver:

[[email protected] ~]$ rocm-smi --showdriverversion

============================ ROCm System Management Interface ============================
============================== Version of System Component ===============================
Driver version: 6.3.6
==========================================================================================
================================== End of ROCm SMI Log ===================================

This driver version is from ROCm 6.0.

Crusher also has deployments of ROCms up to 6.0.0:

[[email protected] ~]$ ml avail rocm

----------------------------------------------------------------------------------------------------------------------------- /sw/crusher/modulefiles -----------------------------------------------------------------------------------------------------------------------------
   papi/7.0.1.0_rocm5.3    rocm/4.2.0    rocm/4.3.0    rocm/4.5.0    rocm/4.5.2    rocm/5.0.0    rocm/5.0.2    rocm/5.1.0    rocm/5.2.0    rocm/5.3.0 (D)    rocm/5.4.0    rocm/5.4.3    rocm/5.5.1    rocm/5.6.0    rocm/5.7.0    rocm/5.7.1    rocm/6.0.0

There are older versions there, too. But the latest stack is available, and it is compatible with the CPE that is deployed on Crusher.

The page you're referring to hasn't been updated in over a year. The information on that page was (and still is) correct, but there are newer versions available with associated CPE compatibility requirements that are not listed on that page. I can let the OLCF folks know about this page and help them update it. Thanks for bringing that page to my attention.

It's been a while since I've built QUDA on Crusher and Frontier, but I've worked with Balint Joo to address both correctness and performance-related issues with QUDA in 5.5 and 5.6, so if my memory serves me correctly, QUDA did successfully build with 5.6. There was also work I did to prepare QUDA (#1415 and #1418) for the ROCm 6 release which contained some breaking changes (namely a header-file re-org and the removal of the memoryType member from the hipPointerAttributes_t type). The header file re-org breaking changes were documented in ROCm 5.5, and users were warned at compile time whenever they pointed to the old header file locations until ROCm 6.0 when the old header locations were removed. Additionally, there was a period of three ROCm releases where users were given the union to prepare for the upcoming breaking change. The union existed ROCm 5.5, 5.6, and 5.7. This was documented in ROCm 5.6. The union was removed in 6.0, breaking compatibility in a major release.

Of course, users that were on versions older than ROCm 5.5 never saw the header file warnings, and never had the opportunity to gracefully move to the new type field name in those interim ROCm releases.

The situation on LUMI-G is different because both the driver and the userland are almost two years old. These need to be updated. ROCms newer than 5.4 aren't guaranteed to work with a driver from 5.2. Updating the software stack is critical.

With all of this said, I have tried to address your concern in #1445. I wish you success with ROCm 5.2, but I will re-iterate that you are working with a compiler that is almost two years old, and there are bugs that QUDA triggered in the ROCm compiler that have been addressed since ROCm 5.2.

I'm sorry that I can't be more helpful.

@kostrzewa
Copy link
Member Author

kostrzewa commented Mar 16, 2024

The situation on Crusher is different. Crusher is running the latest GPU driver:

I'm happy to hear that. I was just referring to the page because @stevengottlieb mentioned above that he was having similar problems. I guess these are then really different issues.

The situation on LUMI-G is different because both the driver and the userland are almost two years old. These need to be updated. ROCms newer than 5.4 aren't guaranteed to work with a driver from 5.2. Updating the software stack is critical.

I hope that the LUMI admins / HPE will get around to it

With all of this said, I have tried to address your concern in #1445. I wish you success with ROCm 5.2, but I will re-iterate that you are working with a compiler that is almost two years old, and there are bugs that QUDA triggered in the ROCm compiler that have been addressed since ROCm 5.2.

Thanks a lot, I will test this out as soon as possible. I really hope the LUMI admins / HPE will upgrade the driver and software stack this year but in the meantime your workaround in #1445 should make us able to compile current QUDA versions.

It appears that we have not yet hit any of the issues with ROCm 5.2.3 that you describe, but I will keep your warning in mind and stress again with the admins how crucial an upgrade of the software stack would be.

@kostrzewa
Copy link
Member Author

There are older versions there, too. But the latest stack is available, and it is compatible with the CPE that is deployed on Crusher.

This is very valuable information, thanks. I'll try to discuss again with the LUMI admins.

@kostrzewa
Copy link
Member Author

Thanks for #1445 !

@dmcdougall
Copy link
Contributor

You're welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants