-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an example using an asynchronous shared AdePT #270
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Can one of the admins verify this patch? |
hageboeck
force-pushed
the
async
branch
3 times, most recently
from
April 19, 2024 11:13
414d3f3
to
ef65218
Compare
- Refactor the processing of hits. Instead of processing hits by passing a pointer/reference to a HostScoring instance, a loop over iterators to hits is used. In this way, hit scoring is decoupled from the specific implementation of HostScoring, and all classes with the same interface as the original GPUHit can be used for scoring. This facilitates hit scoring for the AsyncTransport implementation. - Move several Geant4 objects into the .cpp to make the integration headers simpler. - Place temporary scoring objects into a struct to go around G4's pool allocators. This prevents a destruction order fiasco (where the pool is gone but the object isn't), and keeps the scoring objects closer in memory. A few objects need to leak, unfortunately, since they are allocated in G4 pools, and the handles don't support them being on a stack. - Improve const correctness in a few places. - Add information about threadID and eventID to the scoring interface. This information is required for AsyncAdePT to score correctly, but is unused in the thread-local transport for now.
- Split the transport from the integration-related parts introducing a new source file. The integration-related parts can be reused in a different transport implementation. - Create a transport abstraction, so AdePTTrackingManager is independent of the transport implementation. - Add thread and event IDs to the transport interface. These are necessary for the async transport implementation. - Start to enumerate tracks in the tracking manager. This can be used to reproducibly seed the AdePT random sequences. - Add some const declarations for the default AdePT implementation. - Use a factory function to instantiate AdePT. Like this, different AdePT implementations can be used without changing code in the tracking manager or in AdePTPhysics. - Replace a few includes with forward declarations. - Fix device link errors that can show when using a symbol in multiple cuda translation units.
…ePT. When AdePT tries to purge a particle from the GPU, Geant4 must not send it back. To achieve this, a custom track ID was created, such that the AdePTTrackingManager doesn't attempt to send it back to the GPU.
The setSeed command was defined, but didn't have any effect. To avoid confusion on the user side, it is removed here.
- Create a folder to convert AdePT into a shared asynchronous workflow. - Create a base macro to run the example.
Clean up AdeptIntegration from example14 before transforming it into the async example.
Add unique_ptr with cuda deleters. This way, cuda memory doesn't have to be freed manually. So far, this is only used for a few cuda objects, but could be extended in the future.
- Remove a variable without any effect. - Remove dead variables and includes in example21. - Fix compiler warnings. - Minor constness fixes.
- Refactor AdeptIntegration to use a GPU management thread - G4 workers deposit / retrieve tracks through shared memory - WIP towards using a single shared AdePT instance per G4 application - Use more unique_ptr and RAII idiom - Use less getters/setters in favour of constructor arguments Several pieces are missing for a truly shared AdePT: - Scoring per G4 worker thread - Slot recycling in AdePT track buffers
In addition to marking slots as free, now actually start periodic kernels that return free slots to the pool of available slots.
- Allow for flushing of single events on device.
Instead of one single scoring per AdePT, one scoring per G4 worker is created. On the host, the scoring instances are stored in a vector, whereas they are a malloc-ed array on the device. In this way, the correct scoring instance can be accessed using the threadId of each particle. In order to have the scoring in a vector, the host side had to be made copyable/moveable. The device side remains uninitialised when a host instance is copied, though.
Using cuda callbacks and state machines for the injection and particle extraction workflows, the entire transport loop was redesigned. Events now go through a more fine-grained state machine to reduce latency while ensuring that we wait for all particles to go through every step.
- Create an object library in the AsyncExample that uses examples/common for the ParticleGun and HepMC3 reader - Delete the corresponding files from AsyncExample - Make PrimaryGeneratorAction trivial, because all the logic is now in the gun
- Port AsyncExample to use AdePTTransportInterface. - Integrate AdePTGeant4Integration into AsyncAdePT - Start streaming hits out of GPU instead of accumulating them on device. - Try to synchronise code between example1 and AsyncExample to make them do the same as much as possible. - Remove unused functionality from SlotManager.
- Use EventAction and RunAction to write histograms for each thread. - Use the EndOfRunAction to write this histograms to a ROOT file.
- Create a dedicated library for AsyncTransport that can be linked instead of the standard transport library.
- Add a parentID to track structs. - Replace track init functions by constructors, so members don't get missed inadvertently when the track storage is refactored.
- Instead of allocating three queues for the three particle types, all particle types can use the same queue. This better uses the available memory, since an overfull queue of one type might be compensated by a sparsely populated queue of another type. - Put a lot more of the device memory under RAII management. - This allows recovering from allocation failures, trying again with a smaller size. - Allocate the largest chunk of memory (the track storage) last, and give warnings how much memory is missing to successfully allocate the array. In this way, the maximum size of the track storage can be computed by the user.
agheata
pushed a commit
that referenced
this pull request
Nov 26, 2024
This PR is based on #270, with the addition of making ROOT not required to build AdePT ---- Original description: Example demonstrating how to use one single asynchronous AdePT from Geant4 using the fast simulation hooks. This example is based on Example14 with the following modifications: - A slot manager that can recycle slots is employed. This allows to transport considerably larger numbers of particles without running out of memory. The slot manager works as follows: - If a new slot is needed, a slot number is fetched atomically from a list of free slots. - Once a track dies, its slot is marked as to be freed. It cannot be freed while other tracks are transported because this might race with allocating new slots. - Periodically, the slots to be freed are copied over into the list of free slots. - Only one instance of AdePT runs asynchronously in a separate thread. - Each G4 worker can enqueue tracks, which are all transported in parallel. - A transport loop is running continuously, which transports particles, injects new particles, and retrieves leaked tracks and hits. - The G4 workers communicate with the AdePT thread via state machines, so it is clear when events finish or need to start transporting. - As long as the G4 workers have CPU work to do, they don't block while the GPU transport is running. - Each track knows which G4 worker it came from, and the scoring structures are replicated for each G4 worker that is active. - AdePT not only runs in the region that is named as GPU region in the config, it also transports particles in all daughter regions of the "AdePT region". This required refactoring the geometry visitor that sets up the GPU region. ---- --------- Co-authored-by: Stephan Hageboeck <[email protected]> Co-authored-by: SeverinDiederichs <[email protected]>
Superseded by #319 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Example demonstrating how to use one single asynchronous AdePT from Geant4 using the fast simulation hooks.
This example is based on Example14 with the following modifications:
numbers of particles without running out of memory. The slot manager works as follows:
because this might race with allocating new slots.