Add an example using an asynchronous shared AdePT #270

hageboeck · 2024-01-11T13:42:31Z

Example demonstrating how to use one single asynchronous AdePT from Geant4 using the fast simulation hooks.

This example is based on Example14 with the following modifications:

A slot manager that can recycle slots is employed. This allows to transport considerably larger
numbers of particles without running out of memory. The slot manager works as follows:
- If a new slot is needed, a slot number is fetched atomically from a list of free slots.
- Once a track dies, its slot is marked as to be freed. It cannot be freed while other tracks are transported
  because this might race with allocating new slots.
- Periodically, the slots to be freed are copied over into the list of free slots.
Only one instance of AdePT runs asynchronously in a separate thread.
- Each G4 worker can enqueue tracks, which are all transported in parallel.
- A transport loop is running continuously, which transports particles, injects new particles, and retrieves leaked tracks and hits.
- The G4 workers communicate with the AdePT thread via state machines, so it is clear when events finish or need to start transporting.
- As long as the G4 workers have CPU work to do, they don't block while the GPU transport is running.
- Each track knows which G4 worker it came from, and the scoring structures are replicated for each G4 worker that is active.
AdePT not only runs in the region that is named as GPU region in the config, it also transports particles in all daughter regions of the "AdePT region". This required refactoring the geometry visitor that sets up the GPU region.

phsft-bot · 2024-01-11T13:46:01Z

Can one of the admins verify this patch?

- Refactor the processing of hits. Instead of processing hits by passing a pointer/reference to a HostScoring instance, a loop over iterators to hits is used. In this way, hit scoring is decoupled from the specific implementation of HostScoring, and all classes with the same interface as the original GPUHit can be used for scoring. This facilitates hit scoring for the AsyncTransport implementation. - Move several Geant4 objects into the .cpp to make the integration headers simpler. - Place temporary scoring objects into a struct to go around G4's pool allocators. This prevents a destruction order fiasco (where the pool is gone but the object isn't), and keeps the scoring objects closer in memory. A few objects need to leak, unfortunately, since they are allocated in G4 pools, and the handles don't support them being on a stack. - Improve const correctness in a few places. - Add information about threadID and eventID to the scoring interface. This information is required for AsyncAdePT to score correctly, but is unused in the thread-local transport for now.

- Split the transport from the integration-related parts introducing a new source file. The integration-related parts can be reused in a different transport implementation. - Create a transport abstraction, so AdePTTrackingManager is independent of the transport implementation. - Add thread and event IDs to the transport interface. These are necessary for the async transport implementation. - Start to enumerate tracks in the tracking manager. This can be used to reproducibly seed the AdePT random sequences. - Add some const declarations for the default AdePT implementation. - Use a factory function to instantiate AdePT. Like this, different AdePT implementations can be used without changing code in the tracking manager or in AdePTPhysics. - Replace a few includes with forward declarations. - Fix device link errors that can show when using a symbol in multiple cuda translation units.

…ePT. When AdePT tries to purge a particle from the GPU, Geant4 must not send it back. To achieve this, a custom track ID was created, such that the AdePTTrackingManager doesn't attempt to send it back to the GPU.

The setSeed command was defined, but didn't have any effect. To avoid confusion on the user side, it is removed here.

- Create a folder to convert AdePT into a shared asynchronous workflow. - Create a base macro to run the example.

Clean up AdeptIntegration from example14 before transforming it into the async example.

Add unique_ptr with cuda deleters. This way, cuda memory doesn't have to be freed manually. So far, this is only used for a few cuda objects, but could be extended in the future.

- Remove a variable without any effect. - Remove dead variables and includes in example21. - Fix compiler warnings. - Minor constness fixes.

- Refactor AdeptIntegration to use a GPU management thread - G4 workers deposit / retrieve tracks through shared memory - WIP towards using a single shared AdePT instance per G4 application - Use more unique_ptr and RAII idiom - Use less getters/setters in favour of constructor arguments Several pieces are missing for a truly shared AdePT: - Scoring per G4 worker thread - Slot recycling in AdePT track buffers

In addition to marking slots as free, now actually start periodic kernels that return free slots to the pool of available slots.

- Allow for flushing of single events on device.

Instead of one single scoring per AdePT, one scoring per G4 worker is created. On the host, the scoring instances are stored in a vector, whereas they are a malloc-ed array on the device. In this way, the correct scoring instance can be accessed using the threadId of each particle. In order to have the scoring in a vector, the host side had to be made copyable/moveable. The device side remains uninitialised when a host instance is copied, though.

Using cuda callbacks and state machines for the injection and particle extraction workflows, the entire transport loop was redesigned. Events now go through a more fine-grained state machine to reduce latency while ensuring that we wait for all particles to go through every step.

- Create an object library in the AsyncExample that uses examples/common for the ParticleGun and HepMC3 reader - Delete the corresponding files from AsyncExample - Make PrimaryGeneratorAction trivial, because all the logic is now in the gun

- Port AsyncExample to use AdePTTransportInterface. - Integrate AdePTGeant4Integration into AsyncAdePT - Start streaming hits out of GPU instead of accumulating them on device. - Try to synchronise code between example1 and AsyncExample to make them do the same as much as possible. - Remove unused functionality from SlotManager.

- Use EventAction and RunAction to write histograms for each thread. - Use the EndOfRunAction to write this histograms to a ROOT file.

- Create a dedicated library for AsyncTransport that can be linked instead of the standard transport library.

- Add a parentID to track structs. - Replace track init functions by constructors, so members don't get missed inadvertently when the track storage is refactored.

- Instead of allocating three queues for the three particle types, all particle types can use the same queue. This better uses the available memory, since an overfull queue of one type might be compensated by a sparsely populated queue of another type. - Put a lot more of the device memory under RAII management. - This allows recovering from allocation failures, trying again with a smaller size. - Allocate the largest chunk of memory (the track storage) last, and give warnings how much memory is missing to successfully allocate the array. In this way, the maximum size of the track storage can be computed by the user.

This PR is based on #270, with the addition of making ROOT not required to build AdePT ---- Original description: Example demonstrating how to use one single asynchronous AdePT from Geant4 using the fast simulation hooks. This example is based on Example14 with the following modifications: - A slot manager that can recycle slots is employed. This allows to transport considerably larger numbers of particles without running out of memory. The slot manager works as follows: - If a new slot is needed, a slot number is fetched atomically from a list of free slots. - Once a track dies, its slot is marked as to be freed. It cannot be freed while other tracks are transported because this might race with allocating new slots. - Periodically, the slots to be freed are copied over into the list of free slots. - Only one instance of AdePT runs asynchronously in a separate thread. - Each G4 worker can enqueue tracks, which are all transported in parallel. - A transport loop is running continuously, which transports particles, injects new particles, and retrieves leaked tracks and hits. - The G4 workers communicate with the AdePT thread via state machines, so it is clear when events finish or need to start transporting. - As long as the G4 workers have CPU work to do, they don't block while the GPU transport is running. - Each track knows which G4 worker it came from, and the scoring structures are replicated for each G4 worker that is active. - AdePT not only runs in the region that is named as GPU region in the config, it also transports particles in all daughter regions of the "AdePT region". This required refactoring the geometry visitor that sets up the GPU region. ---- --------- Co-authored-by: Stephan Hageboeck <[email protected]> Co-authored-by: SeverinDiederichs <[email protected]>

JuanGonzalezCaminero · 2024-11-26T10:27:51Z

Superseded by #319

hageboeck self-assigned this Jan 11, 2024

hageboeck force-pushed the async branch 3 times, most recently from 414d3f3 to ef65218 Compare April 19, 2024 11:13

hageboeck force-pushed the async branch from ef65218 to b3d9b23 Compare October 16, 2024 11:07

hageboeck added 4 commits November 14, 2024 10:55

[benchmarking] Add a missing include.

8a6a82f

hageboeck force-pushed the async branch from e563881 to d629a63 Compare November 14, 2024 17:02

hageboeck added 19 commits November 14, 2024 18:03

[core] Remove adept/setSeed command.

5f2ecc8

The setSeed command was defined, but didn't have any effect. To avoid confusion on the user side, it is removed here.

[core] Reset local time and proper time when creating tracks.

c987909

[core] Fix unused argument warnings.

84290de

[integration] Call cleanup only if an AdePT instance exists.

69c185e

Add AsyncExample as a copy of example14.

6879f05

- Create a folder to convert AdePT into a shared asynchronous workflow. - Create a base macro to run the example.

[async] Cleanup code before transformation.

5e88243

Clean up AdeptIntegration from example14 before transforming it into the async example.

[async] Add a slot manager that supports reusing slots.

1727956

Add automatic RAII-style memory management for slot manager.

c1745b9

Add unique_ptr with cuda deleters. This way, cuda memory doesn't have to be freed manually. So far, this is only used for a few cuda objects, but could be extended in the future.

[async] Use more transport threads to speed up AsyncExample.

059dae8

[async] Minor code fixes

29f43ee

- Remove a variable without any effect. - Remove dead variables and includes in example21. - Fix compiler warnings. - Minor constness fixes.

[async] Fix compiler warnings in AsyncExample.

b94043f

[async] Remove unnecessary new and dynamic_cast.

a637635

[async] Format outputs for better diffs.

c36fbe2

[async] Switch on compiler warnings in AsyncExample.

0855c4e

[async] Add capability to free slots in AsyncExample.

b0646ef

In addition to marking slots as free, now actually start periodic kernels that return free slots to the pool of available slots.

[async] Speed up init of SlotManagers.

0096568

[async] Refactor state machine for asynchronous transport loop

1fd0d5c

- Allow for flushing of single events on device.

hageboeck added 25 commits November 14, 2024 18:03

[async][NFC] Change on-device printing of particle population.

cea5a09

[async] Place number of tracks coming back from device into gpuState.

22d4497

[async] Ensure that nFromDevice doesn't overflow the from-device buffer.

6a5bfec

[async] Use shared memory to reduce contention of atomicAdd.

73425d4

[async] Add NVTX tracing to AsyncExample.

66ae1ee

[async] Adapt includes and links to new AdePT directory structure.

eb0e828

[async] Replace HepEM physics list with the one from examples/common.

4734bac

[async] Measure fill level of particle queues as fraction of capacity.

dc12425

[async] Properly transfer preStep data to scoring.

8ca746c

[async] Ensure proper serialisation when printing energy deposits.

a2b37fd

[async] Add option to change AsyncAdePT seed.

c2ef9c4

[AsyncExample] Create ROOT histograms to study AdePT physics.

85b015a

- Use EventAction and RunAction to write histograms for each thread. - Use the EndOfRunAction to write this histograms to a ROOT file.

[async] Shorten duration of End-of-event print lock.

babb241

[async] Adapt AsyncTransport to updated AdePT-Geant4 integration.

a71cbc6

- Create a dedicated library for AsyncTransport that can be linked instead of the standard transport library.

[async] Test step and looper counter on GPU to deal with slow electrons.

1d2ada3

[core][NFC] Rename an argument in the transport interface.

561a644

[async] Adapt AsyncAdePT for usage of parentIDs.

97fe3e4

- Add a parentID to track structs. - Replace track init functions by constructors, so members don't get missed inadvertently when the track storage is refactored.

[async,NFC] Update comments.

b0f29ef

[AsyncExample] Allocate more stack memory to handle all CMS volumes.

5905b93

[Async] Fix linking to HepMC3.

b6c5fdd

[async] Adapt AsyncExample after rebase on master.

61d9da8

hageboeck force-pushed the async branch from d629a63 to 61d9da8 Compare November 14, 2024 17:04

JuanGonzalezCaminero mentioned this pull request Nov 26, 2024

Add an example using an asynchronous shared AdePT #319

Merged

JuanGonzalezCaminero closed this Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an example using an asynchronous shared AdePT #270

Add an example using an asynchronous shared AdePT #270

hageboeck commented Jan 11, 2024

phsft-bot commented Jan 11, 2024

JuanGonzalezCaminero commented Nov 26, 2024

Add an example using an asynchronous shared AdePT #270

Add an example using an asynchronous shared AdePT #270

Conversation

hageboeck commented Jan 11, 2024

phsft-bot commented Jan 11, 2024

JuanGonzalezCaminero commented Nov 26, 2024