Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an example using an asynchronous shared AdePT #270

Closed
wants to merge 74 commits into from

Conversation

hageboeck
Copy link
Contributor

Example demonstrating how to use one single asynchronous AdePT from Geant4 using the fast simulation hooks.

This example is based on Example14 with the following modifications:

  • A slot manager that can recycle slots is employed. This allows to transport considerably larger
    numbers of particles without running out of memory. The slot manager works as follows:
    • If a new slot is needed, a slot number is fetched atomically from a list of free slots.
    • Once a track dies, its slot is marked as to be freed. It cannot be freed while other tracks are transported
      because this might race with allocating new slots.
    • Periodically, the slots to be freed are copied over into the list of free slots.
  • Only one instance of AdePT runs asynchronously in a separate thread.
    • Each G4 worker can enqueue tracks, which are all transported in parallel.
    • A transport loop is running continuously, which transports particles, injects new particles, and retrieves leaked tracks and hits.
    • The G4 workers communicate with the AdePT thread via state machines, so it is clear when events finish or need to start transporting.
    • As long as the G4 workers have CPU work to do, they don't block while the GPU transport is running.
    • Each track knows which G4 worker it came from, and the scoring structures are replicated for each G4 worker that is active.
  • AdePT not only runs in the region that is named as GPU region in the config, it also transports particles in all daughter regions of the "AdePT region". This required refactoring the geometry visitor that sets up the GPU region.

@hageboeck hageboeck self-assigned this Jan 11, 2024
@phsft-bot
Copy link

Can one of the admins verify this patch?

- Refactor the processing of hits.
  Instead of processing hits by passing a pointer/reference to a
  HostScoring instance, a loop over iterators to hits is used.
  In this way, hit scoring is decoupled from the specific implementation
  of HostScoring, and all classes with the same interface as the original
  GPUHit can be used for scoring.
  This facilitates hit scoring for the AsyncTransport implementation.
- Move several Geant4 objects into the .cpp to make the integration headers
  simpler.
- Place temporary scoring objects into a struct to go around G4's pool
  allocators. This prevents a destruction order fiasco (where the pool
  is gone but the object isn't), and keeps the scoring objects closer
  in memory.
  A few objects need to leak, unfortunately, since they are allocated in
  G4 pools, and the handles don't support them being on a stack.
- Improve const correctness in a few places.
- Add information about threadID and eventID to the scoring interface.
  This information is required for AsyncAdePT to score correctly, but is
  unused in the thread-local transport for now.
- Split the transport from the integration-related parts introducing a
  new source file. The integration-related parts can be reused in a
  different transport implementation.
- Create a transport abstraction, so AdePTTrackingManager is independent
  of the transport implementation.
- Add thread and event IDs to the transport interface. These are
  necessary for the async transport implementation.
- Start to enumerate tracks in the tracking manager. This can be used to
  reproducibly seed the AdePT random sequences.
- Add some const declarations for the default AdePT implementation.
- Use a factory function to instantiate AdePT. Like this, different
  AdePT implementations can be used without changing code in the tracking
  manager or in AdePTPhysics.
- Replace a few includes with forward declarations.
- Fix device link errors that can show when using a symbol in multiple
  cuda translation units.
…ePT.

When AdePT tries to purge a particle from the GPU, Geant4 must not send
it back. To achieve this, a custom track ID was created, such that the
AdePTTrackingManager doesn't attempt to send it back to the GPU.
The setSeed command was defined, but didn't have any effect. To avoid
confusion on the user side, it is removed here.
- Create a folder to convert AdePT into a shared asynchronous workflow.
- Create a base macro to run the example.
Clean up AdeptIntegration from example14 before transforming it into the
async example.
Add unique_ptr with cuda deleters. This way, cuda memory doesn't have to
be freed manually. So far, this is only used for a few cuda objects, but
could be extended in the future.
- Remove a variable without any effect.
- Remove dead variables and includes in example21.
- Fix compiler warnings.
- Minor constness fixes.
- Refactor AdeptIntegration to use a GPU management thread
- G4 workers deposit / retrieve tracks through shared memory
- WIP towards using a single shared AdePT instance per G4 application
- Use more unique_ptr and RAII idiom
- Use less getters/setters in favour of constructor arguments

Several pieces are missing for a truly shared AdePT:
- Scoring per G4 worker thread
- Slot recycling in AdePT track buffers
In addition to marking slots as free, now actually start periodic kernels
that return free slots to the pool of available slots.
- Allow for flushing of single events on device.
Instead of one single scoring per AdePT, one scoring per G4 worker is
created. On the host, the scoring instances are stored in a vector,
whereas they are a malloc-ed array on the device. In this way, the
correct scoring instance can be accessed using the threadId of each
particle.

In order to have the scoring in a vector, the host side had to be made
copyable/moveable. The device side remains uninitialised when a host
instance is copied, though.
Using cuda callbacks and state machines for the injection and particle
extraction workflows, the entire transport loop was redesigned.
Events now go through a more fine-grained state machine to reduce
latency while ensuring that we wait for all particles to go through
every step.
- Create an object library in the AsyncExample that uses examples/common
  for the ParticleGun and HepMC3 reader
- Delete the corresponding files from AsyncExample
- Make PrimaryGeneratorAction trivial, because all the logic is now in
  the gun
- Port AsyncExample to use AdePTTransportInterface.
- Integrate AdePTGeant4Integration into AsyncAdePT
- Start streaming hits out of GPU instead of accumulating them on
  device.
- Try to synchronise code between example1 and AsyncExample to make
  them do the same as much as possible.
- Remove unused functionality from SlotManager.
- Use EventAction and RunAction to write histograms for each thread.
- Use the EndOfRunAction to write this histograms to a ROOT file.
- Create a dedicated library for AsyncTransport that can be linked
  instead of the standard transport library.
- Add a parentID to track structs.
- Replace track init functions by constructors, so members don't get
  missed inadvertently when the track storage is refactored.
- Instead of allocating three queues for the three particle types, all
  particle types can use the same queue. This better uses the available
  memory, since an overfull queue of one type might be compensated by a
  sparsely populated queue of another type.
- Put a lot more of the device memory under RAII management.
- This allows recovering from allocation failures, trying again with a
  smaller size.
- Allocate the largest chunk of memory (the track storage) last, and
  give warnings how much memory is missing to successfully allocate the
  array. In this way, the maximum size of the track storage can be
  computed by the user.
agheata pushed a commit that referenced this pull request Nov 26, 2024
This PR is based on #270, with the
addition of making ROOT not required to build AdePT

----

Original description:

Example demonstrating how to use one single asynchronous AdePT from
Geant4 using the fast simulation hooks.

This example is based on Example14 with the following modifications:
- A slot manager that can recycle slots is employed. This allows to
transport considerably larger
numbers of particles without running out of memory. The slot manager
works as follows:
- If a new slot is needed, a slot number is fetched atomically from a
list of free slots.
- Once a track dies, its slot is marked as to be freed. It cannot be
freed while other tracks are transported
    because this might race with allocating new slots.
- Periodically, the slots to be freed are copied over into the list of
free slots.
- Only one instance of AdePT runs asynchronously in a separate thread.
- Each G4 worker can enqueue tracks, which are all transported in
parallel.
- A transport loop is running continuously, which transports particles,
injects new particles, and retrieves leaked tracks and hits.
- The G4 workers communicate with the AdePT thread via state machines,
so it is clear when events finish or need to start transporting.
- As long as the G4 workers have CPU work to do, they don't block while
the GPU transport is running.
- Each track knows which G4 worker it came from, and the scoring
structures are replicated for each G4 worker that is active.
- AdePT not only runs in the region that is named as GPU region in the
config, it also transports particles in all daughter regions of the
"AdePT region". This required refactoring the geometry visitor that sets
up the GPU region.
----

---------

Co-authored-by: Stephan Hageboeck <[email protected]>
Co-authored-by: SeverinDiederichs <[email protected]>
@JuanGonzalezCaminero
Copy link
Contributor

Superseded by #319

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants