-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix repeatability, deadlocks, and reduced mixing for > 1 chain #30
Conversation
Looks like the existing tests are failing in some cases before they get to changes in this PR. I wasn't able to run the pre-commit hooks since they're out-of-date, but happy to re-run Maybe a good idea to call out pre-commit in the README too -- I use this in other projects, but initially missed it here. |
Updating copyright notice as required by the U.S. Department of Energy, which funded recent work.
Add elapsed runtime output to help with performance tuning, and debug-level logging to demonstrate the parallel tempering deadlock problem.
Code is from `simple.ipynb`, but repackaged for efficient bulk testing of seeded MPI runs from the CLI
1. Moved rank comparison (the most important part) to the beginning of conditional statements to improve readability. 2. Nested conditionals together that share the same rank comparison 3. Reordered some conditionals to prevent confusion over indent levels for large indented blocks. 4. Used local variables to avoid repeated rank-related math
100+ seeded runs ranging from 2-6 chains each are reliably terminating and producing identical results. Initial tests also show that gaussian_mpi_example.py is running twice as fast following this update, though larger-scale tests with different data take 1/3 longer (20 mins instead of 15 with 6+ chains). 1. Removed sequences of comm.Iprobe() + time.sleep() + comm.recv() Use of time.sleep() appears from comments to have been a workaround to improve timing for the non-blocking Iprobe() call. This repeated sequence of calls seems to have caused non-deterministic parallel tempering runs, since they depend on unreliable timing of multiple MPI processes executing in parallel. Timing of processes executing in parallel cannot be relied on without explicit synchronization (see next item). 2. Send messages for negatives as well as positives Onstead of, on the sending side, sending only positive messages, then on the receiver side using Iprobe() [non-blocking] + sleep() + recv() [blocking], send messages for both positive and negative cases, always blocking on the receiver side until the determination is made on whether or not to proceed with the next step. There may be some performance impacts here that need to be tested / optimized. 3. Fix a conditional error in PTSwap() The second large conditional block only executed if the first didn't run, which meant that a lower temperature chain in the middle of a group would only communicate with higher temperature chains, but not respond to requests from lower temperature chains. This created deadlocks during intermediate tests with > 2 chains.
Instead of using looped send() to communicate from the T=0 chain to all other chains, use bcast(), which should improve efficiency by reducing communication overhead at scale.
Prior commits guarantee that chains can only be executing a single step in parallel. That being the case, only exchange swap messages when the current step is a swap opportunity, rather than on each step. This reduced communication overhead should be a performance boost at scale.
1. Correct for index errors between PTMCMCOneStep() and PTswap() `iter` in `PTMCMCOneStep()` is used as 1-indexed, but `iter` in `PTswap()` is used as 0-indexed 2. Avoid a swap on the first step
Prior commits guarantee that chains can only be executing a single step in parallel. That being the case, only exchange messages for this data on the prescribed step frequency, instead of on each step. This reduced communication overhead should be a performance boost at scale.
Force pushing to rebase onto master following #32. No manual code edits made during the rebase, though I did have to do a subsequent |
4eac542
to
40dc984
Compare
Things like improved logging, simplified conditionals, and performance enhancements are welcome additions! So if you can reconcile things when #29 merges, I will take a closer look at these changes, and hopefully merge |
Thanks for your reply! Yes, I started making a few comments there to note differences. It looks like we did this work roughly in parallel, though my submission was greatly delayed waiting on approval to publicly release the code. |
Looking back over this code, most of the changes implemented and tested here are unfortunately superseded by the ones made roughly in parallel in #29. I do see a few useful changes left in this branch, but the vast majority of work here was in Some useful things to extract here are, for example:
|
As noted in #25, seeded
PTMCMCSampler
runs are not currently repeatable using > 1 chain.Additionally, I discovered during testing that there are occasional deadlocks that seem to stem in part from the same repeatability issue (non-deterministic execution when multiple chains are executed in parallel). I've also discovered / fixed what I believe are isolated conditional logic errors that cause chains to not exchange state as expected, resulting in reduced mixing.
Suggestions for review
I've tried to structure commits in this PR in a linear narrative that should be much easier to follow than the overall diffs for the PR as a whole. The core logic for parallel tempering should be essentially unchanged, but I believe I've fixed several implementation errors in conditionals, communication / sequencing, and indexing. These changes and the end result data have been through a great deal of internal review for us prior to PR submission, and are being used to design synthetic biology experiments. Of course you the original authors may also have new / valuable feedback to add.
Whitespace changes are likely to dominate the diffs, so a good graphical diff that can highlight code changes and ignore whitespace may save a lot of time in review. PyCharm is my personal favorite.
Proposed copyright updates
This work was funded by the United States Department of Energy, which requires that I include the proposed additions to your
LICENSE
file to acknowledge their recent contribution.Demonstrate the problem
bulk_gaussion_test.sh
Combined with the
gaussian_mpi_example.py
script included in this PR (which repackages code from PTMCMCSampler's "simple" notebook), I've been using the following Bash script to demonstrate the repeatability issue and to test fixes in this PR. The script performs 100 consecutive, seeded runs ofgaussian_mpi_example.py
, and for the final output line of each run, prints a hash of the resulting low-temperaturechain_1.0.txt
file:Repeatability problem (~5 min)
gaussian_mpi_example.py
and DEBUG logging to facilitate testing, but doesn't yet change any of the corePTMCMCSampler
codebulk_gaussian_test.sh
to show that result file hashes are not the same for a few runs. The final output line from each run contains the result hash, andCTRL-C
will exit early (full execution takes a long time & performs 100 runs).Deadlock problem (~50 mins)
gaussian_mpi_example.py
and DEBUG logging to facilitate testing, but doesn't yet change any of the corePTMCMCSampler
codebulk_gaussian_test.sh
, uncomment theDEBUG=1
line to switch on DEBUG-level logging in the Python codebulk_gaussian_test.sh
script until it hangs (up to ~50 mins on my 2019 MacBook Pro)There are timestamps included in the output, so it should be safe to let this run unattended after the first run or so, then come back to check on it later. Of course there is some associated randomness (e.g. in OS-level scheduling of multiple MPI processes running in parallel), but I have hever seen it run past ~60 of 100 intended total runs. When a run eventually hangs, DEBUG output will show that some of the chains ran to completion, and others didn't. You can compare timestamps in the run output to the current time to verify how long it's been hanging.
Demonstrate the solution (~50 mins)
bulk_gaussian_test.sh
, comment out theDEBUG=1
line to hide DEBUG-level logging in the Python codeNo hangs are anticipated, and result hashes & execution times are much easier to compare across runs with minimal output.
bulk_gaussian_test.sh
to completion as many times as desired.All runs should complete and produce identical checksums, except after the "Fix indexing..." commit, which purposefully changes program behavior. Other commits should be strictly performance improvements. Of course it's impossible to prove that a deadlock never occurs, but I haven't yet seen any in this PR code after many hundreds of runs.
Performance
It may be worth noting the average runtime for gaussian tests before and after changes in this PR. It's been a while since I worked on this (I've been waiting for permission to publicly release), but I believe that low-level results from this test did not correlate with our higher-level stress tests using proprietary code based on this branch. This image demonstrates performance of our proprietary code, where v1 corresponds to the initial core change to
PTMCMCSampler
, and v3 corresponds to the final commit (v2 was an error).Most of the later commits were designed to improve performance for many cores / chains after the initial repeatability issue was resolved (which initially increased computational cost proportionally to the number of cores used, with its heavy-handed communication requirements between chains).