Fix repeatability, deadlocks, and reduced mixing for > 1 chain #30

chimaerase · 2022-10-12T18:57:54Z

As noted in #25, seeded PTMCMCSampler runs are not currently repeatable using > 1 chain.

Additionally, I discovered during testing that there are occasional deadlocks that seem to stem in part from the same repeatability issue (non-deterministic execution when multiple chains are executed in parallel). I've also discovered / fixed what I believe are isolated conditional logic errors that cause chains to not exchange state as expected, resulting in reduced mixing.

Suggestions for review

I've tried to structure commits in this PR in a linear narrative that should be much easier to follow than the overall diffs for the PR as a whole. The core logic for parallel tempering should be essentially unchanged, but I believe I've fixed several implementation errors in conditionals, communication / sequencing, and indexing. These changes and the end result data have been through a great deal of internal review for us prior to PR submission, and are being used to design synthetic biology experiments. Of course you the original authors may also have new / valuable feedback to add.

Whitespace changes are likely to dominate the diffs, so a good graphical diff that can highlight code changes and ignore whitespace may save a lot of time in review. PyCharm is my personal favorite.

Proposed copyright updates

This work was funded by the United States Department of Energy, which requires that I include the proposed additions to your LICENSE file to acknowledge their recent contribution.

Demonstrate the problem

`bulk_gaussion_test.sh`

Combined with the gaussian_mpi_example.py script included in this PR (which repackages code from PTMCMCSampler's "simple" notebook), I've been using the following Bash script to demonstrate the repeatability issue and to test fixes in this PR. The script performs 100 consecutive, seeded runs of gaussian_mpi_example.py, and for the final output line of each run, prints a hash of the resulting low-temperature chain_1.0.txt file:

#!/bin/bash -e
###
# Performs bulk runs of PTMCMCSampler to demonstrate repeatability & termination issues.
###
NUM_CHAINS=6
NUM_RUNS=100
SEEDS=$(seq 1 $NUM_CHAINS)
#DEBUG=1  # uncomment for additional DEBUG-level logging from the Python code
echo "Performing $NUM_RUNS runs, with $NUM_CHAINS chains each"
for RUN_NUM in $(seq 1 $NUM_RUNS)
do
    # Visually separate run output
    echo "****************************************************************************************"
    echo "Run ${RUN_NUM}..."
    echo "****************************************************************************************"
    # Print the start time.
    # Gives feedback to help identify deadlocks during long bulk runs that aren't constantly monitored.
    START=$(date '+%Y-%m-%d %H:%M:%S %Z')
    echo "Started at $START"

    # Execute the run.
    OUTPUT_DIR=results/${NUM_CHAINS}_chains_run_${RUN_NUM}
    mpirun -np ${NUM_CHAINS} \
      python -m gaussian_mpi_example --seeds ${SEEDS} ${OUTPUT_DIR} ${DEBUG:+"--debug"}

    # Print a checksum for the T=0 chain results
    FILE_NAME="chain_1.0.txt"
    if [[ "$NUM_CHAINS" == 1 ]]; then
      FILE_NAME="chain_1.txt"
    fi
    CHECKSUM=$(md5sum "${OUTPUT_DIR}/${FILE_NAME}")
    echo "Checksum: ${CHECKSUM}"
done

Repeatability problem (~5 min)

Checkout the "Add example script..." commit from this branch. This commit includes the gaussian_mpi_example.py and DEBUG logging to facilitate testing, but doesn't yet change any of the core PTMCMCSampler code
Run bulk_gaussian_test.sh to show that result file hashes are not the same for a few runs. The final output line from each run contains the result hash, and CTRL-C will exit early (full execution takes a long time & performs 100 runs).

Deadlock problem (~50 mins)

Checkout the "Add an example script..." commit from this branch. This commit includes the gaussian_mpi_example.py and DEBUG logging to facilitate testing, but doesn't yet change any of the core PTMCMCSampler code
In bulk_gaussian_test.sh, uncomment the DEBUG=1 line to switch on DEBUG-level logging in the Python code
Run the bulk_gaussian_test.sh script until it hangs (up to ~50 mins on my 2019 MacBook Pro)
There are timestamps included in the output, so it should be safe to let this run unattended after the first run or so, then come back to check on it later. Of course there is some associated randomness (e.g. in OS-level scheduling of multiple MPI processes running in parallel), but I have hever seen it run past ~60 of 100 intended total runs. When a run eventually hangs, DEBUG output will show that some of the chains ran to completion, and others didn't. You can compare timestamps in the run output to the current time to verify how long it's been hanging.

Demonstrate the solution (~50 mins)

Checkout the final commit, or any intermediate commit in this branch, following the "Fix sequencing..." commit
In bulk_gaussian_test.sh, comment out the DEBUG=1 line to hide DEBUG-level logging in the Python code
No hangs are anticipated, and result hashes & execution times are much easier to compare across runs with minimal output.
Run bulk_gaussian_test.sh to completion as many times as desired.
All runs should complete and produce identical checksums, except after the "Fix indexing..." commit, which purposefully changes program behavior. Other commits should be strictly performance improvements. Of course it's impossible to prove that a deadlock never occurs, but I haven't yet seen any in this PR code after many hundreds of runs.

Performance

It may be worth noting the average runtime for gaussian tests before and after changes in this PR. It's been a while since I worked on this (I've been waiting for permission to publicly release), but I believe that low-level results from this test did not correlate with our higher-level stress tests using proprietary code based on this branch. This image demonstrates performance of our proprietary code, where v1 corresponds to the initial core change to PTMCMCSampler, and v3 corresponds to the final commit (v2 was an error).

Most of the later commits were designed to improve performance for many cores / chains after the initial repeatability issue was resolved (which initially increased computational cost proportionally to the number of cores used, with its heavy-handed communication requirements between chains).

chimaerase · 2022-10-12T19:47:51Z

Looks like the existing tests are failing in some cases before they get to changes in this PR. I wasn't able to run the pre-commit hooks since they're out-of-date, but happy to re-run black against this code once they're up-to-date. I've seen similar problems on other projects as black dependencies get updated and break older versions. pre-commit autoupdate will probably fix it.

Maybe a good idea to call out pre-commit in the README too -- I use this in other projects, but initially missed it here.

Updating copyright notice as required by the U.S. Department of Energy, which funded recent work.

Add elapsed runtime output to help with performance tuning, and debug-level logging to demonstrate the parallel tempering deadlock problem.

Code is from `simple.ipynb`, but repackaged for efficient bulk testing of seeded MPI runs from the CLI

1. Moved rank comparison (the most important part) to the beginning of conditional statements to improve readability. 2. Nested conditionals together that share the same rank comparison 3. Reordered some conditionals to prevent confusion over indent levels for large indented blocks. 4. Used local variables to avoid repeated rank-related math

100+ seeded runs ranging from 2-6 chains each are reliably terminating and producing identical results. Initial tests also show that gaussian_mpi_example.py is running twice as fast following this update, though larger-scale tests with different data take 1/3 longer (20 mins instead of 15 with 6+ chains). 1. Removed sequences of comm.Iprobe() + time.sleep() + comm.recv() Use of time.sleep() appears from comments to have been a workaround to improve timing for the non-blocking Iprobe() call. This repeated sequence of calls seems to have caused non-deterministic parallel tempering runs, since they depend on unreliable timing of multiple MPI processes executing in parallel. Timing of processes executing in parallel cannot be relied on without explicit synchronization (see next item). 2. Send messages for negatives as well as positives Onstead of, on the sending side, sending only positive messages, then on the receiver side using Iprobe() [non-blocking] + sleep() + recv() [blocking], send messages for both positive and negative cases, always blocking on the receiver side until the determination is made on whether or not to proceed with the next step. There may be some performance impacts here that need to be tested / optimized. 3. Fix a conditional error in PTSwap() The second large conditional block only executed if the first didn't run, which meant that a lower temperature chain in the middle of a group would only communicate with higher temperature chains, but not respond to requests from lower temperature chains. This created deadlocks during intermediate tests with > 2 chains.

Instead of using looped send() to communicate from the T=0 chain to all other chains, use bcast(), which should improve efficiency by reducing communication overhead at scale.

Prior commits guarantee that chains can only be executing a single step in parallel. That being the case, only exchange swap messages when the current step is a swap opportunity, rather than on each step. This reduced communication overhead should be a performance boost at scale.

1. Correct for index errors between PTMCMCOneStep() and PTswap() `iter` in `PTMCMCOneStep()` is used as 1-indexed, but `iter` in `PTswap()` is used as 0-indexed 2. Avoid a swap on the first step

Prior commits guarantee that chains can only be executing a single step in parallel. That being the case, only exchange messages for this data on the prescribed step frequency, instead of on each step. This reduced communication overhead should be a performance boost at scale.

chimaerase · 2022-10-26T23:57:55Z

Force pushing to rebase onto master following #32. No manual code edits made during the rebase, though I did have to do a subsequent git rebase -i [master commit] in addition to speculative rebase process listed in #32. Re-running a few iterations of bulk_gaussian.py following rebasing indicates results are still repeatable and no bugs were introduced by auto-reformatting the code.

paulthebaker · 2022-10-27T21:08:47Z

PTMCMCSampler was built for and is still used as one of the primary samplers in the NANOGrav collaboration (see also our github org). There is another PR, #29, from a NANOGrav member that will likely conflict with this one. It fixes several issues we discovered during some pipeline validation. I'm going to merge that PR before getting to this one.

Things like improved logging, simplified conditionals, and performance enhancements are welcome additions! So if you can reconcile things when #29 merges, I will take a closer look at these changes, and hopefully merge

chimaerase · 2022-10-27T21:31:21Z

PTMCMCSampler was built for and is still used as one of the primary samplers in the NANOGrav collaboration (see also our github org). There is another PR, #29, from a NANOGrav member that will likely conflict with this one. It fixes several issues we discovered during some pipeline validation. I'm going to merge that PR before getting to this one.

Things like improved logging, simplified conditionals, and performance enhancements are welcome additions! So if you can reconcile things when #29 merges, I will take a closer look at these changes, and hopefully merge

Thanks for your reply! Yes, I started making a few comments there to note differences. It looks like we did this work roughly in parallel, though my submission was greatly delayed waiting on approval to publicly release the code.

chimaerase · 2022-10-31T19:59:07Z

Looking back over this code, most of the changes implemented and tested here are unfortunately superseded by the ones made roughly in parallel in #29. I do see a few useful changes left in this branch, but the vast majority of work here was in PTswap(), which has essentially been replaced. I'm going to close this PR and leave the branch intact as-is, e.g. in case it becomes useful at scale for some applications (e.g. see discussion of scaling here). As time allows, I may submit a follow up with remaining fixes, though most of these are a fairly low priority for me at this point.

Some useful things to extract here are, for example:

Indexing fixes -- e.g. I think swapping is still being considered on the first step in master.
Testing -- with updates, test code repackaged from existing PTMCMCSampler code could be useful for further testing, though something more sustainable / automated might be better. E.g. for a single chain or small number of chains, verifying repeatable results in an automated test.
Logging improvements, though they are fairly minor here

chimaerase added 9 commits October 26, 2022 16:39

Update copyright notice

dc15ce6

Updating copyright notice as required by the U.S. Department of Energy, which funded recent work.

Update logging

4d385c7

Add elapsed runtime output to help with performance tuning, and debug-level logging to demonstrate the parallel tempering deadlock problem.

Add an example script for repeatable MPI runs

1af7358

Code is from `simple.ipynb`, but repackaged for efficient bulk testing of seeded MPI runs from the CLI

Use bcast to improve performance

15e2871

Instead of using looped send() to communicate from the T=0 chain to all other chains, use bcast(), which should improve efficiency by reducing communication overhead at scale.

Fix indexing for swap

fe46156

1. Correct for index errors between PTMCMCOneStep() and PTswap() `iter` in `PTMCMCOneStep()` is used as 1-indexed, but `iter` in `PTswap()` is used as 0-indexed 2. Avoid a swap on the first step

chimaerase force-pushed the repeatability-fix branch from 4eac542 to 40dc984 Compare October 26, 2022 23:58

chimaerase mentioned this pull request Oct 27, 2022

Synchronize swaps, reduce memory used, SeedSequence for random numbers #29

Merged

chimaerase closed this Oct 31, 2022

This was referenced Nov 8, 2022

Add Python 3.11 to CI tests #37

Merged

Add Dockerfile #31

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix repeatability, deadlocks, and reduced mixing for > 1 chain #30

Fix repeatability, deadlocks, and reduced mixing for > 1 chain #30

chimaerase commented Oct 12, 2022 •

edited

Loading

chimaerase commented Oct 12, 2022

chimaerase commented Oct 26, 2022 •

edited

Loading

paulthebaker commented Oct 27, 2022

chimaerase commented Oct 27, 2022

chimaerase commented Oct 31, 2022

Fix repeatability, deadlocks, and reduced mixing for > 1 chain #30

Fix repeatability, deadlocks, and reduced mixing for > 1 chain #30

Conversation

chimaerase commented Oct 12, 2022 • edited Loading

Suggestions for review

Proposed copyright updates

Demonstrate the problem

bulk_gaussion_test.sh

Repeatability problem (~5 min)

Deadlock problem (~50 mins)

Demonstrate the solution (~50 mins)

Performance

chimaerase commented Oct 12, 2022

chimaerase commented Oct 26, 2022 • edited Loading

paulthebaker commented Oct 27, 2022

chimaerase commented Oct 27, 2022

chimaerase commented Oct 31, 2022

chimaerase commented Oct 12, 2022 •

edited

Loading

`bulk_gaussion_test.sh`

chimaerase commented Oct 26, 2022 •

edited

Loading