Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resuming a calculation with hot chains (-nc) #54

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

bvgoncharov
Copy link

@bvgoncharov bvgoncharov commented Apr 24, 2024

Because

  1. Hot chains must always be loaded for resuming to resume correctly,
  2. writeHotChains=False by default, and it is not set as a keyword argument in setup_sampler() in enterprise extensions,

I suggest resetting the default argument for now, to avoid problems with the results.

Tagging @vhaasteren , @astrolamb

@vhaasteren
Copy link
Member

How about leaving the default to False (because they take up a lot of disk space), and instead throwing an error if resume=True and we are sampling with resume=False?

The only issue with that I see is that it could be that the runs were initially done with resume=False (with writeHotChains=False), and then you cannot abort/resume them.

But I think it's probably not fun for people who now suddenly have jobs that use up a lot of memory. Perhaps we can also set a more useful error message than the one that sparked this adjustment?

@kdolum
Copy link
Collaborator

kdolum commented Apr 24, 2024

I agree with @vhaasteren that just changing the default writeHotChains is not the right answer. You could not care about resuming and not want to use up lots of disk space. On the other hand, I don't think it makes sense to resume a parallel-tempering run with no data about where to start the hot chains. William Lamb's suggestion to save just enough data to resume is a good one, but would take some implementation effort. Meanwhile I suggest the following minimal changes:

  1. Apparently it now writes empty hot chain files when writeHotChains=False. Change it to not write anything.
  2. Refuse to resume a parallel-tempering run unless we have all the old chains. Give a sensible error message.
  3. Fix enterprise_extensions to pass on writeHotChains to PTMCMCSampler.

I'm not sure about interpreting resume as "intent to resume later" and so signaling an error when resume=True but writeHotChains=False. You might want this. There might be a default resume=True when you're not intending to resume, or you might be resuming for the last time and not need to write the hot chains further. I would be more inclined to just print a warning in this case, as we do now when Niter is not a multiple of thin so that samples will be lost. We could also print a warning when Niter is not a multiple of isave, which leads to a partial block in the chain file and then you cannot resume.

@vhaasteren
Copy link
Member

Excellent suggestions @kdolum, I agree with all of it. I like the warnings you suggest

@bvgoncharov
Copy link
Author

Thanks, @vhaasteren and @kdolum , this sounds good. I was initially thinking in terms of 4 temperatures, but if people use >10 saving extra chains by default might indeed cause problems. In case we are avoiding drastic changes, I have one more suggestion. IIn particular, to show a warning instead of an error when attempting to resume the run without the old chains. Perhaps it can also avoid practical issues for users. I incorporated it in my latest commit. What do you think?

@kdolum
Copy link
Collaborator

kdolum commented Apr 24, 2024

I think it should be an error if you try to resume a parallel-tempering run with no hot chain information. What are we supposed to use as the initial samples for the hot chains? I guess it currently uses the initial sample given to the sampler (which I imagine is otherwise ignored on resuming). That seems unfortunate. What is the purpose of resuming in this case, instead of just doing a new run to gather more samples?
But if this is useful, we could put in some with additional variable to suppress the error. I just don't think people should get this behavior without an explicit request.
I suppose maybe the last sample from the cold chain is a better starting place for the hot chains than a random sample if one does this at all.

@bvgoncharov
Copy link
Author

Adding a "special request" variable, at least for my case, will not make a difference for me because it's not in enterprise_extensions anyway. So, perhaps I can just raise an exception instead...

@vhaasteren
Copy link
Member

I like the extra variable that suppresses the error that @kdolum suggests. And, we can totally add it to enterprise_extensions. I'd approve such a PR

It would have been neater to not write the hot chain files at all, but the sampler already has this behavior. Let's not change it unless we have to. Perhaps add one more check for self.resume in the current PR changes to differentiate between error and warning? It makes no sense to resume unless suppressed by this type of variable. Just make the error message usefule

@bvgoncharov
Copy link
Author

Hi @vhaasteren , @kdolum , I got distracted to other work, but now I introduced this new variable that we discussed. Shall we close the PR now?

@kdolum
Copy link
Collaborator

kdolum commented Jun 25, 2024

I'm confused by the logic here. It looks like what you did is that if the old chain file is the wrong length but ignoreHotChainsWhenResuming is set, and this is not the cold chain, then we ignore the error and start the chain file anew. But it looks like it will still print something about how it is resuming.

  1. Do we want the flag to always ignore old hot chains, or only ignore them if they are invalid? If the latter, it should have a different name like "allowMissingHotChainsWhenResuming". In any case, I think the function argument should have the full name, not an abbreviation. I would also advocate not ignoring the chain file if it is bogus, but only if the previous run didn't set writeHotChains. I guess that leads to a zero length chain, so only in that specific case do I think we should go on. Otherwise something fishy could be happening and we don't get notified of it.
  2. I think when it doesn't use the old chain it shouldn't say it is resuming.
  3. There may be issues with sample counting. If some chains resumed but others didn't, the numbering of samples will be different in different processes. I think if you don't set writeHotChains, then you resume and this time you do set writeHotChains, your hot chains will be shorter than your cold chain. Maybe it doesn't matter because all counting and reporting is done by rank 0, but it might be good to think about whether there will be problems.
  4. Perhaps it would be best to start the hot chains with the last sample from the cold chain if you don't have a hot chain to resume from. Otherwise there is some time of burning in the hot chains during which time parallel tempering won't help you. I wouldn't use this whole feature myself anyway, so I don't have strong feelings.

@bvgoncharov
Copy link
Author

@kdolum , I can rename the variable, but the discussion of (1, 3) resuming with zero/non-zero length cold chains, (4) selecting a sample, -- still all boils down to the same case of starting the burn in in hot chains again. So, this seems as micro-managing something that is not important, and/or discriminating between incorrect use cases. One either cares about hot chains and resumes correctly, or not, so it should not make any difference to users.

@vhaasteren , what do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants