Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AlchemicalNonequilibriumLangevinIntegrator acceptance #201

Open
sgill2 opened this issue May 18, 2017 · 51 comments
Open

AlchemicalNonequilibriumLangevinIntegrator acceptance #201

sgill2 opened this issue May 18, 2017 · 51 comments

Comments

@sgill2
Copy link

sgill2 commented May 18, 2017

Copying from the alchemistry Slack channel from @davidlmobley:
We're having some issues with our work on binding mode sampling with NCMC (in BLUES), especially as we try to switch to the AlchemicalNonequilibriumLangevinIntegrator, where we are having large energies/forces that lead to crashes as we are turning back on alchemical interactions between the ligand and the protein. Part of this seems to potentially be due to problems with the alchemical correction factor in certain cases, but we think that actually forces are getting large enough to cause the integrator to lose stability first. Do you have suggestions? We were wondering whether we ought to be checking for forces that exceed some sort of cutoff and automatically rejecting cases where that happens (naively it seems to me there ought to be some sort of force to timestep ratio which determines the largest force that can be tolerated without risking crashes, and that anything exceeding that value by many orders of magnitude should certainly be rejected).

Some notes
•For the AlchemicalNonequilibriumLangevinIntegrator we were using R V O H O V R (g-baoab) splitting. •The "alchemical correction factor" is the correction for transform the real system to/from the alchemical system (it was in perses.annihilation.ncmc_switching at one point, https://github.com/choderalab/perses/blob/3c0e672728c0acb87a0efc9fa76bc893f1b5dfed/perses/annihilation/ncmc_switching.py#L162-L182 but glancing over the master branch I don't see it there now)

@sgill2
Copy link
Author

sgill2 commented May 18, 2017

Here's an example that highlights the issue we're seeing with the correction factor. In the zip file correction.zip correction.py script takes a system with a ligand and turns on the electrostatics/sterics over the course of 50 steps. At each step of the ncmc integrator this script outputs what the potentials would be for those positions in the normal system and if the alchemical system was fully interacting (lambda steric/statics = 1) as well as the difference between the potentials to emulate what's done for the alchemical correction factor.

(Just to note this starting position was taken the middle of a symmetric protocol where we turned off the sterics/electrostatics and performed a random rotation of the ligand. This 'correction.py' script is basically emulating the other half of turning on those interactions after the rotation.)

@sgill2
Copy link
Author

sgill2 commented May 18, 2017

Also tagging @pgrinaway since we were planning on discussing these topics with him.

@jchodera
Copy link
Member

So for the correction.py script (where I had to change your .inpcrd file to match the .prmtop filename before the prefix), you are turning lambda back on and expect to see the difference between real and alchemical energies go to zero, right? Instead, the problem is that you see a large difference, like

step 48
energy norm_sim -2074635.8145003263 kJ/mol energy alch_sim -2074676.2595828865 kJ/mol
partial alchemical correction 40.44508256018162 kJ/mol
protocol_work -491.23986235819757
step 49
energy norm_sim -2061930.9198158663 kJ/mol energy alch_sim -2061970.6320748944 kJ/mol
partial alchemical correction 39.71225902810693 kJ/mol
protocol_work -512.2767891641706

and the concern is that difference does not cancel with the lambda=0 difference?

@jchodera
Copy link
Member

And this issue is separate from seeing NaNs appear periodically, yes?

@jchodera
Copy link
Member

Tagging in @maxentile on the g-BAOAB stability issue, since he's looking into this right now.

My suggestion is to modify your splitting string to use k=4 or k=5 inner steps for g-BAOAB, like

R V V V V O H O V V V V R

for k=4. This should improve stability. @maxentile is currently benchmarking DHFR to see how the choice of k impacts the maximum stable timestep for g-BAOAB.

@sgill2
Copy link
Author

sgill2 commented May 18, 2017

Instead, the problem is that you see a large difference, and the concern is that difference does not cancel with the lambda=0 difference?

Yes. I guess when you just ran the system didn't blow up, but when the system does blow up we can see numbers like

('energy norm_sim', Quantity(value=8.789168765780722e+19, unit=kilojoule/mole), 'alch energy', Quantity(value=8.809416936406537e+19, unit=kilojoule/mole))
('partial alchemical correction', Quantity(value=-2.0248170625815347e+17, unit=kilojoule/mole))

which can really skew the acceptance.

@sgill2
Copy link
Author

sgill2 commented May 18, 2017

And this issue is separate from seeing NaNs appear periodically, yes?

We think it's be a separate issue, since we do see some NaNs following acceptance when we turn off the alchemical correction factor.

@sgill2
Copy link
Author

sgill2 commented May 18, 2017

I'll give the increased inner steps a try in the meantime, and I'll give an update after I test it out. Thanks!

@jchodera
Copy link
Member

How precisely are you computing the alchemical correction factor?

The alchemically-modified PME system omits the reciprocal-space interactions, so it will always differ from the real system. The hope is that this difference is rather insensitive to exactly which orientation the ligand is in. So you'd compute something like

alchemical_correction = (initial_alchemical - initial_system) + (final_system - final_alchemical)

Is that what you're doing? If so, what does that look like, and how does the quantity computed in your example bear on this since it's only half the quantity of interest?

Is it just that you sometimes see very large discrepancies? This is probably only occurring when the integration is unstable and energies are blowing up anyway---the two systems will have very different numerical behaviors when the energies are insanely high.

This is probably just an integrator stability issue, where the geodesic inner steps will help a lot.

@maxentile
Copy link
Member

Hi @sgill2 ,

A couple quick notes:

R V O H O V R (g-baoab)

This is actually not "BAOAB" -- BAOAB would be V R O R V (note that we're using different notation than Leimkuhler and Matthews, V=B, R=A, O=O).

Regarding the number of geodesic drift steps k, I'm currently working on benchmarking this systematically and hope to be able to provide a recommendation soon. In my initial tests, I haven't yet been able to measure a difference between k=1,2,3,4... or provide a recommendation here, but Leimkuhler and Matthews report that increasing k can increase the stability threshold (http://rspa.royalsocietypublishing.org/content/472/2189/20160138).

Regarding the placement of the hamiltonian update (should you use V R O R V H vs V R O H O R V or something), that likely has an effect on stability, but I haven't yet measured the effects of this choice for realistic systems.

@davidlmobley
Copy link

@sgill2 - looks like the order of updates MAY be incorrect, then. Probably that's the first thing to check... (But, @maxentile - @jchodera had specifically recommended this g-BAOAB scheme to us here: MobleyLab/blues#50 ; did he mistype?)

There's also earlier background -- April 7th @jchodera and I were in the same room together and @sgill2 had me ask him this: "The normal order for the BAOAB integrator is VRORV. If I was to use the AlchemicalLangevinSplittingIntegrator would the order of moves then be VROHORV, where H is the alchemical perturbation step?" I asked John this in person and he told me yes -- my summary was "John says yes, or put the H's at the beginning and the end. It's half as expensive if they are at the beginning and the end, he thinks! So HVRORVH."

@maxentile - have you guys looked at this any more since then?

Please do keep us posted. We had a protocol that was basically working for us with our original integrator but we're trying to switch to the "better" integrator you guys have been recommending using AlchemicalNonequilibriumLangevinIntegrator with recommended splitting, and when we do so we immediately start having terrible stability problems.

@jchodera
Copy link
Member

This is actually not "BAOAB" -- BAOAB would be V R O R V

Whoops, my fault here!

@jchodera
Copy link
Member

@maxentile - have you guys looked at this any more since then?

We're working on quantifying the effect of H placement on shadow work accumulation, but my original estimates that one scheme is much faster than another scheme may be incorrect. For now, we think that (as long as you are using the latest openmmtools 0.9.4) you can either use V R O H O R V or H V R O R V H. (There had been a bug in earlier versions where adding two H characters did not do proper bookkeeping.)

@jchodera
Copy link
Member

Please do keep us posted. We had a protocol that was basically working for us with our original integrator but we're trying to switch to the "better" integrator you guys have been recommending using AlchemicalNonequilibriumLangevinIntegrator with recommended splitting, and when we do so we immediately start having terrible stability problems.

What was your original protocol? We should be able to give you a splitting that you can compare against that should replicate your original behavior for reference.

@sgill2
Copy link
Author

sgill2 commented May 19, 2017

What was your original protocol?

We were using the velocity verlet (from the integrator in perses).

@sgill2
Copy link
Author

sgill2 commented May 19, 2017

This is actually not "BAOAB" -- BAOAB would be V R O R V (note that we're using different notation than Leimkuhler and Matthews, V=B, R=A, O=O).

Thanks! I'll test out using the actual BAOBAB splitting, along with varying geodesic drift steps–to see how that affects the integration.

@jchodera
Copy link
Member

We were using the velocity verlet (from the integrator in perses).

If it was just velocity Verlet, then you should be able to use

H V R V H

I think the one difference is that we used to do steps of Velocity Verlet on either end of the protocol with H updates in the middle, whereas this one is still symmetric but bookended by H updates.

Can you compare that with k=4 BAOAB (H V V V V R O R V V V V H or V V V V R O H O R V V V V )?

@maxentile
Copy link
Member

k=4 BAOAB (H V V V V R O R V V V V H or V V V V R O H O R V V V V)?

Geodesic BAOAB uses consecutive constrained drift steps, so those should have been H V R R R R O R R R R V H and V R R R R O H O R R R R V, respectively (implementation).

Note that multiple consecutive constrained V steps are equivalent to a single longer V step, but multiple consecutive constrained R steps are not necessarily equivalent to a single longer R step.

Also note that V R O H O R V, V R H O R V, and V R O H R V should all be equivalent, just the first one requires drawing 2x as many Gaussian random numbers per timestep...

@jchodera
Copy link
Member

@maxentile : Thanks for the clarifications, and apologies for my errors above!

@sgill2
Copy link
Author

sgill2 commented Jun 1, 2017

I’ve tested the integrators with actual BAOAB splitting, and both ran without accepting positions that resulted in NaNs–at least without the alchemical correction factor. When including the correction factor, I still get into the case where moves are accepted which seemingly shouldn’t be.

Just to be sure I haven’t messed up the alchemical correction–which is a correction for transforming the system to/from an alchemical system–I have it defined as:
correction = [Alch_energy(i) - Norm_energy(i)] + [Norm_energy(f) - Alch_energy(f)],
where Alch_energy is the energy of the alchemically transformed system, and Norm_energy is the energy of the normal system. i and f correspond to the initial and final positions respectively.

To illustrate an example case, this alch_corr.zip attached .zip file contains the initial and final positions of such a case after some ncmc integration steps in which the move was accepted. If you run the enclosed sys_compare.py, you should see that in the initial position the energies between the alchemical and normal system are roughly the same. The final positions, however, are in a configuration that are nowhere near favorable, and the difference between the two systems with those positions is vastly dissimilar, which leads to an extremely large acceptance correction factor overall.

('final energies:', 'normal', Quantity(value=1324996177395145.2, unit=kilojoule/mole), 
'alchemical', Quantity(value=1324996178206507.0, unit=kilojoule/mole))
('init energies:', 'normal', Quantity(value=-2050645.7073889282, unit=kilojoule/mole), 
'alchemical', Quantity(value=-2050677.8384977598, unit=kilojoule/mole))
('initial pos correction', Quantity(value=-32.131108831614256, unit=kilojoule/mole))
('final pos correction', Quantity(value=-811361.75, unit=kilojoule/mole))
('total correction', Quantity(value=-811393.8811088316, unit=kilojoule/mole))

@jchodera
Copy link
Member

jchodera commented Jun 1, 2017

('final energies:', 'normal', Quantity(value=1324996177395145.2, unit=kilojoule/mole),
'alchemical', Quantity(value=1324996178206507.0, unit=kilojoule/mole))

The final alchemical energies are really large. Can you store the potential energy after each timestep during your BLUES protocol and plot that? I'm wondering where the energy accumulation occurs.

Where are you placing the Hamiltonian update in your scheme?

Our more recent tests have not noted any major difference in the stability with different numbers of geodesic steps, so it may be simplest to stick with standard BAOAB (V R O H O R V or H V R O R V H).

@sgill2
Copy link
Author

sgill2 commented Jun 1, 2017

The final alchemical energies are really large. Can you store the potential energy after each timestep during your BLUES protocol and plot that? I'm wondering where the energy accumulation occurs.

I'll get on that now!

Where are you placing the Hamiltonian update in your scheme?

Could you clarify for me what you mean by scheme here? I'm not exactly sure what you're referring to, but if you were referring to the splitting I used V R O H O R V .

@jchodera
Copy link
Member

jchodera commented Jun 1, 2017

It might be useful to also try H V R O R V H

@sgill2
Copy link
Author

sgill2 commented Jun 1, 2017

It might be useful to also try H V R O R V H

Okay, I'll go ahead and test out that splitting while I'm rerunning things too.

@sgill2
Copy link
Author

sgill2 commented Jun 6, 2017

Here's a plot of the potential energy as a function of the NCMC steps for an accepted move that results in a NaN. The graph is on a log scale and offset by a constant value (to keep the energies positive so that they can be log scaled). For reference below the potential energy graph is a plot of the lambda sterics/electrostatics at the given steps.
screen shot 2017-06-06 at 2 39 05 pm

And this next plot is a closer look at the energies around the point where they blow up:
screen shot 2017-06-06 at 2 18 27 pm

We still see this happen for the H V R O R V H case with the alchemical correction factor, which we have defined as

correction = [Alch_energy(i) - Norm_energy(i)] + [Norm_energy(f) - Alch_energy(f)],
where Alch_energy is the energy of the alchemically transformed system, and Norm_energy is the energy of the normal system. i and f correspond to the initial and final positions respectively.

@jchodera
Copy link
Member

jchodera commented Jun 7, 2017

I'm still not sure what the alchemical correction factor has anything to do with this. If the energy is exploding, why would the constant added from your correction have anything to do with that?

Aren't 100 steps a very, very small number for an NCMC switching protocol? Our protocols for just changing a single proton are typically ~40 ps in length.

@davidlmobley
Copy link

@jchodera - Sam can comment further (I should be sleeping) but I can make two comments:

  1. Now that we've got the ordering of steps right in the integrator, we no longer see the system explode EXCEPT for cases where an extremely large alchemical correction factor results in accepting moves which should not be accepted. In other words, if we ran without the alchemical correction factor (not that this makes any sense to do) the simulations would not explode.

  2. Regarding 100 steps -- this is actually working quite well for us with the older integrator, yielding reasonable acceptance and without crashes. It's when we switch to the nonequilibrium langevin integrator we start running into issues. Maybe we need more steps there, but with the older integrator the shadow work contribution meant acceptance would fall off as we increased the number of steps. @sgill2 can comment further.

@jchodera
Copy link
Member

jchodera commented Jun 7, 2017

Now that we've got the ordering of steps right in the integrator, we no longer see the system explode EXCEPT for cases where an extremely large alchemical correction factor results in accepting moves which should not be accepted. In other words, if we ran without the alchemical correction factor (not that this makes any sense to do) the simulations would not explode.

That's odd. You're saying that the alchemical correction factor can be highly favorable in some cases that you wouldn't expect?

Could there be a sign error in your use of it? If you compute

correction = [Alch_energy(i) - Norm_energy(i)] + [Norm_energy(f) - Alch_energy(f)]

then your log_P_accept should include -correct, presuming your energies are expressed in units of kT. Note that we recently changed the units with which work is accumulated: You should use get_protocol_work() to retrieve the unit-bearing work.

Regarding 100 steps -- this is actually working quite well for us with the older integrator, yielding reasonable acceptance and without crashes.

There's no way that could have been correct. 100 steps is far, far too short for an alchemical insertion of even a proton.

@jchodera
Copy link
Member

jchodera commented Jun 7, 2017

with the older integrator the shadow work contribution meant acceptance would fall off as we increased the number of steps.

Do you have a plot of acceptance rate vs number of NCMC steps?

@davidlmobley
Copy link

davidlmobley commented Jun 7, 2017

@jchodera

There's no way that could have been correct. 100 steps is far, far too short for an alchemical insertion of even a proton.

Actually, it seems to me that this could be significantly easier than a proton. Specifically, what we're looking at here is flipping toluene between orientations that (at least for successful moves) could have been docked into successfully. It's nonpolar, so we're dealing with relatively weak interactions. Even STANDARD MC without any relaxation at all can sometimes yield moves which are accepted for these types of moves (though much more rarely). It seems highly plausible to me that even adding only a very small amount of relaxation -- such as 100 steps -- could dramatically increase acceptance, which is exactly what we've been seeing.

To me, insertion of a proton sound like a much more difficult transformation as there, you're dealing with changing the formal charge, so you have very strong long-range electrostatic interactions, and relaxation will need to be much more significant, whereas in this particular case, protein relaxation is almost minimal.

@sgill2 will have to respond on the other issues.

That's odd. You're saying that the alchemical correction factor can be highly favorable in some cases that you wouldn't expect?

Yes.

@jchodera
Copy link
Member

jchodera commented Jun 7, 2017

I'm still dubious that 100 steps would help much. I'd love to see the average acceptance vs number of steps for even just the alchemically-modified system, omitting the correction factors that are causing trouble.

There must be something wrong with the correction factor if it is leading toward higher acceptance. Since there's a slight difference between fully-interacting and alchemically-modified systems, I would presume it would, on average, bias you toward rejection, not acceptance. The big differences in energies, as we noted earlier, appear only in cases where the energies are hugely unfavorable, but that indicates something else is going on when the insertion is nearly complete.

Did you recently switch over to openmmtools.alchemy too? Could something odd be going on there?

@sgill2
Copy link
Author

sgill2 commented Jun 7, 2017

Just to clarify a few things and give updates on what I'm running right now:

then your log_P_accept should include -correct, presuming your energies are expressed in units of kT. Note that we recently changed the units with which work is accumulated: You should use get_protocol_work() to retrieve the unit-bearing work.

In the actual calculation I have the energies in terms of kJ/mol, so I multiply by -1/kT to get the correction in appropriate of so in the code it looks like
correction = -([Alch_energy(i) - Norm_energy(i)] + [Norm_energy(f) - Alch_energy(f)]) / kT

Regarding 100 steps -- this is actually working quite well for us with the older integrator, yielding reasonable acceptance and without crashes. It's when we switch to the nonequilibrium langevin integrator we start running into issues. Maybe we need more steps there, but with the older integrator the shadow work contribution meant acceptance would fall off as we increased the number of steps. @sgill2 can comment further.

Using the NCMCVVAlchemicalIntegrator from perses the acceptance ratio was pretty reasonable (~20%) at about the 200 NCMC step range. I don't think it's entirely unreasonable to expect a low number of NCMC steps in this case since there's space in the binding pocket for the ligand already; NCMC just helps resolve minor clashes brought by the random rotation.

The number of steps/acceptance ratio is definitely an important point to consider, and I don't disagree that the 100 steps in this case may not be ideal, but I think it's drawing us away from the issue at hand, which at this point seems to be figuring out why the correction factor is leading to these obviously unfavorable NCMC moves being accepted and how to address that. In general though, the number of ncmc steps should vary greatly depending on the protocol and system.

In the NCMCVVAlchemicalIntegrator case when the system is exploding the shadow work also increases at a similar magnitude–and a greater magnitude then the correction factor–which prevents those cases from being accepted.

Did you recently switch over to openmmtools.alchemy too? Could something odd be going on there?

We have actually recently switched to using alchemy from openmmtools. Currently I'm running the different integrators with the two different alchemy modules to see if I can spot any differences in their behavior.

@sgill2
Copy link
Author

sgill2 commented Jun 7, 2017

Tangentially, how much wallclock time does it take for to run those 20ps of NCMC in the proton insertion case? I'd love to explore longer relaxation timescales but running the alchemical integrators is quite slow so running anything past 1000 steps repeatedly doesn't seem very feasible to me currently.

@sgill2
Copy link
Author

sgill2 commented Jun 7, 2017

Also somewhat tangentially I'll be doing some benchmarking runs on T4/lysozyme toluene to see the acceptance as a function of NCMC steps so we can use those as a reference (and for paper-related things).

@jchodera
Copy link
Member

jchodera commented Jun 7, 2017

In the actual calculation I have the energies in terms of kJ/mol, so I multiply by -1/kT to get the correction in appropriate of so in the code it looks like
correction = -([Alch_energy(i) - Norm_energy(i)] + [Norm_energy(f) - Alch_energy(f)]) / kT

That looks correct to me, provided you're careful about how you get all those quantities with the correct units. Can you point me to where you compute Alch_energy and Norm_energy in your code?

Using theNCMCVVAlchemicalIntegrator from perses the acceptance ratio was pretty reasonable (~20%) at about the 200 NCMC step range. I don't think it's entirely unreasonable to expect a low number of NCMC steps in this case since there's space in the binding pocket for the ligand already; NCMC just helps resolve minor clashes brought by the random rotation.

I'm still dubious here, but I've been surprised before.

The number of steps/acceptance ratio is definitely an important point to consider, and I don't disagree that the 100 steps in this case may not be ideal, but I think it's drawing us away from the issue at hand, which at this point seems to be figuring out why the correction factor is leading to these obviously unfavorable NCMC moves being accepted and how to address that. In general though, the number of ncmc steps should vary greatly depending on the protocol and system.

I think it's premature to even worry about the correction factor since your plot is showing that the work accumulated for the purely alchemical system is blowing up. Do you have a bunch of work trace you could provide data for or plot? I think we need to figure out what the work distribution after 100 steps looks like and make sure that there is a good fraction of these work values less than a few kT before we tackle what's wrong with the correction factor. The correction factor should be very small.

Tangentially, how much wallclock time does it take for to run those 20ps of NCMC in the proton insertion case? I'd love to explore longer relaxation timescales but running the alchemical integrators is quite slow so running anything past 1000 steps repeatedly doesn't seem very feasible to me currently.

The alchemical integrator shouldn't be much slower than standard LangevinIntegrator dynamics. On very small systems and very fast GPUs, the integrator steps become a bit of a bottleneck, but I'm not sure if you're in this regime.

running the alchemical integrators is quite slow so running anything past 1000 steps repeatedly doesn't seem very feasible to me currently.

What do you mean by "quite slow" here? I think our 20 ps switches took a few minutes.

@sgill2
Copy link
Author

sgill2 commented Jun 8, 2017

That looks correct to me, provided you're careful about how you get all those quantities with the correct units. Can you point me to where you compute Alch_energy and Norm_energy in your code?

You'll have to jump around a bit in the code, so let me contextualize the code references:
We get the context and state for the normal system and alchemical system at the start of the nc move, which are the states with the initial positions
https://github.com/MobleyLab/blues/blob/master/blues/simulation.py#L125-L126
Then we run the ncmc portion, get the positions at the end and find the energy of the normal system using these positions.
https://github.com/MobleyLab/blues/blob/master/blues/simulation.py#L268-L269
https://github.com/MobleyLab/blues/blob/master/blues/simulation.py#L206
(the alch_sim is actually a normal system, while the nc_sim system is alchemically transformed)
We then get the energies by calling getPotentialEnergy on each of those states
https://github.com/MobleyLab/blues/blob/master/blues/simulation.py#L209

I think it's premature to even worry about the correction factor since your plot is showing that the work accumulated for the purely alchemical system is blowing up. Do you have a bunch of work trace you could provide data for or plot? I think we need to figure out what the work distribution after 100 steps looks like and make sure that there is a good fraction of these work values less than a few kT before we tackle what's wrong with the correction factor. The correction factor should be very small.

I don't have those handy, but I'll try to construct them when I'm running my benchmarking runs.

The alchemical integrator shouldn't be much slower than standard LangevinIntegrator dynamics. On very small systems and very fast GPUs, the integrator steps become a bit of a bottleneck, but I'm not sure if you're in this regime.

I did a simple benchmarking on my lysozyme/toluene system with different integrators, running on a GPU. The OpenMM LangevinIntegrator took about 10 seconds to run 10000 steps, and the openmmtools LangevinIntegrator took about 20 seconds to run the same amount of steps. However the AlchemicalNonequilibriumLangevinIntegrator took about 9 minutes to run the same number of steps with the toluene ligand atoms being alchemically perturbed (with sterics/electrostatics).

@jchodera
Copy link
Member

jchodera commented Jun 8, 2017

This is getting a little complicated since I'm not sure at which time you pulled code from perses into blues. The perses code was certainly not guaranteed to be bug-free, and you don't seem to have pulled in tests for the components you pulled in from perses alongside complicated classes like NCMCVVAlchemicalIntegrator, so it's not easy for me to tell if your versions of these components are working correctly.

I'd really recommend trying to use as much of the openmmtools infrastructure as possible, which at least has an extensive set of component tests. If you need to pull in classes like NCMCVVAlchemicalIntegrator instead of using them from perses, could you also pull the testing infrastructure as well so we can tell if your diverged version still works? @pgrinaway may be able to help debug your version in detail, but it's up to him where this falls on his priority list.

I did a simple benchmarking on my lysozyme/toluene system with different integrators, running on a GPU. The OpenMM LangevinIntegrator took about 10 seconds to run 10000 steps, and the openmmtools LangevinIntegrator took about 20 seconds to run the same amount of steps. However the AlchemicalNonequilibriumLangevinIntegrator took about 9 minutes to run the same number of steps with the toluene ligand atoms being alchemically perturbed (with sterics/electrostatics).

That's very surprising. Do you have a minimal test script you could post? If there's only a few alchemically-modified atoms, it really shouldn't be so slow---we can certainly address that.

@jchodera
Copy link
Member

jchodera commented Jun 8, 2017

I'll definitely read through the sections of the code above that you highlight on my flight today, @sgill2. I just wanted to set expectations because there's a lot of code there and I might not be able to spot the issues easily because of the subtle divergence with perses and lack of component tests.

@sgill2
Copy link
Author

sgill2 commented Jun 8, 2017

I'd really recommend trying to use as much of the openmmtools infrastructure as possible, which at least has an extensive set of component tests. If you need to pull in classes like NCMCVVAlchemicalIntegrator instead of using them from perses, could you also pull the testing infrastructure as well so we can tell if your diverged version still works?

Sounds good. I'll see about getting the tests moved over, or just use the integrator from perses directly. On that note I'll add testing VV with the AlchemicalNonequilibriumLangevinIntegrator on the system on my todo list as well.

I'll definitely read through the sections of the code above that you highlight on my flight today, @sgill2. I just wanted to set expectations because there's a lot of code there and I might not be able to spot the issues easily because of the subtle divergence with perses and lack of component tests.

I definitely appreciate any of your time spent on this and duly noted regarding the expectations bit.

@sgill2
Copy link
Author

sgill2 commented Jun 8, 2017

That's very surprising. Do you have a minimal test script you could post? If there's only a few alchemically-modified atoms, it really shouldn't be so slow---we can certainly address that.

Here's a minimal example comparing the integration schemes on a T4-lysozyme/toluene explicit solvent system over 10,000 steps.
benchmark.tar.gz
openmm_int uses the OpenMM LangevinIntegrator.
tools_int uses the openmmtools LangevinIntegrator.
alch_int tests the AlchemicalNonequilibriumLangevinIntegrator with toluene alchemically modified.

And I just wanted to say, thanks for all the help everyone!

@jchodera
Copy link
Member

jchodera commented Jun 8, 2017

Something funny is going on in alch_int. Here's the per-step timing:

      88 : -319404.426 kJ/mol :   59.158 ms
      89 : -319200.621 kJ/mol :   47.861 ms
      90 : -318938.130 kJ/mol :   47.387 ms
      91 : -318754.851 kJ/mol :   52.418 ms
      92 : -318686.440 kJ/mol :   60.418 ms
      93 : -318666.708 kJ/mol :   58.989 ms
      94 : -318669.038 kJ/mol :   57.757 ms
      95 : -318706.835 kJ/mol :   60.661 ms
      96 : -318721.925 kJ/mol :   55.355 ms
      97 : -318627.344 kJ/mol :   63.451 ms
      98 : -318444.256 kJ/mol :   55.624 ms
      99 : -318338.366 kJ/mol :   42.139 ms
     100 : -318415.482 kJ/mol : 1073.012 ms
     101 : -318587.143 kJ/mol : 1136.083 ms
     102 : -318645.321 kJ/mol : 1063.601 ms
     103 : -318514.293 kJ/mol : 1038.306 ms
     104 : -318317.484 kJ/mol : 1134.827 ms
     105 : -318206.944 kJ/mol :  689.704 ms
     106 : -318181.438 kJ/mol : 1065.977 ms
     107 : -318124.754 kJ/mol : 1104.001 ms
     108 : -317989.689 kJ/mol : 1140.472 ms
     109 : -317840.785 kJ/mol : 1038.809 ms
     110 : -317745.965 kJ/mol :  933.706 ms
     111 : -317694.162 kJ/mol : 1028.113 ms
     112 : -317669.189 kJ/mol :  693.051 ms

For some reason, the step time grows by 20x around 100 steps in (where I've reduced the protocol length to 500 steps).

Thanks for this! We'll take a look at what's going on here.

@jchodera
Copy link
Member

jchodera commented Jun 9, 2017

The slowdown seems to occur when lambda_sterics is altered from 1:

      97 : -318756.389 kJ/mol :    5.841 ms : lambda_sterics   1.00000000 lambda_electrostatics   0.02000000
      98 : -318606.840 kJ/mol :    6.094 ms : lambda_sterics   1.00000000 lambda_electrostatics   0.01000000
      99 : -318472.686 kJ/mol :    6.364 ms : lambda_sterics   1.00000000 lambda_electrostatics   0.00000000
     100 : -318510.135 kJ/mol :  618.244 ms : lambda_sterics   0.99333333 lambda_electrostatics   0.00000000
     101 : -318685.078 kJ/mol :  617.969 ms : lambda_sterics   0.98666667 lambda_electrostatics   0.00000000
     102 : -318760.011 kJ/mol :  621.370 ms : lambda_sterics   0.98000000 lambda_electrostatics   0.00000000
     103 : -318610.319 kJ/mol :  616.498 ms : lambda_sterics   0.97333333 lambda_electrostatics   0.00000000

@jchodera
Copy link
Member

jchodera commented Jun 9, 2017

The issue was the long-range correction. Every time lambda_sterics is changed, it has to be recomputed. Since this happens every timestep once lambda_sterics starts being modified, we see something like a 100x slowdown.

Tagging @pgrinaway @bas-rustenburg @Lnaden @gregoryross, since this issue may be relevant to many of our other projects.

To avoid this, we should add some options to openmmtools.alchemy.AbsoluteAlchemicalFactory that disable the long-range dispersion correction from CustomNonbondedForce for cases where nonequilibrium switching is to be used.

Thanks again for bringing the slowdown to our attention and providing a simple script to reproduce this, @sgill2!

@davidlmobley
Copy link

@jchodera - thanks so much for digging into that. Glad to hear there will be a simple fix! I'd been worried about the slowdown long-term, though the improvement in efficiency we get incorporating MC moves made it still worthwhile. But having it NOT slow down will obviously be way better. :)

Let us know if you have any thoughts on our bigger problem (blowing up/alchemical corrections) when you get a chance to look at the code, otherwise @sgill2 will get back to you once he's run those additional tests.

@jchodera
Copy link
Member

jchodera commented Jun 9, 2017

The bigger problem is harder to debug because it relies on a very complex codebase that was branched from our but no longer contains the requisite component unit tests. This is probably going to be a hard-fought lesson in software engineering best (and worst) practices, unfortunately.

I would suggest we try to migrate useful common components between perses and blues into openmmtools along with their unit tests. Ideally, blues would eventually be encapsulated as an openmmtools.mcmc.MCMCMove subclass so users can mix-and-match move sets with other kinds of ergodically sampling, but for now, at least making sure we can rule out where the issues might be by reusing common tested components will help narrow in on the real problem.

@jchodera
Copy link
Member

jchodera commented Jun 9, 2017

We've got a fix for the speed issue in the works:
#218

When I construct AbsoluteAlchemicalFactory with disable_alchemical_dispersion_correction=True, the slowdown is eliminated:

      95 : -319001.077 kJ/mol :    9.914 ms : lambda_sterics   1.00000000 lambda_electrostatics   0.04000000
      96 : -319005.766 kJ/mol :    9.883 ms : lambda_sterics   1.00000000 lambda_electrostatics   0.03000000
      97 : -319047.705 kJ/mol :    9.726 ms : lambda_sterics   1.00000000 lambda_electrostatics   0.02000000
      98 : -319026.067 kJ/mol :   11.732 ms : lambda_sterics   1.00000000 lambda_electrostatics   0.01000000
      99 : -318967.641 kJ/mol :    9.736 ms : lambda_sterics   1.00000000 lambda_electrostatics   0.00000000
     100 : -318978.847 kJ/mol :    9.766 ms : lambda_sterics   0.99333333 lambda_electrostatics   0.00000000
     101 : -319043.797 kJ/mol :    9.910 ms : lambda_sterics   0.98666667 lambda_electrostatics   0.00000000
     102 : -319017.829 kJ/mol :   10.063 ms : lambda_sterics   0.98000000 lambda_electrostatics   0.00000000
     103 : -318845.324 kJ/mol :    9.762 ms : lambda_sterics   0.97333333 lambda_electrostatics   0.00000000
     104 : -318664.833 kJ/mol :   11.491 ms : lambda_sterics   0.96666667 lambda_electrostatics   0.00000000
     105 : -318611.969 kJ/mol :    9.534 ms : lambda_sterics   0.96000000 lambda_electrostatics   0.00000000
     106 : -318625.666 kJ/mol :    9.543 ms : lambda_sterics   0.95333333 lambda_electrostatics   0.00000000

@sgill2
Copy link
Author

sgill2 commented Jun 20, 2017

I've been looking to use Velocity Verlet (VV) with the AlchemicalNonequilibriumLangevinIntegrator to replicate the old Perses integrator functionality, but it seems like I have the incorrect splitting. I was trying with H V R V H, (which was previously recommended) but it seems to be missing some necessary O steps; when I run that splitting I get the following error:

  File "/Users/samgill/anaconda/lib/python2.7/site-packages/openmmtools/integrators.py", line 1655, in __init__
    super(AlchemicalNonequilibriumLangevinIntegrator, self).__init__(*args, **kwargs)
  File "/Users/samgill/anaconda/lib/python2.7/site-packages/openmmtools/integrators.py", line 1474, in __init__
    super(NonequilibriumLangevinIntegrator, self).__init__(*args, **kwargs)
  File "/Users/samgill/anaconda/lib/python2.7/site-packages/openmmtools/integrators.py", line 998, in __init__
    ORV_counts, mts, force_group_nV = self._parse_splitting_string(splitting)
  File "/Users/samgill/anaconda/lib/python2.7/site-packages/openmmtools/integrators.py", line 1386, in _parse_splitting_string
    self._sanity_check(splitting_string)
  File "/Users/samgill/anaconda/lib/python2.7/site-packages/openmmtools/integrators.py", line 1254, in _sanity_check
    assert ("O" in splitting_no_space)
AssertionError

Could someone point me to the proper splitting for VV with the alchemical integrator?

On that note, it might be helpful to include more alchemical splitting schemes in the AlchemicalNonequilibriumLangevinIntegrator docstring examples so people can easily refer to that for their splitting questions :)

@jchodera
Copy link
Member

@pgrinaway and @maxentile : Can you handle this? Should we allow non-Langevin dynamics to be covered by LangevinIntegrator, or do we want to add a separate AlchemicalVelocityVerletIntegrator?

@sgill2 : Are you still getting very different behavior than before? Is this primarily for debugging purposes? Last I heard, we had solved the slowdown problem (use the latest openmmtools and use disable_alchemical_dispersion_correction=True with AbsoluteAlchemicalFactory) but were still waiting to hear from you about energy traces during the NCMC switching. We suspect that your simulation is actually blowing up during NCMC.

@sgill2
Copy link
Author

sgill2 commented Jun 20, 2017

Are you still getting very different behavior than before? Is this primarily for debugging purposes? Last I heard, we had solved the slowdown problem (use the latest openmmtools and use disable_alchemical_dispersion_correction=True with AbsoluteAlchemicalFactory) but were still waiting to hear from you about energy traces during the NCMC switching. We suspect that your simulation is actually blowing up during NCMC.

@jchodera I've mostly been working paper writing, so I haven't had a chance to run those energy trace tests yet. I was looking to move towards replacing the perses integrator with the one from openmmtools integrators, as per your suggestion, but it's probably better for me to run those energy traces now instead. I'll let you folks know when they're done.

@maxentile
Copy link
Member

Should we allow non-Langevin dynamics to be covered by LangevinIntegrator, or do we want to add a separate AlchemicalVelocityVerletIntegrator?

I think it makes sense for this to be separately named (like AlchemicalVelocityVerletIntegrator).

Another possibility is to add an optional argument override_sanity_checks=False, allowing users to specify non-Langevin dynamics like H V R V H...

@jchodera
Copy link
Member

What about LangevinIntegrator instead inheriting from a base TrotterSplittingIntegrator that would allow velocity Verlet-style integrators?

Alternatively, I like the idea of an override_sanity_checks=False (or maybe sanity_checks=True)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants