-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathequilibration-burn-in.tex
43 lines (33 loc) · 6 KB
/
equilibration-burn-in.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
% !TEX root = ./main.tex
\section{Determining and removing an equilibration or `burn-in' portion of a trajectory}
\label{sec:equil}
\begin{figure}
\centering
\includegraphics[width=0.9\linewidth]{figures/tequil-time-trace}
\caption{
\label{fig:tequil}
The equilibration and production segments of a trajectory.
``Equilibration'' over the time $t_{\mathrm{equil}}$ represents transient behavior while the initial configuration relaxes toward configurations more representative of the equilibrium ensemble.
Readers are encouraged to select $t_{\mathrm{equil}}$ in a systematic way based on published literature.
If you find strong sensitivity of ``production'' data to the choice of $t_{\mathrm{equil}}$, this suggests additional sampling is required.
}
\end{figure}
The ``equilibration'' or ``burn-in'' time $t_{\mathrm{equil}}$ represents the initial part of a single continuous trajectory (whether from MD or MC) that is \emph{discarded} for purposes of data analysis of \emph{equilibrium or steady-state properties};
the remaining trajectory data are often called ``production'' data.
See Fig.\ \ref{fig:tequil}.
Discarding data may seem counterproductive, but there is no reason to expect that the initial configurations of a trajectory will be important in the ensemble ultimately obtained.
Including early-time data, therefore, can systematically \emph{bias} results.
To illustrate these points, consider the process of relaxing an initial, crystalline configuration of a protein to its amorphous counterpart in an aqueous environment.
While the initial structure might seem to be intrinsically valuable, remember that configurations representative of the crystal structure may never appear in an aqueous system.
As a result, the initial structure may be subject to unphysical forces and/or transitions that provide useless, if not misleading information about the system behavior.\footnote{In MD modeling of structural polymers (e.g., thermoset polymers), the problem of unphysical forces can be so severe that simulations become numerically unstable and crash. This frequently manifests as systems that explode and/or tear themselves apart. As a result, relaxation is often performed using Monte Carlo moves that minimize energy without reference to velocities and forces.}
Note that relaxation/equilibration should be viewed as a means to an end: for equilibrium sampling, we only care that the relaxed state is representative of {\it any} local energy minimum that the system might sample, not how we arrived at that state, which is ultimately why data generated during equilibration can be discarded.
The RMSD trace in Fig. \ref{f:rmsd} illustrates typical behavior of a system undergoing relaxation. Note the very rapid RMSD increase in the first $\approx$ 200 ns. Part of this increase is simply entropic: the volume of phase space within 1 {\AA} of a protein structure is extremely small, so that the process of thermalizing rapidly increases the RMSD from the starting structure, \emph{regardless of how favorable or representative that structure is}. Thus, examining that initial rapid increase is not helpful in determining an equilibration time. However, in this case, the RMSD continues to increase past 3 {\AA}, which is larger than the amplitude of simple thermal fluctuations (shown by Fig.\ \ref{f:rmsd}B), indicating an initial drift to a new structure, followed by sampling.
Accepting that some data should be discarded, it is not hard to see that we want to avoid discarding too much data, given that many systems of interest are extremely expensive to simulate. In statistical terms, we want to remove bias but also minimize uncertainty (variance) through adequate sampling. Before addressing this problem, however, we emphasize that the very notion of separating a trajectory into equilibration and production segments only makes sense if the system has indeed reached configurations important in the equilibrium ensemble. While it is generally impossible to guarantee this has occurred, some easy checks for determining that this has not occurred are described in Sec.\ \ref{sec:quick}. \emph{It is essential to perform those basic checks before analyzing data with a more sophisticated approach} that may assume a trajectory has a substantial amount of true equilibrium sampling.
A robust approach to determining the equilibration time is discussed in \cite{Chodera-2016}, which generalizes the notion of reverse cumulative averaging~\cite{Yang2004} to observables that do not necessarily have Gaussian distributions.
The key idea is to analyze time-series data considering the effect of discarding various trial values of the initial equilibration interval, $t_{\mathrm{equil}}$ (Fig.\ \ref{fig:tequil}), and selecting the value that maximizes the effective number of uncorrelated samples of the remaining production region.
This effective sample size is estimated from the number of samples in the production region divided by the number of temporally correlated samples required to produce one effectively uncorrelated sample, based on an auto-correlation analysis.
At sufficiently large $t_{\mathrm{equil}}$, the majority of the initial relaxation transient is excluded, and the method selects the largest production region for which correlation times remain short to maximize the number of uncorrelated samples.
Care must be taken in the case that the simulation is insufficiently long to sample many transitions among kinetically metastable states, however, or else this approach can simply result in restricting the production region to the last sampled metastable basin.
A simpler qualitative analysis based on comparing forward and reverse estimates of observables \cite{Klimovich2015} may also be helpful.
Readers may want to compare auto-correlation times for individual observables to the global ``decorrelation time'' \cite{Lyman2007a} described in Sec.\ \ref{sec:global}.
As another general check, if values of observables estimated from the production phase depend sensitively on the choice of $t_{\mathrm{equil}}$, it is likely that further sampling is required.