Computing Molecular Properties from MD Simulations

5.3.3. Computing Molecular Properties from MD Simulations#

These notes introduce a unifying way to think about molecular properties computed from molecular dynamics (MD) simulations:

Start from the ensemble average of an observable.
Rewrite that average as a weighted integral over configurations.
Approximate the integral using Monte Carlo / sampling ideas.
Apply the same framework to specific properties such as:
- radial distribution function \(g(r)\),
- radius of gyration \(R_g\),
- root-mean-square deviation (RMSD),
- mean-squared displacement (MSD) and diffusion.

The key message is that most properties computed from MD are just expectation values of suitable functions of the molecular coordinates.

5.3.3.1. Learning goals#

By the end of these notes, you should be able to:

write a molecular property as an expectation value,
explain how an MD trajectory approximates an ensemble average,
derive the histogram estimator behind \(g(r)\),
compute \(R_g\), RMSD, and MSD from Cartesian coordinates,
recognize practical issues such as finite sampling, equilibration, and correlated data.

5.3.3.2. General theory: expectation values#

Let \(x\) denote a molecular configuration. In classical statistical mechanics, \(x\) typically contains all particle positions and momenta.
An observable \(A(x)\) is any quantity that can be evaluated for a given configuration.

The equilibrium expectation value is

(5.125)#\[\begin{equation} \langle A \rangle = \int A(x)\, p(x)\, dx, \end{equation}\]

where \(p(x)\) is the normalized probability density over phase space.

For many structural observables, the momenta can be integrated out, so we work only with coordinates \(\mathbf{R}\):

(5.126)#\[\begin{equation} \langle A \rangle = \int A(\mathbf{R})\, p(\mathbf{R})\, d\mathbf{R}. \end{equation}\]

5.3.3.2.1. Canonical ensemble#

In the \(NVT\) ensemble,

(5.127)#\[\begin{equation} p(\mathbf{R}) = \frac{e^{-\beta U(\mathbf{R})}}{Z_\mathrm{conf}}, \qquad \beta = \frac{1}{k_B T}, \end{equation}\]

with configurational partition function

(5.128)#\[\begin{equation} Z_\mathrm{conf} = \int e^{-\beta U(\mathbf{R})}\, d\mathbf{R}. \end{equation}\]

Therefore,

(5.129)#\[\begin{equation} \langle A \rangle = \frac{\int A(\mathbf{R}) e^{-\beta U(\mathbf{R})}\, d\mathbf{R}} {\int e^{-\beta U(\mathbf{R})}\, d\mathbf{R}}. \end{equation}\]

This is a weighted average, where lower-energy configurations receive larger Boltzmann weights.

5.3.3.3. From weighted integrals to Monte Carlo sampling#

Suppose we draw configurations \(\mathbf{R}^{(1)}, \mathbf{R}^{(2)}, \dots, \mathbf{R}^{(N)}\) from the target distribution \(p(\mathbf{R})\). Then the expectation value can be approximated by the sample mean

(5.130)#\[\begin{equation} \langle A \rangle \approx \frac{1}{N}\sum_{n=1}^{N} A\!\left(\mathbf{R}^{(n)}\right). \end{equation}\]

This is the basic Monte Carlo estimator.

5.3.3.3.1. Why this applies to MD#

An MD trajectory is not generated by independent random draws, but under equilibrium conditions and adequate sampling, the trajectory visits configurations with the correct ensemble probabilities.
Then the time average along the trajectory approximates the ensemble average:

(5.131)#\[\begin{equation} \langle A \rangle \approx \frac{1}{N}\sum_{n=1}^{N} A_n, \end{equation}\]

where \(A_n = A(\mathbf{R}^{(n)})\) is the observable evaluated on frame \(n\).

5.3.3.3.2. Important caveat: correlated samples#

Consecutive MD frames are typically correlated.
That means the estimator is still reasonable, but the effective number of independent samples is smaller than the total number of saved frames.

In practice, this means:

discard unequilibrated initial frames,
save frames far enough apart if storage is a concern,
estimate uncertainties with block averaging or autocorrelation analysis.

5.3.3.4. A discrete weighted-average viewpoint#

Sometimes it is useful to imagine configuration space partitioned into bins or microstates labeled by \(i\) with probabilities \(p_i\). Then

(5.132)#\[\begin{equation} \langle A \rangle = \sum_i A_i p_i. \end{equation}\]

If we sample from the distribution, and state \(i\) appears \(n_i\) times in \(N\) samples, then

(5.133)#\[\begin{equation} p_i \approx \frac{n_i}{N}, \end{equation}\]

so

(5.134)#\[\begin{equation} \langle A \rangle \approx \sum_i A_i \frac{n_i}{N} = \frac{1}{N}\sum_{n=1}^{N} A_n. \end{equation}\]

This is often the most intuitive bridge between statistical mechanics and trajectory analysis.

5.3.3.5. Generic workflow for computing a property from MD#

For almost any structural property:

Choose an observable \(A(\mathbf{R})\).
Evaluate it on each trajectory frame.
Average over frames.
If needed, normalize appropriately.
Estimate uncertainty.

Pseudocode:

values = []
for frame in trajectory:
    values.append(A(frame.coordinates))
estimate = np.mean(values)

The rest of these notes shows how this general idea becomes concrete for several important observables.

5.3.3.6. Example 1: radial distribution function \(g(r)\) for a pure liquid#

The radial distribution function describes how particle density varies as a function of distance from a reference particle.

For a homogeneous isotropic liquid,

(5.135)#\[\begin{equation} g(r) = \frac{1}{4\pi r^2 \rho} \left\langle \frac{1}{N} \sum_{i=1}^{N} \sum_{j\ne i} \delta\!\left(r-r_{ij}\right) \right\rangle, \end{equation}\]

where

\(r_{ij}\) is the distance between particles \(i\) and \(j\),
\(\rho = N/V\) is the number density.

5.3.3.6.1. Interpretation#

\(g(r) = 1\) means particles are distributed as in an ideal gas at the same density.
\(g(r) > 1\) means enhanced probability of finding neighbors at distance \(r\).
\(g(r) < 1\) means depleted probability.

5.3.3.6.2. Histogram form#

In practice, the delta function is replaced by a histogram bin of width \(\Delta r\):

(5.136)#\[\begin{equation} g(r_k) \approx \frac{\langle n_k \rangle} {N \, 4\pi r_k^2 \Delta r \, \rho}, \end{equation}\]

where \(\langle n_k \rangle\) is the average number of neighbors in shell \(k\).

So \(g(r)\) is an expectation value of a shell-counting observable, followed by normalization by the shell volume.

import numpy as np
import matplotlib.pyplot as plt

def rdf_from_coordinates(frames, box_length, r_max=None, n_bins=100):
    '''
    Compute a simple radial distribution function g(r) for a pure liquid.

    Parameters
    ----------
    frames : ndarray, shape (n_frames, n_atoms, 3)
        Cartesian coordinates.
    box_length : float
        Cubic box length.
    r_max : float or None
        Maximum radius. If None, uses box_length / 2.
    n_bins : int
        Number of histogram bins.

    Returns
    -------
    r : ndarray
        Bin centers.
    g_r : ndarray
        Radial distribution function.
    '''
    frames = np.asarray(frames)
    n_frames, n_atoms, _ = frames.shape
    volume = box_length**3
    rho = n_atoms / volume

    if r_max is None:
        r_max = box_length / 2.0

    edges = np.linspace(0.0, r_max, n_bins + 1)
    counts = np.zeros(n_bins, dtype=float)

    for frame in frames:
        for i in range(n_atoms - 1):
            rij = frame[i+1:] - frame[i]
            # minimum image convention for cubic box
            rij -= box_length * np.round(rij / box_length)
            dists = np.linalg.norm(rij, axis=1)
            hist, _ = np.histogram(dists, bins=edges)
            counts += 2.0 * hist  # count i-j and j-i

    dr = edges[1] - edges[0]
    r = 0.5 * (edges[:-1] + edges[1:])
    shell_volumes = 4.0 * np.pi * r**2 * dr

    # Average counts per frame and per particle
    avg_counts = counts / n_frames
    g_r = avg_counts / (n_atoms * rho * shell_volumes)

    return r, g_r

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 1
----> 1 import numpy as np
      2 import matplotlib.pyplot as plt
      4 def rdf_from_coordinates(frames, box_length, r_max=None, n_bins=100):

ModuleNotFoundError: No module named 'numpy'

5.3.3.6.3. Notes on \(g(r)\)#

For liquids, the first peak often corresponds to the first solvation shell.
Integrating \(4\pi r^2 \rho g(r)\) over a shell gives the coordination number.
In mixtures, one often computes partial RDFs such as \(g_{AB}(r)\).
Proper treatment of periodic boundary conditions is essential.

5.3.3.7. Example 2: radius of gyration \(R_g\)#

The radius of gyration measures how spread out a set of atoms is around its center of mass.

For atoms with masses \(m_i\) at positions \(\mathbf{r}_i\),

(5.137)#\[\begin{equation} R_g^2 = \frac{\sum_i m_i \left| \mathbf{r}_i - \mathbf{r}_{\mathrm{COM}} \right|^2} {\sum_i m_i}, \end{equation}\]

where

(5.138)#\[\begin{equation} \mathbf{r}_{\mathrm{COM}} = \frac{\sum_i m_i \mathbf{r}_i}{\sum_i m_i}. \end{equation}\]

Thus,

(5.139)#\[\begin{equation} R_g = \sqrt{R_g^2}. \end{equation}\]

5.3.3.7.1. Why it matters#

\(R_g\) is widely used for:

proteins and intrinsically disordered proteins,
polymers,
compactness changes during folding or binding,
comparing expanded and collapsed ensembles.

To compute the average compactness of a molecule from MD, evaluate \(R_g\) on each frame and average:

(5.140)#\[\begin{equation} \langle R_g \rangle \approx \frac{1}{N}\sum_{n=1}^{N} R_g^{(n)}. \end{equation}\]

def radius_of_gyration(coords, masses=None):
    '''
    Compute radius of gyration for a single frame.

    Parameters
    ----------
    coords : ndarray, shape (n_atoms, 3)
    masses : ndarray, shape (n_atoms,), optional
        If None, all atoms are weighted equally.
    '''
    coords = np.asarray(coords, dtype=float)
    n_atoms = coords.shape[0]

    if masses is None:
        masses = np.ones(n_atoms)
    masses = np.asarray(masses, dtype=float)

    total_mass = masses.sum()
    com = np.sum(coords * masses[:, None], axis=0) / total_mass
    rg2 = np.sum(masses * np.sum((coords - com)**2, axis=1)) / total_mass
    return np.sqrt(rg2)

def trajectory_rg(frames, masses=None):
    return np.array([radius_of_gyration(frame, masses=masses) for frame in frames])

5.3.3.8. Example 3: root-mean-square deviation (RMSD)#

RMSD measures similarity to a reference structure.

For a frame with coordinates \(\mathbf{r}_i\) and reference coordinates \(\mathbf{r}_i^\mathrm{ref}\),

(5.141)#\[\begin{equation} \mathrm{RMSD} = \sqrt{ \frac{1}{N} \sum_{i=1}^{N} \left| \mathbf{r}_i - \mathbf{r}_i^\mathrm{ref} \right|^2 }. \end{equation}\]

Usually, one first aligns the current structure to the reference to remove rigid-body translation and rotation.
Otherwise the RMSD would reflect overall motion rather than structural change.

5.3.3.8.1. What RMSD is useful for#

monitoring equilibration,
detecting conformational transitions,
comparing folded and unfolded states,
quantifying deviation from an experimental structure.

def kabsch_align(P, Q):
    '''
    Align P onto Q using the Kabsch algorithm.
    Returns aligned copy of P.
    '''
    P = np.asarray(P, dtype=float)
    Q = np.asarray(Q, dtype=float)

    Pc = P - P.mean(axis=0)
    Qc = Q - Q.mean(axis=0)

    C = Pc.T @ Qc
    V, S, Wt = np.linalg.svd(C)
    d = np.sign(np.linalg.det(V @ Wt))
    D = np.diag([1.0, 1.0, d])
    U = V @ D @ Wt
    P_aligned = Pc @ U
    return P_aligned, Qc

def rmsd(P, Q, align=True):
    '''
    RMSD between coordinate sets P and Q.
    '''
    P = np.asarray(P, dtype=float)
    Q = np.asarray(Q, dtype=float)

    if align:
        P_use, Q_use = kabsch_align(P, Q)
    else:
        P_use, Q_use = P, Q

    return np.sqrt(np.mean(np.sum((P_use - Q_use)**2, axis=1)))

def trajectory_rmsd(frames, reference, align=True):
    return np.array([rmsd(frame, reference, align=align) for frame in frames])

5.3.3.8.2. RMSD versus fluctuations#

A related quantity is the root-mean-square fluctuation (RMSF), which is computed per atom or per residue:

(5.142)#\[\begin{equation} \mathrm{RMSF}_i = \sqrt{ \left\langle \left| \mathbf{r}_i - \langle \mathbf{r}_i \rangle \right|^2 \right\rangle }. \end{equation}\]

RMSF tells us which parts of a molecule are flexible; RMSD tells us how far an entire structure is from a reference.

5.3.3.9. Example 4: mean-squared displacement (MSD) and diffusion#

A very useful dynamical observable is the mean-squared displacement:

(5.143)#\[\begin{equation} \mathrm{MSD}(t) = \left\langle \left| \mathbf{r}_i(t_0+t) - \mathbf{r}_i(t_0) \right|^2 \right\rangle_{i,t_0}. \end{equation}\]

The average is over particles and time origins.

For diffusive motion in three dimensions,

(5.144)#\[\begin{equation} \mathrm{MSD}(t) \approx 6Dt \end{equation}\]

at long times, where \(D\) is the self-diffusion coefficient.

5.3.3.9.1. Why include MSD?#

Unlike \(R_g\) and RMSD, MSD is explicitly time-dependent.
It is a good example of how the expectation-value idea extends naturally to time-correlation-type observables.

def mean_squared_displacement(frames):
    '''
    Compute a simple MSD(t) averaged over atoms and time origins.

    Parameters
    ----------
    frames : ndarray, shape (n_frames, n_atoms, 3)

    Returns
    -------
    lag : ndarray
        Lag times in frame units.
    msd : ndarray
        Mean-squared displacement.
    '''
    frames = np.asarray(frames, dtype=float)
    n_frames = frames.shape[0]
    msd = np.zeros(n_frames)

    for tau in range(n_frames):
        displacements = frames[tau:] - frames[:n_frames - tau]
        sq = np.sum(displacements**2, axis=2)  # over xyz
        msd[tau] = np.mean(sq)

    return np.arange(n_frames), msd

5.3.3.10. Other useful observables#

Here are several additional properties that fit the same framework.

5.3.3.10.1. End-to-end distance#

For a polymer or peptide,

(5.145)#\[\begin{equation} R_{ee} = \left| \mathbf{r}_{\mathrm{end}} - \mathbf{r}_{\mathrm{start}} \right|. \end{equation}\]

Useful for flexible chains and folding.

5.3.3.10.2. Potential energy#

If the trajectory stores \(U(\mathbf{R})\), then

(5.146)#\[\begin{equation} \langle U \rangle \approx \frac{1}{N}\sum_n U_n. \end{equation}\]

5.3.3.10.3. Angle or dihedral distributions#

Choose an observable such as \(\phi(\mathbf{R})\) and build a histogram over frames.
This gives a probability density for torsional motion.

5.3.3.10.4. Hydrogen-bond occupancy#

Define an indicator function

(5.147)#\[\begin{equation} h(\mathbf{R}) = \begin{cases} 1 & \text{if a hydrogen bond is present} \\ 0 & \text{otherwise} \end{cases} \end{equation}\]

Then the average occupancy is simply

(5.148)#\[\begin{equation} \langle h \rangle \approx \frac{1}{N}\sum_n h_n. \end{equation}\]

This is a nice example where the property is literally an average of zeros and ones.

5.3.3.11. Uncertainty, equilibration, and convergence#

Even if the formula is simple, trustworthy estimates require care.

5.3.3.11.1. Equilibration#

Do not average over initial frames that are still relaxing toward equilibrium.

5.3.3.11.2. Correlation#

Saved frames are often correlated, especially for slow collective motions.

5.3.3.11.3. Convergence checks#

Useful diagnostics include:

running averages,
block averages,
comparing first and second halves of a trajectory,
repeating simulations from different initial conditions.

A property estimate is only as good as the region of configuration space the trajectory has actually sampled.

def block_average(data, block_size):
    '''
    Simple block averaging utility for correlated data.
    '''
    data = np.asarray(data, dtype=float)
    n_blocks = len(data) // block_size
    if n_blocks == 0:
        raise ValueError("block_size is too large for the dataset.")
    trimmed = data[:n_blocks * block_size]
    blocks = trimmed.reshape(n_blocks, block_size)
    block_means = blocks.mean(axis=1)
    mean = block_means.mean()
    stderr = block_means.std(ddof=1) / np.sqrt(n_blocks)
    return mean, stderr, block_means

5.3.3.12. A tiny synthetic example#

The next few cells build a toy trajectory and compute several observables.
This is not meant to represent a realistic molecular system; it is just a compact way to show the workflow.

# Create a toy trajectory: a noisy 10-atom chain over 500 frames
rng = np.random.default_rng(4)

n_frames = 500
n_atoms = 10

reference = np.zeros((n_atoms, 3))
reference[:, 0] = np.arange(n_atoms) * 1.5

frames = np.empty((n_frames, n_atoms, 3))
for t in range(n_frames):
    noise = 0.15 * rng.normal(size=(n_atoms, 3))
    collective = np.array([0.2 * np.sin(2 * np.pi * t / 100), 0.0, 0.0])
    frames[t] = reference + noise + collective

rg_values = trajectory_rg(frames)
rmsd_values = trajectory_rmsd(frames, reference, align=True)
lag, msd_values = mean_squared_displacement(frames)

print(f"Average Rg   = {rg_values.mean():.3f}")
print(f"Average RMSD = {rmsd_values.mean():.3f}")

plt.figure(figsize=(6, 4))
plt.plot(rg_values)
plt.xlabel("Frame")
plt.ylabel(r"$R_g$")
plt.title("Radius of gyration along the trajectory")
plt.tight_layout()
plt.show()

plt.figure(figsize=(6, 4))
plt.plot(rmsd_values)
plt.xlabel("Frame")
plt.ylabel("RMSD")
plt.title("RMSD to reference structure")
plt.tight_layout()
plt.show()

plt.figure(figsize=(6, 4))
plt.plot(lag, msd_values)
plt.xlabel("Lag time (frames)")
plt.ylabel("MSD")
plt.title("Mean-squared displacement")
plt.tight_layout()
plt.show()

5.3.3.13. Example using a real MD trajectory with MDAnalysis#

If you have a topology file and trajectory, a common workflow is to use MDAnalysis.

The cell below is written as a template.
It may require you to install MDAnalysis and adjust atom selections for your system.

# Uncomment if needed:
# !pip install MDAnalysis

# import MDAnalysis as mda
# from MDAnalysis.analysis import rms
# from MDAnalysis.analysis.rdf import InterRDF

# u = mda.Universe("topology.psf", "trajectory.dcd")

# Example: radius of gyration of a protein
# protein = u.select_atoms("protein")
# rg_values = []
# for ts in u.trajectory:
#     rg_values.append(protein.radius_of_gyration())

# Example: RMSD to the first frame
# R = rms.RMSD(u, u, select="protein and name CA", ref_frame=0)
# R.run()
# rmsd_values = R.results.rmsd[:, 2]

# Example: RDF between oxygen atoms in a pure liquid
# oxy = u.select_atoms("name O")
# rdf = InterRDF(oxy, oxy, nbins=100, range=(0.0, 10.0))
# rdf.run()
# r = rdf.results.bins
# g_r = rdf.results.rdf

5.3.3.14. Summary#

The central idea is simple:

(5.149)#\[\begin{equation} \langle A \rangle = \int A(\mathbf{R}) p(\mathbf{R})\, d\mathbf{R} \approx \frac{1}{N}\sum_{n=1}^{N} A(\mathbf{R}^{(n)}). \end{equation}\]

That is, a molecular property is an expectation value, and an MD trajectory provides the sampled configurations needed to approximate it.

5.3.3.14.1. Main examples in these notes#

\(g(r)\): histogram-based estimate of pair-distance probabilities, normalized by shell volume and density.
\(R_g\): average molecular compactness.
RMSD: deviation from a reference structure after alignment.
MSD: time-dependent displacement used to estimate diffusion.

5.3.3.14.2. Recommended conceptual takeaway#

Different MD observables may look unrelated, but computationally they all follow the same pattern:

define a frame-level function,
evaluate it across sampled configurations,
average or histogram the results,
interpret the result in physical terms.