Diagnostics

Once an inference object is fitted, the next question is should we trust it? The SFI.diagnostics submodule answers that by recomputing standardised residuals from the fitted force and the inferred constant diffusion, then testing whether they look like an independent \(\mathcal N(0, 1)\) sample — the signature of a correctly specified model.

Quick start

import SFI
from SFI.diagnostics import assess

inf = SFI.OverdampedLangevinInference(collection)
inf.compute_diffusion_constant()
inf.infer_force_linear(basis)
inf.compute_force_error()

report = assess(inf, level="standard")   # or inf.diagnose()
report.print_summary()

The same call works for UnderdampedLangevinInference, multi-particle and multi-dataset collections, and for the linear, parametric, and nonlinear inference paths. Data access is routed through TrajectoryDataset.make_batch_producer (the same streaming layer used internally by SFI.integrate), so masked frames and a variable per-step dt are handled uniformly.

What is computed

For an overdamped fit, the residual is the Euler–Maruyama innovation

\[r_t = \Delta x_t - F(x_t)\,\Delta t, \qquad \mathrm{Cov}(r_t) = 2\,\bar D\,\Delta t,\]

whitened by \((2\bar D\,\Delta t)^{-1/2}\) so that, under a correct model, the standardised residual \(z_t\) is \(\mathcal N(0, I)\).

For an underdamped fit (positions only), the residual is the symmetric finite-difference acceleration minus the fitted force — the same quantity the underdamped estimator itself fits:

\[r_t = \hat a_t - F(\hat x_t, \hat v_t), \qquad \hat a_t = \frac{x_{t+1} - 2x_t + x_{t-1}}{\Delta t^2},\]

with \(\hat x_t = \tfrac13(x_{t-1} + x_t + x_{t+1})\) and \(\hat v_t = (x_{t+1} - x_{t-1}) / (2\Delta t)\). The noise of this second difference is the second difference of integrated white noise, with variance \(\tfrac23\,(2\bar D)/\Delta t\); this sets the whitening. Because adjacent accelerations share two of three positions, the residual is a moving-average process in time, so the builder keeps every second valid index — leaving the pooled innovations serially independent.

The standard suite (level="standard", the default) runs:

  • Residual moments — pooled mean, standard deviation, skewness, and excess kurtosis, with a per-component breakdown.

  • Autocorrelation — Ljung–Box on \(z_t\) and on \(z_t^2\) (the latter detects volatility clustering / a mis-estimated diffusion).

  • Normality — Kolmogorov–Smirnov against \(\mathcal N(0, 1)\), plus the raw Q–Q data.

  • MSE consistency — compares the realised mean-square residual to the predicted (sampling-noise) NMSE via a sampling-noise-aware chi-square z-score; \(|z| > 5\) flags bias or a mis-specified diffusion.

The "minimal" level computes the residual moments only.

Output

assess() returns a DiagnosticsReport carrying a residuals section and a meta dict (backend, regime, sample sizes). The report exposes:

  • print_summary() — a concise human-readable table with ✓/✗ marks at a chosen significance level, ending in a -- Flags -- block;

  • flag_issues() — the list of warnings (one string per failing check) that print_summary uses;

  • to_dict() / to_json() — JSON-clean serialisation.

Plotting helpers

from SFI.diagnostics import (
    plot_qq, plot_residual_histogram, plot_residual_acf, plot_summary,
)

fig = plot_summary(report)        # 1×3 figure, all panels at once

All panels accept either a fitted inferer or a precomputed DiagnosticsReport.

Interpreting flags

  • [autocorr/ljung_box] — the model is missing a time-correlated feature (for example a constant-only fit on a mean-reverting process).

  • [autocorr/ljung_box_squared] — the diffusion is mis-estimated or state-dependent.

  • [normality/ks] — the residuals are non-Gaussian: rare events not captured by the basis, or a non-Gaussian noise structure.

  • [moments/std] — a whitened standard deviation \(\not\approx 1\) usually means \(\bar D\) is wrong (try a different compute_diffusion_constant() method).

  • [moments/mean] — a non-zero residual mean points to a systematic drift bias.

  • [mse_consistency] — realised NMSE far above predicted: the model is biased; widen the basis or run sparsify_force() and refit.

Each printed flag carries the corresponding action hint inline ("[mse_consistency] consider the parametric estimator (infer_force)"); pass hints=False to flag_issues() or print_summary() for bare statistics (machine parsing).

How visible a flag is depends on the sampling interval. At very fine \(\Delta t\) the diffusion estimate can absorb a weak force-misspecification, leaving the marginal tests looking clean; coarser sampling makes the leftover structure show up in the autocorrelation and MSE-consistency checks.

When a flag points beyond the linear estimators

On experimental data, [mse_consistency], [moments/std], and [autocorr/ljung_box_squared] very often trace back to measurement noise or coarse sampling rather than a wrong basis. The cure is then not a bigger basis but the noise-aware parametric estimators (infer_force() / infer_diffusion()), which model the measurement-noise covariance \(\Lambda\) explicitly and replace the Euler secant with an RK4 flow step:

  • [mse_consistency] with localization noise suspected — refit with the parametric estimator on a fresh inference object and compare; Measurement noise and coarse sampling walks through it.

  • [moments/std] far from 1 after trying both compute_diffusion_constant() methods — the noise level is biasing the diffusion estimate; the parametric estimators profile \((\mathbf{D}, \Lambda)\) jointly.

  • [autocorr/ljung_box] that persists after widening the basis — suspect coarse sampling; the parametric RK4 flow step extends the usable \(\Delta t\) range.

If the parametric and linear fits agree, the flags point back at the model itself: widen or restructure the basis (Building bases) and re-run sparsify_force().

When data is plentiful, a suspected bias floor can also be confirmed with an explicit train/test split — coll.split_time(0.8) + inf.holdout_score(test), and assess(inf, data=test) for held-out diagnostics. This is a side feature: the predicted error and this suite cost no data, which is SFI’s preferred, data-efficient route.

Classifying overdamped vs. underdamped dynamics

Note

This classifier is experimental — the discriminator and verdict thresholds may still change, and the "inconclusive" band is deliberately conservative.

Before choosing an inference engine you can ask the data which regime it is in. classify_dynamics() (experimental) reads raw positions and returns an "OD" / "UD" / "inconclusive" verdict, robust to high localization noise and coarse sampling:

from SFI import classify_dynamics
from SFI.diagnostics import plot_dynamics_order

report = classify_dynamics(collection)
report.print_summary()        # verdict + tau_v, sigma, D, AICc, scaling slope
plot_dynamics_order(report)   # rho2(dt) and apparent-KE scaling panels

The discriminator is the lag-resolved displacement covariance \(C_k=\langle\Delta x_t\cdot\Delta x_{t+k}\rangle\). White localization noise touches only \(C_0\) and \(C_1\), so lag-2 statistics are measurement-noise-immune; scanning \(\Delta t\) separates the overdamped force confound (which vanishes as \(\Delta t\to0\)) from genuine momentum persistence (which saturates). The verdict combines a model-free scaling test, a parametric fit of the diffusion + inertia + localization covariance model (recovering the momentum relaxation time \(\tau_v\)), and an overdamped-fit residual-autocorrelation cross-check. At coarse sampling (\(\gamma\,\Delta t\gtrsim1\)) momentum is unresolved and the verdict is "inconclusive" by design. The method assumes white localization noise; strong memory (a generalized-Langevin / viscoelastic bath) can also produce velocity persistence and is out of scope. A worked example ships in the gallery (Overdamped or underdamped? Classifying dynamics from data).

References

Ljung & Box (1978); Diebold, Gunther & Tay (1998).