Sparse model selection — Theory & design¶

This page describes the mathematical foundations and software architecture of the SFI.inference.sparse sub-package. It complements the user guide with deeper technical details.

Mathematical framework¶

Normal equations and information gain¶

After projecting the force onto a basis library \(\{\phi_a\}_{a=1}^p\), the inference problem reduces to a linear system in normal-equation form:

\[G\,C = M\]

where

\[G_{ab} = \bigl\langle \phi_a \cdot \bar D^{-1} \cdot \phi_b \bigr\rangle, \qquad M_a = \bigl\langle v_t \cdot \bar D^{-1} \cdot \phi_a \bigr\rangle.\]

Given a support \(B \subseteq \{1, \dots, p\}\), the restricted solution \(C_B = G_{BB}^{-1}\,M_B\) yields an information gain

\[\mathcal{I}(B) = \tfrac{1}{2}\,C_B^\top M_B = \tfrac{1}{2}\,M_B^\top G_{BB}^{-1}\,M_B.\]

This quantity measures how much the data support the selected terms. The full problem (\(B = \{1,\dots,p\}\)) gives \(\mathcal{I}_{\mathrm{total}}\).

Note

All sparsity algorithms in SFI operate only on \((M, G)\). The raw trajectory data never enters the sparse search, ensuring a clean separation between data processing and model selection.

Sparse model selection as combinatorial optimisation¶

We seek the support \(B^*(k)\) that maximises \(\mathcal{I}(B)\) for each cardinality \(|B| = k\):

\[B^*(k) = \arg\max_{|B|=k} \mathcal{I}(B).\]

The set of all \((k, \mathcal{I}(B^*(k)))\) pairs defines the Pareto front. This is a combinatorial optimisation problem — exhaustive search is intractable for \(p > 20\) — so approximate algorithms are used.

Information criteria¶

The optimal model size is selected by maximising an information criterion that trades fit against complexity:

\[\begin{split}\text{AIC}(k) &= \mathcal{I}(k) - k, \\ \text{BIC}(k) &= \mathcal{I}(k) - \tfrac{1}{2}\,k\,\ln\tau, \\ \text{EBIC}(k) &= \text{BIC}(k) - 2\gamma\,\ln\binom{n_0}{k}, \\ \text{PASTIS}(k) &= \mathcal{I}(k) - k\,\ln(n_0 / p_0).\end{split}\]

Here \(\tau\) is the total trajectory time, \(n_0\) the library size, \(p_0\) the PASTIS significance level, and \(\gamma \in [0,1]\) the EBIC tuning parameter.

AIC (Akaike, 1974) penalises by the number of parameters.

BIC (Schwarz, 1978) arises from a Laplace approximation of the marginal likelihood. The continuous-time formulation uses \(\ln\tau\) rather than \(\ln n\) (number of data points), which is the correct scaling when \(\Delta t \to 0\) (Gerardos & Ronceray, 2025).

EBIC (Chen & Chen, 2008) augments BIC with a combinatorial correction \(2\gamma\ln\binom{n_0}{k}\) that accounts for the number of candidate subsets. Setting \(\gamma=0\) recovers BIC; \(\gamma=1\) gives the strongest penalty.

PASTIS (Gerardos & Ronceray, 2025) combines likelihood statistics with extreme-value theory. The penalty \(k\ln(n_0/p_0)\) controls the probability of retaining at least one superfluous term.

Choosing the criterion

PASTIS is recommended for typical experimental data where \(n_0\) is moderate (10–100). For very small bases, AIC or BIC may be better calibrated. For large libraries, EBIC with \(\gamma \approx 0.5\) improves selection consistency.

Note

BIC and EBIC require the trajectory time \(\tau\) to be passed explicitly via the tau keyword argument.

Algorithms¶

Beam search (PASTIS original)¶

The bidirectional beam search explores the support lattice by expanding (add one index) and contracting (drop one index) every support in the current frontier, then retaining only the beam_width best supports per cardinality.

Algorithm outline:

Initialise with the empty support \(B = \emptyset\).
For each support in the frontier at cardinality \(k\):
1. Generate all add-one children (cardinality \(k+1\)).
2. Generate all drop-one children (cardinality \(k-1\)).
Score all children via \(\mathcal{I}(B)\) (batched using vmap).
For each target cardinality, retain only the beam_width highest- scoring supports (min-heap).
Repeat until no new supports are generated or the AIC early-stop criterion triggers (AIC has declined for aic_patience consecutive closed levels).

The bidirectionality allows the search to correct early mistakes by removing terms that were added sub-optimally.

Complexity: Each generation scores \(O(p \times \text{beam\_width})\) candidates at each of \(O(p)\) cardinality levels. In practice, the AIC early stop limits the number of generations well below \(p\).

Second-best tracking: The beam naturally retains the top-2 supports per cardinality, providing a robustness diagnostic.

Greedy stepwise selection¶

Stepwise selection adds (forward) or removes (backward) one feature at a time, always choosing the action that maximises information gain.

Forward (starting from \(B = \emptyset\)):

Step \(k\): evaluate all \(p - k + 1\) candidates obtained by adding one index to the current support; keep the best.

Backward (starting from \(B = \{1,\dots,p\}\)):

Step \(k\): evaluate all \(k\) candidates obtained by removing one index; keep the best.

Bidirectional: run both paths and merge their Pareto fronts.

Complexity: \(O(p^2)\) evaluations total (each done in a batched vmap call). This is the cheapest algorithm and scales well to large \(p\).

Note

Greedy selection produces exactly one candidate per cardinality, so the Pareto front is always monotone. It does not track second-best alternatives.

Sequential Thresholded Least Squares (STLSQ)¶

STLSQ (Brunton et al., 2016) is an iterative hard-thresholding algorithm:

Solve the full system \(G\,C = M\) for all \(p\) coefficients.
Zero out coefficients whose magnitude falls below a threshold \(\tau\).
Re-solve on the surviving support.
Repeat until convergence.

The threshold can be:

Relative (default): \(\tau_{\mathrm{eff}} = \tau \cdot \max|C|\).
Absolute: \(\tau_{\mathrm{eff}} = \tau\).

Running the algorithm over a log-spaced sweep of thresholds produces supports at different sparsity levels, from which a Pareto front is assembled.

Complexity: Each threshold requires \(O(p)\) least-squares solves (one per iteration, each of size at most \(p\)). With \(n_\tau\) thresholds, the total cost is \(O(n_\tau \cdot \text{max\_iter} \cdot p^3)\), but each solve is small.

Stochastic hill-climbing¶

The stochastic hill-climbing strategy implements the search algorithm of Gerardos & Ronceray (2025): starting from an initial model, it accepts random single-parameter additions or removals if they increase a chosen information criterion, and stops after a configurable number of consecutive failures (patience).

Multiple independent chains run in parallel, one per cardinality \(k \in \{0, \ldots, \text{max\_k}\}\). Each chain starts from a uniformly random support of size \(k\) (this includes the null and full models as special cases).

Algorithm outline:

For each \(k = 0, \ldots, \text{max\_k}\), draw a random support \(B_0\) of size \(k\).
At each step, propose a random add or remove move:
- Add: pick a random index \(j \notin B\) and form \(B' = B \cup \{j\}\).
- Remove: pick a random index \(j \in B\) and form \(B' = B \setminus \{j\}\).
If \(\text{IC}(B') > \text{IC}(B)\), accept the move.
Stop after patience consecutive rejections.
Record the best support found by each chain into the Pareto front.

Complexity: Each chain performs at most \(O(\text{patience} \times \text{max\_k})\) single-support evaluations. Since each evaluation is JIT-compiled, the overhead is dominated by the number of evaluations.

Note

Unlike beam search and greedy selection, hill-climbing is stochastic — the result depends on the random seed. Use the seed parameter for reproducibility.

LASSO (ℓ₁-penalised regression)¶

The LASSO minimises the penalised normal-equations objective:

\[\hat C = \arg\min_C \; \tfrac{1}{2}\,C^\top G\,C \;-\; M^\top C \;+\; \alpha\,\|C\|_1\]

using proximal coordinate descent. No access to the raw design matrix \(\Phi\) is needed — only \((M, G)\).

Coordinate descent update: For each \(j = 1, \dots, p\):

\[C_j \leftarrow \frac{S_\alpha\!\bigl( M_j - \sum_{i \neq j} G_{ji} C_i \bigr)}{G_{jj}}\]

where \(S_\alpha(x) = \text{sign}(x)\max(|x| - \alpha, 0)\) is the soft-thresholding operator.

Regularisation path: sweeping \(\alpha\) from \(\alpha_{\max} = \max|M|\) (all-zero solution) down to \(10^{-4}\alpha_{\max}\) with warm-starting produces a set of supports at varying sparsity.

De-biased re-solve: for each support found by the LASSO, we re-solve the unrestricted problem on that support to remove the \(\ell_1\) shrinkage bias. The de-biased coefficients and information gain are stored in the Pareto front.

Software architecture¶

The SFI.inference.sparse sub-package separates three concerns:

Component	Class	Responsibility
Scoring	`SparseScorer`	Owns `(M, G)`. Scores any support \(B\) by solving \(G_{BB} C_B = M_B\) and returning \((\mathcal{I}(B), C_B)\). JIT-compiled and `vmap`-batched for efficiency.
Search	`SparsityStrategy` (ABC)	Abstract base class. Subclasses implement `run(scorer, max_k) → SparsityResult`. Concrete strategies: `BeamSearchStrategy`, `GreedyStepwiseStrategy`, `HillClimbStrategy`, `STLSQStrategy`, `LassoStrategy`.
Result	`SparsityResult`	Immutable container for the Pareto front. Provides `select_by_ic(name)` and `all_ic()` for criterion-based model selection.

This factoring means:

New algorithms only need to implement the SparsityStrategy interface — they receive a scorer and return a result.
Scoring is decoupled from the search. The same scorer can be reused by multiple strategies without re-computation.
Selection is decoupled from both scoring and searching. A SparsityResult from any algorithm can be queried for any IC.

Module map¶

SFI/inference/sparse/
├── __init__.py      # re-exports public API
├── scorer.py        # SparseScorer — (M, G) scoring
├── result.py        # SparsityResult — Pareto front + IC
├── base.py          # SparsityStrategy ABC
├── beam.py          # BeamSearchStrategy
├── greedy.py        # GreedyStepwiseStrategy
├── hillclimb.py     # HillClimbStrategy
├── stlsq.py         # STLSQStrategy
├── lasso.py         # LassoStrategy
└── metrics.py       # overlap_metrics, predictive_nmse

Integration with the inference engines¶

The inference base class (BaseLangevinInference) creates a SparseScorer during infer_force_linear() and stores it as self.force_scorer. Calling sparsify_force(method=...) dispatches to the chosen strategy and stores the result as self.force_sparsity_result.

infer_force_linear(basis)
    └── builds SparseScorer(M=..., G=...)
            stored as self.force_scorer

sparsify_force(method="beam", criterion="PASTIS")
    ├── strategy = BeamSearchStrategy(...)
    ├── result = strategy.run(self.force_scorer, max_k=...)
    ├── k*, support, coeffs = result.select_by_ic("PASTIS")
    ├── updates force coefficients via _update_force_coefficients
    └── stores result as self.force_sparsity_result

Extending with a custom strategy¶

To add a new search algorithm:

from SFI.inference.sparse import SparsityStrategy, SparsityResult

class MyStrategy(SparsityStrategy):
    name = "my_algo"

    def run(self, scorer, *, max_k, **kwargs):
        # ... implement search logic using scorer.info_and_coeffs
        #     and/or scorer.vmap_info ...
        return SparsityResult(
            p=scorer.p,
            total_info=float(scorer.total_info),
            method=self.name,
            best_info_by_k=...,
            best_support_by_k=...,
            best_coeffs_by_k=...,
        )

Then use it directly:

scorer = inf.force_scorer
result = MyStrategy().run(scorer, max_k=20)
k, support, score, coeffs = result.select_by_ic("PASTIS")

References¶

AIC: Akaike, H. (1974). “A new look at the statistical model identification.” IEEE Trans. Automat. Control, 19(6), 716–723.
BIC: Schwarz, G. (1978). “Estimating the dimension of a model.” Ann. Statist., 6(2), 461–464.
EBIC: Chen, J. & Chen, Z. (2008). “Extended Bayesian information criteria for model selection with large model spaces.” Biometrika, 95(3), 759–771.
PASTIS (Parsimonious Stochastic Inference): Gerardos, A. & Ronceray, P. (2025). “Principled model selection for stochastic dynamics.” Phys. Rev. Lett. 135, 167401. DOI: 10.1103/ltdt-hvh7
SINDy / STLSQ: Brunton, S. L., Proctor, J. L. & Kutz, J. N. (2016). “Discovering governing equations from data by sparse identification of nonlinear dynamical systems.” Proc. Natl. Acad. Sci. 113(15), 3932–3937.
LASSO: Tibshirani, R. (1996). “Regression shrinkage and selection via the lasso.” J. R. Statist. Soc. B 58(1), 267–288.