.. _sparsity-theory: Sparse model selection — Theory & design ========================================= This page describes the mathematical foundations and software architecture of the :mod:`SFI.inference.sparse` sub-package. It complements the :ref:`user guide ` with deeper technical details. Mathematical framework ---------------------- Normal equations and information gain ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ After projecting the force onto a basis library :math:`\{\phi_a\}_{a=1}^p`, the inference problem reduces to a linear system in normal-equation form: .. math:: G\,C = M where .. math:: G_{ab} = \bigl\langle \phi_a \cdot \bar D^{-1} \cdot \phi_b \bigr\rangle, \qquad M_a = \bigl\langle v_t \cdot \bar D^{-1} \cdot \phi_a \bigr\rangle. Given a *support* :math:`B \subseteq \{1, \dots, p\}`, the restricted solution :math:`C_B = G_{BB}^{-1}\,M_B` yields an **information gain** .. math:: \mathcal{I}(B) = \tfrac{1}{2}\,C_B^\top M_B = \tfrac{1}{2}\,M_B^\top G_{BB}^{-1}\,M_B. This quantity measures how much the data support the selected terms. The full problem (:math:`B = \{1,\dots,p\}`) gives :math:`\mathcal{I}_{\mathrm{total}}`. .. note:: All sparsity algorithms in SFI operate **only on** :math:`(M, G)`. The raw trajectory data never enters the sparse search, ensuring a clean separation between data processing and model selection. Sparse model selection as combinatorial optimisation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We seek the support :math:`B^*(k)` that maximises :math:`\mathcal{I}(B)` for each cardinality :math:`|B| = k`: .. math:: B^*(k) = \arg\max_{|B|=k} \mathcal{I}(B). The set of all :math:`(k, \mathcal{I}(B^*(k)))` pairs defines the **Pareto front**. This is a combinatorial optimisation problem — exhaustive search is intractable for :math:`p > 20` — so approximate algorithms are used. Information criteria ~~~~~~~~~~~~~~~~~~~~ The optimal model size is selected by maximising an information criterion that trades fit against complexity: .. math:: \text{AIC}(k) &= \mathcal{I}(k) - k, \\ \text{BIC}(k) &= \mathcal{I}(k) - \tfrac{1}{2}\,k\,\ln\tau, \\ \text{EBIC}(k) &= \text{BIC}(k) - 2\gamma\,\ln\binom{n_0}{k}, \\ \text{PASTIS}(k) &= \mathcal{I}(k) - k\,\ln(n_0 / p_0). Here :math:`\tau` is the total trajectory time, :math:`n_0` the library size, :math:`p_0` the PASTIS significance level, and :math:`\gamma \in [0,1]` the EBIC tuning parameter. **AIC** (Akaike, 1974) penalises by the number of parameters. **BIC** (Schwarz, 1978) arises from a Laplace approximation of the marginal likelihood. The continuous-time formulation uses :math:`\ln\tau` rather than :math:`\ln n` (number of data points), which is the correct scaling when :math:`\Delta t \to 0` (Gerardos & Ronceray, 2025). **EBIC** (Chen & Chen, 2008) augments BIC with a combinatorial correction :math:`2\gamma\ln\binom{n_0}{k}` that accounts for the number of candidate subsets. Setting :math:`\gamma=0` recovers BIC; :math:`\gamma=1` gives the strongest penalty. **PASTIS** (Gerardos & Ronceray, 2025) combines likelihood statistics with extreme-value theory. The penalty :math:`k\ln(n_0/p_0)` controls the probability of retaining at least one superfluous term. .. admonition:: Choosing the criterion :class: tip PASTIS is recommended for typical experimental data where :math:`n_0` is moderate (10–100). For very small bases, AIC or BIC may be better calibrated. For large libraries, EBIC with :math:`\gamma \approx 0.5` improves selection consistency. .. note:: BIC and EBIC require the trajectory time :math:`\tau` to be passed explicitly via the ``tau`` keyword argument. Algorithms ---------- Beam search (PASTIS original) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The bidirectional beam search explores the support lattice by *expanding* (add one index) and *contracting* (drop one index) every support in the current frontier, then retaining only the ``beam_width`` best supports per cardinality. **Algorithm outline:** 1. Initialise with the empty support :math:`B = \emptyset`. 2. For each support in the frontier at cardinality :math:`k`: a. Generate all *add-one* children (cardinality :math:`k+1`). b. Generate all *drop-one* children (cardinality :math:`k-1`). 3. Score all children via :math:`\mathcal{I}(B)` (batched using ``vmap``). 4. For each target cardinality, retain only the ``beam_width`` highest- scoring supports (min-heap). 5. Repeat until no new supports are generated or the AIC early-stop criterion triggers (AIC has declined for ``aic_patience`` consecutive closed levels). The bidirectionality allows the search to correct early mistakes by removing terms that were added sub-optimally. **Complexity:** Each generation scores :math:`O(p \times \text{beam\_width})` candidates at each of :math:`O(p)` cardinality levels. In practice, the AIC early stop limits the number of generations well below :math:`p`. **Second-best tracking:** The beam naturally retains the top-2 supports per cardinality, providing a robustness diagnostic. Greedy stepwise selection ~~~~~~~~~~~~~~~~~~~~~~~~~ Stepwise selection adds (forward) or removes (backward) one feature at a time, always choosing the action that maximises information gain. **Forward** (starting from :math:`B = \emptyset`): Step :math:`k`: evaluate all :math:`p - k + 1` candidates obtained by adding one index to the current support; keep the best. **Backward** (starting from :math:`B = \{1,\dots,p\}`): Step :math:`k`: evaluate all :math:`k` candidates obtained by removing one index; keep the best. **Bidirectional**: run both paths and merge their Pareto fronts. **Complexity:** :math:`O(p^2)` evaluations total (each done in a batched ``vmap`` call). This is the cheapest algorithm and scales well to large :math:`p`. .. note:: Greedy selection produces exactly one candidate per cardinality, so the Pareto front is always monotone. It does not track second-best alternatives. Sequential Thresholded Least Squares (STLSQ) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ STLSQ (Brunton *et al.*, 2016) is an iterative hard-thresholding algorithm: 1. Solve the full system :math:`G\,C = M` for all :math:`p` coefficients. 2. Zero out coefficients whose magnitude falls below a threshold :math:`\tau`. 3. Re-solve on the surviving support. 4. Repeat until convergence. The threshold can be: * **Relative** (default): :math:`\tau_{\mathrm{eff}} = \tau \cdot \max|C|`. * **Absolute**: :math:`\tau_{\mathrm{eff}} = \tau`. Running the algorithm over a log-spaced sweep of thresholds produces supports at different sparsity levels, from which a Pareto front is assembled. **Complexity:** Each threshold requires :math:`O(p)` least-squares solves (one per iteration, each of size at most :math:`p`). With :math:`n_\tau` thresholds, the total cost is :math:`O(n_\tau \cdot \text{max\_iter} \cdot p^3)`, but each solve is small. Stochastic hill-climbing ~~~~~~~~~~~~~~~~~~~~~~~~ The stochastic hill-climbing strategy implements the search algorithm of Gerardos & Ronceray (2025): starting from an initial model, it accepts random single-parameter additions or removals if they increase a chosen information criterion, and stops after a configurable number of consecutive failures (*patience*). Multiple independent chains run in parallel, one per cardinality :math:`k \in \{0, \ldots, \text{max\_k}\}`. Each chain starts from a uniformly random support of size :math:`k` (this includes the null and full models as special cases). **Algorithm outline:** 1. For each :math:`k = 0, \ldots, \text{max\_k}`, draw a random support :math:`B_0` of size :math:`k`. 2. At each step, propose a random add or remove move: * **Add:** pick a random index :math:`j \notin B` and form :math:`B' = B \cup \{j\}`. * **Remove:** pick a random index :math:`j \in B` and form :math:`B' = B \setminus \{j\}`. 3. If :math:`\text{IC}(B') > \text{IC}(B)`, accept the move. 4. Stop after *patience* consecutive rejections. 5. Record the best support found by each chain into the Pareto front. **Complexity:** Each chain performs at most :math:`O(\text{patience} \times \text{max\_k})` single-support evaluations. Since each evaluation is JIT-compiled, the overhead is dominated by the number of evaluations. .. note:: Unlike beam search and greedy selection, hill-climbing is *stochastic* — the result depends on the random seed. Use the ``seed`` parameter for reproducibility. LASSO (ℓ₁-penalised regression) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The LASSO minimises the penalised normal-equations objective: .. math:: \hat C = \arg\min_C \; \tfrac{1}{2}\,C^\top G\,C \;-\; M^\top C \;+\; \alpha\,\|C\|_1 using **proximal coordinate descent**. No access to the raw design matrix :math:`\Phi` is needed — only :math:`(M, G)`. **Coordinate descent update:** For each :math:`j = 1, \dots, p`: .. math:: C_j \leftarrow \frac{S_\alpha\!\bigl( M_j - \sum_{i \neq j} G_{ji} C_i \bigr)}{G_{jj}} where :math:`S_\alpha(x) = \text{sign}(x)\max(|x| - \alpha, 0)` is the soft-thresholding operator. **Regularisation path:** sweeping :math:`\alpha` from :math:`\alpha_{\max} = \max|M|` (all-zero solution) down to :math:`10^{-4}\alpha_{\max}` with warm-starting produces a set of supports at varying sparsity. **De-biased re-solve:** for each support found by the LASSO, we re-solve the unrestricted problem on that support to remove the :math:`\ell_1` shrinkage bias. The de-biased coefficients and information gain are stored in the Pareto front. Software architecture --------------------- The :mod:`SFI.inference.sparse` sub-package separates three concerns: .. list-table:: :widths: 15 25 60 :header-rows: 1 * - Component - Class - Responsibility * - **Scoring** - :class:`~SFI.inference.sparse.SparseScorer` - Owns ``(M, G)``. Scores any support :math:`B` by solving :math:`G_{BB} C_B = M_B` and returning :math:`(\mathcal{I}(B), C_B)`. JIT-compiled and ``vmap``-batched for efficiency. * - **Search** - :class:`~SFI.inference.sparse.SparsityStrategy` (ABC) - Abstract base class. Subclasses implement ``run(scorer, max_k) → SparsityResult``. Concrete strategies: :class:`~SFI.inference.sparse.BeamSearchStrategy`, :class:`~SFI.inference.sparse.GreedyStepwiseStrategy`, :class:`~SFI.inference.sparse.HillClimbStrategy`, :class:`~SFI.inference.sparse.STLSQStrategy`, :class:`~SFI.inference.sparse.LassoStrategy`. * - **Result** - :class:`~SFI.inference.sparse.SparsityResult` - Immutable container for the Pareto front. Provides ``select_by_ic(name)`` and ``all_ic()`` for criterion-based model selection. This factoring means: * **New algorithms** only need to implement the ``SparsityStrategy`` interface — they receive a scorer and return a result. * **Scoring** is decoupled from the search. The same scorer can be reused by multiple strategies without re-computation. * **Selection** is decoupled from both scoring and searching. A ``SparsityResult`` from any algorithm can be queried for any IC. Module map ~~~~~~~~~~ .. code-block:: text SFI/inference/sparse/ ├── __init__.py # re-exports public API ├── scorer.py # SparseScorer — (M, G) scoring ├── result.py # SparsityResult — Pareto front + IC ├── base.py # SparsityStrategy ABC ├── beam.py # BeamSearchStrategy ├── greedy.py # GreedyStepwiseStrategy ├── hillclimb.py # HillClimbStrategy ├── stlsq.py # STLSQStrategy ├── lasso.py # LassoStrategy └── metrics.py # overlap_metrics, predictive_nmse Integration with the inference engines ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The inference base class (``BaseLangevinInference``) creates a ``SparseScorer`` during ``infer_force_linear()`` and stores it as ``self.force_scorer``. Calling ``sparsify_force(method=...)`` dispatches to the chosen strategy and stores the result as ``self.force_sparsity_result``. .. code-block:: text infer_force_linear(basis) └── builds SparseScorer(M=..., G=...) stored as self.force_scorer sparsify_force(method="beam", criterion="PASTIS") ├── strategy = BeamSearchStrategy(...) ├── result = strategy.run(self.force_scorer, max_k=...) ├── k*, support, coeffs = result.select_by_ic("PASTIS") ├── updates force coefficients via _update_force_coefficients └── stores result as self.force_sparsity_result Extending with a custom strategy ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To add a new search algorithm: .. code-block:: python from SFI.inference.sparse import SparsityStrategy, SparsityResult class MyStrategy(SparsityStrategy): name = "my_algo" def run(self, scorer, *, max_k, **kwargs): # ... implement search logic using scorer.info_and_coeffs # and/or scorer.vmap_info ... return SparsityResult( p=scorer.p, total_info=float(scorer.total_info), method=self.name, best_info_by_k=..., best_support_by_k=..., best_coeffs_by_k=..., ) Then use it directly: .. code-block:: python scorer = inf.force_scorer result = MyStrategy().run(scorer, max_k=20) k, support, score, coeffs = result.select_by_ic("PASTIS") References ---------- * **AIC:** Akaike, H. (1974). "A new look at the statistical model identification." *IEEE Trans. Automat. Control*, 19(6), 716–723. * **BIC:** Schwarz, G. (1978). "Estimating the dimension of a model." *Ann. Statist.*, 6(2), 461–464. * **EBIC:** Chen, J. & Chen, Z. (2008). "Extended Bayesian information criteria for model selection with large model spaces." *Biometrika*, 95(3), 759–771. * **PASTIS (Parsimonious Stochastic Inference):** Gerardos, A. & Ronceray, P. (2025). "Principled model selection for stochastic dynamics." *Phys. Rev. Lett.* 135, 167401. `DOI: 10.1103/ltdt-hvh7 `_ * **SINDy / STLSQ:** Brunton, S. L., Proctor, J. L. & Kutz, J. N. (2016). "Discovering governing equations from data by sparse identification of nonlinear dynamical systems." *Proc. Natl. Acad. Sci.* 113(15), 3932–3937. * **LASSO:** Tibshirani, R. (1996). "Regression shrinkage and selection via the lasso." *J. R. Statist. Soc. B* 58(1), 267–288.