.. _trajectory-user-guide:

Trajectory data
===============

.. currentmodule:: SFI.trajectory

The :mod:`SFI.trajectory` submodule is the **data handling layer** of SFI.  It
provides a single user-facing container,
:class:`SFI.trajectory.TrajectoryCollection`, which represents trajectories
together with masks and extras in a JAX-friendly form.

This guide shows how to construct, save, load and combine
:class:`TrajectoryCollection` objects, and how they integrate into the
simulation and inference workflow.

Why a dedicated container, not a DataFrame?
-------------------------------------------

A tracking table (one row per detection) is the natural *interchange*
format, and SFI ingests it directly (:meth:`from_dataframe`, below).
But inference does not run on rows — it runs on **streams** derived from
the trajectory: positions ``X``, increments ``dX``, finite-difference
velocities, and basis functions evaluated along the path.  The
collection exists to serve those streams, which a row-oriented DataFrame
cannot:

- **Increment / stream views.**  ``dX``, secant velocities, and the
  skipped errors-in-variables instrument are built lazily from a sliding
  time window with the correct per-stream alignment — not stored as
  columns.
- **Masking.**  Missing frames and particles entering or leaving are
  tracked per ``(time, particle)`` so each stream only ever yields rows
  where its whole finite-difference window is valid (see *Working with
  masking*).
- **Normalisation and weighting.**  Pooled experiments of different
  length and quality are reweighted into one estimator (per-dataset
  weights, effective observation time).
- **JAX, above all.**  The data are held as dense, rectangular
  ``(T, N, d)`` arrays so the whole pipeline is JIT-compiled, ``vmap``-ed
  and GPU-ready; a dynamically-shaped table of Python objects would
  defeat that.

So: bring your data as a DataFrame or a tracking file, and let the
collection turn it into the JAX-native, mask-aware streams the
estimators consume.

Basic construction from arrays
------------------------------

The main constructor when you already have tensors in memory is
:meth:`TrajectoryCollection.from_arrays`:

.. code-block:: python

   import jax.numpy as jnp
   from SFI.trajectory import TrajectoryCollection

   # Toy data: T time steps, N particles, d dimensions
   T, N, d = 1000, 5, 2
   X = jnp.zeros((T, N, d))   # your tracked positions

   # Uniform timestep
   dt = 0.1

   # Optional boolean mask (T, N): False = missing sample
   mask = jnp.ones((T, N), dtype=bool)

   coll = TrajectoryCollection.from_arrays(
       X=X,
       dt=dt,
       mask=mask,
       extras_global={"temperature": 300.0},
       extras_local={"radius": jnp.ones((N,))},
   )

Key points:

- ``X`` has shape ``(T, N, d)`` or ``(T, d)``; in the latter case,
  a single particle is assumed.
- You may specify either:

  - a scalar or per-step ``dt`` (shape ``(T,)``), or
  - an absolute time vector ``t`` (shape ``(T,)``), or
  - both ``t`` and ``dt=None``; the dataset will derive step sizes
    from ``t`` when needed.

- If ``mask`` is omitted, all samples are treated as valid.

The resulting ``coll`` is a single-dataset collection; you can pass it
directly to inference routines.

.. note::

   The middle axis ``N`` is the **particle** axis (see
   :term:`particles`).  At ``pdepth=0`` a single law is applied
   independently to each particle; interacting models (``pdepth=1``) see
   all particles at once.  The particle count may change over time — the
   mask records who is present at each frame — so ``N`` is the maximum
   concurrent count.  See :doc:`/particles/user_guide`.

Constructing from tracking tables
---------------------------------

Many experimental pipelines produce per-detection tables with columns
such as:

- particle identifier,
- frame or time index,
- coordinates (x, y, z, …),
- optional per-row metadata.

There are two ways to turn such tables into a collection.

From a saved tracking file
~~~~~~~~~~~~~~~~~~~~~~~~~~

If your data are already in a CSV / parquet / HDF5 file that follows the
SFI trajectory format — specified in :doc:`/trajectory/data_formats` —
you can load it directly:

.. code-block:: python

   from SFI.trajectory import TrajectoryCollection

   coll = TrajectoryCollection.load("tracks.parquet")

This will:

- parse the state vectors and (particle, time) indices,
- reconstruct masked ``(T, N, d)`` arrays,
- read extras and metadata from the file header and columns,
- return a single-dataset collection.

If the file contains raw tracking output in a different layout, you can:

1. use pandas (or your preferred tool) to reshape it into the standard
   column layout, then
2. feed the columns to :meth:`TrajectoryCollection.from_columns`.

From a pandas DataFrame
~~~~~~~~~~~~~~~~~~~~~~~

For raw tracking tables (trackpy, TrackMate, custom pipelines), the
most direct route is :meth:`TrajectoryCollection.from_dataframe` —
columns are addressed by name, in any order, and junk columns are
dropped:

.. code-block:: python

   import pandas as pd
   from SFI.trajectory import TrajectoryCollection

   tracks = pd.read_csv("raw_tracks.csv")   # e.g. track_id, frame, x, y, quality

   coll = TrajectoryCollection.from_dataframe(
       tracks,
       particle="track_id",
       time="frame",
       coords=("x", "y"),
       dt=0.05,
   )

Common particle/time column names (``particle_id``, ``particle``,
``track_id``, ``frame``, ``time``, …) are auto-detected when the
keywords are omitted; an ambiguous table raises with the candidates
listed.  A float-valued time column is factorized into frame indices
and its unique values become the absolute time axis.  Columns with the
extras prefixes (``G_``/``TG_``/``P_``/``TP_``, see
:doc:`/trajectory/data_formats`) are parsed into extras exactly as
when loading files.

From in-memory columns
~~~~~~~~~~~~~~~~~~~~~~

For custom pipelines that already expose columns, use
:meth:`TrajectoryCollection.from_columns` (the lower-level form that
:meth:`from_dataframe` delegates to):

.. code-block:: python

   import numpy as np
   from SFI.trajectory import TrajectoryCollection

   # Example columns
   particle_idx = np.array([0, 0, 1, 1, 0, 0, 1, 1])   # (L,)
   time_idx     = np.array([0, 1, 0, 1, 2, 3, 2, 3])   # (L,)
   state_vecs   = np.random.randn(8, 2)                # (L, d)

   coll = TrajectoryCollection.from_columns(
       particle_idx=particle_idx,
       time_idx=time_idx,
       state_vectors=state_vecs,
       dt=0.1,                # or t=..., or use extras_global["t"]
   )

The constructor:

- reconstructs the dense ``(T, N, d)`` state array,
- infers the number of particles and time steps,
- builds the internal mask (dropping any missing rows),
- attaches any provided extras or metadata.

Working with masking
--------------------

The mask is a boolean array of shape ``(T, N)`` attached to each
dataset. It encodes which **positions** are available at each time and
for each particle.

**How missing data is handled.**  Dropped detections, blinking labels,
tracking gaps, and particles that enter or leave the field of view are
all represented the same way — the corresponding mask entries are set to
``False``.  Nothing is removed from the array and nothing is
interpolated: a masked frame simply does not contribute the increments
that would have used it.  You can therefore keep one clean rectangular
``(T, N, d)`` array even when each particle is observed over a different,
gappy subset of frames.  When ``mask`` is ``None`` the dataset is treated
as fully observed.

Internally, when an inference program requests streams such as ``"X"``
and ``"dX"``, the dataset:

1. computes the smallest time window needed to provide all requested
   streams (e.g. ``t`` and ``t+1`` for ``"dX"``),
2. defines the set of valid indices where this window is in-bounds and
   all required positions are unmasked,
3. only ever produces rows for those valid indices.

This means:

- you can freely mask out arbitrary times and particles without worrying
  about the exact finite-difference scheme used later,
- the same collection can be reused for different inference programs
  that request different combinations of streams.

You typically do not interact with valid indices directly; the
integration runtime requests them via internal streaming methods on the
collection.

Using extras
------------

Extras are user-defined fields attached to a dataset, accessible during
integration through the ``"extras"`` stream. They are divided into:

- ``extras_global``: shared by all particles in the dataset,
- ``extras_local``: per-particle quantities.

Example: per-experiment and per-particle parameters:

.. code-block:: python

   import jax.numpy as jnp
   from SFI.trajectory import TrajectoryCollection

   X = jnp.zeros((T, N, d))

   coll = TrajectoryCollection.from_arrays(
       X=X,
       dt=0.1,
       extras_global={
           "temperature": 300.0,
           "salt_concentration": 0.2,
       },
       extras_local={
           "radius": jnp.linspace(0.5, 1.0, N),
       },
   )

Semantics:

- Global extras are visible to all particles; local extras are aligned
  with the particle index axis.
- If the same key appears in both, the local version wins.

**Time-dependent extras** are first-class: wrap an array with a leading
time axis as a :class:`~SFI.trajectory.TimeSeriesExtra`
(``time_series_extra(values)``, shape ``(T, ...)`` for globals or
``(T, N, ...)`` per particle), and the integration runtime slices it to
the value at each frame.  ``TG_`` / ``TP_`` file columns load this way
automatically (:doc:`/trajectory/data_formats`).

During integration, when a program asks for the ``"extras"`` stream, it
receives a dictionary, for each time index, built by:

- slicing any recognised time-series extras at that index,
- evaluating any callable extras at that index, passing along an
  optional context string supplied by the integrator,
- merging global and local dictionaries.

This makes it natural to fit force fields that depend not only on the
state, but also on experiment conditions, drive protocols, or particle
labels.  The compositional entry point is
:func:`~SFI.bases.extra_scalar`, which turns an extras key into a basis
symbol:

.. code-block:: python

   from SFI.bases import X, extra_scalar

   # trap + protocol-modulated trap: F = θ₁·x + θ₂·k(t)·x
   B = X(dim=1) & (extra_scalar("k_drive", dim=1) * X(dim=1))
   inf.infer_force_linear(B)

See :doc:`/gallery/time_dependent_forcing_demo` for the full round
trip (simulate with a protocol, infer it back), and
:doc:`/langevin/user_guide` for the simulation-side conventions.  For
multi-experiment collections (see below), you can attach different
extras to each dataset and use them as input features in inference.

Combining multiple experiments
------------------------------

A :class:`TrajectoryCollection` can hold several independent datasets.
They may differ in:

- number of particles and duration,
- masks and time axes,
- extras and metadata.

You can build them separately and then concatenate:

.. code-block:: python

   coll1 = TrajectoryCollection.from_arrays(X=X1, dt=0.1,
                                            extras_global={"temperature": 280.0})
   coll2 = TrajectoryCollection.from_arrays(X=X2, dt=0.1,
                                            extras_global={"temperature": 300.0})

   coll = coll1 & coll2          # quick merge; default "pool" policy

   # …or make each experiment count equally regardless of length:
   coll = coll1.concat([coll2], weights="per_dataset")

The resulting collection:

- has two datasets, each with its own extras and mask,
- carries per-dataset weights — an **unnormalised multiplier** that the
  runtime applies to every estimator (force, diffusion, parametric).

The ``weights`` policy sets how much each dataset contributes to the
pooled inference:

- ``"pool"`` (default): multiplier ``1`` for every dataset — pool all the
  increments on equal footing.  Combined with each estimator's intrinsic
  within-dataset weighting (the force is per-:math:`\Delta t`, the
  diffusion per-point), each dataset then contributes in proportion to its
  effective observation time (force) or its number of points (diffusion).
- ``"per_dataset"``: each experiment contributes equally regardless of
  length (multiplier :math:`\overline{T_\mathrm{eff}}/T_{\mathrm{eff},d}`);
  exact for the force, and for the diffusion when ``dt`` is uniform.
- an explicit array: manual per-dataset multipliers.

A single inference on the concatenated collection then fits one shared
model to all experiments at once — see
:doc:`/gallery/advanced/multi_experiment_demo` for a worked example.

**Shared vs. experiment-specific parameters.**  Every produced row
carries the reserved extra ``dataset_index`` (the dataset's position in
the collection — injected at integration time, never stored on disk),
which lets part of the model be experiment-specific while the rest is
shared:

.. code-block:: python

   from SFI.bases import X, named_scalar, per_dataset_scalar, unit_vector_basis

   k = named_scalar("k", default=1.0)            # shared stiffness
   a = per_dataset_scalar("a", n_datasets=2)     # one drift per experiment
   F = a * unit_vector_basis(1) - k * X(dim=1)
   inf.infer_force(F)                            # parametric (L-BFGS path)

For the **linear estimators**, use one-hot indicator features instead:
``dataset_indicator(n) * X(dim)`` gives an independent linear
coefficient per experiment (block-diagonal Gram, PASTIS-prunable).
Per-particle inferred parameters in interacting models use the same
idea with the reserved ``particle_index`` extra inside
:func:`~SFI.statefunc.make_interactor` kernels (declare it via
``particle_extras=...``).

**Reproducing one experiment.**  ``dataset_index`` is purely an
inference-time coordinate — it has no meaning in a simulation.  To
reproduce a single experiment from a pooled fit, collapse the model to
that condition with
:meth:`~SFI.statefunc.StateExpr.specialize`: per-dataset parameters are
folded at the chosen index and ``dataset_index`` drops out, leaving a
standalone single-condition model.  :meth:`~SFI.inference.OverdampedLangevinInference.simulate_bootstrapped_trajectory`
does this for you when given ``dataset=k`` (``k`` is the experiment's
position in the pooled collection):

.. code-block:: python

   coll_k, proc_k = inf.simulate_bootstrapped_trajectory(key, dataset=k)
   # proc_k is experiment k's own model; coll_k is a plain single
   # trajectory (no dataset_index), so re-inference uses a plain basis.

Do **not** set ``dataset_index`` yourself as a user extra — it is
framework-reserved and will be rejected.

This structure is useful when:

- pooling multiple repeats of the same experiment,
- combining trajectories recorded at different conditions and letting
  the state expression depend symbolically on those conditions (e.g.
  learning a force field ``F(x, T)`` from datasets at different
  temperatures).

Saving and loading collections
------------------------------

Use :meth:`TrajectoryCollection.save` and
:meth:`TrajectoryCollection.load` for persistence and interchange.
The on-disk formats (CSV / Parquet / HDF5 table layout, extras-column
prefixes, metadata header) are specified in
:doc:`/trajectory/data_formats`.

Single-dataset case
~~~~~~~~~~~~~~~~~~~

For a single dataset, you can save and reload from a single file:

.. code-block:: python

   coll = TrajectoryCollection.from_arrays(X=X, dt=0.1)

   # Save to a parquet file (recommended)
   coll.save("traj.parquet")

   # Later…
   coll2 = TrajectoryCollection.load("traj.parquet")

Rules:

- The collection must contain exactly one dataset when saving to a
  single file.
- Masked samples are dropped; the file only contains valid rows.
- At load time, particle IDs are optionally relabelled to ``0..N-1`` and
  time indices shifted to start at 0 (configurable via ``relabel``).

Multi-dataset case
~~~~~~~~~~~~~~~~~~

For multiple datasets, save to a directory:

.. code-block:: python

   # coll contains several datasets
   coll.save("my_experiments/")

   # Directory "my_experiments" will contain:
   #   ds_000.parquet, ds_001.parquet, ...
   #   manifest.yaml

   coll2 = TrajectoryCollection.load("my_experiments/")

Each dataset is stored in its own file, and a ``manifest.yaml`` records
their names and filenames. Loading the directory reconstructs the full
collection with one dataset per file.

Round-trip guarantees
~~~~~~~~~~~~~~~~~~~~~

A collection that is saved and then reloaded:

- preserves the state arrays, masks, time axis and extras, up to the
  loss of masked samples (which are not written to disk),
- restores per-dataset metadata and I/O-level time information,
- recomputes reasonable default weights (by default equal weights for
  multi-dataset collections, or the chosen policy).

You can rely on this for reproducible pipelines where simulation,
storage, and inference may happen at different times or on different
machines.

Train/test splitting
--------------------

For held-out validation on data-abundant problems,
:meth:`TrajectoryCollection.split_time` cuts every dataset along time:

.. code-block:: python

   train, test = coll.split_time(0.8)        # optional: gap=, reweight=

Masks, time axes (kept absolute), and extras follow the split
(time-series extras are sliced; static extras are shared).  See the
held-out note in :doc:`/inference/user_guide` for when this is — and
is not — the right tool: SFI's default validation route
(``force_predicted_MSE`` + diagnostics) costs no data.

Optional: degrading synthetic data
----------------------------------

The :meth:`TrajectoryCollection.degrade` method is a convenience wrapper
for generating synthetic “experimental-like” trajectories from clean
simulations:

.. code-block:: python

   degraded = coll.degrade(
       downsample=4,            # keep every 4th frame (effective dt ×4)
       motion_blur=3,           # average over motion_blur+1 frames before
                                #   downsampling (exposure smear);
                                #   0 <= motion_blur < downsample
       data_loss_fraction=0.1,  # randomly mask out 10% of surviving points
       noise=0.01,              # add Gaussian localisation noise, std 0.01
       reweight="pool",         # recompute pooled weights after degrading
   )

Two further options not shown above: ``ROI=`` drops points outside a
region of interest (a radial cutoff, an axis-aligned box, or a
predicate), and ``seed=`` fixes the RNG for the added noise and the
drop-out so a degradation is reproducible.

Typical use cases:

- testing robustness of inference algorithms to motion blur and missing
  data,
- benchmarking against known synthetic ground truth.

Real experimental data are usually ingested directly from tracking
tables via :meth:`TrajectoryCollection.load` or
:meth:`TrajectoryCollection.from_columns`; degradation is not required
for those workflows.

Quick plotting
--------------

The :mod:`SFI.utils.plotting` module provides simple helpers that work
directly with collections. For example:

.. code-block:: python

   import matplotlib.pyplot as plt
   from SFI.utils import plotting

   # Time series of all coordinates for the first dataset
   plotting.timeseries(coll, dataset=0)
   plt.show()

   # 2D phase plot for coordinates 0 and 1
   plotting.phase2d(coll, dataset=0, dims=(0, 1))
   plt.gca().set_aspect("equal")
   plt.show()

These utilities internally extract ``(t, X, mask)`` from the chosen
dataset, respect the mask when drawing trajectories, and can be used as
building blocks for more elaborate visualisation routines.

Sanity-checking your data
-------------------------

A quick look before inference rarely hurts: plot the trajectory with
``plotting.timeseries`` / ``plotting.phase2d`` (above) and watch for
tracking artefacts (jumps, stuck particles, coordinate resets) or a
mismatched ``dt`` (displacements spanning a large fraction of the
explored range); ``ds.mask`` — ``None`` when fully observed — reports the
coverage.  None of this is essential, though: the rigorous, fit-aware
checks happen *after* inference, through the diagnostics suite
(:doc:`/diagnostics`).

Further reading
---------------

- :doc:`reference` for a curated reference of the public methods on
  :class:`TrajectoryCollection`.
- The autogenerated :mod:`SFI.trajectory` API page for full signatures
  and parameter documentation.