Trajectory file formats¶

This page specifies the on-disk formats read by TrajectoryCollection.load and written by TrajectoryCollection.save. Three formats are supported — CSV, Parquet, and HDF5 — all sharing the same tabular layout: one row per observation.

from SFI.trajectory import TrajectoryCollection

coll = TrajectoryCollection.load("tracks.csv")      # or .parquet / .h5
coll.save("tracks.parquet")                          # format from suffix

Table layout¶

Each row is one observation of one particle at one time step:

Column	Required	Content
`particle_id`	optional	Integer track identifier. If absent, the file is a single-trajectory file (pass `particle_column=None` to the low-level loader).
`time_step`	yes	Integer time index (0-based after relabelling).
`x0, x1, …`	yes	State-vector components (positions, angles, concentrations, …).

CSV files identify columns by position, not by name: by default column 0 is the particle identifier, column 1 the time index, and every remaining column without an extras prefix (below) is a state component. Your columns can therefore be named particle_id, frame, x, y or anything else. Parquet and HDF5 files identify columns by name and must use the canonical names particle_id and time_step.

Rows containing NaNs are dropped on load; masked samples are dropped on save (only valid rows are written).

Extras columns¶

Per-observation metadata is carried in extra numeric columns, classified by a name prefix:

Prefix	Kind	Example
`G_`	Global scalar (constant for the whole dataset)	`G_temperature` — stored in the header on save
`TG_`	Time-dependent global (depends on \(t\) only)	`TG_field` — an external drive protocol
`P_`	Per-particle constant (depends on particle only)	`P_radius` — individual particle sizes
`TP_`	Time- and particle-dependent	`TP_intensity` — per-detection fluorescence

On load these populate extras_global / extras_local of the dataset and become available to state functions through the extras mechanism (see Trajectory data).

Metadata header¶

A file can carry a YAML metadata mapping — most importantly the time step dt:

CSV — leading comment lines: a # --- opener followed by # key: value lines.
```
# ---
# dt: 0.01
# description: 2D optical tweezer
particle_id,frame,x,y
0,0,-0.017995,-0.025163
0,1,0.037124,-0.100932
```
(Plain # key: value lines without the # --- opener are also accepted, as in examples/experimental_data/optical_tweezer.csv.)
Parquet — the same YAML string stored in the table schema metadata under the key sfi_yaml_header.
HDF5 — one dataset per column inside a table group; the YAML string stored as the root attribute sfi_yaml_header.

Recognised keys:

dt — scalar sampling interval (seconds, or your time unit). Accepted both at the top level and inside extras_global (files written by TrajectoryCollection.save() use the latter);
extras_global — a mapping of arbitrary scalars or arrays. The special key t (a length-T vector) defines a non-uniform time axis and overrides dt;
anything else is kept as free-form dataset metadata (coll.datasets[0].meta).

Named columns¶

When a file does not follow the positional/canonical layout above, select the columns explicitly — by name for any format, or by index for CSV:

coll = TrajectoryCollection.load(
    "raw_tracks.csv",
    particle_column="particle",      # or an int index (CSV only)
    time_column="t",
    state_columns=("x", "y"),        # drops every other non-extras column
)

For in-memory tables, TrajectoryCollection.from_dataframe() is the more convenient entry point (auto-detection of common column names) — see Trajectory data.

Loading behaviour¶

TrajectoryCollection.load() accepts a single file or a directory and takes two knobs:

relabel=True (default) — particle IDs are compressed to 0..N-1 and time indices shifted to start at 0. The original IDs are recorded in extras_local["original_particle_id"].
compress_particles=False — when True, particles whose time supports do not overlap (with a 2-frame safety buffer) are packed into the same column slot. Useful for open-boundary data where particles enter and leave the field of view, which otherwise makes the array width grow with the total number of unique tracks rather than the concurrent count. The mapping is stored in dataset.meta["particle_column_map"].

Weights: every load initialises dataset weights with the default "pool" policy; call coll.with_weights(...) after loading if you need a different policy.

Multi-dataset directories¶

A collection with several datasets saves to a directory:

my_experiments/
├── ds_000.parquet
├── ds_001.parquet
└── manifest.yaml        # records dataset names and filenames

TrajectoryCollection.load("my_experiments/") reconstructs the full collection, one dataset per file.

Round trip¶

import jax.numpy as jnp
from SFI.trajectory import TrajectoryCollection

coll = TrajectoryCollection.from_arrays(X=jnp.zeros((100, 3, 2)), dt=0.05)
coll.save("run.parquet")
coll2 = TrajectoryCollection.load("run.parquet")

State arrays, masks, time axis, extras, and metadata survive the round trip, up to the loss of masked samples (which are never written).