Specform

Get Started

Specform

Dataset snapshot ledger for reproducible clinical analysis.

Specform is a dataset snapshot ledger for reproducible clinical analysis.
You work with dataset aliases via a notebook-native object: DatasetRef.

Core idea: immutable DS (snapshots) + mutable alias pointers (your "current dataset") + a DAO that makes it feel natural.


Install

pip install specform

You do not need to do this --- but it's the fastest way to understand Specform.

Initialize a workspace:

specform init

This creates:

  • .specform/ workspace
  • demo_brca_03-22-2006.csv
  • specform_template.ipynb

Open the notebook and run it.

It demonstrates:

  • Adding a CSV snapshot to alias brca
  • Editing it as a DataFrame
  • Checkpointing a new immutable DS
  • Viewing alias history
  • Exporting a portable data bundle
  • Rolling back to version 1

That's the entire mental model in 30 seconds.


Minimal Manual Workflow (No Template)

1) Create a session

from specform import Specform
 
sf = Specform(home=".", author="krish")

2) Add a dataset snapshot (DS)

brca = sf.dataset("brca")
brca.add("data/brca.csv", note="raw export")

You just created an immutable Dataset Snapshot (DS) and pointed alias brca at it.

3) Work with it

df = brca.df()
df.head()
 
brca.checkpoint(df.dropna(), note="drop NA rows")
brca.history()

Mental Model

  • DS is immutable (identity = canonical bytes fingerprint)
  • Aliases are mutable pointers (your "current dataset")
  • DatasetRef is the DAO
  • Notes never affect identity --- metadata only
  • History is append-only

Nothing silently mutates.


Next