How to use bindsight¶
Practical end-to-end walkthrough. The discovery half (RNA-seq → ranked surface-antigen targets) runs on CPU; the design half (RFdiffusion → ProteinMPNN → Boltz-2) runs on a GPU backend you choose.
Install¶
bindsight is installed from source (not yet on PyPI):
git clone https://github.com/mikhaeelatefrizk/bindsight
cd bindsight
python -m venv .venv
source .venv/bin/activate # macOS / Linux
# .venv\Scripts\activate # Windows
pip install -e ".[discover,report]"
Then:
bindsight --version # 0.2.0
bindsight doctor # check the install + cache state
bindsight verify-licenses # see the per-component license inventory
bindsight doctor is the first thing to run if anything looks off later — it
tells you what's installed, what's cached, and what's missing.
Optional extras: .[runners] (Modal/Kaggle clients), .[workflow] (the
Snakemake front-end). .[all] installs everything.
Quick start: the demo¶
bindsight demo
This runs the full discovery half on a real TCGA-BRCA tumor-vs-adjacent-normal cohort: on first run it auto-downloads the cohort (STAR-Counts) from the NIH/GDC open-access API, populates the full SURFY surfaceome, runs real DESeq2 + Open Targets enrichment, and writes a ranked candidate table + a self-contained HTML report. First run needs internet (cohort + SURFY cached afterwards) and takes a few minutes; CPU-only, no GPU.
Concept: a "run"¶
A run is one invocation of the pipeline against one config, producing one self-contained output directory:
runs/my_run/
├── deg/results.parquet # one row per gene (DESeq2)
├── targets/candidates.parquet # one row per (gene, UniProt) candidate
├── epitopes/epitopes.parquet # one row per top-N target
├── design/ # binder designs (bindsight design)
├── validate/validated.parquet # structure + affinity metrics (bindsight validate)
├── rank/ranking.parquet # composite-ranked binders (bindsight rank)
├── report.html # self-contained HTML report (bindsight report)
└── run_manifest.jsonld # PROV-O audit trail of every stage
run_manifest.jsonld is what makes the run reproducible. Treat it like a lab
notebook entry: never edit, always keep.
Two front-ends, one pipeline¶
You can drive the exact same pipeline two ways:
- CLI (recommended):
bindsight discover|design|validate|rank|report|export, orbindsight run <config>for the whole chain. - Snakemake (optional,
pip install -e ".[workflow]"):snakemake --configfile <config> --cores 4. Each rule calls the samebindsight.*functions, so artifacts are identical.
Step 1 — Author a config¶
Configs are YAML validated by bindsight.config.RunConfig (validation runs at
load time, so a typo fails loudly before any compute). Start from an example:
cp examples/tcga_luad.yaml my_config.yaml
A minimal config:
name: my_first_run
out_dir: runs/my_first_run
inputs:
counts: data/my_counts.tsv.gz # gene × sample, integer counts
design: data/my_design.tsv # sample, condition, ...
# Optional: auto-download a real TCGA cohort if the files above are absent:
# download: { project: TCGA-BRCA, n_tumor: 20, n_normal: 20 }
params:
deg:
design_formula: "~ condition"
contrast: ["condition", "tumor", "normal"]
fdr_threshold: 0.05
log2fc_threshold: 1.0
target_discovery:
require_surfy: true
use_open_targets: true
require_tractable_modality: ["Antibody"]
max_safety_events: 5
top_n: 10
backend: modal # GPU backend for the design half: modal | local_docker | kaggle | colab | mock
Step 2 — Provide the data¶
Two TSVs (the counts may be gzipped):
counts — gene IDs in column 1, samples in the rest, integer counts:
gene_id sample_001 sample_002 ...
ENSG00000141736 1245 1389 ...
design — sample IDs in column 1, factors after; the condition column (or
whatever your design_formula references) must contain the contrast levels:
sample condition
sample_001 tumor
normal_001 normal
Where to get real data: set inputs.download to auto-fetch a TCGA cohort
from NIH/GDC (see bindsight/io/gdc.py), or bring your own aligner output
(STAR/Salmon/kallisto counts), or pull pre-aligned counts from
recount3.
Step 3 — Reference data (auto on first use)¶
bindsight caches external data under your OS cache dir; bindsight doctor
shows the state. On first real run these populate automatically:
- SURFY surfaceome (full ~2,886-protein list) — downloaded from wlab.ethz.ch/surfaceome (CC-BY).
- AlphaFoldDB structures + Open Targets evidence — fetched per target.
- SURFACE-Bind targetable-site lookup is implemented: when a vendored
SURFACE-Bind site tree is present, design focuses on those sites; otherwise it
falls back to whole-surface design (
require_surface_bind_site: false, the default).
No manual setup is required for a standard run.
Step 4 — Discover (CPU)¶
bindsight discover my_config.yaml --out runs/my_first_run
DESeq2 → surfaceome filter → Open Targets enrichment → AlphaFoldDB structures →
candidates.parquet + epitopes.parquet + run_manifest.jsonld. Inspect:
python -c "import pandas as pd; print(pd.read_parquet('runs/my_first_run/targets/candidates.parquet').head())"
Step 5 — Design + validate (GPU)¶
# Estimate cost first (no GPU needed):
bindsight design runs/my_first_run --backend modal --dry-run
# Run it: RFdiffusion → ProteinMPNN → Boltz-2 on the chosen backend
bindsight design runs/my_first_run --backend modal --designer rfdiff_mpnn --trajectories 50
bindsight validate runs/my_first_run
The GPU work runs in bindsight.runners.job_exec on the backend you pick:
| Backend | Cost | When to use |
|---|---|---|
colab |
Free (T4) / Pro (A100) | Writes a ready-to-run notebook you execute in Colab |
kaggle |
Free (T4×2, quota) | Headless via the Kaggle API |
modal |
~$0.6–4/GPU-hr | Headless cloud GPUs, no queue |
local_docker |
Your hardware | A local NVIDIA GPU (native or Docker) |
mock |
Free, instant | CI / testing (mock results only) |
Designers: rfdiff_mpnn (default), bindcraft, boltzgen. Validators:
boltz2 (default), chai1r, af2_ig (non-commercial AF2 weights — a banner is
shown). --dry-run always works without a GPU.
Step 6 — Rank, report, export¶
bindsight rank runs/my_first_run
bindsight report runs/my_first_run --format html
bindsight export runs/my_first_run --format ro-crate --out runs/my_first_run.crate.zip
The HTML report is a single self-contained file (embedded volcano plot, ranked
tables, and the full PROV-O manifest). The RO-Crate zip is ready for Zenodo /
Figshare deposit. bindsight report --format streamlit launches an interactive
dashboard instead.
Run the whole chain at once with bindsight run my_config.yaml --out runs/x
(CPU stages always run; GPU stages run on the configured headless backend).
Step 7 — Benchmark against the held-out set¶
bindsight benchmark runs/my_first_run --known-antigens benchmarks/known.tsv --out bench.html
Scores how well the run rediscovers the literature-validated known antigens in
benchmarks/ (recall@k, per-antigen ranks). See benchmarks/PROVENANCE.md.
Reproducibility¶
- Commit the config YAML and the
run_manifest.jsonld. - A collaborator runs the same config (pin the Docker image for byte-identical environments).
- Compare manifests — SHA-256s of every artifact match.
Troubleshooting¶
| Symptom | Fix |
|---|---|
dep: pydeseq2 = not installed in doctor |
pip install -e ".[discover]" |
config validation error |
Read it — Pydantic names the exact field |
samples in counts but not design |
Sample IDs must match across both files |
AlphaFoldDB has no model for X |
Expected for some accessions; the row is tagged and the run continues |
design nothing to do |
Run discover first so epitopes.parquet (with structures) exists |
| Modal/Kaggle "install the runners extra" | pip install -e ".[runners]" + provide credentials |