Skip to content

Usage

This page describes how to run CETSAx–NADPH end-to-end, what each stage does, and how to work with the outputs. The goal is to let you move from raw CETSA data to biologically interpretable results with minimal friction.


Overview

CETSAx is designed as a pipeline-first system. You typically do not run individual scripts manually. Instead, you define your dataset and configuration, and let Snakemake orchestrate all steps.

The workflow follows a strict progression:

  • Data → curve fitting → scoring → systems analysis → sequence modeling → interpretation

Each stage produces outputs that are consumed by the next stage. You can stop early if you only need part of the analysis.


Installation

Clone the repository

git clone https://github.com/bibymaths/CETSAx.git
cd CETSAx

Environment setup

uv venv
source .venv/bin/activate
uv pip install -r requirements.txt

This is the fastest and cleanest way to set up dependencies.

conda env create -f environment.yml
conda activate cetsax
pip install -r requirements.txt
pip install -e .

Use this if you rely on Conda-based workflows or shared environments.

docker compose -f docker-compose.dev.yml up

Use this for a fully isolated, reproducible environment.


Hardware notes

GPU strongly recommended for sequence modeling

  • CPU-only execution is sufficient for curve fitting and systems analysis.
  • A CUDA-capable GPU is strongly recommended for sequence modeling (ESM-2).
  • If no GPU is available, set device: "cpu" in config.yaml and disable or skip sequence modeling steps.

Input Data

The pipeline expects CETSA data in tabular form.

Minimal structure

  • id: protein identifier
  • condition: experimental condition or replicate
  • Dose columns: numeric values corresponding to concentrations

Example:

id,condition,0.01,0.1,1,10,100
P12345,rep1,0.98,1.02,1.10,1.25,1.30

The exact column names for doses are defined in config.yaml.


Configuration

All runtime behavior is controlled through config.yaml.

input_csv: "data/nadph.csv"

python_bin: ".venv/bin/python"

epochs: 10
batch_size: 8
task: "classification"
device: "cuda"

Key parameters

  • input_csv: path to CETSA dataset
  • epochs, batch_size: sequence model training
  • task: "classification" or "regression"
  • device: "cuda" or "cpu"

You rarely need to modify anything else for standard runs.


Running the Pipeline

Execute the full workflow:

snakemake --profile workflow/profiles/local
snakemake --profile workflow/profiles/slurm
docker compose -f docker-compose.dev.yml run --rm cetsax \
    snakemake --profile workflow/profiles/local

What happens internally

Snakemake executes the pipeline in stages:

  1. Curve fitting
  2. Hit calling and QC
  3. System-level analysis
  4. Sequence dataset construction
  5. Model training
  6. Prediction and evaluation
  7. Network analysis
  8. Visualization

Each step is cached. If a step completes successfully, it will not be recomputed unless inputs change.


Pipeline Stages (What Each Step Does)

1. Curve Fitting

  • Fits a logistic model to each protein’s dose–response curve
  • Outputs:

    • EC50
    • Hill coefficient
    • Δmax

This is the quantitative backbone of the pipeline.


2. Hit Calling

  • Filters unreliable fits
  • Classifies proteins into:

    • strong responders
    • weak responders
    • non-responders

This step converts raw fits into interpretable biological categories.


3. System-Level Analysis

  • PCA and clustering of response profiles
  • Pathway enrichment (if annotations are available)

This step reveals coordinated behavior across proteins.


4. Sequence Dataset Construction

  • Maps protein IDs to sequences (UniProt rescue if needed)
  • Builds labeled dataset for ML

This is where experimental data becomes training data.


5. Sequence Model Training

  • Uses ESM-2 embeddings
  • Learns sequence determinants of NADPH response

Training modes include:

  • pooled embeddings (fastest)
  • residue representations (interpretability)
  • token-level training (most flexible, slowest)

6. Prediction and Evaluation

  • Generates predictions for all proteins
  • Produces:

    • confusion matrix
    • ROC curves
    • probability distributions

7. Explainability

  • Computes saliency maps
  • Computes Integrated Gradients

Outputs are residue-level importance scores.


8. Network Analysis

  • Builds co-stabilization network
  • Detects modules of coordinated response

This provides system-level structure.


Outputs

All results are stored under results/.

report.html — Integrated Results Overview

The report.html file is a self-contained HTML report generated by Snakemake that weaves all pipeline results — figures, tables, and metrics — into a single interactive overview. It is the recommended first point of inspection after a successful pipeline run.

Key directories

  • results/hit_calling/ → fit results and classifications
  • results/system_analysis/ → PCA, clustering
  • results/network/ → co-stabilization graphs
  • results/plots/ → model evaluation and interpretability
  • results/detailed_plots/ → per-protein curves

How to Work With Results

Curve fits

  • Use EC50 and Δmax to assess biochemical relevance
  • Check R² before trusting any result

Hit tables

  • Focus on strong responders
  • Look for pathway-level enrichment

PCA and clustering

  • Identify response modes
  • Detect groups of functionally related proteins

Sequence model outputs

  • Evaluate model performance (ROC, confusion matrix)
  • Inspect misclassifications for interesting biology

Interpretability outputs

  • Map important residues to:

    • domains
    • binding sites
    • motifs

This is where the model becomes biologically useful.


Network outputs

  • Identify modules
  • Connect proteins through shared response patterns

Running Individual Steps (Optional)

You can run specific parts of the pipeline:

snakemake results/hit_calling/
snakemake results/network/

This is useful for debugging or iterative development.


Common Issues

Pipeline stops early

Check missing dependencies and verify config.yaml paths are correct and the input CSV exists.

Poor curve fits

Data may be noisy or non-sigmoidal. Increase QC thresholds or inspect raw curves before proceeding.

Sequence model fails

Likely a GPU memory issue. Reduce esm_batch_size or switch to mode: "pooled" in config.yaml.

Empty outputs

Usually caused by overly strict filtering. Relax QC thresholds in config.yaml.

For first-time users:

  • Run full pipeline
  • Inspect results/hit_calling/
  • Move to system_analysis/
  • Then evaluate sequence model
  • Finally, interpret saliency and network results

Avoid jumping directly to deep learning outputs without validating upstream steps.


Summary

The intended usage pattern is:

  • Run once → inspect outputs → refine parameters → rerun selectively

CETSAx is designed to be iterative, not one-shot. The more you align the pipeline with your biological question, the more meaningful the results become.