Usage¶
This page describes how to run CETSAx–NADPH end-to-end, what each stage does, and how to work with the outputs. The goal is to let you move from raw CETSA data to biologically interpretable results with minimal friction.
Overview¶
CETSAx is designed as a pipeline-first system. You typically do not run individual scripts manually. Instead, you define your dataset and configuration, and let Snakemake orchestrate all steps.
The workflow follows a strict progression:
- Data → curve fitting → scoring → systems analysis → sequence modeling → interpretation
Each stage produces outputs that are consumed by the next stage. You can stop early if you only need part of the analysis.
Installation¶
Clone the repository¶
git clone https://github.com/yourusername/cetsax-nadph.git
cd cetsax-nadph
````
---
### Environment setup (uv, recommended)
```bash
uv venv
source .venv/bin/activate
uv pip install -r requirements.txt
This is the fastest and cleanest way to set up dependencies.
Alternative: Conda¶
conda env create -f environment.yml
conda activate cetsax
pip install -r requirements.txt
pip install -e .
Use this if you rely on Conda-based workflows or shared environments.
Hardware notes¶
- CPU-only execution is sufficient for curve fitting and systems analysis.
- A CUDA-capable GPU is strongly recommended for sequence modeling (ESM-2).
- If no GPU is available, disable or skip sequence modeling steps.
Input Data¶
The pipeline expects CETSA data in tabular form.
Minimal structure¶
id: protein identifiercondition: experimental condition or replicate- Dose columns: numeric values corresponding to concentrations
Example:
The exact column names for doses are defined in config.yaml.
Configuration¶
All runtime behavior is controlled through config.yaml.
input_csv: "data/nadph.csv"
python_bin: ".venv/bin/python"
epochs: 10
batch_size: 8
task: "classification"
device: "cuda"
Key parameters¶
input_csv: path to CETSA datasetepochs,batch_size: sequence model trainingtask:"classification"or"regression"device:"cuda"or"cpu"
You rarely need to modify anything else for standard runs.
Running the Pipeline¶
Execute the full workflow:
What happens internally¶
Snakemake executes the pipeline in stages:
- Curve fitting
- Hit calling and QC
- System-level analysis
- Sequence dataset construction
- Model training
- Prediction and evaluation
- Network analysis
- Visualization
Each step is cached. If a step completes successfully, it will not be recomputed unless inputs change.
Pipeline Stages (What Each Step Does)¶
1. Curve Fitting¶
- Fits a logistic model to each protein’s dose–response curve
-
Outputs:
- EC50
- Hill coefficient
- R²
- Δmax
This is the quantitative backbone of the pipeline.
2. Hit Calling¶
- Filters unreliable fits
-
Classifies proteins into:
- strong responders
- weak responders
- non-responders
This step converts raw fits into interpretable biological categories.
3. System-Level Analysis¶
- PCA and clustering of response profiles
- Pathway enrichment (if annotations are available)
This step reveals coordinated behavior across proteins.
4. Sequence Dataset Construction¶
- Maps protein IDs to sequences (UniProt rescue if needed)
- Builds labeled dataset for ML
This is where experimental data becomes training data.
5. Sequence Model Training¶
- Uses ESM-2 embeddings
- Learns sequence determinants of NADPH response
Training modes include:
- pooled embeddings (fastest)
- residue representations (interpretability)
- token-level training (most flexible, slowest)
6. Prediction and Evaluation¶
- Generates predictions for all proteins
-
Produces:
- confusion matrix
- ROC curves
- probability distributions
7. Explainability¶
- Computes saliency maps
- Computes Integrated Gradients
Outputs are residue-level importance scores.
8. Network Analysis¶
- Builds co-stabilization network
- Detects modules of coordinated response
This provides system-level structure.
Outputs¶
All results are stored under results/.
Key directories¶
results/hit_calling/→ fit results and classificationsresults/system_analysis/→ PCA, clusteringresults/network/→ co-stabilization graphsresults/plots/→ model evaluation and interpretabilityresults/detailed_plots/→ per-protein curves
How to Work With Results¶
Curve fits¶
- Use EC50 and Δmax to assess biochemical relevance
- Check R² before trusting any result
Hit tables¶
- Focus on strong responders
- Look for pathway-level enrichment
PCA and clustering¶
- Identify response modes
- Detect groups of functionally related proteins
Sequence model outputs¶
- Evaluate model performance (ROC, confusion matrix)
- Inspect misclassifications for interesting biology
Interpretability outputs¶
-
Map important residues to:
- domains
- binding sites
- motifs
This is where the model becomes biologically useful.
Network outputs¶
- Identify modules
- Connect proteins through shared response patterns
Running Individual Steps (Optional)¶
You can run specific parts of the pipeline:
This is useful for debugging or iterative development.
Common Issues¶
Pipeline stops early¶
- Check missing dependencies
- Verify
config.yamlpaths
Poor curve fits¶
- Data may be noisy or non-sigmoidal
- Increase QC thresholds or inspect raw curves
Sequence model fails¶
- Likely GPU memory issue
- Reduce batch size or switch to pooled mode
Empty outputs¶
- Usually due to strict filtering
- Relax thresholds in configuration
Recommended Workflow¶
For first-time users:
- Run full pipeline
- Inspect
results/hit_calling/ - Move to
system_analysis/ - Then evaluate sequence model
- Finally, interpret saliency and network results
Avoid jumping directly to deep learning outputs without validating upstream steps.
Summary¶
The intended usage pattern is:
- Run once → inspect outputs → refine parameters → rerun selectively
CETSAx is designed to be iterative, not one-shot. The more you align the pipeline with your biological question, the more meaningful the results become.