Data¶

Overview¶

The dataset used in CETSAx–NADPH originates from a Cellular Thermal Shift Assay (CETSA) study measuring the thermal stability of human proteome proteins in response to graded concentrations of NADPH. Each row represents a single protein under a specific experimental condition (replicate). The dose–response profile across ten NADPH concentrations is the primary input to the curve-fitting stage.

The example data (data/nadph.csv) is derived from the publicly available dataset of Dziekan et al. (Nature Protocols, 2020, DOI: 10.1038/s41596-020-0310-z). The original .Rdata file was converted to .csv format for use in this pipeline.

Using Your Own Data

Any dataset that matches the schema described below will work with CETSAx. You do not need to use the provided example file. Simply point input_csv in config.yaml to your own CSV file, and update dose_columns under experiment: to reflect your actual NADPH concentrations.

File Format¶

The required input is a comma-separated values (.csv) file with the following structure:

An optional index column (any column whose name starts with Unnamed: is automatically dropped).
One protein identifier column.
One condition column (typically distinguishes experimental replicates).
Exactly ten numeric dose columns corresponding to the NADPH concentrations used.
Three quality-control columns.

Minimal Example¶

"","id","description","condition","3.81e-06","1.526e-05","6.104e-05","0.00024414",
"0.00097656","0.00390625","0.015625","0.0625","0.25","1","sumUniPeps","sumPSMs","countNum"
"1","P04075","Fructose-bisphosphate aldolase A","NADPH.r1",1,0.97,0.99,0.97,...,3,1019,72

Data Dictionary¶

Column	Type	Description	Example
(index)	integer	Optional row index — dropped automatically on load	`1`
`id`	string	UniProt accession or protein identifier	`P04075`
`description`	string	Human-readable protein description	`Fructose-bisphosphate aldolase A`
`condition`	string	Experimental condition or replicate label	`NADPH.r1`, `NADPH.r2`
`3.81e-06`	float	Normalised abundance at 3.81 µM NADPH	`1.000`
`1.526e-05`	float	Normalised abundance at 15.26 µM NADPH	`0.974`
`6.104e-05`	float	Normalised abundance at 61.04 µM NADPH	`0.991`
`0.00024414`	float	Normalised abundance at 244.14 µM NADPH	`0.966`
`0.00097656`	float	Normalised abundance at 976.56 µM NADPH	`0.967`
`0.00390625`	float	Normalised abundance at 3.906 mM NADPH	`0.978`
`0.015625`	float	Normalised abundance at 15.625 mM NADPH	`1.002`
`0.0625`	float	Normalised abundance at 62.5 mM NADPH	`0.984`
`0.25`	float	Normalised abundance at 250 mM NADPH	`1.046`
`1`	float	Normalised abundance at 1 M NADPH	`0.987`
`sumUniPeps`	integer	Number of unique peptides identified	`3`
`sumPSMs`	integer	Total peptide-spectrum matches	`1019`
`countNum`	integer	Number of quantified data points	`72`

Abundance normalisation

Dose columns contain fold-change values relative to the vehicle/lowest-dose condition. Values around 1.0 indicate no change; values > 1 indicate stabilisation and values < 1 indicate destabilisation.

Quality Control Thresholds¶

Proteins are filtered before curve fitting using three thresholds defined in config.yaml:

Parameter	Default	Meaning
`min_unique_peptides`	3	Minimum number of unique peptides required
`min_psms`	15	Minimum total PSMs required
`min_countnum`	8	Minimum quantified data points required

Rows that fall below any threshold are excluded before downstream analysis.

Notes & Warnings¶

Delimiter

The pipeline expects a comma-separated (.csv) file. Tab-separated or semicolon-separated files must be converted before use.

Missing dose columns

If any of the dose columns listed under experiment.dose_columns in config.yaml are absent from your CSV, the pipeline will raise a KeyError at the curve-fitting stage. Always verify that your column names match the config exactly.

Non-numeric dose values

Dose columns must contain numeric values. String values (e.g. "N/A", empty cells) are coerced to NaN and may cause individual proteins to be excluded.

Replicates

CETSAx is designed to work with at least two replicates per condition. The hit-calling stage uses replicate agreement to increase confidence. Single-replicate data can be analysed but will produce less reliable hit calls.

Large datasets

The provided example file contains ~12,000 rows. For larger proteomes, increase the parallelism in Snakemake (--cores) and consider enabling the embedding cache (pooled_cache: true in config.yaml) to avoid redundant ESM-2 forward passes.