Skip to content

Data

Overview

The dataset used in CETSAx–NADPH originates from a Cellular Thermal Shift Assay (CETSA) study measuring the thermal stability of human proteome proteins in response to graded concentrations of NADPH. Each row represents a single protein under a specific experimental condition (replicate). The dose–response profile across ten NADPH concentrations is the primary input to the curve-fitting stage.

The example data (data/nadph.csv) is derived from the publicly available dataset of Dziekan et al. (Nature Protocols, 2020, DOI: 10.1038/s41596-020-0310-z). The original .Rdata file was converted to .csv format for use in this pipeline.

Using Your Own Data

Any dataset that matches the schema described below will work with CETSAx. You do not need to use the provided example file. Simply point input_csv in config.yaml to your own CSV file, and update dose_columns under experiment: to reflect your actual NADPH concentrations.


File Format

The required input is a comma-separated values (.csv) file with the following structure:

  • An optional index column (any column whose name starts with Unnamed: is automatically dropped).
  • One protein identifier column.
  • One condition column (typically distinguishes experimental replicates).
  • Exactly ten numeric dose columns corresponding to the NADPH concentrations used.
  • Three quality-control columns.

Minimal Example

"","id","description","condition","3.81e-06","1.526e-05","6.104e-05","0.00024414",
"0.00097656","0.00390625","0.015625","0.0625","0.25","1","sumUniPeps","sumPSMs","countNum"
"1","P04075","Fructose-bisphosphate aldolase A","NADPH.r1",1,0.97,0.99,0.97,...,3,1019,72

Data Dictionary

Column Type Description Example
(index) integer Optional row index — dropped automatically on load 1
id string UniProt accession or protein identifier P04075
description string Human-readable protein description Fructose-bisphosphate aldolase A
condition string Experimental condition or replicate label NADPH.r1, NADPH.r2
3.81e-06 float Normalised abundance at 3.81 µM NADPH 1.000
1.526e-05 float Normalised abundance at 15.26 µM NADPH 0.974
6.104e-05 float Normalised abundance at 61.04 µM NADPH 0.991
0.00024414 float Normalised abundance at 244.14 µM NADPH 0.966
0.00097656 float Normalised abundance at 976.56 µM NADPH 0.967
0.00390625 float Normalised abundance at 3.906 mM NADPH 0.978
0.015625 float Normalised abundance at 15.625 mM NADPH 1.002
0.0625 float Normalised abundance at 62.5 mM NADPH 0.984
0.25 float Normalised abundance at 250 mM NADPH 1.046
1 float Normalised abundance at 1 M NADPH 0.987
sumUniPeps integer Number of unique peptides identified 3
sumPSMs integer Total peptide-spectrum matches 1019
countNum integer Number of quantified data points 72

Abundance normalisation

Dose columns contain fold-change values relative to the vehicle/lowest-dose condition. Values around 1.0 indicate no change; values > 1 indicate stabilisation and values < 1 indicate destabilisation.


Quality Control Thresholds

Proteins are filtered before curve fitting using three thresholds defined in config.yaml:

Parameter Default Meaning
min_unique_peptides 3 Minimum number of unique peptides required
min_psms 15 Minimum total PSMs required
min_countnum 8 Minimum quantified data points required

Rows that fall below any threshold are excluded before downstream analysis.


Notes & Warnings

Delimiter

The pipeline expects a comma-separated (.csv) file. Tab-separated or semicolon-separated files must be converted before use.

Missing dose columns

If any of the dose columns listed under experiment.dose_columns in config.yaml are absent from your CSV, the pipeline will raise a KeyError at the curve-fitting stage. Always verify that your column names match the config exactly.

Non-numeric dose values

Dose columns must contain numeric values. String values (e.g. "N/A", empty cells) are coerced to NaN and may cause individual proteins to be excluded.

Replicates

CETSAx is designed to work with at least two replicates per condition. The hit-calling stage uses replicate agreement to increase confidence. Single-replicate data can be analysed but will produce less reliable hit calls.

Large datasets

The provided example file contains ~12,000 rows. For larger proteomes, increase the parallelism in Snakemake (--cores) and consider enabling the embedding cache (pooled_cache: true in config.yaml) to avoid redundant ESM-2 forward passes.