Limitations¶
This section outlines the conceptual, statistical, and computational limitations of CETSAx–NADPH. These are not implementation flaws; they reflect the assumptions and constraints of the underlying methodology.
Understanding these limitations is necessary for correct interpretation of results.
1. Model Assumptions¶
Logistic response constraint¶
The dose–response model assumes a sigmoidal relationship between concentration and stability.
- Proteins with non-monotonic or multi-phase behavior are not well captured.
- Complex binding mechanisms (e.g. multi-site, allosteric effects) are reduced to a single effective EC50.
As a result, some biologically valid responses may be filtered out or mischaracterized.
Monotonic smoothing¶
Isotonic regression enforces monotonicity.
- This removes noise but also removes genuine non-monotonic structure.
- Subtle transitions or intermediate states may be flattened.
2. Sensitivity Scoring¶
Heuristic weighting¶
The NADPH Sensitivity Score (NSS) is a weighted combination of features.
- Weights are biologically motivated but not learned from data.
- Different weighting schemes may change protein rankings.
The score should therefore be treated as a relative ranking, not an absolute quantity.
Dependence on fit quality¶
Sensitivity depends on fitted parameters.
- Poor fits propagate into NSS.
- Filtering reduces noise but may remove borderline cases.
3. Data Limitations¶
CETSA-specific constraints¶
CETSA measures thermal stability, not binding directly.
-
Stabilization may reflect:
- ligand binding
- complex formation
- indirect metabolic effects
-
Destabilization may arise from:
- competition
- conformational shifts
- degradation or aggregation
Therefore, CETSA signals are context-dependent proxies, not direct mechanistic measurements.
Replicate variability¶
- Limited replicates reduce reliability of EC50 estimation.
- High noise at low concentrations can bias fits.
Coverage bias¶
- Not all proteins are equally detected.
- Abundant or stable proteins are overrepresented.
This introduces systematic bias in downstream analyses.
4. Pathway and Network Analysis¶
Correlation-based networks¶
Co-stabilization networks are based on correlation.
- Correlation does not imply causation.
- Shared patterns may arise from indirect effects or experimental artifacts.
Annotation dependence¶
Pathway enrichment depends on annotation quality.
- Missing or incorrect annotations affect results.
- Broad pathway definitions dilute specificity.
5. Latent Representation¶
Linear assumptions (PCA)¶
- PCA captures only linear structure.
- Nonlinear relationships may be missed.
Factor interpretation¶
- Latent factors are not uniquely identifiable.
- Biological interpretation depends on feature loadings and context.
6. Sequence Modeling¶
Data size constraints¶
- Performance depends on the number and quality of labeled proteins.
- Small datasets limit generalization.
Label noise¶
Labels are derived from experimental fits.
- Errors in EC50 or classification propagate into training.
- The model may learn dataset-specific artifacts.
Model bias¶
- ESM embeddings encode evolutionary information, not experimental context.
- Predictions reflect both sequence signal and pretrained biases.
Limited structural awareness¶
- The model operates on sequence, not explicit 3D structure.
- Structural context is inferred implicitly, not modeled directly.
7. Explainability¶
Gradient-based attribution limits¶
- Saliency and Integrated Gradients depend on model gradients.
- These methods highlight sensitivity, not causality.
High importance does not guarantee biological relevance.
Resolution vs interpretation¶
- Residue-level signals can be noisy.
- Aggregation across proteins is required for robust conclusions.
8. Computational Constraints¶
GPU requirements¶
- Sequence modeling is resource-intensive.
- Limited GPU memory restricts batch size and model complexity.
Runtime scaling¶
- Curve fitting scales with number of proteins × conditions.
- Large datasets increase runtime significantly.
9. Interpretation Risks¶
Overinterpretation of single proteins¶
- Individual hits can be misleading.
- Reliable conclusions require consistent patterns across:
- replicates
- pathways
- multiple analysis layers
Black-box misuse¶
- The sequence model can produce confident predictions even when data is weak.
- Interpretability tools mitigate this but do not eliminate the risk.
10. Summary¶
CETSAx–NADPH provides a structured and interpretable framework, but it operates under:
- simplified biophysical assumptions
- noisy experimental inputs
- statistical and computational constraints
Results should be interpreted as:
- hypothesis-generating, not definitive proof
- strongest when supported across multiple layers of analysis
Careful validation with orthogonal experiments is recommended.