Data Pre‑processing
Cleaning/normalization, deduplication, imputation, encoding, and data sanitisation of personal data (generalization, suppression, perturbation of identifiers/quasi‑identifiers) to reduce re‑identification risk before downstream use.
Attack surfaces
- Label‑Flipping Attacks: Adversaries hide targeted re-labeling within cleaning scripts or pull requests, altering the meaning of training examples.
Ex: In a medical dataset, attackers flip “benign tumor” labels to “malignant.” A cancer-detection model trained on this poisoned dataset produces dangerously skewed diagnoses.
- Adversarial Noise Injection: Malicious actors conceal adversarial perturbations under the guise of normalization or denoising. These perturbations later trigger model misclassifications.
Ex: A poisoned speech dataset includes tiny perturbations masked as background noise. The resulting voice-recognition model consistently misidentifies attacker-chosen phrases.
- Pipeline Code Exploitation: Insecure ETL scripts, notebooks, or deserialization routines can be hijacked to execute arbitrary code or exfiltrate data.
Ex:A malicious update to a Python preprocessing script uses unsafe pickle.load calls, leading to remote code execution and leakage of sensitive training data.
- Privacy Regression During Sanitisation: Poorly designed anonymization fails to suppress quasi-identifiers, allowing re-identification.
Ex: A healthcare dataset anonymizes names but leaves unique patient admission dates and rare combinations of treatments, enabling re-identification of individuals.
- Feature‑Scale Tampering: Attackers manipulate summary statistics used for scaling or normalization, biasing downstream model behavior.
Ex: By injecting outliers into financial data, attackers distort normalization so that fraudulent transactions appear statistically “normal” to the fraud detection model.