Data Preparation

Final packaging of engineered data into training, validation, and test sets; stratified/time-aware splitting to preserve representativeness; this is the last stop before training.

Attack surfaces

  • Validation/Test-Set Poisoning: Attackers insert crafted examples into evaluation datasets to distort metrics or guide hyperparameter selection.

    Ex: A malicious sample is inserted into the validation set, making a backdoored model appear more accurate than a clean alternative.

  • Train–Test Leakage or Contamination: Mixing data between sets inflates performance metrics and masks overfitting.

    Ex: Attackers duplicate specific training records into the test set. The model memorizes these records, artificially boosting accuracy scores and passing QA checks.

  • Class-Balance Skewing: Adversaries manipulate dataset splits to bias decision boundaries.

    Ex: In credit scoring data, attackers reduce representation of “risky” loan applicants in the training set. The resulting model over-approves fraudulent loans.

  • Packaging and Registry Attacks: Final data shards (e.g., TFRecords, Parquet files) may be tampered with in object stores or CI/CD systems.

    Ex: An attacker swaps out one shard in cloud storage with a poisoned version, embedding triggers just before model training.

  • Embedded Exfiltration Risks: If identifiable traces remain in the prepared sets, they create future privacy risks.

    Ex: A prepared dataset inadvertently retains unique patient identifiers. Later, adversaries run membership inference on the deployed model, confirming specific patients were part of training.