AI System (Serving/Inference)
The live AI system that receives input (from users or upstream systems) and produces outputs (predictions, decisions, content). It could be an online API endpoint for a machine learning model, a real-time streaming inference service, an edge device running a model, or an interactive AI assistant (like an LLM-based chatbot). The serving system includes the model loaded in memory, the runtime (e.g., a Flask app, FastAPI, gRPC server, or specialized inference server), and often integration with other services (like databases or tool APIs if it’s an agent).
Attack surfaces
- Model Theft and Extraction Attacks: These aim to duplicate a proprietary model by exploiting its public interface.
- API-Based Extraction: Attackers can train a substitute model by repeatedly querying a target model and recording outputs. Even without parameter knowledge, the substitute can achieve near-perfect fidelity.
- Knowledge Distillation Attacks: A more efficient variant, where attackers collect soft-label outputs (e.g., probabilities) to train a high-fidelity student model. By leveraging explanations or confidence scores, the substitute model can be trained with significantly fewer queries.
- Inference-Time Privacy Attacks: Models may inadvertently memorize and leak training data.
- Model Inversion Attacks: Attackers reconstruct sensitive training data from outputs. For example, facial recognition models have been shown to regenerate blurred or hidden facial features from probability vectors.
- Membership Inference Attacks: Attackers determine whether a specific data record was included in training. This has severe privacy implications in domains like healthcare (revealing a patient’s presence in a dataset).
- Adversarial Evasion Attacks: Adversaries craft inputs to cause intentional misclassifications.
Ex: Adding imperceptible pixel noise (FGSM or PGD perturbations) to an image of a “stop” sign can cause an autonomous vehicle’s model to classify it as a “speed-limit” sign.
Ex: LLM jailbreaks where prompts like “ignore previous instructions” cause the model to produce restricted outputs. - Denial of Service (DoS) and Resource Exhaustion: Attackers exploit the computational cost of inference.
Ex: An adversary submits oversized prompts or adversarial token storms to an LLM API, consuming GPUs and spiking costs, effectively denying access to legitimate users.
- RAG & Vector Store Poisoning: In Retrieval-Augmented Generation (RAG) setups, attackers tamper with the external knowledge base or embeddings.
Ex: Poisoning a corporate vector database so that every query about “compliance” retrieves manipulated documents suggesting unsafe practices.
- Output Injection Attacks: Maliciously crafted outputs exploit downstream systems.
Ex: An LLM integrated into a workflow produces SQL statements that, if executed unvalidated, perform destructive queries — a form of AI-enabled SQL injection.
- Side-Channel Exploits in Shared Environments: On multi-tenant GPUs or TPUs, attackers may extract information about co-located models.
Ex: By measuring memory timing on shared hardware, adversaries can infer details of another tenant’s neural network architecture.