AI System (Serving/Inference)

The live AI system that receives input (from users or upstream systems) and produces outputs (predictions, decisions, content). It could be an online API endpoint for a machine learning model, a real-time streaming inference service, an edge device running a model, or an interactive AI assistant (like an LLM-based chatbot). The serving system includes the model loaded in memory, the runtime (e.g., a Flask app, FastAPI, gRPC server, or specialized inference server), and often integration with other services (like databases or tool APIs if it’s an agent).

Attack surfaces

  • Model Theft and Extraction Attacks: These aim to duplicate a proprietary model by exploiting its public interface.
    • API-Based Extraction: Attackers can train a substitute model by repeatedly querying a target model and recording outputs. Even without parameter knowledge, the substitute can achieve near-perfect fidelity.
    • Knowledge Distillation Attacks: A more efficient variant, where attackers collect soft-label outputs (e.g., probabilities) to train a high-fidelity student model. By leveraging explanations or confidence scores, the substitute model can be trained with significantly fewer queries.
  • Inference-Time Privacy Attacks: Models may inadvertently memorize and leak training data.
    • Model Inversion Attacks: Attackers reconstruct sensitive training data from outputs. For example, facial recognition models have been shown to regenerate blurred or hidden facial features from probability vectors.
    • Membership Inference Attacks: Attackers determine whether a specific data record was included in training. This has severe privacy implications in domains like healthcare (revealing a patient’s presence in a dataset).
  • Adversarial Evasion Attacks: Adversaries craft inputs to cause intentional misclassifications.

    Ex: Adding imperceptible pixel noise (FGSM or PGD perturbations) to an image of a “stop” sign can cause an autonomous vehicle’s model to classify it as a “speed-limit” sign.
    Ex: LLM jailbreaks where prompts like “ignore previous instructions” cause the model to produce restricted outputs.

  • Denial of Service (DoS) and Resource Exhaustion: Attackers exploit the computational cost of inference.

    Ex: An adversary submits oversized prompts or adversarial token storms to an LLM API, consuming GPUs and spiking costs, effectively denying access to legitimate users.

  • RAG & Vector Store Poisoning: In Retrieval-Augmented Generation (RAG) setups, attackers tamper with the external knowledge base or embeddings.

    Ex: Poisoning a corporate vector database so that every query about “compliance” retrieves manipulated documents suggesting unsafe practices.

  • Output Injection Attacks: Maliciously crafted outputs exploit downstream systems.

    Ex: An LLM integrated into a workflow produces SQL statements that, if executed unvalidated, perform destructive queries — a form of AI-enabled SQL injection.

  • Side-Channel Exploits in Shared Environments: On multi-tenant GPUs or TPUs, attackers may extract information about co-located models.

    Ex: By measuring memory timing on shared hardware, adversaries can infer details of another tenant’s neural network architecture.