Prompt Injection — Technology Wiki

Overview

Direct Answer

Prompt injection is a security vulnerability in which an attacker embeds malicious instructions within user input to manipulate a language model into bypassing its original directives or system constraints. This technique exploits the model's inability to distinguish between legitimate user queries and adversarial instructions designed to override its intended behaviour.

How It Works

An attacker crafts input that includes hidden instructions, often using techniques such as context switching, role-playing prompts, or explicit directives prefixed with phrases like 'ignore previous instructions.' The language model processes this concatenated input sequentially and treats the injected content as legitimate guidance, causing it to prioritise the new instructions over its system-level constraints and training.

Why It Matters

Organisations deploying language models in customer-facing applications, content generation, and data analysis face significant risks including unauthorised data disclosure, brand reputation damage, and regulatory compliance violations. Teams must address this vulnerability to ensure model outputs remain trustworthy and aligned with business objectives and legal requirements.

Common Applications

Prompt injection affects chatbot applications, automated customer support systems, content management platforms, and AI-driven code generation tools. Attackers have demonstrated exploitation through email inputs to email-filtering systems and user prompts to customer service bots, revealing widespread exposure across enterprise deployments.

Key Considerations

Mitigation requires multi-layered defences including input sanitisation, model fine-tuning, and architectural separation of user input from system instructions. No single technical solution eliminates the risk entirely, necessitating ongoing monitoring and adversarial testing as attack methods evolve.

Cross-References(1)

Natural Language Processing

Language Model

Referenced By1 term mentions Prompt Injection

Other entries in the wiki whose definition references Prompt Injection — useful for understanding how this concept connects across Natural Language Processing and adjacent domains.

AI Security·Cybersecurity

Related in Semantics & Representation

Large Language Model

A neural network trained on massive text corpora that can generate, understand, and reason about natural language.

GPT

Generative Pre-trained Transformer — a family of autoregressive language models that generate text by predicting the next token.

BERT

Bidirectional Encoder Representations from Transformers — a language model that understands context by reading text in both directions.

Tokenisation

The process of breaking text into smaller units (tokens) such as words, subwords, or characters for processing by language models.

Language Model

A probabilistic model that assigns probabilities to sequences of words, enabling prediction of the next word in a sequence.

Contextual Embedding

Word representations that change based on surrounding context, capturing polysemy and contextual meaning.

Word2Vec

A neural network model that learns distributed word representations by predicting surrounding context words.

GloVe

Global Vectors for Word Representation — an unsupervised learning algorithm for obtaining word vector representations from aggregated word co-occurrence statistics.

Instruction Tuning

Training a language model to follow natural language instructions by fine-tuning on instruction-response pairs.

RLHF

Reinforcement Learning from Human Feedback — a technique for aligning language models with human preferences through reward modelling.

Grounding

Connecting language model outputs to real-world knowledge, facts, or data sources to improve factual accuracy.

Hallucination Detection

Techniques for identifying when AI language models generate plausible but factually incorrect or unsupported content.

More in Natural Language Processing

Coreference Resolution

Parsing & Structure

The task of identifying all expressions in text that refer to the same real-world entity.

Dependency Parsing

Parsing & Structure

The syntactic analysis of a sentence to establish relationships between head words and words that modify them.

Abstractive Summarisation

Text Analysis

A text summarisation approach that generates novel sentences to capture the essential meaning of a document, rather than simply extracting and rearranging existing sentences.

Dialogue System

Generation & Translation

A computer system designed to converse with humans, encompassing task-oriented and open-domain conversation.

Question Answering

Generation & Translation

An NLP task where a system automatically answers questions posed in natural language based on given context.

Text Generation

Generation & Translation

The process of producing coherent and contextually relevant text using AI language models.

Text Classification

Text Analysis

The task of assigning predefined categories or labels to text documents based on their content.

Document Understanding

Core NLP

AI systems that extract structured information from unstructured documents by combining optical character recognition, layout analysis, and natural language comprehension.