Overview
Direct Answer
BLEU (Bilingual Evaluation Understudy) is a quantitative metric that measures the correspondence between machine-generated text and one or more reference translations by comparing n-gram overlap. It produces a score between 0 and 1, where higher scores indicate closer alignment with reference text.
How It Works
The metric calculates the proportion of n-grams (sequences of 1 to 4 words) in the generated output that appear in the reference text(s), applying a brevity penalty to prevent artificially inflated scores from shorter translations. Precision is computed for each n-gram length, then combined using geometric averaging to produce a single composite score.
Why It Matters
BLEU enables rapid, reproducible evaluation of machine translation and text generation systems without requiring manual human assessment, significantly reducing evaluation costs and enabling continuous quality monitoring across translation pipelines and model iterations.
Common Applications
The metric is widely deployed in machine translation evaluation, multilingual natural language processing research, and quality assurance workflows for automated subtitle generation and cross-language content adaptation systems.
Key Considerations
BLEU scores correlate imperfectly with human judgement of translation quality and cannot detect semantic correctness or fluency; a single reference translation may penalise valid alternative phrasings, necessitating supplementary evaluation methods for comprehensive quality assessment.
More in Artificial Intelligence
AI Hallucination
Safety & GovernanceWhen an AI model generates plausible-sounding but factually incorrect or fabricated information with high confidence.
Artificial Narrow Intelligence
Foundations & TheoryAI systems designed and trained for a specific task or narrow range of tasks, such as image recognition or language translation.
AI Orchestration
Infrastructure & OperationsThe coordination and management of multiple AI models, services, and workflows to achieve complex end-to-end automation.
AI Bias
Training & InferenceSystematic errors in AI outputs that arise from biased training data, flawed assumptions, or prejudicial algorithm design.
Planning Algorithm
Reasoning & PlanningAn AI algorithm that generates a sequence of actions to achieve a specified goal from an initial state.
Reinforcement Learning from Human Feedback
Training & InferenceA training paradigm where AI models are refined using human preference signals, aligning model outputs with human values and quality expectations through reward modelling.
Chain-of-Thought Prompting
Prompting & InteractionA prompting technique that encourages language models to break down reasoning into intermediate steps before providing an answer.
Commonsense Reasoning
Foundations & TheoryThe AI capability to make inferences based on everyday knowledge that humans typically take for granted.