Overview
Direct Answer
Tokenisation is the foundational preprocessing step that converts raw text into discrete units (tokens) that language models can process numerically. These units may represent individual words, subword fragments, or characters, depending on the tokenisation strategy employed.
How It Works
The process segments input text according to defined rules—either at whitespace boundaries for word-level tokenisation, or through vocabulary-based algorithms such as Byte Pair Encoding or WordPiece for subword splitting. Each token is then mapped to a numerical identifier via a learned vocabulary, enabling downstream models to perform mathematical operations on textual data.
Why It Matters
Effective tokenisation directly impacts model efficiency, accuracy, and cost. Poor tokenisation strategies increase sequence length, consuming more computational resources and memory during training and inference. Language coverage and handling of out-of-vocabulary terms critically influence model robustness across multilingual and domain-specific applications.
Common Applications
Tokenisation is essential across machine translation systems, sentiment analysis pipelines, document classification, and conversational AI platforms. It enables named entity recognition systems to identify boundaries of entities and supports question-answering models in retrieving and ranking relevant text spans.
Key Considerations
Trade-offs exist between vocabulary size, sequence length, and computational overhead. Language-specific requirements, handling of punctuation and special characters, and preserving semantic boundaries present ongoing challenges, particularly for morphologically rich languages and code-based applications.
Cited Across coldai.org4 pages mention Tokenisation
Industry pages, services, technologies, capabilities, case studies and insights on coldai.org that reference Tokenisation — providing applied context for how the concept is used in client engagements.
Referenced By1 term mentions Tokenisation
Other entries in the wiki whose definition references Tokenisation — useful for understanding how this concept connects across Natural Language Processing and adjacent domains.
More in Natural Language Processing
Reranking
Core NLPA two-stage retrieval process where an initial set of candidate documents is rescored by a more powerful model to improve the relevance ordering of search results.
Seq2Seq Model
Core NLPA neural network architecture that maps an input sequence to an output sequence, used in translation and summarisation.
Multilingual Model
Semantics & RepresentationA language model trained on text from dozens or hundreds of languages simultaneously, enabling cross-lingual understanding and generation without language-specific fine-tuning.
Machine Translation
Generation & TranslationThe use of AI to automatically translate text or speech from one natural language to another.
Instruction Following
Semantics & RepresentationThe capability of language models to accurately interpret and execute natural language instructions, a core skill developed through instruction tuning and alignment training.
Topic Modelling
Text AnalysisAn unsupervised technique for discovering abstract topics that occur in a collection of documents.
Natural Language Understanding
Core NLPThe subfield of NLP focused on machine reading comprehension and extracting meaning from text.
Cross-Lingual Transfer
Core NLPThe application of models trained in one language to perform tasks in another language, leveraging shared multilingual representations learned during pre-training.