Natural Language ProcessingSemantics & Representation

Tokenisation

Overview

Direct Answer

Tokenisation is the foundational preprocessing step that converts raw text into discrete units (tokens) that language models can process numerically. These units may represent individual words, subword fragments, or characters, depending on the tokenisation strategy employed.

How It Works

The process segments input text according to defined rules—either at whitespace boundaries for word-level tokenisation, or through vocabulary-based algorithms such as Byte Pair Encoding or WordPiece for subword splitting. Each token is then mapped to a numerical identifier via a learned vocabulary, enabling downstream models to perform mathematical operations on textual data.

Why It Matters

Effective tokenisation directly impacts model efficiency, accuracy, and cost. Poor tokenisation strategies increase sequence length, consuming more computational resources and memory during training and inference. Language coverage and handling of out-of-vocabulary terms critically influence model robustness across multilingual and domain-specific applications.

Common Applications

Tokenisation is essential across machine translation systems, sentiment analysis pipelines, document classification, and conversational AI platforms. It enables named entity recognition systems to identify boundaries of entities and supports question-answering models in retrieving and ranking relevant text spans.

Key Considerations

Trade-offs exist between vocabulary size, sequence length, and computational overhead. Language-specific requirements, handling of punctuation and special characters, and preserving semantic boundaries present ongoing challenges, particularly for morphologically rich languages and code-based applications.

Cited Across coldai.org4 pages mention Tokenisation

Referenced By1 term mentions Tokenisation

Other entries in the wiki whose definition references Tokenisation — useful for understanding how this concept connects across Natural Language Processing and adjacent domains.

More in Natural Language Processing