Tokenisation — Technology Wiki

Overview

Direct Answer

Tokenisation is the foundational preprocessing step that converts raw text into discrete units (tokens) that language models can process numerically. These units may represent individual words, subword fragments, or characters, depending on the tokenisation strategy employed.

How It Works

The process segments input text according to defined rules—either at whitespace boundaries for word-level tokenisation, or through vocabulary-based algorithms such as Byte Pair Encoding or WordPiece for subword splitting. Each token is then mapped to a numerical identifier via a learned vocabulary, enabling downstream models to perform mathematical operations on textual data.

Why It Matters

Effective tokenisation directly impacts model efficiency, accuracy, and cost. Poor tokenisation strategies increase sequence length, consuming more computational resources and memory during training and inference. Language coverage and handling of out-of-vocabulary terms critically influence model robustness across multilingual and domain-specific applications.

Common Applications

Tokenisation is essential across machine translation systems, sentiment analysis pipelines, document classification, and conversational AI platforms. It enables named entity recognition systems to identify boundaries of entities and supports question-answering models in retrieving and ranking relevant text spans.

Key Considerations

Trade-offs exist between vocabulary size, sequence length, and computational overhead. Language-specific requirements, handling of punctuation and special characters, and preserving semantic boundaries present ongoing challenges, particularly for morphologically rich languages and code-based applications.

Cited Across coldai.org4 pages mention Tokenisation

Industry pages, services, technologies, capabilities, case studies and insights on coldai.org that reference Tokenisation — providing applied context for how the concept is used in client engagements.

Technology

Hedera Token Service (HTS)

Native, protocol-level tokenisation on Hedera — issue fungible and non-fungible tokens without writing a smart contract. HTS supports KYC, freeze, pause, supply controls, custom fe

Technology

Our Hedera Hashgraph Practice

ColdAI is a member of The Hashgraph Association and a delivery partner for enterprise teams building on Hedera. We design, build, and operate production systems on Hedera's hashgra

Technology

Physical Assets (RWA)

Tokenisation of real-world assets including Stablecoins, Real Estate, Bonds, Cars, Retail Goods, Equities, and Commodities. We handle the complete legal and technical stack: SPV st

Technology

Tokenisation

Bridging physical and digital assets through secure on-chain representation and programmable ownership. Our tokenisation framework handles the full lifecycle from legal structuring

Referenced By1 term mentions Tokenisation

Other entries in the wiki whose definition references Tokenisation — useful for understanding how this concept connects across Natural Language Processing and adjacent domains.

Byte-Pair Encoding·Natural Language Processing

Related in Semantics & Representation

Large Language Model

A neural network trained on massive text corpora that can generate, understand, and reason about natural language.

GPT

Generative Pre-trained Transformer — a family of autoregressive language models that generate text by predicting the next token.

BERT

Bidirectional Encoder Representations from Transformers — a language model that understands context by reading text in both directions.

Language Model

A probabilistic model that assigns probabilities to sequences of words, enabling prediction of the next word in a sequence.

Contextual Embedding

Word representations that change based on surrounding context, capturing polysemy and contextual meaning.

Word2Vec

A neural network model that learns distributed word representations by predicting surrounding context words.

GloVe

Global Vectors for Word Representation — an unsupervised learning algorithm for obtaining word vector representations from aggregated word co-occurrence statistics.

Instruction Tuning

Training a language model to follow natural language instructions by fine-tuning on instruction-response pairs.

RLHF

Reinforcement Learning from Human Feedback — a technique for aligning language models with human preferences through reward modelling.

Grounding

Connecting language model outputs to real-world knowledge, facts, or data sources to improve factual accuracy.

Hallucination Detection

Techniques for identifying when AI language models generate plausible but factually incorrect or unsupported content.

Prompt Injection

A security vulnerability where malicious inputs manipulate a language model into ignoring its instructions or producing unintended outputs.

More in Natural Language Processing

Natural Language Processing

Core NLP

The field of AI focused on enabling computers to understand, interpret, and generate human language.

Text Embedding Model

Core NLP

A neural network trained to convert text passages into fixed-dimensional vectors that capture semantic meaning, enabling similarity search, clustering, and retrieval applications.

Dialogue Management

Generation & Translation

The component of conversational systems that tracks conversation state, determines the next system action, and maintains coherent multi-turn interactions with users.

Text Summarisation

Text Analysis

The process of creating a concise and coherent summary of a longer text document while preserving key information.

Intent Detection

Generation & Translation

The classification of user utterances into predefined categories representing the user's goal or purpose, a fundamental component of conversational AI and chatbot systems.

Byte-Pair Encoding

Parsing & Structure

A subword tokenisation algorithm that iteratively merges the most frequent character pairs to build a vocabulary.

Context Window

Semantics & Representation

The maximum amount of text a language model can consider at once when generating a response.

Text-to-SQL

Generation & Translation

The task of automatically converting natural language questions into executable SQL queries, enabling non-technical users to interrogate databases through conversational interfaces.