Aleph Alpha wants to free language models from their dependence on tokenizers

The startup Aleph Alpha, recognized as one of the European gems in the artificial intelligence sector, recently unveiled a major advance in the field of large language models (LLM). At the Davos Economic Forum, the company presented an innovative architecture designed to work without a tokenizer. This approach reveals a clear ambition: to reduce the requirements for computing resources both for training and for inference of models. The removal of tokenizers could well represent a watershed moment for generative AI.

It is essential to understand how tokenizers work. These tools convert strings into lists of symbols that natural language processing (NLP) models can interpret. Although their use has been crucial in the emergence of current LLMs, Aleph Alpha draws attention to the inefficiency that these systems can generate, particularly during fine-tuning and supervised training. Language models learn based on patterns present in tokenized texts, making their adaptation to previously unseen data more complex.

The challenges of tokenization

Tokenization is not a trivial process and raises several challenges. On the one hand, the method of segmenting sentences into characters has been gradually abandoned due to its excessive consumption of computational and memory resources. The current method, which divides words into sequences of adjacent characters, although allowing efficient management of unknown words, “burdens” the models and makes them less efficient on innovative texts. Indeed, the prejudices introduced by the static vocabulary used to train the models do not make it possible to prioritize the resources allocated according to the complexity of the first tokens of a sentence.

Aleph Alpha proposes a radical change with the Hierarchical Architecture Transformer (HAT). This framework combines character-based and word-based processing, starting with a simple division of texts into words, using rules conforming to the Unicode definition. Each word is then encoded into an embedding vector, which will feed a much more powerful main model.

Issues related to tokenizers

The limitations of tokenizers appear particularly significant in industrial environments, where users are looking for models that can answer questions specific to their domain. Often, tokenized models are poorly suited when it comes to working with languages other than English. Removing the tokenizer thus presents itself as a promising solution to guarantee the sovereignty of models and reduce the carbon footprint linked to their training.

As Aleph Alpha builds more efficient models, there is a growing need for models that adapt not only to industry specificities but also to diverse languages. The current predilection for multi-language language models requires adjustments within the framework of tokenization, which at present remains too rigid and static.

Aleph Alpha’s tokenizer-free architecture

The Aleph Alpha HAT envisions a complete redefinition of text data processing. By reducing the size of the vocabulary to only 256 tokens while relying on UTF-8 as the alphabet, this architecture stands out for its simplicity and efficiency. The system enables end-to-end training without the need to rely on a fixed, pre-trained tokenizer, representing a significant advance over traditional architectures.

To test its concept, Aleph Alpha implemented a model with 7 billion parameters, trained on a massive dataset including 2.3 trillion tokens in English and Finnish. The results obtained are impressive, both in terms of inference costs and performance compared to tokenizer-based models.

Advantages of the HAT model

The first feedback on this “Tokenizer Free” architecture from Aleph Alpha highlights several notable advantages. Beyond a clear reduction in inference costs, superior performance in terms of efficiency has been observed, surpassing that of many other models under development. Furthermore, the models are less sensitive to common mistakes such as typographical errors or incomplete words, providing better robustness. These characteristics make the HAT particularly promising for advanced applications where precision is crucial. In a context where AI is increasingly integrated into industrial solutions, this could also mean a significant reduction in operational costs.

The limits and prospects of Aleph Alpha

However, not all challenges are overcome with tokenizer removal. Aleph Alpha’s architecture, although effective, has yet to demonstrate its viability against logographic languages, such as Chinese or Japanese, where a character can carry entire meanings. This reality poses obstacles for the implementation of models in programming or complex mathematics. Aleph Alpha continues to explore other methodologies for separating input words and adapt its approach accordingly.

Faced with competitors like Meta who are also pursuing tokenizer-free solutions, continued innovation will be crucial for Aleph Alpha. The international laboratory must adapt its datasets and support capabilities relating to multi-sector models while maintaining high quality standards.

The AI competitive landscape without a tokenizer

As Aleph Alpha develops its HAT architecture, other research labs like Meta are working in the same direction. Meta’s recent proposal, the Byte Latent Transformer, shares similar goals but focuses on more complex approaches aimed at replacing the tokenizer using dynamic character representations. These developments highlight a growing interest in decentralized models that can properly meet varied needs while reducing costs. The debate on the future of tokenization is more relevant than ever and involves the various stakeholders in the AI sector.

The future of LLMs with Aleph Alpha

With its new architecture, Aleph Alpha aspires to position itself as a key player in the language model landscape. The transition to more autonomous generative AI systems could disrupt current development processes, providing businesses with a viable alternative to pre-existing models.

Aleph Alpha’s support of this approach promises to drive significant change, allowing businesses to fully leverage the capabilities of AI without the limitations imposed by tokenizers. The potential for improved productivity and reduced training costs could open doors to even wider adoption of artificial intelligence across various industries. Ultimately, Aleph Alpha’s commitment to innovation in the field of LLMs could mean the dawn of a new era for AI.