Check Your LLM's Secret Dictionary!

New Study: LLMs Have an Interpretable Internal Vocabulary Structure

What we found

We are pleased to announce the release of our new paper:

“Check Your LLM’s Secret Dictionary!
Five Lines of Code Reveal What Your LLM Learned (Including What It Shouldn’t Have)“

Our study introduces a lightweight method for examining the internal structure of large language models (LLMs) by applying singular value decomposition (SVD) directly to the output projection matrix (lm_head). The method requires only five lines of code and does not rely on model inference or external datasets.

We find that the lm_head matrix of transformer-based language models contains structured, interpretable subspaces corresponding to distinct categories of vocabulary: syntactic function words, named entities, semantic domains, and formatting tokens. These structures can be extracted directly from model weights, revealing systematic differences across model families without running the model at all.

Analysing GPT-OSS-120B, Gemma-2-2B, and Qwen2.5-1.5B, we find that each model exhibits distinct spectral and clustering patterns reflecting differences in training data composition and post-training alignment. Notably, certain vocabulary-level structures---including ethically concerning content---remain stable across base and instruction-tuned variants, indicating that post-training alignment does not fully overwrite properties learned during pretraining.

The study introduces two new metrics:

Vocabulary Cluster Score (VCS): measures the geometric coherence of vocabulary subspaces in the output projection layer.
Weighted Projection Score (WPS): a static detector for anomalous tokens, including glitch tokens, requiring no model inference. Applying WPS to GPT-OSS-120B recovers a well-known problematic token without running a single forward pass.

Why this matters for MBSE and Model-based Human–Machine Collaboration

As we discussed in our previous post on AOP, we have been working to bridge the gap between the stochastic, empirical world of LLMs and the rigorous, formal models of Model-Based Systems Engineering (MBSE). The internal vocabulary structure revealed by this analysis directly informs the technologies we are developing to build reliable, accountable AI for engineering applications.

Understanding how an LLM organises its output vocabulary---and what that organisation reveals about its training data---is a prerequisite for auditable, trustworthy AI systems. This work provides a concrete, reproducible method for that analysis.

We published this paper on arXiv on 22 May 2026: arXiv:2605.22005

We continue to contribute to reliable, dependable AI and MBSE technologies.