Why Language: A Human Brain Perspective on VLA

Starting from the human brain.

Human conscious thinking primarily consists of five basic functions: understanding, decision-making, recollection, memory, and inhibition. These functions work together to enable planning, problem-solving, communication, and task completion.

Many people believe that conscious activity occupies the vast majority of our brain capacity—but this is not the case. Unconscious thinking is actually the protagonist of our mental life. These automatic unconscious behaviors operate outside our conscious control: maintaining heartbeat and respiration, instinctual approach-avoidance, the fight-or-flight response when facing danger, and much more.

alt text

Prefrontal Cortex vs. Basal Ganglia

Conscious thinking is primarily hosted by the prefrontal cortex (PFC) region of the brain. The PFC is responsible for decision-making and problem-solving, and it's the primary brain region that evolved relatively late in human evolution. As such, the prefrontal cortex serves as the core of deliberate thinking—carrying the content of our thoughts at any given moment.

Unconscious automatic thinking primarily engages another brain region called the basal ganglia, an older part of the brain that's extremely energy-efficient. Once an activity has been repeated a few times, the basal ganglia takes over, and the activity no longer consumes excessive energy. Consider driving: when you first learn to drive, you need full concentration. But once you've mastered it, you can drive in simple scenarios using your subconscious—many people listen to podcasts or music while driving effortlessly. Because basal ganglia processing is fast, it's also called System 1 (the fast system).

In contrast, the prefrontal cortex—having evolved relatively recently—occupies only 4-5% of brain volume. Though small in proportion, it's extraordinarily energy-intensive: the prefrontal consumes glucose and oxygen at an astonishing rate. Moreover, the brain allocates a fixed, limited energy budget for decision-making and impulse control that gradually depletes with use. This explains why your brain "slows down" when you're tired or hungry. Additionally, in emergency situations, the brain conserves energy for System 1. Because prefrontal processing is slower, it's also called System 2 (the slow system).

Sequence → Pattern → Language

The basal ganglia excels at building patterns. It picks up patterns unconsciously, as demonstrated by the Serial Reaction Time (SRT) paradigm proposed by Nissen and Bullemer in 1987.

In the SRT experiment, participants pressed one of four keys based on where a light flashed on a screen. Participants were divided into two groups: - One group saw lights in completely random positions - The other group saw lights that followed a complex, repeating pattern too subtle for participants to consciously identify

The result? The basal ganglia still picked up on the pattern—the second group responded an average of 10% faster than the random group. Even more interesting: when some participants did consciously notice the pattern and could describe it verbally, their reaction speed was 30-50% faster than the random group.

The SRT experiment reveals that we learn complex rules unconsciously, and what's most striking is the correlation between SRT task performance and language ability. In fact, SRT simulates the core process of language acquisition:

Sequential Dependency

SRT: Light 1 is always followed by Light 3. Language: In English, "The" is overwhelmingly likely to be followed by a noun (e.g., "cat") rather than a verb (e.g., "went"). Commonality: Both involve implicit extraction of rules from linear sequences.

Non-adjacent Dependency

More advanced SRT experiments involve complex patterns like A-X-B, where X is an interference and A predicts the appearance of B. Language equivalent: Subject-verb agreement or long-distance dependencies.

Example: "The boy [who is wearing a red hat] is running."

The brain must remember "boy" (A) from the beginning, skip past the modifying clause (X), and predict that the verb must be "is" (B) rather than "are".

Finding: People who learn these complex patterns well in SRT tasks typically also have stronger grammatical abilities.

The Nature of Language: Structured Pattern System

Language itself is a multi-layered, complex system of patterns:

1. Phonological Patterns

Each language has its own specific set of pronunciation rules and combinations. For example, in English, /st/ can appear at the beginning of a word (as in "stop"), but /ts/ cannot (except in loanwords). Human infants learn and internalize these permitted phonological patterns by listening to speech in their environment.

2. Morphological Patterns

Words are built from smaller meaningful units (morphemes). Pattern recognition allows us to understand how meaning changes with word endings. For example: we recognize that the -ed pattern in English indicates past tense ("walked"), while the -s pattern indicates plurality ("cats").

3. Syntactic Patterns (Grammar)

This is the most obvious pattern level. The grammar of a language is fundamentally a rule system for combining words into meaningful sentences. For instance, many languages follow a "Subject-Verb-Object" (SVO) pattern. Learning a language means learning this word order pattern and being able to identify invalid patterns (ungrammatical sentences).

4. Semantic Patterns

Word meanings and usage are not random—they exist in a network. For example, "cat" and "dog" both fall into the pattern categories of "pets" or "animals."

Hierarchy of Patterns

In 2005, Scientific American Mind published an article about how chess masters use chunking to remember board configurations. Masters don't plan hundreds of moves ahead—instead, they remember chunks and abstract typical patterns, creating what we call a hierarchy of patterns.

When we describe patterns using language, language itself is already multi-layered and recursive. Language is uniquely suited for expressing higher-level patterns. This is another key advantage of using language.

Thinking Machine Example

A recent example from Thinking Machine's Tinker project demonstrates that Vision-Language Models (VLMs) have better generalization and few-shot learning capabilities compared to purely visual models like DINOv2. Using Qwen3-VL and DINOv2 with very little additional data—one example per class—the results showed significantly better performance with VLM.

In the limited-data regime, Qwen3-VL-235-A22B outperforms DINOv2. Not only is it a bigger model, but as a VLM, it also comes with language knowledge out-of-the-box (i.e. what a "golden retriever" or "sunflower" is). This general language-and-vision capability of Qwen3-VL makes it readily available for vision tasks beyond classification.

Summary

Through language learning, humans acquire a general approach to pattern learning. When facing many types of regularities, we describe them in language and commit them to memory. This allows the basal ganglia to respond quickly and correctly when similar problems arise. Correspondingly, in VLA (Vision-Language Action) training, using a language-based system to model patterns enables faster mastery of regularities and produces better responses during inference.

Visual and Language

The human brain constructs visual mental imagery—an extremely efficient information structure that contains vast amounts of information: complex relationships between objects, sizes, relative positions, and more. The visual pathway evolved millions of years earlier than the language circuit and operates with higher efficiency. For example, if you want to solve a logic problem, explaining it through visual imagery rather than abstract concepts dramatically improves efficiency.

Concepts from language can also be used to construct visual imagery through imagination:

Visual cortex (occipital lobe): Even without external stimuli, the primary visual cortex (V1, V2) and higher visual areas can be activated to create a "visual buffer." This acts as the brain's "screen" for presenting imagined scenes, though activation intensity is weaker than in real perception.
Prefrontal cortex: Responsible for setting imaginative goals, guiding the entire process, and executing high-level planning and decision-making. It acts like a "conductor," actively extracting information from memory and assembling it into visual scenes.
Hippocampus: Retrieves visual memory fragments related to concepts from long-term memory. For example, when you think of "apple," the hippocampus pulls out memory traces of the shape and color of apples you've seen before.
Parietal and temporal association cortex: Integrates multi-sensory information to form a comprehensive representation of an object. The parietal lobe handles spatial layout ("where"), while the temporal lobe handles semantic classification ("what").

This process can be understood as top-down generative simulation:

Trigger Stage: Internal or external cues (language, questions, emotions) activate the default mode network and send instructions to the memory system.
Memory Deconstruction & Extraction: The hippocampus deconstructs memories associated with the concept and extracts relevant visual fragments (colors, shapes, spatial relationships).
Joint Integration: The parietal and temporal lobes integrate the fragments into a coherent representation, giving it spatial structure and semantic coherence.
Visual Rendering: The integrated signal is projected back to the visual cortex via feedback connections. Neuroscientist Stephen Grossberg called this mechanism "folded feedback"—predictive signals from the prefrontal cortex and high-level visual areas back-activate lower visual areas, causing the visual cortex to "activate as if it were really seeing." This activation pattern resembles real perception but is weaker and less stable.
Attention Regulation: The prefrontal cortex and parietal lobe allocate attention resources, adjusting the clarity and detail of the imagined content. The more vivid the imagination, the stronger the visual cortex activation.

Visual Imagery and Language Circuits

Visual imagery and language circuits have a complementary relationship: - Structured visual input provides anchors for linguistic symbols - Linguistic symbols, through recursive composition, give visual imagery causal chains and narrative structure

If you deprive the system of the visuospatial sketchpad, language can still function—but it loses "scenario imagination" and degenerates into list-like statements. If you freeze the phonological loop, visual imagery is still perceived—but cannot generate temporal sequences and reasoning chains.

Therefore, the foundation of structured patterns partially depends on the "synchronous parallel semantic field" provided by the visual circuit. Without multimodal scaffolding, the structural skeleton of language remains, but its flesh withers.

Final Summary

Language and vision together form the foundational system of the human brain. This brief overview of these two fundamental systems illustrates why both Vision and Language are indispensable in VLA. We hope this perspective provides some useful guidance for your work.

Inspired by the original Zhihu answer by bobin