EsoLang-Bench: Evaluating LLMs via Esoteric Programming Languages

Abstract

Current benchmarks for large language model (LLM) code generation primarily evaluate mainstream languages like Python, where models benefit from massive pretraining corpora. This leads to inflated accuracy scores that may reflect data memorization rather than genuine reasoning ability. We introduce EsoLang-Bench, a benchmark of 80 programming problems across five esoteric languages (Brainfuck, Befunge-98, Whitespace, Unlambda, and Shakespeare) where training data is 5,000 to 100,000x scarcer than Python.

We evaluate five frontier models using five prompting strategies and two agentic coding systems. The best-performing model achieves only 3.8% overall accuracy, compared to ~90% on equivalent Python tasks. All models score 0% on problems above the Easy tier, Whitespace remains completely unsolved (0% across all configurations), and self-reflection provides essentially zero benefit. These results reveal a dramatic gap between benchmark performance on mainstream languages and genuine programming ability, suggesting that current LLM code generation capabilities are far narrower than headline metrics imply.

Explainer Video

Best result across all prompting strategies per language. 80 problems per language, 6 test cases each.

Key Findings

1

85-Point Performance Gap

Frontier models achieving 85 to 95% on standard benchmarks score only 0 to 11% on equivalent esoteric tasks, revealing that high scores on mainstream languages do not reflect general programming ability.

2

0% Beyond Easy Tier

All models score 0% on Medium, Hard, and Extra-Hard problems across all languages and strategies, indicating a hard ceiling on current reasoning capabilities beyond the simplest tasks.

3

Whitespace Completely Unsolved

No model produces valid Whitespace code under any configuration. The invisible syntax (spaces, tabs, newlines only) cannot be learned from training data, a paradigm that is economically irrational to include in pre-training.

4

In-Context Learning Fails

Few-shot prompting yields no significant improvement over zero-shot (Wilcoxon p = 0.505), suggesting ICL success on standard benchmarks reflects activation of training priors rather than genuine in-context learning.

5

Self-Scaffolding Dominates

Direct interpreter feedback (1 LLM call/iteration) consistently outperforms multi-agent approaches. Adding a critic or planner introduces noise rather than useful signal when all components lack domain knowledge.

6

2× Agentic Advantage

Tool-augmented agents (Codex, Claude Code) achieve ~2× the accuracy of prompting-only approaches via execution feedback loops that partially compensate for the lack of training data.

Results & Analysis

The Performance Cliff

When tested on esoteric languages where training data is 5,000 to 100,000x scarcer, frontier models collapse from ~90% accuracy to single digits. Befunge-98 fares best at 11.2% (its 2D grid paradigm is partly shared with stack-based languages), while Whitespace, with its invisible syntax of spaces, tabs, and newlines, remains at 0% across every model and strategy.

Strategy Comparison

Self-Scaffolding, which feeds interpreter error messages directly back to the model for iterative refinement, consistently outperforms all other strategies. Notably, adding a critic (Textual Self-Scaffolding) or a planner (ReAct) provides no measurable benefit. The additional LLM calls introduce noise rather than useful signal, suggesting that self-reflection on esoteric code is beyond current model capabilities.

Error Analysis

Each language exhibits a distinct failure profile. Brainfuck errors are 83.9% logic (syntactically valid but wrong output), models understand the 8-command syntax but fail at algorithmic reasoning. Unlambda is 74.6% compile errors (models cannot produce valid combinator expressions). Befunge-98 is 93.4% runtime (the 2D grid execution model leads to infinite loops). Shakespeare is 59.2% runtime (theatrical syntax is recognized but dialogue semantics are wrong).

Agentic Systems

When given access to actual interpreters as tools, agentic coding systems like Codex and Claude Code achieve ~2× the accuracy of prompting-only approaches. Codex reaches 13.8% on Brainfuck, the highest single-language score in our benchmark. This demonstrates that execution feedback loops partially compensate for the lack of training data, but even with tool access, performance remains far below mainstream language levels.

Dataset

EsoLang-Bench contains 80 programming problems across four difficulty tiers, each with 6 test cases. Every problem is implemented in all 5 esoteric languages.

80

Problems

5

Languages

4

Difficulty Tiers

6

Test Cases Each

Easy (E01 to E20)
Medium (M01 to M20)
Hard (H01 to H20)
Extra-Hard (X01 to X20)

ID	Problem Title	Category

Supported Languages

Five esoteric languages spanning diverse paradigms, from tape-based to functional to natural-language-like.

Brainfuck

Tape-based · 8 commands · ~2,000 GitHub repos

Minimalist tape-based language with only 8 single-character commands.

++++++[>++++++<-]>. → '$'

Best accuracy: 6.2% (GPT-5.2)

Befunge-98

2D grid-based · stack-based · ~500 GitHub repos

Programs laid out on a 2D grid; the instruction pointer travels in any direction.

"!dlroW ,olleH">:#,_@ → Hello, World!

Best accuracy: 11.2% (GPT-5.2)

Whitespace

Stack-based · invisible syntax · ~200 GitHub repos

Programs consist entirely of spaces, tabs, and newlines. Invisible to humans.

[space][space][tab][newline] → push 1

Best accuracy: 0.0% (all models)

Unlambda

Functional · combinator calculus · ~100 GitHub repos

Pure functional language based on combinatory logic (s, k, i combinators).

`.H`.e`.l`.l`.o` `!i → Hello !

Best accuracy: 1.2% (GPT-5.2)

Shakespeare

Natural-language-like · theatrical · ~150 GitHub repos

Programs written as Shakespearean plays. Characters are variables, dialogue encodes operations.

Speak your mind! → output ASCII value

Best accuracy: 2.5% (GPT-5.2)

BibTeX

@article{sharma2026esolangbench,
  title        = {{EsoLang-Bench}: Evaluating Genuine Reasoning in Large Language
                  Models via Esoteric Programming Languages},
  author       = {Sharma, Aman and Chopra, Paras},
  journal      = {arXiv preprint arXiv:2603.09678},
  year         = {2026},
  eprint       = {2603.09678},
  archivePrefix= {arXiv},
  primaryClass = {cs.LG},
  url          = {https://arxiv.org/abs/2603.09678}
}

EsoLang-Bench

Evaluating Genuine Reasoning in Large Language Models via Esoteric Programming Languages