42
Project 42

The Geometry of
Machine Intelligence

You can't fix what you can't measure. We built the instrument that measures what no benchmark can: the gap between what your model knows and what it can discover.

The problem
The Problem

You spend billions training models. You benchmark them on knowledge retrieval, code generation, instruction following, math reasoning. Your models score 90th percentile on every test you throw at them.

Then someone asks your model a genuinely hard question — one that requires sustained exploration, cross-domain bridging, the capacity to push past the obvious answer into territory the training data doesn't map — and your model says the same thing four times in a row with different words.

You know this. Your researchers know this. The gap between what your model knows and what your model can discover is the most important unsolved problem in AI capability research.

And you can't fix what you can't measure.

You can't measure it. We can.
The Engine

We put your model in a room
with top competitors

We give them all the same hard question. We track every thought in 768-dimensional embedding space. We build an exclusion zone — a growing map of everywhere the models have already been — and we force them to go somewhere new.

The core insight is geometric. When a model generates a thought, that thought occupies a position in embedding space. If you track where the model has already been, you can prevent it from going back there. Every stored thought expands the zone of explored territory, and every new thought has to land outside that zone to be kept.

The model is forced — geometrically, mathematically — to go somewhere it hasn't been before. That's the whole idea. Force the model into unexplored territory by making explored territory off-limits.

Multiple frontier models. One shared exclusion zone. The most honest benchmark of creative reasoning capacity that exists.

Exclusion Zone
A growing manifold of explored territory in 768-dimensional embedding space. Every stored thought expands it. Every new thought must land outside it.
Novelty Gate
Measures distance from explored territory. Rises at 0.001 per thought with a ceiling at 0.61. Demands increasingly novel contributions as easy territory fills up.
Soliton Bounce
Catches failing thoughts and redirects them toward frontier gaps. Not "try again" — "try again from over there." Geometric rescue with anchor rotation.
Anchor Decomposition
Breaks every question into sub-territories. Multiple models exploring different anchors instead of fighting over one patch of ground.
The Atlas
Persistent memory across sessions. Every session makes the next one smarter. The system never retreads old territory — and knows when a question is exhausted.
The Data

Every frontier model has a
cognitive fingerprint.

Under geometric pressure, models stop performing and start revealing. The exclusion zone strips away surface fluency and exposes the architecture underneath — where each model gravitates, where it breaks, and what it does when there's nowhere familiar left to go.

Claude Opus
The architect
Opens strong, then builds. Designs falsification experiments for its own claims — the only model that routinely attacks its own best thought. Recovers from exhaustion by switching cognitive registers. Under maximal exclusion pressure, produces the session's organizing framework.
GPT-5
The formalizer
Reaches for mathematical structure where other models narrate. Information bottleneck theory, Galois connections, RSB formalization. Needs room to think — early sessions misdiagnosed as weak until token budget was tripled, revealing the real capability underneath the scaffolding.
DeepSeek V3
The scalpel
Highest novelty-per-token efficiency by a wide margin. Finds the gap and names it in a sentence. Shows inverted fatigue — stumbles early, then recovers by exploiting territory larger models left behind. Can't solo-carry a session, but the ecosystem needs it.
Gemini Pro
The synthesizer
Thinks internally before speaking — burns reasoning tokens behind the scenes. When it lands, it lands with cross-domain connections the other models miss. Requires the most headroom of any model. The investment pays off in synthesis quality.
Grok 4
The contrarian
Arrives at the question from an angle nobody else tried. Highest variance model — capable of both the best and worst thought in a session. Under exclusion pressure, that volatility becomes an asset. The model most likely to break through when everyone else is circling.

These aren't benchmark scores. These are behavioral signatures — visible only when models are forced past the territory their training data mapped. CHRONOS doesn't test what models know. It tests what they can discover.

What We're Selling

Not a benchmark.
An instrument.

Every AI company is converging on the same benchmark scores. MMLU is saturated. HumanEval is saturated. The models are differentiated on vibes, pricing, and speed — not on cognitive capability. Because the test that measures real cognitive capability didn't exist. Now it does.

01
The Diagnostic
Run your model through CHRONOS against three competitors. Get back a complete geometric profile: attractor basins, blind spots, recovery capacity, compression efficiency, frontier reach, novelty-per-token efficiency. The MRI for your model's creative reasoning. Nobody else can produce it.
02
The Training Signal
Every session generates DPO-ready preference pairs. Your model's low-novelty thought paired with a competitor's high-novelty thought — same anchor, same exclusion zone, different output quality. Not "be more creative." Coordinates. Trajectories. Paired examples targeting the exact deficit identified.
03
The Instrument
CHRONOS as a live evaluation environment. Run sessions after every training run. Track whether attractor basins are shrinking, novelty ceilings are rising, recovery is improving. Know whether your latest intervention made your model think better — not just score higher on static benchmarks.
The Ask

We're not asking you to
believe a pitch deck

We're asking you to send us your model.

We'll run it. Hard questions across multiple domains. You'll get back the full geometric profile — attractor basins, novelty trajectories, failure modes, competitor comparison — plus the first batch of CHRONOS-derived training pairs targeting your model's specific deficits.

If the diagnostic tells you something your internal evals didn't, we talk about what comes next.

If it doesn't, you've lost nothing but ten minutes of API calls.

Send Your Model

We already know what the diagnostic will show. We have the data. The question isn't whether your model has blind spots. The question is whether you want to see them.

Because knowing the answer isn't intelligence.
Finding the answer nobody else found is.