π¬ Aesthetics vs. Accuracy
The hidden hallucinations in AI image restoration β and what a 1927 photo of Einstein reveals about how models fail.
The Solvay Test: the ultimate stress test for AI image editing
The 1927 Solvay Conference photograph is one of the most famous images in scientific history. Taken in Brussels, it captured 29 of the greatest scientific minds of the 20th century in a single frame β Albert Einstein, Marie Curie, Niels Bohr, Max Planck, Werner Heisenberg, and others whose faces are among the most documented in the world.
This makes it the perfect stress test for AI colorization and restoration tools. Every face is universally recognizable. Any model that "hallucinates" β generates plausible-but-wrong pixels β will replace the actual historical subjects with convincing-looking strangers. And because the real faces are so well-known, the failure is immediately visible.
The test: give multiple AI models the original black-and-white photograph and ask them to colorize it while preserving the identity and structural integrity of every face. The results revealed a sharp divide between models that prioritize aesthetic quality and models that prioritize factual preservation.
The pitfall of "pretty pixels"
Several mainstream models β including variants of ChatGPT's image editing and Grok β produced visually stunning colorizations. The lighting was cinematic. The skin tones were rich and natural. The fabric textures were high-definition. On a quick glance, the results looked impressive.
But look at the faces.
These models didn't preserve the historical subjects β they replaced them with generically attractive, plausible-looking faces. Einstein's distinctive features were softened into a generic elderly European man. Curie's recognizable bone structure was replaced. The physics were right. The people were wrong.
This is the core tension in generative AI editing: the training objective that produces beautiful, high-fidelity images is not the same as the objective that preserves source integrity. A model optimized to make things look good will, by default, make them look better than reality β which means replacing reality with something prettier.
Why this happens technically
- Upsampling fills gaps with probability, not fact. When a model increases resolution or adds color, it generates the most statistically likely pixels β which means average-looking faces, not specific ones.
- Identity is not explicitly preserved unless the model is designed for it. Standard image editing models treat faces as visual patterns, not as specific individuals. There's no mechanism to "lock" an identity.
- Aesthetic training rewards pleasing output. Models trained on human preference ratings learn to produce what people rate as beautiful β and people rate idealized faces as more attractive than accurate ones.
Preservation over perfection
Two models passed the Solvay test: Gemini 3 Pro and Seedream 4.5.
Their colorizations were not the most stunning. The lighting was less dramatic. The skin tones were more muted. But the faces were right β structurally faithful to the source, with the actual subjects' features preserved rather than enhanced into generic attractiveness.
Maintained structural fidelity to source faces. Color palette was historically plausible β muted, era-appropriate tones rather than modern cinematic saturation. Einstein, Curie, and Bohr remained recognizable.
Strong source adherence throughout the restoration. Where other models added idealized features, Seedream 4.5 held closer to the original pixel structure β producing a less "polished" but historically accurate result.
The difference isn't just a preference β it's architectural. Models that prioritize source fidelity implement explicit constraints during inference that prevent the network from "improving" source material beyond a threshold. This trades aesthetic sharpness for structural truth.
What "quality" actually means in AI editing
The Solvay test exposes a fundamental confusion in how we evaluate AI image output. Quality has two completely different definitions β and most users don't know which one they're getting.
Aesthetic quality
High sharpness, rich saturation, idealized lighting. Looks impressive in a portfolio or social post. Optimizes for human preference ratings.
Risk: the model is a creative collaborator, not a faithful reproducer. It will improve on reality.
Fidelity quality
Structural preservation of source data. Less visually dramatic, but factually accurate. Optimizes for data integrity over aesthetics.
Use when: accuracy matters. Historical restoration, forensic work, medical imaging, identity verification.
For developers and designers
- Specify fidelity requirements explicitly. Most models default to aesthetic enhancement. If you need source preservation, say so β and test it on known faces before trusting it on unknown ones.
- Test with ground truth. The Solvay approach works for any editing task: test on inputs where you know what the correct output looks like. If it fails on known faces, don't trust it on unknown ones.
- Aesthetic quality β accuracy. A visually stunning output from an AI editing tool is not evidence that it preserved the source faithfully. These are orthogonal properties.
- For archival or identity-sensitive work, prefer models with explicit source-fidelity design (Gemini 3 Pro, Seedream 4.5) over general-purpose image editors.
The rule: If accuracy matters, don't judge AI output by how good it looks. Judge it by how close it is to what was actually there.