The ARTEKNE Creative Benchmark (ACB) is an 8-axis evaluation framework designed to measure whether AI-generated creative output meets the standard of luxury brand campaigns. We're open-sourcing the methodology.
The Problem with Existing Benchmarks
Current AI image quality benchmarks were designed for research, not commerce. FID (Fréchet Inception Distance) measures statistical similarity to a reference dataset. CLIP Score measures text-image alignment. ImageReward measures human aesthetic preference.
None of these capture what a creative director actually evaluates when reviewing campaign imagery: Does this image build the brand? Does it maintain visual consistency with the rest of the campaign? Does it tell a story that commands premium pricing?
The 8-Axis Framework
ACB evaluates AI creative output across eight dimensions, each scored on a 1–10 scale:
| AXIS | MEASURES | WHY IT MATTERS |
|---|---|---|
| 1. Color Science | Palette sophistication, harmony, tonal range | First thing the eye registers |
| 2. Composition | Visual weight, negative space, focal point | Separates "editorial" from "snapshot" |
| 3. Lighting | Directionality, mood, shadow quality | $500 shot vs. $50K campaign |
| 4. Model Realism | Anatomy, skin texture, expression | Uncanny valley destroys credibility |
| 5. Styling | Garment fit, accessory coordination | Signals "catalog" vs. "campaign" |
| 6. Environmental Integration | Model-environment relationship, scale | Compositing artifacts are instant tells |
| 7. Campaign Consistency | Visual language across a series | One image ≠ a campaign |
| 8. Brand Narrative | Does it tell a story? Build value? | The highest-order evaluation |
Baseline Results (v1.0)
We scored outputs from Midjourney v7, DALL-E 4, Stable Diffusion 3, Runway Gen-4, and ARTEKNE/Hephaestus across all 8 axes:
| SYSTEM | CS | CO | LI | MR | ST | EI | CC | BN | AVG |
|---|---|---|---|---|---|---|---|---|---|
| Midjourney v7 | 8.0 | 7.5 | 7.5 | 6.5 | 5.0 | 7.0 | 3.0 | 2.0 | 5.8 |
| DALL-E 4 | 7.0 | 7.0 | 6.5 | 6.0 | 4.5 | 6.5 | 2.5 | 2.0 | 5.3 |
| Stable Diffusion 3 | 6.5 | 6.5 | 6.0 | 5.5 | 4.0 | 6.0 | 2.0 | 1.5 | 4.8 |
| Runway Gen-4 | 7.5 | 7.0 | 7.0 | 6.0 | 4.5 | 7.0 | 3.0 | 2.5 | 5.6 |
| ARTEKNE | 8.2 | 7.5 | 7.8 | 7.0 | 7.5 | 8.0 | 8.5 | 8.0 | 7.8 |
Key Finding
Single-image generators (Midjourney, DALL-E, SD) score competitively on individual image quality axes (CS, CO, LI) but collapse on system-level axes (CC, BN). This is because they have no concept of "campaign" or "brand" — each image is generated independently.
ARTEKNE's multi-agent architecture specifically addresses this gap. The 209-agent system maintains brand DNA, visual language, and narrative consistency across every generated image — which is why it scores 8.0+ on Campaign Consistency and Brand Narrative where competitors score 2.0–3.0.
The gap between AI creative tools is not in individual image quality. It's in system-level creative thinking. That's the gap Autonomous Creative is designed to close.
How to Contribute
ACB is open-source. We welcome contributions in three forms:
- New system evaluations — Run the benchmark on additional AI systems and submit results
- Methodology improvements — Propose new axes, refined scoring criteria, or alternative protocols
- Evaluator participation — Join the evaluator pool (especially creative directors with luxury brand experience)
Repository: github.com/artekne/creative-benchmark
ACB methodology is released under MIT License. Results data is released under CC BY 4.0.