The ARTEKNE Creative Benchmark (ACB) is an 8-axis evaluation framework designed to measure whether AI-generated creative output meets the standard of luxury brand campaigns. We're open-sourcing the methodology.

The Problem with Existing Benchmarks

Current AI image quality benchmarks were designed for research, not commerce. FID (Fréchet Inception Distance) measures statistical similarity to a reference dataset. CLIP Score measures text-image alignment. ImageReward measures human aesthetic preference.

None of these capture what a creative director actually evaluates when reviewing campaign imagery: Does this image build the brand? Does it maintain visual consistency with the rest of the campaign? Does it tell a story that commands premium pricing?

The 8-Axis Framework

ACB evaluates AI creative output across eight dimensions, each scored on a 1–10 scale:

AXIS MEASURES WHY IT MATTERS
1. Color Science Palette sophistication, harmony, tonal range First thing the eye registers
2. Composition Visual weight, negative space, focal point Separates "editorial" from "snapshot"
3. Lighting Directionality, mood, shadow quality $500 shot vs. $50K campaign
4. Model Realism Anatomy, skin texture, expression Uncanny valley destroys credibility
5. Styling Garment fit, accessory coordination Signals "catalog" vs. "campaign"
6. Environmental Integration Model-environment relationship, scale Compositing artifacts are instant tells
7. Campaign Consistency Visual language across a series One image ≠ a campaign
8. Brand Narrative Does it tell a story? Build value? The highest-order evaluation

Baseline Results (v1.0)

We scored outputs from Midjourney v7, DALL-E 4, Stable Diffusion 3, Runway Gen-4, and ARTEKNE/Hephaestus across all 8 axes:

SYSTEM CS CO LI MR ST EI CC BN AVG
Midjourney v7 8.0 7.5 7.5 6.5 5.0 7.0 3.0 2.0 5.8
DALL-E 4 7.0 7.0 6.5 6.0 4.5 6.5 2.5 2.0 5.3
Stable Diffusion 3 6.5 6.5 6.0 5.5 4.0 6.0 2.0 1.5 4.8
Runway Gen-4 7.5 7.0 7.0 6.0 4.5 7.0 3.0 2.5 5.6
ARTEKNE 8.2 7.5 7.8 7.0 7.5 8.0 8.5 8.0 7.8

Key Finding

Single-image generators (Midjourney, DALL-E, SD) score competitively on individual image quality axes (CS, CO, LI) but collapse on system-level axes (CC, BN). This is because they have no concept of "campaign" or "brand" — each image is generated independently.

ARTEKNE's multi-agent architecture specifically addresses this gap. The 209-agent system maintains brand DNA, visual language, and narrative consistency across every generated image — which is why it scores 8.0+ on Campaign Consistency and Brand Narrative where competitors score 2.0–3.0.

The gap between AI creative tools is not in individual image quality. It's in system-level creative thinking. That's the gap Autonomous Creative is designed to close.

How to Contribute

ACB is open-source. We welcome contributions in three forms:

  1. New system evaluations — Run the benchmark on additional AI systems and submit results
  2. Methodology improvements — Propose new axes, refined scoring criteria, or alternative protocols
  3. Evaluator participation — Join the evaluator pool (especially creative directors with luxury brand experience)

Repository: github.com/artekne/creative-benchmark

ACB methodology is released under MIT License. Results data is released under CC BY 4.0.