← Back to blog

Running the MOOD Challenge: Lessons from 80+ Teams

When I founded the MOOD Challenge in 2020, the goal was simple: create the first rigorous, international benchmark for out-of-distribution detection and localization on medical images. Five years and 80+ teams later, the lessons go far beyond what I expected.

Why benchmarking matters

In medical AI, a model that works on one dataset often fails on another. Before MOOD, anomaly detection papers each used different data, different metrics, and different train/test splits — making comparison nearly impossible.

If you can't compare methods fairly, you can't tell whether the field is actually progressing.

The Docker-based evaluation framework

To ensure reproducibility, we built a live Docker-based submission system. Teams submit a Docker image, we run it on our on-premise cluster on hidden test data. This eliminates "evaluation hacking" — no one can tune to the test set they can see.

┌──────────┐     ┌──────────────┐     ┌────────────────┐
│  Submit  │ ──▶ │  Pull image  │ ──▶ │  Run on hidden │
│  Docker  │     │  to cluster  │     │  test set      │
└──────────┘     └──────────────┘     └────────────────┘
                                            │
                                            ▼
                                     ┌────────────────┐
                                     │  Score & rank  │
                                     │  automatically │
                                     └────────────────┘

Key findings

  1. Reconstruction error alone is fragile. Methods that rely purely on pixel-wise reconstruction tend to fail on structured anomalies.
  2. Latent space methods generalize better. Approaches that model the distribution of latent codes are more robust to distribution shifts.
  3. Localization is harder than detection. Telling if something is wrong is easier than telling where — by a significant margin.

What I'd do differently

If I were starting MOOD today, I would:

  • Require code submission from day one (we phased this in)
  • Include more diverse modalities (we started with MRI; CT adds complexity)
  • Publish baseline submissions so newcomers have a reference

Organizing an international benchmark is a huge amount of work, but seeing the community engage and the field advance makes it worth every hour.