Running the MOOD Challenge: Lessons from 80+ Teams
When I founded the MOOD Challenge in 2020, the goal was simple: create the first rigorous, international benchmark for out-of-distribution detection and localization on medical images. Five years and 80+ teams later, the lessons go far beyond what I expected.
Why benchmarking matters¶
In medical AI, a model that works on one dataset often fails on another. Before MOOD, anomaly detection papers each used different data, different metrics, and different train/test splits — making comparison nearly impossible.
If you can't compare methods fairly, you can't tell whether the field is actually progressing.
The Docker-based evaluation framework¶
To ensure reproducibility, we built a live Docker-based submission system. Teams submit a Docker image, we run it on our on-premise cluster on hidden test data. This eliminates "evaluation hacking" — no one can tune to the test set they can see.
┌──────────┐ ┌──────────────┐ ┌────────────────┐
│ Submit │ ──▶ │ Pull image │ ──▶ │ Run on hidden │
│ Docker │ │ to cluster │ │ test set │
└──────────┘ └──────────────┘ └────────────────┘
│
▼
┌────────────────┐
│ Score & rank │
│ automatically │
└────────────────┘
Key findings¶
- Reconstruction error alone is fragile. Methods that rely purely on pixel-wise reconstruction tend to fail on structured anomalies.
- Latent space methods generalize better. Approaches that model the distribution of latent codes are more robust to distribution shifts.
- Localization is harder than detection. Telling if something is wrong is easier than telling where — by a significant margin.
What I'd do differently¶
If I were starting MOOD today, I would:
- Require code submission from day one (we phased this in)
- Include more diverse modalities (we started with MRI; CT adds complexity)
- Publish baseline submissions so newcomers have a reference
Organizing an international benchmark is a huge amount of work, but seeing the community engage and the field advance makes it worth every hour.