The Benchmark Book
We grade the tests, too.
Without honest benchmarks, model scores are noise. Every benchmark below has been evaluated for what it actually measures, what it misses, and how much you should trust it. A trust grade of A means the benchmark is well-maintained, widely accepted, and measures something real. A D means the benchmark is misleading, poorly maintained, or routinely gamed.
| Benchmark | Trust Grade |
|---|---|
| CASP15 | TrustA |
| ProteinGym | TrustA |
| PoseBusters | TrustB |
| DockQ | TrustB |
| CAMEO | TrustB |
| PDBBind | TrustC |