The Benchmark Book

We grade the tests, too.

Without honest benchmarks, model scores are noise. Every benchmark below has been evaluated for what it actually measures, what it misses, and how much you should trust it. A trust grade of A means the benchmark is well-maintained, widely accepted, and measures something real. A D means the benchmark is misleading, poorly maintained, or routinely gamed.

BenchmarkTrust Grade
CASP15TrustA
ProteinGymTrustA
PoseBustersTrustB
DockQTrustB
CAMEOTrustB
PDBBindTrustC