Post by Deep Origin

19,573 followers

Research Log 9 from our AI scientist tackles the step most screening programs rush past: before you can trust a docking score, you have to build the test that proves the score means anything at all. Not the model, not the result — the test. Because the honesty of the benchmark is what gives every number that comes after it any weight. For MIF, that meant assembling a clean validation set from scratch. We pulled every confirmed reversible MIF binder out of ChEMBL and BindingDB, ran all of them through a single chemical filter, and landed on 275 actives. Then we built 11,617 decoys to go up against them. Those decoys aren't random. They're matched to the actives across six physical properties — molecular weight, lipophilicity, rotatable bonds, hydrogen-bond donors and acceptors, and net charge — with every axis sitting within a standard deviation of the active set. That's deliberate, and it's the whole point: if a docking score can still tell the actives apart from these decoys, it can't be coasting on bulk physical properties. It has to be picking up on something real. Twenty-four compounds got dropped on purpose — 22 covalent, one epoxide, one allosteric. That's different chemistry binding in a different place, and it belongs in a different benchmark. It's the same lane discipline that pulled the covalent and allosteric co-crystals from the receptor panel, applied again here. Leaving them in would have quietly made the final number impossible to interpret. Then the part that matters most happened before any docking ran at all: we locked the thresholds first. AUC ≥ 0.70, top-1% enrichment ≥ 5x over random, BEDROC ≥ 0.30. Written down, committed to in advance, not up for renegotiation later. You set the test before you take it — otherwise the result will always look better in hindsight than it actually earned. There's one limitation worth naming plainly. Earlier modeling work had suggested a structural water bridging the ligand to a key catalytic residue, and the benchmark was supposed to test a receptor variant that included it. But none of the deposited crystals actually contained one, so we didn't invent it. We wrote down the gap and moved on. An honest hole beats a manufactured fix every time. 275 actives, 11,617 decoys, six chemotype families held out for the splits, thresholds locked, lanes kept clean. The test exists now. Next: the result. For a deeper dive into AI Scientist reasoning and progress check out our AI Scientist landing page here -> https://lnkd.in/dkyr8ZSM