Post by Galileo

25,692 followers

A fintech compliance team took their investment advice detector from 71% to 94% accuracy in three refinement cycles. The fix wasn't a better model. It was 12 hours of SME review. LLM judges plateau because they don't carry the domain rules that decide what's safe, compliant, or helpful in your specific context. A judge reads the words. A compliance lead reads the liability. In this case, the judge was passing responses that expressed positive sentiment about specific securities because nothing in its criteria said that counts as investment advice. One SME review cycle surfaced that implicit rule. The next caught a subtler one: responses that engaged with "should I invest in X" questions, even with balanced frameworks, needed to be deflected entirely. That wasn't a judge problem. It was a product requirement nobody had written down. Systematic SME refinement does two things: it improves eval accuracy, and it forces product decisions that were never made explicit. Here’s the process: – Split your labeled data into train, dev, and test sets. – Measure TPR and TNR against human labels. – Below 90% on either: review disagreements with your SME, extract the implicit rule, update the prompt. – Above 90%: validate against the test set. If it holds, deploy. If it overfits, extract the rule and go again. The loop is what gets you from "mostly right" to production-safe. Chapter 3 of our Eval Engineering Book covers how to build ground truth datasets, structure SME involvement, sample strategically so expert time goes toward hard cases, and trace failures back to where the fix actually belongs. Read it here: https://lnkd.in/gF9wyWq5