Most AI assistants are graded on what they answer. We grade Brolly AI on what it refuses to. When the underlying retrieval is weak, the only correct response is "I do not know, here is why I cannot tell you, here is what I would need." Every other answer is, at best, a guess. In an academic or institutional context a guess is a fabricated citation, and a fabricated citation is the failure mode that ends the product.
The metric, defined
We log every assistant turn with three labels: an "answered" boolean, a "retrieval-grounded" boolean, and a structured refusal-reason on negatives. The "I don't know" rate is the share of turns where the model declined to answer because retrieval scores fell below a calibrated confidence floor. Our internal target is between eight and fifteen percent. Below eight, the model is being too generous. Above fifteen, the retrieval pipeline needs work.
"The model that admits it does not know is the model a researcher will use without a chaperone. Every other model needs one."
How we calibrate the floor
Every two weeks we sample two hundred turns where the model declined and two hundred where it answered. Domain experts grade them. The grading produces two error rates: false refusals (declined when it should have answered) and false answers (answered when it should have declined). We tune the confidence floor to keep false answers under one percent. False refusals are an annoyance. A single false answer with a real-looking citation is the failure mode we cannot tolerate.
Three things we did not do
- › We did not train the model to apologise. A confident "I cannot answer that with the sources I have" is more useful than a hedged paragraph.
- › We did not let the model invent its own citations. Every citation must resolve, in real time, to a paper, page, and DOI on the retrieval index.
- › We did not optimise for "helpfulness" as defined by user thumbs-up. Researchers will thumbs-up an answer that sounds right. We optimise for an answer that is right.
What this looks like in the product
Every answered turn shows the retrieval evidence inline, with a confidence indicator. Every declined turn shows the user what would have changed the answer, a more specific query, a different corpus, a higher institutional subscription tier. The decline is never a dead end; it is a path forward the user can act on.
Why we will keep publishing the number
We publish the I-don't-know rate on the product's public dashboard, alongside retrieval coverage and citation-resolution rate. It is a visible, quarterly-reported metric. It will feel high, by the standards of a chatbot. It will look correct, by the standards of a research tool.