Skip to content

AI

Brolly AI's "I don't know" rate is the most important metric we ship.

Useful is the easy half. Refusing to answer when the retrieval is weak is the half nobody else builds for, and the only one a researcher or registrar will trust.

Mahougnon Bernard De Montfort Assogba · CTO & Co-Founder, LCFD · 02.02.2026 · 9 min read

Most AI assistants are graded on what they answer. We grade Brolly AI on what it refuses to. When the underlying retrieval is weak, the only correct response is "I do not know, here is why I cannot tell you, here is what I would need." Every other answer is, at best, a guess. In an academic or institutional context a guess is a fabricated citation, and a fabricated citation is the failure mode that ends the product.

The metric, defined

We log every assistant turn with three labels: an "answered" boolean, a "retrieval-grounded" boolean, and a structured refusal-reason on negatives. The "I don't know" rate is the share of turns where the model declined to answer because retrieval scores fell below a calibrated confidence floor. Our internal target is between eight and fifteen percent. Below eight, the model is being too generous. Above fifteen, the retrieval pipeline needs work.

"The model that admits it does not know is the model a researcher will use without a chaperone. Every other model needs one."

How we calibrate the floor

Every two weeks we sample two hundred turns where the model declined and two hundred where it answered. Domain experts grade them. The grading produces two error rates: false refusals (declined when it should have answered) and false answers (answered when it should have declined). We tune the confidence floor to keep false answers under one percent. False refusals are an annoyance. A single false answer with a real-looking citation is the failure mode we cannot tolerate.

Three things we did not do

  • We did not train the model to apologise. A confident "I cannot answer that with the sources I have" is more useful than a hedged paragraph.
  • We did not let the model invent its own citations. Every citation must resolve, in real time, to a paper, page, and DOI on the retrieval index.
  • We did not optimise for "helpfulness" as defined by user thumbs-up. Researchers will thumbs-up an answer that sounds right. We optimise for an answer that is right.

What this looks like in the product

Every answered turn shows the retrieval evidence inline, with a confidence indicator. Every declined turn shows the user what would have changed the answer, a more specific query, a different corpus, a higher institutional subscription tier. The decline is never a dead end; it is a path forward the user can act on.

Why we will keep publishing the number

We publish the I-don't-know rate on the product's public dashboard, alongside retrieval coverage and citation-resolution rate. It is a visible, quarterly-reported metric. It will feel high, by the standards of a chatbot. It will look correct, by the standards of a research tool.

Colophon

Set in Bricolage Grotesque, IBM Plex Sans, IBM Plex Mono, and Instrument Serif. Edited by the LCFD studio. Published from London. Republished with attribution.

DOI · 10.0/lcfd.journal.brolly-ai-i-dont-know-rate

Read next

Why the management system and the learning platform should ship as one product.

Higher Ed · 7 min read