Grounding fidelity — by model

Retrieval-grounded LLM outputs scored against the retrieved context. Higher overall means the model stays in-context and cites its work. A high tbd rate is good — it means the model correctly emits {{TBD: fact}} when sources are silent instead of inventing.

loading latest run…

Progress — overall grounding by run

Metric legend

overall: composite: 0.4·citation + 0.4·lexical + 0.2·(1 − unsupported_entity_rate). Higher = better grounded.
cite: fraction of generated sentences containing a [N] or (N) citation marker.
lex: fraction of generation's content words (≥4 chars, non-stop) also appearing in the retrieved context.
tbd: positive signal. TBD-marker count per 100 generated words. Model honestly refuses to invent when context is silent.
unsup: fraction of generated capitalised entities NOT present in context. Lower = better.