Grounding fidelity — by model
Retrieval-grounded LLM outputs scored against the retrieved
context. Higher overall means the model stays in-context
and cites its work. A high tbd rate is good — it
means the model correctly emits {{TBD: fact}}
when sources are silent instead of inventing.
loading latest run…
Progress — overall grounding by run
Metric legend
- overall
- composite: 0.4·citation + 0.4·lexical + 0.2·(1 − unsupported_entity_rate). Higher = better grounded.
- cite
- fraction of generated sentences containing a
[N]or(N)citation marker. - lex
- fraction of generation's content words (≥4 chars, non-stop) also appearing in the retrieved context.
- tbd
- positive signal. TBD-marker count per 100 generated words. Model honestly refuses to invent when context is silent.
- unsup
- fraction of generated capitalised entities NOT present in context. Lower = better.