Continuous evaluation

Evals · 7 pillars · 40 cases

Every agent decision tested against a continuously-running eval suite. Pass-rate ticks update on each weekly run; rollback fires if pass-rate drops below 95%.
📊Overall pass rate
0 cases
🎯Sample coverage
0%
10% GREEN · 100% RED
p95 latency
0s
median 0ms
🔄Rollbacks · 7d
0
auto-revert on threshold breach
Pillars
Per-pillar pass-rate
Trend
Recent weekly runs
Calibration
Confidence vs observed pass-rate
Brier score 0.071 · target ≤ 0.08
Coverage
Failure-mode coverage (JD catalog)
ModeCasesPassFailStatus
Per-agent
Agent leaderboard
AgentCasesPass ratep50p95ConfLast failure
connecting…