Continuous evaluation
Evals · 7 pillars · 40 cases
Every agent decision tested against a continuously-running eval suite. Pass-rate ticks update on each weekly run; rollback fires if pass-rate drops below 95%.
📊Overall pass rate
—
0 cases
🎯Sample coverage
0%
10% GREEN · 100% RED
⚡p95 latency
0s
median 0ms
🔄Rollbacks · 7d
0
auto-revert on threshold breach
Pillars
Per-pillar pass-rate
Trend
Recent weekly runs
Calibration
Confidence vs observed pass-rate
Brier score 0.071 · target ≤ 0.08
Coverage
Failure-mode coverage (JD catalog)
| Mode | Cases | Pass | Fail | Status |
|---|
Per-agent
Agent leaderboard
| Agent | Cases | Pass rate | p50 | p95 | Conf | Last failure |
|---|