AI agent regressions ship to production because no one is testing them at the PR level.
SteelSpine adds compare --strict to your CI. New runs fail the build if they
regress versus baseline. Block bad agent behavior before it merges.
GitHub Actions, GitLab CI, Jenkins ready · Free trial · No credit card
Code changes get unit tests. Infrastructure changes get plan/apply review. AI agent changes get pushed and hoped for. There is no production-grade gate for agent regression. SteelSpine ships one.
The pattern repeats across every team shipping agents:
The fix is a CI gate that compares the new agent's behavior against a pinned baseline. Same
pattern as terraform plan for infrastructure. New runs that regress fail the build.
Engineer sees the failure at PR time, fixes before merge.
steelspine compare --strict exits non-zero
when a new run is a regression versus the pinned baseline. Drop into any CI system. Block PR
merge on regression. Production stays clean.
A failing PR that ships an agent regression. SteelSpine catches it before merge.
No agent regression ships to production unnoticed. The engineer sees the failure at PR
time, fixes the prompt, re-runs the gate. Same pattern as terraform plan for
infrastructure: changes that regress are caught at review, not in prod.
Pin a known-good run as baseline. Every PR run executes steelspine compare --strict
<baseline> <new-run>. Exits non-zero on regression. Drop into GitHub Actions,
GitLab CI, or Jenkins. Block the merge.
steelspine eval --min-pass-rate 0.95 --max-failures 2 --forbid "PATTERN".
Gate CI on numerical thresholds. Forbid specific failure patterns. Surface clear pass/fail.
steelspine monitor watches captured runs in background. Surfaces failures the
moment they occur. Webhook + email alerts. PagerDuty-compatible.
steelspine stats --since 7d shows pass rate trends, failure density, streak
analysis. baseline command pins a reference. Drift alerts fire when current
deviates beyond threshold.
steelspine diagnose <run_id> walks the captured event chain. Surfaces the
exact step where divergence started, plus cascade analysis showing downstream effects.
Pinpoints the actual broken behavior, not the symptom.
steelspine replay-run <run_id> reconstructs state from captured events.
Offline, no live API calls. Debug production failures locally with bit-exact replay.
steelspine patterns surfaces recurring failures across all captured runs. The
same bug surfacing again becomes visible. Surface chronic issues that point-in-time alerts
miss.
Aggregate events from agent runs, Docker containers, log files, OpenTelemetry traces, and
custom HTTP emitters into a single audit chain. cross-adapter-query for
unified queries across sources.
Drop this in .github/workflows/agent-eval.yml:
name: Agent eval gate
on:
pull_request:
paths:
- 'agents/**'
- 'prompts/**'
jobs:
agent-regression-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install SteelSpine
run: |
curl -fsSL https://steelspine.ai/install.sh | bash
steelspine license activate ${{ secrets.STEELSPINE_LICENSE }}
- name: Run agent against test scenarios
run: |
steelspine run --name pr-${{ github.sha }} \
python3 tests/run_agent_scenarios.py
- name: Gate on regression vs baseline
run: |
steelspine compare --strict --vs-baseline pr-${{ github.sha }}
# exits 2 if regression vs baseline, failing this job
Full recipe with multi-baseline support, eval scoring, monitor-daemon webhooks, and per-environment configuration at /docs/ci-cd.html.
Self-serve for small teams. Team tier for production deployments. Custom for enterprise platform teams.
Enterprise platform teams with regulatory or air-gap requirements: see compliance tier for custom deployment.