For DevOps, SRE, & AI Platform Teams

Gate your PRs on
agent reliability.

AI agent regressions ship to production because no one is testing them at the PR level. SteelSpine adds compare --strict to your CI. New runs fail the build if they regress versus baseline. Block bad agent behavior before it merges.

GitHub Actions, GitLab CI, Jenkins ready · Free trial · No credit card

The DevOps reality of shipping AI

Code changes get unit tests. Infrastructure changes get plan/apply review. AI agent changes get pushed and hoped for. There is no production-grade gate for agent regression. SteelSpine ships one.

The pattern repeats across every team shipping agents:

The fix is a CI gate that compares the new agent's behavior against a pinned baseline. Same pattern as terraform plan for infrastructure. New runs that regress fail the build. Engineer sees the failure at PR time, fixes before merge.

SteelSpine's CI gate: steelspine compare --strict exits non-zero when a new run is a regression versus the pinned baseline. Drop into any CI system. Block PR merge on regression. Production stays clean.

What this looks like in your CI

A failing PR that ships an agent regression. SteelSpine catches it before merge.

GitHub Actions — agent regression caught

# PR opened: change to agents/refund_handler.py

# CI workflow triggers: agent-eval gate

# Step 1: capture new agent run against test scenarios

$ steelspine run --name pr-a8f4c2 python3 tests/run_agent_scenarios.py

✓ Run captured: pr-a8f4c2 · 312 events · 4.2s

  Pass rate: 87% (was 95% on baseline)

# Step 2: compare strict vs pinned baseline

$ steelspine compare --strict --vs-baseline pr-a8f4c2

✗ REGRESSION at event 187: param "query" changed

✗ 3 downstream decisions invalidated

✗ Pass rate dropped 8 points vs baseline

  exit code: 2

# GitHub Actions response:

::error::Agent regression detected. Merge blocked. View details: pr-a8f4c2

No agent regression ships to production unnoticed. The engineer sees the failure at PR time, fixes the prompt, re-runs the gate. Same pattern as terraform plan for infrastructure: changes that regress are caught at review, not in prod.

The DevOps feature set

CI Gate

compare --strict for PR regression

Pin a known-good run as baseline. Every PR run executes steelspine compare --strict <baseline> <new-run>. Exits non-zero on regression. Drop into GitHub Actions, GitLab CI, or Jenkins. Block the merge.

Eval Scoring

Score runs against criteria

steelspine eval --min-pass-rate 0.95 --max-failures 2 --forbid "PATTERN". Gate CI on numerical thresholds. Forbid specific failure patterns. Surface clear pass/fail.

Monitor Daemon

Real-time failure alerts

steelspine monitor watches captured runs in background. Surfaces failures the moment they occur. Webhook + email alerts. PagerDuty-compatible.

Drift Detection

Catch slow degradation

steelspine stats --since 7d shows pass rate trends, failure density, streak analysis. baseline command pins a reference. Drift alerts fire when current deviates beyond threshold.

Diagnose

Step-level root cause analysis

steelspine diagnose <run_id> walks the captured event chain. Surfaces the exact step where divergence started, plus cascade analysis showing downstream effects. Pinpoints the actual broken behavior, not the symptom.

Replay for Debug

Reproduce failures deterministically

steelspine replay-run <run_id> reconstructs state from captured events. Offline, no live API calls. Debug production failures locally with bit-exact replay.

Patterns

Find recurring failure modes

steelspine patterns surfaces recurring failures across all captured runs. The same bug surfacing again becomes visible. Surface chronic issues that point-in-time alerts miss.

Cross-Adapter

Multi-source event correlation

Aggregate events from agent runs, Docker containers, log files, OpenTelemetry traces, and custom HTTP emitters into a single audit chain. cross-adapter-query for unified queries across sources.

GitHub Actions example

Drop this in .github/workflows/agent-eval.yml:

name: Agent eval gate

on:
  pull_request:
    paths:
      - 'agents/**'
      - 'prompts/**'

jobs:
  agent-regression-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install SteelSpine
        run: |
          curl -fsSL https://steelspine.ai/install.sh | bash
          steelspine license activate ${{ secrets.STEELSPINE_LICENSE }}

      - name: Run agent against test scenarios
        run: |
          steelspine run --name pr-${{ github.sha }} \
            python3 tests/run_agent_scenarios.py

      - name: Gate on regression vs baseline
        run: |
          steelspine compare --strict --vs-baseline pr-${{ github.sha }}
          # exits 2 if regression vs baseline, failing this job

Full recipe with multi-baseline support, eval scoring, monitor-daemon webhooks, and per-environment configuration at /docs/ci-cd.html.

Pricing

Self-serve for small teams. Team tier for production deployments. Custom for enterprise platform teams.

Indie / Solo Engineer
CA$29.99/mo
For individual DevOps engineers running personal agent workflows.
  • Unlimited captures
  • CI gate via compare --strict
  • Eval scoring + diagnose
  • Monitor daemon (1 environment)
  • 1 user, 1 machine
Start free trial

Enterprise platform teams with regulatory or air-gap requirements: see compliance tier for custom deployment.