SteelSpine for DevOps Teams — Gate PRs on agent reliability.

The DevOps reality of shipping AI

Code changes get unit tests. Infrastructure changes get plan/apply review. AI agent changes get pushed and hoped for. There is no production-grade gate for agent regression. SteelSpine ships one.

The pattern repeats across every team shipping agents:

Engineer changes a prompt. Local test passes. PR merges.
Production agent now fails 8% more often. Nobody notices for 3 days.
Customer support ticket spike triggers investigation.
Rollback to last good version. Lose 3 days of progress.

The fix is a CI gate that compares the new agent's behavior against a pinned baseline. Same pattern as terraform plan for infrastructure. New runs that regress fail the build. Engineer sees the failure at PR time, fixes before merge.

SteelSpine's CI gate: steelspine compare --strict exits non-zero when a new run is a regression versus the pinned baseline. Drop into any CI system. Block PR merge on regression. Production stays clean.

What this looks like in your CI

A failing PR that ships an agent regression. SteelSpine catches it before merge.

        
        GitHub Actions — agent regression caught

# PR opened: change to agents/refund_handler.py
# CI workflow triggers: agent-eval gate
# Step 1: capture new agent run against test scenarios
$ steelspine run --name pr-a8f4c2 python3 tests/run_agent_scenarios.py
✓ Run captured: pr-a8f4c2 · 312 events · 4.2s
  Pass rate: 87% (was 95% on baseline)
# Step 2: compare strict vs pinned baseline
$ steelspine compare --strict --vs-baseline pr-a8f4c2
✗ REGRESSION at event 187: param "query" changed
✗ 3 downstream decisions invalidated
✗ Pass rate dropped 8 points vs baseline
  exit code: 2
# GitHub Actions response:
::error::Agent regression detected. Merge blocked. View details: pr-a8f4c2

No agent regression ships to production unnoticed. The engineer sees the failure at PR time, fixes the prompt, re-runs the gate. Same pattern as terraform plan for infrastructure: changes that regress are caught at review, not in prod.

The DevOps feature set

CI Gate

compare --strict for PR regression

Pin a known-good run as baseline. Every PR run executes steelspine compare --strict <baseline> <new-run>. Exits non-zero on regression. Drop into GitHub Actions, GitLab CI, or Jenkins. Block the merge.

Eval Scoring

Score runs against criteria

steelspine eval --min-pass-rate 0.95 --max-failures 2 --forbid "PATTERN". Gate CI on numerical thresholds. Forbid specific failure patterns. Surface clear pass/fail.

Monitor Daemon

Real-time failure alerts

steelspine monitor watches captured runs in background. Surfaces failures the moment they occur. Webhook + email alerts. PagerDuty-compatible.

Drift Detection

Catch slow degradation

steelspine stats --since 7d shows pass rate trends, failure density, streak analysis. baseline command pins a reference. Drift alerts fire when current deviates beyond threshold.

Diagnose

Step-level root cause analysis

steelspine diagnose <run_id> walks the captured event chain. Surfaces the exact step where divergence started, plus cascade analysis showing downstream effects. Pinpoints the actual broken behavior, not the symptom.

Replay for Debug

Reproduce failures deterministically

steelspine replay-run <run_id> reconstructs state from captured events. Offline, no live API calls. Debug production failures locally with bit-exact replay.

Patterns

Find recurring failure modes

steelspine patterns surfaces recurring failures across all captured runs. The same bug surfacing again becomes visible. Surface chronic issues that point-in-time alerts miss.

Cross-Adapter

Multi-source event correlation

Aggregate events from agent runs, Docker containers, log files, OpenTelemetry traces, and custom HTTP emitters into a single audit chain. cross-adapter-query for unified queries across sources.

GitHub Actions example

Drop this in .github/workflows/agent-eval.yml:

name: Agent eval gate

on:
  pull_request:
    paths:
      - 'agents/**'
      - 'prompts/**'

jobs:
  agent-regression-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install SteelSpine
        run: |
          curl -fsSL https://steelspine.ai/install.sh | bash
          steelspine license activate ${{ secrets.STEELSPINE_LICENSE }}

      - name: Run agent against test scenarios
        run: |
          steelspine run --name pr-${{ github.sha }} \
            python3 tests/run_agent_scenarios.py

      - name: Gate on regression vs baseline
        run: |
          steelspine compare --strict --vs-baseline pr-${{ github.sha }}
          # exits 2 if regression vs baseline, failing this job

Full recipe with multi-baseline support, eval scoring, monitor-daemon webhooks, and per-environment configuration at /docs/ci-cd.html.

Pricing

Self-serve for small teams. Team tier for production deployments. Custom for enterprise platform teams.

Indie / Solo Engineer

Free

For individual DevOps engineers running personal agent workflows.

Unlimited captures
CI gate via compare --strict
Eval scoring + diagnose
Monitor daemon (1 environment)
1 user, 1 machine

Get it free

Team (coming soon)

Free

For DevOps and platform teams gating PRs on agent regression in CI.

Everything in Indie tier
Multi-seat licensing
Shared baseline management
Monitor daemon (multi-environment)
Slack / PagerDuty webhooks
GitHub Marketplace integration (planned)
Team dashboard at .ui

Contact for team tier

Enterprise platform teams with regulatory or air-gap requirements: see compliance tier for custom deployment.

Gate your PRs onagent reliability.

The DevOps reality of shipping AI

What this looks like in your CI

The DevOps feature set

compare --strict for PR regression

Score runs against criteria

Real-time failure alerts

Catch slow degradation

Step-level root cause analysis

Reproduce failures deterministically

Find recurring failure modes

Multi-source event correlation

GitHub Actions example

Pricing

Gate your PRs on
agent reliability.