Regression testing
AgentDbg ships three CLI commands that turn traced runs into lightweight regression tests: baseline, assert, and diff. Together they let you capture a known-good run, check future runs against it, and drill into what changed when something breaks.
Why
Agent behavior is non-deterministic. A prompt tweak, model upgrade, or tool change can silently increase token usage, add unexpected tool calls, or introduce loops. agentdbg assert gives you a one-line check — locally or in CI — that catches these regressions before they reach production.
Workflow overview
1. Run your agent agentdbg view
2. Capture a baseline agentdbg baseline <run_id>
3. Run the agent again python your_agent.py
4. Assert against baseline agentdbg assert <new_run_id> --baseline .agentdbg/baselines/my_agent.json
5. If it fails, diff agentdbg diff <new_run_id> --baseline .agentdbg/baselines/my_agent.json
Step 1: Capture a baseline
After a successful run that represents the expected behavior:
agentdbg baseline <RUN_ID>
This creates a JSON snapshot at .agentdbg/baselines/<run_name>.json (or the run ID if no name was set). Use --out to control the path:
agentdbg baseline a1b2c3d4 --out baselines/support_agent_v1.json
What gets captured:
| Field | Description |
|---|---|
schema_version |
Baseline format version ("0.1") |
source_run_id |
The run this baseline was created from |
source_run_name |
Run name (if set) |
summary |
Aggregate metrics: total events, LLM calls, tool calls, errors, loop warnings, duration, tokens |
tool_path |
Sorted list of unique tool names used |
tool_call_counts |
Per-tool invocation counts |
llm_models_used |
Models seen in LLM_CALL events |
event_type_sequence |
Ordered list of event types |
guardrail_events |
Any guardrail-triggered events |
final_status |
Run status ("ok" or "error") |
Check the baseline file into version control so the team shares the same reference point.
Step 2: Assert against a baseline
agentdbg assert <RUN_ID> --baseline .agentdbg/baselines/my_agent.json
Exit codes: 0 = all checks pass, 1 = one or more checks failed, 2 = run or baseline not found, 10 = internal error.
What gets checked
Checks are controlled by the assertion policy — a combination of a policy YAML file and CLI flags. By default, if a baseline is provided, every numeric metric is compared with a 50% tolerance. You can tighten or customize this with a policy file or CLI flags.
Standalone thresholds (no baseline needed)
You can assert without a baseline by setting hard caps:
agentdbg assert <RUN_ID> --max-steps 80 --max-tool-calls 30 --no-loops
Combining baseline and thresholds
When both a baseline and a max_* threshold are set, the effective limit is the lesser of the two:
limit = min(baseline_value * (1 + tolerance), max_value)
See the Policy YAML reference for the full decision table.
Step 3: Use a policy file
Instead of passing many CLI flags, commit a .agentdbg/policy.yaml file:
assert:
max_steps: 80
step_tolerance: 0.2
max_tool_calls: 30
no_loops: true
no_new_tools: true
expect_status: ok
agentdbg assert auto-detects .agentdbg/policy.yaml in the current directory. To use a different path:
agentdbg assert <RUN_ID> --baseline baseline.json --policy ci-policy.yaml
Precedence: CLI flags > policy file > defaults. See the full policy reference for all fields, threshold semantics, and override rules.
Output formats
Use --format (-f) to choose the output format.
Text (default)
agentdbg assert <RUN_ID> --baseline baseline.json
✓ step_count: 42 steps (baseline: 38, tolerance: 50%)
✓ tool_calls: 12 tool calls (baseline: 10, tolerance: 50%)
✗ no_loops: 2 loop warning(s) detected
✓ expect_status: status is 'ok'
RESULT: FAILED (1 of 4 checks failed)
JSON
agentdbg assert <RUN_ID> --baseline baseline.json --format json
{
"run_id": "a1b2c3d4-...",
"baseline_run_id": "e5f6a7b8-...",
"passed": false,
"results": [
{
"check_name": "step_count",
"passed": true,
"message": "42 steps (baseline: 38, tolerance: 50%)",
"expected": "57",
"actual": "42"
},
{
"check_name": "no_loops",
"passed": false,
"message": "2 loop warning(s) detected",
"expected": null,
"actual": "2"
}
]
}
Markdown
Designed for GitHub PR comments and step summaries:
agentdbg assert <RUN_ID> --baseline baseline.json --format markdown
## AgentDbg Regression Report
| Check | Status | Details |
|-------|--------|---------|
| step_count | ✅ Pass | 42 steps (baseline: 38, tolerance: 50%) |
| no_loops | ❌ Fail | 2 loop warning(s) detected |
Result: **FAILED**
Step 4: Drill into failures with diff
When agentdbg assert fails, use agentdbg diff to see exactly what changed.
Diff against a baseline
agentdbg diff <RUN_ID> --baseline .agentdbg/baselines/my_agent.json
Diff two runs directly
agentdbg diff <RUN_A> <RUN_B>
Sample output
Run comparison: a1b2c3d4 vs e5f6a7b8
Summary:
total_events: 38 -> 42 (+11%)
tool_calls: 10 -> 14 (+40%)
loop_warnings: 0 -> 2 (NEW)
Tool path changes:
+ web_search (new)
Event type distribution:
LLM_CALL: 8 -> 8
TOOL_CALL: 10 -> 14 (+40%)
LOOP_WARNING: 0 -> 2 (NEW)
The diff shows summary-level metric changes, new or removed tools, and shifts in the event type distribution.
GitHub Actions example
Run your agent in CI, then assert against the checked-in baseline:
- name: Run agent
run: python my_agent.py
- name: Assert agent behavior
run: |
RUN_ID=$(agentdbg list --json | python -c "import sys,json; print(json.load(sys.stdin)['runs'][0]['run_id'])")
agentdbg assert "$RUN_ID" \
--baseline .agentdbg/baselines/my_agent.json \
--format markdown >> "$GITHUB_STEP_SUMMARY"
If the assertion fails, the step exits with code 1 and the markdown report appears in the GitHub Actions step summary.
Related docs
- Policy YAML reference — all assertion fields, threshold semantics, CLI-to-YAML mapping
- CLI reference —
baseline,assert,diffcommand details - Guardrails — runtime limits (separate from post-hoc assertions)