Maida gates your PRs with behavioral checks for AI agents. It compares current runs against checked-in baselines and fails CI when structural behavior regresses: more steps, unexpected tool calls, loops, latency spikes, or cost blowups.
$ maida assert <RUN_ID> --baseline baselines/main.json --policy .maida/policy.yaml --format markdown [FAIL] step_count 6 -> 22 (+267%) [FAIL] tool_calls 4 -> 19 (+375%) [FAIL] new_tool_path salesforce_api [FAIL] cost_estimate $0.012 -> $0.041 (+242%) [PASS] loop_risk none detected 4 checks failed. PR blocked.
A prompt tweak can preserve the final answer while changing the execution path completely. The agent may take more steps, call a new tool, retry more often, or enter a loop. Manual smoke tests and output-only checks often miss that kind of regression.
Output evals ask whether the answer was good. Maida asks whether behavior changed.
Output evals are useful. Production tools are useful. Maida sits earlier in the workflow: before merge, inside CI, where a PR can still be blocked.
Maida is the pre-merge behavioral regression gate for AI agents.
Install the `maida-ai` package from PyPI. It provides the `maida` CLI and the `maida` Python module.
pip install maida-ai
Wrap one agent entrypoint so Maida can record runs.
from maida import trace
@trace
def run_agent(input: str):
# Your existing agent code
...
Run your agent once on main, find the run ID, and check in the baseline JSON.
python my_agent.py maida list --json maida baseline <RUN_ID> --out baselines/my_agent.json
Compare each new run against the baseline and fail when policy says behavior regressed.
python my_agent.py maida assert <RUN_ID> --baseline baselines/my_agent.json --policy .maida/policy.yaml --format markdown maida diff <RUN_ID> --baseline baselines/my_agent.json maida view <RUN_ID>
Add maida-ai/maida-assert@v2
to pull requests. Your agent-script
must use @trace
or traced_run()
so Maida can record a run.
See the action repo for the current inputs: agent-script, baseline, policy, maida-version, python-version, extra-args, and post-comment.
name: Agent Regression Check
on: [pull_request]
jobs:
agent-check:
runs-on: ubuntu-latest
permissions:
contents: read
pull-requests: write
steps:
- uses: actions/checkout@v4
- uses: maida-ai/maida-assert@v2
with:
agent-script: my_agent.py
baseline: baselines/my_agent.json
policy: .maida/policy.yaml
python-version: '3.12'
These are behavioral regression signals. They do not prove output quality. They tell you when the agent's execution behavior changed relative to a baseline.
assert:
no_loops: true
no_guardrails: true
step_tolerance: 0.5
expect_status: ok
Commit the policy file beside your baseline. The policy defines which behavior changes the CI check should tolerate and which changes should block the pull request.
These are product examples, not customer testimonials.
The final answer still looks fine, but the agent now takes 22 steps instead of 6. Maida flags the step-count regression before the PR merges.
A refactor introduces a tool call that never appeared in the baseline. Maida fails the check until the behavior is fixed or explicitly accepted in policy.
The run produces partial output, but no longer reaches the expected terminal condition. Maida catches the missing stop condition in CI.
A model or prompt change causes repeated calls with similar inputs. Maida records the pattern and fails the PR when policy says loops are not allowed.
Runs on your machine or GitHub Actions runner. No cloud account required. No telemetry by default. Runs stay in your environment unless you explicitly export them.
Maida writes local JSONL files under
~/.maida/runs/.
You can inspect them, archive them, or delete them.
Install Maida, capture a known-good baseline, and run a behavioral check against your next prompt, model, tool, or framework change. No cloud account required. No telemetry by default. Runs locally or in your GitHub Actions runner.