---
title: "Run evaluations"
description: "Learn how to run evaluations using Arcade"
---
# Run evaluations
The `arcade evals` command discovers and executes evaluation suites with support for multiple providers, models, and output formats.
import { Callout } from "nextra/components";
**Backward compatibility**: All new features (multi-provider support, capture
mode, output formats) work with existing evaluation suites. No code changes
required.
## Basic usage
Run all evaluations in the current directory:
```bash
arcade evals .
```
The command searches for files starting with `eval_` and ending with `.py`.
Show detailed results with critic feedback:
```bash
arcade evals . --details
```
Filter to show only failures:
```bash
arcade evals . --only-failed
```
## Multi-provider support
### Single provider with default model
Use OpenAI with default model (`gpt-4o`):
```bash
export OPENAI_API_KEY=sk-...
arcade evals .
```
Use Anthropic with default model (`claude-sonnet-4-5-20250929`):
```bash
export ANTHROPIC_API_KEY=sk-ant-...
arcade evals . --use-provider anthropic
```
### Specific models
Specify one or more models for a provider:
```bash
arcade evals . --use-provider "openai:gpt-4o,gpt-4o-mini"
```
### Multiple providers
Compare performance across providers (space-separated):
```bash
arcade evals . \
--use-provider "openai:gpt-4o anthropic:claude-sonnet-4-5-20250929" \
--api-key openai:sk-... \
--api-key anthropic:sk-ant-...
```
When you specify multiple models, results show side-by-side comparisons.
## API keys
API keys are resolved in the following order:
| Priority | Format |
|----------|--------|
| 1. Explicit flag | `--api-key provider:key` (can repeat) |
| 2. Environment | `OPENAI_API_KEY`, `ANTHROPIC_API_KEY` |
| 3. `.env` file | `OPENAI_API_KEY=...`, `ANTHROPIC_API_KEY=...` |
Create a `.env` file in your project directory to avoid setting keys in every terminal session.
**Examples:**
```bash
# Single provider
arcade evals . --api-key openai:sk-...
# Multiple providers
arcade evals . \
--api-key openai:sk-... \
--api-key anthropic:sk-ant-...
```
## Capture mode
Record tool calls without scoring to bootstrap test expectations:
```bash
arcade evals . --capture --output captures/baseline.json
```
Include conversation context in captured output:
```bash
arcade evals . --capture --include-context --output captures/detailed.json
```
Capture mode is useful for:
- Creating initial test expectations
- Debugging model behavior
- Understanding tool call patterns
See [Capture mode](/guides/create-tools/evaluate-tools/capture-mode) for details.
## Output formats
### Save results to files
Specify output files with extensions - format is auto-detected:
```bash
# Single format
arcade evals . --output results.md
# Multiple formats
arcade evals . --output results.md --output results.html --output results.json
# All formats (no extension)
arcade evals . --output results
```
### Available formats
| Extension | Format | Description |
|-----------|--------|-------------|
| `.txt` | Plain text | Pytest-style output |
| `.md` | Markdown | Tables and collapsible sections |
| `.html` | HTML | Interactive report |
| `.json` | JSON | Structured data for programmatic use |
| (none) | All formats | Generates all four formats |
## Command options
### Quick reference
| Flag | Short | Purpose | Example |
|------|-------|---------|---------|
| `--use-provider` | `-p` | Select provider/model | `-p "openai:gpt-4o"` |
| `--api-key` | `-k` | Provider API key | `-k openai:sk-...` |
| `--capture` | - | Record without scoring | `--capture` |
| `--details` | `-d` | Show critic feedback | `--details` |
| `--only-failed` | `-f` | Filter failures | `--only-failed` |
| `--output` | `-o` | Output file(s) | `-o results.md` |
| `--include-context` | - | Add messages to output | `--include-context` |
| `--max-concurrent` | `-c` | Parallel limit | `-c 10` |
| `--debug` | - | Debug info | `--debug` |
### `--use-provider`, `-p`
Specify provider(s) and model(s) (space-separated):
```bash
--use-provider "[:] [[:]]"
```
**Supported providers:**
- `openai` (default: `gpt-4o`)
- `anthropic` (default: `claude-sonnet-4-5-20250929`)
Anthropic model names include date stamps. Check [Anthropic's model
documentation](https://docs.anthropic.com/en/docs/about-claude/models) for the
latest model versions.
**Examples:**
```bash
# Default model for provider
arcade evals . -p anthropic
# Specific model
arcade evals . -p "openai:gpt-4o-mini"
# Multiple models from same provider
arcade evals . -p "openai:gpt-4o,gpt-4o-mini"
# Multiple providers (space-separated)
arcade evals . -p "openai:gpt-4o anthropic:claude-sonnet-4-5-20250929"
```
### `--api-key`, `-k`
Provide API keys explicitly (repeatable):
```bash
arcade evals . -k openai:sk-... -k anthropic:sk-ant-...
```
### `--capture`
Enable capture mode to record tool calls without scoring:
```bash
arcade evals . --capture
```
### `--include-context`
Include system messages and conversation history in output:
```bash
arcade evals . --include-context --output results.md
```
### `--output`, `-o`
Specify output file(s). Format is auto-detected from extension:
```bash
# Single format
arcade evals . -o results.md
# Multiple formats (repeat flag)
arcade evals . -o results.md -o results.html
# All formats (no extension)
arcade evals . -o results
```
### `--details`, `-d`
Show detailed results including critic feedback:
```bash
arcade evals . --details
```
### `--only-failed`, `-f`
Show only failed test cases:
```bash
arcade evals . --only-failed
```
### `--max-concurrent`, `-c`
Set maximum concurrent evaluations:
```bash
arcade evals . --max-concurrent 10
```
Default is 1 concurrent evaluation.
### `--debug`
Show debug information for troubleshooting:
```bash
arcade evals . --debug
```
Displays detailed error traces and connection information.
## Understanding results
Results are formatted based on evaluation type (regular, multi-model, or comparative) and selected flags.
### Summary format
Results show overall performance:
```
Summary -- Total: 5 -- Passed: 4 -- Failed: 1
```
**How flags affect output:**
- `--details`: Adds per-critic breakdown for each case
- `--only-failed`: Filters to show only failed cases (summary shows original totals)
- `--include-context`: Includes system messages and conversation history
- Multiple models: Switches to comparison table format
- Comparative tracks: Shows side-by-side track comparison
### Case results
Each case displays status and score:
```
PASSED Get weather for city -- Score: 1.00
FAILED Weather with invalid city -- Score: 0.65
```
### Detailed feedback
Use `--details` to see critic-level analysis:
```
Details:
location:
Match: False, Score: 0.00/0.70
Expected: Seattle
Actual: Seatle
units:
Match: True, Score: 0.30/0.30
```
### Multi-model results
When using multiple models, results show comparison tables:
```
Case: Get weather for city
Model: gpt-4o -- Score: 1.00 -- PASSED
Model: gpt-4o-mini -- Score: 0.95 -- WARNED
```
## Advanced usage
### High concurrency for fast execution
Increase concurrent evaluations:
```bash
arcade evals . --max-concurrent 20
```
High concurrency may hit API rate limits. Start with default (1) and increase
gradually.
### Save comprehensive results
Generate all formats with full details:
```bash
arcade evals . --details --include-context --output results
```
This creates:
- `results.txt`
- `results.md`
- `results.html`
- `results.json`
## Troubleshooting
### Missing dependencies
If you see `ImportError: MCP SDK is required`, install the full package:
```bash
pip install 'arcade-mcp[evals]'
```
For Anthropic support:
```bash
pip install anthropic
```
### Tool name mismatches
Tool names are normalized (dots become underscores). Check your tool definitions if you see unexpected names.
### API rate limits
Reduce `--max-concurrent` value:
```bash
arcade evals . --max-concurrent 2
```
### No evaluation files found
Ensure your evaluation files:
- Start with `eval_`
- End with `.py`
- Contain functions decorated with `@tool_eval()`
## Next steps
- Explore [capture mode](/guides/create-tools/evaluate-tools/capture-mode) for recording tool calls
- Learn about [comparative evaluations](/guides/create-tools/evaluate-tools/comparative-evaluations) for comparing tool sources