--- title: "Run evaluations" description: "Learn how to run evaluations using Arcade" --- # Run evaluations The `arcade evals` command discovers and executes evaluation suites with support for multiple providers, models, and output formats. import { Callout } from "nextra/components"; **Backward compatibility**: All new features (multi-provider support, capture mode, output formats) work with existing evaluation suites. No code changes required. ## Basic usage Run all evaluations in the current directory: ```bash arcade evals . ``` The command searches for files starting with `eval_` and ending with `.py`. Show detailed results with critic feedback: ```bash arcade evals . --details ``` Filter to show only failures: ```bash arcade evals . --only-failed ``` ## Multi-provider support ### Single provider with default model Use OpenAI with default model (`gpt-4o`): ```bash export OPENAI_API_KEY=sk-... arcade evals . ``` Use Anthropic with default model (`claude-sonnet-4-5-20250929`): ```bash export ANTHROPIC_API_KEY=sk-ant-... arcade evals . --use-provider anthropic ``` ### Specific models Specify one or more models for a provider: ```bash arcade evals . --use-provider "openai:gpt-4o,gpt-4o-mini" ``` ### Multiple providers Compare performance across providers (space-separated): ```bash arcade evals . \ --use-provider "openai:gpt-4o anthropic:claude-sonnet-4-5-20250929" \ --api-key openai:sk-... \ --api-key anthropic:sk-ant-... ``` When you specify multiple models, results show side-by-side comparisons. ## API keys API keys are resolved in the following order: | Priority | Format | |----------|--------| | 1. Explicit flag | `--api-key provider:key` (can repeat) | | 2. Environment | `OPENAI_API_KEY`, `ANTHROPIC_API_KEY` | | 3. `.env` file | `OPENAI_API_KEY=...`, `ANTHROPIC_API_KEY=...` | Create a `.env` file in your project directory to avoid setting keys in every terminal session. **Examples:** ```bash # Single provider arcade evals . --api-key openai:sk-... # Multiple providers arcade evals . \ --api-key openai:sk-... \ --api-key anthropic:sk-ant-... ``` ## Capture mode Record tool calls without scoring to bootstrap test expectations: ```bash arcade evals . --capture --output captures/baseline.json ``` Include conversation context in captured output: ```bash arcade evals . --capture --include-context --output captures/detailed.json ``` Capture mode is useful for: - Creating initial test expectations - Debugging model behavior - Understanding tool call patterns See [Capture mode](/guides/create-tools/evaluate-tools/capture-mode) for details. ## Output formats ### Save results to files Specify output files with extensions - format is auto-detected: ```bash # Single format arcade evals . --output results.md # Multiple formats arcade evals . --output results.md --output results.html --output results.json # All formats (no extension) arcade evals . --output results ``` ### Available formats | Extension | Format | Description | |-----------|--------|-------------| | `.txt` | Plain text | Pytest-style output | | `.md` | Markdown | Tables and collapsible sections | | `.html` | HTML | Interactive report | | `.json` | JSON | Structured data for programmatic use | | (none) | All formats | Generates all four formats | ## Command options ### Quick reference | Flag | Short | Purpose | Example | |------|-------|---------|---------| | `--use-provider` | `-p` | Select provider/model | `-p "openai:gpt-4o"` | | `--api-key` | `-k` | Provider API key | `-k openai:sk-...` | | `--capture` | - | Record without scoring | `--capture` | | `--details` | `-d` | Show critic feedback | `--details` | | `--only-failed` | `-f` | Filter failures | `--only-failed` | | `--output` | `-o` | Output file(s) | `-o results.md` | | `--include-context` | - | Add messages to output | `--include-context` | | `--max-concurrent` | `-c` | Parallel limit | `-c 10` | | `--debug` | - | Debug info | `--debug` | ### `--use-provider`, `-p` Specify provider(s) and model(s) (space-separated): ```bash --use-provider "[:] [[:]]" ``` **Supported providers:** - `openai` (default: `gpt-4o`) - `anthropic` (default: `claude-sonnet-4-5-20250929`) Anthropic model names include date stamps. Check [Anthropic's model documentation](https://docs.anthropic.com/en/docs/about-claude/models) for the latest model versions. **Examples:** ```bash # Default model for provider arcade evals . -p anthropic # Specific model arcade evals . -p "openai:gpt-4o-mini" # Multiple models from same provider arcade evals . -p "openai:gpt-4o,gpt-4o-mini" # Multiple providers (space-separated) arcade evals . -p "openai:gpt-4o anthropic:claude-sonnet-4-5-20250929" ``` ### `--api-key`, `-k` Provide API keys explicitly (repeatable): ```bash arcade evals . -k openai:sk-... -k anthropic:sk-ant-... ``` ### `--capture` Enable capture mode to record tool calls without scoring: ```bash arcade evals . --capture ``` ### `--include-context` Include system messages and conversation history in output: ```bash arcade evals . --include-context --output results.md ``` ### `--output`, `-o` Specify output file(s). Format is auto-detected from extension: ```bash # Single format arcade evals . -o results.md # Multiple formats (repeat flag) arcade evals . -o results.md -o results.html # All formats (no extension) arcade evals . -o results ``` ### `--details`, `-d` Show detailed results including critic feedback: ```bash arcade evals . --details ``` ### `--only-failed`, `-f` Show only failed test cases: ```bash arcade evals . --only-failed ``` ### `--max-concurrent`, `-c` Set maximum concurrent evaluations: ```bash arcade evals . --max-concurrent 10 ``` Default is 1 concurrent evaluation. ### `--debug` Show debug information for troubleshooting: ```bash arcade evals . --debug ``` Displays detailed error traces and connection information. ## Understanding results Results are formatted based on evaluation type (regular, multi-model, or comparative) and selected flags. ### Summary format Results show overall performance: ``` Summary -- Total: 5 -- Passed: 4 -- Failed: 1 ``` **How flags affect output:** - `--details`: Adds per-critic breakdown for each case - `--only-failed`: Filters to show only failed cases (summary shows original totals) - `--include-context`: Includes system messages and conversation history - Multiple models: Switches to comparison table format - Comparative tracks: Shows side-by-side track comparison ### Case results Each case displays status and score: ``` PASSED Get weather for city -- Score: 1.00 FAILED Weather with invalid city -- Score: 0.65 ``` ### Detailed feedback Use `--details` to see critic-level analysis: ``` Details: location: Match: False, Score: 0.00/0.70 Expected: Seattle Actual: Seatle units: Match: True, Score: 0.30/0.30 ``` ### Multi-model results When using multiple models, results show comparison tables: ``` Case: Get weather for city Model: gpt-4o -- Score: 1.00 -- PASSED Model: gpt-4o-mini -- Score: 0.95 -- WARNED ``` ## Advanced usage ### High concurrency for fast execution Increase concurrent evaluations: ```bash arcade evals . --max-concurrent 20 ``` High concurrency may hit API rate limits. Start with default (1) and increase gradually. ### Save comprehensive results Generate all formats with full details: ```bash arcade evals . --details --include-context --output results ``` This creates: - `results.txt` - `results.md` - `results.html` - `results.json` ## Troubleshooting ### Missing dependencies If you see `ImportError: MCP SDK is required`, install the full package: ```bash pip install 'arcade-mcp[evals]' ``` For Anthropic support: ```bash pip install anthropic ``` ### Tool name mismatches Tool names are normalized (dots become underscores). Check your tool definitions if you see unexpected names. ### API rate limits Reduce `--max-concurrent` value: ```bash arcade evals . --max-concurrent 2 ``` ### No evaluation files found Ensure your evaluation files: - Start with `eval_` - End with `.py` - Contain functions decorated with `@tool_eval()` ## Next steps - Explore [capture mode](/guides/create-tools/evaluate-tools/capture-mode) for recording tool calls - Learn about [comparative evaluations](/guides/create-tools/evaluate-tools/comparative-evaluations) for comparing tool sources