--- title: "Capture mode" description: "Record tool calls without scoring to bootstrap test expectations" --- import { Callout, Steps } from "nextra/components"; # Capture mode Capture mode records tool calls without evaluating them. Use it to bootstrap test expectations or debug model behavior. **Backward compatibility**: Capture mode works with existing evaluation suites. Simply add the `--capture` flag to any `arcade evals` command. No code changes needed. ## When to use capture mode **Bootstrapping test expectations**: When you don't know what tool calls to expect, run capture mode to see what the model actually calls. **Debugging model behavior**: When evaluations fail unexpectedly, capture mode shows exactly what the model is doing. **Exploring new tools**: When adding new tools, capture mode helps you understand how models interpret them. **Documenting tool usage**: Create examples of how models use your tools in different scenarios. ### Typical workflow ``` 1. Create suite with empty expected_tool_calls ↓ 2. Run: arcade evals . --capture --format json ↓ 3. Review captured tool calls in output file ↓ 4. Copy tool calls into expected_tool_calls ↓ 5. Add critics for validation ↓ 6. Run: arcade evals . --details ``` ## Basic usage ### Create an evaluation suite without expectations Create a suite with test cases but empty `expected_tool_calls`: ```python from arcade_evals import EvalSuite, tool_eval @tool_eval() async def capture_weather_suite(): suite = EvalSuite( name="Weather Capture", system_message="You are a weather assistant.", ) await suite.add_mcp_stdio_server(["python", "weather_server.py"]) # Add cases without expected tool calls suite.add_case( name="Simple weather query", user_message="What's the weather in Seattle?", expected_tool_calls=[], # Empty for capture ) suite.add_case( name="Multi-city comparison", user_message="Compare the weather in Seattle and Portland", expected_tool_calls=[], ) return suite ``` ### Run in capture mode Run evaluations with the `--capture` flag: ```bash arcade evals . --capture --file captures/weather --format json ``` This creates `captures/weather.json` with all tool calls. ### Review captured output Open the JSON file to see what the model called: ```json { "suite_name": "Weather Capture", "model": "gpt-4o", "provider": "openai", "captured_cases": [ { "case_name": "Simple weather query", "user_message": "What's the weather in Seattle?", "tool_calls": [ { "name": "Weather_GetCurrent", "args": { "location": "Seattle", "units": "fahrenheit" } } ] } ] } ``` ### Convert to test expectations Copy the captured calls into your evaluation suite: ```python from arcade_evals import ExpectedMCPToolCall, BinaryCritic suite.add_case( name="Simple weather query", user_message="What's the weather in Seattle?", expected_tool_calls=[ ExpectedMCPToolCall( "Weather_GetCurrent", {"location": "Seattle", "units": "fahrenheit"} ) ], critics=[ BinaryCritic(critic_field="location", weight=0.7), BinaryCritic(critic_field="units", weight=0.3), ], ) ``` ## CLI options ### Basic capture Record tool calls to JSON: ```bash arcade evals . --capture --file captures/baseline --format json ``` ### Include conversation context Capture system messages and conversation history: ```bash arcade evals . --capture --include-context --file captures/detailed --format json ``` Output includes: ```json { "case_name": "Weather with context", "user_message": "What about the weather there?", "system_message": "You are a weather assistant.", "additional_messages": [ {"role": "user", "content": "I'm traveling to Tokyo"}, {"role": "assistant", "content": "Tokyo is a great city!"} ], "tool_calls": [...] } ``` ### Multiple formats Save captures in multiple formats: ```bash arcade evals . --capture --file captures/out --format json,md ``` Markdown format is more readable for quick review: ```markdown ## Weather Capture ### Model: gpt-4o #### Case: Simple weather query **Input:** What's the weather in Seattle? **Tool Calls:** - `Weather_GetCurrent` - location: Seattle - units: fahrenheit ``` ### Multiple providers Capture from multiple providers to compare behavior: ```bash arcade evals . --capture \ --use-provider openai:gpt-4o \ --use-provider anthropic:claude-sonnet-4-5-20250929 \ --file captures/comparison --format json ``` ## Capture with comparative tracks Capture from multiple tool sources to see how different implementations behave: ```python @tool_eval() async def capture_comparative(): suite = EvalSuite( name="Weather Comparison", system_message="You are a weather assistant.", ) # Register different tool sources await suite.add_mcp_server( "http://weather-api-1.example/mcp", track="Weather API v1" ) await suite.add_mcp_server( "http://weather-api-2.example/mcp", track="Weather API v2" ) # Capture will run against each track suite.add_case( name="get_weather", user_message="What's the weather in Seattle?", expected_tool_calls=[], ) return suite ``` Run capture: ```bash arcade evals . --capture --file captures/apis --format json ``` Output shows captures per track: ```json { "captured_cases": [ { "case_name": "get_weather", "track_name": "Weather API v1", "tool_calls": [ {"name": "GetCurrentWeather", "args": {...}} ] }, { "case_name": "get_weather", "track_name": "Weather API v2", "tool_calls": [ {"name": "Weather_Current", "args": {...}} ] } ] } ``` ## Best practices ### Start with broad queries Begin with open-ended prompts to see natural model behavior: ```python suite.add_case( name="explore_weather_tools", user_message="Show me everything you can do with weather", expected_tool_calls=[], ) ``` ### Capture edge cases Record model behavior on unusual inputs: ```python suite.add_case( name="ambiguous_location", user_message="What's the weather in Portland?", # OR or ME? expected_tool_calls=[], ) ``` ### Include context variations Capture with different conversation contexts: ```python suite.add_case( name="weather_from_context", user_message="How about the weather there?", additional_messages=[ {"role": "user", "content": "I'm going to Seattle"}, ], expected_tool_calls=[], ) ``` ### Capture multiple providers Compare how different models interpret your tools: ```bash arcade evals . --capture \ --use-provider openai:gpt-4o,gpt-4o-mini \ --use-provider anthropic:claude-sonnet-4-5-20250929 \ --file captures/models --format json,md ``` ## Converting captures to tests ### Step 1: Identify patterns Review captured tool calls to find patterns: ```json // Most queries use "fahrenheit" {"location": "Seattle", "units": "fahrenheit"} {"location": "Portland", "units": "fahrenheit"} // Some use "celsius" {"location": "Tokyo", "units": "celsius"} ``` ### Step 2: Create base expectations Create expected tool calls based on patterns: ```python # Default to fahrenheit for US cities ExpectedMCPToolCall("GetWeather", {"location": "Seattle", "units": "fahrenheit"}) # Use celsius for international cities ExpectedMCPToolCall("GetWeather", {"location": "Tokyo", "units": "celsius"}) ``` ### Step 3: Add critics Add critics to validate parameters. See [Critics](/guides/create-tools/evaluate-tools/create-evaluation-suite#critics) for options. ### Step 4: Run evaluations Test with real evaluations: ```bash arcade evals . --details ``` ### Step 5: Iterate Use failures to refine: - Adjust expected values - Change critic weights - Modify tool descriptions - Add more test cases ## Troubleshooting ### No tool calls captured **Symptom:** Empty `tool_calls` arrays **Possible causes:** 1. Model didn't call any tools 2. Tools not properly registered 3. System message doesn't encourage tool use **Solution:** ```python suite = EvalSuite( name="Weather", system_message="You are a weather assistant. Use the available weather tools to answer questions.", ) ``` ### Missing parameters **Symptom:** Some parameters are missing from captured calls **Explanation:** Models may omit optional parameters. **Solution:** Check if parameters have defaults in your schema. The evaluation framework applies defaults automatically. ### Different results per provider **Symptom:** OpenAI and Anthropic capture different tool calls **Explanation:** Providers interpret tool descriptions differently. **Solution:** This is expected. Use captures to understand provider-specific behavior, then create provider-agnostic tests. ## Next steps - Learn about [comparative evaluations](/guides/create-tools/evaluate-tools/comparative-evaluations) to compare tool sources - [Create evaluation suites](/guides/create-tools/evaluate-tools/create-evaluation-suite) with expectations