---
title: "Comparative evaluations"
description: "Compare different tool implementations with the same test cases"
---
# Comparative evaluations
Comparative evaluations let you test how well AI models select and use tools from different, isolated tool sources. Each "track" represents a separate tool registry, allowing you to compare implementations side-by-side.
import { Callout, Steps } from "nextra/components";
## What are tracks?
**Tracks are isolated tool registries** within a single evaluation suite. Each track has its own set of tools that are **not shared** with other tracks. This isolation lets you test how models perform when given different tool options for the same task.
**Key concept**: Comparative evaluations test tool **selection** across different tool sets. Each track provides a different context (set of tools) to the model.
**Common use cases:**
- **Compare tool providers**: Test Google Weather vs OpenWeather API
- **Implementation comparison**: Test different MCP servers offering similar functionality
- **A/B testing**: Evaluate alternative tool designs
### When to use comparative evaluations
Use **comparative evaluations** when:
- ✅ Testing multiple implementations of the same functionality
- ✅ Comparing different tool providers
- ✅ Evaluating how models choose between different tool sets
Use **regular evaluations** when:
- ✅ Testing a single tool implementation
- ✅ Testing mixed tools from multiple sources in the same context
- ✅ Regression testing
### Testing mixed tool sources
To test how multiple MCP servers work **together** in the same context (not isolated), use a regular evaluation and load multiple sources:
```python
@tool_eval()
async def mixed_tools_eval():
suite = EvalSuite(name="Mixed Tools", system_message="You are helpful.")
# All tools available to the model in the same context
await suite.add_mcp_server("http://server1.example")
await suite.add_mcp_server("http://server2.example")
suite.add_tool_definitions([{"name": "CustomTool", ...}])
# Model can use any tool from any source
suite.add_case(...)
return suite
```
Alternatively, use an Arcade Gateway which aggregates tools from multiple sources.
## Basic comparative evaluation
### Register tools per track
Create a suite and register tools for each track:
```python
from arcade_evals import EvalSuite, tool_eval, ExpectedMCPToolCall, BinaryCritic
@tool_eval()
async def weather_comparison():
suite = EvalSuite(
name="Weather API Comparison",
system_message="You are a weather assistant.",
)
# Track A: Weather API v1
await suite.add_mcp_server(
"http://weather-v1.example/mcp",
track="Weather v1"
)
# Track B: Weather API v2
await suite.add_mcp_server(
"http://weather-v2.example/mcp",
track="Weather v2"
)
return suite
```
### Create comparative test case
Add a test case with track-specific expectations:
```python
suite.add_comparative_case(
name="get_current_weather",
user_message="What's the weather in Seattle?",
).for_track(
"Weather v1",
expected_tool_calls=[
ExpectedMCPToolCall(
"GetWeather",
{"city": "Seattle", "type": "current"}
)
],
critics=[
BinaryCritic(critic_field="city", weight=0.7),
BinaryCritic(critic_field="type", weight=0.3),
],
).for_track(
"Weather v2",
expected_tool_calls=[
ExpectedMCPToolCall(
"Weather_GetCurrent",
{"location": "Seattle"}
)
],
critics=[
BinaryCritic(critic_field="location", weight=1.0),
],
)
```
### Run comparative evaluation
```bash
arcade evals .
```
Results show per-track scores:
```
Suite: Weather API Comparison
Case: get_current_weather
Track: Weather v1 -- Score: 1.00 -- PASSED
Track: Weather v2 -- Score: 1.00 -- PASSED
```
## Track registration
### From MCP HTTP server
```python
await suite.add_mcp_server(
url="http://localhost:8000",
headers={"Authorization": "Bearer token"},
track="Production API",
)
```
### From MCP stdio server
```python
await suite.add_mcp_stdio_server(
command=["python", "server_v2.py"],
env={"API_KEY": "secret"},
track="Version 2",
)
```
### From Arcade Gateway
```python
await suite.add_arcade_gateway(
gateway_slug="weather-gateway",
track="Arcade Gateway",
)
```
### Manual tool definitions
```python
suite.add_tool_definitions(
tools=[
{
"name": "GetWeather",
"description": "Get weather for a location",
"inputSchema": {...},
}
],
track="Custom Tools",
)
```
Tools must be registered before creating comparative cases that reference
their tracks.
## Comparative case builder
The `add_comparative_case()` method returns a builder for defining track-specific expectations.
### Basic structure
```python
suite.add_comparative_case(
name="test_case",
user_message="Do something",
).for_track(
"Track A",
expected_tool_calls=[...],
critics=[...],
).for_track(
"Track B",
expected_tool_calls=[...],
critics=[...],
)
```
### Optional parameters
Add conversation context to comparative cases:
```python
suite.add_comparative_case(
name="weather_with_context",
user_message="What about the weather there?",
system_message="You are helpful.", # Optional override
additional_messages=[
{"role": "user", "content": "I'm going to Seattle"},
],
).for_track("Weather v1", ...).for_track("Weather v2", ...)
```
**Bias-aware message design:**
Design `additional_messages` to avoid leading the model. Keep them neutral so you measure tool behavior, not prompt hints:
```python
# ✅ Good - Neutral
additional_messages=[
{"role": "user", "content": "I need weather information"},
{"role": "assistant", "content": "I can help with that. Which location?"},
]
# ❌ Avoid - Tells the model which tool to call
additional_messages=[
{"role": "user", "content": "Use the GetWeather tool for Seattle"},
]
```
Keep messages generic so the model chooses tools naturally based on what is available in the track.
### Different expectations per track
Tracks can expose different tools and schemas. Because of that, you may need different critics per track:
```python
suite.add_comparative_case(
name="search_query",
user_message="Search for Python tutorials",
).for_track(
"Google Search",
expected_tool_calls=[
ExpectedMCPToolCall("Google_Search", {"query": "Python tutorials"})
],
critics=[BinaryCritic(critic_field="query", weight=1.0)],
).for_track(
"Bing Search",
expected_tool_calls=[
ExpectedMCPToolCall("Bing_WebSearch", {"q": "Python tutorials"})
],
# Different schema, so validate the matching field for this track
critics=[BinaryCritic(critic_field="q", weight=1.0)],
)
```
## Complete example
Here's a full comparative evaluation:
```python
from arcade_evals import (
EvalSuite,
tool_eval,
ExpectedMCPToolCall,
BinaryCritic,
SimilarityCritic,
)
@tool_eval()
async def search_comparison():
"""Compare different search APIs."""
suite = EvalSuite(
name="Search API Comparison",
system_message="You are a search assistant. Use the available tools to search for information.",
)
# Register search providers (MCP servers)
await suite.add_mcp_server(
"http://google-search.example/mcp",
track="Google",
)
await suite.add_mcp_server(
"http://bing-search.example/mcp",
track="Bing",
)
# Mix with manual tool definitions
suite.add_tool_definitions(
tools=[{
"name": "DDG_Search",
"description": "Search using DuckDuckGo",
"inputSchema": {
"type": "object",
"properties": {
"query": {"type": "string"}
},
"required": ["query"]
}
}],
track="DuckDuckGo",
)
# Simple query
suite.add_comparative_case(
name="basic_search",
user_message="Search for Python tutorials",
).for_track(
"Google",
expected_tool_calls=[
ExpectedMCPToolCall("Search", {"query": "Python tutorials"})
],
critics=[BinaryCritic(critic_field="query", weight=1.0)],
).for_track(
"Bing",
expected_tool_calls=[
ExpectedMCPToolCall("WebSearch", {"q": "Python tutorials"})
],
critics=[BinaryCritic(critic_field="q", weight=1.0)],
).for_track(
"DuckDuckGo",
expected_tool_calls=[
ExpectedMCPToolCall("DDG_Search", {"query": "Python tutorials"})
],
critics=[BinaryCritic(critic_field="query", weight=1.0)],
)
# Query with filters
suite.add_comparative_case(
name="search_with_filters",
user_message="Search for Python tutorials from the last month",
).for_track(
"Google",
expected_tool_calls=[
ExpectedMCPToolCall(
"Search",
{"query": "Python tutorials", "time_range": "month"}
)
],
critics=[
SimilarityCritic(critic_field="query", weight=0.7),
BinaryCritic(critic_field="time_range", weight=0.3),
],
).for_track(
"Bing",
expected_tool_calls=[
ExpectedMCPToolCall(
"WebSearch",
{"q": "Python tutorials", "freshness": "Month"}
)
],
critics=[
SimilarityCritic(critic_field="q", weight=0.7),
BinaryCritic(critic_field="freshness", weight=0.3),
],
).for_track(
"DuckDuckGo",
expected_tool_calls=[
ExpectedMCPToolCall(
"DDG_Search",
{"query": "Python tutorials"}
)
],
critics=[
SimilarityCritic(critic_field="query", weight=1.0),
],
)
return suite
```
Run the comparison:
```bash
arcade evals . --details
```
Output shows side-by-side results:
```
Suite: Search API Comparison
Case: basic_search
Track: Google -- Score: 1.00 -- PASSED
Track: Bing -- Score: 1.00 -- PASSED
Track: DuckDuckGo -- Score: 1.00 -- PASSED
Case: search_with_filters
Track: Google -- Score: 1.00 -- PASSED
Track: Bing -- Score: 0.85 -- WARNED
Track: DuckDuckGo -- Score: 0.90 -- WARNED
```
## Result structure
Comparative results are organized by track:
```python
{
"Google": {
"model": "gpt-4o",
"suite_name": "Search API Comparison",
"track_name": "Google",
"rubric": {...},
"cases": [
{
"name": "basic_search",
"track": "Google",
"input": "Search for Python tutorials",
"expected_tool_calls": [...],
"predicted_tool_calls": [...],
"evaluation": {
"score": 1.0,
"result": "passed",
...
}
}
]
},
"Bing": {...},
"DuckDuckGo": {...}
}
```
## Mixing regular and comparative cases
A suite can have both regular and comparative cases:
```python
@tool_eval()
async def mixed_suite():
suite = EvalSuite(
name="Mixed Evaluation",
system_message="You are helpful.",
)
# Register default tools
await suite.add_mcp_stdio_server(["python", "server.py"])
# Regular case (uses default tools)
suite.add_case(
name="regular_test",
user_message="Do something",
expected_tool_calls=[...],
)
# Register track-specific tools
await suite.add_mcp_server("http://api-v2.example", track="v2")
# Comparative case
suite.add_comparative_case(
name="compare_versions",
user_message="Do something else",
).for_track(
"default", # Uses default tools
expected_tool_calls=[...],
).for_track(
"v2", # Uses v2 tools
expected_tool_calls=[...],
)
return suite
```
Use track name `"default"` to reference tools registered without a track.
## Capture mode with tracks
Capture tool calls from each track separately:
```bash
arcade evals . --capture --file captures/comparison --format json
```
Output includes track names:
```json
{
"captured_cases": [
{
"case_name": "get_weather",
"track_name": "Weather v1",
"tool_calls": [
{"name": "GetWeather", "args": {...}}
]
},
{
"case_name": "get_weather",
"track_name": "Weather v2",
"tool_calls": [
{"name": "Weather_GetCurrent", "args": {...}}
]
}
]
}
```
## Multi-model comparative evaluations
Combine comparative tracks with multiple models:
```bash
arcade evals . \
--use-provider openai:gpt-4o,gpt-4o-mini \
--use-provider anthropic:claude-sonnet-4-5-20250929
```
Results show:
- Per-track scores for each model
- Cross-track comparisons for each model
- Cross-model comparisons for each track
Example output:
```
Suite: Weather API Comparison
Model: gpt-4o
Case: get_weather
Track: Weather v1 -- Score: 1.00 -- PASSED
Track: Weather v2 -- Score: 1.00 -- PASSED
Model: gpt-4o-mini
Case: get_weather
Track: Weather v1 -- Score: 0.90 -- WARNED
Track: Weather v2 -- Score: 0.95 -- PASSED
Model: claude-sonnet-4-5-20250929
Case: get_weather
Track: Weather v1 -- Score: 1.00 -- PASSED
Track: Weather v2 -- Score: 0.85 -- WARNED
```
## Best practices
### Use descriptive track names
Choose clear names that indicate what's being compared:
```python
# ✅ Good
track="Weather API v1"
track="OpenWeather Production"
track="Google Weather (Staging)"
# ❌ Avoid
track="A"
track="Test1"
track="Track2"
```
### Keep test cases consistent
Use the same user message and context across tracks:
```python
suite.add_comparative_case(
name="get_weather",
user_message="What's the weather in Seattle?", # Same for all tracks
).for_track("v1", ...).for_track("v2", ...)
```
### Adjust critics to track differences
Different tools may have different parameter names or types:
```python
.for_track(
"Weather v1",
expected_tool_calls=[
ExpectedMCPToolCall("GetWeather", {"city": "Seattle"})
],
critics=[
BinaryCritic(critic_field="city", weight=1.0), # v1 uses "city"
],
).for_track(
"Weather v2",
expected_tool_calls=[
ExpectedMCPToolCall("GetWeather", {"location": "Seattle"})
],
critics=[
BinaryCritic(critic_field="location", weight=1.0), # v2 uses "location"
],
)
```
### Start with capture mode
Use capture mode to discover track-specific tool signatures:
```bash
arcade evals . --capture
```
Then create expectations based on captured calls.
### Test edge cases per track
Different implementations may handle edge cases differently:
```python
suite.add_comparative_case(
name="ambiguous_location",
user_message="What's the weather in Portland?", # OR or ME?
).for_track(
"Weather v1",
# v1 defaults to most populous
expected_tool_calls=[
ExpectedMCPToolCall("GetWeather", {"city": "Portland", "state": "OR"})
],
).for_track(
"Weather v2",
# v2 requires disambiguation
expected_tool_calls=[
ExpectedMCPToolCall("DisambiguateLocation", {"city": "Portland"}),
ExpectedMCPToolCall("GetWeather", {"city": "Portland", "state": "OR"}),
],
)
```
## Troubleshooting
### Track not found
**Symptom:** `ValueError: Track 'TrackName' not registered`
**Solution:** Register the track before adding comparative cases:
```python
# ✅ Correct order
await suite.add_mcp_server(url, track="TrackName")
suite.add_comparative_case(...).for_track("TrackName", ...)
# ❌ Wrong order - will fail
suite.add_comparative_case(...).for_track("TrackName", ...)
await suite.add_mcp_server(url, track="TrackName")
```
### Missing track expectations
**Symptom:** Case runs against some tracks but not others
**Explanation:** Comparative cases only run against tracks with `.for_track()` defined.
**Solution:** Add expectations for all registered tracks:
```python
suite.add_comparative_case(
name="test",
user_message="...",
).for_track("Track A", ...).for_track("Track B", ...)
```
### Tool name mismatches
**Symptom:** "Tool not found" errors in specific tracks
**Solution:** Check tool names in each track:
```python
# List tools per track
print(suite.list_tool_names(track="Track A"))
print(suite.list_tool_names(track="Track B"))
```
Use the exact tool names from the output.
### Inconsistent results across tracks
**Symptom:** Same user message produces different scores across tracks
**Explanation:** This is expected. Different tool implementations may work differently.
**Solution:** Adjust expectations and critics per track to account for implementation differences.
## Advanced patterns
### Baseline comparison
Compare new implementations against a baseline:
```python
await suite.add_mcp_server(
"http://production.example/mcp",
track="Production (Baseline)"
)
await suite.add_mcp_server(
"http://staging.example/mcp",
track="Staging (New)"
)
```
Results show deviations from baseline.
### Progressive feature testing
Test feature support across versions:
```python
suite.add_comparative_case(
name="advanced_filters",
user_message="Search with advanced filters",
).for_track(
"v1",
expected_tool_calls=[], # Not supported
).for_track(
"v2",
expected_tool_calls=[
ExpectedMCPToolCall("SearchWithFilters", {...})
],
)
```
### Tool catalog comparison
Compare Arcade tool catalogs:
```python
from arcade_core import ToolCatalog
from my_tools import weather_v1, weather_v2
catalog_v1 = ToolCatalog()
catalog_v1.add_tool(weather_v1, "Weather")
catalog_v2 = ToolCatalog()
catalog_v2.add_tool(weather_v2, "Weather")
suite.add_tool_catalog(catalog_v1, track="Python v1")
suite.add_tool_catalog(catalog_v2, track="Python v2")
```
## Next steps
- [Create an evaluation suite](/guides/create-tools/evaluate-tools/create-evaluation-suite) with tracks
- Use [capture mode](/guides/create-tools/evaluate-tools/capture-mode) to discover track-specific tool calls
- [Run evaluations](/guides/create-tools/evaluate-tools/run-evaluations) with multiple models and tracks