---
title: "Why evaluate tools?"
description: "Learn why evaluating your tools is important"
---
import { Callout } from "nextra/components";
# Why evaluate tools?
Tool evaluations ensure AI models use your tools correctly in production. Unlike traditional testing, evaluations measure two key aspects:
1. **Tool selection**: Does the model choose the right tools for the task?
2. **Parameter accuracy**: Does the model provide correct arguments?
Arcade's evaluation framework helps you validate tool-calling capabilities before deployment, ensuring reliability in real-world applications. You can evaluate tools from MCP servers, Arcade Gateways, or custom implementations.
## What can go wrong?
Without proper evaluation, AI models might:
- **Misinterpret user intents**, selecting the wrong tools
- **Provide incorrect arguments**, causing failures or unexpected behavior
- **Skip necessary tool calls**, missing steps in multi-step tasks
- **Make incorrect assumptions** about parameter defaults or formats
## How evaluation works
Evaluations compare the model's actual tool calls with expected tool calls for each test case.
### Scoring components
1. **Tool selection**: Did the model choose the correct tool?
2. **Parameter evaluation**: Are the arguments correct? (evaluated by critics)
3. **Weighted scoring**: Each aspect has a weight that affects the final score
### Evaluation results
Each test case receives:
- **Score**: Calculated from weighted critic scores, normalized proportionally (weights can be any positive value)
- **Status**:
- **Passed**: Score meets or exceeds fail threshold (default: 0.8)
- **Failed**: Score falls below fail threshold
- **Warned**: Score is between warn and fail thresholds (default: 0.9)
Example output:
```
PASSED Get weather for city -- Score: 1.00
WARNED Send message with typo -- Score: 0.85
FAILED Wrong tool selected -- Score: 0.50
```
## Next steps
- [Create an evaluation suite](/guides/create-tools/evaluate-tools/create-evaluation-suite) to start testing your tools
- [Run evaluations](/guides/create-tools/evaluate-tools/run-evaluations) with multiple providers
- Explore [capture mode](/guides/create-tools/evaluate-tools/capture-mode) to bootstrap test expectations
- Compare tool sources with [comparative evaluations](/guides/create-tools/evaluate-tools/comparative-evaluations)
## Advanced features
Once you're comfortable with basic evaluations, explore these advanced capabilities:
### Capture mode
Record tool calls without scoring to discover what models actually call. Useful for bootstrapping test expectations and debugging. [Learn more →](/guides/create-tools/evaluate-tools/capture-mode)
### Comparative evaluations
Test the same cases against different tool sources (tracks) with isolated registries. Compare how models perform with different tool implementations. [Learn more →](/guides/create-tools/evaluate-tools/comparative-evaluations)
### Output formats
Save results in multiple formats (txt, md, html, json) for reporting and analysis. Specify output files with extensions or use no extension for all formats. [Learn more →](/guides/create-tools/evaluate-tools/run-evaluations#output-formats)