--- title: "Why evaluate tools?" description: "Learn why evaluating your tools is important" --- import { Callout } from "nextra/components"; # Why evaluate tools?
Tool evaluations ensure AI models use your tools correctly in production. Unlike traditional testing, evaluations measure two key aspects: 1. **Tool selection**: Does the model choose the right tools for the task? 2. **Parameter accuracy**: Does the model provide correct arguments? Arcade's evaluation framework helps you validate tool-calling capabilities before deployment, ensuring reliability in real-world applications. You can evaluate tools from MCP servers, Arcade Gateways, or custom implementations.
## What can go wrong? Without proper evaluation, AI models might: - **Misinterpret user intents**, selecting the wrong tools - **Provide incorrect arguments**, causing failures or unexpected behavior - **Skip necessary tool calls**, missing steps in multi-step tasks - **Make incorrect assumptions** about parameter defaults or formats ## How evaluation works Evaluations compare the model's actual tool calls with expected tool calls for each test case. ### Scoring components 1. **Tool selection**: Did the model choose the correct tool? 2. **Parameter evaluation**: Are the arguments correct? (evaluated by critics) 3. **Weighted scoring**: Each aspect has a weight that affects the final score ### Evaluation results Each test case receives: - **Score**: Calculated from weighted critic scores, normalized proportionally (weights can be any positive value) - **Status**: - **Passed**: Score meets or exceeds fail threshold (default: 0.8) - **Failed**: Score falls below fail threshold - **Warned**: Score is between warn and fail thresholds (default: 0.9) Example output: ``` PASSED Get weather for city -- Score: 1.00 WARNED Send message with typo -- Score: 0.85 FAILED Wrong tool selected -- Score: 0.50 ``` ## Next steps - [Create an evaluation suite](/guides/create-tools/evaluate-tools/create-evaluation-suite) to start testing your tools - [Run evaluations](/guides/create-tools/evaluate-tools/run-evaluations) with multiple providers - Explore [capture mode](/guides/create-tools/evaluate-tools/capture-mode) to bootstrap test expectations - Compare tool sources with [comparative evaluations](/guides/create-tools/evaluate-tools/comparative-evaluations) ## Advanced features Once you're comfortable with basic evaluations, explore these advanced capabilities: ### Capture mode Record tool calls without scoring to discover what models actually call. Useful for bootstrapping test expectations and debugging. [Learn more →](/guides/create-tools/evaluate-tools/capture-mode) ### Comparative evaluations Test the same cases against different tool sources (tracks) with isolated registries. Compare how models perform with different tool implementations. [Learn more →](/guides/create-tools/evaluate-tools/comparative-evaluations) ### Output formats Save results in multiple formats (txt, md, html, json) for reporting and analysis. Specify output files with extensions or use no extension for all formats. [Learn more →](/guides/create-tools/evaluate-tools/run-evaluations#output-formats)