Run evaluations with the Arcade CLI
The Arcade Evaluation Framework allows you to run evaluations of your -enabled language models conveniently using the command-line interface (CLI). This enables you to execute your evaluation suites, gather results, and analyze the performance of your models in an efficient and streamlined manner.
Using the arcade evals
Command
To run evaluations, use the arcade evals
command provided by the Arcade CLI. This command searches for evaluation files in the specified directory, executes any functions decorated with @tool_eval
, and displays the results.
Basic Usage
arcade evals <directory>
<directory>
: The directory containing your evaluation files. By default, it searches the current directory (.
).
For example, to run evaluations in the current directory:
arcade evals
Evaluation File Naming Convention
The arcade evals
command looks for Python files that start with eval_
and end with .py
(e.g., eval_math_tools.py
, eval_slack_messaging.py
). These files should contain your evaluation suites.
Command Options
The arcade evals
command supports several options to customize the evaluation process:
-
--details
,-d
: Show detailed results for each evaluation case, including critic feedback.Example:
Terminalarcade evals --details .
-
--models
,-m
: Specify the models to use for evaluation. Provide a comma-separated list of model names.Example:
Terminalarcade evals --models gpt-4o,gpt-3.5 .
-
--max-concurrent
,-c
: Set the maximum number of concurrent evaluations to run in parallel.Example:
Terminalarcade evals --max-concurrent 4 .
-
--provider
,-p
: The provider of the models to use for evaluation. Uses OpenAI by default.Example:
Terminalarcade evals --provider anthropic .
-
--provider-api-key
,-k
: The model provider . If not provided, will look for the appropriate environment variable based on the provider (e.g., OPENAI_API_KEY for openai provider), first in the current environment, then in the current working directory’s .env file.Example:
Terminalarcade evals --provider-api-key my-api-key .
-
--debug
: Show debug information in the CLI.Example:
Terminalarcade evals --debug .
-
--help
: Show help information and exit.Example:
Terminalarcade evals --help
Example Command
Running evaluations in the arcade_my_tools/evals
directory, showing detailed results, using the gpt-4o
model:
arcade evals arcade_my_tools/evals --details --models gpt-4o
Execution Process
When you run the arcade evals
command, the following steps occur:
-
Preparation: The CLI loads the evaluation suites from the specified directory, looking for files that match the naming convention.
-
Execution: The evaluation suites are executed asynchronously. Each suite’s evaluation function, decorated with
@tool_eval
, is called with the appropriate configuration, including the model and concurrency settings. -
Concurrency: Evaluations can run concurrently based on the
--max-concurrent
setting, improving efficiency. -
Result Aggregation: Results from all evaluation cases and models are collected and aggregated.
Displaying Results
After the evaluations are complete, the results are displayed in a concise and informative format, similar to testing frameworks like pytest
. The output includes:
-
Summary: Shows the total number of cases, how many passed, failed, or issued warnings.
Example:
PLAINTEXTSummary -- Total: 5 -- Passed: 4 -- Failed: 1
-
Detailed Case Results: For each evaluation case, the status (PASSED, FAILED, WARNED), the case name, and the score are displayed.
Example:
PLAINTEXTPASSED Add two large numbers -- Score: 1.00 FAILED Send DM with ambiguous username -- Score: 0.75
-
Critic Feedback: If the
--details
flag is used, detailed feedback from each critic is provided, highlighting matches, mismatches, and scores for each evaluated field.Example:
PLAINTEXTDetails: user_name: Match: False, Score: 0.00/0.50 Expected: johndoe Actual: john_doe message: Match: True, Score: 0.50/0.50
Interpreting the Results
-
Passed: The evaluation case met or exceeded the fail threshold specified in the rubric.
-
Failed: The evaluation case did not meet the fail threshold.
-
Warnings: If the score is between the warn threshold and the fail threshold, a warning is issued.
Use the detailed feedback to understand where the model’s performance can be improved, particularly focusing on mismatches identified by critics.
Customizing Evaluations
You can customize the evaluation process by adjusting:
-
Rubrics: Modify fail and warn thresholds, and adjust weights to emphasize different aspects of evaluation.
-
Critics: Add or modify critics in your evaluation cases to target specific arguments or behaviors.
-
Concurrency: Adjust the
--max-concurrent
option to optimize performance based on your environment.
Handling Multiple Models
You can evaluate multiple models in a single run by specifying them in the --models
option as a comma-separated list. This allows you to compare the performance of different models across the same evaluation suites.
Example:
arcade evals . --models gpt-4o,gpt-3.5
Considerations
-
Evaluation Files: Ensure your evaluation files are correctly named and contain the evaluation suites decorated with
@tool_eval
. -
Provider : If you are using a different provider, you will need to set the appropriate API key in an environment variable, or use the
--provider-api-key
option. -
Catalog: Ensure your tool catalog is correctly defined and includes all the tools you want to evaluate.
-
Weight distribution: Ensure your weight distribution reflects the importance of each critic and that the sum of the weights is
1.0
.
Conclusion
Running evaluations using the Arcade CLI provides a powerful and convenient way to assess the -calling capabilities of your language models. By leveraging the arcade evals
command, you can efficiently execute your evaluation suites, analyze results, and iterate on your models and tools.
Integrating this evaluation process into your development workflow helps ensure that your models interact with as expected, enhances reliability, and builds confidence in deploying actionable language models in production environments.