Skip to the content.

Cline Evaluation System

This directory contains the evaluation system for benchmarking Cline against various coding evaluation frameworks.

Overview

The Cline Evaluation System allows you to:

  1. Run Cline against standardized coding benchmarks
  2. Collect comprehensive metrics on performance
  3. Generate detailed reports on evaluation results
  4. Compare performance across different models and benchmarks

Architecture

The evaluation system consists of two main components:

  1. CLI Tool: Command-line interface in evals/cli/ for orchestrating evaluations
  2. Diff Edit Benchmark: Separate command using the CLI tool that runs a comprehensive diff editing benchmark suite on real world cases, along with a streamlit dashboard displaying the results. For more details, see the Diff Edit Benchmark README. Make sure you add a evals/diff-edits/cases folder with all the conversation jsons.

Directory Structure

evals/                            # Main directory for evaluation system
├── cli/                          # CLI tool for orchestrating evaluations
│   └── src/
│       ├── index.ts              # CLI entry point
│       ├── commands/             # CLI commands (setup, run, report)
│       ├── adapters/             # Benchmark adapters
│       ├── db/                   # Database management
│       └── utils/                # Utility functions
├── diff-edits/                   # Diff editing evaluation suite
│   ├── cases/                    # Test case JSON files
│   ├── results/                  # Evaluation results
│   ├── diff-apply/               # Diff application logic
│   ├── parsing/                  # Assistant message parsing
│   └── prompts/                  # System prompts
├── repositories/                 # Cloned benchmark repositories
│   └── exercism/                 # Exercism (Aider Polyglot)
├── results/                      # Evaluation results storage
│   ├── runs/                     # Individual run results
│   └── reports/                  # Generated reports
└── README.md                     # This file

Getting Started

Prerequisites

Installation

  1. Build the CLI tool:
cd evals
npm install
npm run build:cli

Usage

Setting Up Benchmarks

cd evals/cli
node dist/index.js setup

This will clone and set up all benchmark repositories. You can specify specific benchmarks:

node dist/index.js setup --benchmarks exercism

Running Evaluations

node dist/index.js run --benchmark exercism --count 10

Options:

Note: Model selection is currently configured through the Cline CLI itself, not through evaluation flags.

Generating Reports

node dist/index.js report

Options:

Benchmarks

Exercism

Modified Exercism exercises from the polyglot-benchmark repository. These are small, focused programming exercises in various languages.

SWE-Bench (Coming Soon)

Real-world software engineering tasks from the SWE-bench repository.

SWELancer (Coming Soon)

Freelance-style programming tasks from the SWELancer benchmark.

Multi-SWE-Bench (Coming Soon)

Multi-file software engineering tasks from the Multi-SWE-Bench repository.

Diff Edit Evaluations

The Cline Evaluation System includes a specialized suite for evaluating how well models can make precise edits to files using the replace_in_file tool.

Overview

Diff edit evaluations test a model’s ability to:

  1. Understand file content and identify specific sections to modify
  2. Generate correct SEARCH/REPLACE blocks for targeted edits
  3. Successfully apply changes without introducing errors

Directory Structure

diff-edits/
├── cases/                  # Test case JSON files
├── results/                # Evaluation results
├── ClineWrapper.ts         # Wrapper for model interaction
├── TestRunner.ts           # Main test execution logic
├── types.ts                # Type definitions
├── diff-apply/             # Diff application logic
├── parsing/                # Assistant message parsing
└── prompts/                # System prompts

Creating Test Cases

Test cases are defined as JSON files in the diff-edits/cases/ directory. Each test case should include:

{
  "test_id": "example_test_1",
  "messages": [
    {
      "role": "user",
      "text": "Please fix the bug in this code...",
      "images": []
    },
    {
      "role": "assistant",
      "text": "I'll help you fix that bug..."
    }
  ],
  "file_contents": "// Original file content here\nfunction example() {\n  // Code with bug\n}",
  "file_path": "src/example.js",
  "system_prompt_details": {
    "mcp_string": "",
    "cwd_value": "/path/to/working/directory",
    "browser_use": false,
    "width": 900,
    "height": 600,
    "os_value": "macOS",
    "shell_value": "/bin/zsh",
    "home_value": "/Users/username",
    "user_custom_instructions": ""
  },
  "original_diff_edit_tool_call_message": ""
}

Running Diff Edit Evaluations

Single Model Evaluation

cd evals/cli
node dist/index.js run-diff-eval --model-ids "anthropic/claude-3-5-sonnet-20241022"

Multi-Model Evaluation

Compare multiple models in a single evaluation run:

# Compare Claude and Grok models
node dist/index.js run-diff-eval \
  --model-ids "anthropic/claude-3-5-sonnet-20241022,x-ai/grok-beta" \
  --max-cases 10 \
  --valid-attempts-per-case 3 \
  --verbose

# Compare multiple Claude variants
node dist/index.js run-diff-eval \
  --model-ids "anthropic/claude-3-5-sonnet-20241022,anthropic/claude-3-5-haiku-20241022,anthropic/claude-3-opus-20240229" \
  --max-cases 5 \
  --valid-attempts-per-case 2 \
  --parallel

Options

Examples

# Quick test with 2 models, 4 cases, 2 attempts each
node dist/index.js run-diff-eval \
  --model-ids "anthropic/claude-3-5-sonnet-20241022,x-ai/grok-beta" \
  --max-cases 4 \
  --valid-attempts-per-case 2 \
  --verbose

# Comprehensive evaluation with parallel execution
node dist/index.js run-diff-eval \
  --model-ids "anthropic/claude-3-5-sonnet-20241022,anthropic/claude-3-5-haiku-20241022" \
  --system-prompt-name claude4SystemPrompt \
  --valid-attempts-per-case 5 \
  --max-cases 20 \
  --parallel \
  --verbose

Database Storage & Analytics

All evaluation results are automatically stored in a SQLite database (diff-edits/evals.db) for advanced analytics and comparison. The database includes:

Interactive Dashboard

Launch the Streamlit dashboard to visualize and analyze evaluation results:

cd diff-edits/dashboard
streamlit run app.py

The dashboard provides:

Dashboard Features

  1. Hero Section: Overview of current run with key metrics
  2. Model Cards: Performance cards with grades and detailed metrics
  3. Comparison Charts: Interactive Plotly charts for visual analysis
  4. Result Explorer: Detailed view of individual test results including:
    • Original and edited file content
    • Raw model output
    • Parsed tool calls
    • Timing and cost metrics
    • Error analysis

Quick Start Dashboard

# Run a quick evaluation
node cli/dist/index.js run-diff-eval \
  --model-ids "anthropic/claude-3-5-sonnet-20241022,x-ai/grok-beta" \
  --max-cases 4 \
  --valid-attempts-per-case 2 \
  --verbose

# Launch dashboard to view results
cd diff-edits/dashboard && streamlit run app.py

Legacy Results

For backward compatibility, results are also saved as JSON files in the diff-edits/results/ directory. The JSON results include:

Metrics

The evaluation system collects the following metrics:

Reports

Reports are generated in Markdown or JSON format and include:

Development

Adding a New Benchmark

  1. Create a new adapter in evals/cli/src/adapters/
  2. Implement the BenchmarkAdapter interface
  3. Register the adapter in evals/cli/src/adapters/index.ts

Extending Metrics

To add new metrics:

  1. Update the database schema in evals/cli/src/db/schema.ts
  2. Add collection logic in evals/cli/src/utils/results.ts
  3. Update report generation in evals/cli/src/commands/report.ts