Cline Evaluation System

This directory contains the evaluation system for benchmarking Cline against various coding evaluation frameworks.

Overview

The Cline Evaluation System allows you to:

Run Cline against standardized coding benchmarks
Collect comprehensive metrics on performance
Generate detailed reports on evaluation results
Compare performance across different models and benchmarks

Architecture

The evaluation system consists of two main components:

CLI Tool: Command-line interface in evals/cli/ for orchestrating evaluations
Diff Edit Benchmark: Separate command using the CLI tool that runs a comprehensive diff editing benchmark suite on real world cases, along with a streamlit dashboard displaying the results. For more details, see the Diff Edit Benchmark README. Make sure you add a evals/diff-edits/cases folder with all the conversation jsons.

Directory Structure

evals/                            # Main directory for evaluation system
├── cli/                          # CLI tool for orchestrating evaluations
│   └── src/
│       ├── index.ts              # CLI entry point
│       ├── commands/             # CLI commands (setup, run, report)
│       ├── adapters/             # Benchmark adapters
│       ├── db/                   # Database management
│       └── utils/                # Utility functions
├── diff-edits/                   # Diff editing evaluation suite
│   ├── cases/                    # Test case JSON files
│   ├── results/                  # Evaluation results
│   ├── diff-apply/               # Diff application logic
│   ├── parsing/                  # Assistant message parsing
│   └── prompts/                  # System prompts
├── repositories/                 # Cloned benchmark repositories
│   └── exercism/                 # Exercism (Aider Polyglot)
├── results/                      # Evaluation results storage
│   ├── runs/                     # Individual run results
│   └── reports/                  # Generated reports
└── README.md                     # This file

Getting Started

Prerequisites

Node.js 16+
VSCode with Cline extension installed
Git

Installation

Build the CLI tool:

cd evals
npm install
npm run build:cli

Usage

Setting Up Benchmarks

cd evals/cli
node dist/index.js setup

This will clone and set up all benchmark repositories. You can specify specific benchmarks:

node dist/index.js setup --benchmarks exercism

Running Evaluations

node dist/index.js run --benchmark exercism --count 10

Options:

--benchmark: Specific benchmark to run (default: exercism)
--count: Number of tasks to run (default: all available tasks)

Note: Model selection is currently configured through the Cline CLI itself, not through evaluation flags.

Generating Reports

node dist/index.js report

Options:

--format: Report format (json, markdown) (default: markdown)
--output: Output path for the report

Benchmarks

Exercism

Modified Exercism exercises from the polyglot-benchmark repository. These are small, focused programming exercises in various languages.

SWE-Bench (Coming Soon)

Real-world software engineering tasks from the SWE-bench repository.

SWELancer (Coming Soon)

Freelance-style programming tasks from the SWELancer benchmark.

Multi-SWE-Bench (Coming Soon)

Multi-file software engineering tasks from the Multi-SWE-Bench repository.

Diff Edit Evaluations

The Cline Evaluation System includes a specialized suite for evaluating how well models can make precise edits to files using the replace_in_file tool.

Overview

Diff edit evaluations test a model’s ability to:

Understand file content and identify specific sections to modify
Generate correct SEARCH/REPLACE blocks for targeted edits
Successfully apply changes without introducing errors

Directory Structure

diff-edits/
├── cases/                  # Test case JSON files
├── results/                # Evaluation results
├── ClineWrapper.ts         # Wrapper for model interaction
├── TestRunner.ts           # Main test execution logic
├── types.ts                # Type definitions
├── diff-apply/             # Diff application logic
├── parsing/                # Assistant message parsing
└── prompts/                # System prompts

Creating Test Cases

Test cases are defined as JSON files in the diff-edits/cases/ directory. Each test case should include:

{
  "test_id": "example_test_1",
  "messages": [
    {
      "role": "user",
      "text": "Please fix the bug in this code...",
      "images": []
    },
    {
      "role": "assistant",
      "text": "I'll help you fix that bug..."
    }
  ],
  "file_contents": "// Original file content here\nfunction example() {\n  // Code with bug\n}",
  "file_path": "src/example.js",
  "system_prompt_details": {
    "mcp_string": "",
    "cwd_value": "/path/to/working/directory",
    "browser_use": false,
    "width": 900,
    "height": 600,
    "os_value": "macOS",
    "shell_value": "/bin/zsh",
    "home_value": "/Users/username",
    "user_custom_instructions": ""
  },
  "original_diff_edit_tool_call_message": ""
}

Running Diff Edit Evaluations

Single Model Evaluation

cd evals/cli
node dist/index.js run-diff-eval --model-ids "anthropic/claude-3-5-sonnet-20241022"

Multi-Model Evaluation

Compare multiple models in a single evaluation run:

# Compare Claude and Grok models
node dist/index.js run-diff-eval \
  --model-ids "anthropic/claude-3-5-sonnet-20241022,x-ai/grok-beta" \
  --max-cases 10 \
  --valid-attempts-per-case 3 \
  --verbose

# Compare multiple Claude variants
node dist/index.js run-diff-eval \
  --model-ids "anthropic/claude-3-5-sonnet-20241022,anthropic/claude-3-5-haiku-20241022,anthropic/claude-3-opus-20240229" \
  --max-cases 5 \
  --valid-attempts-per-case 2 \
  --parallel

Options

--model-ids: Comma-separated list of model IDs to evaluate (required)
--system-prompt-name: System prompt to use (default: “basicSystemPrompt”)
--valid-attempts-per-case: Number of attempts per test case per model (default: 1)
--max-cases: Maximum number of test cases to run (default: all available)
--parsing-function: Function to parse assistant messages (default: “parseAssistantMessageV2”)
--diff-edit-function: Function to apply diffs (default: “constructNewFileContentV2”)
--test-path: Path to test cases (default: diff-edits/cases)
--thinking-budget: Tokens allocated for thinking (default: 0)
--parallel: Run tests in parallel (flag)
--replay: Use pre-recorded LLM output (flag)
--verbose: Enable detailed logging (flag)

Examples

# Quick test with 2 models, 4 cases, 2 attempts each
node dist/index.js run-diff-eval \
  --model-ids "anthropic/claude-3-5-sonnet-20241022,x-ai/grok-beta" \
  --max-cases 4 \
  --valid-attempts-per-case 2 \
  --verbose

# Comprehensive evaluation with parallel execution
node dist/index.js run-diff-eval \
  --model-ids "anthropic/claude-3-5-sonnet-20241022,anthropic/claude-3-5-haiku-20241022" \
  --system-prompt-name claude4SystemPrompt \
  --valid-attempts-per-case 5 \
  --max-cases 20 \
  --parallel \
  --verbose

Database Storage & Analytics

All evaluation results are automatically stored in a SQLite database (diff-edits/evals.db) for advanced analytics and comparison. The database includes:

System Prompts: Versioned system prompt content with hashing for deduplication
Processing Functions: Versioned parsing and diff-edit function configurations
Files: Original and edited file content with content-based hashing
Runs: Evaluation run metadata and configuration
Cases: Individual test case information with context tokens
Results: Detailed results with timing, cost, and success metrics

Interactive Dashboard

Launch the Streamlit dashboard to visualize and analyze evaluation results:

cd diff-edits/dashboard
streamlit run app.py

The dashboard provides:

Model Performance Comparison: Side-by-side comparison of success rates, latency, and costs
Interactive Charts: Success rate trends, latency vs cost analysis, and performance metrics
Detailed Drill-Down: Individual result analysis with file content viewing
Run Selection: Browse and compare different evaluation runs
Real-time Updates: Automatically refreshes with new evaluation data

Dashboard Features

Hero Section: Overview of current run with key metrics
Model Cards: Performance cards with grades and detailed metrics
Comparison Charts: Interactive Plotly charts for visual analysis
Result Explorer: Detailed view of individual test results including:
- Original and edited file content
- Raw model output
- Parsed tool calls
- Timing and cost metrics
- Error analysis

Quick Start Dashboard

# Run a quick evaluation
node cli/dist/index.js run-diff-eval \
  --model-ids "anthropic/claude-3-5-sonnet-20241022,x-ai/grok-beta" \
  --max-cases 4 \
  --valid-attempts-per-case 2 \
  --verbose

# Launch dashboard to view results
cd diff-edits/dashboard && streamlit run app.py

Legacy Results

For backward compatibility, results are also saved as JSON files in the diff-edits/results/ directory. The JSON results include:

Success/failure status
Extracted tool calls
Diff edit content
Token usage and cost metrics

Metrics

The evaluation system collects the following metrics:

Token Usage: Input and output tokens
Cost: Estimated cost of API calls
Duration: Time taken to complete tasks
Tool Usage: Number of tool calls and failures
Success Rate: Percentage of tasks completed successfully
Test Success Rate: Percentage of tests passed
Functional Correctness: Ratio of tests passed to total tests

Reports

Reports are generated in Markdown or JSON format and include:

Overall summary
Benchmark-specific results
Model-specific results
Tool usage statistics
Charts and visualizations

Development

Adding a New Benchmark

Create a new adapter in evals/cli/src/adapters/
Implement the BenchmarkAdapter interface
Register the adapter in evals/cli/src/adapters/index.ts

Extending Metrics

To add new metrics:

Update the database schema in evals/cli/src/db/schema.ts
Add collection logic in evals/cli/src/utils/results.ts
Update report generation in evals/cli/src/commands/report.ts