A New End-to-end Framework for
Evaluating Voice Agents (EVA)

Most voice agent benchmarks evaluate either what the agent does or how it sounds — EVA evaluates both.

EVA is an open-source evaluation framework for conversational voice agents that scores complete, multi-turn spoken conversations across two fundamental dimensions:

🎯 EVA-A (Accuracy) — Did the agent complete the task correctly and faithfully?
✨ EVA-X (Experience) — Was the interaction natural, concise, and appropriate for spoken dialogue?

Using a realistic bot-to-bot architecture, EVA runs fully automated evaluations without human listeners — end to end, from speech in to judgment out.

📊 What's included

Metrics for both EVA-A and EVA-X, fully documented and validated with judge prompts, code, etc.
50 airline scenarios spanning flight rebooking, cancellations, vouchers, and more
Results for 20 cascade and audio-native systems (speech-to-speech models, large audio language models) — see Experiment Setup for model configurations.

🔍 Key finding

Agents that score well on task completion tend to score worse on conversational experience — and vice versa. The accuracy–experience tradeoff is real, consistent, and previously unmeasured.

Quick Start

Cloning the Repository

If you're only interested in running the latest stable version of EVA, you can clone with --branch latest, and optionally speed things up with --depth 1 --no-tags --single-branch.

git clone https://github.com/ServiceNow/eva.git --branch latest --depth 1 --no-tags --single-branch

Otherwise, for development, you can clone the default branch, main.

git clone https://github.com/ServiceNow/eva.git

Installation

We recommend using uv for fast, reliable dependency management. If you don't have uv installed, see the uv installation guide.

This project requires Python 3.11–3.13 (set via requires-python in pyproject.toml). uv will automatically select a compatible version. If you're using pip, make sure you're running a supported Python version.

cd eva

# Install all dependencies (uv automatically creates a virtual environment)
uv sync --all-extras

# Copy environment template
cp .env.example .env
# Edit .env with your API keys (ELEVENLABS_API_KEY, OPENAI_API_KEY required)

After installation, you can run EVA using either:

eva — CLI entry point (e.g., eva --help)
python main.py — script at the repo root (e.g., python main.py --help)

If using an IDE, point your Python interpreter to .venv/bin/python so commands run in the virtual environment automatically. Otherwise, prefix commands with uv run or activate the environment with source .venv/bin/activate.

Alternative: using pip

This project requires Python 3.11. If you need to manage multiple Python versions, consider using pyenv.

# Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install --upgrade pip
pip install -e ".[dev]"

Environment Variables

Required:

OPENAI_API_KEY (or another LLM provider): Powers the assistant LLM and text judge metrics
EVA_MODEL_LIST: Model deployments that reference your API key (see .env.example). Also configurable via --model-list CLI flag. Only used for regular LLMs.
ELEVENLABS_API_KEY + agent IDs: For user simulation
STT/TTS API key and model: Passed via EVA_MODEL__STT_PARAMS / EVA_MODEL__TTS_PARAMS (default provider is Cartesia)

For all metrics:

OPENAI_API_KEY: GPT-5.2 for text judge metrics (task completion, conciseness, turn taking, etc.)
GOOGLE_APPLICATION_CREDENTIALS: Gemini via Vertex AI (audio judge metrics)
AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY: Claude via Bedrock (faithfulness metric)

Key Environment Variables:

# Framework Configuration
EVA_DOMAIN=airline           # Domain-based path conventions
EVA_MAX_CONCURRENT_CONVERSATIONS=5   # Max parallel conversations
EVA_DEBUG=false                       # Run only 1 record for testing when enabled
EVA_RECORD_IDS=1.2.1,1.2.2            # Run specific records only (remove to run all records)

# Pipeline Model Configuration (nested under EVA_MODEL__)
EVA_MODEL__LLM=gpt-5-mini                # LLM model name (must match EVA_MODEL_LIST)
EVA_MODEL__STT=deepgram               # deepgram | openai_whisper
EVA_MODEL__TTS=cartesia               # cartesia | elevenlabs

EVA_MODEL__STT_PARAMS={"api_key":"", "alias": "deepgram-nova-3", "model": "nova-3"}
EVA_MODEL__TTS_PARAMS={"api_key":"", "alias": "cartesia-sonic-3", "model": "sonic-3"}

# Or speech-to-speech model (mutually exclusive with LLM)
# EVA_MODEL__S2S=gpt-realtime-mini    # Audio-native model name (S2S, S2T+TTS)

# Logging
EVA_LOG_LEVEL=INFO                    # DEBUG | INFO | WARNING | ERROR

See .env.example for the complete list of configuration options.

Running the framework

# Run with domain-based conventions (easiest):
EVA_DOMAIN=airline python main.py
# Automatically uses:
#   data/airline_dataset.jsonl
#   configs/agents/airline_agent.yaml
#   data/airline_scenarios/

# Run with CLI overrides
python main.py --model.llm gpt-5-mini --max-concurrent-conversations 10

Running Metrics

# Re-run specific metrics on an existing run
python main.py \
    --run-id <existing_run_id> \
    --metrics task_completion,faithfulness,conciseness

Exploring Results

EVA includes a Streamlit analysis app for visualizing and comparing results:

streamlit run apps/analysis.py

The app reads from the output/ directory by default and provides three views: cross-run comparison, run overview, and per-record detail (transcripts, audio, metrics, conversation traces). See apps/README.md for full documentation.

Using Docker

# Build the image
docker compose build

# Run a benchmark
docker compose run --rm benchmark

Development Setup

Install pre-commit hooks to lint and format code:

pre-commit install

Running Tests

Install the [dev] extra dependencies as shown in the Installation section.

# Run all tests
pytest tests/ -v

# Run specific test file
pytest tests/test_postprocessor_transcript.py -v

# Run with coverage
pytest tests/ --cov=eva

# Run metrics tests
pytest tests/integration/test_metrics.py -v

Evaluation Gap

Existing benchmarks evaluate voice agent components in isolation — speech understanding, TTS quality, or conversational dynamics — but none assess the full pipeline end to end. In real deployed systems, errors compound across modules and failure modes interact in ways that component-level evaluation cannot capture. EVA addresses this by treating voice agent quality as an integrated whole, evaluating accuracy and experience jointly across complete multi-turn spoken conversations.

Framework	Interaction Mode	Multi-turn	Tool Calling	Goal Completion	Experience Metrics	Pass@k Pass^k	Supported Systems
EVA	Live bot-to-bot	✅	✅	✅ Task Completion, Speech Fidelity, Faithfulness	✅ Conciseness, Turn-taking, Latency, Progression	✅	Audio-native, Cascade
VoiceAgentBench	Static, TTS-synthesized	✅	✅	⚠️	❌	❌	Audio-native, Cascade
CAVA	Partial simulation	✅	✅	⚠️	⚠️ Latency, Tone-awareness	❌	Audio-native, Cascade
FDB-v2	Live, automated examiner	✅	❌	❌	✅ Turn-taking fluency, Correction handling, Safety	❌	Audio-native
FDB-v1	Static, pre-recorded	❌	❌	❌	✅ Turn-taking, Backchanneling, Interruption	❌	Audio-native
FD-Bench	Live, simulated	❌	❌	❌	✅ Interruption, Delay, Robustness	❌	Audio-native
Talking Turns	Static, curated	❌	❌	❌	✅ Turn change, Backchannel, Interruption	❌	Audio-native, Cascade

🏗️ Architecture

EVA evaluates agents using a bot-to-bot audio architecture — no human listeners, no text replays. Two conversational AIs speak to each other over a live WebSocket connection, producing realistic speech-to-speech interactions that capture real STT behavior and turn-taking dynamics.

Component	Role
🎭 User Simulator (ElevenAgent)	Plays the role of a caller with a defined goal and persona
🤖 Voice Agent (Pipecat)	The system under evaluation — supports cascade (STT→LLM→TTS) and speech-to-speech models
🔧 Tool Executor	The engine that provides deterministic, reproducible tool responses via custom Python functions. It dynamically queries and modifies a predefined per-scenario database.
✅ Validators	Automated checks that verify conversations are complete and that the user simulator faithfully reproduced its intended goal — no human annotation required. Conversations that fail validation are automatically regenerated, ensuring only clean, correctly executed runs enter evaluation.
📊 Metrics Engine	Scores each conversation using the audio recording, transcripts, and tool call logs.

Output Structure

output/<run_id>/
├── config.json              # Run configuration snapshot
├── results.csv              # Quick results table
├── metrics_summary.json     # Aggregate metrics (after metrics run)
├── metrics_summary.csv      # Per-category metrics breakdown
└── records/<record_id>/
    ├── result.json          # Conversation result
    ├── audio_assistant.wav  # Assistant audio channel
    ├── audio_user.wav       # User audio channel
    ├── audio_mixed.wav      # Mixed stereo audio
    ├── transcript.jsonl     # Turn-by-turn transcript
    ├── audit_log.json       # Complete interaction log
    ├── pipecat_logs.jsonl   # Pipecat framework events
    ├── elevenlabs_events.jsonl # ElevenLabs events
    └── metrics.json         # Per-record metric scores and details

Metrics

🎯 EVA-A · Accuracy	✨ EVA-X · Experience
Did the agent complete the task correctly?	Was the conversational experience high quality?
Task Completion · Deterministic	Turn Taking · LLM Judge `BETA`
Agent Speech Fidelity · Audio LLM Judge `BETA`	Conciseness · LLM Judge
Faithfulness · LLM Judge	Conversation Progression · LLM Judge

See the Metrics documentation for detailed scoring rubrics and judge prompts. For the data structures that metrics operate on, see MetricContext documentation.

🗂️ Dataset

We created three datasets on different enterprise domains, each selected to target a distinct axis of difficulty for voice agents. All three require accurate transcription of structured named entities over voice (e.g., confirmation codes and employee identifiers), but differ in their primary challenge. Airline Customer Service Management (CSM) tests temporal reasoning and complex policy adherence in high-stakes flight rebooking scenarios. Healthcare Human Resources Service Delivery (HRSD) stresses entity density, requiring callers to communicate multiple registration and license numbers across clinical and administrative HR workflows. Enterprise Information Technology Service Management (ITSM) introduces branching conversational flows (e.g., incident resolution attempts must fail before ticket escalation is permitted) and tiered authentication reflecting the access sensitivity of different workflows.

Within each domain, scenarios span three dimensions: Single-Intent (one workflow per call), Multi-Intent (one to four concurrent workflows, testing compositional task completion without context loss), and Adversarial (hard policy constraints under social pressure, e.g., refusing compensation to an ineligible caller).

See the Data documentation for a detailed breakdown of the data structure and scenario design, and the Database & Tool Schema for the airline scenario database format.

Project Structure

eva/
├── main.py                    # Main entry point
├── pyproject.toml             # Python project configuration
├── apps/                      # Streamlit apps
├── Dockerfile                 # Docker configuration
├── compose.yaml               # Docker Compose configuration
├── src/eva/
│   ├── cli.py                 # CLI interface
│   ├── run_benchmark.py       # Benchmark runner
│   ├── models/                # Pydantic data models
│   ├── orchestrator/          # Framework execution
│   │   ├── runner.py          # Main orchestrator
│   │   ├── worker.py          # Per-conversation worker
│   │   ├── validation_runner.py # Validation runner
│   │   └── port_pool.py       # Port management
│   ├── assistant/             # Pipecat-based assistant
│   │   ├── agentic/           # Agent orchestration
│   │   ├── tools/             # Python-based tool implementations
│   │   ├── pipeline/          # Audio/LLM processing pipeline
│   │   └── services/          # STT/TTS/LLM factories
│   ├── user_simulator/        # ElevenLabs user simulator
│   ├── metrics/               # Evaluation metrics
│   │   ├── base.py            # Base metric classes
│   │   ├── processor.py       # Metrics context processor
│   │   ├── runner.py          # Metrics execution
│   │   ├── registry.py        # Metric registry
│   │   ├── aggregation.py     # Metric aggregation
│   │   ├── accuracy/          # Task completion metrics
│   │   ├── experience/        # Responsiveness, progression, turn-taking
│   │   ├── diagnostic/        # Diagnostic metrics (not in final scores)
│   │   └── validation/        # Quality control metrics
│   └── utils/                 # Utilities (LLM client, log processing)
├── scripts/                   # Utility scripts
│   ├── run_text_only.py       # Text-only evaluation runner
│   ├── docker_entrypoint.py   # Docker entry point
│   ├── check_version_bump.py  # Version checking
│   └──  check_version_bump.py  # Version checking
├── configs/                   # Configuration files
│   ├── prompts/               # Judge and simulation prompts
│   │   ├── judge.yaml         # Judge metric prompts
│   │   └── simulation.yaml    # User simulator prompts
│   └── agents/                # Agent configurations
│       └── airline_agent.yaml
├── docs/                      # Documentation
│   ├── metrics/               # Per-metric documentation
│   ├── data.md                # Data documentation
│   ├── experiment_setup.md    # Experiment setup guide
│   ├── llm_configuration.md   # LLM provider setup guide
│   ├── metric_context.md      # Metric context documentation
│   ├── limitations.md         # Known limitations
│   └── demo/                  # Demo audio files
├── data/                      # Data files
│   ├── airline_dataset.jsonl  # Evaluation dataset
│   └── airline_scenarios/     # Per-record scenario databases
├── tests/                     # Test suite
│   ├── unit/                  # Unit tests
│   ├── integration/           # Integration tests
│   ├── artifacts/             # Test artifacts and fixtures
│   └── fixtures/              # Shared test fixtures
└── website/                   # Project website (React/TypeScript)

Contributing

We welcome contributions! Please read our Contributing Guidelines before submitting a pull request. For larger features, we recommend reaching out first to ensure alignment with our roadmap.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A New End-to-end Framework for
Evaluating Voice Agents (EVA)

📊 What's included

🔍 Key finding

Quick Start

Cloning the Repository

Installation

Environment Variables

Running the framework

Running Metrics

Exploring Results

Using Docker

Development Setup

Running Tests

Evaluation Gap

🏗️ Architecture

Output Structure

Metrics

🗂️ Dataset

Project Structure

Contributing

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 893 Commits
.github/workflows		.github/workflows
apps		apps
assets/noise		assets/noise
configs		configs
data		data
docs		docs
scripts		scripts
src/eva		src/eva
tests		tests
website		website
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
THIRD_PARTY_NOTICES		THIRD_PARTY_NOTICES
compose.yaml		compose.yaml
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

A New End-to-end Framework for Evaluating Voice Agents (EVA)

📊 What's included

🔍 Key finding

Quick Start

Cloning the Repository

Installation

Environment Variables

Running the framework

Running Metrics

Exploring Results

Using Docker

Development Setup

Running Tests

Evaluation Gap

🏗️ Architecture

Output Structure

Metrics

🗂️ Dataset

Project Structure

Contributing

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

A New End-to-end Framework for
Evaluating Voice Agents (EVA)

Packages