Experiments

Overview

Layer AI’s experiment framework lets you split production traffic between two variants and measure which performs better. Instead of guessing whether switching models or changing prompts will help, you get statistically rigorous answers. Common use cases:

Will switching from GPT-4o to Claude Sonnet reduce costs without hurting quality?
Does a refined system prompt improve response accuracy?
Is lowering temperature from 0.7 to 0.5 reducing hallucinations?

Test Types

Parameter Test

Compare different configurations on the same gate. Layer internally routes traffic to one of two variant configs. What you can vary:

Model (e.g., GPT-4o vs Claude Sonnet)
System prompt
Temperature
Max tokens
Top P

Gate Comparison

Compare two completely different gates. Useful for testing fundamentally different setups (e.g., different routing strategies, different fallback chains).

How It Works

Create an experiment — Define what you’re testing, choose a goal metric, set guardrails
Split traffic — Layer routes a configurable percentage to each variant (e.g., 50/50)
Collect data — Metrics are tracked automatically on every request
Determine winner — When minimum sample size is reached and results are statistically significant (p < 0.05), Layer declares a winner
Apply winner — One-click to update your gate with the winning configuration

Metrics

Goal Metrics

The primary metric you’re optimizing. Choose one and set a direction:

Metric	Description	Typical Target
`avg_cost`	Average cost per request	Minimize
`avg_latency`	Average response time	Minimize
`error_rate`	Percentage of failed requests	Minimize
`avg_tokens`	Average token usage	Minimize

Guardrail Metrics

Boundaries that must not be violated. If a critical guardrail is breached, the experiment auto-stops and the control variant (A) wins. Example guardrails:

Error rate must stay below 2% (critical — auto-stop)
Average latency must stay below 500ms (warning — alert only)

Diagnostic Metrics

Additional metrics tracked for context. They don’t affect winner determination but help you understand the results. Examples: total tokens, request volume trends, conversation length.

Creating an Experiment

From the Dashboard

Go to Dashboard → Experiments → Create New
Test Details — Name your experiment and add a description. Optionally state a hypothesis.
Test Type — Choose parameter test or gate comparison
Configure Variants
- For parameter tests: select the base gate and which parameters differ between Variant A (control) and Variant B (treatment)
- For gate comparison: select two gates
Define Success — Choose your goal metric, optimization direction, and minimum detectable effect (MDE)
Set Guardrails — Add guardrail metrics with thresholds and severity levels
Duration — Set max duration (days) and/or max requests. Set minimum sample size per variant.
Launch

Traffic Split

Configure what percentage of requests go to each variant. Default is 50/50, but you can adjust (e.g., 90/10 if you want to limit exposure to the experimental variant). Sticky assignment ensures the same user/session always gets the same variant.

Statistical Framework

Minimum Detectable Effect (MDE)

The smallest improvement that matters to you. If you set MDE to 10%, Layer won’t declare a winner for a 3% improvement — it’s too small to justify the operational overhead of switching.

Confidence Level

Layer uses 95% confidence (p < 0.05). This means less than a 5% chance the observed difference is due to random variation.

Minimum Sample Size

Experiments require a configurable minimum number of requests per variant (default: 100) before winner determination. This prevents premature conclusions from small samples.

Experiment Lifecycle

Status	Description
Draft	Created but not started
Running	Actively splitting traffic and collecting data
Paused	Temporarily halted (can resume)
Stopped	Manually terminated
Completed	Winner determined or duration/request limit reached

Reading Results

The experiment results page shows:

Goal metric comparison with statistical significance indicator
Guardrail status — all passing, warnings, or violations
Diagnostic metrics for additional context
Winner recommendation with confidence level and trade-off summary

Results are color-coded:

Green — Improvement (meets goal target and MDE threshold)
Red — Regression
Gray — Not statistically significant

Applying a Winner

When an experiment completes with a clear winner:

Review the results and trade-offs
Click Apply Winner to update your gate with the winning configuration
The gate is updated with the winning model, prompt, and parameter settings

For gate comparison tests, the winning gate’s configuration is applied to the base gate.

Getting Started

SDK Reference

Platform

Integrations

Provider Compatibility

Overview

Test Types

Parameter Test

Gate Comparison

How It Works

Metrics

Goal Metrics

Guardrail Metrics

Diagnostic Metrics

Creating an Experiment

From the Dashboard

Traffic Split

Statistical Framework

Minimum Detectable Effect (MDE)

Confidence Level

Minimum Sample Size

Experiment Lifecycle

Reading Results

Applying a Winner

Getting Started

SDK Reference

Platform

Integrations

Provider Compatibility

Documentation Index

​Overview

​Test Types

​Parameter Test

​Gate Comparison

​How It Works

​Metrics

​Goal Metrics

​Guardrail Metrics

​Diagnostic Metrics

​Creating an Experiment

​From the Dashboard

​Traffic Split

​Statistical Framework

​Minimum Detectable Effect (MDE)

​Confidence Level

​Minimum Sample Size

​Experiment Lifecycle

​Reading Results

​Applying a Winner

Overview

Test Types

Parameter Test

Gate Comparison

How It Works

Metrics

Goal Metrics

Guardrail Metrics

Diagnostic Metrics

Creating an Experiment

From the Dashboard

Traffic Split

Statistical Framework

Minimum Detectable Effect (MDE)

Confidence Level

Minimum Sample Size

Experiment Lifecycle

Reading Results

Applying a Winner