Overview
Layer AI’s experiment framework lets you split production traffic between two variants and measure which performs better. Instead of guessing whether switching models or changing prompts will help, you get statistically rigorous answers. Common use cases:- Will switching from GPT-4o to Claude Sonnet reduce costs without hurting quality?
- Does a refined system prompt improve response accuracy?
- Is lowering temperature from 0.7 to 0.5 reducing hallucinations?
Test Types
Parameter Test
Compare different configurations on the same gate. Layer internally routes traffic to one of two variant configs. What you can vary:- Model (e.g., GPT-4o vs Claude Sonnet)
- System prompt
- Temperature
- Max tokens
- Top P
Gate Comparison
Compare two completely different gates. Useful for testing fundamentally different setups (e.g., different routing strategies, different fallback chains).How It Works
- Create an experiment — Define what you’re testing, choose a goal metric, set guardrails
- Split traffic — Layer routes a configurable percentage to each variant (e.g., 50/50)
- Collect data — Metrics are tracked automatically on every request
- Determine winner — When minimum sample size is reached and results are statistically significant (p < 0.05), Layer declares a winner
- Apply winner — One-click to update your gate with the winning configuration
Metrics
Goal Metrics
The primary metric you’re optimizing. Choose one and set a direction:| Metric | Description | Typical Target |
|---|---|---|
avg_cost | Average cost per request | Minimize |
avg_latency | Average response time | Minimize |
error_rate | Percentage of failed requests | Minimize |
avg_tokens | Average token usage | Minimize |
Guardrail Metrics
Boundaries that must not be violated. If a critical guardrail is breached, the experiment auto-stops and the control variant (A) wins. Example guardrails:- Error rate must stay below 2% (critical — auto-stop)
- Average latency must stay below 500ms (warning — alert only)
Diagnostic Metrics
Additional metrics tracked for context. They don’t affect winner determination but help you understand the results. Examples: total tokens, request volume trends, conversation length.Creating an Experiment
From the Dashboard
- Go to Dashboard → Experiments → Create New
- Test Details — Name your experiment and add a description. Optionally state a hypothesis.
- Test Type — Choose parameter test or gate comparison
- Configure Variants
- For parameter tests: select the base gate and which parameters differ between Variant A (control) and Variant B (treatment)
- For gate comparison: select two gates
- Define Success — Choose your goal metric, optimization direction, and minimum detectable effect (MDE)
- Set Guardrails — Add guardrail metrics with thresholds and severity levels
- Duration — Set max duration (days) and/or max requests. Set minimum sample size per variant.
- Launch
Traffic Split
Configure what percentage of requests go to each variant. Default is 50/50, but you can adjust (e.g., 90/10 if you want to limit exposure to the experimental variant). Sticky assignment ensures the same user/session always gets the same variant.Statistical Framework
Minimum Detectable Effect (MDE)
The smallest improvement that matters to you. If you set MDE to 10%, Layer won’t declare a winner for a 3% improvement — it’s too small to justify the operational overhead of switching.Confidence Level
Layer uses 95% confidence (p < 0.05). This means less than a 5% chance the observed difference is due to random variation.Minimum Sample Size
Experiments require a configurable minimum number of requests per variant (default: 100) before winner determination. This prevents premature conclusions from small samples.Experiment Lifecycle
| Status | Description |
|---|---|
| Draft | Created but not started |
| Running | Actively splitting traffic and collecting data |
| Paused | Temporarily halted (can resume) |
| Stopped | Manually terminated |
| Completed | Winner determined or duration/request limit reached |
Reading Results
The experiment results page shows:- Goal metric comparison with statistical significance indicator
- Guardrail status — all passing, warnings, or violations
- Diagnostic metrics for additional context
- Winner recommendation with confidence level and trade-off summary
- Green — Improvement (meets goal target and MDE threshold)
- Red — Regression
- Gray — Not statistically significant
Applying a Winner
When an experiment completes with a clear winner:- Review the results and trade-offs
- Click Apply Winner to update your gate with the winning configuration
- The gate is updated with the winning model, prompt, and parameter settings