Skip to main content

Overview

Layer AI’s experiment framework lets you split production traffic between two variants and measure which performs better. Instead of guessing whether switching models or changing prompts will help, you get statistically rigorous answers. Common use cases:
  • Will switching from GPT-4o to Claude Sonnet reduce costs without hurting quality?
  • Does a refined system prompt improve response accuracy?
  • Is lowering temperature from 0.7 to 0.5 reducing hallucinations?

Test Types

Parameter Test

Compare different configurations on the same gate. Layer internally routes traffic to one of two variant configs. What you can vary:
  • Model (e.g., GPT-4o vs Claude Sonnet)
  • System prompt
  • Temperature
  • Max tokens
  • Top P

Gate Comparison

Compare two completely different gates. Useful for testing fundamentally different setups (e.g., different routing strategies, different fallback chains).

How It Works

  1. Create an experiment — Define what you’re testing, choose a goal metric, set guardrails
  2. Split traffic — Layer routes a configurable percentage to each variant (e.g., 50/50)
  3. Collect data — Metrics are tracked automatically on every request
  4. Determine winner — When minimum sample size is reached and results are statistically significant (p < 0.05), Layer declares a winner
  5. Apply winner — One-click to update your gate with the winning configuration

Metrics

Goal Metrics

The primary metric you’re optimizing. Choose one and set a direction:
MetricDescriptionTypical Target
avg_costAverage cost per requestMinimize
avg_latencyAverage response timeMinimize
error_ratePercentage of failed requestsMinimize
avg_tokensAverage token usageMinimize

Guardrail Metrics

Boundaries that must not be violated. If a critical guardrail is breached, the experiment auto-stops and the control variant (A) wins. Example guardrails:
  • Error rate must stay below 2% (critical — auto-stop)
  • Average latency must stay below 500ms (warning — alert only)

Diagnostic Metrics

Additional metrics tracked for context. They don’t affect winner determination but help you understand the results. Examples: total tokens, request volume trends, conversation length.

Creating an Experiment

From the Dashboard

  1. Go to Dashboard → Experiments → Create New
  2. Test Details — Name your experiment and add a description. Optionally state a hypothesis.
  3. Test Type — Choose parameter test or gate comparison
  4. Configure Variants
    • For parameter tests: select the base gate and which parameters differ between Variant A (control) and Variant B (treatment)
    • For gate comparison: select two gates
  5. Define Success — Choose your goal metric, optimization direction, and minimum detectable effect (MDE)
  6. Set Guardrails — Add guardrail metrics with thresholds and severity levels
  7. Duration — Set max duration (days) and/or max requests. Set minimum sample size per variant.
  8. Launch

Traffic Split

Configure what percentage of requests go to each variant. Default is 50/50, but you can adjust (e.g., 90/10 if you want to limit exposure to the experimental variant). Sticky assignment ensures the same user/session always gets the same variant.

Statistical Framework

Minimum Detectable Effect (MDE)

The smallest improvement that matters to you. If you set MDE to 10%, Layer won’t declare a winner for a 3% improvement — it’s too small to justify the operational overhead of switching.

Confidence Level

Layer uses 95% confidence (p < 0.05). This means less than a 5% chance the observed difference is due to random variation.

Minimum Sample Size

Experiments require a configurable minimum number of requests per variant (default: 100) before winner determination. This prevents premature conclusions from small samples.

Experiment Lifecycle

StatusDescription
DraftCreated but not started
RunningActively splitting traffic and collecting data
PausedTemporarily halted (can resume)
StoppedManually terminated
CompletedWinner determined or duration/request limit reached

Reading Results

The experiment results page shows:
  • Goal metric comparison with statistical significance indicator
  • Guardrail status — all passing, warnings, or violations
  • Diagnostic metrics for additional context
  • Winner recommendation with confidence level and trade-off summary
Results are color-coded:
  • Green — Improvement (meets goal target and MDE threshold)
  • Red — Regression
  • Gray — Not statistically significant

Applying a Winner

When an experiment completes with a clear winner:
  1. Review the results and trade-offs
  2. Click Apply Winner to update your gate with the winning configuration
  3. The gate is updated with the winning model, prompt, and parameter settings
For gate comparison tests, the winning gate’s configuration is applied to the base gate.