🎬 Trajectory Testing with VCR

This document explains the VCR-style trajectory testing system used to ensure consistency during refactoring of the CollectiveCrossing environment.

📋 Overview

The trajectory testing system records environment interactions (actions → observations, rewards, terminations) and replays them to verify that refactored code produces identical behavior. This prevents regressions during code changes.

⚙️ How It Works

VCR (Video Cassette Recorder) Concept

The system works like a VCR for environment interactions:

Record Mode: Capture complete environment state at each step
Replay Mode: Feed the same actions and verify identical outputs
Comparison: Detect any behavioral changes during refactoring

Key Components

TrajectoryVCR Class: Main recorder/replayer
Golden Baselines: Known good trajectories from working code
Version-Specific Trajectories: Track changes across versions
JSON Storage: Trajectories stored as structured data files

Directory Structure

tests/fixtures/trajectories/
├── golden/                    # Golden baselines (known good)
│   ├── golden_basic_trajectory.json
│   └── ...
├── current/                   # Current version trajectories
│   ├── test_basic_trajectory.json
│   └── ...
├── v1.0/                      # Version-specific trajectories
│   └── ...
└── v2.0/
    └── ...

Version Control

What to Commit

golden/ directory: Golden baselines should be committed to version control
📋 Test files: All test files should be committed

What NOT to Commit

current/ directory: Current trajectories are temporary test artifacts
Version-specific directories: These are generated during testing

The current/ directory is automatically ignored by .gitignore:

# VCR trajectory test artifacts
tests/fixtures/trajectories/current/

Golden Baseline Lifecycle

Create: Golden baselines are created from known-good code
Commit: Golden baselines are committed to version control
Test: Tests compare current behavior against golden baselines
Update: Golden baselines are updated when behavior intentionally changes

Usage

1. Creating Golden Baselines

Golden baselines are trajectories from known good code that serve as reference points.

# Create golden baseline from working code
uv run pytest tests/collectivecrossing/envs/test_trajectory_vcr.py::test_create_golden_baseline -v

When to create golden baselines: - Before starting major refactoring - After fixing bugs in working code - When you have a stable, tested version

Important: Tests now preserve existing golden baselines. They will only create new ones if they don't exist, preventing accidental overwrites.

2. Comparing Against Golden Baselines

Compare current code behavior against golden baselines to detect regressions.

# Compare current trajectory with golden baseline
uv run pytest tests/collectivecrossing/envs/test_trajectory_vcr.py::test_golden_baseline_comparison -v

What this catches: - Changes in agent behavior - Reward calculation changes - Termination condition changes - Observation space changes

Test Behavior: This test requires the golden baseline to exist and will fail with a clear error message if it's missing.

3. Version-Specific Testing

Track changes across different versions of your code.

# Test version-specific trajectories
uv run pytest tests/collectivecrossing/envs/test_trajectory_vcr.py::test_version_specific_trajectories -v

4. Running All Tests

# Run all trajectory VCR tests
uv run pytest tests/collectivecrossing/envs/test_trajectory_vcr.py -v

Creating New Versions

Step 1: Create Version-Specific VCR

from tests.collectivecrossing.envs.test_trajectory_vcr import TrajectoryVCR

# Create VCR for new version
vcr_new = TrajectoryVCR(version="v2.1")

Step 2: Record Trajectories

# Record trajectory for new version
trajectory = vcr_new.record_trajectory(env, actions_sequence, "new_feature_test")

Step 3: Compare with Previous Version

# Compare with previous version
vcr_old = TrajectoryVCR(version="v2.0")
vcr_old._compare_trajectories(old_trajectory, new_trajectory, "v2.0", "v2.1")

Creating Golden Baselines

Method 1: Using Test Functions

# Run the golden baseline creation test
uv run pytest tests/collectivecrossing/envs/test_trajectory_vcr.py::test_create_golden_baseline -v

Method 2: Manual Creation

from tests.collectivecrossing.envs.test_trajectory_vcr import TrajectoryVCR, create_test_environment, generate_deterministic_actions

# Create VCR
vcr = TrajectoryVCR()

# Create environment
env = create_test_environment()
observations, _ = env.reset(seed=42)

# Generate actions
actions_sequence = generate_deterministic_actions(observations, num_steps=20)

# Create golden baseline
trajectory = vcr.create_golden_baseline(env, actions_sequence, "my_golden_baseline")

Method 3: Command Line Script

# Run the manual script
uv run python tests/collectivecrossing/envs/test_trajectory_vcr.py

Trajectory Data Structure

Each trajectory is stored as a JSON file with the following structure:

{
  "config": {
    "width": 10,
    "height": 6,
    "division_y": 3,
    "tram_door_left": 4,
    "tram_door_right": 5,
    "tram_length": 8,
    "num_boarding_agents": 2,
    "num_exiting_agents": 1,
    "exiting_destination_area_y": 0,
    "boarding_destination_area_y": 5
  },
  "initial_observations": {
    "boarding_0": [2, 1, 5, 3, 4, 5, ...],
    "boarding_1": [7, 2, 5, 3, 4, 5, ...],
    "exiting_0": [4, 4, 5, 3, 4, 5, ...]
  },
  "initial_infos": {
    "boarding_0": {"agent_type": "boarding"},
    "boarding_1": {"agent_type": "boarding"},
    "exiting_0": {"agent_type": "exiting"}
  },
  "steps": [
    {
      "step": 0,
      "actions": {
        "boarding_0": 0,
        "boarding_1": 2,
        "exiting_0": 3
      },
      "observations": {
        "boarding_0": [2, 1, 5, 3, 4, 5, ...],
        "boarding_1": [7, 2, 5, 3, 4, 5, ...],
        "exiting_0": [4, 4, 5, 3, 4, 5, ...]
      },
      "next_observations": {
        "boarding_0": [3, 1, 5, 3, 4, 5, ...],
        "boarding_1": [6, 2, 5, 3, 4, 5, ...],
        "exiting_0": [4, 3, 5, 3, 4, 5, ...]
      },
      "next_rewards": {
        "boarding_0": -0.3,
        "boarding_1": -0.4,
        "exiting_0": 0.1
      },
      "next_terminated": {
        "boarding_0": false,
        "boarding_1": false,
        "exiting_0": false,
        "__all__": false
      },
      "next_truncated": {
        "boarding_0": false,
        "boarding_1": false,
        "exiting_0": false,
        "__all__": false
      },
      "next_infos": {
        "boarding_0": {"agent_type": "boarding", "in_tram_area": false, "at_door": false},
        "boarding_1": {"agent_type": "boarding", "in_tram_area": false, "at_door": false},
        "exiting_0": {"agent_type": "exiting", "in_tram_area": true, "at_door": false}
      }
    }
  ]
}

Best Practices

1. 🏆 When to Create Golden Baselines

Before major refactoring: Create baselines from stable code
After bug fixes: Update baselines to reflect correct behavior
Before releases: Ensure baselines represent intended behavior

2. 🎯 Test Coverage

Multiple scenarios: Create baselines for different environment configurations
Edge cases: Include trajectories that test boundary conditions
Common paths: Focus on typical agent behaviors

3. 🔧 Maintenance

Regular updates: Update golden baselines when behavior intentionally changes
Version control: Commit trajectory files to track changes over time
Documentation: Document why baselines were updated

4. 🐛 Debugging

When tests fail, the system provides detailed information:

Step-by-step comparison: Shows exactly where trajectories diverge
Agent-specific details: Identifies which agents behave differently
State differences: Shows observation, reward, and termination differences

🔧 Troubleshooting

Understanding Test Skipping

The VCR testing system is designed to skip tests when required golden baseline files are missing. This is intentional behavior to prevent false failures when baseline data isn't available.

Why tests are skipped: - Golden baselines missing: Tests require specific golden baseline files to compare against - No comparison data: Without baselines, tests can't verify consistency - Prevents false failures: Skipping is better than failing due to missing data

Common skipped tests: - test_replay_trajectory: Requires test_basic_trajectory.json in golden directory - test_trajectory_consistency: Requires consistency_test.json in golden directory

How to identify what's missing:

# Run tests with verbose output to see skip reasons
uv run pytest tests/collectivecrossing/envs/test_trajectory_vcr.py -v -rs

# Check what golden baselines exist
ls tests/fixtures/trajectories/golden/

# Check what current trajectories exist
ls tests/fixtures/trajectories/current/

🚨 Common Issues

Missing Golden Baseline pytest.skip: Golden baseline test_name not found. Create golden baseline first. Solution: Run the golden baseline creation test first.
Tests Being Skipped pytest.skip: Golden cassette test_basic_trajectory not found. Create golden baseline first. pytest.skip: Golden cassette consistency_test not found. Create golden baseline first. ** Solution**: These tests require specific golden baseline files. You can resolve this by:

Option A: Create golden baselines automatically bash # Create all required golden baselines uv run pytest tests/collectivecrossing/envs/test_trajectory_vcr.py::test_create_golden_baseline -v

Option B: Copy existing current trajectories to golden baselines bash # Copy specific missing files cp tests/fixtures/trajectories/current/test_basic_trajectory.json tests/fixtures/trajectories/golden/ cp tests/fixtures/trajectories/current/consistency_test.json tests/fixtures/trajectories/golden/

Option C: Check what golden baselines exist ```bash # List existing golden baselines ls tests/fixtures/trajectories/golden/

# List current trajectories that can be copied ls tests/fixtures/trajectories/current/ ```

⚙️ Config Mismatch pytest.fail: Config mismatch between golden and current Solution: Ensure environment configuration matches between recording and replay.
👁️ Observation Mismatch pytest.fail: Observation mismatch for agent_id at step N Solution: Check for changes in environment logic that affect agent behavior.
🔄 Golden Baseline Modified git status shows modified golden baseline files 💡 Solution: Tests now preserve golden baselines. If you see modifications, it means:
The test detected a regression (intentional behavior)
You need to update golden baselines for intentional changes
Restore golden baselines with git restore tests/fixtures/trajectories/golden/

🛠️ Debugging Commands

# 📋 List available golden baselines
python -c "from tests.collectivecrossing.envs.test_trajectory_vcr import TrajectoryVCR; vcr = TrajectoryVCR(); print('Golden:', vcr.list_golden_baselines())"

# 📋 List current version trajectories
python -c "from tests.collectivecrossing.envs.test_trajectory_vcr import TrajectoryVCR; vcr = TrajectoryVCR(); print('Current:', vcr.list_version_trajectories())"

# 🔍 Inspect trajectory file
cat tests/fixtures/trajectories/golden/golden_basic_trajectory.json | jq '.steps[0]'

🔄 Integration with CI/CD

The trajectory testing system integrates with the GitHub Actions workflow:

# .github/workflows/test.yml
- name: Run trajectory tests
  run: |
    uv run pytest tests/collectivecrossing/envs/test_trajectory_vcr.py -v

This ensures that: - ✅ Trajectory consistency is checked on every commit - 🚨 Regressions are caught before merging - 📝 Behavioral changes are documented and reviewed

🚀 Advanced Usage

🎯 Custom Action Sequences

def custom_action_sequence(observations, num_steps):
    """Generate custom deterministic actions"""
    actions_sequence = []
    for step in range(num_steps):
        actions = {}
        for agent_id in observations.keys():
            # Custom logic here
            actions[agent_id] = custom_policy(observations[agent_id])
        actions_sequence.append(actions)
    return actions_sequence

# Use custom actions
trajectory = vcr.record_trajectory(env, custom_action_sequence(observations, 20), "custom_test")

📋 Multiple Environment Configurations

def test_multiple_configs():
    configs = [
        {"width": 10, "height": 6, "num_boarding_agents": 2},
        {"width": 15, "height": 8, "num_boarding_agents": 4},
        {"width": 8, "height": 4, "num_boarding_agents": 1}
    ]

    for i, config in enumerate(configs):
        env = create_test_environment_with_config(config)
        trajectory = vcr.create_golden_baseline(env, actions, f"config_{i}")

This trajectory testing system provides robust regression testing for the CollectiveCrossing environment, ensuring that refactoring doesn't introduce behavioral changes. 🎉