5. Synthetic Data Generation for Staging Environment¶
Date: 2026-01-05
Status¶
Accepted
Context¶
Open CIS needs realistic synthetic clinical data for the Railway staging environment to enable: 1. Demonstration and testing without real patient data 2. Frontend development with realistic datasets 3. API testing with varied clinical scenarios 4. Onboarding new contributors with working examples
Problem¶
The current seed script (scripts/seed.py) only creates basic patient demographics (MRN, name, birth date). It lacks:
- Clinical observations (vital signs, lab results)
- Encounter histories
- Medications and diagnoses
- Longitudinal patient journeys
- Realistic value distributions
For a staging environment, we need synthetic data that: - Matches our openEHR templates - Generates clinically plausible values - Can be easily recreated for each deployment - Requires minimal external dependencies - Works with our EHRbase + FastAPI stack
Research: Available Solutions¶
1. MapEHR¶
Website: https://mapehr.com/ Type: Commercial/Proprietary
Features:
- Purpose-built for openEHR synthetic data generation
- YAML-based rules with LOINC/SNOMED codes
- Faker library integration for demographics
- Statistical distributions for clinical values (e.g., randomNormalDistribution())
- Supports complex calculations (BMI from height/weight)
- Works with OPT2 templates
Limitations: - ❌ Not publicly available (no GitHub, npm, or PyPI package) - ❌ Website blocks automated access (403 errors) - ❌ No pricing information publicly available - ❌ Requires vendor contact for access - ❌ Proprietary product with potential licensing costs - ⚠️ Template compatibility unclear (we use OPT 1.4, MapEHR uses OPT2)
Status: Unavailable for immediate use
2. openFHIR¶
Website: https://open-fhir.com/ Type: Commercial (trial available)
Features: - Docker-based FHIR ↔ openEHR mapping engine - YAML mapping rules (nearly identical to MapEHR) - Bidirectional conversion - Sandbox available (sandbox.open-fhir.com)
Limitations: - ❌ Requires trial license request - ❌ Commercial product (pricing unknown) - ❌ Focused on FHIR mapping, not synthetic data generation - ⚠️ Would need separate data source (like Synthea)
Status: Could explore for future FHIR integration
3. Synthea¶
Repository: https://github.com/synthetichealth/synthea Type: Open Source (Apache 2.0)
Features: - ✅ Industry-standard synthetic patient generator - ✅ Generates realistic longitudinal patient histories - ✅ Exports FHIR R4, STU3, C-CDA, CSV - ✅ 1M+ free synthetic records available - ✅ Actively maintained (3.5k+ commits) - ✅ Docker images available - ✅ Clinically validated scenarios
Limitations: - ❌ No direct openEHR export (FHIR only) - ⚠️ Requires FHIR → openEHR conversion layer - ⚠️ Additional dependency (Java-based)
Integration Path:
Status: Viable option but adds complexity
4. ehrbase/fhir-bridge¶
Repository: https://github.com/ehrbase/fhir-bridge Type: Open Source
Features: - ✅ Official EHRbase component - ✅ Converts FHIR → openEHR compositions - ✅ Actively maintained
Limitations: - ❌ Only handles conversion, not data generation - ⚠️ Must be paired with Synthea or similar
Status: Complementary tool, not standalone solution
5. Custom Python Script¶
Implementation: Enhanced scripts/seed.py
Features:
- ✅ Full control over data generation
- ✅ Uses Faker for realistic demographics
- ✅ Direct integration with existing ehrbase_client
- ✅ No external service dependencies
- ✅ Can be customized for specific test scenarios
- ✅ Railway-ready (no additional infrastructure)
- ✅ Works with existing templates (OPT 1.4)
- ✅ Simple to maintain and extend
Limitations: - ⚠️ Manual work to create realistic clinical scenarios - ⚠️ Need to define value ranges ourselves - ⚠️ Less sophisticated than specialized tools - ⚠️ No longitudinal patient journeys (initially)
Status: Immediately implementable
Decision¶
We will implement Option 5: Custom Python Seed Script using Faker and manual composition building for synthetic data generation in the staging environment.
Implementation Approach¶
# scripts/seed.py (enhanced)
import asyncio
from datetime import datetime, timedelta
from faker import Faker
import httpx
from random import randint, uniform
fake = Faker()
async def create_synthetic_patient_with_vitals():
# 1. Create patient with Faker demographics
patient = {
"mrn": fake.unique.bothify(text='MRN-####'),
"given_name": fake.first_name(),
"family_name": fake.last_name(),
"birth_date": fake.date_of_birth(minimum_age=18, maximum_age=90)
}
# 2. Create vital signs composition
vital_signs = {
"ctx/language": "en",
"ctx/territory": "US",
"vital_signs/blood_pressure/systolic": randint(90, 140),
"vital_signs/blood_pressure/diastolic": randint(60, 90),
"vital_signs/pulse_rate": randint(60, 100),
"vital_signs/body_temperature": uniform(36.1, 37.5),
"vital_signs/time": datetime.now().isoformat()
}
# 3. Post to API
# ... (existing patient creation logic)
# ... (new composition creation via ehrbase_client)
Railway Deployment Integration¶
Railway provides several approaches for running seed scripts during deployment:
Option 1: Dockerfile CMD with Chained Commands (Current Approach)¶
We already use this pattern for migrations in api/Dockerfile:
For seeding, we can extend this to:
CMD sh -c "prisma migrate deploy && python scripts/seed_staging.py && uvicorn src.main:app --host 0.0.0.0 --port ${PORT:-8000}"
Pros: - ✅ Runs automatically on every deployment - ✅ Consistent with existing migration pattern - ✅ No Railway configuration changes needed - ✅ Works for all Railway environments
Cons: - ⚠️ Runs on every container start (including restarts) - ⚠️ Requires idempotent seed script - ⚠️ Can't easily disable for production
Option 2: railway.toml startCommand¶
Configure per-environment start commands in api/railway.toml:
[deploy]
startCommand = "prisma migrate deploy && python scripts/seed_staging.py && uvicorn src.main:app --host 0.0.0.0 --port $PORT"
healthcheckPath = "/health"
Pros: - ✅ Overrides Dockerfile CMD - ✅ Can be environment-specific (different Railway projects for staging/prod) - ✅ No Dockerfile changes needed
Cons: - ⚠️ Configuration split between Dockerfile and railway.toml - ⚠️ Must remember to set for staging environment only
Option 3: Conditional Seeding Based on Environment Variable¶
Add environment variable check in Dockerfile:
CMD sh -c "prisma migrate deploy && \
if [ \"$RAILWAY_ENVIRONMENT\" = \"staging\" ]; then python scripts/seed_staging.py; fi && \
uvicorn src.main:app --host 0.0.0.0 --port ${PORT:-8000}"
Pros: - ✅ Single Dockerfile works for all environments - ✅ Automatic based on Railway environment - ✅ No accidental production seeding
Cons:
- ⚠️ More complex shell script in CMD
- ⚠️ Requires setting RAILWAY_ENVIRONMENT variable
Recommended Approach: Option 3 (Conditional + Idempotent)¶
We'll use conditional seeding based on environment variable with an idempotent seed script that:
- Checks if data exists: Only seed if patient count < threshold
- Uses unique identifiers: MRNs that won't conflict with real data
- Handles existing records gracefully: Skip or update, don't fail
- Runs quickly: Complete in <10 seconds to avoid deployment timeout
# scripts/seed_staging.py
async def should_seed() -> bool:
"""Only seed if staging environment and data doesn't exist."""
if os.getenv("RAILWAY_ENVIRONMENT") != "staging":
return False
patient_count = await get_patient_count()
return patient_count < 5 # Threshold for re-seeding
async def main():
if not await should_seed():
print("Skipping seed (not staging or data exists)")
return
print("Seeding staging data...")
# ... seed logic
Scope¶
Initial implementation (for staging deployment): - Patient demographics (10-20 synthetic patients) - Vital signs observations (2-5 per patient) - Realistic value ranges based on clinical norms - Timestamps spread over recent weeks - Idempotent execution (safe to run multiple times) - Environment-aware (staging only)
Future enhancements (as needed): - Diagnoses and problem lists - Medication orders - Lab results - Encounter histories - Longitudinal data (multiple observations over time)
Rationale¶
Why Custom Python Script?¶
- Immediate Availability: No vendor contact, licensing, or trial requests needed
- Zero Additional Dependencies: Uses existing stack (Python, httpx, Faker)
- Railway Compatibility: Simple script, no additional services/containers
- Full Control: Customize data to match our specific templates and scenarios
- Maintainability: ~200 lines of Python vs integrating external systems
- Educational Value: For a learning project, understanding data structure is valuable
- Sufficient for Staging: We don't need complex patient journeys yet
- Incremental Enhancement: Can add complexity as needs grow
Why Not MapEHR (Now)?¶
- Unavailable: Not publicly accessible, no clear path to obtain
- Unknown Cost: Could require commercial license
- Overkill: We need 10-20 patients with basic vitals, not thousands with complex histories
- Template Compatibility: Unclear if our OPT 1.4 templates work with OPT2-focused tool
Note: We will explore MapEHR/openFHIR for plausibility research once we need: - More sophisticated clinical scenarios - Standardized data generation patterns - Complex multi-system patient histories - FHIR integration capabilities
Why Not Synthea + fhir-bridge?¶
- Complexity: Adds Java dependency (Synthea) + conversion layer (fhir-bridge)
- Deployment Overhead: Two additional services on Railway
- Learning Curve: Need to understand FHIR → openEHR mapping
- Overkill for V1: Synthea generates years of patient history; we need basic vitals
Note: Synthea remains a strong option if we need realistic longitudinal data later.
Consequences¶
Positive¶
- ✅ Fast to implement: Can be done in ~2 hours
- ✅ No blockers: No vendor contacts, licenses, or external approvals
- ✅ Simple deployment: Runs as Railway deployment hook or manual script
- ✅ Transparent: Full visibility into what data is generated
- ✅ Customizable: Easy to adjust for specific test scenarios
- ✅ No ongoing costs: No licensing fees or API usage charges
- ✅ Git-friendly: Seed script logic versioned in repository
Negative¶
- ⚠️ Manual value ranges: Must research clinical norms ourselves
- ⚠️ Limited sophistication: No statistical distributions or complex calculations initially
- ⚠️ Maintenance burden: Must update script as templates evolve
- ⚠️ No FHIR integration: Can't easily test FHIR workflows
Neutral¶
- 🔄 Incremental approach: Can migrate to specialized tools later
- 🔄 Educational trade-off: More hands-on work, more learning
Mitigation Strategies¶
To address the negative consequences:
-
Clinical Value Research: Reference medical guidelines for realistic ranges
-
Template Helpers: Create reusable composition builders
-
Seed Data Versioning: Store generated datasets as JSON for reproducibility
-
Railway Integration: Use conditional environment-based seeding
Set RAILWAY_ENVIRONMENT=staging in Railway staging project environment variables.
Alternatives Considered¶
1. Wait for MapEHR Access¶
Rejected: No timeline for when/if we could obtain access. Blocks staging deployment.
2. Manual Data Entry via UI¶
Rejected: Not reproducible, time-consuming, doesn't scale.
3. Commit Pre-generated JSON Compositions¶
Rejected: Less flexible than generation script, harder to customize.
4. Use Synthea Now¶
Rejected: Over-engineering for current needs. Can revisit when we need complex scenarios.
Future Exploration: MapEHR/openFHIR¶
While we're implementing the custom script now, we will explore MapEHR and openFHIR for plausibility research to:
- Understand YAML mapping patterns: Learn industry-standard approaches
- Evaluate OPT2 compatibility: Assess if our templates need updates
- Compare data quality: See how specialized tools generate distributions
- Assess FHIR integration: Understand conversion patterns for future needs
Action items: - [ ] Contact MapEHR vendor for trial access information - [ ] Request openFHIR trial license - [ ] Document findings in separate research document - [ ] Evaluate migration path if tools prove valuable
This exploration is non-blocking and runs in parallel with the custom script implementation.
Migration Path¶
If we adopt MapEHR/openFHIR or Synthea in the future:
- Script Remains Useful: Custom script can generate quick test data during development
- Incremental Adoption: Can use both approaches (script for quick tests, MapEHR for staging)
- Template Evolution: Learning from YAML patterns can improve our manual builders
- FHIR Bridge: If we add FHIR support, Synthea + fhir-bridge becomes attractive
The custom script is not wasted effort—it's a pragmatic V1 that unblocks progress.
Related¶
- ADR-0001: Use openEHR for Clinical Data
- ADR-0003: openEHR Template Management
- ADR-0004: Direct httpx openEHR Integration
- Current seed script:
scripts/seed.py - Vital signs template:
api/templates/IDCR - Vital Signs Encounter.v1.opt
References¶
Synthetic Data Tools¶
- MapEHR Documentation: https://mapehr.com/docs/synthetic-data/
- openFHIR: https://open-fhir.com/
- Synthea: https://github.com/synthetichealth/synthea
- ehrbase/fhir-bridge: https://github.com/ehrbase/fhir-bridge
- Faker (Python): https://faker.readthedocs.io/
- WHO Vital Signs Guidelines: https://www.who.int/data/gho/indicator-metadata-registry/imr-details/3155