Key Impact
- •4.1M+ rows consolidated from 4 data sources
- •244 tables processed with full ETL pipeline
- •Reporting reduced from ~1 month to on-demand
- •Business users self-serve real-time analytics
The Problem
The county public health organization was drowning in data, but starving for insights. They collected health data from multiple sources:
- Disease surveillance systems
- Immunization registries
- Vital statistics databases
- Partner organization feeds
A monthly report took approximately one month to produce. The irony wasn't lost on anyone.
The Solution
I built a modern data lake with automated ETL pipelines, giving the organization instant access to unified, clean data.
Architecture: The Medallion Approach
I implemented a bronze/silver/gold medallion architecture:
Bronze Layer (Raw)
Raw data lands exactly as received from source systems. No transformations, no cleaning. Just a faithful record of what came in. This provides full audit trail and enables reprocessing if logic changes.
Silver Layer (Cleaned)
Data undergoes standardization and quality checks:
- Schema normalization across sources
- Deduplication with configurable matching rules
- Null resolution and default handling
- Data type enforcement and validation
Aggregated, joined, and optimized for specific use cases:
- Pre-computed metrics and KPIs
- Dimensional models for reporting
- Cross-source entity resolution
- Time-series aggregations
ETL Pipeline
The pipeline handles the gnarly work that used to consume weeks of staff time:
Deduplication
Matching records across sources with fuzzy logic, handling name variations, address formats, and missing identifiers. Configurable confidence thresholds let the organization tune precision vs. recall.
Null Resolution
Smart defaults and cross-source filling. If one source has demographic data another lacks, the pipeline merges intelligently while tracking data provenance.
Cross-Source Aggregation
Unified views across all 244 tables, with proper handling of different reporting periods, geographic granularities, and categorical definitions.
Self-Service Analytics
The data lake powers a web application where business users run their own queries. No more waiting for IT. No more month-long report cycles.
Features include:
- Pre-built report templates for common queries
- Custom filtering and drill-down
- Export to Excel/CSV for further analysis
- Scheduled report delivery via email
Technical Implementation
Serverless ETL: Azure Functions handle pipeline orchestration. Pay-per-execution keeps costs low while scaling automatically during batch processing windows.
Incremental Processing: Changed data capture identifies new/modified records, so we're not reprocessing millions of rows daily.
Data Quality Monitoring: Automated checks flag anomalies like sudden drops in record counts, unexpected nulls, schema drift from source systems.
Security: Role-based access controls ensure staff only see data appropriate to their function. Full audit logging for compliance.
Results
The transformation was dramatic:
- 4.1M+ rows consolidated into a unified, queryable data lake
- 244 tables processed through automated ETL pipelines
- Monthly reporting cycles reduced to on-demand queries
- Staff time redirected from data wrangling to actual public health work