Data Lake & Analytics Platform - Case Study

Key Impact

•4.1M+ rows consolidated from 4 data sources
•244 tables processed with full ETL pipeline
•Reporting reduced from ~1 month to on-demand
•Business users self-serve real-time analytics

The Problem

The county public health organization was drowning in data, but starving for insights. They collected health data from multiple sources:

Disease surveillance systems
Immunization registries
Vital statistics databases
Partner organization feeds

Each source had its own format, its own identifiers, its own quirks. When leadership needed a report (say, vaccination rates by demographic across the county) staff spent weeks manually pulling data, deduplicating records, resolving conflicts, and building spreadsheets.

A monthly report took approximately one month to produce. The irony wasn't lost on anyone.

The Solution

I built a modern data lake with automated ETL pipelines, giving the organization instant access to unified, clean data.

Architecture: The Medallion Approach

I implemented a bronze/silver/gold medallion architecture:

Bronze Layer (Raw)

Raw data lands exactly as received from source systems. No transformations, no cleaning. Just a faithful record of what came in. This provides full audit trail and enables reprocessing if logic changes.

Silver Layer (Cleaned)

Data undergoes standardization and quality checks:

Schema normalization across sources
Deduplication with configurable matching rules
Null resolution and default handling
Data type enforcement and validation

Gold Layer (Business-Ready)

Aggregated, joined, and optimized for specific use cases:

Pre-computed metrics and KPIs
Dimensional models for reporting
Cross-source entity resolution
Time-series aggregations

ETL Pipeline

The pipeline handles the gnarly work that used to consume weeks of staff time:

Deduplication

Matching records across sources with fuzzy logic, handling name variations, address formats, and missing identifiers. Configurable confidence thresholds let the organization tune precision vs. recall.

Null Resolution

Smart defaults and cross-source filling. If one source has demographic data another lacks, the pipeline merges intelligently while tracking data provenance.

Cross-Source Aggregation

Unified views across all 244 tables, with proper handling of different reporting periods, geographic granularities, and categorical definitions.

Self-Service Analytics

The data lake powers a web application where business users run their own queries. No more waiting for IT. No more month-long report cycles.

Features include:

Pre-built report templates for common queries
Custom filtering and drill-down
Export to Excel/CSV for further analysis
Scheduled report delivery via email

Technical Implementation

Serverless ETL: Azure Functions handle pipeline orchestration. Pay-per-execution keeps costs low while scaling automatically during batch processing windows.

Incremental Processing: Changed data capture identifies new/modified records, so we're not reprocessing millions of rows daily.

Data Quality Monitoring: Automated checks flag anomalies like sudden drops in record counts, unexpected nulls, schema drift from source systems.

Security: Role-based access controls ensure staff only see data appropriate to their function. Full audit logging for compliance.

Results

The transformation was dramatic:

4.1M+ rows consolidated into a unified, queryable data lake
244 tables processed through automated ETL pipelines
Monthly reporting cycles reduced to on-demand queries
Staff time redirected from data wrangling to actual public health work

Delivered in 12 weeks from kickoff to production, including data migration and user training.

Technologies Used

ReactFlaskAzure FunctionsAzure SQL DatabaseAzure Blob Storage