HealthcareTechnical Co-Founder, Augmented AI Labs

Data Lake & Analytics Platform

Built a data lake consolidating 4.1M+ rows from 4 data sources for a public health organization. Reduced reporting from ~1 month of manual work to on-demand queries.

Key Impact

  • 4.1M+ rows consolidated from 4 data sources
  • 244 tables processed with full ETL pipeline
  • Reporting reduced from ~1 month to on-demand
  • Business users self-serve real-time analytics

The Problem

The county public health organization was drowning in data, but starving for insights. They collected health data from multiple sources:

  • Disease surveillance systems
  • Immunization registries
  • Vital statistics databases
  • Partner organization feeds
Each source had its own format, its own identifiers, its own quirks. When leadership needed a report (say, vaccination rates by demographic across the county) staff spent weeks manually pulling data, deduplicating records, resolving conflicts, and building spreadsheets.

A monthly report took approximately one month to produce. The irony wasn't lost on anyone.

The Solution

I built a modern data lake with automated ETL pipelines, giving the organization instant access to unified, clean data.

Architecture: The Medallion Approach

I implemented a bronze/silver/gold medallion architecture:

Bronze Layer (Raw)

Raw data lands exactly as received from source systems. No transformations, no cleaning. Just a faithful record of what came in. This provides full audit trail and enables reprocessing if logic changes.

Silver Layer (Cleaned)

Data undergoes standardization and quality checks:

  • Schema normalization across sources
  • Deduplication with configurable matching rules
  • Null resolution and default handling
  • Data type enforcement and validation
Gold Layer (Business-Ready)

Aggregated, joined, and optimized for specific use cases:

  • Pre-computed metrics and KPIs
  • Dimensional models for reporting
  • Cross-source entity resolution
  • Time-series aggregations

ETL Pipeline

The pipeline handles the gnarly work that used to consume weeks of staff time:

Deduplication

Matching records across sources with fuzzy logic, handling name variations, address formats, and missing identifiers. Configurable confidence thresholds let the organization tune precision vs. recall.

Null Resolution

Smart defaults and cross-source filling. If one source has demographic data another lacks, the pipeline merges intelligently while tracking data provenance.

Cross-Source Aggregation

Unified views across all 244 tables, with proper handling of different reporting periods, geographic granularities, and categorical definitions.

Self-Service Analytics

The data lake powers a web application where business users run their own queries. No more waiting for IT. No more month-long report cycles.

Features include:

  • Pre-built report templates for common queries
  • Custom filtering and drill-down
  • Export to Excel/CSV for further analysis
  • Scheduled report delivery via email

Technical Implementation

Serverless ETL: Azure Functions handle pipeline orchestration. Pay-per-execution keeps costs low while scaling automatically during batch processing windows.

Incremental Processing: Changed data capture identifies new/modified records, so we're not reprocessing millions of rows daily.

Data Quality Monitoring: Automated checks flag anomalies like sudden drops in record counts, unexpected nulls, schema drift from source systems.

Security: Role-based access controls ensure staff only see data appropriate to their function. Full audit logging for compliance.

Results

The transformation was dramatic:

  • 4.1M+ rows consolidated into a unified, queryable data lake
  • 244 tables processed through automated ETL pipelines
  • Monthly reporting cycles reduced to on-demand queries
  • Staff time redirected from data wrangling to actual public health work
Delivered in 12 weeks from kickoff to production, including data migration and user training.

Technologies Used

ReactFlaskAzure FunctionsAzure SQL DatabaseAzure Blob Storage

Facing a similar challenge?

I build AI solutions like this for companies ready to automate manual processes or unlock insights from their data. Whether you need an off-the-shelf tool configured or a custom system built from scratch, I can help.

Free consultation. I'll assess your situation and give you an honest recommendation.