2025

GCP Healthcare Revenue Cycle Management

Role

Data Engineer & Analyst

Year

2025

Tech & Tools

Google Cloud PlatformDatabricksSQLPythonPowerBI

Overview

This project involved building a data lake in Google Cloud Platform (GCP) for Revenue Cycle Management (RCM) in the healthcare domain. The goal was to centralize, clean, and transform data from multiple sources, enabling healthcare providers and insurance companies to streamline billing, claims processing, and revenue tracking.

Key Outcomes

Built end-to-end data lake pipeline on GCP
Centralized billing and claims data from multiple healthcare sources
Enabled real-time revenue tracking dashboards in PowerBI

GCP Services Used

This project leverages multiple GCP services to build an efficient and scalable RCM Data Lake:

Google Cloud Storage (GCS): Stores raw and processed data files.
BigQuery: Serves as the analytical engine for storing and querying structured data.
Dataproc: Used for large-scale data processing with Apache Spark.
Cloud Composer (Apache Airflow): Automates ETL pipelines and workflow orchestration.
Cloud SQL (MySQL): Stores transactional Electronic Medical Records (EMR) data.
GitHub & Cloud Build: Enables version control and CI/CD implementation.
CI/CD: Automates deployment pipelines for data processing and ETL workflows.

Key Techniques Involved

This project follows best practices to ensure scalability, reliability, and efficiency:

Metadata-Driven Approach

Uses configs to dynamically configure ETL pipelines instead of hardcoded logic.
Reduces manual intervention and increases automation.

Slowly Changing Dimensions (SCD) Type 2

Tracks historical changes in dimension tables (e.g., patient details, transaction updates).
Maintains data history by adding new records instead of overwriting existing data.

Common Data Model (CDM)

Standardizes data across multiple hospitals to maintain consistency.
Maps raw data into a unified schema for easier analytics and interoperability.

Medallion Architecture (Bronze → Silver → Gold)

Bronze Layer: Stores raw, unprocessed data.
Silver Layer: Cleanses, transforms, and enriches data.
Gold Layer: Stores aggregated and business-ready data for reporting.

Other Best Practices

Logging & Monitoring: Tracks pipeline execution, errors, and performance metrics.
Error Handling: Robust mechanisms to capture, log, and resolve errors during ingestion.
CI/CD: Automates deployment using GitHub + Cloud Build.
Data Validation: Checks for missing values, incorrect formats, and duplicates.
Access Controls: IAM roles and permissions to secure sensitive data.
Compliance: Adheres to HIPAA and other healthcare regulations.

What is RCM?

Revenue Cycle Management (RCM) is the financial process that healthcare providers use to track patient care episodes from registration and appointment scheduling to the final payment of a balance. This ensures smooth operations by managing patient details, billing, insurance claims, and payments efficiently.

RCM Process Breakdown

Patient Visit Initiation: Patient details and insurance information are collected. Identifies who will be responsible for payment.
Service Provision: Includes daily checkups, treatments, surgeries, and other medical services.
Billing Generation: The hospital generates an itemized bill for the services provided.
Claims Review: Insurance companies review the bill — they may accept fully, pay partially, or reject.
Payments & Follow-ups: If only partial payment is covered, the patient may pay the remainder. Hospitals follow up on outstanding payments.
Tracking & Improvement: Ensures financial sustainability while maintaining quality care.

Data Sources

EMR (Electronic Medical Records) — Cloud SQL

Two hospitals with separate databases (hospital_a_db, hospital_b_db), each storing data on patients, providers, departments, transactions, and encounters.

Claims Data

Comes from insurance companies as flat files, stored in the data lake landing zone on a monthly basis.

Public APIs

CPT Codes: Standardized system to describe medical, surgical, and diagnostic procedures.
NPI Data: Unique identifiers for healthcare providers.
ICD Codes: Standardized disease codes and descriptions.

Data Engineering Workflow

Extract: PySpark jobs for EMR data, flat file ingestion for claims, API calls for CPT/NPI/ICD.
Transform: Clean and standardize data, convert into structured fact and dimension tables.
Load: Store fact and dimension tables in BigQuery for analytics and reporting.
Orchestrate: Manage all job dependencies and execution order.
Report: Provide data for dashboards and KPIs to optimize revenue performance.

Expected Outcomes

Efficient Data Pipeline: Automated ingestion and transformation of RCM data.
Structured Data Warehouse: Gold tables in BigQuery for analytical queries.
KPI Dashboards: Insights into revenue collection, claims processing efficiency, and financial trends.