Development

šŸ”„ Merging Data from Multiple Sources into One Clean Pipeline Using Google Cloud Dataflow

Complete guide to building unified data pipelines with Google Cloud Dataflow and BigQuery

data-pipelinegoogle-clouddataflowbigqueryetldata-integrationpub-subbusiness-intelligencedata-mergestreaming
Dezoko Team
•
February 9, 2025
•
6 min read

Table of Contents

šŸ”„ Merging Data from Multiple Sources into One Clean Pipeline Using Google Cloud Dataflow

šŸ”„ Merging Data from Multiple Sources into One Clean Pipeline Using Google Cloud Dataflow


In today's fragmented digital ecosystem, data lives everywhere:


  • Meta Ads Manager
  • Google Analytics 4
  • Jira & Trello
  • Figma, Adobe Analytics
  • Stripe, HubSpot, LinkedIn Ads

The challenge?


> "We can't make decisions because our data is scattered across 10 different platforms."


At Dezoko, we specialize in building unified data pipelines -- and for one of our clients, we used Google Cloud Dataflow to merge 10+ external data sources into a single, clean dataset in BigQuery.


šŸ’” Use Case: Unified Business Intelligence


Our client -- a multi-brand marketing agency -- wanted:

  • āœ…Centralized reporting across all client accounts
  • āœ…Clean, deduplicated, and normalized data
  • āœ…Near real-time updates
  • āœ…Unified dashboards (across ad platforms, analytics, and CRMs)

šŸ“ Architecture Overview


+------------------+     +------------------+  
|  Meta API        |     |  Google Ads API  |  
+------------------+     +------------------+  
        ↓                       ↓  
+------------------+     +------------------+  
|  LinkedIn API    |     |  Jira Webhooks   |  
+------------------+     +------------------+  
        ↓                       ↓  
   [Cloud Functions / Cloud Run]  
               ↓  
        [Pub/Sub Topic]  
               ↓  
        [Cloud Dataflow Job]  ← handles parsing, transformation, joins
               ↓  
          [BigQuery Table] ← Final unified schema  
               ↓  
       [Looker Studio Dashboards]

āš™ļø Tools We Used


Tool
Purpose
Cloud Dataflow
Streaming ETL pipeline for transformation & joins
Pub/Sub
Scalable event-driven ingestion layer
Cloud Functions
Lightweight API fetchers from sources
BigQuery
Central storage, analytics, and dashboard source
Cloud Scheduler
Triggers daily fetches or webhook verifications
Secret Manager
API tokens, auth credentials

šŸ”„ Step-by-Step: Our Data Merge Strategy


1ļøāƒ£ Data Collection via API/Webhooks


We set up:


  • Pull-based fetchers (e.g., Meta, LinkedIn, Google Ads)
  • Webhooks for push-based sources (e.g., Jira, Trello)
  • Cron jobs for periodic data where APIs lacked webhooks
  • All payloads pushed to Pub/Sub

2ļøāƒ£ Streaming with Cloud Dataflow


We built a streaming pipeline with:


  • Source: Pub/Sub subscription
  • Transforms:

  • Schema mapping to internal standard
  • Deduplication using unique keys
  • Data enrichment (adding campaign names, client IDs)
  • Error handling (dead-letter queues + logs)
  • Sink: Merged data into BigQuery unified table

Sample Join Example:


  • Match Meta Ad data with Google Analytics session
  • Combine Jira ticket tags with Adobe user behavior
  • Join Stripe transactions with campaign IDs

3ļøāƒ£ Normalization & Schema Enforcement


Raw Field (Meta)
Normalized Field
`campaign_name`
`campaign`
`impressions`
`ad_impressions`
`date_start`
`event_date`
`client_id`
`customer_id`

We created a single standard schema to support cross-source queries.


4ļøāƒ£ Real-Time Monitoring + Observability


  • Cloud Logging: for fetch failures and API rate issues
  • BigQuery Audit Logs: track each load
  • Pub/Sub dead-letter: route errors to Firestore for retry
  • Slack alerts for pipeline failure or delays

šŸ“ˆ Outcome for the Client


Metric
Before
After
Data Sources Connected
2 (manually)
10+ (automated)
Update Frequency
Weekly manual export
Realtime / hourly
Reporting Time
~6 hrs/week
0 hrs
Errors / Gaps
High
Monitored & alertable
Decision Confidence
Low
High -- unified view

šŸ” Enterprise-Ready Features We Implemented


  • āœ…Token rotation with Secret Manager
  • āœ…Per-client data partitioning in BigQuery
  • āœ…IAM roles for audit-level access
  • āœ…Multi-region support for latency reduction
  • āœ…End-to-end encryption + logging

šŸ’¬ What the Client Said


> "We now see everything -- ads, product behavior, and campaigns -- in one place. Your team made our data useful again."

> -- VP of Marketing Analytics


> "Before, we had tools. Now, we have insights."

> -- CEO, Martech SaaS


šŸ“ž Want to Merge Your Disconnected Data into Google Cloud?


We help you:

  • āœ…Ingest from any API, webhook, or file
  • āœ…Clean + map data with Dataflow
  • āœ…Store in BigQuery, Firestore, or Cloud Storage
  • āœ…Monitor + visualize with Looker or custom dashboards
  • āœ…Automate 100% of the pipeline


Get a free consultation