Skip to main content

Serverless API Data Ingestion in Google BigQuery: Part 1 (Introduction)

 

Ingesting API Data in Google BigQuery the Serverless way!


API To Google BigQuery


In the era of cloud computing, Serverless has become a buzzword that we keep hearing about, And eventually, we get convinced that serverless is the way to go for companies of all sizes because of various advantages. The basic advantages of the Serverless approach are :

  • No Server Management
  • Scalability
  • Pay as you go

In this article, we will also explore how we can use the Serverless approach to build our data Ingestion pipeline in Google Cloud.


Serverless Offerings In GCP

GCP offers plenty of Serverless Services in various areas such as mentioned below.

  • Computing: Cloud Run, Cloud Function, App Engine
  • Data warehouse: Google BigQuery
  • Object Storage: Google Cloud Storage
  • Workflow management: Cloud Workflow
  • Scheduler: Cloud Scheduler

Technically, the combination of the above tools is enough to build API data ingestion in GCP.

We can build Two patterns to ingest API data in Google BigQuery


Business Requirement

Let's consider you have a business requirement that you want to ingest/scrape data from various online resources and get the top headlines for every day. These headlines will be ingested to Google BigQuery and Every 1 day a scheduled query will be executed on BigQuery to calculate which media outlet published most headlines for a given day.

Pattern 1

Technical Design
Now we can build this data ingestion pipeline in many ways. But if you really decided to go serverless below data pipeline might be a good approach.

  • Cloud function would be a good solution to write an API request and response handling code and we can also use Google BigQuery Streaming Insert API to write each news headline into BigQuery.

Pattern 2

Technical Design
In this approach, we are basically breaking down the ingestion and insert task as separate steps. API request and response handling code will be written in Cloud function and all the news headlines record can be written into the GCS file instead of writing to BigQuery via Insert API. We would use the BigQuery Batch load command to load all the data from the GCS file into BigQuery.

We will orchestrate data ingestion and data load steps using cloud workflow which is another serverless offering in GCP.

Note: Streaming Insert to BigQuery mentioned in pattern 1 would cost an additional charge, while batch load data to BigQuery is completely free.

Part 2 of this blog where I share about code and configurations for the above-mentioned pattern.

Happy Data Pipeline Building! ✌️

Comments

Popular posts from this blog

Beginners Guide to Machine Learning on GCP

This is the title of the webpage! This blog covers basic knowledge needed to get started ML journey on GCP. It provides foundational knowledge which will help readers to gain some level of confidence understanding ML ecosystem on GCP from where they can master each component. Introduction to Machine Learning Machine Learning is a way to use some set of algorithms to derive predictive analytics from data . It is different than Business Intelligence and Data Analytics in a sense that In BI and Data analytics Businesses make decision based on historical data, but In case of Machine Learning , Businesses  predict  the future based on the historical data. Example, It’s a difference between  what happened to the business  vs  what will happen to the business .Its like making BI much smarter and scalable so that it can predict future rather than just showing the state of the business. ML is based on Standard algorithms which are used to create use case specific model

Using NPM Library in Google BigQuery UDF

  Javascript UDF’s are cool and using with NPM library is a whole new world to explore! Background One of the main reason to build ETL pipeline was to do data transformation on data before loading into the data warehouse. The only reason we were doing that because data warehouses were not capable to handle these data transformations due to several reasons such as performance and flexibility. In the era of modern data warehouses like Google BigQuery or SnowFlake , things have changed. These data warehouses can process terabyte and petabyte data within seconds and minutes. Considering this much improvement, now performing data transformation within a data warehouse make more sense. Hence to create common transformation logic via UDF (user-defined functions). In this blog, we will see how can we utilize the power of javascript UDF and NPM library to generate data in BigQuery. What is UDF? From Google Cloud Documentations: A user-defined function (UDF) lets you create a function by using

What is advertised.listeners in Kafka?

This is the title of the webpage! Hi guys,  Today we gonna talk about Kafka Broker Properties. More Specifically, advertised.listeners property. If you have seen the server.properties file in Kafka there are two properties with listener settings. #listeners=PLAINTEXT://:9092 #advertised.listeners=PLAINTEXT://your.host.name:9092 why the hell we need two listeners for our broker? usually, Kafka brokers talk to each other and register themselves in zookeeper using listeners property. So for all internal cluster communication happens over what you set in listeners property. But if you have a complex network, for example, consider if your cluster is on the cloud which has an internal network and also external IP on which rest of the work can connect to your cluster, in that case, you have to set advertised.listeners property with {EXTERNAL_IP}://{EXTERNAL_PORT}. For Example, If Internal IP is 10.168.4.9 and port is 9092 and External IP is  35.19