Skip to main content

Using NPM Library in Google BigQuery UDF

 

Javascript UDF’s are cool and using with NPM library is a whole new world to explore!

Background

One of the main reason to build ETL pipeline was to do data transformation on data before loading into the data warehouse. The only reason we were doing that because data warehouses were not capable to handle these data transformations due to several reasons such as performance and flexibility.

In the era of modern data warehouses like Google BigQuery or SnowFlake, things have changed. These data warehouses can process terabyte and petabyte data within seconds and minutes. Considering this much improvement, now performing data transformation within a data warehouse make more sense. Hence to create common transformation logic via UDF (user-defined functions). In this blog, we will see how can we utilize the power of javascript UDF and NPM library to generate data in BigQuery.

What is UDF?

From Google Cloud Documentations:

A user-defined function (UDF) lets you create a function by using a SQL expression or JavaScript code. A UDF accepts columns of input, performs actions on the input, and returns the result of those actions as a value.

You can define a UDFs as either persistent or temporary. You can reuse persistent UDFs across multiple queries, while temporary UDFs only exist in the scope of a single query

The below function is a simple example of creating a temporary UDF.

CREATE TEMP FUNCTION multiplyInputs(x FLOAT64, y FLOAT64)
RETURNS FLOAT64
LANGUAGE js AS r"""
return x*y;
""";

WITH numbers AS
(SELECT 1 AS x, 5 as y
UNION ALL
SELECT 2 AS x, 10 as y
UNION ALL
SELECT 3 as x, 15 as y)
SELECT x, y, multiplyInputs(x, y) as product
FROM numbers;

Package Existing NPM library

  • We will package credit card generator npm library using webpack.
  • Create a card-number-generator folder and install the creditcard-generator library. Create webpack config file to create webpack file.
mkdir card-number-generator
cd card-number-generator/
npm install --save creditcard-generator
touch webpack.config.js
view raw gistfile1.txt hosted with ❤ by GitHub
  • We need to edit webpack.config.js and mention about entry point for the library and output path.
const path = require('path');
module.exports = {
entry: './node_modules/creditcard-generator/index.js',
output: {
path: path.resolve(__dirname, 'dist'),
filename: 'creditcard-generator.js',
library: "webpackNumbers",
},
};
  • Run below command will generate webpack file at card-number-generator/dist/main.js
    npx webpack --config webpack.config.js
    view raw webpack.sh hosted with ❤ by GitHub
  • After packaging, we will upload the packaged file to the google cloud storage bucket.

Using Packaged NPM library In BigQuery UDF

  • We will use packaged npm credit card generator library in our Javascript UDF by passing it as OPTIONS parameter.
  • We will use one of the methods called GenCC(“Amex”)
/*
Temp FN: 2 -> get_rand_card_number
*/
CREATE TEMP FUNCTION get_rand_card_number()
RETURNS STRING
LANGUAGE js
OPTIONS (
library=["gs://bq_udf_library/main.js"]
)
AS r"""
return creditcard_generator.GenCC("Amex");
""";
view raw udf_2.sql hosted with ❤ by GitHub
  • We will also create another UDF just to generate a random customer id.
/*
Temp FN: 1 -> get_rand_id
*/
CREATE TEMP FUNCTION get_rand_id()
RETURNS STRING
LANGUAGE js
AS r"""
return 'xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx'.replace(/[xy]/g, function(c) {
var r = Math.random() * 4 | 0, v = c == 'x' ? r : (r & 0x3 | 0x8);
return v.toString(4);
});
""";
view raw udf_1.sql hosted with ❤ by GitHub
  • Now we can use the above two udfs to generate our table with customer_id and credit card number. Additionally, in order to generate N numbers of rows, we can use any public or authorized table with LIMIT.
  • In the below query, we are using bigquery public dataset to generate 100 records.
CREATE OR REPLACE TABLE `avid-subject-309421.bq_test.sample_data_1`
AS
SELECT
get_rand_id() as unique_id,
get_rand_card_number() as credit_card_number
FROM
`bigquery-public-data.google_trends.top_terms` LIMIT 100
  • Let’s summarize all the information and our entire code looks like below
/*
Temp FN: 1 -> get_rand_id
*/
CREATE TEMP FUNCTION get_rand_id()
RETURNS STRING
LANGUAGE js
AS r"""
return 'xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx'.replace(/[xy]/g, function(c) {
var r = Math.random() * 4 | 0, v = c == 'x' ? r : (r & 0x3 | 0x8);
return v.toString(4);
});
""";
/*
Temp FN: 2 -> get_rand_card_number
*/
CREATE TEMP FUNCTION get_rand_card_number()
RETURNS STRING
LANGUAGE js
OPTIONS (
library=["gs://bq_udf_library/main.js"]
)
AS r"""
return creditcard_generator.GenCC("Amex");
""";
CREATE OR REPLACE TABLE `avid-subject-309421.bq_test.sample_data_1`
AS
SELECT
get_rand_id() as unique_id,
get_rand_card_number() as credit_card_number
FROM
`bigquery-public-data.google_trends.top_terms` LIMIT 100

if you execute the above query it will generate 100 records as shown below.

Conclusion

This blog only covers the surface area and shows what’s possible when you use javascript UDF with the existing NPM library. It's a whole new world for the ELT paradigm. Please do share your experience about using UDF and NPM libraries in the comment section.

Comments

Popular posts from this blog

What is advertised.listeners in Kafka?

This is the title of the webpage! Hi guys,  Today we gonna talk about Kafka Broker Properties. More Specifically, advertised.listeners property. If you have seen the server.properties file in Kafka there are two properties with listener settings. #listeners=PLAINTEXT://:9092 #advertised.listeners=PLAINTEXT://your.host.name:9092 why the hell we need two listeners for our broker? usually, Kafka brokers talk to each other and register themselves in zookeeper using listeners property. So for all internal cluster communication happens over what you set in listeners property. But if you have a complex network, for example, consider if your cluster is on the cloud which has an internal network and also external IP on which rest of the work can connect to your cluster, in that case, you have to set advertised.listeners property with {EXTERNAL_IP}://{EXTERNAL_PORT}. For Example, If Internal IP is 10.168.4.9 and port is 9092 and External IP is...

Beginners Guide to Machine Learning on GCP

This is the title of the webpage! This blog covers basic knowledge needed to get started ML journey on GCP. It provides foundational knowledge which will help readers to gain some level of confidence understanding ML ecosystem on GCP from where they can master each component. Introduction to Machine Learning Machine Learning is a way to use some set of algorithms to derive predictive analytics from data . It is different than Business Intelligence and Data Analytics in a sense that In BI and Data analytics Businesses make decision based on historical data, but In case of Machine Learning , Businesses  predict  the future based on the historical data. Example, It’s a difference between  what happened to the business  vs  what will happen to the business .Its like making BI much smarter and scalable so that it can predict future rather than just showing the state of the business. ML is based on Standard algorithms which are used to create ...