Apache Hudi and Kubernetes: The Fastest Way to Try Apache Hudi!

Harsha Teja Kanna
2 min readFeb 28, 2022

--

Follow up is here: https://www.ekalavya.dev/how-to-run-apache-hudi-deltastreamer-kubevela-addon/

More related content at https://www.denote.dev/

As I previously stated, I am developing a set of scenarios to try out Apache Hudi features at https://github.com/replication-rs/apache-hudi-scenarios

Here is how you can try it out quickly if you have Docker running on your computer. You need at least 4 CPUs and 8GB memory allocated to it.

git clone https://github.com/h7kanna/apache-hudi-scenarioscd apache-hudi-scenarios

Quick start on macOS (Intel)

./quickstarts/kind-macos.sh

Quick start on macOS (M1)

./quickstarts/kind-macos-m1.sh

Quick start on Linux (Intel)

./quickstarts/kind-linux.sh

Wait for the services to initialize

./bin/kubectl get pods --all-namespaces -w

Hello Hudi

cd hello-hudi../bin/kubectl apply -f huditable.yaml

Then go to http://localhost:30001/ and log in using admin/password in around 2 mins.

You will find the Apache Hudi data lake bucket with the table we just created if the execution is successful.

You can query the table using

../bin/kubectl apply -f hudisparkquery.yaml

Watch progress and output using logs.

export POD_NAME=`../bin/kubectl get pods -l spark-role=driver -n spark-system | grep hudisparkquery | awk '{print $1;}'`

Use the above hudisparkquery-sample pod name to get the output

../bin/kubectl logs $POD_NAME -n spark-system

I will keep adding more complex scenarios using clustering and sync service using Kubernetes-based lock etc. and improve the operator as I get time. Keep watching if you are interested in learning to build an Apache Hudi Data Lake using Kubernetes.

My end goal is to build a production-ready hudi-operator based on the knowledge I have currently before it vanishes.

--

--

Harsha Teja Kanna
Harsha Teja Kanna

Written by Harsha Teja Kanna

Builder, Tech enthusiast, and Opinions are my own! https://denote.dev/

Responses (1)