Apache Hudi and Kubernetes: The Fastest Way to Try Apache Hudi!
Follow up is here: https://www.ekalavya.dev/how-to-run-apache-hudi-deltastreamer-kubevela-addon/
More related content at https://www.denote.dev/
As I previously stated, I am developing a set of scenarios to try out Apache Hudi features at https://github.com/replication-rs/apache-hudi-scenarios
Here is how you can try it out quickly if you have Docker running on your computer. You need at least 4 CPUs and 8GB memory allocated to it.
git clone https://github.com/h7kanna/apache-hudi-scenarioscd apache-hudi-scenarios
Quick start on macOS (Intel)
./quickstarts/kind-macos.sh
Quick start on macOS (M1)
./quickstarts/kind-macos-m1.sh
Quick start on Linux (Intel)
./quickstarts/kind-linux.sh
Wait for the services to initialize
./bin/kubectl get pods --all-namespaces -w
Hello Hudi
cd hello-hudi../bin/kubectl apply -f huditable.yaml
Then go to http://localhost:30001/ and log in using admin/password in around 2 mins.
You will find the Apache Hudi data lake bucket with the table we just created if the execution is successful.
You can query the table using
../bin/kubectl apply -f hudisparkquery.yaml
Watch progress and output using logs.
export POD_NAME=`../bin/kubectl get pods -l spark-role=driver -n spark-system | grep hudisparkquery | awk '{print $1;}'`
Use the above hudisparkquery-sample pod name to get the output
../bin/kubectl logs $POD_NAME -n spark-system
I will keep adding more complex scenarios using clustering and sync service using Kubernetes-based lock etc. and improve the operator as I get time. Keep watching if you are interested in learning to build an Apache Hudi Data Lake using Kubernetes.
My end goal is to build a production-ready hudi-operator based on the knowledge I have currently before it vanishes.