Apache Parquet and Go

Harsha Teja Kanna
2 min readJan 31, 2022

--

Follow up: More updates here https://www.denote.dev/

Go ecosystem lacks an official and performant Parquet library. Parquet format is a standard data storage format in Data lake. Data processing and engineering on big data using Go have this as the weak link.

I have been watching and waiting for the Apache Arrow Go Parquet library being developed by Matt Poole for about 2 years now.

It shows how much rigorous work goes into developing such a low-level library. JIRA issue here

Last week the last major code chunk got merged. It is scheduled for the next Arrow release(7.0.0) I think.

Also Go 1.18 with big bang generics is getting ready for next month. I thought it is a good time to give the library a try and consider Go for some Data engineering work going forward.

A quick overview of the dependencies shows parquet depends on native high performant compression and JSON codec libraries.

There is a parquet_reader program just added last week. Running it has an issue so I modified it a little bit for the experiment. Commented this part here. I have a sample 64 MB (15x compressed) Parquet file. Just tried converting it into a CSV file. A simple decompression speed test basically.

git clone https://github.com/apache/arrow.git

cd arrow/go

go build ./parquet/cmd/parquet_reader

./parquet_reader demo.parquet

% time ./parquet_reader demo.parquet > demo.csv

./parquet_reader demo.parquet > demo.csv 19.24s user 2.12s system 103% cpu 20.670 total

-rw-r — r — @ 1 h7kanna staff 785M Jan 30 23:55 demo.csv

TODO : Compare to a pyarrow parquet to csv converter program. Will post a comparision when I have more time and after verifying the result.

Thanks to Matt Poole for this work.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Harsha Teja Kanna
Harsha Teja Kanna

Written by Harsha Teja Kanna

Builder, Tech enthusiast, and Opinions are my own! https://denote.dev/

No responses yet

Write a response