Apache Parquet and Go

2 min readJan 31, 2022

Follow up: More updates here https://www.denote.dev/

Go ecosystem lacks an official and performant Parquet library. Parquet format is a standard data storage format in Data lake. Data processing and engineering on big data using Go have this as the weak link.

I have been watching and waiting for the Apache Arrow Go Parquet library being developed by Matt Poole for about 2 years now.

It shows how much rigorous work goes into developing such a low-level library. JIRA issue here

Last week the last major code chunk got merged. It is scheduled for the next Arrow release(7.0.0) I think.

Also Go 1.18 with big bang generics is getting ready for next month. I thought it is a good time to give the library a try and consider Go for some Data engineering work going forward.

A quick overview of the dependencies shows parquet depends on native high performant compression and JSON codec libraries.

There is a parquet_reader program just added last week. Running it has an issue so I modified it a little bit for the experiment. Commented this part here. I have a sample 64 MB (15x compressed) Parquet file. Just tried converting it into a CSV file. A simple decompression speed test basically.

git clone https://github.com/apache/arrow.git

cd arrow/go

go build ./parquet/cmd/parquet_reader

./parquet_reader demo.parquet

% time ./parquet_reader demo.parquet > demo.csv

./parquet_reader demo.parquet > demo.csv 19.24s user 2.12s system 103% cpu 20.670 total

-rw-r — r — @ 1 h7kanna staff 785M Jan 30 23:55 demo.csv

TODO : Compare to a pyarrow parquet to csv converter program. Will post a comparision when I have more time and after verifying the result.

Thanks to Matt Poole for this work.

Apache Parquet and Go

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Harsha Teja Kanna

No responses yet