Apache Parquet and Go
Follow up: More updates here https://www.denote.dev/
Go ecosystem lacks an official and performant Parquet library. Parquet format is a standard data storage format in Data lake. Data processing and engineering on big data using Go have this as the weak link.
I have been watching and waiting for the Apache Arrow Go Parquet library being developed by Matt Poole for about 2 years now.
It shows how much rigorous work goes into developing such a low-level library. JIRA issue here
Last week the last major code chunk got merged. It is scheduled for the next Arrow release(7.0.0) I think.
Also Go 1.18 with big bang generics is getting ready for next month. I thought it is a good time to give the library a try and consider Go for some Data engineering work going forward.
A quick overview of the dependencies shows parquet depends on native high performant compression and JSON codec libraries.

There is a parquet_reader program just added last week. Running it has an issue so I modified it a little bit for the experiment. Commented this part here. I have a sample 64 MB (15x compressed) Parquet file. Just tried converting it into a CSV file. A simple decompression speed test basically.
git clone https://github.com/apache/arrow.git
cd arrow/go
go build ./parquet/cmd/parquet_reader
./parquet_reader demo.parquet
% time ./parquet_reader demo.parquet > demo.csv
./parquet_reader demo.parquet > demo.csv 19.24s user 2.12s system 103% cpu 20.670 total
-rw-r — r — @ 1 h7kanna staff 785M Jan 30 23:55 demo.csv
TODO : Compare to a pyarrow parquet to csv converter program. Will post a comparision when I have more time and after verifying the result.
Thanks to Matt Poole for this work.