What happens when you ask a Product person and a Software Engineer to build a Datalake?

6 min readJan 26, 2022

Follow up: https://www.ekalavya.dev/how-i-integrated-airbyte-and-apache-hudi-again/

This post will have ideas, technology references, and a solution. As a novice writer, I am practicing to make it easy and short for the readers in general. I will try to make it concise as much as I can.

Why I mainly started writing is to convey complex ideas and validate with a wider audience. Also, I want to express my own opinions about products design and implementation. This is a biased and opinion-oriented article. Feedback appreciated. But for now, I hope to make this a valuable use of your time.

So, both the Product person and the Software Engineer are me of course, that is what at least I think of myself to be. The problem statement is to come up with a pipeline, to build a Data lake. And, we have data sources that are not under our control, and target consumption patterns we don’t understand yet. I kind of figured out, what a simple solution should look like and its requirements. Also working towards making it practical if possible.

It is a grouping of Data lake tables to be queried by data consumers. Each grouping of tables belongs to a business domain/service context and is owned and produced by the service owners themselves. As I have working knowledge in a few technologies, I used them to create the solution. Other implementations are ubiquitous.

Solution

I present below an implementation I think that any engineering team should be able to build with just their programming language and self-service portal.

Build pipelines using a polyglot framework

Teams responsible for each service or domain of services should build the pipeline with tools and languages they are most productive with. Building a data pipeline should not be a dedicated person job. So only a single framework/platform should not be enforced. It can be Airflow, Dagster, or Prefect. It’s the team's choice. Operating these dedicated frameworks can be a problem. That’s where the service providers can help out. The provider can be your internal infra/platform team or cloud marketplace.

Also, taking myself as an example, I am most productive in Go and Java. I should still be able to build the pipeline for my service using them. That’s when declarative and container-based tools like Argo and Serverlessworkflow come into play. I am not listing big cloud vendor services here as those choices will be obvious at cloud-native companies.

But learning these tools or frameworks and asking for a variety of them to be managed by infra/platform teams is not practical. That is when I suggest my go-to platform Temporal. we can build data pipelines in the language we are experts in and create Activity tasks to call out other data frameworks as containers or external calls if needed. No need to learn any dedicated domain-specific language. And we can run the pipeline as part of the microservices which are themselves built on Temporal. Check out all SDKs here. TypeScript SDK is the most exciting one.

Use a self-service data integration platform

When independent every team, while creating the pipeline, will try to solve the same problem of integration in their own ways. We should avoid this. And moving data from different sources to destinations is very involved work and not fun. Automation is the way to go.

There are many things to consider. Data schemas of the sources and destinations. Credentials to these integrations. Catalog of the tables involved etc. Airbyte solves this in an open, modern, extensible, and beautiful way. Just adopt, use and contribute connectors to it.

Since I suggested Temporal above. Microservices can call out Airbyte from long-running distributed activities within the service.

Build data lake on open standards

Data lake technology advanced so much and the open standards Apache Hudi, Apache Iceberg, and Delta are becoming cost-effective and performant alternatives. Cloud vendors are supporting these formats in addition to their dedicated offerings as well.

I suggest Apache Hudi (biased) because it is not only a table format but currently the most feature-rich open data lake platform of all and I worked with it (blogs about my experience will follow).

Again, when using Temporal, we can orchestrate Apache Hudi from within your microservices or if it becomes available, can use Airbyte Hudi sync from the service activities.

So your end pipeline can look something like shown below

Implementation can be in a single file(for a short pipeline) within your microservice. Independently deployable and scalable data pipeline within service or if preferred as a complimentary worker service. No queues involved. No event consumers or producers are involved in orchestration. Just plain code. Of course, learn the programming model here. So the service that generates the data is also responsible to build a data lake out of it.

Product requirements that will lead to the above solution

This is for a company that strives to be data-driven, which is every company.

Service team should own their data and make it available as a product

Service owning team should think not only about their service but also its data as a product. They should think about its interface and consumption patterns. They should have plans for versioning, schema evolution, and support. Here product means producing the data, streaming it, and building and owing data lake tables for the domain along with the service. That is why my solution to build a data lake be possible in any programming language. Teams should not think of data pipeline as somebody else’s job and they must enjoy the development experience of implementing it and make it part of their service.

Service team should own the quality and correctness of the data.

For service owners, it is imperative to produce high-quality data in the lake. Why I make Temporal a good solution to implement services is we can ensure the quality and consistency of the data flowing out of the service by using dedicated workflow state logic. In fact, Temporal can scale to numerous of these short workflow instances. I will describe it with a dedicated example in one of my next blogs.

Data platform teams should provide self-service portals as a service

Data Platform teams should own the base infrastructure and components for the self-service portals with help of infrastructure teams. Indeed their services should themselves be available in a self-service portal. Again same product thinking comes into play. Their product should make it as simple as it could be for the domain/service teams to provision Temporal/Airflow/Argo or Airbyte workspace or namespace.

No centralized data platform team be solely responsible for the data lake and the service pipelines.

Data engineers who try to work with data sources they don’t control or own will have a tough time communicating with different service teams and setting expectations. And they will not have a complete context of the consumption patterns of the data lake to be built. Instead, their time is better spent on making self-service portals available as services and establishing data governance, security, and management standards and best practices for implementing the pipelines by service teams.

Data product as culture and service requirement

Product management should make the data requirements and availability in the data lake feature requests rather than after-thoughts.

Conclusion

That is all I thought about on my own. This stemmed mainly from my experience of building a data lake and from some of the challenges I faced and foresee. But I had an incredible time, so put some effort into thinking of a simpler and open solution than the one I was able to implement before.

But there are many excellent expert articles that already talk about this in-depth. But my solution and requirements came from my own question. Can I build a data lake with just a single code and config file! A bit too much isn’t it. It's certainly possible if we rely on good platforms and have a product mindset and culture.

Note:

This is only for a simple workflow to build a data lake. Building Python/Java plugins and extensions to the self-service portals, building models in DBT, creating pipeline DAGs, writing Apache Spark applications is the main data-engineering work and is very involved. I don’t want to make this sound simple in any way. The next blogs I am thinking of are in fact about my work with Apache Hudi on Spark and related gotchas and tips. And my previous experience of building real-time event processing using Apache Flink.

What happens when you ask a Product person and a Software Engineer to build a Datalake?

Solution

Product requirements that will lead to the above solution

Conclusion

Note:

Written by Harsha Teja Kanna