3 benefits of applying Continuous Deployment to Data Products

5 min readMar 1, 2023

As organizations attempt to leverage insights from data to grow their businesses, the data departments grow as well. Keeping this growth sustainable can be challenging. Data Infrastructure can be expensive and if not dealt with properly, costs can easily go through the roof. Teams grow and collaboration needs to be efficient to keep up with the development pace. Finally, to make sure our data products have the right quality, we need testing. However, the more data we manage and the more complex it gets, the more difficult it is to generate and maintain synthetic data for our tests.

In this article I’m explaining how Continuous Deployment helps with reducing costs in infrastructure, delivering data products faster, and reducing the effort of maintaining synthetic data.

What is Continuous Deployment?

Continuous Deployment is a software release strategy where the codebase changes are deployed to production frequently and through automated deployments. Once your changes are deployed, the decision to release them is completely independent from your deployment process and can be done at any point in time.

Imagine you have a Data Product team with the goal to expose Finance metrics like Operating Cash Flow (OCF) or Working Capital. With a traditional CI/CD pipeline, we would work on the first metric for a week in our local environment. Once the codebase has been reviewed by our peers we would merge our codebase to the main branch and the pipeline will deploy to the dev/staging environment. There, some manual testing or automated e2e testing will happen and once we are 99.999% sure the code works, we deploy it to production. Unexpectedly, a bug shows up and because we have developed everything together it takes some hours to find out why.

*Everything deployed at once to production.*

Now, with Continuous Deployment, we will deploy to production from day one. Probably your metric won’t be fully ready, but you might already have some connectors to the input ports ready. By having this code already in production, we could allow our connectors to consume data while developing the rest of the codebase. The moment we receive unexpected data in production, we will trigger an alert. Thus, on day 2 we will already know that our code had a bug. Not only did we discover the bug sooner, but it was also much easier to debug the code since it was deployed to production in smaller chunks compared to the previous scenario where everything was developed and pushed together.

*Only input port connectors are deployed to production.*

Benefit #1: Save money

No company wants to have bugs in production. That’s the reason why pre-production or staging environments are used. However, maintaining these environments might be hard and most importantly, they cost money.

In Continuous Deployment, CI/CD pipelines are automated until production meaning that there’s no space for manual testing in any other environment. That’s why one common practice is to test directly in production with feature toggles (aka feature flags). If manual testing is done in production, there’s no need to have pre-production/staging environments. Therefore, you save the infrastructure costs associated with these environments.

*Reducing costs by removing intermediate environments*

Benefit #2: Speed up feature development

Long Pull Requests slow down the deployment process. The more code you have in your branch, the longer it will take the reviewers to provide feedback.

*Long Pull Request can be painful and slow to review*

Continuous Deployment encourages deploying the code often in small increments, thus Pull Requests need to be very small so that they can be merged the same day. Also, you can practice Trunk-based Development with Pair Programming where only one branch is needed and code review happens while writing the code. In addition to smaller deployments and smaller code reviews, it’s easier to detect bugs whenever they show up.

*Pair programming provides a fast feedback loop for code reviews*

Benefit #3: Less production-like synthetic data

Many teams have tried to replicate production-like data, but it only works for some time. At some point, there are so many changes in your input ports that your team will start struggling to apply them to your synthetic data. Therefore, the test that used that data becomes meaningless.

Why don’t we remove the effort of maintaining synthetic data, at least for e2e tests, by testing our data products in production? A data product team could trigger a data product with feature toggles (aka feature flags) enabled and run the set of transformations pending to be released. They will generate some data that we need to be published to the output ports. Because we don’t want testing data to be exposed to the consumers, it should be published in a different output port. Only the team or some specific consumers will be allowed to consume data to perform the validation.

*Exposing internal output ports to test in production*

Conclusion

Nowadays, data teams are still promoting their data products to multiple environments(dev, test, staging, pre-production,…) that need to be validated manually. This approach leads to a lot of costs in maintaining production-like infrastructure and slows down the team delivery pace. On top of that, Data Engineers need to put a lot of effort into building and maintaining production-like synthetic data for their automated or manual e2e tests.

With Continuous Deployment and the underlying good practices (feature flags, automated tests, trunk based development, and pair programming) your data team will deploy their data products to production a few times a day and test their transformations in production with real data reducing the amount of complex production-like synthetic data. Finally, since the e2e tests will happen in production, you will have the option of deprecating your staging/pre-production environment and saving costs in infrastructure.