Standardizing AdTech data ingestion with Lineate’s Data Octopus architecture

2024-05-13

Learn more about Lineate’s custom solutions for AdTech

AdTech companies are constantly making business decisions based on data integrated from all kinds of places. Some of this data is fairly typical relational data pulled from databases, APIs, or reports—think campaign configuration data that needs to be presented in the reporting system or pre-aggregated metrics obtained from third-party vendors of audience information. But AdTech is unusual in that it also processes massive streams of transactional data, such as bid stream data, that can stream in at a rate of millions of transactions per second and stress even the most robust data ingestion systems. We have taken many approaches over the years to solve problems in data integration and reporting systems for AdTech companies, ranging from a ClickHouse deployment designed to provide near real-time updates, to aggregations of vast amounts of transactional data, to very complex Spark and Hadoop infrastructures that ingest millions of click events per second. The challenge was that while these architectures were highly optimized for these use cases, they were not very repeatable, required a great deal of expertise to maintain and extend, and could be time-consuming for our clients to learn. By extending our own internal Data Octopus data aggregation system, which we had been using for our own operational needs, we believe we have addressed the problem of repeatability of implementation while allowing reporting to take place at AdTech scale on standard AWS services.

Over the last decade or so, we have built numerous data transform and reporting solutions for our AdTech clients, and we have built upon what we have learned to improve our solutions over time and to best target solutions to the specific needs of each individual client. For one of our long-term AdTech clients, we initially built a data transform and reporting solution that ingested event streams (requests, impressions, and clicks) from Kafka and wrote the data to Amazon S3, where the data was then read and aggregated by daily Spark jobs into a relational database used for reporting. To optimize costs, we later upgraded our initial solution to process raw log data using custom real-time components directly into a ClickHouse data store, where schedule scripts aggregate data for reporting on a daily basis. For another client, we deployed a data transform and reporting solution into AWS using custom real-time components writing to a Kafka stream, which fed AWS EMR jobs to create daily aggregate rollups. For a third, we built one of our largest systems to date using Kafka streaming into Spark and writing to Vertical for data warehousing and to ClickHouse for analytical reporting. The specific technologies used for each of these differ based on client-specific needs, and the systems continue to operate efficiently in each case. But having built these three solutions, as well as many others, we saw that a lot of effort was needed to orchestrate, instrument, measure, optimize, and maintain these systems. Additionally, there was significant overlap in these efforts.

In parallel, for years, we had been building our own data reporting and analysis system—called Data Octopus—to better operate our own business, pulling data from sources such as time-tracking, accounting, recruiting, capacity-planning, and other systems and cross-referencing data to predict delivery quality and profitability for each of our projects. For example, we can determine, hour by hour, exactly why each dollar of project margin didn’t match expectations. After several iterations, we eventually standardized on a cloud-based serverless architecture that was both data source– and data repository–agnostic and that provided an easily deployable and maintainable pipeline. This proved so successful that we ended up deploying the same architecture to Lineate customers who had similar needs.

As Data Octopus became more mature, we realized that the architecture could easily be extended to handle the kinds of massive data streams common in AdTech while still efficiently incorporating data from more modestly sized data sources, all within a single deployment. For example, we use pure Python Glue jobs for low-volume data processing that reads from relational schemas or SAS APIs, but we can switch to Spark Glue jobs that read from Kafka streams to handle very high-volume data processing. Finally, within the same deployment, we can add monitoring, alerting, and validation of the reporting structure to ensure a robust system and lower ongoing maintenance needs.

Data Octopus is not a standalone product but a unified, cloud-based architecture that we now use to solve most data integration needs, whether they involve traditional table-based data or large-scale streaming data. With Data Octopus, we now have a system that we believe will allow us to quickly bootstrap greenfield AdTech reporting projects so that value can be delivered very early in the project, and it can be easily configured to include component services that are optimal for handling the very large dataset sizes inherent in the AdTech business domain.

Learn more about Lineate’s custom solutions for AdTech

Got a project?

Harness the power of your data with the help of our tailored data-centric expertise.