Sovrn provides advertising tools, technologies, and services to tens of thousands of content creators, helping them make money, grow their businesses, and access a massive data commons that provides extraordinary insights.
Pulling data from over 40 different systems in itself isn’t a problem. Linking records by user and provider identifiers across a 10*109 records per day isn’t impossible. Designing an architecture that processes and stores the data in a fast, reliable, and cost-effective way? That’s where the fun begins.
The Data-as-a-Service platform needs to ingest 115,000 records per second, link them according to identifier mapping, and store it for 90 days. The MVPl architecture was expensive, a cost driven up by ingesting and processing a large number of duplicate records.
We began by prototyping and testing solutions on a variety of storage types: Graph databases (AWS Neptune, TigerGraph, Nebula), columnar databases (Сassandra, ScyllaDB).
Our test results showed that AWS Neptune couldn’t handle the load effectively, TigerGraph was too expensive.
During our prototyping we tested the following technologies:
The DaaS storage processes 10*109 (!) new linked data records from 40 different systems on a daily basis with the TTL of 90 days.
Previously, the MVP had to operate with intensively duplicated data (duplication factor 4x), now the data are deduplicated automatically during the insertion into the storage
Having deduplication, the linked data storage size for keeping 90 days data is going to be smaller than raw data storage for 7 days.
The DaaS storage based on HBase is going to cost less than $10K per month, which will prove profitable for the customer.