Start typing and press Enter to search

SOVRN: TECH STORY

DATA-AS-A-SERVICE, WHEN DATA = 115,000 RECORDS PER SECOND

Sovrn provides advertising tools, technologies, and services to tens of thousands of content creators, helping them make money, grow their businesses, and access a massive data commons that provides extraordinary insights.

Pulling data from over 40 different systems in itself isn’t a problem. Linking records by user and provider identifiers  across a 10*109 records per day isn’t impossible. Designing an architecture that processes and stores the data in a fast, reliable, and cost-effective way? That’s where the fun begins.

THE PROBLEM

The Data-as-a-Service platform needs to ingest 115,000 records per second, link them according to identifier mapping, and store it for 90 days. The MVPl architecture was expensive, a cost driven up by ingesting and processing a large number of  duplicate records.

 

THE APPROACH 

We began by prototyping and testing solutions on a variety of storage types: Graph databases (AWS Neptune, TigerGraph, Nebula), columnar databases (Сassandra, ScyllaDB). 

Our test results showed that AWS Neptune couldn’t handle the load effectively, TigerGraph was too expensive.

During our prototyping we tested the following technologies:

  • HBase — the eventual storage solution
  • graph databases (AWS Neptune, TigerGraph, Nebula), columnar databases (Keyspaces/Сassandra, ScyllaDB)
  • Java, Scala
  • Apache Spark
  • AWS Lambda
  • SQS, SNS, AWS EventBridge
  • Terraform
  • AWS EMR, S3
  • DataDog
The final architecture is depicted below:

Fig 1. Process flow diagram

daas-flow x600

Fig. 2. Scala job sequence diagram

daas-job sequence x600 

THE RESULTS 

The DaaS storage processes  10*109 (!) new linked data records from 40 different systems on a daily basis with the TTL of 90 days.

Previously, the MVP had to operate with intensively duplicated data (duplication factor 4x), now the data are deduplicated automatically during the insertion into the storage

Having deduplication, the linked data storage size for keeping 90 days data is going to be smaller than raw data storage for 7 days.

The DaaS storage based on HBase is going to cost less  than $10K per month, which will prove profitable for the customer.