Can Aerospike’s graph database solution more optimally solve for identity resolution?

2024-02-15

In December 2023, at AWS re:Invent, our engineering team had the opportunity to discuss Aerospike Graph, Aerospike’s graph database solution that was released in June 2023, with Aerospike engineers. We were impressed with the demonstration of capabilities presented by Aerospike and by the ease of use apparent in the system and syntax built on top of Aerospike’s core platform. After re:Invent, during our regular Friday tech brainstorming session with our Solution Architecture team, we realized that it would be very interesting to revisit a graph-based identity resolution solution that we had built for one of our large ad tech clients given the new capabilities available from Aerospike.

In 2022, one of our largest long-term ad tech clients came to us looking for a solution to the problem of cookieless identity resolution. After working with this client to fully understand their requirements, as well as the need to associate data points with unique users across multiple site visits without cookies, our engineers came to the understanding that a graph-based abstraction on top of the large volumes of captured data would allow for queries that could answer the questions required by the client’s use case. Given that we would be working with a graph-based approach to the problem, the use of a graph database for storage of incoming event data and queries against that data seemed like an obvious fit. However, the ability of available graph databases to handle the huge read-write data volume inherent in ad tech remained an open question. In order to validate our approach, our engineering teams tested several available graph database solutions against high-volume simulated data sets in a highly concurrent read-write environment. The graph databases we tested included Neo4j, Amazon Neptune, NebulaGraph, and TigerGraph.

During our testing, we found that the various graph database solutions could store large amounts of data and make reasonably fast multi-hop ID queries across this data. However, in ad tech, there is often a need to implement full table scans—for example, when a partner needs to join a company’s user or device data with its own and output a user intersection between the two systems. These wide queries proved problematic with all the graph databases we tried. We concluded that existing graph databases were better suited for a smaller volume of data mutations and for queries other than full table scans. We also tried approaches to a solution using several NoSQL (non-graph) databases such as ScyllaDB and Apache HBase. In the end, we implemented a graph-oriented solution on top of Apache HBase that was able to scale to our required data volume.

Our solution built on top of Apache HBase was successful, but it came with inherent limitations. In particular, although we were able to maintain the very convenient graph abstraction over our datasets, we ultimately needed to hard-code precomputed query paths into Apache HBase indexes, thereby giving up the ability to do multi-hop lookups in arbitrary directions. Given these experiences, which you can read about in more detail here, we are particularly excited by the opportunities that Aerospike Graph appears to offer.

Aerospike Graph provides a convenient and familiar Gremlin Query Language interface on top of Aerospike core components and is designed to tackle data at ad tech scale. Given our extensive experience implementing ad tech solutions on top of Aerospike and our past examination of Aerospike’s capabilities (we were known as Thumbtack Technology at the time), we believe that Aerospike Graph should be an obvious fit for implementing the sort of large-scale identity graph solutions required by our client.

In summary, our hypothesis is that Aerospike Graph can meet the known performance requirements of our client’s ad tech use case while still allowing for performant, arbitrarily oriented, multi-hop queries across the domain graph. This arbitrarily oriented query capability is critical for ad hoc data exploration and reporting capabilities. We are currently testing this hypothesis, and our Solution Architecture team expects to deliver experimental results at scale in the coming weeks.

Got a project?

Harness the power of your data with the help of our tailored data-centric expertise.