NOTICE Unless I explicitly say otherwise, all pages and posts on this site are largely speculation based on observations. Where experiments have been done to provide conclusive evidence it is noted. These posts are a brain dump of my thoughts (to capture those elusive moments) at the time of writing, and is therefore likely to change as the results of research proves otherwise. Obviously this does not apply to posts where I describe a technique that has been implemented and tested.

Abstract - Introduction

Here I propose "Vertex Drift" a dynamic partitioning algorithm suited to the query model being used in the Tesseract. Most current dynamic graph partition algorithms are focused on inter processor communication and data migration, i.e. on a single node. (Walshaw, Cross & Everett, 1997) for example introduces the relative gain optimization technique. This and other algorithms require some amount of data migration or are iterative techniques. Vertex drift requires neither iteration or data migration and instead of working with multiple processors it is focused on working with multiple machines. It borrows ideas from Dynamo's virtual nodes (DeCandia, Hastorun, Jampani, Kakulapati, Lakshman, Pilchin, Sivasubramanian, Vosshall & Vogels, 2007) to create a technique where by a vertex or some of it's data splits, and pieces "drift" around the cluster. The Tesseract query model where "markers" and path prediction are used, means that when a vertex is drifted around the cluster we know a lot about a vertex in advance. Using the vertex drift algorithm, only messages are required for a traversal (in the worse case) which involves the vertex, where is the number of nodes in the cluster. Only messages are required in the best case.

Background - Why is this needed?

Every day graphs as they occur in the real world can be very disproportionate. This makes it tricky to create an algorithm that will automatically partition data in a reasonable way that ensures minimum network communication during traversals.

As an example, consider the social graph for some Twitter users with 40+ million followers, while the average number of followers for a twitter user is only about 200. The following images represent the number of followers, those they follow and Tweets from Justin Bieber, Kety Perry, Lady Gaga and myself.

Number of Justin Bieber followers Number of Katy Perry followers Number of Lady Gaga followers Number of Courtney Robinson's followers

These images form a clear demonstration of how ill proportioned a real world graph can be. The question now becomes, how can the relationship of each of these disproportionate vertices be stored such that, traversal is evenly spread across the cluster. One answer is vertex drift, by splitting up the number of in and out edges across multiple machines with this technique a traversal involving this vertex can run against in or outbound edges in parallel across the entire cluster.

Edge partitioning

Based on the concepts in virtual nodes from Dynamo and implemented in Cassandra. While the application is different it is a similar concept with optimizations for graphs.

References

  1. Walshaw, C., Cross, M., & Everett, Mg. (1997). Parallel dynamic graph partitioning for adaptive unstructured meshes. Journal of Parallel and Distributed Computing, 47(2), 102–108.
  2. DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., & Vogels, W. (2007). Dynamo: amazon’s highly available key-value store. In SOSP (Vol. 7, 205–220).