NOTICE Unless I explicitly say otherwise, all pages and posts on this site are largely speculation based on observations. Where experiments have been done to provide conclusive evidence it is noted. These posts are a brain dump of my thoughts (to capture those elusive moments) at the time of writing, and is therefore likely to change as the results of research proves otherwise. Obviously this does not apply to posts where I describe a technique that has been implemented and tested.


Graph/Network theory in general is a very broad subject. In this post I'll try to narrow the scope of the project without closing possible routes of exploration later as it develops.

This post is largely a brain dump of my current line of thinking, and is therefore likely to change as the results of research proves otherwise. It encompasses many observations from real world projects but also experiments with available systems on a small scale (few GB of data).

Concurrency vs Parallelism

A side note before we go any further. The terms concurrency and parallelism are often used interchangeably. In our case interchanging the two would be incorrect and as such the two terms, when used are used to be interpretted by the following definitions:

  • Concurrency the ability of one or more tasks to be scheduled such that each task appears to progress at an almost equal pace as other tasks.

  • Parallelism is the ability of a system to take a number of tasks and execute them all at the same time so that progress is not just apparent but actually occurs.

In essence the difference is that concurrency means tasks are interleaved, so given tasks 1 to 5, task 1 runs for 1 second, followed by task 2 then task 3 and so on until task 1 runs again, this continues until all tasks are complete. Parallelism means all 5 tasks start and run along side each other.

Rejecting Bulk synchronous approaches

Modern graph abstractions such as Pregel (Malewicz, Austern, Bik, Dehnert, Horn, Leiser & Czajkowski, 2010) use the Bulk Synchronous Parallel (BSP) (Bisseling & McColl, 1993) approach to graph processing. This is a credible and proven approach but BSP incurs an expensive synchronization step. It uses what are typically called super steps which is effectively where a group of computations occur and all of them converge on a 'border' before moving to the next super step, this convergence means that all BSP based approaches utilizing the sync. step can only ever be as fast as the slowest operation within a superstep. i.e. a super step is made up of n smaller steps that run, usually in parallel, their results accumulate to form the results of their containing superstep and the next super step cannot continue until all the smaller steps have completed.

Parellelizing graph operations

A typical data store allows you to ask the question "What is x?" and as far as a client is concerned the answer given was correct at the time of the query. Even if it changed moments after the query was finished. Under these usual circumstances a data store only ever really operates on a snapshot of the data, this observation is largely implicit.

Given our current models of querying databases, how do you effectively parallelize a query guaranteeing no conflicts or corruption? One answer is to ditch most of the current models. Stop being implicit and make snapshot processing an explicit, first class operation. The question then becomes "What is x as of t1 and up to t2 ?".

Aim is, given any query, execute it in parallel across the entire cluster... The key to making this happen is immutability. Every snapshot is immutable i.e. functional purity or isolated side effects (Okasaki, 1999). Given a function that operates on a snapshot, that function can emit a new version of a value but this results in the existing value being superseeded without modification. Iterative algorithms either operate on the current version of data or are run in parralel on newer versions as soon as the new version is created. This assumes the current iteration doesn't finish immediately after generating a new version.

Snap+ => Matryoshka

You parralelize a query by creating time slices. Each time slice is a snapshot. Subsequent snapshots are time ordered such that the associativity laws of addition apply, i.e. if all the results of the snapshots are to be added together, the result does not depend on the order in which the snapshots are added.

A merge function is used to combine the result of two snapshot computations. This allows a series of snapshots to have a result computed and each result can become the input to the next application of the merge function.

This results in a model where we begin with lots of small computations that are, in parallel, merged to eventually produce a single result there by resulting in an effect similar to the properties of a matryoshka (Russian nesting doll).


This approach forms the basis for support of high throughput streams where the answer to a question is only ever correct at a given time and each iteration could yield a vastly different answer depending on how much time has passed and how many new versions of data has been created since the last query.


This model is useful for computing dynamic global properties. For example, identifying trends in data streams i.e. trending music, people, topics


  1. Malewicz, G., Austern, M. H., Bik, A. Jc., Dehnert, J. C., Horn, I., Leiser, N., & Czajkowski, G. (2010). Pregel: a system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data (135–146).
  2. Bisseling, R. H., & McColl, W. F. (1993). Scientific computing on bulk synchronous parallel architectures.
  3. Okasaki, C. (1999). Purely functional data structures. Cambridge University Press.

blog comments powered by Disqus


18 January 2014