The purpose of this web page is to provide links to the project publications and software and to provide a guide for experimenters to repeat our experimental evaluation.


This project concerns high-performance parallel graph traversal (or, graph search) over directed, in-memory graphs using shared-memory, multiprocessor machines (aka multicore). The main research question addressed by this project is the following:

Can we design asymptotically and practically algorithms for the efficient traversal of in-memory directed graphs? Furthermore, can we design such algorithms so that, for any input graph, the cost of load balancing is negligible, yet the amount of parallelism being utilized is close to the hard limit imposed by the input graph and the machine?

Such algorithms are crucial because multicore hardware demands computations that make the most economical use of their time. In contrast, algorithms that spend many cycles performing load-balancing work are often slower and may consume excess power. There are two competing forces that make this problem challenging. First, the space of possible input graphs is huge, and the graphs vary substantially in their structure. Particularly challenging inputs include long chains and, more generally, high diameter graphs. To add to the challenge, graph traversal algorithms typically discover new regions of graph on the fly. So, the second competing force is the fact that new parallelism is discovered online, thus challenging the load-balancing algorithm to adapt quickly.

To summarize, a good graph-traversal algorithm must cope with the large input space, rapid-fire parallelism, yet keep load-balancing overheads low. There are several algorithms proposed to this end. All of these algorithms and related issues are covered in detail in our paper. Overall, we found that, although the previous state of the art performs well for certain classes of graphs, none meet the main challenge for all graphs.

In our Supercomputing’15 paper1, we answer the main question from above in the affirmative: we present a new algorithm that performs a DFS-like traversal over a given graph. We prove that the algorithm is strongly work efficient, meaning that, in addition to having the linear upper bound on the total work performed, the algorithm effectively amortizes overheads, such as the cycles spent balancing traversal workload between cores. Put a different way, we prove that, for any input, the algorithm confines the amount of load-balancing work to a small fraction of the total (linear) running time. Then, we prove that the algorithm achieves a high degree of parallelism, that is, near optimal under the constraints of the graph. Finally, we present an experimental evaluation showing that, thanks to reducing overheads and increasing parallelism, our implementation of the algorithm outperforms two other state-of-the-art algorithms in all but a few cases, and in some cases, far outperforms the competition.

We made available a long version of our Supercomputing’15 article which includes appendices. In the appendix, there is a longer version of one of the proofs

Run our experimental evaluation

The source code we used for our experimental evaluation is hosted by a Github repository. The source code for our PDFS and our implementation of the DFS of Cong et al is stored in the file named dfs.hpp.

Note. The PDFS source code is located in the new-sc15-graph branch of the pasl repository. Be careful to use only the correct branch if you wish to get the PDFS code directly from github.

1. Prerequisites

To have enough room to run the experiments, your filesystem should have about 300GB of free hard-drive space and your machine at least 128GB or RAM. These space requirements are so large because some of the input graphs we use are huge.

The following packages should be installed on your test machine.

Software dependencies for our pdfs benchmarks.
Package Version Details
gcc >= 4.9.0 Recent gcc is required because pdfs makes heavy use of features of C++1x, such as lambda expressions and higher-order templates. (Home page)
php >= 5.3.10 PHP is currently used by the build system of pdfs. (Home page)
ocaml >= 4.00 Ocaml is required to build the benchmarking script. (Home page)
R >= 2.4.1 The R tools is used by our scripts to generate plots. (Home page)
hwloc recent Optional dependency (See instructions below). This package is used by pdfs to force interleaved NUMA allocation; as such this package is optional and only really relevant for NUMA machines. (Home page)
ipfs recent We are going to use this software to download data sets for our experiments. (Home page)

2. Getting experimental data

IPFS is a tool that is useful for transfering large amounts of data over the internet. We need this tool because our experiments use large input graphs. After installing the package, we need to initialize the local IPFS configuration.

$ ipfs init

Warning: disk space. The default behavior of IPFS is to keep a cache of all downloaded files in the folder ~/.ipfs/. Because the graph data is several gigabytes, the cache folder should have at least twice this much free space. To select a different cache folder for IPFS, before issuing the command ipfs init, set the environment variable $IPFS_PATH to point to the desired path.

Next, we need to run the IPFS daemon. This process needs to be running until after all input graphs have been successfully downloaded to your machine.

$ ipfs daemon &

3. Getting non-synthetic graphs and source code

Now, create a new directory in which to store all of our the code and data.

$ mkdir sc15
$ cd sc15

Let us create a folder in which to store the graphs.

$ mkdir sc15-graphs

The table below summarizes our collection non-synthetic graphs. Each graph can be downloaded individually, or as a whole package. To download just cage15, for instance, we can issue the following command.

$ ipfs get QmYrupxDXnYfbw8Vj8tDN7PYrMxWiCmNLS4xLHQ7QF64CF

Note that each graph file must have the extension .adj_bin. To download all of the graphs, we use the following command.

$ ipfs get QmeQPA4mRAhkNta8Pofa8DGfokfr6dT5pGAb1C4A4MmZ3u -o=sc15-graphs

Because the total size is about 30GB, the download may take a long time.

Non-synthetic graphs.
Graph ipfs address
Freescale1 QmUe8sV6hBfS45HtogKC3cxwenR45UHMBcbGtni5WoXqiE
cage14 QmSo1PdRffxbZFGyz89ZkegZM99UTREPbtEJhjhzu9bjPF
cage15 QmYrupxDXnYfbw8Vj8tDN7PYrMxWiCmNLS4xLHQ7QF64CF
delaunay QmSXK5B3zPh3WAtuQmtkj6XfkcSK4jLT2WJ67Fzxge7Q6r
europe QmSA2tgcdKoyFWfktdQVb6AuJfsg6Se5cbBfAf878BAiEc
friendster QmWLc6h7MqwbpzFVvrgWrhzpEuhnxdHfEfcW8sc92ddePv
kkt_power QmfBrYSuGrjSGvdP1feFGMVoSa4TfuBsVWhoHUvQtShT9B
livejournal1 QmZJk9P5qodnwfKKHUa4BnCR4NyNg2NkyLbJoJJjFhSm6G
orkut QmYRQUzDmaJ4vHNTJSk8VWs9JTD6kZPyWcKfQabbWxgQvc
rgg QmU5NuE1KDnPo9jEXfr44tr9hu6o1NHtiaq1Vr9B7qvSuK
twitter QmW7y6EL3FtakEw9UxtqhDjkhcZVAGtqJZi5dXeKtcKCjt
usa QmcGm3pWSqksakmggKgXK7wuGhtkYx15y5YHqDobErFvW5
wikipedia-20070206 QmYi8Ga5j4zb1XWBGjR4f9WRWbQYs5BC4yARuJKtAdvG6w
all graphs QmeQPA4mRAhkNta8Pofa8DGfokfr6dT5pGAb1C4A4MmZ3u

Warning. In the command-line samples, we assume that . is in the user’s $PATH, which can be achieved by setting PATH=.:$PATH.

To obtain the source code, first get the downloader script, then perform the following steps.

$ wget
$ chmod u+x

Building with hwloc. If your system has a non-uniform memory model (aka NUMA), then using hwloc may prove crucial to obtain correct experimental results. To link correctly with hwloc, all you need to do is set an environment variable with the path to the hwloc package-configuration folder. On our system, this folder is located at /usr/lib64/pkgconfig/, but that location may differ for your installation.

$ export hwloc_path=/usr/lib64/pkgconfig/

You need to ensure that whatever path you substitute for this one on your machine contains a file named hwloc.pc.

4. Generating synthetic graphs

Before building any packages, we need configure some paths. Let us change to the directory where the configuration file is going to be stored.

$ cd sc15-graph/graph/bench/

The next step is to generate the graph data via our benchmarking script.

$ make graph.pbench

Generation of the synthetic graphs may take a long time. To start running our graph generator, specify the number of processors to be used by the experiment by passing the argument on the command line. For instance, our system has 40 cores.

$ export P=40
$ graph.pbench generate -size large -proc $P

5. Running the experiment

The first series of benchmark runs gather data to serve as baseline measurements. The baseline in this case is a fast sequential graph-traversal algorithm, hence the argument -proc 1.

$ graph.pbench baselines -proc 1 -size large

The next command that needs to be run collects data for each graph to determine the number of vertices reachable from the source vertex.

$ graph.pbench accessible -proc 1 -size large -skip plot

After the command completes, the results of the experiment are going to be stored in the _results folder.

We are now ready to run the main body of experiments. The following command starts the experiments running.

$ graph.pbench overview -proc $P -size large -skip plot -runs 30 

To achieve clean results, we recommend performing thirty runs. However, it may be faster to perform just a few at first, and then collect more data later, next time passing to graph.pbench the flag -mode append.

6. Analyzing the results

After running the experiments, the raw result data should be stored in new files named results_accessible_large.txt, results_baselines_large.txt, and results_overview_large.txt. If the experiments completed successfully, then the kind of plots that appear in our SC’15 paper can be generated by the following command.

$ graph.pbench overview -proc $P -size large -runs 30 -sub graphs,main,locality

After completion, the file named table_graphs.pdf should contain a table that bears resemblance to Table 1 in our SC’15 paper. The file named _results/table_graphs.tex contains the source code. A speedup plot, like the one in Figure 6, should appear in a new file named plot_main.pdf (with souce code in _results/plot_main.pdf-all.r). The plot corresponding to Figure 8 in the paper, namely, the locality study, should appear in the file named plots_overview_locality_large.pdf (with souce code in plots_overview_locality_large.pdf-all.r).



Get the bibtex file used to generate these references.

Acar, Umut A, Arthur Charguéraud, and Mike Rainey. 2015. “A Work-Efficient Algorithm for Parallel Unordered Depth-First Search.” In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 67:1–67:12. ACM.

  1. (Acar, Charguéraud, and Rainey 2015)