The purpose of this web page is to provide links to the project publications and software and to provide a guide for experimenters to repeat our experimental evaluation.
This project concerns highperformance parallel graph traversal (or, graph search) over directed, inmemory graphs using sharedmemory, multiprocessor machines (aka multicore). The main research question addressed by this project is the following:
Can we design asymptotically and practically algorithms for the efficient traversal of inmemory directed graphs? Furthermore, can we design such algorithms so that, for any input graph, the cost of load balancing is negligible, yet the amount of parallelism being utilized is close to the hard limit imposed by the input graph and the machine?
Such algorithms are crucial because multicore hardware demands computations that make the most economical use of their time. In contrast, algorithms that spend many cycles performing loadbalancing work are often slower and may consume excess power. There are two competing forces that make this problem challenging. First, the space of possible input graphs is huge, and the graphs vary substantially in their structure. Particularly challenging inputs include long chains and, more generally, high diameter graphs. To add to the challenge, graph traversal algorithms typically discover new regions of graph on the fly. So, the second competing force is the fact that new parallelism is discovered online, thus challenging the loadbalancing algorithm to adapt quickly.
To summarize, a good graphtraversal algorithm must cope with the large input space, rapidfire parallelism, yet keep loadbalancing overheads low. There are several algorithms proposed to this end. All of these algorithms and related issues are covered in detail in our paper. Overall, we found that, although the previous state of the art performs well for certain classes of graphs, none meet the main challenge for all graphs.
In our Supercomputing’15 paper^{1}, we answer the main question from above in the affirmative: we present a new algorithm that performs a DFSlike traversal over a given graph. We prove that the algorithm is strongly work efficient, meaning that, in addition to having the linear upper bound on the total work performed, the algorithm effectively amortizes overheads, such as the cycles spent balancing traversal workload between cores. Put a different way, we prove that, for any input, the algorithm confines the amount of loadbalancing work to a small fraction of the total (linear) running time. Then, we prove that the algorithm achieves a high degree of parallelism, that is, near optimal under the constraints of the graph. Finally, we present an experimental evaluation showing that, thanks to reducing overheads and increasing parallelism, our implementation of the algorithm outperforms two other stateoftheart algorithms in all but a few cases, and in some cases, far outperforms the competition.
We made available a long version of our Supercomputing’15 article which includes appendices. In the appendix, there is a longer version of one of the proofs
The source code we used for our experimental evaluation is hosted by a Github repository. The source code for our PDFS and our implementation of the DFS of Cong et al is stored in the file named dfs.hpp.
Note. The PDFS source code is located in the newsc15graph
branch of the pasl
repository. Be careful to use only the correct branch if you wish to get the PDFS code directly from github.
To have enough room to run the experiments, your filesystem should have about 300GB of free harddrive space and your machine at least 128GB or RAM. These space requirements are so large because some of the input graphs we use are huge.
The following packages should be installed on your test machine.
Package  Version  Details 

gcc  >= 4.9.0  Recent gcc is required because pdfs makes heavy use of features of C++1x, such as lambda expressions and higherorder templates. (Home page) 
php  >= 5.3.10  PHP is currently used by the build system of pdfs. (Home page) 
ocaml  >= 4.00  Ocaml is required to build the benchmarking script. (Home page) 
R  >= 2.4.1  The R tools is used by our scripts to generate plots. (Home page) 
hwloc  recent  Optional dependency (See instructions below). This package is used by pdfs to force interleaved NUMA allocation; as such this package is optional and only really relevant for NUMA machines. (Home page) 
ipfs  recent  We are going to use this software to download data sets for our experiments. (Home page) 
IPFS is a tool that is useful for transfering large amounts of data over the internet. We need this tool because our experiments use large input graphs. After installing the package, we need to initialize the local IPFS configuration.
$ ipfs init
Warning: disk space. The default behavior of IPFS is to keep a cache of all downloaded files in the folder ~/.ipfs/
. Because the graph data is several gigabytes, the cache folder should have at least twice this much free space. To select a different cache folder for IPFS, before issuing the command ipfs init
, set the environment variable $IPFS_PATH
to point to the desired path.
Next, we need to run the IPFS daemon. This process needs to be running until after all input graphs have been successfully downloaded to your machine.
$ ipfs daemon &
Now, create a new directory in which to store all of our the code and data.
$ mkdir sc15
$ cd sc15
Let us create a folder in which to store the graphs.
$ mkdir sc15graphs
The table below summarizes our collection nonsynthetic graphs. Each graph can be downloaded individually, or as a whole package. To download just cage15, for instance, we can issue the following command.
$ ipfs get QmYrupxDXnYfbw8Vj8tDN7PYrMxWiCmNLS4xLHQ7QF64CF
o=sc15graphs/cage15.adj_bin
Note that each graph file must have the extension .adj_bin
. To download all of the graphs, we use the following command.
$ ipfs get QmeQPA4mRAhkNta8Pofa8DGfokfr6dT5pGAb1C4A4MmZ3u o=sc15graphs
Because the total size is about 30GB, the download may take a long time.
Graph  ipfs address 

Freescale1  QmUe8sV6hBfS45HtogKC3cxwenR45UHMBcbGtni5WoXqiE 
cage14  QmSo1PdRffxbZFGyz89ZkegZM99UTREPbtEJhjhzu9bjPF 
cage15  QmYrupxDXnYfbw8Vj8tDN7PYrMxWiCmNLS4xLHQ7QF64CF 
delaunay  QmSXK5B3zPh3WAtuQmtkj6XfkcSK4jLT2WJ67Fzxge7Q6r 
europe  QmSA2tgcdKoyFWfktdQVb6AuJfsg6Se5cbBfAf878BAiEc 
friendster  QmWLc6h7MqwbpzFVvrgWrhzpEuhnxdHfEfcW8sc92ddePv 
kkt_power  QmfBrYSuGrjSGvdP1feFGMVoSa4TfuBsVWhoHUvQtShT9B 
livejournal1  QmZJk9P5qodnwfKKHUa4BnCR4NyNg2NkyLbJoJJjFhSm6G 
orkut  QmYRQUzDmaJ4vHNTJSk8VWs9JTD6kZPyWcKfQabbWxgQvc 
rgg  QmU5NuE1KDnPo9jEXfr44tr9hu6o1NHtiaq1Vr9B7qvSuK 
QmW7y6EL3FtakEw9UxtqhDjkhcZVAGtqJZi5dXeKtcKCjt 

usa  QmcGm3pWSqksakmggKgXK7wuGhtkYx15y5YHqDobErFvW5 
wikipedia20070206  QmYi8Ga5j4zb1XWBGjR4f9WRWbQYs5BC4yARuJKtAdvG6w 
all graphs  QmeQPA4mRAhkNta8Pofa8DGfokfr6dT5pGAb1C4A4MmZ3u 
Warning. In the commandline samples, we assume that .
is in the user’s $PATH
, which can be achieved by setting PATH=.:$PATH
.
To obtain the source code, first get the downloader script, then perform the following steps.
$ wget http://deepsea.inria.fr/graph/get.sh
$ chmod u+x get.sh
$ get.sh
Building with hwloc. If your system has a nonuniform memory model (aka NUMA), then using hwloc may prove crucial to obtain correct experimental results. To link correctly with hwloc, all you need to do is set an environment variable with the path to the hwloc packageconfiguration folder. On our system, this folder is located at /usr/lib64/pkgconfig/
, but that location may differ for your installation.
$ export hwloc_path=/usr/lib64/pkgconfig/
You need to ensure that whatever path you substitute for this one on your machine contains a file named hwloc.pc
.
Before building any packages, we need configure some paths. Let us change to the directory where the configuration file is going to be stored.
$ cd sc15graph/graph/bench/
The next step is to generate the graph data via our benchmarking script.
$ make graph.pbench
Generation of the synthetic graphs may take a long time. To start running our graph generator, specify the number of processors to be used by the experiment by passing the argument on the command line. For instance, our system has 40 cores.
$ export P=40
$ graph.pbench generate size large proc $P
The first series of benchmark runs gather data to serve as baseline measurements. The baseline in this case is a fast sequential graphtraversal algorithm, hence the argument proc 1
.
$ graph.pbench baselines proc 1 size large
The next command that needs to be run collects data for each graph to determine the number of vertices reachable from the source vertex.
$ graph.pbench accessible proc 1 size large skip plot
After the command completes, the results of the experiment are going to be stored in the _results
folder.
We are now ready to run the main body of experiments. The following command starts the experiments running.
$ graph.pbench overview proc $P size large skip plot runs 30
To achieve clean results, we recommend performing thirty runs. However, it may be faster to perform just a few at first, and then collect more data later, next time passing to graph.pbench
the flag mode append
.
After running the experiments, the raw result data should be stored in new files named results_accessible_large.txt
, results_baselines_large.txt
, and results_overview_large.txt
. If the experiments completed successfully, then the kind of plots that appear in our SC’15 paper can be generated by the following command.
$ graph.pbench overview proc $P size large runs 30 sub graphs,main,locality
After completion, the file named table_graphs.pdf
should contain a table that bears resemblance to Table 1 in our SC’15 paper. The file named _results/table_graphs.tex
contains the source code. A speedup plot, like the one in Figure 6, should appear in a new file named plot_main.pdf
(with souce code in _results/plot_main.pdfall.r
). The plot corresponding to Figure 8 in the paper, namely, the locality study, should appear in the file named plots_overview_locality_large.pdf
(with souce code in plots_overview_locality_large.pdfall.r
).
Get the bibtex file used to generate these references.
Acar, Umut A, Arthur Charguéraud, and Mike Rainey. 2015. “A WorkEfficient Algorithm for Parallel Unordered DepthFirst Search.” In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 67:1–67:12. ACM. http://chargueraud.org/research/2015/pdfs/pdfs_sc15.pdf.
(Acar, Charguéraud, and Rainey 2015)↩