The purpose of this web page is to provide links to the project publications and software and to provide a guide for experimenters to repeat our experimental evaluation.
This project concerns high-performance parallel graph traversal (or, graph search) over directed, in-memory graphs using shared-memory, multiprocessor machines (aka multicore). The main research question addressed by this project is the following:
Can we design asymptotically and practically algorithms for the efficient traversal of in-memory directed graphs? Furthermore, can we design such algorithms so that, for any input graph, the cost of load balancing is negligible, yet the amount of parallelism being utilized is close to the hard limit imposed by the input graph and the machine?
Such algorithms are crucial because multicore hardware demands computations that make the most economical use of their time. In contrast, algorithms that spend many cycles performing load-balancing work are often slower and may consume excess power. There are two competing forces that make this problem challenging. First, the space of possible input graphs is huge, and the graphs vary substantially in their structure. Particularly challenging inputs include long chains and, more generally, high diameter graphs. To add to the challenge, graph traversal algorithms typically discover new regions of graph on the fly. So, the second competing force is the fact that new parallelism is discovered online, thus challenging the load-balancing algorithm to adapt quickly.
To summarize, a good graph-traversal algorithm must cope with the large input space, rapid-fire parallelism, yet keep load-balancing overheads low. There are several algorithms proposed to this end. All of these algorithms and related issues are covered in detail in our paper. Overall, we found that, although the previous state of the art performs well for certain classes of graphs, none meet the main challenge for all graphs.
In our Supercomputing’15 paper1, we answer the main question from above in the affirmative: we present a new algorithm that performs a DFS-like traversal over a given graph. We prove that the algorithm is strongly work efficient, meaning that, in addition to having the linear upper bound on the total work performed, the algorithm effectively amortizes overheads, such as the cycles spent balancing traversal workload between cores. Put a different way, we prove that, for any input, the algorithm confines the amount of load-balancing work to a small fraction of the total (linear) running time. Then, we prove that the algorithm achieves a high degree of parallelism, that is, near optimal under the constraints of the graph. Finally, we present an experimental evaluation showing that, thanks to reducing overheads and increasing parallelism, our implementation of the algorithm outperforms two other state-of-the-art algorithms in all but a few cases, and in some cases, far outperforms the competition.
We made available a long version of our Supercomputing’15 article which includes appendices. In the appendix, there is a longer version of one of the proofs
The source code we used for our experimental evaluation is hosted by a Github repository. The source code for our PDFS and our implementation of the DFS of Cong et al is stored in the file named dfs.hpp.
Note. The PDFS source code is located in the new-sc15-graph
branch of the pasl
repository. Be careful to use only the correct branch if you wish to get the PDFS code directly from github.
To have enough room to run the experiments, your filesystem should have about 300GB of free hard-drive space and your machine at least 128GB or RAM. These space requirements are so large because some of the input graphs we use are huge.
The following packages should be installed on your test machine.
Package | Version | Details |
---|---|---|
gcc | >= 4.9.0 | Recent gcc is required because pdfs makes heavy use of features of C++1x, such as lambda expressions and higher-order templates. (Home page) |
php | >= 5.3.10 | PHP is currently used by the build system of pdfs. (Home page) |
ocaml | >= 4.00 | Ocaml is required to build the benchmarking script. (Home page) |
R | >= 2.4.1 | The R tools is used by our scripts to generate plots. (Home page) |
hwloc | recent | Optional dependency (See instructions below). This package is used by pdfs to force interleaved NUMA allocation; as such this package is optional and only really relevant for NUMA machines. (Home page) |
ipfs | recent | We are going to use this software to download data sets for our experiments. (Home page) |
IPFS is a tool that is useful for transfering large amounts of data over the internet. We need this tool because our experiments use large input graphs. After installing the package, we need to initialize the local IPFS configuration.
$ ipfs init
Warning: disk space. The default behavior of IPFS is to keep a cache of all downloaded files in the folder ~/.ipfs/
. Because the graph data is several gigabytes, the cache folder should have at least twice this much free space. To select a different cache folder for IPFS, before issuing the command ipfs init
, set the environment variable $IPFS_PATH
to point to the desired path.
Next, we need to run the IPFS daemon. This process needs to be running until after all input graphs have been successfully downloaded to your machine.
$ ipfs daemon &
Now, create a new directory in which to store all of our the code and data.
$ mkdir sc15
$ cd sc15
Let us create a folder in which to store the graphs.
$ mkdir sc15-graphs
The table below summarizes our collection non-synthetic graphs. Each graph can be downloaded individually, or as a whole package. To download just cage15, for instance, we can issue the following command.
$ ipfs get QmYrupxDXnYfbw8Vj8tDN7PYrMxWiCmNLS4xLHQ7QF64CF
-o=sc15-graphs/cage15.adj_bin
Note that each graph file must have the extension .adj_bin
. To download all of the graphs, we use the following command.
$ ipfs get QmeQPA4mRAhkNta8Pofa8DGfokfr6dT5pGAb1C4A4MmZ3u -o=sc15-graphs
Because the total size is about 30GB, the download may take a long time.
Graph | ipfs address |
---|---|
Freescale1 | QmUe8sV6hBfS45HtogKC3cxwenR45UHMBcbGtni5WoXqiE |
cage14 | QmSo1PdRffxbZFGyz89ZkegZM99UTREPbtEJhjhzu9bjPF |
cage15 | QmYrupxDXnYfbw8Vj8tDN7PYrMxWiCmNLS4xLHQ7QF64CF |
delaunay | QmSXK5B3zPh3WAtuQmtkj6XfkcSK4jLT2WJ67Fzxge7Q6r |
europe | QmSA2tgcdKoyFWfktdQVb6AuJfsg6Se5cbBfAf878BAiEc |
friendster | QmWLc6h7MqwbpzFVvrgWrhzpEuhnxdHfEfcW8sc92ddePv |
kkt_power | QmfBrYSuGrjSGvdP1feFGMVoSa4TfuBsVWhoHUvQtShT9B |
livejournal1 | QmZJk9P5qodnwfKKHUa4BnCR4NyNg2NkyLbJoJJjFhSm6G |
orkut | QmYRQUzDmaJ4vHNTJSk8VWs9JTD6kZPyWcKfQabbWxgQvc |
rgg | QmU5NuE1KDnPo9jEXfr44tr9hu6o1NHtiaq1Vr9B7qvSuK |
QmW7y6EL3FtakEw9UxtqhDjkhcZVAGtqJZi5dXeKtcKCjt |
|
usa | QmcGm3pWSqksakmggKgXK7wuGhtkYx15y5YHqDobErFvW5 |
wikipedia-20070206 | QmYi8Ga5j4zb1XWBGjR4f9WRWbQYs5BC4yARuJKtAdvG6w |
all graphs | QmeQPA4mRAhkNta8Pofa8DGfokfr6dT5pGAb1C4A4MmZ3u |
Warning. In the command-line samples, we assume that .
is in the user’s $PATH
, which can be achieved by setting PATH=.:$PATH
.
To obtain the source code, first get the downloader script, then perform the following steps.
$ wget http://deepsea.inria.fr/graph/get.sh
$ chmod u+x get.sh
$ get.sh
Building with hwloc. If your system has a non-uniform memory model (aka NUMA), then using hwloc may prove crucial to obtain correct experimental results. To link correctly with hwloc, all you need to do is set an environment variable with the path to the hwloc package-configuration folder. On our system, this folder is located at /usr/lib64/pkgconfig/
, but that location may differ for your installation.
$ export hwloc_path=/usr/lib64/pkgconfig/
You need to ensure that whatever path you substitute for this one on your machine contains a file named hwloc.pc
.
Before building any packages, we need configure some paths. Let us change to the directory where the configuration file is going to be stored.
$ cd sc15-graph/graph/bench/
The next step is to generate the graph data via our benchmarking script.
$ make graph.pbench
Generation of the synthetic graphs may take a long time. To start running our graph generator, specify the number of processors to be used by the experiment by passing the argument on the command line. For instance, our system has 40 cores.
$ export P=40
$ graph.pbench generate -size large -proc $P
The first series of benchmark runs gather data to serve as baseline measurements. The baseline in this case is a fast sequential graph-traversal algorithm, hence the argument -proc 1
.
$ graph.pbench baselines -proc 1 -size large
The next command that needs to be run collects data for each graph to determine the number of vertices reachable from the source vertex.
$ graph.pbench accessible -proc 1 -size large -skip plot
After the command completes, the results of the experiment are going to be stored in the _results
folder.
We are now ready to run the main body of experiments. The following command starts the experiments running.
$ graph.pbench overview -proc $P -size large -skip plot -runs 30
To achieve clean results, we recommend performing thirty runs. However, it may be faster to perform just a few at first, and then collect more data later, next time passing to graph.pbench
the flag -mode append
.
After running the experiments, the raw result data should be stored in new files named results_accessible_large.txt
, results_baselines_large.txt
, and results_overview_large.txt
. If the experiments completed successfully, then the kind of plots that appear in our SC’15 paper can be generated by the following command.
$ graph.pbench overview -proc $P -size large -runs 30 -sub graphs,main,locality
After completion, the file named table_graphs.pdf
should contain a table that bears resemblance to Table 1 in our SC’15 paper. The file named _results/table_graphs.tex
contains the source code. A speedup plot, like the one in Figure 6, should appear in a new file named plot_main.pdf
(with souce code in _results/plot_main.pdf-all.r
). The plot corresponding to Figure 8 in the paper, namely, the locality study, should appear in the file named plots_overview_locality_large.pdf
(with souce code in plots_overview_locality_large.pdf-all.r
).
Get the bibtex file used to generate these references.
Acar, Umut A, Arthur Charguéraud, and Mike Rainey. 2015. “A Work-Efficient Algorithm for Parallel Unordered Depth-First Search.” In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 67:1–67:12. ACM. http://chargueraud.org/research/2015/pdfs/pdfs_sc15.pdf.
(Acar, Charguéraud, and Rainey 2015)↩