Publications

Batch-Parallel Compressed Sparse Row: A Locality-Optimized Dynamic-Graph Representation

Brian Wheatman, Randal Burns, and Helen Xu

Abstract

The default data structure for storing sparse graphs is Compressed Sparse Row (CSR), which enables efficient algorithms but is not designed to accommodate changes to the graph. Since many real-world graphs are dynamic (i.e., they change over time), there has been significant work towards developing dynamic-graph data structures that can support fast algorithms as well as updates to the graph.

This paper introduces Batch-Parallel Compressed Sparse Row (BP-CSR), a batch-parallel data structure optimized for storing and processing dynamic graphs based on the Packed Memory Array (PMA). At a high level, Batch-Parallel Compressed Sparse Row extends Packed Compressed Sparse Row (PCSR, HPEC ’18), a serial dynamic-graph data structure built on a PMA. However, since the original PCSR runs only on one thread, it cannot take advantage of the parallelism available in multithreaded machines. In contrast, Batch-Parallel Compressed Sparse Row is built on the batch-parallel Packed Memory Array data structure (PPoPP ’24) and can support fast parallel algorithms and updates.

The empirical evaluation demonstrates that Batch-Parallel Compressed Sparse Row supports fast parallel updates with minimal cost to algorithm performance. Specifically, Batch- Parallel Compressed Sparse Row performs up to 420 million inserts per second. Across a suite of 10 graph algorithms and 10 input graphs, Batch-Parallel Compressed Sparse Row incurs 1.05× slowdown on average and about 1.5× slowdown at most compared to Compressed Sparse Row (CSR), a classical static graph representation. Furthermore, the empirical results show that Batch-Parallel Compressed Sparse Row outperforms existing tree-based and PMA-based dynamic-graph data structures on both algorithms and updates.

Presented at HPEC 2024, where it won an Outstanding Student Paper Award
The code can be found on github as part of the PMA library https://github.com/wheatman/Packed-Memory-Array

BYO: A Unified Framework for Benchmarking Large-Scale Graph Containers

Brian Wheatman, Xiaojun Dong, Zheqi Shen, Laxman Dhulipala, Jakub Łącki, Prashant Pandey, and Helen Xu

Abstract

A fundamental building block in any graph algorithm is a graph container – a data structure used to represent the graph. Ideally, a graph container enables efficient access to the underlying graph and supports updating the graph efficiently. In this paper, we conduct an extensive empirical evaluation of graph containers designed to support running algorithms on large graphs. To our knowledge, this is the first apples-to-apples comparison of graph containers rather than overall systems, which include confounding factors such as differences in algorithm implementations and infrastructure.

We measure the running time of 10 highly-optimized algorithms across over 20+ different containers and 10 graphs. Somewhat surprisingly, we find that the average algorithm running time does not differ much across containers, especially those that support dynamic updates. Specifically, a simple container based on an off-the-shelf B-tree is only 1.22× slower on average than a highly optimized static one. Moreover, we observe that simplifying a graph-container Application Programming Interface (API) to only a few simple functions incurs a mere 1.16× slowdown compared to a complete API. Finally, we also measure batch-insert throughput in dynamic-graph containers for a full picture of their performance.

To perform the benchmarks, we introduce BYO, a unified framework that standardizes evaluations of graph-algorithm performance across different graph containers. BYO extends the Graph Based Benchmark Suite (Dhulipala et al. 18), a state-of-the-art graph algorithm benchmark, to easily plug into different dynamic graph containers and enable fair comparisons between them on a large suite of graph algorithms. While several graph algorithm benchmarks have been developed to date, to the best of our knowledge, BYO is the first system designed to benchmark graph containers.

Presented at VLDB 2024
The code can be found on github https://github.com/wheatman/BYO

Speedcode: Software Performance Engineering Education via the Coding of Didactic Exercises

Tim Kaler, Xuhao Chen, Brian Wheatman, Dorothy Curtis, Bruce Hoppe, Tao B. Schardl, and Charles E. Leiserson

Abstract

This paper introduces Speedcode, an online programming platform that aims to improve the accessibility of software performance-engineering education. At its core, Speedcode provides a platform that lets users gain hands-on experience in software performance engineering and parallel programming by completing short programming exercises.

Speedcode challenges users to develop fast multicore solutions for short programming problems and evaluates their code’s performance and scalability in a quiesced cloud environment. Speedcode supports parallel programming using OpenCilk, a task-parallel computing platform that is open-source and easy to program, teach and use for research.

Speedcode aims to reduce barriers to learning and teaching software performance engineering. It allows users to run and evaluate their code on modern multicore machines from their own computer without installing any software. This provides users an easy introduction to the topic, and enables teachers to more easily incorporate lessons on software performance engineering into their courses without incurring the onerous overhead of needing to setup computing environments for their students.

Presented at EduPar-24, in conjuction with IPDPS 2024

CPMA: An Efficient Batch-Parallel Compressed Set Without Pointers

Brian Wheatman, Randal Burns, Aydın Buluç, and Helen Xu

Abstract

This paper introduces the batch-parallel Compressed Packed Memory Array (CPMA), a compressed dynamic ordered batch-parallel set data structure based on the Packed Memory Array (PMA). Traditionally, batch-parallel sets are built on pointer-based data structures such as trees because pointer-based structures enable fast parallel unions via pointer manipulation. When compared to cache-optimized trees, PMAs were slower to update but faster to scan.

The batch-parallel CPMA overcomes this tradeoff between updates and scans by optimizing for cache-friendliness. On average, the CPMA achieves 3× faster batch-insert throughput and 4× faster range-query throughput compared to compressed PaC-trees, a state-of-the-art batch-parallel set library based on cache-optimized trees.

We further evaluate the CPMA compared to compressed PaC-trees and Aspen, a state-of-the-art system, on a realworld application of dynamic-graph processing. The CPMA is on average 1.2× faster on a suite of graph algorithms and 2× faster on batch inserts when compared with compressed PaC-trees. Furthermore, the CPMA is on average 1.3× faster on graph algorithms and 2× faster on batch inserts compared to Aspen.

Presented at PPoPP 2024, where it won Best Artifact Award
The code can be found on github https://github.com/wheatman/Packed-Memory-Array

BP-tree: Overcoming the Point-Range Operation Tradeoff for In-Memory B-trees

Helen Xu, Amanda Li, Brian Wheatman, Manoj Marneni, and Prashant Pandey

Abstract

B-trees are the go-to data structure for in-memory indexes in databases and storage systems. B-trees support both point operations (i.e., inserts and finds) and range operations (i.e., iterators and maps). However, there is an inherent tradeoff between point and range operations since the optimal node size for point operations is much smaller than the optimal node size for range operations. Existing implementations use a relatively small node size to achieve fast point operations at the cost of range operation throughput.

We present the BP-tree, a variant of the B-tree, that overcomes the decades-old point-range operation tradeoff in traditional B-trees. In the BP-tree, the leaf nodes are much larger in size than the internal nodes to support faster range scans. To avoid any slowdown in point operations due to large leaf nodes, we introduce a new insertoptimized array called the buffered partitioned array (BPA) to efficiently organize data in leaf nodes. The BPA supports fast insertions by delaying ordering the keys in the array. This results in much faster range operations and faster point operations at the same time in the BP-tree.

Our experiments show that on 48 hyperthreads, on workloads generated from the Yahoo! Cloud Serving Benchmark (YCSB), the BP-tree supports similar or faster point operation throughput (between .94×−1.2× faster) compared to Masstree and OpenBw-tree, two state-of-the-art in-memory key-value (KV) stores. On a YCSB workload with short scans, the BP-tree is about 7.4× faster than Masstree and 1.6× faster than OpenBw-tree. Furthermore, we extend the YCSB to add large range workloads, commonly found in database applications, and show that the BP-tree is 30× faster than Masstree and 2.5× faster than OpenBw-tree.

We also provide a reference implementation for a concurrent B+ - tree and find that the BP-tree supports faster (between 1.03×−1.2× faster) point operations when compared to the best-case configuration for B+ -trees for point operations while supporting similar performance (about .95×as fast) on short range operations and faster (about 1.3× faster) long range operations

Presented at VLDB 2023
Presented as a Poster at Highlights of Parallel Computing 2024 at SPAA 2024
For a video of the talk, see here
The code can be found on github https://github.com/wheatman/BP-Tree

Optimizing Search Layouts in Packed Memory Arrays

Brian Wheatman, Randal Burns, Aydın Buluç, and Helen Xu

Abstract

This paper introduces Search-optimized Packed Memory Arrays (SPMAs), a collection of data structures based on Packed Memory Arrays (PMAs) that address suboptimal search via cache-optimized search layouts. Traditionally, PMAs and B-trees have tradeoffs between searches/inserts and scans: B-trees were faster for searches and inserts, while PMAs were faster for scans.

Our empirical evaluation shows that SPMAs overcome this tradeoff for unsorted input distributions: on average, SPMAs are faster than B+-trees (a variant of B-trees optimized for scans) on all major operations. We generated datasets and search/insert workloads from the Yahoo! Cloud Serving Benchmark (YCSB) and found that SPMAs are about 2× faster than B+-trees regardless of the ratio of searches to inserts. On uni form random inputs, SPMAs are on average between 1.3 × −2.3× faster than B+-trees on all operations. Finally, we vary the amount of sortedness in the inputs to stress the worst-case insert distribution in the PMA. We find that the worst-case B+-tree insertion throughput is about 1.5× faster than the worst-case PMA insertion throughput. However, the worst-case input for the PMA is sorted and highly unlikely to appear naturally in practice. The SPMAs maintain higher insertion throughput than the B+-tree when the input is up to 25% sorted.

Presented at SIAM ALENEX 2023
The code can be found on github https://github.com/wheatman/Packed-Memory-Array

Streaming Sparse Graphs using Efficient Dynamic Sets

Brian Wheatman and Randal Burns

Abstract

We present the SSTGraph framework for the storage and analysis of dynamic graphs. Its performance matches or exceeds state-of-the-art static graph engines and supports streaming updates. SSTGraph builds on top of the tinyset parallel, dynamic set data structure. Tinyset implements set membership in a shallow hierarchy of sorted packed memory arrays to achieve logarithmic time access and updates, and it scans in optimal linear time. Tinyset uses space comparable to that of systems that use data compression while avoiding compression's computation and serialization overhead.

SSTGraph outperforms other streaming, dynamic graph engines on a suite of four graph algorithms. Our evaluation includes a comparison with the Aspen streaming graph system. SSTGraph reduces runtime by 40% on average, updates are 2x-5x faster on batch sizes up to 10 million, and graphs are smaller. The partitioned data structure scales well and runs on billion edge graphs in just 15 GB of memory.

Presented at IEEE BigData 2021
Presented as a Poster at Highlights of Parallel Computing 2023 at SPAA 2023
For a video of the talk, see here
The code can be found on github https://github.com/wheatman/SSTGraph

Terrace: A Hierarchical Graph Container for Skewed Dynamic Graphs

Prashant Pandey, Brian Wheatman, Helen Xu, and Aydın Buluç

Abstract

Various applications model problems as streaming graphs, which need to quickly apply a stream of updates and run algorithms on the updated graph. Furthermore, many dynamic real-world graphs, such as social networks, follow a skewed distribution of vertex degrees, where there are a few high-degree vertices and many low-degree vertices.

Existing static graph-processing systems optimized for graph skewness achieve high performance and low space usage by preprocessing a cache-efficient graph partitioning based on vertex degree. In the streaming setting, the whole graph is not available upfront, however, so finding an optimal partitioning is not feasible in the presence of updates. As a result, existing streaming graph-processing systems take a "one-size-fits-all" approach, leaving performance on the table.

We present Terrace, a system for streaming graphs that uses a hierarchical data structure design to store a vertex's neighbors in different data structures depending on the degree of the vertex. This multi-level structure enables Terrace to dynamically partition vertices based on their degrees and adapt to skewness in the underlying graph.

Our experiments show that Terrace supports faster batch insertions for batch sizes up to 1M when compared to Aspen, a state-of-the-art graph streaming system. On graph query algorithms, Terrace is between 1.7x--2.6x faster than Aspen and between 0.5x--1.3x as fast as Ligra, a state-of-the-art static graph-processing system.

Presented at SIGMOD 2021
For a video of the talk, see here
The code can be found on github https://github.com/PASSIONLab/terrace

RADICS: Runtime Assurance of Distributed Intelligent Control Systems

Brian Wheatman, Jerry Chen, Tamim Sookoor, and Yair Amir

Abstract

We describe RADICS: Runtime Assurance of Distributed Intelligent Control Systems, which combines a Simplex-based, black-box monitor with a white-box monitor to ensure correct behavior and good performance of AI systems. The black-box monitor allows the system to detect when the AI controller is on a failing trajectory and use a provably safe, but less performant algorithm, to right the system. The white-box monitor predicts when the AI controller will be put on such a trajectory before it happens and helps maximize the performance of the overall system. We describe the overall approach in detail and implement a simple version of it on a case study into controlling the lights in a small traffic grid.

Presented at DSML 2021
For a video of the talk, see here
For more information see http://www.dsn.jhu.edu/radics/

A Parallel Packed Memory Array to Store Dynamic Graphs

Brian Wheatman and Helen Xu

Abstract

The ideal data structure for storing dynamic graphs would support fast updates as well as fast range queries which underlie graph traversals such as breadth-first search. The Packed Memory Array (PMA) seems like a good candidate for this setting because it supports fast updates as well as cache-efficient range queries. Concurrently updating a PMA raises challenges, however, because an update may require rewriting the entire structure.

This paper introduces a parallel PMA with intra- and inter-operation parallelism and deadlock-free polylogarithmicspan operations. Our main observation is that the PMA is well-suited to concurrent updates despite occasionally requiring a rewrite of the entire structure because 1) most of the updates only write to a small part of the structure and 2) the worst case is highly parallel and cache-efficient.

To evaluate our data structure, we implemented Parallel Packed Compressed Sparse Row (PPCSR), a dynamic-graph processing framework that extends the Ligra interface with graph updates. We show that PPCSR is on average about 1.6x faster on graph kernels than Aspen, a state-of-the-art graph-streaming system. PPCSR achieves up to 80 million updates per second and is 2 – 5x faster than Aspen on most batch sizes. Finally, PPCSR is competitive with Ligra and Ligra+, two state-of-the-art static graph-processing frameworks.

Presented at ALENEX 2021
For a video of the talk, see here

High-Throughput Image Alignment for Connectomics using Frugal Snap Judgments

Tim Kaler, Brian Wheatman, and Sarah Wooders

Abstract

The accuracy and computational efficiency of image alignment directly affects the advancement of connectomics, a field which seeks to understand the structure of the brain through electron microscopy.

We introduce the algorithms Quilter and Stacker that are designed to perform 2D and 3D alignment respectively on petabyte-scale data sets from connectomics. Quilter and Stacker are efficient, scalable, and simple to deploy on hardware ranging from a researcher's laptop to a large computing cluster. On a single 18-core cloud machine each algorithm achieves throughputs of more than 1 TB/hr; when combined the algorithms produce an end-to-end alignment pipeline that processes data at a rate of 0.82 TB/hr - an over 10x improvement from previous systems. This efficiency comes from both traditional optimizations and from the use of ``Frugal Snap Judgments'' to judiciously exploit performance--accuracy trade offs.

A high-throughput image-alignment pipeline was implemented using the Quilter and Stacker algorithms and its performance was evaluated using three datasets whose size ranged from 550GB to 38TB. The full alignment pipeline achieved a throughput of 0.6-0.8 TB/hr and 1.4-1.5 TB/hr on an 18-core and 112-core shared-memory multicore, respectively. On a supercomputing cluster with 200 nodes and 1600 total cores, the pipeline achieved a throughput of 21.4 TB/hr.

Presented as a Poster at PPOPP 2019.
Presented as a paper at HPEC 2020, where it won the best student paper award.

Packed Compressed Sparse Row: A Dynamic Graph Representation

Brian Wheatman and Helen Xu

Abstract

Perhaps the most popular sparse graph storage format is Compressed Sparse Row (CSR). CSR excels at storing graphs compactly with minimal overhead, allowing for fast traversals, lookups, and basic graph computations such as PageRank. Since elements in CSR format are packed together, additions and deletions often require time linear in the size of the graph.

We introduce a new dynamic sparse graph representation called Packed Compressed Sparse Row (PCSR), based on an array-based dynamic data structure called the Packed Memory Array. PCSR is similar to CSR, but leaves spaces between elements, allowing for asymptotically faster insertions and deletions in exchange for a constant factor slowdown in traversals and a constant factor increase in space overhead.

Our contributions are twofold. We describe PCSR and review the theoretical guarantees for update, insert, and search for PCSR. We also implemented PCSR as well as other basic graph storage formats and report our findings on a variety of benchmarks. PCSR supports inserts orders of magnitude faster than CSR and is only a factor of two slower on graph traversals. Our results suggest that PCSR is a lightweight dynamic graph representation that supports fast inserts and competitive searches.

Presented at HPEC 2018

Electricity Demand and Population Dynamics Prediction from Mobile Phone Metadata

Brian Wheatman, Alejandro Noriega and Alex Pentland

Abstract

Energy efficiency is a key challenge for building modern sustainable societies. World’s energy consumption is expected to grow annually by 1.6 %, increasing pressure for utilities and governments to fulfill demand and raising significant challenges in generation, distribution, and storage of electricity. In this context, accurate predictions and understanding of population dynamics and their relation to electricity demand dynamics is of high relevance.

We introduce a simple machine learning (ML) method for day-ahead predictions of hourly energy consumption, based on population and electricity demand dynamics. We use anonymized mobile phone records (CDRs) and historical energy records from a small European country. CDRs are large-scale data that is collected passively and on a regular basis by mobile phone carriers, including time and location of calls and text messages, as well as phones’ countries of origin. We show that simple support vector machine (SVM) autoregressive models are capable of baseline energy demand predictions with accuracies below 3 % percentage error and active population predictions below 10 % percentage error. Moreover, we show that population dynamics from mobile phone records contain information additional to that of electricity demand records, which can be exploited to improve prediction performance. Finally, we illustrate how the joint analysis of population and electricity dynamics elicits insights into the relation between population and electricity demand segments, allowing for potential demand management interventions and policies beyond reactive supply-side operations.

Presented at SBP-BRiMS 2016

Brian Wheatman

About Brian

Publications

Batch-Parallel Compressed Sparse Row: A Locality-Optimized Dynamic-Graph Representation

BYO: A Unified Framework for Benchmarking Large-Scale Graph Containers

Speedcode: Software Performance Engineering Education via the Coding of Didactic Exercises

CPMA: An Efficient Batch-Parallel Compressed Set Without Pointers

BP-tree: Overcoming the Point-Range Operation Tradeoff for In-Memory B-trees

Optimizing Search Layouts in Packed Memory Arrays

Streaming Sparse Graphs using Efficient Dynamic Sets

Terrace: A Hierarchical Graph Container for Skewed Dynamic Graphs

RADICS: Runtime Assurance of Distributed Intelligent Control Systems

A Parallel Packed Memory Array to Store Dynamic Graphs

High-Throughput Image Alignment for Connectomics using Frugal Snap Judgments

Packed Compressed Sparse Row: A Dynamic Graph Representation

Electricity Demand and Population Dynamics Prediction from Mobile Phone Metadata

Invited Talks

Recent Improvements to Packed Memory Arrays

Ordered Sets: An Evolution of Memory Optimized Data Structures

So You Want to Make a Dynamic Graph Data Structure

Guest Lectures

Current Work

Masked Matrix Multiplication for Emergent Sparsity

General Purpose, Memory-Efficient Data Structures

Education

PhD's Dissertation:

Cache and Memory Optimized Data Structures for High Performance Applications

Master's Thesis:

Image alignment and dynamic graph analytics : two case studies of how managing data movement can make (parallel) code run fast

Anti-magic and Edge Graceful Graphs

Tourist Path Optimization Problem

Teaching

Service

Professional Experience

View My CV