Brian Wheatman

A PhD student in Computer Science.

About Brian


BYO: A Unified Framework for Benchmarking Large-Scale Graph Containers

Brian Wheatman, Xiaojun Dong, Zheqi Shen, Laxman Dhulipala, Jakub Łącki, Prashant Pandey, and Helen Xu


    A fundamental building block in any graph algorithm is a graph container – a data structure used to represent the graph. Ideally, a graph container enables efficient access to the underlying graph and supports updating the graph efficiently. In this paper, we conduct an extensive empirical evaluation of graph containers designed to support running algorithms on large graphs. To our knowledge, this is the first apples-to-apples comparison of graph containers rather than overall systems, which include confounding factors such as differences in algorithm implementations and infrastructure.

    We measure the running time of 10 highly-optimized algorithms across over 20+ different containers and 10 graphs. Somewhat surprisingly, we find that the average algorithm running time does not differ much across containers, especially those that support dynamic updates. Specifically, a simple container based on an off-the-shelf B-tree is only 1.22× slower on average than a highly optimized static one. Moreover, we observe that simplifying a graph-container Application Programming Interface (API) to only a few simple functions incurs a mere 1.16× slowdown compared to a complete API. Finally, we also measure batch-insert throughput in dynamic-graph containers for a full picture of their performance.

    To perform the benchmarks, we introduce BYO, a unified framework that standardizes evaluations of graph-algorithm performance across different graph containers. BYO extends the Graph Based Benchmark Suite (Dhulipala et al. 18), a state-of-the-art graph algorithm benchmark, to easily plug into different dynamic graph containers and enable fair comparisons between them on a large suite of graph algorithms. While several graph algorithm benchmarks have been developed to date, to the best of our knowledge, BYO is the first system designed to benchmark graph containers.

Speedcode: Software Performance Engineering Education via the Coding of Didactic Exercises

Tim Kaler, Xuhao Chen, Brian Wheatman, Dorothy Curtis, Bruce Hoppe, Tao B. Schardl, and Charles E. Leiserson


    This paper introduces Speedcode, an online programming platform that aims to improve the accessibility of software performance-engineering education. At its core, Speedcode provides a platform that lets users gain hands-on experience in software performance engineering and parallel programming by completing short programming exercises.

    Speedcode challenges users to develop fast multicore solutions for short programming problems and evaluates their code’s performance and scalability in a quiesced cloud environment. Speedcode supports parallel programming using OpenCilk, a task-parallel computing platform that is open-source and easy to program, teach and use for research.

    Speedcode aims to reduce barriers to learning and teaching software performance engineering. It allows users to run and evaluate their code on modern multicore machines from their own computer without installing any software. This provides users an easy introduction to the topic, and enables teachers to more easily incorporate lessons on software performance engineering into their courses without incurring the onerous overhead of needing to setup computing environments for their students.

CPMA: An Efficient Batch-Parallel Compressed Set Without Pointers

Brian Wheatman, Randal Burns, Aydın Buluç, and Helen Xu


    This paper introduces the batch-parallel Compressed Packed Memory Array (CPMA), a compressed dynamic ordered batch-parallel set data structure based on the Packed Memory Array (PMA). Traditionally, batch-parallel sets are built on pointer-based data structures such as trees because pointer-based structures enable fast parallel unions via pointer manipulation. When compared to cache-optimized trees, PMAs were slower to update but faster to scan.

    The batch-parallel CPMA overcomes this tradeoff between updates and scans by optimizing for cache-friendliness. On average, the CPMA achieves 3× faster batch-insert throughput and 4× faster range-query throughput compared to compressed PaC-trees, a state-of-the-art batch-parallel set library based on cache-optimized trees.

    We further evaluate the CPMA compared to compressed PaC-trees and Aspen, a state-of-the-art system, on a realworld application of dynamic-graph processing. The CPMA is on average 1.2× faster on a suite of graph algorithms and 2× faster on batch inserts when compared with compressed PaC-trees. Furthermore, the CPMA is on average 1.3× faster on graph algorithms and 2× faster on batch inserts compared to Aspen.

BP-tree: Overcoming the Point-Range Operation Tradeoff for In-Memory B-trees

Helen Xu, Amanda Li, Brian Wheatman, Manoj Marneni, and Prashant Pandey


    B-trees are the go-to data structure for in-memory indexes in databases and storage systems. B-trees support both point operations (i.e., inserts and finds) and range operations (i.e., iterators and maps). However, there is an inherent tradeoff between point and range operations since the optimal node size for point operations is much smaller than the optimal node size for range operations. Existing implementations use a relatively small node size to achieve fast point operations at the cost of range operation throughput.

    We present the BP-tree, a variant of the B-tree, that overcomes the decades-old point-range operation tradeoff in traditional B-trees. In the BP-tree, the leaf nodes are much larger in size than the internal nodes to support faster range scans. To avoid any slowdown in point operations due to large leaf nodes, we introduce a new insertoptimized array called the buffered partitioned array (BPA) to efficiently organize data in leaf nodes. The BPA supports fast insertions by delaying ordering the keys in the array. This results in much faster range operations and faster point operations at the same time in the BP-tree.

    Our experiments show that on 48 hyperthreads, on workloads generated from the Yahoo! Cloud Serving Benchmark (YCSB), the BP-tree supports similar or faster point operation throughput (between .94×−1.2× faster) compared to Masstree and OpenBw-tree, two state-of-the-art in-memory key-value (KV) stores. On a YCSB workload with short scans, the BP-tree is about 7.4× faster than Masstree and 1.6× faster than OpenBw-tree. Furthermore, we extend the YCSB to add large range workloads, commonly found in database applications, and show that the BP-tree is 30× faster than Masstree and 2.5× faster than OpenBw-tree.

    We also provide a reference implementation for a concurrent B+ - tree and find that the BP-tree supports faster (between 1.03×−1.2× faster) point operations when compared to the best-case configuration for B+ -trees for point operations while supporting similar performance (about .95×as fast) on short range operations and faster (about 1.3× faster) long range operations

Optimizing Search Layouts in Packed Memory Arrays

Brian Wheatman, Randal Burns, Aydın Buluç, and Helen Xu


    This paper introduces Search-optimized Packed Memory Arrays (SPMAs), a collection of data structures based on Packed Memory Arrays (PMAs) that address suboptimal search via cache-optimized search layouts. Traditionally, PMAs and B-trees have tradeoffs between searches/inserts and scans: B-trees were faster for searches and inserts, while PMAs were faster for scans.

    Our empirical evaluation shows that SPMAs overcome this tradeoff for unsorted input distributions: on average, SPMAs are faster than B+-trees (a variant of B-trees optimized for scans) on all major operations. We generated datasets and search/insert workloads from the Yahoo! Cloud Serving Benchmark (YCSB) and found that SPMAs are about 2× faster than B+-trees regardless of the ratio of searches to inserts. On uni form random inputs, SPMAs are on average between 1.3 × −2.3× faster than B+-trees on all operations. Finally, we vary the amount of sortedness in the inputs to stress the worst-case insert distribution in the PMA. We find that the worst-case B+-tree insertion throughput is about 1.5× faster than the worst-case PMA insertion throughput. However, the worst-case input for the PMA is sorted and highly unlikely to appear naturally in practice. The SPMAs maintain higher insertion throughput than the B+-tree when the input is up to 25% sorted.

Streaming Sparse Graphs using Efficient Dynamic Sets

Brian Wheatman and Randal Burns


    We present the SSTGraph framework for the storage and analysis of dynamic graphs. Its performance matches or exceeds state-of-the-art static graph engines and supports streaming updates. SSTGraph builds on top of the tinyset parallel, dynamic set data structure. Tinyset implements set membership in a shallow hierarchy of sorted packed memory arrays to achieve logarithmic time access and updates, and it scans in optimal linear time. Tinyset uses space comparable to that of systems that use data compression while avoiding compression's computation and serialization overhead.

    SSTGraph outperforms other streaming, dynamic graph engines on a suite of four graph algorithms. Our evaluation includes a comparison with the Aspen streaming graph system. SSTGraph reduces runtime by 40% on average, updates are 2x-5x faster on batch sizes up to 10 million, and graphs are smaller. The partitioned data structure scales well and runs on billion edge graphs in just 15 GB of memory.

Terrace: A Hierarchical Graph Container for Skewed Dynamic Graphs

Prashant Pandey, Brian Wheatman, Helen Xu, and Aydın Buluç


    Various applications model problems as streaming graphs, which need to quickly apply a stream of updates and run algorithms on the updated graph. Furthermore, many dynamic real-world graphs, such as social networks, follow a skewed distribution of vertex degrees, where there are a few high-degree vertices and many low-degree vertices.

    Existing static graph-processing systems optimized for graph skewness achieve high performance and low space usage by preprocessing a cache-efficient graph partitioning based on vertex degree. In the streaming setting, the whole graph is not available upfront, however, so finding an optimal partitioning is not feasible in the presence of updates. As a result, existing streaming graph-processing systems take a "one-size-fits-all" approach, leaving performance on the table.

    We present Terrace, a system for streaming graphs that uses a hierarchical data structure design to store a vertex's neighbors in different data structures depending on the degree of the vertex. This multi-level structure enables Terrace to dynamically partition vertices based on their degrees and adapt to skewness in the underlying graph.

    Our experiments show that Terrace supports faster batch insertions for batch sizes up to 1M when compared to Aspen, a state-of-the-art graph streaming system. On graph query algorithms, Terrace is between 1.7x--2.6x faster than Aspen and between 0.5x--1.3x as fast as Ligra, a state-of-the-art static graph-processing system.

RADICS: Runtime Assurance of Distributed Intelligent Control Systems

Brian Wheatman, Jerry Chen, Tamim Sookoor, and Yair Amir


    We describe RADICS: Runtime Assurance of Distributed Intelligent Control Systems, which combines a Simplex-based, black-box monitor with a white-box monitor to ensure correct behavior and good performance of AI systems. The black-box monitor allows the system to detect when the AI controller is on a failing trajectory and use a provably safe, but less performant algorithm, to right the system. The white-box monitor predicts when the AI controller will be put on such a trajectory before it happens and helps maximize the performance of the overall system. We describe the overall approach in detail and implement a simple version of it on a case study into controlling the lights in a small traffic grid.

A Parallel Packed Memory Array to Store Dynamic Graphs

Brian Wheatman and Helen Xu


    The ideal data structure for storing dynamic graphs would support fast updates as well as fast range queries which underlie graph traversals such as breadth-first search. The Packed Memory Array (PMA) seems like a good candidate for this setting because it supports fast updates as well as cache-efficient range queries. Concurrently updating a PMA raises challenges, however, because an update may require rewriting the entire structure.

    This paper introduces a parallel PMA with intra- and inter-operation parallelism and deadlock-free polylogarithmicspan operations. Our main observation is that the PMA is well-suited to concurrent updates despite occasionally requiring a rewrite of the entire structure because 1) most of the updates only write to a small part of the structure and 2) the worst case is highly parallel and cache-efficient.

    To evaluate our data structure, we implemented Parallel Packed Compressed Sparse Row (PPCSR), a dynamic-graph processing framework that extends the Ligra interface with graph updates. We show that PPCSR is on average about 1.6x faster on graph kernels than Aspen, a state-of-the-art graph-streaming system. PPCSR achieves up to 80 million updates per second and is 2 – 5x faster than Aspen on most batch sizes. Finally, PPCSR is competitive with Ligra and Ligra+, two state-of-the-art static graph-processing frameworks.

High-Throughput Image Alignment for Connectomics using Frugal Snap Judgments

Tim Kaler, Brian Wheatman, and Sarah Wooders


    The accuracy and computational efficiency of image alignment directly affects the advancement of connectomics, a field which seeks to understand the structure of the brain through electron microscopy.

    We introduce the algorithms Quilter and Stacker that are designed to perform 2D and 3D alignment respectively on petabyte-scale data sets from connectomics. Quilter and Stacker are efficient, scalable, and simple to deploy on hardware ranging from a researcher's laptop to a large computing cluster. On a single 18-core cloud machine each algorithm achieves throughputs of more than 1 TB/hr; when combined the algorithms produce an end-to-end alignment pipeline that processes data at a rate of 0.82 TB/hr - an over 10x improvement from previous systems. This efficiency comes from both traditional optimizations and from the use of ``Frugal Snap Judgments'' to judiciously exploit performance--accuracy trade offs.

    A high-throughput image-alignment pipeline was implemented using the Quilter and Stacker algorithms and its performance was evaluated using three datasets whose size ranged from 550GB to 38TB. The full alignment pipeline achieved a throughput of 0.6-0.8 TB/hr and 1.4-1.5 TB/hr on an 18-core and 112-core shared-memory multicore, respectively. On a supercomputing cluster with 200 nodes and 1600 total cores, the pipeline achieved a throughput of 21.4 TB/hr.

    • Presented as a Poster at PPOPP 2019.
    • Presented as a paper at HPEC 2020, where it won the best student paper award.

Packed Compressed Sparse Row: A Dynamic Graph Representation

Brian Wheatman and Helen Xu


    Perhaps the most popular sparse graph storage format is Compressed Sparse Row (CSR). CSR excels at storing graphs compactly with minimal overhead, allowing for fast traversals, lookups, and basic graph computations such as PageRank. Since elements in CSR format are packed together, additions and deletions often require time linear in the size of the graph.

    We introduce a new dynamic sparse graph representation called Packed Compressed Sparse Row (PCSR), based on an array-based dynamic data structure called the Packed Memory Array. PCSR is similar to CSR, but leaves spaces between elements, allowing for asymptotically faster insertions and deletions in exchange for a constant factor slowdown in traversals and a constant factor increase in space overhead.

    Our contributions are twofold. We describe PCSR and review the theoretical guarantees for update, insert, and search for PCSR. We also implemented PCSR as well as other basic graph storage formats and report our findings on a variety of benchmarks. PCSR supports inserts orders of magnitude faster than CSR and is only a factor of two slower on graph traversals. Our results suggest that PCSR is a lightweight dynamic graph representation that supports fast inserts and competitive searches.

Electricity Demand and Population Dynamics Prediction from Mobile Phone Metadata

Brian Wheatman, Alejandro Noriega and Alex Pentland


    Energy efficiency is a key challenge for building modern sustainable societies. World’s energy consumption is expected to grow annually by 1.6 %, increasing pressure for utilities and governments to fulfill demand and raising significant challenges in generation, distribution, and storage of electricity. In this context, accurate predictions and understanding of population dynamics and their relation to electricity demand dynamics is of high relevance.

    We introduce a simple machine learning (ML) method for day-ahead predictions of hourly energy consumption, based on population and electricity demand dynamics. We use anonymized mobile phone records (CDRs) and historical energy records from a small European country. CDRs are large-scale data that is collected passively and on a regular basis by mobile phone carriers, including time and location of calls and text messages, as well as phones’ countries of origin. We show that simple support vector machine (SVM) autoregressive models are capable of baseline energy demand predictions with accuracies below 3 % percentage error and active population predictions below 10 % percentage error. Moreover, we show that population dynamics from mobile phone records contain information additional to that of electricity demand records, which can be exploited to improve prediction performance. Finally, we illustrate how the joint analysis of population and electricity dynamics elicits insights into the relation between population and electricity demand segments, allowing for potential demand management interventions and policies beyond reactive supply-side operations.

Invited Talks

Recent Improvements to Packed Memory Arrays

Brian Wheatman


    A Packed Memory Array (PMA) is a dynamic data structure that stores N elements in O(N) space. It supports reasonably efficient updates (inserts and deletes) and extremely efficient scans. However, traditional PMAs are theoretically (and often empirically) dominated by B-Trees in many situations. This talk will present recent advancements that enable PMAs to surpass B-Trees in various scenarios, positioning them as a superior choice for ordered sets.

    First, we will discuss optimizations to searching in a PMA, showing how PMAs can asymptotically match and empirically outperform B-Trees. Then we will review improvements to updating a PMA with improvements in serial insert as well as a new parallel batch insert algorithm for PMAs. Lastly, we will explore techniques that reduce the space usage for PMAs, which both decrease the memory usage and increase the throughput of many operations due to the decreased memory bandwidth.

Ordered Sets: An Evolution of Memory Optimized Data Structures

Brian Wheatman


    Ordered sets are a fundamental building block used all over computer science. We will review the different approaches used to implement ordered sets over the years with a focus on practical performance. Then we will turn to Packed Memory Arrays, evaluate their strengths and weaknesses. My research overcomes these limitations in search and insert performance and allows the packed memory array to outperform other approaches on modern highly parallel architectures.

    • Invited talk at University of Chicago

So You Want to Make a Dynamic Graph Data Structure

Brian Wheatman


    This talk discusses dynamic graph data structures and what to consider when designing new ones. It breaks the task of designing graph data structures into creating structures that can support a few key operations efficiently. The focus is on how different use cases, which can cause different data patterns and access patterns, can influence the design of these structures and how to exploit these differences to maximize performance.

Current Work

Masked Matrix Multiplication for Emergent Sparsity


    Artificial intelligence workloads, especially transformer models, exhibit emergent sparsity in which computations perform selective sparse access to dense data. The workloads are inefficient on hardware designed for dense computations and do not map well onto sparse data representations. We build a vectorized and parallel matrix-multiplication system A × B = C that eliminates unnecessary computations and avoids branches based on a runtime evaluation of sparsity. We use a combination of dynamic code lookup to adapt to the specific sparsity encoded in the B matrix and preprocessing of sparsity maps of the A and B matrices to compute conditional branches once for the whole computation. For a wide range of sparsity, from 60% to 95% zeros, our implementation performs fewer instructions and increases performance when compared with Intel MKL's dense or sparse matrix multiply routines. Benefits can be as large as 2 times speedup and 4 times fewer instructions.

General Purpose, Memory-Efficient Data Structures


    My vision is to create a suite of data structures, which implement memory-efficient versions of common operations, so users can reap these benefits across many problems and scales without needing the technical expertise to understand and tune for the memory hierarchy. The next steps are to continue to develop techniques for adapting structures dynamically to best suit the data pattern and problem structure, as well as move from dynamic graphs to more general structures, such as sets and matrices. Some specific examples are as follows:

    A graph library for storing the time history of a graph, where instead of being able to update a graph to store the current version in respect to changes over time, would instead store the complete history of the graph, so that any point in time can be efficiently queried. Furthermore, multiple different points in time will be able to be analyzed at the same time, sharing work and I/O between the different queries.

    A Dynamic Sparse Matrix Library called Concord, which will build off of the SSTGraph and adapt to efficiently store matrices of many different forms and enable both standard computation with separate output and inplace updates. Concord will take advantage of its dynamic structure, without having to perform rewrites or pre-calculate the matrix’s pattern.




Professional Experience

View My CV

    You can see my academic and professional experiences via the linked CV.