Brian Wheatman

A PhD student in Computer Science.

About Brian

Publications and Presentations



CPMA: An Efficient Batch-Parallel Compressed Set Without Pointers

Brian Wheatman, Randal Burns, Aydın Buluç, and Helen Xu

    Abstract

    This paper introduces the batch-parallel Compressed Packed Memory Array (CPMA), a compressed dynamic ordered batch-parallel set data structure based on the Packed Memory Array (PMA). Traditionally, batch-parallel sets are built on pointer-based data structures such as trees because pointer-based structures enable fast parallel unions via pointer manipulation. When compared to cache-optimized trees, PMAs were slower to update but faster to scan.

    The batch-parallel CPMA overcomes this tradeoff between updates and scans by optimizing for cache-friendliness. On average, the CPMA achieves 3× faster batch-insert throughput and 4× faster range-query throughput compared to compressed PaC-trees, a state-of-the-art batch-parallel set library based on cache-optimized trees.

    We further evaluate the CPMA compared to compressed PaC-trees and Aspen, a state-of-the-art system, on a realworld application of dynamic-graph processing. The CPMA is on average 1.2× faster on a suite of graph algorithms and 2× faster on batch inserts when compared with compressed PaC-trees. Furthermore, the CPMA is on average 1.3× faster on graph algorithms and 2× faster on batch inserts compared to Aspen.



BP-tree: Overcoming the Point-Range Operation Tradeoff for In-Memory B-trees

Helen Xu, Amanda Li, Brian Wheatman, Manoj Marneni, and Prashant Pandey

    Abstract

    B-trees are the go-to data structure for in-memory indexes in databases and storage systems. B-trees support both point operations (i.e., inserts and finds) and range operations (i.e., iterators and maps). However, there is an inherent tradeoff between point and range operations since the optimal node size for point operations is much smaller than the optimal node size for range operations. Existing implementations use a relatively small node size to achieve fast point operations at the cost of range operation throughput.

    We present the BP-tree, a variant of the B-tree, that overcomes the decades-old point-range operation tradeoff in traditional B-trees. In the BP-tree, the leaf nodes are much larger in size than the internal nodes to support faster range scans. To avoid any slowdown in point operations due to large leaf nodes, we introduce a new insertoptimized array called the buffered partitioned array (BPA) to efficiently organize data in leaf nodes. The BPA supports fast insertions by delaying ordering the keys in the array. This results in much faster range operations and faster point operations at the same time in the BP-tree.

    Our experiments show that on 48 hyperthreads, on workloads generated from the Yahoo! Cloud Serving Benchmark (YCSB), the BP-tree supports similar or faster point operation throughput (between .94×−1.2× faster) compared to Masstree and OpenBw-tree, two state-of-the-art in-memory key-value (KV) stores. On a YCSB workload with short scans, the BP-tree is about 7.4× faster than Masstree and 1.6× faster than OpenBw-tree. Furthermore, we extend the YCSB to add large range workloads, commonly found in database applications, and show that the BP-tree is 30× faster than Masstree and 2.5× faster than OpenBw-tree.

    We also provide a reference implementation for a concurrent B+ - tree and find that the BP-tree supports faster (between 1.03×−1.2× faster) point operations when compared to the best-case configuration for B+ -trees for point operations while supporting similar performance (about .95×as fast) on short range operations and faster (about 1.3× faster) long range operations



Optimizing Search Layouts in Packed Memory Arrays

Brian Wheatman, Randal Burns, Aydın Buluç, and Helen Xu

    Abstract

    This paper introduces Search-optimized Packed Memory Arrays (SPMAs), a collection of data structures based on Packed Memory Arrays (PMAs) that address suboptimal search via cache-optimized search layouts. Traditionally, PMAs and B-trees have tradeoffs between searches/inserts and scans: B-trees were faster for searches and inserts, while PMAs were faster for scans.

    Our empirical evaluation shows that SPMAs overcome this tradeoff for unsorted input distributions: on average, SPMAs are faster than B+-trees (a variant of B-trees optimized for scans) on all major operations. We generated datasets and search/insert workloads from the Yahoo! Cloud Serving Benchmark (YCSB) and found that SPMAs are about 2× faster than B+-trees regardless of the ratio of searches to inserts. On uni form random inputs, SPMAs are on average between 1.3 × −2.3× faster than B+-trees on all operations. Finally, we vary the amount of sortedness in the inputs to stress the worst-case insert distribution in the PMA. We find that the worst-case B+-tree insertion throughput is about 1.5× faster than the worst-case PMA insertion throughput. However, the worst-case input for the PMA is sorted and highly unlikely to appear naturally in practice. The SPMAs maintain higher insertion throughput than the B+-tree when the input is up to 25% sorted.



So You Want to Make a Dynamic Graph Data Structure

Brian Wheatman

    Abstract

    This talk discusses dynamic graph data structures and what to consider when designing new ones. It breaks the task of designing graph data structures into creating structures that can support a few key operations efficiently. The focus is on how different use cases, which can cause different data patterns and access patterns, can influence the design of these structures and how to exploit these differences to maximize performance.



Streaming Sparse Graphs using Efficient Dynamic Sets

Brian Wheatman and Randal Burns

    Abstract

    We present the SSTGraph framework for the storage and analysis of dynamic graphs. Its performance matches or exceeds state-of-the-art static graph engines and supports streaming updates. SSTGraph builds on top of the tinyset parallel, dynamic set data structure. Tinyset implements set membership in a shallow hierarchy of sorted packed memory arrays to achieve logarithmic time access and updates, and it scans in optimal linear time. Tinyset uses space comparable to that of systems that use data compression while avoiding compression's computation and serialization overhead.

    SSTGraph outperforms other streaming, dynamic graph engines on a suite of four graph algorithms. Our evaluation includes a comparison with the Aspen streaming graph system. SSTGraph reduces runtime by 40% on average, updates are 2x-5x faster on batch sizes up to 10 million, and graphs are smaller. The partitioned data structure scales well and runs on billion edge graphs in just 15 GB of memory.



Terrace: A Hierarchical Graph Container for Skewed Dynamic Graphs

Prashant Pandey, Brian Wheatman, Helen Xu, and Aydın Buluç

    Abstract

    Various applications model problems as streaming graphs, which need to quickly apply a stream of updates and run algorithms on the updated graph. Furthermore, many dynamic real-world graphs, such as social networks, follow a skewed distribution of vertex degrees, where there are a few high-degree vertices and many low-degree vertices.

    Existing static graph-processing systems optimized for graph skewness achieve high performance and low space usage by preprocessing a cache-efficient graph partitioning based on vertex degree. In the streaming setting, the whole graph is not available upfront, however, so finding an optimal partitioning is not feasible in the presence of updates. As a result, existing streaming graph-processing systems take a "one-size-fits-all" approach, leaving performance on the table.

    We present Terrace, a system for streaming graphs that uses a hierarchical data structure design to store a vertex's neighbors in different data structures depending on the degree of the vertex. This multi-level structure enables Terrace to dynamically partition vertices based on their degrees and adapt to skewness in the underlying graph.

    Our experiments show that Terrace supports faster batch insertions for batch sizes up to 1M when compared to Aspen, a state-of-the-art graph streaming system. On graph query algorithms, Terrace is between 1.7x--2.6x faster than Aspen and between 0.5x--1.3x as fast as Ligra, a state-of-the-art static graph-processing system.



RADICS: Runtime Assurance of Distributed Intelligent Control Systems

Brian Wheatman, Jerry Chen, Tamim Sookoor, and Yair Amir

    Abstract

    We describe RADICS: Runtime Assurance of Distributed Intelligent Control Systems, which combines a Simplex-based, black-box monitor with a white-box monitor to ensure correct behavior and good performance of AI systems. The black-box monitor allows the system to detect when the AI controller is on a failing trajectory and use a provably safe, but less performant algorithm, to right the system. The white-box monitor predicts when the AI controller will be put on such a trajectory before it happens and helps maximize the performance of the overall system. We describe the overall approach in detail and implement a simple version of it on a case study into controlling the lights in a small traffic grid.



A Parallel Packed Memory Array to Store Dynamic Graphs

Brian Wheatman and Helen Xu

    Abstract

    The ideal data structure for storing dynamic graphs would support fast updates as well as fast range queries which underlie graph traversals such as breadth-first search. The Packed Memory Array (PMA) seems like a good candidate for this setting because it supports fast updates as well as cache-efficient range queries. Concurrently updating a PMA raises challenges, however, because an update may require rewriting the entire structure.

    This paper introduces a parallel PMA with intra- and inter-operation parallelism and deadlock-free polylogarithmicspan operations. Our main observation is that the PMA is well-suited to concurrent updates despite occasionally requiring a rewrite of the entire structure because 1) most of the updates only write to a small part of the structure and 2) the worst case is highly parallel and cache-efficient.

    To evaluate our data structure, we implemented Parallel Packed Compressed Sparse Row (PPCSR), a dynamic-graph processing framework that extends the Ligra interface with graph updates. We show that PPCSR is on average about 1.6x faster on graph kernels than Aspen, a state-of-the-art graph-streaming system. PPCSR achieves up to 80 million updates per second and is 2 – 5x faster than Aspen on most batch sizes. Finally, PPCSR is competitive with Ligra and Ligra+, two state-of-the-art static graph-processing frameworks.



High-Throughput Image Alignment for Connectomics using Frugal Snap Judgments

Tim Kaler, Brian Wheatman, and Sarah Wooders

    Abstract

    The accuracy and computational efficiency of image alignment directly affects the advancement of connectomics, a field which seeks to understand the structure of the brain through electron microscopy.

    We introduce the algorithms Quilter and Stacker that are designed to perform 2D and 3D alignment respectively on petabyte-scale data sets from connectomics. Quilter and Stacker are efficient, scalable, and simple to deploy on hardware ranging from a researcher's laptop to a large computing cluster. On a single 18-core cloud machine each algorithm achieves throughputs of more than 1 TB/hr; when combined the algorithms produce an end-to-end alignment pipeline that processes data at a rate of 0.82 TB/hr - an over 10x improvement from previous systems. This efficiency comes from both traditional optimizations and from the use of ``Frugal Snap Judgments'' to judiciously exploit performance--accuracy trade offs.

    A high-throughput image-alignment pipeline was implemented using the Quilter and Stacker algorithms and its performance was evaluated using three datasets whose size ranged from 550GB to 38TB. The full alignment pipeline achieved a throughput of 0.6-0.8 TB/hr and 1.4-1.5 TB/hr on an 18-core and 112-core shared-memory multicore, respectively. On a supercomputing cluster with 200 nodes and 1600 total cores, the pipeline achieved a throughput of 21.4 TB/hr.


    • Presented as a Poster at PPOPP 2019.
    • Presented as a paper at HPEC 2020, where it won the best student paper award.

Packed Compressed Sparse Row: A Dynamic Graph Representation

Brian Wheatman and Helen Xu

    Abstract

    Perhaps the most popular sparse graph storage format is Compressed Sparse Row (CSR). CSR excels at storing graphs compactly with minimal overhead, allowing for fast traversals, lookups, and basic graph computations such as PageRank. Since elements in CSR format are packed together, additions and deletions often require time linear in the size of the graph.

    We introduce a new dynamic sparse graph representation called Packed Compressed Sparse Row (PCSR), based on an array-based dynamic data structure called the Packed Memory Array. PCSR is similar to CSR, but leaves spaces between elements, allowing for asymptotically faster insertions and deletions in exchange for a constant factor slowdown in traversals and a constant factor increase in space overhead.

    Our contributions are twofold. We describe PCSR and review the theoretical guarantees for update, insert, and search for PCSR. We also implemented PCSR as well as other basic graph storage formats and report our findings on a variety of benchmarks. PCSR supports inserts orders of magnitude faster than CSR and is only a factor of two slower on graph traversals. Our results suggest that PCSR is a lightweight dynamic graph representation that supports fast inserts and competitive searches.



Electricity Demand and Population Dynamics Prediction from Mobile Phone Metadata

Brian Wheatman, Alejandro Noriega and Alex Pentland

    Abstract

    Energy efficiency is a key challenge for building modern sustainable societies. World’s energy consumption is expected to grow annually by 1.6 %, increasing pressure for utilities and governments to fulfill demand and raising significant challenges in generation, distribution, and storage of electricity. In this context, accurate predictions and understanding of population dynamics and their relation to electricity demand dynamics is of high relevance.

    We introduce a simple machine learning (ML) method for day-ahead predictions of hourly energy consumption, based on population and electricity demand dynamics. We use anonymized mobile phone records (CDRs) and historical energy records from a small European country. CDRs are large-scale data that is collected passively and on a regular basis by mobile phone carriers, including time and location of calls and text messages, as well as phones’ countries of origin. We show that simple support vector machine (SVM) autoregressive models are capable of baseline energy demand predictions with accuracies below 3 % percentage error and active population predictions below 10 % percentage error. Moreover, we show that population dynamics from mobile phone records contain information additional to that of electricity demand records, which can be exploited to improve prediction performance. Finally, we illustrate how the joint analysis of population and electricity dynamics elicits insights into the relation between population and electricity demand segments, allowing for potential demand management interventions and policies beyond reactive supply-side operations.



Current Work

Masked Matrix Multiplication for Emergent Sparsity

    Abstract

    Artificial intelligence workloads, especially transformer models, exhibit emergent sparsity in which computations perform selective sparse access to dense data. The workloads are inefficient on hardware designed for dense computations and do not map well onto sparse data representations. We build a vectorized and parallel matrix-multiplication system A × B = C that eliminates unnecessary computations and avoids branches based on a runtime evaluation of sparsity. We use a combination of dynamic code lookup to adapt to the specific sparsity encoded in the B matrix and preprocessing of sparsity maps of the A and B matrices to compute conditional branches once for the whole computation. For a wide range of sparsity, from 60% to 95% zeros, our implementation performs fewer instructions and increases performance when compared with Intel MKL's dense or sparse matrix multiply routines. Benefits can be as large as 2 times speedup and 4 times fewer instructions.


BYO: A Unified Framework for Benchmarking Large-Scale Graph Containers

    Abstract

    A fundamental building block in any graph algorithm is a graph container – a data structure used to represent the graph. Ideally, a graph container enables efficient access to the underlying graph and supports updating the graph efficiently. In this paper, we conduct an extensive empirical evaluation of graph containers designed to support running algorithms on large graphs. To our knowledge, this is the first apples-to-apples comparison of graph containers rather than overall systems, which include confounding factors such as differences in algorithm implementations and infrastructure.

    We measure the running time of 10 highly-optimized algorithms across over 20+ different containers and 10 graphs. Somewhat surprisingly, we find that the average algorithm running time does not differ much across containers, especially those that support dynamic updates. Specifically, a simple container based on an off-the-shelf B-tree is only 1.22× slower on average than a highly optimized static one. Moreover, we observe that simplifying a graph-container Application Programming Interface (API) to only a few simple functions incurs a mere 1.16× slowdown compared to a complete API. Finally, we also measure batch-insert throughput in dynamic-graph containers for a full picture of their performance.

    To perform the benchmarks, we introduce BYO, a unified framework that standardizes evaluations of graph-algorithm performance across different graph containers. BYO extends the Graph Based Benchmark Suite (Dhulipala et al. 18), a state-of-the-art graph algorithm benchmark, to easily plug into different dynamic graph containers and enable fair comparisons between them on a large suite of graph algorithms. While several graph algorithm benchmarks have been developed to date, to the best of our knowledge, BYO is the first system designed to benchmark graph containers.


General Purpose, Memory-Efficient Data Structures

    Abstract

    My vision is to create a suite of data structures, which implement memory-efficient versions of common operations, so users can reap these benefits across many problems and scales without needing the technical expertise to understand and tune for the memory hierarchy. The next steps are to continue to develop techniques for adapting structures dynamically to best suit the data pattern and problem structure, as well as move from dynamic graphs to more general structures, such as sets and matrices. Some specific examples are as follows:

    A graph library for storing the time history of a graph, where instead of being able to update a graph to store the current version in respect to changes over time, would instead store the complete history of the graph, so that any point in time can be efficiently queried. Furthermore, multiple different points in time will be able to be analyzed at the same time, sharing work and I/O between the different queries.

    A Dynamic Sparse Matrix Library called Concord, which will build off of the SSTGraph and adapt to efficiently store matrices of many different forms and enable both standard computation with separate output and inplace updates. Concord will take advantage of its dynamic structure, without having to perform rewrites or pre-calculate the matrix’s pattern.


Education

Teaching

Service

Professional Experience

View My CV

    You can see my academic and professional experiences via the linked CV.


CV