Applied Distributed Systems Lab (ADSLab)

Department of Computer Science and Engineering, The Chinese University of Hong Kong
 

Network Measurement and Workload Characterization


For any large-scale deployed system, maintaining large-scale dependability guarantees is a critical requirement. We address this challenge from a data-driven perspective through network measurement and workload characterization, so as to understand the behaviors and diagnose any anomalies of the underlying deployed system.

Sketch-based Measurement

We focus on sketches, the summary data structures that support memory-efficient frequency counting with bounded errors. Sketches have been extensively studied in the literature for decades, mainly for single-point measurement in routers. However, the advances of programmable networks motivate us to revisit the design of sketches. In particular, we look for new sketch-based measurement solutions that are (1) deployable in programmable switches and (2) applicable for network-wide measurement.

We have proposed new sketch-based measurement frameworks for network-wide measurement. SketchVisor1 augments general sketch-based measurement solutions with fast-path processing. The fast path complements sketches with fast but slightly less accurate measurement, and is activated on demand when sketches cannot promptly handle all packets under high traffic load. SketchVisor later recovers the missing information in the fast path via compressive sensing in a network-wide manner. SketchLearn2 characterizes the statistical properties of sketches, so as to automatically extract the recorded flow information from sketches without putting user burdens on the accurate parametric configuration of sketches.

Traditional sketches are non-invertible: since a sketch maps frequent items into fixed-size memory space (i.e., many-to-one mappings), if we need to find all frequent items, we need to enumerate each possible item in the item space and see if the item is frequent. To address the challenge, we also study the invertibility of sketches, meaning that a sketch can return all frequent items directly from the recorded information in the data structure itself without exhaustively querying all possible items. MV-Sketch3 is an invertible sketch that accurately recovers heavy hitters and heavy changers using the notion of majority voting. SpreadSketch4 is another invertible sketch that accurately recovers superspreaders. We demonstrate how MV-Sketch and SpreadSketch can be feasibly implemented in P4-based programmable hardware switches. They are also applicable for network-wide measurement, in which a centralized control plane combines the sketches at multiple measurement points to perform collective measurement.

  1. Qun Huang, Xin Jin, Patrick P. C. Lee, Runhui Li, Lu Tang, Yi-Chao Chen, and Gong Zhang.
    "SketchVisor: Robust Network Measurement for Software Packet Processing."
    Proceedings of the ACM SIGCOMM 2017 conference (SIGCOMM 2017), Los Angeles, CA, USA, August 2017.
    (AR: 36/250 = 14.4%)
    [pdf] [pptx]

  2. Qun Huang, Patrick P. C. Lee, and Yungang Bao.
    "SketchLearn: Relieving User Burdens in Approximate Measurement with Automated Statistical Inference."
    Proceedings of the ACM SIGCOMM 2018 conference (SIGCOMM 2018), Budapest, Hungary, August 2018.
    (AR: 40/222 = 18.0%)
    [pdf] [pptx] [tech report] [software] (Awarded with ACM badges)

  3. Lu Tang, Qun Huang, and Patrick P. C. Lee.
    "A Fast and Compact Invertible Sketch for Network-Wide Heavy Flow Detection."
    IEEE/ACM Transactions on Networking (TON), 28(5), pp. 2350-2363, October 2020.
    (An earlier version appeared in INFOCOM 2019)
    [pdf] [software] [doi]

  4. Lu Tang, Qun Huang, and Patrick P. C. Lee.
    "SpreadSketch: Toward Invertible and Network-Wide Detection of Superspreaders."
    Proceedings of IEEE International Conference on Computer Communications (INFOCOM 2020), Toronto, Canada, July 2020.
    (AR: 268/1354 = 19.8%)
    [pdf] [pptx] [software]

Network Telemetry Architecture

Sketches perform approximate measurement and inevitably trade accuracy for resource efficiency. {\bf OmniMon} \cite{huang20sigcomm} addresses this trade-off and simultaneously achieves both resource efficiency and full accuracy in network-wide measurement via a new architectural design, by carefully coordinating the network measurement operations of end-hosts, switches, and the controller.

  1. Qun Huang, Haifeng Sun, Patrick P. C. Lee, Wei Bai, Feng Zhu, and Yungang Bao.
    "OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy."
    Proceedings of the ACM SIGCOMM 2020 conference (SIGCOMM 2020), New York, NY, USA, August 2020.
    (AR: 54/250 = 21.6%)
    [pdf] [software] (Awarded with ACM artifact badges)

Workload Characterization

In collaboration with Alibaba, we have conducted field studies to characterize the workload patterns in production large-scale storage systems. We design StreamDFP1, a general stream-based data mining framework that supports failure prediction of hard disk drives in the face of concept drift (i.e., statistical variations of data streams). In addition, we study the I/O workloads in Alibaba's cloud block storage system2 and characterize the correlation failures in Alibaba's production data centers based on solid-state drives3; in both studies, we release the datasets for public validation.

  1. Shujie Han, Patrick P. C. Lee, Zhirong Shen, Cheng He, Yi Liu, and Tao Huang.
    "Toward Adaptive Disk Failure Prediction via Stream Mining."
    Proceedings of the 40th IEEE International Conference on Distributed Computing Systems (ICDCS 2020), Singapore, November 2020.
    (AR: 105/584 = 18.5%)
    [pdf] [pptx] [software]

  2. Jinhong Li, Qiuping Wang, Patrick P. C. Lee, and Chao Shi.
    "An In-Depth Analysis of Cloud Block Storage Workloads in Large-Scale Production."
    Proceedings of 2020 IEEE International Symposium on Workload Characterization (IISWC 2020), Beijing, China, October 2020.
    (AR: 26/70 = 37.1%)
    [pdf] [pptx] [dataset]

  3. Shujie Han, Patrick P. C. Lee, Fan Xu, Yi Liu, Cheng He, and Jiongzhou Liu.
    "An In-Depth Study of Correlated Failures in Production SSD-Based Data Centers."
    Proceedings of the 19th USENIX Conference on File and Storage Technologies (FAST 2021), February 2021.
    (AR: 28/130 = 21.5%)
    [pdf] [pptx] [software] [dataset]


Last updated in June 2021.