Recently, the top conference on computer system and architecture, USENIX FAST 2023, was held in Santa Clara, USA, focusing on international cutting-edge research in the field of storage systems. Ruiming Lu, a doctoral student from the Network Computing Center of the Department of Computer Science instructed by Professor Guangtao Xue and Professor Minglu Li, won the Best Paper Award at this year's conference, which is also the first time for researchers in China to receive this honor.
The award-winning paper is titled "Perseus: A Fail-Slow Detection Framework for Cloud Storage Systems". Shanghai Jiao Tong University and Alibaba Cloud jointly proposed and implemented Perseus, a fail-slow failure detection framework applicable to cloud storage systems. Perseus utilizes performance monitoring metrics for non-intrusive and fine-grained fail-slow drive detection. The framework has been deployed on nearly 20,000 machines in Alibaba Cloud and has accurately detected hundreds of fail-slow drives. Perseus can reduce the average node-level p9999 tail latency by 33-64% while ensuring performance stability and greatly reducing performance instability, providing customers with predictable and smooth service quality assurance. Due to its outstanding innovative research and enormous application value in fail-slow detection, the paper was also recommended by the FAST Program Committee for publication in the USENIX magazine ";login:".
Background
Fail-slow failures, a previously unnoticed failure model, have gradually caught the attention of researchers in recent years. Unlike the traditional fail-stop failures, fail-slow failures refer to devices that are in an intermediate state between full speed and complete failure. In other words, fail-slow components are still functioning yet with lower-than-expected performance.
Accurately detecting fail-slow failures is challenging. First, there are no clear criteria to determine fail-slow. In practice, detecting fail-slow failures usually relies on the empirical knowledge of on-site engineers, hence inherently inaccurate. Second, fail-slow failures are often transient and thus difficult to detect in a timely manner, let alone reproduce or reason the root causes. Finally, performance variations caused by internal factors (e.g., SSD garbage collection) or external factors (e.g., workload burst) can have similar symptoms as fail-slow failures. In this case, healthy drives with normal performance variations can be misidentified as fail-slow.
Existing work on fail-slow failure detection is mostly coarse-grained and intrusive. Coarse-grained detection means that they can only detect fail-slow failures at the node level, thus still requiring nontrivial manual efforts to locate the culprits. Intrusive detection means they require source code access or software modification, while large service providers like cloud vendors do not touch tenants’ code. Even for in-house infrastructures, inserting certain code segments is still time-consuming, as the systems can run dozens of internal services with different software stacks. Therefore, there is an urgent need for a fine-grained, non-intrusive, accurate, and general fail-slow failure detection framework for various cloud service products.
To tackle the challenges mentioned above, this research combines classic machine learning techniques and proposes a fail-slow detection framework for storage devices that is adaptable to large-scale cloud storage systems. As shown in the figure below, the overall framework includes four steps: outlier detection, building regression models, formulating fail-slow events, and evaluating risk. Eventually, the slowness of drives will be quantified using a set of scoring mechanisms, making it easier for on-site engineers to prioritize devices with the highest slow degree for offline maintenance and manual inspection. This framework can be widely applied to various cloud services of Alibaba Cloud without any parameter or design adjustments. Currently, this research has been successfully deployed in Alibaba Cloud's production environment and has detected over 300 fail-slow devices in more than a year of deployment, significantly reducing the long tail latency of nodes while ensuring the smooth operation of cloud services.
High-level workflow of the fail-slow detection framework
USENIX Conference on File and Storage Technologies (FAST)
The FAST conference was founded in 2002 and is ever since a top international conference in the storage field organized jointly by the USENIX Association and the ACM SIGOPS. It represents the highest level of international achievement in computer storage. This year's conference included 28 papers, from which two best paper awards were selected. Over the past twenty years, FAST has driven the development of many storage-related technologies, such as RAID, flash file systems, non-volatile memory technologies, and distributed storage.
Paper link:https://www.usenix.org/conference/fast23/presentation/lu