Enabling Low-Redundancy Proactive Fault Tolerance for Stream Machine Learning via Erasure Coding

Introduction

Machine learning for continuous data streams, or stream machine learning in short, is increasingly adopted in real-time big data applications. Fault tolerance is a critical requirement for stream machine learning applications in large-scale distributed deployment. However, existing reactive fault tolerance mechanisms, which trigger failure recovery upon the detection of failures, inevitably incur high recovery overhead and compromise the low-latency requirement of stream machine learning. We design StreamLEC, a stream machine learning system that leverages erasure coding to provide low-redundancy proactive fault tolerance for immediate failure recovery. StreamLEC supports general stream machine learning applications, and incorporates different techniques to mitigate erasure coding overhead. Evaluation on a local cluster and Amazon EC2 shows that StreamLEC achieves much higher throughput than both reactive fault tolerance and replication-based proactive fault tolerance, with negligible failure recovery overhead.

Publication

Download

People

The project is developed by the Applied Distributed Systems Lab in the Department of Computer Science and Engineering at the Chinese University of Hong Kong (CUHK).

Please contact Zhinan Cheng if you have any questions.

License

The source code of StreamLEC is released under the GNU/GPL license.