Improving Failure Tolerance in Large-Scale Cloud Computing Systems
Large-scale cloud computing systems have served as the fundamental supporting platform for big data, Internet of Things, and artificial intelligence applications for the past decade. With the scale and complexity of these systems increasing dramatically, various hardware and software failures will inevitably occur and may not be detected and repaired in a timely manner. Besides, sophisticated architectural features of cloud computing may also have an adverse impact on system reliability. In response to these challenges, this paper proposes a simulation-driven framework based on real cloud computing system operation logs for improving failure tolerance in large-scale cloud computing systems. For a given cloud computing system, we first conduct a systematic analysis of its structure and operation characteristics. A Markovbased model is used to examine the systemÔ??s potential failures, assess their severities, and suggest quick recoveries. During this process, the proposed reliability-aware resource scheduling algorithm is adopted to optimize resources so that the systemÔ??s reliability can be improved cost-effectively.We also report a case study to demonstrate the application of our algorithm in improving failure tolerance of a large-scale cloud computing system
Cloud computing, failure tolerance, large-scale systems, Markov model.