Evaluating Distributed Computing Infrastructures: An Empirical Study Comparing Hadoop Deployments on Cloud and Local Systems
The popularity of distributed computing platforms (e.g., Hadoop) is largely to their ability to address scalability issues that arise due to data storage and processing limitations of standard computing systems. However, the decision to dedicate organizational resources and capital for such systems needs a careful consideration of several factors including evaluation of cloud-based distributed computing options. We propose a framework of metrics which we used to conduct an in-depth performance and cost benefit analysis of two standard Hadoop infrastructural choices, i.e., a Platform as a Service (PaaS) on-demand cloud setup and a local organizational setup. We evaluated the framework with an exploratory data analysis use case for a large-scale graph processing research problem. Our analysis considered highly granular aspects of distributed computing performance and studied how utilization rates and infrastructure amortization times affect break-even times. We identified that virtual memory management adversely affects the performance of a cloud cluster during the reduce phase with the magnitude of degradation dependent on the type of MapReduce operation. Our study is intended not only as an evaluation of infrastructural choices but also a development of a metric framework that can serve as a baseline for researchers examining distributed infrastructures.
Cloud computing, Computers and information processing, Cost benefit analysis, Distributed computing, Data processing, Performance evaluation, Platform-as-a-Service