Push-based Network-efficient Hadoop YARN Scheduling Mechanism for In-memory Computing
In the big data era, data-intensive cluster computing systems like Hadoop, have gained much popularity, and YARN, the second generation of Hadoop becomes the general resource manager in the Hadoop ecosystem. In the distributed computing scenarios, data locality (scheduling tasks on where the data resides) is essential to the performance since higher data locality brings lower network transmission cost and higher throughput. However, we find that the native YARN scheduling mechanism has little data locality and the delay scheduling strategy leads to the long-tail effect while achieving data locality for in-memory computing scenarios. Therefore, in this paper we propose the push-based YARN scheduling mechanism for the in-memory computing environment. First, we classify the ResourceRequests into various categories. Then, we prune the non-local ResourceRequests to achieve fast datalocality in-memory computation. Finally, we push the left longtail ResourceRequests to the data-locality nodes to avoid the long-tail effect. The experimental results demonstrate that the proposed scheduling mechanism achieves nearly 100% datalocality percentage comparing to the native YARN scheduling mechanism that only achieves 10%Ë?20% data-locality percentage. Under the identical data-locality percentage, the proposed pushbased scheduling mechanism promotes nearly 20% throughput and reduces nearly 10% application running time comparing to the existing delay scheduling mechanism used in YARN.
Scheduling, Data locality, Hadoop YARN, Inmemory computing