An Intermediate Data Partition Algorithm for Skew Mitigation in Spark Computing Environment
In the parallel computing framework of Hadoop/Spark, data skew is a common problem resulting in performance degradation, such as prolonging of the entire execution time and idle resources. What lies behind this issue is the partition imbalance, that causes significant differences in the amount of data processed by each reduce task. This paper proposes a key reassigning and splitting partition algorithm (SKRSP) to solve the partition skew from the source codes of Spark-core 2.11 project, which considers both the partition balance of the intermediate data and the partition balance after shuffle operators. First, we propose a step-based algorithm for sampling the input data to estimate the general key distribution of entire intermediate data. According to the types of the specific applications, we design two algorithms: hash based key reassigning algorithm (KRHP) and rang based key splitting algorithm (KSRP), which can generate appropriate strategy and implement the skew mitigation in shuffle phase. KKSRP generates the weighted bounds to split intermediate data for the type of sort-based applications while KRHP records these reassigned keys and the new reducers these keys belong to for other applications. Finally, we implement SKRSP in Spark 2.2.0 and evaluate its performance through four benchmarks exhibiting significant data skew: WordCount, Sort, Join, and PageRank. The experimental results verify that our algorithm not only can achieve a better partition balance but also reduce the execution time of reduce tasks effectively.
data sampling, data skew, MapReduce, partition, Spark