BE/BTech & ME/MTech Final Year Projects for Computer Science | Information Technology | ECE Engineer | IEEE Projects Topics, PHD Projects Reports, Ideas and Download | Sai Info Solution | Nashik |Pune |Mumbai
director@saiinfo settings_phone02536644344 settings_phone02048626262 settings_phone+919270574718 +919096813348 settings_phone+919028924212
logo


SAI INFO SOLUTION

Diploma | BE |B.Tech |ME | M.Tech |PHD

Project Development and Training

Search Project by Domain wise


An Intermediate Data Partition Algorithm for Skew Mitigation in Spark Computing Environment


Using Big Data Analytics to Cr

Class Agnostic Image Common Ob
Abstract


In the parallel computing framework of Hadoop/Spark, data skew is a common problem resulting in performance degradation, such as prolonging of the entire execution time and idle resources. What lies behind this issue is the partition imbalance, that causes significant differences in the amount of data processed by each reduce task. This paper proposes a key reassigning and splitting partition algorithm (SKRSP) to solve the partition skew from the source codes of Spark-core 2.11 project, which considers both the partition balance of the intermediate data and the partition balance after shuffle operators. First, we propose a step-based algorithm for sampling the input data to estimate the general key distribution of entire intermediate data. According to the types of the specific applications, we design two algorithms: hash based key reassigning algorithm (KRHP) and rang based key splitting algorithm (KSRP), which can generate appropriate strategy and implement the skew mitigation in shuffle phase. KKSRP generates the weighted bounds to split intermediate data for the type of sort-based applications while KRHP records these reassigned keys and the new reducers these keys belong to for other applications. Finally, we implement SKRSP in Spark 2.2.0 and evaluate its performance through four benchmarks exhibiting significant data skew: WordCount, Sort, Join, and PageRank. The experimental results verify that our algorithm not only can achieve a better partition balance but also reduce the execution time of reduce tasks effectively.

KeyWords
data sampling, data skew, MapReduce, partition, Spark



Share
Share via WhatsApp
BE/BTech & ME/MTech Final Year Projects for Computer Science | Information Technology | ECE Engineer | IEEE Projects Topics, PHD Projects Reports, Ideas and Download | Sai Info Solution | Nashik |Pune |Mumbai
Call us : 09096813348 / 02536644344
Mail ID : developer.saiinfo@gmail.com
Skype ID : saiinfosolutionnashik