An Efficient Data Duplication System based on Hadoop Distributed File System
HDFS [Hadoop Distributed File System] a part of Apache Hadoop to store large data set consistently. HDFS is used for process Massive-Scale Data in parallel and it ensures accessibility of facts by replicating data to different nodes. Still, the repetition policy of HDFS doesn't think about the name of knowledge. The recognition of the files tends to alter over time. Hence, maintaining a fixed replication issue can affect the storage efficiency of HDFS. An Efficient Data Duplication System Based on HDFS, is proposed which consider the reputations of the records set aside in HDFS before replication. The proposed technique successfully reduces storage consumption by up to 45% without moving the accessibility and fault recognition in HDFS.
Data Locality, Data Duplication, Hadoop, Access Predication.