Employment of optimal approximations on Apache Hadoop checkpoint technique for performance improvements
The Checkpoint and Recovery (CR) technique is widely used due to its fault tolerance efficiency. The Apache Hadoop framework uses this technique as a way to avoid failures in its distributed file system. However, determining the optimal interval between successive checkpoints is a challenge, mainly inside Hadoop as it does not allow real-time modifications. The Dynamic Configuration Architecture (DCA) was created to solve this issue by enabling changes in the checkpoint period without any interruption of the Hadoop services. This paper presents improvements for the DCA through the configuration of the Hadoop checkpoint period in real-time based on optimal period approximations that were already endorsed by the literature. The proposed improvement depends on the tracking of the system resources. The data collected from these resources are stored in a history of attributes: a tree of monitored elements where data is updated as new observations are experienced in the system. This feature enables the user to estimate system factors so that our solution computes the checkpoints costs and the mean time between failures (MTBF). For the validation, experiments with transient failure in the NameNode were created and the usage of the history of attributes was tested in different scenarios. The evaluation results show that an adaptive configuration of checkpoint periods reduces the wasted time caused by failures in the NameNode and improves Hadoop performance. Also, the history of attributes demonstrated its value by providing an efficient way to estimate the system factors.
checkpoint and recovery, dynamic configuration, distributed file system, optimal periods, fault tolerance