A MODEL for DISTRIBUTED PROCESSING and ANALYSES of NGS DATA under MAP-REDUCE PARADIGM
â??Massively parallel sequencing technique, introduced by NGS technology, has resulted in an exponential growth of sequencing data, with greatly reduced cost and increased throughput. This huge explosion of data has introduced new challenges in regard to its storage, integration, processing and analyses. In this paper, we have proposed a novel distributed model under Map-Reduce paradigm to address the NGS big data problem. The architecture of the model involves Map-Reduce based modularized approach involving 3 different phases that support various analytical pipelines. The first phase will generate detailed base level information of various individual genomes, by granulating the alignment data. The other 2 phases independently process this base level information in parallel. One of these 2 phases will provide an integrated DNA profile of multiple individuals, whereas the other phase will generate contigs with similar features in an individual. Each of these 2 phases will generate a repository of genomic information that will facilitate other analytical pipelines. A simulated and real experimental prototypes has been provided as results to show the effectiveness of the model and its superiority over a few existing popular models and tools. A detailed description of the scope of applications of this model is also included in this article
CNV, Hadoop, Personalised Medicine, Cancer, NGS, fault tolerant model, Bioinformatics analytical workflow.