Architecture Design of Distributed Medical Big Data Platform Based on Spark
An Apache Spark based distributed computing and storage system designed for large-scale health data. The system provides the solution for health data digitalization and analysis, while enabling high-throughput data processing, realtime processing and messaging capabilities. The design of the system has the potential to provide many health-related services to medical professionals, such as data retrieving/processing, realtime alerts, data mining. The article described key considerations throughout the system designing process, including comparison of different component, finding the optimal data flow mechanics for specific tasks, etc. During the design and implementation, these core technologies are involved: (1) Java, Scala, programming languages, with IntelliJ IDEA IDE. (2)Apache Spark, general purpose distributed data processing engine. (3)Apache Hadoop: HDFS and YARN´╝?distributed storage system and computational resource manager, respectively. (4)MySQL, relational database engine.(5)Apache Hive, distributed relational database engine.(6)Apache Kafka, distributed messaging system.
MapReduce, Spark Streaming, Hadoop, Big Health data