The Automation of the Data Lake Ingestion Process from Various Sources
In a big data environment, it is often necessary to ingest data from different sources into a unique storage. Because of low memory price, system distribution and failure tolerance, that storage is typically HDFS. It enables users to manipulate data with different tools from the Hadoop ecosystem. The process of data ingestion seems simple. However, because sources can be different database systems, structured, semi-structured and unstructured data complicate the ingestion procedure. It is usually not enough to just store everything. Data needs to be stored in such a way that enables users to quickly access and manipulate it. There are many ingestion-specific solutions in the big data ecosystem. This paper will describe an implemented system for data ingestion from MSSQL, MySQL and Postgres into a Hive database. The process starts with creating tables with corresponding metadata, continues with the ingestion process and ends with a description of how the process is automated. The implementation of Sqoop as an open-source tool and Hue, a web user interface from Cloudera, will be described.
hadoop, sqoop, hive, data ingestion automation big data