Apache Hadoop has emerged as the de facto Big Data analytics operating system helping to deal with tons of data coming from logs, email, sensor devices, mobile devices, social and more. While business intelligence systems are typically the last stop in portraying value from Big Data, the first stop is commonly manipulation of the data in the process of ETL: “Extract, Transform, Load”. ETL is the process by which data are moved from source systems, manipulated into a consumable format and loaded into a target system for further advance analytics, analysis and reporting. ETL is emerging as one of the key use cases for Hadoop implementation. Gartner points out that “most organizations will adapt their data integration strategy using Hadoop as a preprocessor for Big Data integration in the data stores.” Our partner, Hortonworks, helps organizations ease into Hadoop ETL projects with an integration tool that minimizes hurdles to pave the way for wider adoption under Hadoop eco-system. They integrate with Hadoop and HDP directly through YARN, making it easier for users to write and maintain MapReduce or Tez jobs, in result to better use of the resources and execute more efficiently.
Today’s data architectures are strained under the loads placed upon them. Data volumes continue to grow considerably, low value workloads like ETL consume ever more processing resources, and new types of data can’t easily be captured and put to use. Organizations struggle with escalating costs, increasing complexity, and the challenge of expansion.
Data architects use Hadoop to address the above challenges, moving high data volumes to Hadoop, offloading ETL processes, and enriching existing data architectures with new data for increased value.
The scope of ETL, analytics and operations tasks executed by the Enterprise Data Warehouse (EDW) has grown considerably. The ETL function is a relatively low-value computing workload that can be performed at a lower cost, when offloaded to Hadoop. Data are extracted, transformed and then the results are loaded into the data warehouse. The result: critical CPU cycles and storage space are freed for the truly high value functions – analytics and operations – that best leverage its advanced capabilities. For more information or proof of concept, please fill in the below contact form: