Introducing Apache Arrow, a new standard for in memory processing standard
The apache Software Foundation last week announced a new top level project, Apache Arrow, with promises of a high-performance cross system data layer for columnar in memory analytics. Apache Arrow provides big data workloads benefits in several key areas including;
Arrow Enables execution engines to take advantage of the latest SIMD (Single input multiple data) operations included in modern processors, for native vectorised optimization of analytical data processing.
Arrow acts as the new high-performance interface between various systems. Also focussed on supporting a wide variety of industry standard programming languages. Java, C, C++, Python are underway with more expected soon.
Apache Arrow is backed by key developers of 13 major open source projects, including Calcite, Cassandra, Drill, Hadoop, Hbase, Parquet, Ibis, Impala, Kudu, Pandas, Phoenix, Spark and Storm making it the de-facto standard for columnar in-memory analytics.
"The Open Source community has joined forces on Apache Arrow," said Jacques Nadeau, Vice President of Apache Arrow and Vice President Apache Drill. "Developers from 13 major Open Source Big Data projects are already on board - by introducing a new era of columnar in-memory analytics, we anticipate the majority of the world's data will be processed through Arrow within the next few years.”
"A columnar in-memory data layer enables systems and applications to process data at full hardware speeds," said Todd Lipcon, original Apache Kudu creator and member of the Apache Arrow Project Management Committee. "Modern CPUs are designed to exploit data-level parallelism via vectorised operations and SIMD instructions. Arrow facilitates such processing."
In many workloads, 70-80% of CPU cycles are spent serialising and serialising data. Arrow solves this problem by enabling data to be shared between systems and processes with no serialisation, deserialization or memory copies.
Scalable and high performance analytics are at the forefront of the minds of customers and businesses helping them fully realise the value of Big Data and Hadoop. A new standard of in-memory columnar processing seems promising and within a couple of years is likely to be common in the big data tool belt.
The flexibility of its application in real world use cases will see a momentous rise in popularity, it has already been tested on complex combinations of data structures and proved efficient for in-memory processing. For example, Arrow can handle JSON data which is commonly used in Iot Workloads. The ability for implementations that are available (or underway) for a number of programming languages including Java, C++ and python allow greater universality among a number of Big Data solutions and practitioners.
In the past couple of years there has been great innovation within the open source community with Spark, Impala, Storm all becoming powerful technologies to help businesses understand their industry in varied data environments. Apache arrow represents a new type of memory based format for use across multiple systems, applications and programming languages.