Manado, Indonesia. 95252
(+62) 823-9602-9583

Apache Spark Parallel Processing Introduction

Software Engineer | DevOps Engineer


Apache Spark Parallel Processing Introduction

Introduction to Apache Spark

Apache Spark is usually defined as a fast, general-purpose, distributed computing platform. Yes, it sounds a bit like marketing speak at first glance, but we could hardly come up with a more appropriate label to put on the Spark box. Apache spark provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Apache Spark really did bring a revolution to the big data space. Spark makes efficient use of memory and can execute equivalent jobs 10 to 100 times faster than Hadoop’s MapReduce. On top of that, Spark’s creators managed to abstract away the fact that you’re dealing with a cluster of machines, and instead present you with a set of collections-based APIs. Working with Spark’s collections feels like working with local Scala, Java, or Python collections, but Spark’s collections reference data distributed on many nodes. Operations on these collections get translated to complicated parallel programs without the user being necessarily aware of the fact, which is a truly powerful concept.

What is Apache Spark?

Apache Spark is an exciting new technology that is rapidly superseding Hadoop’s MapReduce as the preferred big data processing platform. Hadoop is an open source, distributed, Java computation framework consisting of the Hadoop Distributed File System (HDFS) and MapReduce, its execution engine. Spark is similar to Hadoop in that it’s a distributed, general-purpose computing platform. But Spark’s unique design, which allows for keeping large amounts of data in memory, offers tremendous performance improvements. Spark programs can be 100 times faster than their MapReduce counterparts.

Spark combines MapReduce-like capabilities for batch programming, real-time data-processing functions, SQL-like handling of structured data, graph algorithms, and machine learning, all in a single framework. This makes it a one-stop shop for most of your big data-crunching needs. It’s no wonder, then, that Spark is one of the busiest and fastest-growing Apache Software Foundation projects today.

But some applications aren’t appropriate for Spark. Because of its distributed architecture, Spark necessarily brings some overhead to the processing time. This overhead is negligible when handling large amounts of data; but if you have a dataset that can be handled by a single machine (which is becoming ever more likely these days), it may be more efficient to use some other framework optimized for that kind of computation. Also, Spark wasn’t made with online transaction processing (OLTP) applications in mind (fast, numerous, atomic transactions). It’s better suited for online analytical processing (OLAP): batch jobs and data mining.

Apache Spark Architecture

The Spark revolution

Although the last decade saw Hadoop’s wide adoption, Hadoop is not without its shortcomings. It’s powerful, but it can be slow. This has opened the way for newer technologies, such as Spark, to solve the same challenges Hadoop solves, but more efficiently. In the next few pages, we’ll discuss Hadoop’s shortcomings and how Spark answers those issues.

The Hadoop framework, with its HDFS and MapReduce data-processing engine, was the first that brought distributed computing to the masses. Hadoop solved the three main problems facing any distributed data-processing endeavor :

  • Parallelization— How to perform subsets of the computation simultaneously.
  • Distribution— How to distribute the data.
  • Fault tolerance— How to handle component failure

On top of that, Hadoop clusters are often made of commodity hardware, which makes Hadoop easy to set up. That’s why the last decade saw its wide adoption.

MapReduce’s shortcomings

Although Hadoop is the foundation of today’s big data revolution and is actively used and maintained, it still has its shortcomings, and they mostly pertain to its Map-Reduce component. MapReduce job results need to be stored in HDFS before they can be used by another job. For this reason, MapReduce is inherently bad with iterative algorithms.

Map Reduces
Map Reduces . MapR, [Carol McDonald], 2018

Furthermore, many kinds of problems don’t easily fit MapReduce’s two-step paradigm, and decomposing every problem into a series of these two operations can be difficult. The API can be cumbersome at times.

Hadoop is a rather low-level framework, so myriad tools have sprung up around it: tools for importing and exporting data, higher-level languages and frameworks for manipulating data, tools for real-time processing, and so on. They all bring additional complexity and requirements with them, which complicates any environment. Spark solves many of these issues.

What Spark brings to the table

Spark’s core concept is an in-memory execution model that enables caching job data in memory instead of fetching it from disk every time, as MapReduce does. This can speed the execution of jobs up to 100 times, compared to the same jobs in Map-Reduce; it has the biggest effect on iterative algorithms such as machine learning, graph algorithms, and other types of workloads that need to reuse data.

In the case of MapReduce, you’d need to store the results of each of these three phases on disk (HDFS). Each subsequent phase would read the results of the previous one from disk. But with Spark, you can find the shortest path between all vertices and cache that data in memory. The next phase can use that data from memory, find the farthest point distance for each vertex, and cache its results. The last phase can go through this final cached data and find the vertex with the minimum farthest point distance. You can imagine the performance gains compared to reading and writing to disk every time.

Spark performance is so good that in October 2014 it won the Daytona Gray Sort contest and set a world record (jointly with TritonSort, to be fair) by sorting 100 TB in 1,406 seconds (see

Spark’s ease of use

Spark supports the Scala, Java, Python, and R programming languages, so it’s accessible to a much wider audience. Although Java is supported, Spark can take advantage of Scala’s versatility, flexibility, and functional programming concepts, which are a much better fit for data analysis. Python and R are widespread among data scientists and in the scientific community, which brings those users on par with Java and Scala developers.

Furthermore, the Spark shell (read-eval-print loop [REPL]) offers an interactive console that can be used for experimentation and idea testing. There’s no need for compilation and deployment just to find out something isn’t working (again). REPL can even be used for launching jobs on the full set of data.

Finally, Spark can run on several types of clusters: Spark standalone cluster, Hadoop’s YARN (yet another resource negotiator), and Mesos. This gives it additional flexibility and makes it accessible to a larger community of users.

Spark as a unifying platform

An important aspect of Spark is its combination of the many functionalities of the tools in the Hadoop ecosystem into a single unifying platform. The execution model is general enough that the single framework can be used for stream data processing, machine learning, SQL-like operations, and graph and batch processing. Many roles can work together on the same platform, which helps bridge the gap between programmers, data engineers, and data scientists. And the list of functions that Spark provides is continuing to grow.

Spark anti-patterns

Spark isn’t suitable, though, for asynchronous updates to shared data such as online transaction processing, because it has been created with batch analytics in mind.

Also, if you don’t have a large amount of data, Spark may not be required, because it needs to spend some time setting up jobs, tasks, and so on. Sometimes a simple relational database or a set of clever scripts can be used to process data more quickly than a distributed system such as Spark. But data has a tendency to grow, and it may outgrow your relational database management system (RDBMS) or your clever scripts rather quickly.

That’s all from introduction for apache spark, you may see spark compenent in next posts. Feel free to leave a comment through this post or email me directly for futher discussion


No Comments

Leave a Reply

%d bloggers like this: