Apache Spark Parallel Program Flows
Apache Spark Flows – Apache Spark consists of several purpose-built components as we have discuss at the introduction of apache spark. Let’s see what a typical Spark program looks like. Imagine that a 300 MB log file is stored in a three-node HDFS cluster. Hadoop File System (HDFS) automatically splits the file into 128 MB parts and places each part on a separate node of the cluster.
Let’s assume Spark is running on YARN, inside the same Hadoop cluster. Although it’s not relevant to the example, we should probably mention that HDFS replicates each block to two additional nodes.
A Spark data engineer is given the task of analyzing how many errors of type “OutOfMemoryError” have happened during the last two weeks.
Mary, the engineer, knows that the log file contains the last two weeks of logs of the company’s application server cluster.
She sits at her laptop and starts to work. She first starts her Spark shell and establishes a connection to the Spark cluster. Next, she loads the log file from HDFS.
To achieve maximum data locality, the loading operation asks Hadoop for the locations of each block of the log file and then transfers all the blocks into RAM of the cluster’s nodes.
Now Spark has a reference to each of those blocks (partitions, in Spark terminology) in RAM. The sum of those partitions is a distributed collection of lines from the log file referenced by an RDD. Simplifying, we can say that RDDs allow you to work with a distributed collection the same way you would work with any local, nondistributed one.
You don’t have to worry about the fact that the collection is distributed, nor do you have to handle node failures yourself. Data locality is honored if each block gets loaded in the RAM of the same node where it resides in HDFS. The whole point is to try to avoid having to transfer large amounts of data over the wire.
In addition to automatic fault tolerance and distribution, the RDD provides an elaborate API, which allows you to work with a collection in a functional style.
You can filter the collection; map over it with a function; reduce it to a cumulative value; subtract, intersect, or create a union with another RDD, and so on.
Mary now has a reference to the RDD, so in order to find the error count, she first wants to remove all the lines that don’t have an OutOfMemoryError substring.
This is a job for the filter function, which she calls like this:
val Lines = lines.filter(l => l.contains("OutOfMemoryError")).cache()
After filtering the collection so it contains the subset of data that she needs to analyze, Mary calls cache on it, which tells Spark to leave that RDD in memory across jobs. Caching is the basic component of Spark’s performance improvements we mentioned before. The benefits of caching the RDD will become apparent later.
Now she is left with only those lines that contain the error substring. For this simple example, we’ll ignore the possibility that the OutOfMemoryError string might occur in multiple lines of a single error.
Our data engineer counts the remaining lines and reports the result as the number of out-of-memory errors that occurred in the last two weeks:
val result = Lines.count()
Apache Spark flows enabled her to perform distributed filtering and counting of the data with only three lines of code. Her little program was executed on all three nodes in parallel.
If she now wants to further analyze lines with “Out Of Memory Errors”, and perhaps call filter again on an “Lines” object that was previously cached in memory, Spark won’t load the file from HDFS again, as it would normally do. Spark will load it from the cache.