Apache Spark Component Parallel Processing
Apache Spark consists of several purpose-built components as we have discuss at the introduction of apache spark. Apache spark provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools. These are Apache Spark Component : Spark Core, Spark SQL, Spark Streaming, Spark GraphX, Spark MLlib.
These components make Spark a feature-packed unifying platform: it can be used for many tasks that previously had to be accomplished with several different frameworks. A brief description of each Apache Spark component follows.
Spark allows the developers to write code quickly with the help of rich set of operators. While it takes a lot of lines of code, it takes fewer lines to write the same code in Spark Scala. Spark Core contains basic Spark functionalities required for running jobs and needed by other components. The most important of these is the resilient distributed dataset (RDD), which is the main element of the Spark API.
It’s an abstraction of a distributed collection of items with operations and transformations applicable to the dataset. It’s resilient because it’s capable of rebuilding datasets in case of node failures. This component contains logic for accessing various filesystems, such as HDFS, GlusterFS, Amazon S3, and so on. It also provides a means of information sharing between computing nodes with broadcast variables and accumulators. Other fundamental functions, such as networking, security, scheduling, and data shuffling.
Spark SQL is a component on top of Spark Core that introduces a new set of data abstraction called Schema RDD, which provides support for both the structured and semi-structured data.
It’s provides functions for manipulating large sets of distributed, structured data using an SQL subset supported by Spark and Hive SQL (HiveQL).
With DataFrames introduced in Spark 1.3, and DataSets introduced in Spark 1.6, which simplified handling of structured data and enabled radical performance optimizations, Spark SQL became one of the most important Spark components.
Operations on DataFrames and DataSets at some point translate to operations on RDDs and execute as ordinary Spark jobs. This component provides a query optimization framework called Catalyst that can be extended by custom optimization rules. Spark SQL also includes a Thrift server, which can be used by external systems, such as business intelligence tools, to query data through SQL using classic JDBC and ODBC protocols.
This spark component allows Spark to process real-time streaming data. It provides an API to manipulate data streams that matches with RDD API. It allows the programmers to understand the project and switch through the applications that manipulate the data and giving outcome in real-time.
Spark Streaming is a framework for ingesting real-time streaming data from various sources strives to make the system fault-tolerant and scalable.
The supported streaming sources include HDFS, Kafka, Flume, Twitter, ZeroMQ, and custom ones. This component operations recover from failure automatically, which is important for online data processing. Spark Stream represents streaming data using discretized streams (DStreams), which periodically create RDDs containing the data that came in during the last time window.
It also can be combined with other Spark components in a single program, unifying real-time processing with machine learning, SQL, and graph operations. This is something unique in the Hadoop ecosystem. And since Spark 2.0, the new Structured Streaming API makes Spark streaming programs more similar to Spark batch programs.
Apache Spark is equipped with a rich library known as MLlib. This library contains a wide array of machine learning algorithms, classification, clustering and collaboration filters, etc. It also includes few lower-level primitives. All these functionalities help Spark scale out across a cluster.
Spark MLlib is a library of machine-learning algorithms grown from the MLbase project at UC Berkeley. Supported algorithms include logistic regression, naïve Bayes classification, support vector machines (SVMs), decision trees, random forests, linear regression, and k-means clustering.
Apache Mahout is an existing open source project offering implementations of distributed machine-learning algorithms running on Hadoop. Although Apache Mahout is more mature, both Spark MLlib and Mahout include a similar set of machine-learning algorithms.
This platform also comes with a library to manipulate the graphs and performing computations, called as GraphX. GraphX also extends Spark RDD API which creates a directed graph. It also contains numerous operators in order to manipulate the graphs along with graph algorithms.
Graphs are data structures comprising vertices and the edges connecting them. GraphX provides functions for building graphs, represented as graph RDDs: EdgeRDD and VertexRDD.
GraphX contains implementations of the most important algorithms of graph theory, such as page rank, connected components, shortest paths, SVD++, and others. It also provides the Pregel message-passing API, the same API for large-scale graph processing implemented by Apache Giraph, a project with implementations of graph algorithms and running on Hadoop.
Each of apache spark component that we have discuss above according to apache spark main website that you could found online at https://spark.apache.org/