Skip to main content

Apache Spark Basics FAQ


Big Data has come a long way. Apache spark is one of the fastest big data computational engines. We will answer often asked questions about the basics of Apache Spark in this article.

What problems Apache Spark solves and how does it solve them?

Big data computation problem

When the size of data is large in terabytes, it is time taking and inefficient to load them into a single machine's memory and process them for computation. The cost of running a computation on high-end machines (large memory with multiple cores and processors) is very high.

Apache Spark is a cluster-based parallel processing engine that runs efficiently on low-end machines. It can run in-memory as well as on disk.

Limitation in MapReduce processing

MapReduce is a big data-parallel and distributed algorithm to process and generate data set on a cluster. It is the programming model used by Apache Hadoop for big data computation.

MapReduce process everything on disk (cluster of disks) in the following sequential steps.

  1. Read data from disk
  2. Map data
  3. Reduce data
  4. Write result on disk

IO (Input & Output) from disk takes most of the time of a MapReduce operation. It goes really inefficient when a problem needs multiple iterations on the same data set. We need iterations on the same data set for Graph manipulation, Machine learning algorithms, and other problems.

Apache Spark overcomes the MapReduce big data processing bottlenecks with their in-memory resilient distributed dataset (RDD) data structure, a clustered read-only multi-set of data items. In-Memory it is 100x faster than Apache Hadoop MapReduce, while on the disk it is 10x faster.

With RDDs implementation now it is possible to use iterative algorithms which use the same dataset in a loop as well as repeated database-style querying.

The complex big data ecosystem

Apache Hadoop, another big data computation platform grown as the complex ecosystem of tools and libraries for solving real-time streaming, structured data analysis, machine learning, etc. Usually, development has to opt for newer frameworks leads to increased cost of maintenance.

Apache Spark provides a unified ecosystem with low-end API along with high-level APIs and tools for real-time streaming, machine learning, etc.

What are the various components of Apache Spark?


                                            Apache Spark Various Components

Spark Core - Apache Spark core includes RDD (Resilient Distributed Dataset) API, Cluster management, scheduling, data source handling, memory management, fault tolerance, and others functionalities. Apache Spark's general-purpose fast computational core provides fundamentals for building higher-level API for various purposes. The benefits of tightly coupled architecture are that when Core get improvements, further high-level API also get benefitted.

Spark SQL, DataFrames, and Datasets - It provides API for processing structured data (JSON, relational database, and others). You can query the dataset using SQL-like syntax.

Spark Streaming - It provides API for processing real-time streams of data coming from various sources like Kafka, Flume, Kinesis, or TCP sockets. These real-time streams can be further processed using complex machine learning, graph, and other algorithms.

Spark MLib (Machine Learning) - It provides API for executing Machine Learning algorithms like Classification, Regression, Collaborative Filtering, and others.

Spark GraphX - It provides API for doing parallel computation on Graph data (e.g Facebook friend graph). It also gives inbuilt support for Graph algorithms like Page Rank and Triangle counting.

In which programming languages you can write Spark applications?

                                Apache Spark Programming Language Supports

Apache Spark provides libraries and tools for writing an application using Scala, Java, Python, and R programming languages.

You can also interact with Apache Spark with Scala Python and R CLI (Command Line Interface) to execute exploratory queries. It helps Data Scientists a lot.

How Apache Spark application execute?

Apache Spark program execution on cluster lifecycle

 

The lifecycle for Spark program execution on the cluster:

  1. You write Spark applications, package them and send them to the main spark server (not Worker Node). In your application's main program (Driver Program) you use SparkContext. Spark application runs on a cluster as an independent process coordinated by SparkContext.
  2. SparkContext can connect to worker nodes using many Cluster Managers.
  3. SparkContext acquires worker node executor process. These processes run computations and store data for the app.
  4. SparkContext sends application code (JARs or Python files) to the executor node.
  5. SparkContext send tasks to the worker node executor for further process.

A few special features of a Spark application architecture are:

  1. Each application is given its executor process. The executor process stays up until the end of the program and executes it in multiple threads. It brings isolation between multiple Spark applications. Data sharing between nodes are also not possible without writing them on disk.
  2. Spark can work with any Cluster Manager as long as they can acquire executor processes on the node.
  3. The network connection between the worker node and the driver program is a must.
  4. Keeping driver program and worker node close to each other (preferably on the same LAN) decreases the latency of cluster task scheduling.

What data storage Apache Spark supports?

Apache Spark doesn't have its own data storage capabilities. Though it supports several data storages like Hadoop Distributed File System (HDFS), HBase, Cassandra, Apache Hive, Amazon S3, and others. It has options to add custom data backend.

Which cluster managers Apache Spark supports?

Apache Spark comes with its native cluster manager, which is good for small deployment. Though it can also use Hadoop YARN and Apache Mesos as its cluster manager.

Why Apache Hadoop bundle with Apache Spark distribution?

Apache Spark gives support for Hadoop YARN and Mesos cluster manager. Apache Spark depends on Hadoop client libraries for YARN and Mesos. We may download Spark build without the Hadoop bundle. We can also refer to the existing installation of Hadoop if it is on the same machine as Apache Spark.

References

  1. Apache Spark latest documentation (Click Here)
  2. MapReduce framework Wikipedia article (Click here)
  3. Apache Hadoop Ecosystem
  4. Apache Spark Wikipedia article (Click Here)
  5. MapReduce algorithm Wikipedia article
  6. Hadoop YARN official website
  7. Apache Mesos official website

Comments

Popular posts from this blog

Working with request header in Jersey (JAX-RS) guide

In the  previous post , we talked about, how to get parameters and their values from the request query string. In this guide learn how to get request header values in Jersey (JAX-RS) based application. We had tested or used the following tools and technologies in this project: Jersey (v 2.21) Gradle Build System (v 2.9) Spring Boot (v 1.3) Java (v 1.8) Eclipse IDE This is a part of  Jersey (JAX-RS) Restful Web Services Development Guides series. Please read Jersey + Spring Boot getting started guide . Gradle Build File We are using Gradle for our build and dependency management (Using Maven rather than Gradle is a very trivial task). File: build.gradle buildscript { ext { springBootVersion = '1.3.0.RELEASE' } repositories { mavenCentral() } dependencies { classpath("org.springframework.boot:spring-boot-gradle-plugin:${springBootVersion}") } } apply plugin: 'java' apply plugin: 'eclipse' a

Ajax Cross Domain Resource Access Using jQuery

Some time back in our project we faced a problem while making an Ajax call using jQuery. Chrome Browser console had given some weird error message like below when we try to access one of our web pages: When we try to access the same web page in the Firefox browser, it doesn't give any error in the console but some parsing error occurred. In our case, we were accessing XML as an Ajax request resource. I was curious to check if the non-XML cross-domain resource was successfully loading or not. But finally, I realized that it is not going through. jersey-spring-boot-quick-starter-guide In our Ajax call, requesting domain was not the same as the requested URL domain. $.ajax({ url: "https://10.11.2.171:81/xxxxxx/xxxxxxx.xml" , type : "get" , success: function (response) { alert( "Load was performed." ); }, error : function (xhr, status) {

FastAPI first shot

Setup on my Mac (Macbook Pro 15 inch Retina, Mid 2014) Prerequisite Python 3.6+ (I used 3.7.x. I recently reinstalled OS after cleaning up disk, where stock Python 2.7 was available. I installed Pyenv and then used it to install 3.7.x). I already had a git repo initialized at Github for this project. I checked that out. I use this approach to keep all the source code safe or at a specific place 😀. I set the Python version in .python-version file. I also initialize the virtual environment using pyenv in venv folder. I started the virtual environment. FastAPI specific dependencies setup Now I started with basic pip commands to install dependency for the project. I saved dependencies in requirements.txt  the file. Minimal viable code to spin an API Server FastAPI is as cool as NodeJS or Go Lang (?) to demonstrate the ability to spin an API endpoint up and running in no time. I had the same feeling for the Flask too, which was also super cool. app/main.py: from typing i