RDDs
A resilient distributed data set, also known as an RDD, is Spark’s primary data abstraction.
A resilient distributed data set, is a collection of fault tolerant elements partitioned across the cluster’s nodes capable of receiving parallel operations. Additionally, resilient distributed databases are immutable, meaning that these databases cannot be changed once created.
- You can create an RDD using an external or local Hadoop-supported file, from a collection, or from another RDD. RDDs are immutable and always recoverable, providing resilience in Apache Spark. RDDs can persist or cache datasets in memory across operations, which speeds up iterative operations in Spark.
- Every spark application consists of a driver program that runs the user’s main functions and runs multiple parallel operations on a cluster.
Lazy Transformation
The first RDD operation to explore is a Transformation. An RDD transformation creates a new RDD from an existing RDD.
Transformations in Spark are considered lazy because Spark does not compute transformation results immediately. Instead, the results are only computed when evaluated by “actions.”
- To evaluate a transformation in Spark, you use an action. The action returns a value to the driver program after running a computation.
DAG
But how do transformations and actions happen? Spark uses a unique data structure called a Directed Acyclic Graph, knowns as a DAG, and an associated DAG Scheduler to perform RDD operations.
- Think of a DAG as a graphical structure composed of edges and vertices.
- The term “acyclic” means that new edges only originate from an existing vertex.
- In general, the vertices and edges are sequential.
- The vertices represent RDDs, and the edges represent the transformations or actions.
The DAG Scheduler applies the graphical structure to run the tasks that use the RDD to perform
transformation processes.
So, why does Spark use DAGs? DAGS help enable fault tolerance.
When a node goes down, Spark replicates the DAG and restores the node.
Let’s look at the process in more detail.
- First, Spark creates DAG when creating an RDD.
- Next, Spark enables the DAG Schedular to perform a transformation and updates the DAG.
- The DAG now points to the new RDD.
- The pointer that transforms RDD is returned to the Spark driver program.
- And, if there is an action, the driver program that calls the action evaluates the DAG only after Spark completes the action.
Transformations
map
filter
distinct
flatmap
Here are some examples of RDD transformations
Transformation | Description |
---|---|
map (func) | Returns a new distributed dataset formed by passing each element of the source through a function func |
filter(func) | Returns a new dataset formed by selecting those |
distinct ([numTasks]) | Returns a new dataset that contains the distinct elements of the source dataset |
flatmap (func) | Similar to map (func). Can map each input item to zero or more output items. Func should return a Seq rather than a single item |
Actions
Here are some examples of Actions:
reduce
take
collect
takeOrdered
Action | Description |
---|---|
reduce (func) - aggregates dataset elements using the function func | func takes two arguments and returns one Is commutatie. Is associative. Can be computed correctly in parallel |
take(n) | Returns an array with the first n element |
collect() | Returns all the elements as an array - Make sure that the data will fit in driver program - AVOID this at all costs because you are pulling all the data |
takeOrdered (n, key=func) | Returns n elements ordered in ascending order or as specified by the optional key function |