Advanced Spark for Developers

This advanced course provides a comprehensive understanding of Apache Spark’s internal structure, focusing on Spark Core (RDD), Spark SQL, and Spark Streaming. Students will learn to optimize Spark jobs, manage resource allocation, and work with connectors for enhanced data processing.

28 hours
English
Online

Description

The Advanced Spark for Developers course equips participants with an in-depth understanding of Apache Spark’s architecture, including Spark Core (RDD), Spark SQL, Spark Streaming, and Spark Structured Streaming. Trainees will explore Spark’s deployment and execution on clusters, resource management, and the Catalyst optimizer and Tungsten format. Through practical modules, students will learn optimization techniques for RDD, SQL, and streaming jobs, explore integration with external systems like Cassandra and Kafka, and implement best practices for testing and debugging. A preliminary module on Scala syntax prepares participants for advanced Spark development, making this course ideal for developers seeking to build efficient, high-performance applications on Spark clusters.

The course starts with a quick overview of essential Scala features to ensure all participants are prepared for Spark development. Next, students dive into Spark’s foundational components—RDDs, DataFrames, and DataSets—learning to create and manipulate distributed data efficiently. Modules cover core concepts like transformations, actions, and the Catalyst optimizer, enabling participants to develop optimized, scalable Spark applications.

Optimization is a key focus, with detailed discussions on optimizing RDDs, Spark SQL queries, and streaming jobs. Additionally, students will learn how to integrate Spark with external data sources and work with connectors for formats like Parquet, ORC, and Delta, as well as MPP databases and message brokers. Modules on Spark’s cluster and resource management delve into dynamic resource allocation and executor management. The course concludes with an introduction to Spark Streaming, where students learn to process streaming data using stateful transformations and Kafka integration.

Upon completion of the course, participants will be able to:

Understand Spark’s internal architecture and deployment on various cluster managers (Standalone, YARN, Mesos).
Develop optimized Spark jobs using RDDs, DataFrames, DataSets, and Structured Streaming.
Integrate Spark with external data sources and connectors, such as Cassandra and Kafka.
Implement testing strategies for Spark jobs and apply best practices in CI/CD workflows.
Manage and optimize resource allocation on Spark clusters for high performance.

The course balances theory (50%) and hands-on practice (50%), allowing participants to apply Spark optimizations and integration techniques in real-world scenarios. Practical labs cover the full spectrum of Spark’s capabilities, from job optimization to stream processing.

Roadmap

Module 0 - Scala in one day (Theory 2 h, practice 1.5 h)

1. Examine Scala features used in the Spark framework

2. Theory:

var and val, val (x, x), lazy val, transient lazy val
type and Type, (Nil, None, Null => null, Nothing, Unit => (), Any, AnyRef, AnyVal, String, interpolation
class, object (case), abstract class, trait
Scala function, methods, lambda
Generic, ClassTag, covariant, contravariant, invariant position, F[_], *
Pattern matching and if then else construction
Mutable and Immutable collection, Iterator, collection operation
Monads (Option, Either, Try, Future, ....), Try().recovery
map, flatMap, foreach, for comprehension
Implicits, private[sql], package
Scala sbt, assembly
Encoder, Product
Scala libs for Spark: scopt, chimney, jsoniter

Module 1 – RDD (Theory 2 h, practice 1.5 h)

Theory RDD api:
RDD creating api: from array, from file. from DS
RDD base operations: map, flatMap, filter, reduceByKey, sort
Time parse libs

2. Theory RDD under the hood:

Iterator + mapPartitions()
RDD creating path: compute() and getPartitions()
Partitions
Partitioner: Hash and Range
Dependencies: wide and narrow
Joins: inner, cogroup, join without shuffle
Query Plan

Module 2 - DataFrame & DataSet, Spark DSL & Spark SQL (Theory 2 h, practice 1.5 h)

1. Theory DataFrame, DataSet api:

Creating DataFrame: memory from file (HDFS, S3, FS) (Avro, Orc, Parquet)
Spark DSL: Join broadcast, grouped operations
Spark SQL: Window functions, single partitions
Scala UDF problem-solving
Spark catalog

2. Recreate code using plans

Catalyst Optimiser: Logical & Physical plans
Codegen
Persist vs Cache vs Checkpoint
Creating DataFrame Path
Raw vs InternalRaw

Module 3 - Spark optimization (Theory 2 h, practice 1.5 h)

Compare speed, size RDD, DataFrame, DataSet
Compare crimes counting: SortMerge Join, BroadCast, BlumFilter
Resolve problems with a skewed join
Build UDF for Python and Scala
UDF Problems

Module 4 - External and Connectors (Theory 2 h, practice 1.5 h)

How to read/write data from file storages (HDFS, S3, FTP, FS)
What data format to choose (Json, CSV, Avro, Orc, Parquet, Delta, ... )
How to parallelize reading/writing to JDBC
How to create dataframe from MPP (Cassandra, vertica, gp)
How to work with Kafka
How to write your own connectors
Write UDF for joining with cassandra

Module 5 – Testing (Theory 2 h, practice 1.5 h)

1. Write a test for data marts written in module (Exercise: find popular time for orders, find the most popular boroughs for orders, find distance distribution for orders grouped by boroughs)

2. Theory:

Unit testing
Code review
QA
CI/CD
Problems
Libs which solve these problems

Module 6 - Spark Cluster (Theory 2 h, practice 1.5 h)

Build config with allocation
Compare several workers
Dynamic Resource Allocation
Manual managing executors runtime

Module 7 - Spark streaming (Theory 2 h, practice 1.5 h)

[Solve problem with Cassandra writing](src/main/scala/mod4connectors/DataSetsWithCassandra.scala)
Build Spark Structure Reading Kafka
Build Spark Structure Using State
Build Spark Structure Writing Cassandra

Related courses

This training course delivers key concepts and methods for data processing applications development using Apache Spark.

Explore modern data management techniques with our "Modern Data Management Approaches in Real World Case" course. Learn through real-world examples, including handling 24M gaming cards, and gain hands-on experience with cutting-edge technologies like MongoDB, Spark Streaming, Cassandra, and distributed file systems. Perfect for data professionals looking to solve complex data challenges.

Unlock the power of big data analytics with "BigData SQL Hive." This course dives deep into Apache Hive, covering everything from architecture and data types to complex queries, transactions, and performance tuning. Perfect for data professionals looking to enhance their SQL skills in a big data environment.

Advanced Spark for Developers

Description

Objectives

Target Audience

Prerequisites

Roadmap

Related courses

Apache Spark Fundamentals

Modern Data Management Approaches in Real World Case

BigData SQL Hive

You may also be interested in

Discover more about professional growth and skills development