Advanced Spark for Developers

This advanced course provides a comprehensive understanding of Apache Spark’s internal structure, focusing on Spark Core (RDD), Spark SQL, and Spark Streaming. Students will learn to optimize Spark jobs, manage resource allocation, and work with connectors for enhanced data processing.

  • duration 28 hours
  • Language English
  • format Online
duration
28 hours
location
Online
Language
English
Code
EAS-024
price
€ 700 *

Available sessions

To be determined



Training for 7-8 or more people?
Customize trainings for your specific needs

Description

The Advanced Spark for Developers course equips participants with an in-depth understanding of Apache Spark’s architecture, including Spark Core (RDD), Spark SQL, Spark Streaming, and Spark Structured Streaming. Trainees will explore Spark’s deployment and execution on clusters, resource management, and the Catalyst optimizer and Tungsten format. Through practical modules, students will learn optimization techniques for RDD, SQL, and streaming jobs, explore integration with external systems like Cassandra and Kafka, and implement best practices for testing and debugging. A preliminary module on Scala syntax prepares participants for advanced Spark development, making this course ideal for developers seeking to build efficient, high-performance applications on Spark clusters.

 

The course starts with a quick overview of essential Scala features to ensure all participants are prepared for Spark development. Next, students dive into Spark’s foundational components—RDDs, DataFrames, and DataSets—learning to create and manipulate distributed data efficiently. Modules cover core concepts like transformations, actions, and the Catalyst optimizer, enabling participants to develop optimized, scalable Spark applications.

 

Optimization is a key focus, with detailed discussions on optimizing RDDs, Spark SQL queries, and streaming jobs. Additionally, students will learn how to integrate Spark with external data sources and work with connectors for formats like Parquet, ORC, and Delta, as well as MPP databases and message brokers. Modules on Spark’s cluster and resource management delve into dynamic resource allocation and executor management. The course concludes with an introduction to Spark Streaming, where students learn to process streaming data using stateful transformations and Kafka integration.

 

Upon completion of the course, participants will be able to:

  • Understand Spark’s internal architecture and deployment on various cluster managers (Standalone, YARN, Mesos).
  • Develop optimized Spark jobs using RDDs, DataFrames, DataSets, and Structured Streaming.
  • Integrate Spark with external data sources and connectors, such as Cassandra and Kafka.
  • Implement testing strategies for Spark jobs and apply best practices in CI/CD workflows.
  • Manage and optimize resource allocation on Spark clusters for high performance.

 

The course balances theory (50%) and hands-on practice (50%), allowing participants to apply Spark optimizations and integration techniques in real-world scenarios. Practical labs cover the full spectrum of Spark’s capabilities, from job optimization to stream processing.

After completing the course, a certificate is issued on the Luxoft Training form

Objectives

  • Understand Spark’s internal structure;
  • Understand the deployment, configuration, and execution of Spark components on various clusters (Standalone, YARN, Mesos);
  • Optimize RDD-based Spark jobs;
  • Optimize Spark SQL jobs;
  • Optimize Microbatch and Structured Streaming jobs.

Target Audience

  • Developers
  • Architects

Prerequisites

Development experience in Java or Scala for Apache Spark over 3 months.


Roadmap

Module 0 - Scala in one day (Theory 2 h, practice 1.5 h)

1. Examine Scala features used in the Spark framework

2. Theory:

  1. var and val, val (x, x), lazy val, transient lazy val
  2. type and Type, (Nil, None, Null => null, Nothing, Unit => (), Any, AnyRef, AnyVal, String, interpolation
  3. class, object (case), abstract class, trait
  4. Scala function, methods, lambda
  5. Generic, ClassTag, covariant, contravariant, invariant position, F[_], *
  6. Pattern matching and if then else construction
  7. Mutable and Immutable collection, Iterator, collection operation
  8. Monads (Option, Either, Try, Future, ....), Try().recovery
  9. map, flatMap, foreach, for comprehension
  10. Implicits, private[sql], package
  11. Scala sbt, assembly
  12. Encoder, Product
  13. Scala libs for Spark: scopt, chimney, jsoniter

Module 1 – RDD (Theory 2 h, practice 1.5 h)

  1. Theory RDD api:
  2. RDD creating api: from array, from file. from DS
  3. RDD base operations: map, flatMap, filter, reduceByKey, sort
  4. Time parse libs

2. Theory RDD under the hood:

  1. Iterator + mapPartitions()
  2. RDD creating path: compute() and getPartitions()
  3. Partitions
  4. Partitioner: Hash and Range
  5. Dependencies: wide and narrow
  6. Joins: inner, cogroup, join without shuffle
  7. Query Plan

Module 2 - DataFrame & DataSet, Spark DSL & Spark SQL (Theory 2 h, practice 1.5 h)

1. Theory DataFrame, DataSet api:

  1. Creating DataFrame: memory from file (HDFS, S3, FS) (Avro, Orc, Parquet)
  2. Spark DSL: Join broadcast, grouped operations
  3. Spark SQL: Window functions, single partitions
  4. Scala UDF problem-solving
  5. Spark catalog

2. Recreate code using plans

  1. Catalyst Optimiser: Logical & Physical plans
  2. Codegen
  3. Persist vs Cache vs Checkpoint
  4. Creating DataFrame Path
  5. Raw vs InternalRaw

Module 3 - Spark optimization (Theory 2 h, practice 1.5 h)

  1. Compare speed, size RDD, DataFrame, DataSet
  2. Compare crimes counting: SortMerge Join, BroadCast, BlumFilter
  3. Resolve problems with a skewed join
  4. Build UDF for Python and Scala
  5. UDF Problems

Module 4 - External and Connectors (Theory 2 h, practice 1.5 h)

  1. How to read/write data from file storages (HDFS, S3, FTP, FS)
  2. What data format to choose (Json, CSV, Avro, Orc, Parquet, Delta, ... )
  3. How to parallelize reading/writing to JDBC
  4. How to create dataframe from MPP (Cassandra, vertica, gp)
  5. How to work with Kafka
  6. How to write your own connectors
  7. Write UDF for joining with cassandra

Module 5 – Testing (Theory 2 h, practice 1.5 h)

1. Write a test for data marts written in module (Exercise: find popular time for orders, find the most popular boroughs for orders, find distance distribution for orders grouped by boroughs)

2. Theory:

  1. Unit testing
  2. Code review
  3. QA
  4. CI/CD
  5. Problems
  6. Libs which solve these problems

Module 6 - Spark Cluster (Theory 2 h, practice 1.5 h)

  1. Build config with allocation
  2. Compare several workers
  3. Dynamic Resource Allocation
  4. Manual managing executors runtime

Module 7 - Spark streaming (Theory 2 h, practice 1.5 h)

  1. [Solve problem with Cassandra writing](src/main/scala/mod4connectors/DataSetsWithCassandra.scala)
  2. Build Spark Structure Reading Kafka
  3. Build Spark Structure Using State
  4. Build Spark Structure Writing Cassandra


Related courses

You may also be interested in

Discover more about professional growth and skills development

contact us