Databricks fundamentals
This Databricks Fundamentals course will help participants in getting a proper understanding of the internal structure and functioning of Databricks, the most powerful big data processing tool.
This advanced course provides a comprehensive understanding of Apache Spark’s internal structure, focusing on Spark Core (RDD), Spark SQL, and Spark Streaming. Students will learn to optimize Spark jobs, manage resource allocation, and work with connectors for enhanced data processing.
To be determined
The Advanced Spark for Developers course equips participants with an in-depth understanding of Apache Spark’s architecture, including Spark Core (RDD), Spark SQL, Spark Streaming, and Spark Structured Streaming. Trainees will explore Spark’s deployment and execution on clusters, resource management, and the Catalyst optimizer and Tungsten format. Through practical modules, students will learn optimization techniques for RDD, SQL, and streaming jobs, explore integration with external systems like Cassandra and Kafka, and implement best practices for testing and debugging. A preliminary module on Scala syntax prepares participants for advanced Spark development, making this course ideal for developers seeking to build efficient, high-performance applications on Spark clusters.
The course starts with a quick overview of essential Scala features to ensure all participants are prepared for Spark development. Next, students dive into Spark’s foundational components—RDDs, DataFrames, and DataSets—learning to create and manipulate distributed data efficiently. Modules cover core concepts like transformations, actions, and the Catalyst optimizer, enabling participants to develop optimized, scalable Spark applications.
Optimization is a key focus, with detailed discussions on optimizing RDDs, Spark SQL queries, and streaming jobs. Additionally, students will learn how to integrate Spark with external data sources and work with connectors for formats like Parquet, ORC, and Delta, as well as MPP databases and message brokers. Modules on Spark’s cluster and resource management delve into dynamic resource allocation and executor management. The course concludes with an introduction to Spark Streaming, where students learn to process streaming data using stateful transformations and Kafka integration.
Upon completion of the course, participants will be able to:
The course balances theory (50%) and hands-on practice (50%), allowing participants to apply Spark optimizations and integration techniques in real-world scenarios. Practical labs cover the full spectrum of Spark’s capabilities, from job optimization to stream processing.
Development experience in Java or Scala for Apache Spark over 3 months.
Module 0 - Scala in one day (Theory 2 h, practice 1.5 h)
1. Examine Scala features used in the Spark framework
2. Theory:
Module 1 – RDD (Theory 2 h, practice 1.5 h)
2. Theory RDD under the hood:
Module 2 - DataFrame & DataSet, Spark DSL & Spark SQL (Theory 2 h, practice 1.5 h)
1. Theory DataFrame, DataSet api:
2. Recreate code using plans
Module 3 - Spark optimization (Theory 2 h, practice 1.5 h)
Module 4 - External and Connectors (Theory 2 h, practice 1.5 h)
Module 5 – Testing (Theory 2 h, practice 1.5 h)
1. Write a test for data marts written in module (Exercise: find popular time for orders, find the most popular boroughs for orders, find distance distribution for orders grouped by boroughs)
2. Theory:
Module 6 - Spark Cluster (Theory 2 h, practice 1.5 h)
Module 7 - Spark streaming (Theory 2 h, practice 1.5 h)