Advanced Spark for Developers
The Advanced Spark for Developers Course will help trainees get a proper understanding of the internal structure and functioning of Apache Spark – Spark Core (RDD), Spark SQL and Spark Streaming.
This Databricks Fundamentals course will help participants in getting a proper understanding of the internal structure and functioning of Databricks, the most powerful big data processing tool.
Databricks is an increasingly popular platform for big data processing and analysis. Our Databricks Fundamentals course is a great way to start if you want to improve your skills in this area. You will acquire practical experience with important Databricks tools and ideas over the course of several modules, including writing queries in Scala, Python, and SQL, using Delta Lake / Parquet, and working with Notebooks.
One of the primary goals of the course is to make you more comfortable when using Notebook, the web-based interface for data analysis and collaboration for Databricks. With guidance from our trainer, you’ll learn how to efficiently build, manage, and share notebooks, allowing you to deal with complex data challenges.
Another important topic we will cover is the open-source engine, Spark, that powers Databricks data processing capabilities. You will gain a deep understanding of Spark’s internal architecture, as here we can mention RDD (Resilient Distributed Datasets) which according to databricks.com “is an immutable distributed collection of elements of your data, partitioned across nodes in your cluster, that can be operated in parallel with high level API that offers transformations and actions."
In order to make the right decisions on the project and avoid architectural errors, you’ll discover the differences between Delta Lake and Parquet, two file types used by Databricks to store data. Understanding the particularities of these formats will help you select the best one for your project, leading to more efficient results. We will also cover one of the key topics for any big data environment, which is query writing. You'll learn how to write queries in Scala and SQL, giving you the flexibility to work with different languages and tools as needed.
You will learn how to optimize your Databricks workflows for maximum performance and also learn how to use powerful visualization tools to gain valuable insights - in order to drive better decisions for the project. Overall, the Databricks Fundamentals course is a detailed practical introduction to this big data tool. With guidance from our trainer, who is an experienced Data Engineer, you’ll be able to develop the abilities and confidence to successfully handle the most complex data tasks.
Developers, Architects
Development experience in Scala, Java, Python, & SQL - 3 months.
Introduction to Databricks – Theory 60% / Practice 40% - 4h
Creating Databricks Service
Databricks RI Overview
Databricks Architecture Overview
Databricks Notebooks
Databricks Cluster and Jobs - Theory 60% / Practice 40% - 4h
Cluster types and configuration
Databricks cluster pool
Databricks Job
Notebooks’ workflows
DBFS - Theory 60% / Practice 40% - 4h
Databricks and Spark - Theory 60% / Practice 40% - 4h
Data Formats
Transformation
Joins, Aggregation
SQL
Delta Lake - Theory 60% / Practice 40% - 4h
Pitfalls of Data Lakes
Data Lakehouse Architecture
Read & Write to Delta Lake
Updates and Deletes on Delta Lake
Merge/Upsert to Delta Lake
History, Time Travel, Vacuum
Delta Lake Transaction Log
Convert from Parquet to Delta
Data Ingestion
Data Transformation - PySpark and Notebooks
Visualizations in Databricks - Theory 60% / Practice 40% - 2h
Collaboration in Databricks - Theory 60% / Practice 40% - 2h
Deploying Databricks on Azure - Theory 60% / Practice 40% - 2h
Deploying Databricks on the AWS Marketplace - Theory 60% / Practice 40% - 2h
Data Protection Use cases - 4h
Oleksandr Holota
Big Data and ML Trainer