Databricks fundamentals

Unlock the potential of Databricks with our "Databricks Fundamentals" course. Learn to create and manage Databricks services, explore clusters, Databricks File System (DBFS), Delta Lake, and more. Perfect for data engineers and analysts looking to optimize big data processing and collaboration using Databricks on Azure or AWS.

24 hours
English
Online

Description

Databricks Fundamentals is a comprehensive course designed to introduce you to the powerful Databricks platform, a leading cloud-based solution for big data processing and machine learning. This course is ideal for data engineers, data scientists, and analysts who want to leverage Databricks for their data-driven projects.

The course begins with an introduction to Databricks, where you’ll learn how to create a Databricks service and explore its architecture and key components, including Databricks Notebooks. You’ll gain hands-on experience in setting up and configuring Databricks clusters and jobs, understanding the various cluster types, and managing workflows using Databricks Notebooks.

You'll then dive into the Databricks File System (DBFS), learning how to store and manage data efficiently within the Databricks environment. The course also covers the integration of Databricks with Apache Spark, where you’ll learn about data formats, transformations, joins, aggregations, and SQL queries within Databricks.

A significant portion of the course is dedicated to Delta Lake, a powerful storage layer that brings reliability and performance to data lakes. You’ll explore the pitfalls of traditional data lakes and learn how Delta Lake overcomes these challenges with features like time travel, updates, deletes, and data ingestion. The course also covers advanced topics such as Delta Lake’s transaction log and converting Parquet files to Delta format.

In addition, you’ll explore visualization techniques in Databricks, collaboration tools that enhance teamwork within the platform, and the deployment of Databricks on both Azure and AWS. This comprehensive course ensures you can manage data securely and optimize workflows in a collaborative environment.

By the end of this course, participants will:

Understand the architecture and key components of Databricks and its integration with Apache Spark.
Create and configure Databricks clusters and jobs and manage workflows using Databricks Notebooks.
Efficiently manage data using the Databricks File System (DBFS).
Implement and optimize data lakes using Delta Lake, including advanced features like time travel, transactions, and data ingestion.
Leverage Databricks for data visualization, collaboration, and secure deployment on Azure and AWS.

This course offers a balanced mix of theory and practice across a total of 28 hours. Each module is designed to provide you with both the conceptual understanding and hands-on experience needed to effectively use Databricks in your data projects.

Objectives

Set up and manage Databricks environments, including clusters, jobs, and notebooks.
Use Databricks File System (DBFS) for efficient data management.
Integrate Databricks with Apache Spark for data processing and SQL querying.
Implement Delta Lake to enhance data lake reliability and performance, including advanced operations like time travel and data transformation.
Deploy and manage Databricks on Azure and AWS, ensuring data security and optimizing workflows.

Roadmap

1. Introduction to Databricks – Theory 60% / Practice 40% - 4h

Creating Databricks Service
Databricks RI Overview
Databricks Architecture Overview
Databricks Notebooks

2. Databricks Cluster and Jobs - Theory 60% / Practice 40% - 4h

Cluster types and configuration
Databricks cluster pool
Databricks Job
Notebooks’ workflows

3. DBFS - Theory 60% / Practice 40% - 4h

4. Databricks and Spark - Theory 60% / Practice 40% - 4h

Data Formats
Transformation
Joins, Aggregation
SQL

5. Delta Lake - Theory 60% / Practice 40% - 4h

Pitfalls of Data Lakes
Data Lakehouse Architecture
Read & Write to Delta Lake
Updates and Deletes on Delta Lake
Merge/Upsert to Delta Lake
History, Time Travel, Vacuum
Delta Lake Transaction Log
Convert from Parquet to Delta
Data Ingestion
Data Transformation - PySpark and Notebooks

6. Visualizations in Databricks - Theory 60% / Practice 40% - 2h

7. Collaboration in Databricks - Theory 60% / Practice 40% - 2h

8. Deploying Databricks on Azure - Theory 60% / Practice 40% - 2h

9. Deploying Databricks on the AWS Marketplace - Theory 60% / Practice 40% - 2h

10. Data Protection Use cases - 4h

Related courses

This training course delivers key concepts and methods for data processing applications development using Apache Spark.

Explore modern data management techniques with our "Modern Data Management Approaches in Real World Case" course. Learn through real-world examples, including handling 24M gaming cards, and gain hands-on experience with cutting-edge technologies like MongoDB, Spark Streaming, Cassandra, and distributed file systems. Perfect for data professionals looking to solve complex data challenges.

Unlock the power of big data analytics with "BigData SQL Hive." This course dives deep into Apache Hive, covering everything from architecture and data types to complex queries, transactions, and performance tuning. Perfect for data professionals looking to enhance their SQL skills in a big data environment.

Databricks fundamentals

Description

Objectives

Target Audience

Prerequisites

Roadmap

Related courses

Apache Spark Fundamentals

Modern Data Management Approaches in Real World Case

BigData SQL Hive

Apache Spark Fundamentals

Modern Data Management Approaches in Real World Case

BigData SQL Hive

You may also be interested in

Discover more about professional growth and skills development