Hadoop Fundamentals

Master the essentials of Hadoop with our "Hadoop Fundamentals" course. Learn how to navigate the Hadoop ecosystem, from HDFS to MapReduce, YARN, Hive, and Spark. Gain hands-on experience in managing large-scale data processing and storage, making this course ideal for aspiring data engineers and developers.

  • duration 24 hours
  • Language English
  • format Online
duration
24 hours
location
Online
Language
English
Code
EAS-015
price
€ 650 *

Available sessions

To be determined



Training for 7-8 or more people?
Customize trainings for your specific needs

Description

Hadoop Fundamentals is a comprehensive course designed to introduce you to the core components of the Hadoop ecosystem, providing the foundational knowledge and practical skills necessary to work with big data technologies. Whether you’re a beginner or have some experience, this course will equip you with the expertise needed to effectively manage and process large-scale data using Hadoop.

 

Course Overview:

  1. Basic Concepts of Modern Data Architecture
    • Begin with an introduction to modern data architecture, focusing on the role Hadoop plays in managing and processing big data. Understand the evolution of data management technologies and how Hadoop fits into the larger ecosystem.
  2. HDFS: Hadoop Distributed File System
    • Delve into the architecture of HDFS, exploring how it manages distributed storage, replication, and data accessibility. Learn key commands for working with HDFS and get hands-on experience connecting to a Hadoop cluster and managing files using both the shell and Hue interface.
  3. The MapReduce Paradigm and Its Implementation in Java and Hadoop Streaming
    • Explore the MapReduce programming model, a core component of Hadoop for processing large datasets. Learn how to implement MapReduce in Java and through Hadoop Streaming. Practice by launching applications and observing how data is processed in a distributed environment.
  4. YARN: Distributed Application Execution Management
    • Understand the role of YARN in managing distributed applications within Hadoop. Learn about YARN’s architecture, how to launch applications in YARN, and monitor them through the user interface.
  5. Introduction to Hive
    • Discover Hive, a data warehouse infrastructure built on top of Hadoop. Learn about its architecture, table metadata, file formats, and the HiveQL query language. Practice creating tables, working with different file formats (CSV, Parquet, ORC), and executing SQL queries with aggregation and joins.
  6. Introduction to Spark
    • Get introduced to Apache Spark, focusing on its DataFrame/SQL API, metadata management, file formats, and data sources. Practice by reading and writing data using JDBC, CSV, and Parquet formats, and explore partitioning, query execution plans, and monitoring tasks through the Spark UI.
  7. Introduction to Streaming Data Processing
    • Learn about real-time data processing using Spark Streaming, Spark Structured Streaming, and Flink. Practice reading, processing, and writing data streams between Kafka, relational databases, and file systems.
  8. Introduction to HBase
    • Conclude with an introduction to HBase, a NoSQL database for Hadoop. Learn its architecture and query language, then practice writing and reading data through the HBase shell.

 

By the end of this course, participants will:

  • Understand the core components of the Hadoop ecosystem and how they interact to manage big data.
  • Gain practical experience with HDFS, MapReduce, YARN, Hive, Spark, and HBase.
  • Develop the skills necessary to manage and process large-scale datasets using Hadoop and its associated tools.
  • Apply concepts learned in real-world scenarios, including data storage, processing, and analysis.

 

This course offers a balanced mix of theory and practice, with 24 hours of content. You’ll engage in hands-on exercises that complement the theoretical knowledge, ensuring you’re ready to apply Hadoop technologies in practical settings.

After completing the course, a certificate is issued on the Luxoft Training form

Objectives

Upon completion of the "Hadoop Fundamentals" course, trainees will be able to:

  • Effectively navigate and manage Hadoop’s core components, including HDFS, MapReduce, YARN, Hive, and Spark.
  • Implement data processing pipelines using MapReduce, HiveQL, and Spark SQL.
  • Utilize HDFS and HBase for efficient data storage and retrieval.
  • Process real-time data streams with Spark Streaming and Flink.
  • Monitor and optimize Hadoop applications through various user interfaces.

Target Audience

Developers, architects, database designers, database administrators

Prerequisites

  • Basic Java programming skills. Unix/Linux shell familiarity. Experience with databases is optional.
  • Desired requirements:

- NoSQL/RDBMS experience

- BigData understanding


Roadmap

1. Basic concepts of modern data architecture (1h theory)

2. HDFS: Hadoop Distributed File System (2h theory, 1h practice)

- Architecture, replication, data in/out, HDFS commands

     Practice (shell, Hue): connecting to a cluster, working with the file system

3. The MapReduce paradigm and its implementation in Java and Hadoop Streaming (2h theory, 1h practice)

Practice: Launching applications

4. YARN: Distributed application execution management (theory 1h, practice 1h)

     - YARN architecture, application launch in YARN

     Practice: launching applications and monitoring the cluster through the UI

5. Introduction to Hive (2h theory, 3h practice)

     - Architecture, Table metadata, File formats, HiveQL query language

     Practice (Hue, hive, beeline, Tez UI): creating tables, reading & writing CSV, Parquet, ORC, partitioning, SQL queries with aggregation and joins

6. Introduction to Spark (theory 2h, practice 3h)

     - DataFrame/SQL, metadata, file formats, data sources, RDD

     Practice (Zeppelin, Spark UI): reading & writing from the database (JDBC), CSV, Parquet, partitioning, SQL queries with aggregation and joins, query execution plans, monitoring

7. Introduction to streaming data processing (theory 2h, practice 1h)

     - Spark Streaming, Spark Structured Streaming, Flink

     Practice: Reading/processing/writing streams between Kafka, relational database and file system

8. Introduction to HBase (1h theory, 1h practice)

     - Architecture, query language

     Practice (HBase shell): writing and reading data

Total: theory 13h (54%), practice 11h (46%)



Related courses

You may also be interested in

Discover more about professional growth and skills development

contact us