Hadoop Fundamentals

Hadoop Fundamentals

This training course delivers key concepts and methods for data processing applications development using Apache Hadoop. We’ll look at HDFS, the de-facto standard for large scale long-term robust data storage; MapReduce framework for automated distributed code execution; and companion projects from the Hadoop ecosystem.

Duration
24 hours
Course type
Online
Language
English
Duration
24 hours
Location
Online
Language
English
Code
EAS-015
Training for 7-8 or more people? Customize trainings for your specific needs
Hadoop Fundamentals
Duration
24 hours
Location
Online
Language
English
Code
EAS-015
€ 650 *
Training for 7-8 or more people? Customize trainings for your specific needs

Description


This training provides a foundation of Apache Hadoop concepts and methods for developing data-processing applications while using it. Participants will get acquainted with HDFS, the de facto standard for long-term reliable big data storage; the YARN framework that manages parallellized execution of applications on a cluster; and the Hadoop ecosystem projects: Hive, Spark, & HBase.

certificate
After completing the course, a certificate
is issued on the Luxoft Training form

Objectives

  • Understand the key concepts and architecture of Hadoop
  • Get an idea of the ecosystem that has developed around Hadoop and its key components
  • Know how to read & write data to/from HDFS
  • Comprehend the MapReduce programming paradigm
  • Be able to access tabular data using Hive
  • Learn to access tabular data using Spark SQL/DataFrame in batch mode
  • Process data streams using Spark Structured Streaming
  • Learn to use HBase for low-latency data storage and reading

Target Audience

  • Software developers
  • Software architects
  • Database designers
  • Database administrators

Prerequisites

  • Basic Java programming skills
  • Unix/Linux shell familiarity
  • Experience with databases is optional

Roadmap

1. Basic concepts of modern data architecture (1h theory)

2. HDFS: Hadoop Distributed File System (2h theory, 1h practice)

- Architecture, replication, data in/out, HDFS commands

Practice (shell, Hue): connecting to a cluster, working with the file system

3. The MapReduce paradigm and its implementation in Java and Hadoop Streaming (2h theory, 1h practice)

Practice: Launching applications

4. YARN: Distributed application execution management (theory 1h, practice 1h)

- YARN architecture, application launch in YARN

Practice: launching applications and monitoring the cluster through the UI

5. Introduction to Hive (2h theory, 3h practice)

- Architecture, Table metadata, File formats, HiveQL query language

Practice (Hue, hive, beeline, Tez UI): creating tables, reading & writing CSV, Parquet, ORC, partitioning, SQL queries with aggregation and joins

6. Introduction to Spark (theory 2h, practice 3h)

     - DataFrame/SQL, metadata, file formats, data sources, RDD

     Practice (Zeppelin, Spark UI): reading & writing from the database (JDBC), CSV, Parquet, partitioning, SQL queries with aggregation and joins, query execution plans, monitoring

7. Introduction to streaming data processing (theory 2h, practice 1h)

     - Spark Streaming, Spark Structured Streaming, Flink

     Practice: Reading/processing/writing streams between Kafka, relational database and file system

8. Introduction to HBase (1h theory, 1h practice)

     - Architecture, query language

     Practice (HBase shell): writing and reading data

Total: theory 13h (54%), practice 11h (46%)

Still have questions?
Connect with us