Hadoop Fundamentals
Description
This training provides a foundation of Apache Hadoop concepts and methods for developing data-processing applications while using it. Participants will get acquainted with HDFS, the de facto standard for long-term reliable big data storage; the YARN framework that manages parallellized execution of applications on a cluster; and the Hadoop ecosystem projects: Hive, Spark, & HBase.
is issued on the Luxoft Training form
Objectives
- Understand the key concepts and architecture of Hadoop
- Get an idea of the ecosystem that has developed around Hadoop and its key components
- Know how to read & write data to/from HDFS
- Comprehend the MapReduce programming paradigm
- Be able to access tabular data using Hive
- Learn to access tabular data using Spark SQL/DataFrame in batch mode
- Process data streams using Spark Structured Streaming
- Learn to use HBase for low-latency data storage and reading
Target Audience
- Software developers
- Software architects
- Database designers
- Database administrators
Prerequisites
- Basic Java programming skills
- Unix/Linux shell familiarity
- Experience with databases is optional
Roadmap
1. Basic concepts of modern data architecture (1h theory)
2. HDFS: Hadoop Distributed File System (2h theory, 1h practice)
- Architecture, replication, data in/out, HDFS commands
Practice (shell, Hue): connecting to a cluster, working with the file system
3. The MapReduce paradigm and its implementation in Java and Hadoop Streaming (2h theory, 1h practice)
Practice: Launching applications
4. YARN: Distributed application execution management (theory 1h, practice 1h)
- YARN architecture, application launch in YARN
Practice: launching applications and monitoring the cluster through the UI
5. Introduction to Hive (2h theory, 3h practice)
- Architecture, Table metadata, File formats, HiveQL query language
Practice (Hue, hive, beeline, Tez UI): creating tables, reading & writing CSV, Parquet, ORC, partitioning, SQL queries with aggregation and joins
6. Introduction to Spark (theory 2h, practice 3h)
- DataFrame/SQL, metadata, file formats, data sources, RDD
Practice (Zeppelin, Spark UI): reading & writing from the database (JDBC), CSV, Parquet, partitioning, SQL queries with aggregation and joins, query execution plans, monitoring
7. Introduction to streaming data processing (theory 2h, practice 1h)
- Spark Streaming, Spark Structured Streaming, Flink
Practice: Reading/processing/writing streams between Kafka, relational database and file system
8. Introduction to HBase (1h theory, 1h practice)
- Architecture, query language
Practice (HBase shell): writing and reading data
Total: theory 13h (54%), practice 11h (46%)