1. Basic concepts of modern data architecture (1h theory)
2. HDFS: Hadoop Distributed File System (2h theory, 1h practice)
- Architecture, replication, data in/out, HDFS commands
Practice (shell, Hue): connecting to a cluster, working with the file system
3. The MapReduce paradigm and its implementation in Java and Hadoop Streaming (2h theory, 1h practice)
Practice: Launching applications
4. YARN: Distributed application execution management (theory 1h, practice 1h)
- YARN architecture, application launch in YARN
Practice: launching applications and monitoring the cluster through the UI
5. Introduction to Hive (2h theory, 3h practice)
- Architecture, Table metadata, File formats, HiveQL query language
Practice (Hue, hive, beeline, Tez UI): creating tables, reading & writing CSV, Parquet, ORC, partitioning, SQL queries with aggregation and joins
6. Introduction to Spark (theory 2h, practice 3h)
- DataFrame/SQL, metadata, file formats, data sources, RDD
Practice (Zeppelin, Spark UI): reading & writing from the database (JDBC), CSV, Parquet, partitioning, SQL queries with aggregation and joins, query execution plans, monitoring
7. Introduction to streaming data processing (theory 2h, practice 1h)
- Spark Streaming, Spark Structured Streaming, Flink
Practice: Reading/processing/writing streams between Kafka, relational database and file system
8. Introduction to HBase (1h theory, 1h practice)
- Architecture, query language
Practice (HBase shell): writing and reading data
Total: theory 13h (54%), practice 11h (46%)