Spark Scala Kubernetes (Piloting)

Spark Scala Kubernetes (Piloting)

Participants will embark on an enriching voyage through the Spark universe powered by Scala, beginning with a foundational understanding of Spark's architecture and seeing its edge over Hadoop's MapReduce.

Duration
0 hours
Course type
Online
Language
English
Duration
0 hours
Location
Online
Language
English
Code
EAS-030
Training for 7-8 or more people? Customize trainings for your specific needs
Spark Scala Kubernetes (Piloting)
Location
Online
Language
English
Code
EAS-030
€ 650 *
Training for 7-8 or more people? Customize trainings for your specific needs

Description

Embark on an enriching voyage through the Spark universe powered by Scala. Beginning with the foundational understanding of Spark's architecture, you'll see its edge over Hadoop's MapReduce. Delve into the realm of containerization, with a special focus on deploying Spark apps on Kubernetes. Discover the intricacies of high-level Data APIs and master data interactions with external storages. Journey further into Spark's DSL, SQL, and optimization avenues. Equip yourself with testing strategies for Spark applications and conclude with a deep dive into Spark's structured streaming. This comprehensive course is a blend of theory and hands-on practice.Participants will embark on an enriching voyage through the Spark universe powered by Scala, beginning with a foundational understanding of Spark's architecture and seeing its edge over Hadoop's MapReduce.

certificate
After completing the course, a certificate
is issued on the Luxoft Training form

Objectives

  1. Foundational Spark Principles: Dives into Spark's foundational concepts and architecture, comparing its efficiency to Hadoop's MapReduce, and exploring its diverse resource managers.
  2. Spark & Kubernetes Synergy: Equips participants with knowledge about the containerization of Spark applications, understanding Kubernetes dynamics, and efficient deployment techniques.
  3. Data API Proficiency: Delves deep into Spark's high-level Data APIs - DataFrame and DataSet - highlighting differences, parallelization, and optimal storage methods.
  4. External Data Management Mastery: Focuses on robust techniques for data interaction with diverse external storages, optimizing data formats, and efficient data transfers.
  5. Spark Optimization & Streamlining: Addresses the core challenges in Spark, understanding optimization strategies, and diving into structured streaming techniques and applications.

Target Audience

Developers, architects

Prerequisites

Basic Java, Scala programming skills. Unix/Linux shell familiarity. Experience with databases (Kafka is optional).

Roadmap

  • Module 1: Spark concepts and architecture (theory 2h 30m, practice 1h 30m)
  • Module 2: Containerization and deploy Spark Applications to Kubernetes - (theory 1h, practice 1h)
  • Module 3: High Level Data API: DataFrame, DataSet (theory 2h, practice 2h)
  • Module 4: Loading data from/in external storages (theory 1h, practice 3h)
  • Module 5: Spark DSL and SQL (theory 2h, practice 1h)
  • Module 6: Spark optimization cases (theory 2h, practice 1h)
  • Module 7: Testing Spark Applications - (theory 2h, practice 1h)
  • Module 8: Spark Structure Streaming - (theory 2h, practice 1h)

Spark concepts and architecture

Explore Spark's superiority over Hadoop's MapReduce with hands-on examples. Dive into Lambda architecture, understand batch vs. streaming. Master Spark's resource managers: Kubernetes, YARN, Standalone. Learn to initiate Spark applications. Comprehensive definitions included.

Containerization and deploy Spark Applications to Kubernetes

Master containerization: delve into Kubernetes terminology. Compare Kubernetes with YARN. Grasp dynamic resource allocation. Learn to containerize and deploy Spark on Kubernetes. Kickstart Spark applications seamlessly.

High Level Data API: DataFrame, DataSet

Explore high-level Data APIs: DataFrame & DataSet. Unravel differences between RDD, DataFrame, and DataSet. Learn creation, parallelization techniques. Dive into DataFrame & DataSet analysis, control via plans and DAGs. Master saving methods to HDFS, FTP, S3.

Loading data from/in external storages

Master data loading techniques from external storages: Dive into reading/writing from HDFS, S3, FTP, FS. Choose optimal data formats. Learn parallelized JDBC interactions. Create DataFrames & DataSets from Kafka topics. Efficiently load data into Cassandra.

Spark optimization cases

Delve into Spark optimization scenarios: Address 'out of memory' issues, manage small files in HDFS, correct skewed data, enhance join speeds, optimize large table broadcasts, resource sharing strategies, and leverage AQE & DPP for performance tuning.

Testing Spark Applications

4 levels of quality for Spark Application

Unit Testing for Spark Application

Problems with Unit testing Spark Application

Libraries and Solutions

Spark Structure Streaming

Streaming DataFrame & Dataset

DF, DS based on the Kafka Topic

Loading Data to Cassandra

Working with Spark, Cassandra State

Optimization features

Trainers
Konstantin Okhrimenko
Pavel is a highly experienced specialist in test automation. Currently he is a head of Luxoft automation group supporting one of Deutsche Bank projects. He developed easily extensible framework for automatic testing of the records system. H
Still have questions?
Connect with us