EInnovator - Core Spring Training
EInnovator - BigData Training

BigData Processing & Analytics with Hadoop & Spark Course

Course Overview

In this course, you will learn how to use Hadoop and Spark to process BigData workloads, and how to perform advanced analytics with Spark MLLib.

You will learn how to identify the characteristics of BigData & FastData problems, the components and projects in the Hadoop ecosystem, the HDFS architecture, commands, and API, how to write map-reduce jobs in Java, script map-reduce jobs in Pig, and write Hive SQL. You also gain in-deep knowledge of Spark and its ecosystem. You also learn enough about advanced analytics models, to apply them in real-world BigData projects using Spark machine-learning library MLLib.

Course Format and Modes of Delivery

  • Four days of instructor-led training
  • 50% lecture, 50% hands-on lab
  • Corporate On-Site
  • Public

Target Audience

  • Data Engineers and Data Scientists
  • BigData and IOT Software Developers and Architects


  • Familiarity with SQL and DB concepts useful.
  • [Day 4] Basic Concepts of algebra and mathematical analysis (high-school or early under-grad school) useful but not required.


Course Objectives

  • Become familiar with the Hadoop and Spark Ecosystems.
  • Understand Map-Reduce, and write Map-Reduce jobs in Java.
  • Write Scripts in Pig Latin.
  • Learn to write SQL queries in Hive to run in Hadoop.
  • Learn the architecture of Spark and how to write Spark functional-programming jobs in Scala for BigData processing.
  • Learn hot to process event stream with Spark streaming.
  • Learn how to use the Spark Machine-Learning Library MLLib to perform analytics tasks.
  • Learn the theory and practice of the most widely used predictive analytics models with Spark MLLib.

Course Modules

  • BigData & IoT Problems
  • BigData & IoT Processing Architectures
  • Hadoop Ecosystem Overview
  • Getting Started with Hadoop
  • HDFS Motivation & Overview
  • HDFS Architecture — NameNode & DataNodes
  • HDFS Commands
  • HDJS Java API
  • HDFS Properties
  • Check-pointing in HDFS
  • High-availability in HDFS
  • HDFS Namenode federation
  • Map-Reduce Motivation & Overview
  • Map-Reduce Architecture
  • Writing a Mapper in Java
  • Writing a Reducer in Java
  • Writing a Job in Java
  • The Writable hierarchy
  • Partitionners, Combiners, Shuffle
  • Map-Reduce Joins
  • Limitations of Map-Reduce
  • Map-Reduce Design-Patterns
  • Hadoop streaming
  • Pig Scripting
  • SLQ in Hadoop
  • Hive overview
  • Hive tables and DDL
  • Partitions and external tables
  • Selecting data
  • Joins
  • Transforms & User Defined Functions (UDFs)
  • Spark Overview
  • RDD Abstraction
  • Functional-Programming & BigData Processing with Spark
  • Spark SQL
  • D-RDD Abstraction
  • Real-time Processing with Spark
  • Integration & Receiver Types
  • Data Analytics Concepts
  • Analytics Models Overview
  • Getting Started with Spark MLLib
  • Instance-Based Analytics Models
  • PMML – Generation & Evaluation
  • Decision-Trees Theory
  • Decision-Trees in Spark MLLib
  • KMean Clustering Theory
  • KMean Clustering in Spark MLLib
  • Naive-Bayes Theory
  • Naive-Bayes in Spark MLLib
  • Linear Regression Theory
  • Linear Regression in Spark MLLib