Course Curriculum
Module 1 : Introduction to Big Data
• Introduction to Big Data
• Why Big Data?
• Characteristics of Big Data – 4 Vs
• Applications of Big Data
Module 2 : Hadoop Storage – HDFS
• Introduction to Hadoop
• HDFS – Hadoop Distributed file system
• Components of HDFS
• HDFS terminology
• HDFS Federation
• HDFS high availability
• Role of zoo keeper
• Replica pipeline and network distance algorithm
• HDFS Read and Write
• Installing Hadoop in Windows/Mac using Cloudera Quickstart VM
Module 3 : Introduction to Map Reduce – Hadoop’s processing framework
• Introduction to Map Reduce Framework
• Mapper and Reducer APIs
• First Map Reduce program – Word Count
• Map Reduce examples – Inverted Index and Titanic Data Analysis
• Modes of execution
• Job execution in MRV1 VS YARN
• Serialization and Deserialization
• Writable Classes
• Distributed Cache
Module 4 : Hive – Data warehousing infrastructure built on top of Hadoop
• Introduction to Hive
• RDBMS VS Hive
• Hive DDL : Managed Table VS External Table
• Issues with delimiters
• Hive Architecture
• Partitioning – Static and Dynamic
• Bucketing
• Dealing JSON data – using JSON SerDe
• Hive UDF
• Creating Views
• File Formats – Avro, Parquet, ORC
• Optimizing Techniques
Module 5 : No SQL Database – Hbase – Hadoop’s database
• What is a No SQL Database ?
• Why Hbase ?
• Introduction to Hbase
• Hbase high level architecture
• Hbase commands
• Indepth architectural view of Hbase
• Java APIs for Hbase operations
• Bulk Load using Table Mapper and Table Reducer API
• Bulk Load using import TSV tool from a file
Module 6 : Sqoop, Kafka – Data Ingestion tools and Oozie – Hadoop workflow scheduler
• Introduction to Sqoop
• Sqoop Architecture
• Sqoop import and Export with Examples
• Introduction to Oozie
• Oozie workflow
• Oozie Action Tags
• Oozie Parametrization
• Flume – Spooling Directory
• Kafka
Module 7 : Python
• Core programming concepts of Python
Module 8 : Spark Core
• Introduction to Spark
• Why Spark ?
• Applications of Spark
• Spark Terminology
• RDD
• Architecture of Spark
• Transformations and Actions
• RDD Hierarchy
• Lazy Execution
• Shared Variables
• RDD persistence
Module 9 : Spark SQL
• Spark SQL – Data Frames , Data Sets and SQL
• Spark JDBC
Contact : 9160040789/9182622217 mail: info@careerit.co.in
• Creating Dataframes from different sources
• Saving a dataframe to a file/table & table with partitioning
• Introduction to Delta Lake
• CRUD operation using delta format
Module 10 : Kafka
• Introduction to Kafka
• Kafka Architecture
• Kafka Use cases
• Creating Producer and Consumer
Module 11 : Spark Streaming
• Spark Streaming with Kafka
Course Duration : 45 Hours
Tools Used :
• Virtual Box
• Cloudera Quick start VM
• Google Cloud Platform/Microsoft Azure for Demo
• Databricks Community Edition
Programming Languages Used :
• J2SE for Map Reduce
• Python for Spark
• SQL for Hive and Spark SQL
Projects :
• Batch Processing of e-commerce data using Hadoop stack
• Real Time Data Ingestion and processing of Social Media data using Hadoop stack
• Telematics Data Analysis
Workshops and Hackathons :
• Realtime ETL process simulation using Hadoop/Spark
• Hackathons on real-time data engineering problem statements