Introduction to Big Data & Apache Spark |
- Introduction of Big Data
- Introduction & a brief history of Apache Spark
- Components of Apache Spark unified stack
- Who uses Apache Spark?
|
Getting started with Apache Spark |
- Downloading & installing Apache Spark
- Running the examples & shell (Python & Scala)
- Introduction to core Apache Spark concepts
|
Introduction to Online Lab |
- What is Online Lab
- Components of Online Lab
- Logging into Online Lab
- First Hands-On using Online Lab
|
Understanding resilient distributed datasets (RDD) |
- RDD basics
- Creating RDDs
- Working with RDD operations
- Passing functions to Apache Spark
- Common transformations and actions
- Persistence (caching)
|
Working with key/value pairs |
- Creating pair RDDs
- Transformations on pair RDDs
- Actions available on pair RDDs
- Data partitioning (advanced)
|
Loading and saving your data |
- Various file formats & file systems
- Structured data with Apache Spark SQL
- Databases
|
Advanced Apache Spark programming |
- Introduction
- Accumulators
- Broadcast variables
- Working on a per-partition basis
- Piping to external programs
- Numeric RDD operations
|
Running Apache Spark on a cluster |
- Introduction
- Apache Spark runtime architecture
- Submitting applications with Apache Spark-submit
- Packaging your code and dependencies
- Scheduling within and between Apache Spark applications
- Cluster managers
|
Introduction to Apache Spark libraries |
- Understanding Apache Spark SQL
- Using Apache Spark SQL in applications
- Machine learning basics
- Machine learning with MLlib, GraphX
|
Apache Spark streaming |
- A simple example
- Architecture and abstraction
- Transformations & output operations
- Input sources
- Streaming UI
- Performance considerations
- Kafka basics
|