Foreword Preface 1. Introduction to Data Analysis with Spark What Is Apache Spark? A Unified Stack Spark Core Spark SQL Spark Streaming MLlib GraphX Cluster Managers Who Uses Spark, and for What? Data Science Tasks Data Processing Applications A Brief History of Spark Spark Versions and Releases Storage Layers for Spark 2. Downloading Spark and Getting Started Downloading Spark Introduction to Spark's Python and Scala Shells Introduction to Core Spark Concepts Standalone Applications Initializing a SparkContext Building Standalone Applications Conclusion 3. Programming with RDDs RDD Basics Creating RDDs RDD Operations Transformations Actions Lazy Evaluation Passing Functions to Spark Python Scala Java Common Transformations and Actions Basic RDDs Converting Between RDD Types Persistence (Caching) Conclusion 4. Working with Key/Value Pairs Motivation Creating Pair RDDs Transformations on Pair RDDs Aggregations Grouping Data Joins Sorting Data Actions Available on Pair RDDs Data Partitioning (Advanced) Determining an RDD's Partitioner Operations That Benefit from Partitioning Operations That Affect Partitioning Example: PageRank Custom Partitioners Conclusion 5. Loading and Saving Your Data Motivation File Formats Text Files JSON Comma-Separated Values and Tab-Separated Values SequenceFiles Object Files Hadoop Input and Output Formats File Compression Filesystems Local/“Regular” FS Amazon $3 HDFS Structured Data with Spark SQL Apache Hive JSON Databases Java Database Connectivity Cassandra HBase Elasticsearch Conclusion 6. Advanced Spark Programming Introduction Accumulators Accumulators and Fault Tolerance Custom Accumulators Broadcast Variables Optimizing Broadcasts Working on a Per-Partition Basis Piping to External Programs Numeric RDD Operations Conclusion 7. Running on a Cluster Introduction Spark Runtime Architecture The Driver Executors Cluster Manager Launching a Program Summary Deploying Applications with spark-submit Packaging Your Code and Dependencies A Java Spark Application Built with Maven A Scala Spark Application Built with sbt Dependency Conflicts Scheduling Within and Between Spark Applications Cluster Managers Standalone Cluster Manager Hadoop YARN Apache Mesos Amazon EC2 Which Cluster Manager to Use? Conclusion 8. Tuning and Debugging Spark Configuring Spark with SparkConf Components of Execution: Jobs, Tasks, and Stages Finding Information Spark Web UI Driver and Executor Logs Key Performance Considerations Level of Parallelism Serialization Format Memory Management Hardware Provisioning Conclusion 9. Spark SQL Linking with Spark SQL Using Spark SQL in Applications Initializing Spark SQL Basic Query Example SchemaRDDs Caching Loading and Saving Data Apache Hive Parquet JSON From RDDs JDBC/ODBC Server Working with Beeline Long-Lived Tables and Queries User-Defined Functions Spark SQL UDFs Hive UDFs Spark SQL Performance Performance Tuning Options Conclusion 10. Spark Streaming A Simple Example Architecture and Abstraction Transformations Stateless Transformations Stateful Transformations Output Operations Input Sources Core Sources Additional Sources Multiple Sources and Cluster Sizing 24/7 Operation Checkpointing Driver Fault Tolerance Worker Fault Tolerance Receiver Fault Tolerance Processing Guarantees Streaming UI Performance Considerations Batch and Window Sizes Level of Parallelism Garbage Collection and Memory Usage Conclusion 11. Machine Learning with MLlib Overview System Requirements Machine Learning Basics Example: Spam Classification Data Types Working with Vectors Algorithms Feature Extraction Statistics Classification and Regression Clustering Collaborative Filtering and Recommendation Dimensionality Reduction Model Evaluation Tips and Performance Considerations Preparing Features Configuring Algorithms Caching RDDs to Reuse Recognizing Sparsity Level of Parallelism Pipeline API Conclusion Index