Working with big data can be complex and challenging, in part
because of the multiple analysis frameworks and tools required.
Apache Spark is a big data processing framework perfect for analyzing
near-real-time streams and discovering historical patterns in batched
data sets. But Spark goes much further than other frameworks. By
including machine learning and graph processing capabilities, it makes
many specialized data processing platforms obsolete. Spark's unified
framework and programming model significantly lowers the initial
infrastructure investment, and Spark's core abstractions are intuitive for
most Scala, Java, and Python developers.
Spark in Action teaches readers to use Spark for stream and batch data
processing. It starts with an introduction to the Spark architecture and
ecosystem followed by a taste of Spark's command line interface.
Readers then discover the most fundamental concepts and abstractions
of Spark, particularly Resilient Distributed Datasets (RDDs) and the
basic data transformations that RDDs provide. The first part of the
book covers writing Spark applications using the the core APIs.
Readers also learn how to work with structured data using Spark SQL,
how to process near-real time data with Spark Streaming, how to apply
machine learning algorithms with Spark MLlib, how to apply graph
algorithms on graph-shaped data using Spark GraphX, and an
introduction to Spark clustering.
Key Features:
* Clear introduction to Spark
* Teaches how to ingest near real-time data
* Gaining value from big data
* Includes real-life case studies
AUDIENCE
Readers should be familiar with Java, Scala, or Python. No knowledge of
Spark or streaming operations is assumed, but some acquaintance with
machine learning is helpful.
ABOUT THE TECHNOLOGY
Apache Spark is a big data processing framework perfect for analyzing
near-real-time streams and discovering historical patterns in batched data
sets. Spark also offers machine learning and graph processing capabilities.