Best Hadoop and Spark Books
In God we trust. All others must bring data.
~ W. Edwards Deming (statistician, author, and lecturer)
The goal is to turn data into information, and information into insight.
~ Carly Fiorina (former president, and chair of Hewlett-Packard)
All in all it's just another brick in the wall
All in all you're just another brick in the wall.
~ Pink Floyd (lyrics from Another Brick in the Wall, Part 2)
My prior post was on Scala which—along with Java and Clojure—is a language that I find highly expressive and helpful for my programming needs. This weekend, let's move on to another topic and see what can be done to help you in your journey to grokking the Big Data solution space :)
- How best to handle and work with data at super-mega scale?
- How can one best decipher and understand that high-volume data and, in turn, convert it into a competitive advantage?
That leads me to share some thoughts on the finest books on the subject—primarily on Spark and Hadoop, plus a smattering of others—that have proved especially helpful to me as I drank from the Kool Aid of Big Data knowledge ;)
But first, I invite your comments—Once you've read my brief take each on the books below...
- Advanced Analytics with Spark: Patterns for Learning from Data at Scale (O'Reilly), by Josh Wills, Sandy Ryza, et al.
- Learning Spark: Lightning-Fast Big Data Analysis (O'Reilly) by Holden Karau, et al.
- Hadoop: The Definitive Guide, 4th Edition (O'Reilly) by Tom White.
- Hadoop in Practice, 2nd Edition, (Manning), by Alex Holmes.
- Professional Hadoop Solutions (Wrox), by Boris Lublinsky et al.
- Data Scientists at Work (Apress) by Sebastian Gutierrez.
- MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems (O'Reilly), by by Donald Miner and Adam Shook.
- Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data (Wiley), by Byron Ellis.
- Big Data: Principles and Best Practices of Scalable Realtime Data Systems (Manning), by Nathan Marz.
- Did you sense that your experience of reading any of these books was different?
- Perhaps some qualities that I did not cover are the ones that you found the most helpful as you learned Big Data and its ecosystem.
- Did I leave out any of your favorite Big Data book(s)?
- I've covered only a partial list of the Big Data books that I've read, necessarily limited by the time available...
If you're looking for the best-written and most exciting Big Data book of the year, look no further than this one: Advanced Analytics with Spark: Patterns for Learning from Data at Scale (O'Reilly), by Josh Wills, Sandy Ryza, et al. This book provides sparkling clear insights into the value proposition that Apache Spark brings to the Big Data (metaphorical) table ;)
You get to understand how this open source project makes distributed programming eminently accessible to data scientists. It goes on to show how Spark—while maintaining MapReduce’s linear scalability and fault tolerance—extends it in three important ways:
- Its engine can execute a more general directed acyclic graph (DAG) of operators.
- It complements this capability with a rich set of transformations.
- It extends its predecessors with in-memory processing. Its Resilient Distributed Dataset (RDD) abstraction enables developers to materialize any point in a processing pipeline into memory across the cluster.
If the preceding themes strike a chord with you—and if you're looking for deep dives to get a sense for the feel of using Spark to do complex analytics on massive data sets—look no further than this book. It covers the entire pipeline in an exceptionally clear and engaging style. A bunch of diverse domains are engagingly covered in no less than nine case studies, to which a chapter each is devoted. These chapters make up the bulk of this stellar book.
IMHO, Advanced Analytics with Spark: Patterns for Learning from Data at Scale is an ideal second book on Spark—for your initial forays into this subject, the next book on this list would be an excellent first book on Spark. But if you're determined to drink from the proverbial firehose, you really can't go wrong reading them side-by-side :)
Oh, and the most fun and standout chapters in this altogether stellar book are those on
- Geospatial and Temporal Data Analysis on the New York City Taxi Trip Data
- Understanding Wikipedia with Latent Semantic Analysis
- Analyzing Co-occurrence Networks with GraphX
Finally, I mention here the table of contents to give you a fuller flavor of the topics covered
- Chapter 1. Analyzing Big Data
- Chapter 2. Introduction to Data Analysis with Scala and Spark
- Chapter 3. Recommending Music and the Audioscrobbler Data Set
- Chapter 4. Predicting Forest Cover with Decision Trees
- Chapter 5. Anomaly Detection in Network Traffic with K-means Clustering
- Chapter 6. Understanding Wikipedia with Latent Semantic Analysis
- Chapter 7. Analyzing Co-occurrence Networks with GraphX
- Chapter 8. Geospatial and Temporal Data Analysis on the New York City Taxi Trip Data
- Chapter 9. Estimating Financial Risk through Monte Carlo Simulation
- Chapter 10. Analyzing Genomics Data and the BDG Project
- Chapter 11. Analyzing Neuroimaging Data with PySpark and Thunder
All in all, Advanced Analytics with Spark: Patterns for Learning from Data at Scale is a book that's got me really excited about the possibilities of this remarkable platform!
- Spark brings value by its ease-of-use (fire up Spark on your laptop, and start using its high-level API, which enables you to focus on your domain-specific computations).
- Spark enables interactive use for tackling complex algorithms.
- And you get in Spark a general-purpose computation engine (thinking here to combining multiple types of computations, such as ML, text processing, SQL querying, etc.) that would previously have necessitated a bunch of different engines.
This book targets data scientists and engineers. We chose these two groups because they have the most to gain from using Spark to expand the scope of problems they can solve. Spark’s rich collection of data-focused libraries (like MLlib) makes it easy for data scientists to go beyond problems that fit on a single machine while using their statistical background. Engineers, meanwhile, will learn how to write general-purpose distributed programs in Spark and operate production applications. Engineers and data scientists will both learn different details from this book, but will both be able to apply Spark to solve large distributed problems in their respective fields.
The second group this book targets is software engineers who have some experience with Java, Python, or another programming language. If you are an engineer, we hope that this book will show you how to set up a Spark cluster, use the Spark shell, and write Spark applications to solve parallel processing problems (italicized by me for emphasis). If you are familiar with Hadoop, you have a bit of a head start on figuring out how to interact with HDFS and how to manage a cluster, but either way, we will cover basic distributed execution concepts.The full chapter devoted to Spark’s core abstraction for doing data-intensive computations—the resilient distributed dataset (aka RDD)—is a standout. The other standout chapter is the one that gets into the nitty gritty of configuring a Spark application, and which also provides an overview of tuning and debugging Spark workloads in production.
Learning Spark: Lightning-Fast Big Data Analysis is richly illustrated with diagrams and tables, and there's no shortage of helpful code snippets to get you going with Spark :)
When reading books, we're all gotten used to doing the inevitable google searches periodically—to compensate for the equally inevitable gaps in the narratives of any given technology book—but this book is mercifully free of the aforesaid read-some, search-online-some, resume-reading syndrome, yay!
So if you're ready to drink deep at the Hadoop pool, you simply can't go wrong with this book. Allow me to elaborate: In the Preface, the author elegantly traces the genesis of this very point—sparkling clear prose and unambiguous readability—to the works of the renowned mathematics writer, Martin Gardner, and adds
Its inner workings are complex, resting as they do on a mixture of distributed systems theory, practical engineering, and common sense. And to the uninitiated, Hadoop can appear alien.
But it doesn’t need to be like this. Stripped to its core, the tools that Hadoop provides for working with big data are simple. If there’s a common theme, it is about raising the level of abstraction—to create building blocks for programmers who have lots of data to store and analyze, and who don’t have the time, the skill, or the inclination to become distributed systems experts to build the infrastructure to handle it.You immediately get the sense that this book is a no-nonsense, friendly, and engaging guide to Hadoop and its ecosystem; rest assured that you'll finish this book without the author letting you down one bit. In fact, elaborating on this very theme—that this is a no-nonsense, friendly, and engaging guide to Hadoop—the first chapter gives a pleasant tour (a lay of the land, if you will) to the entirety of Hadoop: The Definitive Guide, 4th Edition, which is made up of no less than 756 pages. Be sure to use the book's indispensable first chapter to make the most of absorbing the contents of this remarkable book. As the author explains,
The book is divided into five main parts: Parts I to III are about core Hadoop, Part IV covers related projects in the Hadoop ecosystem, and Part V contains Hadoop case studies. You can read the book from cover to cover, but there are alternative pathways through the book that allow you to skip chapters that aren’t needed to read later ones.Further along, a bird's eye view is provided for each of the chapters in the five main parts that make up this book. This summary is accompanied by a lovely flowchart of the paths that can be taken through the contents—Thoughtful design, with the reader in mind, is the hallmark of the entire book. As a reader, I felt secure in the knowledge of learning Hadoop from a master of the art. In this regard, the following remarks (in the Foreword) by Doug Cutting—who, along with Mike Cafarella, created Hadoop in 2005—are quite telling, and reflect just how friendly and engaging a guide this book is to all things Hadoop
Tom is now a respected senior member of the Hadoop developer community. Though he’s an expert in many technical corners of the project, his specialty is making Hadoop easier to use and understand.
Given this, I was very pleased when I learned that Tom intended to write a book about Hadoop. Who could be better qualified? Now you have the opportunity to learn about Hadoop from a master—not only of the technology, but also of common sense and plain talk.
Don't miss this work (Hadoop: The Definitive Guide, 4th Edition) by the leading popularizer of Hadoop, who is doing for Hadoop what Martin Gardner has done for mathematics!
In the About this Book section, after mentioning how, with its distributed storage and compute capabilities, Hadoop is fundamentally an enabling technology for working with huge datasets, the author goes on to identify the target audience of this book:
This hands-on book targets users who have some practical experience with Hadoop and understand the basic concepts of MapReduce and HDFS. Manning’s Hadoop in Action by Chuck Lam contains the necessary prerequisites to understand and apply the techniques covered in this book.
Many techniques in this book are Java-based, which means readers are expected to possess an intermediate-level knowledge of Java. An excellent text for all levels of Java users is Effective Java, Second Edition by Joshua Bloch (Addison-Wesley).One thing I really, really like about this book is the abundance of useful diagrams and code snippets, all of which are profusely annotated with thoughtful comments! I would say that the barrier-to-entry to this book is not all that high—hastening to add that this is most emphatically not the same as saying that the contents are trifling—so if you're determined, don't shy away from tackling this book (along with, importantly, having an introductory book by your side, such as the fine book entitled Hadoop: The Definitive Guide, by Tom White, and which is also reviewed above).
Very briefly, here is a rundown of the topics covered in this book:
1. Background and fundamentals: Chapter 1. Hadoop in a heartbeat, Chapter 2. Introduction to YARN.
2. Data logistics: Chapter 3. Data serialization—working with text and beyond, Chapter 4. Organizing and optimizing data in HDFS, Chapter 5. Moving data into and out of Hadoop.
3. Big data patterns: Chapter 6. Applying MapReduce patterns to big data, Chapter 7. Utilizing data structures and algorithms at scale, Chapter 8. Tuning, debugging, and testing.
4. Beyond MapReduce: Chapter 9. SQL on Hadoop Chapter 10. Writing a YARN application.
This book (Hadoop in Practice, 2nd Edition) is packed with helpful material which—far from being cluttered in any way—is pleasingly organized and makes for smooth reading and a rewarding learning experience.
In my mind, the key to understanding the value in this book lies in appreciating the following observation, which the authors make in the introductory chapter
Although many publications emphasize the fact that Hadoop hides infrastructure complexity from business developers, you should understand that Hadoop extensibility is not publicized enough... Hadoop’s implementation was designed in a way that enables developers to easily and seamlessly incorporate new functionality into Hadoop’s execution.
A significant portion of this book is dedicated to describing approaches to such customizations, as well as practical implementations. These are all based on the results of work performed by the authors.They go on to explain cogently the reasons why great emphasis is placed on MapReduce code throughout the book. So if you approach this book with the mindset that the narratives will directly revolve around MapReduce, you'll glean quite a bit of value out of this book. Their explanation of the MapReduce paradigm, as well as its nuts-and-bolts mechanisms, really are top notch.
The standout chapters are the following:
- Processing Your Data with MapReduce
- Customizing MapReduce Execution
- Hadoop Security
- Building Enterprise Security Solutions for Hadoop Implementations
But first a fair warning is in order about this next book: Once you start reading it, you're going to have a terribly hard time putting it down or, for that matter, doing anything else before you've read it all! Such was my experience of reading (and re-reading) this page-turner of a book: Data Scientists at Work (Apress) by Sebastian Gutierrez.
Consider this... We have these marvelous frameworks—in Spark, Hadoop, Storm and others—but surely they were not created in some ethereal vacuum. Right, these frameworks were of course created in the service of genuine business needs, and to solve pressing problems that folks were facing. So if you're looking for the scoop on this nexus (i.e. the potent symbiosis between the aims of Data Science and what Big Data has to offer), this is the book for you.
The corpus of this book is made up of in-depth interviews of 16 gifted data scientists. What makes these interviews incredibly engaging is the spectacularly good job done by the interviewer (the author of this book), Sebastian Gutierrez. His academic training is from MIT—where he earned a BS in Mathematics—and he is a data entrepreneur who has founded three data-related companies.
The pointed and evocative questions asked throughout the book could only have come from someone who knows the pragmatics of the Data Science field inside-out! And therein lies the immense value of this book: Detailed answers by 16 top data scientists as they shed light on the human side of data science, their thoughts on how this field is evolving, where it's headed, plus plenty of straight-from-the-trenches stories about their work.
While the quality of the interviews is uniformly excellent, the standout interviews in my mind are the ones with these data scientists who are doing stellar work
To give you a flavor of the interviews—each of which is given its own chapter—ever so briefly, here is something from Claudia, who is the Chief Scientist at Dstillery. She teaches a high-level overview course on data mining for the NYU Stern MBA program to, in here own words, "...give people a good understanding of what the opportunities are and how to manage them instead of really teaching them how to do it". She has taught at NYU, MIT, Wharton, and Columbia. In response to the interview question in the book ("What about this work is interesting and exciting for you?"), Claudia noted
I have always been fascinated by math puzzles and puzzles in general. The work that I do is a real-world version of puzzles that life just presents. Data is the footprint of real life in some form, and so it is always interesting. It is like a detective game to figure out what is really going on. Most of my time I am debugging data with a sense of finding out what is wrong with it or where it disagrees with my assumption of what it was supposed to have meant. So these are games that I am just inherently getting really excited about.
In the end, here is the book's author (Sebastian Gutierrez) himself, describing in the Introduction the essence of his approach in putting together the interviews for this book
My interviewing method was designed to ask open-ended questions so that the personalities and spontaneous thought processes of each interviewee would shine through clearly and accurately. My aim was to get at the heart of how they came to be data scientists, what they love about the field, what their daily work lives entail, how they built their careers, how they developed their skills, what advice they have for people looking to become data scientists, and what they think the future of the field holds.Some 20 years ago, when I was finishing grad school—at that time, I earned an MS degree in electrical engineering from Texas A&M University—we didn't call work such as my dissertation (Noise-tolerant Software Method for Traffic Sign Recognition) Data Science. But in several ways, while I was reading the fine interviews in this book, I sure was reminded of the algorithms I worked out back then: Various AI programming techniques (neural networks primarily, such as the Back-propagation Neural Network and the Adaptive Resonance Theory model, aka ART2). Good stuff, and enough reminiscing, for that matter :)
So Data Scientists at Work is a fantastic book overall, if this sort of thing piques your interest.
The authors are clearly experts in the Hadoop ecosystem, and what they've put together is more than what you'll find in the endearing O'Reilly “cookbook” series. Thus, they don’t call out specific problems and accompanying solutions. Instead, they share the lessons that they have learned along the way to becoming experts in the Hadoop ecosystem. Note, too, that this book is mostly about the analytics side of Hadoop and MapReduce.
And they assume that you're already familiar with how Hadoop and MapReduce work, so they don't dive into the details of the APIs which they use in this book—Those topics have already been covered thoroughly in other books, and they focus on analytics. In their own words
The motivation for us to write this book was to fill a missing gap we saw in a lot of new MapReduce developers. They had learned how to use the system, got comfortable with writing MapReduce, but were lacking the experience to understand how to do things right or well. The intent of this book is to prevent you from having to make some of your own mistakes by educating you on how experts have figured out how to solve problems with MapReduce. So, in some ways, this book can be viewed as an intermediate or advanced MapReduce developer resource, but we think early beginners and gurus will find use out of it.One thing I appreciated a lot was the way the authors answer the question, "So why should we use Java MapReduce in Hadoop at all when we have options like Pig and Hive?". They point out two core reasons for spending time explaining how to implement something in hundreds of lines of code when the same can be accomplished in a couple lines with, say, Pig and Hive. In their own words
First, there is conceptual value in understanding the lower-level workings of a system like MapReduce. The developer that understands how Pig actually performs a reduce-side join will make smarter decisions. Using Pig or Hive without understanding MapReduce can lead to some dangerous situations....
Second, Pig and Hive aren’t there yet in terms of full functionality and maturity (as of 2012). It is obvious that they haven’t reached their full potential yet. Right now, they simply can’t tackle all of the problems in the ways that Java MapReduce can.Remaining mindful of the fact that the title of this book is admittedly open-ended, I mention here the table of contents to give you a flavor of the topics covered
- Chapter 1. Design Patterns and MapReduce
- Chapter 2. Summarization Patterns
- Chapter 3. Filtering Patterns
- Chapter 4. Data Organization Patterns
- Chapter 5. Join Patterns
- Chapter 6. Metapatterns
- Chapter 7. Input and Output Patterns
- Chapter 8. Final Thoughts and the Future of Design Patterns
With the caveats noted above, MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems is a book absolutely worth exploring!
This next book is impeccably written in an eminently thoughtful style—Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data (Wiley), by Byron Ellis. The author is the CTO of Spongecell, and he has a Ph.D. in Statistics from Harvard University.
No doubt, with enough determination and time, one can do online searches and cobble together a solution to handle real-time, high-volume mega data. But that begs the question, and I'm not questioning anyone's tenacity here: Is that really the ideal strategy? And that's where the book shines—What makes it stand out is the care and thought that have clearly been poured into making this book a one-stop resource for crafting end-to-end solutions for effectively grappling with real-time, high-volume mega data.
Much as I alluded to above, this book is impeccably written. The author has clearly honed his writing skills—quite likely while preparing his dissertation for the Ph.D. that he earned from Harvard University :)
Clearly written books are a heaven-send, and this superb book is one. In that vein, the author notes with razor-sharp precision the aim of this book
The goal of this book is to allow a fairly broad range of potential users and implementers in an organization to gain comfort with the complete stack of applications. When real-time projects reach a certain point, they should be agile and adaptable systems that can be easily modified, which requires that the users have a fair understanding of the stack as a whole in addition to their own areas of focus. “Real time” applies as much to the development of new analyses as it does to the data itself. Any number of well-meaning projects have failed because they took so long to implement that the people who requested the project have either moved on to other things or simply forgotten why they wanted the data in the first place. By making the projects agile and incremental, this can be avoided as much as possible.The author weaves into the narratives a lot of pragmatic advice; he has clearly been in the development trenches and done it all. As with the prior book, I mention here the table of contents to give you a flavor of the topics covered in Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
Chapter 1: Introduction to Streaming Data Sources of Streaming Data, Why Streaming Data Is Different, Infrastructures and Algorithms, Conclusion
Part I: Streaming Analytics Architecture
Chapter 2: Designing Real-Time Streaming Architectures Real-Time Architecture Components Features of a Real-Time Architecture Languages for Real-Time Programming A Real-Time Architecture Checklist Conclusion
Chapter 3: Service Configuration and Coordination Motivation for Configuration and Coordination Systems Maintaining Distributed State Apache ZooKeeper Conclusion
Chapter 4: Data-Flow Management in Streaming Analysis Distributed Data Flows Apache Kafka: High-Throughput Distributed Messaging Apache Flume: Distributed Log Collection Conclusion
Chapter 5: Processing Streaming Data Distributed Streaming Data Processing Processing Data with Storm Processing Data with Samza Conclusion
Chapter 6: Storing Streaming Data Consistent Hashing “NoSQL” Storage Systems Other Storage Technologies Choosing a Technology Warehousing Conclusion
Part II: Analysis and Visualization
Chapter 7: Delivering Streaming Metrics Streaming Web Applications Visualizing Data Mobile Streaming Applications Conclusion
Chapter 8: Exact Aggregation and Delivery Timed Counting and Summation Multi-Resolution Time-Series Aggregation Stochastic Optimization Delivering Time-Series Data Conclusion
Chapter 9: Statistical Approximation of Streaming Data Numerical Libraries Probabilities and Distributions Working with Distributions Random Number Generation Sampling Procedures Conclusion
Chapter 10: Approximating Streaming Data with Sketching Registers and Hash Functions Working with Sets The Bloom Filter Distinct Value Sketches The Count-Min Sketch Other Applications Conclusion
Chapter 11: Beyond Aggregation Models for Real-Time Data Forecasting with Models Monitoring Real-Time Optimization Conclusion Introduction Overview and Organization of This Book Who Should Read This Book Tools You Will Need What's on the Website Time to Dive In
In the end, do make a note of the author's point when he reiterates that
The hope is that the reader of this book would feel confident taking a proof-of-concept streaming data project in their organization from start to finish with the intent to release it into a production environment.All this makes Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data a book that shouldn't be missed ;)
This book is dives deep into the concepts underlying Lambda Architecture—which is what the author dubbed the approach that he formalized during his years working at the startup BackType—along with, importantly, many illustrative examples which are nicely supplemented by code snippets. The author puts it succinctly when he notes that
This book is the result of my desire to spread the knowledge of the Lambda Architecture and how it avoids the complexities of traditional architectures. It is the book I wish I had when I started working with Big Data. I hope you treat this book as a journey—a journey to challenge what you thought you knew about data systems, and to discover that working with Big Data can be elegant, simple, and fun.As an aside—confessing here my fondness for Clojure, the Lisp that runs on the JVM—I couldn't help but resonate with the following sentiments echoed by Nathan Marz in the Acknowledgments section of Big Data: Principles and Best Practices of Scalable Realtime Data Systems, where he notes that
Rich Hickey has been one of my biggest inspirations during my programming career. Clojure is the best language I have ever used, and I’ve become a better programmer having learned it. I appreciate its practicality and focus on simplicity. Rich’s philosophy on state and complexity in programming has influenced me deeply.In sum, this is a worthwhile book, nicely structured into theory and illustration chapters.
In the end, and as I mentioned at the outset, I invite your comments—Having now read my brief take each on the books above...
- Do you find that your experience of reading any of these books was different?
- Perhaps some qualities that I did not cover are the ones that you found the most helpful as you learned Scala and its ecosystem.
- Did I omit any of your favorite Big Data book(s)?
- I've covered only a partial list of the Big Data books that I've read, limited as you can imagine I am by the time available...
Bon voyage, and I leave you with an obligatory photo of a section of one of my bookshelves—one that's, um, rather biased toward Big Data material in a statistically significant way, eh ;)