Saturday, August 1, 2015

Best Hadoop and Spark Books

In God we trust. All others must bring data.
W. Edwards Deming (statistician, author, and lecturer)
The goal is to turn data into information, and information into insight.
~ Carly Fiorina (former president, and chair of Hewlett-Packard)
All in all it's just another brick in the wall
All in all you're just another brick in the wall
~ Pink Floyd (lyrics from Another Brick in the Wall, Part 2)

My prior post was on Scala which—along with Java and Clojure—is a language that I find highly expressive and helpful for my programming needs. This weekend, let's move on to another topic and see what can be done to help you in your journey to grokking the Big Data solution space :)

I do believe that the two key questions which are fueling the torrent that this age of Big Data has evolved into are these
  1. How best to handle and work with data at super-mega scale?
  2. How can one best decipher and understand that high-volume data and, in turn, convert it into a competitive advantage?
Living as we do today, well into the age of Big Data, it sure helps to have some guidance from those who are at the frontline of these endeavors which revolve around these two questions—Online resources are indispensable and fantastic in their own right, especially for cutting edge updates. But what about times when you simply want to sit down and really absorb the wisdom of our Big Data sages—the underlying conceptual infrastructure that powers the Big Data machinery—in a more sustained and methodical way?

That leads me to share some thoughts on the finest books on the subject—primarily on Spark and Hadoop, plus a smattering of others—that have proved especially helpful to me as I drank from the Kool Aid of Big Data knowledge ;)
  1. Advanced Analytics with Spark: Patterns for Learning from Data at Scale (O'Reilly), by Josh Wills, Sandy Ryza, et al.
  2. Learning Spark: Lightning-Fast Big Data Analysis (O'Reilly) by Holden Karau, et al.
  3. Hadoop: The Definitive Guide, 4th Edition (O'Reilly) by Tom White.
  4. Hadoop in Practice, 2nd Edition, (Manning), by Alex Holmes.
  5. Professional Hadoop Solutions (Wrox), by Boris Lublinsky et al.
  6. Data Scientists at Work (Apress) by Sebastian Gutierrez.
  7. MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems (O'Reilly), by by Donald Miner and Adam Shook.
  8. Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data (Wiley), by Byron Ellis.
  9. Big Data: Principles and Best Practices of Scalable Realtime Data Systems (Manning), by Nathan Marz.
But first, I invite your comments—Once you've read my brief take each on the books below... 
  • Did you sense that your experience of reading any of these books was different? 
  • Perhaps some qualities that I did not cover are the ones that you found the most helpful as you learned Big Data and its ecosystem. 
  • Did I leave out any of your favorite Big Data book(s)? 
  • I've covered only a partial list of the Big Data books that I've read, necessarily limited by the time available...


If you're looking for the best-written and most exciting Big Data book of the year, look no further than this one: Advanced Analytics with Spark: Patterns for Learning from Data at Scale (O'Reilly), by Josh Wills, Sandy Ryza, et al. This book provides sparkling clear insights into the value proposition that Apache Spark brings to the Big Data (metaphorical) table ;)

You get to understand how this open source project makes distributed programming eminently accessible to data scientists. It goes on to show how Spark—while maintaining MapReduce’s linear scalability and fault tolerance—extends it in three important ways:
  1. Its engine can execute a more general directed acyclic graph (DAG) of operators.
  2. It complements this capability with a rich set of transformations.
  3. It extends its predecessors with in-memory processing. Its Resilient Distributed Dataset (RDD) abstraction enables developers to materialize any point in a processing pipeline into memory across the cluster.
One particularly telling remark that the authors make has to do with how, "...With respect to the pertinence of munging and ETL, Spark strives to be something closer to the Python of big data than the Matlab of big data". Spark’s in-memory caching makes it equally ideal for programming in the large and small. And what's possibly most exciting is how Spark bridges the gap between the avenues of exploratory analytics and production (i.e. operational) analytics! And given Spark's tight integration with Hadoop ecosystem makes it an eminently accessible and attractive framework.

If the preceding themes strike a chord with you—and if you're looking for deep dives to get a sense for the feel of using Spark to do complex analytics on massive data sets—look no further than this book. It covers the entire pipeline in an exceptionally clear and engaging style. A bunch of diverse domains are engagingly covered in no less than nine case studies, to which a chapter each is devoted. These chapters make up the bulk of this stellar book.

IMHO, Advanced Analytics with Spark: Patterns for Learning from Data at Scale is an ideal second book on Spark—for your initial forays into this subject, the next book on this list would be an excellent first book on Spark. But if you're determined to drink from the proverbial firehose, you really can't go wrong reading them side-by-side :)

Oh, and the most fun and standout chapters in this altogether stellar book are those on
  • Geospatial and Temporal Data Analysis on the New York City Taxi Trip Data
  • Understanding Wikipedia with Latent Semantic Analysis
  • Analyzing Co-occurrence Networks with GraphX
Finally, I mention here the table of contents to give you a fuller flavor of the topics covered
  • Chapter 1. Analyzing Big Data
  • Chapter 2. Introduction to Data Analysis with Scala and Spark
  • Chapter 3. Recommending Music and the Audioscrobbler Data Set
  • Chapter 4. Predicting Forest Cover with Decision Trees
  • Chapter 5. Anomaly Detection in Network Traffic with K-means Clustering
  • Chapter 6. Understanding Wikipedia with Latent Semantic Analysis
  • Chapter 7. Analyzing Co-occurrence Networks with GraphX
  • Chapter 8. Geospatial and Temporal Data Analysis on the New York City Taxi Trip Data
  • Chapter 9. Estimating Financial Risk through Monte Carlo Simulation
  • Chapter 10. Analyzing Genomics Data and the BDG Project
  • Chapter 11. Analyzing Neuroimaging Data with PySpark and Thunder
All in all, Advanced Analytics with Spark: Patterns for Learning from Data at Scale is a book that's got me really excited about the possibilities of this remarkable platform!


The book Learning Spark: Lightning-Fast Big Data Analysis (O'Reilly) by Holden Karau, et al is a welcome addition to the library of those starting out in their quest to grok the amazing framework that Apache Spark is. What I appreciated the most about this book is the thorough and pragmatic coverage of Apache Spark, beginning with an invitation to understand the value that Spark offers by extending MapReduce
  1. Spark brings value by its ease-of-use (fire up Spark on your laptop, and start using its high-level API, which enables you to focus on your domain-specific computations).
  2. Spark enables interactive use for tackling complex algorithms.
  3. And you get in Spark a general-purpose computation engine (thinking here to combining multiple types of computations, such as ML, text processing, SQL querying, etc.) that would previously have necessitated a bunch of different engines.
For us software types, the following observations by the authors are worth bringing out so you can best decide whether the targeted value that this book offers is for you
This book targets data scientists and engineers. We chose these two groups because they have the most to gain from using Spark to expand the scope of problems they can solve. Spark’s rich collection of data-focused libraries (like MLlib) makes it easy for data scientists to go beyond problems that fit on a single machine while using their statistical background. Engineers, meanwhile, will learn how to write general-purpose distributed programs in Spark and operate production applications. Engineers and data scientists will both learn different details from this book, but will both be able to apply Spark to solve large distributed problems in their respective fields. 
The second group this book targets is software engineers who have some experience with Java, Python, or another programming language. If you are an engineer, we hope that this book will show you how to set up a Spark cluster, use the Spark shell, and write Spark applications to solve parallel processing problems (italicized by me for emphasis). If you are familiar with Hadoop, you have a bit of a head start on figuring out how to interact with HDFS and how to manage a cluster, but either way, we will cover basic distributed execution concepts.
The full chapter devoted to Spark’s core abstraction for doing data-intensive computations—the resilient distributed dataset (aka RDD)—is a standout. The other standout chapter is the one that gets into the nitty gritty of configuring a Spark application, and which also provides an overview of tuning and debugging Spark workloads in production.

Learning Spark: Lightning-Fast Big Data Analysis is richly illustrated with diagrams and tables, and there's no shortage of helpful code snippets to get you going with Spark :)


Let's segue from Spark to Hadoop land now, beginning with a remarkable book: Hadoop: The Definitive Guide, 4th Edition (O'Reilly) by Tom White—This crystal clear and eminently readable book is perhaps the grand-daddy of all Big Data books out there! Now in its fourth edition, this book is the paragon of sparkling clear prose and unambiguous explanations of all things Hadoop, which of course we tech types crave :)

When reading books, we're all gotten used to doing the inevitable google searches periodically—to compensate for the equally inevitable gaps in the narratives of any given technology book—but this book is mercifully free of the aforesaid read-some, search-online-some, resume-reading syndrome, yay!

So if you're ready to drink deep at the Hadoop pool, you simply can't go wrong with this book. Allow me to elaborate: In the Preface, the author elegantly traces the genesis of this very point—sparkling clear prose and unambiguous readability—to the works of the renowned mathematics writer, Martin Gardner, and adds
Its inner workings are complex, resting as they do on a mixture of distributed systems theory, practical engineering, and common sense. And to the uninitiated, Hadoop can appear alien.  
But it doesn’t need to be like this. Stripped to its core, the tools that Hadoop provides for working with big data are simple. If there’s a common theme, it is about raising the level of abstraction—to create building blocks for programmers who have lots of data to store and analyze, and who don’t have the time, the skill, or the inclination to become distributed systems experts to build the infrastructure to handle it.
You immediately get the sense that this book is a no-nonsense, friendly, and engaging guide to Hadoop and its ecosystem; rest assured that you'll finish this book without the author letting you down one bit. In fact, elaborating on this very theme—that this is a no-nonsense, friendly, and engaging guide to Hadoop—the first chapter gives a pleasant tour (a lay of the land, if you will) to the entirety of Hadoop: The Definitive Guide, 4th Edition, which is made up of no less than 756 pages. Be sure to use the book's indispensable first chapter to make the most of absorbing the contents of this remarkable book. As the author explains,
The book is divided into five main parts: Parts I to III are about core Hadoop, Part IV covers related projects in the Hadoop ecosystem, and Part V contains Hadoop case studies. You can read the book from cover to cover, but there are alternative pathways through the book that allow you to skip chapters that aren’t needed to read later ones.
Further along, a bird's eye view is provided for each of the chapters in the five main parts that make up this book. This summary is accompanied by a lovely flowchart of the paths that can be taken through the contents—Thoughtful design, with the reader in mind, is the hallmark of the entire book. As a reader, I felt secure in the knowledge of learning Hadoop from a master of the art. In this regard, the following remarks (in the Foreword) by Doug Cutting—who, along with Mike Cafarella, created Hadoop in 2005—are quite telling, and reflect just how friendly and engaging a guide this book is to all things Hadoop
Tom is now a respected senior member of the Hadoop developer community. Though he’s an expert in many technical corners of the project, his specialty is making Hadoop easier to use and understand. 
Given this, I was very pleased when I learned that Tom intended to write a book about Hadoop. Who could be better qualified? Now you have the opportunity to learn about Hadoop from a master—not only of the technology, but also of common sense and plain talk.
Don't miss this work (Hadoop: The Definitive Guide, 4th Edition) by the leading popularizer of Hadoop, who is doing for Hadoop what Martin Gardner has done for mathematics!


This next title is an excellent second book on Hadoop: Hadoop in Practice, 2nd Edition (Manning), by Alex Holmes. Now in its second edition, this book got a thorough update to cover changes and new features in Hadoop, including MapReduce 2. New chapters have been added to cover YARN, Kafka, Impala, and Spark SQL as they each relate to Hadoop. While sticking to the strengths of the first edition—approximately 100 intermediate-to-advanced Hadoop examples in a superb problem-and-solution format—the new edition continues to build on those  strengths, while maintaining the high-quality in the code examples.

In the About this Book section, after mentioning how, with its distributed storage and compute capabilities, Hadoop is fundamentally an enabling technology for working with huge datasets, the author goes on to identify the target audience of this book:
This hands-on book targets users who have some practical experience with Hadoop and understand the basic concepts of MapReduce and HDFS. Manning’s Hadoop in Action by Chuck Lam contains the necessary prerequisites to understand and apply the techniques covered in this book.  
Many techniques in this book are Java-based, which means readers are expected to possess an intermediate-level knowledge of Java. An excellent text for all levels of Java users is Effective Java, Second Edition by Joshua Bloch (Addison-Wesley).
One thing I really, really like about this book is the abundance of useful diagrams and code snippets, all of which are profusely annotated with thoughtful comments! I would say that the barrier-to-entry to this book is not all that high—hastening to add that this is most emphatically not the same as saying that the contents are trifling—so if you're determined, don't shy away from tackling this book (along with, importantly, having an introductory book by your side, such as the fine book entitled Hadoop: The Definitive Guide, by Tom White, and which is also reviewed above).

Very briefly, here is a rundown of the topics covered in this book:
1. Background and fundamentals: Chapter 1. Hadoop in a heartbeat, Chapter 2. Introduction to YARN.  
2. Data logistics: Chapter 3. Data serialization—working with text and beyond, Chapter 4. Organizing and optimizing data in HDFS, Chapter 5. Moving data into and out of Hadoop.  
3. Big data patterns: Chapter 6. Applying MapReduce patterns to big data, Chapter 7. Utilizing data structures and algorithms at scale, Chapter 8. Tuning, debugging, and testing.  
4. Beyond MapReduce: Chapter 9. SQL on Hadoop Chapter 10. Writing a YARN application.
This book (Hadoop in Practice, 2nd Editionis packed with helpful material which—far from being cluttered in any way—is pleasingly organized and makes for smooth reading and a rewarding learning experience.


Once comfortable with the Hadoop paradigm, you'll be able to appreciate the gem of a book we've got in this next title: Professional Hadoop Solutions (Wrox), by Boris Lublinsky et al. The authors have assembled a first-class collection of design expertise narratives.

In my mind, the key to understanding the value in this book lies in appreciating the following observation, which the authors make in the introductory chapter
Although many publications emphasize the fact that Hadoop hides infrastructure complexity from business developers, you should understand that Hadoop extensibility is not publicized enough... Hadoop’s implementation was designed in a way that enables developers to easily and seamlessly incorporate new functionality into Hadoop’s execution. 
A significant portion of this book is dedicated to describing approaches to such customizations, as well as practical implementations. These are all based on the results of work performed by the authors.
They go on to explain cogently the reasons why great emphasis is placed on MapReduce code throughout the book. So if you approach this book with the mindset that the narratives will directly revolve around MapReduce, you'll glean quite a bit of value out of this book. Their explanation of the MapReduce paradigm, as well as its nuts-and-bolts mechanisms, really are top notch.

The standout chapters are the following:
  • Processing Your Data with MapReduce
  • Customizing MapReduce Execution
  • Hadoop Security
  • Building Enterprise Security Solutions for Hadoop Implementations
The Appendix toward the end of Professional Hadoop Solutions is especially rich and useful. Overall, I'm glad to have found this book!


And now let's segue from Hadoop to a foray into Data Science kingdom proper :)

But first a fair warning is in order about this next book: Once you start reading it, you're going to have a terribly hard time putting it down or, for that matter, doing anything else before you've read it all! Such was my experience of reading (and re-reading) this page-turner of a book: Data Scientists at Work (Apress) by Sebastian Gutierrez.

Consider this... We have these marvelous frameworks—in Spark, Hadoop, Storm and others—but surely they were not created in some ethereal vacuum. Right, these frameworks were of course created in the service of genuine business needs, and to solve pressing problems that folks were facing. So if you're looking for the scoop on this nexus (i.e. the potent symbiosis between the aims of Data Science and what Big Data has to offer), this is the book for you.

The corpus of this book is made up of in-depth interviews of 16 gifted data scientists. What makes these interviews incredibly engaging is the spectacularly good job done by the interviewer (the author of this book), Sebastian Gutierrez. His academic training is from MIT—where he earned a BS in Mathematics—and he is a data entrepreneur who has founded three data-related companies.

The pointed and evocative questions asked throughout the book could only have come from someone who knows the pragmatics of the Data Science field inside-out! And therein lies the immense value of this book: Detailed answers by 16 top data scientists as they shed light on the human side of data science, their thoughts on how this field is evolving, where it's headed, plus plenty of straight-from-the-trenches stories about their work.

While the quality of the interviews is uniformly excellent, the standout interviews in my mind are the ones with these data scientists who are doing stellar work
To give you a flavor of the interviews—each of which is given its own chapter—ever so briefly, here is something from Claudia, who is the Chief Scientist at Dstillery. She teaches a high-level overview course on data mining for the NYU Stern MBA program to, in here own words, "...give people a good understanding of what the opportunities are and how to manage them instead of really teaching them how to do it". She has taught at NYU, MIT, Wharton, and Columbia. In response to the interview question in the book ("What about this work is interesting and exciting for you?"), Claudia noted
I have always been fascinated by math puzzles and puzzles in general. The work that I do is a real-world version of puzzles that life just presents. Data is the footprint of real life in some form, and so it is always interesting. It is like a detective game to figure out what is really going on. Most of my time I am debugging data with a sense of finding out what is wrong with it or where it disagrees with my assumption of what it was supposed to have meant. So these are games that I am just inherently getting really excited about.
In the end, here is the book's author (Sebastian Gutierrez) himself, describing in the Introduction the essence of his approach in putting together the interviews for this book
My interviewing method was designed to ask open-ended questions so that the personalities and spontaneous thought processes of each interviewee would shine through clearly and accurately. My aim was to get at the heart of how they came to be data scientists, what they love about the field, what their daily work lives entail, how they built their careers, how they developed their skills, what advice they have for people looking to become data scientists, and what they think the future of the field holds.
Some 20 years ago, when I was finishing grad school—at that time, I earned an MS degree in electrical engineering from Texas A&M University—we didn't call work such as my dissertation (Noise-tolerant Software Method for Traffic Sign Recognition) Data Science. But in several ways, while I was reading the fine interviews in this book, I sure was reminded of the algorithms I worked out back then: Various AI programming techniques (neural networks primarily, such as the Back-propagation Neural Network and the Adaptive Resonance Theory model, aka ART2). Good stuff, and enough reminiscing, for that matter :)

So Data Scientists at Work is a fantastic book overall, if this sort of thing piques your interest.


Segueing right back to Hadoop now, the title of the next book is decidedly open-ended—MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems (O'Reilly), by by Donald Miner and Adam Shook. Given the open-ended title, allow me to elaborate on the gist of this fine book...

The authors are clearly experts in the Hadoop ecosystem, and what they've put together is more than what you'll find in the endearing O'Reilly “cookbook” series. Thus, they don’t call out specific problems and accompanying solutions. Instead, they share the lessons that they have learned along the way to becoming experts in the Hadoop ecosystem. Note, too, that this book is mostly about the analytics side of Hadoop and MapReduce.

And they assume that you're already familiar with how Hadoop and MapReduce work, so they don't dive into the details of the APIs which they use in this book—Those topics have already been covered thoroughly in other books, and they focus on analytics. In their own words
The motivation for us to write this book was to fill a missing gap we saw in a lot of new MapReduce developers. They had learned how to use the system, got comfortable with writing MapReduce, but were lacking the experience to understand how to do things right or well. The intent of this book is to prevent you from having to make some of your own mistakes by educating you on how experts have figured out how to solve problems with MapReduce. So, in some ways, this book can be viewed as an intermediate or advanced MapReduce developer resource, but we think early beginners and gurus will find use out of it.
One thing I appreciated a lot was the way the authors answer the question, "So why should we use Java MapReduce in Hadoop at all when we have options like Pig and Hive?". They point out two core reasons for spending time explaining how to implement something in hundreds of lines of code when the same can be accomplished in a couple lines with, say, Pig and Hive. In their own words
First, there is conceptual value in understanding the lower-level workings of a system like MapReduce. The developer that understands how Pig actually performs a reduce-side join will make smarter decisions. Using Pig or Hive without understanding MapReduce can lead to some dangerous situations.... 
Second, Pig and Hive aren’t there yet in terms of full functionality and maturity (as of 2012). It is obvious that they haven’t reached their full potential yet. Right now, they simply can’t tackle all of the problems in the ways that Java MapReduce can.
Remaining mindful of the fact that the title of this book is admittedly open-ended, I mention here the table of contents to give you a flavor of the topics covered
  • Chapter 1. Design Patterns and MapReduce
  • Chapter 2. Summarization Patterns
  • Chapter 3. Filtering Patterns
  • Chapter 4. Data Organization Patterns
  • Chapter 5. Join Patterns
  • Chapter 6. Metapatterns
  • Chapter 7. Input and Output Patterns
  • Chapter 8. Final Thoughts and the Future of Design Patterns
With the caveats noted above, MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems is a book absolutely worth exploring!


Finally, let's segue to the land of real-time, streaming data :)

This next book is impeccably written in an eminently thoughtful style—Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data (Wiley), by Byron Ellis. The author is the CTO of Spongecell, and he has a Ph.D. in Statistics from Harvard University.

No doubt, with enough determination and time, one can do online searches and cobble together a solution to handle real-time, high-volume mega data. But that begs the question, and I'm not questioning anyone's tenacity here: Is that really the ideal strategy? And that's where the book shines—What makes it stand out is the care and thought that have clearly been poured into making this book a one-stop resource for crafting end-to-end solutions for effectively grappling with real-time, high-volume mega data.

Much as I alluded to above, this book is impeccably written. The author has clearly honed his writing skills—quite likely while preparing his dissertation for the Ph.D. that he earned from Harvard University :)

Clearly written books are a heaven-send, and this superb book is one. In that vein, the author notes with razor-sharp precision the aim of this book
The goal of this book is to allow a fairly broad range of potential users and implementers in an organization to gain comfort with the complete stack of applications. When real-time projects reach a certain point, they should be agile and adaptable systems that can be easily modified, which requires that the users have a fair understanding of the stack as a whole in addition to their own areas of focus. “Real time” applies as much to the development of new analyses as it does to the data itself. Any number of well-meaning projects have failed because they took so long to implement that the people who requested the project have either moved on to other things or simply forgotten why they wanted the data in the first place. By making the projects agile and incremental, this can be avoided as much as possible.
The author weaves into the narratives a lot of pragmatic advice; he has clearly been in the development trenches and done it all. As with the prior book, I mention here the table of contents to give you a flavor of the topics covered in Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
Chapter 1: Introduction to Streaming Data Sources of Streaming Data, Why Streaming Data Is Different, Infrastructures and Algorithms, Conclusion  
Part I: Streaming Analytics Architecture
Chapter 2: Designing Real-Time Streaming Architectures Real-Time Architecture Components Features of a Real-Time Architecture Languages for Real-Time Programming A Real-Time Architecture Checklist Conclusion
Chapter 3: Service Configuration and Coordination Motivation for Configuration and Coordination Systems Maintaining Distributed State Apache ZooKeeper Conclusion
Chapter 4: Data-Flow Management in Streaming Analysis Distributed Data Flows Apache Kafka: High-Throughput Distributed Messaging Apache Flume: Distributed Log Collection Conclusion
Chapter 5: Processing Streaming Data Distributed Streaming Data Processing Processing Data with Storm Processing Data with Samza Conclusion
Chapter 6: Storing Streaming Data Consistent Hashing “NoSQL” Storage Systems Other Storage Technologies Choosing a Technology Warehousing Conclusion  
Part II: Analysis and Visualization 
Chapter 7: Delivering Streaming Metrics Streaming Web Applications Visualizing Data Mobile Streaming Applications Conclusion
Chapter 8: Exact Aggregation and Delivery Timed Counting and Summation Multi-Resolution Time-Series Aggregation Stochastic Optimization Delivering Time-Series Data Conclusion
Chapter 9: Statistical Approximation of Streaming Data Numerical Libraries Probabilities and Distributions Working with Distributions Random Number Generation Sampling Procedures Conclusion
Chapter 10: Approximating Streaming Data with Sketching Registers and Hash Functions Working with Sets The Bloom Filter Distinct Value Sketches The Count-Min Sketch Other Applications Conclusion
Chapter 11: Beyond Aggregation Models for Real-Time Data Forecasting with Models Monitoring Real-Time Optimization Conclusion Introduction Overview and Organization of This Book Who Should Read This Book Tools You Will Need What's on the Website Time to Dive In
In the end, do make a note of the author's point when he reiterates that
The hope is that the reader of this book would feel confident taking a proof-of-concept streaming data project in their organization from start to finish with the intent to release it into a production environment.
All this makes Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data a book that shouldn't be missed ;)


Last, but certainly not the least—continuing now in the spirit of frameworks that enable us developers to tackle real-time, streaming data—is a book by Nathan Marz: Big Data: Principles and Best Practices of Scalable Realtime Data Systems (Manning). The author happens to be the originator of the Lambda Architecture approach to programming in the world of Big Data, and he deploys his considerable knowledge of this approach in explaining the details.

This book is dives deep into the concepts underlying Lambda Architecture—which is what the author dubbed the approach that he formalized during his years working at the startup BackType—along with, importantly, many illustrative examples which are nicely supplemented by code snippets. The author puts it succinctly when he notes that
This book is the result of my desire to spread the knowledge of the Lambda Architecture and how it avoids the complexities of traditional architectures. It is the book I wish I had when I started working with Big Data. I hope you treat this book as a journey—a journey to challenge what you thought you knew about data systems, and to discover that working with Big Data can be elegant, simple, and fun.
As an aside—confessing here my fondness for Clojure, the Lisp that runs on the JVM—I couldn't help but resonate with the following sentiments echoed by Nathan Marz in the Acknowledgments section of Big Data: Principles and Best Practices of Scalable Realtime Data Systems, where he notes that
Rich Hickey has been one of my biggest inspirations during my programming career. Clojure is the best language I have ever used, and I’ve become a better programmer having learned it. I appreciate its practicality and focus on simplicity. Rich’s philosophy on state and complexity in programming has influenced me deeply.
In sum, this is a worthwhile book, nicely structured into theory and illustration chapters.

In the end, and as I mentioned at the outset, I invite your comments—Having now read my brief take each on the books above...
  • Do you find that your experience of reading any of these books was different? 
  • Perhaps some qualities that I did not cover are the ones that you found the most helpful as you learned Scala and its ecosystem. 
  • Did I omit any of your favorite Big Data book(s)? 
  • I've covered only a partial list of the Big Data books that I've read, limited as you can imagine I am by the time available...
As with my prior post, which contains a set of book vignettes—those pertaining to the finest and most useful books on Scala in print—my aim here, too, in sharing these brief reviews remains the same, albeit on a different subject (Big Data) this time: I hope these vignettes will help you in selecting your resources well, and help you in your journey to grokking the Big Data solution space!

Bon voyage, and I leave you with an obligatory photo of a section of one of my bookshelves—one that's, um, rather biased toward Big Data material in a statistically significant way, eh ;)


  1. Yesterday evening, as I was thumbing through the pages of Peter Seibel's excellent compilation of interviews with top programmers, entitled Coders at Work—specifically the insights shared by the well-known Java expert, Joshua Bloch—the following thoughts shared by Bloch leapt from the pages as I related them back to the ever-evolving, protean Big Data landscape:

    "There are multiple communities associated with Java and with other programming languages, too. When there aren't, it's usually a sign that the language is either a niche language or an immature language. As a language grows and prospers, it naturally appeals to a more diverse community. And furthermore, as the amount of investment in a language grows, the value of it grows".

    "It's like Metcalfe's law: the value of a network is proportional to the square of the number of users. The same is true of languages... Even if Java isn't the perfect language for you, there are all these incidental benefits to using it, so you form your own community that figures out how to do numeric programming in Java, or whatever kind of programming you want to do".

    I found myself wondering, Is the reverse effect equally true? That is, can a stellar computing system such as Apache Spark infuse new life into a programming language? More specifically, can Spark—which is written in the functional programming language Scala—spark (pun intended) the widespread adoption of Scala? And I'm thinking here to the following, eminently plausible idea that Josh Wills (along with his co-authors) delineates nicely in the superb book Advanced Analytics with Spark (O'Reilly)

    "...we think that learning how to work with Spark in the same language in which the underlying framework is written has a number of advantages...".

    Does that ring true with you, too? Is the adoption of Spark, in turn, rejuvenating the adoption of Scala? Is Spark going to be the killer framework for Scala? Will it do for Scala what Spring did for (enterprise) Java and what Rails did for Ruby?

    Speaking for myself, my two-cents' is that is that I'm plenty grateful for having hacked Scala code (in my personal time) for several years now ;)

  2. Awesome blog! An out of the box roadmap for anyone wishing to dive into a learning journey of Big Data.

    1. Thanks for your comment, Nadia - I'm gratified to hear that my goal of providing a roadmap for the journey of learning all things related to Big Data has been fulfilled and that it is appreciated!

  3. Thanks for sharing the information very useful info about Hadoop and

    keep updating us, Please........

  4. A very nice guide. I will definitely follow these tips. Thank you for sharing such detailed article. I am learning a lot from you.

    Hadoop Training in Chennai

    Base SAS Training in Chennai

  5. HELLO!!!
    Hats off to your presence of mind.Thank you so much for sharing tis worth able content with us. The concept taken here will be useful for my future programs and i will surely implement them in my study. Keep blogging article like this.
    Software Testing Training in Chennai

  6. Thanks for sharing Valuable information. Greatful Info about hadoop. Really helpful. Keep sharing........... If it possible share some more tutorials.........

  7. Everyone wants to get unique place in the IT industry’s for that you need to upgrade your skills, your blog helps me improvise my skill set to get good career, keep sharing your thoughts with us.

    Hadoop Training In Chennai

  8. instructor lead live training in Big Data Hadoop and Spark Developer, kindly contact us
    MaxMunus Offer World Class Virtual Instructor led training on TECHNOLOGY. We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ trainings in India, USA, UK, Australlia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.
    For Demo Contact us.
    Sangita Mohanty
    MaxMunus I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual)
    Skype id: training_maxmunus
    Ph:(0) 9738075708 / 080 - 41103383

  9. Thanks for putting this kind of words.This book contain full of hadoop messages. Thanks for sharing.

    Hadoop Training in Chennai

  10. There are lots of information about latest technology and how to get trained in them, like this have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get trained in future technologies. By the way you are running a great blog. Thanks for sharing this.

    Hadoop Training in Chennai

    Base SAS Training in Chennai

  11. The great service in this blog and the nice technology is visible in this blog. I am really very happy for the nice approach is visible in this blog and thank you very much for using the nice technology in this blog

    Hadoop Online Training

  12. It's interesting that many of the bloggers to helped clarify a few things for me as well as giving.Most of ideas can be nice content.The people to give them a good shake to get your point and across the command.
    Hadoop Training in Chennai


  14. wow really superb you had posted one nice information through this. Definitely it will be useful for many people. So please keep update like this.

    Hadoop Training in Chennai

    Base SAS Training in Chennai

  15. There are lots of information about Hadoop developed have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me get to the next level in Oracle. Thanks for sharing this.
    Hadoop Training in Chennai
    Hadoop Training Institute in Chennai
    Hadoop Training
    Hadoop Training in Chennai with Placement

  16. Have a fantastic blog. Your information is very nice and super. I like your blog very much. Thanks for sharing.

    Hadoop Training in Bangalore

  17. Wow amazing i saw the article with execution models you had posted. It was such informative. Really its a wonderful article. Thank you for sharing and please keep update like this type of article because i want to learn more relevant to this topic.

    Digital Marketing Training in Chennai

    Hadoop Training in Chennai

  18. really you have posted an informative blog. it will be really helpful to many peoples. so keep on sharing such kind of an interesting blogs.
    hadoop training in chennai

  19. thanks for sharing excellent information online excellent blog in hadoop...<ahref=""hadoop online training in hyderabad</a>

  20. This is very good, you shared very useful information. This will useful for freshers and experienced also. Learn Hadoop Online

  21. Very nice post here and thanks for it .I always like and such a super contents of these post.Excellent and very cool idea and great content of different kinds of the valuable information's.
    Hadoop Training in Chennai