Saturday, August 1, 2015

Best Hadoop and Spark Books


In God we trust. All others must bring data.
W. Edwards Deming (statistician, author, and lecturer)
The goal is to turn data into information, and information into insight.
~ Carly Fiorina (former president, and chair of Hewlett-Packard)
All in all it's just another brick in the wall
All in all you're just another brick in the wall
.
~ Pink Floyd (lyrics from Another Brick in the Wall, Part 2)

My prior post was on Scala which—along with Java and Clojure—is a language that I find highly expressive and helpful for my programming needs. This weekend, let's move on to another topic and see what can be done to help you in your journey to grokking the Big Data solution space :)

I do believe that the two key questions which are fueling the torrent that this age of Big Data has evolved into are these
  1. How best to handle and work with data at super-mega scale?
  2. How can one best decipher and understand that high-volume data and, in turn, convert it into a competitive advantage?
Living as we do today, well into the age of Big Data, it sure helps to have some guidance from those who are at the frontline of these endeavors which revolve around these two questions—Online resources are indispensable and fantastic in their own right, especially for cutting edge updates. But what about times when you simply want to sit down and really absorb the wisdom of our Big Data sages—the underlying conceptual infrastructure that powers the Big Data machinery—in a more sustained and methodical way?

That leads me to share some thoughts on the finest books on the subject—primarily on Spark and Hadoop, plus a smattering of others—that have proved especially helpful to me as I drank from the Kool Aid of Big Data knowledge ;)
  1. Advanced Analytics with Spark: Patterns for Learning from Data at Scale (O'Reilly), by Josh Wills, Sandy Ryza, et al.
  2. Learning Spark: Lightning-Fast Big Data Analysis (O'Reilly) by Holden Karau, et al.
  3. Hadoop: The Definitive Guide, 4th Edition (O'Reilly) by Tom White.
  4. Hadoop in Practice, 2nd Edition, (Manning), by Alex Holmes.
  5. Professional Hadoop Solutions (Wrox), by Boris Lublinsky et al.
  6. Data Scientists at Work (Apress) by Sebastian Gutierrez.
  7. MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems (O'Reilly), by by Donald Miner and Adam Shook.
  8. Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data (Wiley), by Byron Ellis.
  9. Big Data: Principles and Best Practices of Scalable Realtime Data Systems (Manning), by Nathan Marz.
But first, I invite your comments—Once you've read my brief take each on the books below... 
  • Did you sense that your experience of reading any of these books was different? 
  • Perhaps some qualities that I did not cover are the ones that you found the most helpful as you learned Big Data and its ecosystem. 
  • Did I leave out any of your favorite Big Data book(s)? 
  • I've covered only a partial list of the Big Data books that I've read, necessarily limited by the time available...

#1

If you're looking for the best-written and most exciting Big Data book of the year, look no further than this one: Advanced Analytics with Spark: Patterns for Learning from Data at Scale (O'Reilly), by Josh Wills, Sandy Ryza, et al. This book provides sparkling clear insights into the value proposition that Apache Spark brings to the Big Data (metaphorical) table ;)

You get to understand how this open source project makes distributed programming eminently accessible to data scientists. It goes on to show how Spark—while maintaining MapReduce’s linear scalability and fault tolerance—extends it in three important ways:
  1. Its engine can execute a more general directed acyclic graph (DAG) of operators.
  2. It complements this capability with a rich set of transformations.
  3. It extends its predecessors with in-memory processing. Its Resilient Distributed Dataset (RDD) abstraction enables developers to materialize any point in a processing pipeline into memory across the cluster.
One particularly telling remark that the authors make has to do with how, "...With respect to the pertinence of munging and ETL, Spark strives to be something closer to the Python of big data than the Matlab of big data". Spark’s in-memory caching makes it equally ideal for programming in the large and small. And what's possibly most exciting is how Spark bridges the gap between the avenues of exploratory analytics and production (i.e. operational) analytics! And given Spark's tight integration with Hadoop ecosystem makes it an eminently accessible and attractive framework.

If the preceding themes strike a chord with you—and if you're looking for deep dives to get a sense for the feel of using Spark to do complex analytics on massive data sets—look no further than this book. It covers the entire pipeline in an exceptionally clear and engaging style. A bunch of diverse domains are engagingly covered in no less than nine case studies, to which a chapter each is devoted. These chapters make up the bulk of this stellar book.

IMHO, Advanced Analytics with Spark: Patterns for Learning from Data at Scale is an ideal second book on Spark—for your initial forays into this subject, the next book on this list would be an excellent first book on Spark. But if you're determined to drink from the proverbial firehose, you really can't go wrong reading them side-by-side :)

Oh, and the most fun and standout chapters in this altogether stellar book are those on
  • Geospatial and Temporal Data Analysis on the New York City Taxi Trip Data
  • Understanding Wikipedia with Latent Semantic Analysis
  • Analyzing Co-occurrence Networks with GraphX
Finally, I mention here the table of contents to give you a fuller flavor of the topics covered
  • Chapter 1. Analyzing Big Data
  • Chapter 2. Introduction to Data Analysis with Scala and Spark
  • Chapter 3. Recommending Music and the Audioscrobbler Data Set
  • Chapter 4. Predicting Forest Cover with Decision Trees
  • Chapter 5. Anomaly Detection in Network Traffic with K-means Clustering
  • Chapter 6. Understanding Wikipedia with Latent Semantic Analysis
  • Chapter 7. Analyzing Co-occurrence Networks with GraphX
  • Chapter 8. Geospatial and Temporal Data Analysis on the New York City Taxi Trip Data
  • Chapter 9. Estimating Financial Risk through Monte Carlo Simulation
  • Chapter 10. Analyzing Genomics Data and the BDG Project
  • Chapter 11. Analyzing Neuroimaging Data with PySpark and Thunder
All in all, Advanced Analytics with Spark: Patterns for Learning from Data at Scale is a book that's got me really excited about the possibilities of this remarkable platform!

#2

The book Learning Spark: Lightning-Fast Big Data Analysis (O'Reilly) by Holden Karau, et al is a welcome addition to the library of those starting out in their quest to grok the amazing framework that Apache Spark is. What I appreciated the most about this book is the thorough and pragmatic coverage of Apache Spark, beginning with an invitation to understand the value that Spark offers by extending MapReduce
  1. Spark brings value by its ease-of-use (fire up Spark on your laptop, and start using its high-level API, which enables you to focus on your domain-specific computations).
  2. Spark enables interactive use for tackling complex algorithms.
  3. And you get in Spark a general-purpose computation engine (thinking here to combining multiple types of computations, such as ML, text processing, SQL querying, etc.) that would previously have necessitated a bunch of different engines.
For us software types, the following observations by the authors are worth bringing out so you can best decide whether the targeted value that this book offers is for you
This book targets data scientists and engineers. We chose these two groups because they have the most to gain from using Spark to expand the scope of problems they can solve. Spark’s rich collection of data-focused libraries (like MLlib) makes it easy for data scientists to go beyond problems that fit on a single machine while using their statistical background. Engineers, meanwhile, will learn how to write general-purpose distributed programs in Spark and operate production applications. Engineers and data scientists will both learn different details from this book, but will both be able to apply Spark to solve large distributed problems in their respective fields. 
The second group this book targets is software engineers who have some experience with Java, Python, or another programming language. If you are an engineer, we hope that this book will show you how to set up a Spark cluster, use the Spark shell, and write Spark applications to solve parallel processing problems (italicized by me for emphasis). If you are familiar with Hadoop, you have a bit of a head start on figuring out how to interact with HDFS and how to manage a cluster, but either way, we will cover basic distributed execution concepts.
The full chapter devoted to Spark’s core abstraction for doing data-intensive computations—the resilient distributed dataset (aka RDD)—is a standout. The other standout chapter is the one that gets into the nitty gritty of configuring a Spark application, and which also provides an overview of tuning and debugging Spark workloads in production.

Learning Spark: Lightning-Fast Big Data Analysis is richly illustrated with diagrams and tables, and there's no shortage of helpful code snippets to get you going with Spark :)


#3

Let's segue from Spark to Hadoop land now, beginning with a remarkable book: Hadoop: The Definitive Guide, 4th Edition (O'Reilly) by Tom White—This crystal clear and eminently readable book is perhaps the grand-daddy of all Big Data books out there! Now in its fourth edition, this book is the paragon of sparkling clear prose and unambiguous explanations of all things Hadoop, which of course we tech types crave :)

When reading books, we're all gotten used to doing the inevitable google searches periodically—to compensate for the equally inevitable gaps in the narratives of any given technology book—but this book is mercifully free of the aforesaid read-some, search-online-some, resume-reading syndrome, yay!

So if you're ready to drink deep at the Hadoop pool, you simply can't go wrong with this book. Allow me to elaborate: In the Preface, the author elegantly traces the genesis of this very point—sparkling clear prose and unambiguous readability—to the works of the renowned mathematics writer, Martin Gardner, and adds
Its inner workings are complex, resting as they do on a mixture of distributed systems theory, practical engineering, and common sense. And to the uninitiated, Hadoop can appear alien.  
But it doesn’t need to be like this. Stripped to its core, the tools that Hadoop provides for working with big data are simple. If there’s a common theme, it is about raising the level of abstraction—to create building blocks for programmers who have lots of data to store and analyze, and who don’t have the time, the skill, or the inclination to become distributed systems experts to build the infrastructure to handle it.
You immediately get the sense that this book is a no-nonsense, friendly, and engaging guide to Hadoop and its ecosystem; rest assured that you'll finish this book without the author letting you down one bit. In fact, elaborating on this very theme—that this is a no-nonsense, friendly, and engaging guide to Hadoop—the first chapter gives a pleasant tour (a lay of the land, if you will) to the entirety of Hadoop: The Definitive Guide, 4th Edition, which is made up of no less than 756 pages. Be sure to use the book's indispensable first chapter to make the most of absorbing the contents of this remarkable book. As the author explains,
The book is divided into five main parts: Parts I to III are about core Hadoop, Part IV covers related projects in the Hadoop ecosystem, and Part V contains Hadoop case studies. You can read the book from cover to cover, but there are alternative pathways through the book that allow you to skip chapters that aren’t needed to read later ones.
Further along, a bird's eye view is provided for each of the chapters in the five main parts that make up this book. This summary is accompanied by a lovely flowchart of the paths that can be taken through the contents—Thoughtful design, with the reader in mind, is the hallmark of the entire book. As a reader, I felt secure in the knowledge of learning Hadoop from a master of the art. In this regard, the following remarks (in the Foreword) by Doug Cutting—who, along with Mike Cafarella, created Hadoop in 2005—are quite telling, and reflect just how friendly and engaging a guide this book is to all things Hadoop
Tom is now a respected senior member of the Hadoop developer community. Though he’s an expert in many technical corners of the project, his specialty is making Hadoop easier to use and understand. 
Given this, I was very pleased when I learned that Tom intended to write a book about Hadoop. Who could be better qualified? Now you have the opportunity to learn about Hadoop from a master—not only of the technology, but also of common sense and plain talk.
Don't miss this work (Hadoop: The Definitive Guide, 4th Edition) by the leading popularizer of Hadoop, who is doing for Hadoop what Martin Gardner has done for mathematics!


#4

This next title is an excellent second book on Hadoop: Hadoop in Practice, 2nd Edition (Manning), by Alex Holmes. Now in its second edition, this book got a thorough update to cover changes and new features in Hadoop, including MapReduce 2. New chapters have been added to cover YARN, Kafka, Impala, and Spark SQL as they each relate to Hadoop. While sticking to the strengths of the first edition—approximately 100 intermediate-to-advanced Hadoop examples in a superb problem-and-solution format—the new edition continues to build on those  strengths, while maintaining the high-quality in the code examples.

In the About this Book section, after mentioning how, with its distributed storage and compute capabilities, Hadoop is fundamentally an enabling technology for working with huge datasets, the author goes on to identify the target audience of this book:
This hands-on book targets users who have some practical experience with Hadoop and understand the basic concepts of MapReduce and HDFS. Manning’s Hadoop in Action by Chuck Lam contains the necessary prerequisites to understand and apply the techniques covered in this book.  
Many techniques in this book are Java-based, which means readers are expected to possess an intermediate-level knowledge of Java. An excellent text for all levels of Java users is Effective Java, Second Edition by Joshua Bloch (Addison-Wesley).
One thing I really, really like about this book is the abundance of useful diagrams and code snippets, all of which are profusely annotated with thoughtful comments! I would say that the barrier-to-entry to this book is not all that high—hastening to add that this is most emphatically not the same as saying that the contents are trifling—so if you're determined, don't shy away from tackling this book (along with, importantly, having an introductory book by your side, such as the fine book entitled Hadoop: The Definitive Guide, by Tom White, and which is also reviewed above).

Very briefly, here is a rundown of the topics covered in this book:
1. Background and fundamentals: Chapter 1. Hadoop in a heartbeat, Chapter 2. Introduction to YARN.  
2. Data logistics: Chapter 3. Data serialization—working with text and beyond, Chapter 4. Organizing and optimizing data in HDFS, Chapter 5. Moving data into and out of Hadoop.  
3. Big data patterns: Chapter 6. Applying MapReduce patterns to big data, Chapter 7. Utilizing data structures and algorithms at scale, Chapter 8. Tuning, debugging, and testing.  
4. Beyond MapReduce: Chapter 9. SQL on Hadoop Chapter 10. Writing a YARN application.
This book (Hadoop in Practice, 2nd Editionis packed with helpful material which—far from being cluttered in any way—is pleasingly organized and makes for smooth reading and a rewarding learning experience.


#5

Once comfortable with the Hadoop paradigm, you'll be able to appreciate the gem of a book we've got in this next title: Professional Hadoop Solutions (Wrox), by Boris Lublinsky et al. The authors have assembled a first-class collection of design expertise narratives.

In my mind, the key to understanding the value in this book lies in appreciating the following observation, which the authors make in the introductory chapter
Although many publications emphasize the fact that Hadoop hides infrastructure complexity from business developers, you should understand that Hadoop extensibility is not publicized enough... Hadoop’s implementation was designed in a way that enables developers to easily and seamlessly incorporate new functionality into Hadoop’s execution. 
A significant portion of this book is dedicated to describing approaches to such customizations, as well as practical implementations. These are all based on the results of work performed by the authors.
They go on to explain cogently the reasons why great emphasis is placed on MapReduce code throughout the book. So if you approach this book with the mindset that the narratives will directly revolve around MapReduce, you'll glean quite a bit of value out of this book. Their explanation of the MapReduce paradigm, as well as its nuts-and-bolts mechanisms, really are top notch.

The standout chapters are the following:
  • Processing Your Data with MapReduce
  • Customizing MapReduce Execution
  • Hadoop Security
  • Building Enterprise Security Solutions for Hadoop Implementations
The Appendix toward the end of Professional Hadoop Solutions is especially rich and useful. Overall, I'm glad to have found this book!


#6

And now let's segue from Hadoop to a foray into Data Science kingdom proper :)

But first a fair warning is in order about this next book: Once you start reading it, you're going to have a terribly hard time putting it down or, for that matter, doing anything else before you've read it all! Such was my experience of reading (and re-reading) this page-turner of a book: Data Scientists at Work (Apress) by Sebastian Gutierrez.

Consider this... We have these marvelous frameworks—in Spark, Hadoop, Storm and others—but surely they were not created in some ethereal vacuum. Right, these frameworks were of course created in the service of genuine business needs, and to solve pressing problems that folks were facing. So if you're looking for the scoop on this nexus (i.e. the potent symbiosis between the aims of Data Science and what Big Data has to offer), this is the book for you.

The corpus of this book is made up of in-depth interviews of 16 gifted data scientists. What makes these interviews incredibly engaging is the spectacularly good job done by the interviewer (the author of this book), Sebastian Gutierrez. His academic training is from MIT—where he earned a BS in Mathematics—and he is a data entrepreneur who has founded three data-related companies.

The pointed and evocative questions asked throughout the book could only have come from someone who knows the pragmatics of the Data Science field inside-out! And therein lies the immense value of this book: Detailed answers by 16 top data scientists as they shed light on the human side of data science, their thoughts on how this field is evolving, where it's headed, plus plenty of straight-from-the-trenches stories about their work.

While the quality of the interviews is uniformly excellent, the standout interviews in my mind are the ones with these data scientists who are doing stellar work
To give you a flavor of the interviews—each of which is given its own chapter—ever so briefly, here is something from Claudia, who is the Chief Scientist at Dstillery. She teaches a high-level overview course on data mining for the NYU Stern MBA program to, in here own words, "...give people a good understanding of what the opportunities are and how to manage them instead of really teaching them how to do it". She has taught at NYU, MIT, Wharton, and Columbia. In response to the interview question in the book ("What about this work is interesting and exciting for you?"), Claudia noted
I have always been fascinated by math puzzles and puzzles in general. The work that I do is a real-world version of puzzles that life just presents. Data is the footprint of real life in some form, and so it is always interesting. It is like a detective game to figure out what is really going on. Most of my time I am debugging data with a sense of finding out what is wrong with it or where it disagrees with my assumption of what it was supposed to have meant. So these are games that I am just inherently getting really excited about.
In the end, here is the book's author (Sebastian Gutierrez) himself, describing in the Introduction the essence of his approach in putting together the interviews for this book
My interviewing method was designed to ask open-ended questions so that the personalities and spontaneous thought processes of each interviewee would shine through clearly and accurately. My aim was to get at the heart of how they came to be data scientists, what they love about the field, what their daily work lives entail, how they built their careers, how they developed their skills, what advice they have for people looking to become data scientists, and what they think the future of the field holds.
Some 20 years ago, when I was finishing grad school—at that time, I earned an MS degree in electrical engineering from Texas A&M University—we didn't call work such as my dissertation (Noise-tolerant Software Method for Traffic Sign Recognition) Data Science. But in several ways, while I was reading the fine interviews in this book, I sure was reminded of the algorithms I worked out back then: Various AI programming techniques (neural networks primarily, such as the Back-propagation Neural Network and the Adaptive Resonance Theory model, aka ART2). Good stuff, and enough reminiscing, for that matter :)

So Data Scientists at Work is a fantastic book overall, if this sort of thing piques your interest.


#7

Segueing right back to Hadoop now, the title of the next book is decidedly open-ended—MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems (O'Reilly), by by Donald Miner and Adam Shook. Given the open-ended title, allow me to elaborate on the gist of this fine book...

The authors are clearly experts in the Hadoop ecosystem, and what they've put together is more than what you'll find in the endearing O'Reilly “cookbook” series. Thus, they don’t call out specific problems and accompanying solutions. Instead, they share the lessons that they have learned along the way to becoming experts in the Hadoop ecosystem. Note, too, that this book is mostly about the analytics side of Hadoop and MapReduce.

And they assume that you're already familiar with how Hadoop and MapReduce work, so they don't dive into the details of the APIs which they use in this book—Those topics have already been covered thoroughly in other books, and they focus on analytics. In their own words
The motivation for us to write this book was to fill a missing gap we saw in a lot of new MapReduce developers. They had learned how to use the system, got comfortable with writing MapReduce, but were lacking the experience to understand how to do things right or well. The intent of this book is to prevent you from having to make some of your own mistakes by educating you on how experts have figured out how to solve problems with MapReduce. So, in some ways, this book can be viewed as an intermediate or advanced MapReduce developer resource, but we think early beginners and gurus will find use out of it.
One thing I appreciated a lot was the way the authors answer the question, "So why should we use Java MapReduce in Hadoop at all when we have options like Pig and Hive?". They point out two core reasons for spending time explaining how to implement something in hundreds of lines of code when the same can be accomplished in a couple lines with, say, Pig and Hive. In their own words
First, there is conceptual value in understanding the lower-level workings of a system like MapReduce. The developer that understands how Pig actually performs a reduce-side join will make smarter decisions. Using Pig or Hive without understanding MapReduce can lead to some dangerous situations.... 
Second, Pig and Hive aren’t there yet in terms of full functionality and maturity (as of 2012). It is obvious that they haven’t reached their full potential yet. Right now, they simply can’t tackle all of the problems in the ways that Java MapReduce can.
Remaining mindful of the fact that the title of this book is admittedly open-ended, I mention here the table of contents to give you a flavor of the topics covered
  • Chapter 1. Design Patterns and MapReduce
  • Chapter 2. Summarization Patterns
  • Chapter 3. Filtering Patterns
  • Chapter 4. Data Organization Patterns
  • Chapter 5. Join Patterns
  • Chapter 6. Metapatterns
  • Chapter 7. Input and Output Patterns
  • Chapter 8. Final Thoughts and the Future of Design Patterns
With the caveats noted above, MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems is a book absolutely worth exploring!


#8

Finally, let's segue to the land of real-time, streaming data :)

This next book is impeccably written in an eminently thoughtful style—Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data (Wiley), by Byron Ellis. The author is the CTO of Spongecell, and he has a Ph.D. in Statistics from Harvard University.

No doubt, with enough determination and time, one can do online searches and cobble together a solution to handle real-time, high-volume mega data. But that begs the question, and I'm not questioning anyone's tenacity here: Is that really the ideal strategy? And that's where the book shines—What makes it stand out is the care and thought that have clearly been poured into making this book a one-stop resource for crafting end-to-end solutions for effectively grappling with real-time, high-volume mega data.

Much as I alluded to above, this book is impeccably written. The author has clearly honed his writing skills—quite likely while preparing his dissertation for the Ph.D. that he earned from Harvard University :)

Clearly written books are a heaven-send, and this superb book is one. In that vein, the author notes with razor-sharp precision the aim of this book
The goal of this book is to allow a fairly broad range of potential users and implementers in an organization to gain comfort with the complete stack of applications. When real-time projects reach a certain point, they should be agile and adaptable systems that can be easily modified, which requires that the users have a fair understanding of the stack as a whole in addition to their own areas of focus. “Real time” applies as much to the development of new analyses as it does to the data itself. Any number of well-meaning projects have failed because they took so long to implement that the people who requested the project have either moved on to other things or simply forgotten why they wanted the data in the first place. By making the projects agile and incremental, this can be avoided as much as possible.
The author weaves into the narratives a lot of pragmatic advice; he has clearly been in the development trenches and done it all. As with the prior book, I mention here the table of contents to give you a flavor of the topics covered in Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
Chapter 1: Introduction to Streaming Data Sources of Streaming Data, Why Streaming Data Is Different, Infrastructures and Algorithms, Conclusion  
Part I: Streaming Analytics Architecture
Chapter 2: Designing Real-Time Streaming Architectures Real-Time Architecture Components Features of a Real-Time Architecture Languages for Real-Time Programming A Real-Time Architecture Checklist Conclusion
Chapter 3: Service Configuration and Coordination Motivation for Configuration and Coordination Systems Maintaining Distributed State Apache ZooKeeper Conclusion
Chapter 4: Data-Flow Management in Streaming Analysis Distributed Data Flows Apache Kafka: High-Throughput Distributed Messaging Apache Flume: Distributed Log Collection Conclusion
Chapter 5: Processing Streaming Data Distributed Streaming Data Processing Processing Data with Storm Processing Data with Samza Conclusion
Chapter 6: Storing Streaming Data Consistent Hashing “NoSQL” Storage Systems Other Storage Technologies Choosing a Technology Warehousing Conclusion  
Part II: Analysis and Visualization 
Chapter 7: Delivering Streaming Metrics Streaming Web Applications Visualizing Data Mobile Streaming Applications Conclusion
Chapter 8: Exact Aggregation and Delivery Timed Counting and Summation Multi-Resolution Time-Series Aggregation Stochastic Optimization Delivering Time-Series Data Conclusion
Chapter 9: Statistical Approximation of Streaming Data Numerical Libraries Probabilities and Distributions Working with Distributions Random Number Generation Sampling Procedures Conclusion
Chapter 10: Approximating Streaming Data with Sketching Registers and Hash Functions Working with Sets The Bloom Filter Distinct Value Sketches The Count-Min Sketch Other Applications Conclusion
Chapter 11: Beyond Aggregation Models for Real-Time Data Forecasting with Models Monitoring Real-Time Optimization Conclusion Introduction Overview and Organization of This Book Who Should Read This Book Tools You Will Need What's on the Website Time to Dive In
In the end, do make a note of the author's point when he reiterates that
The hope is that the reader of this book would feel confident taking a proof-of-concept streaming data project in their organization from start to finish with the intent to release it into a production environment.
All this makes Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data a book that shouldn't be missed ;)


#9

Last, but certainly not the least—continuing now in the spirit of frameworks that enable us developers to tackle real-time, streaming data—is a book by Nathan Marz: Big Data: Principles and Best Practices of Scalable Realtime Data Systems (Manning). The author happens to be the originator of the Lambda Architecture approach to programming in the world of Big Data, and he deploys his considerable knowledge of this approach in explaining the details.

This book is dives deep into the concepts underlying Lambda Architecture—which is what the author dubbed the approach that he formalized during his years working at the startup BackType—along with, importantly, many illustrative examples which are nicely supplemented by code snippets. The author puts it succinctly when he notes that
This book is the result of my desire to spread the knowledge of the Lambda Architecture and how it avoids the complexities of traditional architectures. It is the book I wish I had when I started working with Big Data. I hope you treat this book as a journey—a journey to challenge what you thought you knew about data systems, and to discover that working with Big Data can be elegant, simple, and fun.
As an aside—confessing here my fondness for Clojure, the Lisp that runs on the JVM—I couldn't help but resonate with the following sentiments echoed by Nathan Marz in the Acknowledgments section of Big Data: Principles and Best Practices of Scalable Realtime Data Systems, where he notes that
Rich Hickey has been one of my biggest inspirations during my programming career. Clojure is the best language I have ever used, and I’ve become a better programmer having learned it. I appreciate its practicality and focus on simplicity. Rich’s philosophy on state and complexity in programming has influenced me deeply.
In sum, this is a worthwhile book, nicely structured into theory and illustration chapters.

In the end, and as I mentioned at the outset, I invite your comments—Having now read my brief take each on the books above...
  • Do you find that your experience of reading any of these books was different? 
  • Perhaps some qualities that I did not cover are the ones that you found the most helpful as you learned Scala and its ecosystem. 
  • Did I omit any of your favorite Big Data book(s)? 
  • I've covered only a partial list of the Big Data books that I've read, limited as you can imagine I am by the time available...
As with my prior post, which contains a set of book vignettes—those pertaining to the finest and most useful books on Scala in print—my aim here, too, in sharing these brief reviews remains the same, albeit on a different subject (Big Data) this time: I hope these vignettes will help you in selecting your resources well, and help you in your journey to grokking the Big Data solution space!

Bon voyage, and I leave you with an obligatory photo of a section of one of my bookshelves—one that's, um, rather biased toward Big Data material in a statistically significant way, eh ;)









67 comments:

  1. Yesterday evening, as I was thumbing through the pages of Peter Seibel's excellent compilation of interviews with top programmers, entitled Coders at Work—specifically the insights shared by the well-known Java expert, Joshua Bloch—the following thoughts shared by Bloch leapt from the pages as I related them back to the ever-evolving, protean Big Data landscape:

    "There are multiple communities associated with Java and with other programming languages, too. When there aren't, it's usually a sign that the language is either a niche language or an immature language. As a language grows and prospers, it naturally appeals to a more diverse community. And furthermore, as the amount of investment in a language grows, the value of it grows".

    "It's like Metcalfe's law: the value of a network is proportional to the square of the number of users. The same is true of languages... Even if Java isn't the perfect language for you, there are all these incidental benefits to using it, so you form your own community that figures out how to do numeric programming in Java, or whatever kind of programming you want to do".

    I found myself wondering, Is the reverse effect equally true? That is, can a stellar computing system such as Apache Spark infuse new life into a programming language? More specifically, can Spark—which is written in the functional programming language Scala—spark (pun intended) the widespread adoption of Scala? And I'm thinking here to the following, eminently plausible idea that Josh Wills (along with his co-authors) delineates nicely in the superb book Advanced Analytics with Spark (O'Reilly)

    "...we think that learning how to work with Spark in the same language in which the underlying framework is written has a number of advantages...".

    Does that ring true with you, too? Is the adoption of Spark, in turn, rejuvenating the adoption of Scala? Is Spark going to be the killer framework for Scala? Will it do for Scala what Spring did for (enterprise) Java and what Rails did for Ruby?

    Speaking for myself, my two-cents' is that is that I'm plenty grateful for having hacked Scala code (in my personal time) for several years now ;)

    ReplyDelete
  2. Awesome blog! An out of the box roadmap for anyone wishing to dive into a learning journey of Big Data.

    ReplyDelete
    Replies
    1. Thanks for your comment, Nadia - I'm gratified to hear that my goal of providing a roadmap for the journey of learning all things related to Big Data has been fulfilled and that it is appreciated!

      Delete
  3. Thanks for sharing the information very useful info about Hadoop and

    keep updating us, Please........

    ReplyDelete
  4. A very nice guide. I will definitely follow these tips. Thank you for sharing such detailed article. I am learning a lot from you.

    Hadoop Training in Chennai

    Base SAS Training in Chennai

    ReplyDelete
  5. HELLO!!!
    Hats off to your presence of mind.Thank you so much for sharing tis worth able content with us. The concept taken here will be useful for my future programs and i will surely implement them in my study. Keep blogging article like this.
    Software Testing Training in Chennai

    ReplyDelete
  6. Thanks for sharing Valuable information. Greatful Info about hadoop. Really helpful. Keep sharing........... If it possible share some more tutorials.........

    ReplyDelete
  7. Everyone wants to get unique place in the IT industry’s for that you need to upgrade your skills, your blog helps me improvise my skill set to get good career, keep sharing your thoughts with us.

    Hadoop Training In Chennai

    ReplyDelete
  8. Thanks for putting this kind of words.This book contain full of hadoop messages. Thanks for sharing.


    Hadoop Training in Chennai

    ReplyDelete
  9. There are lots of information about latest technology and how to get trained in them, like this have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get trained in future technologies. By the way you are running a great blog. Thanks for sharing this.

    Hadoop Training in Chennai

    Base SAS Training in Chennai

    ReplyDelete
  10. The great service in this blog and the nice technology is visible in this blog. I am really very happy for the nice approach is visible in this blog and thank you very much for using the nice technology in this blog

    Hadoop Online Training

    ReplyDelete
  11. It's interesting that many of the bloggers to helped clarify a few things for me as well as giving.Most of ideas can be nice content.The people to give them a good shake to get your point and across the command.
    Hadoop Training in Chennai

    ReplyDelete
  12. http://programming-digressions.blogspot.in/2015/08/best-spark-and-hadoop-books-in-god-we.html

    ReplyDelete
  13. wow really superb you had posted one nice information through this. Definitely it will be useful for many people. So please keep update like this.

    Hadoop Training in Chennai

    Base SAS Training in Chennai

    ReplyDelete
  14. There are lots of information about Hadoop developed have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me get to the next level in Oracle. Thanks for sharing this.
    Hadoop Training in Chennai
    Hadoop Training Institute in Chennai
    Hadoop Training
    Hadoop Training in Chennai with Placement

    ReplyDelete
  15. Have a fantastic blog. Your information is very nice and super. I like your blog very much. Thanks for sharing.


    Hadoop Training in Bangalore

    ReplyDelete
  16. Wow amazing i saw the article with execution models you had posted. It was such informative. Really its a wonderful article. Thank you for sharing and please keep update like this type of article because i want to learn more relevant to this topic.

    Digital Marketing Training in Chennai

    Hadoop Training in Chennai

    ReplyDelete
  17. really you have posted an informative blog. it will be really helpful to many peoples. so keep on sharing such kind of an interesting blogs.
    hadoop training in chennai

    ReplyDelete
  18. thanks for sharing excellent information online excellent blog in hadoop...<ahref="http://www.rstrainings.com/hadoop-online-training.html="hadoop online training in hyderabad</a>

    ReplyDelete
  19. This is very good, you shared very useful information. This will useful for freshers and experienced also. Learn Hadoop Online

    ReplyDelete
  20. Very nice post here and thanks for it .I always like and such a super contents of these post.Excellent and very cool idea and great content of different kinds of the valuable information's.
    Hadoop Training in Chennai

    ReplyDelete


  21. Thanks For Sharing.It is very useful information

    Starpmo is one of the best institute to provide Online training courses in hyderabad. We have real time industry experts to provide Online classroom training

    pmp training in hyderabad

    pmp training

    PMI PMP Exams Hyderabad

    PMP Exams Hyderabad

    For more details Visit Us:Starpmo.com

    Contact Us:+91 7095608254

    ReplyDelete
  22. I have seen a lot of blogs and Info. on other Blogs and Web sites But in this Hadoop Blog Information is useful very thanks for sharing it........

    ReplyDelete
  23. It's interesting that many of the bloggers to helped clarify a few things for me as well as giving.Most of ideas can be nice content.The people to give them a good shake to get your point and across the command.

    Data Science Online Training

    Hadoop Online Training

    ReplyDelete
  24. Hii nice to see your post here .I had the most valuable information about Hadoop in site so please visit my site once to gain good information about Hadoop.

    http://www.kellytechno.com/Online/Course/Hadoop-Training

    ReplyDelete
  25. Excellent and very cool idea and great content of different kinds of the valuable information's. Your post helps to increase my knowledge .Keep posting.

    Hadoop Development Houston

    ReplyDelete
  26. Wow amazing i saw the article with execution models you had posted. It was such informative. Really its a wonderful article. Thank you for sharing and please keep update like this type of article because i want to learn more relevant to this topic.

    Base SAS Training in Chennai

    MSBI Training in Chennai

    ReplyDelete
  27. Great and interesting article to read.. i Gathered more useful and new information from this article.thanks a lot for sharing this article to us..

    best big data hadoop training and certification | best institute for big data in Chennai

    ReplyDelete
  28. - As the author of this blog, I have to sheepishly confess that I'm far behind on responding to the scores of kind, thoughtful, and encouraging comments from all my readers who continue to make time to stop by and reading the essays that I post on my blog.
    - While I may not be able to reply to each reader comment, much as I would like to, rest assured that I read every single reader's comment!
    - Please know that encouraging comments such as yours make my day, every day, so thank you :)

    ReplyDelete
  29. The book Learning Spark: Lightning-Fast Big Data Analysis (O'Reilly) seems to be outdated as it is based on Spark 1.3. Please let me know if otherwise.
    Found Spark in Action up to date with Spark 2.0. Could you please provide your view on book.

    ReplyDelete
    Replies
    1. - You have asked an excellent question, Saurabh. I'll try my best to answer. Yes, while its clarity remains pristine, the book Learning Spark: Lightning-Fast Big Data Analysis (O'Reilly) is beginning to show its age, and could definitely use a revision (i.e. a second edition).

      - You mentioned another, newer book: Spark in Action (Manning), which is up to date for Spark 2.0. Let me first say this - Most all books by Manning publishers (and I've bought and read literally dozens of them) are uniformly great. It is no coincidence that four out of the five books, which I have recently reviewed, in-depth, which you can read at your leisure at my recent post entitled Best Reactive Programming Books just happen to be published by Manning :)

      - Alas, the book you mention, Spark in Action, also published by Manning, is an exception in that it left me utterly disappointed because the attempts at explanation were convoluted, unfortunately. It's a mediocre book, and far better alternatives are available, which I'll point out here:

      Big Data Analytics with Spark: A Practitioner's Guide to Using Spark for Large Scale Data Analysis by Mohammed Guller is new and quite excellent, with lots of clearly annotated code.

      Good luck!

      Delete
    2. Thanks Akram, it really helps.

      Delete
    3. Delighted to hear that, Saurabh.

      Delete
    4. - I can most highly recommend to you, Saurabh, and in fact to all readers, a brand new book entitled High Performance Spark (O'Reilly) by Holden Karau and Rachel Warren.
      - In particular, I encourage you to look up the section in that book with the heading To Be a Spark Expert You Have to Learn a Little Scala Anyway where the authors point out that:

      Although Python and Java are more commonly used languages, learning Scala is a worthwhile investment for anyone interested in delving deep into Spark development. Spark’s documentation can be uneven. However, the readability of the codebase is world-class. Perhaps more than with other frameworks, the advantages of cultivating a sophisticated understanding of the Spark code base is integral to the advanced Spark user. Because Spark is written in Scala, it will be difficult to interact with the Spark source code without the ability, at least, to read Scala code. Furthermore, the methods in the RDD class closely mimic those in the Scala collections API. RDD functions, such as map, filter, flatMap, reduce, and fold, have nearly identical specifications to their Scala equivalents. Fundamentally Spark is a functional framework, relying heavily on concepts like immutability and lambda definition, so using the Spark API may be more intuitive with some knowledge of functional programming.

      Delete
  30. Interesting blog post.This blog shows that you have a great future as a content writer.waiting for more updates...
    Digital Marketing Company in India

    ReplyDelete
    Replies
    1. - Thanks for the thoughtful comment, Karthi. I am grateful.
      - I invite you to also check out another one of my more recent posts: Best Reactive Programming Books.
      - I think you'll like that post as well :)

      Delete
  31. It is amazing and wonderful to visit your site.Thanks for sharing this information,this is useful to me...
    SEO Company in India

    ReplyDelete
    Replies
    1. - Please spread the word, Karthi, about this site. My aim is to share what I've learned (from both experience and research) with fellow programmers, software developers, and in fact all readers such as yourself!
      - Delighted that you're finding the material on this site helpful :)

      Delete
  32. - Folks, your tremendous and vibrant participation (by way of comments) motivates me to mention yet another awesome book, which you will also not want to miss!
    - This one is entitled Hadoop Application Architectures (O'Reilly) by Mark Grover, Ted Malaska, Jonathan Seidman, and Gwen Shapira. The original designer and implementor of Hadoop (Doug Cutting) writes in his Foreword to Hadoop Application Architectures how

    "...Hadoop’s become the kernel of a complex ecosystem... A wide variety of tools have now been built around this kernel. Some, like HBase and Accumulo, provide online keystores that can back interactive applications. Others, like Flume, Sqoop, and Apache Kafka, help route data in and out of Hadoop’s storage. Improved processing APIs are available through Pig, Crunch, and Cascading. SQL queries can be processed with Apache Hive and Cloudera Impala. Apache Spark is a superstar, providing an improved and optimized batch API while also incorporating real-time stream processing, graph processing, and machine learning. Apache Oozie and Azkaban orchestrate and schedule many of the above."

    - And here is the punch line, more like a pair of punch paragraphs ;-)

    - That's where, toward the end of his Foreword, the original designer and implementor of Hadoop (Doug Cutting) asks the reader:

    "Confused yet? This menagerie of tools can be overwhelming. Yet, to make effective use of this new platform, you need to understand how these tools all fit together and which can help you. The authors of this book have years of experience building Hadoop-based systems and can now share with you the wisdom they’ve gained."

    "In theory there are billions of ways to connect and configure these tools for your use. But in practice, successful patterns emerge. This book describes best practices, where each tool shines, and how best to use it for a particular task. It also presents common-use cases. At first users improvised, trying many combinations of tools, but this book describes the patterns that have proven successful again and again, sparing you much of the exploration."

    - Don't miss this highly readable volume!

    ReplyDelete
  33. Hats off to your presence of mind..I really enjoyed reading your blog. I really appreciate your information which you shared with us.


    Hadoop Training in Marathahalli|
    Hadoop Training in Bangalore|
    Data science training in Marathahalli|
    Data science training in Bangalore|

    ReplyDelete
    Replies
    1. - Thank you, Rose, for those very kind words.
      - I truly enjoy writing and, in turn, am delighted that readers like you enjoy reading my stuff, and benefit from the material that I share in my essays such as this one :-)

      Delete
  34. Well Said, you have furnished the right information that will be useful to anyone at all time. Thanks for sharing your Ideas.

    Software Testing Training in Marathahalli|

    Software Testing Training in Bangalore|

    ReplyDelete
    Replies
    1. - In turn, Rose, please know that I appreciated your making the time to share your comments, thank you.
      - Do please spread the word about this blog. As I mentioned previously, my aim is to share what I've learned, and continue to learn, in fact, with fellow technologists and readers like you!

      Delete
  35. Great article about big data. The way of explanation if good, its easily understand to all users. Keep sharing more articles about this topic. Software Testing Training in Chennai | Selenium Training in Chennai

    ReplyDelete
    Replies
    1. - Thanks for the thoughtful comment, Melba. I appreciated that.
      - We are all one global family. I noticed in your comment that you do software training in Chennai.
      - I have many good friends from India, and it just so happens that I mentioned about the amazing work of Mother Teresa very recently :)
      - You can read more details, and I invite you to check this more recent post: Best Book on Technical Blogging.

      Delete
  36. It's interesting that many of the bloggers to helped clarify a few things for me as well as giving.Most of ideas can be nice content.The people to give them a good shake to get your point and across the command.



    Hadoop Training in BTM Layout


    Hadoop Training in Marathahalli

    ReplyDelete
    Replies
    1. - I appreciated your sharing those thoughts, Ranasing, with the community of readers, including myself. It's great to hear that the results of my research—which I share here of course—are of service and value to readers like yourself.
      - Do please spread the word about this blog so more can benefit as well.

      Delete
  37. The great service in this blog and the nice technology is visible in this blog. I am really very happy for the nice approach is visible in this blog and thank you very much for using the nice technology in this blog

    Hadoop Training in Marathahalli|
    Hadoop Training in Bangalore|
    Data science training in Marathahalli|
    Data science training in Bangalore|

    ReplyDelete
    Replies
    1. - Thank you, Rose, for the kind words. I am humbled by the kindness shown to me by my readers - Please spread the word about this blog, so more readers can benefit as well!

      Delete
  38. Wonderful blog!!! I liked the complete article…. great written,Thanks for all the information you have provided…
    Hadoop Training in Hyderabad

    ReplyDelete
    Replies
    1. - Thank you, Dubai Raju, for the heartfelt endorsement of my essays!

      - Comments from readers like you make my day, every day :)

      Delete
  39. Replies
    1. - Thank you, Nutana, for making the time to read this essay, plus sharing (your training offerings) via the comment...
      - While I don't mind at all when readers share links (to their training offerings, etc.), I would ask that you also please contribute to the discussion by sharing your thoughts, observations, and how we can make this a better blog for the reading community, thanks :)

      Delete
  40. The list of spark and hadoop books are very much useful my sincere thanks for sharing this post Please continue to share this post
    Hadoop Training in Bangalore

    ReplyDelete
  41. This comment has been removed by the author.

    ReplyDelete