Geeks With Blogs

Arthur Zubarev Compudicted

Web Crawling and Data Mining with Apache Nutch

In our age of Data Explosion it becomes increasingly appealing, if not necessary, to scout the myriad of what it looks like though shrinking World Wide Web pages. If you even are not tasked with crawling a subset of the webpages today you may want to grab a copy of Web Crawling and Data Mining with Apache Nutch book to make you well prepared in advance.

Advantageously, the book is not excessively long, so even if you are in a hurry, it will allow you to accomplish the desired scope in a short time. Be aware that the book concentrates a lot on making related software communicate with each other and devotes a significant portion of it to setting things up in general so you may need to check for changes in how to integrate or install the parts in case you happen to work on newer releases of the involved software.

I need to give the credits to the authors here that they have made every effort to showcast the Nutch capabilities and yet make your solution prepared to be scalable. Better yet, Zakir and Abdulbasit empower you by sharing the intimacies of setting Hadoop and integrate Nutch with several popular Key-Value or RDBMS data persistence solutions as Accumulo and MySQL. However, the Nutch crawl optimization is for some reason is missing.

The book gladly is covering the index processing which is compulsory, but unfortunately in my opinion, does not expand enough on an a necessary part: Apache Solr.

The book also covers Apache Gora, but lefts out the option to integrate with Cassandra.

Based on my impression, this book is an ideal fit for IT practitioners, field/infrastructure people who need to deliver quickly a working Nutch prototype environment with various options.

On the not so happy note, the book concentrates a lot on the infrastructure aspects so while reading the book I desired the authors could provide better explanations about the place of the technologies covered. At least of what Nutch is comprised of supplemented with real life usage examples, perhaps a study or two would not harm. It also felt at the beginning like the book lacks some reader background prep steps so at times I needed to take a pause to seek some additional information. I suggest some reference would be nice to have along with glossary of terms.

Nevertheless, overall, it is a good read: 4 out of 5 is my verdict.

Posted on Friday, January 10, 2014 4:51 PM | Back to top

Comments on this post: Web Crawling and Data Mining with Apache Nutch by Dr. Zakir Laliwala and Abdulbasit Shaikh, Packt Publishing Book Review

# re: Web Crawling and Data Mining with Apache Nutch by Dr. Zakir Laliwala and Abdulbasit Shaikh, Packt Publishing Book Review
Requesting Gravatar...
Hi Arthur,
I could not disagree more.
In late 2013 I was approached by Packt to review this book for them. After reading the first two proposed chapters I was disgusted by the obvious lack of knowledge regarding Nutch in general. I was also utterly appalled by the obvious lack of attention to detail regarding the fact that the overwhelming majority of the content from within these chapters had been plagiarized directly from various publicly available wiki pages... the official Nutch wiki page being one of them.

I would strongly advise anyone thinking about purchasing this book not to do so. Stay as far away from it as you can. If you wish to learn how to use Nutch, go the the Nutch wiki and subscribe to the user@ or dev@ list(s).

If anything this book will do you more harm than good. It is poorly constructed, tells you very little (in a comprehensive manner) about Nutch as a framework and concentrates VERY little on data mining within this context.

I am disappointed to see that you seem to be, even to a degree, positive about this title.


Lewis John McGibbney
Apache Member
Apache Gora V.P
Apache Nutch PMC
Apache Any23 PMC
Left by Lewis John McGibbney on Jan 29, 2014 6:35 AM

# re: Web Crawling and Data Mining with Apache Nutch by Dr. Zakir Laliwala and Abdulbasit Shaikh, Packt Publishing Book Review
Requesting Gravatar...
Thank you for taking your time Lewis posting the comment.

I can understand your frustration about the book content quality especially being an Apache Nutch PMC.

For me it was not possible to compare the content of the book with the online wiki due to time constraints. Yet, based on my research there are no other publications (books) from other publishers than Packt.

Thus I deem the book stands useful to those individuals who are short on time or cannot rely on mailing lists.
Left by Arthur Zubarev on Jan 29, 2014 10:11 AM

Your comment:
 (will show your gravatar)

Copyright © Compudicted | Powered by: