Saturday, December 15, 2012

QIZMT

I have been working mostly on windows for the past 2.5 years and I badly wanted to get hands on using some MapReduce framework implementation like Hadoop. For quite sometime, I was thinking to install and configure Hadoop on my windows machine. The introductory tutorials about Hadoop on Windows were really daunting as every tutorial started with a  caution about using Hadoop on windows as Hadoop is not officially supported on Windows. I badly wanted to play with some parallel computing framework. Though I know about few of them, I was certainly uncertain of which one could be really handy to try out few freaky things. Then I came across MySpace's Qizmt (Kiz-Mit). Qizmt is an implementation of the MapReduce framework from MySpace. It is free and open. Qizmt is licensed through GNU GPL V3. I tried installing Qizmt and I got everything ready in matter of few minutes. I ran my first example in 10 mins or so.

Overall, the getting started experience was so easy and smooth. I simply love these kind of toolsets and frameworks that make your life easy rather than chasing some installation or configuration issues. I ran a simple word count program and everything went fine. Qizmt has a very nice feature. (I am not sure if other frameworks readily support it). You can create sample data on the fly and check the correctness of your map and reduce code instantly. And all the job definitions are serialized as xml data (even the map and reduce module code). As of now, I am not sure when that code is compiled. The toolset also comes with a decent debugger. It supports viewing call stack, local variables and debug output. Immediate window and thread window have been really good. All I tried was a single machine set up and I am yet to do the cluster setup.

Tutorial related to QIZMT can be found here : http://code.google.com/p/qizmt/




Monday, July 30, 2012

The BD


I have always wondered about how internet companies like facebook, google, yahoo etc manage the enormous amount of data that they collect from various parts of their services. Before that let us consider the amount of data that Google processes everyday. Allegedly, the internet giant processes few petabytes of new data that is collected. Collected data may include request logs, link logs, crawled web documents etc. It is imperative to understand that the valuable information apparent in this data is huge and can be used to take some influential decisions. For example, people may need to calculate page rank or inverted indices for documents and websites from the mined data. There would be lot of test automation that relies upon the mined data as well. Most of the operations seem more straight forward. But for God's sake, they are not so facile when the data is really huge(when I say 'really huge' I mean really really huge). For example, consider the task of identifying spam links that has a page rank greater than 0.45 from a data source which is around 5 Petabytes in size. It is very readily appealing to think in terms of distributed primitives to solve this problem. But I swear when you start writing a distributed systems to accomplish this tedious task, you will find yourself in a very bad soup debugging multiple processes. Some processes may hang because of some other processes in the cluster which it may not know itself. Developers or analysts generally lose track of their original problem and start running behind crazy problems like the one mentioned above. So is there a way to get out of these messy things and deal with clean interfaces and simple primitives? Some months ago, I started hearing more about the MapReduce framework from Google. I started reading about it that time but had no good time to write my thoughts about it. The MapReduce paradigm or the framework can potentially can bring the developers and analysts out of this soup introduced by tedious implementation challenges. Hadoop is an implementation of the MapReduce framework that can run on commodity hardware and solve problems using the MapReduce paradigm. All that the MapReduce framework understands is the Map primitive and the Reduce primitive. You can think of map and reduce as functions that do some computing to solve the problem. Let us take a simple problem and try to address it using the MapReduce framework. Let us assume that your document crawler has crawled over all shared documents in your network share and created a single file which is like 1 Tb is size. Now you have to give the word count of the words inside this big file. For example if the contents of the file is something like: "This is an example example statement", the output should be something like: This - 1 is - 1 an - 1 example - 2 statement -1. please note the simplicity of the problem when we are dealing with small data. You just need to write a simple function in any language that is not more than 20 lines to accomplish this. But this is not what we are interested in. We want to solve the same problem where the file is really huge. Let us logically break the problem and get into the MapReduce paradigm. I will state things only at a higher level so that people can get the big picture. Hadoop comes with a file system called the HFS (Hadoop File System). You can put in your large file in HFS and ask Hadoop to split it into chunks. HFS takes care of replicating the copies of the split chunks for fault tolerance. We just don't need to worry about things like : "What will happen if the file was deleted by some random adversary process?", "Will my file be available for computing all the time?" etc. Now we have logically split the big file into small ones. Here is the 'how it works' : If you can compute the count of each word at each node of the cluster that has the data then we can combine them and then find the universal data. There are 3 main phases in MapReduce frameowrk : Map, Combine and Reduce. We will see how the Map and Reduce primitive look like: map(String key, String value): for each word w in value: EmitIntermediate(w, “1”); reduce(String key, Iterator values): int result = 0; for each v in values: result += v.ToInteger(); Emit(AsString(result)) When you see, the map function actually will run and give you the word count for each word that it has seen and the reduce actually sums it up and 'Emits' the answer. It would be bit difficult to understand how to write the Map and Reduce, how to configure the Hadoop cluster and things of that nature. In my next post we will discuss about those things in detail. I believe that this post gave a good idea on how we can use the MapReduce framework to process BigData. Stay tuned! Lot more to come in this month ;)