A3: Data Processing with Map-Reduce

In this assignment, you are going to learn how to write the sort of Map-Reduce programs that would be used at a large Internet search company. For a general strategies for writing Map-Reduce programs, we recommend that you consult the draft textbook Data Intensive Text Processing with Map-Reduce.

Getting Started

You will be using a local installation of Hadoop, so first read and follow the directions for using the CCL Hadoop Cluster, and try out the sample programs. For these problems you are encouraged to use the streaming version with a scripting language such as Python or Perl. You will have to do some fiddly string processing, so pick a language that you are comfortable using.

(We set up CRC accounts for you at the beginning of the semester. If you have any difficulty logging into disc01.crc.nd.edu, then please see Prof. Thain.)

Your primary dataset will be a set of web pages downloaded from the web recently, using a simple web spider, starting at a few university websites. You can find the data in the HDFS directory /public/www. Each file name indicates the host it was downloaded from, and contains the raw HML data. In addition, the data has the following restrictions:

Only host names ending in edu, org, and gov were visited, to avoid anything accidentally unpleasant.

Only the top-level directory of a host is available. For example, if we found a link to www.cse.nd.edu/academics, only www.cse.nd.edu would be visited.

Only a couple of thousand pages are available, since we can't actually store the whole web here.

Working with raw data from the web can be a little messy, because the files are all written by humans: code can be incorrect, links can be broken, words can be misspelled, and so forth. So, the first step is to write some code that can transform raw HTML into data that is useful for more abstract processing. Many other people have done this part, so make good use of manuals and Google to figure it out. Start by writing two programs as follows:

htmltowords: Read HTML on the standard input, remove extraneous items such as tags and punctuation, and emit only simple lowercase words of three or more characters, one per line.

htmltohosts: Read HTML on the standard input, find the A HREF tags, and emit only the hostnames present in those tags, one per line.

Problems

Now you are ready to to process the 'cooked' data in interesting ways. Solve each of the following problems using the Hadoop Map-Reduce framework, by writing simple map and reduce programs. For some problems, you may need to run multiple map or reduce rounds.

Word Count: Produce a ~~single file~~ listing of all words that appear in all documents, each with a count of frequency, sorted by frequency.
Bi-grams: Produce a listing of the top ten bi-grams (pair of adjacent words) in the dataset.
Inverted Index: For each word encountered, produce a list of all hosts in which the word occurs.
Out-Links: For each host, produce a unique list of hosts that it links to.
In-Links: For each host, produce a unique list of hosts that link TO it.
N-Degrees: Produce a listing of all hosts 1 hop from www.nd.edu. Then, produce a listing for 2 hops, 3 hops, and so forth, until the result converges.

Hints

Your mapper can find out the name of the file currently being processed by examining the environment variable mapreduce_map_input_file

What to Turn In

Your dropbox directory is:

/afs/nd.edu/courses/cse/cse40822.01/dropbox/YOURNAME/a3

Turn in all of your programs along with a short writeup named answers.html. The latter should contain, for each problem, links to the map and reduce programs you wrote for that problem and the hadoop command line used to run it.

Do not include the complete output files, because they will be way too big. Instead, include a sample of the output data for verification. For problems 1-2, include the top ten items. For problems 2-5, pick one interesting output file and include the first ten items in that file. For problem 6, explain how many rounds were needed to converge, and how many items were in the output in each round.

This assignment is due on Friday, October 12th at 11:59PM.