Assignment A4: Map-Reduce

In the final assignment, you are going to learn how to write the sort of Map-Reduce programs that would be used at a large Internet search company. For a general strategies for writing Map-Reduce programs, I recommend that you consult the draft textbook Data Intensive Text Processing with Map-Reduce.

Getting Started

You will be using a local installation of Hadoop, so first read and follow the directions for using the CCL Hadoop Cluster, and try out the sample programs. For these problems you may use either the native Java version, or the streaming version with a scripting language such as Perl or Python. You will have to do some fiddly string processing, so pick a language that you are comfortable using.

Your primary dataset will be a set of web pages downloaded from the web last week, using a simple web spider, starting at www.nd.edu. You can find the data in the HDFS directory /dthain/edu. The first line of each file gives the hostname from which it was downloaded, followed by the raw HTML data. In addition, the data has the following restrictions:

  • Only host names ending in .edu were visited, to avoid anything accidentally unpleasant.
  • Only the top-level directory of a host is available. For example, if we found a link to www.cse.nd.edu/academics, only www.cse.nd.edu would be visited.
  • Only a few thousand pages are available, since we can't actually store the whole web here.
  • Working with raw data from the web can be a little messy, because the files are all written by humans: code can be incorrect, links can be broken, words can be misspelled, and so forth. So, the first step is to write some code that can transform raw HTML into data that is useful for more abstract processing. Many other people have done this part, so make good use of manuals and Google to figure it out Suggest that you start by writing two programs as follows:
  • htmltowords: Read HTML on the standard input, remove extraneous items such as tags and punctuation, and emit only simple lowercase words of three or more characters.
  • htmltohosts: Read HTML on the standard input, find the A HREF tags, and emit only the hostnames present in those tags.
  • Problems

    Now you are ready to to process the 'cooked' data in interesting ways. Solve each of the following problems using the Hadoop Map-Reduce framework, by writing simple map and reduce programs. For some problems, you may need to run multiple map or reduce rounds.
    1. Word Count: Produce a single file listing all words that appear in all documents, each with a count of frequency, sorted by frequency.
    2. Inverted Index: For each word encountered, produce a file that lists all hosts in which the word occurs.
    3. Bigrams: Produce a listing of the top ten bigrams (pair of adjacent words) in the dataset.
    4. Out-Links: For each host, produce a unique list of hosts that it links to.
    5. In-Links: For each host, produce a unique list of hosts that link TO it.
    6. N-Degrees: Produce a listing of all hosts 1 hop from www.nd.edu. Then, producing a listing for 2 hops, 3 hops, and so forth, until the result converges.

    What to Turn In

    Your dropbox directory is:
    /afs/nd.edu/courses/cse/cse40771.01/dropbox/YOURNAME/a4
    
    Turn in all of your programs, along with a short writeup that explains how to invoke them. You should NOT turn in the output data (which is way too big), but briefly explain how many rounds were need to solve problem 6.

    This assignment is due on Thursday, April 15th.