<h2>Assignment A4: Map-Reduce</h2>

In the final assignment, you are going to learn how to write the sort of Map-Reduce
programs that would be used at a large Internet search company.  For a general strategies
for writing Map-Reduce programs, I recommend that you consult the draft textbook <a href=http://www.umiacs.umd.edu/~jimmylin/MapReduce-book-20100219.pdf>Data Intensive Text Processing with Map-Reduce</a>.

<h2>Getting Started</h2>

You will be using a local installation of Hadoop, so first read and follow the <a href=http://www.cse.nd.edu/~ccl/operations/hadoop>directions for using the CCL Hadoop Cluster</a>, and try out the sample programs.  For these problems you may use either the native Java version, or the streaming version with a scripting language such as Perl or Python.  You will have to do some fiddly string processing, so pick a language that you are comfortable using.
<p>
Your primary dataset will be a set of web pages downloaded from the web last week, using
a simple web spider, starting at <tt>www.nd.edu</tt>.  You can find the data in the HDFS
directory <tt>/dthain/edu</tt>.  The first line of each file gives the hostname
from which it was downloaded, followed by the raw HTML data.  In addition, the data has the following restrictions:
<dir>
<li> Only host names ending in <tt>.edu</tt> were visited, to avoid anything accidentally unpleasant.
<li> Only the top-level directory of a host is available.  For example, if we found a link to <tt>www.cse.nd.edu/academics</tt>, only <tt>www.cse.nd.edu</tt> would be visited.
<li> Only a few thousand pages are available, since we can't actually store the whole web here.
</dir>

Working with raw data from the web can be a little messy, because the files are all written
by humans: code can be incorrect, links can be broken, words can be misspelled, and so forth.
So, the first step is to write some code that can transform raw HTML into data that is useful
for more abstract processing.  Many other people have done this part, so make good use of manuals
and Google to figure it out  Suggest that you start by writing two programs as follows:

<dir>
<li> <tt>htmltowords</tt>: Read HTML on the standard input, remove extraneous items such as tags and punctuation, and emit only simple lowercase words of three or more characters.
<li> <tt>htmltohosts</tt>: Read HTML on the standard input, find the A HREF tags, and emit only the hostnames present in those tags.
</dir>

<h2>Problems</h2>

Now you are ready to to process the 'cooked' data in interesting ways.
Solve each of the following problems using the Hadoop Map-Reduce framework, by writing simple
map and reduce programs.  For some problems, you may need to run multiple map or reduce rounds.

<ol>
<li> Word Count: Produce a single file listing all words that appear in all documents, each with a count of frequency, sorted by frequency.
<li> Inverted Index: For each word encountered, produce a file that lists all hosts in which the word occurs.
<li> Bigrams: Produce a listing of the top ten bigrams (pair of adjacent words) in the dataset.
<li> Out-Links: For each host, produce a unique list of hosts that it links to.
<li> In-Links: For each host, produce a unique list of hosts that link TO it.
<li> N-Degrees: Produce a listing of all hosts 1 hop from www.nd.edu.  Then, producing a listing for 2 hops, 3 hops, and so forth, until the result converges.
</ol>

<h2>What to Turn In</h2>

Your dropbox directory is:

<pre>
/afs/nd.edu/courses/cse/cse40771.01/dropbox/YOURNAME/a4
</pre>

Turn in all of your programs, along with a short writeup that explains how to invoke them.
You should NOT turn in the output data (which is way too big), but briefly explain how
many rounds were need to solve problem 6.
<p>
This assignment is due on <b>Thursday, April 15th</b>.