Assignment A5: Map-Reduce

First read and follow the directions for using the CCL Hadoop Cluster. For these problems you should use the streaming version of Map-Reduce. I recommend that you use Perl to write the mapper and reducer, but you can apply whatever language you are most comfortable with. Note that tbe mapper and the reducer can be simple Unix programs like cat, sort, and wc.
  • Problem 1: Modify the streaming word count example from the directions page. Modify the mapper to clean up the words by removing punctuation, converting to lower case, and only displaying words that consist of the characters a-z. Experiment with this on a small input file (such /public/warandpeace.txt) to make sure that you get it right. Run a complete word count on the entire contents of /public/gutenberg/\*. This might take a while. Save the top 100 lines of this output, which will not be sorted.

  • Problem 2: Write a second mapper and reducer that reads the output of Step 1 and produces a result sorted by number from high to low. (Be careful, Hadoop sorts in dictionary order, not numeric order by default.) Save the top 100 lines of this output.

  • Problem 3: Compute the top ten bigrams (pairs of words that occur in sequence) in the Gutenberg data.

  • Problem 4: Write a mapper that will search for instances of a given word and then display the line number and the complete text line from every document in which that word appears. Run it on the Gutenberg data, and give the results for the following words: catafalques, escheats, whitlock.
  • You may also try the problems on the Google/IBM Academic Cluster.

    Turning In

    Your dropbox directory is:
    /afs/nd.edu/courses/cse/cse40771.01/dropbox/YOURNAME/a5
    
    In your dropbox directory, make four sub-directories problem1, problem2, etc... In each directory, save the code for your mapper, reducer, the exact command line used to invoke them, and the final output. Do not turn in more than 100 lines of output per problem..