CCL Home

Research

Software Community Operations

Hadoop and Map-Reduce in the CCL

Hadoop is an open-source implementation of the Map-Reduce concept of data processing. Hadoop is currently running on the DISC cluster at Notre Dame. Hadoop is good at a restricted class of data intensive processing workloads. For more computation intensive workloads, consider using our campus Condor Pool.
  • Hadoop Filesystem Status
  • Hadoop Map-Reduce Job Status
  • This document gives a very brief introduction. For more information, see here:
  • Hadoop File System Guide
  • Hadoop Map-Reduce Tutorial

  • Getting Started

    To get started with Hadoop, ssh to any of the disc cluster machines:
    ssh disc01.crc.nd.edu
    
    Then set the following environment variables:
    setenv JAVA_HOME /afs/nd.edu/user37/ccl/software/external/java
    setenv HADOOP_HOME /afs/nd.edu/user37/ccl/software/external/hadoop
    setenv PATH ${HADOOP_HOME}/bin:$PATH
    setenv HADOOP_CLASSPATH ${HADOOP_HOME}/hadoop-0.21.0.jar
    setenv HADOOP_CLASSPATH ${HADOOP_CLASSPATH}:${HADOOP_HOME}/hadoop-common-0.21.0.jar
    setenv HADOOP_CLASSPATH ${HADOOP_CLASSPATH}:${HADOOP_HOME}/hadoop-hdfs-0.21.0.jar
    setenv HADOOP_CLASSPATH ${HADOOP_CLASSPATH}:${HADOOP_HOME}/hadoop-mapred-0.21.0.jar
    
    
    Now, try the following commands, which list the Hadoop filesystem, make a private directory, and upload a file:
    hadoop fs -ls /
    hadoop fs -mkdir /users/YOURNAME
    hadoop fs -put /usr/share/dict/linux.words /users/YOURNAME/words
    hadoop fs -ls /users/YOURNAME
    hadoop fs -cat /users/YOURNAME/words | less
    
    Note that Hadoop has no meaningful access controls. Any data that you put into the system is essentially readable and writeable by anyone in the CSE department. Your private directory is simply there as a convenience. Be a good citizen, and do not mess around with other people's data.


    Example of Map-Reduce Using Java

    Here is a very brief introduction to Map-Reduce using Java. If you are not a Java programmer, see the next section on Streaming.

    I have already uploaded the complete text of Tolstoy's "War and Peace" to the system under /public/warandpeace.txt. You are going to use WordCount.java to compute the frequncy of words in the novel. Begin by downloading the source of WordCount.java to your machine. Now, compile into wordcount.jar it as follows:

    mkdir wordcount_classes 
    javac -classpath ${HADOOP_CLASSPATH} -d wordcount_classes WordCount.java 
    jar -cvf wordcount.jar -C wordcount_classes . 
    
    To perform a Map-Reduce job, run hadoop with the jar option and specify the input file and a new directory for output files:
    hadoop jar wordcount.jar WordCount /public/warandpeace.txt /users/YOURNAME/outputs
    
    Now, your outputs are stored under /users/YOURNAME/outputs in Hadoop:
    hadoop fs -ls /users/YOURNAME/outputs
    hadoop fs -cat /users/YOURNAME/outputs/part-00000
    


    Example of Map-Reduce Using Streaming

    Using the streaming mode, you can run Map-Reduce programs where the mapper and reducer are ordinary programs written in whatever language you like. Data is passed between programs in plain ASCII format, where each line consists of a key string, a tab character, a value string, and a newline.

    For example, if you want to compute the frequency of words, you could write a mapper and reducer in Perl like this:

  • WordCountMap.pl
  • WordCountReduce.pl
  • You can easily test your mapper and reducer locally on a small amount of data like this:
    cat words.txt | ./WordCountMap.pl | sort | ./WordCountReduce.pl > output.txt
    
    Then, to run it all in Hadoop, run a command like this:
    hadoop jar $HADOOP_HOME/mapred/contrib/streaming/hadoop-*-streaming.jar
        -input /public/warandpeace.txt
        -output /users/YOURNAME/output2
        -mapper  WordCountMap.pl
        -file    WordCountMap.pl
        -reducer WordCountReduce.pl
        -file    WordCountReduce.pl