A4: Data Processing with AWS and Pig

In this homework, you will gain experience in standing up services using Amazon EC2, S3, and EMR, and use them to perform data analysis using the Pig language.

By now, you should have created a personal Amazon account, applied for the education credit, and added the additional credit provided in class. This total amount will be more than enough to complete the assignment. If you accidentally go over, you will be responsible for the overage, so be careful to track your usage through the AWS console and shut down services when they are not used.

Warning: You will need to spend some time reading the Amazon documentation, experimenting with Pig, and learning how things work. This will take time, and starting the night before the assignment is not a good idea.

Getting Started

  • Create an Amazon account at aws.amazon.com.
  • Skim the documentation for EC2, S3, and EMR at aws.amazon.com/documentation
  • Using the EC2 console, go to the Profile->Security Credentials->Access Keys page, create and download a new Key Pair.
  • Using the AWS console, create a new EMR cluster. Several things to note about the Create Cluster form:
  • Disable logging to S3, since we won't be using it.
  • Select an EC2 Key Pair to use, so that you can ssh to the cluster directly.
  • Under EMR role and EC2 instance profile, click the "Create Default Role" link to set up the roles automatically.
  • Accept the default configuration of one master node and two storage nodes.
  • Once the EMR cluster is running, ssh to the head node as the hadoop user using your key pair (e.g. ssh -i keypair.pem hadoop@ip-address). This machine will be your "home base" for future work. Try out some basic AWS, Hadoop and Pig commands, like this:
    aws ec2 describe-instances
    aws s3 ls s3://dthain-cloud/employee/
    hadoop fs -ls s3://dthain-cloud/employee
    pig ls /
    
  • The cluster is able to access S3 in addition to HDFS, so your jobs can simply refer to S3 buckets within Pig and Hadoop. You will be using the following datasets that we have set up:
  • s3://dthain-cloud/employee
  • s3://dthain-cloud/bigram
  • s3://dthain-cloud/wikilinks
  • Try out the aws s3 command. Create your own bucket with a unique name, which you will use for submitting the homework results. Try moving data between the class bucket, your bucket, and the local filesystem.
  • Problem 1: Employee Database

    There are six csv files in the "s3://dthain-cloud/employee" bucket, each of which have line-oriented records with the following structure:
    department.csv:
    dept_no, dept_name
    
    dept_emp.csv:
    emp_no, dept_no, from_date, to_date
    
    dept_manager.csv:
    dept_no, emp_no, from_date, to_date
    
    employees.csv:
    emp_no, birth_date, first_name, last_name, gender, hire_date
    
    You are going to write Pig queries to answer questions about this collection of data. To get you started, each of your Pig programs will begin by loading and formatting the data, performing some operations, then storing the data. For example, create a file called test.pig containing this:
    dept = LOAD 's3://dthain-cloud/employee/departments.csv' USING PigStorage(',') as (dept_no: chararray, dept_name: chararray);
    result = FOREACH dept GENERATE dept_name;
    DUMP result;
    
    Then run it with pig test.pig. (Pig generates a lot of warning messages, and then will show you the complete output.) DUMP will send the output to the console, which is useful for testing. To store the data in S3 instead, do this:
    STORE x INTO 's3://YOUR_BUCKET_NAME/result';
    
    You may find it useful to read the Pig Getting Started Page

    Write Pig programs to answer the following queries about the Employee database. To keep things organized name each Pig program query.1.1 and each result output.1.1 and so forth. The outputs will be large, so store them in your S3 bucket.

  • query.1.1 / output.1.1 - List the emp_no, from_date and to_date of each manager in the research department.
  • query.1.2 / output.1.2 - List the first_name and last_name of each manager in the research department.
  • query.1.3 / output.1.3 - List the first_name, birth_date and hire_date of all pairs of employees who share the same birth_date and hire_date, with no duplicates. For example, "John 1970-07-20 1999-11-04 Mike 1970-07-20 1999-11-04".
  • query.1.4 / output.1.4 - List the dept_name and number of employees for all departments that have more than 50000 employees.
  • Problem 2: Bigrams

    There are two csv files stored in s3://dthain-cloud/bigram/
  • googlebooks-2gram.csv, This is part of google 2gram dataset
  • googlebooks-2gram-small.csv. Small sample file for debugging
  • Each line in the file gives the number of times a given bigram appears in all books (known to Google) in a given historical year. The format of each file is:
    ngram TAB year TAB match_count TAB page_count TAB volume_count NEWLINE
    
    Write a Pig program to answer the following question about the bigram database. Test your work on the small dataset first, and then send the final result to your S3 bucket:
  • query.2 / output.2 - Compute for each bigram, the average number of appearances of that bigram across all years.
  • For example, if the data consisted of this:
    I love  1990   19   18   9
    I love  2000   200  199  88
    cloud computing  1922  378 375  10
    cloud computing  1988  299 289  50
    cloud computing  2014  10  10    27
    
    We can tell from this snapshot that, in 1990, the bigram "I love" appeared 19 times in 9 different books in 18 different pages. And, the frequency of each bigram would be computed like this:
    I love (19 + 200) / (9 + 88) = 2.257
    cloud computing (378 + 299 + 10) / (10 + 50 + 27) = 7.896
    
    And the output would simply be:
    I love 2.257
    cloud computing 7.896
    

    Problem 3: Wikilinks

    This dataset describes the structure proper Wikipedia pages, that is pages in "namespace 0", including disambiguation and redirect pages. There are four csv files stored in s3://dthain-cloud/wikilinks/
  • links-simple-sorted.txt, This is Wikilinks map
  • links-simple-sorted-small.txt, This is small version of wikilinks map for debugging
  • links-titles.txt, This is list of corresponding titles
  • links-titles-small.txt, This is small version of links-titles.txt for debugging
  • links-titles.txt gives a unique id and a title for each page in the dataset:
    725634: Boardwalk_fries
    725635: Boardwatch
    725636: Boarfish
    725637: Boarhouse
    725638: Boarhunt
    725639: Boario_Terme
    
    links-simple-sorted.txt gives for each page ID, the page IDs of the files to which it links:
    213: 217
    214: 2979792 4936083
    215: 164
    216: 164
    217: 505650 1420568 3080731 3607953
    218: 505650
    
    Write a Pig program to solve the following question about the Wikilinks data. Test your work on the small dataset first, and then send the final result to your S3 bucket.
  • query.3 / output.3 - Determine the ten pages with the greatest number of outlinks. Output the titles of those ten pages and the number of outlinks, sorted from highest to lowest.
  • Hint: Read in all outlinks per line as one field, then use the Pig string functions to make a list and count the fields.

    What to Turn In

    This assignment is due on Monday, April 4th before class begins.

    For each problem, send the final output directly to your S3 bucket. Also, copy the Pig query files into your bucket as well. Please use the indicated naming scheme so that we can keep everything straight.

    In your AFS dropbox directory, simply submit a README with the name of your S3 bucket, so that the grader can access it. If you have any other clarification about your work, you can add that to the README file.

    Important: Select a S3 bucket name that is not easily guessable, then mark each file in the S3 bucket so that it is publically readable. (To do this, Go to the S3 console, select the item, select Properties in the upper right hand corner, open the Permissions field, and add a permission for Everyone to Open/Download the file.) Then, test that it works by visiting the S3 URL from a web browser.

    Good luck, and get started early!