To get started, you have to create a personal account with Amazon AWS, and then apply the $100 credit provided by the TA. This amount should be more than enough to complete the assignment. If you accidentally go over, you will be responsible for the overage, so be careful to track your usage through the AWS console and shut down services when they are not used.
Warning: You will need to spend some time reading the Amazon documentation, experimenting with Pig, and learning how things work. This will take time, and starting the night before the assignment is not a good idea.
aws ec2 describe-instances aws s3 ls s3://nd-cse40822/employee hadoop fs -ls s3://nd-cse40822/employee pig ls /
department.csv: dept_no, dept_name dept_emp.csv: emp_no, dept_no, from_date, to_date dept_manager.csv: dept_no, emp_no, from_date, to_date employees.csv: emp_no, birth_date, first_name, last_name, gender, hire_dateYou are going to write Pig queries to answer questions about this collection of data. To get you started, each of your Pig programs will begin by loading and formatting the data, performing some operations, then storing the data. For example, create a file called test.pig containing this:
dept = LOAD 's3://nd-cse40822/employee/departments.csv' USING PigStorage(',') as (dept_no: chararray, dept_name: chararray); result = FOREACH dept GENERATE dept_name; DUMP result;Then run it with pig test.pig. (Pig generates a lot of warning messages, and then will show you the complete output.) DUMP will send the output to the console, which is useful for testing. To store the data in S3 instead, do this:
STORE x INTO 's3://YOUR_BUCKET_NAME/result';You may find it useful to read the Pig Getting Started Page
Write Pig programs to answer the following queries about the Employee database. To keep things organized name each Pig program query.1.1 and each result output.1.1 and so forth. The outputs will be large, so store them in your S3 bucket.
ngram TAB year TAB match_count TAB page_count TAB volume_count NEWLINEWrite a Pig program to answer the following question about the bigram database. Test your work on the small dataset first, and then send the final result to your S3 bucket:
I love 1990 19 18 9 I love 2000 200 199 88 cloud computing 1922 378 375 10 cloud computing 1988 299 289 50 cloud computing 2014 10 10 27We can tell from this snapshot that, in 1990, the bigram "I love" appeared 19 times in 9 different books in 18 different pages. And, the frequency of each bigram would be computed like this:
I love (19 + 200) / (9 + 88) = 2.257 cloud computing (378 + 299 + 10) / (10 + 50 + 27) = 7.896And the output would simply be:
I love 2.257 cloud computing 7.896
725634: Boardwalk_fries 725635: Boardwatch 725636: Boarfish 725637: Boarhouse 725638: Boarhunt 725639: Boario_Termelinks-simple-sorted.txt gives for each page ID, the page IDs of the files to which it links:
213: 217 214: 2979792 4936083 215: 164 216: 164 217: 505650 1420568 3080731 3607953 218: 505650Write a Pig program to solve the following question about the Wikilinks data. Test your work on the small dataset first, and then send the final result to your S3 bucket.
For each problem, send the final output directly to your S3 bucket. Also, copy the Pig query files into your bucket as well. Please use the indicated naming scheme so that we can keep everything straight.
In your AFS dropbox directory, simply submit a README with the name of your S3 bucket, so that the grader can access it. If you have any other clarification about your work, you can add that to the README file.
Important: Select a S3 bucket name that is not easily guessable, then mark each file in the S3 bucket so that it is publically readable. (To do this, Go to the S3 console, select the item, select Properties in the upper right hand corner, open the Permissions field, and add a permission for Everyone to Open/Download the file.)
Good luck, and get started early!