The goal of this assignment is to allow you to practice manipulating data in CSV and JSON format using the Python programming language. Additionally, you will have an opportunity for open-ended exploration.
To record your solutions and answers, create a new Jupyter Notebook
titled Notebook08.ipynb
and use this notebook to complete the following
activities and answer the corresponding questions.
Make sure you label your activities appropriately. That is for Activity 1, have a header cell that is titled Activity 1. Likewise, use the Markdown cells to answer the questions (include the questions above the answer).
This Notebook assignment is due Midnight Friday, November 13, 2015 and is to be done individually or in pairs.
To help you complete this Notebook, here are the solutions to Activity 0 of Notebook07:
# Imports
import os
import requests
import zipfile
import StringIO
# Functions
def ls(path=os.curdir, sort=True):
''' Lists the contents of directory specified by path '''
filelist = os.listdir(path)
if sort:
filelist.sort()
for filename in filelist:
print filename
def head(path, n=10):
''' Lists the first n lines of file specified by path '''
for count, line in enumerate(open(path, 'r')):
if count >= n:
break
line = line.strip()
print line
def tail(path, n=10):
''' Lists the last n lines of file specified by path '''
for line in open(path, 'r').readlines()[-n:]:
line = line.strip()
print line
def download_and_extract_zipfile(url, destdir):
''' Downloads zip file from url and extracts the contents to destdir '''
response = requests.get(url)
zipdata = zipfile.ZipFile(StringIO.StringIO(response.content))
zipdata.extractall(destdir)
Feel free to use these functions to help you retrieve and explore the datasets in the following activities.
The first activity is to explore data from the MovieLens project:
MovieLens is a research site run by GroupLens Research at the University of Minnesota. MovieLens uses "collaborative filtering" technology to make recommendations of movies that you might enjoy, and to help you avoid the ones that you won't. Based on your movie ratings, MovieLens generates personalized predictions for movies you haven't seen yet. MovieLens is a unique research vehicle for dozens of undergraduates and gradute students researching various aspects of personalization and filtering technologies.
This is basically a collection of movie ratings that contains information such as movie titles, ratings, genres, years, etc. similar to what you would find on Netflix.
Your goal for this activity is to explore the data in the MovieLens by asking two questions and then writing code to help you answer them.
For instance, you may wish to explore the following questions:
Which movie has the least/most genres?
Which movie has the lowest/highest rating?
Which animated movie has the lowest/highest rating?
Which sci-fi movie from the 1990's is rated the highest?
How many fantasy movies from the year 1998 were rated?
What movies do programmers enjoy the most?
The possibilities are endless!
Here is an example of how to retrieve the Movie Lens using
the download_and_extract_zipfile
function from Notebook07:
# Constants
MOVIE_LENS_URL = 'http://files.grouplens.org/datasets/movielens/ml-1m.zip'
MOVIE_LENS_DIR = 'movielens'
# Download Movie Lens 1-M dataset
download_and_extract_zipfile(MOVIE_LENS_URL, MOVIE_LENS_DIR)
Here is an example of how to answer the question: "How many movie genres are in the Movie Lens dataset?"
# Movie record constants
MOVIE_ID = 0
MOVIE_TITLE = 1
MOVIE_GENRES = 2
# Count Genres Functions
def count_genres(path):
''' Returns a dictionary containing the number of movies per genre in file specified by path '''
genres = {}
for line in open(path):
movie = line.strip().split('::')
for genre in movie[MOVIE_GENRES].split('|'):
genres[genre] = genres.get(genre, 0) + 1
return genres
# Perform count on MovieLens 1M dataset
count_genres(os.path.join(MOVIE_LENS_DIR, 'ml-1m', 'movies.dat'))
{'Action': 503,
'Adventure': 283,
'Animation': 105,
"Children's": 251,
'Comedy': 1200,
'Crime': 211,
'Documentary': 127,
'Drama': 1603,
'Fantasy': 68,
'Film-Noir': 44,
'Horror': 343,
'Musical': 114,
'Mystery': 106,
'Romance': 471,
'Sci-Fi': 276,
'Thriller': 492,
'War': 143,
'Western': 68}
The following are hints and suggestions that will help you complete this activity:
Download the dataset and explore the files in the ml-1m
directory. In
particular, take a look at the
README.txt
file which describes the dataset.
Use the Notebook to explore the files by writing small functions or code snippets to manipulate the data.
Some questions will require to look at data from multiple files (ie.
movies.dat
and ratings.dat
). You will need to think about what you
want to store and keep track of and how you wish to accomplish that. You
may end up with multiple datasets that link to each other!
Unfortunately, because the 1M Dataset uses ::
as a delimiter, you
will not be able to use the csv module to parse the files. Instead, use
str.split.
After completing the activity above, answer the following questions:
Describe the two questions you asked and discuss the answers you found.
For each question, explain how you used code to manipulate the data in order to explore the dataset and answer your questions.
What challenges did you face and how did you overcome these obstables? What sort of questions were difficult to ask?
The second activity is to create a currency conversion application using real-time currency data from the Yahoo Finance, which looks something like this:
{
"list" : {
"meta" : {
"type" : "resource-list",
"start" : 0,
"count" : 173
},
"resources" : [
{
"resource" : {
"classname" : "Quote",
"fields" : {
"name" : "USD/KRW",
"price" : "1136.045044",
"symbol" : "KRW=X",
"ts" : "1446671220",
"type" : "currency",
"utctime" : "2015-11-04T21:07:00+0000",
"volume" : "0"
}
}
}
,
... SNIP ...
Your goal for this activity is to explore the Currency JSON Feed and then design an interactive currency converter that looks like this:
To complete this activity, we can break the activity down into the following functions:
def extract_currencies(url):
''' Return a list of currency objects from the JSON feed specified by URL '''
Given a
url
, this performs a requests.get on theurl
and then processes the JSON feed to extract the currency objects in the feed and returns the currencies in a list.
def build_conversion_table(currencies):
''' Return a data structure that maps currency symbols to currency prices '''
Given a list of
currencies
, this build a data structure that allows users to lookup conversion prices based on currency symbols.
def currency_converter(dollars, symbol):
''' Prints the amount in dollars to the amount specified by the foreign currency symbol '''
Given an amount in
dollars
, this converts the dollars into the foreign currency specified bysymbol
and prints the result.
The following are hints and suggestions that will help you complete this activity:
Remember that you can convert the JSON response you get from requests.get into a Python object by doing something like this:
data = requests.get(url).json()
Once you have converted the JSON response into a Python object, you can manipulate the data using normal Python commands.
JSON uses the same syntax as Python to distinguish between lists and dictionaries (ie. '[]', '{}').
The Yahoo Finance's Currency JSON Feed is a deeply nested data structure that will require some exploration to understand.
Remember that if you want a string to be treated as a float
then you
must explicitly cast it.
Because there are many currencies, you will want to use a Select
widget rather than DropDown
box for interact. To do this, you can use
the following code:
from IPython.display import HTML, display
from IPython.html.widgets import Select
from IPython.html.widgets.interaction import interact
interact(currency_converter, dollars='', symbol=Select(options=symbols))
After completing the activity above, answer the following questions:
Describe the contents of the Yahoo Finance's Currency JSON Feed. How did you extract each individual currency from the JSON feed? What does each currency object contain?
Describe how you built the conversion table. What data structure did you use?
Discuss how your currency converter uses the conversion table to translate dollars into foreign currency. What was the most difficult part of this activity? How did you overcome it?
To submit your notebook, follow the same directions for Notebook00, except store this notebook in the notebook08 folder.