Homework 04: Scraping the Web

The goal of this homework assignment is to allow you to practice using Python to interact with the web using the Requests package. In this assignment, you will write scripts that pull data from the Internet and manipulate it in some way.

For this assignment, record your scripts and any responses to the following activities in the in the homework04 folder of your assignments GitLab repository and push your work by 11:59 PM Friday, February 24, 2017.

Activity 1: Blend Profiles (10 Points)

Trevor¹ loves having fun at the faculty's expense. When he is not using the write command to spam the instructor's terminal during class, he is making meme pics of the teacher and posting it on Facebook²:

ImageMagick

Having learned about ImageMagick, Trevor decides to take his trolling game up a notch by creating a script to generate some amusing and possibly terrifying GIF animations³ such as the ones below:

To create these animations, Trevor uses the ImageMagick composite tool to generate blended image frames:

This command takes the source1 and source2 images and blends them based on the percentage specified in stepsize as show in the following formula:

For instance if stepsize is 20 then the composite tool will take 20% of the pixel value from source1 and 80% from source2 to produce the blended image:

# Blend: 20% Ramzi, 80% Tijana
$ composite -blend 20 ramzi.jpg tijana.jpg 020-ramzi_tijana.gif

Conversely, if stepsize is 80 then the composite tool will take 80% of the pixel value from source1 and 20% from source2 to produce the blended image:

# Blend: 80% Ramzi, 20% Tijana
$ composite -blend 80 ramzi.jpg tijana.jpg 080-ramzi_tijana.gif

To create the animation, Trevor generates a series of these blended composite images at regular intervals from 000 to 100:

# List all the generated blended composite images
$ ls
000-ramzi_tijana.gif
010-ramzi_tijana.gif
020-ramzi_tijana.gif
030-ramzi_tijana.gif
040-ramzi_tijana.gif
050-ramzi_tijana.gif
060-ramzi_tijana.gif
070-ramzi_tijana.gif
080-ramzi_tijana.gif
090-ramzi_tijana.gif
100-ramzi_tijana.gif

Once he has all the individual composite images, he can stitch them together to create a GIF animation by using the convert command:

# Stitch blended composite images into an animation
$ convert -loop 0 -delay 5 \
      000-ramzi_tijana.gif \
      010-ramzi_tijana.gif \
      020-ramzi_tijana.gif \
      030-ramzi_tijana.gif \
      040-ramzi_tijana.gif \
      050-ramzi_tijana.gif \
      060-ramzi_tijana.gif \
      070-ramzi_tijana.gif \
      080-ramzi_tijana.gif \
      090-ramzi_tijana.gif \
      100-ramzi_tijana.gif \
      ramzi_tijana.gif

The -loop 0 means to have the GIF loop forever, while the -delay 5 means to wait 5 hundredths of a second before transitioning to the next image frame.

As can be seen the convert tool is given a list of all the composite images followed by the final target file (ie. ramzi_tijana.gif).

To have the animation blend forward and backwards, Trevor simply appends the list of composite images but in reverse:

# Stitch blended composite images into an animation that runs forwards and backwards
$ convert -loop 0 -delay 5 \
      000-ramzi_tijana.gif \
      010-ramzi_tijana.gif \
      020-ramzi_tijana.gif \
      030-ramzi_tijana.gif \
      040-ramzi_tijana.gif \
      050-ramzi_tijana.gif \
      060-ramzi_tijana.gif \
      070-ramzi_tijana.gif \
      080-ramzi_tijana.gif \
      090-ramzi_tijana.gif \
      100-ramzi_tijana.gif \
      100-ramzi_tijana.gif \
      090-ramzi_tijana.gif \
      080-ramzi_tijana.gif \
      070-ramzi_tijana.gif \
      060-ramzi_tijana.gif \
      050-ramzi_tijana.gif \
      040-ramzi_tijana.gif \
      030-ramzi_tijana.gif \
      020-ramzi_tijana.gif \
      010-ramzi_tijana.gif \
      000-ramzi_tijana.gif \
      ramzi_tijana.gif

ImageMagick

The default version of ImageMagick on the student machines is pretty old. To use a more recent version of ImageMagick, add the following directory to your PATH environmental variable:

~ccl/software/external/imagemagick/bin

In csh, you would do:

$ setenv PATH ~ccl/software/external/imagemagick/bin:$PATH

In bash, you would do:

$ export PATH=~ccl/software/external/imagemagick/bin:$PATH

In Python, you would do:

import os
os.environ['PATH'] = '~ccl/software/external/imagemagick/bin:' + os.environ['PATH']

Once the PATH is updated, you should be able to run composite -version and see ImageMagick 6.6.4-2.

Profiles

To get the original source images, Trevor has to download them from the faculty profiles found on the Computer Science and Engineering Directory. Given a netid, you can access the person's profile by going to:

https://engineering.nd.edu/profiles/$NETID

For instance, Ramzi's profile is located at:

https://engineering.nd.edu/profiles/rbualuan

Each person's profile contains information about the person such as name, phone number, office location, etc. and includes a portrait image. The location of each portrait looks something like this:

https://engineering.nd.edu/profiles/rbualuan/@@images/42093b02-060d-4436-91b6-2cf068d4f8b8.jpeg

Going to each faculty member's profile and manually extracting this portrait image is a bit tedious, so Trevor decides to write his script in Python and use the Requests package to help him extract these image portraits in an automated fashion. Once the images are downloaded, the script can call the ImageMagick commands previously described to generate the delightful GIF animations.

Unfortunately, although Trevor is a mastermind at soliciting lulz, he is not as effective at executing his brilliant pranks. He needs your help in completing the Python script: blend.py.

Task 1: `blend.py`

The blend.py has the following usage message:

$ ./blend.py -h
Usage: blend.py [ -r -d DELAY -s STEPSIZE ] netid1 netid2 target
    -r          Blend forward and backward
    -d DELAY    GIF delay between frames (default: 20)
    -s STEPSIZE Blending percentage increment (default: 5)

As can be seen, the script takes three arguments: netid1 and netid2 correspond to the netids of the two people in the Computer Science and Engineering Directory while target is the name of the GIF animation file to create.

In addition to these arguments, the script has three possible flags:

The -r flags means that the animation should run both forward (blend from netid1 to netid2) and then backward (blend from netid2 to netid1).
The -d flag allows the user to specify the animation DELAY in hundredths of a second (ie. how long to wait before shifting to next image in animation).
The -s flag allows the user to specify the STEPSIZE which impacts have many frames there are in the animation (this number must be between 0 and 100).

To help you get started, Trevor has provided you with the following starter code:

# Download and display start code
$ curl -sL https://www3.nd.edu/~pbui/teaching/cse.20289.sp17/static/py/blend.py

The starter code contains the following:

#!/usr/bin/env python2.7

import atexit
import os
import re
import shutil
import sys
import tempfile

import requests

# Global variables

REVERSE     = False
DELAY       = 20
STEPSIZE    = 5

# Functions

def usage(status=0):
    print '''Usage: {} [ -r -d DELAY -s STEPSIZE ] netid1 netid2 target
    -r          Blend forward and backward
    -d DELAY    GIF delay between frames (default: {})
    -s STEPSIZE Blending percentage increment (default: {})'''.format(
        os.path.basename(sys.argv[0]), DELAY, STEPSIZE
    )
    sys.exit(status)

# Parse command line options

args = sys.argv[1:]

while len(args) and args[0].startswith('-') and len(args[0]) > 1:
    arg = args.pop(0)
    # TODO: Parse command line arguments

if len(args) != 3:
    usage(1)

netid1 = args[0]
netid2 = args[1]
target = args[2]

# Main execution

# TODO: Create workspace

# TODO: Register cleanup

# TODO: Extract portrait URLs

# TODO: Download portraits

# TODO: Generate blended composite images

# TODO: Generate final animation

You are to complete the sections marked TODO in order to complete the blend.py script.

Hints

To parse the command line arguments, you will need to check arg against any possible flags. Any parameters to the flag can be accessed by popping from the front of the args list:
```
parameter = args.pop(0)   # Remove the first item in the list
```
This is analogous to using shift in a shell script to remove the first item in the command line arguments.

Remember that arguments are strings by default. If you need a parameter to be another type, then you will have to explicitly cast it.
```
number = int(args.pop(0)) # Remove the first item in the list and convert to int
```
To create a workspace, you should use the tempfile.mkdtemp function which will create a temporary directory for you and return its location.
To register a cleanup function, you should use the atexit.register function to assign a function to run when the program exits. This cleanup function should remove the temporary directory you created by using the shutil.rmtree function.

This is somewhat analogous to creating a trap in a shell script.
To extract portrait URLS, you will need to use the requests.get function to fetch the contents of a profile from the Computer Science and Engineering Directory and then the re.findall function to search and extract the URLS from the retrieved contents.

If retrieving the the contents fails or the search for a portrait URL fails, then the program should exit with an error code: sys.exit(1).
To download portraits, you will need to use the requests.get method again.
To generate blended composite images, you will need to use the os.system function to execute the ImageMagick composite tool.
To generate the final animation, you will need to use the os.system function to execute the ImageMagick convert tool.

Functions

Since many of the operations above happen multiple times, you should consider organizing the common code into functions:

search_portrait(netid): Given a netid, this function returns the corresponding image portrait URL.
download_file(url, path): Given a url and path, this function downloads the data specified by url and stores it in the file specified by path.
run_command(command): Given a command, this function executes the command and checks its return status.

Organizing your code into smaller functions will not only make your program shorter and more concise, but also make it easier to debug and maintain.

Task 2: Testing

To verify the correctness of your blend.py script, you should try to reproduce the images above:

# Animate Ramzi and Tijana
$ ./blend.py -r -s 10 rbualuan tmilenkovic ramzi_tijana.gif
Using workspace: /tmp/blendiGT7lt
Searching portrait for rbualuan... https://engineering.nd.edu/profiles/rbualuan/@@images/42093b02-060d-4436-91b6-2cf068d4f8b8.jpeg
Searching portrait for tmilenkovic... https://engineering.nd.edu/profiles/tmilenkovic/@@images/2e3cada8-ee15-4dba-88c8-21c89b05466b.jpeg
Downloading https://engineering.nd.edu/profiles/rbualuan/@@images/42093b02-060d-4436-91b6-2cf068d4f8b8.jpeg to /tmp/blendiGT7lt/42093b02-060d-4436-91b6-2cf068d4f8b8.jpeg... Success!
Downloading https://engineering.nd.edu/profiles/tmilenkovic/@@images/2e3cada8-ee15-4dba-88c8-21c89b05466b.jpeg to /tmp/blendiGT7lt/2e3cada8-ee15-4dba-88c8-21c89b05466b.jpeg... Success!
Generating /tmp/blendiGT7lt/000-ramzi_tijana.gif ... Success!
Generating /tmp/blendiGT7lt/010-ramzi_tijana.gif ... Success!
Generating /tmp/blendiGT7lt/020-ramzi_tijana.gif ... Success!
Generating /tmp/blendiGT7lt/030-ramzi_tijana.gif ... Success!
Generating /tmp/blendiGT7lt/040-ramzi_tijana.gif ... Success!
Generating /tmp/blendiGT7lt/050-ramzi_tijana.gif ... Success!
Generating /tmp/blendiGT7lt/060-ramzi_tijana.gif ... Success!
Generating /tmp/blendiGT7lt/070-ramzi_tijana.gif ... Success!
Generating /tmp/blendiGT7lt/080-ramzi_tijana.gif ... Success!
Generating /tmp/blendiGT7lt/090-ramzi_tijana.gif ... Success!
Generating /tmp/blendiGT7lt/100-ramzi_tijana.gif ... Success!
Generating ramzi_tijana.gif ... Success!
Cleaning up workspace: /tmp/blendiGT7lt

# Animate Peter and David
$ ./blend.py -r pbui dchiang peter_david.gif
...

# Animate Shreya and Scott
$ ./blend.py -r skumar semrich shreya_scott.gif
...

# Animate Bowyer and Flynn
$ ./blend.py -r kbowyer pflynn bowyer_flynn.gif
...

Although it is not required, you should consider emitting diagnostic messages as shown above to inform the user the progress of your script.

As noted above, if the script encounters an error while searching for a portrait, downloading a file, or executing a command, the script should exit early and cleanup the temporary workspace.

# Exit early on failure
$ ./blend.py batman superman fail.gif
Using workspace: /tmp/blendFuiGiK
Searching portrait for batman... Not Found!
Cleaning up workspace: /tmp/blendFuiGiK

Task 3: `README.md`

In your README.md, describe how you implemented the blend.py script. In particular, briefly discuss:

How you parsed command line arguments.
How you managed the temporary workspace.
How you extracted the portrait URLS.
How you downloaded the portrait images.
How you generated the blended composite images.
How you generated the final animation.
How you checked for failure of different operations and exited early.

Activity 2: Grep Reddit (5 Points)

Katie likes to sit in the back of the class. It has its perks:

She can beat the rush out the door when class ends.
She can see everyone browsing Facebook, playing video games, watching YouTube, or doing homework.
She feels safe from being called upon by the instructor... except when he does that strange thing where he goes around the class and tries to talk to people. Totally weird ⁴.

That said, sitting in the back has its downsides:

She can never see what the instructor is writing because he has terrible handwriting and always writes too small.
She is prone to falling asleep because the instructor is really boring and the class is not as interesting as Discrete Math was last semester.

To combat her boredom, Katie typically just browses Reddit. Her favorite subreddits are AdviceAnimals, aww, todayilearned, and of course UnixPorn. Katie is tired of having to go to each subreddit and browsing for cool links, however, and decides she wants to create a script, reddit.py, which will allow her to quickly filter or grep a subreddit.

Fortunately for Katie, Reddit provides a JSON feed for every subreddit. You simply need to append .json to the end of each subreddit. For instance, the JSON feed for todayilearned can be found here:

https://www.reddit.com/r/todayilearned/.json

To fetch that data, Katie uses the Requests package in Python to access the JSON data:

import requests

r = requests.get('https://www.reddit.com/r/todayilearned/.json')
print r.json()

429 Too Many Requests

Reddit tries to prevent bots from accessing its website too often. To work around any 429: Too Many Requests errors, we can trick Reddit by specifying our own user agent:

headers  = {'user-agent': 'reddit-{}'.format(os.environ['USER'])}
response = requests.get('https://www.reddit.com/r/linux/.json', headers=headers)

This should allow you to make requests without getting the dreaded 429 error.

This script would output something like the following:

{"kind": "Listing", "data": {"modhash": "g8n3uwtdj363d5abd2cbdf61ed1aef6e2825c29dae8c9fa113", "children": [{"kind": "t3", "data": ...

Looking through that stream of text, Katie sees that the JSON data is a collection of structured or hierarchical dictionaries and lists. This looks a bit complex to her, so she wants you to help her complete the reddit.py script which fetches the JSON data for a subreddit and allows the user to filter articles by specified fields using a regular expression.

Task 1: `reddit.py`

The reddit.py has the following usage message:

$ ./reddit.py -h
Usage: reddit.py [ -f FIELD -s SUBREDDIT ] regex
    -f FIELD        Which field to search (default: title)
    -n LIMIT        Limit number of articles to report (default: 10)
    -s SUBREDDIT    Which subreddit to search (default: linux)

As can be seen, the reddit.py script takes three possible flags followed by the regular expression to use in filtering the articles:

The -f flag allows the user to specify which FIELD to search when filtering. By default this is the title field, but it could be any field in the JSON data corresponding to the article.
The -n flag limits the number of articles to report or display. The default is 10 articles.
The -s flag allows the user to specify which SUBREDDIT to search. By default this is the linux subreddit.

Here are some examples of reddit.py in action:

# By default list 10 articles from the r/linux subreddit
$ ./reddit.py

 1. Title:   Delta uses Linux for their in flight entertainment
    Author:  SaberHamLincoln
    Link:    https://www.reddit.com/r/linux/comments/5uv5uh/delta_uses_linux_for_their_in_flight_entertainment/
    Short:   https://is.gd/EhxxUr

 2. Title:   This year's Linux Sucks talk will be the last one ever, apparently.
    Author:  deusmetallum
    Link:    https://www.reddit.com/r/linux/comments/5uyiv9/this_years_linux_sucks_talk_will_be_the_last_one/
    Short:   https://is.gd/k8VAT1

 3. Title:   cron.weekly issue #68: Virtual Memory, Jenkins, Etckeeper, Tensorflow, PGP, Let's Encrypt &amp; more
    Author:  ilconcierge
    Link:    https://www.reddit.com/r/linux/comments/5uyizs/cronweekly_issue_68_virtual_memory_jenkins/
    Short:   https://is.gd/o5Jkk2

 4. Title:   Rusty Builder rustup support in gnome builder
    Author:  abdulkareemsn
    Link:    https://www.reddit.com/r/linux/comments/5uybe6/rusty_builder_rustup_support_in_gnome_builder/
    Short:   https://is.gd/nxGy7Q

 5. Title:   The decline of GPL?
    Author:  speckz
    Link:    https://www.reddit.com/r/linux/comments/5uz3ut/the_decline_of_gpl/
    Short:   https://is.gd/R6hZit

 6. Title:   Linux Action Show's Review of the newest XPS 13 Developer Edition laptop
    Author:  Khaotic_Kernel
    Link:    https://www.reddit.com/r/linux/comments/5uxacn/linux_action_shows_review_of_the_newest_xps_13/
    Short:   https://is.gd/LqSyAm

 7. Title:   GPD Pocket Crowdfunder Passes $1 Million Mark
    Author:  raymii
    Link:    https://www.reddit.com/r/linux/comments/5uyp51/gpd_pocket_crowdfunder_passes_1_million_mark/
    Short:   https://is.gd/jwKa1S

 8. Title:   Linode offering $5 1 GB VPS now. Also upped storage for 2 GB and net out min for all plans to 1000 Mbits
    Author:  upvotes2doge
    Link:    https://www.reddit.com/r/linux/comments/5uxuun/linode_offering_5_1_gb_vps_now_also_upped_storage/
    Short:   https://is.gd/6y0r4M

 9. Title:   Self X-post from r/linuxmint: Workaround for backlight control issues in Linux Mint 18.1 Cinnamon with nVidia drivers
    Author:  ProlificAlias
    Link:    https://www.reddit.com/r/linux/comments/5uy86f/self_xpost_from_rlinuxmint_workaround_for/
    Short:   https://is.gd/cTHvjQ

10. Title:   Krita Update: Support for svg loading and improved vector tools is on it's way.
    Author:  raghukamath
    Link:    https://www.reddit.com/r/linux/comments/5usd62/krita_update_support_for_svg_loading_and_improved/
    Short:   https://is.gd/iXF9eS

# List top article in r/technology
$ ./reddit.py -n 1 -s technology

 1. Title:   Got a tech question or want to discuss tech? Weekly /r/Technology Tech Support / General Discussion Thread
    Author:  AutoModerator
    Link:    https://www.reddit.com/r/technology/comments/5upmjy/got_a_tech_question_or_want_to_discuss_tech/
    Short:   https://is.gd/QjYHpY

# List top article in r/linux that contains the word linux in title
$ ./reddit.py -n 1 'Linux'

 1. Title:   Delta uses Linux for their in flight entertainment
    Author:  SaberHamLincoln
    Link:    https://www.reddit.com/r/linux/comments/5uv5uh/delta_uses_linux_for_their_in_flight_entertainment/
    Short:   https://is.gd/EhxxUr

# List top article in r/linux whose author has a number in it
$ ./reddit.py -n 1 -f author '[0-9]'

 1. Title:   Linode offering $5 1 GB VPS now. Also upped storage for 2 GB and net out min for all plans to 1000 Mbits
    Author:  upvotes2doge
    Link:    https://www.reddit.com/r/linux/comments/5uxuun/linode_offering_5_1_gb_vps_now_also_upped_storage/
    Short:   https://is.gd/6y0r4M

# Error out on invalid field
$ ./reddit.py -f fake
Invalid field: fake

Notice that in addition to displaying the Title, Author, and Link for each article, the script also lists a shorten form of the longer Link. To do this, the reddit.py script creates a URL redirect via the is.gd web service.

Hints

To fetch the JSON data, you should use the requests.get function on the appropriate SUBREDDIT URL.
To filter the articles, you will need to iterate through the appropriate JSON data structure corresponding to the articles.
For each article, you should check if the specified FIELD is valid. If not, you should report an error and exit the program with an error code. Otherwise, you should use the [re.search] function to check of the REGEX matches the FIELD for the current article.
To generate a shortened URL, you will need to use the requests.get function on 'http://is.gd/create.php' with parameters format set to json and url set to the URL you wish to compress. This request will return a JSON object that contains the shorturl which you should display.

Task 2: Testing

To verify the correctness of your reddit.py script, you should try to the examples above. Because the Reddit is an active and live website, we cannot provide an automated way to test the output of your script.

Task 3: `README.md`

In your README.md, describe how you implemented the reddit.py script. In particular, briefly discuss:

How you parsed command line arguments.
How you fetched the JSON data and iterated over the articles.
How you filtered each article based on the FIELD and REGEX.
How you generated the shortened URL.

Guru Point (1 Point)

For extra credit, you are to use NetFile to Publish a Web Site website. Rather than manually writing HTML, you can use a static website generator such as:

Alternatively, you can cobble together your own website generator using scripts and something like Python-Markdown. For instance, the course website and the instructor's homepage are created using a Python script called yasb.py.

Once you have decide on the tool you wish to use and create some static HTML files, you will need to upload the files to NetFile using your favorite file transfer tool (ie. sftp, scp, or rsync).

The actual content of the website is up to you, but I recommend that you take this opportunity to perhaps create an online portfolio that you can share with family, friends, and possible employers or schools.

Here are some examples:

Feedback

If you have any questions, comments, or concerns regarding the course, please provide your feedback at the end of your README.md.

Submission

To submit your assignment, please commit your work to the homework04 folder in your assignments GitLab repository. Your homework04 folder should only contain the following files:

README.md
blend.py
reddit.py

Missed you Travis. ↩
Or Snapchat. Or GroupMe. Or whatever you kids use these days... ↩
Admit it. You love GIFs. It's OK. We all do. ↩
Caring is Creepy ↩

Homework 04: Scraping the Web

Activity 1: Blend Profiles (10 Points)

ImageMagick

ImageMagick

Profiles

Task 1: blend.py

Hints

Functions

Task 2: Testing

Task 3: README.md

Activity 2: Grep Reddit (5 Points)

429 Too Many Requests

Task 1: reddit.py

Hints

Task 2: Testing

Task 3: README.md

Guru Point (1 Point)

Feedback

Submission

Task 1: `blend.py`

Task 3: `README.md`

Task 1: `reddit.py`

Task 3: `README.md`