Reading 06: Data Processing

This Is Not The Course Website You Are Looking For

This course website is from a previous semester. If you are currently in the class, please make sure you are viewing the latest course website instead of this old one.

Everyone:

Next week, we will continue to explore using the Python programming language for data processing. In particular, we will review regular expressions and learn how to utilize them in Python and then examine ways of processing structured information such as CSV and JSON files.

TL;DR

The focus of this reading is to introduce scripting in Python.

Readings

The readings for this week are:

Automate The Boring Stuff
- Chapter 7. Pattern Matching with Regular Expressions
  
  Focus on the use of re.findall.
- Chapter 12. Web Scraping
  
  Focus on the usage of requests.
- Chapter 16. Working with CSV Files and JSON Data
  
  Focus on csv.reader and csv.DictReader and using requests with JSON.

Optional Resources

Here are some additional resources:

Scripts

This week, there is no reading quiz. Instead, you are to complete three Python scripts: faculty.py, users.py, and machines.py.

To test these scripts, you will need to download the Makefile and test scripts:

$ git checkout master                 # Make sure we are in master branch
$ git pull --rebase                   # Make sure we are up-to-date with GitLab

$ git checkout -b reading06           # Create reading06 branch and check it out

$ cd reading06                        # Go into reading06 folder

# Download Reading 06 Makefile
$ curl -LO https://gitlab.com/nd-cse-20289-sp20/cse-20289-sp20-assignments/raw/master/reading06/Makefile

# Execute tests (and download them)
$ make

Script: `faculty.py`

For the first script, faculty.py, you are to use regular expressions to extract all the graduation years of the CSE faculty from the following webpage: https://cse.nd.edu/people/faculty. After counting up how many people graduated each year, you are to display the totals in descending sorted order as shown below:

$ ./faculty.py
      4 2004
      4 2002
      3 2012
      2 2017
      2 2013
      2 2009
      2 1992
      2 1984
      2 1960
      1 2018
      1 2016
      1 2015
      1 2014
      1 2010
      1 2006
      1 2005
      1 2000
      1 1997
      1 1994
      1 1990
      1 1989
      1 1985
      1 1980
      1 1977
      1 1973
      1 1967

It should basically be the Python equivalent of the following pipeline:

$ curl -sL https://cse.nd.edu/people/faculty | \
  sed -En 's|.*<p>.*([PhD\.MS]{3,5}).*([0-9]{4}).*</p>.*|\2|p' | \
  sort | uniq -c | sort -rn

Irregular Counts

Remember that regular expressions are only effective if the source data being filtered is regular or matches a specific pattern. In the example above, 1960 shows up twice even though it should only appear once (one faculty member) due to the irregular text.

For this assignment, you just need to match the output of the test script, even if it is not totally accurate.

Skeleton

To help you get started, we have provided you with the following faculty.py skeleton code:

import collections
import re
import requests

# Constants

URL = 'https://cse.nd.edu/people/faculty'

# Initialize a dictionary with integer values
counts = collections.defaultdict(int)

# TODO: Make a HTTP request to URL
response = None

# TODO: Access text from response object
data = None

# TODO: Compile regular expression to extract degrees and years of each faculty
# member
regex = None

# TODO: Search through data using compiled regular expression and count up all
# the faculty members per year.
pass

# TODO: Sort items in counts by key in reverse order
items = {}

# Sort items by value in reverse order and display counts and years
for year, count in sorted(items, key=lambda p: p[1], reverse=True):
    print(f'{count:>7} {year}')

Implement the TODO sections in the code in order to complete the faculty.py script.

Hints

A collections.defaultdict object is just a dict that has a default value according to the specified type. For instance, because counts is a collections.defaultdict(int), accessing counts[key] will automatically insert a 0 (the default int value) for that key if it does not exist in the dict yet.
You can use requests.get to make a HTTP request. The response object you get back has a text attribute you can use to access the content or data of the request.

You can use the re.compile function to compile regular expressions. Likewise, you can use the re.findall function to search for all matches in a string. For instance:

# Match phone numbers in the format XXX-XXX-XXXX, while grouping just the
# the area codes.
regex = re.compile(r'(\d{3})-\d{3}-\d{4}')
for areacode in re.findall(regex, data):
    print(areacode)

You can set the reverse keyword argument of the sorted function to sort in descending order (the default is ascending).

Script: `users.py`

For the second script, users.py, you are to use csv.reader to loop through the records in your local /etc/passwd file and extract all the user descriptions (ie. the fifth field).

It should basically be the Python equivalent of the following pipeline:

$ cat /etc/passwd | cut -d : -f 5 | sed '/^\s*$/d' | env LC_ALL=C sort

For instance, here is the output of users.py on student05.cse.nd.edu:

$ ./users.py
Account used by the trousers package to sandbox the tcsd daemon
Anonymous NFS User
Apache
Condor Batch System
FTP User
GlusterFS daemons
Guest
LDAP Client User
MariaDB Server
NSCD Daemon
...
mail
operator
qemu user
root
shutdown
sync
systemd Bus Proxy
systemd Network Management
tog-pegasus OpenPegasus WBEM/CIM services
usbmuxd user

Note: The output does not contain any lines that are empty.

Skeleton Code

To help you get started, we have provided you with the following users.py skeleton code:

import csv

# Constants

PATH = '/etc/passwd'

# TODO: Loop through ':' delimited data in PATH and extract the fifth field
# (user description)
pass

# TODO: Print user descriptions in sorted order
pass

Implement the TODO sections in the code in order to complete the users.py script.

Hints

The csv.reader function takes a file stream (ie. use open) and can optionally take a delimiter keyword argument to specify what separates the fields in each row.

You can check if a string is non-empty by simplying doing the following:

if string:      # Non-empty strings evaluate to True
    do_the_thing()

Script: `machines.py`

For the third script, machines.py, you are to parse the JSON data from http://catalog.cse.nd.edu:9097/query.json, which contains a listing of all the machines and services registered with Cooperative Computing Lab, and display the name of the machines with the type wq_factory.

It should basically be the Python equivalent of the following pipeline:

curl -sL http://catalog.cse.nd.edu:9097/query.json | sed -En 's/\{"name":"([^"]+)".*"type":"wq_factory".*/\1/p'

For instance, here is the output of machines.py:

$ ./machines.py
103-165-135-150.dynamic.arizona.edu
126-165-135-150.dynamic.arizona.edu
barricade.cri.uchicago.edu
condorfe.crc.nd.edu
crcfe01.crc.nd.edu
crcfe02.crc.nd.edu
earth.crc.nd.edu
vm142-121.cyverse.org
Yeti.lifemapper.org

Skeleton

To help you get started, we have provided you with the following machines.py skeleton code:

import requests

# Constants

URL = 'http://catalog.cse.nd.edu:9097/query.json'

# TODO: Make a HTTP request to URL
response = None

# TODO: Access json representation from response object
data = None

# TODO: Display all machine names with type "wq_factory"
pass

Implement the TODO sections in the code in order to complete the machines.py script.

Hints

The response object returned by the requests.get function has a json() method that parses the response text as JSON data and returns an appropriate Python data structure.
You may wish to the pprint.pprint function to examine and explore the JSON data to determine how to access the 'name' and 'type' information of each machine.

Testing

To test all of these scripts, you can use the provided test_scripts.sh, which should have been downloaded by the Makefile:

$ ./test_scripts.sh
Testing scripts...
 faculty.py                               ... Success
 users.py                                 ... Success
 machines.py                              ... Success

   Score 4.00

We'll Do It Live

Because the data in each of these scripts is being pulled from data sources in real-time, the outputs may change in between runs.

Submission

To submit your work, follow the same process outlined in Reading 01:

#--------------------------------------------------
# BE SURE TO DO THE PREPARATION STEPS ABOVE
#--------------------------------------------------

$ cd reading06                        # Go into reading06 folder
$ $EDITOR answers.json                # Edit your answers.json file

$ $EDITOR faculty.py                  # Edit your faculty.py file

$ $EDITOR users.py                    # Edit your users.py file

$ $EDITOR machines.py                 # Edit your machines.py file

$ ./test_scripts.sh                   # Test your reading 06 scripts

$ git add Makefile                    # Add Makefile to staging area
$ git add faculty.py                  # Add faculty.py to staging area
$ git add users.py                    # Add users.py to staging area
$ git add machines.py                 # Add machines.py to staging area

$ git commit -m "Reading 06: Scripts" # Commit work

$ git push -u origin reading06        # Push branch to GitLab

Merge Request

Remember to create a merge request and assign the appropriate TA from the Reading 06 TA List.

DO NOT MERGE your own merge request. The TAs use open merge requests to keep track of which assignments to grade. Closing them yourself will cause a delay in grading and confuse the TAs.

This Is Not The Course Website You Are Looking For

Reading 06: Data Processing

TL;DR

Readings

Optional Resources

Scripts

Script: faculty.py

Irregular Counts

Skeleton

Hints

Script: users.py

Skeleton Code

Hints

Script: machines.py

Skeleton

Hints

Testing

We'll Do It Live

Submission

Merge Request

Script: `faculty.py`

Script: `users.py`

Script: `machines.py`