The goal of this homework assignment is to allow you to practice using Python to create scripts that require sophisticated parsing of data and manipulation of data structures. In this assignment, you will process data in both CSV and JSON format and then present the information to the user in the terminal.
For this assignment, record your scripts and any responses to the following
activities in the homework05
folder of your assignments GitHub
repository and push your work by noon Saturday, March 2.
Before starting this homework assignment, you should first perform a git
pull
to retrieve any changes in your remote GitHub repository:
$ cd path/to/repository # Go to assignments repository
$ git switch master # Make sure we are in master branch
$ git pull --rebase # Get any remote changes not present locally
Next, create a new branch for this assignment:
$ git checkout -b homework05 # Create homework05 branch and check it out
To help you get started, the instructor has provided you with the following skeleton code:
# Go to homework05 folder
$ cd homework05
# Download Makefile
$ curl -LO https://www3.nd.edu/~pbui/teaching/cse.20289.sp24/static/txt/homework05/Makefile
# Download Python skeleton code
$ curl -LO https://www3.nd.edu/~pbui/teaching/cse.20289.sp24/static/txt/homework05/namez.py
$ curl -LO https://www3.nd.edu/~pbui/teaching/cse.20289.sp24/static/txt/homework05/searx.py
Once downloaded, you should see the following files in your homework05
directory:
homework05
\_ Makefile # This is the Makefile building all the assignment artifacts
\_ namez.py # This is the Python script for the namez script
\_ searx.py # This is the Python script for the searx script
You may notice that in addition to the usual comments and TODOs
, the
docstrings of each method also contains a few doctests.
You are not to modify these doctests and must keep them in-place. They are used to verify the correctness of your code.
Your code goes below the docstrings, where the TODO
and pass
commands are (you may remove the pass once you complete the method).
Now that the files are downloaded into the homework05
folder, you can
commit them to your git repository:
$ git add Makefile # Mark changes for commit
$ git add *.py
$ git commit -m "Homework 05: Initial Import" # Record changes
After downloading these files, you can run the make
command to run the
tests.
# Run all tests (will trigger automatic download)
$ make
You will notice that the Makefile downloads these additional test data and scripts:
homework05
\_ namez.input.gz # This is the input file for the namez script
\_ namez.input.test.gz # This is the test input file for the namez script
\_ namez.test # This is the Python test for the namez script
\_ searx.test # This is the Python test for the searx script
In addition to the embedded doctests in the skeleton code, you will be using these unit tests to verify the correctness and behavior of your code.
The test
scripts are automatically downloaded by the Makefile
, so any
modifications you do to them will be lost when you run make
again. Likewise,
because they are automatically downloaded, you do not need to add
or commit
them to your git repository.
The details on what you need to implement for this assignment are described in the following sections.
For the first activity, you are to write a script called namez.py
, which
reads CSV data from the Social Security Agency containing all the names
of babies born from 1880
to 2022
.
# Display usage
$ ./namez.py -h
Usage: namez.py [-p -y YEARS] NAMES
Print the counts for each NAME in the corresponding YEARS.
-p Show percentages instead of counts
-y YEARS Only display these YEARS (default: all)
YEARS may be specified multiple times and can be a single year, or a list
separated by commas, or a range separated by -:
-y 2002 -y 2003 -y 2004
-y 2002,2003,2004
-y 2002-2004
The YEARS are always displayed in ascending order, while the NAMES are
displayed in the order provided.
# Print counts for Peter and James in 1983
$ ./namez.py -y 1983 Peter James
Year Peter James
1983 5525 36624
# Print percentages for Peter and James from 1980 to 1985
$ ./namez.py -y 1980-1985 -p Peter James
Year Peter James
1980 0.1746% 1.1494%
1981 0.1744% 1.1150%
1982 0.1660% 1.1172%
1983 0.1595% 1.0574%
1984 0.1628% 1.0349%
1985 0.1597% 1.0129%
The namez.py
script takes the following possible flags:
-p
: This has the script print out the percentage within the current year
instead of the count for the baby name.
-y
: This allows the user to specify which YEARS
to print out. There
are three possible formats for this flag as noted in the usage above.
-h
: This prints the usage message and exits with success.
After the flags, the remaining arguments are the NAMES
to include in the
table.
If an unsupported flag is specified, or if no NAMES
are given, then
namez.py
should print out the usage message and exit with failure.
The baby names are stored in a CSV file, namez.input.gz
, which is in the
following format:
year,name,gender,count
1880,Mary,F,7065
1880,Anna,F,2604
1880,Emma,F,2003
1880,Elizabeth,F,1939
1880,Minnie,F,1746
1880,Margaret,F,1578
1880,Ida,F,1472
1880,Alice,F,1414
1880,Bertha,F,1320
...
Because the CSV file is so large, it has been compressed with gzip, which
means you will need to use gzip.open to read the file in Python.
Additionally, because the complete database is so large, it will take a few
seconds to load. For testing purposes, the doctests and some of the unit
tests will verify the functionality of your program using
names.input.test.gz
instead, which is a reduced version of the complete
CSV data.
namez.py
¶To implement the namez.py
Python script, you will need to complete the
following functions:
load_database(path: str=PATH) -> Database
This function reads the CSV data from the specified
path
and returns a dict that maps years to a dict with names mapped to counts (ie.{YEAR: {NAME: COUNT}}
).
Hints: You can use the csv.DictReader function with gzip.open to read the CSV data.
for record in csv.DictReader(gzip.open(path, 'rt')):
...
The dict you return has the type alias, Database
, which can be used
to annotate your dict variable:
database: Database = {}
print_table(database: Database, years: list[int]=[], names: list[str]=[], percentage=False) -> None
This function prints out a table of
years
andnames
(along with their corresponding counts or percentages). The first column contains the year, while the second column contains the counts or percentages for the first name in the corresponding year, etc. The first row is a header containing the wordyear
and then each of thenames
.
Hints: The names can be aligned into columns by using str.rjust with
a width of 12
. Likewise, the counts should be aligned as follows:
f'{count:>12}'
While percentages should be aligned as follows:
f'{percentage:>11.4f}%'
You should considering collecting all the columns for each row into a list and then using str.join to combine them all into a single string for printing.
main(arguments=sys.argv[1:]) -> None
This function processes the command line
arguments
, loads the database, and then prints a table with specifiedyears
andnames
.
Hints: To convert a sequence of str
s to int
s, you can use either
a list comprehension:
[int(s) for s in strs]
Or the map function:
list(map(int, strs))
To add all the elements of a list to an existing list, you can use the list.extend method:
list1.extend(list2)
As you implement namez.py
, you can use the provided doctests to verify the
correctness of your code:
# Run doctests
$ python3 -m doctest namez.py -v
2 items had no tests:
namez
namez.usage
3 items passed all tests:
4 tests in namez.load_database
1 tests in namez.main
2 tests in namez.print_table
7 tests in 5 items.
7 passed and 0 failed.
Test passed.
You can also use make
to run both the doctests and the unit tests:
# Run unit tests (and doctests)
$ make test-namez
Testing NameZ ...
test_00_doctest (__main__.NamezTest) ... ok
test_01_mypy (__main__.NamezTest) ... ok
test_02_load_database (__main__.NamezTest) ... ok
test_03_print_table (__main__.NamezTest) ... ok
test_04_main_usage (__main__.NamezTest) ... ok
test_05_main_year_multiple (__main__.NamezTest) ... ok
test_06_main_year_list (__main__.NamezTest) ... ok
test_07_main_year_range (__main__.NamezTest) ... ok
test_08_main_year_range_percentage (__main__.NamezTest) ... ok
Score 5.00 / 5.00
Status Success
----------------------------------------------------------------------
Ran 9 tests in 21.071s
OK
To just run the unit tests, you can do the following:
# Run unit tests
$ ./namez.test -v
...
To run a specific unit test, you can specify the method name:
# Run only mypy unit test
$ ./namez.test -v NamezTest.test_01_mypy
...
To manually check your types, you can use mypy:
# Run mypy to check types
$ mypy namez.py
For the second activity, you are to write a script called searx.py
, which
fetches search results from searx.ndlug.org, a
website running SearXNG:
SearXNG is a metasearch engine, aggregating the results of other search engines while not storing information about its users.
With this script, you can search the Internet right from your terminal!
# Display usage
$ ./searx.py
Usage: searx.py [-u URL -n LIMIT -o ORDERBY] QUERY
Fetch SearX results for QUERY and print them out.
-u URL Use URL as the SearX instance (default is: https://searx.ndlug.org/search)
-n LIMIT Only display up to LIMIT results (default is: 5)
-o ORDERBY Sort the search results by ORDERBY (default is: score)
If ORDERBY is score, the results are shown in descending order. Otherwise,
results are shown in ascending order.
# Search for notre dame
$ ./searx.py notre dame
1. University of Notre Dame [4.00]
https://www.nd.edu/
2. Notre-Dame de Paris [1.40]
https://en.wikipedia.org/wiki/Notre-Dame_de_Paris
3. Notre Dame Athletics | The Fighting Irish [1.11]
https://fightingirish.com/
4. University of Notre Dame [1.00]
https://en.wikipedia.org/wiki/University_of_Notre_Dame
5. Notre Dame [1.00]
https://en.wikipedia.org/wiki/Notre_Dame
# Search for notre dame, limit to one result, order by title
$ ./searx.py -n 1 -o title notre dame
1. 2023 Notre Dame football schedule: Dates, times, TV channels, scores [0.05]
https://www.ncaa.com/news/football/article/2023-12-03/2023-notre-dame-football-schedule-dates-times-tv-channels-scores
The searx.py
script takes the following possible arguments:
-u
: This allows the user to set the URL
of the SearXNG instance to
search. By default, this is
https://searx.ndlug.org/search.
-n
: This allows the user to limit the number of search results to print
out. By default, this is 5
.
-o
: This allows the user to specify how to sort the results. By
default, it will sort by score
in descending order. Otherwise, it will
sort either the title
or url
in ascending order.
-h
: This prints the usage message and exits with success.
After the flags, the remaining arguments is the QUERY
to search.
If an unsupported flag is specified, or if no QUERY
is given, then
searx.py
should print out the usage message and exit with failure.
searx.py
¶To implement the searx.py
Python script, you will need to complete the
following functions:
searx_query(query: str, url: str=URL) -> list[dict]
This function makes a HTTP request to the SearXNG instance specified by
url
and returns the search results corresponding toquery
.
Hints: To fetch the results from SearXNG, you will need to specify the following parameters when making a HTTP request:
{'q': query, 'format': 'json'}
To do so, you can use requests.get:
requests.get(url, params=...)
Remember that you want to return just the list of results inside the JSON data, and not the whole response.
print_results(results: list[dict], limit: int=LIMIT, orderby: str=ORDERBY) -> None
This function prints up to
limit
searchresults
sorted byorderby
.
Hints: Each search result should be printed out in the following format:
'{index:>4}.\t{title} [{score:0.2f}]'
'\t{url}'
Note: There should be a blank line between each search result, but not before the first result or after the last result.
You should sort the results using sorted before slicing it up
to the limit
. Remember that you can control how each element is compared
in sorted by specifying a key
function:
sorted(results, key=lambda r: r['score'], reverse=True)
The code above would return a sequence of results ordered by score
in descending order.
main(arguments=sys.argv[1:]) -> None
This function processes the command line
arguments
, fetches the search results forQUERY
, and then prints out the results.
Hints: You may wish to use str.join to combine the remaining
arguments into a single query
.
As you implement searx.py
, you can use the provided doctests to verify the
correctness of your code:
# Run doctests
$ python3 -m doctest searx.py -v
...
2 items had no tests:
searx
searx.usage
3 items passed all tests:
1 tests in searx.main
1 tests in searx.print_results
1 tests in searx.searx_query
3 tests in 5 items.
3 passed and 0 failed.
Test passed.
You can also use make
to run both the doctests and the unit tests:
# Run unit tests (and doctests)
$ make test-searx
Testing SearX ...
test_00_doctest (__main__.SearxTest) ... ok
test_01_mypy (__main__.SearxTest) ... ok
test_02_searx_query_python (__main__.SearxTest) ... ok
test_03_searx_query_peter_bui (__main__.SearxTest) ... ok
test_04_searx_query_cse_20289_sp24 (__main__.SearxTest) ... ok
test_05_searx_query_url (__main__.SearxTest) ... ok
test_06_print_results (__main__.SearxTest) ... ok
test_07_print_results_limit_1 (__main__.SearxTest) ... ok
test_08_print_results_orderby_title (__main__.SearxTest) ... ok
test_09_print_results_limit_3_orderby_url (__main__.SearxTest) ... ok
test_10_main_usage (__main__.SearxTest) ... ok
test_11_main_python (__main__.SearxTest) ... ok
test_12_main_peter_bui (__main__.SearxTest) ... ok
test_13_main_cse_20289_sp24 (__main__.SearxTest) ... ok
test_14_main_url_python (__main__.SearxTest) ... ok
test_15_main_limit_1_url_python (__main__.SearxTest) ... ok
test_16_main_url_orderby_title_python (__main__.SearxTest) ... ok
test_17_main_url_orderby_url_limit_3_python (__main__.SearxTest) ... ok
Score 5.00 / 5.00
Status Success
----------------------------------------------------------------------
Ran 18 tests in 8.387s
OK
To just run the unit tests, you can do the following:
# Run unit tests
$ ./searx.test -v
...
To run a specific unit test, you can specify the method name:
# Run only mypy unit test
$ ./searx.test -v SearxTest.test_01_mypy
...
To manually check your types, you can use mypy:
# Run mypy to check types
$ mypy searx.py
Once you have completed all the activities above, you are to complete the following reflection quiz:
As with Reading 01, you will need to store your answers in a
homework05/answers.json
file. You can use the form above to generate the
contents of this file, or you can write the JSON by hand.
To test your quiz, you can use the check.py
script:
$ ../.scripts/check.py
Checking homework05 quiz ...
Q01 0.50
Q02 0.50
Score 1.00 / 1.00
Status Success
For extra credit, you are to plot the counts of multiple names from every
year using the baby names database in namez.py
. To do so, you should use
matplotlib inside a Jupyter notebook to plot multiple lines on a single
graph as shown below:
To run Jupyter on the student machines, you can use the following command:
$ jupyter-notebook-3.11 --ip studentXX.cse.nd.edu --port 9000 --no-browser
Replace studentXX.cse.nd.edu
with the name of the student machine you are
on. Also, you may need to change the port to something else between 9000
-
9999
.
Alternatively, you can use Google Colaboratory to run a Jupyter notebook from your Google Drive. This is a hassle-free way to explore developing with Python and matplotlib, while using Google Cloud resources (thus no need for the VPN or the student machines).
To get credit for this Guru Point, show your Jupyter notebook with the graphical plot to a TA to verify (or attached a video / screenshot to your Pull Request). You have up until a week after this assignment is due to verify your Guru Point.
Remember that you can always forgo this Guru Point for two extra days to do the homework. That is, if you need an extension, you can simply skip the Guru Point and you will automatically have until Monday to complete the assignment for full credit.
Just leave a note on your Pull Request of your intentions.
To submit your assignment, please commit your work to the homework05
folder
of your homework05
branch in your assignments GitHub repository.
Your homework05 folder should only contain the following files:
Note: You do not need to commit the test scripts because the Makefile
automatically downloads them.
#-----------------------------------------------------------------------
# Make sure you have already completed Activity 0: Preparation
#-----------------------------------------------------------------------
...
$ git add namez.py # Mark changes for commit
$ git commit -m "Homework 05: Activity 1 completed" # Record changes
...
$ git add searx.py # Mark changes for commit
$ git commit -m "Homework 05: Activity 2 completed" # Record changes
...
$ git add answers.json # Mark changes for commit
$ git commit -m "Homework 05: Activity 3 completed" # Record changes
...
$ git push -u origin homework05 # Push branch to GitHub
Remember to create a Pull Request and assign the appropriate TA from the Reading 06 TA List.
DO NOT MERGE your own Pull Request. The TAs use open Pull Requests to keep track of which assignments to grade. Closing them yourself will cause a delay in grading and confuse the TAs.