Homework 04: Cut, WC

The goal of this homework assignment is to allow you to practice writing Python scripts to create simple Unix filters or utilities that utilize data structures such as lists and dicts.

For this assignment, record your scripts and any responses to the following activities in the homework04 folder of your assignments GitHub repository and push your work by noon Saturday, February 24.

Activity 0: Preparation¶

Before starting this homework assignment, you should first perform a git pull to retrieve any changes in your remote GitHub repository:

$ cd path/to/repository                   # Go to assignments repository

$ git switch master                       # Make sure we are in master branch

$ git pull --rebase                       # Get any remote changes not present locally

Next, create a new branch for this assignment:

$ git checkout -b homework04              # Create homework04 branch and check it out

Task 1: Skeleton Code¶

To help you get started, the instructor has provided you with the following skeleton code:

# Go to homework04 folder
$ cd homework04

# Download Makefile
$ curl -LO https://www3.nd.edu/~pbui/teaching/cse.20289.sp24/static/txt/homework04/Makefile

# Download Python skeleton code
$ curl -LO https://www3.nd.edu/~pbui/teaching/cse.20289.sp24/static/txt/homework04/cut.py
$ curl -LO https://www3.nd.edu/~pbui/teaching/cse.20289.sp24/static/txt/homework04/wc.py

Once downloaded, you should see the following files in your homework04 directory:

homework04
    \_ Makefile     # This is the Makefile building all the assignment artifacts
    \_ cut.py       # This is the Python script for the cut utility
    \_ wc.py        # This is the Python script for the wc utility

DocTests¶

You may notice that in addition to the usual comments and TODOs, the docstrings of each method also contains a few doctests.

You are not to modify these doctests and must keep them in-place. They are used to verify the correctness of your code.

Your code goes below the docstrings, where the TODO and pass commands are (you may remove the pass once you complete the method).

Task 2: Initial Import¶

Now that the files are downloaded into the homework04 folder, you can commit them to your git repository:

$ git add Makefile                            # Mark changes for commit
$ git add *.py
$ git commit -m "Homework 04: Initial Import" # Record changes

Task 3: Unit Tests¶

After downloading these files, you can run the make command to run the tests.

# Run all tests (will trigger automatic download)
$ make

You will notice that the Makefile downloads these additional test data and scripts:

homework04
    \_ cut.input    # This is the input file for the cut test script
    \_ cut.test     # This is the Python test for the cut utility
    \_ wc.input     # This is the input file for the wc test script
    \_ wc.test      # This is the Python test for the wc utility

In addition to the embedded doctests in the skeleton code, you will be using these unit tests to verify the correctness and behavior of your code.

Automatic Downloads¶

The test scripts are automatically downloaded by the Makefile, so any modifications you do to them will be lost when you run make again. Likewise, because they are automatically downloaded, you do not need to add or commit them to your git repository.

The details on what you need to implement for this assignment are described in the following sections.

Frequently Asked Questions¶

Activity 1: Cut (4.5 Points)¶

For the first activity, you are to re-create a simplified version of the cut filter:

# Display usage message
$ ./cut.py -h
Usage: cut -d DELIMITER -f FIELDS

Print selected parts of lines from stream to standard output.

    -d DELIMITER    Use DELIM instead of TAB for field delimiter
    -f FIELDS       Select only these fields
$ echo $?
0

$ ./cut.py
Usage: cut -d DELIMITER -f FIELDS

Print selected parts of lines from stream to standard output.

    -d DELIMITER    Use DELIM instead of TAB for field delimiter
    -f FIELDS       Select only these fields
$ echo $?
1

# Extract the first field
$ echo 'Harder, Better, Faster, Stronger' | ./cut.py -d , -f 1
Harder

# Extract the second and fourth fields
$ echo 'Harder, Better, Faster, Stronger' | ./cut.py -d , -f 2,4
 Better, Stronger

# Extract the first field with test input (first four lines)
$ ./cut.py -d , -f 1 < cut.input | head -n 4
Work it
Do it
Harder
Faster

The cut.py script reads input from standard input and takes the following possible flags:

-d: This allows the user to specify a DELIMITER. By default, it is the TAB character (ie. \t).
-f: This allows the user to specify which fields to select. The user must specify this flag, otherwise it should result in a usage message and a failure exit status as shown above.
-h: This prints the usage message and exits with success.

If the user specifies an unknown flag, then the program should print the usage message and exit with a failure status.

As with cut, cut.py processes each line from standard input by splitting the line by the DELIMITER, combining specified FIELDS by the DELIMITER, and then printing out the resulting selection.

Task 1: `cut.py`¶

To build this filter, you will need to complete the cut.py Python script by implementing the following functions:

strs_to_ints(strings: list[str]) -> list[int]

This function converts a list of strings into a list of ints.
cut_line(line: str, delimiter: str='\t', fields: list[int]=[]) -> list[str]

This function splits the given line by the delimiter and returns a list containing only the specified fields.

Hint: Consider what you want to loop over and how you can use a try/except to handle IndexErrors.
cut_stream(stream=sys.stdin, delimiter: str='\t', fields: list[int]=[]) -> None

This function calls cut_line on each line in stream with the specified delimiter and fields. It joins each result from cut_line with the delimiter and then prints out the selection.

Hint: Consider using str.rstrip to remove trailing whitespace from each line while using str.join to combine the selections extracted from each line.
main(arguments=sys.argv[1:], stream=sys.stdin) -> None

This function processes the command-line arguments to determine the user specified DELIMITER and FIELDS (as described above) and then calls cut_stream with the given stream and computed DELIMITER and FIELDS.

Hint: Consider when to use strs_to_ints and how to check if the user did not specify any FIELDS.

Task 2: Testing¶

As you implement cut.py, you can use the provided doctests to verify the correctness of your code:

# Run doctests
$ python3 -m doctest cut.py -v
...
2 items had no tests:
    cut
    cut.usage
4 items passed all tests:
   1 tests in cut.cut_line
   1 tests in cut.cut_stream
   1 tests in cut.main
   1 tests in cut.strs_to_ints
4 tests in 6 items.
4 passed and 0 failed.
Test passed.

You can also use make to run both the doctests and the unit tests:

# Run unit tests (and doctests)
$ make test-cut
Testing Cut ...
test_00_doctest (__main__.CutTest) ... ok
test_01_mypy (__main__.CutTest) ... ok
test_02_strs_to_ints (__main__.CutTest) ... ok
test_03_cut_line (__main__.CutTest) ... ok
test_04_cut_stream (__main__.CutTest) ... ok
test_05_main_usage (__main__.CutTest) ... ok
test_06_main_1 (__main__.CutTest) ... ok
test_07_main_1_2 (__main__.CutTest) ... ok
test_08_main_1_2_4 (__main__.CutTest) ... ok

   Score 4.50 / 4.50
  Status Success

----------------------------------------------------------------------
Ran 9 tests in 0.052s

OK

To just run the unit tests, you can do the following:

# Run unit tests
$ ./cut.test -v
...

To run a specific unit test, you can specify the method name:

# Run only mypy unit test
$ ./cut.test -v CutTest.test_01_mypy
...

To manually check your types, you can use mypy:

# Run mypy to check types
$ mypy cut.py

To use the test input with the script, you can utilize I/O redirection:

# Run script with input
$ ./cut.py -d , -f 1 < cut.input
...

Activity 2: WC (4.5 Points)¶

For the second activity, you are to re-create a simplified version of the wc utility:

# Display usage message
$ ./wc.py -h
Usage: wc.py [-l | -w | -c]

Print newline, word, and byte counts from standard input.

The options below may be used to select which counts are printed, always in
the following order: newline, word, byte.

    -c      Print byte counts
    -l      Print newline counts
    -w      Print word counts

# Count newlines, words, and bytes:
$ echo Despite all my rage, I am still just a rat in a cage | ./wc.py
1 13 53

# Count only newlines
$ ./wc.py -l < wc.input
18

The wc.py script reads input from standard input and takes the following possible flags:

-c: This allows the user to specify that the program should print byte counts.
-l: This allows the user to specify that the program should print newline counts.
-w: This allows the user to specify that the program should print word counts.
-h: This prints the usage message and exits with success.

If the user specifies an unknown flag, then the program should print the usage message and exit with a failure status.

As with wc, wc.py computes the newline, word, and byte counts of the data from standard input and prints out the selected counts. If no options are specified, then all three options are reported.

Task 1: `wc.py`¶

To build this utility, you will need to complete the wc.py Python script by implementing the following functions:

count_stream(stream=sys.stdin) -> dict[str, int]

This function counts the number of newlines, words, and bytes and returns a dict in the form: {'newlines': ..., 'words': ..., 'bytes': ...}.

Hint: Consider looping over the stream line-by-line, and consider how to handle possible KeyErrors when updating the count for one of the options.
print_counts(counts: dict[str, int], options: list[str]) -> None

This function prints the counts of the specified options. If no options are specified (e.g. options is an empty list), then all options are reported (ie. newlines, words, bytes).

Hint: Consider what you should loop over and what you should search to determine if an option is selected or not. Print the selected options by combining them with a space (ie. ' ') delimiter.
main(arguments=sys.argv[1:], stream=sys.stdin) -> None

This function processes the command-line arguments to determine the user specified options, computes the counts using count_stream on stream, and then reports the counts using print_counts.

Hint: Consider how to keep track of the options while looping through the arguments.

Task 2: Testing¶

As you implement wc.py, you can use the provided doctests to verify the correctness of your code:

# Run doctests
$ python3 -m doctest wc.py -v
...
2 items had no tests:
    wc
    wc.usage
3 items passed all tests:
   1 tests in wc.count_stream
   1 tests in wc.main
   1 tests in wc.print_counts
3 tests in 5 items.
3 passed and 0 failed.
Test passed.

You can also use make to run both the doctests and the unit tests:

# Run unit tests (and doctests)
$ make test-wc
Testing WC ...
test_00_doctest (__main__.WCTest) ... ok
test_01_mypy (__main__.WCTest) ... ok
test_02_count_stream (__main__.WCTest) ... ok
test_03_print_counts_all (__main__.WCTest) ... ok
test_04_print_counts_newlines (__main__.WCTest) ... ok
test_05_print_counts_words (__main__.WCTest) ... ok
test_06_print_counts_bytes (__main__.WCTest) ... ok
test_07_print_counts_newlines_words (__main__.WCTest) ... ok
test_08_print_counts_newlines_bytes (__main__.WCTest) ... ok
test_09_print_counts_words_bytes (__main__.WCTest) ... ok
test_10_main_usage (__main__.WCTest) ... ok
test_11_main_counts_all (__main__.WCTest) ... ok
test_12_main_counts_newlines (__main__.WCTest) ... ok
test_13_main_counts_words (__main__.WCTest) ... ok
test_14_main_counts_bytes (__main__.WCTest) ... ok
test_15_main_counts_newlines_words (__main__.WCTest) ... ok
test_16_main_counts_newlines_bytes (__main__.WCTest) ... ok
test_17_main_counts_words_bytes (__main__.WCTest) ... ok

   Score 4.50 / 4.50
  Status Success

----------------------------------------------------------------------
Ran 18 tests in 0.108s

OK

To just run the unit tests, you can do the following:

# Run unit tests
$ ./wc.test -v
...

To run a specific unit test, you can specify the method name:

# Run only mypy unit test
$ ./wc.test -v WCTest.test_01_mypy
...

To manually check your types, you can use mypy:

# Run mypy to check types
$ mypy wc.py

To use the test input with the script, you can utilize I/O redirection:

# Run script with input
$ ./wc.py < wc.input
...

Activity 3: Quiz (2 Points)¶

Once you have completed all the activities above, you are to complete the following reflection quiz:

As with Reading 01, you will need to store your answers in a homework04/answers.json file. You can use the form above to generate the contents of this file, or you can write the JSON by hand.

To test your quiz, you can use the check.py script:

$ ../.scripts/check.py
Checking homework04 quiz ...
     Q01 0.30
     Q02 0.70
     Q03 0.20
     Q04 0.60
     Q05 0.20
   Score 2.00 / 2.00
  Status Success

Guru Point (1 Extra Credit Point)¶

For extra credit, you are to modify wc.py such that its output matches the exact formatting of wc. In our wc.py, we simply join multiple counts by a single space. In the traditional wc, however, it will compute the maximum width (ie. number of digits) of the counts and use that to organize the counts into evenly spaced columns.

For example:

# wc.py (columns not aligned)
$ ./wc.py < wc.input
18 97 439

# wc (columns are aligned)
$ wc < wc.input
 18  97 439

To accomplish this task, you must do the following:

wc2.py: Copy wc.py to wc2.py and modify the print_counts function mimic the reporting behavior of wc.
Hint: Once you know the maximum number of digits from the counts, you can use str.rjust to format the count to a specific width.
```
str(count).rjust(count_width)
```
If only one option is specified (e.g. wc -l), then no adjustment needs to be made (ie. count_width should be 0).
As reference, you can look at the wc.c source code from the GNU coreutils.
wc2.test: Copy wc.test to wc2.test and modify the appropriate functions to compare the output of main directly against the output of wc.
Hint: You should reference how cut.test checks the output of cut in its unit tests (ie. it uses subprocess.run to execute cut and capture its output for comparison).

To avoid having to rename all references to wc to wc2, you can change the import at the top of the new wc2.test to:
```
import wc2 as wc
```
You will need to adjust all the print_counts_* test cases in wc2.test to reflect the new formatting. Likewise, you will need to adjust the doctests in wc2.py to reflect the alignment as well.

To compare the output of main directly against the output of wc, you should modify each of the main_* test cases to replace lines such as:
```
self.assertEqual(output.getvalue(), '18 97 439\n')
```
With the following:
```
with open(self.Path) as input_stream:
    process = subprocess.run('wc'.split(), stdin=input_stream, capture_output=True)
self.assertEqual(output.getvalue(), process.stdout.decode())
```
This will use subprocess.run to execute the wc command, capture its output, and then compare it to the output of your main function.

Verification¶

To get credit for this Guru Point, show your new wc2.py and wc2.test to a TA to verify (or attached a video / screenshot to your Pull Request). You have up until a week after this assignment is due to verify your Guru Point.

Self-Service Extension¶

Remember that you can always forgo this Guru Point for two extra days to do the homework. That is, if you need an extension, you can simply skip the Guru Point and you will automatically have until Monday to complete the assignment for full credit.

Just leave a note on your Pull Request of your intentions.

Submission (11 Points)¶

To submit your assignment, please commit your work to the homework04 folder of your homework04 branch in your assignments GitHub repository. Your homework04 folder should only contain the following files:

Makefile
answers.json
cut.py
wc.py

Note: You do not need to commit the test scripts because the Makefile automatically downloads them.

#-----------------------------------------------------------------------
# Make sure you have already completed Activity 0: Preparation
#-----------------------------------------------------------------------
...
$ git add cut.py                                      # Mark changes for commit
$ git commit -m "Homework 04: Activity 1 completed"   # Record changes
...
$ git add wc.py                                       # Mark changes for commit
$ git commit -m "Homework 04: Activity 2 completed"   # Record changes
...
$ git add answers.json                                # Mark changes for commit
$ git commit -m "Homework 04: Activity 3 completed"   # Record changes
...
$ git push -u origin homework04                       # Push branch to GitHub

Pull Request¶

Remember to create a Pull Request and assign the appropriate TA from the Reading 05 TA List.

DO NOT MERGE your own Pull Request. The TAs use open Pull Requests to keep track of which assignments to grade. Closing them yourself will cause a delay in grading and confuse the TAs.

Homework 04: Cut, WC

Activity 0: Preparation¶

Task 1: Skeleton Code¶

DocTests¶

Task 2: Initial Import¶

Task 3: Unit Tests¶

Automatic Downloads¶

Frequently Asked Questions¶

Activity 1: Cut (4.5 Points)¶

Task 1: cut.py¶

Task 2: Testing¶

Activity 2: WC (4.5 Points)¶

Task 1: wc.py¶

Task 2: Testing¶

Activity 3: Quiz (2 Points)¶

Guru Point (1 Extra Credit Point)¶

Verification¶

Self-Service Extension¶

Submission (11 Points)¶

Pull Request¶

Task 1: `cut.py`¶

Task 1: `wc.py`¶