The first goal of this homework assignment is to allow you to practice using Python to create scripts that require sophisticated parsing of data and manipulation of data structures. In the first program, you will process data in JSON format and then present the information to the user in the terminal.
The second goal of this homework assignment is to allow you to practice using functional programming to process data in Python. In this assignment, you will write a script that brute-force attacks a large collection of passwords using multiple processes. That is, we will use functional programming to construct a concurrent application, and then exploit this concurrency by using multiple CPU cores to execute the program in parallel.
For this assignment, record your scripts and any responses to the following
activities in the homework05
folder of your assignments GitHub
repository and push your work by noon Saturday, March 1.
Before starting this homework assignment, you should first perform a git
pull
to retrieve any changes in your remote GitHub repository:
$ cd path/to/repository # Go to assignments repository
$ git switch master # Make sure we are in master branch
$ git pull --rebase # Get any remote changes not present locally
Next, create a new branch for this assignment:
$ git checkout -b homework05 # Create homework05 branch and check it out
To help you get started, the instructor has provided you with the following skeleton code:
# Go to homework05 folder
$ cd homework05
# Download Makefile
$ curl -LO https://www3.nd.edu/~pbui/teaching/cse.20289.sp25/static/txt/homework05/Makefile
# Download Python skeleton code
$ curl -LO https://www3.nd.edu/~pbui/teaching/cse.20289.sp25/static/txt/homework05/searx.py
$ curl -LO https://www3.nd.edu/~pbui/teaching/cse.20289.sp25/static/txt/homework05/hulk.py
Once downloaded, you should see the following files in your homework05
directory:
homework05
\_ Makefile # This is the Makefile building all the assignment artifacts
\_ hulk.py # This is the Python script for the hulk script
\_ searx.py # This is the Python script for the searx script
You may notice that in addition to the usual comments and TODOs
, the
docstrings of each method also contains a few doctests.
You are not to modify these doctests and must keep them in-place. They are used to verify the correctness of your code.
Your code goes below the docstrings, where the TODO
and pass
commands are (you may remove the pass once you complete the method).
Now that the files are downloaded into the homework05
folder, you can
commit them to your git repository:
$ git add Makefile # Mark changes for commit
$ git add *.py
$ git commit -m "Homework 05: Initial Import" # Record changes
After downloading these files, you can run the make
command to run the
tests.
# Run all tests (will trigger automatic download)
$ make
You will notice that the Makefile downloads these additional test data and scripts:
homework05
\_ hulk.hashes # This is the list of SHA1 hashes for the hulk script
\_ hulk.test # This is the Python test for the hulk script
\_ searx.test # This is the Python test for the searx script
In addition to the embedded doctests in the skeleton code, you will be using these unit tests to verify the correctness and behavior of your code.
The test
scripts are automatically downloaded by the Makefile
, so any
modifications you do to them will be lost when you run make
again. Likewise,
because they are automatically downloaded, you do not need to add
or commit
them to your git repository.
The details on what you need to implement for this assignment are described in the following sections.
For the second activity, you are to write a script called searx.py
, which
fetches search results from searx.ndlug.org, a
website running SearXNG:
SearXNG is a metasearch engine, aggregating the results of other search engines while not storing information about its users.
With this script, you can search the Internet right from your terminal!
# Display usage
$ ./searx.py
Usage: searx.py [-u URL -n LIMIT -o ORDERBY] QUERY
Fetch SearX results for QUERY and print them out.
-u URL Use URL as the SearX instance (default is: https://searx.ndlug.org/search)
-n LIMIT Only display up to LIMIT results (default is: 5)
-o ORDERBY Sort the search results by ORDERBY (default is: score)
If ORDERBY is score, the results are shown in descending order. Otherwise,
results are shown in ascending order.
# Search for notre dame
$ ./searx.py notre dame
1. Notre Dame [1.00]
https://en.wikipedia.org/wiki/Notre_Dame
2. University of Notre Dame [1.00]
https://www.nd.edu/
3. Notre Dame Athletics | The Fighting Irish [0.50]
https://fightingirish.com/
4. Notre-Dame de Paris - official website [0.33]
https://www.notredamedeparis.fr/en/
5. University of Notre Dame [0.25]
https://en.wikipedia.org/wiki/University_of_Notre_Dame
# Search for notre dame, limit to one result, order by title
$ ./searx.py -n 1 -o title notre dame
1. Football – Notre Dame Fighting Irish – Official Athletics ... [0.20]
https://fightingirish.com/sports/football/
The searx.py
script takes the following possible arguments:
-u
: This allows the user to set the URL
of the SearXNG instance to
search. By default, this is
https://searx.ndlug.org/search.
-n
: This allows the user to limit the number of search results to print
out. By default, this is 5
.
-o
: This allows the user to specify how to sort the results. By
default, it will sort by score
in descending order. Otherwise, it will
sort either the title
or url
in ascending order.
-h
: This prints the usage message and exits with success.
After the flags, the remaining arguments is the QUERY
to search.
If an unsupported flag is specified, or if no QUERY
is given, then
searx.py
should print out the usage message and exit with failure.
searx.py
¶To implement the searx.py
Python script, you will need to complete the
following functions:
searx_query(query: str, url: str=URL) -> list[dict]
This function makes a HTTP request to the SearXNG instance specified by
url
and returns the search results corresponding toquery
.
Hints: To fetch the results from SearXNG, you will need to specify the following parameters when making a HTTP request:
{'q': query, 'format': 'json'}
To do so, you can use requests.get:
requests.get(url, params=...)
Remember that you want to return just the list of results inside the JSON data, and not the whole response.
print_results(results: list[dict], limit: int=LIMIT, orderby: str=ORDERBY) -> None
This function prints up to
limit
searchresults
sorted byorderby
.
Hints: Each search result should be printed out in the following format:
'{index:>4}.\t{title} [{score:0.2f}]'
'\t{url}'
Note: There should be a blank line between each search result, but not before the first result or after the last result.
You should sort the results using sorted before slicing it up
to the limit
. Remember that you can control how each element is compared
in sorted by specifying a key
function:
sorted(results, key=lambda r: r['score'], reverse=True)
The code above would return a sequence of results ordered by score
in descending order.
main(arguments=sys.argv[1:]) -> None
This function processes the command line
arguments
, fetches the search results forQUERY
, and then prints out the results.
Hints: You may wish to use [str.join] to combine the remaining
arguments into a single query
.
As you implement searx.py
, you can use the provided doctests to verify the
correctness of your code:
# Run doctests
$ python3 -m doctest searx.py -v
...
2 items had no tests:
searx
searx.usage
3 items passed all tests:
1 tests in searx.main
1 tests in searx.print_results
1 tests in searx.searx_query
3 tests in 5 items.
3 passed and 0 failed.
Test passed.
You can also use make
to run both the doctests and the unit tests:
# Run unit tests (and doctests)
$ make test-searx
Testing SearX ...
test_00_doctest (__main__.SearxTest.test_00_doctest) ... ok
test_01_mypy (__main__.SearxTest.test_01_mypy) ... ok
test_02_searx_query_python (__main__.SearxTest.test_02_searx_query_python) ... ok
test_03_searx_query_peter_bui (__main__.SearxTest.test_03_searx_query_peter_bui) ... ok
test_04_searx_query_cse_20289_sp25 (__main__.SearxTest.test_04_searx_query_cse_20289_sp25) ... ok
test_05_searx_query_url (__main__.SearxTest.test_05_searx_query_url) ... ok
test_06_print_results (__main__.SearxTest.test_06_print_results) ... ok
test_07_print_results_limit_1 (__main__.SearxTest.test_07_print_results_limit_1) ... ok
test_08_print_results_orderby_title (__main__.SearxTest.test_08_print_results_orderby_title) ... ok
test_09_print_results_limit_3_orderby_url (__main__.SearxTest.test_09_print_results_limit_3_orderby_url) ... ok
test_10_main_usage (__main__.SearxTest.test_10_main_usage) ... ok
test_11_main_python (__main__.SearxTest.test_11_main_python) ... ok
test_12_main_peter_bui (__main__.SearxTest.test_12_main_peter_bui) ... ok
test_13_main_cse_20289_sp25 (__main__.SearxTest.test_13_main_cse_20289_sp25) ... ok
test_14_main_url_python (__main__.SearxTest.test_14_main_url_python) ... ok
test_15_main_limit_1_url_python (__main__.SearxTest.test_15_main_limit_1_url_python) ... ok
test_16_main_url_orderby_title_python (__main__.SearxTest.test_16_main_url_orderby_title_python) ... ok
test_17_main_url_orderby_url_limit_3_python (__main__.SearxTest.test_17_main_url_orderby_url_limit_3_python) ... ok
Score 3.00 / 3.00
Status Success
----------------------------------------------------------------------
Ran 18 tests in 6.049s
OK
To just run the unit tests, you can do the following:
# Run unit tests
$ ./searx.test -v
...
To run a specific unit test, you can specify the method name:
# Run only mypy unit test
$ ./searx.test -v SearxTest.test_01_mypy
...
To manually check your types, you can use mypy:
# Run mypy to check types
$ mypy searx.py
With various password leaks being announced on a weekly basis there has been a lot of discussion on password hygiene and password strength. There is even a website to check have i been pwned.
In all decent password accounting systems, the raw password is rarely stored. Instead a cryptographic hash such as MD5 or SHA1 is used to record a user's password digest or checksum and it is these hashes that are leaked to outsiders in these all too frequent breaches. Because these cryptographic hash functions are generally considered one-way functions, attackers cannot directly obtain the original input password even though they know the output of the cryptographic hash.
For instance, the SHA1 digest of goirish
is:
$ printf goirish | sha1sum
d546785e887d9aaae85102c28769becfb86dde7d -
As you can see, the string d546785e887d9aaae85102c28769becfb86dde7d
does
not provide an attacker any clues about the original password goirish
. To
obtain the original text, attackers often employ various
techniques
such as [brute-force cracking].
For this activity, you are to create hulk.py
, which is a script that uses
brute-force to smash a set of SHA1 hashes:
$ ./hulk.py -h
Usage: hulk.py [-a alphabet -c CORES -l LENGTH -p PREFIX -s HASHES]
-a ALPHABET Alphabet to use in permutations
-c CORES CPU Cores to use
-l LENGTH Length of permutations
-p PREFIX Prefix for all permutations
-s HASHES Path of hashes file
The hulk.py
script takes the following possible arguments:
-a
: This allows the user to set the ALPHABET
to use when generating
all possible permutations. By default, this is
abcdefghijklmnopqrstuvwxyz0123456789
.
-c
: This allows the user to set the number of CPU CORES
(ie.
[processes]) to use concurrently. By default, this is 1
.
-l
: This allows the user to set the LENGTH
of the permutations to
consider. By default, this is 1
.
-p
: This allows the user to set the PREFIX
to be added in front of
each permutation to consider. By default, this is an empty string.
-s
: This allows the user to set the path to the file containing SHA1
HASHES
to checked against. By default, this is hulk.hashes
.
-h
: This prints the usage message and exits with success.
Given an ALPHABET
(default is abcdefghijklmnopqrstuvwxyz0123456789
),
hulk.py
will compute the SHA1 hash of every permutation of the
ALPHABET
for the specified LENGTH
and check if it is in the set of
HASHES
(default is hulk.hashes
). If a PREFIX
is specified, then this
should be inserted before each candidate permutation.
Many of the passwords in the hulk.hashes
password collection were pulled
from a Chegg password leak.
For instance, suppose we had a sample.hashes
file that contained the
following SHA1 checksums:
86f7e437faa5a7fce15d1ddcb9eaeaea377667b8
e9d71f5ee7c92d6dc9e92ffdad17b8bd49418f98
84a516841ba77a5b4648de2cd0dfcb30ea46dbb4
da23614e02469a0d7c7bd1bdab5c9c474b1904dc
a9993e364706816aba3e25717850c26c9cd0d89d
If we executed hulk.py
with a LENGTH
of 1
, we should get the following
result:
$ ./hulk.py -l 1 -s sample.hashes
a
b
c
That is, hulk.py
determined that three of the hashes correspond to the
passwords a
, b
, and c
.
If we executed hulk.py
with a LENGTH
of 2
and a PREFIX
of a
, we
should get the following result:
$ ./hulk.py -l 2 -s sample.hashes -p a
abc
That is, hulk.py
determined that 1 of the hashes corresponds to the
password abc
. Note, we could have achieved the same results by executing:
$ ./hulk.py -l 3 -s sample.hashes
abc
The difference is the former searches each permutation in the form aXX
where X
is a letter from the ALPHABET
due to the PREFIX
of a
, while
the latter searches each permutation in the form XXX
.
Finally, if CORES
is greater than 1
and the LENGTH
is greater than 1
,
the script will use the concurrent.futures module to brute-force passwords
in parallel with the specified number of CORES
:
$ ./hulk.py -l 1 -c 2 -s sample.hashes
a
b
c
hulk.py
¶To implement the hulk.py
Python script, you will need to complete the
following functions:
sha1sum(s: str) -> str
This function computes the SHA1 checksum of the given string
s
and returns the corresponding hex digest.
Hints: Consider using the hashlib library and the str.encode method of the string object to convert unicode into bytes.
permutations(length: int, alphabet: str=ALPHABET) -> Iterator[str]
This function uses a recursive algorithm to implement a generator using the yield statement to produce all the permutations of the given
length
with the correspondingalphabet
.
Hints: Consider the following approach.
For the base case, you will want to consider what permutations (if
any) should be yielded
with a length
of 0
.
For the recursive case, how you can combine the alphabet
with the
results of permutations
of length - 1
to yield
each
permutation of the specified length
.
The structure of the permutations
function should look something like this:
if length == 0: # Base case
yield base case
else: # Recursive case
yield the combination of each letter with each permutation of length - 1
Note, you may not use any of the utilities in the itertools module for this function.
flatten(sequence: Iterable[Iterable[str]]) -> Iterator[str]
This function flattens a sequence of iterables into a single sequence of values from the nested data.
The idea with this function is that given a nested sequence of iterables
such as [[A, B, C], [D, E, F]]
, you yield
each item from the
subsequences: A, B, C, D, E, F
.
Hints: Consider using yield from
on the subsequences.
crack(hashes: set[str], length: int, alphabet: str=ALPHABET, prefix: str='') -> list[str]
This function returns all password permutations of specified
length
,alphabet
, andprefix
that are inhashes
set by trying all possible permutations sequentially.
Hints: Consider using list comprehensions or generator expressions to implement this function.
Pay careful attention to not use too much memory in this
function, particularly when producing all possible permutations
.
whack(arguments: tuple[set[str], int, str, str]) -> list[str]
This function calls
crack
with the given tuple ofarguments
.
This function exists because map
can only pass a single argument to the
function it applies. To workaround this limitation, we create the
whack
helper function which takes a tuple of arguments which can then
unpack to call crack
. That is, whack
takes a tuple of four
items which correspond to the parameters to crack
and we simply need to
call crack
with these items:
# Calling whack should result in calling crack with the arguments
# tuple unpacked as parameters.
whack((hashes, length, alphabet, prefix)) -> crack(hashes, length, alphabet, prefix)
Hints: Consider using *
to unpack an argument list.
smash(hashes: set[str], length: int, alphabet: str=ALPHABET, prefix: str='', cores: int=1) -> Iterator[str]
This function returns all password permutations of specified
length
,alphabet
, andprefix
that are inhashes
set by trying all possible permutations concurrently with the specified number ofcores
.
To allow for parallel execution, you will need to do divide up the
work normally done by one process. The easiest way to accomplish this
is by taking advantage of the prefix
feature of crack
.
For instance, if the specified LENGTH
is 4
, you can crack
passwords
of length 3
with prefixes of [a, b, ..., 9]
to achieve the same result:
Passwords of length 4 Prefixes + Passwords of length 3
aaaa a + aaa -> aaaa
aaab a + aab -> aaab
.... ....
baaa b + aaa -> baaa
baab b + aab -> baab
.... ....
Hints: Consider using the map
method of the ProcessPoolExecutor
from the concurrent.futures module in order to take advantage of
multiple CPU cores
.
In smash
, you will need to create an arguments
sequence that consists
of the argument tuples to pass to whack
. As explained above, each tuple
consists of the arguments to crack
: hashes
, length
, alphabet
,
prefix
. Based on the previous description, only the length
and
prefix
arguments need to be modified in the arguments sequence:
(original hashes, new length, original alphabet, new prefix)
For simplicity, we suggest that you use each letter in the alphabet
as
part of the prefix
in the arguments tuple, while keeping in mind that
you still need to account for the prefix
passed to smash
itself.
Because we are now using the prefix
feature of crack
, you will also
need to update the length
argument in the tuple
.
Your code should look something like this:
# Generator expression representing a sequence of arguments to pass
# to `whack` (hashes, length, alphabet remain constant, but prefix
# should change for each argument tuple).
arguments = ((...) for ... in ...)
# Create a ProcessPoolExecutor and then apply whack to the
# arguments
with concurrent.futures.ProcessPoolExecutor(cores) as executor:
... executor.map(...)
Because the executor.map
will return a sequence of sequences, you will
want to use the flatten
function you wrote to convert the result into a
single sequence.
main(arguments: list[str]=sys.argv[1:]) -> None
This function smashes the given hashes to determine the original passwords with specified alphabet, length, and prefix using multiple cores (ie. processes).
As you implement hulk.py
, you can use the provided doctests to verify the
correctness of your code:
# Run doctests
$ python3 -m doctest hulk.py -v
...
2 items had no tests:
hulk
hulk.usage
7 items passed all tests:
1 tests in hulk.crack
1 tests in hulk.flatten
1 tests in hulk.main
1 tests in hulk.permutations
1 tests in hulk.sha1sum
1 tests in hulk.smash
1 tests in hulk.whack
7 tests in 9 items.
7 passed and 0 failed.
Test passed.
You can also use make
to run both the doctests and the unit tests:
# Run unit tests (and doctests)
$ make test-hulk
Testing Hulk ...
test_00_doctest (__main__.HulkTest.test_00_doctest) ... ok
test_01_mypy (__main__.HulkTest.test_01_mypy) ... ok
test_02_sha1sum (__main__.HulkTest.test_02_sha1sum) ... ok
test_03_permutations (__main__.HulkTest.test_03_permutations) ... ok
test_04_flatten (__main__.HulkTest.test_04_flatten) ... ok
test_05_crack (__main__.HulkTest.test_05_crack) ... ok
test_06_whack (__main__.HulkTest.test_06_whack) ... ok
test_07_smash (__main__.HulkTest.test_07_smash) ... ok
test_08_main_usage (__main__.HulkTest.test_08_main_usage) ... ok
test_09_main_no_arguments (__main__.HulkTest.test_09_main_no_arguments) ... ok
test_10_main_length_1 (__main__.HulkTest.test_10_main_length_1) ... ok
test_11_main_length_1_alphabet_abc (__main__.HulkTest.test_11_main_length_1_alphabet_abc) ... ok
test_12_main_length_1_prefix_a (__main__.HulkTest.test_12_main_length_1_prefix_a) ... ok
test_13_main_length_1_cores_1 (__main__.HulkTest.test_13_main_length_1_cores_1) ... ok
test_14_main_length_1_prefix_a_cores_1 (__main__.HulkTest.test_14_main_length_1_prefix_a_cores_1) ... ok
test_15_main_length_2 (__main__.HulkTest.test_15_main_length_2) ... ok
test_16_main_length_2_alphabet_abc (__main__.HulkTest.test_16_main_length_2_alphabet_abc) ... ok
test_17_main_length_2_prefix_a (__main__.HulkTest.test_17_main_length_2_prefix_a) ... ok
test_18_main_length_2_cores_2 (__main__.HulkTest.test_18_main_length_2_cores_2) ... ok
test_19_main_length_2_prefix_a_cores_2 (__main__.HulkTest.test_19_main_length_2_prefix_a_cores_2) ... ok
test_20_main_length_3 (__main__.HulkTest.test_20_main_length_3) ... ok
test_21_main_length_3_alphabet_abc (__main__.HulkTest.test_21_main_length_3_alphabet_abc) ... ok
test_22_main_length_3_prefix_a (__main__.HulkTest.test_22_main_length_3_prefix_a) ... ok
test_23_main_length_3_cores_3 (__main__.HulkTest.test_23_main_length_3_cores_3) ... ok
test_24_main_length_3_prefix_a_cores_3 (__main__.HulkTest.test_24_main_length_3_prefix_a_cores_3) ... ok
test_25_script_length_4_cores_4 (__main__.HulkTest.test_25_script_length_4_cores_4) ... ok
Score 5.00 / 5.00
Status Success
----------------------------------------------------------------------
Ran 26 tests in 2.163s
OK
To just run the unit tests, you can do the following:
# Run unit tests
$ ./hulk.test -v
...
To run a specific unit test, you can specify the method name:
# Run only mypy unit test
$ ./hulk.test -v HulkTest.test_01_mypy
...
To manually check your types, you can use mypy:
# Run mypy to check types
$ mypy hulk.py
Just for fun, once you are confident that your program is correct, you can
use hulk.py
to brute-force crack set of 12093
password hashes found in
hulk.hashes
.
This is an optional activity and you can skip it if you wish.
To keep track of who has cracked the most passwords, you can submit your set of discovered passwords to the [deadpool]:
$ cat PASSWORDS | curl --data-binary @- http://xavier.h4x0r.space:9111/NETID
Replace PASSWORDS
with the name of the file with your passwords and NETID
with your Notre Dame NetID:
# Generate passwords of length 1 and store them in passwords.txt
$ ./hulk.py -l 1 | tee -a passwords.txt
...
# Upload passwords.txt
$ cat passwords.txt | curl --data-binary @- http://xavier.h4x0r.space:9111/pbui
{"timestamp": 1488333215.967784, "passwords": 36}
Each password is no more than 8
characters long and only consists of
lowercase letters and numeric digits.
To speed up your computations, consider using different values for
CORES
:
$ ./hulk.py -l 6 -c 8 # Use 8 processes
To find out how many cores a Linux machine has, you can do:
$ lscpu
Remember that you need to share the student machines with all the other students in class.
Therefore, do not run hulk.py
on lengths larger than 6
... otherwise
you will take up significant CPU resources away from other students for long
periods of time.
If you want to crack passwords of length greater than 6
, you will need to
run hashcat on your own machines.
Once you have completed all the activities above, you are to complete the following reflection quiz:
As with Reading 01, you will need to store your answers in a
homework05/answers.json
file. You can use the form above to generate the
contents of this file, or you can write the JSON by hand.
To test your quiz, you can use the check.py
script:
$ ../.scripts/check.py
Checking homework05 quiz ...
Q01 0.50
Q02 1.00
Q03 0.30
Q04 0.20
Score 2.00 / 2.00
Status Success
To submit your assignment, please commit your work to the homework05
folder
of your homework05
branch in your assignments GitHub repository.
Your homework05 folder should only contain the following files:
Note: You do not need to commit the test scripts because the Makefile
automatically downloads them.
#-----------------------------------------------------------------------
# Make sure you have already completed Activity 0: Preparation
#-----------------------------------------------------------------------
...
$ git add searx.py # Mark changes for commit
$ git commit -m "Homework 05: searx.py" # Record changes
...
$ git add hulk.py # Mark changes for commit
$ git commit -m "Homework 05: hulk.py" # Record changes
...
$ git add answers.json # Mark changes for commit
$ git commit -m "Homework 05: Quiz" # Record changes
...
$ git push -u origin homework05 # Push branch to GitHub
If you collaborated with any other students, or received help from TAs or AI
tools on this assignment, please record this support in the README.md
in
the homework05
folder and include it with your Pull Request.
Remember to create a Pull Request and assign the appropriate TA from the Reading 06 TA List.
DO NOT MERGE your own Pull Request. The TAs use open Pull Requests to keep track of which assignments to grade. Closing them yourself will cause a delay in grading and confuse the TAs.