This Is Not The Course Website You Are Looking For

This course website is from a previous semester. If you are currently in the class, please make sure you are viewing the latest course website instead of this old one.

The goal of this homework assignment is to allow you to practice using functional programming to process data in Python. In this assignment, you will write a script that brute-force attacks a large collection of passwords using multiple processes. That is, we will use functional programming to construct a concurrent application, and then exploit this concurrency by using multiple CPU cores to execute the program in parallel.

For this assignment, record your scripts and any responses to the following activities in the in the homework06 folder of your assignments GitHub repository and push your work by noon Saturday, March 27.

Activity 0: Preparation

Before starting this homework assignment, you should first perform a git pull to retrieve any changes in your remote GitHub repository:

$ cd path/to/repository                   # Go to assignments repository

$ git checkout master                     # Make sure we are in master branch

$ git pull --rebase                       # Get any remote changes not present locally

Next, create a new branch for this assignment:

$ git checkout -b homework06              # Create homework06 branch and check it out

Once this is done, download the Makefile and test scripts:

# Go to homework06 folder
$ cd homework06

# Download the Makefile
$ curl -LO https://raw.githubusercontent.com/nd-cse-20289-sp21/cse-20289-sp21-assignments/master/homework06/Makefile

# Add and commit Makefile
$ git add Makefile
$ git commit -m "homework06: Add Makefile"

# Download the test scripts
$ make test-scripts

Note, you do not need to add and commit the test scripts since the Makefile will automatically download them again whenever you run make.

You are now ready to work on the activities below.

Activity 1: Hulk (10 Points)

With various password leaks being announced on a weekly basis there has been a lot of discussion on password hygiene and password strength. There is even a website to check have i been pwned.

In all decent password accounting systems, the raw password is rarely stored. Instead a cryptographic hash such as MD5 or SHA1 is used to record a user's password digest or checksum and it is these hashes that are leaked to outsiders in these all too frequent breaches. Because these cryptographic hash functions are generally considered one-way functions, attackers cannot directly obtain the original input password even though they know the output of the cryptographic hash.

For instance, the MD5 digest of goirish is:

$ printf goirish | md5sum
c761cc3341c91238ba735b26874572a0 -

As you can see, the string c761cc3341c91238ba735b26874572a0 does not provide an attacker any clues about the original password goirish. To obtain the original text, attackers often employ various techniques such as brute-force cracking.

Task 1: hulk.py

For the first task, you are to create hulk.py, which is a script that uses brute-force to smash a set of MD5 hashes:

$ ./hulk.py -h
Usage: hulk.py [-a alphabet -c CORES -l LENGTH -p PREFIX -s HASHES]
    -a ALPHABET Alphabet to use in permutations
    -c CORES    CPU Cores to use
    -l LENGTH   Length of permutations
    -p PREFIX   Prefix for all permutations
    -s HASHES   Path of hashes file

Given an ALPHABET (default is abcdefghijklmnopqrstuvwxyz0123456789), hulk.py will compute the MD5 hash of every permutation of the ALPHABET for the specified LENGTH and check if it is in the set of HASHES. If a PREFIX is specified, then this should be inserted before each candidate permutation.

Examples

For instance, suppose we had a sample.hashes file that contained the following MD5 checksums:

0cc175b9c0f1b6a831c399e269772661
92eb5ffee6ae2fec3ad71c777531578f
4a8a08f09d37b73795649038408b5f33
900150983cd24fb0d6963f7d28e17f72

If we executed hulk.py with a LENGTH of 1, we should get the following result:

$ ./hulk.py -l 1 -s sample.hashes
a
b
c

That is, hulk.py determined that three of the hashes correspond to the passwords a, b, and c.

If we executed hulk.py with a LENGTH of 2 and a PREFIX of a, we should get the following result:

$ ./hulk.py -l 2 -s sample.hashes -p a
abc

That is, hulk.py determined that 1 of the hashes corresponds to the password abc. Note, we could have achieved the same results by executing:

$ ./hulk.py -l 3 -s sample.hashes
abc

The difference is the former searches each permutation in the form aXX where X is a letter from the ALPHABET due to the PREFIX of a, while the latter searches each permutation in the form XXX.

Finally, if CORES is greater than 1 and the LENGTH is greater than 1, the script will use the concurrent.futures module to brute-force passwords in parallel with the specified number of CORES:

$ ./hulk.py -l 1 -c 2 -s sample.hashes
a
b
c

Skeleton Code

To help you get started, the instructor has provided you with the following skeleton code:

# Download skeleton code
$ curl -LO https://raw.githubusercontent.com/nd-cse-20289-sp21/cse-20289-sp21-assignments/master/homework06/hulk.py

The skeleton code contains the following:

import concurrent.futures
import hashlib
import os
import string
import sys

# Constants

ALPHABET = string.ascii_lowercase + string.digits

# Functions

def usage(exit_code=0):
    progname = os.path.basename(sys.argv[0])
    print(f'''Usage: {progname} [-a ALPHABET -c CORES -l LENGTH -p PATH -s HASHES]
    -a ALPHABET Alphabet to use in permutations
    -c CORES    CPU Cores to use
    -l LENGTH   Length of permutations
    -p PREFIX   Prefix for all permutations
    -s HASHES   Path of hashes file''')
    sys.exit(exit_code)

def md5sum(s):
    ''' Compute md5 digest for given string. '''
    return ''

def permutations(length, alphabet=ALPHABET):
    ''' Recursively yield all permutations of the given length using the
    provided alphabet. '''
    yield None

def flatten(sequence):
    ''' Flatten sequence of iterators. '''
    yield None

def crack(hashes, length, alphabet=ALPHABET, prefix=''):
    ''' Return all password permutations of specified length that are in hashes
    by sequentially trying all permutations. '''
    return []

def whack(arguments):
    ''' Call the crack function with the specified list of arguments '''
    return []

def smash(hashes, length, alphabet=ALPHABET, prefix='', cores=1):
    ''' Return all password permutations of specified length that are in hashes
    by concurrently subsets of permutations concurrently.
    '''
    return []

def main():
    arguments   = sys.argv[1:]
    alphabet    = ALPHABET
    cores       = 1
    hashes_path = 'hashes.txt'
    length      = 1
    prefix      = ''

# Main Execution

if __name__ == '__main__':
    main()

You are to complete the sections marked TODO in order to complete the hulk.py script.

Hints

  1. For md5sum, you must compute the MD5 hash of a string by using the hashlib library. To convert a unicode string into bytes, you can call the str.encode method of the string object.

  2. For permutations, you must use a recursive algorithm to implement a generator using the yield statement. You may not use any of the utilities in the itertools module for this function.

    For the base case, you will want to consider what happens with a length of 1 and for the recursive case, how you can combine the alphabet with the results of permutations of length - 1 to yield each permutation of the specified length.

    if length == 1:   # Base case
        yield each letter in alphabet
    else:             # Recursive case
        yield the combination of each letter with each permutation of length - 1
    
  3. For flatten, you must use the yield statement to implement a generator that takes a sequence of iterables and converts it into a single sequence of values from the nested data.

    The idea with this function is that given a nested sequence of iterators such as [[A, B, C], [D, E, F]], you yield each item from the subsequences: A, B, C, D, E, F.

  4. For crack, you must implement the function using only list comprehensions or generator expressions (you can use multiple ones).

    Pay careful attention to not use too much memory in this function, particularly when calling permutations.

  5. For whack, you must implement the function by calling the crack function with the provided list of arguments.

    This function exists because map can only pass a single argument to the function it applies. To workaround this limitation, we create the whack helper function which takes a tuple of arguments which can then unpack to call crack. That is, whack takes a tuple of four items which correspond to the parameters to crack and we simply need to call crack with these items:

    # Calling whack should result in calling crack with the arguments
    # tuple unpacked as parameters.
    whack((hashes, length, alphabet, prefix)) -> crack(hashes, length, alphabet, prefix)
    
  6. For smash, you must implement the function by using the map method of the ProcessPoolExecutor from the concurrent.futures module in order to take advantage of multiple CPU cores.

    To allow for parallel execution, you will need to do divide up the work normally done by one process. The easiest way to accomplish this is by taking advantage of the prefix feature of crack.

    For instance, if the specified LENGTH is 4, you can crack passwords of length 3 with prefixes of [a, b, ..., 9] to achieve the same result:

    Passwords of length 4     Prefixes + Passwords of length 3
    
    aaaa                      a + aaa -> aaaa
    aaab                      a + aab -> aaab
    ....                      ....
    baaa                      b + aaa -> baaa
    baab                      b + aab -> baab
    ....                      ....
    

    In smash, you will need to create an arguments sequence that consists of the argument tuples to pass to whack. As explained above, each tuple consists of the arguments to crack: hashes, length, alphabet, prefix. Based on the previous description, only the length and prefix arguments need to be modified in the arguments sequence:

    (original hashes, new length, original alphabet, new prefix)
    

    For simplicity, we suggest that you use each letter in the alphabet as part of the prefix in the arguments tuple, while keeping in mind that you still need to account for the prefix passed to smash itself. Because we are now using the prefix feature of crack, you will also need to update the length argument in the tuple.

    Your code should look something like this:

    # Generator expression representing a sequence of arguments to pass
    # to `whack` (hashes, length, alphabet remain constant, but prefix
    # should change for each argument tuple).
    arguments = ((...) for ... in ...)
    
    # Create a ProcessPoolExecutor and then apply whack to the
    # arguments
    with concurrent.futures.ProcessPoolExecutor(cores) as executor:
        ... executor.map(...)
    

    Because the executor.map will return a sequence of sequences, you will want to use the flatten function you wrote to convert the result into a single sequence.

  7. You can parse the command line options as we have done in the past, or you can investigate using argparse or getopt.

  8. You must use a set to store the collection of MD5 hashes.

Task 2: Testing

To aid you in testing the hulk.py script, we are providing you with test_hulk.sh, which you can use as follows:

# Download Makefile
$ curl -LO https://raw.githubusercontent.com/nd-cse-20289-sp21/cse-20289-sp21-assignments/master/homework06/Makefile

# Download test script, hashes.txt, and run test
$ make

# Run test script manually
$ ./test_hulk.sh
Testing hulk.py ...
   Unit Tests                               ... 2.00 / 2.00
   Usage                                    ... Success
   Hulk LENGTH 1                            ... Success
   Hulk LENGTH 1 (ALPHABET: abc)            ... Success
   Hulk LENGTH 2                            ... Success
   Hulk LENGTH 2 (ALPHABET: uty)            ... Success
   Hulk LENGTH 3                            ... Success
   Hulk LENGTH 3 (ALPHABET: abc)            ... Success
   Hulk LENGTH 4                            ... Success
   Hulk LENGTH 4 (ALPHABET: abcd)           ... Success
   Hulk LENGTH 2 (CORES: 2)                 ... Success
   Hulk LENGTH 3 (CORES: 2)                 ... Success
   Hulk LENGTH 4 (CORES: 2)                 ... Success
   Hulk LENGTH 1 (PREFIX: a)                ... Success
   Hulk LENGTH 1 (PREFIX: 1, CORES: 2)      ... Success
   Hulk LENGTH 2 (PREFIX: a)                ... Success
   Hulk LENGTH 2 (PREFIX: a, CORES: 2)      ... Success
   Hulk LENGTH 3 (PREFIX: a)                ... Success
   Hulk LENGTH 3 (PREFIX: a, CORES: 2)      ... Success

   Score 10.00

Timeout Failures

If you fail a function test but the output appears correct, it is mostly likely that your code is too slow. In addition to checking the output of your commands, the test_hulk.sh uses the timeout command to ensure your program does not run for too long.

Carefully reconsider the data structures and functional programming methods you employed if your code is timing out.

In addition to the test script, there is also hulk_test.py, which contains unit tests for most of the functions you need to implement. You can execute these tests manually as follows:

$ ./hulk_test.py -v
test_00_md5sum (__main__.HulkTestCase) ... ok
test_01_permutations (__main__.HulkTestCase) ... ok
test_02_flatten (__main__.HulkTestCase) ... ok
test_03_crack (__main__.HulkTestCase) ... ok
test_04_whack (__main__.HulkTestCase) ... ok
test_05_smash (__main__.HulkTestCase) ... ok

----------------------------------------------------------------------
Ran 6 tests in 0.168s

OK

This will test each of the five functions individually and thus allow you to build and test your program one function at a time.

Incremental Development

Do not try to write the whole program at once. Implement one feature at a time and verify that it works. The unit tests are provided to allow you to test individual components, while the test_hulk.sh is given to enable you to test the completed solution.

Activity 2: Deadpool (1 Point)

Once you are confident that your program is correct, you are to use hulk.py to brute-force crack set of 12038 password hashes found at:

https://raw.githubusercontent.com/nd-cse-20289-sp21/cse-20289-sp21-assignments/master/homework06/hashes.txt

To keep track of who has cracked the most passwords, you can submit your set of discovered passwords to the deadpool:

$ cat PASSWORDS | curl --data-binary @- http://xavier.h4x0r.space:9111/NETID

Replace PASSWORDS with the name of the file with your passwords and NETID with your Notre Dame NetID:

# Generate passwords of length 1 and store them in passwords.txt
$ ./hulk.py -l 1 | tee -a passwords.txt
...

# Upload passwords.txt
$ cat passwords.txt | curl --data-binary @- http://xavier.h4x0r.space:9111/pbui
{"timestamp": 1488333215.967784, "passwords": 36}

For full credit, you must crack at least 8000 passwords. There are 12038 total hashes. Each password is no more than 8 characters long and only consists of lowercase letters and numeric digits.

Chegg Passwords

Many of the passwords in the password hashes collection were pulled from the recent Chegg password leak as recorded on Hashes.org.

Parallel Universe

To speed up your computations, consider using different values for CORES:

$ ./hulk.py -l 6 -c 2  # Use 2 processes

To find out how many cores a Linux machine has, you can do:

$ lscpu

Shared Resources

Remember that you need to share the student machines with all the other students in class.

Therefore, do not run hulk.py on lengths larger than 6... otherwise you will take up significant CPU resources away from other students for long periods of time.

If you want to crack passwords of length greater than 6, you will need to run hashcat on your own machines as explained in the Guru Point.

Activity 3: Quiz (1 Point)

Once you have completed all the activities above, you are to complete the following reflection quiz:

As with Reading 01, you will need to store your answers in a homework06/answers.json file. You can use the form above to generate the contents of this file, or you can write the JSON by hand.

To test your quiz, you can use the check.py script:

$ ../.scripts/check.py
Checking homework06 quiz ...
     Q01 0.50
     Q02 0.30
     Q03 0.20
   Score 1.00 / 1.00
  Status Success

Guru Point (2 Points)

This week there are two different opportunities for extra credit. You may choose to do one, both, or none of them.

Self-Service Extension

Remember that you can always forgo this Guru Point for two extra days to do the homework. That is, if you need an extension, you can simply skip the Guru Point and you will automatically have until Monday to complete the assignment for full credit.

Just leave a note on your Pull Request of your intensions.

Hash Cat

As can be seen, cracking passwords using brute-force is extremely slow, especially on CPUs. For extra credit, you are to download and use the hashcat program on your own computer:

hashcat is the world's fastest and most advanced password recovery utility, supporting five unique modes of attack for over 200 highly-optimized hashing algorithms. hashcat currently supports CPUs, GPUs, and other hardware accelerators on Linux, Windows, and macOS, and has facilities to help enable distributed password cracking.

The secret to hashcat's speed is that it can take advantage of your GPU (aka. your video card), which turns out to be a massively parallel computing system (basically a commodity supercomputer). You will learn more about the GPU's power in your Computer Architecture class, but for this extra credit, all you need to do is take advantage of it by using hashcat.

Discrete Video Card

For optimal performance, your computer should use a discrete video card from Nvidia or AMD. Moreover, you will need to have the proper drivers and OpenCL software installed for hashcat to take advantage of your GPU.

Once you have downloaded hashcat to your own computer and the password hashes file for this assignment, you should use hashcat to crack passwords of length 7 and 8:

# Brute-force MD5 hashed passwords of length 7 with lowercase letters and digits
$ hashcat -a 3 -m 0 -1 '?l?d' -o passwords.7 --potfile-disable hashes.txt '?1?1?1?1?1?1?1'

# Brute-force MD5 hashed passwords of length 8 with lowercase letters and digits
$ hashcat -a 3 -m 0 -1 '?l?d' -o passwords.8 --potfile-disable hashes.txt '?1?1?1?1?1?1?1?1'

Note: for macOS users, you may wish to install hashcat via Homebrew.

This Might Take A While

Even though a GPU is basically a commodity supercomputer, cracking passwords using brute-force still takes time. With a discrete Nvidia GTX 980 in a desktop machine, cracking all passwords of length 8 took about 20 minutes. On the other hand, with an integrated Intel 620 in a laptop, cracking all passwords of length 8 had a time estimate of about 2 days.

This is still dramatically better than using say hundreds of machines for hours to crack passwords (which is what we have done in the past).

After executing these commands, the cracked passwords should be in the passwords.7 and passwords.8 files. You will need to use either Python or shell to extract the passwords and append them to the list of passwords of lengths 1 - 6 you found using hulk.py.

Verification

To get credit for this Guru Point, you are to upload at least 10,000 passwords to deadpool. You have up until a week after this assignment is due to verify your Guru Point.

SSH Keys

Passwords are great, but what would even be better is not having to enter your password everytime you do a git pull or git push!

For extra credit, you are to setup SSH Keys and configure your GitHub account to utilize them. The following are resources on how to set this up:

Basically, with SSH Keys properly configured, you can do things such as push and pull to GitHub without having to enter your password everytime and login to remote Unix machines without a password. This is handy for automating tasks where no human interaction is desirable.

Once you setup your SSH Keys, you will need to change your remote URL

Verification

To get credit for this Guru Point, you must show either the instructor or TA a demonstration of either pushing to your repository or pulling from it using SSH (or attached a video / screenshot to your Pull Request). You have up until a week after this assignment is due to verify your Guru Point.

Submission

To submit your assignment, please commit your work to the homework06 folder of your homework06 branch in your assignments GitHub repository. Your homework06 folder should only contain the following files:

The passwords.txt should contain a list of all all the passwords you found from the password hashes collection.

#--------------------------------------------------
# BE SURE TO DO THE PREPARATION STEPS IN ACTIVITY 0
#--------------------------------------------------

$ cd homework06                           # Go to Homework 06 directory
...
$ git add Makefile                        # Mark changes for commit
$ git add hulk.py                         # Mark changes for commit
$ git commit -m "homework06: Activity 1"  # Record changes
...
$ git add passwords.txt                   # Mark changes for commit
$ git commit -m "homework06: Activity 2"  # Record changes
...
$ git add answers.json                    # Mark changes for commit
$ git commit -m "homework06: Activity 3"  # Record changes
...
$ git push -u origin homework06           # Push branch to GitHub

Pull Request

Remember to create a Pull Request and assign the appropriate TA from the Reading 07 TA List.

DO NOT MERGE your own Pull Request. The TAs use open Pull Requests to keep track of which assignments to grade. Closing them yourself will cause a delay in grading and confuse the TAs.