Homework 05: Processing Data

The goal of this homework assignment is to allow you to practice using functional programming to process data in Python. In this assignment, you will write a script that brute-force attacks a large collection of passwords using multiple processes. That is, we will use functional programming to construct a concurrent application, and then exploit this concurrency by using multiple CPU cores to execute the program in parallel.

For this assignment, record your scripts and any responses to the following activities in the in the homework05 folder of your assignments GitLab repository and push your work by 11:59 AM Monday, March 4.

Activity 0: Preparation

Before starting this homework assignment, you should first perform a git pull to retrieve any changes in your remote GitLab repository:

$ cd path/to/repository                   # Go to assignments repository

$ git checkout master                     # Make sure we are in master branch

$ git pull --rebase                       # Get any remote changes not present locally

Next, create a new branch for this assignment:

$ git checkout -b homework05              # Create homework05 branch and check it out

You are now ready to work on the activities below.

Activity 1: `hulk.py` (10 Points)

With various password leaks being announced on a weekly basis there has been a lot of discussion on password hygiene and password strength. There is even a website to check have i been pwned.

In all decent password accounting systems, the raw password is rarely stored. Instead a cryptographic hash such as MD5 or SHA1 is used to record a user's password digest or checksum and it is these hashes that are leaked to outsiders in these all too frequent breaches. Because these cryptographic hash functions are generally considered one-way functions, attackers cannot directly obtain the original input password even though they know the output of the cryptographic hash.

For instance, the SHA1 digest of goirish is:

$ printf goirish | sha1sum
d546785e887d9aaae85102c28769becfb86dde7d -

As you can see, the string d546785e887d9aaae85102c28769becfb86dde7d does not provide an attacker any clues about the original password goirish. To obtain the original text, attackers often employ various techniques such as brute-force cracking.

Task 1: `hulk.py`

For the first task, you are to create hulk.py, which is a script that uses brute-force to smash a set of SHA1 hashes:

$ ./hulk.py -h
Usage: hulk.py [-a alphabet -c CORES -l LENGTH -p PREFIX -s HASHES]
    -a ALPHABET Alphabet to use in permutations
    -c CORES    CPU Cores to use
    -l LENGTH   Length of permutations
    -p PREFIX   Prefix for all permutations
    -s HASHES   Path of hashes file

Given an ALPHABET (default is abcdefghijklmnopqrstuvwxyz0123456789), hulk.py will compute the SHA1 hash of every permutation of the ALPHABET for the specified LENGTH and check if it is in the set of HASHES. If a PREFIX is specified, then this should be inserted before each candidate permutation.

For instance, suppose we had a sample.hashes file that contained the following SHA1 checksums:

86f7e437faa5a7fce15d1ddcb9eaeaea377667b8
e9d71f5ee7c92d6dc9e92ffdad17b8bd49418f98
84a516841ba77a5b4648de2cd0dfcb30ea46dbb4
a9993e364706816aba3e25717850c26c9cd0d89d

If we executed hulk.py with a LENGTH of 1, we should get the following result:

$ ./hulk.py -l 1 -s sample.hashes
a
b
c

That is, hulk.py determined that three of the hashes correspond to the passwords a, b, and c.

If we executed hulk.py with a LENGTH of 2 and a PREFIX of a, we should get the following result:

$ ./hulk.py -l 2 -s sample.hashes -p a
abc

That is, hulk.py determined that 1 of the hashes corresponds to the password abc. Note, we could have achieved the same results by executing:

$ ./hulk.py -l 3 -s sample.hashes
abc

The difference is the former searches each permutation in the form aXX where X is a letter from the ALPHABET due to the PREFIX of a, while the latter searches each permutation in the form XXX.

Finally, if CORES is greater than 1 and the LENGTH is greater than 1, the script will use the multiprocessing module to brute-force passwords in parallel with the specified number of CORES:

$ ./hulk.py -l 1 -c 2 -s sample.hashes
a
b
c

Starter Code

To help you get started, the instructor has provided you with the following starter code:

# Download starter code
$ curl -sLOk https://gitlab.com/nd-cse-20289-sp19/cse-20289-sp19-assignments/raw/master/homework05/hulk.py

The starter code contains the following:

#!/usr/bin/env python3

import functools
import hashlib
import itertools
import multiprocessing
import os
import string
import sys

# Constants

ALPHABET    = string.ascii_lowercase + string.digits
ARGUMENTS   = sys.argv[1:]
CORES       = 1
HASHES      = 'hashes.txt'
LENGTH      = 1
PREFIX      = ''

# Functions

def usage(exit_code=0):
    print('''Usage: {} [-a alphabet -c CORES -l LENGTH -p PREFIX -s HASHES]
    -a ALPHABET Alphabet to use in permutations
    -c CORES    CPU Cores to use
    -l LENGTH   Length of permutations
    -p PREFIX   Prefix for all permutations
    -s HASHES   Path of hashes file'''.format(os.path.basename(sys.argv[0])))
    sys.exit(exit_code)

def sha1sum(s):
    ''' Generate sha1 digest for given string.

    >>> sha1sum('abc')
    'a9993e364706816aba3e25717850c26c9cd0d89d'

    >>> sha1sum('wake me up inside')
    '5bfb1100e6ef294554c1e99ff35ad11db6d7b67b'

    >>> sha1sum('baby now we got bad blood')
    '9c6d9c069682759c941a6206f08fb013c55a0a6e'
    '''
    # TODO: Implement
    return ''

def permutations(length, alphabet=ALPHABET):
    ''' Recursively yield all permutations of alphabet up to provided length.

    >>> list(permutations(1, 'ab'))
    ['a', 'b']

    >>> list(permutations(2, 'ab'))
    ['aa', 'ab', 'ba', 'bb']

    >>> list(permutations(1))       # doctest: +ELLIPSIS
    ['a', 'b', ..., '9']

    >>> list(permutations(2))       # doctest: +ELLIPSIS
    ['aa', 'ab', ..., '99']

    >>> import inspect; inspect.isgeneratorfunction(permutations)
    True
    '''
    # TODO: Implement as generator
    yield ''

def smash(hashes, length, alphabet=ALPHABET, prefix=''):
    ''' Return all password permutations of specified length that are in hashes

    >>> smash([sha1sum('ab')], 2)
    ['ab']

    >>> smash([sha1sum('abc')], 2, prefix='a')
    ['abc']

    >>> smash(map(sha1sum, 'abc'), 1, 'abc')
    ['a', 'b', 'c']
    '''
    # TODO: Implement with list or generator comprehensions
    return []

# Main Execution

if __name__ == '__main__':
    # Parse command line arguments

    # Load hashes set

    # Execute smash function

    # Print passwords

# vim: set sts=4 sw=4 ts=8 expandtab ft=python:

You are to complete the sections marked TODO in order to complete the hulk.py script.

Doctests

Note, the hulk.py script contains doctests for each function (the long strings under each function). Do not remove these tests, otherwise you will not be given full credit.

Notes

For sha1sum, you must compute the SHA1 hash of a string by using the hashlib library. To convert a unicode string into bytes, you can call the str.encode method of the string object.
For permutations, you must use a recursive algorithm to implement a generator using the yield statement. You may not use any of the utilities in the itertools module for this function.
For smash, you must implement the function using only list comprehensions or generator expressions (you can use multiple ones).
You can parse the command line options as we have done in the past, or you can investigate using argparse or getopt.
You must use either a dict or set to store the collection of SHA1 hashes.
You should execute the smash function to get a list of passwords for the specified command line options.

As noted above, if CORES is greater than 1 and the LENGTH is greater than 1, you must use the multiprocessing module to brute-force the passwords in parallel.

To allow for parallel execution, you will need to do divide up the work normally done by one process. The easiest way to accomplish this is by taking advantage of the prefix feature of smash.

For instance, if the specified LENGTH is 4, you can smash passwords of length 3 with prefixes of [a, b, ..., 9]. Likewise if the specified LENGTH is 5, you can smash passwords of length 3 with prefixes of [aa, ab, ..., zz]:
```
Passwords of length 4     Prefixes + Passwords of length 3

aaaa                      a + aaa -> aaaa
aaab                      a + aab -> aaab
....                      ....
baaa                      b + aaa -> baaa
baab                      b + aab -> baab
....                      ....
```
Because the imap function of the multiprocessing.Pool object only allows passing one argument to the function to apply, you will need to use the functools.partial function to curry the arguments:
```
subsmash = functools.partial(smash, hashes, ..., ALPHABET)
```
The functools.partial basically creates a new function that already has the initial hashes, length, and alphabet parameters set. This means that subsmash only takes the prefix argument.

You can then use this new subsmash function along with itertools.chain.from_iterable to produce a list of passwords:
```
passwords = itertools.chain.from_iterable(pool.imap(subsmash, prefixes))
```
The prefixes argument is an iterable of all the prefixes you wish to prepend to the permutations you are smashing as described above.

Task 2: Testing

To aid you in testing the hulk.py script, we are providing you with test_hulk.sh, which you can use as follows:

# Download Makefile
$ curl -sLOk https://gitlab.com/nd-cse-20289-sp19/cse-20289-sp19-assignments/raw/master/homework05/Makefile

# Download test script, hashes.txt, and run test
$ make

# Run test script manually
$ ./test_hulk.sh
Testing hulk.py ...
 hulk.py is executable                    ... Success
 hulk.py has usage                        ... Success
 hulk.py doctest                          ... Success
 hulk.py length 1                         ... Success
 hulk.py length 2                         ... Success
 hulk.py length 3                         ... Success
 hulk.py length 2 (CORES: 2)              ... Success
 hulk.py length 3 (CORES: 2)              ... Success
 hulk.py length 2 (PREFIX: a)             ... Success
 hulk.py length 2 (PREFIX: a, CORES: 2)   ... Success
 hulk.py length 3 (PREFIX: a, CORES: 2)   ... Success
   Score 7.00

In addition to the test script, the starter code contains embedded doctests, which you can verify by doing the following:

$ python3 -m doctest -v hulk.py
...
3 items passed all tests:
   5 tests in hulk.permutations
   3 tests in hulk.sha1sum
   3 tests in hulk.smash
11 tests in 5 items.
11 passed and 0 failed.
Test passed.

This will test each of the three functions individually and thus allow you to build and test your program one function at a time.

Incremental Development

Do not try to write the whole program at once. Implement one feature at a time and verify that it works. The doctests are unit tests that allow you to test individual components, while the test_hulk.sh allows you to test the completed solution.

Task 3: Deadpool

Once you are confident that your program is correct, you are to use hulk.py to brute-force crack set of 10427 password hashes found at:

https://gitlab.com/nd-cse-20289-sp19/cse-20289-sp19-assignments/raw/master/homework05/hashes.txt

To keep track of who has cracked the most passwords, you can submit your set of discovered passwords to the deadpool:

$ cat PASSWORDS | curl --data-binary @- http://xavier.h4x0r.space:9111/NETID

Replace PASSWORDS with the name of the file with your passwords and NETID with your Notre Dame NetID:

# Generate passwords of length 1 and store them in passwords.txt
$ ./hulk.py -l 1 | tee -a passwords.txt
...

# Upload passwords.txt
$ cat passwords.txt | curl --data-binary @- http://xavier.h4x0r.space:9111/pbui
{"timestamp": 1488333215.967784, "passwords": 36}

For full credit, you must crack at least 6000 passwords. There are 10427 total hashes. Each password is no more than 8 characters long and only consists of lowercase letters and numeric digits.

NSFW Passwords

Many of the passwords in the password hashes collection were pulled from a set of the top 10,000 most commonly used passwords on the Internet, so the set may contain some vulgar and offensive words.

Parallel Universe

To speed up your computations, consider using different values of for CORES:

$ ./hulk.py -l 6 -c 2  # Use 2 processes

To find out how many cores a Linux machine has, you can do:

$ cat /proc/cpuinfo | grep processor | wc -l

Shared Resources

Remember that you need to share the student machines with all the other students in class.

Therefore, do not run hulk.py on lengths larger than 6... otherwise you will take up significant CPU resources away from other students for long periods of time.

If you want to crack passwords of length greater than 6, you will need to run hashcat on your own machines as explained in the Guru Point.

Task 4: `README.md`

In your README.md, respond to the following prompts:

Describe how you implemented the hulk.py script. In particular, briefly discuss:
- How you generated all the candidate password combinations.
- How you filtered the candidates to only contain valid passwords.
- How you handled processing on multiple cores.
- How you verified that your code works properly.
Complete the following table for passwords of length 5:

Processes Time

1

2

4

8

How does the number of processes utilized affect the amount of time required to crack passwords?
From your experience in this project and recalling your Discrete Math knowledge, what would make a password more difficult to brute-force: more complex alphabet or longer password? Explain.

Processes	Time
1
2
4
8

Guru Point (1 Point)

As can be seen, cracking passwords using brute-force is extremely slow, especially on CPUs. For extra credit, you are to download and use the hashcat program:

hashcat is the world's fastest and most advanced password recovery utility, supporting five unique modes of attack for over 200 highly-optimized hashing algorithms. hashcat currently supports CPUs, GPUs, and other hardware accelerators on Linux, Windows, and macOS, and has facilities to help enable distributed password cracking.

The secret to hashcat's speed is that it can take advantage of your GPU (aka. your video card), which turns out to be a massively parallel computing system (basically a commodity supercomputer). You will learn more about the GPU's power in your Computer Architecture class, but for this extra credit, all you need to do is take advantage of it by using hashcat.

Discrete Video Card

For optimal performance, your computer should use a discrete video card from Nvidia or AMD. Moreover, you will need to have the proper drivers and OpenCL software installed for hashcat to take advantage of your GPU.

Once you have downloaded hashcat and the password hashes file for this assignment, you should use hashcat to crack passwords of length 7 and 8:

# Brute-force SHA1 hashed passwords of length 8 with lowercase letters and digits
$ hashcat -a 3 -m 100 -1 ?l?d -o passwords.7 hashes.txt ?1?1?1?1?1?1?1

# Brute-force SHA1 hashed passwords of length 8 with lowercase letters and digits
$ hashcat -a 3 -m 100 -1 ?l?d -o passwords.8 hashes.txt ?1?1?1?1?1?1?1?1

This Might Take A While

Even though a GPU is basically a commodity supercomputer, cracking passwords using brute-force still takes time. With a discrete Nvidia GTX 980 in a desktop machine, cracking all passwords of length 8 took about 20 minutes. On the other hand, with an integrated Intel 620 in a laptop, cracking all passwords of length 8 had a time estimate of about 2 days.

This is still dramatically better than using say hundreds of machines for hours to crack passwords (which is what we have done in the past).

After executing these commands, the cracked passwords should be in the passwords.7 and passwords.8 files. You will need to use either Python or shell to extract the passwords and append them to the list of passwords of lengths 1 - 6 you found using hulk.py.

To receive credit for this Guru Point, you are to upload at least 10,000 passwords to deadpool.

Feedback

If you have any questions, comments, or concerns regarding the course, please provide your feedback at the end of your README.md.

Submission

To submit your assignment, please commit your work to the homework05 folder of your homework05 branch in your assignments GitLab repository. Your homework05 folder should only contain the following files:

Makefile
README.md
hulk.py
passwords.txt

The passwords.txt should contain a list of all all the passwords you found from the password hashes collection.

#--------------------------------------------------
# BE SURE TO DO THE PREPARATION STEPS IN ACTIVITY 0
#--------------------------------------------------

$ cd homework05                           # Go to Homework 05 directory
...
$ git add Makefile                        # Mark changes for commit
$ git add hulk.py                         # Mark changes for commit
$ git add passwords.txt                   # Mark changes for commit
$ git add README.md                       # Mark changes for commit
$ git commit -m "homework05: activity 1"  # Record changes
...
$ git push -u origin homework05           # Push branch to GitLab

Remember to create a merge request and assign the appropriate TA from the Reading 06 TA List.