The goal of this homework assignment is to allow you to practice using functional programming techniques to process data in Python. In this assignment, you will write a script that brute-force attacks a large collection of passwords using multiple processes. That is, we will use functional programming to construct a concurrent application, and then exploit this concurrency by using multiple CPU cores to execute the program in parallel.
For this assignment, record your scripts and any responses to the following
activities in the in the homework05
folder of your assignments GitLab
repository and push your work by 11:59 AM Saturday, March 3.
Before starting this homework assignment, you should first perform a git
pull
to retrieve any changes in your remote GitLab repository:
$ cd path/to/repository # Go to assignments repository $ git checkout master # Make sure we are in master branch $ git pull --rebase # Get any remote changes not present locally
Next, create a new branch for this assignment:
$ git checkout -b homework05 # Create homework05 branch and check it out
You are now ready to work on the activities below.
hulk.py
(10 Points)With various password leaks being announced on a weekly basis there has been a lot of discussion on password hygiene and password strength. There is even a website to check have i been pwned.
In all decent password accounting systems, the raw password is rarely stored. Instead a cryptographic hash such as MD5 or SHA1 is used to record a user's password digest or checksum and it is these hashes that are leaked to outsiders in these all too frequent breaches. Because these cryptographic hash functions are generally considered one-way functions, attackers cannot directly obtain the original input password even though they know the output of the cryptographic hash.
For instance, the SHA1 digest of goirish
is:
$ printf goirish | sha1sum
d546785e887d9aaae85102c28769becfb86dde7d -
As you can see, the string d546785e887d9aaae85102c28769becfb86dde7d
does
not provide an attacker any clues about the original password goirish
. To
obtain the original text, attackers often employ various
techniques
such as brute-force cracking.
hulk.py
For the first activity, you are to create hulk.py
, which is a script that
uses brute-force to smash a set of SHA1 hashes:
$ ./hulk.py -h Usage: hulk.py [-a alphabet -c CORES -l LENGTH -p PATH -s HASHES] -a ALPHABET Alphabet to use in permutations -c CORES CPU Cores to use -l LENGTH Length of permutations -p PREFIX Prefix for all permutations -s HASHES Path of hashes file
Given, an ALPHABET
(default is abcdefghijklmnopqrstuvwxyz0123456789
),
hulk.py
will compute the SHA1 hash of every permutation of the
ALPHABET
for the specified LENGTH
and check if it is in the set of
HASHES
. If a PREFIX
is specified, then this should be inserted before
each candidate permutation.
For instance, suppose we had a sample.hashes
file that contained the
following SHA1 checksums:
86f7e437faa5a7fce15d1ddcb9eaeaea377667b8 e9d71f5ee7c92d6dc9e92ffdad17b8bd49418f98 84a516841ba77a5b4648de2cd0dfcb30ea46dbb4 a9993e364706816aba3e25717850c26c9cd0d89d
If we executed hulk.py
with a LENGTH
of 1
, we should get the following
result:
$ ./hulk.py -l 1 -s sample.hashes a b c
That is, hulk.py
determined that three of the hashes correspond to the
passwords a
, b
, and c
.
If we executed hulk.py
with a LENGTH
of 2
and a PREFIX
of a
, we
should get the following result:
$ ./hulk.py -l 2 -s sample.hashes -p a abc
That is, hulk.py
determined that 1 of the hashes corresponds to the
password abc
. Note, we could have achieved the same results by executing:
$ ./hulk.py -l 3 -s sample.hashes abc
The difference is the former searches each permutation in the form aXX
where X
is a letter from the ALPHABET
due to the PREFIX
of a
, while
the latter searches each permutation in the form XXX
.
Finally, if CORES
is greater than 1
and the LENGTH
is greater than 1
,
the script will use the multiprocessing module to brute-force passwords
in parallel with the specified number of CORES
:
$ ./hulk.py -l 1 -c 2 -s sample.hashes a b c
To help you get started, the instructor has provided you with the following starter code:
# Download starter code
$ curl -LO https://gitlab.com/nd-cse-20289-sp18/cse-20289-sp18-assignments/raw/master/homework05/hulk.py
The starter code contains the following:
#!/usr/bin/env python3 import functools import hashlib import itertools import multiprocessing import os import string import sys # Constants ALPHABET = string.ascii_lowercase + string.digits ARGUMENTS = sys.argv[1:] CORES = 1 HASHES = 'hashes.txt' LENGTH = 1 PREFIX = '' # Functions def usage(exit_code=0): print('''Usage: {} [-a alphabet -c CORES -l LENGTH -p PATH -s HASHES] -a ALPHABET Alphabet to use in permutations -c CORES CPU Cores to use -l LENGTH Length of permutations -p PREFIX Prefix for all permutations -s HASHES Path of hashes file'''.format(os.path.basename(sys.argv[0]))) sys.exit(exit_code) def sha1sum(s): ''' Generate sha1 digest for given string. >>> sha1sum('abc') 'a9993e364706816aba3e25717850c26c9cd0d89d' >>> sha1sum('wake me up inside') '5bfb1100e6ef294554c1e99ff35ad11db6d7b67b' >>> sha1sum('baby now we got bad blood') '9c6d9c069682759c941a6206f08fb013c55a0a6e' ''' # TODO: Implement return '' def permutations(length, alphabet=ALPHABET): ''' Recursively yield all permutations of alphabet up to provided length. >>> list(permutations(1, 'ab')) ['a', 'b'] >>> list(permutations(2, 'ab')) ['aa', 'ab', 'ba', 'bb'] >>> list(permutations(1)) # doctest: +ELLIPSIS ['a', 'b', ..., '9'] >>> list(permutations(2)) # doctest: +ELLIPSIS ['aa', 'ab', ..., '99'] >>> import inspect; inspect.isgeneratorfunction(permutations) True ''' # TODO: Implement as generator yield '' def smash(hashes, length, alphabet=ALPHABET, prefix=''): ''' Return all password permutations of specified length that are in hashes >>> smash([sha1sum('ab')], 2) ['ab'] >>> smash([sha1sum('abc')], 2, prefix='a') ['abc'] >>> smash(map(sha1sum, 'abc'), 1, 'abc') ['a', 'b', 'c'] ''' # TODO: Implement with list or generator comprehensions return [] # Main Execution if __name__ == '__main__': # Parse command line arguments # Load hashes set # Execute smash function # Print passwords # vim: set sts=4 sw=4 ts=8 expandtab ft=python:
You are to complete the sections marked TODO
in order to complete the
hulk.py
script.
For sha1sum
, you must compute the SHA1 hash of a string by using
the hashlib library. To convert a unicode string into bytes, you can
call the str.encode method of the string object.
For permutations
, you must use a recursive algorithm to implement
a generator using the yield statement. You may not use any of the
utilities in the itertools module for this function.
For smash
, you must implement the function using only list
comprehensions or generator expressions (you can use multiple ones).
You can parse the command line options as we have done in the past, or you can investigate using argparse or getopt.
You must use either a dict or set to store the collection of SHA1 hashes.
You should execute the smash
function to get a list of passwords for the
specified command line options.
As noted above, if CORES
is greater than 1
and the LENGTH
is
greater than 1
, you must use the multiprocessing module to brute-force
the passwords in parallel.
To allow for parallel execution, you will need to do divide up the
work normally done by one process. The easiest way to accomplish this
is by taking advantage of the prefix
feature of smash
.
For instance, if the specified LENGTH
is 4
, you can smash
passwords
of length 3
with prefixes
of [a
, b
, ..., 9
]. Likewise if the
specified LENGTH
is 5
, you can smash
passwords of length 3
with
prefixes
of [aa
, ab
, ..., zz
]:
Passwords of length 4 Prefixes + Passwords of length 3 aaaa a + aaa -> aaaa aaab a + aab -> aaab .... .... baaa b + aaa -> baaa baab b + aab -> baab .... ....
Because the imap
function of the multiprocessing.Pool object only
allows passing one argument to the function to apply, you will need to
use the functools.partial function to curry the arguments:
subsmash = functools.partial(smash, hashes, ..., ALPHABET)
The functools.partial basically creates a new function that already has
the initial hashes
, length
, and alphabet
parameters set. This
means that subsmash
only takes the prefix
argument.
You can then use this new subsmash
function along with
itertools.chain.from_iterable to produce a list of passwords:
passwords = itertools.chain.from_iterable(pool.imap(subsmash, prefixes))
The prefixes
argument is an iterable of all the prefixes you wish to
prepend to the permutations you are smashing as described above.
To aid you in testing the hulk.py
script, we are providing you with
test_hulk.sh
, which you can use as follows:
# Download Makefile $ curl -LO https://gitlab.com/nd-cse-20289-sp18/cse-20289-sp18-assignments/raw/master/homework05/Makefile # Download test script, hashes.txt, and run test $ make # Run test script manually $ ./test_hulk.sh Testing hulk.py ... hulk.py is executable ... Success hulk.py has usage ... Success hulk.py doctest ... Success hulk.py length 1 ... Success hulk.py length 2 ... Success hulk.py length 3 ... Success hulk.py length 2 (CORES: 2) ... Success hulk.py length 3 (CORES: 2) ... Success hulk.py length 2 (PREFIX: a) ... Success hulk.py length 2 (PREFIX: a, CORES: 2) ... Success hulk.py length 3 (PREFIX: a, CORES: 2) ... Success Score 7.00
In addition to the test script, the starter code contains embedded doctests, which you can verify by doing the following:
$ python3 -m doctest -v hulk.py ... 3 items passed all tests: 5 tests in hulk.permutations 3 tests in hulk.sha1sum 3 tests in hulk.smash 11 tests in 5 items. 11 passed and 0 failed. Test passed.
This will test each of the three functions individually and thus allow you to build and test your program one function at a time.
Do not try to write the whole program at once. Implement one feature at a
time and verify that it works. The doctests are unit tests that
allow you to test individual components, while the test_hulk.sh
allows you
to test the completed solution.
Once you are confident that your program is correct, you are to use hulk.py
to brute-force crack set of 10419
password hashes found at:
https://gitlab.com/nd-cse-20289-sp18/cse-20289-sp18-assignments/raw/master/homework05/hashes.txt
To keep track of who has cracked the most passwords, you can submit your set of discovered passwords to the deadpool:
$ cat PASSWORDS | curl --data-binary @- http://xavier.h4x0r.space:9111/NETID
Replace PASSWORDS
with the name of the file with your passwords and NETID
with your Notre Dame NetID:
# Generate passwords of length 1 and store them in passwords.txt $ ./hulk.py -l 1 | tee -a passwords.txt ... # Upload passwords.txt $ cat passwords.txt | curl --data-binary @- http://xavier.h4x0r.space:9111/pbui {"timestamp": 1488333215.967784, "passwords": 36}
For full credit, you must crack at least 6000
passwords. There are
10419
total hashes. Each password is no more than 8
characters long and
only consists of lowercase letters and numeric digits.
Many of the passwords in the password hashes collection were pulled from a set of the top 10,000 most commonly used passwords on the Internet, so the set may contain some vulgar and offensive words.
To speed up your computations, consider using different values of for
CORES
:
$ ./hulk.py -l 6 -c 2 # Use 2 processes
To find out how many cores a Linux machine has, you can do:
$ cat /proc/cpuinfo | grep processor | wc -l
Remember that you need to share the student machines with all the other students in class.
Therefore, do not run hulk.py
on lengths larger than 6
... otherwise
you will take up significant CPU resources away from other students for long
periods of time.
If you want to crack passwords of length greater than 6
, you will need to
use your own machines or do the Guru Point in Reading 07.
README.md
In your README.md
, respond to the following prompts:
Describe how you implemented the hulk.py
script. In particular, briefly
discuss:
How you generated all the candidate password combinations.
How you filtered the candidates to only contain valid passwords.
How you handled processing on multiple cores.
How you verified that your code works properly.
Complete the following table for passwords of length 5
:
Processes | Time |
---|---|
1 | |
2 | |
4 | |
6 | |
8 |
How does the number of processes utilized affect the amount of time required to crack passwords?
From your experience in this project and recalling your Discrete Math knowledge, what would make a password more difficult to brute-force: more complex alphabet or longer password? Explain.
For extra credit, you are to practice writing a Map-Reduce application. In this case, you are to write both the map and reduce components for generating an Inverted Index:
The map function parses each line in an input file, and emits a sequence of
<word, line number>
pairs. The reduce function accepts all pairs for a given word, sorts the corresponding line numbers, and emits a<word, list(line numbers)>
pair. The set of all the output pairs forms a simple inverted index.
Note, this problem is inspired by the description of an Inverted Index in the MapReduce: Simplified Data Processing on Large Clusters paper.
As discussed in class, the canonical Map-Reduce example is performing a word count on a large collection of documents as shown below:
In the Map-Reduce programming model, you must define two functions:
A map function that transforms, filters, or selects input data
A reduce function that aggregates, combines, or collections results.
While Map-Reduce is normally performed on massive datasets using many networked machines, we can use the Hadoop Streaming interface to write Map-Reduce programs that run on small datasets on our local machines. The Hadoop Streaming interface simply specifies that both the map and reduce functions must emit key value pairs in the following format:
key\tvalue
That is, each line of output consists of the key
followed by a tab
character (i.e. \t
) and then the value
.
With this in mind, we can implement the word count application using the Hadoop Streaming interface with the following Python scripts:
# map.py import sys for line in sys.stdin: for word in line.strip().split(): print('{}\t{}'.format(word, 1)) # reduce.py import sys counts = {} for line in sys.stdin: k, v = line.split('\t', 1) counts[k] = counts.get(k, 0) + int(v) for k, v in sorted(counts.items()): print('{}\t{}'.format(k, v))
Given these two Python scripts, we can simulate Map-Reduce on a local machine by doing the following:
$ cat inputs.txt | ./map.py | sort | ./reduce.py
The end result will be a stream of key value pairs in the Hadoop Streaming format.
iv_map.py
and iv_reduce.py
For this extra credit, you will have to write two separate scripts
iv_map.py
and iv_reduce.py
that uses the Hadoop Streaming interface to
generate an Inverted Index.
The iv_map.py
script will receive the text of a document via standard
input, while the iv_reduce.py
script will receive the results of the
iv_map.py
program via standard input in Hadoop Streaming format. This
means that both the iv_map.py
and the iv_reduce.py
programs should emit
key value pairs in the Hadoop Streaming format.
For example, given the following input file:
'Beware the Jabberwock, my son! The jaws that bite, the claws that catch! Beware the Jubjub bird, and shun The frumious Bandersnatch!"
Running ./iv_map.py < input.sample
should yield the following intermediate
outputs:
beware 1 the 1 jabberwock 1 my 1 son 1 the 2 jaws 2 that 2 bite 2 the 2 claws 2 that 2 catch 2 beware 3 the 3 jubjub 3 bird 3 and 3 shun 3 the 4 frumious 4 bandersnatch 4
That is, each line contains the word
and the line number
on which it
occurs.
Note, we start counting the line numbers at 1
, we ignore
punctuation other than -
, and we ignore case by making everything
lower-case.
Running ./iv_map.py < input.sample | sort | ./iv_reduce.py
should yield the
following results:
and 3 bandersnatch 4 beware 1 3 bird 3 bite 2 catch 2 claws 2 frumious 4 jabberwock 1 jaws 2 jubjub 3 my 1 shun 3 son 1 that 2 the 1 2 3 4
That is, each line contains the word
and the line numbers
on which it
appears.
Note, the words are in lexicographical order and the line numbers are sorted from lowest to highest and are unique.
For iv_map.py
, you can use list comprehensions to filter out
undesirable characters.
For iv_reduce.py
, you can use a set to remove duplicates.
To aid you in testing the iv_map.py
and iv_reduce.py
scripts, we are
providing you with test_iv.sh
, which you can use as follows:
# Download script $ curl -LO https://gitlab.com/nd-cse-20289-sp18/cse-20289-sp18-assignments/raw/master/homework05/test_iv.sh # Make script executable $ chmod +x test_iv.sh # Run test script $ ./test_iv.sh Testing iv... iv_map.py is executable ... Success iv_reduce.py is executable ... Success iv_map.py < iv_input.txt ... Success iv_reduce.py < iv_output.map ... Success iv_map.py | sort | iv_reduce.py ... Success Score 1.00
After you download and execute the test_iv.sh
script, you will have the
iv_input.txt
, iv_output.map
, and iv_output.reduce
files which you can
use to manually test the scripts:
# Test iv_map.py $ ./iv_map.py < iv_input.txt | diff - iv_output.map # Test iv_reduce.py $ ./iv_reduce.py < iv_output.map | diff - iv_output.reduce # Test whole pipeline $ ./iv_map.py < iv_input.txt | sort | ./iv_reduce.py | diff - iv_output.reduce
You can use grep
to check the results of your Inverted Index by doing the
following:
$ grep -n -i beware iv_input.txt 1:"Beware the Jabberwock, my son! 3:Beware the Jubjub bird, and shun
As can be seen, the word beware
does appear on lines 1
and 3
as the
Map-Reduce application claims.
To get credit, you must show either a TA or the instructor a demonstration of your Inverted Index scripts (both the source code and passing the test).
If you have any questions, comments, or concerns regarding the course, please
provide your feedback at the end of your README.md
.
To submit your assignment, please commit your work to the homework05
folder
of your homework05
branch in your assignments GitLab repository.
Your homework05
folder should only contain the following files:
Makefile
README.md
hulk.py
passwords.txt
The passwords.txt
should contain a list of all all the passwords you found
from the password hashes collection.
#-------------------------------------------------- # BE SURE TO DO THE PREPARATION STEPS IN ACTIVITY 0 #-------------------------------------------------- $ cd homework05 # Go to Homework 05 directory ... $ git add Makefile # Mark changes for commit $ git add hulk.py # Mark changes for commit $ git add passwords.txt # Mark changes for commit $ git add README.md # Mark changes for commit $ git commit -m "homework05: activity 1" # Record changes ... $ git push -u origin homework05 # Push branch to GitLab
Remember to create a merge request and assign the appropriate TA from the Reading 06 TA List.