Lab 05: Spell Check

The goal of this lab assignment to allow you to practice reading and writing files and exploring the properties of different data structures by writing a Python program that performs spell checking using either a list or dict.

For this assignment, record your work in a Jupyter Notebook and upload it to the form below by 11:59 PM Friday, March 1.

Activity 0: Starter Notebook

To help you get started, we have provided you with a starter notebook, which you can download and then use to record your answers. To utilize the starter notebook, download it to wherever you are running Jupyter Notebook and then use the Jupyter Notebook interface to open the downloaded file.

Download Starter Notebook

The starter notebook is already formatted for all the various sections you need to fill out. Just look for the red instructions that indicates were you need to write text or the cyan comments that indicate where you need to write code.

Activity 1: Download

In order for us to implement a spell checking program, we must have a list of words that serves as our dictionary (ie. the set of all valid words). Rather than entering in data manually, we will simply fetch a file from the Internet to serve as the basis for our spelling dictionary. To do so, however, we will need to first implement a function that will allow us to retrieve data from the Internet and store it to a local file that we can open and process in Python.

Function

To implement a function that downloads data from the Internet, we will use the requests library to implement a wget function that will fetch data from a specified url and store it to a local file:

def wget(url):
    ''' Download file from specified url and store in file '''
    response = requests.get(url)
    path     = os.path.basename(url)

    # TODO: store contents of response.content to file located at path

As can be seen, the starter notebook already contains two lines inside the function:

response = requests.get(url)

This fetches data from the remote machine at the specified url and stores the data in the response object. To access the data, we can use the response.content attribute to view the string representation of the data we fetched.

path     = os.path.basename(url)

This sets path to be the last portion of the url. For instance if the url was https://github.com/dwyl/english-words/raw/master/words.txt, then path would be set to words.txt since that is the last component of the url.

To complete the function, you will need to write a few lines of code that do the following:

open the file in write and binary mode to get a file handle.
Use this file handle to write the response.content data to the opened file.
close the file handle.

Note, you should consider using the with statement so that you automatically close the file when you are done writing.

Testing

Once you have implemented the wget function, you can test it by downloading the set of words at:

https://github.com/dwyl/english-words/raw/master/words.txt

To do this, you can run the following:

>>> wget('https://github.com/dwyl/english-words/raw/master/words.txt')

This should create a local words.txt file. We can check its size to ensure it was successfully downloaded:

>>> os.path.getsize('words.txt')
4862966

This words.txt file contains all the words that we will use as the basis of our spelling dictionary.

Activity 2: Words

Before we can perform spell checking, we must load our set of words into a data structure that we can search. Once we have a data structure with all the known words, we can perform spell checking by simply searching the data structure to see if any words in a phrase are in the data structure.

Functions

For this activity, you are to implement two functions that load the contents of the words.txt file you downloaded in the previous activity into either a list or dict:

def load_words_list(path):
    ''' Load words in file into a list. '''
    return None

def load_words_dict(path):
    ''' Loads words in file into a dict. '''
    return None

Both of these functions simply open the specified file path, loop over the contents of the file line-by-line and add each word to the corresponding data structure:

To make searching easier, each word should be converted into lowercase and stripped of whitespace.
Both functions should be nearly identical and relatively short (about 4
5 lines long).

Testing

To verify that these functions are correct, you can check the output of the following commands:

# Check Words List
>>> WordsList = load_words_list('words.txt')

>>> len(WordsList)
466544

>>> 'wow' in WordsList
True

>>> 'asdf' in WordsList
False

# Check Words Dict
>>> WordsDict = load_words_dict('words.txt')

>>> len(WordsDict)
466544

>>> 'wow' in WordsDict
True

>>> 'asdf' in WordsDict
False

Activity 3: Spell Check

Now that we can load our set of words into either a list or dict, we can now implement a spell checking function that searches either the WordsList or WordsDict for unknown words.

Function

For this activity, you will need to implement the spell_check_line function, which splits the line of text into individual words and then checks if each word is in the provided words data structure:

PUNCTUATION = str.maketrans({key: None for key in string.punctuation})

def spell_check_line(line, words):
    ''' Check if each word in line is misspelled by searching 
    words data structure. '''
    pass

To make searching more accurate, we will remove punctation from each word by doing the following:
```
word = word.translate(PUNCTUATION)
```
Note, the PUNCTUATION constant is provided to you.
Make sure you convert each line into lowercase.
When a word is not found in the words data structure, you should print Unknown word: {word}.

Testing

Once you have implemented the spell_check_line function, you can test it by giving it different lines of text:

>>> spell_check_line('Yayaya, understanding the digital world!', WordsList)
Unknown word: yayaya

>>> spell_check_line('Yayaya, understanding the digital world!', WordsDict)
Unknown word: yayaya

Activity 4: Benchmark

Now that we have a working spell checking function, we can write compare the performance of a list versus a dict by spell checking a whole file.

Function

To perform the benchmark, we will first need a spell_check_file function that simply opens a file and calls spell_check_line on each line of the file:

def spell_check_file(path, words):
    ''' Check if each word in file is misspelled by searching
    words data structure. '''
    pass

Because some files may contain unicode characters, you should use the following to open the file:
```
open(path, 'r', encoding='utf-8')
```
You should simply call spell_check_line on each line in the file.

Benchmarking

Once you have implemented the spell_check_file function, you will need to download a book from Project Gutenberg, which is an online repository of public domain literary works.

For instance, to benchmark A Modest Proposal by Jonathan Swift, we can do the following:

# Download A Modest Proposal from Project Gutenberg
>>> wget('http://www.gutenberg.org/cache/epub/1080/pg1080.txt')

# Benchmark spell checking file with words list
>>> %time spell_check_file('pg1080.txt', WordsList)
...
Unknown word: gutenbergtm
Unknown word: ebooks
Unknown word: ebooks
CPU times: user 18.1 s, sys: 25.4 ms, total: 18.1 s
Wall time: 18.1 s

# Benchmark spell checking file with words dict
>>> %time spell_check_file('pg1080.txt', WordsDict)
Unknown word: gutenbergtm
Unknown word: ebooks
Unknown word: ebooks
CPU times: user 34.5 ms, sys: 4.01 ms, total: 38.5 ms
Wall time: 32.6 ms

For this activity, you are to download at least two different works of literature from Project Gutenberg and benchmark your spell_check_file function using both the WordsList and the WordsDict.

This May Take A While

Note, the longer the books, the longer the benchmarks will take. You should notice that one data structure is much faster than the other, so keep that in mind if the timings are taking a while.

Activity 5: Reflection

After implementing your spell checker and running the benchmarks, answer the following questions:

Based on your benchmarking results, what can you say about the importance of choosing the correct data structure when implementing a program? Which data structure was faster for implementing a spell checker?
In which situations would you want to use a list instead of a dict? Conversely, when would you want to use a dict instead of a list?

Submission

Once you have completed your lab, submit your Jupyter Notebook using the form:

Submit Lab Notebook