The goal of this lab assignment to allow you to practice reading and writing files and exploring the properties of different data structures by writing a Python program that performs spell checking using either a list or dict.
For this assignment, record your work in a Jupyter Notebook and upload it to the form below by 11:59 PM Friday, February 28.
To help you get started, we have provided you with a starter notebook, which you can download and then use to record your answers. To utilize the starter notebook, download it to wherever you are running Jupyter Notebook and then use the Jupyter Notebook interface to open the downloaded file.
The starter notebook is already formatted for all the various sections you need to fill out. Just look for the red instructions that indicates were you need to write text or the cyan comments that indicate where you need to write code.
In order for us to implement a spell checking program, we must have a list of words that serves as our dictionary (ie. the set of all valid words). Rather than entering in data manually, we will simply fetch a file from the Internet to serve as the basis for our spelling dictionary. To do so, however, we will need to first implement a function that will allow us to retrieve data from the Internet and store it to a local file that we can open and process in Python.
To implement a function that downloads data from the Internet, we will use
the requests library to implement a wget
function that will fetch data
from a specified url
and store it to a local file:
def wget(url):
''' Download file from specified url and store in file '''
response = requests.get(url)
path = os.path.basename(url)
# TODO: store contents of response.content to file located at path
As can be seen, the starter notebook already contains two lines inside the function:
response = requests.get(url)
This fetches data from the remote machine at the specified url
and stores
the data in the response
object. To access the data, we can use the
response.content
attribute to view the string representation of the data
we fetched.
path = os.path.basename(url)
This sets path
to be the last portion of the url
. For instance if the
url
was https://github.com/dwyl/english-words/raw/master/words.txt
,
then path
would be set to words.txt
since that is the last component of
the url
.
To complete the function, you will need to write a few lines of code that do the following:
open the file in write and binary mode to get a file handle.
Use this file handle to write the response.content
data to the
opened file.
close the file handle.
Note, you should consider using the with statement so that you automatically close the file when you are done writing.
Once you have implemented the wget
function, you can test it by
downloading the set of words at:
https://github.com/dwyl/english-words/raw/master/words.txt
To do this, you can run the following:
>>> wget('https://github.com/dwyl/english-words/raw/master/words.txt')
This should create a local words.txt
file. We can check its size to
ensure it was successfully downloaded:
>>> os.path.getsize('words.txt')
4862966
This words.txt
file contains all the words that we will use as the basis
of our spelling dictionary.
Before we can perform spell checking, we must load our set of words into a data structure that we can search. Once we have a data structure with all the known words, we can perform spell checking by simply searching the data structure to see if any words in a phrase are in the data structure.
For this activity, you are to implement two functions that load the
contents of the words.txt
file you downloaded in the previous activity
into either a list or dict:
def load_words_list(path):
''' Load words in file into a list. '''
return None
def load_words_dict(path):
''' Loads words in file into a dict. '''
return None
Both of these functions simply open the specified file path
, loop over
the contents of the file line-by-line and add each word to the
corresponding data structure:
To make searching easier, each word should be converted into lowercase and stripped of whitespace.
Both functions should be nearly identical and relatively short (about 4
5
lines long).To verify that these functions are correct, you can check the output of the following commands:
# Check Words List
>>> WordsList = load_words_list('words.txt')
>>> len(WordsList)
466544
>>> 'wow' in WordsList
True
>>> 'asdf' in WordsList
False
# Check Words Dict
>>> WordsDict = load_words_dict('words.txt')
>>> len(WordsDict)
466544
>>> 'wow' in WordsDict
True
>>> 'asdf' in WordsDict
False
Now that we can load our set of words into either a list or dict, we can now implement a spell checking function that searches
either the WordsList
or WordsDict
for unknown words.
For this activity, you will need to implement the spell_check_line
function, which splits the line
of text into individual words and then
checks if each word is in the provided words
data structure:
PUNCTUATION = str.maketrans({key: None for key in string.punctuation})
def spell_check_line(line, words):
''' Check if each word in line is misspelled by searching
words data structure. '''
pass
To make searching more accurate, we will remove punctation from each word by doing the following:
word = word.translate(PUNCTUATION)
Note, the PUNCTUATION
constant is provided to you.
Make sure you convert each line into lowercase.
When a word is not found in the words
data structure, you should
print Unknown word: {word}
.
Once you have implemented the spell_check_line
function, you can test it
by giving it different lines of text:
>>> spell_check_line('Yayaya, understanding the digital world!', WordsList)
Unknown word: yayaya
>>> spell_check_line('Yayaya, understanding the digital world!', WordsDict)
Unknown word: yayaya
Now that we have a working spell checking function, we can write compare the performance of a list versus a dict by spell checking a whole file.
To perform the benchmark, we will first need a spell_check_file
function
that simply opens a file and calls spell_check_line
on each line of
the file:
def spell_check_file(path, words):
''' Check if each word in file is misspelled by searching
words data structure. '''
pass
Because some files may contain unicode characters, you should use the following to open the file:
open(path, 'r', encoding='utf-8')
You should simply call spell_check_line
on each line in the file.
Once you have implemented the spell_check_file
function, you will need to
download a book from Project Gutenberg, which is an online repository of
public domain literary works.
For instance, to benchmark A Modest Proposal by Jonathan Swift, we can do the following:
# Download A Modest Proposal from Project Gutenberg
>>> wget('http://www.gutenberg.org/cache/epub/1080/pg1080.txt')
# Benchmark spell checking file with words list
>>> %time spell_check_file('pg1080.txt', WordsList)
...
Unknown word: gutenbergtm
Unknown word: ebooks
Unknown word: ebooks
CPU times: user 18.1 s, sys: 25.4 ms, total: 18.1 s
Wall time: 18.1 s
# Benchmark spell checking file with words dict
>>> %time spell_check_file('pg1080.txt', WordsDict)
Unknown word: gutenbergtm
Unknown word: ebooks
Unknown word: ebooks
CPU times: user 34.5 ms, sys: 4.01 ms, total: 38.5 ms
Wall time: 32.6 ms
For this activity, you are to download at least two different works of
literature from Project Gutenberg and benchmark your spell_check_file
function using both the WordsList
and the WordsDict
.
Note, the longer the books, the longer the benchmarks will take. You should notice that one data structure is much faster than the other, so keep that in mind if the timings are taking a while.
After implementing your spell checker and running the benchmarks, answer the following questions:
Based on your benchmarking results, what can you say about the importance of choosing the correct data structure when implementing a program? Which data structure was faster for implementing a spell checker?
In which situations would you want to use a list instead of a dict? Conversely, when would you want to use a dict instead of a list?
Once you have completed your lab, submit your Jupyter Notebook using the form: