The fourth project is a continuation of the Project 03. You are to build a few more backends for a key-value store in C++. In particular, you must implement the following additional map backends which support insertion and searching operations:
Unordered: This stores the key-value pairs using a std::unordered_map.
Chained: This stores the key-value pairs in a hash table using separate chaining.
Open: This stores the key-value pairs in a hash table using open addressing and linear probing.
Once you have finished implementing these new map backends into the
libmap.a library, you will have to update the associated map_bench and
frequencies applications to support these additional map backends.
Of course, map_test is once again provided so you can check the correctness
of all the map backends.
This project is to be done in groups of 1 - 3 and is due midnight Friday, November 11, 2016.
Updated reference implementations of the map_bench benchmark utility
and the frequecies application can be found at
/afs/nd.edu/user15/pbui/pub/bin/map_bench and
/afs/nd.edu/user15/pbui/pub/bin/frequencies respectively.
Here is the usage message for map_bench:
# Display usage message $ map_bench -h -b BACKEND Which Map backend (unsorted, sorted, bst, rbtree, treap, unordered, chained, open) -n NITEMS Number of items to benchmark -p PADLENGTH Amount to pad the keys with leading 0's # Run with unordered backend on 1000000 items $ map_bench -b unordered -n 1000000 Insert: 1.4775 s Search: 0.99349 s # Run with chained backend on 1000000 items and default load factor (0.9) $ map_bench -b chained -n 1000000 Insert: 4.3771 s Search: 1.0269 s # Run with chained backend on 1000000 items and load factor of 0.8 $ map_bench -b chained-0.8 -n 1000000 Insert: 4.1273 s Search: 0.9992 s # Run with open backend on 1000000 items and load factor of 0.8 $ map_bench -b open-0.8 -n 1000000 Insert: 2.1962 s Search: 0.79791 s
As can be seen, map_bench takes three optional arguments. The -b argument
allows the user to select which map backend to utilize. The -n argument
specifies the number of items to test. The -p argument specifies how many
0's should be used to pad the keys and values. By default -b is
unsorted, -n is 1 and -p is 1.
Note: For the chained and open backends, you can append the load
factor to use by adding -X.X to the end of the BACKEND parameter. If
none is specified, then the default load factor is used (ie. 0.9).
Note: You can mostly ignore PADLENGTH for this project and just assume the
default value of 1.
Here is the usage message for frequencies:
$ frequencies -h
-b BACKEND Which Map backend (unsorted, sorted, bst, rbtree, treap, unordered, chained, open)
-d DUMPFLAG Which dump flag (key, value, key_value, value_key)
As can be seen, frequencies takes two optional arguments. Once again, the
-b argument allows the user to select which map backend to utilize. The
-d argument specifies how the output should be formatted (ie. output key,
value, both key then value, or both value then key). By default -b is
unsorted and -d is value_key.
As with most Unix filters, frequencies reads from [standand input] and
then outputs to standard output:
# Retrieve Beowulf from Project Gutenberg $ curl -L http://www.gutenberg.org/ebooks/16328.txt.utf-8 > pg16328.txt # Perform word count on Beowulf using chained map $ frequencies -b chained < pg16328.txt | sort -rn | head 2558 the 1536 of 913 to 758 and 616 in 590 The 467 his 435 a 373 he 372 with
To start this project, you must form a group of 1 - 3 people. One group member should fork the Project 04 repository on GitLab:
https://gitlab.com/nd-cse-30331-fa16/project04
Once this repository has been forked, follow the instructions from Reading 00 to:
Make the repository private
Configure access to the repository
Make sure you add all the members of the team in addition to the instructional staff.
As with the programming challenges in this course, the project repository
contains a Makefile for automated building and testing of your program.
In addition to these files, the repository should also contain the following:
map.h: This the map library header file that contains the declarations
for the Map classes, a few type definitions, and some constants. You may
need to edit this file as you implement the different backends.
constants.cpp: This contains the definition of the NONE constant which
is a sentinel to indicate that a search failed (ie. the key is not in the
map). Likewise, it also contains DEFAULT_LOAD_FACTOR and
DEFAULT_TABLE_SIZE, which specify the default load factor and table sizes
for the ChainedMap and OpenMap.
unsorted.cpp: This contains the unsorted container implementation of a map.
sorted.cpp: This contains the sorted container implementation of a map.
bst.cpp: This contains the binary search tree implementation of a map.
rbtree.cpp: This contains the red-black tree implementation of a map.
unordered.cpp: This contains the std::unordered_map implemention of a
map.
chained.cpp: This contains the separate chaining implementation of a
map.
open.cpp: This conatins the open addressing implementation of a
map.
map_test.cpp: This is the testing utility provided to you to check the
correctness of your implementations.
map_bench.cpp: This is the map benchmarking application.
frequencies.cpp: This is the word counting application.
Note: The solutions for the map backends from Project 03 are provided
to you, along with the solutions for map_bench and frequencies.
Further descriptions of these files and what you need to implement is described below.
For an overview of the map.h header file, please review the information in
the Project 03 description. The following is a list of the major additions
to the header file:
Both the chained and open map implementations require explicit use
of a hash function. As discussed in class, we can instantiate a hash
function for a supported type by doing:
std::hash<Type> hasher;
To simplify your code, we have the following type definition for a string hash function:
typedef std::hash<std::string> StringHasher;
This means, you can create an class variable of type StringHasher in the
ChainedMap and OpenMap classes:
private: StringHasher hfunc;
In addition to the NONE constant used in Project 03, we also now have
the DEFAULT_LOAD_FACTOR and DEFAULT_TABLE_SIZE constants which
define the default load factor and table sizes for the ChainedMap
and OpenMap classes:
extern const double DEFAULT_LOAD_FACTOR; extern const size_t DEFAULT_TABLE_SIZE;
The constants.cpp file defines these values to be 0.9 for
DEFAULT_LOAD_FACTOR and 1<<10 for the DEFAULT_TABLE_SIZE.
The header file also contains three additional Map classes:
UnorderdMap, ChainedMap, and OpenMap. Each of these classes supports
the insert, search and dump methods of the base Map class.
You will need to modify the Unordered, ChainedMap, OpenMap, classes by
adding any necessary private members. You may also need to define your own
constructors and destructors for some of the map backends.
The first additional backend is to implement a map using the
std::unordered_map in unordered.cpp.
entries private member to the UnorderedMap
class.The second additional backend is to implement a map using a hash table
with separate chaining in chained.cpp.
In addition to the three base methods (insert, search, and dump), you
need to implement a constructor, destructor, and a resize method:
void resize(const size_t new_size);
Given a
new_size, this method allocates a new table of the givennew_size, copies all the entries from the old table to this new table, and then deallocates the old table.
When you grow the table in the resize method, you cannot simply copy the
entries from the old table to the new table. Instead, you must rehash each
entry and insert it it the new table.
For instance, suppose the old table was size 4 and looks like this:
[ 0 | 4 ] [ 1 | 1, 5 ] [ 2 | ] [ 3 | 3, 7 ]
If we resize the table by doubling the number of buckets to 8 and then hash
and insert each entry into the new table, we get something like this:
[ 0 | ] [ 1 | 1 ] [ 2 | ] [ 3 | 3 ] [ 4 | 4 ] [ 5 | 5 ] [ 6 | ] [ 7 | 7 ]
Notice that because in the change in the table size, the keys 4, 5, and
7 moved to different buckets. Moreover, after the resize, the keys are
more evenly distributed, which is one of the goals of resizing!
With some careful coding, you should be able to simply call your insert
method to perform the hashing and inserting into the new table (rather than
duplicating the code).
Although we used a std::vector<std::set<std::string>> in our class
example of a hash table using separate chaining, it is recommended that
you use an array of std::map<std::string, std::string> as your table.
Using a raw array instead of a std::vector is to make resizing simpler.
The ChainedMap must be able to set its load factor and table
size during object initialization.
With careful implementation of resize, you should be able to initialize
the table by simply calling this method with the appropriate size.
You should perform a resize whenever the current load factor is
above the limit set during initialization.
To accurately compute the load factor, you will need to track how many
items are in the ChainedMap.
The third additional backend is to implement a map using a hash table
with open addressing with linear probing in open.cpp.
You will need to implement the three base methods (insert, search, and
dump), along with a constructor, destructor, resize method, and
an internal locate method:
size_t locate(const std::string &key);
Given a
key, this function returns the bucket that either contains thiskeyor is the next empty bucket.
Note: You may use the NONE constant as the sentinel value to denote
an empty bucket.
You should use an array of Entry's as the table.
The OpenMap must be able to set its load factor and table size
during object initialization.
With careful implementation of resize, you should be able to initialize
the table by simply calling this method with the appropriate size.
You should perform a resize whenever the current load factor is
above the limit set during initialization or when locate fails to find an
empty bucket.
To accurately compute the load factor, you will need to track how many
iterms are in the OpenMap.
As described in Project 03, the map_bench application in map_bench.cpp
is a utility for measuring the amount of time inserting and searching using a
particular map backend takes.
The solution for Project 03 is provided to you. You must updated the code
to support the additional backends: unordered, chained, and open.
Note: The chained and open backends must support the ability to
specify the load factor by appending the value to the end of the backend
name:
# Use chained map with load factor of 0.75 $ map_bench -b chained-0.75 -n 1000 # Use open map with load factor of 0.5 $ map_bench -b open-0.5 -n 1000
If no load factor is specified, you should simply use
DEFAULT_LOAD_FACTOR.
As described in Project 03, the frequencies application in
frequencies.cpp is a utility for counting how often every word in standard
input appears.
The solution for Project 03 is provided to you. You must updated the code
to support the additional backends: unordered, chained, and open.
Note: You do not need to support specifying the load factor in this application.
Note: This program will be used to ensure the correctness of your hash
table implementations by comparing the output of each backend with that of
the RBTreeMap.
When you are finished implementing your map_bench benchmarking utility and
frequencies application, answer the following questions in the
project04/README.md:
What is complexity of each of the map backends (ie. Unordered,
Chained, Open)? For each implementation, briefly explain the average,
and worst case complexities of inserting and searching using the
particular method in terms of Big-O notation.
Using map_bench, benchmark all eight map backends for when NITEMS
is:
10, 100, 1000, 10000, 100000, 1000000, 10000000
For the chained backend, you should test the following load factors:
0.5, 0.75, 1.0, 5.0, 10.0, 100.0
For the open backend, you should test the following load factors:
0.5, 0.6, 0.7, 0.8, 0.9, 1.0
Record the elapsed time for the insert and search benchmarks of each
experiment in a Markdown table as demonstrated below:
| BACKEND | NITEMS | INSERT | SEARCH | |----------------------|------------|------------|------------| | unsorted | 10 | 0.0 | 0.0 | | ... | ... | ... | ... | | unsorted | 100 | 0.0 | 0.0 | | ... | ... | ... | ... | | unsorted | 1000 | 0.0 | 0.0 | | ... | ... | ... | ... | | unsorted | 10000 | 0.0 | 0.0 | | ... | ... | ... | ... | | unsorted | 100000 | 0.0 | 0.0 | | ... | ... | ... | ... | | unsorted | 1000000 | 0.0 | 0.0 | | ... | ... | ... | ... | | unsorted | 10000000 | inf | inf | | sorted | 10000000 | inf | inf | | bst | 10000000 | 9.3875 | 9.2954 | | rbtree | 10000000 | 14.6810 | 9.7553 | | treap | 10000000 | 10.0640 | 10.2470 | | unordered | 10000000 | 9.1063 | 5.3702 | | chained-0.500 | 10000000 | 23.1980 | 5.3952 | | chained-0.750 | 10000000 | 18.8790 | 5.2926 | | chained-1.000 | 10000000 | 23.0340 | 5.5527 | | chained-5.000 | 10000000 | 20.9280 | 7.5628 | | chained-10.00 | 10000000 | 23.1560 | 8.5264 | | chained-100.0 | 10000000 | 38.1510 | 13.4530 | | open-0.5 | 10000000 | 10.0810 | 3.3071 | | open-0.6 | 10000000 | 8.0291 | 4.0538 | | open-0.7 | 10000000 | 9.4615 | 4.1696 | | open-0.8 | 10000000 | 11.2160 | 4.2335 | | open-0.9 | 10000000 | 16.1610 | 4.9198 | | open-1.0 | 10000000 | inf | inf |
Note: A full table of example results can be seen here.
For Unsorted and Sorted, as you increase NITEMS, the benchmark may
slow down significantly. Once you have reached a set of parameters for
these methods that take more than 60 seconds, you can simply record
that these settings took a long time and specify INFINITY for the
elapsed time.
Using frequencies, compute the word count on at least three texts
from Project Gutenberg using all eight different backends.
Record the elapsed time and memory usage of each experiment in a Markdown table:
| Backend | Text | File Size | Elasped Time | Memory | |-------------|-----------------------|-----------|---------------|-----------| | Unsorted | Beowulf.txt | 0 | 0 | | | ... | ... | ... | ... | ... |
Note: Refer to Project 03 for a list of texts you can use.
Hint: Use the the measure program from previous assignments to
measure the elapsed time and memory usage.
After you have performed your benchmarks:
Discuss the relative performance of each of the new map implementation and try to explain the differences.
What effect did NITEMS, load factor, and File Size have on the
performance of each hash table backend in your experiments in terms
of execution time and memory usage?
In your opinion, which of the eight map backends is the best? Justify your conclusion by examining the trade-offs for the chosen backend and based on your experimental results.
In addition to the questions, please provide a brief summary of each individual group members contributions to the project.
Your project will be scored based on the following metrics:
| Metric | Points |
|---|---|
map_test -b unordered passes tests |
2 |
map_test -b chained passes tests |
4 |
map_test -b open passes tests |
4 |
map_bench works properly for all backends and without memory errors |
2 |
frequencies works properly for all backends and without memory errors |
1 |
README.md responses |
4 |
| Coding Style | 1 |
| Individual contributions | 2 |
| Total | 20 |
To test your programs, simply run make test. Your programs must correctly
behave correctly and have no memory errors.
To submit your project, simply push your changes to your Project 04 repository. There is no need to create a Merge Request. Make sure you have enable access to your repository to all of the instructional staff.