The fourth project is a continuation of the Project 03. You are to build a few more backends for a key-value store in C++. In particular, you must implement the following additional map backends which support insertion and searching operations:
Unordered: This stores the key-value pairs using a std::unordered_map.
Chained: This stores the key-value pairs in a hash table using separate chaining.
Open: This stores the key-value pairs in a hash table using open addressing and linear probing.
Once you have finished implementing these new map backends into the
libmap.a
library, you will have to update the associated map_bench
and
frequencies
applications to support these additional map backends.
Of course, map_test
is once again provided so you can check the correctness
of all the map backends.
This project is to be done in groups of 1 - 3 and is due midnight Friday, November 11, 2016.
Updated reference implementations of the map_bench
benchmark utility
and the frequecies
application can be found at
/afs/nd.edu/user15/pbui/pub/bin/map_bench
and
/afs/nd.edu/user15/pbui/pub/bin/frequencies
respectively.
Here is the usage message for map_bench
:
# Display usage message $ map_bench -h -b BACKEND Which Map backend (unsorted, sorted, bst, rbtree, treap, unordered, chained, open) -n NITEMS Number of items to benchmark -p PADLENGTH Amount to pad the keys with leading 0's # Run with unordered backend on 1000000 items $ map_bench -b unordered -n 1000000 Insert: 1.4775 s Search: 0.99349 s # Run with chained backend on 1000000 items and default load factor (0.9) $ map_bench -b chained -n 1000000 Insert: 4.3771 s Search: 1.0269 s # Run with chained backend on 1000000 items and load factor of 0.8 $ map_bench -b chained-0.8 -n 1000000 Insert: 4.1273 s Search: 0.9992 s # Run with open backend on 1000000 items and load factor of 0.8 $ map_bench -b open-0.8 -n 1000000 Insert: 2.1962 s Search: 0.79791 s
As can be seen, map_bench
takes three optional arguments. The -b
argument
allows the user to select which map backend to utilize. The -n
argument
specifies the number of items to test. The -p
argument specifies how many
0
's should be used to pad the keys and values. By default -b
is
unsorted
, -n
is 1
and -p
is 1
.
Note: For the chained
and open
backends, you can append the load
factor to use by adding -X.X
to the end of the BACKEND
parameter. If
none is specified, then the default load factor is used (ie. 0.9
).
Note: You can mostly ignore PADLENGTH
for this project and just assume the
default value of 1
.
Here is the usage message for frequencies
:
$ frequencies -h -b BACKEND Which Map backend (unsorted, sorted, bst, rbtree, treap, unordered, chained, open) -d DUMPFLAG Which dump flag (key, value, key_value, value_key)
As can be seen, frequencies
takes two optional arguments. Once again, the
-b
argument allows the user to select which map backend to utilize. The
-d
argument specifies how the output should be formatted (ie. output key,
value, both key then value, or both value then key). By default -b
is
unsorted
and -d
is value_key
.
As with most Unix filters, frequencies
reads from [standand input] and
then outputs to standard output:
# Retrieve Beowulf from Project Gutenberg $ curl -L http://www.gutenberg.org/ebooks/16328.txt.utf-8 > pg16328.txt # Perform word count on Beowulf using chained map $ frequencies -b chained < pg16328.txt | sort -rn | head 2558 the 1536 of 913 to 758 and 616 in 590 The 467 his 435 a 373 he 372 with
To start this project, you must form a group of 1 - 3 people. One group member should fork the Project 04 repository on GitLab:
https://gitlab.com/nd-cse-30331-fa16/project04
Once this repository has been forked, follow the instructions from Reading 00 to:
Make the repository private
Configure access to the repository
Make sure you add all the members of the team in addition to the instructional staff.
As with the programming challenges in this course, the project repository
contains a Makefile
for automated building and testing of your program.
In addition to these files, the repository should also contain the following:
map.h
: This the map library header file that contains the declarations
for the Map
classes, a few type definitions, and some constants. You may
need to edit this file as you implement the different backends.
constants.cpp
: This contains the definition of the NONE
constant which
is a sentinel to indicate that a search failed (ie. the key
is not in the
map). Likewise, it also contains DEFAULT_LOAD_FACTOR
and
DEFAULT_TABLE_SIZE
, which specify the default load factor and table sizes
for the ChainedMap
and OpenMap
.
unsorted.cpp
: This contains the unsorted container implementation of a map.
sorted.cpp
: This contains the sorted container implementation of a map.
bst.cpp
: This contains the binary search tree implementation of a map.
rbtree.cpp
: This contains the red-black tree implementation of a map.
unordered.cpp
: This contains the std::unordered_map implemention of a
map.
chained.cpp
: This contains the separate chaining implementation of a
map.
open.cpp
: This conatins the open addressing implementation of a
map.
map_test.cpp
: This is the testing utility provided to you to check the
correctness of your implementations.
map_bench.cpp
: This is the map benchmarking application.
frequencies.cpp
: This is the word counting application.
Note: The solutions for the map backends from Project 03 are provided
to you, along with the solutions for map_bench
and frequencies
.
Further descriptions of these files and what you need to implement is described below.
For an overview of the map.h
header file, please review the information in
the Project 03 description. The following is a list of the major additions
to the header file:
Both the chained
and open
map implementations require explicit use
of a hash function. As discussed in class, we can instantiate a hash
function for a supported type by doing:
std::hash<Type> hasher;
To simplify your code, we have the following type definition for a string hash function:
typedef std::hash<std::string> StringHasher;
This means, you can create an class variable of type StringHasher
in the
ChainedMap
and OpenMap
classes:
private: StringHasher hfunc;
In addition to the NONE
constant used in Project 03, we also now have
the DEFAULT_LOAD_FACTOR
and DEFAULT_TABLE_SIZE
constants which
define the default load factor and table sizes for the ChainedMap
and OpenMap
classes:
extern const double DEFAULT_LOAD_FACTOR; extern const size_t DEFAULT_TABLE_SIZE;
The constants.cpp
file defines these values to be 0.9
for
DEFAULT_LOAD_FACTOR
and 1<<10
for the DEFAULT_TABLE_SIZE
.
The header file also contains three additional Map
classes:
UnorderdMap
, ChainedMap
, and OpenMap
. Each of these classes supports
the insert
, search
and dump
methods of the base Map
class.
You will need to modify the Unordered
, ChainedMap
, OpenMap
, classes by
adding any necessary private members. You may also need to define your own
constructors and destructors for some of the map backends.
The first additional backend is to implement a map using the
std::unordered_map in unordered.cpp
.
entries
private member to the UnorderedMap
class.The second additional backend is to implement a map using a hash table
with separate chaining in chained.cpp
.
In addition to the three base methods (insert
, search
, and dump
), you
need to implement a constructor, destructor, and a resize
method:
void resize(const size_t new_size);
Given a
new_size
, this method allocates a new table of the givennew_size
, copies all the entries from the old table to this new table, and then deallocates the old table.
When you grow the table in the resize
method, you cannot simply copy the
entries from the old table to the new table. Instead, you must rehash each
entry and insert it it the new table.
For instance, suppose the old table was size 4
and looks like this:
[ 0 | 4 ] [ 1 | 1, 5 ] [ 2 | ] [ 3 | 3, 7 ]
If we resize the table by doubling the number of buckets to 8
and then hash
and insert each entry into the new table, we get something like this:
[ 0 | ] [ 1 | 1 ] [ 2 | ] [ 3 | 3 ] [ 4 | 4 ] [ 5 | 5 ] [ 6 | ] [ 7 | 7 ]
Notice that because in the change in the table size, the keys 4
, 5
, and
7
moved to different buckets. Moreover, after the resize, the keys are
more evenly distributed, which is one of the goals of resizing!
With some careful coding, you should be able to simply call your insert
method to perform the hashing and inserting into the new table (rather than
duplicating the code).
Although we used a std::vector<std::set<std::string>>
in our class
example of a hash table using separate chaining, it is recommended that
you use an array of std::map<std::string, std::string>
as your table.
Using a raw array instead of a std::vector
is to make resizing simpler.
The ChainedMap
must be able to set its load factor and table
size during object initialization.
With careful implementation of resize
, you should be able to initialize
the table by simply calling this method with the appropriate size.
You should perform a resize
whenever the current load factor is
above the limit set during initialization.
To accurately compute the load factor, you will need to track how many
items are in the ChainedMap
.
The third additional backend is to implement a map using a hash table
with open addressing with linear probing in open.cpp
.
You will need to implement the three base methods (insert
, search
, and
dump
), along with a constructor, destructor, resize
method, and
an internal locate
method:
size_t locate(const std::string &key);
Given a
key
, this function returns the bucket that either contains thiskey
or is the next empty bucket.
Note: You may use the NONE
constant as the sentinel value to denote
an empty bucket.
You should use an array of Entry
's as the table.
The OpenMap
must be able to set its load factor and table size
during object initialization.
With careful implementation of resize
, you should be able to initialize
the table by simply calling this method with the appropriate size.
You should perform a resize
whenever the current load factor is
above the limit set during initialization or when locate
fails to find an
empty bucket.
To accurately compute the load factor, you will need to track how many
iterms are in the OpenMap
.
As described in Project 03, the map_bench
application in map_bench.cpp
is a utility for measuring the amount of time inserting and searching using a
particular map backend takes.
The solution for Project 03 is provided to you. You must updated the code
to support the additional backends: unordered
, chained
, and open
.
Note: The chained
and open
backends must support the ability to
specify the load factor by appending the value to the end of the backend
name:
# Use chained map with load factor of 0.75 $ map_bench -b chained-0.75 -n 1000 # Use open map with load factor of 0.5 $ map_bench -b open-0.5 -n 1000
If no load factor is specified, you should simply use
DEFAULT_LOAD_FACTOR
.
As described in Project 03, the frequencies
application in
frequencies.cpp
is a utility for counting how often every word in standard
input appears.
The solution for Project 03 is provided to you. You must updated the code
to support the additional backends: unordered
, chained
, and open
.
Note: You do not need to support specifying the load factor in this application.
Note: This program will be used to ensure the correctness of your hash
table implementations by comparing the output of each backend with that of
the RBTreeMap
.
When you are finished implementing your map_bench
benchmarking utility and
frequencies
application, answer the following questions in the
project04/README.md
:
What is complexity of each of the map backends (ie. Unordered
,
Chained
, Open
)? For each implementation, briefly explain the average,
and worst case complexities of inserting and searching using the
particular method in terms of Big-O notation.
Using map_bench
, benchmark all eight map backends for when NITEMS
is:
10, 100, 1000, 10000, 100000, 1000000, 10000000
For the chained
backend, you should test the following load factors:
0.5, 0.75, 1.0, 5.0, 10.0, 100.0
For the open
backend, you should test the following load factors:
0.5, 0.6, 0.7, 0.8, 0.9, 1.0
Record the elapsed time for the insert
and search
benchmarks of each
experiment in a Markdown table as demonstrated below:
| BACKEND | NITEMS | INSERT | SEARCH | |----------------------|------------|------------|------------| | unsorted | 10 | 0.0 | 0.0 | | ... | ... | ... | ... | | unsorted | 100 | 0.0 | 0.0 | | ... | ... | ... | ... | | unsorted | 1000 | 0.0 | 0.0 | | ... | ... | ... | ... | | unsorted | 10000 | 0.0 | 0.0 | | ... | ... | ... | ... | | unsorted | 100000 | 0.0 | 0.0 | | ... | ... | ... | ... | | unsorted | 1000000 | 0.0 | 0.0 | | ... | ... | ... | ... | | unsorted | 10000000 | inf | inf | | sorted | 10000000 | inf | inf | | bst | 10000000 | 9.3875 | 9.2954 | | rbtree | 10000000 | 14.6810 | 9.7553 | | treap | 10000000 | 10.0640 | 10.2470 | | unordered | 10000000 | 9.1063 | 5.3702 | | chained-0.500 | 10000000 | 23.1980 | 5.3952 | | chained-0.750 | 10000000 | 18.8790 | 5.2926 | | chained-1.000 | 10000000 | 23.0340 | 5.5527 | | chained-5.000 | 10000000 | 20.9280 | 7.5628 | | chained-10.00 | 10000000 | 23.1560 | 8.5264 | | chained-100.0 | 10000000 | 38.1510 | 13.4530 | | open-0.5 | 10000000 | 10.0810 | 3.3071 | | open-0.6 | 10000000 | 8.0291 | 4.0538 | | open-0.7 | 10000000 | 9.4615 | 4.1696 | | open-0.8 | 10000000 | 11.2160 | 4.2335 | | open-0.9 | 10000000 | 16.1610 | 4.9198 | | open-1.0 | 10000000 | inf | inf |
Note: A full table of example results can be seen here.
For Unsorted
and Sorted
, as you increase NITEMS
, the benchmark may
slow down significantly. Once you have reached a set of parameters for
these methods that take more than 60 seconds
, you can simply record
that these settings took a long time and specify INFINITY
for the
elapsed time.
Using frequencies
, compute the word count on at least three texts
from Project Gutenberg using all eight different backends.
Record the elapsed time and memory usage of each experiment in a Markdown table:
| Backend | Text | File Size | Elasped Time | Memory | |-------------|-----------------------|-----------|---------------|-----------| | Unsorted | Beowulf.txt | 0 | 0 | | | ... | ... | ... | ... | ... |
Note: Refer to Project 03 for a list of texts you can use.
Hint: Use the the measure
program from previous assignments to
measure the elapsed time and memory usage.
After you have performed your benchmarks:
Discuss the relative performance of each of the new map implementation and try to explain the differences.
What effect did NITEMS
, load factor, and File Size
have on the
performance of each hash table backend in your experiments in terms
of execution time and memory usage?
In your opinion, which of the eight map backends is the best? Justify your conclusion by examining the trade-offs for the chosen backend and based on your experimental results.
In addition to the questions, please provide a brief summary of each individual group members contributions to the project.
Your project will be scored based on the following metrics:
Metric | Points |
---|---|
map_test -b unordered passes tests |
2 |
map_test -b chained passes tests |
4 |
map_test -b open passes tests |
4 |
map_bench works properly for all backends and without memory errors |
2 |
frequencies works properly for all backends and without memory errors |
1 |
README.md responses |
4 |
Coding Style | 1 |
Individual contributions | 2 |
Total | 20 |
To test your programs, simply run make test
. Your programs must correctly
behave correctly and have no memory errors.
To submit your project, simply push your changes to your Project 04 repository. There is no need to create a Merge Request. Make sure you have enable access to your repository to all of the instructional staff.