Project 04: Key-Value Store II

Overview

The fourth project is a continuation of the Project 03. You are to build a few more backends for a key-value store in C++. In particular, you must implement the following additional map backends which support insertion and searching operations:

Unordered: This stores the key-value pairs using a std::unordered_map.
Chained: This stores the key-value pairs in a hash table using separate chaining.
Open: This stores the key-value pairs in a hash table using open addressing and linear probing.

Once you have finished implementing these new map backends into the libmap.a library, you will have to update the associated map_bench and frequencies applications to support these additional map backends.

Of course, map_test is once again provided so you can check the correctness of all the map backends.

This project is to be done in groups of 1 - 3 and is due midnight Friday, November 11, 2016.

Reference

Updated reference implementations of the map_bench benchmark utility and the frequecies application can be found at /afs/nd.edu/user15/pbui/pub/bin/map_bench and /afs/nd.edu/user15/pbui/pub/bin/frequencies respectively.

Map Bench

Here is the usage message for map_bench:

# Display usage message
$ map_bench -h
    -b BACKEND    Which Map backend (unsorted, sorted, bst, rbtree, treap, unordered, chained, open)
    -n NITEMS     Number of items to benchmark
    -p PADLENGTH  Amount to pad the keys with leading 0's

# Run with unordered backend on 1000000 items
$ map_bench -b unordered -n 1000000
Insert: 1.4775 s
Search: 0.99349 s

# Run with chained backend on 1000000 items and default load factor (0.9)
$ map_bench -b chained -n 1000000
Insert: 4.3771 s
Search: 1.0269 s

# Run with chained backend on 1000000 items and load factor of 0.8
$ map_bench -b chained-0.8 -n 1000000
Insert: 4.1273 s
Search: 0.9992 s

# Run with open backend on 1000000 items and load factor of 0.8
$ map_bench -b open-0.8 -n 1000000
Insert: 2.1962 s
Search: 0.79791 s

As can be seen, map_bench takes three optional arguments. The -b argument allows the user to select which map backend to utilize. The -n argument specifies the number of items to test. The -p argument specifies how many 0's should be used to pad the keys and values. By default -b is unsorted, -n is 1 and -p is 1.

Note: For the chained and open backends, you can append the load factor to use by adding -X.X to the end of the BACKEND parameter. If none is specified, then the default load factor is used (ie. 0.9).

Note: You can mostly ignore PADLENGTH for this project and just assume the default value of 1.

Frequencies

Here is the usage message for frequencies:

$ frequencies -h
    -b BACKEND    Which Map backend (unsorted, sorted, bst, rbtree, treap, unordered, chained, open)
    -d DUMPFLAG   Which dump flag (key, value, key_value, value_key)

As can be seen, frequencies takes two optional arguments. Once again, the -b argument allows the user to select which map backend to utilize. The -d argument specifies how the output should be formatted (ie. output key, value, both key then value, or both value then key). By default -b is unsorted and -d is value_key.

As with most Unix filters, frequencies reads from [standand input] and then outputs to standard output:

# Retrieve Beowulf from Project Gutenberg
$ curl -L http://www.gutenberg.org/ebooks/16328.txt.utf-8 > pg16328.txt

# Perform word count on Beowulf using chained map
$ frequencies -b chained < pg16328.txt | sort -rn | head
2558    the
1536    of
913     to
758     and
616     in
590     The
467     his
435     a
373     he
372     with

GitLab

To start this project, you must form a group of 1 - 3 people. One group member should fork the Project 04 repository on GitLab:

https://gitlab.com/nd-cse-30331-fa16/project04

Once this repository has been forked, follow the instructions from Reading 00 to:

Make the repository private
Configure access to the repository

Make sure you add all the members of the team in addition to the instructional staff.

As with the programming challenges in this course, the project repository contains a Makefile for automated building and testing of your program.

In addition to these files, the repository should also contain the following:

map.h: This the map library header file that contains the declarations for the Map classes, a few type definitions, and some constants. You may need to edit this file as you implement the different backends.
constants.cpp: This contains the definition of the NONE constant which is a sentinel to indicate that a search failed (ie. the key is not in the map). Likewise, it also contains DEFAULT_LOAD_FACTOR and DEFAULT_TABLE_SIZE, which specify the default load factor and table sizes for the ChainedMap and OpenMap.
unsorted.cpp: This contains the unsorted container implementation of a map.
sorted.cpp: This contains the sorted container implementation of a map.
bst.cpp: This contains the binary search tree implementation of a map.
rbtree.cpp: This contains the red-black tree implementation of a map.
treap.cpp: This contains the treap implementation of a map.
unordered.cpp: This contains the std::unordered_map implemention of a map.
chained.cpp: This contains the separate chaining implementation of a map.
open.cpp: This conatins the open addressing implementation of a map.
map_test.cpp: This is the testing utility provided to you to check the correctness of your implementations.
map_bench.cpp: This is the map benchmarking application.
frequencies.cpp: This is the word counting application.

Note: The solutions for the map backends from Project 03 are provided to you, along with the solutions for map_bench and frequencies.

Further descriptions of these files and what you need to implement is described below.

Map Library

For an overview of the map.h header file, please review the information in the Project 03 description. The following is a list of the major additions to the header file:

Both the chained and open map implementations require explicit use of a hash function. As discussed in class, we can instantiate a hash function for a supported type by doing:
```
std::hash<Type> hasher;
```
To simplify your code, we have the following type definition for a string hash function:
```
typedef std::hash<std::string>  StringHasher;
```
This means, you can create an class variable of type StringHasher in the ChainedMap and OpenMap classes:
```
private:
    StringHasher    hfunc;
```
In addition to the NONE constant used in Project 03, we also now have the DEFAULT_LOAD_FACTOR and DEFAULT_TABLE_SIZE constants which define the default load factor and table sizes for the ChainedMap and OpenMap classes:
```
extern const double DEFAULT_LOAD_FACTOR;
extern const size_t DEFAULT_TABLE_SIZE;
```
The constants.cpp file defines these values to be 0.9 for DEFAULT_LOAD_FACTOR and 1<<10 for the DEFAULT_TABLE_SIZE.
The header file also contains three additional Map classes: UnorderdMap, ChainedMap, and OpenMap. Each of these classes supports the insert, search and dump methods of the base Map class.

Modify Map Class Definition

You will need to modify the Unordered, ChainedMap, OpenMap, classes by adding any necessary private members. You may also need to define your own constructors and destructors for some of the map backends.

Unordered

The first additional backend is to implement a map using the std::unordered_map in unordered.cpp.

Hints

You will want to add an entries private member to the UnorderedMap class.

Chained

The second additional backend is to implement a map using a hash table with separate chaining in chained.cpp.

In addition to the three base methods (insert, search, and dump), you need to implement a constructor, destructor, and a resize method:

void            resize(const size_t new_size);

Given a new_size, this method allocates a new table of the given new_size, copies all the entries from the old table to this new table, and then deallocates the old table.

Don't Copy, Re-Hash/Insert

When you grow the table in the resize method, you cannot simply copy the entries from the old table to the new table. Instead, you must rehash each entry and insert it it the new table.

For instance, suppose the old table was size 4 and looks like this:

[ 0 | 4       ]
[ 1 | 1, 5    ]
[ 2 |         ]
[ 3 | 3, 7    ]

If we resize the table by doubling the number of buckets to 8 and then hash and insert each entry into the new table, we get something like this:

[ 0 |         ]
[ 1 | 1       ]
[ 2 |         ]
[ 3 | 3       ]
[ 4 | 4       ]
[ 5 | 5       ]
[ 6 |         ]
[ 7 | 7       ]

Notice that because in the change in the table size, the keys 4, 5, and 7 moved to different buckets. Moreover, after the resize, the keys are more evenly distributed, which is one of the goals of resizing!

With some careful coding, you should be able to simply call your insert method to perform the hashing and inserting into the new table (rather than duplicating the code).

Hints

Although we used a std::vector<std::set<std::string>> in our class example of a hash table using separate chaining, it is recommended that you use an array of std::map<std::string, std::string> as your table. Using a raw array instead of a std::vector is to make resizing simpler.
The ChainedMap must be able to set its load factor and table size during object initialization.
With careful implementation of resize, you should be able to initialize the table by simply calling this method with the appropriate size.
You should perform a resize whenever the current load factor is above the limit set during initialization.
To accurately compute the load factor, you will need to track how many items are in the ChainedMap.

Open

The third additional backend is to implement a map using a hash table with open addressing with linear probing in open.cpp.

You will need to implement the three base methods (insert, search, and dump), along with a constructor, destructor, resize method, and an internal locate method:

size_t          locate(const std::string &key);

Given a key, this function returns the bucket that either contains this key or is the next empty bucket.

Note: You may use the NONE constant as the sentinel value to denote an empty bucket.

Hints

You should use an array of Entry's as the table.
The OpenMap must be able to set its load factor and table size during object initialization.
With careful implementation of resize, you should be able to initialize the table by simply calling this method with the appropriate size.
You should perform a resize whenever the current load factor is above the limit set during initialization or when locate fails to find an empty bucket.
To accurately compute the load factor, you will need to track how many iterms are in the OpenMap.

Map Bench

As described in Project 03, the map_bench application in map_bench.cpp is a utility for measuring the amount of time inserting and searching using a particular map backend takes.

The solution for Project 03 is provided to you. You must updated the code to support the additional backends: unordered, chained, and open.

Note: The chained and open backends must support the ability to specify the load factor by appending the value to the end of the backend name:

# Use chained map with load factor of 0.75
$ map_bench -b chained-0.75 -n 1000

# Use open map with load factor of 0.5
$ map_bench -b open-0.5 -n 1000

If no load factor is specified, you should simply use DEFAULT_LOAD_FACTOR.

Frequencies

As described in Project 03, the frequencies application in frequencies.cpp is a utility for counting how often every word in standard input appears.

The solution for Project 03 is provided to you. You must updated the code to support the additional backends: unordered, chained, and open.

Note: You do not need to support specifying the load factor in this application.

Note: This program will be used to ensure the correctness of your hash table implementations by comparing the output of each backend with that of the RBTreeMap.

Questions

When you are finished implementing your map_bench benchmarking utility and frequencies application, answer the following questions in the project04/README.md:

What is complexity of each of the map backends (ie. Unordered, Chained, Open)? For each implementation, briefly explain the average, and worst case complexities of inserting and searching using the particular method in terms of Big-O notation.

Using map_bench, benchmark all eight map backends for when NITEMS is:

10, 100, 1000, 10000, 100000, 1000000, 10000000

For the chained backend, you should test the following load factors:

0.5, 0.75, 1.0, 5.0, 10.0, 100.0

For the open backend, you should test the following load factors:

0.5, 0.6, 0.7, 0.8, 0.9, 1.0

Record the elapsed time for the insert and search benchmarks of each experiment in a Markdown table as demonstrated below:

| BACKEND              | NITEMS     | INSERT     | SEARCH     |
|----------------------|------------|------------|------------|
|             unsorted |         10 |        0.0 |        0.0 |
|                  ... |        ... |        ... |        ... |
|             unsorted |        100 |        0.0 |        0.0 |
|                  ... |        ... |        ... |        ... |
|             unsorted |       1000 |        0.0 |        0.0 |
|                  ... |        ... |        ... |        ... |
|             unsorted |      10000 |        0.0 |        0.0 |
|                  ... |        ... |        ... |        ... |
|             unsorted |     100000 |        0.0 |        0.0 |
|                  ... |        ... |        ... |        ... |
|             unsorted |    1000000 |        0.0 |        0.0 |
|                  ... |        ... |        ... |        ... |
|             unsorted |   10000000 |        inf |        inf |
|               sorted |   10000000 |        inf |        inf |
|                  bst |   10000000 |     9.3875 |     9.2954 |
|               rbtree |   10000000 |    14.6810 |     9.7553 |
|                treap |   10000000 |    10.0640 |    10.2470 |
|            unordered |   10000000 |     9.1063 |     5.3702 |
|        chained-0.500 |   10000000 |    23.1980 |     5.3952 |
|        chained-0.750 |   10000000 |    18.8790 |     5.2926 |
|        chained-1.000 |   10000000 |    23.0340 |     5.5527 |
|        chained-5.000 |   10000000 |    20.9280 |     7.5628 |
|        chained-10.00 |   10000000 |    23.1560 |     8.5264 |
|        chained-100.0 |   10000000 |    38.1510 |    13.4530 |
|             open-0.5 |   10000000 |    10.0810 |     3.3071 |
|             open-0.6 |   10000000 |     8.0291 |     4.0538 |
|             open-0.7 |   10000000 |     9.4615 |     4.1696 |
|             open-0.8 |   10000000 |    11.2160 |     4.2335 |
|             open-0.9 |   10000000 |    16.1610 |     4.9198 |
|             open-1.0 |   10000000 |        inf |        inf |

Note: A full table of example results can be seen here.

This Might Take A While

For Unsorted and Sorted, as you increase NITEMS, the benchmark may slow down significantly. Once you have reached a set of parameters for these methods that take more than 60 seconds, you can simply record that these settings took a long time and specify INFINITY for the elapsed time.

Using frequencies, compute the word count on at least three texts from Project Gutenberg using all eight different backends.

Record the elapsed time and memory usage of each experiment in a Markdown table:
```
| Backend     | Text                  | File Size | Elasped Time  | Memory    |
|-------------|-----------------------|-----------|---------------|-----------|
| Unsorted    | Beowulf.txt           | 0         | 0             |           |
| ...         | ...                   | ...       | ...           | ...       |
```
Note: Refer to Project 03 for a list of texts you can use.

Hint: Use the the measure program from previous assignments to measure the elapsed time and memory usage.
After you have performed your benchmarks:
- Discuss the relative performance of each of the new map implementation and try to explain the differences.
- What effect did NITEMS, load factor, and File Size have on the performance of each hash table backend in your experiments in terms of execution time and memory usage?
- In your opinion, which of the eight map backends is the best? Justify your conclusion by examining the trade-offs for the chosen backend and based on your experimental results.

In addition to the questions, please provide a brief summary of each individual group members contributions to the project.

Rubric

Your project will be scored based on the following metrics:

Metric	Points
`map_test -b unordered` passes tests	2
`map_test -b chained` passes tests	4
`map_test -b open` passes tests	4
`map_bench` works properly for all backends and without memory errors	2
`frequencies` works properly for all backends and without memory errors	1
`README.md` responses	4
Coding Style	1
Individual contributions	2
Total	20

To test your programs, simply run make test. Your programs must correctly behave correctly and have no memory errors.

Submission

To submit your project, simply push your changes to your Project 04 repository. There is no need to create a Merge Request. Make sure you have enable access to your repository to all of the instructional staff.