Project 03: Key-Value Store I

Overview

The third project is to build a few backends for a key-value store in C++. This library must implement the following map backends which support insertion and searching operations:

Unsorted: This stores the key-value pairs in an unsorted collection.
Sorted: This stores the key-value pairs in a sorted collection.
Binary Search Tree: This stores the key-value pairs in a plain binary search tree.
Red-Black Tree: This stores the key-value pairs in a red-black tree (ie. std::map).
Treap: This stores the key-value pairs in a treap.

Once you have implemented these map backends into a library (ie. libmap.a), you will use this library to construct the following applications:

map_bench: This is a benchmarking utility that measure the time it takes for a particular map implementation to insert and search a specific number of items.
frequencies: This is a word counting application that uses the map library to count the number of times every word from standard input appears and outputs the resulting tally to standard output.

A third application map_test is provided to you as a basis for these two applications and so you can check the correctness of your backends.

This project is to be done in groups of 1 - 3 and is due midnight Friday October 28, 2016.

Reference

Reference implementations of the map_bench benchmark utility and the frequecies application can be found at /afs/nd.edu/user15/pbui/pub/bin/map_bench and /afs/nd.edu/user15/pbui/pub/bin/frequencies respectively.

Here is the usage message for map_bench:

# Display usage message
$ map_bench -h
    -b BACKEND    Which Map backend (unsorted, sorted, bst, rbtree, treap)
    -n NITEMS     Number of items to benchmark
    -p PADLENGTH  Amount to pad the keys with leading 0's

# Run with defaults (ie. -b unsorted -n 1 -p 1)
$ map_bench
Insert: 1.64e-05 s
Search: 6.083e-06 s

# Run with treap on 10000 items and pad length 8
$ map_bench -b treap -n 10000 -p 8
Insert: 0.01066 s
Search: 0.0062078 s

As can be seen, map_bench takes three optional arguments. The -b argument allows the user to select which map backend to utilize. The -n argument specifies the number of items to test. The -p argument specifies how many 0's should be used to pad the keys and values. By default -b is unsorted, -n is 1 and -p is 1.

Here is the usage message for frequencies:

$ frequencies -h
    -b BACKEND    Which Map backend (unsorted, sorted, bst, rbtree, treap)
    -d DUMPFLAG   Which dump flag (key, value, key_value, value_key)

As can be seen, frequencies takes two optional arguments. Once again, the -b argument allows the user to select which map backend to utilize. The -d argument specifies how the output should be formatted (ie. output key, value, both key then value, or both value then key). By default -b is unsorted and -d is value_key.

As with most Unix filters, frequencies reads from [standand input] and then outputs to standard output:

# Retrieve Beowulf from Project Gutenberg
$ curl -L http://www.gutenberg.org/ebooks/16328.txt.utf-8 > pg16328.txt

# Perform word count on Beowulf
$ frequencies < pg16328.txt | sort -rn | head
2558    the
1536    of
913     to
758     and
616     in
590     The
467     his
435     a
373     he
372     with

GitLab

To start this project, you must form a group of 1 - 3 people. One group member should fork the Project 03 repository on GitLab:

https://gitlab.com/nd-cse-30331-fa16/project03

Once this repository has been forked, follow the instructions from Reading 00 to:

Make the repository private
Configure access to the repository

Make sure you add all the members of the team in addition to the instructional staff.

As with the programming challenges in this course, the project repository contains a Makefile for automated building and testing of your program.

In addition to these files, the repository should also contain the following:

map.h: This the map library header file that contains the declarations for the Map classes, a few type definitions, and some constants. You may need to edit this file as you implement the different backends.
constants.cpp: This contains the definition of the NONE constant which is a sentinel to indicate that a search failed (ie. the key is not in the map).
unsorted.cpp: This contains the unsorted container implement of a map.
sorted.cpp: This contains the sorted container implementation of a map.
bst.cpp: This contains the binary search tree implementation of a map.
rbtree.cpp: This contains the red-black tree implementation of a map.
treap.cpp: This contains the treap implementation of a map.
map_test.cpp: This is the testing utility provided to you to check the correctness of your implementations.
map_bench.cpp: This is the map benchmarking application.
frequencies.cpp: This is the word counting application.

Further descriptions of these files and what you need to implement is described below.

Map Library

The map.h library header file contains the definition of a Entry:

typedef std::pair<std::string, std::string> Entry;

Each Entry in our collection is a std::pair of std::strings. As noted above, there is a special sentinel Entry called NONE which is used to indicate that a search failed:

extern const Entry NONE;

The map.h library header file also contains the following enumeration:

typedef enum {
    DUMP_KEY,
    DUMP_VALUE,
    DUMP_KEY_VALUE,
    DUMP_VALUE_KEY,
} DumpFlag;

These constants are used by the dump methods for specifying exactly what to output when dumping a map.

Additionally, the map.h library header file contains the following definition of struct Node:

struct Node {
    Entry     entry;
    int       priority;
    Node     *left;
    Node     *right;
};

As can be seen, each Node contains an Entry key-value pair, an integer priority, and references to the node's children.

One Node To Rule Them All

Although nodes in a plain binary search tree do not have a priority, we will use a the same struct Node for a binary search tree as for a treap. When creating a Node for a binary search tree, you can simply set the priority to 0 or any other value you wish.

Finally, the map.h library header file contains the following definition of the Map base class:

class Map {
public:
    virtual void            insert(const std::string &key, const std::string &value) {}
    virtual const Entry     search(const std::string &key) { return NONE; }
    virtual void            dump(std::ostream &os, DumpFlag flag) {}
    virtual                ~Map() {}
};

This is followed by the declaration of the UnsortedMap, SortedMap, BSTMap, RBTreeMap, and TreapMap which derive from Map and all implement the following three operations:

void            insert(const std::string &key, const std::string &value);

Given a key and value, this adds this key-value pair if it is not in the map already. If the key already exists, then the value for the existing std::pair is updated to the new value.

const Entry     search(const std::string &key);

Given a key, this returns the Entry containing the key and the associated value. If the key is not in the map, then the sentinel NONE is returned to indicate failure.

void            dump(std::ostream &os, DumpFlag flag);

Given an output stream os, this writes all the key-value pairs in the map to os. The output is formatted based on the flag argument which indicates whether to print out the key, value, key and value, or value and key. For the latter two options, the two components should be separated by the tab character.

Modify Map Class Definition

You will need to modify the UnsortedMap, SortedMap, BSTMap, RBTreeMap, and TreapMap classes by adding any necessary private members. You may also need to define your own destructors for some of the map backends.

Unsorted

The first backend is to implement a map using an unsorted a sequence container such as std::vector or std::deque in unsorted.cpp.

Hints

You will want to add an entries private member to the UnsortedMap class.

Sorted

The second backend is to implement a map using an sorted a sequence container such as std::vector or std::deque in sorted.cpp.

To speed up the search operation, you will need to implement and utilize a custom binary search function:

const Entry   binary_search(const IT &start, const IT &end, const std::string &target);

The reason why we need a custom binary search implementation is because std::binary_search only returns whether or not the item is in in the container, but does not return the actual item itself.

Hints

You will want to add an entries private member to the SortedMap class.
Remember that iterators are like pointers.

Binary Search Tree

The third backend is to implement a map using a binary search tree in bst.cpp. You will need to implement and utilize the following recursive functions:

Node *insert_r(Node *node, const std::string &key, const std::string &value);

Given a node, key, and value, this inserts the key-value pair into the binary search tree.

Node *search_r(Node *node, const std::string &key);

Given a node and key, this returns the key-value pair in the binary search tree that contains the key, otherwise NONE.

void    dump_r(Node *node, std::ostream &os, DumpFlag flag);

Given an output stream os, this writes all the key-value pairs in the binary search tree to os. The output is formatted based on the flag argument.

Hints

You will want to add a root private member to the BSTMap class.
You will want to add a destructor to your BSTMap class.

Red-Black Tree

The fourth backend is to implement a map using a red-black tree in rbtree.cpp. In this case, you can simply use std::map.

Hints

You will want to add an entries private member to the RBTreeMap class.

Treap

The fifth backend is to implement a map using a treap in treap.cpp. You will need to implement and utilize the following functions:

Node *insert_r(Node *node, const std::string &key, const std::string &value);

Given a node, key, and value, this inserts the key-value pair into the treap.

Node *rotate_right(Node *p);

Given p, rotate the tree around p to the right.

Node *rotate_left(Node *p);

Given p, rotate the tree around p to the left.

int get_random();

Return a random number between 0 and INT_MAX.

A Search For All Trees

Because a treap is also a binary search tree, we do not need to re-implement the search_r function. Instead, we just declare the function as extern, which lets the compiler know we will use the definition from another file (in this case bst.cpp).

Hints

You will want to add a root private member to the TreapMap class.
You will want to add a destructor to your TreapMap class.

For get_random you can use the C++'s std::uniform_int_distribution function to produce random numbers:

std::random_device               rd;
std::default_random_engine       g(rd());
std::uniform_int_distribution<>  d(1, INT_MAX);
return d(g);

Map Bench

As described above, the map_bench application in map_bench.cpp is a utility for measuring the amount of time inserting and searching using a particular map backend takes.

The general program flow of map_bench should be as follows:

Parse command line arguments to determine BACKEND, NITEMS, and PADLENGTH where:
- BACKEND is which map implementation to utilize.
- NITEMS is the number of items to insert and search.
- PADLENGTH is the amount of padding we should apply to each integer key and value.
  
  For instance, if the key is 2 and the PADLENGTH is 4, then the key should be transformed to 0002.
Perform insertions of the specified number of items (ie. NITEMS). Each item can simply be an integer that is padded by the specified to the specified PADLENGTH for both the key and value.

Measure how long this takes by saving a timestamp before the insertions are performed and after the insertions are completed.

Report the difference in these timestamps to output how long the insertions took (in seconds).
Perform searches for the specified number of items (ie. NITEMS).

Measure how long this takes by saving a timestamp before the searches are performed and after the searches are completed.

Report the difference in these timestamps to output how long the searches took (in seconds).

Hints

Use map_test as a basis on how to setup your map_bench and how to process comand line arguments.
To generate timestamps, you can use std::chrono::high_resolution_clock::now.
You may wish to implement and utilize the following function for your keys and values:
```
std::string int_to_key(int i, size_t padlength)
```
Given an integer i, this converts i into a string and pads the result to be at least padlength size.

Frequencies

As described above, the frequencies application in frequencies.cpp is a utility for counting how often every word in standard input appears.

The general program flow of frequencies should be as follows:

Parse command line arguments to determine BACKEND, and DUMPFLAG where:
- BACKEND is which map implementation to utilize.
- DUMPFLAG is the flag to send to the dump method to control what is outputted.
Read input from standard input and for each word insert the word into the map. If the word is already there, then you should increment the count.
Once all the words have been processed, simply dump the map to standard output with the specified DUMPFLAG.

Hints

Use map_test as a basis on how to setup your frequencies and how to process comand line arguments.
You may wish to implement and utilize the following function:
```
std::string increment(const std::string &s);
```
Given a string s, this converts s to an integer, increments it, and then returns the string of the result.

Questions

When you are finished implementing your map_bench benchmarking utility and frequencies application, answer the following questions in the project03/README.md:

What is complexity of each of the map backends (ie. Unsorted, Sorted, BST, RBTree, Treap)? For each implementation, briefly explain the best, average, and worst case complexities of inserting and searching using the particular method in terms of Big-O notation.
Using map_bench, benchmark all five backends for when NITEMS is:
```
10, 100, 1000, 10000, 100000, 1000000, 10000000
```
And PADLENGTH is
```
1, 2, 4, 8
```
Record the elapsed time of each experiment in a Markdown table:
```
| Backend     | NITEMS    | PADLENGTH | Insert Time   | Search Time   |
|-------------|-----------|-----------|---------------|---------------|
| UNSORTED    | 10        | 1         | 0             | 0             |
| ...         | ...       | ...       | ...           | ...           |
```
This Might Take A While

For Unsorted, Sorted, and BST, as you increase NITEMS or PADLENGTH, the benchmark may slow down significantly. Once you have reached a set of parameters for these methods that take more than 10 minutes, you can simply record that these settings took a long time and specify INFINITY for the elapsed time.
Using frequencies, compute the word count on at least three texts from Project Gutenberg using all five different backends.

Here are some good candidates from the Top 100:
Record the elapsed time of each experiment in a Markdown table:
```
| Backend     | Text                  | File Size | Elasped Time  |
|-------------|-----------------------|-----------|---------------|
| UNSORTED    | Beowulf.txt           | 0         | 0             |
| ...         | ...                   | ...       | ...           |
```
Hint: You can use the time command or the measure program from previous assignments to measure the elapsed time.
After you have performed your benchmarks:
- Discuss the relative performance of each map implementation and try to explain the differences.
- What effect did NITEMS, PADLENGTH, and File Size have on the performance of each backend in your experiments?
- In your opinion, which map backend is the best? Justify your conclusion by examining the trade-offs for the chosen backend and based on your experimental results.

In addition to the questions, please provide a brief summary of each individual group members contributions to the project.

Rubric

Your project will be scored based on the following metrics:

Metric	Points
`map_test -b unsorted` passes tests	2
`map_test -b sorted` passes tests	2
`map_test -b bst` passes tests	2
`map_test -b rbtree` passes tests	1
`map_test -b treap` passes tests	2
`map_bench` works properly for all backends and without memory errors	2
`frequencies` works properly for all backends and without memory errors	2
`README.md` responses	4
Coding Style	1
Individual contributions	2
Total	20

To test your programs, simply run make test. Your programs must correctly behave correctly and have no memory errors.

Submission

To submit your project, simply push your changes to your Project 03 repository. There is no need to create a Merge Request. Make sure you have enable access to your repository to all of the instructional staff.

Project 03: Key-Value Store I

Overview

Reference

GitLab

Map Library

One Node To Rule Them All

Modify Map Class Definition

Unsorted

Hints

Sorted

Hints

Binary Search Tree

Hints

Red-Black Tree

Hints

Treap

A Search For All Trees

Hints

Map Bench

Hints

Frequencies

Hints

Questions

This Might Take A While

Rubric

Submission