The third project is to build a few backends for a key-value store in C++. This library must implement the following map backends which support insertion and searching operations:
Unsorted: This stores the key-value pairs in an unsorted collection.
Sorted: This stores the key-value pairs in a sorted collection.
Binary Search Tree: This stores the key-value pairs in a plain binary search tree.
Red-Black Tree: This stores the key-value pairs in a red-black tree (ie. std::map).
Treap: This stores the key-value pairs in a treap.
Once you have implemented these map backends into a library (ie.
libmap.a
), you will use this library to construct the following
applications:
map_bench
: This is a benchmarking utility that measure the time it takes
for a particular map implementation to insert and search a specific number of
items.
frequencies
: This is a word counting application that uses the map
library to count the number of times every word from standard input appears
and outputs the resulting tally to standard output.
A third application map_test
is provided to you as a basis for these two
applications and so you can check the correctness of your backends.
This project is to be done in groups of 1 - 3 and is due midnight Friday October 28, 2016.
Reference implementations of the map_bench
benchmark utility and the
frequecies
application can be found at
/afs/nd.edu/user15/pbui/pub/bin/map_bench
and
/afs/nd.edu/user15/pbui/pub/bin/frequencies
respectively.
Here is the usage message for map_bench
:
# Display usage message $ map_bench -h -b BACKEND Which Map backend (unsorted, sorted, bst, rbtree, treap) -n NITEMS Number of items to benchmark -p PADLENGTH Amount to pad the keys with leading 0's # Run with defaults (ie. -b unsorted -n 1 -p 1) $ map_bench Insert: 1.64e-05 s Search: 6.083e-06 s # Run with treap on 10000 items and pad length 8 $ map_bench -b treap -n 10000 -p 8 Insert: 0.01066 s Search: 0.0062078 s
As can be seen, map_bench
takes three optional arguments. The -b
argument
allows the user to select which map backend to utilize. The -n
argument
specifies the number of items to test. The -p
argument specifies how many
0
's should be used to pad the keys and values. By default -b
is
unsorted
, -n
is 1
and -p
is 1
.
Here is the usage message for frequencies
:
$ frequencies -h -b BACKEND Which Map backend (unsorted, sorted, bst, rbtree, treap) -d DUMPFLAG Which dump flag (key, value, key_value, value_key)
As can be seen, frequencies
takes two optional arguments. Once again, the
-b
argument allows the user to select which map backend to utilize. The
-d
argument specifies how the output should be formatted (ie. output key,
value, both key then value, or both value then key). By default -b
is
unsorted
and -d
is value_key
.
As with most Unix filters, frequencies
reads from [standand input] and
then outputs to standard output:
# Retrieve Beowulf from Project Gutenberg $ curl -L http://www.gutenberg.org/ebooks/16328.txt.utf-8 > pg16328.txt # Perform word count on Beowulf $ frequencies < pg16328.txt | sort -rn | head 2558 the 1536 of 913 to 758 and 616 in 590 The 467 his 435 a 373 he 372 with
To start this project, you must form a group of 1 - 3 people. One group member should fork the Project 03 repository on GitLab:
https://gitlab.com/nd-cse-30331-fa16/project03
Once this repository has been forked, follow the instructions from Reading 00 to:
Make the repository private
Configure access to the repository
Make sure you add all the members of the team in addition to the instructional staff.
As with the programming challenges in this course, the project repository
contains a Makefile
for automated building and testing of your program.
In addition to these files, the repository should also contain the following:
map.h
: This the map library header file that contains the declarations
for the Map
classes, a few type definitions, and some constants. You may
need to edit this file as you implement the different backends.
constants.cpp
: This contains the definition of the NONE
constant which
is a sentinel to indicate that a search failed (ie. the key
is not in the
map).
unsorted.cpp
: This contains the unsorted container implement of a map.
sorted.cpp
: This contains the sorted container implementation of a map.
bst.cpp
: This contains the binary search tree implementation of a map.
rbtree.cpp
: This contains the red-black tree implementation of a map.
map_test.cpp
: This is the testing utility provided to you to check the
correctness of your implementations.
map_bench.cpp
: This is the map benchmarking application.
frequencies.cpp
: This is the word counting application.
Further descriptions of these files and what you need to implement is described below.
The map.h
library header file contains the definition of a Entry
:
typedef std::pair<std::string, std::string> Entry;
Each Entry
in our collection is a std::pair of std::strings. As noted
above, there is a special sentinel Entry
called NONE
which is used to
indicate that a search failed:
extern const Entry NONE;
The map.h
library header file also contains the following enumeration:
typedef enum { DUMP_KEY, DUMP_VALUE, DUMP_KEY_VALUE, DUMP_VALUE_KEY, } DumpFlag;
These constants are used by the dump
methods for specifying exactly what to
output when dumping a map.
Additionally, the map.h
library header file contains the following
definition of struct Node
:
struct Node { Entry entry; int priority; Node *left; Node *right; };
As can be seen, each Node
contains an Entry
key-value pair, an integer
priority, and references to the node's children.
Although nodes in a plain binary search tree do not have a priority
, we
will use a the same struct Node
for a binary search tree as for a
treap. When creating a Node
for a binary search tree, you can simply
set the priority
to 0
or any other value you wish.
Finally, the map.h
library header file contains the following definition of
the Map
base class:
class Map { public: virtual void insert(const std::string &key, const std::string &value) {} virtual const Entry search(const std::string &key) { return NONE; } virtual void dump(std::ostream &os, DumpFlag flag) {} virtual ~Map() {} };
This is followed by the declaration of the UnsortedMap
, SortedMap
,
BSTMap
, RBTreeMap
, and TreapMap
which derive from Map
and all
implement the following three operations:
void insert(const std::string &key, const std::string &value);
Given a
key
andvalue
, this adds this key-value pair if it is not in the map already. If thekey
already exists, then thevalue
for the existing std::pair is updated to the newvalue
.
const Entry search(const std::string &key);
Given a
key
, this returns theEntry
containing thekey
and the associatedvalue
. If thekey
is not in the map, then the sentinelNONE
is returned to indicate failure.
void dump(std::ostream &os, DumpFlag flag);
Given an output stream
os
, this writes all the key-value pairs in the map toos
. The output is formatted based on theflag
argument which indicates whether to print out thekey
,value
,key and value
, orvalue and key
. For the latter two options, the two components should be separated by the tab character.
You will need to modify the UnsortedMap
, SortedMap
, BSTMap
,
RBTreeMap
, and TreapMap
classes by adding any necessary private members.
You may also need to define your own destructors for some of the map
backends.
The first backend is to implement a map using an unsorted a sequence
container such as std::vector or std::deque in unsorted.cpp
.
entries
private member to the UnsortedMap
class.The second backend is to implement a map using an sorted a sequence
container such as std::vector or std::deque in sorted.cpp
.
To speed up the search
operation, you will need to implement and utilize a
custom binary search function:
const Entry binary_search(const IT &start, const IT &end, const std::string &target);
The reason why we need a custom binary search implementation is because std::binary_search only returns whether or not the item is in in the container, but does not return the actual item itself.
You will want to add an entries
private member to the SortedMap
class.
Remember that iterators are like pointers.
The third backend is to implement a map using a binary search tree in
bst.cpp
. You will need to implement and utilize the following recursive
functions:
Node *insert_r(Node *node, const std::string &key, const std::string &value);
Given a
node
,key
, andvalue
, this inserts the key-value pair into the binary search tree.
Node *search_r(Node *node, const std::string &key);
Given a
node
andkey
, this returns the key-value pair in the binary search tree that contains thekey
, otherwiseNONE
.
void dump_r(Node *node, std::ostream &os, DumpFlag flag);
Given an output stream
os
, this writes all the key-value pairs in the binary search tree toos
. The output is formatted based on theflag
argument.
You will want to add a root
private member to the BSTMap
class.
You will want to add a destructor to your BSTMap
class.
The fourth backend is to implement a map using a red-black tree in
rbtree.cpp
. In this case, you can simply use std::map.
entries
private member to the RBTreeMap
class.The fifth backend is to implement a map using a treap in treap.cpp
.
You will need to implement and utilize the following functions:
Node *insert_r(Node *node, const std::string &key, const std::string &value);
Given a
node
,key
, andvalue
, this inserts the key-value pair into the treap.
Node *rotate_right(Node *p);
Given
p
, rotate the tree aroundp
to the right.
Node *rotate_left(Node *p);
Given
p
, rotate the tree aroundp
to the left.
int get_random();
Return a random number between
0
andINT_MAX
.
Because a treap is also a binary search tree, we do not need to
re-implement the search_r
function. Instead, we just declare the function
as extern
, which lets the compiler know we will use the definition from
another file (in this case bst.cpp
).
You will want to add a root
private member to the TreapMap
class.
You will want to add a destructor to your TreapMap
class.
For get_random
you can use the C++'s std::uniform_int_distribution
function to produce random numbers:
std::random_device rd; std::default_random_engine g(rd()); std::uniform_int_distribution<> d(1, INT_MAX); return d(g);
As described above, the map_bench
application in map_bench.cpp
is a
utility for measuring the amount of time inserting and searching using a
particular map backend takes.
The general program flow of map_bench
should be as follows:
Parse command line arguments to determine BACKEND
, NITEMS
, and
PADLENGTH
where:
BACKEND
is which map implementation to utilize.
NITEMS
is the number of items to insert and search.
PADLENGTH
is the amount of padding we should apply to each integer
key and value.
For instance, if the key is 2
and the PADLENGTH
is 4
, then the
key should be transformed to 0002
.
Perform insertions of the specified number of items (ie. NITEMS
). Each
item can simply be an integer that is padded by the specified to the
specified PADLENGTH
for both the key and value.
Measure how long this takes by saving a timestamp before the insertions are performed and after the insertions are completed.
Report the difference in these timestamps to output how long the insertions took (in seconds).
Perform searches for the specified number of items (ie. NITEMS
).
Measure how long this takes by saving a timestamp before the searches are performed and after the searches are completed.
Report the difference in these timestamps to output how long the searches took (in seconds).
Use map_test
as a basis on how to setup your map_bench
and how to
process comand line arguments.
To generate timestamps, you can use std::chrono::high_resolution_clock::now.
You may wish to implement and utilize the following function for your keys and values:
std::string int_to_key(int i, size_t padlength)
Given an integer
i
, this convertsi
into a string and pads the result to be at leastpadlength
size.
As described above, the frequencies
application in frequencies.cpp
is a
utility for counting how often every word in standard input appears.
The general program flow of frequencies
should be as follows:
Parse command line arguments to determine BACKEND
, and DUMPFLAG
where:
BACKEND
is which map implementation to utilize.
DUMPFLAG
is the flag to send to the dump
method to control what is
outputted.
Read input from standard input and for each word insert the word into the map. If the word is already there, then you should increment the count.
Once all the words have been processed, simply dump the map to standard
output with the specified DUMPFLAG
.
Use map_test
as a basis on how to setup your frequencies
and how to
process comand line arguments.
You may wish to implement and utilize the following function:
std::string increment(const std::string &s);
Given a string
s
, this convertss
to an integer, increments it, and then returns the string of the result.
When you are finished implementing your map_bench
benchmarking utility and
frequencies
application, answer the following questions in the
project03/README.md
:
What is complexity of each of the map backends (ie. Unsorted
,
Sorted
, BST
, RBTree
, Treap
)? For each implementation, briefly
explain the best, average, and worst case complexities of inserting
and searching using the particular method in terms of Big-O notation.
Using map_bench
, benchmark all five backends for when NITEMS
is:
10, 100, 1000, 10000, 100000, 1000000, 10000000
And PADLENGTH
is
1, 2, 4, 8
Record the elapsed time of each experiment in a Markdown table:
| Backend | NITEMS | PADLENGTH | Insert Time | Search Time | |-------------|-----------|-----------|---------------|---------------| | UNSORTED | 10 | 1 | 0 | 0 | | ... | ... | ... | ... | ... |
For Unsorted
, Sorted
, and BST
, as you increase NITEMS
or
PADLENGTH
, the benchmark may slow down significantly. Once you have
reached a set of parameters for these methods that take more than 10
minutes
, you can simply record that these settings took a long time and
specify INFINITY
for the elapsed time.
Using frequencies
, compute the word count on at least three texts
from Project Gutenberg using all five different backends.
Here are some good candidates from the Top 100:
Record the elapsed time of each experiment in a Markdown table:
| Backend | Text | File Size | Elasped Time | |-------------|-----------------------|-----------|---------------| | UNSORTED | Beowulf.txt | 0 | 0 | | ... | ... | ... | ... |
Hint: You can use the time
command or the measure
program from
previous assignments to measure the elapsed time.
After you have performed your benchmarks:
Discuss the relative performance of each map implementation and try to explain the differences.
What effect did NITEMS
, PADLENGTH
, and File Size
have on the
performance of each backend in your experiments?
In your opinion, which map backend is the best? Justify your conclusion by examining the trade-offs for the chosen backend and based on your experimental results.
In addition to the questions, please provide a brief summary of each individual group members contributions to the project.
Your project will be scored based on the following metrics:
Metric | Points |
---|---|
map_test -b unsorted passes tests |
2 |
map_test -b sorted passes tests |
2 |
map_test -b bst passes tests |
2 |
map_test -b rbtree passes tests |
1 |
map_test -b treap passes tests |
2 |
map_bench works properly for all backends and without memory errors |
2 |
frequencies works properly for all backends and without memory errors |
2 |
README.md responses |
4 |
Coding Style | 1 |
Individual contributions | 2 |
Total | 20 |
To test your programs, simply run make test
. Your programs must correctly
behave correctly and have no memory errors.
To submit your project, simply push your changes to your Project 03 repository. There is no need to create a Merge Request. Make sure you have enable access to your repository to all of the instructional staff.