The goal of this homework assignment is to allow you to practice manipulating
and building data structures in C. The first activity involves
implementing a map library, while the second activity requires you to use
this library to build and benchmark a freq
utility similar to uniq.
For this assignment, record your source code and any responses to the
following activities in the homework08
folder of your assignments
GitLab repository and push your work by 11:59 PM Friday, April 7, 2017.
As discussed in your data structures class, a map is a data structure that allows the user to associate keys to values. This enables users to easily lookup values by a non-numeric key such as a string in an efficient manner. Maps (aka dict in Python) are used in many applications and scripts. For instance, whenever we used JSON, we were accessing a map.
Of course, being a bare bones language, C does not have a built-in map in its standard library. Because of this, we will build our own map using a hash table. Specially, we will implement a hash table using a form of separate chaining to manage collisions.
Fundamentally, a hash table is just an array. Normal arrays in C, unfortunately, can only be indexed by an integer. As we discussed a couple of weeks ago, this is due to the fact that the name of an array is really just the base address of a contiguous chunk of memory. Therefore, an index operation is equivalent to the following:
A[i] = *(A + i) // Index i of a is the same as access the location in memory at A + i
This is fine if we only need to search our elements by integer values, since we can use the integer value as an offset into our chunk of memory. However, if we wanted to quickly lookup values by strings or doubles, we would instead have to scan linearly through our array to find the appropriate element. This, of course, is inefficient, so we need another way of storing non-integer values in a way that allows us to quickly find them again.
To work around this limitation in arrays, we can use a hash function such as FNV-1a to deterministicly compute an integer value for any sequence of binary data:
FNV-1A-Hash(data) -> Integer
Therefore, using a hash function allows us to utilize non-integer values such as strings or doubles to lookup elements in a normal array. The use of a hash function in this manner is the core concept behind the data structure know as a hash table, which can be used to implement either the set or map abstract data types.
For this activity, you will be implementing a map based on a hash table that uses separate chaining to handle collisions as shown below:
In this design, the map starts with a structure that contains the following fields:
Buckets: This is the array of Entries, each of which consists of a key, a value, and a reference to the next Entry in the chain.
Capacity: This is the size of the array of Entries.
Size: This is the number of valid Entries we are storing in our map.
Load Factor: This is our maximum ratio of Size to Capacity. That is, if Size / Capacity is ever greater than Load Factor, then we will need to resize our Buckets.
In the example in the diagram above, we have a map with an array of 8
Buckets. Each Bucket is actually a singly-linked list where we
store the Entries in our map. Because the Bucket is an actual
Entry and not just a reference to one, this means that our singly-linked
list uses a dummy or sentinel node.
For instance, the first Entry at the top has a key of 0
and a
value of "Dog". Because 0
maps to Bucket 0
we simply have the
Entry structure at Buckets[0] reference this Entry. Because
there are no other Entries that map to 0
, this top Entry's next
reference is simply NULL
.
At the bottom, we have two Entries that map to Bucket 5
. When
multiple elements map to the same Bucket, we call this a collision. In
separate chaining, we handle this situation by simply adding the additional
Entry to our singly-linked list as shown in the diagram. Here, we have
one Entry with the key 5
and the value "Log" followed by a
second Entry with the key 13
and the value "Rock". The former
has a reference to the latter to form a singly-linked list.
The remainder of this document details how you are to build a map that follows this overall design.
To help you get started, the instructor has provided you with the following starter code:
# Go to assignments repository $ cd path/to/assignments/repository # Download starter code tarball $ curl -O http://www3.nd.edu/~pbui/teaching/cse.20289.sp17/static/tar/homework08.tar.gz # Extract starter code tarball $ tar xzvf homework08.tar.gz
Once downloaded and extracted, you should see the following files in your
homework08
directory:
homework08 \_ Makefile # This is the Makefile for building all the project artifacts \_ README.md # This is the README file for recording your responses \_ entry.c # This is the C99 implementation file for the entry structure \_ fnv.c # This is the C99 implementation file for the FNV-1 hash function \_ freq.c # This is the C99 implementation file for the word frequencies tool \_ map.c # This is the C99 implementation file for the map data structure \_ map.h # This is the C99 header file for map data structure \_ measure.c # This is the C99 implementation file for the measure utility \_ test_entry.c # This is the C99 implementation file for the entry test \_ test_fnv.c # This is the C99 implementation file for the FNV-1 test \_ test_freq.sh # This is the shell script for testing the word frequencies tool \_ test_map.c # This is the C99 implementation file for the map test
The details on what you need to implement are described in the following sections.
map.h
The map.h
file is the header file for whole library. In addition to the
function prototypes, this header defines the following types:
typedef union { int64_t number; char *string; } Value;
This defines a Value
union type consisting of either a 64
-bit integer
or a string.
typedef enum { NUMBER, STRING, } Type;
This defines an enumeration that defines the constants NUMBER
and STRING
.
typedef struct Entry Entry; struct Entry { char *key; Value value; Type type; Entry *next; };
This defines a Entry
struct, which follows from our design described
above. In addition to having a key
, value
, and next
reference, the
Entry
also records the type
of its value
(ie. NUMBER
or STRING
).
This will be useful in choosing between the number
and string
fields of
the value
union.
typedef enum { KEY, VALUE, KEY_VALUE, VALUE_KEY, } DumpMode;
This defines an enumeration that allows the user to select what items in
the Entry
to dump.
typedef struct { Entry *buckets; size_t capacity; size_t size; double load_factor; } Map;
This defines the Map
struct, which follows from our design described
above.
The use of typedef
in this header file allow us to simply say Map *m
rather than struct Map *m
. While there is some
controversary
over whether or not this is good practice, we will use these type
definitions to make our code clearer and more compact.
Finally, the header file also provides a debug
macro that you can use
to insert debugging statements into your code (rather than [printf]):
debug("foo = d", foo);
The advantage of using this macro is that it displays the file name, line number, and function name where the command is called in its output, along with the formatted message you specified.
For this task, you do not need to modify this header file. Instead, you should review them and ensure you understand the provided code.
Makefile
Once again, the Makefile
contains all the rules or recipes for building the
project artifacts (e.g. libmap.a
, freq
, etc.). Although the provided
Makefile
contains most of the variable definitions and test
recipes, you
must add the appropriate rules for libmap.a
, test_entry
, test_fnv
,
test_map
, freq
, measure
, and any intermediate objects. The
dependencies for these targets are shown in the DAG below:
Note: Although we are producing a static library in the form of
libmap.a
, we will not be statically linking our executables. Instead,
we will use the libmap.a
as another object file when we link our
executables.
You must use the CC
, CFLAGS
, LD
, LDFLAGS
, AR
, and ARFLAGS
variables when appropriate in your rules.
Once you have a working Makefile
, you should be able to run the following
commands:
# Build all TARGETS $ make Compiling freq.o... Compiling entry.o... Compiling fnv.o... Compiling map.o... Linking libmap.a... Linking freq... # Run all tests $ make test Compiling test_entry.o... Linking test_entry... Testing entry... Segmentation fault ==23062== Memcheck, a memory error detector ==23062== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al. ==23062== Using Valgrind-3.12.0 and LibVEX; rerun with -h for copyright info ==23062== Command: ./test_entry ... ==23062== For counts of detected and suppressed errors, rerun with: -v ==23062== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0) Compiling test_fnv.o... Linking test_fnv... Testing fnv... Aborted ==23078== Memcheck, a memory error detector ==23078== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al. ==23078== Using Valgrind-3.12.0 and LibVEX; rerun with -h for copyright info ==23078== Command: ./test_fnv ... ==23078== For counts of detected and suppressed errors, rerun with: -v ==23078== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0) Compiling test_map.o... Linking test_map... Testing map... Aborted ==23095== Memcheck, a memory error detector ==23095== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al. ==23095== Using Valgrind-3.12.0 and LibVEX; rerun with -h for copyright info ==23095== Command: ./test_map ... ==23095== For counts of detected and suppressed errors, rerun with: -v ==23095== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0) Testing freq... --- - 2017-04-02 18:59:27.526589003 -0400 +++ test_freq.output 2017-04-02 18:59:27.524651038 -0400 @@ -0,0 +1,134 @@ +54 I +52 shake ... FAILURE: Probably stayed out too late and got nothing in your brain. Shake it off! REASON: Output doesn't match # Simulate build with tracing output $ make -n echo Compiling freq.o... gcc -g -gdwarf-2 -Wall -std=gnu99 -c -o freq.o freq.c echo Compiling entry.o... gcc -g -gdwarf-2 -Wall -std=gnu99 -c -o entry.o entry.c echo Compiling fnv.o... gcc -g -gdwarf-2 -Wall -std=gnu99 -c -o fnv.o fnv.c echo Compiling map.o... gcc -g -gdwarf-2 -Wall -std=gnu99 -c -o map.o map.c echo Linking libmap.a... ar rcs libmap.a entry.o fnv.o map.o echo Linking freq... gcc -L. -o freq freq.o libmap.a
Depending on your compiler, you may see some warnings with the initial starter code. Likewise, the test programs will all fail in some fashion.
entry.c
The entry.c
file contains the C99 implementation for the Entry
structure; it rather closely resembles the node structure from
Homework 07. For this task, you will need to implement the following functions:
Entry * entry_create(const char *key, const Value value, Entry *next, Type type)
This function allocates a new
Entry
, sets itskey
,value
,next
andtype
fields, and returns the allocated structure.
Notes:
You must check if allocation fails.
You must allocate and copy the key
(consider strdup).
You must allocate and copy the value
if the type
is set to STRING
.
You may wish to use the entry_update
function to implement part of
this function (to avoid code duplication).
Entry * entry_delete(Entry *e, bool recursive)
This function deallocates the given
Entry
and any previously allocated fields such as thekey
andvalue
. Ifrecursive
istrue
, then this function will also deallocate thenext
Entry
and any of its successors.
Notes: You can use recursion for this if you wish since the singly-linked list should never be too long.
void entry_update(Entry *e, const Value value, Type type)
This function updates the
value
field of the specifiedEntry
.
Notes: Special care must be given to both the old value
type
and the
new value
type
.
If the old value
is of type
STRING
, then you must deallocate it.
If the new value
is of type
STRING
, then you must allocate and copy it.
Failure to do so will lead to memory errors.
void entry_dump(Entry *e, FILE *stream, const DumpMode mode)
This function prints the
Entry
'skey
andvalue
fields to the specifiedstream
based on the givenmode
.
Notes: The modes
are:
KEY
: Displays only the key
.KEY_VALUE
: Displays both the key
and value
separated by the tab
character.VALUE
: Displays only the value
.VALUE_KEY
: Displays both the value
and key
separated by the tab
character.As you implement entry.c
, you can test it by running the test-entry
target:
# Run test-entry target $ make test-entry Compiling test_entry.o... Compiling entry.o... Compiling fnv.o... Compiling map.o... Linking libmap.a... Linking test_entry... Testing entry... # View log $ cat test_entry.log # Run test_entry manually $ ./test_entry Test entry creation... Test entry update... Test entry deletion...
fnv.c
The fnv.c
file contains the C99 implementation for the FNV-1a hash
function. For this task, you will need to implement the following function:
uint64_t fnv_hash(const void *d, size_t n)
This function uses the FNV-1a algorithm to compute a hash on the given data
d
which consists ofn
bytes.
Notes: Remember that you can use type casting to interpret a chunk of bytes in any format you wish.
As you implement fnv.c
, you can test it by running the test-fnv
target:
# Run test-fnv target $ make test-fnv Compiling test_fnv.o... Compiling entry.o... Compiling fnv.o... Compiling map.o... Linking libmap.a... Linking test_fnv... Testing fnv... # View log $ cat test_fnv.log # Run test_fnv manually $ ./test_fnv wake me up inside = 0x7d9454fa584a0a56 123 = 0x456fc2181822c4db deadool = 0x7d517630945dadbb wolverine = 0x874d0d8868f684c8
map.c
The map.c
file contains the C99 implementation for the core map data
structure. For this task, you will need to implement the following functions:
Map * map_create(size_t capacity, double load_factor)
This function allocates a new
Map
, sets itscapacity
,size
, andload_factor
fields, allocatesbuckets
ofEntry
structures, and returns the allocated structure.
Notes:
If capacity
is 0
, then it should use the DEFAULT_CAPACITY
instead.
If load_factor
less than or equal to 0
, then it should use the
DEFAULT_LOAD_FACTOR
instead.
The buckets
field is an array of Entry
structures.
Map * map_delete(Map *m)
This function deallocates the given
Map
along with itsbuckets
ofEntry
structures.
void map_insert(Map *m, const char *key, const Value value, Type type)
This function inserts a new
Entry
into theMap
if thekey
does not exist yet. If it does, then pre-existingEntry
'svalue
is updated to the givenvalue
.
Notes:
You should check the load_factor
to determine if you need to resize
before inserting a value.
It doesn't matter where you insert the Entry
in the corresponding
singly-linked list. That is, you can either add to the front or
the back, it is up to you.
Entry * map_search(Map *m, const char *key)
This function searches the
Map
for anEntry
with the givenkey
. If found then a reference to thisEntry
is returned, otherwiseNULL.
bool map_remove(Map *m, const char *key)
This function removes the
Entry
with the givenkey
from theMap
and returns whether or not the removal was successful (ie. anEntry
with thekey
was in fact in theMap
).
Notes: Special care must be given to bridge the gap between the
previous Entry
and the next Entry
during removal.
void map_dump(Map *m, FILE *stream, const DumpMode mode)
This function dumps all the
Entries
in theMap
to the specifiedstream
using the givenmode
.
void map_resize(Map *m, size_t new_capacity)
This function resizes the
Map
to thenew_capacity
.
Notes:
You will need to allocate a new array of Entry
structures.
You will not need to copy or allocate new Entry
structures; rather,
you should simply re-link the existing Entry
structures into the
appropriate Bucket.
As you implement map.c
, you can test it by running the test-map
target:
# Run test-map target $ make test-map Compiling test_map.o... Compiling entry.o... Compiling fnv.o... Compiling map.o... Linking libmap.a... Linking test_map... Testing map... # View log $ cat test_map.log # Run test_map manually $ ./test_map Testing map creation (defaults)... Testing map insertion... ... Testing map searching... Testing map creation (custom)... Testing map insertion... ... Testing map searching... Testing map removal...
README.md
In your README.md
, respond to the following prompts:
Briefly describe what memory you allocated in entry_create
and what
memory you deallocated in entry_delete
. How did you handle the recursive
flag?
Briefly describe what memory you allocated in map_create
and what memory
you deallocated in map_delete
. How did you handle internal entries?
Briefly describe what memory you allocated, copied, and deallocated in
map_resize
. When is this function called and what exactly does it do to
grow the map?
Briefly describe how your map_insert
function works and analyze its
time complexity and space complexity in terms of Big-O notation (both
average and worst case).
Briefly describe how your map_search
function works and analyze its time
complexity and space complexity in terms of Big-O notation (both average
and worst case).
Briefly describe how your map_remove
function works and analyze its time
complexity and space complexity in terms of Big-O notation (both average
and worst case).
Once you have a working map library, you are to complete the freq.c
program:
# Display usage $ ./freq -h Usage: ./freq -f FORMAT Output format (KEY, KEY_VALUE, VALUE, VALUE_KEY) -l LOAD_FACTOR Load factor for hash table # Compute frequency of some random numbers $ shuf -i1-10 -n 10 | ./freq 1 2 1 5 1 9 1 10 1 6 1 4 1 8 1 3 1 1 1 7 # Compute frequency of some random numbers (display only keys) $ shuf -i1-10 -n 10 | ./freq -f KEY 2 5 9 10 6 4 8 3 1 7 # Measure time and memory usage for load factor of 2.0 $ shuf -i1-1000000 -n 1000000 | ./measure ./freq -f VALUE -l 2.0 > /dev/null 0.853000 seconds 93.382812 Mbytes
The freq
program reads in data from standard input one word at a time,
inserts the word into the map to keep track of its count, and then outputs
the final map using the specified FORMAT
. The LOAD_FACTOR
parameter is
used to set the load_factor
parameter of the map_create
function and
adjust how often the internal buckets
array should resize.
freq.c
The freq.c
file contains the C99 implementation of the word frequencies
tool described above.
In addition to implementing command line parsing in the main
function and you
will need to implement the freq_stream
function:
void freq_stream(FILE *stream, double load_factor, DumpMode mode)
This function reads one word at a time from the
stream
and uses a map with the specifiedload_factor
to keep count of how often each word appears in thestream
. Once thestream
is complete read, the contents of the map are dumped tostdout
using the specifiedmode
.
test_freq.sh
As you implement freq.c
, you can test it by running the test-freq
target:
# Run test-freq target $ make test-freq Compiling freq.o... Compiling entry.o... Compiling fnv.o... Compiling map.o... Linking libmap.a... Linking freq... Testing freq... # View log $ cat test_freq.log # Run test_freq.sh manually $ ./test_freq.sh
benchmark.{sh,py}
Once you have a working word frequencies utility, you are to write a benchmark script in either shell or Python that measures the amount of time and space it takes to count the following number of items:
1, 10, 100, 1000, 10000, 100000, 1000000, 10000000
Additionally, you will need to experiment with the following load factors
or ALPHA
values:
0.5, 0.75, 0.9, 1.0, 2.0, 4.0, 8.0, 16.0
To help you measure the time and space a program uses, you can utilize the measure.c program written by David Chiang (it is already included in the starter code). Once you build the program, you can use it in the following manner:
# Build measure tool $ make measure Compiling measure.o... Linking measure... # Benchmark freq with 10,000,000 random numbers and load factor of 0.9 $ shuf -i1-100000000 -n 10000000 | ./measure ./freq -l 0.9 > /dev/null 10.055000 seconds 1345.105469 Mbytes
This means that the freq
utility counted the frequencies of 10,000,000
random numbers in 10.05
seconds and used 1345
megabytes of memory.
Your script, either benchmark.sh
or benchmark.py
, should automate the
timing and measurement of these experiments to produce a markdown table looks
like the following:
| NITEMS | ALPHA | TIME | SPACE | |------------|------------|------------|------------| | 1 | 0.50 | 0.00 | 1.37 | | 1 | ... | ... | ... | | 10 | 0.50 | 0.00 | 1.33 | | 10 | ... | ... | ... | | 100 | 0.50 | 0.00 | 1.46 | | 100 | ... | ... | ... | | 1000 | 0.50 | 0.00 | 1.56 | | 1000 | ... | ... | ... | | 10000 | 0.50 | 0.01 | 3.47 | | 10000 | ... | ... | ... | | 100000 | 0.50 | 0.06 | 18.24 | | 100000 | ... | ... | ... | | 1000000 | 0.50 | 0.78 | 141.08 | | 1000000 | ... | ... | ... | | 10000000 | 0.50 | 10.01 | 2177.27 | | 10000000 | ... | ... | ... |
Note: If you decided to use shell, please update the Makefile
to use
benchmark.sh
instead of benchmark.py
for the benchmark
target.
README.md
In your README.md
, include markdown table your benchmark script
generated and then answer the following questions:
Based on your experiments, what is the overall trend in terms of time and space as the number of items increases for the different load factors?
Are the results surprising to you? Explain why or why not.
There is no such thing as a perfect data structure. Compare this hash table to the treap you are creating in your data structures class. What are the advantages and disadvantages of both? If you had to use one as the underlying data structure for a map, which one would you choose and why?
Note: If you are not currently in the data structures class, you can simply compare your hash table to any other binary search tree you are familiar with such as a red-black tree.
For extra credit, you are to install a [Linux] application to the Public
folder of your $HOME
directory. For most [open source] software, this
usually involves:
Downloading a tarball to your $HOME
directory.
Extracting the tarball.
Running the configure
script in the extracted directory. To tell the
install where to put the resulting program, you can specify the prefix
command line argument:
$ ./configure --prefix ~/Public
This will tell the installer to put the resulting files in your Public
folder. The install will follow the typical Linux filesystem hierarchy
and put executables in ~/Public/bin
, libraries in ~/Public/lib
, etc.
Building the software using make
.
Installing the software using make install
.
This standard process can be explained here.
Here are some possible packages for you to try this out on (you are free to try out something different if you wish):
To get credit, you must update your $PATH
environment variable to include
the location of your installed binaries and you must demonstrate the
installed application executing either to the instructor or one of the TAs.
To submit your assignment, please commit your work to the homework08
folder
in your assignments GitLab repository. Your homework08
folder should
only contain at the following files:
Makefile
README.md
BENCHMARK.md
benchmark.sh
or benchmark.py
entry.c
fnv.c
freq.c
map.c
map.h
measure.c
test_entry.c
test_fnv.c
test_freq.sh
test_map.c