Homework 08: Data Representation

The goal of this homework assignment is to allow you to practice manipulating and building data structures in C. The first activity involves implementing a map library, while the second activity requires you to use this library to build and benchmark a freq utility similar to uniq.

For this assignment, record your source code and any responses to the following activities in the homework08 folder of your assignments GitLab repository and push your work by 11:59 PM Friday, April 7, 2017.

Activity 1: Map Library (12 Points)

As discussed in your data structures class, a map is a data structure that allows the user to associate keys to values. This enables users to easily lookup values by a non-numeric key such as a string in an efficient manner. Maps (aka dict in Python) are used in many applications and scripts. For instance, whenever we used JSON, we were accessing a map.

Of course, being a bare bones language, C does not have a built-in map in its standard library. Because of this, we will build our own map using a hash table. Specially, we will implement a hash table using a form of separate chaining to manage collisions.

Separate Chaining Hash Table

Fundamentally, a hash table is just an array. Normal arrays in C, unfortunately, can only be indexed by an integer. As we discussed a couple of weeks ago, this is due to the fact that the name of an array is really just the base address of a contiguous chunk of memory. Therefore, an index operation is equivalent to the following:

A[i] = *(A + i) // Index i of a is the same as access the location in memory at A + i

This is fine if we only need to search our elements by integer values, since we can use the integer value as an offset into our chunk of memory. However, if we wanted to quickly lookup values by strings or doubles, we would instead have to scan linearly through our array to find the appropriate element. This, of course, is inefficient, so we need another way of storing non-integer values in a way that allows us to quickly find them again.

To work around this limitation in arrays, we can use a hash function such as FNV-1a to deterministicly compute an integer value for any sequence of binary data:

FNV-1A-Hash(data) -> Integer

Therefore, using a hash function allows us to utilize non-integer values such as strings or doubles to lookup elements in a normal array. The use of a hash function in this manner is the core concept behind the data structure know as a hash table, which can be used to implement either the set or map abstract data types.

For this activity, you will be implementing a map based on a hash table that uses separate chaining to handle collisions as shown below:

In this design, the map starts with a structure that contains the following fields:

Buckets: This is the array of Entries, each of which consists of a key, a value, and a reference to the next Entry in the chain.
Capacity: This is the size of the array of Entries.
Size: This is the number of valid Entries we are storing in our map.
Load Factor: This is our maximum ratio of Size to Capacity. That is, if Size / Capacity is ever greater than Load Factor, then we will need to resize our Buckets.

In the example in the diagram above, we have a map with an array of 8 Buckets. Each Bucket is actually a singly-linked list where we store the Entries in our map. Because the Bucket is an actual Entry and not just a reference to one, this means that our singly-linked list uses a dummy or sentinel node.

For instance, the first Entry at the top has a key of 0 and a value of "Dog". Because 0 maps to Bucket 0 we simply have the Entry structure at Buckets[0] reference this Entry. Because there are no other Entries that map to 0, this top Entry's next reference is simply NULL.

At the bottom, we have two Entries that map to Bucket 5. When multiple elements map to the same Bucket, we call this a collision. In separate chaining, we handle this situation by simply adding the additional Entry to our singly-linked list as shown in the diagram. Here, we have one Entry with the key 5 and the value "Log" followed by a second Entry with the key 13 and the value "Rock". The former has a reference to the latter to form a singly-linked list.

The remainder of this document details how you are to build a map that follows this overall design.

Starter Code

To help you get started, the instructor has provided you with the following starter code:

# Go to assignments repository
$ cd path/to/assignments/repository

# Download starter code tarball
$ curl -O http://www3.nd.edu/~pbui/teaching/cse.20289.sp17/static/tar/homework08.tar.gz

# Extract starter code tarball
$ tar xzvf homework08.tar.gz

Once downloaded and extracted, you should see the following files in your homework08 directory:

homework08
    \_ Makefile       # This is the Makefile for building all the project artifacts
    \_ README.md      # This is the README file for recording your responses
    \_ entry.c        # This is the C99 implementation file for the entry structure
    \_ fnv.c          # This is the C99 implementation file for the FNV-1 hash function
    \_ freq.c         # This is the C99 implementation file for the word frequencies tool
    \_ map.c          # This is the C99 implementation file for the map data structure
    \_ map.h          # This is the C99 header file for map data structure
    \_ measure.c      # This is the C99 implementation file for the measure utility
    \_ test_entry.c   # This is the C99 implementation file for the entry test
    \_ test_fnv.c     # This is the C99 implementation file for the FNV-1 test
    \_ test_freq.sh   # This is the shell script for testing the word frequencies tool
    \_ test_map.c     # This is the C99 implementation file for the map test

The details on what you need to implement are described in the following sections.

Task 0: `map.h`

The map.h file is the header file for whole library. In addition to the function prototypes, this header defines the following types:

typedef union {
    int64_t  number;
    char    *string;
} Value;

This defines a Value union type consisting of either a 64-bit integer or a string.

typedef enum {
    NUMBER,
    STRING,
} Type;

This defines an enumeration that defines the constants NUMBER and STRING.

typedef struct Entry Entry;
struct Entry {
    char    *key;
    Value    value;
    Type     type;
    Entry   *next;
};

This defines a Entry struct, which follows from our design described above. In addition to having a key, value, and next reference, the Entry also records the type of its value (ie. NUMBER or STRING). This will be useful in choosing between the number and string fields of the value union.

typedef enum {
    KEY,
    VALUE,
    KEY_VALUE,
    VALUE_KEY,
} DumpMode;

This defines an enumeration that allows the user to select what items in the Entry to dump.

typedef struct {
    Entry   *buckets;
    size_t   capacity;
    size_t   size;
    double   load_factor;
} Map;

This defines the Map struct, which follows from our design described above.

Type Definitions

The use of typedef in this header file allow us to simply say Map *m rather than struct Map *m. While there is some controversary over whether or not this is good practice, we will use these type definitions to make our code clearer and more compact.

Finally, the header file also provides a debug macro that you can use to insert debugging statements into your code (rather than [printf]):

debug("foo = d", foo);

The advantage of using this macro is that it displays the file name, line number, and function name where the command is called in its output, along with the formatted message you specified.

For this task, you do not need to modify this header file. Instead, you should review them and ensure you understand the provided code.

Task 1: `Makefile`

Once again, the Makefile contains all the rules or recipes for building the project artifacts (e.g. libmap.a, freq, etc.). Although the provided Makefile contains most of the variable definitions and test recipes, you must add the appropriate rules for libmap.a, test_entry, test_fnv, test_map, freq, measure, and any intermediate objects. The dependencies for these targets are shown in the DAG below:

Note: Although we are producing a static library in the form of libmap.a, we will not be statically linking our executables. Instead, we will use the libmap.a as another object file when we link our executables.

Makefile Variables

You must use the CC, CFLAGS, LD, LDFLAGS, AR, and ARFLAGS variables when appropriate in your rules.

Once you have a working Makefile, you should be able to run the following commands:

# Build all TARGETS
$ make
Compiling freq.o...
Compiling entry.o...
Compiling fnv.o...
Compiling map.o...
Linking libmap.a...
Linking freq...

# Run all tests
$ make test
Compiling test_entry.o...
Linking test_entry...
Testing entry...
Segmentation fault
==23062== Memcheck, a memory error detector
==23062== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==23062== Using Valgrind-3.12.0 and LibVEX; rerun with -h for copyright info
==23062== Command: ./test_entry
...
==23062== For counts of detected and suppressed errors, rerun with: -v
==23062== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
Compiling test_fnv.o...
Linking test_fnv...
Testing fnv...
Aborted
==23078== Memcheck, a memory error detector
==23078== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==23078== Using Valgrind-3.12.0 and LibVEX; rerun with -h for copyright info
==23078== Command: ./test_fnv
...
==23078== For counts of detected and suppressed errors, rerun with: -v
==23078== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Compiling test_map.o...
Linking test_map...
Testing map...
Aborted
==23095== Memcheck, a memory error detector
==23095== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==23095== Using Valgrind-3.12.0 and LibVEX; rerun with -h for copyright info
==23095== Command: ./test_map
...
==23095== For counts of detected and suppressed errors, rerun with: -v
==23095== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Testing freq...
--- -   2017-04-02 18:59:27.526589003 -0400
+++ test_freq.output    2017-04-02 18:59:27.524651038 -0400
@@ -0,0 +1,134 @@
+54     I
+52     shake
...
FAILURE: Probably stayed out too late and got nothing in your brain. Shake it off!
REASON:  Output doesn't match

# Simulate build with tracing output
$ make -n
echo Compiling freq.o...
gcc -g -gdwarf-2 -Wall -std=gnu99 -c -o freq.o freq.c
echo Compiling entry.o...
gcc -g -gdwarf-2 -Wall -std=gnu99 -c -o entry.o entry.c
echo Compiling fnv.o...
gcc -g -gdwarf-2 -Wall -std=gnu99 -c -o fnv.o fnv.c
echo Compiling map.o...
gcc -g -gdwarf-2 -Wall -std=gnu99 -c -o map.o map.c
echo Linking libmap.a...
ar rcs libmap.a entry.o fnv.o map.o
echo Linking freq...
gcc -L. -o freq freq.o libmap.a

Depending on your compiler, you may see some warnings with the initial starter code. Likewise, the test programs will all fail in some fashion.

Task 2: `entry.c`

The entry.c file contains the C99 implementation for the Entry structure; it rather closely resembles the node structure from Homework 07. For this task, you will need to implement the following functions:

Entry * entry_create(const char *key, const Value value, Entry *next, Type type)

This function allocates a new Entry, sets its key, value, next and type fields, and returns the allocated structure.

Notes:
- You must check if allocation fails.
- You must allocate and copy the key (consider strdup).
- You must allocate and copy the value if the type is set to STRING.
- You may wish to use the entry_update function to implement part of this function (to avoid code duplication).
Entry * entry_delete(Entry *e, bool recursive)

This function deallocates the given Entry and any previously allocated fields such as the key and value. If recursive is true, then this function will also deallocate the next Entry and any of its successors.

Notes: You can use recursion for this if you wish since the singly-linked list should never be too long.
void entry_update(Entry *e, const Value value, Type type)

This function updates the value field of the specified Entry.

Notes: Special care must be given to both the old value type and the new value type.
- If the old value is of type STRING, then you must deallocate it.
- If the new value is of type STRING, then you must allocate and copy it.
Failure to do so will lead to memory errors.
void entry_dump(Entry *e, FILE *stream, const DumpMode mode)

This function prints the Entry's key and value fields to the specified stream based on the given mode.

Notes: The modes are:
- KEY: Displays only the key.
- KEY_VALUE: Displays both the key and value separated by the tab character.
- VALUE: Displays only the value.
- VALUE_KEY: Displays both the value and key separated by the tab character.

As you implement entry.c, you can test it by running the test-entry target:

# Run test-entry target
$ make test-entry
Compiling test_entry.o...
Compiling entry.o...
Compiling fnv.o...
Compiling map.o...
Linking libmap.a...
Linking test_entry...
Testing entry...

# View log
$ cat test_entry.log

# Run test_entry manually
$ ./test_entry
Test entry creation...
Test entry update...
Test entry deletion...

Task 3: `fnv.c`

The fnv.c file contains the C99 implementation for the FNV-1a hash function. For this task, you will need to implement the following function:

uint64_t fnv_hash(const void *d, size_t n)

This function uses the FNV-1a algorithm to compute a hash on the given data d which consists of n bytes.

Notes: Remember that you can use type casting to interpret a chunk of bytes in any format you wish.

As you implement fnv.c, you can test it by running the test-fnv target:

# Run test-fnv target
$ make test-fnv
Compiling test_fnv.o...
Compiling entry.o...
Compiling fnv.o...
Compiling map.o...
Linking libmap.a...
Linking test_fnv...
Testing fnv...

# View log
$ cat test_fnv.log

# Run test_fnv manually
$ ./test_fnv
    wake me up inside = 0x7d9454fa584a0a56
                  123 = 0x456fc2181822c4db
              deadool = 0x7d517630945dadbb
            wolverine = 0x874d0d8868f684c8

Task 4: `map.c`

The map.c file contains the C99 implementation for the core map data structure. For this task, you will need to implement the following functions:

Map * map_create(size_t capacity, double load_factor)

This function allocates a new Map, sets its capacity, size, and load_factor fields, allocates buckets of Entry structures, and returns the allocated structure.

Notes:
- If capacity is 0, then it should use the DEFAULT_CAPACITY instead.
- If load_factor less than or equal to 0, then it should use the DEFAULT_LOAD_FACTOR instead.
- The buckets field is an array of Entry structures.
Map * map_delete(Map *m)

This function deallocates the given Map along with its buckets of Entry structures.
void map_insert(Map *m, const char *key, const Value value, Type type)

This function inserts a new Entry into the Map if the key does not exist yet. If it does, then pre-existing Entry's value is updated to the given value.

Notes:
- You should check the load_factor to determine if you need to resize before inserting a value.
- It doesn't matter where you insert the Entry in the corresponding singly-linked list. That is, you can either add to the front or the back, it is up to you.
Entry * map_search(Map *m, const char *key)

This function searches the Map for an Entry with the given key. If found then a reference to this Entry is returned, otherwise NULL.
bool map_remove(Map *m, const char *key)

This function removes the Entry with the given key from the Map and returns whether or not the removal was successful (ie. an Entry with the key was in fact in the Map).

Notes: Special care must be given to bridge the gap between the previous Entry and the next Entry during removal.
void map_dump(Map *m, FILE *stream, const DumpMode mode)

This function dumps all the Entries in the Map to the specified stream using the given mode.
void map_resize(Map *m, size_t new_capacity)

This function resizes the Map to the new_capacity.

Notes:
- You will need to allocate a new array of Entry structures.
- You will not need to copy or allocate new Entry structures; rather, you should simply re-link the existing Entry structures into the appropriate Bucket.

As you implement map.c, you can test it by running the test-map target:

# Run test-map target
$ make test-map
Compiling test_map.o...
Compiling entry.o...
Compiling fnv.o...
Compiling map.o...
Linking libmap.a...
Linking test_map...
Testing map...

# View log
$ cat test_map.log

# Run test_map manually
$ ./test_map
Testing map creation (defaults)...
Testing map insertion...
...
Testing map searching...
Testing map creation (custom)...
Testing map insertion...
...
Testing map searching...
Testing map removal...

Task 5: `README.md`

In your README.md, respond to the following prompts:

Briefly describe what memory you allocated in entry_create and what memory you deallocated in entry_delete. How did you handle the recursive flag?
Briefly describe what memory you allocated in map_create and what memory you deallocated in map_delete. How did you handle internal entries?
Briefly describe what memory you allocated, copied, and deallocated in map_resize. When is this function called and what exactly does it do to grow the map?
Briefly describe how your map_insert function works and analyze its time complexity and space complexity in terms of Big-O notation (both average and worst case).
Briefly describe how your map_search function works and analyze its time complexity and space complexity in terms of Big-O notation (both average and worst case).
Briefly describe how your map_remove function works and analyze its time complexity and space complexity in terms of Big-O notation (both average and worst case).

Activity 2: Word Frequencies (3 Points)

Once you have a working map library, you are to complete the freq.c program:

# Display usage
$ ./freq -h
Usage: ./freq
    -f FORMAT        Output format (KEY, KEY_VALUE, VALUE, VALUE_KEY)
    -l LOAD_FACTOR   Load factor for hash table

# Compute frequency of some random numbers
$ shuf -i1-10 -n 10 | ./freq
1       2
1       5
1       9
1       10
1       6
1       4
1       8
1       3
1       1
1       7

# Compute frequency of some random numbers (display only keys)
$ shuf -i1-10 -n 10 | ./freq -f KEY
2
5
9
10
6
4
8
3
1
7

# Measure time and memory usage for load factor of 2.0
$ shuf -i1-1000000 -n 1000000 | ./measure ./freq -f VALUE -l 2.0 > /dev/null
0.853000 seconds        93.382812 Mbytes

The freq program reads in data from standard input one word at a time, inserts the word into the map to keep track of its count, and then outputs the final map using the specified FORMAT. The LOAD_FACTOR parameter is used to set the load_factor parameter of the map_create function and adjust how often the internal buckets array should resize.

Task 1: `freq.c`

The freq.c file contains the C99 implementation of the word frequencies tool described above.

In addition to implementing command line parsing in the main function and you will need to implement the freq_stream function:

void freq_stream(FILE *stream, double load_factor, DumpMode mode)

This function reads one word at a time from the stream and uses a map with the specified load_factor to keep count of how often each word appears in the stream. Once the stream is complete read, the contents of the map are dumped to stdout using the specified mode.

Task 2: `test_freq.sh`

As you implement freq.c, you can test it by running the test-freq target:

# Run test-freq target
$ make test-freq
Compiling freq.o...
Compiling entry.o...
Compiling fnv.o...
Compiling map.o...
Linking libmap.a...
Linking freq...
Testing freq...

# View log
$ cat test_freq.log

# Run test_freq.sh manually
$ ./test_freq.sh

Task 3: `benchmark.{sh,py}`

Once you have a working word frequencies utility, you are to write a benchmark script in either shell or Python that measures the amount of time and space it takes to count the following number of items:

1, 10, 100, 1000, 10000, 100000, 1000000, 10000000

Additionally, you will need to experiment with the following load factors or ALPHA values:

0.5, 0.75, 0.9, 1.0, 2.0, 4.0, 8.0, 16.0

To help you measure the time and space a program uses, you can utilize the measure.c program written by David Chiang (it is already included in the starter code). Once you build the program, you can use it in the following manner:

# Build measure tool
$ make measure
Compiling measure.o...
Linking measure...

# Benchmark freq with 10,000,000 random numbers and load factor of 0.9
$ shuf -i1-100000000 -n 10000000 | ./measure ./freq -l 0.9 > /dev/null
10.055000 seconds       1345.105469 Mbytes

This means that the freq utility counted the frequencies of 10,000,000 random numbers in 10.05 seconds and used 1345 megabytes of memory.

Your script, either benchmark.sh or benchmark.py, should automate the timing and measurement of these experiments to produce a markdown table looks like the following:

|   NITEMS   |   ALPHA    |    TIME    |   SPACE    |
|------------|------------|------------|------------|
|          1 |       0.50 |       0.00 |       1.37 |
|          1 |        ... |        ... |        ... |
|         10 |       0.50 |       0.00 |       1.33 |
|         10 |        ... |        ... |        ... |
|        100 |       0.50 |       0.00 |       1.46 |
|        100 |        ... |        ... |        ... |
|       1000 |       0.50 |       0.00 |       1.56 |
|       1000 |        ... |        ... |        ... |
|      10000 |       0.50 |       0.01 |       3.47 |
|      10000 |        ... |        ... |        ... |
|     100000 |       0.50 |       0.06 |      18.24 |
|     100000 |        ... |        ... |        ... |
|    1000000 |       0.50 |       0.78 |     141.08 |
|    1000000 |        ... |        ... |        ... |
|   10000000 |       0.50 |      10.01 |    2177.27 |
|   10000000 |        ... |        ... |        ... |

Note: If you decided to use shell, please update the Makefile to use benchmark.sh instead of benchmark.py for the benchmark target.

Task 4: `README.md`

In your README.md, include markdown table your benchmark script generated and then answer the following questions:

Based on your experiments, what is the overall trend in terms of time and space as the number of items increases for the different load factors?

Are the results surprising to you? Explain why or why not.
There is no such thing as a perfect data structure. Compare this hash table to the treap you are creating in your data structures class. What are the advantages and disadvantages of both? If you had to use one as the underlying data structure for a map, which one would you choose and why?

Note: If you are not currently in the data structures class, you can simply compare your hash table to any other binary search tree you are familiar with such as a red-black tree.

Guru Point: Software Installation (1 Point)

For extra credit, you are to install a [Linux] application to the Public folder of your $HOME directory. For most [open source] software, this usually involves:

Downloading a tarball to your $HOME directory.
Extracting the tarball.
Running the configure script in the extracted directory. To tell the install where to put the resulting program, you can specify the prefix command line argument:
```
$ ./configure --prefix ~/Public
```
This will tell the installer to put the resulting files in your Public folder. The install will follow the typical Linux filesystem hierarchy and put executables in ~/Public/bin, libraries in ~/Public/lib, etc.
Building the software using make.
Installing the software using make install.

This standard process can be explained here.

Here are some possible packages for you to try this out on (you are free to try out something different if you wish):

To get credit, you must update your $PATH environment variable to include the location of your installed binaries and you must demonstrate the installed application executing either to the instructor or one of the TAs.

Submission

To submit your assignment, please commit your work to the homework08 folder in your assignments GitLab repository. Your homework08 folder should only contain at the following files:

Makefile
README.md
BENCHMARK.md
benchmark.sh or benchmark.py
entry.c
fnv.c
freq.c
map.c
map.h
measure.c
test_entry.c
test_fnv.c
test_freq.sh
test_map.c

Homework 08: Data Representation

Activity 1: Map Library (12 Points)

Separate Chaining Hash Table

Starter Code

Task 0: map.h

Type Definitions

Task 1: Makefile

Makefile Variables

Task 2: entry.c

Task 3: fnv.c

Task 4: map.c

Task 5: README.md