The goal of this homework assignment is to allow you to practice manipulating and building data structures in C. The first activity involves implementing a map library, while the second activity requires you to use this library to build and benchmark a `freq` utility similar to uniq.

For this assignment, record your source code and any responses to the following activities in the `homework08` folder of your assignments GitLab repository and push your work by 11:59 PM Friday, April 7, 2017.

## Activity 1: Map Library (12 Points)

As discussed in your data structures class, a map is a data structure that allows the user to associate keys to values. This enables users to easily lookup values by a non-numeric key such as a string in an efficient manner. Maps (aka dict in Python) are used in many applications and scripts. For instance, whenever we used JSON, we were accessing a map.

Of course, being a bare bones language, C does not have a built-in map in its standard library. Because of this, we will build our own map using a hash table. Specially, we will implement a hash table using a form of separate chaining to manage collisions.

#### Separate Chaining Hash Table

Fundamentally, a hash table is just an array. Normal arrays in C, unfortunately, can only be indexed by an integer. As we discussed a couple of weeks ago, this is due to the fact that the name of an array is really just the base address of a contiguous chunk of memory. Therefore, an index operation is equivalent to the following:

```A[i] = *(A + i) // Index i of a is the same as access the location in memory at A + i
```

This is fine if we only need to search our elements by integer values, since we can use the integer value as an offset into our chunk of memory. However, if we wanted to quickly lookup values by strings or doubles, we would instead have to scan linearly through our array to find the appropriate element. This, of course, is inefficient, so we need another way of storing non-integer values in a way that allows us to quickly find them again.

To work around this limitation in arrays, we can use a hash function such as FNV-1a to deterministicly compute an integer value for any sequence of binary data:

```FNV-1A-Hash(data) -> Integer
```

Therefore, using a hash function allows us to utilize non-integer values such as strings or doubles to lookup elements in a normal array. The use of a hash function in this manner is the core concept behind the data structure know as a hash table, which can be used to implement either the set or map abstract data types.

For this activity, you will be implementing a map based on a hash table that uses separate chaining to handle collisions as shown below:

In this design, the map starts with a structure that contains the following fields:

1. Buckets: This is the array of Entries, each of which consists of a key, a value, and a reference to the next Entry in the chain.

2. Capacity: This is the size of the array of Entries.

3. Size: This is the number of valid Entries we are storing in our map.

4. Load Factor: This is our maximum ratio of Size to Capacity. That is, if Size / Capacity is ever greater than Load Factor, then we will need to resize our Buckets.

In the example in the diagram above, we have a map with an array of `8` Buckets. Each Bucket is actually a singly-linked list where we store the Entries in our map. Because the Bucket is an actual Entry and not just a reference to one, this means that our singly-linked list uses a dummy or sentinel node.

For instance, the first Entry at the top has a key of `0` and a value of "Dog". Because `0` maps to Bucket `0` we simply have the Entry structure at Buckets[0] reference this Entry. Because there are no other Entries that map to `0`, this top Entry's next reference is simply `NULL`.

At the bottom, we have two Entries that map to Bucket `5`. When multiple elements map to the same Bucket, we call this a collision. In separate chaining, we handle this situation by simply adding the additional Entry to our singly-linked list as shown in the diagram. Here, we have one Entry with the key `5` and the value "Log" followed by a second Entry with the key `13` and the value "Rock". The former has a reference to the latter to form a singly-linked list.

The remainder of this document details how you are to build a map that follows this overall design.

#### Starter Code

To help you get started, the instructor has provided you with the following starter code:

```# Go to assignments repository
\$ cd path/to/assignments/repository

\$ curl -O http://www3.nd.edu/~pbui/teaching/cse.20289.sp17/static/tar/homework08.tar.gz

# Extract starter code tarball
\$ tar xzvf homework08.tar.gz
```

Once downloaded and extracted, you should see the following files in your `homework08` directory:

```homework08
\_ Makefile       # This is the Makefile for building all the project artifacts
\_ entry.c        # This is the C99 implementation file for the entry structure
\_ fnv.c          # This is the C99 implementation file for the FNV-1 hash function
\_ freq.c         # This is the C99 implementation file for the word frequencies tool
\_ map.c          # This is the C99 implementation file for the map data structure
\_ map.h          # This is the C99 header file for map data structure
\_ measure.c      # This is the C99 implementation file for the measure utility
\_ test_entry.c   # This is the C99 implementation file for the entry test
\_ test_fnv.c     # This is the C99 implementation file for the FNV-1 test
\_ test_freq.sh   # This is the shell script for testing the word frequencies tool
\_ test_map.c     # This is the C99 implementation file for the map test
```

The details on what you need to implement are described in the following sections.

### Task 0: `map.h`

The `map.h` file is the header file for whole library. In addition to the function prototypes, this header defines the following types:

```typedef union {
int64_t  number;
char    *string;
} Value;
```

This defines a `Value` union type consisting of either a `64`-bit integer or a string.

```typedef enum {
NUMBER,
STRING,
} Type;
```

This defines an enumeration that defines the constants `NUMBER` and `STRING`.

```typedef struct Entry Entry;
struct Entry {
char    *key;
Value    value;
Type     type;
Entry   *next;
};
```

This defines a `Entry` struct, which follows from our design described above. In addition to having a `key`, `value`, and `next` reference, the `Entry` also records the `type` of its `value` (ie. `NUMBER` or `STRING`). This will be useful in choosing between the `number` and `string` fields of the `value` union.

```typedef enum {
KEY,
VALUE,
KEY_VALUE,
VALUE_KEY,
} DumpMode;
```

This defines an enumeration that allows the user to select what items in the `Entry` to dump.

```typedef struct {
Entry   *buckets;
size_t   capacity;
size_t   size;
} Map;
```

This defines the `Map` struct, which follows from our design described above.

#### Type Definitions

The use of `typedef` in this header file allow us to simply say `Map *m` rather than `struct Map *m`. While there is some controversary over whether or not this is good practice, we will use these type definitions to make our code clearer and more compact.

Finally, the header file also provides a `debug` macro that you can use to insert debugging statements into your code (rather than [printf]):

```debug("foo = d", foo);
```

The advantage of using this macro is that it displays the file name, line number, and function name where the command is called in its output, along with the formatted message you specified.

For this task, you do not need to modify this header file. Instead, you should review them and ensure you understand the provided code.

### Task 1: `Makefile`

Once again, the `Makefile` contains all the rules or recipes for building the project artifacts (e.g. `libmap.a`, `freq`, etc.). Although the provided `Makefile` contains most of the variable definitions and `test` recipes, you must add the appropriate rules for `libmap.a`, `test_entry`, `test_fnv`, `test_map`, `freq`, `measure`, and any intermediate objects. The dependencies for these targets are shown in the DAG below:

Note: Although we are producing a static library in the form of `libmap.a`, we will not be statically linking our executables. Instead, we will use the `libmap.a` as another object file when we link our executables.

#### Makefile Variables

You must use the `CC`, `CFLAGS`, `LD`, `LDFLAGS`, `AR`, and `ARFLAGS` variables when appropriate in your rules.

Once you have a working `Makefile`, you should be able to run the following commands:

```# Build all TARGETS
\$ make
Compiling freq.o...
Compiling entry.o...
Compiling fnv.o...
Compiling map.o...

# Run all tests
\$ make test
Compiling test_entry.o...
Testing entry...
Segmentation fault
==23062== Memcheck, a memory error detector
==23062== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==23062== Using Valgrind-3.12.0 and LibVEX; rerun with -h for copyright info
==23062== Command: ./test_entry
...
==23062== For counts of detected and suppressed errors, rerun with: -v
==23062== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
Compiling test_fnv.o...
Testing fnv...
Aborted
==23078== Memcheck, a memory error detector
==23078== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==23078== Using Valgrind-3.12.0 and LibVEX; rerun with -h for copyright info
==23078== Command: ./test_fnv
...
==23078== For counts of detected and suppressed errors, rerun with: -v
==23078== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Compiling test_map.o...
Testing map...
Aborted
==23095== Memcheck, a memory error detector
==23095== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==23095== Using Valgrind-3.12.0 and LibVEX; rerun with -h for copyright info
==23095== Command: ./test_map
...
==23095== For counts of detected and suppressed errors, rerun with: -v
==23095== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Testing freq...
--- -   2017-04-02 18:59:27.526589003 -0400
+++ test_freq.output    2017-04-02 18:59:27.524651038 -0400
@@ -0,0 +1,134 @@
+54     I
+52     shake
...
FAILURE: Probably stayed out too late and got nothing in your brain. Shake it off!
REASON:  Output doesn't match

# Simulate build with tracing output
\$ make -n
echo Compiling freq.o...
gcc -g -gdwarf-2 -Wall -std=gnu99 -c -o freq.o freq.c
echo Compiling entry.o...
gcc -g -gdwarf-2 -Wall -std=gnu99 -c -o entry.o entry.c
echo Compiling fnv.o...
gcc -g -gdwarf-2 -Wall -std=gnu99 -c -o fnv.o fnv.c
echo Compiling map.o...
gcc -g -gdwarf-2 -Wall -std=gnu99 -c -o map.o map.c
ar rcs libmap.a entry.o fnv.o map.o
gcc -L. -o freq freq.o libmap.a
```

Depending on your compiler, you may see some warnings with the initial starter code. Likewise, the test programs will all fail in some fashion.

### Task 2: `entry.c`

The `entry.c` file contains the C99 implementation for the Entry structure; it rather closely resembles the node structure from Homework 07. For this task, you will need to implement the following functions:

1. `Entry * entry_create(const char *key, const Value value, Entry *next, Type type)`

This function allocates a new `Entry`, sets its `key`, `value`, `next` and `type` fields, and returns the allocated structure.

Notes:

• You must check if allocation fails.

• You must allocate and copy the `key` (consider strdup).

• You must allocate and copy the `value` if the `type` is set to `STRING`.

• You may wish to use the `entry_update` function to implement part of this function (to avoid code duplication).

2. `Entry * entry_delete(Entry *e, bool recursive)`

This function deallocates the given `Entry` and any previously allocated fields such as the `key` and `value`. If `recursive` is `true`, then this function will also deallocate the `next` `Entry` and any of its successors.

Notes: You can use recursion for this if you wish since the singly-linked list should never be too long.

3. `void entry_update(Entry *e, const Value value, Type type)`

This function updates the `value` field of the specified `Entry`.

Notes: Special care must be given to both the old `value` `type` and the new `value` `type`.

• If the old `value` is of `type` `STRING`, then you must deallocate it.

• If the new `value` is of `type` `STRING`, then you must allocate and copy it.

Failure to do so will lead to memory errors.

4. `void entry_dump(Entry *e, FILE *stream, const DumpMode mode)`

This function prints the `Entry`'s `key` and `value` fields to the specified `stream` based on the given `mode`.

Notes: The `modes` are:

• `KEY`: Displays only the `key`.
• `KEY_VALUE`: Displays both the `key` and `value` separated by the `tab` character.
• `VALUE`: Displays only the `value`.
• `VALUE_KEY`: Displays both the `value` and `key` separated by the `tab` character.

As you implement `entry.c`, you can test it by running the `test-entry` target:

```# Run test-entry target
\$ make test-entry
Compiling test_entry.o...
Compiling entry.o...
Compiling fnv.o...
Compiling map.o...
Testing entry...

# View log
\$ cat test_entry.log

# Run test_entry manually
\$ ./test_entry
Test entry creation...
Test entry update...
Test entry deletion...
```

### Task 3: `fnv.c`

The `fnv.c` file contains the C99 implementation for the FNV-1a hash function. For this task, you will need to implement the following function:

1. `uint64_t fnv_hash(const void *d, size_t n)`

This function uses the FNV-1a algorithm to compute a hash on the given data `d` which consists of `n` bytes.

Notes: Remember that you can use type casting to interpret a chunk of bytes in any format you wish.

As you implement `fnv.c`, you can test it by running the `test-fnv` target:

```# Run test-fnv target
\$ make test-fnv
Compiling test_fnv.o...
Compiling entry.o...
Compiling fnv.o...
Compiling map.o...
Testing fnv...

# View log
\$ cat test_fnv.log

# Run test_fnv manually
\$ ./test_fnv
wake me up inside = 0x7d9454fa584a0a56
123 = 0x456fc2181822c4db
wolverine = 0x874d0d8868f684c8
```

### Task 4: `map.c`

The `map.c` file contains the C99 implementation for the core map data structure. For this task, you will need to implement the following functions:

1. `Map * map_create(size_t capacity, double load_factor)`

This function allocates a new `Map`, sets its `capacity`, `size`, and `load_factor` fields, allocates `buckets` of `Entry` structures, and returns the allocated structure.

Notes:

• If `capacity` is `0`, then it should use the `DEFAULT_CAPACITY` instead.

• If `load_factor` less than or equal to `0`, then it should use the `DEFAULT_LOAD_FACTOR` instead.

• The `buckets` field is an array of `Entry` structures.

2. `Map * map_delete(Map *m)`

This function deallocates the given `Map` along with its `buckets` of `Entry` structures.

3. `void map_insert(Map *m, const char *key, const Value value, Type type)`

This function inserts a new `Entry` into the `Map` if the `key` does not exist yet. If it does, then pre-existing `Entry`'s `value` is updated to the given `value`.

Notes:

• You should check the `load_factor` to determine if you need to resize before inserting a value.

• It doesn't matter where you insert the `Entry` in the corresponding singly-linked list. That is, you can either add to the front or the back, it is up to you.

4. `Entry * map_search(Map *m, const char *key)`

This function searches the `Map` for an `Entry` with the given `key`. If found then a reference to this `Entry` is returned, otherwise `NULL.`

5. `bool map_remove(Map *m, const char *key)`

This function removes the `Entry` with the given `key` from the `Map` and returns whether or not the removal was successful (ie. an `Entry` with the `key` was in fact in the `Map`).

Notes: Special care must be given to bridge the gap between the previous `Entry` and the next `Entry` during removal.

6. `void map_dump(Map *m, FILE *stream, const DumpMode mode)`

This function dumps all the `Entries` in the `Map` to the specified `stream` using the given `mode`.

7. `void map_resize(Map *m, size_t new_capacity)`

This function resizes the `Map` to the `new_capacity`.

Notes:

• You will need to allocate a new array of `Entry` structures.

• You will not need to copy or allocate new `Entry` structures; rather, you should simply re-link the existing `Entry` structures into the appropriate Bucket.

As you implement `map.c`, you can test it by running the `test-map` target:

```# Run test-map target
\$ make test-map
Compiling test_map.o...
Compiling entry.o...
Compiling fnv.o...
Compiling map.o...
Testing map...

# View log
\$ cat test_map.log

# Run test_map manually
\$ ./test_map
Testing map creation (defaults)...
Testing map insertion...
...
Testing map searching...
Testing map creation (custom)...
Testing map insertion...
...
Testing map searching...
Testing map removal...
```

### Task 5: `README.md`

In your `README.md`, respond to the following prompts:

1. Briefly describe what memory you allocated in `entry_create` and what memory you deallocated in `entry_delete`. How did you handle the `recursive` flag?

2. Briefly describe what memory you allocated in `map_create` and what memory you deallocated in `map_delete`. How did you handle internal entries?

3. Briefly describe what memory you allocated, copied, and deallocated in `map_resize`. When is this function called and what exactly does it do to grow the map?

4. Briefly describe how your `map_insert` function works and analyze its time complexity and space complexity in terms of Big-O notation (both average and worst case).

5. Briefly describe how your `map_search` function works and analyze its time complexity and space complexity in terms of Big-O notation (both average and worst case).

6. Briefly describe how your `map_remove` function works and analyze its time complexity and space complexity in terms of Big-O notation (both average and worst case).

## Activity 2: Word Frequencies (3 Points)

Once you have a working map library, you are to complete the `freq.c` program:

```# Display usage
\$ ./freq -h
Usage: ./freq
-f FORMAT        Output format (KEY, KEY_VALUE, VALUE, VALUE_KEY)

# Compute frequency of some random numbers
\$ shuf -i1-10 -n 10 | ./freq
1       2
1       5
1       9
1       10
1       6
1       4
1       8
1       3
1       1
1       7

# Compute frequency of some random numbers (display only keys)
\$ shuf -i1-10 -n 10 | ./freq -f KEY
2
5
9
10
6
4
8
3
1
7

# Measure time and memory usage for load factor of 2.0
\$ shuf -i1-1000000 -n 1000000 | ./measure ./freq -f VALUE -l 2.0 > /dev/null
0.853000 seconds        93.382812 Mbytes
```

The `freq` program reads in data from standard input one word at a time, inserts the word into the map to keep track of its count, and then outputs the final map using the specified `FORMAT`. The `LOAD_FACTOR` parameter is used to set the `load_factor` parameter of the `map_create` function and adjust how often the internal `buckets` array should resize.

### Task 1: `freq.c`

The `freq.c` file contains the C99 implementation of the word frequencies tool described above.

In addition to implementing command line parsing in the `main` function and you will need to implement the `freq_stream` function:

1. `void freq_stream(FILE *stream, double load_factor, DumpMode mode)`

This function reads one word at a time from the `stream` and uses a map with the specified `load_factor` to keep count of how often each word appears in the `stream`. Once the `stream` is complete read, the contents of the map are dumped to `stdout` using the specified `mode`.

### Task 2: `test_freq.sh`

As you implement `freq.c`, you can test it by running the `test-freq` target:

```# Run test-freq target
\$ make test-freq
Compiling freq.o...
Compiling entry.o...
Compiling fnv.o...
Compiling map.o...
Testing freq...

# View log
\$ cat test_freq.log

# Run test_freq.sh manually
\$ ./test_freq.sh
```

### Task 3: `benchmark.{sh,py}`

Once you have a working word frequencies utility, you are to write a benchmark script in either shell or Python that measures the amount of time and space it takes to count the following number of items:

```1, 10, 100, 1000, 10000, 100000, 1000000, 10000000
```

Additionally, you will need to experiment with the following load factors or `ALPHA` values:

```0.5, 0.75, 0.9, 1.0, 2.0, 4.0, 8.0, 16.0
```

To help you measure the time and space a program uses, you can utilize the measure.c program written by David Chiang (it is already included in the starter code). Once you build the program, you can use it in the following manner:

```# Build measure tool
\$ make measure
Compiling measure.o...

# Benchmark freq with 10,000,000 random numbers and load factor of 0.9
\$ shuf -i1-100000000 -n 10000000 | ./measure ./freq -l 0.9 > /dev/null
10.055000 seconds       1345.105469 Mbytes
```

This means that the `freq` utility counted the frequencies of `10,000,000` random numbers in `10.05` seconds and used `1345` megabytes of memory.

Your script, either `benchmark.sh` or `benchmark.py`, should automate the timing and measurement of these experiments to produce a markdown table looks like the following:

```|   NITEMS   |   ALPHA    |    TIME    |   SPACE    |
|------------|------------|------------|------------|
|          1 |       0.50 |       0.00 |       1.37 |
|          1 |        ... |        ... |        ... |
|         10 |       0.50 |       0.00 |       1.33 |
|         10 |        ... |        ... |        ... |
|        100 |       0.50 |       0.00 |       1.46 |
|        100 |        ... |        ... |        ... |
|       1000 |       0.50 |       0.00 |       1.56 |
|       1000 |        ... |        ... |        ... |
|      10000 |       0.50 |       0.01 |       3.47 |
|      10000 |        ... |        ... |        ... |
|     100000 |       0.50 |       0.06 |      18.24 |
|     100000 |        ... |        ... |        ... |
|    1000000 |       0.50 |       0.78 |     141.08 |
|    1000000 |        ... |        ... |        ... |
|   10000000 |       0.50 |      10.01 |    2177.27 |
|   10000000 |        ... |        ... |        ... |
```

Note: If you decided to use shell, please update the `Makefile` to use `benchmark.sh` instead of `benchmark.py` for the `benchmark` target.

### Task 4: `README.md`

In your `README.md`, include markdown table your benchmark script generated and then answer the following questions:

1. Based on your experiments, what is the overall trend in terms of time and space as the number of items increases for the different load factors?

Are the results surprising to you? Explain why or why not.

2. There is no such thing as a perfect data structure. Compare this hash table to the treap you are creating in your data structures class. What are the advantages and disadvantages of both? If you had to use one as the underlying data structure for a map, which one would you choose and why?

Note: If you are not currently in the data structures class, you can simply compare your hash table to any other binary search tree you are familiar with such as a red-black tree.

## Guru Point: Software Installation (1 Point)

For extra credit, you are to install a [Linux] application to the `Public` folder of your `\$HOME` directory. For most [open source] software, this usually involves:

1. Downloading a tarball to your `\$HOME` directory.

2. Extracting the tarball.

3. Running the `configure` script in the extracted directory. To tell the install where to put the resulting program, you can specify the `prefix` command line argument:

```\$ ./configure --prefix ~/Public
```

This will tell the installer to put the resulting files in your `Public` folder. The install will follow the typical Linux filesystem hierarchy and put executables in `~/Public/bin`, libraries in `~/Public/lib`, etc.

4. Building the software using `make`.

5. Installing the software using `make install`.

This standard process can be explained here.

Here are some possible packages for you to try this out on (you are free to try out something different if you wish):

To get credit, you must update your `\$PATH` environment variable to include the location of your installed binaries and you must demonstrate the installed application executing either to the instructor or one of the TAs.

## Submission

To submit your assignment, please commit your work to the `homework08` folder in your assignments GitLab repository. Your `homework08` folder should only contain at the following files:

• `Makefile`
• `README.md`
• `BENCHMARK.md`
• `benchmark.sh` or `benchmark.py`
• `entry.c`
• `fnv.c`
• `freq.c`
• `map.c`
• `map.h`
• `measure.c`
• `test_entry.c`
• `test_fnv.c`
• `test_freq.sh`
• `test_map.c`