The fifth project is to build a simplified implementation of memcached, which is an in-memory key-value store that is used for quick access to frequently used computations (ie. a cache). To support concurrent access to the cache, your library, MemHashed, must use the POSIX Threads to ensure thread-safe behavior.
To use the memcached programming paradigm, we first declare a memcached structure:
// Create Cache Structure
Cache *cache = cache_create(addrlen, page_size, policy, handler);
This object takes the size of the address space (in terms of the number
of bits), the size of each page (in terms of the number of
entries), the eviction policy (either Random or Clock, and the
handler (the function to call when a requested item is missing from the
cache). It maps an uint64_t
key to a int64_t
value.
Once we have this Cache
structure, we can retrieve a value from the
cache with following command:
// Retrieve value associated with key
int64_t value = cache_get(cache, key);
If the value associated with the key is not in the Cache
then the
provided handler will be called to generate or retrieve the missing
value, which is then stored in the cache and returned to the caller.
Likewise, we can also store values into the cache by doing the following:
// Store value associated with key
cache_put(cache, key, value);
If the Cache
is full, that is it has no free space for another value,
then we must perform an eviction. To do so, we will use the specified
eviction policy to select a currently existing value for replacement
and then store the new value in the place of the evicted object.
In this way, the use of a fault handler and an eviction policy means
that Cache
is similar to the paging and virtual memory systems we
have discussed in class for the last few weeks.
Internally, the Cache
structure in the memhashed library uses a
modified form of a hashed array tree, which combines facets of hash
tables with binary trees as shown below:
Unlike in typical hash tables, our hashed array tree splits up the
array of buckets into a tree of Pages
. Rather than using the result
of the hash function to map to a specific bucket, we will simply perform
a modulus operation on the key to convert the key into a virtual
address:
VirtualAddress = key % NumberOfAddreses
Once we have this virtual address, we can then use it to navigate our hashed array tree in a manner similar to how we translate virtual address to physical addresses with page tables.
This means that inside the Cache
structure, we have a pages
array which
contains pointers to individual Page
structures. Likewise, each Page
structure itself contains an array of Entry
structures which at the very
keep track of the key and value pair associated with the Entry
,
whether or not the Entry
is valid, how recently the Entry
was
used, and the Entry
's insertion order.
To locate an item in the Cache
, we must translate the key into an
address by using the modulus operator with the number of addresses in
the Cache
Structure, and then use that to find the appropriate key-value
pair within our hashed array tree.
For example, the following code corresponds to the diagram above:
int64_t is_odd(const uint64_t key) {
return key % 2;
}
Cache *cache = cache_create(4, 4, RANDOM, is_odd);
printf("IsOdd(3) = %ld\n", cache_get(cache, 3));
printf("IsOdd(8) = %ld\n", cache_get(cache, 8));
The example code creates a Cache
structure called cache
with an
address length of 4
bits as specified by the first argument to the
Cache
constructor. This means that our total address space is 2^4 =
16
addresses where each address corresponds to a Entry
structure.
The second argument, 4
specifies the number of entries per Page
.
Because each Page
has 4
Entry
structures, we need 2^4 Entries / 2^2
Entries Per Page = 2^2 = 4
Pages
in our pages
array.
To perform cache_get(cache, 3)
(that is retrieve the value of 3
from
our Cache
), we must do the following:
First, convert the key into an integer virtual address by using the modulus operator with the number of addresses.
In this example, this is 3 % 16 = 3
.
Next, extract the virtual page number or VPN from the integer address.
Since the virtual address is 3
in this case, the bit representation
is 0011
. Therefore the VPN is 00
or the first two bits of the
virtual address.
Afterwards, extract the offset into the Page
by taking the
remaining bits of the virtual address.
In this case, the final bits are 11
and therefore the offset is 11
.
In summary, we modulus the key to generate a virtual address and then extract the VPN and the offset from this virtual address:
3 % 16 -> 0 0 1 1
___ ___
VPN Offset
Once we have the VPN, we can then use this to select the appropriate
Page
from the pages
array. Likewise, we can then access the Entry
inside the selected Page
by using the computed offset.
To perform cache_get(cache, 8)
(that is retrieve the value of 8
from
our Cache
), we follow the same process and get a VPN of 10
and an
offset of 00
. This means we will need to lookup the first Entry
of
the third Page
of our pages
array to find the value for the key
8
.
In this way, the Cache
's internal structure resembles a single-level
page table and uses a similar mechanism for translating virtual addresses
to physical address as we've discussed in class.
Working individually, you are to create a library that implements the MemHashed library described above and the collatz and coins functional tests that utilize your library by 11:59PM on Monday, November 25, 2019.
More details about this project and your deliverables are described below.
Many websites such as Wikipedia and Wordpress utilize memcached to speed up access to their websites. In fact the notion of a fast distributed key-value store is a core component of the NoSQL movement currently sweeping the data storage and processing industry.
For our project, we will create a simplified version of memcached that
acts uses an embedded library model rather than a client/server framework.
In this way, our Cache
structure is akin to sqlite, which is an
embeddable database engine that many applications such as Firefox and
Chrome utilize internally.
As noted above, you are to work individually to implement your memhashed library. You must use C (not C++) as the implementation language. Any test scripts or auxillary tools can be written in any reasonable scripting language.
Here is a timeline of events related to this project:
Date | Event |
---|---|
Saturday, November 09 | Project description and repository are available. |
Saturday, November 16 | Brainstorming should be completed. |
Monday, November 18 | Repository must be forked and set to private. |
Monday, November 25 | Library and functional tests due (pushed to GitLab memhashed branch). |
If you do not fork your repository AND mark it as private by the deadline above, you will lose 3 points from your final grade.
For the final submission, please open a Merge Request on your
repository (for your memhashed
branch to the master
branch) and assign
it to your grader from the [Reading 10 TA List].
To start this project, you must fork the Project 05 repository on GitLab:
https://gitlab.com/nd-cse-30341-fa19/cse-30341-fa19-project05
Once this repository has been forked, follow the instructions from Reading 00 to:
Make the repository private.
Configure access to the repository.
Make sure you add all the members of the team in addition to the instructional staff.
Note: Do your work in a separate git branch that can be later merged
into the master
branch (ie. make a memhashed
branch, DO NOT COMMIT OR
MERGE TO MASTER).
The Project 05 repository includes a README.md
file with the following
sections:
Student: This should be your name and email address.
Brainstorming: This should contain your responses to the provided brainstorming questions (optional for this project).
Errata: This should contain a description of any known errors or deficiencies in your implementation.
Extra Credit: This should contain a description of any extra credit you attempted or completed.
You must complete this document report as part of your project.
While it is tempting to write a long response to the brainstorming and reflection questions, please limit yourself to a few sentences at the most for each response.
This will facilitate more efficient grading and it will help you practice distilling ideas into their absolute essentials (which is useful for real world communication and for answering questions on exams).
As you can see, the base Project 05 repository contains a README.md
file and the following folder hierarchy:
project05
\_ Makefile # This is the project Makefile
\_ bin # This contains the test executables and scripts
\_ include
\_ memhashed # This contains the MemHashed library header files
\_ src
\_ memhashed # This contains the MemHashed library source code
\_ tests # This contains the MemHashed test source code
You must maintain this folder structure for your project and place files in their appropriate place.
To help you get started, we have provided a Makefile
with all the
necessary targets:
$ make # Build libraries and tests
Compiling src/memhashed/queue.o
Compiling src/memhashed/cache.o
Compiling src/memhashed/page.o
Linking lib/libmemhashed.a
Compiling src/tests/unit_cache.o
Linking bin/unit_cache
Compiling src/tests/unit_page.o
Linking bin/unit_page
Compiling src/tests/func_coins.o
Linking bin/func_coins
Compiling src/tests/func_collatz.o
Linking bin/func_collatz
Compiling src/tests/func_scan.o
Linking bin/func_scan
Compiling src/tests/func_prime.o
Linking bin/func_prime
Compiling src/tests/func_fibonacci.o
Linking bin/func_fibonacci
$ make clean # Remove all targets
Removing objects
Removing libraries
Removing test programs
While the exact organization of the project code is up to you, keep in mind that you will be graded in part on coding style, cleaniness, and organization. This means your code should be consistently formatted, not contain any dead code, have reasonable comments, and appropriate naming among other things:
Break long functions into smaller functions.
Make sure each function does one thing and does it well.
Abstract, but don't over do it.
Please refer to these Coding Style slides for some tips and guidelines on coding style expectations.
Once you have the MemHashed library, you can use the provided tests to check your implementation:
$ make test # Run all tests
Running unit_cache 0
Running unit_cache 1
Running unit_cache 2
Running unit_page 0
Running unit_page 1
Running unit_page 2
Running unit_page 3
Testing coins with Random ... Success
Testing coins with Clock ... Success
Testing coins (mt) with RANDOM ... Success
Testing coins (mt) with CLOCK ... Success
Testing collatz with Random ... Success
Testing collatz with Clock ... Success
Testing collatz (mt) with Random ... Success
Testing collatz (mt) with Clock ... Success
Testing fibonacci with Random ... Success
Testing fibonacci with Clock ... Success
Testing fibonacci (mt) with RANDOM ... Success
Testing fibonacci (mt) with CLOCK ... Success
Testing prime with Random ... Success
Testing prime with Clock ... Success
Testing prime (mt) with RANDOM ... Success
Testing prime (mt) with CLOCK ... Success
Testing scan (10) with Random ... Success
Testing scan (10) with Clock ... Success
Testing scan (8) with Random ... Success
Testing scan (8) with Clock ... Success
Testing scan (mt) with RANDOM ... Success
Testing scan (mt) with CLOCK ... Success
Remember, you can run each unit test separately:
$ ./bin/unit_cache 0 # Run test 0 of cache unit test
In addition to unit tests, there are a few functional tests:
src/tests/func_scan.c
: This simply reads data from an array.
src/tests/func_prime.c
: This computes prime numbers.
src/tests/func_fibonacci.c
: This recursively computes fibonacci
numbers.
To utilize these functional tests, you can do the following:
# Build applications
$ make
# Scan Usage
$ ./bin/func_scan
Usage: ./bin/func_scan AddressLength PageSize EvictionPolicy Threads
# Run scan example
$ ./bin/func_scan 8 64 0 1
Hits = 184
Misses = 10056
Each functional test takes four arguments:
AddressLength
: Number of bits in address space.
PageSize
: Number of entries per page.
EvictionPolicy
: Which eviction policy to use (Random =
0
, Clock = 1
).
Threads
: Number of threads to utilize.
To run the functional test without any caching, you can simply set
AddressLength
to 0
:
# Run scan example without cache
$ ./bin/func_scan 0 0 0 1
Hits = 0
Misses = 0
To check the result of your functional tests, you can run each functional test script individually:
$ ./bin/test_scan.sh
Testing scan (10) with Random ... Success
Testing scan (10) with Clock ... Success
Testing scan (8) with Random ... Success
Testing scan (8) with Clock ... Success
Note there is a multi-threaded version of each functional test to ensure you do not have any deadlocks or race conditions:
$ ./bin/test_scan_mt.sh
Testing scan (mt) with RANDOM ... Success
Testing scan (mt) with CLOCK ... Success
All of the C header files are in the include/memhashed
folder, while the
C source code for the MemHashed library is in the src
folder. To
help you get started, parts of the project are already implemented:
[x] include/memhashed/cache.h # Cache structure header (implemented)
[x] include/memhashed/logging.h # Logging utilities header (implemented)
[x] include/memhashed/page.h # Page structure header (implemented)
[x] include/memhashed/queue.h # Queue structure header (implemented)
[x] include/memhashed/thread.h # Thread utilites header (implemented)
[x] include/memhashed/types.h # Tpyes header (implemented)
[ ] src/memhashed/cache.c # Cache structure definitions (not implemented)
[ ] src/memhashed/page.c # Page structure definitions (not implemented)
[x] src/memhashed/queue.c # Queue structure definitions (implemented)
Basically, all of the header files are implemented. You only need to focus
on implementing the Cache
and Page
structures which are at the core of
the MemHashed library. Each of the functions in the incomplete files
above have comments that describe what needs to be done.
You will need to examine these source files and complete the implementation
of the MemHashed library that utilizes a hashed array tree (as
described above) to implement an in-memory cache. To do so, you should
start with the Page
structure and then move to the Cache
structure.
Once you pass the provided unit tests, you should then use the functional
tests to ensure that your MemHashed library behaves properly:
src/memhashed/page.c
: This file contains the functions for the Page
structure, which contains a number of attributes for proper management of
its internal entries
:
The page_create
constructor must determine the following attributes
and then initialize the entries
array based on these values.
nentries
: This is the number of entries per page (specified by user).policy
: This is the policy to use when an entry must be evicted (specified by user).entries
: This is the array of Entry
structures.The page_get
function must use [linear probing] to locate the
appropriate Entry
. If it is not found, then it must return an
Entry
with a key
that is different from the user specified key
.
The page_put
function must also use [linear probing] to locate the
appropriate Entry
. If no Entry
is found, then it should overwrite
an invalid Entry
. If the Page
is full, an existing Entry
must
be selected based on the eviction policy
and then used for
replacement.
You will have to implement each of these eviction policies
individually in separate helper functions that you can call in
your page_put
function.
Since multiple threads can access each Page
concurrently, you must
use the lock
attribute to implement monitor-style protection.
Note: For clock, you will need to keep track of the used
attribute of each Entry
in the Page
. To set the used
attribute, you should first incremente the Page
's used
counter (aka
logical clock) and then assign the new value to the Entry
's used
attribute as follows:
// Set used attribute of Entry in Page with logical clock
page->entries[offset].used = ++page->used
src/memhashed/cache.c
: This file contains the functions for the
Cache
structure, which contains a number of attributes necessary for
proper management of its internal pages
:
The cache_create
constructor must determine the following
attributes and then initialize the pages
array based on these
values.
addrlen
: This is the number of bits in the virtual address (specified by user).page_size
: This is the number of entries per page (specified by user).policy
: This is the policy to use when an entry must be evicted (specified by user).handler
: This is the handler function to user when an entry is not found (specified by user).
naddresses
: This is the maximum number of virtual addresses (based on addrlen
).
npages
: This is the maximum number pages in the cache (based on naddresses
and page_size
).vpn_shift
: This is the number of bits required to shift the VPN (based on addrlen
).vpn_mask
: This is the mask required to extract the VPN from a virtual address (based on npages
and vpn_shift
).offset_mask
: This is the mask required to extract the offset from a virtual address (based on vpn_shift
).
pages
: This is the array of Page
pointers.
lock
: This is the mutex used to synchronize access to the internals of the cache.The cache_get
function must determine the virtual address, VPN,
and offset based on the key
, and then use this information to
retrieve the value from the appropriate Page
.
It must take care of the case where the data is missing (ie. the
returned Entry
has a mismatching key
). In this situation, it
must call the handler
function to compute the appropriate value
and then store that into the cache before returning this new
value.
The cache_put
method must also determine the virtual address,
VPN, and offset based on the key
, and then use this
information to store the value into the appropriate Page
.
To keep track of the cache hits and misses, you must update the hits
and misses
attributes appropriately. All of the information that you
determine and track will be dumped to a stream by the provided
cache_stats
function.
Because multiple threads can access the Cache
concurrently, you are
to use the provided lock
attribute judiciously to protect values that
need to be updated atomically.
Note: There are also some TODOs
to help identify which parts you need
to implement, along with how you will want to modify the counters.
In addition to building the MemHashed library, you are also given a
src/tests/func_collatz.c
skeleton. You are to complete this program by
using your MemHashed library to memoize the computation of Collatz
sequence lengths:
The Collatz conjecture is a conjecture in mathematics that concerns a sequence defined as follows: start with any positive integer n. Then each term is obtained from the previous term as follows: if the previous term is even, the next term is one half the previous term. If the previous term is odd, the next term is 3 times the previous term plus 1. The conjecture is that no matter what value of n, the sequence will always reach 1.
For example, given the input 22
, the following sequence of numbers would
be generated by the Collatz conjecture:
22 11 34 17 52 26 13 40 20 10 5 16 8 4 2 1
This sequence contains 16
numbers, so we would say that the Collatz
sequence length for 22
is 16
.
Given any integer n
, we could compute the length of the Collatz
sequence using the following recursive algorithm:
def collatz_length(n):
if n == 1:
return 1
if is_odd(n):
return collatz_length(3*n + 1) + 1
else:
return collatz_length(n / 2) + 1
You are to complete src/tests/func_collatz.c
such that it reads in a
stream of integers from standard input and computes the Collatz sequence
length of each integer. To speed up computation, we will use our
MemHashed
library to memoize overlapping computations.
Additionally, we will support utilizing multiple threads to handle
different input integers concurrently. To coordinate the worker threads,
you are to use the provided Queue
structure, which internally uses
semaphores for synchronization (we implemented in class in Lecture 12).
You will need to think carefully about how and when to use the Cache
and
Queue
to implement this application.
In addition to src/tests/func_collatz.c
, you are also given a
src/tests/func_coins.c
skeleton. You are to complete this program by
using your MemHashed library to implement a dynamic programming
solution to the classic coins problem:
What is the minimum number of US coins (
25
,10
,5
,1
) necessary to make a certain amount of change:
For instance, to make change for 17
cents, we can use 10
+ 5
+ 1
+
1
for a minimum number of 4
coins.
To perform a dynamic programming solution, we can build a table using the following recurrence relationship:
Coins(Amount) = Minimum(
Coins(Amount - BaseCoin) for BaseCoin in BaseCoins
)
You are to complete src/tests/func_coins.c
such that it computes the
number of coins necessary to produce change between between 0
and 99
cents using the US coins (25
, 10
, 5
, 1
).
You will need to think carefully about how and when to use the Cache
and
to implement this application.
Once you have completed your project, you may do one of the following for 2.0 points of extra credit:
Implement a multi-threaded MemHashed server that responds to HTTP requests (
GET
andPUT
) and provide a basic demonstration of it in action.There is a race condition in the current design of the
Cache
. Identify the bug and then redesign theCache
andPage
structures.
Your project will be graded on the following metrics:
Metric | Points |
---|---|
Source Code
|
23.0
|
Documentation
|
1.0
|
To encourage you to work on the project regularly (rather than leaving it until the last second) and to practice performing incremental development, part of your grade will be based on whether or not you have regular and reasonably sized commits in your git log.
That is, you will lose a point if you commit everything at once near the project deadline.
In addition to meeting the functional requirements of the project (as described above), your program must not exhibit any memory leaks or invalid memory accesses as would be detected by Valgrind.
Additionally, because [system calls] can fail, your program must implement robust error handling and ensure your code checks the status of [system calls] when reasonable.
Moreover, you code must compile without any warnings (and -Wall
must
be one of the CFLAGS
).