A4: From Parallel to Serverless

In this assignment you will convert some of your previous work (parallel DNA analysis) into a serverless model, running on Amazon Lambda. This will give you the opportunity to explore the performance tradeoffs of working at cloud scale.

Caution: This assignment requires some fiddling around with Amazon AWS setup in order to get to the starting point. You will need to read the various instructions and manuals from Amazon to figure out the details and make it work. Don't leave this until the last minute.

Setup: Virtual Machine

Create an AWS web services account and apply the credit code provided earlier.
Go to the AWS console and select the EC2 service. Create a micro-sized (free-tier) virtual machine running Amazon Linux in the us-east-1 region to be your point of interaction with the system. (See list of regions and names.) When prompted, create a new key pair, download it into a file mykeypair.pem and protect with with chmod 600 mykeypair.pem.
Connect to your virtual machine using ssh -i mykeypair.pem ec2-user@IPADDRESS. Once logged in,you are known by the username ec2-user. If necessary, you can become root by using sudo without a password. For example, do a sudo yum update.

Setup: AWS Command Line

Go to the AWS console, select your profile in the upper right, then "My Security Credentials", "Continue to Security Credentials", then "Access Keys". Select "Create New Access Keys" then download the keyfile and put it somewhere safe. (It's plain text, just look at it to see your Access Key ID and Secret Key.)

In your VM, run aws configure and then enter your details:

AWS Access Key ID [None]: XXXX
AWS Secret Access Key [None]: XXXX
Default region name [None]: us-east-1
Default output format [None]: json

Double check that everything works by running aws ec2 describe-instances. If you get a JSON description of your virtual machine, then everything's fine. If not, go back and check previous steps.

Setup: Lambda

Go to the AWS console, and select the Lambda service. Create a new function using a Python runtime, and create a new role using the policy template "Test Harness".
You can now test your function in the console by modifying code directly in the editor. Note that the function accepts an arbitrary event (payload) data structure as input, and is expected to produce JSON on the output. The default function just prints "Hello from Lambda". Modify it to print the payload in the JSON instead.

Go back your VM and invoke the Lambda function from the command line:

aws lambda invoke --function-name YourFunctionName --payload '{ "hello" : "world" }' outfile

If successful, the result code from invoking the function is displayed on the standard output, and the actual JSON output of the function goes to outfile.
Next, make sure that the boto3 python library is installed.
```
sudo yum install python2-boto3
```

Finally, create a Python program on your VM that invokes the function from Python, like this:

#!/usr/bin/python

import boto3
import json

client = boto3.client('lambda')
print("Sending Request...")
response = client.invoke(
        FunctionName="TestFunction",
        InvocationType='RequestResponse',
        Payload="{ \"hello\" : \"world\" }"
    )
print("Response: ")
print(json.loads(response['Payload'].read()))
print("\n")

Make sure that you are able to run the program, invoke the Lambda function, and get a result back. Working? Ok, now you are ready to put Lambda to use.

The Assignment

Re-create your solution to assignment two: write a program that accepts a file containing DNA sequences on the command line, and then compares each sequence to all the others (excluding self matches) and then produces a list of the top ten matches, like this: (not real results)

./compareit.py aedes-aegypti-small.fasta
Sending functions to Lambda...
Top Ten Matches:
1: sequence DV400178.1 matches DV387043.1 with a score of 807
2: sequence JN573266.1 matches DV279822.1 with a score of 643
...
10: sequence DW199870.1 matches DW192223.1 with a score of 35

In this case, the main program should run on your virtual machine and do the comparisons by invoking a Lambda function of your creation. The Lambda function should use the provided swalign tool to do the desired comparison and return the result. For this assignment, you should use the aedes-aegypti-small.fasta dataset, which consists of 1000 sequences, also drawn from VectorBase.

Some things you will have to figure out:

Use the Lambda documentation as needed.

If your Lambda function needs access to anything not already installed on the machine, you will need to send it along with the function. Put everything you need together into a deployment package (a .zip file) and upload that as a complete function. This will be unpacked to create your execution environment.

If you just invoke each function in the direct way, they will run sequentially and not achieve any speedup. You will need to run multiple functions concurrently by using threads, multiple processes, or some other concurrent Python technique.

Questions to Answer

What is the minimum time necessary to invoke a Lambda function? That is, how long does it take to execute a function that does nothing?
Create subsets of the data consisting of 10, 50, 100, 500, and 1000 items, and run compareit on each subset. What is the total execution time and speedup for each quantity of data? Did you observe any limit in the amount of concurrency that Lambda offers?
Compare all 1000 items to each other and report the top ten matches.

What to Turn In

Your dropbox directory is:

/afs/nd.edu/courses/cse/cse40822.01/dropbox/YOURNAME/a4

Turn in the following files:

A file answers.html containing answers to the questions and links to your programs in the dropbox directory:
The master program compareit.py which invokes all of the functions.
The code for the function function.py invoked via Lambda. (It's ok if you constructed this via the web interface, just copy-paste it into a file for submission.)

This assignment is due on Friday, November 9th at 11:59PM.