In this assignment, you will learn how to collect different types of data from Twitter by using an open source library called Tweepy in order to build your own Twitter data crawler. Since Twitter has an IP based rate limit policy, please use your own computer to finish this assignment. If you have a problem in finding a machine to finish the assignment, please contact the instructor.
Here is the reference for more details about OAuth: Twitter OAuth
The authentication of API requests on Twitter is done through OAuth. Note that Twitter APIs can only be accessed by registered applications (e.g., the crawlers you will develop in this assignment). In order to register your application, you first need to have a Twitter account. If you already have one, you can just use it. If not, you can go ahead and createa an account at Twitter. After that, you need to bind your Twitter account with the application you registered (i.e., crawlers). Once you finish the binding process, you will get the keys and tokens (i.e., a pair of consumer key and consumer secret and a pair of access token and access token secret) for your application.
Here are the main steps for the above registration and binding process:
This programming assignment requires Python 2.7.9* and additional Python packages, namely Tweepy. You can download Python 2.7.9 from Python Download and find more information about Python from www.python.org. If you are new to Python, here is a tutorial to get started on Python: docs.python.org/2/tutorial/
* If you use other versions of Python, please make sure your code runs correctly and specify the version number of your Python in the README file you turn in.
The main steps of installing Tweepy are as follows:
Tweepy can be cloned from the Github repository using the commands below:
For MacOS/Linux User:
$ git clone http://github.com/tweepy/tweepy.gitInside the home directory of Tweepy, execute:
$ python setup.py install
Or using easy install:
$ pip install tweepyEither way gives you the latest version of Tweepy.
For Windows User:
You can get the source code using the http method. github.com/tweepy/tweepy
Inside the home directory of Tweepy, execute:
$ python setup.py install
Note : Just for your information, Twitter has some rate limits on its Public API. You could get more information here: Twitter API Rate Limits
Task 1 :
Given a list of user IDs, please write a data crawler to collect the users' profile information.
What to Turn In :
(1) A result file that contains the profile information of the Twitter users with the following IDs: 34373370, 26257166, 12579252.
(2) The source code of your crawler to finish this task.
The following is a snapshot of a user's social network information on Twitter (the user ID is 13334762).
Task 2 :
Given a list of user IDs, please write a data crawler to collect the user social network information (i.e., the lists of screen names of the user's friends and followers)
What to Turn In :
(1) A result file that contains the social network information of the Twitter users with the following IDs: 34373370, 26257166, 12579252. Note: you only need to collect the first 20 followers and friends of a specified user in the result file.
(2) The source code of your crawler to finish this task.
Hint: Twitter has two different APIs for applications to collect tweets: search API and streaming API. For this assignment, you might want to look at both and decide which one to use.
Task 3 :
Please write a data crawler to:
(1) collect tweets that contain one of the following two keywords: [Indiana, weather]
(2) collect tweets that originate from the geographic region around South Bend: [-86.33,41.63,-86.20,41.74]. Note the coordinates correspond to two diagonal points of a rectangle: Longitude of left point, Latitude of left point, Longitude of right point, Latitude of right point. (Note that Google Map normally takes a slightly different format as [latitude, longitude]).
What to Turn In :
(1) A result file of tweets that contain one of above two keywords (50 tweets will be enough and the result file size should be less than 1 M).
(2) A result file of tweets that originate from the specified geographic region (50 tweets will be enough and the result file size should be less than 1 M).
(3) The source code of your crawler to finish this task.
Note: In the result files, you only need to record the text part of the tweet instead of the entire JSON response you get from your query.
/afs/nd.edu/coursesp.16/cse/cse40437.01/dropbox/YOURNAME/A1