Archiving Tweets: A Simple Way to Use PHP, MySQL, and the Twitter API to Save Stuff
NOTE: As of July 1st, TwapperKeeper now does nearly the same thing I am going to describe in this post. That’s ok—now you’ll understand the process and if you want to do it yourself instead of through a third-party app, you can. For mission-critical data (like dissertation research) you might want to do it yourself, not because there’s anything wrong with the TwapperKeeper folks (I don’t know them personally) but because, well, it’s your data that you need to be sure you can access.
Background
After the Twitter backchannel conversations during the Computers & Writing 2009 conference, Karl Stolley wrote a post asking everyone to save their tweets tagged with #cw09. His post provided instructions for the non-scripting way for archiving tweets: basically, to manually invoke and save the search results page, incrementing the page number until you hit the limit or no more results from your search. The outcome of this process would be a set of text files—still data, still stored, but not really in a format that can be quickly used. But some archiving is better than no archiving, of course.
As the Digital Humanities 2009 conference started and the conversations started flying fast and furious, I thought to myself (as did many others) “whoa! we have to save this stuff” knowing that Twitter only allows access to the last 1500 search results, and only for a 7 to 10 day period of time. That is to say, if the number of #dh09 tagged tweets hit 1500 or several days passed, we wouldn’t be able to get those tweets back from the depths of the Twitter databases. Someone set up a TwapperKeeper archive for #dh09, but at that time there wasn’t a way to export from TwapperKeeper. Following on the heels of #dh09 was THATCamp, and the Twitter backchannel nearly doubled that of #dh09. In other words, there was good stuff that simply had to be saved.
Mikhail Gershovich (@mikhailg) tweeted about TweetRipper, a quick little script that invokes PHP from the command line and yanks the date, user, and text for specifically-tagged tweets out to a text file. I grabbed the script, set up a database table, modified the script a bit (I kept their outside loop but just did different things in the middle), and started archiving tweets for both #cw09 and #dh09 (retroactively) and #thatcamp (going forward).
Below I discuss this script and how you can use it yourself to archive your own tweets. Or, you could just set up and use a TwapperKeeper now that they have export capabilities, but remember that if you control your data, then, well, you control your data.
These instructions require a server that has access to PHP and MySQL. Note that if you understand what’s going on conceptually, you can easily do this with any server-side scripting language and database.
Setting Up the Database Table
You can name your table and your fields anything you want, and you can save as much of the Twitter data from the API search method as you want. I saved everything except the profile_image_url. Call it an editorial decision.
“Everything” is the following: the tweet text, the Twitter ID and username of the user to whom the tweet was address (if an @reply), the Twitter ID and username of the person who sent the tweet, the Twitter ID of the tweet itself, the language encoding of the tweet, the source (Twitter client) used to send the tweet, and the time the tweet was created.
I created the following database table (note that the data type lengths indicated are not fine-tuned at all; in the interest of time I erred on the side of “leave a lot of room”):
CREATE TABLE tw_hashtag_search (
id bigint(20) NOT NULL auto_increment,
tw_hashtag varchar(32) default NULL,
tw_id bigint(20) default NULL,
tw_lang char(2) default NULL,
tw_source varchar(255) default NULL,
tw_text varchar(160) default NULL,
tw_created_at varchar(64) default NULL,
tw_to_user_id bigint(20) default NULL,
tw_to_user varchar(32) default NULL,
tw_from_user_id bigint(20) default NULL,
tw_from_user varchar(32) default NULL,
record_add_date datetime default NULL,
PRIMARY KEY (id),
UNIQUE KEY tw_id (tw_id)
);
The id field and the record_add_date field are for the ID of the record and the date the record was created in this table. In other words, independent of anything from Twitter. Also technically independent of anything from Twitter is the tw_hashtag field; that’s the field in which I store the hashtag used in the search—this table holds tweets tagged with #cw09, #dh09, and #thatcamp, and can hold tweets for any other hashtag search you want to run.
The other tw_* fields match up with data made available through the Twitter API (see above for list of “everything”). The tw_created_at field stores the date the tweet was created, and Twitter stores this as a string such as “Fri, 26 Jun 2009 12:56:06 +0000″. I decided to keep the raw data in this format. Note that you could use the PHP strtotime() function to change that data to a UNIX timestamp before you store it, or you could convert to a UNIX timestamp when you manipulate the data in however you eventually plan to do so. Personally, I like to archive data in its native format and then manipulate it when needed, but your mileage may vary.
Obtaining/Modifying the Script
This is a PHP script; like I said before, once you understand the logic involved you can easily replicate this in a scripting language of your choice if PHP isn’t your thing.
» Download the archive_tweets.zip archive
This zip archive contains the PHP script and a very basic README file. The PHP script is commented pretty clearly as to what is going on in it, but I’ll explain some things below.
Most importantly, the only change you have to make is to replace the database connection information with your own (hostname, username, password, database name). However, if you change use a different database table structure to store the tweets obviously you’ll need to modify the SQL query for record insertion so that it matches the table you’ve created.
Understanding Key Parts of the Script
The code itself is pretty clearly commented—there’s not a lot of magic to it—but below I will note some of the more interesting things involved.
This script is invoked via http; as I thought about the ways in which people would use it, and the access they might have to the server, this seemed the easiest thing to do. Note though that a simple change would allow it to be run on a command line (leave a comment if you want to know how). Since it’s set up to use http, we pass two arguments via a query string; these lines in the script give those $_GET variables some better names ($hashTag and $limit).
//sets up variable names from query string for internal use
$hashTag = $_GET['ht'];
$limit = $_GET['n'];
After these two lines there are two if clauses that check for the validity of the values. If there is no value for $hashTag then the script prints a message saying so and exits. If there is a value for $hashTag it continues, then checking for the number of tweets to return from the search. If there is no value for $limit or if the value given is not numeric, the script assumes 1500 (the Twitter limit), prints a message saying so, and continuess.
//checks for required values
if (empty($hashTag)) {
echo "No hashtag used.<br/>";
exit;
}
if ((empty($limit)) || (!is_numeric($limit))) {
echo "No numeric limit provided; assuming 1500.<br/>";
$limit = 1500;
}
The next six lines simply set up some variables to control the looping. Since this script uses the Twitter API to yank search results, and there are 100 results provided on each search “page,” this set of variables defines things like results limit, results per page, which page number to start with, and so on. Basically, what will happen is that the loop you will soon see will iterate for as many times as necessary to reach the number of tweets you will want to grab
After those variables comes the database connection information. In the second line below, replace HOSTNAME, USERNAME, PASSWORD, and DATABASE with your own connection information. Also note that this line (and the rest of the MySQL-related lines in the script) uses the mysqli_* family of functions. If you have to use the mysql_* functions just adjust accordingly (leave a comment if you are unsure what that means).
//connect to database; if you are using mysql_* family of functions, adjust accordingly throughout
$mysqli = @mysqli_connect("HOSTNAME", "USERNAME", "PASSWORD", "DATABASE");
After the database connection comes a SQL query to retrieve the maximum ID (from Twitter) of the tweets stored in the table for that hashtag. This will speed things up when we get to the loop-and-save process.
The next part of the script sets up the loop. As you can see by the comments in the script, the script will:
//Continue looping IF:
// - we have fewer results than requested
// - have not reached the end
// - have executed fewer than so many queries
Within the loop, the script gets the results of a query, and if there are results decodes the JSON and does something with the data. Namely, the script formats the data (where necessary) for inclusion in the database. In the example below you can see that we assign a variable name ($tw_text) to the element extracted from the JSON ($item->text) and we also use the addslashes() function to be sure that the data can be inserted cleanly into the database table.
//these 9 lines get 9 elements out of the JSON result and
//format them for inclusion in the database (where necessary)
$tw_id = $item->id;
$tw_lang = $item->iso_language_code;
$tw_source = addslashes($item->source);
$tw_text = addslashes($item->text);
$tw_created_at = $item->created_at;
$tw_to_user_id = $item->to_user_id;
$tw_to_user = $item->to_user;
$tw_from_user_id = $item->from_user_id;
$tw_from_user = $item->from_user;
But before issuing the SQL query to insert the record into the table, there’s a little if…else construct that checks the tweet ID against the maximum tweet ID stored in the table, for that hashtag. The idea here is that we don’t issue a query if we don’t have to (although since tw_id is set up as a unique field in the table, the query would simply fail and move on to the next one, but better to avoid that in the first place).
At both stages of the if…else test, the script prints a message that tells you what’s going on—either that a tweet is being added or that it is not.
if ($tw_id > $max_tw_id) {
//if the tweet id is greater than the max id of the stored tweets for that hashtag...
//print message that we are adding the tweet
echo "adding tweet /".$tw_from_user."/status/".$tw_id."<br/>";
//add tweet to database table
@mysqli_query($mysqli, "insert into tw_hashtag_search values
('', '".$hashTag."', '".$tw_id."', '".$tw_lang."', '".$tw_source."',
'".$tw_text."', '".$tw_created_at."', '".$tw_to_user_id."', '".$tw_to_user."',
'".$tw_from_user_id."', '".$tw_from_user."', now())");
} else {
//if the tweet id is less than the max id of the stored tweets...
//print message that it has already been stored
echo "tweet already stored: /".$tw_from_user."/status/".$tw_id."<br/>";
}
The script finishes up with closing the loop and, finally, printing a “done” message to the screen.
Usage
As I mentioned, this script is meant to be run via the web browser. The idea here was that people would not have access to the command line (as typical in virtual hosting/shared hosting environments) but they could put a script on their web server and bookmark it to run on-demand.
The URL you would use is as follows:
http://domain/path_to/archive_tweets.php?n=MAXNUMBER&ht=HASHTAG
Where domain and path_to should obviously be replaced with your own information. MAXNUMBER is some number of tweets to retrieve, less than 1500 (note that if you put a larger number in, the script will iterate but you won’t get anything from Twitter. HASHTAG is the search string (minus the hash, so cw09 instead of #cw09, for example).
*** Note! You could also add a cron job that invokes the script via cURL, or change the script slightly and add a cron job that invokes the script via the commandline. Leave a comment if you need to know how to do that. ***
The bookmarked-URL method is what I used, not only because it was just easier for me to say “oh hey, time to archive” and hit the script, but because I actually read the messages printed to the browser to see what was being added or how many tweets had already been added, and so on. That worked, because we’re talking about 2700 tweets over a two-day period (in the case of #thatcamp tweets), and not 5000 tweets per hour for the wildly popular trending topics (like #iranelection).
If you want to archive tweets for a trending topic, that would be the time to use a cron job to run the script automatically every 10-15 minutes or so (or more) unless you want to be beholden to your bookmarks all day long. But for the average person wanting to grab a snapshot of a search at a given moment or before the tweets expire, this method works just fine.
For more information on options, modifications, different formats (this script retrieves JSON results, could also grab atom results), or the Twitter API, leave a comment. I tried to make this as general and broadly useful as possible.
July 7, 2009
Tags: coding, tech geeky Posted in: Projects, Techie
Additional comments powered by BackType
![[Bloglines]](http://www.academicsandbox.com/blog/wp-content/plugins/bookmarkify/bloglines.png)
![[del.icio.us]](http://www.academicsandbox.com/blog/wp-content/plugins/bookmarkify/delicious.png)
![[Digg]](http://www.academicsandbox.com/blog/wp-content/plugins/bookmarkify/digg.png)
![[Facebook]](http://www.academicsandbox.com/blog/wp-content/plugins/bookmarkify/facebook.png)
![[Furl]](http://www.academicsandbox.com/blog/wp-content/plugins/bookmarkify/furl.png)
![[Google]](http://www.academicsandbox.com/blog/wp-content/plugins/bookmarkify/google.png)
![[Newsvine]](http://www.academicsandbox.com/blog/wp-content/plugins/bookmarkify/newsvine.png)
![[Propeller]](http://www.academicsandbox.com/blog/wp-content/plugins/bookmarkify/propeller.png)
![[Reddit]](http://www.academicsandbox.com/blog/wp-content/plugins/bookmarkify/reddit.png)
![[Simpy]](http://www.academicsandbox.com/blog/wp-content/plugins/bookmarkify/simpy.png)
![[Slashdot]](http://www.academicsandbox.com/blog/wp-content/plugins/bookmarkify/slashdot.png)
![[Spurl]](http://www.academicsandbox.com/blog/wp-content/plugins/bookmarkify/spurl.png)
![[Squidoo]](http://www.academicsandbox.com/blog/wp-content/plugins/bookmarkify/squidoo.png)
![[StumbleUpon]](http://www.academicsandbox.com/blog/wp-content/plugins/bookmarkify/stumbleupon.png)
![[Technorati]](http://www.academicsandbox.com/blog/wp-content/plugins/bookmarkify/technorati.png)
![[Twitter]](http://www.academicsandbox.com/blog/wp-content/plugins/bookmarkify/twitter.png)
![[Yahoo!]](http://www.academicsandbox.com/blog/wp-content/plugins/bookmarkify/yahoo.png)
![[Email]](http://www.academicsandbox.com/blog/wp-content/plugins/bookmarkify/email.png)
11 Responses
Academic Sandbox (the blog) » Grab ‘em: Conference Tweet Archives - July 8, 2009
[...] tweets were stored using the process in this blog post. [view post citation] [hide post citation] MLA citation: Meloni, Julie. "Grab ‘em: [...]
Thanks so much for posting this (and for your other API posts)! Questions for us newbies: how do I create the database on my server? Do I save it as a particular file type and then upload it to a place on my server? If so, does it matter where I upload it? The only time I’ve created databases on my server is with a wizard which does it all for me. . . .
Thanks!
Since this is just one table you don’t need a whole separate database for it. If you have a database on your host that you use for general or miscellaneous tables then you can just add this one table to that database. Then again if the only databases you have on your virtual host or server are for WordPress installs or other systems that create their own databases, you’ll have to create a general purpose database to put this table in it. If your virtual host includes a standard control panel like cPanel or its ilk, there’s probably a button that says “MySQL” that gives you the option to create a new database. Once created, that database would then appear in your administrative interface to MySQL, if you have one, like phpMyAdmin.
Once you have a database (and username and password for it) set up, you can issue the SQL query (shown in the post, the text that begins “CREATE TABLE tw_hashtag_search”) to create the table. If you are using an administrative interface to your database (such as phpMyAdmin) you just go to the “SQL” tab and paste the content of the query there and hit “go” (or “enter” or whatever the button says – I can’t remember). Then you’ll have a table, in a database, for the script to dump its stuff into.
Bookmarks for 08/10/2009 — MK Anderson - August 10, 2009
[...] Academic Sandbox (the blog) » Archiving Tweets: A Simple Way to Use PHP, MySQL, and the Twitte… – [...]
Twitter Trackbacks for Academic Sandbox (the blog) » Archiving Tweets: A Simple Way to Use PHP, MySQL, and the Twitter API to Save [academicsandbox.com] on Topsy.com - August 25, 2009
[...] Academic Sandbox (the blog) » Archiving Tweets: A Simple Way to Use PHP, MySQL, and the Twitter API… http://www.academicsandbox.com/blog/?p=204 – view page – cached #Academic Sandbox (the blog) RSS 2.0 Feed Academic Sandbox (the blog) RSS Feed Academic Sandbox (the blog) Atom 0.3 Academic Sandbox (the blog) » Archiving Tweets: A Simple Way to Use PHP, MySQL, and the Twitter API to Save Stuff Comments Feed Academic Sandbox (the blog) Moving Day Twitter Visualization: THATcamp tweets Grab ‘em: Conference Tweet Archives — From the page [...]
Working with APIs (part 1) - August 31, 2009
[...] server-side scripting language at least to the extent that I describe in this how-to post regarding archiving tweets. But the Google Developer Resources provide many ways of accessing the various Google APIs without [...]
Thanks for posting this code snippet. It was a great starting point for a great idea and got me off to a good motivated start. Twitter here I come!
Thanks for this code! I was looking for somehting exactly like this. It works amazingly well! However, I do want automate it cos I want to track a few words over a period of several months. How can I add a cron job script to do so?
Much appreciated!
The process of adding a crontab entry depends on your hosting provider. In general, the crontab format is:
minute hour day month weekday command
The timed execution of your script would be up to you, and would depend on the frequency of the term(s) you’re tracking (keeping in mind the search limits based on number of results and length of time; IOW, once a day would be adequate as long as there aren’t thousands of results in that one day).
So, once per day at say 6pm would be:
* 18 * * * command
“command” is the next part: if you have command-line cURL available on your system, the command would be:
curl http://yourdomain/pathtoscript
Using cURL to invoke http allows you to keep the script as-is with regards to the GET query string, although it might simply be easier (if this is the only term you’d be tracking) just to hardcode the values. Similarly, If you do not have access to commandline cURL but you have access to a PHP binary you would likely just hardcode the values and the command would be:
/path/to/php /path/to/script
(bearing in mind you can pass arguments on the command line but then would have to change the script anyway)
Thank you so much! Will try it out!
Many thanks for this useful script,
For some reason the condition for the `for each` loop is not met with:
($stdClassResults->results as $item)
I`m getting “done”
but nothing is inserted into the DB
Any idea ?
Thanks
Leave a Reply