-
Notifications
You must be signed in to change notification settings - Fork 114
FAQ
This FAQ will benefit from your questions! If you have a question which is not answered here, do not hesitate to open an issue and request more information.
DMI-TCAT allows for the study of the full array of objects and subjects when doing Twitter research: disasters, events, political campaigns (and elections), social issues, conversations, revolutions, the 1% random sample, geotagged tweets, stock prices, celebrity awards, cities, and most recently the national. The DMI-TCAT tool, which can be downloaded and installed on a server (with GitHub instructions), allows for a battery of analyses of a tweet collection (also made via the Twitter APIs sourced by DMI-TCAT), from simple and experimental activity measures to those concerning mention, reply, hashtag, URL reference, retweet and other analytical opportunities. DMI-TCAT has been designed for data collection and analytical prep and may be used in conjunction with tabular data analysis tools and network analysis tools such as Gephi.
We attempt to lower the barrier of entry to Twitter research by providing a freely available platform built on publicly available data which requires little or no custom programming and scales to datasets of hundreds of millions of tweets using consumer hardware. DMI-TCAT provides robust and reproducible data capture and analysis, allow easy import and export of data, interlink with existing analytics software, and guarantee methodological transparency by publishing the source code.
Our tool connects to Twitter using the tmhOAuth library to retrieve tweets via both the streaming API and the REST API.
- you can capture a one percent random sample of all tweets passing through Twitter.
- you can track tweets containing specific keywords in real-time.
- you can track tweets from specific locations in real-time.
- You can follow tweets from a specified set of up to 5000 users
- you can search for tweets up to about a week old. (Keep in mind that the REST API explicitly omits an unknown percentage of tweets).
- you can get the last 3200 tweets for each user in a set
- You can reconstruct a dataset from a list of ids and can export such a list as well. Along the same line, we provide import scripts for yourTwapperKeeper databases or a set of Twitter JSON files captured by other means (e.g. streamR).
A collection is defined in a so-called “query bin”, i.e. all tweets belonging to a set of queries, a set of users, or a "one percent" sample. Depending on the CAPTURE_ROLES defined in your config.php you will be able to 'track' tweets based on a set of keywords, 'follow' users based on a set of user ids, or retrieve a 'one percent' sample from the Twitter API. For example, a 'track' bin like [globalwarming, ’global warming’, #IPCC] would retrieve all tweets containing one of these three query elements and combine them into a single dataset, stored as a group of related tables in the database.
You can create and modify query bins at BASE_URL/capture/index.php. Following is an explanation of the types of queries one can make.
DMI-TCAT allows for three types of 'track' queries:
- a single word/hashtag. Consider that Twitter does not do partial matching on words, i.e. [twitter] will get tweets with [twitter], [#twitter] but not [twitteraddiction]
- two or more words: works like an AND operator, i.e. [global warming] will find tweets that have both [global] and [warming] in any position in the tweet, e.g. "life is global but not warming"
- exact phrases: ['global warming'] will get only tweets with the exact phrase. Beware, however that due to how the streaming API works, tweets are captured in the same way as in 2, but tweets that do not match the exact phrase are thrown away. This means that you will request many more tweets from the Twitter API than you will see in your query bin - thus increasing the possibility that you will hit a rate limit. E.g. if you specify a query like ['are we'] all tweets matching both [are] and [we] are retrieved, while DMI-TCAT only retains those with the exact phrase ['are we'].
See the track parameter documentation for how the Twitter streaming API matches keywords.
You can track a maximum of 400 queries at the same time (for all query bins combined) and the total volume should never exceed 1% of global Twitter volume, at any specific moment in time.
Example bin: [globalwarming,global warming,’climate change’]
For a follow bin you can only specify a comma-separated list of user IDs, indicating the users whose Tweets should be captured. See the follow parameter documentation for what tweets will be collected using this method. Note that you can only follow a maximum of 5000 user ids at the same time (for all query bins combined).
Example bin: [1304933132,1286333395,856010760,381660841,381453862,224572743]
Returns a small random sample of all public statuses. The Tweets returned by the default access level are the same, so if two different clients connect to this endpoint, they will see the same Tweets.
DMI-TCAT allows you to track Tweets from one or multiple areas in the world. As specified by the documentation, the query has a strict format:
- Each bounding box should be specified as a longitude and latitude pair, starting with the southwest corner of the bounding box. After the southwest corner comes the northeast corner. This makes four coordinates per area (sw lng, sw lat, ne lng, ne lat). Every coordinate is separated by a comma.
- Track multiple areas by adding another set of four coordinates. Keep adding commas to your query. No other delimiters are required.
- Tweets will be stored in a particular bin if its coordinates are contained in at least one of the areas you specified for that bin.
- Twitter does not just return the Tweets with explicit GEO coordinates. If a user has set 'use my location' in his or her preferences, Twitter may also decide to use IP addresses or other measures to determine the location. The third option is for a user to attach a Tweet to a specific place. If such a place is within the bounding boxes which you defined, it will be stored.
You can track a maximum of 25 geoboxes at the same time (for all query bins combined) and the total volume of all your queries should never exceed 1% of global Twitter volume (at any specific moment in time).
You can retrieve tweets up to 7 days ago via Twitter's REST API. Note that not all tweets are made available via the REST API and that Twitter does not specify which tweets are dropped or why.
To search tweets via DMI-TCAT, edit capture/search/search.php
and fill in the $bin_name
to reflect the query bin where you want to store your tweets. Next fill in the $keywords
which you want to track (note that instead of specifying a comma separated list, in this script the keywords should be separated by OR
). Now, make sure that in your config.php
you have some extra Twitter API keys defined here (add them in this format). You can only do a limited amount of requests with one key. If you specify multiple keys, the script will take the next key when the former reached its limit. (4 additional keys seems to work fine).
When all is set up, you can run the script as follows: cd capture/search; php search.php
. If the query bin already exists, it will ask you whether you are sure to capture tweets in the same query bin. If so, enter 'yes'.
You can search a maximum number of 10 keywords at a time. If you have more than 10 keywords, just run the script multiple times with new keywords.
You can retrieve the latest 3200 tweets for a (set of) user(s) via Twitter's REST API.
- Open an ssh connection to the server and open a console in screen.
- Make sure that in your
config.php
you have some extra Twitter API keys defined here (add them in this format). You can only do a limited amount of requests with one key. If you specify multiple keys, the script will take the next key when the former reached its limit. (4 additional keys seems to work fine). - Then, edit
/var/www/dmi-tcat/capture/user/timeline.php
and add the full list of user ids or user names to the array$user_ids
(e.g.$user_ids = array(517355521,587438083,332263886);
. Make sure to also specify a bin_name (e.g.$bin_name = "buethastum";
). Then just save and close that file. - Run the script:
cd /var/www/dmi-tcat/capture/user/; php timeline.php
and wait until it finishes.
We currently provide various ways to import externally retrieved data into DMI-TCAT.
The yourTwapperKeeper import (ytk) script will parse the tweets captured by ytk and extract hashtags and mentions. As ytk did not capture all available fields from Twitter's API, you will not be able to do such analyses as 'Social graph by in_reply_to_status_id' and 'geo location exports'.
To import data from ytk, go to dmi-tcat/import
and edit the file import-yourTwapperKeeper.php
. On lines 11 to 22 you will have to specify database credentials for both databases, as well as the ytk table from which you would like to import the data (e.g. z_24) and the DMI-TCAT query bin in which you would like to import the data (e.g. ytk_z_24). When you finished editing, save the file and run it as follows: php import-yourTwapperKeeper.php
. When the script has completed the analysis and capture web interfaces will show the imported data set. Repeat the previous steps for each ytk table you would like to import.
The import-jsondump.php
script imports a series of JSON files (retrieved via e.g. StreamR) into DMI-TCAT.
To import the data, go to dmi-tcat/import
and edit the file import-jsondump.php
. On line 10 you will have to provide the name of the query bin in which you would like to store the data. On line 12 you will have to specify the directory containing the JSON files you would like to import. On line 14 you can specify the way in which the data was retrieved (via the track or follow API) and on line 16 you can specify the queries with which the data was retrieved. When you finished editing, save the file and run it as follows: php import-jsondump.php
. When the script has completed the analysis and capture web interfaces will show the imported data set.
We have an experimental script to import data from Gnip. If you would like to use it, please open a new issue and we will send it to you.
Some users have CSV files containing references to tweets (by ID or URL) and want to use TCAT to analyse their dataset. TCAT can refetch those tweets from the Twitter API, to make the full set of analysis features available for them.
The lookup.php script in the capture/ids directory executes these lookups. It requires as input a flat file with only numeric tweet IDs on single rows. Please edit the required parameters (such as bin name) in the header of this file before running it.
The parse-csv.php script in the capture/ids directory can help you pre-parse your CSV files to be used by the lookup.php script above. It attempts to auto-recognize the CSV file structure and find the column containing Twitter URLS. The following format is expected:
https://twitter.com/micania/status/616148971988348928
Often, a subset of your tweets can no longer be retrieved from Twitter (the output of the lookup.php script will indicate which tweets could not be looked up). There may be several reasons for this:
- The user has deleted her account or this particular tweet
- Twitter has deleted the account or tweet
- The user has started protecting her account, making all her tweets inaccessible
- The tweet is censored in the country you are accessing Twitter from
See the import and export scripts in dmi-tcat/helpers
.
Export queries and data for all query bins in TCAT.
The export script can either export just the queries or both the queries and the data from the bins.
-a
will export query phrases AND data, -s
will export only the queries (e.g. track keywords or user ids). The default is to export both query phrases AND data.
- export data from all bins run e.g.
php export.php -a
. - export structure (i.e. queries only, no data) for all query bins in TCAT use
php export.php -s
You can also list the bins you want to export, for example use php export.php foobar
to export queries and data for the query bin "foobar". To export multiple specified bins list them with spaces between them like this:
export.php foo bar baz
By default the export is saved in the "analysis/cache" directory under the DMI-TCAT install directory (usually: /var/www/dmi-tcat), but the output file can be specified by using -o
like this: php export.php -o myexportfile.sql.gz
The import script requires just one argument; the file location of your export dump, ending with .gz. To import data run e.g. php import.php flowers.gz2
.
DMI-TCAT enables the flexible constitution of a subsample. After selecting a dataset, as defined by a query bin, different techniques to filter the dataset are available. By sub-selecting from the dataset, you can zoom in on a specific time period or on tweets matching certain criteria. You can choose to:
- include only those tweets matching a particular phrase such as a word, hashtag or mention
- by default DMI-TCAT searches for words as parts of string (i.e. the search 'global' will match 'globally' too)
- DMI-TCAT has a bracket notation to specify exact searches. E.g. if you are looking for the term 'global' and you don't want TCAT to also return hits for the word 'globally' you need to enter your query as follows:
[ global ]
. This will return tweets which first have a space, then global, then another space. If you want all tweets which contain a word starting with 'global', you can use[ global]
. If you want all tweets with words ending in 'ly', you can use[ly ]
. - exclude tweets matching a specific phrase
- E.g, to exclude retweets you can using the bracket notation in the exclude field:
[RT @] OR [RT: @]
- focus on tweets by particular users or tweets mentioning a specific (part of a) URL.
- focus on tweet sent by a particular user client (e.g. 'tweetdeck')
- Get tweets which were sent from a specific geographic area.
- you can use polygon drawing on google maps to get the coordinates: http://www.birdtheme.org/useful/v3tool.html > polygon > copy the part in 'live code presentation' between the tags > put in right format (remove ',0.0', replace commas by spaces, replace line breaks by commas) All input fields accept multiple phrases or keywords to specify (AND) or expand (OR) the selection via Boolean queries. After updating the overview, a summary of the selection is generated.
- Tweet statistics and activity metrics include overall activity metrics, frequency lists of hashtags, mentions, URLs, hosts, etcetera. All these statistics can be calculated per hour, day, week, month, year or for a custom interval.
- Tweet exports allow you to retrieve random samples of your selection, lists of retweets, etcetera.
- Networks exports include mention and reply networks, co-hashtag graph, various bi-partite graphs
- Experimental modules include the cascade, sentiment analysis, and associational profiler
- Exports of potential gaps in your data shows you when TCAT was not running and helps you verify the completeness of your dataset
- Exports of rate limit estimates gives you an educated guess(*) as to how many tweets where lost in your dataset, due to TCAT reaching Twitters API rate limit of 1% of all traffic
(*) this estimate equals the number of tweets ratelimited in the queried interval multiplied by the portion your queried phrases represented within all queried phrases during that interval
We also have a video tutorial that takes you through the various analytical modules.
When an analysis has finished a link to the resulting CSV or TSV file will be displayed. Such files are text files in which the data columns are separated by a tab (sometimes referred to as '\t') or by a comma. You will have to save the text file to your computer: right-click the link and then click save as. Now you can open the file in a spreadsheet program or open refine. To import the text file, have a look at the following guides. Remember that the data is tab ('\t') or comma (,) separated.
Both the GDF and GEXF exports can be opened with Gephi. We made a Gephi tutorial for working with DMI-TCAT's mention networks: http://www.youtube.com/watch?v=snPR8CwPld0
DMI-TCAT is written in PHP and organized around a MySQL database positioned between the capture and analysis parts of the system. Data are retrieved by different modules controlled in regular intervals by a supervisor script (using the cron scheduler present in all Unix-like operating systems), which checks whether the capturing processes are running and, if necessary, restarts them. A separate script translates shortened URLs. Database contents are analyzed in a two-stage process: the selection of a subsample precedes the application of various analytical techniques.
We are currently running DMI-TCAT on a cheap Linux machine with four processor cores, a 512GB SSD, and 32GB of RAM, using the default LAMP stack. At the time of writing, we have captured over 700 million tweets and basic analyses for even the largest query bins – over fifty million tweets in a single dataset – generally complete in under a minute, allowing for iterative approaches to analysis. More complex forms of analysis, such as the creation of mention networks, can take several minutes to complete, however.
See our installation guide.
Yes, of course you can. Just choose either a Debian or a Ubuntu installation image and run the appropriate auto-installation script as describe in the installation guide. Pay special attention to Security Groups however. Amazon firewalls the HTTP (webtraffic) port by default. You have to explicitly open the HTTP port in the Amazon interface.
See our upgrade guide.
There could be many different reasons for your TCAT installation to stop functioning. Many of those may be related to server operation. Please read our guidelines for investigating and reporting on TCAT issues before contacting us.
We have written an academic paper about DMI-TCAT. It explains the rationale behind DMI-TCAT and introduces the various components of our toolset.
When you use DMI-TCAT in an academic research project, please reference our paper about DMI-TCAT: Erik Borra, Bernhard Rieder, (2014) "Programmed method: developing a toolset for capturing and analyzing tweets", Aslib Journal of Information Management, Vol. 66 Iss: 3, pp.262 - 278.
DMI-TCAT was originally developed for humanities and social science scholars studying Twitter data at the University of Amsterdam (the Netherlands). It has also been used in research projects at Aalborg Universitet (Copenhagen, Denmark), Autonomous University of Barcelona (Spain), Boston University (Boston, MA, USA), ELISAVA (Barcelona School of Design and Engineering, Spain), Eurecat (Barcelona, Spain), Goldsmiths University (London, UK), Hebrew University of Jerusalem (Jerusalem, Israel), Hogeschool van Amsterdam (the Netherlands), IT University of Copenhagen (Denmark), King's College London (UK), Médialab Sciences Po (Paris, France), Saint Joseph's University (Philadelphia, PA, USA), Université de Picardie Jules Verne (France), University of Calgary (Canada), University of Texas at Austin, University of Twente (Enschede, the Netherlands), University of Warwick (Coventry, UK), Yale University (New Haven, USA), and the ARC Centre of Excellence for Creative Industries and Innovation (Queensland University of Technology, Australia). DMI-TCAT is also available on the Australian National eResearch Collaboration Tools and Resources project (Nectar), where Australian researchers can have installed on a virtual machine with just one click. We'd love to hear about others using DMI-TCAT!