About - Datasets -
SocMap
- Docs - Download

Fork me on GitHub

Quickstart

Launch SocMap as follows:

$ ./socmap.py --authfile auth.txt --userlist userlist.txt --layers 2

Where auth.txt has your Twitter login credentials in it (see the Twitter authentication section below), userlist.txt contains a list of seed usernames (one per line), and --layers 2 says to expand the search two layers out.

After SocMap has run, it will produce output files in the map folder, with names like layer2.gml. These text output files contain the graph information, and are suitable for visualization in Gephi or Cytoscape, among other network analysis tools.

Usage

usage: socmap.py [-h] [-c] [-l LAYERS] [-n NUMTWEETS] [-M MAXREFERENCES]
                 [--ignorementions | --ignoreretweets] [-w WORKDIR]
                 [-t TWEETDIR] [-m MAPDIR] -a <file> -u <file> [-L <file>]
                 [-d]

A Framework for Social-Network Mapping

optional arguments:
  -h, --help            show this help message and exit
  -c, --compress        Compress downloaded tweets with GZIP
  -l LAYERS, --layers LAYERS
                        How many layers out to download
  -n NUMTWEETS, --numtweets NUMTWEETS
                        How many tweets to download from each user
  -M MAXREFERENCES, --maxreferences MAXREFERENCES
                        Maximum number of retweeted and mentioned users to
                        track per user
  --ignorementions      Do not follow mentions during mapping
  --ignoreretweets      Do not follow retweets during mapping
  -w WORKDIR, --workdir WORKDIR
                        Where to store temporary files
  -t TWEETDIR, --tweetdir TWEETDIR
                        Where to store downloaded tweets
  -m MAPDIR, --mapdir MAPDIR
                        Where to store map data
  -a <file>, --authfile <file>
                        File containing consumer keys and access tokens
  -u <file>, --userlist <file>
                        File containing list of starting usernames
  -L <file>, --logfile <file>
                        Where to store log data relative to workdir (detault
                        stdout)
  -d, --debug           Enable debug-level logging

Twitter Authentication

Apps cannot connect to Twitter using a username and password - they must connect using an authentication token.

Follow Twitter’s full guide here, or follow our abbreviated steps:

  1. Login to your Twitter account at apps.twitter.com
  2. Create a new app
  3. Select the new app and navigate to the “Keys and Access Tokens” panel
  4. Copy the “consumer key”, “consumer secret”, “access token”, and “access token secret”
  5. Put them in a text file (we used auth.txt in the example above), one on each line, in the above order

Directories

SocMap keeps data in three directories. By default, these are:

  • tweets - stores all collected tweets
  • map - stores completed maps for each layer of data collection
  • work - stores temporary data for tracking information between layers of data collection

All directories can be changed with command line options:

  -w WORKDIR, --workdir WORKDIR
                        Where to store temporary files
  -t TWEETDIR, --tweetdir TWEETDIR
                        Where to store downloaded tweets
  -m MAPDIR, --mapdir MAPDIR
                        Where to store map data

Directories will be created automatically if they do not already exist.

How is Data Stored?

Tweets collected from each user are stored as JSON, and may optionally be compressed with GZIP using -c or --compress.

Maps of Twitter communities are stored as GML files, which is a format that can be read by Gephi, Cytoscape, and NetworkX.

We store the following information about a user in their node on the graph:

  • name - Their Twitter username
  • retweeted - Whether the user is included because they were retweeted by another user
  • mentioned - Whether the user is included because they were mentioned by another user
  • layer - How many hops away from the original seed user this user is

We store the following information about a connection in each edge on the graph:

  • retweeted - How many times the source retweeted the destination
  • mentioned - How many times the source mentioned the destination

Limitations

Twitter places significant limitations on how much information we can access. In general, SocMap can only see about the last 2000 tweets from any user. This means we will only see recent mentions and retweets between accounts, and can only say how many times a user mentioned or retweeted another within our limited data set.

Twitter also enforces strict rate limits on API usage. When SocMap is rate limited it will block until the rate limit period is over, then resume collection. This means for a moderate dataset (~10,000 users) it is not unusual for SocMap to take several days to download data.

You can reduce the download time by ignoring either mentions or retweets during data collection, or by placing a limit on how many references per user to track:

  -M MAXREFERENCES, --maxreferences MAXREFERENCES
                        Maximum number of retweeted and mentioned users to
                        track per user
  --ignorementions      Do not follow mentions during mapping
  --ignoreretweets      Do not follow retweets during mapping

For example, -M 100 --ignorementions will collect only 100 retweeted accounts per user, and will ignore mentions entirely.

Logging

By default, SocMap logs to stdout. You can change this behavior by specifying the path for a logfile with -L or --logfile. The logfile is created relative to the workdir. For example:

$ ./socmap.py -a auth.txt -u userlist.txt -L log.txt

Will create a logfile in ./work/log.txt and put log messages there instead of stdout.

You can increase the level of logging with -d or --debug to enable debug-level log messages, which are suppressed by default.

Analysis Tools

SocMap comes with a number of ancillary tools for analyzing downloaded Twitter data. These include:

Tool Description
splitRetweetsAndMentions.py Splits the graph into graphs of retweet relationships and mention relationships, to be analyzed separately
sortNodeDegrees.py Displays a list of users in a network sorted by degree
removeLowDegreeNodes.py Prunes nodes with a degree below a specified threshold, to shrink network maps until visualization tools can process them
getInsularity.py For a list of users, returns what percentage of retweets by users on the list are retweets of other users on the list
mergeMaps.py Combine two map files, creating a union of users and social relationships
pruneUsers.py Removes a list of users from a map, and leaves only users reachable from seeds
pruneTweets.py Removes users with a low level of activity from a map
pruneRetweets.py Removes links between users unless there are a sufficient number of retweets
pruneMentions.py Same as pruneRetweets, but for mentions
searchTweets.py Search through downloaded tweets for a regular expression