During July and August 2009, we gathered data for a random selection of 83,628 twitter users. The following is an analysis of the data.
Note that some graphs present data for “real” users as well as “all” users in the survey. A crude algorithm was used to separate potential “real” users from spam accounts (“real” users were decided to be those that had changed their avatar picture from the default and had 5+ tweets, 5+ friends and 5+ followers). Using this liberal metric, 34,334 users were classed as real – about 40% of our total survey, though this undoubtedly still includes many spam and dead accounts.
Twitter users do not specify their sex on registration. To deduce the sex of each user, we compared their full name (if provided) against US Census Data of first names (which was manually updated to include more recent names for obvious omissions). If a name could be either sex, we chose whichever sex had a higher popularity of that name. This resulted in 66% (55,504) of the users in our survey being assigned a sex.
The distribution was the same for “all” accounts and “real” accounts, with 59% female users compared with 41% male users.
The age of a twitter user is difficult to ascertain without a direct survey. To estimate the distribution of ages, we searched twitter for phrases such as “I am 23″, “Im 23″ or “I’m 23″ that didn’t contain mentions of “today”, “tomorrow” or “birthday” (to reduce the skew of people announcing birthdays). The rate of tweets mentioning each age can then be used to plot the distribution.
There is clearly still a skew from birthday tweets (every 5/10 years, plus the legally important 18 and 21 ages), and we might imagine that younger users are more keen/enthusiastic/socially likely to be announcing their age.
However, given the large difference, we can assume that the average twitter user age (mode, rather than median) is somewhere between 18 and 21.
Not much new to report here: a large number of twitter users have never tweeted or have only tweeted a few times (in our full sample, 22% had never tweeted; 58% had tweeted ten times or less).
For the full sample, the number of ‘followers’ (for each user) peaks at around 2 to 4, then quickly drops off: 53% have 10 or less followers.
The number of people that users follow is more interesting. For “real” users, the distribution is fairly flat, with roughly the same number of people following 10 people as 50 or 100 people (with a slight peak at around 30 friends). Remember that we cut off “real” users at less than 5 friends, so the graph doesn’t start until this point.
However, when we look at the “full” sample (all users), a massive spike occurs at 20 friends, suggesting that users who follow exactly 20 people are much more likely to be spam accounts.
Similarly, we can look at the ratio of following to follower numbers. Again, the “full” example exhibits spikes that “real” users do not, with spikes at the 10 ratio and 20 ratio (i.e. these users follow 10 or 20 times the number of people that follow them back).
Segregating by sex, we can also see that female users tend to have a very slightly higher average ratio than male users.
The graph below shows which day of the week twitter users created their account. Mid-week (Wednesday) tends to be busiest, with the weekend being the least popular time to create an account. Note that the data suggests that “all” users (i.e. including spam) have slightly higher account creation activity at the weekend than just “real” users.
Finally, we looked at the description/bio length of each user. The majority (about 65% of all users, 35% of real users) have no description.
The following graph shows the description/bio length for users that have created one. This peaks at around 20 to 40 characters, with a large spike at 160 (maximum) character length – possibly due to users not understanding the limit and typing/pasting a large portion of text into the field.
According to our analysis, the ‘average’ Twitter user is a girl in her late teens, who is following 20 to 50 people, and has roughly the same number of people following her back. Her bio/description is quite short, at about 30 characters.
Studying twitter usage and demographics is important for anyone looking to exploit the ever-growing service, whether for business or personal/social means.
Unfortunately, some organisations and people take advantage of the openness and simplicity of twitter by trying to cheat the system and find ‘quick wins’ by not participating in the spirit of the platform; instead spam-ing, automating and deceiving.
Thankfully, as shown above, we can use this same twitter analysis to help identify the patterns of likely spammers. We now need the organisation behind Twitter to start integrating tools or algorithms to make better use of this type of pattern detection and prevention, stopping spammers before they can aggravate a significant number of real users.
A simple suggestion would be to adopt a similar ‘scoring’ system to that used by email spam detectors. For example, if a user is following exactly 20 people, 10 points; if a user hasn’t changed their avatar, 2 points; no description, 2 points; account created at the weekend, 1 point; a following/follower ratio of more than 10, 5 points.
Each user could then specify in their profile settings a maximum ‘score’ that a potential follower user can have (this would default to a large number, such as 100), allowing individuals to set their preference about potential spam followers (and false-positives).
How else might we detect spam accounts? Does this average twitter user feel right to you? Let us know in the comments below.