I have recently been taking a closer look at twitter. I know a lot of people use it but what i have come to see is that a lot of companies use it as well. This made me start to wonder how if it would be possible to do some analysis on twitter posts that my company makes. I find it a lot easier to wright down my thoughts so that’s really all this post is going to be, my working though how to turn Twitter Feed data into a SSAS cube that can be used to analyse twitter tweet exposure and with luck predict how best to tweet. I have created a SSIS Connection manager and Dataflow Task to query data out of Twitter. So if you are wondering how i’m doing this well there you have it.
What we want to find out
What are the kinds of things we would want to analysis? The possibility must be pretty limited don’t you think? We are really only talking about tweets here. Then i started to really think about it.
Is there an optimal time for the company to tweet? When we tweet when do we actually get re-tweets? Are most of our followers in one timezone or another? If we tweet at a specific time are we more apt to get more followers and re-tweets.
The reach of a tweet will be the potential number of twitter users that see the company’s tweet. This will have to be calculated based upon how many people are following the company, as well as any mentions and hash-tags used in the tweet. Re-tweets will also effect the potential exposure of a tweet.
Listeners vs Broadcasters
One of things we want to do is make sure that our company’s tweets reach the widest possible number of twitter users. One way of doing this will be to determine who are our broadcasters. The company may have 100 twitter users following them but most of them are probably just listeners people that may or may not read your tweets and do nothing more. Broadcasters are the ones we are really after. A broadcaster will re-tweet or comment on the company’s tweets. This will in turn keep the tweet active and ensure that we reach more users. If we can identify who our broadcasters are we can be sure to @mention them in tweets to insure that our tweets reach a bigger audience.
If a broadcaster has a large number of followers and/or has had a twitter account for a long time then we know that they are more likely to help the company get there message out there. If the broadcaster has just joined twitter and has a limited number of followers, they are probably less likely to improve our reach.
Rating a broadcasters tweets
If we scan the broadcasters tweets we can tell how many times they have tweeted or re-tweeted using the same hash tag we used in our tweet, or maybe any of our previous tweets. This should tell us if this broadcaster is apt to have followers that will be part of our target audience. A broadcaster who often tweets in our same area of interest should be more valuable then one that tweets outside our target area.
Once we have this information can we use it to make predictions before we tweet. So that we are sure to add a mention of our broadcasters when valid and add Hash-tags when needed. As well as something simple as knowing what time of day is best to tweet.
What would be interesting to Company X to know about its tweets? The first thing i could think of was how many twitter users are potentially seeing a tweet. Lets call that Exposure. But how would i calculate exposure its not really as simple as how many twitter users are following Company X.
First Level of exposure:
- Company X Tweets a link for a new product. The number of twitter users to potentially see this link on there news feed will be equal to the number of twitter users that are following Company X.
- Company X mentions another twitter User in the tweet. EX: @Twitter The number of twitter users to potentially see the link will be increased by 1 for each twitter users Mentioned. (Question:I’m not sure about this. Does everyone that has this mentioned user see it on there feed or is it just on there own feed. Need to test this.)
- Company X adds a HashTag to the tweet. Anyone that searches on that HashTag will also see the link. I find it hard to decided how to value HashTags. The only way i can come up with a value for a HashTag would be to query the number of posts for the last 2 weeks using that HashTag then add a weight for the value of the HashTag . If there is only 5 posts in the last two weeks with that HashTag then the number of users potentially seeing that tag is probably pretty low. But if there are 100 tweets with this HashTag then people are using this tag they must also be viewing it.
What is the ramifications of this?
Exposure count = TweeterFriendCount + MentonedUserFriendCount + (HashTag weight magic)
Update: When the data is reloaded the amount allocated to twitterfriendcount will be the same. If someone friends CompanyX after the tweet is made if its over two weeks its not going to be worth counting as an exposure, the chances they will see it are low because there feed will be displaying newer posts.
Second Level of Exposure: (re-tweet)
- User Y re-tweets Company X’s tweet. The number of people to potentially see this link is now increased by the number of people following User Y.
- Question: If the original post had a mention to a user. EX @Twitter. Will that person then get the tweeter also on there twitter feed?
- Re-tweet Where the original post had a HashTag should just extend the life of the tweet. It will show up as newer then the original tweet. There for will display higher in the search.
Exposure count = (First Level of Exposure count) + ReTweetUserFriendCount + (Mentioned User friend count??) + (HashTag weight magic)
Update: The re-tweet has a first level exposure being that of the RetweetUsersFriendCount should be added to the first level of exposure of the original parent tweet? I think so.
Timing is everything
Would it be possible to determine what time of day / day of the week is the best time to Tweet in order to get re-tweets.
Lets look at the Twitter API GET statuses/user_timeline method can only return up to 3,200 of a user’s most recent Tweets. So if i query CompanyX Screen Name
When i run that it will return up to the last 3,200 tweets, but i also think its tweets in the two weeks older tweets don’t appear to be returned. What i found very nice here is that when the tweet returns it gives me a retweet_count and a Favorite_count. These counts appear to be the counts at the time of the query. So if i add a new row to my tweet fact table with the datetime that i ran the query i should be able to see when the tweet was re-tweeted. I will be able to track re-tweets and favorites for each of my Company’s tweets. There is a limit to how often you can hit the Twitter API but IMO it shouldn’t be an issue i can run this every 5 minutes and be OK.
SSIS Task Issues
The Twitter API returns so much information i’m finding it hard to decide what information to extract.
The task is very basic so far. I have query type set up so that i can select if its a tweet, re-tweet, mention, search or User that I want to query. But there is so much information that comes back from each of these queries. A tweet returns the same user information as for example a user. Do i want to extract that as well probably as i have the data why not save it to the pipeline. I had the idea of maybe adding some way of selecting which columns i want to extract in the Source but then i will have to add another variable to the task and test for it while i build output columns and when i write the data. I think that would really be to much work. I’m leaning towards just outputting everything i think i might use, then i can just ignore it in the destination task. The only draw back to ignoring the extra output columns in the destination task will be that SSIS will give a warning that some of the output columns aren’t being used. The output columns not being used warning from SSIS isn’t an issue in my opinion I think i can live with it.
Another idea that could be fun is to write the JSon to an output column to. But then if i want to read that column from the database i’m going to have to go back to my SSIS JSon Source task that I gave up on…..
Multi Dimensional Data Model
Now i have a clear idea of what data i can get out of the twitter api, and i have a general idea of what i can analyse with that data. I think its time to start to design my data model.
CREATE TABLE [dbo].[dUser]( [UserId] [bigint] NULL, [UserName] [nvarchar](50) NULL, [UserScreenName] [nvarchar](50) NULL ) ON [PRIMARY]
CREATE TABLE [dbo].[fUser]( [UserId] [bigint] NULL, [TweetDate] [datetime] NULL, [RequestedDate] [datetime] NULL, [UserFollowersCount] [int] NULL, [UserFriendsCount] [int] NULL, [UserStatusesCount] [int] NULL, [wTweetDate] [datetime] NULL ) ON [PRIMARY] On my dUser dimension table all i am storing right now is the username and screen name. I will probably be adding other things later like when they joined twitter. My fUser fact table will be where i will be storing all the information on the user. I plan on loading the feed every 5 minutes or so i will be able to load any changes in the users followerscount, friendscount, and statusescount. This way i can analyse if the twitter user or Company is increasing or decreasing in the number of people following them. Im not sure if there is a point in loging how many friends the twitter user has but its there now and i might find some use for it later. And the saga continues.....