In my quest to dig into how people use social media to communicate about wine I ran into a snag. One of the things I really wanted to do was collect geographic data about each post. When you do a twitter search, a post can contain latitude and longitude data if the source supports GPS, but so far almost none of the posts about wine actually have this data.

So the next best thing is to figure out where the user lives. A twitter user profile has a location field. Its free text unfortunately, but its better than nothing. So for every post returned by a query lookup, I would look up in mongodb if I had stored the user profile yet. If not I would call the twitter api to go get the user profile then pull out the location and time zone of the user, and store it back in mongo. This worked great about 150 times. Then I ran into a fun little thing called rate limiting. Apparently twitter limits any application to 150 api calls per hour (the search api has a different threshold which I haven't hit yet). Once you go over this threshold, you get blacklisted for the next 8 or so hours.

So, considering I have about 50,000 users, at 150 lookups per hour, it'll only take me around 14 days to catch up and have location data for all my users. Of course people don't stop posting on twitter, so in 14 days, I'll probably have 50,000 additional users. At some point this curve will flatten off, but not for a while.

From an analysis point of view, this location information would be really interesting specifically with people who post to twitter with geographic data. If I could find these users, who post often about wine, and each post contains lat/lng information, I could see how people communicate about wine in relation to where they are. For example, someone who lives in San Francisco might never talk about wine, except for trips out to Napa Valley, at which point they post a lot about wineries they visit.

