Switching data locations

Lesson content is currently in draft form.

As I said at the outset, we’re working with a very small sample of Twitter’s data, and we’re not using its official API.

At some point, you’re going to want to remove the training wheels and get data straight from the source.

We won’t get to that in this tutorial, but we’ll go through the motions. I’m going to change up the data source and show you how our my_code.rb can adapt to it.

A note What follows is a bit of tedium, which is the re-organizing of old code so that we can better re-use it. It’s good for practice and to ground some design concepts in, but is not necessary to get worked up about if time is short. After downloading the data-file as listed in the lesson, you can just download the my_code.rb library here and include it going forward.

More data, more problems

Good news. Instead of just looking at the same 10 congressmembers over and over, I have a data file of roughly 500 U.S. politicians.

But rather than have you hit up my server repeatedly, I’m just going to give you the data in bulk. This will make things much faster for you, as you won’t have the latency of an Internet connection: you’ll be combing through the data as it sits on your hard drive.

The bad news is that you’re going to have to adapt.

Download the data

The full dataset (which is not even quite complete, but more on that later) is more than 600MB unzipped. If you have a fast computer, sure, go for it. However, because we’re still just practicing, there’s not really a need to process tens of thousands of tweets. So there is a ‘lite’ version with fewer tweets included:

Large dataset (600+ MB unzipped)
Lite dataset (80+ MB unzipped)

The best place to store this data is as a sub-directory in your working directory.

I’m going to assume you’re naming it data-hold.

So unzip the data-file into data-hold, and it should create some familiar sub-directories, including users and statuses

Making adjustments

Let’s look at the current state of our my_code.rb file:

Our common code (my_code.rb) download

require 'rubygems'
require 'json'
require 'httparty'


###############
# Constants
TWITTER_DATA_HOST = "http://nottwitter.danwin.com"
TWITTER_USER_DATA_PATH = File.join(TWITTER_DATA_HOST, "users")
TWITTER_TWEETS_DATA_PATH = File.join(TWITTER_DATA_HOST, "statuses")

def url_for_twitter_user_info(screen_name)
  # pre: screen_name a Twitter account name as a string
  # returns: url (string) to get user data

  File.join(TWITTER_USER_DATA_PATH, screen_name, 'show.json')
end

def url_for_tweets_page(screen_name, pg_num)
  # pre: screen_name is a Twitter account name, as a string; pg_num
  #      the page number, as tweets are separated into numbered pages


  # returns: url (string) to get tweets

  File.join(TWITTER_TWEETS_DATA_PATH, screen_name, pg_num.to_s, "user_timeline.json")
end

def get_data_file(u)
  # pre: u is a URL string
  # post: downloads from u; has no protection against bad URLs
  # i.e. this will have to be modified later

  HTTParty.get(u)

end

def get_twitter_user(screen_name)
  # pre: screen_name is a Twitter account name, as a string
  # returns: user info as a Hash object

  d = get_data_file( url_for_twitter_user_info(screen_name))
  JSON.parse(d)
end

def get_tweets_page(screen_name, pg_num)
  # pre: screen_name is a Twitter account name, as a string; pg_num
  #      the page number, as tweets are separated into numbered pages

  # returns: an array of tweet Hash objects

  d = get_data_file( url_for_tweets_page(screen_name, pg_num))
  JSON.parse(d)
end



puts "Done loading my code!"

The most obvious adjustment is to change the value of the TWITTER_DATA_HOST to the data-hold sub-directory:

TWITTER_DATA_HOST = File.expand_path('data-hold')

(try out the File.expand_path method in irb to see what it does)

Well-laid plans

With that simple change, we are almost done with the needed adjustments.

This is because we wisely created the url_for methods to rely on the TWITTER_USER_DATA_PATH and TWITTER_TWEETS_DATA_PATH constants.

And what do those constants depend on? The value of TWITTER_DATA_HOST, since the user and tweets data paths are merely sub-directories of the base location.

Pretty nifty.

A different get method

Unfortunately, there is one major change to make.

Retrieving a file from the Internet is not exactly the same as getting it from your own hard drive. For starters, HTTParty won’t work.

Instead, we use the File.open (part of Ruby’s core library), which acts a bit differently:

some_filename = url_for_twitter_user_info('NancyPelosi')
fh = File.open(some_filename, 'r')
fbody = fh.read
fh.close

# ...parse fbody

Exercise

Make the necessary changes to my_code.rb to adapt to the change in data source.

Our common code for local data (my_code_alpha.rb) download

require 'rubygems'
require 'json'
require 'httparty'


###############
# Constants
TWITTER_DATA_HOST = File.expand_path("data-hold")
TWITTER_USER_DATA_PATH = File.join(TWITTER_DATA_HOST, "users")
TWITTER_TWEETS_DATA_PATH = File.join(TWITTER_DATA_HOST, "statuses")

def url_for_twitter_user_info(screen_name)
  # pre: screen_name a Twitter account name as a string
  # returns: url (string) to get user data

  File.join(TWITTER_USER_DATA_PATH, screen_name, 'show.json')
end

def url_for_tweets_page(screen_name, pg_num)
  # pre: screen_name is a Twitter account name, as a string; pg_num
  #      the page number, as tweets are separated into numbered pages


  # returns: url (string) to get tweets

  File.join(TWITTER_TWEETS_DATA_PATH, screen_name, pg_num.to_s, "user_timeline.json")
end

def get_data_file(fname)
  # note: modified for local data access

  # pre: fname is a filename string (not a URL anymore)
  # returns: opens the file at the given fname and returns the data read.
  #  it also closes the file after reading it

  # warning: this will crash if file does not exist

  fstream = File.open(fname, 'r')
  fbody = fstream.read
  fstream.close  # just something we have to do

  return fbody
end

def get_twitter_user(screen_name)
  # pre: screen_name is a Twitter account name, as a string
  # returns: user info as a Hash object

  d = get_data_file( url_for_twitter_user_info(screen_name))
  JSON.parse(d)
end

def get_tweets_page(screen_name, pg_num)
  # pre: screen_name is a Twitter account name, as a string; pg_num
  #      the page number, as tweets are separated into numbered pages

  # returns: an array of tweet Hash objects

  d = get_data_file( url_for_tweets_page(screen_name, pg_num))
  JSON.parse(d)
end



puts "Done loading my code!"

With this modified my_code.rb, test it out on your local data store. Things should seem pretty much the same.

load `./my_code.rb`

t_user = get_twitter_user('DarellIssaTK')
puts t_user['followers_count']

I guess we could change the names of the url_for methods…since they don’t actually refer to web addresses. But that’s the beauty of abstraction: we don’t need to worry about such details as long as things work as expected.

Learn Why to Code

A brief introduction to practical programming