Joseph Kain bio photo

Joseph Kain

Professional Software Engineer learning Elixir.

Twitter LinkedIn Github

I have started a new project in order to play around with Elixir and OTP. Here’s an idea: a tool to scrape various tech news sites and notify me, in near realtime, that a link to Learning Elixir has been created. The goal being that it can help me better engage with people linking to my blog. Examples of sites I could monitor:

  • Hacker News
  • Twitter
  • reddit
  • StackOverflow

Kicking off the project

Here’s my approach to starting the project. I follow something like this for most new projects.

  1. I created a file with some preliminary notes including the vision above
  2. I wrote up what I think the basic requirements are
  3. I put together a plan for an initial prototype
  4. I implemented the first revision of the prototype
  5. I analyzed how well the prototype fit my needs
  6. I decided on next steps based on the analysis of the prototype

The rest of this post will describe my experience carrying out these steps.

Basic Requirements

Here is my first cut at requirements:

  1. Continuously fetch from the services to get new content
  2. Parse out the content from fetched data
  3. Unshorten shortened URLs
  4. Present results

There are probably some steps I’m missing between 3 and 4 but 1-3 are good place to start.

The above are activities. The activities require state to be maintained:

  1. Store state of the fetchers (i.e. where to fetch from next, i.e what does “new” mean)
  2. Store pending content that has been fetched but still needs to be parsed
  3. Store parsed content that contains domains
  4. Store relationship between collected domains and refering content
  5. Store rankings for domains

Prototype P1

I planned out the first prototype, impressively named P1, as an experiment to work out the basics of the required activities. This is a little like a development spike though I’ve written tests and kept the code.

  1. Build a fetcher for a single tweet that has what I want in it
  2. Build a parser: for twitter
    • Extract http links
  3. Build an unshortener
  4. Organize the 3 into a pipeline with potential for parallelization
  5. Dump domains to stdout or something
  6. Don’t save any of the state described above
  7. Build a fetcher for Twitter live stream
  8. Run it end to end

Building the Prototype

First I created a new mix project with mix new p1. I’ve pushed this to github as ds_prototype if you want to follow along.

ExTwitter

In order to interact with Twitter I’ll use the ExTwitter module. I followed the directions in the README to update mix.exs.

I want to write a function to fetch a specific Tweet, one of my own old tweets:

This has a shortened link in it so it should be a great test.

FetcherSingle

FetcherSingle should implement this part of the prototype:

Build a fetcher for a single tweet that has what I want in it

In order to write this I started with a test:

defmodule FetcherSingleTest do
  use ExUnit.Case

  test "it can fetch the specific tweet" do
    assert FetcherSingle.fetch |> Map.get(:text)  
      == "Read about how I learned #elixirlang - http://t.co/kfLrRZJ1cI"
  end
end

Actually, before this I screwed around with ExTwitter and pulled a bunch of tweets from my timeline. I found the tweet I wanted. This how ExTwitter represents it:

What is a Tweet?

%ExTwitter.Model.Tweet {
  contributors: nil,
  coordinates: nil,
  created_at: "Tue Aug 04 23:21:03 +0000 2015",
  entities: %{
    hashtags: [%{indices: [25, 36], text: "elixirlang"}],
    symbols: [],
    urls: [
      %{
        display_url: "buff.ly/1LYD0tp",
        expanded_url: "http://buff.ly/1LYD0tp",
        indices: '\'=',
        url: "http://t.co/kfLrRZJ1cI"
      }
    ],
    user_mentions: []
  },
  favorite_count: 1,
  favorited: false,
  geo: nil,
  id: 628707248206925824,
  id_str: "628707248206925824",
  in_reply_to_screen_name: nil,
  in_reply_to_status_id: nil,
  in_reply_to_status_id_str: nil,
  in_reply_to_user_id: nil,
  in_reply_to_user_id_str: nil,
  lang: "en",
  place: nil,
  retweet_count: 0,
  retweeted: false,
  source: "<a href=\"http://bufferapp.com\" rel=\"nofollow\">Buffer</a>",
  text: "Read about how I learned #elixirlang - http://t.co/kfLrRZJ1cI",
  truncated: false,
  ...
},

The fields I’m going to need to use are

  • id - at least for this hack to look up a specific Tweet
  • entities.urls - Twitter already expands from t.co to expanded_urls
  • text - Which is the text of the Tweet

Given all this I was able to write this simple fetcher:

defmodule FetcherSingle do
  @spec fetch :: String.t
  def fetch do
    ExTwitter.show(628707248206925824)
  end
end

The function just calls ExTwitter to fetch my Tweet which has id 628707248206925824. This passes the test.

Parser

The next step was to build a parser. Parser implements this part of the prototype:

Build a parser: for twitter - Extract http links

To get my bearings I started with parsing the text from the Tweet. Here’s the test:

test "It can extract the text" do
  expected_text = "Expected"
  test_data = %ExTwitter.Model.Tweet{ text: expected_text }
  assert Parser.text(test_data) == expected_text
end

Here I’m using a fake %ExTwitter.Model.Tweet but the parser handles it just fine:

defmodule Parser do
  @spec text(ExTwitter.Model.Tweet.t) :: String.t
  def text(tweet) do
    tweet |> Map.get(:text)
  end
end

Next I need to parse the URL. I plugged in the data for my Tweet which I extracted above:

def mock_tweet do
  %ExTwitter.Model.Tweet{
    contributors: nil,
    coordinates: nil,
    # See copy of the data above ...
  }
end

test "It can extract URLs" do
  assert Parser.urls(mock_tweet) == ["http://buff.ly/1LYD0tp"]
end

This function passes the test:

@spec urls(ExTwitter.Model.Tweet.t) :: [ String.t ]
def urls(tweet) do
  tweet
  |> Map.get(:entities)
  |> Map.get(:urls)
  |> Enum.map(&Map.get(&1, :expanded_url))
end

Here we take the entities.urls field which is an list containing the 3 different URL variations (display, expanded, original). We map that list to a new list containing just the expanded_url. This list is returned.

As you can see, the test expects an list with a single URL.

Unshortener

How do you unshorten a URL? I know there are some sites that provide unshortening as a service but if unshortening isn’t a difficult process I wouldn’t want to deal with a third party API and rate limits. StackOverflow had some helpful advice. I really just need to request the URL and look for a redirect. I can use a HEAD request for efficiency.

Anyway, I started with a test. I iterated a bit to make sure different cases worked correctly. Here are all the tests I ended up with:

defmodule UnshortenerTest do
  use ExUnit.Case

  test "it can unshorten a test URL" do
    assert Unshortener.expand("http://buff.ly/1LYD0tp") == "http://learningelixir.joekain.com/how-I-learned-elixir/?utm_content=buffer9a56c&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer"
  end

  test "it passes through normal URLs" do
    assert Unshortener.expand("http://www.google.com") == "http://www.google.com"
  end

  test "it reports error for bad URLs" do
    assert Unshortener.expand("http://garbage.example.com") == :error
  end

  test "it can unshorten double shortened URLs" do
    assert Unshortener.expand("http://t.co/kfLrRZJ1cI") == "http://learningelixir.joekain.com/how-I-learned-elixir/?utm_content=buffer9a56c&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer"
  end
end

The test “it can unshorten a test URL” just checks the URL from our test Tweet.

The test “it passes through normal URLs” checks a non-shortened URL to make sure it comes back as is.

The test “it reports error for bad URLs” tests a URL that doesn’t resolve and expects :error.

The last test, “it can unshorten double shortened URLs”, tries the “t.co” URL from the test Tweet. This URL is a shortened URL for a shortened URL. Unshortener.expand needs to recursively unshorten in order to expand correctly.

To implement this I need to make web requests. I used HTTPotion to do this. It is simple enough to setup and configure by following the project’s README.

Here’s the implementation:

defmodule Unshortener do

  # Field name "location" is case insensitive in HTTP so use downcase
  defp canonical_field_name(key) do
    key
    |> Atom.to_string
    |> String.downcase
  end

  defp location(headers) do
    {_key, value} = Enum.find(headers, nil, fn {key, value} ->
      canonical_field_name(key) == "location"
    end)

    value
  end

  @spec expand(String.t) :: String.t
  def expand(short) do
    try do
      case HTTPotion.head(short) do
        %HTTPotion.Response{headers: headers, status_code: 301} -> location(headers) |> expand
        %HTTPotion.Response{headers: headers, status_code: success} when 200 <= success and success < 300 -> short
      end
    rescue
      _ -> :error
    end
  end
end

The interesting parts here are checking for the location and the recursive expansion.

The location was a little tricky because HTTPotion returns the headers as a Keyword list where the key names match whats in the real HTTP headers. But the HTTP headers are supposed to be case insensitive.

I found that Bit.ly returned the location field as “Location” while t.co returned it as “location”. To get this right I use Enum.find on the location Keyword list and dowcased all keys before checking against “location”.

In Unshortener.expand/1 I use a case statement to patten match against the HTTP result code. For code 301, a redirect, I recusively expand whatever the location field gives as the target of the redirect. For successful codes I just return the original URL (stored in short). If anything else happens I return :error.

Organize FetcherSingle, Parser, and Unshortener into a pipeline with potential for parallelization

I needed to put everything together. Here’s the test:

test "it should fetch a tweet and extract the URL" do
  assert P1.run == ["http://learningelixir.joekain.com/how-I-learned-elixir/?utm_content=buffer9a56c&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer"]
end

And here’s the implementation:

defmodule P1 do
  def run do
    [ FetcherSingle.fetch |> Parser.urls |> Unshortener.expand ]
  end
end

This just runs each of the 3 activies in a pipeline and returns a list of the result.

Twitter as a Stream

ExTwitter has a stream version. This provides live tweets as an Elixir Stream! What could be more exciting.

I can building a streaming Fetcher quite simply like this:

defmodule Fetcher do
  @spec fetch :: Enumerable.t
  def fetch do
    ExTwitter.stream_filter(track: "apple")
  end
end

Note that ExTwitter.stream_filter requires a filter. The ExTwitter README’s example uses “apple” which is a frequently occurring keyword. I’ve used the same in the prototype and it seems to provide plenty of Tweets.

Drinking from the firehose

Here’s a test which isn’t actually verifying it’s result. It can’t really since the data is live. Instead it just prints out a bunch of URLs.

test "it should just stream out URLs from the firehose" do
  stream = P1.stream |> Stream.map(fn x -> IO.puts x end) |> Enum.take(10) |> Enum.to_list
  Enum.to_list(stream)
end

Here’s the implementation:

def stream do
  Fetcher.fetch
  |> Stream.flat_map(fn x -> Parser.urls(x) end)
  |> Stream.map(fn x -> Unshortener.expand(x) end)
end

This takes the Stream of Tweets and composes that with the Parser. Recall that Parser.urls returns a list so we use Stream.flat_map to flatten into a single Stream of URLs. Finally, we compose in Unshortener.expand to create a Stream of expanded URLs.

The result is a Stream of lazylily evaluated, composed computation. No tweets are fetched yet, no parsing or unshortening has been done.

The test itself composes in IO.puts to print each list and then takes 10 results as a list, forcing evaluation. Here’s the result for a single run:

http://rover.ebay.com/rover/1/711-53200-19255-0/1?ff3=2&toolid=10044&campid=5337506718&customid=&lgeo=1&vectorid=229466&item=331653890787&pub=5575041009
http://gekoo.co/buy/01/?query=271991150643
https://itunes.apple.com/us/app/atn-wigan-edition/id1028785917?ls=1&mt=8
http://www.noktadan.com/nexus-10dan-yeni-sizintilar.html
error
http://www.youbidder.com/Bid/Free-Last-Second-Bid-Snipe-Apple-iPhone-6-Latest-Model-16GB-Silver-Verizon-Smartphone-.html/?item=121757452567
http://rover.ebay.com/rover/1/711-53200-19255-0/1?ff3=2&toolid=10044&campid=5337506718&customid=&lgeo=1&vectorid=229466&item=361387296439&pub=5575041009
error
https://www.google.com/url?rct=j&sa=t&url=http://www.fool.com/investing/general/2015/09/19/apple-is-preparing-to-launch-apple-pay-in-china.aspx&ct=ga&cd=CAIyGjA0OWJlZTQ2ZTU1MjZiOWU6Y29tOmVuOlVT&usg=AFQjCNF0L0-ImRrlB5oKUQrUck-kiBoHUw
https://itunes.apple.com/ru/app/id668427550?mt=8

At this point I think the prototype is done and is a success. I still have some lessons to take away.

Lessons learned and next steps

First, I didn’t unshorten all the URLs. These didn’t make it into the example above, but I’ve seen these URLs come through:

  • http://smarturl.it/dlcgpa1
  • http://ow.ly/ScJ9I
  • http://fb.me/6RK0qYjJI

These are shortened but perhaps use something other than code 301? Also, there are errors listed in the URL list. What causes them? I will explore this in a future prototype / post.

Second, I didn’t complete this objective for the prototype:

Organize the 3 into a pipeline with potential for parallelization

because I didn’t consider parallelization. I will need to explore this further. I should be able to have one process reading the Twitter Stream and other processes doing the parsing and unshortening. Doing this within the framework of Stream composition may be interesting. I’ll explore this in a future prototype / post.

I expect to be working on and writing about this new project for some time and I hope you join me here on Learning Elixir to see how this project develops.