I have started a new project in order to play around with Elixir and OTP. Here’s an idea: a tool to scrape various tech news sites and notify me, in near realtime, that a link to Learning Elixir has been created. The goal being that it can help me better engage with people linking to my blog. Examples of sites I could monitor:
- Hacker News
- StackOverflow
Kicking off the project
Here’s my approach to starting the project. I follow something like this for most new projects.
- I created a file with some preliminary notes including the vision above
- I wrote up what I think the basic requirements are
- I put together a plan for an initial prototype
- I implemented the first revision of the prototype
- I analyzed how well the prototype fit my needs
- I decided on next steps based on the analysis of the prototype
The rest of this post will describe my experience carrying out these steps.
Basic Requirements
Here is my first cut at requirements:
- Continuously fetch from the services to get new content
- Parse out the content from fetched data
- Unshorten shortened URLs
- Present results
There are probably some steps I’m missing between 3 and 4 but 1-3 are good place to start.
The above are activities. The activities require state to be maintained:
- Store state of the fetchers (i.e. where to fetch from next, i.e what does “new” mean)
- Store pending content that has been fetched but still needs to be parsed
- Store parsed content that contains domains
- Store relationship between collected domains and refering content
- Store rankings for domains
Prototype P1
I planned out the first prototype, impressively named P1, as an experiment to work out the basics of the required activities. This is a little like a development spike though I’ve written tests and kept the code.
- Build a fetcher for a single tweet that has what I want in it
- Build a parser: for twitter
- Extract http links
- Build an unshortener
- Organize the 3 into a pipeline with potential for parallelization
- Dump domains to stdout or something
- Don’t save any of the state described above
- Build a fetcher for Twitter live stream
- Run it end to end
Building the Prototype
First I created a new mix project with mix new p1
. I’ve pushed this to github as ds_prototype if you want to follow along.
ExTwitter
In order to interact with Twitter I’ll use the ExTwitter module. I followed the directions in the README to update mix.exs.
I want to write a function to fetch a specific Tweet, one of my own old tweets:
Read about how I learned #elixirlang - http://t.co/kfLrRZJ1cI
— Joseph Kain (@Joseph_Kain) August 4, 2015
This has a shortened link in it so it should be a great test.
FetcherSingle
FetcherSingle should implement this part of the prototype:
Build a fetcher for a single tweet that has what I want in it
In order to write this I started with a test:
defmodule FetcherSingleTest do
use ExUnit.Case
test "it can fetch the specific tweet" do
assert FetcherSingle.fetch |> Map.get(:text)
== "Read about how I learned #elixirlang - http://t.co/kfLrRZJ1cI"
end
end
Actually, before this I screwed around with ExTwitter and pulled a bunch of tweets from my timeline. I found the tweet I wanted. This how ExTwitter represents it:
What is a Tweet?
%ExTwitter.Model.Tweet {
contributors: nil,
coordinates: nil,
created_at: "Tue Aug 04 23:21:03 +0000 2015",
entities: %{
hashtags: [%{indices: [25, 36], text: "elixirlang"}],
symbols: [],
urls: [
%{
display_url: "buff.ly/1LYD0tp",
expanded_url: "http://buff.ly/1LYD0tp",
indices: '\'=',
url: "http://t.co/kfLrRZJ1cI"
}
],
user_mentions: []
},
favorite_count: 1,
favorited: false,
geo: nil,
id: 628707248206925824,
id_str: "628707248206925824",
in_reply_to_screen_name: nil,
in_reply_to_status_id: nil,
in_reply_to_status_id_str: nil,
in_reply_to_user_id: nil,
in_reply_to_user_id_str: nil,
lang: "en",
place: nil,
retweet_count: 0,
retweeted: false,
source: "<a href=\"http://bufferapp.com\" rel=\"nofollow\">Buffer</a>",
text: "Read about how I learned #elixirlang - http://t.co/kfLrRZJ1cI",
truncated: false,
...
},
The fields I’m going to need to use are
id
- at least for this hack to look up a specific Tweetentities.urls
- Twitter already expands from t.co toexpanded_urls
text
- Which is the text of the Tweet
Given all this I was able to write this simple fetcher:
defmodule FetcherSingle do
@spec fetch :: String.t
def fetch do
ExTwitter.show(628707248206925824)
end
end
The function just calls ExTwitter to fetch my Tweet which has id 628707248206925824. This passes the test.
Parser
The next step was to build a parser. Parser implements this part of the prototype:
Build a parser: for twitter - Extract http links
To get my bearings I started with parsing the text from the Tweet. Here’s the test:
test "It can extract the text" do
expected_text = "Expected"
test_data = %ExTwitter.Model.Tweet{ text: expected_text }
assert Parser.text(test_data) == expected_text
end
Here I’m using a fake %ExTwitter.Model.Tweet
but the parser handles it just fine:
defmodule Parser do
@spec text(ExTwitter.Model.Tweet.t) :: String.t
def text(tweet) do
tweet |> Map.get(:text)
end
end
Next I need to parse the URL. I plugged in the data for my Tweet which I extracted above:
def mock_tweet do
%ExTwitter.Model.Tweet{
contributors: nil,
coordinates: nil,
# See copy of the data above ...
}
end
test "It can extract URLs" do
assert Parser.urls(mock_tweet) == ["http://buff.ly/1LYD0tp"]
end
This function passes the test:
@spec urls(ExTwitter.Model.Tweet.t) :: [ String.t ]
def urls(tweet) do
tweet
|> Map.get(:entities)
|> Map.get(:urls)
|> Enum.map(&Map.get(&1, :expanded_url))
end
Here we take the entities.urls
field which is an list containing the 3 different URL variations (display, expanded, original). We map that list to a new list containing just the expanded_url
. This list is returned.
As you can see, the test expects an list with a single URL.
Unshortener
How do you unshorten a URL? I know there are some sites that provide unshortening as a service but if unshortening isn’t a difficult process I wouldn’t want to deal with a third party API and rate limits. StackOverflow had some helpful advice. I really just need to request the URL and look for a redirect. I can use a HEAD request for efficiency.
Anyway, I started with a test. I iterated a bit to make sure different cases worked correctly. Here are all the tests I ended up with:
defmodule UnshortenerTest do
use ExUnit.Case
test "it can unshorten a test URL" do
assert Unshortener.expand("http://buff.ly/1LYD0tp") == "http://learningelixir.joekain.com/how-I-learned-elixir/?utm_content=buffer9a56c&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer"
end
test "it passes through normal URLs" do
assert Unshortener.expand("http://www.google.com") == "http://www.google.com"
end
test "it reports error for bad URLs" do
assert Unshortener.expand("http://garbage.example.com") == :error
end
test "it can unshorten double shortened URLs" do
assert Unshortener.expand("http://t.co/kfLrRZJ1cI") == "http://learningelixir.joekain.com/how-I-learned-elixir/?utm_content=buffer9a56c&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer"
end
end
The test “it can unshorten a test URL” just checks the URL from our test Tweet.
The test “it passes through normal URLs” checks a non-shortened URL to make sure it comes back as is.
The test “it reports error for bad URLs” tests a URL that doesn’t resolve and expects :error
.
The last test, “it can unshorten double shortened URLs”, tries the “t.co” URL from the test Tweet. This URL is a shortened URL for a shortened URL. Unshortener.expand
needs to recursively unshorten in order to expand correctly.
To implement this I need to make web requests. I used HTTPotion to do this. It is simple enough to setup and configure by following the project’s README.
Here’s the implementation:
defmodule Unshortener do
# Field name "location" is case insensitive in HTTP so use downcase
defp canonical_field_name(key) do
key
|> Atom.to_string
|> String.downcase
end
defp location(headers) do
{_key, value} = Enum.find(headers, nil, fn {key, value} ->
canonical_field_name(key) == "location"
end)
value
end
@spec expand(String.t) :: String.t
def expand(short) do
try do
case HTTPotion.head(short) do
%HTTPotion.Response{headers: headers, status_code: 301} -> location(headers) |> expand
%HTTPotion.Response{headers: headers, status_code: success} when 200 <= success and success < 300 -> short
end
rescue
_ -> :error
end
end
end
The interesting parts here are checking for the location and the recursive expansion.
The location was a little tricky because HTTPotion returns the headers as a Keyword list where the key names match whats in the real HTTP headers. But the HTTP headers are supposed to be case insensitive.
I found that Bit.ly returned the location field as “Location” while t.co returned it as “location”. To get this right I use Enum.find
on the location Keyword list and dowcased all keys before checking against “location”.
In Unshortener.expand/1
I use a case statement to patten match against the HTTP result code. For code 301, a redirect, I recusively expand whatever the location field gives as the target of the redirect. For successful codes I just return the original URL (stored in short
). If anything else happens I return :error
.
Organize FetcherSingle, Parser, and Unshortener into a pipeline with potential for parallelization
I needed to put everything together. Here’s the test:
test "it should fetch a tweet and extract the URL" do
assert P1.run == ["http://learningelixir.joekain.com/how-I-learned-elixir/?utm_content=buffer9a56c&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer"]
end
And here’s the implementation:
defmodule P1 do
def run do
[ FetcherSingle.fetch |> Parser.urls |> Unshortener.expand ]
end
end
This just runs each of the 3 activies in a pipeline and returns a list of the result.
Twitter as a Stream
ExTwitter has a stream version. This provides live tweets as an Elixir Stream! What could be more exciting.
I can building a streaming Fetcher
quite simply like this:
defmodule Fetcher do
@spec fetch :: Enumerable.t
def fetch do
ExTwitter.stream_filter(track: "apple")
end
end
Note that ExTwitter.stream_filter
requires a filter. The ExTwitter README’s example uses “apple” which is a frequently occurring keyword. I’ve used the same in the prototype and it seems to provide plenty of Tweets.
Drinking from the firehose
Here’s a test which isn’t actually verifying it’s result. It can’t really since the data is live. Instead it just prints out a bunch of URLs.
test "it should just stream out URLs from the firehose" do
stream = P1.stream |> Stream.map(fn x -> IO.puts x end) |> Enum.take(10) |> Enum.to_list
Enum.to_list(stream)
end
Here’s the implementation:
def stream do
Fetcher.fetch
|> Stream.flat_map(fn x -> Parser.urls(x) end)
|> Stream.map(fn x -> Unshortener.expand(x) end)
end
This takes the Stream
of Tweets and composes that with the Parser. Recall that Parser.urls
returns a list so we use Stream.flat_map
to flatten into a single Stream
of URLs. Finally, we compose in Unshortener.expand
to create a Stream
of expanded URLs.
The result is a Stream
of lazylily evaluated, composed computation. No tweets are fetched yet, no parsing or unshortening has been done.
The test itself composes in IO.puts
to print each list and then takes 10 results as a list, forcing evaluation. Here’s the result for a single run:
http://rover.ebay.com/rover/1/711-53200-19255-0/1?ff3=2&toolid=10044&campid=5337506718&customid=&lgeo=1&vectorid=229466&item=331653890787&pub=5575041009
http://gekoo.co/buy/01/?query=271991150643
https://itunes.apple.com/us/app/atn-wigan-edition/id1028785917?ls=1&mt=8
http://www.noktadan.com/nexus-10dan-yeni-sizintilar.html
error
http://www.youbidder.com/Bid/Free-Last-Second-Bid-Snipe-Apple-iPhone-6-Latest-Model-16GB-Silver-Verizon-Smartphone-.html/?item=121757452567
http://rover.ebay.com/rover/1/711-53200-19255-0/1?ff3=2&toolid=10044&campid=5337506718&customid=&lgeo=1&vectorid=229466&item=361387296439&pub=5575041009
error
https://www.google.com/url?rct=j&sa=t&url=http://www.fool.com/investing/general/2015/09/19/apple-is-preparing-to-launch-apple-pay-in-china.aspx&ct=ga&cd=CAIyGjA0OWJlZTQ2ZTU1MjZiOWU6Y29tOmVuOlVT&usg=AFQjCNF0L0-ImRrlB5oKUQrUck-kiBoHUw
https://itunes.apple.com/ru/app/id668427550?mt=8
At this point I think the prototype is done and is a success. I still have some lessons to take away.
Lessons learned and next steps
First, I didn’t unshorten all the URLs. These didn’t make it into the example above, but I’ve seen these URLs come through:
- http://smarturl.it/dlcgpa1
- http://ow.ly/ScJ9I
- http://fb.me/6RK0qYjJI
These are shortened but perhaps use something other than code 301? Also, there are errors listed in the URL list. What causes them? I will explore this in a future prototype / post.
Second, I didn’t complete this objective for the prototype:
Organize the 3 into a pipeline with potential for parallelization
because I didn’t consider parallelization. I will need to explore this further. I should be able to have one process reading the Twitter Stream
and other processes doing the parsing and unshortening. Doing this within the framework of Stream
composition may be interesting. I’ll explore this in a future prototype / post.
I expect to be working on and writing about this new project for some time and I hope you join me here on Learning Elixir to see how this project develops.