I have started a new project in order to play around with Elixir and OTP. Here’s an idea: a tool to scrape various tech news sites and notify me, in near realtime, that a link to Learning Elixir has been created. The goal being that it can help me better engage with people linking to my blog. Examples of sites I could monitor:
- Hacker News
Kicking off the project
Here’s my approach to starting the project. I follow something like this for most new projects.
- I created a file with some preliminary notes including the vision above
- I wrote up what I think the basic requirements are
- I put together a plan for an initial prototype
- I implemented the first revision of the prototype
- I analyzed how well the prototype fit my needs
- I decided on next steps based on the analysis of the prototype
The rest of this post will describe my experience carrying out these steps.
Here is my first cut at requirements:
- Continuously fetch from the services to get new content
- Parse out the content from fetched data
- Unshorten shortened URLs
- Present results
There are probably some steps I’m missing between 3 and 4 but 1-3 are good place to start.
The above are activities. The activities require state to be maintained:
- Store state of the fetchers (i.e. where to fetch from next, i.e what does “new” mean)
- Store pending content that has been fetched but still needs to be parsed
- Store parsed content that contains domains
- Store relationship between collected domains and refering content
- Store rankings for domains
I planned out the first prototype, impressively named P1, as an experiment to work out the basics of the required activities. This is a little like a development spike though I’ve written tests and kept the code.
- Build a fetcher for a single tweet that has what I want in it
- Build a parser: for twitter
- Extract http links
- Build an unshortener
- Organize the 3 into a pipeline with potential for parallelization
- Dump domains to stdout or something
- Don’t save any of the state described above
- Build a fetcher for Twitter live stream
- Run it end to end
Building the Prototype
First I created a new mix project with
mix new p1. I’ve pushed this to github as ds_prototype if you want to follow along.
In order to interact with Twitter I’ll use the ExTwitter module. I followed the directions in the README to update mix.exs.
I want to write a function to fetch a specific Tweet, one of my own old tweets:
This has a shortened link in it so it should be a great test.
FetcherSingle should implement this part of the prototype:
Build a fetcher for a single tweet that has what I want in it
In order to write this I started with a test:
Actually, before this I screwed around with ExTwitter and pulled a bunch of tweets from my timeline. I found the tweet I wanted. This how ExTwitter represents it:
What is a Tweet?
The fields I’m going to need to use are
id- at least for this hack to look up a specific Tweet
entities.urls- Twitter already expands from t.co to
text- Which is the text of the Tweet
Given all this I was able to write this simple fetcher:
The function just calls ExTwitter to fetch my Tweet which has id 628707248206925824. This passes the test.
The next step was to build a parser. Parser implements this part of the prototype:
Build a parser: for twitter - Extract http links
To get my bearings I started with parsing the text from the Tweet. Here’s the test:
Here I’m using a fake
%ExTwitter.Model.Tweet but the parser handles it just fine:
Next I need to parse the URL. I plugged in the data for my Tweet which I extracted above:
This function passes the test:
Here we take the
entities.urls field which is an list containing the 3 different URL variations (display, expanded, original). We map that list to a new list containing just the
expanded_url. This list is returned.
As you can see, the test expects an list with a single URL.
How do you unshorten a URL? I know there are some sites that provide unshortening as a service but if unshortening isn’t a difficult process I wouldn’t want to deal with a third party API and rate limits. StackOverflow had some helpful advice. I really just need to request the URL and look for a redirect. I can use a HEAD request for efficiency.
Anyway, I started with a test. I iterated a bit to make sure different cases worked correctly. Here are all the tests I ended up with:
The test “it can unshorten a test URL” just checks the URL from our test Tweet.
The test “it passes through normal URLs” checks a non-shortened URL to make sure it comes back as is.
The test “it reports error for bad URLs” tests a URL that doesn’t resolve and expects
The last test, “it can unshorten double shortened URLs”, tries the “t.co” URL from the test Tweet. This URL is a shortened URL for a shortened URL.
Unshortener.expand needs to recursively unshorten in order to expand correctly.
To implement this I need to make web requests. I used HTTPotion to do this. It is simple enough to setup and configure by following the project’s README.
Here’s the implementation:
The interesting parts here are checking for the location and the recursive expansion.
The location was a little tricky because HTTPotion returns the headers as a Keyword list where the key names match whats in the real HTTP headers. But the HTTP headers are supposed to be case insensitive.
I found that Bit.ly returned the location field as “Location” while t.co returned it as “location”. To get this right I use
Enum.find on the location Keyword list and dowcased all keys before checking against “location”.
Unshortener.expand/1 I use a case statement to patten match against the HTTP result code. For code 301, a redirect, I recusively expand whatever the location field gives as the target of the redirect. For successful codes I just return the original URL (stored in
short). If anything else happens I return
Organize FetcherSingle, Parser, and Unshortener into a pipeline with potential for parallelization
I needed to put everything together. Here’s the test:
And here’s the implementation:
This just runs each of the 3 activies in a pipeline and returns a list of the result.
Twitter as a Stream
ExTwitter has a stream version. This provides live tweets as an Elixir Stream! What could be more exciting.
I can building a streaming
Fetcher quite simply like this:
ExTwitter.stream_filter requires a filter. The ExTwitter README’s example uses “apple” which is a frequently occurring keyword. I’ve used the same in the prototype and it seems to provide plenty of Tweets.
Drinking from the firehose
Here’s a test which isn’t actually verifying it’s result. It can’t really since the data is live. Instead it just prints out a bunch of URLs.
Here’s the implementation:
This takes the
Stream of Tweets and composes that with the Parser. Recall that
Parser.urls returns a list so we use
Stream.flat_map to flatten into a single
Stream of URLs. Finally, we compose in
Unshortener.expand to create a
Stream of expanded URLs.
The result is a
Stream of lazylily evaluated, composed computation. No tweets are fetched yet, no parsing or unshortening has been done.
The test itself composes in
IO.puts to print each list and then takes 10 results as a list, forcing evaluation. Here’s the result for a single run:
http://rover.ebay.com/rover/1/711-53200-19255-0/1?ff3=2&toolid=10044&campid=5337506718&customid=&lgeo=1&vectorid=229466&item=331653890787&pub=5575041009 http://gekoo.co/buy/01/?query=271991150643 https://itunes.apple.com/us/app/atn-wigan-edition/id1028785917?ls=1&mt=8 http://www.noktadan.com/nexus-10dan-yeni-sizintilar.html error http://www.youbidder.com/Bid/Free-Last-Second-Bid-Snipe-Apple-iPhone-6-Latest-Model-16GB-Silver-Verizon-Smartphone-.html/?item=121757452567 http://rover.ebay.com/rover/1/711-53200-19255-0/1?ff3=2&toolid=10044&campid=5337506718&customid=&lgeo=1&vectorid=229466&item=361387296439&pub=5575041009 error https://www.google.com/url?rct=j&sa=t&url=http://www.fool.com/investing/general/2015/09/19/apple-is-preparing-to-launch-apple-pay-in-china.aspx&ct=ga&cd=CAIyGjA0OWJlZTQ2ZTU1MjZiOWU6Y29tOmVuOlVT&usg=AFQjCNF0L0-ImRrlB5oKUQrUck-kiBoHUw https://itunes.apple.com/ru/app/id668427550?mt=8
At this point I think the prototype is done and is a success. I still have some lessons to take away.
Lessons learned and next steps
First, I didn’t unshorten all the URLs. These didn’t make it into the example above, but I’ve seen these URLs come through:
These are shortened but perhaps use something other than code 301? Also, there are errors listed in the URL list. What causes them? I will explore this in a future prototype / post.
Second, I didn’t complete this objective for the prototype:
Organize the 3 into a pipeline with potential for parallelization
because I didn’t consider parallelization. I will need to explore this further. I should be able to have one process reading the Twitter
Stream and other processes doing the parsing and unshortening. Doing this within the framework of
Stream composition may be interesting. I’ll explore this in a future prototype / post.
I expect to be working on and writing about this new project for some time and I hope you join me here on Learning Elixir to see how this project develops.