Elixir Stream Patterns - Part 3

About 6 months ago I wroute about idiomatic patterns for Streams in Elixir. Over the last week, while working on my new project, I happened across another very interesting pattern.

In this post I’ll describe the pattern I found in the ExTwitter module as well as show how I’ve adapted the pattern to my own project.

ExTwitter

The ExTwitter module provides an Elixir interface to the Twitter API. It let’s you tweet and receive Streams of tweets. Since this post is about Stream patterns I’ll focus on the Stream aspect.

The ExTwitter README goes into great detail on setup and use of the API and includes this example of the Stream API:

stream = ExTwitter.stream_filter(track: "apple") |>
  Stream.map(fn(x) -> x.text end) |>
  Stream.map(fn(x) -> IO.puts "#{x}\n---------------\n" end)
Enum.to_list(stream)

# => Tweets will be displayed in the console as follows.
# Apple 'iWatch' rumour round-up
# ---------------
# Apple iPhone 4s 16GB Black Verizon - Cracked Screen, WORKS PERFECTLY!
# ---------------
# Apple iPod nano 7th Generation (PRODUCT) RED (16 GB) (Latest Model) - Full read by
# ---------------

For this to work ExTwitter must be receiving data from Twitter on an ongoing basis. And in fact, I suspected that must be happening in a separate process because the process calling ExTwitter.stream_filter is busy printing the processing shown above.

Server based `Stream`s

So, I think that ExTwitter is building a Stream that contains data coming from a separate process. Let’s look into the source to see if this is true. If it is true then we can look at how it works:

# from lib/extwitter/api/streaming.ex
def stream_filter(options, timeout \\ @default_stream_timeout) do
  {options, configs} = seperate_configs_from_options(options)
  params = ExTwitter.Parser.parse_request_params(options)
  pid = async_request(self, :post, "1.1/statuses/filter.json", params, configs)
  create_stream(pid, timeout)
end

This is the API function. After collecting some configuration data it calls async_request and then create_stream which I assume creates a stream using that process.

async_request returns a pid so it must be creating a new process to do asynchronous work. It is passed :post and "1.1/statuses/filter.json". I’m not even going to look into that function, I’m going to confidently assume that it sends a POST request to Twitter’s “1.1/statuses/filter.json” endpoint.

But let’s take a look into create_stream

# from lib/extwitter/api/streaming.ex
defp create_stream(pid, timeout) do
  Stream.resource(
    fn -> pid end,
    fn(pid) -> receive_next_tweet(pid, timeout) end,
    fn(pid) -> send pid, {:cancel, self} end)
end

We can see this just wraps up a call to Stream.resource. But, what does Stream.resource/3 do? From the docs:

def resource(start_fun, next_fun, after_fun)

Emits a sequence of values for the given resource.

Similar to transform/2 but the initial accumulated value is computed lazily via start_fun and executes an after_fun at the end of enumeration (both in cases of success and failure).

Successive values are generated by calling next_fun with the previous accumulator (the initial value being the result returned by start_fun) and it must return a tuple containing a list of items to be emitted and the next accumulator. The enumeration finishes if it returns {:halt, acc}.

As the name says, this function is useful to stream values from resources.

Ok, so let’s consider the arguments in our particular call:

start_fun is passed fn -> pid end. This will just evaluate to pid. So the “accumulator” is the pid of teh async request process created in stream_filter. This is the process that I believe is receiving data from Twitter.
next_fun is fn(pid) -> receive_next_tweet(pid, timeout) end. I’m going to assume that this fetches a single tweet from the async process. This means that when the Stream is evaluated it will call receive_next_tweet/2 to generate successive values.
after_fun is fn(pid) -> send pid, {:cancel, self} end. I guess this is a message to shutdown the async process.

So, more generally the pattern is:

Create a server that generates or retrieves data on an ongoing basis. Note, you could use the GenServer or an Agent behaviors for this server.
Include a function / message for that server that returns one data item at a time.
Use Stream.resource to generate an Elixir Stream from the stream of generated items.

For the specific case of ExTwitter.stream_filter there are many details I’ve skipped over. For example we haven’t looked at how to actually receive or manage data from Twitter. But, I wanted to highlight the Stream pattern here which I think is applicable to servers that stream data of any kind.

The benefit of this pattern is that it allows us to use the Stream APIs to compose work on top of the stream of data or events coming out of a server process.

If you like this post then you should know that I'm writing a book full of patterns like these called Idiomatic Elixir.

If you inerested in the book then sign up below to follow its progress and be notified when it is launched. You'll also receive two free chapters as a sample of what the book will contain.

Along the way you will also receive Elixir tips based on my research for the book.

Decoupled processing of Streams

Imagine you had a Stream of tweets as we’ve seen above and you want to do some post-processing on them. This could be any task that takes a little time. For example saving them to a database, sending them to another system, or, in my new project, extracting and expanding any shortened URLs that appear within the tweet.

Based on this post-processing I want to produce a Stream of data - a Stream of URLs. One way to do this would be:

ExTwitter.stream_filter(track: "apple")
|> Stream.flat_map(&Parser.urls/1)
|> Stream.map(&Unshortener.expand/1)

But, this leads to very serial processing and throws away all of the concurrency goodness available from the Erlang VM.

To improve the situation I could imagine doing something like:

ExTwitter.stream_filter(track: "apple")
|> Stream.flat_map(&Parser.urls/1)
|> Stream.map(fn url ->
  spawn_async_task( fn ->
    &Unshortener.expand(url)
  end
end)

But this doesn’t quite work either because we need some way to wait for the async tasks and we need to decouple the spawning and waiting from each other. If we don’t decouple the two tasks they happen before subsequent elements in the streams are generated and we only end up with one async task running at a time.

The question that remains is how do we decouple the spawning and awaiting of tasks?

Well, one way we can approach this problem is to put a server with a queue between the two halves of our computation and use a variation of the pattern from the first part of this post:

{:ok, queue} = BlockingQueue.start_link(5)
spawn_link(fn ->
  ExTwitter.stream_filter(track: "apple")
  |> Stream.map(&spawn_some_work/1)
  |> Enum.each(fn x -> BlockingQueue.push(queue, x) end)
end)

Stream.repeatedly(fn -> BlockingQueue.pop(queue) end)
|> Stream.map(&extract_work/1)

This uses my BlockingQueue module and starts by creating a queue that can hold up to 5 elements. BlockingQueue blocks the caller from pushing when the queue is full. Similarly, it blocks the caller from poping when the queue is empty. In this example, the size 5 is chosen arbitrarily.

After creating the queue the code above spawns a new process.

Within the new process we create the ExTwitter Stream. Let’s ignore spawn_some_work for the moment. The server then uses Enum.each to push items into the queue. Note that Enum.each has eager evaluation. Ordinarily using an eager function would evaluate the entire, infinite, Twitter Stream. But the BlockingQueue controls the rate at which the Stream is evaluated. Only 5 elements can be evaluated and pushed into the queue.

The queue is blocking and this means it will block evaluation of the ExTwitter stream until items are poped off the other end. This way, BlockingQueue controls the rate of evaluation of infinte Streams.

In the parent process, the code above creates a new stream. This is slightly different that the pattern we started with. Here, I use Stream.repeatedly/1 to repeatedly pop values from the queue. I don’t need the setup and teardown cases from Stream.resource/3 so Stream.repeatedly/1 works just fine.

The Stream of results coming out of the anonymous process is then mapped with extract_work. We’ll come back to this function in a momemnt.

The resulting Stream can be composed with further computation and eventually evaluated. As it is evaluated it will take results from the queue. As it takes results from the queue it will unblock the anonymous server to take more tweets from the ExTwitter Stream.

This is the general pattern

Create a queue
Create a process which evaluates the input stream and pushes elements into the queue
Create an new stream which repeatedly pops items from the queue.

This effectively decouples the computation of the streams.

Asynchronous Work

In the example of above I have placeholder functions spawn_some_work and extract_work. These represent creating additional processes to do work like unshortening URLs. A solution that I would like to use would be to use Elixir’s Task module to kick off work. However, this won’t actually serve our purposes. The problem is that Task must be started and awaited in the same process. This is something I learned and wrote about in Learning Elixir Task.

While working with these decoupled streams I found that I can instead use Elixir’s Agent module to mange the work. Here’s how to use Agent to act like a Task:

{:ok, pid} = Agent.start_link(fn -> work end)
# Do something else
result = Agent.get(pid, fn x -> x end)
Agent.stop(pid)

Here Agent.start_link/2 serves the role of Task.async/1. The Agent will start up and run the given function to set its initial state.

Agent.get/3 retrieves the state. It takes an additional function to transform the state. We don’t need any transformation so we use the identify function fn x -> x end. If the Agent hasn’t finished initializing then Agent.get/3 will block. This is much like Task.await/2.

Agent can maintain it’s state across multiple get and update calls. For our purposes we just need to get the value once. So, after we are done we call Agent.stop/1 to shut the Agent down.

Now, to put this all together:

def spawn_some_work(url) do
  Agent.start(fn -> Unshortener.expand(url) end)
end

def extract_work({:ok, pid}) do
  result = Agent.get(pid, fn x -> x end)
  Agent.stop(pid)

  result
end

{:ok, queue} = BlockingQueue.start_link(5)
spawn_link(fn ->
  ExTwitter.stream_filter(track: "apple")
  |> Stream.flat_map(&Parser.urls/1)
  |> Stream.map(&spawn_some_work/1)
  |> Enum.each(fn x -> BlockingQueue.push(queue, x) end)
end)

Stream.repeatedly(fn -> BlockingQueue.pop(queue) end)
|> Stream.map(&extract_work/1)

And this is it. This will stream tweets through the parser and then kickoff another process to unshorten the URLs. 5 such processes can be outstanding in the queue. One more process could have been poped and waiting ing Agent.get/3. This way the calling process is then decoupled from the evaluation of the ExTwitter stream. It picks up the results when they are done.

The way I look at this is that the anonymous server process is about 5-6 elements ahead of the calling process. This adds some latency to the pipeline of work that allows time for the URL requests to complete. It allows multiple URL requests outstanding at a time. It allows the system to wait for multiple requests concurrently.

The depth of the queue, 5 elements, is tunable. It can be adjusted based on the number of processes or web requests you want to allow to be outstanding at one time. 5 is just an example. If you try this you should experiment with this value to find the best setting for your workload.

Server Resplies to Caller Pattern

Heres’s a bonus Elixir pattern - not related to Streams.

As I mentioned above Task must be started and awaited in the same process. This is required because Task saves the pid of process calling Task.async/1 and sends its result as a message only to that process.

ExTwitter does the same thing. That is it hangs on to the pid of the calling process and uses it to send back the messages. This actually means that the Stream returned by ExTwitter.stream_filter can only be evaluated by the process that created it.

These two cases suggest a pattern. I have to say, I’m not that excited about this pattern and would prefer that functions like Task.await and Streams be usable from any process. At least Task documents this requirement.

Next Steps

The code above is far from production ready. It needs some work in order to clean up some of the processes once things are done. But, I hope the code was clear and helped to illustrate the patterns and techniques. I’m going to be putting these patterns into my new project and I hope to write about my experiences in the near future.

That said, there may be a few other posts coming up:

Elixir 1.1.0 was recently released and I’d like to take it for a spin and write about some of its new features.
ElixirConf 2015 starts at the end of the week. I plan to write a post on my favorite talks from the conference.

Joseph Kain

Elixir Stream Patterns - Part 3

ExTwitter

Server based Streams

Decoupled processing of Streams

Asynchronous Work

Server Resplies to Caller Pattern

Next Steps

You might also enjoy (View all posts)

Server based `Stream`s