Twitter Heron and Dhalion

Two recent papers have given more detail on Twitter’s distributed stream processing system Heron (the successor to Storm).

The first paper, Twitter Heron: Towards Extensible Streaming Engines , details how Heron has evolved into a modular architecture that makes it extremely flexible. Adrian Colyer gives a good summary of the paper on his blog The Morning Paper. The main takeaway for me is the ease with which Heron can be augmented with new features. As I read it I saw how I could adapt my modelling work into a module that could aid in auto-scaling Heron topologies. Of course, not long after this the second paper: Dhalion: Self-Regulating Stream Processing in Heron came out. Someone had already done it, but this is by no means a bad thing.

As before, Adrian Colyer gives a good summary of the paper. Dhalion is a framework for regulating and maintaining a running Heron cluster. The authors have implemented monitoring, diagnostic and remedial systems to identify and fix performance issues and have show impressive results.

However, I see one issue with Dhalion’s current approach to resolving performance problems. The system currently has no way to know if a proposed resolution is likely to succeed before it is deployed. Dhalion will implement a resolution and observe its results in order to assess if it is successful. Updating a Heron topology (the equivalent of rebalancing in Storm) has a latency cost and it takes some time for the topology’s performance to stabilise after an update. Only after the topology has stabilised is it clear if a resolution has been successful or not. If not, then Dhalion has to repeat the process, potentially incurring further latency costs.

If Dhalion had a way to model the effects of a proposed resolution then it could iterate to an effective solution much faster. The authors of Dhalion have already done the lions share of the work, providing monitoring code and deployment options. I am looking forward to them open-sourcing Dhalion so I can investigate how easy it would be to integrate a modelling system, like the one I am developing for Storm, into the resolution process.