Building a Live Event View: In Theory and in Practice
Talks describe an idealized version of the system, if you want to know what the system was really like, you have to talk to the speaker after the talk and public Q&A
In this post, I want to explore how the design of a system changes before, during, and after it’s built. The first part covers the initial design of the system and was written before I started building the system. Part 2 goes over any unexpected issues that came up while building the system and how the design was changed to reflect that. In several months I plan on releasing a third part covering how the system has changed after being in production for several months.
Part 1 - The System in Theory
One challenge our customers have run into is ensuring they created their events correctly. That’s where the live view comes in. The live view shows you all the events as Freshpaint receives them. That way you can make sure your event fires when you expect it to.
Many other products have their own live view. To give you an idea of what the end result will look like, here is what the live event view for Mixpanel looks like:
In building the live view, there are three main priorities I have:
- It should be fast to build. I’m the only engineer at Freshpaint right now so it’s critical that I make the best use of my time.
- It should require little ongoing maintenance. Again I’m the only developer so my time is important.
- The live view should introduce a minimal amount of technical debt. Over time, the Freshpaint architecture is going to need to change. The live view should not make it more difficult to change the Freshpaint architecture.
I also have a few secondary priorities that I’m not particularly concerned about:
- Scalability - The system should be able to keep up with our incoming traffic and it should be possible to scale the system as needed. Given that the live view is only going to be active when someone is using and even then, it will only be processing a fraction of our data, I don’t think this will be an issue.
- Costs - Since this system is only going to be processing a tiny subset of our data, I don’t think costs are going to be a concern.
- Availability - While it’s important that this feature stays up so our customers can use it, there aren’t any dire consequences if it temporarily stops working. I think an uptime of 99% for the live view is more than enough.
How it Will Work - In Theory
There are two main pieces of the live event view:
- The backend that evaluates events and forwards them to the frontend.
- How the backend will communicate with the frontend.
Starting with the communication method, I think a clear solution is to use websockets. Websockets are one way you can maintain a two-way communication channel between the browser and your backend. After doing some research, I discovered the websockets functionality provided by AWS API Gateway. The interface API Gateway gives for handling websockets is ideal for what I’m doing. When a new websocket connection is created, you are provided a token. You can then pass that token to API Gateway to send data to the frontend. This is convenient because multiple backend processes can communicate with the frontend as long as each of them has the token.
The plan is for when a new live view connection is initiated from the frontend, the token provided by API Gateway can be written to a Postgres DB. When the websocket connection is closed, the token will be deleted from the DB. Pretty straightforward.
As for the backend event evaluation, right now all incoming Freshpaint events are placed into an AWS Kinesis stream (Kinesis is a queue service, similar to Kafka). I have an existing lambda function that does basic processing of the incoming data. That function forwards the data to the different destinations.
I plan on adding an additional lambda function that will consume the incoming data. When the lambda function starts, it will first query Postgres for a list of active live views. The lambda will then scan through the incoming events, check if any of them are applicable to any of the active live views, and forward those events to the frontend by making a request to API Gateway. I already have code that handles event filtering so I can reuse that code for the live view.
Is data going to be sent to the live view too quickly? - Initially I plan on sending every event to the frontend. I estimate this shouldn’t be a problem as long as we are sending less than 1000 events a second. If it does wind up being a problem, I would limit the amount of data being sent from the backend to the frontend. I would use Postgres to keep track of how much data is being sent to the frontend. Before sending events to the frontend, I would have the lambda function check Postgres to see if the rate limit is exceeded.
I do plan on implementing rate limiting on the frontend to limit the rate at which data is displayed.
Is AWS Lambda the best fit? - The one big downside I see to using AWS lambda is that each function invocation is short lived. This means in order to implement something like rate limiting, the lambda would need to communicate through a shared Postgres DB.
The main alternative I see to using AWS lambda is to set up a single server that will read data from Kinesis and forward events to the frontend. This does solve the rate limiting problem, but requires me to provision a server and setup a deployment process to the server. I’m also biased towards lambda because the existing Freshpaint code uses lambda.
Is API Gateway the best fit? - The big advantage I get out of using API Gateway is that it allows multiple backend processes to communicate with the frontend. If I’m using AWS Lambda to implement the live view backend, this is a requirement. If I were setting up a standalone server to handle the live view, I could have it directly maintain websocket connections to the frontend. I don’t think there’s any big downside to using API Gateway. I don’t expect that much total data to flow through it so costs should stay low.
I estimate it will take me four days to completely build the live view. That’s one day for the backend, two days to implement the frontend, and one day for any unknown unknowns.
Part 2 - The System in Practice
Here’s what the finished feature looks like:
Surprisingly, it took a lot less time than I was expecting to build. In total, it took me about a day and a half to build everything. It took half a day to build the backend and a full day to build the frontend. I think there were two different reasons I had overestimated how long it would take to build:
Unknown Unknowns - I had added some time to my estimate in case there was a major flaw in my design. In the end, there weren’t any. I think if anything, I got lucky that my initial design wound up being a good one. I’m still going to include some time for unknown unknowns in projects I work on in the future.
Frontend Reuse - I had overestimated the amount of time it would take to build the frontend. I thought it was going to take two days when it only took one. The reason for this is all the components in the liveview UI exist elsewhere in the product. For example, I already designed a search bar that’s used elsewhere in Freshpaint. I was able to reuse the search bar component instead of designing everything from scratch. In the end, almost everything on the page was reused from another part of the Freshpaint UI. Going forward I’m going to pay close attention to how much frontend code I’ll be able to reuse.
In building the system there were two key technical takeaways I had:
It can be a challenge to figure out how to correctly use AWS services - It took me a while to figure out how to send messages from the backend server to the frontend through the websockets interface provided by API Gateway. The complete code for doing is not included in the Amazon post on websockets. It turns out that you need to:
- Pass in the url for your websocket API when initializing the backend API Gateway client.
- Use the postToConnection method provided by the API Gateway client.
I was able to figure out the second step, but it wasn’t working until I figured out the first step. I have ran into similar issues in the past with Kinesis. I guess the lesson is that it can sometimes be a pain to figure out how to actually use an AWS service.
Minimizing the blast radius is critical - Before writing part 1 of this post, I had intended to have the live view lambda function be a part of the existing lambda function that does basic backend processing. When I started working on this post, I realized that would be a bad idea because if the code for live view failed, it would cause the general backend processing to also fail. I then changed my design to be a separate lambda function mentioned in part 1.
This concern was proven valid when I initially deployed the live view to production. There was an issue where the DB schema did not match the schema the live view was expecting. This caused the live view code to immediately fail. If I had not deployed the live view as a separate function, that would have had a negative impact on existing backend functionality.
Overall the shipped product wound up being a lot closer to my initial design than I was expecting. It is interesting to reflect on what I learned after completely building a feature. I plan on writing posts similar to this when I have the opportunity.