Building a Cloudflare Durable Objects Alternative in Go

Tags
Published
Author
Cloudflare Durable Objects might be my favorite infrastructure in recent years. The ability to be able to quickly define a consistent single-shot view of some state is amazing. Unfortunately, it comes with a pretty hefty price tag for our use at Tangia.
We use DOs (Durable Objects) as a caching layer for our API for our quite heavy query that fetches online interactions for the Tangia extension and web page. We immediately push updates to this when something changes on the server side, and client websocket to this Durable Object so they receive said updates in real-time.
This architecture is amazing. It’s ~100 lines of code, greatly improves the user experience, and relieves significant preassure from our database. The problem is that we keep these DOs alive the entire time someone is viewing a channel’s Tangia extension or viewer page, as well as when they are online.
If you take an average of a 4-hour stream 3x a week across 300 live streamers… well it adds up fast.. Like 23,000,000 GB-hours per month fast:
notion imagenotion image
Considering that we went from using 2M hours in the prior period, to 23M hours this period, we clearly needed another solution as we were going to outgrow this billing model really fast. While we could work with Cloudflare on an enterprise deal, I’ve been itching to build my own version of this.
To get a reference of how much time that is, we went from the bronze age to casual space travel in a month:
$287? Like, no.
It’s not like we’re using nearly that much memory or CPU too, most of this is only a few MB at most, and is effectively idle the vast majority of the time.
As mentioned in my FireScroll post, it is more often than not far more efficient to build something explicitly dedicated to the use case, rather than paying the extra feature tax of generalized tools.
Cloudflare is apparently releasing something to allow durable objects to sleep and be woken by websockets/requests. However this still leaves the issue of a lack of TCP connections for being able to connect to the database! Plus I still want to build it!

Requirements

To understand why I built Stateful Objects (name WIP) the way I did, let’s look at what we need and don’t need out of Durable Objects:

Need

  1. Global coordination of a single instance of State and code across arbitrary host instances around the world
  1. A scheduler for said global coordination
  1. In-memory caching
  1. Websocket and HTTP support

Don’t Need

This section is really important!
  1. Multi-tenant code - we only run our code, and we write it directly into the workers
  1. Dedicated resources per instance - since it’s our code, we can track memory usage at both the instance and hypervisor level, enforcing any limits needed and moving instances around as required
The greatest advantage is that we don’t need to worry about sandboxing and managing multi-tenant code. Being able to just invoke a new instance of an interface is magic!

Bonuses!

  1. Supporting TCP connections to databases (they can share connection pools!)
  1. Serializable and consistent transactions to disk (we don’t use this, but it’s pretty trivial to add, and we needed the datastore for coordination anyway)

Why not using Temporal

Temporal handles the scheduling of Workflows in ways that we could certainly use for this, but there are a few reasons why we’d prefer not to use that.
The first is that we don’t want to clog up the event history with mutations, as there could be MANY. That includes both changes to the in-memory state, as well as sending messages out to thousands of connected websockets. While we can do some clever tricks for ignoring those operations during replay and just using queries, it’s abusing the system.

How we would do it with Temporal

A workflow is an instance of DO. Your in-memory state is managed by a global cached KV-store such as combining ristretto with CockroachDB. You never keep anything in actual memory, but store it in KV and fetch it when queries come in. Since it’s cached the fetch will be instant, and the global KV store can handle request collapsing for concurrent requests.
The only time an activity is used (and thus we fill the event history) is to fill the cache on start in the event that the workflow stopped and had to be restarted on another worker.
When the worker shuts down, it clears the KV (cache and CRDB).
Websockets will either need to be routed to the correct worker, or workers will have to be in a mesh. Then you can create websocket “rooms” based on the workflow. By using this websocket mesh using some pubsub with NATS/Redis, or something like Centrifugal, you never have to worry about scheduling or coordination.
Mutation operations like modifying the KV store can still be a query, but as long as they are handled in serializable transactions (managed by CRDB) it’s safe to do them outside the event history.
For preventing shutdown, you listen on a shutdown channel within the workflow that the query handler can write to. This can have some inactivity timeout as well that can be refreshed
As you can tell, this is definitely not how Temporal is supposed to be used, and they’d surely have a heart-attack if I mentioned this to them in their slack 😅.

The scheduler

Following the fly.io model, we need something that will succeed or fail fast. Thankfully we don’t have to place anything, but just find a eligible worker that has the code on it to run the instance.