IceDB v3 - Third Time’s the CharmIceDB v3 - Third Time’s the Charm

IceDB v3 - Third Time’s the Charm

Tags
icedb
databases
Published
Published September 27, 2023
Author
I can hear the meme in my head:
“Ah shit… here we go again”
~ CJ, GTA San Andreas

What is IceDB?

IceDB is an in-process Parquet merge engine for better data warehousing in S3, using only S3.
It runs stateless, making it far easier to manage and scale than other open source systems.
It does so in open and easily readable formats to allow for any language or framework to parse the log (jsonl) and read the data (parquet).
It tracks table schema as standard SQL types, and supports dynamic schema evolution.
It's queryable by anything that understands parquet, and runs 54x cheaper than managed solutions such as BigQuery, Snowflake, and Athena.
It retains many of the features of modern OLAP systems (such as ClickHouse’s Materialized Views), adds some new ones, and makes it way easier to build scalable data systems with a focus on true multi-tenancy.
IceDB can replace systems like BigQuery, Athena, and Snowflake, but with clever data design can also replace provisioned solutions such as a ClickHouse cluster, Redshift, and more.
Query engines such as DuckDB, ClickHouse, CHDB, Datafusion, Pandas, or custom parquet readers in any language can easily read IceDB data in the tens and hundreds of milliseconds, and when combined with the IceDB S3 Proxy, IceDB can be transparently queried (the client just thinks it's S3) like with the ClickHouse S3 function s3('https://icedb-s3-proxy/**/*.parquet') or DuckDB's read_parquet('s3://icedb-s3-proxy/**/*.parquet').

Why I shouldn’t have made this… a third time

We don’t need it for Tangia. We could use it, but we don’t need it.
I’ve spent so many hours… so many nights and weekends working on IceDB. Adding features, optimizing, fixing new bugs, writing tests, fixing tests, rewriting features to be faster, adding features again… the cycle where you start coding at 8am and all the sudden it’s dinner.
I built IceDB because I found that there was a gap in the space in that existing technologies clearly fail to fill: Fully serverless, multi-tenant, embedded data warehousing that is easy to run. That gap was itching that part of my brain that doesn’t understand why it’s there, and how hard it would be to fill.
How hard would it be to build? How interested are folks in something that fills that gap? Why are the closest solutions so complex and proprietary? What do I not know about that space? Is this something I can build? Would be people be interested in what’s effectively an open source BigQuery? Does filling that gap enable the construction of software previously too difficult to tackle for someone interested? Why do the closest solutions cost so much, and why are they all closed source?
For a space with such an open source presence, it’s dominated by big tech (or those trying to be).

What’s different from v2

There’s only 1 required external infra: S3.
The (cata)log is no longer in Postgres/CockroachDB, it’s now right in S3. You do need some locking (merge and tombstone clean locks) if you run concurrent modifications on the same tables, but you could have 1M instances each writing to their own table in the same bucket and no coordination needed.
This means it’s far easier to manage (especially for single-node or Kafka-like situations where one consumer is dedicated to a partition), far cheaper to run, and still fully portable (Minio is actually enormously faster than AWS S3).
There are also many more essential features like custom insert functions and schema inspection/validation that help extend functionality even further than what exists in comparable tools.

Brief Comparisons

The best place to find (updated) comparisons, performance tests, etc. is at https://github.com/danthegoodman1/icedb
I’ve omitted systems I’ve never used before, and do not have extensive experience with some (such as Snowflake).I’ve omitted systems I’ve never used before, and do not have extensive experience with some (such as Snowflake).
I’ve omitted systems I’ve never used before, and do not have extensive experience with some (such as Snowflake).
I won’t dive into all the details here about why not BigQuery (expensive AF), Athena, Spark, ClickHouse, etc. and more, but we all want an open source DB that’s both easy to run and highly versatile.
A big question I foresee is: Why not Iceberg? (ignore the naming similarities)
Iceberg has a few problems right now in my eyes (see more detail on github):
  1. Terrible DX, fragmented support and overly verbose, good luck trying to quickly write to a table (or do so in anything but python)
  1. Very few systems support querying it
  1. Terrible operational experience (nobody is excited to run Spark)
  1. Impossible to natively integrate with - you can write an IceDB log parser in <30min
My joke is that you need 7 PhD’s in Java stack traces to run a Spark cluster successfully, 2 if cloud-managed.
 
I specifically designed IceDB to be super easy to use in any language and with any DB:
  1. Extremely simple DX and operational experience (see examples in github), log is just JSONL
  1. With the S3 proxy readers don’t need to understand the log, just s3 and parquet
  1. Multi-tenancy such that I can execute user SQL unchecked in ephemeral execution envs (when used with the S3 proxy for auth)
  1. Schema definition not required - IceDB will track it for you, and it gives you the tools to prevent breaking things, but it is happy to have new columns appear and old ones disappear, just don’t change data types :)
The README and ARCHITECTURE docs both have excellent descriptions of functionality and design, check them out if you’re interested!
I tend to be comparing more to BigQuery, as I think the Iceberg experience isn’t concerning me yet. Maybe it will get there with Tabular’s new round, but rarely do these kinds of things have pleasant ops and DX.

So why did I build this?

I ask myself this all the time…

Projects keep me sane

If you know me, this should come at no surprise.
For better or (probably) for worse, coding is basically all I do. I challenge myself with projects outside my comfort zone in search of deliberate learning and practice: I get uncomfortable to get comfortable again.
It also keeps me from burning out at my startup. I find that coding and engineering is not necessarily the thing that I would burn out on, but the same context.
I can find refreshing pools of energy in changing language, project, work context (Tangia vs. learning), and even workspace. Changing the keyboard in the middle of the day has a surprising positive impact on my energy for typing. These projects serve as a sort of sort of wind-down from working on Tangia upwards of 14 hours a day, but makes sure I feel like I am spending my time productively (as opposed to watching TV for example).
I find the change in context to make a massive impact on how I experience the work emotionally. It’s working without hard constraints, only self-imposed ones. It means I can move as fast or slow as I want. I can waste time exploring multiple ways to do something, over-optimize the hell out of something just to prove to myself I can, or can see how quickly I can build something that “works”. It’s a sort of self-discovery on it’s own, as you never really know what you are capable of until you push the envelope on all sides.

I found this problem extremely interesting

Not only just building in an ignored gap in the market, but also it’s just a really interesting problem.
How do you build a database that can run “stateless”, the only storage is S3, and uses open data formats so anything can query the data? Bonus points for making it multitenant so anyone can query with any DB without having to mangle queries for security.
1 table with 1 million rows should roughly the same as 1 million tables with 1 row. The presence of either should not slow down queries of individual tables (e.g. don’t load all metadata in, just what you need for a query).
Basically: How would I made a DB that behaves like BigQuery, but from Google’s perspective, not a GCP customer’s?
That was a fun question to answer.

I see a lot of utility, and so do others

There are a ton of ideas I have that require allowing end-users to be able to directly write SQL queries on data. For brevity I won’t dive in to them, but it’s clear that others see the issue that IceDB solves.
I’ve had robotics companies reach out to me that are currently using systems like Athena. They have tons of data coming in, but read it infrequently. They don’t need queries to be millisecond speed, but they don’t want to spend $10 every time they query it (nor spend the effort on materialized views, etc).
I’ve had observability companies reach out that have a baseline cost of ~$120/mo for their customers to run ClickHouse clusters for their dedicated hosted offering. Because IceDB can be queried with ClickHouse using nearly identical queries (just change the FROM clause), they could reduce that to $0 baseline cost, passing those savings to their customers.
I’ve had analytics companies reach out in a similar vein, looking to reduce their costs and operational complexity of managing their own massive OLAP systems.
I’m already getting inbound interest from folks who find this a solution to a painful problem, and none of them have cross lines with the ideas I had when building it!
I’ve built many projects in the past that I think have immense utility in business and infrastructure, but this is really the first time that I’ve made something where it’s starting to feel like that “build it and they will come” prophecy that all engineers are hoping for. I’m by no means drowning in inbound, but I also haven’t been pushing anyone else to use IceDB. Perhaps that’s some initial customer validation that this has a practical case as a startup (or to power a startup built on top, synonymous to Relativility Space with their 3D printing to support their rocket manufacturing).
Speaking of Relativity Space, last night (as of writing) at the 2023 YC reunion I had the pleasure of listening to Tim Ellis speak about the launch industry as well as meeting him after the talk. *Bucket list item checked*

I believe this could be a unicorn 🦄

Maybe it will be the next startup I do, maybe I can reach a point where I can start it concurrently with Tangia, maybe this stays an open source passion project that nobody uses, or maybe I never never touch it again. But looking at BigQuery’s revenue (billions), and all these (unexplainably large) funding rounds going to database companies like MotherDuck and Tabular, I feel pretty good about the potential of IceDB.

The value from the journey

As a result of all of the above, even if I put down IceDB and never touch it again, I’ve gotten so much out of the adventure: I’ve met new people and gained new friends that I never would have otherwise, many at the pinnacle of their industries. I’ve received multiple unsolicited job offers to work in tangential and lateral spaces. I’ve gained recognition in the space by industry leaders.
All of this - you really can’t get any other way, all at once, in such a short amount of time. You have to build to earn that.

Wrapping up

I’ve built a lot of things in the past (just see my github), many of those being very cool and things I was similarly excited about. The difference is now that I see so many things to build off IceDB that I am similarly excited about, not just IceDB itself.