BigHouse - Run ClickHouse Like BigQuery

Outline

Overview

Basic overview of how it works

Walk through the query lifecycle
Performance test on single 16 core vs 8 on parquet and csv data, saturate cpu cores and nics
Optimize further with IceDB

Why do I want to run CH like BQ?

The serverless prophecy for OLAP
You can get way more resources. EC2 largest realistic machine you will use is 192 cores, and that’s insanely expensive to run, especially paying for idle time. With fly.io I can spin up 30 16vCPU machines (480 cores), execute a query, and shut them down in 15 seconds.
So much cheaper. That query I mentioned above? It costs less than $0.09 to run with BigHouse (maybe an extra cent for R2 api calls), and with 30 saturated 2Gbit nics and 10:1 avg compression, that ends up being 2gbit * 10 ratio * 30 nics * 15 seconds = 9Tbs or 1.125TB/s. If the query processing itself took 10s, that’s 11.25TB of data. If that was run on BigQuery it would cost 11.25*5=$56.25, on TinyBird $0.07*11,250 = $787.5 (to be fair that’s absolutely not how you are supposed to use tinybird, but it is the closest thing clickhouse-wise). For 15 seconds that EC2 instance would cost $0.03 for 192 cores. Multiply by 2.5x to get to 480 cores and that's already $0.075, nearly the same cost, but you haven't considered the 30s first boot time (or 10s subsequent boot times), the cost of the disk, etc. But we aren’t really here to optimize for per-query cost, we are optimize for not paying for anything between queries.
CH is expensive to run as a warehouse, not fully decouped for compute and storage.
Better SQL and features
Better community, open source
Dedicated resources per query.
Scale in seconds to only the resources you need. It’s also elastic so if you know you are reading small data, only ask for 8 cores. If you are reading a lot, ask for 800 cores. Can even scale the resources based on something like how many files IceDB returned.
And finally because I can, super fun project

What is this useful for?

Primarily a proof, as there is a lot of room for improvement
Large offline analysis
Multi-tenant DBs
Extremely low-cost self-managed analytics
Open source BigQuery/Snowflake alternative
Providers that are not AWS/GCP where nic bandwidth is a lot lower

What is BigHouse not?

Real-time analytics, not trying to replace CH or TinyBird, but going after Spark/BQ/Athena where you can tolerate single-digit seconds of delay between write and read
Production ready, this is an MVP that is used only by the people who make it right now, no guarantees at this moment but openly available to use
Why isn’t it faster than BigQuery?

Data farther away
ClickHouse has some weird hard-walls for s3Cluster and urlCluster performance, does not intelligently spread the load.
Doesn’t really matter because it’s fast enough, and the savings are well worth that wait.

Where do we go from here?

Integrate IceDB as a table function or engine
Optimize CH build to make it even smaller and rip out unused functionality (table engines, functions, etc.)
More configuration options for the node at boot (max_threads, etc.)