Backing up ClickHouse S3-backed Tables Without Duplicating Data

ClickHouse is by far my favorite OLAP DB, and I love decoupling storage and compute by using S3-backed tables.

While it’s still relatively green functionality, lots of docs are lacking around operations around this feature set. Specifically: Backups.

S3 already manages the replication of our data, so there is no reason to backup the literal data we are storing in S3 back into S3.

If we wanted to build a DR system that accounts for the loss of local nodes (deletes cluster, loss of disk), that process only is documented with duplicating your data into S3. If you have many TBs or even PBs of data, that’s a lot of duplicate data stored for no reason.

Even if we could do it without duplicating the data, we definitely don’t want to re-write all the data into ClickHouse during the restore process to get the indexes back.

Since the data is already in S3, why can’t we just restore the knowledge of the files in S3 to ClickHouse?

Turns out, we can:

Quick Anatomy of an S3 Disk & Tables

💡

I will be using ClickHouse nodes from the Altinity Operator running on Kubernetes for the examples below. The files system paths are still the same on normal linux installations.

S3-backed tables are created out of storage policies that use an S3 disk.

S3 disks are a bit special in that the data for a table is not stored under the traditional /var/lib/clickhouse/data directory, but rather under the format /var/lib/clickhouse/disks/{disk_name}/store/{table uuid first 3}/{table uuid}:


root@chi-repl-01-replcluster-0-0-0: ls /var/lib/clickhouse/disks/s3/store/877/8779b9e3-538c-4ce7-8fb8-e4f499720493/0_1_1_0/
checksums.txt                  count.txt                      data.mrk3                      minmax_id.idx                  primary.idx
columns.txt                    data.bin                       default_compression_codec.txt  partition.dat

We can see here that 8779b9e3-538c-4ce7-8fb8-e4f499720493 is the UUID of our table, and 0_1_1_0 is one of the part names. In this case I have an S3 disk called s3. The 877 directory is just the first 3 characters of the table UUID.

While the file structure looks very familiar there are a few differences, such as the file contents storing the name of the S3 file rather than actual data. Double Cloud does a great job discussing this on their blog.

As a result, the files that are contained within this folder are actually very small. Like on the order of KBs for the largest files, most being far less. This knowledge is an important step for building our backup strategy.

The UUID is mapped to a table from the /var/lib/clickhouse/metadata/{db}/{table}.sql file:


root@chi-repl-01-replcluster-0-0-0: cat /var/lib/clickhouse/metadata/default/test_s3.sql
ATTACH TABLE _ UUID '8779b9e3-538c-4ce7-8fb8-e4f499720493'
(
    `id` Int64
)
ENGINE = MergeTree
PARTITION BY intDiv(id, 100)
ORDER BY id
SETTINGS storage_policy = 's3', index_granularity = 8192

On startup, ClickHouse actually runs this SQL query, replacing the _ after ATTACH TABLE with the name of the table from the file name.

That query is just the equivalent CREATE TABLE query, with the default index_granularity specified (we will need this to recover our data)!

Backing Up an S3 Disk

TL;DR: We can use rclone to backup the files under the /var/lib/clickhouse/disks/{disk_name}/store directory, as well as the /var/lib/clickhouse/metadata/{db}/{table}.sql for each table!

Once you have setup an S3 destination, you can use a set of rclone commands to sync a table with S3:


rclone sync --cache-db-path /tmp/rclone-cache-disks /var/lib/clickhouse/disks/s3/store/877/8779b9e3-538c-4ce7-8fb8-e4f499720493 s3:bucket-name/ch-s3-disks/s3/store/877/8779b9e3-538c-4ce7-8fb8-e4f499720493
rclone sync --cache-db-path /tmp/rclone-cache-metadata /var/lib/clickhouse/metadata/default/test_s3.sql s3:bucket-name/ch-s3-metadata/default/test_s3.sql

💡

You can use any tool here, doesn’t have to be rclone!

The first command will sync the local table metadata (references to S3 files, and the indexes of those files) to rclone. sync is used so that the copying is only one way (deleting S3 files will not delete your local data in ClickHouse), but as old parts are deleted they will be removed from the S3 destination. This also allows us to use the --cache-db-path flag which caches the known files that we have already copied to S3. This prevents massive listing as the number of partitions grows, and since the ClickHouse filesystem is append-only (files are not mutated), we know that a file’s content will not change from when we initially copy it over.

In this example I have also chosen to store the files in S3 in a very similar path format to that in the local clickhouse node. This not only makes it clear where the files should be restored on a local ClickHouse node, but also makes the restore process easy as we can just copy the folder back into ClickHouse.

Use a scheduler like cron to run these commands on an interval, and you’re set!

A Note On Replication

Whether or not you use the allow_remote_fs_zero_copy_replication setting, you can optimize by running this on a single node (in a replica) as all nodes should be aware of all of the files in S3 for that replica.

While the allow_remote_fs_zero_copy_replication setting is considered “not production ready”, it’s used in ClickHouse Cloud as of now (as far as we know).

If you do not use this setting, then you have duplicate files stored in S3 (but not duplicate data in the tables). Same goes for metadata, you will have a duplicate per replica.

Using the Altinity Operator

It might be best to build a custom container that will attach to the ClickHouse nodes as a sidecar, sharing the same disk as ClickHouse so it can see the files.

Restoring an S3 Disk

To restore, we simply flip the backup command:


rclone sync --cache-db-path /tmp/rclone-cache-disks s3:bucket-name/ch-s3-disks/s3/store/877/8779b9e3-538c-4ce7-8fb8-e4f499720493 /var/lib/clickhouse/disks/s3/store/877/8779b9e3-538c-4ce7-8fb8-e4f499720493

Note how we did not restore the .sql file. This is because we don’t need it! You can restore this file, but you will need to restart ClickHouse to pick up the table (there might be some reload command you can use).

Instead, we can run the SQL query:


ATTACH TABLE test_s3 UUID '8779b9e3-538c-4ce7-8fb8-e4f499720493'
(
    `id` Int64
)
ENGINE = MergeTree
PARTITION BY intDiv(id, 100)
ORDER BY id
SETTINGS storage_policy = 's3', index_granularity = 8192

This was just the query from the .sql file, but we replaced the _ with the table name.

💡

Note that this assumes the storage policy is called s3

Now, we can see that our table is back, and so is our data!


:) ATTACH TABLE test_s3 UUID '8779b9e3-538c-4ce7-8fb8-e4f499720493'
(
    `id` Int64
)
ENGINE = MergeTree
PARTITION BY intDiv(id, 100)
ORDER BY id
SETTINGS storage_policy = 's3', index_granularity = 8192

Ok.
0 rows in set. Elapsed: 0.203 sec.


:) show tables

┌─name────┐
│ test_s3 │
└─────────┘
1 row in set. Elapsed: 0.002 sec.


:) select * from test_s3

┌─id─┐
│  1 │
│  2 │
│  3 │
└────┘
3 rows in set. Elapsed: 0.027 sec.

The data is now accessible from our ClickHouse node, and we can continue to use the S3 table.

You will need to do this for every replica on the cluster, but that’s a small price to pay for the benefits of zero-copy S3 table backup!

💡

Your ClickHouse node(s) will need to have the same S3 configuration for the disk in the config.xml file to be able to restore!

Summary

Through clever cloning of the local file system ClickHouse uses for S3-backed tables, we can back up our data without duplicating what’s already stored in S3, and restore it without re-writing all of the data by recovering the indexes!

The benefit of this process increases with scale as well, the more data you have stored in S3, the more the transfer and api calls will cost!

Bonus: Snapshot Read Replicas!

If you guarantee that a ClickHouse node only has read access to the S3 bucket, you can create decoupled read replicas of the shard!

Please test this before running on your production environment, and please please please make sure that node only has read access to S3!

Here’s the full blog cover image (from Midjourney), prompt:

Beautiful flaming scroll icon for a database backup diagram. Concept yellow, black, red, vertical lines, vertical stripes, data, database, data bucket, cloud, data lake. Flat render, subtle gradients, no letters.