Citus Con 2023 is a wrap! 🎁 Thanks for joining the fun. Missed it? You can still watch all 37 talks online 🖥
Citus Con 2023 is a wrap! 🎁 Thanks for joining the fun. Missed it? You can still watch all 37 talks online 🖥
Update in October 2022: The Citus managed database service is now available in the cloud as Azure Cosmos DB for PostgreSQL. And as always, the Citus database is also available as open source: you can find the Citus repo on GitHub or download Citus here.
One of the big new things in Citus 10 is that you can now shard Postgres on a single Citus node. So in addition to using the Citus extension to Postgres to scale out Postgres across a distributed cluster, you can now also:
The Citus 10 release is chock full of new capabilities like columnar storage for Postgres, the open sourcing of the shard rebalancer, as well as the feature we are going to explore here: using Citus on a single node. No matter what type of application you run on top of Citus—multi-tenant SaaS apps, customer-facing analytics dashboards, time-series workloads, high-throughput transactional apps—there is something for everyone in Citus 10.
In this post, let’s walk through how to try out Citus on a single node. And let’s dive deep into some of the ways sharding Postgres on a single node can help you.
My favorite way to get started with Citus is to use a docker container to run Citus on a single node.
If you prefer, you could instead download Citus open source packages for both single-node and multi-node Citus installations, too.
The docker container includes:
CREATE EXTENSION citus;
# run PostgreSQL with single-node Citus on port 5500 docker run -d --name citus_coordinator -p 5500:5432 -e POSTGRES_PASSWORD=mypassword citusdata/citus # connect using psql within the Docker container docker exec -it citus_coordinator psql -U postgres
Now that you have a Postgres database with Citus installed (Citus 10.0 or later), you can use your favorite tool—such as
psql—to connect to the database, and start creating distributed tables on your single Citus node.
-- Create a table with the usual PostgreSQL syntax CREATE TABLE users_table (user_id bigserial primary key, age int); -- Convert the table to a distributed table SELECT create_distributed_table('users_table', 'user_id');
create_distributed_table function will divide the Postgres table across 32 shards, but you can still query it as if all the data was in one table. Because the shards are regular Postgres tables, you can still rely on the wealth of Postgres relational database features like transactions, indexes, database constraints, JOINs, and more.
With just a few commands, you can run Citus database on a single node, with a distributed table. Now, I’m going to talk more about the value that sharding Postgres on a single node gives to you.
Sharding Postgres has long been associated with large-scale data sizes. In fact, when most of you think about how Citus shards Postgres, you probably imagine a distributed database cluster with 2 or 4 worker nodes, or perhaps 20 or 50 or even 100 worker nodes. But with the introduction of single-node Citus in Citus 10, we can all think about sharding differently.
Even if the data sizes are not large, sharding Postgres on a single node may provide immediate benefits. By using a distributed data model on a single Citus node, you can get:
For example, after sharding the users_table above with the
create_distributed_table function, the following SQL commands will now run in parallel across the shards, which can significantly decrease the execution times:
-- load data, ingest happens in parallel across shards INSERT INTO users_table (age) SELECT 20 + (random() * 70)::int FROM generate_series(0, 100000); -- this query runs in parallel across all shards SELECT avg(age) FROM users_table; -- index created in parallel across all shards CREATE INDEX user_age ON users_table (age);
Queries on distributed Citus tables are not always faster than queries on regular Postgres tables. For example, Citus won’t be able to provide any parallelism benefits for a SQL query that only involves a single shard, such as this SELECT query below:
-- this query hits a single shard as sharding key is on WHERE clause SELECT age FROM users_table WHERE user_id = 15;
Citus needs to do some additional CPU work on top of what Postgres does. The work involves two steps:
In practice, the additional overhead is tiny enough that many of the applications might not even notice the difference.
When your data no longer fits in memory—or if the database server becomes CPU bound—then with Citus on a single node, your data will already be sharded and you’ll be able to easily add more nodes to your cluster to keep your database performant. In other words, you are already prepared to scale out, or what I like to call “scale-out-ready.”
The result: you can scale out your Citus database to as many nodes as you think required. While a shard is being moved, writes to that shard are blocked, but all reads and writes to other shards can continue. If you are using the Citus on Azure, we have some extra tricks to make writes to the shards being moved non-blocking as well. October 2022: Good news, as of Citus 11.0, the Citus open source shard rebalancer is also non-blocking for writes.
Let’s add 2 more Citus worker nodes and scale out the database:
# start the first worker node on port 5501 docker run -d --name citus_worker_1 -p 5501:5432 -e POSTGRES_PASSWORD=mypassword citusdata/citus # start the second worker on port 5502 docker run -d --name citus_worker_2 -p 5502:5432 -e POSTGRES_PASSWORD=mypassword citusdata/citus
In the example below, I use the
<hostname or your ip address> parameters as ‘192.168.2.1’. When you try this on your network, make sure to use machine’s IP address (or hostname) so that the Citus nodes can communicate with each other.
Now, let’s connect back to the coordinator via psql:
# connect using psql to the Citus coordinator's Docker container docker exec -it citus_coordinator psql -U postgres
And scale-out your Postgres with few SQL commands:
-- tell future worker nodes how to reach coordinator SELECT citus_set_coordinator_host('<hostname or your ip address>', 5500); -- add worker nodes to Citus metadata SELECT citus_add_node('<hostname or your ip address>', 5501); SELECT citus_add_node('<hostname or your ip address>', 5502); -- move shards to new worker node(s) SELECT rebalance_table_shards(); NOTICE: Moving shard 102008 from 192.168.2.1:5500 to 192.168.2.1:5501 ... NOTICE: Moving shard 102009 from 192.168.2.1:5500 to 192.168.2.1:5502 ... .... NOTICE: Moving shard 102028 from 192.168.2.1:5500 to 192.168.2.1:5501 ... -- OPTIONAL: if you do not want any shards on the coordinator, -- move shards from coordinator to worker nodes SELECT citus_drain_node('<hostname or your ip address>', 5500);
Now that the shards have been distributed, the database can use the resources on the worker node(s) as well. From your application’s perspective, nothing has changed. After adding 2 new nodes to the Citus database cluster, and after rebalancing shards across the cluster, your application is still talking to the same Postgres database. Congratulations, you have seamlessly scaled out your Postgres database with Citus!
You may be wondering, “Why use Citus if I don’t need to scale out Postgres right now?” Well, if you think that your database is not going to grow in the future (and that your database will remain less than ~100GB), then vanilla Postgres is likely to handle your workload very well. However, if you expect your database to grow over time, then the answer becomes relevant to you.
There are a few things you need to take into account to make your database scale-out-ready. In scale-out architectures, the data is distributed based on a sharding key (what we sometimes call a distribution column in Citus.) The sharding key you choose will influence not only how the data gets distributed, but also what kind of query performance improvements you will get. So a bit of up-front thinking about your query patterns and your data model can go a long way. For example, it is often not performant (or even possible) to enforce UNIQUE constraints on the columns that do not include distribution keys.
If you follow the data modeling best practices of a distributed database like Citus, you can be what I like to call “scale-out-ready”. The Citus best practices for scaling out your Postgres database seamlessly include:
If you build your database conforming to the best practices outlined above, the promise of Citus is that you’ll be able to scale out your Postgres database to some pretty large database cluster sizes. The key point here is that once you get used to the mindset of scale-out systems, you’ll realize how easy it is to follow the data modeling best practices.
The 3rd interesting scenario for using Citus on single node is with your CI/CD pipelines. If Citus is part of your CI/CD workflow, setting up a test environment is so much simpler now. Instead of having two or more database servers, you can setup Citus on a single node which will provide the same experience and coverage. In other words, whatever you can (or cannot) do on multi-node Citus is the same on the new single-node Citus capability.
Back before we made single-node Citus a first-class citizen in this Citus 10 release, we had shared a sneak preview of sharding on a single node in our Citus 9.5 blog post. And we received this lovely feedback from Ivan Vyazmitinov, one of our Citus users, just a few weeks later:
“A piece of feedback about the new ‘single node cluster’ feature from the blog post: after a little bit of testing we’ve immediately switched all of our integration tests to it. It is literally a lifesaver in terms of complexity of configuration of tests and it also completely eliminated our tests flakiness issue due to errors from unhealthy citus containers on tests’ startup.”
One of the easiest ways run Citus is in the cloud: you can deploy the Hyperscale (Citus) option in the Azure Database for PostgreSQL managed service.
For those of you who use Citus as part of the Azure Database for PostgreSQL managed service, the good news is: as of update August 2021 Citus 10 is now available on Citus on Azure with the new single node configuration. More details in Nik’s initial preview post on what’s new and his follow-on post announcing that Citus 10 is GA in Hyperscale (Citus).
We are very excited to bring you the new Citus on a single node feature. And while “opens the door to new possibilities” may sound lofty, it’s true. Single-node Citus gives you a way to be “scale-out-ready” on day zero. If your application is already running on a single-node Postgres, you now have a way to adopt a data model that allows you to scale your Postgres database as much as you need in the future. In essence, as your application grows and you need to scale, you won’t be faced with any type of database migration challenges.
Perhaps more importantly, with the new Citus single node feature in Citus 10, it’s easier than ever to try out Citus. And you can do so with just a single docker run command.
Here are some of the best links if you want to try out Citus today: