Businesses use data analytics to extract meaningful, decision-making information for their day-to-day operations. With analytics we can discover hidden relations, find unexpected patterns, learn about customer behavior, understand the past, and predict the future.
In this tutorial, we’ll learn about ClickHouse, a distributed column-based database for real-time analytics. We will explore how it works and take it for a test drive using StackPath Containers, a distributed edge compute product that lets you deploy images to multiple points of presence to reduce latency and bandwidth.
Before diving into the tutorial, first we’ll provide an overview of column-based databases, ClickHouse, and the benefits of running ClickHouse on StackPath.
Compared with traditional row-based databases, column-based databases have a subtle yet profound difference in how data is stored on disk.
A traditional database stores data as rows.
A column-based database serialize them by columns.
|COMPANY||Microsoft (1,2)||Apple (3,5)||Facebook (4)|
|CEO_NAME||Bill Gates (1)||Satya Nadella (2)||Steve Cook (3)||M. Zuckerberg (4)||Steve Jobs (5)|
|YEAR_FROM||1975 (1)||2014 (2)||2011 (3)||2014 (4)||1997 (5)|
|YEAR_TO||2000 (1)||2011 (5)|
Our expected workload determines what type of database we should choose. Row I/O is designed to get all the information for a small set of records while column I/O is designed to get a small subset of columns for a huge set of records (or even a whole table).
Now, imagine we ask the server to calculate how many years each of the persons on the tables above were CEOs of their respective companies. A row database has to retrieve each row, then get to the columns with the relevant years and discard the rest. A column database only needs two I/O operations (regardless of the size of the table): one for each date column. No data is discarded. A column system will perform much better in this case and scales much better with large tables.
As a result of their design, column databases have the following unique properties.
- Efficient column aggregation and materialized views
- Wide tables with optional columns pose no problems; on the contrary, they are encouraged
- Because similar data compress best, column orientation uses less space on disk than rows
- Due to less reliance on locking, column architectures can run with a much higher degree of parallelization
- Batch inserts are more efficient with columns than with rows
- Column databases have less reliance on indexes which reduces overhead in the system.
ClickHouse: a Distributed Column-Based DBMS
ClickHouse is a distributed database management system (DBMS) created by Yandex, the Russian Internet giant and the second-largest web analytics platform in the world. ClickHouse is available as open-source software under the Apache 2.0 License.
Beyond Yandex, many other large companies have adopted ClickHouse for their analytics, including Cisco, CERN, Bloomberg, and Spotify.
The main features of ClickHouse include:
- True column-based DBMS
- Horizontal scalability
- Multi-petabyte scale support
- Massive parallel processing on a single node
- Distributed queries using multiple nodes
- Support of approximated calculations through sampling
- Fault tolerance and no-downtime upgrades
- A subset of the SQL Language, feature-rich with analytics and statistics functions
ClickHouse supports two mechanisms to scale out databases. Sharding distributes big tables among multiple machines and requests get automatically parallelized. Replication creates redundant copies of the data and can be used to distribute and filter data to different data centers. Also, ClickHouse supports multi-master replication.
Is ClickHouse right for your project?
ClickHouse IS good for:
- OLAP (Online Analytical Processing) and OLEP (Online Event Processing) workloads, as long as they don’t need full-text search
- Data Warehouse: ClickHouse can store and process petabytes of data and build customized reports on the fly.
- Big tables: tables with hundreds or thousands of columns
- Applications that need a high read throughput
- Applications that make heavy use of aggregated columns or materialized views
While ClickHouse IS NOT good for:
- OLTP (Online Transactional Processing) workloads: ClickHouse doesn’t support full-fledged transactions.
- Applications that need to insert new data batches greater than once per second per clickHouse worker
- Applications with complex or random write or update activity
- Applications that need full-text search
Running ClickHouse on StackPath
Databases have unique requirements that can clash with common cloud offerings:
- Cost: high per-GB costs can make downloads and uploads expensive
- Bandwidth: big databases can take a long time to load or export
- Latency: round trip times can add unacceptable latencies to the queries
Edge computing solves these issues by moving servers as close as possible to customers and sources of data. As for StackPath, it offers the best of both worlds: the convenience and flexibility of the cloud with on premise-like performance.
Single Instance Configuration
In this section we’ll deploy a single ClickHouse edge computing node in order to get familiar with the system. Then, we’ll learn how to implement a cluster.
First, we need to create a custom Docker image with our private config. Before continuing, ensure you have the following things.
- Docker installed on your workstation. On Mac and Windows, we recommend installing Docker Desktop. On Linux machines install Docker from the repository.
- A Docker Hub account to store the images
Prepare the Docker image
Yandex supplies two ready-to-use Docker images:
- clickhouse-server is the server component. The server stores data, accepts client connections, and processes requests.
- clickhouse-client is the official command-line client, much like psql or mysql.
To download the images to your machine, use the following commands.
docker pull yandex/clickhouse-server docker pull yandex/clickhouse-client
By default, the ClickHouse server comes with a single blank-password user called default. For testing purposes, having a blank password is not a problem, but since we’re going to deploy the node in production, we need to secure it.
To customize the user settings, we need to change a file called
users.xml inside the clickhouse-server image.
To do this, create a directory to store the project files.
mkdir my-clickhouse-deployment cd my-clickhouse-deployment
Start an instance of the container so we can copy out the file.
docker create -it --name clickhouse-server clickhouse server
Copy the file from the container to your machine. (Don’t forget the dot at the end of the command.)
users.xml with a text editor and locate the following lines.
<users> <default> <!-- This is the default user with a blank password --> <password></password> . . . </default> </users>
ClickHouse supports several password settings:
- Plaintext: insecure and not recommended
- Double SHA1: uses a cryptographic hash function to make the password harder to guess for anyone looking at our image. Double SHA1 strings go inside the <password_double_sha1_hex></password_double_sha1_hex>
- SHA256: arguably the most secure alternative and the one we will use. This option uses the <password_double_sha256_hex></password_sha256_hex>
To generate a SHA256 password in Linux or Mac, execute the following command in a new terminal.
PASSWORD=$(base64 < /dev/urandom | head -c8); echo "$PASSWORD"; echo -n "$PASSWORD" | sha256sum | tr -d '-'
The first line printed is the password, the second is the encoded SHA256 string. Alternatively, you can use an online SHA256 generator to obtain the string. In either case, keep both the password and the SHA string at hand as we’ll need them both soon.
Going back to
Remove the <password></password> line to get rid of the blank password. Then add the following line with your SHA256 string.
<users> <default> <password_sha256_hex>78bb49329e64a3df1ea5a48fb09ce22ed5223171cf140e3b08683ab3926a0b9b</password_sha256_hex> . . . </default> </users>
Build the image
Create a file called Dockerfile with the following content.
FROM yandex/clickhouse-server COPY users.xml /etc/clickhouse-server
Then build the image.
docker build -t my-clickhouse-server .
To test that it’s working correctly, run the following.
docker run -it -d --name my-server my-clickhouse-server
Now, try connecting using your password.
docker run -it --rm --link clickhouse:clickhouse-server yandex/clickhouse-client --host clickhouse-server --user default --password YOUR_PASSWORD
Push image to registry
Before we can use the image in StackPath, we have to send it to Docker Hub. To do this, first log in to Docker Hub with your account.
$ docker login -u YOUR_DOCKERHUB_USERNAME
Build the image using your Docker Hub username as a prefix.
$ docker build -t YOUR_DOCKERHUB_USERNAME/clickhouse-server:latest
Push the image to Docker Hub.
$ docker push YOUR_DOCKERHUB_USERNAME/clickhouse-server:latest
Deploy to StackPath
Once we have pushed the image to the Docker Registry, we can deploy it on StackPath. Head over to StackPath and login or sign up for a new account. Then click Create and select Workload.
We’ll name the workload “clickhouse-servers” and set the type to Container. For Image, type the address to your Docker image:
If you are using a private repository or a different registry, check the Add Image Pull Credentials box and type in the pull details.
Click Continue to Settings.
If you want the servers to be accessible to everyone, add the following ports under Public ports. This will assign Public IPs and open the ports to the Internet.
All containers get a Private IP assigned automatically.
- Port 8123 TCP is the default HTTP API Endpoint
- Port 9000 TCP is the default port for the native clients
Click Continue to Spec.
In Spec, choose the machine size. ClickHouse developers recommend a minimum of 4GB of RAM to allow for complex queries. But if you choose the smallest machine you can always scale the machine up with a few clicks later on.
Next we’ll need a volume to persist our data. Uunder Additional volume, enter
/var/lib/clickhouse and set its size to your requirements.
Under Deployment Target, group servers with similar scaling needs. Type a descriptive name like “clickhouse-servers”. Then select the PoP nearest to you or your customers, set the number of instances to 1, and click Create Workload.
Wait a few seconds until the instance is on the Running state.
Test the connection
The following test will only work if you have opened ports 8123 and 9000.
First, click on the running instance to get its details. Then scroll down to the Instance details section and copy the Public IP address.
Next, open a terminal and try the
/ping endpoint with Curl or browser. If the server is running it should respond with Ok.
$ curl -w "\n" YOUR_PUBLIC_IP:8123/ping Ok
We can also connect with ClickHouse Native Client.
docker run -it yandex/clickhouse-client --host YOUR_PUBLIC_IP --user default --password YOUR_PASSWORD
Now you can run ClickHouse SQL commands on your new node.
clickhouse-servers-clickhouse-servers-jfk-0 :) create database my_db; CREATE DATABASE my_db Ok. 0 rows in set. Elapsed: 0.393 sec. clickhouse-servers-clickhouse-servers-jfk-0 :) show databases; SHOW DATABASES ┌─name────┐ │ default │ │ my_db │ │ system │ └─────────┘ 3 rows in set. Elapsed: 0.326 sec.
Sharded Cluster Configuration
In this section, we’ll deploy two more database instances to create a 3-node edge compute cluster.
Create more instances
First, we’ll create the remaining instances so that we can get their Private IPs.
On the left navigation menu, click Workloads and select your container workload. Then click the gear icon to edit the workload configuration.
Click the Edit icon next to Spec, Storage & Deployment and increase the number of Instances per PoP to 3. Then click Save Changes.
StackPath will create the remaining instances in only a few seconds.
When the instances start running, copy the Private IP address of each one.
Configure ClickHouse shards
We have to tell ClickHouse where all the nodes are. To do this, we need to get the
config.xml file from the ClickHouse image.
Open a terminal and type the following command to copy the file out.
docker cp clickhouse-server:/etc/clickhouse-server/config.xml
Then, open the file and locate the
<remote_servers incl="clickhouse_remote_servers" > <!-- Test only shard config for testing distributed storage --> <test_shard_localhost> <shard> <replica> <host>localhost</host> <port>9000</port> </replica> </shard> </test_shard_localhost> . . . </remote_servers>
Remove all the lines inside
<remote_servers><remote_servers></remote_servers> and enter the following config, replacing the IP addresses with your servers’ Private IPs.
<remote_servers incl="clickhouse_remote_servers" > <my_sharded_config> <shard> <replica> <host>SERVER_1_PRIVATE_IP</host> <port>9000</port> </replica> </shard> <shard> <replica> <host>SERVER_2_PRIVATE_IP</host> <port>9000</port> </replica> </shard> <shard> <replica> <host>SERVER_3_PRIVATE_IP</host> <port>9000</port> </replica> </shard> </my_sharded_config> </remote_servers>
Dockerfile to add the modified config:
FROM yandex/clickhouse-server COPY users.xml /etc/clickhouse-server COPY config.xml /etc/clickhouse-server
Build a new version of the image. (Notice how we’re using a different tag here.)
docker build -t YOUR_DOCKERHUB_USERNAME/clickhouse-server:sharded
Push the new image to the Docker Registry like we did before.
docker push YOUR_DOCKERHUB_USERNAME/clickhouse-server:sharded
Update the StackPath workload
The final step is to tell StackPath to use our new configuration. To do this, go to your StackPath Workload and click the gear icon again to change its configuration. Edit the Image name with the new tag:
Click Save Changes.
StackPath will start recycling the containers. When this is done, you should find all of them running the newest version.
Testing the cluster
Again, we can use Curl or any browser to check each of the ClickHouse nodes. Just replace the IP addresses with your Private IPs.
$ curl -w "\n" YOUR_PUBLIC_IP_1:8123/ping Ok $ curl -w "\n" YOUR_PUBLIC_IP_2:8123/ping Ok $ curl -w "\n" YOUR_PUBLIC_IP_2:8123/ping Ok
We can also connect to any of the instances and check the cluster health. Run the following query to see the cluster status.
select * from system.clusters;
$ docker run -it yandex/clickhouse-client --host YOUR_PUBLIC_IP --user default --password YOUR_PASSWORD ClickHouse client version 18.104.22.168 (official build). Connecting to 22.214.171.124:9000 as user default. Connected to ClickHouse server version 20.3.2 revision 54433. clickhouse-servers-clickhouse-servers-jfk-0 :) select * from system.clusters; SELECT cluster, shard_num, host_address, is_local, errors_count FROM system.clusters ┌─cluster────────────┬─shard_num─┬─host_address─┬─is_local─┬─errors_count─┐ │ my_sharded_config │ 1 │ 10.128.96.2 │ 0 │ 0 │ │ my_sharded_config │ 2 │ 10.128.96.4 │ 1 │ 0 │ │ my_sharded_config │ 3 │ 10.128.96.3 │ 0 │ 0 │ └────────────────────┴───────────┴──────────────┴──────────┴──────────────┘ 3 rows in set. Elapsed: 2.140 sec.
And that’s it! If you made it this far, you have sucessfully deployed a 3-node cluster on StackPath.
For the next step in your road to mastering ClickHouse, try setting up a second cluster on a different PoP and establishing table replication between the two data centers.
If you want to learn more, check out these related docs and projects: