首页 > 基础资料 博客日记
ClickHouse Kubernetes集群部署与维护文档
2026-05-29 12:30:03基础资料围观6次
这篇文章介绍了ClickHouse Kubernetes集群部署与维护文档,分享给大家做个参考,收藏极客资料网收获更多编程知识
# ClickHouse 集群部署与维护文档 ## 目录 - [架构概览](#架构概览) - [组件说明](#组件说明) - [首次部署](#首次部署) - [配置说明](#配置说明) - [账号认证管理](#账号认证管理) - [日常维护](#日常维护) - [DDL 最佳实践](#ddl-最佳实践) - [数据操作](#数据操作) - [监控与排障](#监控与排障) - [扩容指南](#扩容指南) --- ## 架构概览 ``` ┌─────────────────────────────────────┐ │ AWS Internal NLB │ │ (port 8123 HTTP / 9000 native TCP) │ └──────────────┬──────────────────────┘ │ Route53 A Record (alias) clickhouse.nonprod.internal.icbc.com │ ┌──────────────────┴──────────────────┐ │ │ ┌──────▼──────┐ ┌──────────▼──────┐ │ Pod-0 │ │ Pod-1 │ │ Shard 1 │◄─── 数据同步 ────►│ Shard 1 │ │ Replica 1 │ (Keeper/ZK) │ Replica 2 │ └─────────────┘ └─────────────────┘ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ Keeper-0 │ │ Keeper-1 │ │ Keeper-2 │ │ (quorum) │ │ (quorum) │ │ (quorum) │ └─────────────┘ └─────────────┘ └─────────────┘ ``` **拓扑:** 1 shard × 2 replicas(高可用,全量数据双份) **协调层:** KeeperCluster 3 节点(ZooKeeper 协议,需奇数节点保证 quorum) --- ## 组件说明 | 组件 | 数量 | 作用 | |------|------|------| | ClickHouseCluster | 2 Pod | 数据存储与查询,互为副本 | | KeeperCluster | 3 Pod | 分布式协调(副本同步、DDL 分发) | | NLB (internal) | 1 | 负载均衡,对外暴露 8123/9000 | | ArgoCD Application | 2 | GitOps 自动同步部署 | --- ## 首次部署 ### 前置条件 - ArgoCD 已安装并连接到目标集群 - AWS Load Balancer Controller 已安装 - 集群内有 `gp3` StorageClass ### 部署步骤 ```bash # 1. 仅需 apply 这一个文件,ArgoCD 会自动完成其余工作 kubectl apply -f clickhouse/nonprod/argocd-apps.yaml -n argocd # 2. 观察同步进度(sync-wave: 0 operator 先装,wave: 1 集群后装) argocd app get clickhouse-operator argocd app get clickhouse-cluster # 3. 等待所有 Pod 就绪 kubectl get pods -n clickhouse-operator -w ``` ### 预期 Pod 状态 ``` clickhouse-clickhouse-0-0-0 1/1 Running clickhouse-clickhouse-0-1-0 1/1 Running clickhouse-keeper-keeper-0-0 1/1 Running clickhouse-keeper-keeper-1-0 1/1 Running clickhouse-keeper-keeper-2-0 1/1 Running clickhouse-operator-controller-* 1/1 Running ``` ### Route53 配置 在 AWS Console → Route53 → 私有托管区域创建: | 记录名 | 类型 | 目标 | |--------|------|------| | `clickhouse.nonprod.internal.icbc.com` | A (alias) | NLB DNS 名称 | NLB DNS 获取: ```bash kubectl get svc clickhouse-external -n clickhouse-operator \ -o jsonpath='{.status.loadBalancer.ingress[0].hostname}' ``` --- ## 配置说明 ### 文件结构 ``` clickhouse/nonprod/ ├── argocd-apps.yaml # ArgoCD Application(operator + cluster) ├── kustomization.yaml # Kustomize 入口 ├── keeper.yaml # KeeperCluster + PDB └── clickhouse.yaml # ClickHouseCluster + NLB Service + PDB ``` ### 资源规格 | 组件 | CPU Request | CPU Limit | Memory Request | Memory Limit | 存储 | |------|-------------|-----------|----------------|--------------|------| | ClickHouse | 2 core | 4 core | 8 Gi | 16 Gi | 300 Gi (gp3) | | Keeper | 500m | 1 core | 1 Gi | 2 Gi | 100 Gi (gp3) | ### 连接信息 | 接口 | 端口 | 协议 | 用途 | |------|------|------|------| | HTTP API | 8123 | HTTP | curl / SDK / BI 工具 | | Native TCP | 9000 | Binary | clickhouse-client / Go/Python 驱动 | --- ## 账号认证管理 ### 架构说明 | 账号 | 管理方式 | 存储位置 | |------|----------|----------| | `default` | operator 原生 + K8s Secret | `clickhouse-credentials` Secret | | 应用账号 | SQL `CREATE USER ON CLUSTER` | ClickHouse PV(持久化) | ### default 用户密码配置 密码通过 K8s Secret 注入,operator 读取后写入 `users.yaml`,两个副本自动同步。 **Secret 由运维手动创建,不入 git:** ```bash # 首次创建 kubectl create secret generic clickhouse-credentials \ --from-literal=default-password='your_strong_password' \ -n clickhouse-operator # 修改密码(更新 Secret 后 operator 会自动热重载,无需重启 Pod) kubectl patch secret clickhouse-credentials \ -n clickhouse-operator \ --type merge \ -p '{"stringData":{"default-password":"new_strong_password"}}' ``` **clickhouse.yaml 中的引用配置:** ```yaml settings: defaultUserPassword: secret: name: clickhouse-credentials key: default-password passwordType: password # ClickHouse 配置 key,不是 'plaintext_password' ``` > **注意:** `passwordType` 必须是 ClickHouse users.yaml 的原生字段名: > - `password` — 明文(内网环境可用) > - `password_sha256_hex` — SHA256 哈希(推荐生产) > - `no_password` — 无密码(仅测试) **使用 SHA256 密码(更安全):** ```bash # 生成 SHA256 哈希 echo -n 'your_strong_password' | sha256sum | awk '{print $1}' # Secret 存哈希值 kubectl create secret generic clickhouse-credentials \ --from-literal=default-password='a665a45920422f9d417e4867efdc4fb8a04a1f3fff1fa07e998e86f7f7a27ae3' \ -n clickhouse-operator # clickhouse.yaml 改为 # passwordType: password_sha256_hex ``` ### 创建应用账号 密码生效后,用 SQL 创建业务账号(数据持久化在 PV,Pod 重启不丢失): ```bash NLB="clickhouse.nonprod.internal.icbc.com" DEFAULT_PWD="your_strong_password" # 读写账号 curl -s "http://${NLB}:8123/" \ -H "X-ClickHouse-User: default" \ -H "X-ClickHouse-Key: ${DEFAULT_PWD}" \ --data "CREATE USER IF NOT EXISTS app_user ON CLUSTER default IDENTIFIED WITH sha256_password BY 'app_password'" curl -s "http://${NLB}:8123/" \ -H "X-ClickHouse-User: default" \ -H "X-ClickHouse-Key: ${DEFAULT_PWD}" \ --data "GRANT SELECT, INSERT, CREATE TABLE, CREATE DATABASE ON *.* TO app_user ON CLUSTER default" # 只读账号 curl -s "http://${NLB}:8123/" \ -H "X-ClickHouse-User: default" \ -H "X-ClickHouse-Key: ${DEFAULT_PWD}" \ --data "CREATE USER IF NOT EXISTS readonly_user ON CLUSTER default IDENTIFIED WITH sha256_password BY 'readonly_password'" curl -s "http://${NLB}:8123/" \ -H "X-ClickHouse-User: default" \ -H "X-ClickHouse-Key: ${DEFAULT_PWD}" \ --data "GRANT SELECT ON *.* TO readonly_user ON CLUSTER default" ``` ### 验证认证 ```bash # 正确密码 → 返回 default curl -s "http://${NLB}:8123/" \ -H "X-ClickHouse-User: default" \ -H "X-ClickHouse-Key: your_strong_password" \ --data "SELECT currentUser()" # 无密码 → 返回 401 Unauthorized curl -s "http://${NLB}:8123/?query=SELECT+1" # 查看所有用户 curl -s "http://${NLB}:8123/" \ -H "X-ClickHouse-User: default" \ -H "X-ClickHouse-Key: your_strong_password" \ --data "SELECT name, auth_type FROM system.users FORMAT Pretty" ``` ### 常见认证问题 | 现象 | 原因 | 解决 | |------|------|------| | Pod CrashLoopBackOff,日志报 `BAD_ARGUMENTS` | `passwordType` 值错误(如 `plaintext_password`)| 改为 `password` | | Pod CrashLoopBackOff,日志报 `no such file` | Secret 不存在或 key 名称不匹配 | 检查 Secret 和 key 名称 | | 401 Unauthorized | 密码错误或未传认证头 | 检查 `-H "X-ClickHouse-Key"` | | 修改密码后未生效 | operator 未触发热重载 | 删除一个 Pod 触发重启 | --- ## 日常维护 ### 连通性检查 ```bash NLB="clickhouse.nonprod.internal.icbc.com" # 基础 ping curl -s "http://${NLB}:8123/ping" # 期望: Ok. # 查版本 curl -s "http://${NLB}:8123/?query=SELECT+version()" # 检查副本同步状态 curl -s "http://${NLB}:8123/?query=SELECT+database,table,is_leader,total_replicas,active_replicas+FROM+system.replicas+FORMAT+Pretty" # 检查 Keeper 连接 curl -s "http://${NLB}:8123/?query=SELECT+*+FROM+system.zookeeper+WHERE+path+%3D+%27%2F%27+FORMAT+Pretty" ``` ### 查看集群节点状态 ```bash curl -s "http://${NLB}:8123/?query=SELECT+cluster,shard_num,replica_num,host_name,errors_count,is_local+FROM+system.clusters+FORMAT+Pretty" ``` ### Pod 操作 ```bash # 查看 Pod 状态 kubectl get pods -n clickhouse-operator # 查看 ClickHouse 日志 kubectl logs -n clickhouse-operator clickhouse-clickhouse-0-0-0 -f # 查看 Keeper 日志 kubectl logs -n clickhouse-operator clickhouse-keeper-keeper-0-0 -f # 进入 ClickHouse Pod 执行命令 kubectl exec -it clickhouse-clickhouse-0-0-0 -n clickhouse-operator -- \ clickhouse-client ``` ### 重启操作 ```bash # 滚动重启 ClickHouse(逐个 Pod,不中断服务) kubectl rollout restart statefulset -n clickhouse-operator \ -l clickhouse.com/role=clickhouse-server # 重启 Keeper(谨慎操作,一次只重启一个) kubectl delete pod clickhouse-keeper-keeper-0-0 -n clickhouse-operator # 等待恢复后再操作下一个 kubectl wait pod/clickhouse-keeper-keeper-0-0 -n clickhouse-operator \ --for=condition=Ready --timeout=120s ``` ### 磁盘使用查看 ```bash # 各数据库磁盘占用 curl -s "http://${NLB}:8123/?query=SELECT+database,formatReadableSize(sum(bytes_on_disk))+AS+size+FROM+system.parts+GROUP+BY+database+FORMAT+Pretty" # 各表磁盘占用 curl -s "http://${NLB}:8123/?query=SELECT+database,table,formatReadableSize(sum(bytes_on_disk))+AS+size,sum(rows)+AS+rows+FROM+system.parts+WHERE+active=1+GROUP+BY+database,table+ORDER+BY+sum(bytes_on_disk)+DESC+FORMAT+Pretty" # 查看 PVC 使用情况 kubectl get pvc -n clickhouse-operator ``` --- ## DDL 最佳实践 > **核心原则:所有 DDL 必须带 `ON CLUSTER default`** > > `ReplicatedMergeTree` 只自动同步数据(INSERT),不同步结构变更(DDL)。 > 不加 `ON CLUSTER` 的 DDL 只在接收请求的那个 Pod 执行,导致两副本结构不一致。 ### 创建数据库 ```sql CREATE DATABASE IF NOT EXISTS mydb ON CLUSTER default; ``` ### 创建表(标准模板) ```sql CREATE TABLE IF NOT EXISTS mydb.my_table ON CLUSTER default ( id UInt64, created_at DateTime DEFAULT now(), -- 业务字段 symbol String, price Float64 ) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/my_table', '{replica}') PARTITION BY toYYYYMM(created_at) ORDER BY (created_at, id) TTL created_at + INTERVAL 90 DAY; -- 可选:90天自动过期 ``` **`{shard}` 和 `{replica}` 是宏变量,operator 自动替换,不要修改。** ### 修改表结构 ```sql -- 加列 ALTER TABLE mydb.my_table ON CLUSTER default ADD COLUMN volume Float64 DEFAULT 0; -- 删列 ALTER TABLE mydb.my_table ON CLUSTER default DROP COLUMN volume; -- 改列类型 ALTER TABLE mydb.my_table ON CLUSTER default MODIFY COLUMN price Decimal(20, 8); ``` ### 删除数据库 / 表 ```sql DROP TABLE IF EXISTS mydb.my_table ON CLUSTER default; DROP DATABASE IF EXISTS mydb ON CLUSTER default; ``` --- ## 数据操作 ### 写入 ```bash # 单行插入 curl -s "http://${NLB}:8123/" \ --data "INSERT INTO mydb.my_table (id, symbol, price) VALUES (1, 'BTCUSDT', 67500.5)" # 批量插入(推荐,减少小文件) curl -s "http://${NLB}:8123/" \ --data "INSERT INTO mydb.my_table (id, symbol, price) VALUES (1, 'BTCUSDT', 67500.5), (2, 'ETHUSDT', 3500.0)" # CSV 批量导入 cat data.csv | curl -s "http://${NLB}:8123/?query=INSERT+INTO+mydb.my_table+FORMAT+CSV" \ --data-binary @- ``` ### 强一致性写入(金融场景) ```bash # insert_quorum=2:写入两个副本都确认后才返回成功 curl -s "http://${NLB}:8123/?insert_quorum=2" \ --data "INSERT INTO mydb.my_table ..." # 配合强一致读 curl -s "http://${NLB}:8123/?select_sequential_consistency=1&query=SELECT+*+FROM+mydb.my_table" ``` ### 查询 ```bash # 普通查询 curl -s "http://${NLB}:8123/?query=SELECT+*+FROM+mydb.my_table+LIMIT+10+FORMAT+Pretty" # 带认证 curl -s "http://${NLB}:8123/" \ -H "X-ClickHouse-User: default" \ -H "X-ClickHouse-Key: your_password" \ --data "SELECT count() FROM mydb.my_table" ``` --- ## 监控与排障 ### 副本同步延迟 ```sql -- 查看副本落后多少操作 SELECT database, table, replica_name, queue_size, inserts_in_queue, merges_in_queue, log_pointer, total_replicas, active_replicas FROM system.replicas WHERE queue_size > 0; ``` ### 慢查询 ```sql -- 查看正在执行的查询 SELECT query_id, user, elapsed, query FROM system.processes ORDER BY elapsed DESC; -- 杀掉慢查询 KILL QUERY WHERE query_id = 'xxx'; -- 历史慢查询(top 10) SELECT query, query_duration_ms, read_rows, memory_usage FROM system.query_log WHERE type = 'QueryFinish' AND query_duration_ms > 1000 ORDER BY query_duration_ms DESC LIMIT 10; ``` ### 常见问题排查 | 现象 | 排查命令 | 可能原因 | |------|----------|----------| | INSERT 失败 `UNKNOWN_DATABASE` | `SHOW DATABASES` 在两个 Pod 分别执行 | DDL 没加 `ON CLUSTER` | | SELECT 返回双份数据 | 正常现象,NLB 轮询到两个 Pod | 业务层需处理重复响应 | | 副本同步停止 | `SELECT * FROM system.replicas` | Keeper 不健康,查 Keeper 日志 | | Pod OOMKilled | `kubectl describe pod <pod>` | 内存 limit 不足,调整资源配置 | | NLB target unhealthy | `kubectl get endpoints clickhouse-external -n clickhouse-operator` | Pod 未 Ready 或 selector 不匹配 | ### Keeper 健康检查 ```bash # 检查 Keeper quorum curl -s "http://${NLB}:8123/?query=SELECT+*+FROM+system.zookeeper+WHERE+path%3D%27%2F%27" # Keeper 节点状态(进入 Pod) kubectl exec -it clickhouse-keeper-keeper-0-0 -n clickhouse-operator -- \ clickhouse-keeper-client -h localhost -p 9181 -q "ruok" # 期望返回: imok ``` --- ## 扩容指南 ### 增加副本数(当前 1 shard 2 replicas) 修改 `clickhouse/nonprod/clickhouse.yaml`: ```yaml spec: replicas: 3 # 从 2 改为 3 ``` Push 后 ArgoCD 自动滚动扩容,新副本会自动从 Keeper 同步全量数据。 ### 增加分片(数据量到 TB 级别时) ```yaml spec: shards: 2 # 增加分片 replicas: 2 # 每个分片 2 副本 ``` 增加分片后需要创建 `Distributed` 表作为查询入口: ```sql CREATE TABLE mydb.my_table_dist ON CLUSTER default AS mydb.my_table ENGINE = Distributed(default, mydb, my_table, rand()); ``` ### 扩容存储 修改 `dataVolumeClaimSpec.resources.requests.storage`,**注意 PVC 只能扩容不能缩容**: ```yaml dataVolumeClaimSpec: storageClassName: gp3 resources: requests: storage: 500Gi # 从 300Gi 改为 500Gi ``` --- ## 版本信息 | 项目 | 版本 | |------|------| | ClickHouse | 26.5.1.882 | | clickhouse-operator-helm | 0.0.5 | | Keeper 节点数 | 3 | | 集群名称 | default | | 命名空间 | clickhouse-operator |
附:argocd Kustomize文件:
apiVersion: argoproj.io/v1alpha1 kind: Application metadata: name: clickhouse-operator namespace: argocd annotations: argocd.argoproj.io/sync-wave: "0" finalizers: - resources-finalizer.argocd.argoproj.io spec: project: default source: repoURL: ghcr.io/clickhouse chart: clickhouse-operator-helm targetRevision: 0.0.5 helm: releaseName: clickhouse-operator destination: server: https://kubernetes.default.svc namespace: clickhouse-operator syncPolicy: automated: prune: true selfHeal: true retry: limit: 3 backoff: duration: 5s factor: 2 maxDuration: 3m syncOptions: - CreateNamespace=true - ApplyOutOfSyncOnly=true --- apiVersion: argoproj.io/v1alpha1 kind: Application metadata: name: clickhouse-cluster namespace: argocd annotations: argocd.argoproj.io/sync-wave: "1" finalizers: - resources-finalizer.argocd.argoproj.io spec: project: default source: repoURL: git@github.com:icbc/devops-infra-deploy-manifests.git targetRevision: main path: clickhouse/nonprod destination: server: https://icbc.yl4.ap-northeast-1.eks.amazonaws.com namespace: clickhouse-operator syncPolicy: automated: prune: true selfHeal: true retry: limit: 3 backoff: duration: 5s factor: 2 maxDuration: 3m syncOptions: - CreateNamespace=true - ApplyOutOfSyncOnly=true
apiVersion: kustomize.config.k8s.io/v1beta1 kind: Kustomization namespace: clickhouse-operator labels: - pairs: app: clickhouse team: platform includeSelectors: false includeTemplates: false resources: - keeper.yaml - clickhouse.yaml
apiVersion: clickhouse.com/v1alpha1 kind: KeeperCluster metadata: name: clickhouse-keeper spec: replicas: 3 resources: requests: cpu: "500m" memory: "1Gi" limits: cpu: "1" memory: "2Gi" dataVolumeClaimSpec: storageClassName: gp3 resources: requests: storage: 100Gi --- apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: clickhouse-keeper-pdb spec: minAvailable: 2 selector: matchLabels: app: clickhouse-keeper
apiVersion: clickhouse.com/v1alpha1 kind: ClickHouseCluster metadata: name: clickhouse spec: replicas: 2 resources: requests: cpu: "2" memory: "8Gi" limits: cpu: "4" memory: "16Gi" dataVolumeClaimSpec: storageClassName: gp3 resources: requests: storage: 300Gi keeperClusterRef: name: clickhouse-keeper settings: defaultUserPassword: secret: name: clickhouse-credentials key: default-password passwordType: password enableDatabaseSync: true --- apiVersion: v1 kind: Service metadata: name: clickhouse-external annotations: service.beta.kubernetes.io/aws-load-balancer-type: "external" service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: "ip" service.beta.kubernetes.io/aws-load-balancer-scheme: "internal" spec: type: LoadBalancer selector: clickhouse.com/role: clickhouse-server ports: - name: http port: 8123 targetPort: 8123 - name: native port: 9000 targetPort: 9000 --- apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: clickhouse-pdb spec: minAvailable: 1 selector: matchLabels: clickhouse.com/role: clickhouse-server
kubectl create secret generic clickhouse-credentials --from-literal=default-password='123456789xxx' -n clickhouse-operator --dry-run=client -o yaml | kubectl apply -f - secret/clickhouse-credentials created kubectl get secret clickhouse-credentials -n clickhouse-operator NAME TYPE DATA AGE clickhouse-credentials Opaque 1 3m17s
文章来源:https://www.cnblogs.com/Jame-mei/p/20213879
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若内容造成侵权/违法违规/事实不符,请联系邮箱:jacktools123@163.com进行投诉反馈,一经查实,立即删除!
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若内容造成侵权/违法违规/事实不符,请联系邮箱:jacktools123@163.com进行投诉反馈,一经查实,立即删除!
标签:

