Clickhouse | clickhouse集群切换Clickhouse-keeper过程记录（一）

起因

这几天突然发现测试环境中，NIFI同步数据到Clickhouse失败，经过查看NIFI的失败日志发现是因为Clickhouse的表变成了只读模式。刚开始调查还以为只是这几张表出现了只读的，于是打算使用重建表的方式解决这个问题。但是在建新的表时，报错：

click house版本25.3.3.42，报错信息： Code: 999. Coordination::Exception: Keeper server rejected the connection during the handshake. Possibly it's overloaded, doesn't see leader or stale: while receiving handshake from ZooKeeper
Code: 210. DB::NetException: Connection reset by peer, while reading from socket (peer: yyyyyy:2181, local: xxxx:43274): while receiving handshake from ZooKeeper. (NETWORK_ERROR) 
All connection tries failed while connecting to ZooKeeper. nodes: 10.100.100.1:2181
Code: 999. Coordination::Exception: Keeper server rejected the connection during the handshake. Possibly it's overloaded, doesn't see leader or stale: while receiving handshake from ZooKeeper. (KEEPER_EXCEPTION) (version 24.9.1.3278 (official build)), 10.100.100.1:2181
Code: 999. Coordination::Exception: Keeper server rejected the connection during the handshake. Possibly it's overloaded, doesn't see leader or stale: while receiving handshake from ZooKeeper. (KEEPER_EXCEPTION) (version 24.9.1.3278 (official build)), 10.100.100.1:2181
Code: 999. Coordination::Exception: Keeper server rejected the connection during the handshake. Possibly it's overloaded, doesn't see leader or stale: while receiving handshake from ZooKeeper. (KEEPER_EXCEPTION) (version 24.9.1.3278 (official build)), 10.100.100.1:2181

试了几次都是这个错误，就意识到问题大发了，不是单独重建表就可以解决的了。于是搜索网上解决方案，在github上Clickhouse的issues找到了一样报错的《clickhouse keeper Keeper server rejected the connection during the handshake. Possibly it's overloaded, doesn't see leader or stale" · Issue #39756 · ClickHouse/ClickHouse》，最后有人给出了一个解决方案，

不能解决我的问题
systemctl stop clickhouse-keeper
rm -rf /var/lib/clickhouse/coordination/log/*
rm -rf /var/lib/clickhouse/coordination/snapshots/*
systemctl start clickhouse-keeper

实际试了不可行。

综合后续的调查，这个报错的根本原因就是zookeeper集群出现了问题。

排查原因

在将部分报错信息给到deepseek询问解决方案，得到的结论如下：

ZooKeeper连接被拒绝：ClickHouse节点无法与ZooKeeper集群建立连接
可能的原因：
ZooKeeper服务器过载
ZooKeeper集群没有活跃的leader（脑裂或选举问题）
网络连接问题
配置不匹配或版本兼容性问题

于是按照deepseek的方案进行检查：

1.** 检查ZooKeeper集群状态**

# 连接到每个ZooKeeper节点检查状态
echo stat | nc localhost 2181
# 或者使用zkCli.sh
/path/to/zkCli.sh -server localhost:2181 stat

检查ClickHouse配置
运行后报错，没有正常的信息返回

-- 在ClickHouse中查看ZooKeeper配置
SELECT * FROM system.zookeeper WHERE path = '/';

诊断网络连接
在clickhouse服务器上执行，发现其中一台zookeeper的无法联通

# 测试端口连通性
telnet zk1.example.com 2181
telnet zk2.example.com 2181
telnet zk3.example.com 2181

# 检查DNS解析
nslookup zk1.example.com

查看详细日志
发现clickhouse的错误日志，大量的打印无法连接到zookeeper的端口，表只读模式，尝试解除等信息。
```
# 查看ClickHouse日志
tail -f /var/log/clickhouse-server/clickhouse-server.log
tail -f /var/log/clickhouse-server/clickhouse-server.err.log

# 查看ZooKeeper日志
tail -f /path/to/zookeeper/logs/zookeeper.out
```
此时zookeeper集群出现了问题，无法联通的那台zookeeper服务器上的zookeeper挂掉了。

按照这样的结论，我只需要对挂掉的zookeeper进行重启运行，那么整个Clickhouse集群就可以正常连接到zookeeper集群，然后解除表的只读模式。

解决方案（常规集群架构）

1、重启关掉的zookeeper集群

# 配置服务的（最好启动后查看一下各台zookeeper的日志是否正常）
systemctl start zookeeper


# 非配置服务的
cd zookeeper的bin目录
./ZkServer.sh start

2、查看Clickhouse的日志是否正常

cd /var/log/clickhouse-server/

tail -200f clickhouse-server.err.log

tail -200f clickhouse-server.log

3.连接clickhouse执行 system RESTORE 解决表只读模式

# 查询哪些表只读模式
SELECT table, zookeeper_path, replica_path,* FROM system.replicas WHERE is_readonly ；

# SYSTEM RESTORE REPLICA [database].[table_name];
SYSTEM RESTORE REPLICA ods.test;

# 非常多的表进入只读模式的话，可以执行这个sql，将结果复制执行
SELECT 
    'SYSTEM RESTORE REPLICA ' || database || '.' || table || ';' AS restore_cmd 
FROM system.replicas  WHERE is_readonly

注意集群表的话需要每台clickhouse都需要执行。我看官方文档说也可以加 on cluster

至此，正常结构的就恢复完成了。无奈我们的测试环境搭建的比较复杂，导致仅靠重启zookeeper还是无法解决。后续操作见下一篇文章

罪魁祸首

事后，判断应该为虚拟机重启过，导致了没有配置成开机自启服务的某台zookeeper无法自启，集群故障。