起因
这几天突然发现测试环境中,NIFI同步数据到Clickhouse失败,经过查看NIFI的失败日志发现是因为Clickhouse的表变成了只读模式。刚开始调查还以为只是这几张表出现了只读的,于是打算使用重建表的方式解决这个问题。
但是在建新的表时,报错:
click house版本25.3.3.42,报错信息: Code: 999. Coordination::Exception: Keeper server rejected the connection during the handshake. Possibly it's overloaded, doesn't see leader or stale: while receiving handshake from ZooKeeper
Code: 210. DB::NetException: Connection reset by peer, while reading from socket (peer: yyyyyy:2181, local: xxxx:43274): while receiving handshake from ZooKeeper. (NETWORK_ERROR)
All connection tries failed while connecting to ZooKeeper. nodes: 10.100.100.1:2181
Code: 999. Coordination::Exception: Keeper server rejected the connection during the handshake. Possibly it's overloaded, doesn't see leader or stale: while receiving handshake from ZooKeeper. (KEEPER_EXCEPTION) (version 24.9.1.3278 (official build)), 10.100.100.1:2181
Code: 999. Coordination::Exception: Keeper server rejected the connection during the handshake. Possibly it's overloaded, doesn't see leader or stale: while receiving handshake from ZooKeeper. (KEEPER_EXCEPTION) (version 24.9.1.3278 (official build)), 10.100.100.1:2181
Code: 999. Coordination::Exception: Keeper server rejected the connection during the handshake. Possibly it's overloaded, doesn't see leader or stale: while receiving handshake from ZooKeeper. (KEEPER_EXCEPTION) (version 24.9.1.3278 (official build)), 10.100.100.1:2181
试了几次都是这个错误,就意识到问题大发了,不是单独重建表就可以解决的了。
于是搜索网上解决方案,在github上Clickhouse的issues找到了一样报错的《clickhouse keeper Keeper server rejected the connection during the handshake. Possibly it's overloaded, doesn't see leader or stale" · Issue #39756 · ClickHouse/ClickHouse》,最后有人给出了一个解决方案,
不能解决我的问题
systemctl stop clickhouse-keeper
rm -rf /var/lib/clickhouse/coordination/log
实际试了不可行。
综合后续的调查,这个报错的根本原因就是zookeeper集群出现了问题。
排查原因
在将部分报错信息给到deepseek询问解决方案,得到的结论如下:
ZooKeeper连接被拒绝:ClickHouse节点无法与ZooKeeper集群建立连接
可能的原因:
于是按照deepseek的方案进行检查:
1.** 检查ZooKeeper集群状态**
# 连接到每个ZooKeeper节点检查状态
echo stat | nc localhost 2181
# 或者使用zkCli.sh
/path/to/zkCli.sh -server localhost:2181 stat
检查ClickHouse配置
运行后报错,没有正常的信息返回
-- 在ClickHouse中查看ZooKeeper配置
SELECT * FROM system.zookeeper WHERE path = '/';
诊断网络连接
在clickhouse服务器上执行,发现其中一台zookeeper的无法联通
# 测试端口连通性
telnet zk1.example.com 2181
telnet zk2.example.com 2181
telnet zk3.example.com 2181
# 检查DNS解析
nslookup zk1.example.com
查看详细日志
发现clickhouse的错误日志,大量的打印无法连接到zookeeper的端口,表只读模式,尝试解除等信息。
# 查看ClickHouse日志
tail -f /var/log/clickhouse-server/clickhouse-server.log
tail -f /var/log/clickhouse-server/clickhouse-server.err.log
# 查看ZooKeeper日志
tail -f /path/to/zookeeper/logs/zookeeper.out
此时zookeeper集群出现了问题,无法联通的那台zookeeper服务器上的zookeeper挂掉了。
按照这样的结论,我只需要对挂掉的zookeeper进行重启运行,那么整个Clickhouse集群就可以正常连接到zookeeper集群,然后解除表的只读模式。
解决方案(常规集群架构)
1、重启关掉的zookeeper集群
# 配置服务的(最好启动后查看一下各台zookeeper的日志是否正常)
systemctl start zookeeper
# 非配置服务的
cd zookeeper的bin目录
./ZkServer.sh start
2、查看Clickhouse的日志是否正常
cd /var/log/clickhouse-server/
tail -200f clickhouse-server.err.log
tail -200f clickhouse-server.log
3.连接clickhouse执行 system RESTORE 解决表只读模式
# 查询哪些表只读模式
SELECT table, zookeeper_path, replica_path,* FROM system.replicas WHERE is_readonly ;
# SYSTEM RESTORE REPLICA [database].[table_name];
SYSTEM RESTORE REPLICA ods.test;
# 非常多的表进入只读模式的话,可以执行这个sql,将结果复制执行
SELECT
'SYSTEM RESTORE REPLICA ' || database || '.' || table || ';' AS restore_cmd
FROM system.replicas WHERE is_readonly
注意集群表的话需要每台clickhouse都需要执行。 我看官方文档说也可以加 on cluster
至此,正常结构的就恢复完成了。无奈我们的测试环境搭建的比较复杂,导致仅靠重启zookeeper还是无法解决。后续操作见下一篇文章
罪魁祸首
事后,判断应该为虚拟机重启过,导致了没有配置成开机自启服务的某台zookeeper无法自启,集群故障。