Rakib Hossain pfp
Rakib Hossain

@rakibrgt

been deep in snapchain reliability work the past few weeks. here’s a recap of what we found and how we're fixing it. issues: - brittle p2p mesh, consensus mesh breaks easily, nodes don't auto-reconnect, weak support for k8s / docker bridge networking - pinned to an alpha commit of malachite (the BFT consensus lib), so even bugfix bumps risk regressions - rocksdb misconfigured → memory leaks + perf bottlenecks on hot paths - write-ahead logs stranding nodes in divergent vote rounds after restarts - firehose logging that made triage hard (even for an LLM) improvements we’re making: - tuned rocksdb across the common codepaths - consolidated, less-brittle deploy pipeline - granular consensus visibility tooling, now live on testnet - saner log levels so on-call can actually find signal - migrating from the stale informal-systems malachite fork to Circle's actively-maintained one
0 reply
0 recast
0 reaction