@rakibrgt
been deep in snapchain reliability work the past few weeks. here’s a recap of what we found and how we're fixing it.
issues:
- brittle p2p mesh, consensus mesh breaks easily, nodes don't auto-reconnect, weak support for k8s / docker bridge networking
- pinned to an alpha commit of malachite (the BFT consensus lib), so even bugfix bumps risk regressions
- rocksdb misconfigured → memory leaks + perf bottlenecks on hot paths
- write-ahead logs stranding nodes in divergent vote rounds after restarts
- firehose logging that made triage hard (even for an LLM)
improvements we’re making:
- tuned rocksdb across the common codepaths
- consolidated, less-brittle deploy pipeline
- granular consensus visibility tooling, now live on testnet
- saner log levels so on-call can actually find signal
- migrating from the stale informal-systems malachite fork to Circle's actively-maintained one