Varun Srinivasan pfp
Varun Srinivasan
@v
Here's a deep dive into what caused this very painful outage. Tap into the thread for details. When you link a token to a cast like $DEGEN, it creates an embed. Our server hydrates this embed with data like the image, token creator and stores it in our database so we can show it to you quickly. When we launched token news, we added a lot of info to the token object like the casts that were included in the news. We didn't realize that our hydrator was including all the news and news casts into this object every time we created a token link. We recently ended up in a recursive state because some of the casts in the token news had token links themselves. Now a cast with token link would trigger our server to include the news object, which contained casts, some of which contained the same token link and so on..... These casts would quickly balloon in size to 5 to 10 MB each. Our feed generator tries to fetch all the casts you want to read, order them and then compress them and put them into redis. This compression made the CPU on the feed workers stall. It got very bad very fast where as soon as a worker would come up, it would start picking jobs off the queue and stall immediately. We couldn't even SSH into the box to figure out what was going on. The way we ended up handling it is shutting off various parts of the feed generation until we could figure out a few problematic lines of code. We also scaled back the processor and forced it to run very slowly so we could get into the boxes and profile things, and both of these threads eventually led to us finding the culprits and fixing them.
27 replies
25 recasts
253 reactions

Varun Srinivasan pfp
Varun Srinivasan
@v
Just kidding, I forgot I didn't need to thread.
3 replies
1 recast
43 reactions

pugson pfp
pugson
@pugson
that’s crazy. only redis was affected or did any of your databases bubble up with recursive data as well?
1 reply
0 recast
3 reactions

derek pfp
derek
@derek
just curious: are the interfaces GQL? in my experience very easy to have a recursion issue
1 reply
0 recast
0 reaction

Nabil Abdellaoui pfp
Nabil Abdellaoui
@randombishop
Checks all the boxes of nasty bugs: - Can't see it in staging - Needs very specific data to reproduce - Doesn't really error, silently bubbles up - Explodes everything Special bonus if it triggered a pager duty while the dev on-call is drunk.
0 reply
0 recast
8 reactions

Blinky Stitt pfp
Blinky Stitt
@flashprofits.eth
I'm a big fan of using cpuset so that apps never run on core 0. That way ssh always has some room.
0 reply
0 recast
6 reactions

bertwurst pfp
bertwurst
@bertwurst.eth
1 reply
0 recast
11 reactions

Rakshita Philip pfp
Rakshita Philip
@awkquarian
i don’t understand any of it, but i’m happy for you
1 reply
0 recast
7 reactions

Hi, I’m Ashley pfp
Hi, I’m Ashley
@leadgen
♾️♾️♾️♾️♾️♾️♾️ sheesh! Recursive and deadlock bugs are two of the most difficult bugs to detect true on-call nightmare. If you’ve got a huge codebase, I feel bad for whoever was on call 😵‍💫
0 reply
0 recast
1 reaction

Aaron Ho φ pfp
Aaron Ho φ
@aho
add circuit breakers!
0 reply
0 recast
0 reaction

Colin Charles pfp
Colin Charles
@bytebot
Thank you for writing this blameless incident report!
0 reply
0 recast
2 reactions

MajorTom327 pfp
MajorTom327
@majortom327.eth
Interesting! Good postmortem! Yeah, that would have been hard to debug. I don't know if you could potentially debug locally. If so, using bisect would have been a good tool to find which lines are having issues! But handling this kind of time is hard as everybody is running to fix it fast. It's easy to say what would have been easy to do after the issue was solved. Good job!
0 reply
0 recast
0 reaction

dCommunity - Home of AvenueD pfp
dCommunity - Home of AvenueD
@dcommunity
Nice job to the team 👊 That is not ordinary and is made worse with the public facing nature. Glad you guys withstood the pressure 😌
0 reply
0 recast
0 reaction

TobyJaguar pfp
TobyJaguar
@tobyjaguar
🤔
0 reply
0 recast
0 reaction

Babooun pfp
Babooun
@babooun
Great communication, love the transparency ! Do you use real data for testing ? Like having recorded 24 hours of real prod events you dynamically replay on staging to test new release.
0 reply
0 recast
0 reaction

InsideTheSim 🎩🍪 pfp
InsideTheSim 🎩🍪
@insidethesim.eth
Oooh, that’s insidious. Good job pulling that apart and finding the source.
0 reply
0 recast
0 reaction

UNIBROS 🔥 pfp
UNIBROS 🔥
@unibros
anyway we can get the cached token images refreshed? Mine has been showing the wrong image. Can’t get in touch w/ anyone on the team
0 reply
0 recast
0 reaction

sandymariposa pfp
sandymariposa
@sandymariposa
R we supposed to be jet lagged now?
0 reply
0 recast
0 reaction

Biggydaddy.eth pfp
Biggydaddy.eth
@biggydaddy
I am just here to read, everything seems gibberish to me 😩😩
0 reply
0 recast
0 reaction

C O M P Ξ Z 🧬 pfp
C O M P Ξ Z 🧬
@compez.eth
Love this debugging! 👍🏼
0 reply
0 recast
0 reaction