Content pfp
Content
@
0 reply
0 recast
0 reaction

shoni.eth pfp
shoni.eth
@alexpaden
Just released the largest open dataset of Farcaster threads with embeddings! 📊 24.3M high-quality threads 🔍 512-dim Voyager embeddings (f32) ✨ Spam-filtered & engagement-ranked 📅 Complete Farcaster history to May 2025 Perfect for semantic search, clustering, recommendation systems & social analysis 🤗 https://huggingface.co/datasets/shoni/farcaster
8 replies
10 recasts
41 reactions

Mo pfp
Mo
@meb
This is very cool! Would you mind sharing more about the process? - potential applications in mind - how you ingested the data - various embeddings and other data mangling done - cloud vs local usage - total cost to generate this
1 reply
0 recast
2 reactions

shoni.eth pfp
shoni.eth
@alexpaden
Sure I see these as the most granular useful piece of data in analysis, so I’ll be shifting to the next building block of user based content I get all fc data from Neynar parquet service then use a set of custom Postgres procedures/tables to format/query/generate embeddings (it’s a near live pipeline) The embeddings are float32, I use pgvector which doesn’t support int8 or I would have done that. You can easily convert the float32 to float16 for storage/query savings. I run everything on my Mac studio but the goal is to create advanced live apis that enable other builds/remove custom memory from the stack The embeddings themselves probably $50, ai tools building the procedures was probably $200, Neynar parquet is about $300/mo and Mac studio $10k which runs the Postgres server/pipeline/etc. The whole nonspam dataset is about 2B tokens
1 reply
0 recast
1 reaction