Mo on Farcaster

Content pfp

0 reply

0 recast

0 reaction

shoni.eth pfp

Just released the largest open dataset of Farcaster threads with embeddings! 📊 24.3M high-quality threads 🔍 512-dim Voyager embeddings (f32) ✨ Spam-filtered & engagement-ranked 📅 Complete Farcaster history to May 2025 Perfect for semantic search, clustering, recommendation systems & social analysis 🤗 https://huggingface.co/datasets/shoni/farcaster

7 replies

7 recasts

35 reactions

Mo pfp

This is very cool! Would you mind sharing more about the process? - potential applications in mind - how you ingested the data - various embeddings and other data mangling done - cloud vs local usage - total cost to generate this

1 reply

0 recast

2 reactions

shoni.eth pfp

Sure I see these as the most granular useful piece of data in analysis, so I’ll be shifting to the next building block of user based content I get all fc data from Neynar parquet service then use a set of custom Postgres procedures/tables to format/query/generate embeddings (it’s a near live pipeline) The embeddings are float32, I use pgvector which doesn’t support int8 or I would have done that. You can easily convert the float32 to float16 for storage/query savings. I run everything on my Mac studio but the goal is to create advanced live apis that enable other builds/remove custom memory from the stack The embeddings themselves probably $50, ai tools building the procedures was probably $200, Neynar parquet is about $300/mo and Mac studio $10k which runs the Postgres server/pipeline/etc. The whole nonspam dataset is about 2B tokens

1 reply

0 recast

1 reaction