christopher pfp
christopher
@christopher
Launching Trek, an open source web content extraction library built in Rust! A core part of our work is to understand any link's content on the Internet. And that also means extracting metadata quickly so users can get context, e.g. in a feed. We're building from @kepano's work on Defuddle and then some to do this. Trek also compiles into WASM, enabling anyone to extract content data in a clean, decluttered way in your TS/JS project. It leverages lol_html from Cloudflare to stream HTML in for content extraction instead of building the entire page as a normal scraper would and "trekking" the DOM. This means it's really fast and more importantly memory efficient. Check out the playground here: https://officialunofficial.github.io/trek/ Docs: https://github.com/officialunofficial/trek
8 replies
9 recasts
34 reactions

kepano pfp
kepano
@kepano
cool! is there anything you learned building this that could be incorporated into defuddle?
1 reply
0 recast
1 reaction

christopher pfp
christopher
@christopher
I think having a config TOML with obvious sensible defaults would help a lot. Noticed that Defuddle would show certain elements like navigation or menus when the thresholds weren't met during declutter. Other than that not much! https://github.com/kepano/defuddle/blob/9677af23a3c8e7f14349c6c557e30f7179d667ca/src/scoring.ts#L328
0 reply
0 recast
0 reaction