Dan Romero pfp
Dan Romero
@dwr
Wonder if ChatGPT will be the last major model to be trained on the open web? robots.txt specifically disallowing crawling from LLMs unless getting paid for the data?
9 replies
0 recast
0 reaction

Venkatesh Rao ☀️ pfp
Venkatesh Rao ☀️
@vgr
I doubt it. We’re at the start of an arms race between training and membership inference algorithms. https://arxiv.org/abs/2301.09956 Even if Western majors respect regulatory type regimes and respect robots.txt directives many won’t. The only defense is encryption not regulation.
1 reply
0 recast
0 reaction

0xbyron pfp
0xbyron
@byron
I'm curious what's the law around crawling sites that disregard robots.txt and post mirrors of content.
1 reply
0 recast
0 reaction

Shashank  pfp
Shashank
@0xshash
might be interesting if chatgpt can include citations in the results but it might become more like Google at that point
0 reply
0 recast
0 reaction

Adam Baybutt pfp
Adam Baybutt
@baybutt
How do LLMs incentivize users to give feedback on answer quality? Offer fee? But then just max number of feedbacks. Offer token for ~shared rev? Incentivize credible feedback.
0 reply
0 recast
0 reaction

phil pfp
phil
@phil
I don’t think so. If we continue to see model sizes increase I would expect GPT-4, 5 to also be trained on a similar corpus with better results. What ~might~ happen is that new webpages have protection against this kind of scraping. Hard to do retroactively since the data is probably already cached
0 reply
0 recast
0 reaction

🎩 MxVoid 🎩 pfp
🎩 MxVoid 🎩
@mxvoid
Could be. Microsoft is already being sued for CoPilot; StabilityAI, Midjourney, and Deviant Art are being sued for Stable Diffusion; it’s just a matter of time before OpenAI gets sued for their products, too. When the lawsuits start flying, so do the CYA measures.
1 reply
0 recast
0 reaction

William Saar pfp
William Saar
@saarw
If AIs can generate enough value, it might be worth paying armies of Mechanical Turk-style workers to manually visit and rewrite web sites for copyright-approved training Facts and ideas can't be copyrighted, only particular expression
1 reply
0 recast
0 reaction

Justin Hunter pfp
Justin Hunter
@polluterofminds
Aren’t robots.txt files just suggestions? Any crawler can ignore those files if they want and Google often does IIRC
0 reply
0 recast
0 reaction

Heath pfp
Heath
@hackley01
First, I’m impressed by the thoughtfulness of your responses - very bullish on what you’re building here. Secondly, I think the knee jerk reactions will settle down.
0 reply
0 recast
0 reaction