LapisRising πŸ‘Œ

4 Followers
71 Following
57 Posts
I like good coffee, cycling and software engineering. Preferably in that order after I wake up.
@molly0xfff Update hell is bad on one machine, but what about 2 or 3? Many people have a working and a home laptop. Add a beefy PC tower and you are updating things all the time :/

@artemesia @carnage4life Yes, they own their content. If they don't want their data to be scraped they can use a wildcard and prohibit all bots to do it.

It is their content and they should decide who gets access.

The internet is by default open, so the default should be "everybody can use the content".

Do you have another opinion?

@artemesia @carnage4life I would prefer to have categories like: "SearchEngine", "DataModel", "NewsAggregator" and the like.

robots.txt is not meant for that, but changing the interpretation of the agent string would technically allow for that, similar to using multiple CSS classes for one element back in the day.

@artemesia @carnage4life
You can match all with * or name specific agents like Googlebot.

I believe the white and blacklisting must be done by the content owner, so the robots.txt is the right place.

I also believe that we need categories instead of specific agents, otherwise we get a closed web very fast. Why should ChatGPT be allowed to scrape and Gemini not?

For "deals" in B2B APIs are the correct approach. They are more efficient and accurate.

https://moz.com/learn/seo/robotstxt

What Is A Robots.txt File? Best Practices For Robot.txt Syntax

Robots.txt is a text file webmasters create to instruct robots (typically search engine robots) how to crawl & index pages on their website. The robots.txt file is part of the robots exclusion protocol (REP), a group of web standards that regulate how robots crawl the web, access and index content,…

Moz
@carnage4life technical solutions exist though. Take robots.txt for example. Every page can specify which agents are allowed to crawl which parts of the website. It would be easy to extend this to classifications of bots, for example search engines and gpts.

The critical slope is 13-15% where walking becomes more efficient than #cycling.

This is almost always for recreational cyclists. Basically, when you are able to cycle up the the hill it's more efficient.

https://pedalchile.com/blog/uphill

Walking Vs Biking Uphill | Pedal Chile |

Road cycling: The "critical slope" or the incline where walking (or running) becomes more efficient than cycling is 13–15% (recreational cyclists) Mountain Biking: The β€œcritical slope” is 8 - 11% before it becomes more efficient to walk then to continue pedaling.

Pedal Chile

@carnage4life We only have so many original documents and texts, and they have biases.

Filtering content for #LLMs only reduces the total amount of training data.

Until someone rewrites our history and knowledge within their worldview in its entirety, humanity will live in its bubble for better or worse.

@carnage4life the way out of this problem are a just legislation and a good judicial system.
They better not be biased or corrupt. 🀞
@atineoSE @themarkup I believe that the current available tools are enough regulate web scraping. There are the robots.txt, IP based blocking / throttling and Captchas. Additionally, external APIs clearly define pricing and access control.
On the one hand, I don't want to see prosecuting scraping on the internet become common place. On the other hand, I don't like people circumventing paid APIs to get the data for free.
It boils down to the scale and how the agent / scraper behaves.
@carnage4life I just hope that the same news sites don't use these AI models to generate their articles.
This would lead to paywalls for artificial content and new AI models that don't improve because they train on the data they produce.