Back in 2010 #twitter launched their β€œt.co” link shortened. And I thought at the time how profoundly terrible that is. If the database and/or ability to look up t.co links ever changed, we would literally lose information. We will no longer know what URL was tweeted. It burns a huge hunk of internet history to the ground if we lose that. It was vain and shortsighted and stupid. Now in 2022, I wonder how many months we have left when we can resolve those links. And what happens to collective internet history when they’re done?

In another funny twist, what about trials or legal matters that hinge on the content of a tweet? And suddenly you can’t get the content of that tweet? Or where did that link point to?

Nobody saw this coming except for all of us who saw this coming.

@paco this is a really great point. The t.co links created a walled garden where the wall itself was hidden (aka not clearly a wall to everyone at the time), but really the same as Facebook or other sites that divide up the open internet.
@paco good point, thought that comes up:
does "waybackmachine" (archive.org) snapshot shorties ?
@paco uhmm forget that one, I think its only domains n not sub, n that would be a mess to handle πŸ˜†πŸ˜‚
(Reminder to self, think before toot)

@paco @gsuberland It looks like both the shortened and full urls are present in each user's downloadable archive. I don't know if it's exhaustive of all urls you've seen, but it seems to list the ones I've clicked on and at least few I hadn't.

As a bonus, it revealed that the link attached to an unsolicited nude in a message I ignored was actually a gdrive link. Phishing maybe?

@paco At this rate the whole database will be hacked and leaked anyways, so Internet history will be preserved that way.
@paco So, it's time for a http://t.co crawler it seems
t.co / X

URLTeam - Archiveteam

@mwfc @masta @paco are we sure? I can't see "t.co" in the list https://tracker.archiveteam.org:1338/
URLTeam Tracker

@Barredo @masta @paco
You are right I cannot find them here
https://tracker.archiveteam.org:1338/status

Will ask, I was kind of sure that t.co was on it. My fault.

URLTeam Status

@Barredo @masta @paco
It was a tag on an old urlteam upload to archive. Guess I look into a more recent file soonish. But first other stuff to do.

@mwfc I could find something here, but only crawls "1% of tweet", i'd guess from the firehose

https://archive.org/details/twitteroutlinks

Twitter Outlinks : Free Web : Free Download, Borrow and Streaming : Internet Archive

@paco Hey all, maybe check this out πŸ‘ŒπŸΌ
https://mathstodon.xyz/@timhutton/109316834651128246
Tim Hutton (@[email protected])

If you download your #Twitter archive it arrives wrapped as a static HTML page, which is not very useful for doing anything with, and worse: it requires the original account to be still active to do useful things like enlarge the images since they use t.co links. So here's a #Python script to convert a Twitter archive to #markdown or other formats: https://github.com/timhutton/twitter-archive-parser Now you can archive your tweets in any way you want.

Mathstodon

@paco well this happens to all link shortbers. Specially the ones which are "home grown hosted in my own server"

And also the famous goo.gl one which I loved once upon a time

@paco t.co link expansion issues is another reason to process exported twitter account archives.
Is there a tool that will auto expand links now while Twitter is still around?
@paco It'll be like the Kakaotalk outage, but site stays down/gone.
@paco The Internet is forever, except...
@paco and is this not already true for some of the early functionality. maybe twtimg and some other stuff?
@paco doesn’t the waybackmachine still scrape the actual pages though?
Tane Piper (@[email protected])

Think of all the dead t.co links that about to just never resolve. We really did not think about a sustainable web future.

Tane's Fedeverse
@paco archiving the data would be an ambitious project for the good people at https://www.reddit.com/r/DataHoarder/
It's A Digital Disease! β€’ r/DataHoarder

This is a sub that aims at bringing data hoarders together to share their passion with like minded people.

reddit
@paco Well said. Relies entirely on projects like the Internet Archive and hope -- hope that it scoops up all the t.co links before they go dark. The right way to do link shortening would be the open the database for archiving/backup by third parties.
@paco Here is a py script to change the twitter archive to markdown with full size images and the original links. No global solution, but better than nothing, I suppose. https://github.com/timhutton/twitter-archive-parser
GitHub - timhutton/twitter-archive-parser: Python code to parse a Twitter archive and output in various ways

Python code to parse a Twitter archive and output in various ways - timhutton/twitter-archive-parser

GitHub
@paco so much of the net is ephemeral. Even some of my published writings have disappeared.

@paco

Saw this mentioned a couple of days ago. A sort of solution is possible, albeit a bit hand crafted.

Download Twitter data: Settings > Your account > Download an archive of your data

Then a parse it to resolve all those shortened URLs with this tool: https://github.com/timhutton/twitter-archive-parser

Clunky (and I had to edit the script myself to fix a thing) but at least it gives you the needed URL information.

GitHub - timhutton/twitter-archive-parser: Python code to parse a Twitter archive and output in various ways

Python code to parse a Twitter archive and output in various ways - timhutton/twitter-archive-parser

GitHub
@paco @matthew_d_green don’t give musk an out claiming the need for government support.
@paco if a Twitter user downloads an archive of their data, the resulting files contain the original URL linked to the t.co URL. It’s not easy but it is possible
@paco not too dissimilar to systems with "persistent" links such as SharePoint. When they break it's impossible to work out what file was linked to. No domain. No path. No filename.
@paco if a trial or legal matter hinges on any digital media, there's a hard copy and multiple backups to ensure it's not lost or deleted. But that raises a good point about coming legal actions against twitter itself and the obligation to preserve evidence.

@paco
Do you know why crawlers of the Internet Wayback Machine do not follow these t.co link?

Here's a link from a tweet from @UN: https://web.archive.org/web/20221108031258/https://t.co/phcmnNyYG9

@paco for legal uses, perma.cc exists as a persistent URL redirect or that persistently keeps a cached copy of the original resource; it is the best way to cite a URL in a legal case.

@paco so, this made me think 1) Ok, but-rot and link rot are things that are fairly well documented, and someone's gotta have a solution,

then 2) how does archive.org deal with this?

So, firstly I picked a random non-offensive twitter user to stalk, so I chose Ray Redacted because he rocks. And found a post with a link, in this case a retweet:

1/n

@paco

This clearly has the text:
More: http://tcrn.ch/2WbiHWo
but the link that's embedded:
https://web.archive.org/web/20200317032935mp_/https://t.co/d7JZZL2k1j

So, that archibe.org link goes no where, it wasn't saved but it DOES still show that original URL.
2/3

TechCrunch is part of the Yahoo family of brands

@paco
Conclusion: Donate to warchive.org, use it liberally!

So the fun part is the question about the courts, so that's a question for #lawyers - how does this impact the Best-Evidence rule?
I personally think an indifferent third-party service (archive.org) that archives them as part of their normal business practices would be admissible. If not, hire a PI to "research", print, and testify this is what he printed. IDK. #IANAL, but I did a report on the admissibility of logs for my CIO

@paco this sounds like a call to action for archivists, like we saw for geocities. πŸ€”
@paco This is why I have never liked people using personal image hosting servers and putting those links online. You lose the pictures and have little/no context to the text.
@paco shredding evidence is a white-collar criminal favorite.

@paco nearly half of all links referenced in Supreme Court filings had disappeared. Back in 2013.

https://www.theverge.com/2013/9/23/4763646/half-of-supreme-court-web-citations-have-changed-or-disappeared

One of the big, ignored differences between the physical and the digital world is pretty simple: It takes energy to destroy atoms; it takes energy to preserve bits. Ephemerality is the default online, but we try to apply one set of laws across both realms.

Supreme Court citations are falling apart as web links begin to change and disappear

Half of all the court's web citations have succumbed to link rot

The Verge
@paco but none of that is unique to Twitter. Websites get restructured periodically (and also go offline entirely), so even if you have the non-shortened URL a few months or years later there's a decent chance that it won't work anymore. And no legal proceeding is going to be relying on links, they might include the link but they're going to be primarily focused on screenshots for precisely this reason -- links change; they always have and always will.
@paco Not staking my life on it, but I think, for the lawyers, there would be some β€˜best evidence’ workarounds, particularly if there had been a cratering.
@paco Archive Team are archiving URL shortener links for this exact reason. https://wiki.archiveteam.org/index.php/URLTeam. Anyone can help their efforts by running one of their VMs or Docker images - I've got two spare VPSes running it.
URLTeam - Archiveteam

@paco in theory a raw feed of Twitter has been going to the library of Congress for some years. https://blogs.loc.gov/loc/2017/12/update-on-the-twitter-archive-at-the-library-of-congress-2/
Update on the Twitter Archive at the Library of Congress | Timeless

An update on the status of the Twitter archives at the Library of Congress.

The Library of Congress

@paco I tend to express my lack of consent at each cookie wall I encounter, which often means pages don’t load the various third party nonsense.

Not only does that result in lighter pages, it has also prepared me for what a lot of folks are going to see eventually.

An example:

@paco I'm sure others have pointed out that the internet archve is on the case
@paco Same with all those email sec solutions replacing URLs. If the original URL is not a part of the replaced link... things get difficult.

@paco I'm parsing my Twitter Archive now to remove those t.co links while I can.

I need to do the same with bit.ly links before that magically disappears one day πŸ˜•πŸ€·β€β™‚οΈ

GitHub - timhutton/twitter-archive-parser: Python code to parse a Twitter archive and output in various ways

Python code to parse a Twitter archive and output in various ways - timhutton/twitter-archive-parser

GitHub