Mastodawn

Stefan Tilkov Nov 11, 2022

Back in 2010 #twitter launched their “t.co” link shortened. And I thought at the time how profoundly terrible that is. If the database and/or ability to look up t.co links ever changed, we would literally lose information. We will no longer know what URL was tweeted. It burns a huge hunk of internet history to the ground if we lose that. It was vain and shortsighted and stupid. Now in 2022, I wonder how many months we have left when we can resolve those links. And what happens to collective internet history when they’re done?

In another funny twist, what about trials or legal matters that hinge on the content of a tweet? And suddenly you can’t get the content of that tweet? Or where did that link point to?

Nobody saw this coming except for all of us who saw this coming.

kravse 🍂Nov 11, 2022

@paco this is a really great point. The t.co links created a walled garden where the wall itself was hidden (aka not clearly a wall to everyone at the time), but really the same as Facebook or other sites that divide up the open internet.

𝖈0𝖗𝖊𝖉𝖚𝖒𝖕𝖊𝖉 🇸🇪Nov 11, 2022

@paco good point, thought that comes up:
does "waybackmachine" (archive.org) snapshot shorties ?

𝖈0𝖗𝖊𝖉𝖚𝖒𝖕𝖊𝖉 🇸🇪Nov 11, 2022

@paco uhmm forget that one, I think its only domains n not sub, n that would be a mess to handle 😆😂
(Reminder to self, think before toot)

Eric Kobrin Nov 11, 2022

@paco @gsuberland It looks like both the shortened and full urls are present in each user's downloadable archive. I don't know if it's exhaustive of all urls you've seen, but it seems to list the ones I've clicked on and at least few I hadn't.

As a bonus, it revealed that the link attached to an unsolicited nude in a message I ignored was actually a gdrive link. Phishing maybe?

Allison Nixon Nov 11, 2022

@paco At this rate the whole database will be hacked and leaked anyways, so Internet history will be preserved that way.

Nico Erfurth Nov 11, 2022

@paco So, it's time for a http://t.co crawler it seems

t.co / X

…might work for coffee…Nov 11, 2022

@masta @paco
It is running for years.
https://wiki.archiveteam.org/index.php/URLTeam

URLTeam - Archiveteam

Alex Barredo 📉Nov 11, 2022

@mwfc @masta @paco are we sure? I can't see "t.co" in the list https://tracker.archiveteam.org:1338/

URLTeam Tracker

…might work for coffee…Nov 11, 2022

@Barredo @masta @paco
You are right I cannot find them here
https://tracker.archiveteam.org:1338/status

Will ask, I was kind of sure that t.co was on it. My fault.

URLTeam Status

…might work for coffee…Nov 11, 2022

@Barredo @masta @paco
It was a tag on an old urlteam upload to archive. Guess I look into a more recent file soonish. But first other stuff to do.

Alex Barredo 📉Nov 11, 2022

@mwfc I could find something here, but only crawls "1% of tweet", i'd guess from the firehose

https://archive.org/details/twitteroutlinks

Twitter Outlinks : Free Web : Free Download, Borrow and Streaming : Internet Archive

9Lukas5 🚂 🐧Nov 11, 2022

@paco Hey all, maybe check this out 👌🏼
https://mathstodon.xyz/@timhutton/109316834651128246

Tim Hutton (@[email protected])

If you download your #Twitter archive it arrives wrapped as a static HTML page, which is not very useful for doing anything with, and worse: it requires the original account to be still active to do useful things like enlarge the images since they use t.co links. So here's a #Python script to convert a Twitter archive to #markdown or other formats: https://github.com/timhutton/twitter-archive-parser Now you can archive your tweets in any way you want.

Mathstodon

naught101 Nov 11, 2022

@paco does archive.org archive them?

…might work for coffee…Nov 11, 2022

@naught101 @paco
Yepe, Archiveteam does it
https://wiki.archiveteam.org/index.php/URLTeam
https://archive.org/search.php?query=subject%253A%22urlteam%22

I do not know which ones are just Twitter ones. :)

URLTeam - Archiveteam

rabimba Nov 11, 2022

@paco well this happens to all link shortbers. Specially the ones which are "home grown hosted in my own server"

And also the famous goo.gl one which I loved once upon a time

Ian Fogg Nov 11, 2022

@paco t.co link expansion issues is another reason to process exported twitter account archives.
Is there a tool that will auto expand links now while Twitter is still around?

ShrikeTron🔠💉x5 Nov 11, 2022

@paco It'll be like the Kakaotalk outage, but site stays down/gone.

Julie Webgirl Nov 11, 2022

@paco The Internet is forever, except...

James Green Nov 11, 2022

@paco and is this not already true for some of the early functionality. maybe twtimg and some other stuff?

Monika Viktorova Nov 11, 2022

@paco doesn’t the waybackmachine still scrape the actual pages though?

Tane Piper ⁂Nov 11, 2022

@paco @craignicol

Yes https://tane.codes/@tanepiper/109304110500291493

Tane Piper (@[email protected])

Think of all the dead t.co links that about to just never resolve. We really did not think about a sustainable web future.

Tane's Fedeverse

Peter Yang, MD - Yaz Nov 11, 2022

@paco archiving the data would be an ambitious project for the good people at https://www.reddit.com/r/DataHoarder/

It's A Digital Disease! • r/DataHoarder

This is a sub that aims at bringing data hoarders together to share their passion with like minded people.

reddit

Andrew Leahey Nov 11, 2022

@paco Well said. Relies entirely on projects like the Internet Archive and hope -- hope that it scoops up all the t.co links before they go dark. The right way to do link shortening would be the open the database for archiving/backup by third parties.

Owe Jessen Nov 11, 2022

@paco Here is a py script to change the twitter archive to markdown with full size images and the original links. No global solution, but better than nothing, I suppose. https://github.com/timhutton/twitter-archive-parser

GitHub - timhutton/twitter-archive-parser: Python code to parse a Twitter archive and output in various ways

Python code to parse a Twitter archive and output in various ways - timhutton/twitter-archive-parser

GitHub

…might work for coffee…Nov 11, 2022

Run a warrior :)
https://wiki.archiveteam.org/index.php/URLTeam
https://tracker.archiveteam.org:1338/

URLTeam - Archiveteam

Gary McGraw Nov 11, 2022

@paco so much of the net is ephemeral. Even some of my published writings have disappeared.

Matthew Malthouse Nov 11, 2022

Saw this mentioned a couple of days ago. A sort of solution is possible, albeit a bit hand crafted.

Download Twitter data: Settings > Your account > Download an archive of your data

Then a parse it to resolve all those shortened URLs with this tool: https://github.com/timhutton/twitter-archive-parser

Clunky (and I had to edit the script myself to fix a thing) but at least it gives you the needed URL information.

GitHub - timhutton/twitter-archive-parser: Python code to parse a Twitter archive and output in various ways

Python code to parse a Twitter archive and output in various ways - timhutton/twitter-archive-parser

GitHub

nobletrout Nov 11, 2022

@paco @matthew_d_green don’t give musk an out claiming the need for government support.

Mark Nov 11, 2022

@paco if a Twitter user downloads an archive of their data, the resulting files contain the original URL linked to the t.co URL. It’s not easy but it is possible

MattHawkins Nov 11, 2022

@paco not too dissimilar to systems with "persistent" links such as SharePoint. When they break it's impossible to work out what file was linked to. No domain. No path. No filename.

Jade in the North End Nov 11, 2022

@paco if a trial or legal matter hinges on any digital media, there's a hard copy and multiple backups to ensure it's not lost or deleted. But that raises a good point about coming legal actions against twitter itself and the obligation to preserve evidence.

Yoan Nov 11, 2022

@paco
Do you know why crawlers of the Internet Wayback Machine do not follow these t.co link?

Here's a link from a tweet from @UN: https://web.archive.org/web/20221108031258/https://t.co/phcmnNyYG9

Andy Ellis Nov 11, 2022

@paco for legal uses, perma.cc exists as a persistent URL redirect or that persistently keeps a cached copy of the original resource; it is the best way to cite a URL in a legal case.

Mrs Mouse at work Nov 11, 2022

@paco so, this made me think 1) Ok, but-rot and link rot are things that are fairly well documented, and someone's gotta have a solution,

then 2) how does archive.org deal with this?

So, firstly I picked a random non-offensive twitter user to stalk, so I chose Ray Redacted because he rocks. And found a post with a link, in this case a retweet:

1/n

Mrs Mouse at work Nov 11, 2022

This clearly has the text:
More: http://tcrn.ch/2WbiHWo
but the link that's embedded:
https://web.archive.org/web/20200317032935mp_/https://t.co/d7JZZL2k1j

So, that archibe.org link goes no where, it wasn't saved but it DOES still show that original URL.
2/3

TechCrunch is part of the Yahoo family of brands

Mrs Mouse at work Nov 11, 2022

@paco
Conclusion: Donate to warchive.org, use it liberally!

So the fun part is the question about the courts, so that's a question for #lawyers - how does this impact the Best-Evidence rule?
I personally think an indifferent third-party service (archive.org) that archives them as part of their normal business practices would be admissible. If not, hire a PI to "research", print, and testify this is what he printed. IDK. #IANAL, but I did a report on the admissibility of logs for my CIO

@paco this sounds like a call to action for archivists, like we saw for geocities. 🤔

Eddie Coldrick 💻Nov 11, 2022

@paco This is why I have never liked people using personal image hosting servers and putting those links online. You lose the pictures and have little/no context to the text.

@paco shredding evidence is a white-collar criminal favorite.

acroll Nov 11, 2022

@paco nearly half of all links referenced in Supreme Court filings had disappeared. Back in 2013.

https://www.theverge.com/2013/9/23/4763646/half-of-supreme-court-web-citations-have-changed-or-disappeared

One of the big, ignored differences between the physical and the digital world is pretty simple: It takes energy to destroy atoms; it takes energy to preserve bits. Ephemerality is the default online, but we try to apply one set of laws across both realms.

Supreme Court citations are falling apart as web links begin to change and disappear

Half of all the court's web citations have succumbed to link rot

The Verge

SlightlyCyberpunk Nov 11, 2022

@paco but none of that is unique to Twitter. Websites get restructured periodically (and also go offline entirely), so even if you have the non-shortened URL a few months or years later there's a decent chance that it won't work anymore. And no legal proceeding is going to be relying on links, they might include the link but they're going to be primarily focused on screenshots for precisely this reason -- links change; they always have and always will.

WndlB Nov 11, 2022

@paco Not staking my life on it, but I think, for the lawyers, there would be some ‘best evidence’ workarounds, particularly if there had been a cratering.

Daniel Lo Nigro 🇦🇺Nov 11, 2022

@paco Archive Team are archiving URL shortener links for this exact reason. https://wiki.archiveteam.org/index.php/URLTeam. Anyone can help their efforts by running one of their VMs or Docker images - I've got two spare VPSes running it.

URLTeam - Archiveteam

Nemo_bis 🌈Nov 11, 2022

@paco Join #URLTeam! #ArchiveTeam
https://wiki.archiveteam.org/index.php/URLTeam

URLTeam - Archiveteam

Rob Landley Nov 11, 2022

@paco in theory a raw feed of Twitter has been going to the library of Congress for some years. https://blogs.loc.gov/loc/2017/12/update-on-the-twitter-archive-at-the-library-of-congress-2/

Update on the Twitter Archive at the Library of Congress | Timeless

An update on the status of the Twitter archives at the Library of Congress.

The Library of Congress

Max Nov 11, 2022

@paco I tend to express my lack of consent at each cookie wall I encounter, which often means pages don’t load the various third party nonsense.

Not only does that result in lighter pages, it has also prepared me for what a lot of folks are going to see eventually.

An example:

@paco I'm sure others have pointed out that the internet archve is on the case

Tobias Fiebig Nov 12, 2022

@paco Same with all those email sec solutions replacing URLs. If the original URL is not a part of the replaced link... things get difficult.

Simon Zerafa Nov 12, 2022

@paco I'm parsing my Twitter Archive now to remove those t.co links while I can.

I need to do the same with bit.ly links before that magically disappears one day 😕🤷‍♂️

gabe is not ghost Nov 12, 2022

@paco if youve not seen https://github.com/timhutton/twitter-archive-parser

GitHub - timhutton/twitter-archive-parser: Python code to parse a Twitter archive and output in various ways

Python code to parse a Twitter archive and output in various ways - timhutton/twitter-archive-parser

GitHub