Your periodic reminder that just because a URL is saved at archive.org doesn't mean it's going to stay there.

Last year, I wrote a series about proxy services marketed to cybercriminals, and that relied heavily on Archive.org links to document various connections. After my story ran, the person that those links concerned asked Archive to remove those links from their database, which they did. The person in question came back and said hey, what you said in your story is wrong because there's no supporting evidence and you must remove this. Archive.org confirmed they removed all of the pages at the request of the domain holder, and that was that.

If you stumble upon a page that is in archive.org and you want to make sure there is a record that won't be deleted at some point, consider saving the page to archive.today/archive.ph

Alternatively, of course, you could save the page locally, using something like Firefox's built-in full page screenshot (right click on page). Better yet, save the Archive.org pages you want locally.

@briankrebs

Indeed. The original historic website "buttcoin.com" was saved to archive .org, but the registration to the domain lapsed and was acquired by a domain name squatting company. The company demanded that archive.org erase the archived page and made it link to the company's sale page.

@JorgeStolfi @briankrebs that is particularly interesting/concerning. Does simply owning the domain name give one rights over content that was previously hosted on the same domain?
@aptmoniker @JorgeStolfi yes. So, consider that the Chief Twit could legally request that all Twitter links be removed. Not saying it's going to happen, but it most definitely could.
@briankrebs @JorgeStolfi I was not aware, very good to know - thank you. My hatred for cybersquatters has just multiplied, which I didn't realise was possible.
Jorge Stolfi (@[email protected])

@[email protected] @[email protected] Yes, the archive should have retained the original content of that URL, no matter what. Even if the creators object. That is what it is supposed to do. Bu while the domain squatters did not own the page, they legally owned the domain name (and apparently still do). I suppose their argument was that the sale value of the domain would be impacted if it had prior archived contents unrelated to the buyer.

mas.to
@aptmoniker @JorgeStolfi @briankrebs I would very much doubt so, at least in that generality. Trouble is: you only need to find one major jurisdiction where it does which allows you to enforce it. Archive.org and similar offerings have no interest in lengthy litigation.
@JorgeStolfi @briankrebs I’m surprised they could do this, since they weren’t the original owner of the page. Archive.org should have given them a big one-finger salute.

@PeteF @briankrebs

Yes, the archive should have retained the original content of that URL, no matter what. Even if the creators object. That is what it is supposed to do.

Bu while the domain squatters did not own the page, they legally owned the domain name (and apparently still do). I suppose their argument was that the sale value of the domain would be impacted if it had prior archived contents unrelated to the buyer.

@JorgeStolfi @PeteF @briankrebs if that was their argument, was it an appeal to an archive.org policy or to some law that binds archive.org?
@jasonskiles @JorgeStolfi @PeteF If you're the domain holder, you can legit request them to de-index your domain pages. And they will.
@briankrebs It's not apparent why a snapshot at archive.ph would be more durable in the face of take-down requests. Does it have to do with the laws of the Philippines (PH)?
@neirbowj @briankrebs I think the operator(s) largely ignore take down requests and there may be an intentional lack of transparency about who runs it. My question is in the same vein as John's. Do you consider the service reliable and trustworthy?
@neirbowj @briankrebs It's run by someone in a basement with no impressum.

@neirbowj @briankrebs "In the Philippines the laws are only a suggestion"*.

*cookie line under signature block on my first work email received in Philippines - - from the head of the national Internet peering exchange.

Truest cookie line I've seen.

@briankrebs What's the story with archive.today? Hadn't heard of it before. Do you know who runs it and what its policies are? It looks like it's only been used to save 284 websites so far.
@briankrebs The reliable way to save things is LOCALLY, and post them somewhere you control (then at least you'll be the one who has to argue with the people who want them taken down). If you want datestamped proof they existed at a given time, post hashes somewhere, and save *that* page to archive.org (and archive.today, etc).
@JavierKolstad @briankrebs a service The Internet Archive could really use against thos is an ability to query for signed hashes of any content they've ever archived. I expect no law would require them to take the hashes down?
@gpshead @briankrebs Yes, that would be excellent. Ben Trask ran a similar service (with help/encouragement from IA) at https://web.archive.org/web/20211207050148/https://hash-archive.org/ but it's down right now. There's another instance up at https://hash-archive.carlboettiger.info/
Hash Archive

@briankrebs I wonder if the Internet Archive would be willing to host *hashes* of removed content.

@varx @briankrebs

Hashes on the blockchain?

@zeeclor @varx @briankrebs yes, that's definitely the way to go.

What's needed is a common, openly documented website archive format that can be used to produce snapshots from which to derive a hash.

Anyone wishing to notarize a given URL + content would have to first validate a previous notarization request by archiving the same URL and comparing the checksum. Only the hashes need to be stored on the blockchain, actual archives can be stored somewhere else (locally/online/IPFS/etc.)

@varx @briankrebs that would still present the issue of having a central point of attack (if you control the archive, you can change any hash at any time).

Only solution is a distributed blockchain.

@ligma @briankrebs Nothing so fancy needed. The attack scenario here is legal, not technical; if the Archive can defend holding hashes, then you yourself can host the hashed content.
@varx @briankrebs the problem you’re missing with having a centralized database of hashes is that they can simply be tampered with by anyone who has admin access. So you’re still trusting a very small group of people to safeguard the proof of authenticity for potentially valuable evidence, and that presents a very obvious attack vector for those with the means to execute it.
@varx @briankrebs even if they were going to use a blockchain internally to create an immutable record, without decentralized storage and distributed consensus it can still be tampered with, given enough hash power.

@ligma I think you're trying to solve a different problem.

The problem I'm trying to solve is this: The Internet Archive is a party who I already trust; I trust them both as a timestamping service ("this document existed in this form at this time", essentially like a notary) and as competent administrators. They are presumably being legally coerced via copyright claims into removing content, not into *altering* it. The removal is the problem to solve.

Blockchains are unnecessarily complicated solutions to the deletion problem and still don't solve the fundamental issue of "who will be the notary". They can say "this hash existed at this time" but they can't say "this hash was a true representation of X".

@varx I understand that, but what if the Internet Archive WASN'T trustworthy?

What if you had a digital document that ended up becoming evidence in a high profile court case, and the defendant managed to scrub the hash that authenticates it from the Archive (via corruption, bribery, extortion, whatever)?

Your document would be considered a potential forgery and the defendant walks free. THAT's the problem I was trying to solve in addition to yours.

@varx

> Blockchains are unnecessarily complicated solutions to the deletion problem and still don't solve the fundamental issue of "who will be the notary".

Wrong. A blockchain IS a necessary ingredient to the deletion problem IF you want to make deletions as difficult as possible to discourage them from happening, whether purposeful or accidental.

A *distributed* blockchain additionally solves the problem of who will be notary: whomever gets to produce the next block.

@varx

> They can say "this hash existed at this time" but they can't say "this hash was a true representation of X".

Not true for the system I proposed. In my proposal, the block producer(s) would only see pairs of URLs and their alleged hash as input, and would have to download and checksum their content independently before signing a block. Hence the requirement for a common archive format if those URLs represent web pages.

Block producers would be incentivized to keep each other honest.

@briankrebs Would it be legal to save a PDF of the site to your own storage and additionally include that PDF?

I am not familiar with publishing / republishing law on the Internet.

@chadgeidel @briankrebs legal? Sure. Same as taking a picture of anything you see in public. Unless you share it online in a way that may circumvent copyright laws, no one can prevent you from doing that.

The problem is when it comes to using it as evidence. Would you be able to prove, if required, that your digital document hasn't been tampered with, and did, in fact, exist at the alleged URL at the given time?

@chadgeidel @briankrebs I am not a lawyer but I believe that would be a copyright violation.
@briankrebs one can do a deep copy, no problem, and if you don't know how, start screen shotting or get your coding friend to do some screen scraping at the picture level, that stuff will work regardless of the underlying shenanigans

@briankrebs Hi Brian,

Something about your wording confused me. Just to make sure I understood correctly:

You wrote an article explaining how someone (call him S) was marketing proxy services to criminals. Your article relied on evidence at archive.org. S went to archive.org and asked them to remove the evidence and archive.org complied. S then contacted you to demand your article be removed for lack of evidence, and you were forced to comply.

@trindflo You got that right, except I wasn't "forced" to comply.
@briankrebs Thank you for practicing journalistic integrity and for letting us know.

@briankrebs @trindflo Well, there’s a record of them erasing the evidence right?

Seems like that could generate a new article.

@briankrebs Just asking if the date accessed in your citation could help here? At least that way when someone did not find the article there would be a logical bridge as to why.
@briankrebs @donmelton there's also a good self hosted solution here https://github.com/ArchiveBox/ArchiveBox
GitHub - ArchiveBox/ArchiveBox: 🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more... - ArchiveBox/ArchiveBox

GitHub

@selfagency @briankrebs @donmelton

This looks very interesting. Thanks for sharing it!

@briankrebs I wonder if archive.org could periodically create new subsets of the archived data and make them available as bittorrent files to avoid data disappearing like this. Then maintaining archived internet slices would be up to everyone else.

@briankrebs Following up: https://help.archive.org/help/archive-bittorrents/

> As of summer 2012, the Internet Archive is beta-testing the distribution of our public collections via the BitTorrent protocol

However:

> My Torrent download never completes?
> Most likely, you have an out-of-date Torrent for the Item you are trying to download.
> Torrents for Items on the Internet Archive can become obsolete when the Item the Torrent is for changes.

Perhaps the official torrents may be changed anytime if data is removed.

Archive BitTorrents – Internet Archive Help Center

Make Offline Mirror of a Site using `wget`

Sometimes you want to create an offline copy of a site that you can take and view even without internet access. Using wget you can make such copy easily: wget –mirror –convert-links &#8…

Guy Rutenberg

@briankrebs

archive.today/archive.ph can suffer from the same issues as archive.org if they don't already. I wouldn't trust them for long-term storage.

When saving something into the Internet Archive that must really be available in the future I think a good policy is to save it locally.

SingleFile is a Firefox extension that saves the entire webpage locally in a single HTML file. Screenshots are also good, but we can't click links on those.

@andrade @briankrebs
How does one avoid repudiation of your saved artifacts? If they're on archive.org, we acknowledge that archive folks wouldn't modify them. If they're saved on your computer, I'm not going to trust that you didn't take liberties with the content.

We don't have good "distributed chain of custody" tools for this stuff that I know of.

@mterhar That's a good point. 🤔

Perhaps the Internet Archive could be part of this chain of custody by allowing to download a signed-version of a saved webpage for local storage.

@andrade @briankrebs perma.cc is another option with libraries backing it and so hopefully more meaningful longevity.
@andrade @briankrebs
One solution could be that archive.org retains a hash of the content when deleting. That way someone can prove that their local copy of the content really used to be there.
Save Page WE – Get this Extension for 🦊 Firefox (en-US)

Download Save Page WE for Firefox. Save a complete web page (as currently displayed) as a single HTML file that can be opened in any browser. Save a single page, multiple selected pages or a list of page URLs. Automate saving from command line.

@andrade @briankrebs FWIW… Current Folks running recent Apple software—on iPhone, iPad, and Mac—can access links and other text embedded in images and (paused) video with the Live Text feature.

It’s one of the most helpful new features around, and it’s ~unknown by many…

Live Text is integrated into Spotlight search, so text found in images shows up in local searches, too.

https://support.apple.com/en-asia/guide/iphone/-iph37fdd714b/ios

Use Live Text to interact with content in a photo or video on iPhone

Use Live Text to copy text in photos and videos, translate languages, make a call, and more in the Photos app on iPhone.

Apple Support

@HoffmanLabs @briankrebs I hadn't heard of Apple's Live Text before. I imagine they're applying OCR to these images. It's a useful feature but only works when links are visible in the image.

In web pages we may have links where the URL and the link text are different (think "Home" vs "infosec.exchange/home") or the link is incomplete / shortened (like the Apple link you posted). In this case OCR'ing a screenshot doesn't work because the link information is lost.

SingleFile allows clicking (or copying) these links like we do on normal web pages. It's one of the reasons I like the extension.

GitHub - bellingcat/auto-archiver: Automatically archive links to videos, images, and social media content from Google Sheets (and more).

Automatically archive links to videos, images, and social media content from Google Sheets (and more). - bellingcat/auto-archiver

GitHub
@andrade @briankrebs Would some kind of hash/signature archive service be immune to the takedown issue? Service that captures the page to a single file, then signs it together with URL and time stamp. It could keep a public record of the hash and signature but discard the exported content once you’ve downloaded it. If you held onto the file yourself, you’d have a sort of third party verification that the web page really had the saved content at that time.

@pmdj That might be an answer to the issue @mterhar raised earlier.

https://bfd.so/@mterhar/109620704925979134

It's kind of what I had in mind as well with the Internet Archive itself being the signature service.

mterhar (@[email protected])

@[email protected] @[email protected] How does one avoid repudiation of your saved artifacts? If they're on archive.org, we acknowledge that archive folks wouldn't modify them. If they're saved on your computer, I'm not going to trust that you didn't take liberties with the content. We don't have good "distributed chain of custody" tools for this stuff that I know of.

Mastodon
@briankrebs also if anyone relies on archive.org for anything in their professional field, they should consider doing a donation to @internetarchive
@briankrebs That does make me wonder, would it be possible to save the page with the https certificate attached in a way that could prove the authenticity of the content at that time? I'm no cryptographer so I have no idea, but it may be worth investigating.