My talk from Security Fest is now live for any interested security enthusiasts, pentesters, red teamers and password crackers. Fun fact, my voice is that annoying in real life too.
I have to get some sensitive documents notarized.
My library offers notary services for free.
I must say, the sensitivity and grace the librarians have continues to amaze me.
@dangoodin @troyhunt @benjojo @cR0w @Viss @matthew_d_green
This probably deserves a blog post vs. a forum posting so I'll keep some of my points brief but will be more than happy to fill in with more details and references.
As some background, I spent a lot of time digging into the security of the pwned passwords lookup service before the pandemic. This led to me creating a proof-of-concept traffic analysis attack against the lookup service that would use the size of the replies from the PP service to identify the hash prefix submitted by a user, even when this traffic was encrypted by TLS. Here is the code and a short writeup: https://github.com/lakiw/pwnedpasswords_padding
I then worked with @troyhunt and Junade Ali to implement optional random padding in Cloudflare's responses to requests that can be enabled in the API calls. More information about this can be found here: https://blog.cloudflare.com/pwned-passwords-padding-ft-lava-lamps-and-workers/
As a huge disclaimer though, this is totally different from what Cloudflare is actually doing since it focused on attacks against users of this service vs. attacks against the hosting service itself. It also targeted PP vs. Cloudflare's Might I Get Pwned implementation.
With all that out of the way, I have three main points to make about the current Cloudflare implementation of this service:
1) Pwned Passwords, and Cloudflare's Might I Get Pwned (MIGP) do not use k-anonymity. Full stop.
- Cloudflare has gotten better at not using the term k-anonymity in their publications, but long story short, k-anonymity is a specific approach to anonymizing data, and they are not using it here.
- Pwned passwords does not use that technique, and has never used that technique, but the term sounds cool, so it was incorrectly applied to the bucketization method they use in their responses.
- I could go way into the weeds of this if anyone cares, (you probably don't), but trying to match up k-anonymity to what they are doing leads only to frustration and madness. It would bring me joy for people to stop using this term when describing these implementations.
2) Cloudflare's MIGP service has the potential to do a decent job of segmenting password/credential information to their web application proxy (WAP) and not sharing it with their other servers. But there are association attacks which are likely more academic in nature that can still reveal a lot about user's passwords to the backend MIGP servers.
- If implemented as described in the various papers, Cloudflare's MIGP service sends a number of fake lookup requests along with the real one. Depending on the number of fake requests sent, it can make it harder for an attacker on the backend servers to know what hash prefix is the "real" one.
- There are other features such as applying the username to create individual buckets, and hashing the password. I'm going to gloss over them for now since they don't provide a huge degree of security. For example, the hashing is primally done to make lookups easier. Cloudflare says as much in their documentation, and I can attest to that from my own work. Doing string comparisons on user-created data is a pain, and you see all sorts of weird stuff in passwords.
- As for the attacks, they roughly fall into association attacks. For example:
A) If a user logs in multiple times, and doesn't change their password between sessions, the value of random requests goes down since the attacker can simply look at shared requests across multiple sessions. This of course requires an attacker to distinguish between a login for a particular user vs. all the other users of a site. Now this is where some fun can happen with the individual bucket approach MIGP uses, but long story short, this falls into a theoretical vulnerability but would likely be complicated to pull off in practice.
B) The other weakness lies in the "similar" password detection that MIGP can do. I believe this feature is turned on based on Cloudflare's documentation. Basically, this feature can be summed up as the WAF will mangle a user's password (such as changing "password1" to "password2") and then submit lookups for "password2" as well. This also has the potential to allow for attacks against the "fake requests" since you can look for groups of requests that would match these variation lookups. A more problematic attack though is when combined with the previous attack to remove the fake lookups, it can also leak additional data about the plaintext password. Aka the types of the mangling that this "similar" password lookup service does will be based on the password itself. So if an attacker knows that three buckets requests contain "password1", "password2", and "password3", they can then assume that the user's password contained the word "password", even though only the hash prefix was submitted.
- Attack B is also largely theoretical, but I would love to try and get a PoC working if I knew more details about the actual MIGP implementation. That's a fun research problem to tackle!
3) This whole debate is a great way to highlight that Cloudflare's WAFs see *everything*. This has been a problem in the past for other WAFs where they logged plaintext passwords [One recent reference: https://www.bleepingcomputer.com/news/security/wordpress-aios-plugin-used-by-1m-sites-logged-plaintext-passwords/], and this is something that is well known to red teamers and attackers. Also WAFs are a great source of tapping traffic if law enforcement agencies wanted to gain access to it but didn't want to work with the site owners directly. As a site owner, if you are worried about Cloudflare accessing user passwords then you need to be aware of this before using this service. As a user, well you really don't get much say in this...
- The value of compromised credential services like this can be huge. I don't want to sound negative about Cloudflare's approach even though I'm pointing out potential weaknesses. Many sites don't have the resources to implement compromised credential lookups themselves. Password reuse attacks are *EXTREMLY* prevalent so protecting against them will make users safer. So, I'm not giving Cloudflare of Pwned Passwords any grief for doing this. It is a valuable service!
- In conclusion, this debate makes me happy since passwords are a good way of highlighting the access that 3rd party WAFs have when viewing user traffic. Most of the conversations around this topic can be pretty dry, so the fact that this gets people interested in the larger security questions posed by WAFs is a really good outcome!
The protective value of "k-anonymity"¹ for Have I Been Pwned / Pwned Passwords API lookups is significantly reduced because frequency data is included. And the more common the password, the more this effect is magnified.
An example:
https://gist.github.com/roycewilliams/2034c9253d46fbcaefb13f8e5d42daa2
... with cracks:
https://gist.github.com/roycewilliams/2bb471cc90cce7f6834204344590fcac
Using "k-anonymity"¹ to return all hashes that begin with b2e98 is less "anonymous" ... when 98.6% of the passwords (by frequency across all leaks) are the top one.
It's not really hiding a needle in a haystack if you just lay it on top.
Edit: in fact, even without the frequency data, since some passwords are much more common than others ... left-skewed distribution is an intrinsic property of password data. Missing frequency data can be largely reconstructed from public cracking efforts. (And even if that weren't true, the hashes can just be cracked using traditional methods. If the cracking community can get a 97%+ cracking rate², what is being achieved other than plausible deniability?)
K-anonymity [as implemented by HIBP, anyway -- true K-anonymity is different¹] may just be a bad fit for password hashes.
¹ Not actually k-anonymity at all:
https://en.wikipedia.org/wiki/K-anonymity
² Actually closer to 99.29% across the entire corpus, publicly:
https://gist.github.com/roycewilliams/40f0e8c93ec9c69f5b5a1874c76f2587
Starting to play around with llama LLMs. I need to get smarter about the current state of systems, and making API calls to OpenAI and Claude just isn't cutting it.
I fully expect to spend a week hitting my head against the wall but hopefully I'll have learned something at the end of it.
Warning this is another side tangent that doesn't actually show any risks to only having a 6 digit OTP but I figure it might help to give some background on the discussion @tychotithonus and I were having.
I was worried at one point that shorter OTP tokens could cause problems for attackers trying to clone the underlying token. The scenario is where an attacker has stolen all of the OTP seeds, but needs to match up a particular seed with data obtained via a keystroke logger to clone a particular token. [Attack Reference: The RSA hack https://www.wired.com/story/the-full-story-of-the-stunning-rsa-hack-can-finally-be-told/]
Part of the challenge for an attacker here is false positives from shorter tokens. I looked into this back in the day for the RSA hack and created a PoC tool called eRSAtz to help in this attack scenario, (never released it since I don't want to get sued). A couple of high level findings from that research:
1) There are actually a bunch of different RSA securID backend algorithms. The ones used during the time of the attack were updated from the original flawed implementation. There was actually a different algorithm for the digital-only tokens that focused a lot on making it so you couldn't just create your own fake seeds. For my research I used the SecurID version found in the Cain and Abel SecurID module.
2) Part of the "secret" was actually a serial number that isn't secret. These serial numbers are assigned to companies (and in fact were printed on the OUTSIDE of securID shipping boxes to make managing them in the warehouse easier). What this means is that an attacker in possession of all the seeds doesn't need to worry about collisions from serial numbers assigned to other companies they aren't targeting. This reduces the number of false positives dramatically
3) Simply collecting two TOTP entries from your keystroke logger pretty much negates any issue with collisions.
4) This is a side tangent to a side tangent, but I have to say I was really impressed by the design of SecurID. There are a lot of design decisions that sound bad when talked about out of context, (see my comment above about printing serial numbers on the outside of shipping boxes), but when you look at the overall design doesn't cause any real weaknesses and makes the system much more usable. It's a really well designed algorithm, assuming of course that all of your seeds don't get stolen.