Is gonna sound crazy, but I think you can skip the keylogger step!
You could make a “keystroke-sound-language-model” (so like a language model that combines various modalities, e.g, flamingo), then train that with self-supervised learning to match “audio” with “text”, and have a system where:
- You listen to your target for a day or so, let’s say, 1000 words typed in 🤷🏻♂️
- Then the model could do something akin to anchor tokens in language-to-language translation, except in this case it would be more like fixing on easy words such as “the” to give away part of the sound-to-key map. Then keep running this mapping more parts of the keyboard
- Eventually you try to extract passwords from your recordings and maybe bingo
I think it’s very narrow to think that, just because this research case requires a keylogger, these systems couldn’t evolve other time to combine other techniques