Sergey Ovchinnikov

372 Followers
68 Following
16 Posts
Scientist, pseudo-PI - Harvard University, #FirstGen
@sokrypton

Efficiently generate de novo proteins by
- optimizing residue logits for max AF confidence
- redesigning the sequence using ProteinMPNN
Tested in the lab, including CryoEM structures
@chrisfrank662 @AKhoshouei @sokrypton @hendrik_dietz

https://www.biorxiv.org/content/10.1101/2023.02.24.529906v1

Nailed it! I think I'm ready to retire... 😅
The first interview done! Time to prep for the next. 😎

@neuropunk
So now, during design, as soon as you get the desired structure, there is no longer any signal to update your sequence.

In this case, we wanted the structure to be fully encoded in the LM's contacts, and to avoid a situation where a more complex structure module starts hallucinating or improvising. (2/2)

@neuropunk
It's good for the structure prediction task, you want the model to be robust and recognize the bare minimum signal from the input sequence. But not good for the design task.

Let's say you have a suboptimal sequence that only partly encodes the desired structure. If your model is "too good", it will fill in the rest of the structure. (1/2)

One thing to keep in mind is that it's critical that this linear projection be as simple as possible. This is to avoid "connect the dots" phenomenon we saw with TrRosetta, where the sequence codes for some of the contacts but the rest of the layers fill in the remainder contacts. 🤔
Alright, first attempt at a tooter thread 😅

check out the preprint:
https://www.biorxiv.org/content/10.1101/2022.12.21.521521v1

Thanks to all the amazing collaborators!
@robert_verkuil
@OriKabeli
@du_yilun
@BasileWicky
@LFMilles
@JustasDauparas
David Baker
@UWproteindesign
@TomSercu
@alexrives

Instead, if you use language model, which models: P(sequence), and train an extra structure head on the attention maps, essentially modeling p(structure | sequence) and optimize both functions, you get working designs! LM loves them (lower perplexity) and alphafold does not hate them (pTM > 0.5). (5/5)
For comparison, we also used ColabDesign's AfDesign protocol (protocol=fixbb), which only models p(xyz|seq). Not surprisingly, AF2 liked them (high pTM values), but LM did not (high perplexity values)... and most of these sequences also did not work in the lab (not soluble and/or monomeric species by size exclusion chromatography)... (4/5)