At our #cryoEM data club yesterday, I shared my current strategy for atomic model building and refinement in a cryoEM map. So here it is (as of fall 2024; things change quickly in this field). ๐Ÿงต
1/19
First, this assumes that I have a "final" map (no more image processing needed) and that its local resolution is better than ~5 ร… in most places. Except otherwise noted, this is the "raw" (unfiltered and unsharpened) map.
2/19

If the protein or complex was purified from its native host organism:
1. I sequence the map with `model_angelo build_no_seq`.
2. I identify all proteins with `model_angelo hmm_search` against the reference proteome of the host organism (downloaded from #Uniprot).

If the protein or complex was prepared recombinantly, I already know the sequences of all proteins.

So at this stage, either way, I know which proteins are in there.
3/19

Having identified the proteins, I fetch their #AlphaFold2 predictions from AlphaFold-DB, or compute them if not in the DB. #AF2 models have excellent geometry and complete sequence correctly numbered, so they are excellent starting models. I rarely use #PDB entries as starting models anymore. Rare exceptions: a PDB entry I deposited myself, or one containing a post-translational modification or non-natural amino acid I need (never present in #AF2 models, only the 20 standards amino acids).
4/19
Using #ChimeraX, I dock these #AF2 models into the map. I have seen a case with two different protein subunits of very similar sequences, which makes assigning each AF2 model to the density pretty difficult. It is however very easy to align (with the `matchmaker` command in ChimeraX) the AF2 models to the model produced by #model_angelo. This will unambiguously place similar AF2 models to where model_angelo detected their sequences.
5/19
If the map is symmetric, I generate the asymmetric unit (ASU) and store the symmetry info that regenerates the whole complex in my working CIF file containing what will eventually become the final atomic model.
6/19
I finally trim the termini and loops not supported by any density. At this stage, I have a starting atomic model: it mostly fits the density, and there should remain only local problems to fix.
7/19
At this point, I use #ISOLDE to do manual fitting. If the map has a low resolution, I apply secondary structure restraints to the model. I start with a global simulation to let the model settle, which I let run until the dots in the interactive Ramachandran plot have reached an equilibrium. With #AF2 models, there shouldn't be Ramachandran outliers in the first place, but this is a good overview of how many residues are moving. I want them to settle in the nearest density before I continue.
8/19
Then comes the fun part in #ISOLDE: tug the atoms and they respond in a physically realistic way, how cool is that?! ๐Ÿคฉ This step can also be very tedious if the model is large. ๐Ÿ˜ต Using the `isolde step next` command, I walk along each chain from its first to last residue (in a symmetric map, only for the ASU), fixing problems. This includes placing water molecules, ions and ligands, where there is density supporting their presence. This is super fun to do, and teaches a lot about chemistry.
9/19
During this stage I often get help from two additional sources of information:
1. The model from #model_angelo sometimes helps moving the #AF2 model to the correct location, in these cases where a segment of the AF2 model doesn't match the map. Very often these discrepancies between map and prediction point to regions of the protein involved in conformational change, so I make a note of the chain ID and residue range so I know to take a close look later and compare to related structures.
10/19
2. A post-processed map from #deepEMhancer used only as a visual guide (so not allowed to pull on the atoms in the iMDFF procedure in #ISOLDE) is often helpful in regions of lower resolution. I have seen cases where I couldn't find a stable conformation for a long side chain, until I placed it where the deepEMhancer map suggested: suddenly it was nicely stable in the combined pull of the raw map and MD force field.
11/19
After this exhaustive first pass inspecting each residue, I try to clear the biggest problems flagged by #ISOLDE's validation tools (there are always some). Clashes are rare because atoms that are too close strongly repel each other under the MD force field. Ramachandran and rotamer outliers are rare in #AF2 models, but can arise when the simulation distorts the model. These problems often point to regions where the map is ambiguous, and where the model needs more attention.
12/19
Once I am happy with the current state of the atomic model, the next step depends on the global resolution of the map:
1. For low resolution (~5 to 2.5 ร…), I run the model through #phenix.real_space_refine using the parameter file generated by #ISOLDE. This turns off most things, but refines the coordinates with reference-model restraints (so they won't move far from where I left them with my manual fitting) and the atomic b-factors.
13/19
2. For high resolution (~2.5 ร… and better), I refine the atomic model with #servalcat.
14/19
This step is important for two reasons. The MD force field will produce a distribution of bond lengths and angles that doesn't quite match the distribution expected by the validation suite run by the #PDB deposition server, so omitting this last refinement will cause many things to be flagged as outliers. The other reason is that #ISOLDE doesn't (yet) refine atomic b-factors, so omitting this last refinement would leave nonsensical b-factors in the model.
15/19
There is often one more round of #ISOLDE + final refinement to do, if only to check visually that the "final" refinement didn't mess anything up. Of note, #servalcat produces a sharpened map and a difference map, both extremely helpful to spot and fix problems. I have found very subtle modeling errors that I had totally missed until I examined the model against these maps. So, whenever I use servalcat, I always do at least a second round of visual inspection and manual fitting in ISOLDE.
16/19
There can be more rounds of interactive and automated refinement. I read somewhere (can't find the ref anymore, if this rings a bell please tell me where it's from) a quote along the lines of "high-resolution atomic model building is never complete, only abandoned". And this rings very true after having modeled a map at ~1.8 ร…. With another project, I recently got one at ~1.4 ร…... ๐Ÿ˜ตโ€๐Ÿ’ซ Wish me luck.
17/19

Once I'm happy, or I abandoned modeling the last remaining bits of unexplained density or fixing the last subtle model errors in places mostly irrelevant to the biological question the model is meant to answer, then I run `phenix.validation_cryoem` on the half-maps and atomic model. This gives me most of the numbers to put in the refinement table. And I deposit the model into the #PDB and maps into the #EMDB. Fin.

Happy to hear any comments or descriptions of how other people do this!

18/19

@Guillawme I think lots of people have made this point but the first I can recall was a CCP4 study weekend paper from Phil Evans, in the 1980s, saying that refinement is never done but you refine ad tedium, until it becomes too tedious to continue!

An addition to this ๐Ÿ‘† from spring 2025: I now also use EMReady to produce a post-processed map. It follows a different formalism than deepEMhancer, and I have observed a case where they disagreed on the location of the main chain over a loop of ~4 residues. This was helpful to identify a mixture of states and model them properly.

EMReady paper: https://doi.org/10.1038/s41467-023-39031-1

EMReady website: http://huanglab.phys.hust.edu.cn/EMReady/

Regarding water molecules, I recently added phenix.douse to my toolbox. It is very easy to use with the integration between ChimeraX and Phenix, and in my (still limited) experience produces good results. I run it when the model is reasonably complete, to limit false positives (water molecules placed in density not yet modeled but clearly something else).

#CryoEM