| Website | www.theDNAgeek.com |
| Website | www.theDNAgeek.com |
Current DNA analysis methods assume our matches are only related in one way. What if that's not the case?
https://thednageek.com/jan-and-the-complex-pedigree-analysis/
#pedigreecollapse #endogamy #doublecousins #BanyanDNA #DNAanalysis
Annemarie's grandparents were Anna and Jan, Afrikaners in South Africa. They married when Anna was 18 (her first marriage) and Jan was 36 (his second). Jan was the younger brother of Anna's stepfather Andries, who married Anna's mother, Maria and fathered Anna's half sister Igna. With me so far? [caption id= align=alignleft width=347] The paper-trail for Anna and Jan[/caption] How much DNA can we expect the children of Anna and Jan to share with a grandchild of Andries and Maria? They would be half first cousins (h1C), because Igna and Anna were half sisters and also first cousins once removed (1C1R) through the brothers Jan and Andries. If so, we'd expect them to share ≈425 cM through the first relationship and another ≈425 cM through the second, totaling approximately 850 cM, the same amount as full first cousins. But they didn't. S, the granddaughter of Andries and Maria, shares only 679, 635, and 592 cM with three daughters of Anna and Jan (M, A, and A, respectively). What's more, P, a grandchild of Jan's first marriage, shares only 156 cM with S, on the low end for the expected 2C relationship. Are these observations just normal variation, or do they mean something more? Annemarie suspected that Jan might have been a half brother to Andries or might even have been adopted into the family. How can we tell? Annemarie's question, posted to The DNA Roundtable Facebook group, is a fabulous opportunity to explore complex pedigree analysis. DNA-based Relationship Tools Tools like the Shared cM Tool and What Are the Odds? can indicate which relationships are most likely for two or more people based on how much autosomal DNA they share. Basically, a computer program simulated thousands of first cousins and thousands of second cousins (and so on) to see what the typical shared DNA amounts are for each. With a little more analysis, we can then compare what's expected for a given relationship with what we actually see. However, those tools assume everyone is related only once. That's obviously not the case here. For Annemarie's question, we need data that can account for the double relationship between S and M+A+A. We need custom simulations. A couple of software tools are available that can do such simulations for genetic genealogy, but they are not for the faint of heart, and they don't do the statistical analyses we need to test genealogical hypotheses. I used one called Ped-sim to help Annemarie after making some tweaks to the workflow to align it better with the DNA tests we use for genealogy. (For the nerdy: I used sex-specific crossover rates, accounted for crossover interference, and used a genetic map of ≈3500 cM.) The Hypotheses I tested three hypotheses for Jan's place in the tree: Hypothesis 1 (H1): He was a full brother to Andries. Hypothesis 2 (H2): He was a half brother to Andries. Hypothesis 3 (H3): He was a first cousin to Andries and adopted by Andries' mother. In the diagrams below, the hypotheses are shown in black, and the DNA testers are in blue. The branch in question is red. The table below shows how each pair of DNA testers would be related to one another under each hypothesis. Because M, A, and A are full siblings to one another and have the same relationships to S and P, they were treated as one entity, MAA. Next, I simulated DNA match data between S and MAA and between S and P for each hypothesis. Note that the relationship between P and MAA is the same for all three hypotheses, so there was no need to simulate or analyze those matches. I did 10,000 simulations for each relationship pair for each hypothesis, and used the results to generate expected histograms that we can compare to the actual match data. Were Jan and Andries Brothers? Total cM This first set of histograms shows the total amounts of shared DNA between S and P (top of figure) and between S and MAA (bottom). In each figure, the histograms represent, from left to right, H3 (blue), H2 (green), and H1 (red). The black arrowheads mark the actual shared DNA amounts. The highest bar beneath each arrowhead is the most likely relationship for that match. The match between S and P is equivocal between H1 and H2 (the bars are almost the same height) but strongly disfavors H3. Eyeballing the bar heights, I estimated the probabilities at ≈51% for H1, ≈46% for H2, and ≈3% for H3. The matches between S and M, A, and A all favor H2, with rough probabilities of ≈49%, ≈54%, and ≈54%. I also calculated an odds ratio (i.e., a WATO-type score) for each hypothesis by multiplying the individual probabilities to get a compound probability, then dividing by the smallest one to convert to odds ratios. The WATO scores were 1 for H1, 60 for H2, and 1 for H3. This is considered strong support for H2. Number of segments Next, I analyzed the number of segments shared by S and P (top of figure) and by S and MAA (bottom). This time, the match between S and P strongly favored H1 (≈93%) over the other two hypotheses. The MAA siblings' matches to S individually favored three different hypotheses (bottom), although H2 was the only hypothesis that was not strongly disfavored by any of the three matches. Its individual probabilities for the three siblings were ≈38%, ≈38%, and ≈51%. Again, I calculated the odds ratios for each hypothesis, this time based solely on the number of shared segments. The scores were 12 for H1, 47 for H2, and 1 for H3. This is strong support for H2 over H3 but only moderate support for H2 over H1. Longest segment The third factor considered was the size of the longest segment. For S and P (top of figure), H1 and H2 were equally probable at ≈39%. The matches between S and the three siblings (bottom) all slightly favored H1 (≈39%, ≈48%, and ≈39%). For all four pairwise matches, the second most likely hypothesis was H2 at ≈35%, ≈30%, and ≈35%. The odds ratios for longest shared segment were 8 for H1, 4 for H2, and 1 for H3. There was no meaningful difference between H1 and H2, and both had moderate support over H3. Combined odds ratio When all three factors are considered together (total cM, # segments, and longest segment), the results are unequivocal: Hypothesis 2 is very strongly supported with an odds ratio of 11,666, compared to 112 for H2 and 1 for H3. Thus, this analysis provides robust evidence that Jan and Andries were half brothers, not full brothers or first cousins. The Next Frontier Complex statistical analyses like this one are the future of genetic genealogy. Such analyses will allow us to address genealogical puzzles that are currently out of reach of autosomal DNA due to pedigree collapse, endogamy, and even incest. They can even let us leverage the DNA results of multiple close relatives to investigate research questions further back in our trees. However, the work done for this blog post was both technical and tedious, and more sophisticated statistical analyses are available. A tool called BanyanDNA is currently in beta testing that will make complex hypothesis testing accessible to a much broader audience. If you come from an endogamous population, you can help us to tailor the tool to your needs by submitting known match data to this survey. You can also sign up for the BanyanDNA mailing list to be among the first to hear about opportunities to beta test the tool and our official product launch at RootsTech 2024.
At a family history event in mid-August, a company employee let slip that AncestryDNA was considering allowing DNA files created by other companies to be uploaded into their database. This report was confirmed to me directly by three independent sources. If this were to happen, it would rock the genetic genealogy world, in more ways than one. Let's think this through. The Good Genetic genealogy is a numbers game, and AncestryDNA has—by far—the largest genealogical DNA database out there. It's about as large as the other direct-to-consumer databases combined. Almost everyone will benefit by having their family members in the AncestryDNA database. That is not always possible, however. There are thousands of people in the smaller databases who have since passed away. Many of them tested at the request of avid genealogists who would dearly love to leverage AncestryDNA for their research but can't. These genealogists have often invested thousands of dollars to test extended family and would be willing to pay a little more to get those DNA kits into AncestryDNA's database. It would be a win-win: a new revenue stream for AncestryDNA and a new life for legacy kits that are now stranded. The Bad Uploads at AncestryDNA would upend the genetic genealogy industry. Genealogists managing legacy kits would no longer need to cajole their AncestryDNA matches to transfer to the smaller databases to evaluate matches; they could simply transfer their own family kits the other way. Smaller companies would falter, because there would be little incentive to upload there anymore. I'd hazard that 40% of the kits at MyHeritage and FamilyTreeDNA tested elsewhere, so growth at both places would be seriously impacted. MyHeritage has a solid foothold in Europe, so I think they'd adapt and survive. I'm not so sure about FamilyTreeDNA's autosomal database. As much as I've questioned FamilyTreeDNA's actions over the past few years, I don't want them to go under. A strong market needs competition, and genealogists need the Y-DNA and mitochondrial DNA offerings that only FamilyTreeDNA offers. Sites like GEDmatch that are 100% uploads might well collapse. The Ugly There is a more sinister side to the possibility that AncestryDNA might accept uploaded files: law enforcement and bad actors could, nay would, invade the database, regardless of Ancestry's terms of service or what consumers want. We've already seen that some law enforcement agents are uploading to sites forbidden to them by both Department of Justice policy and the databases themselves. We also recently learned that prominent forensic genetic genealogists—some of whom have spoken eloquently about ethics and trust—had been using a privacy hole at GEDmatch to see people who had not consented to law-enforcement matching. It only takes a few bad apples to spoil everything. If leaders in forensic genetic genealogy cannot be trusted to play by the rules, why should anyone else? Ethical practitioners can't compete with people who cheat to solve cases and draw fawning press coverage. If AncestryDNA starts taking unvetted uploads, law enforcement will be there in a heartbeat. The entire industry will suffer like it did in the wake of the Golden State Killer's arrest. You can see the damage in the graph below. The long straight lines represent the growth rates at AncestryDNA (green) and 23andMe (purple) in the year prior to the arrest (thicker lines) and since then (thinner lines). Sales took a huge hit when the public learned that law enforcement had infiltrated some databases. The industry still hasn't recovered. There is a solution, though. Cryptosignatures In 2018, Yaniv Erlich, then the Chief Science Officer at MyHeritage, proposed that DNA testing companies cryptographically sign their raw data files. (See the last paragraph of this scientific article.) The cryptosignatures could then be used by other sites to authenticate a file's origin before accepting it. Such a system requires collaboration, though. If AncestryDNA wants to accept uploads from 23andMe, or vice versa, both companies would have to agree to use cryptosignatures. Each would have to negotiate its own arrangement with MyHeritage. Back in 2018, there wasn't much incentive for the big companies to agree to all that. But now that AncestryDNA is considering uploads, the calculus has changed. (Calculus does that. 😉) FamilyTreeDNA and GEDmatch, which both collaborate with law enforcement, would also benefit from cryptosignatures. They charge $700 per kit for forensic uploads, so law enforcement has a dual incentive to skirt the rules and upload as normal kits: the agents avoid a hefty fee and get to see kits that have not consented to forensic matching. Cryptsignatures would protect both the regular users at FamilyTreeDNA and GEDmatch as well as their bottom lines. Yaniv was right! It's time for the DNA testing companies to adopt cryptographic signatures to protect the entire industry.
Do you have even a passing interest in the power of genetic genealogy to solve crimes? Are you concerned about the ethical and privacy implications? Then you should read Barbara Rae-Venter's book "I Know Who You Are."
https://thednageek.com/book-review-i-know-who-you-are-by-barbara-rae-venter/
Book Review I Know Who You Are: How an Amateur Sleuth Unmasked the Golden State Killer and Changed Crime Fighting Forever by Barbara Rae-Venter, 2023, Ballantine Books, New York Do you have even a passing interest in using genetic genealogy for justice? Do you want to learn how the pros solve cases? Are you concerned about government agents and for-profit companies rifling through the genomic information of citizens? Then you should read Barbara Rae-Venter's book I Know Who You Are, because it's all here.1 Throughout, Rae-Venter interweaves true-crime horror with investigative details and vignettes from her own life. The result is an engaging, well-written book that gives valuable insight into the power and practice of forensic genetic genealogy (FGG; sometimes also called investigative genetic genealogy or IGG). The book has a glaring weakness, though: its treatment of the ethical and constitutional issues surrounding FGG. It presents a classic example of what my colleague Lindsay Carter calls noble cause bias, in which righteous ends justify inappropriate means. The Cases The cases themselves are best described by Rae-Venter herself, so I will only touch on the highlights. Rae-Venter's first law-enforcement case was that of Lisa Jensen, a woman who had been abducted as a toddler and abandoned a few years later in California. Lisa wanted to know her true identity, and Rae-Venter volunteered to help adoptees find their biological families as a hobby in retirement. A collaboration was a good fit. In working Lisa's case, Rae-Venter's team stumbled across a grisly quadruple-murder cold case in New Hampshire. Despite being from opposite sides the country, Lisa's case and the Bear Brook murders were inextricably related. And Rae-Venter solved them both. Eventually, news that Rae-Venter had identified Lisa's mother ultimately reached FBI lawyer Steve Kramer and Contra Costa County Detective Paul Holes. These two California-based law-enforcement personnel were investigating a long-dormant serial rapist and serial murderer dubbed the Golden State Killer. Although he'd escaped justice for more than 40 years, Rae-Venter was able to identify him as Joseph James DeAngelo in only 2 months using genetic genealogy. True to the title of the book, her work changed crime fighting forever. Rae-Venter recounts several more cases in her book, including another serial rapist, two unidentified murder victims, and a Baby Doe (an infant abandoned immediately after birth). This work stands as testament to the power and promise of FGG. The Methods Genealogists will appreciate that Rae-Venter takes time to explain how she solved each of these cases. It's fascinating to see how genetic genealogy works, much like a ratchet notching you one cog at a time ever closer to the answer. The overall process is the same whether you're trying to identify an unknown parent, a criminal suspect, or a Doe. You start with a list of people who share DNA with your case and build speculative trees for these so-called DNA matches. The goal is to figure out how some of them are related to one another. Then, you work your way forward in time from their most recent common ancestor (MRCA) until you find where your case fits in. This is, of course, an oversimplification, but it's a reasonable one given that this book is targeted to a lay audience. A key theme in the book is scientific advance, both in DNA analysis and genealogy methods. DeAngelo's genetic profile came from a rape kit that had been stored in a freezer for decades. The DNA was still in excellent shape, and the lab analysis was fairly straightforward. The Bear Brook victims, on the other hand, were so badly decayed that even the standard law-enforcement profile, called CODIS, failed. Genetic genealogy profiles, which are about 35,000 times more detailed than CODIS (≈700,000 DNA data points versus ≈20), seemed hopeless. Rae-Venter's grasshopper mind allowed her to extrapolate from a news article about 145-year-old remains unearthed during a home renovation to the potential of analyzing hair shafts of the Bear Brook victims. That led to a productive collaboration with Professor Ed Green of the University of California, Santa Cruz. Dr. Green's lab has since developed methods that have assisted numerous FGG cases by Rae-Venter and others. In parallel, Rae-Venter highlights advances in genetic genealogy methods. DNA segment triangulation was cutting edge in 2015 when Rae-Venter began work on the Jensen case, but it has been superseded by more efficient and effective methods like clustering and the What Are the Odds (WATO) tool. Five years from now, clustering and WATO may well be outdated. Genetic genealogy is still a rapidly developing science and new methods arise frequently, which is part of what makes it so exciting. One aspect of Rae-Venter's approach with which I disagree is her insistence that every name added to a speculative tree be supported by documentation. Think of it this way: a DNA match who is a 3rd cousin has sixteen great-great grandparents. Only two of them (highlighted in the hypothetical tree below) are relevant to your case. Initially, you don't know which two. If you were to find documentary records for every one of the match's ancestors, you'd have to research 30 people when only five are relevant: one parent, one grandparent, one great grandparent, and the shared great-great grandparent couple. That's six times more work than is necessary. For a fourth cousin, the difference is more than 10-fold (62 ancestors with 6 relevant). Worse, to identify the MRCA, you'd have to build such trees for several matches until you found the ancestors common to all of them. A far more efficient approach is to build a quick and dirty tree for each match with little to no documentation. This allows you to quickly identify likely MRCAs. Then, you double back to corroborate only the relevant lines with documentation. Rae-Venter argues that one errant name can set a search back countless days or even months, but I would argue the opposite; too much attention to irrelevant detail will prolong the search unnecessarily. Noble Cause Bias The term noble cause bias has its roots in the concept of noble cause corruption, in which someone perceives that their desired end is be more important than the law. To be clear, Rae-Venter's actions and those of her law-enforcement colleagues were not corrupt in the legal sense; at the time, there were no laws about FGG at all, and few exist even now. However, the field is an ethical and constitutional minefield, and Rae-Venter's blindness on that front is stunning. As she writes, Identifying victims and criminals felt like a good and noble endeavor, and I did not see it as a two-sided issue. The first clue that the issue does, indeed, have two sides came early on, when all but one genetic genealogy company refused to analyze the Golden State Killer's DNA. Steve Kramer approached all of them, but only FamilyTreeDNA was willing to help. Rae-Venter writes I personally did not feel that there were any pressing ethical issues stemming from the use of commercial DNA sites in criminal cases, even though the majority of the industry obviously disagreed. The second clue was when then-CEO Bennett Greenspan agreed on the spot to participate but asked that his company's involvement be hidden from the public. Oh c'mon. If exposing his 1.5 million customers to a criminal investigation without their consent was ethically sound, why hide it? Instead, when the arrest was made, one of the investigative team members threw GEDmatch under the bus, although its owners had no knowledge of the case at the time. The third clue was the immediate outrage among both genealogists and the broader public when the infiltration of genealogy databases was revealed. The firestorm may have cost the industry more than a billion dollars (billion with a B). Consider the graph below. The dashed lines project how large each database would be today if sales hadn't taken a huge hit immediately after the Golden State Killer story became public in April 2018. By my estimates, nearly twice as many people would have done genealogy tests by now had the ethical implications been addressed before FGG was used. And yet, Rae-Venter doubles down. I believe the community actually needs to push for more access from [AncestryDNA and 23andMe], until they give their subscribers the option to help law-enforcement and permit forensic files to be uploaded to their sites. Never mind that customers can already help law enforcement by uploading their DNA files to a database that permits that use; they don't want to. The Ethics I'll save a deep analysis of the ethical concerns surrounding FGG for another post. For now, ask yourself whether the U.S. government has the right to analyze your genome—and hand that information off to for-profit databases and private contractors—without your knowledge or consent. Bear in mind that your genome is more than just the DNA fingerprints that law enforcement has used for decades; it contains highly personal familial and biomedical information that the Supreme Court itself recognized as posing Fourth Amendment concerns.2 I'm not asking about criminals; I'm asking about you, your children, and all of the other innocent people in your life. Are you okay with that? Because that's what happened to Teresa in the book. Teresa's crime? None. None whatsoever. Teresa's cousin was so traumatized by an unwanted pregnancy that she mentally detached from her own body for 9 months, gave birth in a dirty bathroom in less than 15 minutes, panicked, and walked away. And that apparently made Teresa's DNA fair game. She probably has no idea even now. Regardless of where you stand on FGG, you should read this book. It may thrill you. It may appall you. It may do both. But it will definitely get you thinking. 1 This is an affiliate link that will pay a small commission if you purchase the book after clicking on it. The cost is the same for you, and the income helps to keep this blog free. 2 Maryland v King, p. 27.
Are you a beginner at DNA? Intermediate? Advanced? How can you tell? It's not like we've got a standardized test for genetic genealogy. Perhaps we should have one. We could call it the Genetic Analysis Competency Test, or GACT. Get it? If you don't, you might be a beginner.1 [caption id=attachment_5909 align=alignright width=200] Goldilocks, from Fairy Tales, 1918, compiled by Rose Allyn, Stanton and Van Vliet Company, p 73.[/caption] Or maybe I just tell bad jokes. Now that online education is so widespread and accessible, knowing where you stand is important. I've taught workshops where the biggest complaint was that the material was too advanced, followed closely by gripes that it wasn't advanced enough. In the same class! Being able to gauge your experience level can help with this Goldilocks problem. With a standard metric, you can better decide which lectures or courses best fit your needs and your budget. And your instructors will be able to tailor their materials more precisely to their audiences. So how do we gauge our skill levels? After all, there's no one bit of information you need to learn to go from beginner to intermediate nor a single skill you must master to jump from intermediate to advanced. It's a continuum. Layer onto that the fact that genetic genealogy is still a rapidly developing field, and there's no one target we must hit. An advanced person who walked away today would come back in 2 years as intermediate. And if they left for 5 years they might well feel like a beginner. That's part of what makes genetic genealogy so exciting! That's why we can't use a metric based on knowing A, B, and C. In 5 years, A, B, and C might be outdated (Remember mirroring?) and the essential skills will be X, Y, and Z. Broadly, a beginner is new to the field and spends most of their effort learning how to use DNA for genealogy. An advanced person has mastered most of the methods and spends the bulk of their time applying their knowledge. And intermediate is somewhere in the middle. We can break this down further: terminology, methods, databases, interpretation, third-party tools, ethics, and more. Each of those can have beginner, intermediate, and advanced levels, as summarized in the table below. Of course, very few people will fit tidily into just one column, perhaps only the absolute beginners and the most experienced among us. Everyone else will tick boxes in two or three columns. That's okay. This isn't a pass-fail exercise; it's meant to help you evaluate your own level so you can get the most out of your next learning experiences. Finally, there is another level that transcends these categories: the pioneers. These are the people who push the boundaries of genetic genealogy. They invent new methods and create new tools to help us all. People like Dana Leeds, who invented the Leeds method. Or Jonny Perl, who's DNA Painter site has become almost mandatory for DNA interpretation. Or Rob Warthen and the DNAGedcom tools. Or Kevin Borland and his Borland Genetics database of reconstructed ancestral genomes. There are many other—too many to list—bright minds and novel thinkers that will push us beyond today's ABCs and into tomorrow's XYZs. So where do you fit? What would you like to learn next? 1 For the beginners: DNA is a long chemical made up of four subunits, like beads on a string. Their true chemical names are beasts, so we use shorthand: guanine, adenine, cytosine, and thymine. G, A, C, and T. Get it now?