Charles Tapley Hoyt

@cthoyt@scholar.social
420 Followers
215 Following
539 Posts
Bio/cheminformatician, software developer, open scientist. 🇺🇸 living in 🇩🇪🇪🇺 (he/him/searchable/#NoBridge)
Websitehttps://cthoyt.com
GitHubhttps://github.com/cthoyt
ORCID0000-0003-4423-4370
Gravatarhttps://gravatar.com/charlestapleyhoyt

@egonw are you aware of the work I've been doing on semantic mappings? This could be used to feed into @bridgedb (we mention this in the preprint)

preprint: https://www.biorxiv.org/content/10.1101/2025.04.16.649126

code: https://github.com/biopragmatics/semra

Assembly and reasoning over semantic mappings at scale for biomedical data integration

Motivation: Hundreds of resources assign identifiers to biomedical concepts including genes, small molecules, biological processes, diseases, and cell types. Often, these resources overlap by assigning identifiers to the same or related concepts. This creates a data interoperability bottleneck, as integrating data sets and knowledge bases that use identifiers for the same concepts from different resources require such identifiers to be mapped to each other. However, available mappings are incomplete and fragmented across individual resources, motivating their large-scale integration. Results: We developed SeMRA, a software tool that integrates mappings from multiple sources into a graph data structure. Using graph algorithms, it infers missing mappings implied by available ones while keeping track of provenance and confidence. This allows connecting identifier spaces between which direct mapping was previously not possible. SeMRA is customizable and takes a declarative specification as input describing sources to integrate with additional configuration parameters. We make available an aggregated mappings resource produced by SeMRA consisting of 43.4 million mappings from 127 sources that jointly cover identifiers from 445 ontologies and databases. We also describe benchmarks on specific use cases such as integrating mappings between resources cataloging diseases or cell types. Availability: The code is available under the MIT license at https://github.com/biopragmatics/semra. The mappings database assembled by SeMRA is available at https://zenodo.org/records/15208251. ### Competing Interest Statement The authors have declared no competing interest.

bioRxiv
Assembly and reasoning over semantic mappings at scale for biomedical data integration https://www.biorxiv.org/content/10.1101/2025.04.16.649126v1?med=mas
Assembly and reasoning over semantic mappings at scale for biomedical data integration

Motivation: Hundreds of resources assign identifiers to biomedical concepts including genes, small molecules, biological processes, diseases, and cell types. Often, these resources overlap by assigning identifiers to the same or related concepts. This creates a data interoperability bottleneck, as integrating data sets and knowledge bases that use identifiers for the same concepts from different resources require such identifiers to be mapped to each other. However, available mappings are incomplete and fragmented across individual resources, motivating their large-scale integration. Results: We developed SeMRA, a software tool that integrates mappings from multiple sources into a graph data structure. Using graph algorithms, it infers missing mappings implied by available ones while keeping track of provenance and confidence. This allows connecting identifier spaces between which direct mapping was previously not possible. SeMRA is customizable and takes a declarative specification as input describing sources to integrate with additional configuration parameters. We make available an aggregated mappings resource produced by SeMRA consisting of 43.4 million mappings from 127 sources that jointly cover identifiers from 445 ontologies and databases. We also describe benchmarks on specific use cases such as integrating mappings between resources cataloging diseases or cell types. Availability: The code is available under the MIT license at https://github.com/biopragmatics/semra. The mappings database assembled by SeMRA is available at https://zenodo.org/records/15208251. ### Competing Interest Statement The authors have declared no competing interest.

bioRxiv

@nichtich I've finally preprinted SeMRA - my tool for assembling and reasoning over semantic mappings (especially from SSSOM). Might be interesting for you

https://www.biorxiv.org/content/10.1101/2025.04.16.649126

code: https://github.com/biopragmatics/semra

Assembly and reasoning over semantic mappings at scale for biomedical data integration

Motivation: Hundreds of resources assign identifiers to biomedical concepts including genes, small molecules, biological processes, diseases, and cell types. Often, these resources overlap by assigning identifiers to the same or related concepts. This creates a data interoperability bottleneck, as integrating data sets and knowledge bases that use identifiers for the same concepts from different resources require such identifiers to be mapped to each other. However, available mappings are incomplete and fragmented across individual resources, motivating their large-scale integration. Results: We developed SeMRA, a software tool that integrates mappings from multiple sources into a graph data structure. Using graph algorithms, it infers missing mappings implied by available ones while keeping track of provenance and confidence. This allows connecting identifier spaces between which direct mapping was previously not possible. SeMRA is customizable and takes a declarative specification as input describing sources to integrate with additional configuration parameters. We make available an aggregated mappings resource produced by SeMRA consisting of 43.4 million mappings from 127 sources that jointly cover identifiers from 445 ontologies and databases. We also describe benchmarks on specific use cases such as integrating mappings between resources cataloging diseases or cell types. Availability: The code is available under the MIT license at https://github.com/biopragmatics/semra. The mappings database assembled by SeMRA is available at https://zenodo.org/records/15208251. ### Competing Interest Statement The authors have declared no competing interest.

bioRxiv
Unable to specify a default value for a generic parameter · Issue #3737 · python/mypy

Simplified Example: from typing import TypeVar _T = TypeVar('_T') def foo(a: _T = 42) -> _T: # E: Incompatible types in assignment (expression has type "int", variable has type "_T") return a Real ...

GitHub

anyone have some time to help me with a tricky #python #typing and #mypy problem?

it's fully self-contained in https://github.com/cthoyt/python-typing-dilemma. It contains some examples of things I tried, and why they didn't work

it hinges on using PEP-696 defaults in typing.TypeVar, introduced in Python 3.13

GitHub - cthoyt/python-typing-dilemma

Contribute to cthoyt/python-typing-dilemma development by creating an account on GitHub.

GitHub
Data Standardization and Integration with the Bioregistry at Biocuration 2025

GitHub site for the Biopragmatics Stack

Biopragmatics Stack
@PIDNetworkDE are you aware of @bioregistry which is a registry of PID schemas in the life and natural sciences?

If you are working with #SSSOM and are wondering about how you were supposed to pronounce the acronym, I’ve got you covered: incenp.org/notes/2025/sssom-pr…

(Sorry to all those who had never heard about #SSSOM until now; most likely you can safely continue to ignore it, unless you happen to be working on semantic mappings, in which case I hope it might interest you.)

SSSOM Pronunciation Alignment Chart

The final session of Day 2 of #biocuration2025 is on glycans. Kiyoko Aoki-Kinoshita explains that the name for GlyTouCan (https://glytoucan.org) is a play on words in Japanese, where "Tou" means sugar!

#glycans #glygen #glytoucan #biocuration

Glycan Repository

We have a survey up collecting information about the field of biocuration. This information will be incorporated into the biocuration careers workshop during the #Biocuration2025 conference.
https://forms.gle/QGfRoC3caFcUqxZb7
Biocuration Community Survey 2025

This survey aims to gather and analyze information about the field of biocuration. This survey is being conducted by the International Society for Biocuration (ISB) to identify potential gaps or inequities among biocurators and to identify areas where the ISB may be able to take actions to improve awareness of biocuration. This is a follow-up with our community to assess the progress made since the we began surveying the community in 2017. The results from past surveys are available here: https://www.biocuration.org/dissemination/survey-results/ (see ISB Career Description Survey Results). The resulting data will be aggregated and analyzed and shared with the community. No identifying information will be revealed in reporting results of this survey. Thank you for your participation.

Google Docs