DOI → https://doi.org/10.1016/j.imu.2025.101693
#genomics #bioinformatics #FAIRdata
PhenoQC: fast quality checks for clinical phenotype tables in genomic research
Phenotypic tables power genotype–phenotype studies. Errors, missing values, and inconsistent terms slow analysis and bias results. PhenoQC is a configuration-driven toolkit that brings three steps into one workflow: schema validation, ontology mapping, and missing-data imputation. It checks structure and types against a JSON schema, aligns phenotype text to standard ontologies (HPO, DO, MPO) with exact, synonym, and fuzzy matching, and fills gaps using baselines or KNN, MICE, and low-rank SVD. It audits imputation effects with standardized mean difference, variance ratio, Kolmogorov–Smirnov, population stability index, and Cramér’s V. It scales with chunk-based parallelism and runs via CLI or a web GUI. In tests, PhenoQC processed up to 100k records with near-linear scaling, reached ≈97–99% ontology-mapping accuracy under text noise, and on two UCI clinical datasets (CKD and Heart Disease) imputed all missing numeric cells and produced clean reports. The output is analysis-ready and reproducible.
