Genomic data in large genomic knowledgebases (KB) such as Gene Ontology (GO) are being integrated into clinical diagnosis and disease prediction. This type of integration is particularly useful in predicting complex diseases with highly heterogeneous genotypes that make biological marker identifications difficult. Several types of machine learning (ML) models have been applied to identify relatively small number of disease-associated genetic sequence amongst the large number of common variants carried by an individual [1]. Biomedical knowledge organization systems (KOSs) containing relationships between variants, genes, and diseases promise higher precision and performance of ML models.
I use GO as a case study of biomedical KOS that provides rich annotation data in directed acyclic graph (DAG) structure as train dataset for ML algorithms to identify potential disease-triggering gene products. Different from approaches in bioinformatics, my research focus is not on designing models and packages. Rather, I discuss the data quality, ontology structure, crosslink with external resources e.g. Disease Ontology, to evaluate the current design of KOS for biological research, which applies theories in knowledge organization and library science. Past findings on this area were contributed by either bioinformatics or computer science scholars. My role is to reveal the importance of knowledge work in bridging these two communities, and discuss the usage of ontology data in LLM to achieve trustworthiness and precision.
Currently, I test collecting GO annotation data to identify potential gene products that may be associated with Autism disease using ML algorithms - Random Forest, Support-Vector Machine, and Gradient Boosting. A demo of this preliminary step will be presented at the DCMI 2024 NKOS workshop (https://www.dublincore.org/conferences/2024/sessions/nkos-workshop/).