Abstract
Tens of thousands of genotype-phenotype associations have been discovered to date, yet not all of them are easily accessible to scientists. Here, we describe GWASkb, a machine-compiled knowledge base of genetic associations collected from the scientific literature using automated information extraction algorithms. Our information extraction system helps curators by automatically collecting over 6,000 associations from open-access publications with an estimated recall of 60–80% and with an estimated precision of 78–94% (measured relative to existing manually curated knowledge bases). This system represents a fully automated GWAS curation effort and is made possible by a paradigm for constructing machine learning systems called data programming. Our work represents a step towards making the curation of scientific literature more efficient using automated systems.
| Original language | English (US) |
|---|---|
| Article number | 3341 |
| Journal | Nature communications |
| Volume | 10 |
| Issue number | 1 |
| DOIs | |
| State | Published - Dec 1 2019 |
| Externally published | Yes |
All Science Journal Classification (ASJC) codes
- General Chemistry
- General Biochemistry, Genetics and Molecular Biology
- General
- General Physics and Astronomy