65 research outputs found
MapAffil: A bibliographic tool for mapping author affiliation strings to cities and their geocodes worldwide
Bibliographic records often contain author affiliations as free-form text strings. Ideally one would be able to automatically identify all affiliations referring to any particular country or city such as Saint Petersburg, Russia. That introduces several major linguistic challenges. For example, Saint Petersburg is ambiguous (it refers to multiple cities worldwide and can be part of a street address) and it has spelling variants (e.g., St. Petersburg, Sankt-Peterburg, and Leningrad, USSR). We have designed an algorithm that attempts to solve these types of problems. Key components of the algorithm include a set of 24k extracted city, state, and country names (and their variants plus geocodes) for candidate look-up, and a set of 1.1M extracted word n-grams, each pointing to a unique country (or a US state) for disambiguation. When applied to a collection of 12.7M affiliation strings listed in PubMed, ambiguity remained unresolved for only 0.1%. For the 4.2M mappings to the USA, 97.7% were complete (included a city), 1.8% included a state but not a city, and 0.4% did not include a state. A random sample of 300 manually inspected cases yielded six incompletes, none incorrect, and one unresolved ambiguity. The remaining 293 (97.7%) cases were unambiguously mapped to the correct cities, better than all of the existing tools tested: GoPubMed got 279 (93.0%) and GeoMaker got 274 (91.3%) while MediaMeter CLIFF and Google Maps did worse. In summary, we find that incorrect assignments and unresolved ambiguities are rare (< 1%). The incompleteness rate is about 2%, mostly due to a lack of information, e.g. the affiliation simply says “University of Illinois” which can refer to one of five different campuses. A search interface called MapAffil is available from http://abel.lis.illinois.edu/; the full PubMed affiliation dataset and batch processing is available upon request. The longitude and latitude of the geographical city-center is displayed when a city is identified. This not only helps improve geographic information retrieval but also enables global bibliometric studies of proximity, mobility, and other geo-linked data.NIH P01AG039347; NSF 1348742Ope
A population-based statistical approach identifies parameters characteristic of human microRNA-mRNA interactions
BACKGROUND: MicroRNAs are ~17–24 nt. noncoding RNAs found in all eukaryotes that degrade messenger RNAs via RNA interference (if they bind in a perfect or near-perfect complementarity to the target mRNA), or arrest translation (if the binding is imperfect). Several microRNA targets have been identified in lower organisms, but only one mammalian microRNA target has yet been validated experimentally. RESULTS: We carried out a population-wide statistical analysis of how human microRNAs interact complementarily with human mRNAs, looking for characteristics that differ significantly as compared with scrambled control sequences. These characteristics were used to identify a set of 71 outlier mRNAs unlikely to have been hit by chance. Unlike the case in C. elegans and Drosophila, many human microRNAs exhibited long exact matches (10 or more bases in a row), up to and including perfect target complementarity. Human microRNAs hit outlier mRNAs within the protein coding region about 2/3 of the time. And, the stretches of perfect complementarity within microRNA hits onto outlier mRNAs were not biased near the 5'-end of the microRNA. In several cases, an individual microRNA hit multiple mRNAs that belonged to the same functional class. CONCLUSIONS: The analysis supports the notion that sequence complementarity is the basis by which microRNAs recognize their biological targets, but raises the possibility that human microRNA-mRNA target interactions follow different rules than have been previously characterized in Drosophila and C. elegans
Measures of novelty in biomedical literature
We introduce several measures of novelty for a scientific article in MEDLINE based on the concepts associated with it. The concepts associated with an article are identified using the Medical Subject Headings (MeSH) assigned to the article. A temporal profile was computed for each MeSH term (and the combination of pairs of MeSH terms) based on their overall occurrences in MEDLINE, after which papers are labeled by their most novel MeSH and pairs of MeSH as measured in years and volume of prior work. Across all papers in MEDLINE published since 1985, we find that individual concept novelty is rare (5.4% of papers have a MeSH 50 papers about 90% had increasing individual novelty scores over their career on average, but the variability also increased. There is little, if any, correlation between the author age and the time-point of their most novel work. Our measures can be accessed at http://abel.lis.illinois.edu/gimli/noveltyNIH P01AG039347NSF 1348742Ope
Predicting Medical Subject Headings Based on Abstract Similarity and Citations to MEDLINE Records
We describe a classifier-enhanced nearest neighbor approach to assigning Medical Subject Headings (MeSH) to unlabeled documents using a combination of abstract similarities and direct citations to labeled MEDLINE records. The approach frames the classification problem by decomposing it into sets of siblings in the MeSH hierarchy (e.g., training a classifier for predicting "Heterocyclic Compounds, 2-Ring" vs. other "Heterocyclic Compounds"). Preliminary experiments using a small but diverse set of MeSH terms shows the highest performance when using both abstracts and citations compared to each alone, and coupled with a non-naive classifier: 90+% precision and recall with 10-fold cross-validation. NLM's Medical Text Indexer (MTI) tool achieves similar overall performance but varies more across the terms tested. For example, MTI performs better on "Heterocyclic Compounds, 2-Ring", while our approach performs better on Alzheimer Disease and Neuroimaging. Our approach can be applied broadly to documents with abstracts that are similar to (or cite) MEDLINE abstracts, which would help linking and searching across bibliographic databases beyond MEDLINE.Ope
Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database
We present a nearest neighbor approach to ethnicity classification. Given an author name, all of its instances (or the most similar ones) in PubMed are identified and coupled with their respective country of affiliation, and then probabilistically mapped to a set of 26 predefined ethnicities. The dominant ethnicity (or pair of ethnicities) is assigned as the class. The predictions are also used to upgrade Genni (Smith, Singh, and Torvik, 2013) to provide ethnicity-specific gender predictions for cases like Italian vs. English Andrea, Turkish vs. Korean Bora, Israeli vs. Nordic Eli, and Slavic vs. Japanese Renko. Ethnea and Genni 2.0 are available at http://abel.lis.illinois.eduNIH P01AG039347NSF 1348742Ope
Sex-bias in biomedical research: a bibliometric perspective
Models of human disease have traditionally been biased towards the male body. Here, we perform a retrospective study of factors that may have contributed to (reducing) this bias across a variety of biomedical topics and study types in the USA during 1987-2009.NIH P01AG039347; NSF 1348742Ope
Quantifying conceptual novelty in the biomedical literature
We present several measures and methods for quantifying conceptual novelty of an article in the biomedical literature corpus using a collection of 22 million MEDLINE articles. Our results show the prevalence of combinatorial novelty in biomedical literature along with its complex correlations with the age of the author and impact as measured through citations. We make the data and source code available through a web based interface http://abel.lis.illinois.edu/gimli/P01AG0393471348742Ope
Introducing the Author-ity Exporter, and a case study of geo-temporal movement of authors
We introduce a web service, Author-ity Exporter, that permits searching and exporting data from Author-ity -- a database that has PubMed author names disambiguated with a high degree of accuracy [1]. Each author is represented by a cluster of papers annotated by publication count, time-span, affiliations, topics, journals, co-authors, citations as well as imputed data from MapAffil [2], Genni [3], and Ethnea [4] and links to their NIH/NSF grants and USPTO patents; and we have plans for more. This service should enable and simplify new types of author-centered bibliometric analyses with a unique strength in funding, geography, and diversity (gender, ethnicity, and professional age). We also present an illustrative case study of modeling of authors’ career movements to and from a specific city based on data retrieved from Author-ity Exporter. The service (and the R code used in the case study) are available at http://abel.ischool.illinois.edu/cgi-bin/exporter/search.pl.NIH P01AG039347Ope
- …