19,348 research outputs found
Conceptual biology, hypothesis discovery, and text mining: Swanson's legacy
Innovative biomedical librarians and information specialists who want to expand their roles as expert searchers need to know about profound changes in biology and parallel trends in text mining. In recent years, conceptual biology has emerged as a complement to empirical biology. This is partly in response to the availability of massive digital resources such as the network of databases for molecular biologists at the National Center for Biotechnology Information. Developments in text mining and hypothesis discovery systems based on the early work of Swanson, a mathematician and information scientist, are coincident with the emergence of conceptual biology. Very little has been written to introduce biomedical digital librarians to these new trends. In this paper, background for data and text mining, as well as for knowledge discovery in databases (KDD) and in text (KDT) is presented, then a brief review of Swanson's ideas, followed by a discussion of recent approaches to hypothesis discovery and testing. 'Testing' in the context of text mining involves partially automated methods for finding evidence in the literature to support hypothetical relationships. Concluding remarks follow regarding (a) the limits of current strategies for evaluation of hypothesis discovery systems and (b) the role of literature-based discovery in concert with empirical research. Report of an informatics-driven literature review for biomarkers of systemic lupus erythematosus is mentioned. Swanson's vision of the hidden value in the literature of science and, by extension, in biomedical digital databases, is still remarkably generative for information scientists, biologists, and physicians. © 2006Bekhuis; licensee BioMed Central Ltd
Recommended from our members
Mining the Web for Medical Hypothesis: A Proof-of-Concept System
As the prevalence of blogs, discussion forums, and online news services continues to grow, so too does the portion of this Web content that relates to health and medicine. We propose that everyday, medically-oriented Web content is a valuable and viable data source for medical hypothesis generation and testing, despite its being noisy. In this paper, we present a proof-of-concept system supporting this notion. We construct a corpus comprising news articles relating to the drugs Vioxx, Naproxen and Ibuprofen, that were published between 1998-2002. Using this corpus, we show that there was a signiïŹcant link between Vioxx and the concept âMyocardial Infarctionâ well before the drug was withdrawn from the market in 2004. Indeed, within the Vioxx-related content, the concept ranks amongst the top 3.3% in terms of importance. When compared with the Naproxen and Ibuprofen control literatures, the term occurs signiïŹcantly more frequently in the Vioxx-related content.Engineering and Applied Science
A Linked Data Approach to Sharing Workflows and Workflow Results
A bioinformatics analysis pipeline is often highly elaborate, due to the inherent complexity of biological systems and the variety and size of datasets. A digital equivalent of the âMaterials and Methodsâ section in wet laboratory publications would be highly beneficial to bioinformatics, for evaluating evidence and examining data across related experiments, while introducing the potential to find associated resources and integrate them as data and services. We present initial steps towards preserving bioinformatics âmaterials and methodsâ by exploiting the workflow paradigm for capturing the design of a data analysis pipeline, and RDF to link the workflow, its component services, run-time provenance, and a personalized biological interpretation of the results. An example shows the reproduction of the unique graph of an analysis procedure, its results, provenance, and personal interpretation of a text mining experiment. It links data from Taverna, myExperiment.org, BioCatalogue.org, and ConceptWiki.org. The approach is relatively âlight-weightâ and unobtrusive to bioinformatics users
Deer Herd Management Using the Internet: A Comparative Study of California Targeted By Data Mining the Internet
An ongoing project to investigate the use of the internet as an information source for decision support identified the decline of the California deer population as a significant issue. Using Google Alerts, an automated keyword search tool, text and numerical data were collected from a daily internet search and categorized by region and topic to allow for identification of information trends. This simple data mining approach determined that California is one of only four states that do not currently report total, finalized deer harvest (kill) data online and that it is the only state that has reduced the amount of information made available over the internet in recent years. Contradictory information identified by the internet data mining prompted the analysis described in this paper indicating that the graphical information presented on the California Fish and Wildlife website significantly understates the severity of the deer population decline over the past 50 years. This paper presents a survey of how states use the internet in their deer management programs and an estimate of the California deer population over the last 100 years. It demonstrates how any organization can use the internet for data collection and discovery
HypTrails: A Bayesian Approach for Comparing Hypotheses About Human Trails on the Web
When users interact with the Web today, they leave sequential digital trails
on a massive scale. Examples of such human trails include Web navigation,
sequences of online restaurant reviews, or online music play lists.
Understanding the factors that drive the production of these trails can be
useful for e.g., improving underlying network structures, predicting user
clicks or enhancing recommendations. In this work, we present a general
approach called HypTrails for comparing a set of hypotheses about human trails
on the Web, where hypotheses represent beliefs about transitions between
states. Our approach utilizes Markov chain models with Bayesian inference. The
main idea is to incorporate hypotheses as informative Dirichlet priors and to
leverage the sensitivity of Bayes factors on the prior for comparing hypotheses
with each other. For eliciting Dirichlet priors from hypotheses, we present an
adaption of the so-called (trial) roulette method. We demonstrate the general
mechanics and applicability of HypTrails by performing experiments with (i)
synthetic trails for which we control the mechanisms that have produced them
and (ii) empirical trails stemming from different domains including website
navigation, business reviews and online music played. Our work expands the
repertoire of methods available for studying human trails on the Web.Comment: Published in the proceedings of WWW'1
- âŠ