Search CORE

10 research outputs found

Data-driven information retrieval in heterogeneous collections of transcriptomics data links SIM2s to malignant pleural mesothelioma

Author: Aleman
Ali Faisal
Alvis Brazma
Andrew G. Nicholson
Ashburner
Barrett
Buntine
Caldas
Casarsa
Chrast
Crispi
Eeva Kettunen
Engreitz
Feng
Fujibuchi
Gordon
Goshu
Griffiths
Guan
Halvorsen
Hasle
Henzi
Hu
Huang
Hunter
José Caldas
Järvelin
Kapushesky
Kupershmidt
Kwak
Laffin
Lamb
Li
Malone
Manning
Mikko Rönty
Nils Gehlenborg
Parkinson
Paruthiyil
Pinton
Pulver-Kaste
Sakari Knuutila
Samuel Kaski
Segal
Subramanian
Tsuchiya
Woods
Zhu
Publication venue: Oxford University Press
Publication date
Field of study

Motivation: Genome-wide measurement of transcript levels is an ubiquitous tool in biomedical research. As experimental data continues to be deposited in public databases, it is becoming important to develop search engines that enable the retrieval of relevant studies given a query study. While retrieval systems based on meta-data already exist, data-driven approaches that retrieve studies based on similarities in the expression data itself have a greater potential of uncovering novel biological insights

Crossref

PubMed Central

Multi-faceted information retrieval system for large scale email archives

Author
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2005
Field of study

Crossref

Development of a Method for Incorporating Fault Codes in Prognostic Analysis

Author: Strong Eric Allen
Publication venue: TRACE: Tennessee Research and Creative Exchange
Publication date: 01/05/2014
Field of study

Information from fault codes associated with a component may be used as an indicator of its health. A fault code is defined as a timestamp at which a component is not operating according to recommended guidelines. The type of fault codes which are relevant for this analysis represent mild or moderate deviations from normal behavior, rather than those requiring immediate repair. Potentially, fault codes may be used to determine the Remaining Useful Life (RUL) of a component by predicting its failure time, which will improve safety and reduce maintenance costs associated with the component. In this dissertation, methods have been developed to integrate the degradation information from fault codes into an existing prognostic parameter to improve the estimation of RUL. Optimization methods such as gradient descent were used to weight each fault code based on their relevance to degradation. Furthermore, topic models, a document analysis and clustering technique, were used as both a dimension-reduction method and fault mode isolation. Methods developed for this dissertation were applied to two real-world data sets, an actuator system and monitored signals from a motor accelerated degradation experiment. The best estimation of RUL for the actuator system was a topic model with a mean absolute error of 6.41% of the data received, and the best estimation of RUL for the motor accelerated degradation experiment was 5.7% of the average lifetime of the motors. The primary contributions of this research includes a method to construct a prognostic parameter from fault codes alone, the integration of degradation information from fault codes into an existing prognostic parameter, the use of topic models in reliability analysis of fault codes, and a software suite that performs these functions on generic data sets

University of Tennessee, Knoxville: Trace

Retrieval of Gene Expression Measurements with Probabilistic Models

Author: Faisal Ali
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2014
Field of study

A crucial problem in current biological and medical research is how to utilize the diverse set of existing biological knowledge and heterogeneous measurement data in order to gain insights on new data. As datasets continue to be deposited in public repositories it is becoming important to develop search engines that can efficiently integrate existing data and search for relevant earlier studies given a new study. The search task is encountered in several biological applications including cancer genomics, pharmacokinetics, personalized medicine and meta-analysis of functional genomics. Most existing search engines rely on classical keyword or annotation based retrieval which is limited to discovering known information and requires careful downstream annotation of the data. Data-driven model-based methods, that retrieve studies based on similarities in the actual measurement data, have a greater potential for uncovering novel biological insights. In particular, probabilistic modeling provides promising model-based tools due to its ability to encode prior knowledge, represent uncertainty in model parameters and handle noise associated to the data. By introducing latent variables it is further possible to capture relationships in data features in the form of meaningful biological components underlying the data. This thesis adapts existing and develops new probabilistic models for retrieval of relevant measurement data in three different cases of background repositories. The first case is a background collection of data samples where each sample is represented by a single data type. The second case is a collection of multimodal data samples where each sample is represented by more than one data type. The third case is a background collection of datasets where each dataset, in turn, is a collection of multiple samples. In all three setups the proposed models are evaluated quantitatively and with case studies the models are demonstrated to facilitate interpretable retrieval of relevant data, rigorous integration of diverse information sources and learning of latent components from partly related dataset collections

Aaltodoc Publication Archive

A Scalable Topic-Based Open Source Search Engine

Author
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2004
Field of study

Crossref

Twitter Mining for Syndromic Surveillance

Author: Edo-Osagie Osagioduwa
Publication venue
Publication date: 01/09/2019
Field of study

Enormous amounts of personalised data is generated daily from social media platforms today. Twitter in particular, generates vast textual streams in real-time, accompanied with personal information. This big social media data oﬀers a potential avenue for inferring public and social patterns. This PhD thesis investigates the use of Twitter data to deliver signals for syndromic surveillance in order to assess its ability to augment existing syndromic surveillance eﬀorts and give a better understanding of symptomatic people who do not seek healthcare advice directly. We focus on a speciﬁc syndrome - asthma/diﬃculty breathing. We seek to develop means of extracting reliable signals from the Twitter signal, to be used for syndromic surveillance purposes. We begin by outlining our data collection and preprocessing methods. However, we observe that even with keyword-based data collection, many of the collected tweets are not relevant because they represent chatter, or talk of awareness instead of an individual suﬀering a particular condition. In light of this, we set out to identify relevant tweets to collect a strong and reliable signal. We ﬁrst develop novel features based on the emoji content of Tweets and apply semi-supervised learning techniques to ﬁlter Tweets. Next, we investigate the eﬀectiveness of deep learning at this task. We pro-pose a novel classiﬁcation algorithm based on neural language models, and compare it to existing successful and popular deep learning algorithms. Following this, we go on to propose an attentive bi-directional Recurrent Neural Network architecture for ﬁltering Tweets which also oﬀers additional syndromic surveillance utility by identifying keywords among syndromic Tweets. In doing so, we are not only able to detect alarms, but also have some clues into what the alarm involves. Lastly, we look towards optimizing the Twitter syndromic surveillance pipeline by selecting the best possible keywords to be supplied to the Twitter API. We developed algorithms to intelligently and automatically select keywords such that the quality, in terms of relevance, and quantity of Tweets collected is maximised

University of East Anglia digital repository

A Scalable Topic-Based Open Source Search Engine

Author: A. Tuominen
H. Tirri
J. Löfström
J. Perkiö
S. Perttu
T. Silander
V. Poroshin
V. Tuulos
W. Buntine
Publication venue: IEEE Computer Society
Publication date
Field of study

Site-based or topic-specific search engines work with mixed success because of the general difficulty of the information retrieval task, and the lack of good link information to allow authorities to be identified. We are advocating an open source approach to the problem due to its scope and need for software components. We have adopted a topicbased search engine because it represents the next generation of capability. This paper outlines our scalable system for site-based or topic-specific search, and demonstrates the developing system on a small 250,000 document collection of EU and UN web pages

CiteSeerX

Semantic component selection

Author: Sjachyn M.
Sjachyn M.
Publication venue
Publication date: 01/01/2009
Field of study

The means of locating information quickly and efficiently is a growing area of research. However the real challenge is not related to locating bits of information, but finding those that are relevant. Relevant information resides within unstructured ‘natural’ text. However, understanding natural text and judging information relevancy is a challenge. The challenge is partially addressed by use of semantic models and reasoning approaches that allow categorisation and (within limited fashion) provide understanding of this information. Nevertheless, many such methods are dependent on expert input and, consequently, are expensive to produce and do not scale. Although automated solutions exist, thus far, these have not been able to approach accuracy levels achievable through use of expert input. This thesis presents SemaCS - a novel nondomain specific automated framework of categorising and searching natural text. SemaCS does not rely on expert input; it is based on actual data being searched and statistical semantic distances between words. These semantic distances are used to perform basic reasoning and semantic query interpretation. The approach was tested through a feasibility study and two case studies. Based on reasoning and analyses of data collected through these studies, it can be concluded that SemaCS provides a domain independent approach of semantic model generation and query interpretation without expert input. Moreover, SemaCS can be further extended to provide a scalable solution applicable to large datasets (i.e. World Wide Web). This thesis contributes to the current body of knowledge by establishing, adapting, and using novel techniques to define a generic selection/categorisation framework. Implementing the framework outlined in the thesis improves an existing algorithm of semantic distance acquisition. Finally, as a novel approach to the extraction of semantic information is proposed, there exists a positive impact on Information Retrieval domain and, specifically, on Natural Language Processing, word disambiguation and Web/Intranet search

WestminsterResearch

Semantic component selection

Author: Sjachyn Maxym
Publication venue
Publication date: 01/01/2009
Field of study

OpenGrey Repository