10 research outputs found

    Data-driven information retrieval in heterogeneous collections of transcriptomics data links SIM2s to malignant pleural mesothelioma

    Get PDF
    Motivation: Genome-wide measurement of transcript levels is an ubiquitous tool in biomedical research. As experimental data continues to be deposited in public databases, it is becoming important to develop search engines that enable the retrieval of relevant studies given a query study. While retrieval systems based on meta-data already exist, data-driven approaches that retrieve studies based on similarities in the expression data itself have a greater potential of uncovering novel biological insights

    Multi-faceted information retrieval system for large scale email archives

    Full text link

    Development of a Method for Incorporating Fault Codes in Prognostic Analysis

    Get PDF
    Information from fault codes associated with a component may be used as an indicator of its health. A fault code is defined as a timestamp at which a component is not operating according to recommended guidelines. The type of fault codes which are relevant for this analysis represent mild or moderate deviations from normal behavior, rather than those requiring immediate repair. Potentially, fault codes may be used to determine the Remaining Useful Life (RUL) of a component by predicting its failure time, which will improve safety and reduce maintenance costs associated with the component. In this dissertation, methods have been developed to integrate the degradation information from fault codes into an existing prognostic parameter to improve the estimation of RUL. Optimization methods such as gradient descent were used to weight each fault code based on their relevance to degradation. Furthermore, topic models, a document analysis and clustering technique, were used as both a dimension-reduction method and fault mode isolation. Methods developed for this dissertation were applied to two real-world data sets, an actuator system and monitored signals from a motor accelerated degradation experiment. The best estimation of RUL for the actuator system was a topic model with a mean absolute error of 6.41% of the data received, and the best estimation of RUL for the motor accelerated degradation experiment was 5.7% of the average lifetime of the motors. The primary contributions of this research includes a method to construct a prognostic parameter from fault codes alone, the integration of degradation information from fault codes into an existing prognostic parameter, the use of topic models in reliability analysis of fault codes, and a software suite that performs these functions on generic data sets

    Retrieval of Gene Expression Measurements with Probabilistic Models

    Get PDF
    A crucial problem in current biological and medical research is how to utilize the diverse set of existing biological knowledge and heterogeneous measurement data in order to gain insights on new data. As datasets continue to be deposited in public repositories it is becoming important to develop search engines that can efficiently integrate existing data and search for relevant earlier studies given a new study. The search task is encountered in several biological applications including cancer genomics, pharmacokinetics, personalized medicine and meta-analysis of functional genomics.Ā  Most existing search engines rely on classical keyword or annotation based retrieval which is limited to discovering known information and requires careful downstream annotation of the data. Data-driven model-based methods, that retrieve studies based on similarities in the actual measurement data, have a greater potential for uncovering novel biological insights. In particular, probabilistic modeling provides promising model-based tools due to its ability to encode prior knowledge, represent uncertainty in model parameters and handle noise associated to the data. By introducing latent variables it is further possible to capture relationships in data features in the form of meaningful biological components underlying the data.Ā  This thesis adapts existing and develops new probabilistic models for retrieval of relevant measurement data in three different cases of background repositories. The first case is a background collection of data samples where each sample is represented by a single data type. The second case is a collection of multimodal data samples where each sample is represented by more than one data type. The third case is a background collection of datasets where each dataset, in turn, is a collection of multiple samples. In all three setups the proposed models are evaluated quantitatively and with case studies the models are demonstrated to facilitate interpretable retrieval of relevant data, rigorous integration of diverse information sources and learning of latent components from partly related dataset collections

    A Scalable Topic-Based Open Source Search Engine

    No full text

    Twitter Mining for Syndromic Surveillance

    Get PDF
    Enormous amounts of personalised data is generated daily from social media platforms today. Twitter in particular, generates vast textual streams in real-time, accompanied with personal information. This big social media data oļ¬€ers a potential avenue for inferring public and social patterns. This PhD thesis investigates the use of Twitter data to deliver signals for syndromic surveillance in order to assess its ability to augment existing syndromic surveillance eļ¬€orts and give a better understanding of symptomatic people who do not seek healthcare advice directly. We focus on a speciļ¬c syndrome - asthma/diļ¬ƒculty breathing. We seek to develop means of extracting reliable signals from the Twitter signal, to be used for syndromic surveillance purposes. We begin by outlining our data collection and preprocessing methods. However, we observe that even with keyword-based data collection, many of the collected tweets are not relevant because they represent chatter, or talk of awareness instead of an individual suļ¬€ering a particular condition. In light of this, we set out to identify relevant tweets to collect a strong and reliable signal. We ļ¬rst develop novel features based on the emoji content of Tweets and apply semi-supervised learning techniques to ļ¬lter Tweets. Next, we investigate the eļ¬€ectiveness of deep learning at this task. We pro-pose a novel classiļ¬cation algorithm based on neural language models, and compare it to existing successful and popular deep learning algorithms. Following this, we go on to propose an attentive bi-directional Recurrent Neural Network architecture for ļ¬ltering Tweets which also oļ¬€ers additional syndromic surveillance utility by identifying keywords among syndromic Tweets. In doing so, we are not only able to detect alarms, but also have some clues into what the alarm involves. Lastly, we look towards optimizing the Twitter syndromic surveillance pipeline by selecting the best possible keywords to be supplied to the Twitter API. We developed algorithms to intelligently and automatically select keywords such that the quality, in terms of relevance, and quantity of Tweets collected is maximised

    A Scalable Topic-Based Open Source Search Engine

    No full text
    Site-based or topic-specific search engines work with mixed success because of the general difficulty of the information retrieval task, and the lack of good link information to allow authorities to be identified. We are advocating an open source approach to the problem due to its scope and need for software components. We have adopted a topicbased search engine because it represents the next generation of capability. This paper outlines our scalable system for site-based or topic-specific search, and demonstrates the developing system on a small 250,000 document collection of EU and UN web pages

    Semantic component selection

    Get PDF
    The means of locating information quickly and efficiently is a growing area of research. However the real challenge is not related to locating bits of information, but finding those that are relevant. Relevant information resides within unstructured ā€˜naturalā€™ text. However, understanding natural text and judging information relevancy is a challenge. The challenge is partially addressed by use of semantic models and reasoning approaches that allow categorisation and (within limited fashion) provide understanding of this information. Nevertheless, many such methods are dependent on expert input and, consequently, are expensive to produce and do not scale. Although automated solutions exist, thus far, these have not been able to approach accuracy levels achievable through use of expert input. This thesis presents SemaCS - a novel nondomain specific automated framework of categorising and searching natural text. SemaCS does not rely on expert input; it is based on actual data being searched and statistical semantic distances between words. These semantic distances are used to perform basic reasoning and semantic query interpretation. The approach was tested through a feasibility study and two case studies. Based on reasoning and analyses of data collected through these studies, it can be concluded that SemaCS provides a domain independent approach of semantic model generation and query interpretation without expert input. Moreover, SemaCS can be further extended to provide a scalable solution applicable to large datasets (i.e. World Wide Web). This thesis contributes to the current body of knowledge by establishing, adapting, and using novel techniques to define a generic selection/categorisation framework. Implementing the framework outlined in the thesis improves an existing algorithm of semantic distance acquisition. Finally, as a novel approach to the extraction of semantic information is proposed, there exists a positive impact on Information Retrieval domain and, specifically, on Natural Language Processing, word disambiguation and Web/Intranet search

    Semantic component selection

    Get PDF
    The means of locating information quickly and efficiently is a growing area of research. However the real challenge is not related to locating bits of information, but finding those that are relevant. Relevant information resides within unstructured ā€˜naturalā€™ text. However, understanding natural text and judging information relevancy is a challenge. The challenge is partially addressed by use of semantic models and reasoning approaches that allow categorisation and (within limited fashion) provide understanding of this information. Nevertheless, many such methods are dependent on expert input and, consequently, are expensive to produce and do not scale. Although automated solutions exist, thus far, these have not been able to approach accuracy levels achievable through use of expert input. This thesis presents SemaCS - a novel nondomain specific automated framework of categorising and searching natural text. SemaCS does not rely on expert input; it is based on actual data being searched and statistical semantic distances between words. These semantic distances are used to perform basic reasoning and semantic query interpretation. The approach was tested through a feasibility study and two case studies. Based on reasoning and analyses of data collected through these studies, it can be concluded that SemaCS provides a domain independent approach of semantic model generation and query interpretation without expert input. Moreover, SemaCS can be further extended to provide a scalable solution applicable to large datasets (i.e. World Wide Web). This thesis contributes to the current body of knowledge by establishing, adapting, and using novel techniques to define a generic selection/categorisation framework. Implementing the framework outlined in the thesis improves an existing algorithm of semantic distance acquisition. Finally, as a novel approach to the extraction of semantic information is proposed, there exists a positive impact on Information Retrieval domain and, specifically, on Natural Language Processing, word disambiguation and Web/Intranet search.EThOS - Electronic Theses Online ServiceGBUnited Kingdo
    corecore