2,249 research outputs found

    Semi-supervised prediction of protein interaction sentences exploiting semantically encoded metrics

    Get PDF
    Protein-protein interaction (PPI) identification is an integral component of many biomedical research and database curation tools. Automation of this task through classification is one of the key goals of text mining (TM). However, labelled PPI corpora required to train classifiers are generally small. In order to overcome this sparsity in the training data, we propose a novel method of integrating corpora that do not contain relevance judgements. Our approach uses a semantic language model to gather word similarity from a large unlabelled corpus. This additional information is integrated into the sentence classification process using kernel transformations and has a re-weighting effect on the training features that leads to an 8% improvement in F-score over the baseline results. Furthermore, we discover that some words which are generally considered indicative of interactions are actually neutralised by this process

    Analyzing time series from eye tracking using Symbolic Aggregate Approximation

    Get PDF
    This thesis explores the viability of transforming the data produced when tracking the eyes into a discrete symbolic representation. For this transformation, we utilize Symbolic Aggregate Approximation to investigate a new possibility for effectively categorizing data collected via eye tracking technologies. This categorization illustrates tendencies for, e.g., tracking problems, problems with the set-up, normal vision, or vision disturbances. Accordingly, this will contribute to evaluating the eyes' performance and allow professionals to develop a diagnosis based on evidence from objective measurements. The results are based on implementing a symbolic discretization method applied to experiments on a real-world dataset containing recordings of eye movements. In the future, the knowledge and transformation via the SAX method can be utilized to make sense of data and identify anomalies implemented in various domains and for multiple stakeholders.Masteroppgave i Programutvikling samarbeid med HVLPROG399MAMN-PRO

    XML Matchers: approaches and challenges

    Full text link
    Schema Matching, i.e. the process of discovering semantic correspondences between concepts adopted in different data source schemas, has been a key topic in Database and Artificial Intelligence research areas for many years. In the past, it was largely investigated especially for classical database models (e.g., E/R schemas, relational databases, etc.). However, in the latest years, the widespread adoption of XML in the most disparate application fields pushed a growing number of researchers to design XML-specific Schema Matching approaches, called XML Matchers, aiming at finding semantic matchings between concepts defined in DTDs and XSDs. XML Matchers do not just take well-known techniques originally designed for other data models and apply them on DTDs/XSDs, but they exploit specific XML features (e.g., the hierarchical structure of a DTD/XSD) to improve the performance of the Schema Matching process. The design of XML Matchers is currently a well-established research area. The main goal of this paper is to provide a detailed description and classification of XML Matchers. We first describe to what extent the specificities of DTDs/XSDs impact on the Schema Matching task. Then we introduce a template, called XML Matcher Template, that describes the main components of an XML Matcher, their role and behavior. We illustrate how each of these components has been implemented in some popular XML Matchers. We consider our XML Matcher Template as the baseline for objectively comparing approaches that, at first glance, might appear as unrelated. The introduction of this template can be useful in the design of future XML Matchers. Finally, we analyze commercial tools implementing XML Matchers and introduce two challenging issues strictly related to this topic, namely XML source clustering and uncertainty management in XML Matchers.Comment: 34 pages, 8 tables, 7 figure

    Unsupervised Extraction of Representative Concepts from Scientific Literature

    Full text link
    This paper studies the automated categorization and extraction of scientific concepts from titles of scientific articles, in order to gain a deeper understanding of their key contributions and facilitate the construction of a generic academic knowledgebase. Towards this goal, we propose an unsupervised, domain-independent, and scalable two-phase algorithm to type and extract key concept mentions into aspects of interest (e.g., Techniques, Applications, etc.). In the first phase of our algorithm we propose PhraseType, a probabilistic generative model which exploits textual features and limited POS tags to broadly segment text snippets into aspect-typed phrases. We extend this model to simultaneously learn aspect-specific features and identify academic domains in multi-domain corpora, since the two tasks mutually enhance each other. In the second phase, we propose an approach based on adaptor grammars to extract fine grained concept mentions from the aspect-typed phrases without the need for any external resources or human effort, in a purely data-driven manner. We apply our technique to study literature from diverse scientific domains and show significant gains over state-of-the-art concept extraction techniques. We also present a qualitative analysis of the results obtained.Comment: Published as a conference paper at CIKM 201

    The mediated data integration (MeDInt) : An approach to the integration of database and legacy systems

    Get PDF
    The information required for decision making by executives in organizations is normally scattered across disparate data sources including databases and legacy systems. To gain a competitive advantage, it is extremely important for executives to be able to obtain one unique view of information in an accurate and timely manner. To do this, it is necessary to interoperate multiple data sources, which differ structurally and semantically. Particular problems occur when applying traditional integration approaches, for example, the global schema needs to be recreated when the component schema has been modified. This research investigates the following heterogeneities between heterogeneous data sources: Data Model Heterogeneities, Schematic Heterogeneities and Semantic Heterogeneities. The problems of existing integration approaches are reviewed and solved by introducing and designing a new integration approach to logically interoperate heterogeneous data sources and to resolve three previously classified heterogeneities. The research attempts to reduce the complexity of the integration process by maximising the degree of automation. Mediation and wrapping techniques are employed in this research. The Mediated Data Integration (MeDint) architecture has been introduced to integrate heterogeneous data sources. Three major elements, the MeDint Mediator, wrappers, and the Mediated Data Model (MDM) play important roles in the integration of heterogeneous data sources. The MeDint Mediator acts as an intermediate layer transforming queries to sub-queries, resolving conflicts, and consolidating conflict-resolved results. Wrappers serve as translators between the MeDint Mediator and data sources. Both the mediator and wrappers arc well-supported by MDM, a semantically-rich data model which can describe or represent heterogeneous data schematically and semantically. Some organisational information systems have been tested and evaluated using the MeDint architecture. The results have addressed all the research questions regarding the interoperability of heterogeneous data sources. In addition, the results also confirm that the Me Dint architecture is able to provide integration that is transparent to users and that the schema evolution does not affect the integration

    Mostly-Unsupervised Statistical Segmentation of Japanese Kanji Sequences

    Full text link
    Given the lack of word delimiters in written Japanese, word segmentation is generally considered a crucial first step in processing Japanese texts. Typical Japanese segmentation algorithms rely either on a lexicon and syntactic analysis or on pre-segmented data; but these are labor-intensive, and the lexico-syntactic techniques are vulnerable to the unknown word problem. In contrast, we introduce a novel, more robust statistical method utilizing unsegmented training data. Despite its simplicity, the algorithm yields performance on long kanji sequences comparable to and sometimes surpassing that of state-of-the-art morphological analyzers over a variety of error metrics. The algorithm also outperforms another mostly-unsupervised statistical algorithm previously proposed for Chinese. Additionally, we present a two-level annotation scheme for Japanese to incorporate multiple segmentation granularities, and introduce two novel evaluation metrics, both based on the notion of a compatible bracket, that can account for multiple granularities simultaneously.Comment: 22 pages. To appear in Natural Language Engineerin

    RDF, the semantic web, Jordan, Jordan and Jordan

    Get PDF
    This collection is addressed to archivists and library professionals, and so has a slight focus on implications implications for them. This chapter is nonetheless intended to be a more-or-less generic introduction to the Semantic Web and RDF, which isn't specific to that domain

    Mining LEGO data sets to support LEGO design

    Get PDF
    The Lego Group invented LEGO bricks. These bricks could be put up together to build creative LEGO sets of different themes, such as a Volkswagen T1 Camper Van model (in ``Sculpture" theme) or The Simpsons House model (in ``Town" theme). LEGO accompanies nearly everyone from youth to adulthood. The age groups of fans range from kids in pre-school to elder people who have grandchildren. With such a strong and huge fan base, there appear a lot of websites that provide LEGO data of sets, parts, minifigures as well as online communities for fans to share their experience about LEGO sets. However, there barely exists any research in this rich data source to discover knowledge and insights about how each LEGO part plays a role in a LEGO set, in its own part category and in a LEGO theme; how each LEGO set is different from the other one in the aspects of theme and part components. There are a lot of interesting questions we can address from the datasets that will not only help better LEGO designs, but will also help LEGO fans or potential customers make efficient purchasing decisions when they get more familiar with LEGO sets and parts. To address these needs, in this thesis, we propose a systematic method of mining LEGO datasets of sets and parts to support LEGO design. Treating each LEGO set as a document and each part in it as a word, we are able to apply data mining techniques, such as Topic Model and K-Means Clustering to find statistics of sets and parts. The preliminary experiment results show that the proposed methods can automatically construct a LEGO Brick Lexicon that shows a part's relationship with other parts, sets and themes, discover knowledge about typical LEGO construction patterns and create hybrid theme recommendations. We believe this is a step forward to helping LEGO designers create more attractive sets with pragmatic parts as well as improving the building experience of LEGO fans/builders

    Chapter Integrative Systems Biology Resources and Approaches in Disease Analytics

    Get PDF
    Currently, our analytical competences are struggling to keep-up the pace of in-deep analysis of all generated large-scale data resultant of high-throughput omics platforms. While, a substantial effort was spent on methods enhancement regarding technical aspects across many detection omics platforms, the development of integrative down-stream approaches is still challenging. Systems biology has an immense applicability in the biomedical and pharmacological areas since the main goal of those focuses in the translation of measured outputs into potential markers of a Human ailment and/or to provide new compound leads for drug discovery. This approach would become more straightforward and realistic to use in standard analysis workflows if the collation of all available information of every component of a biological system was ensured into a single database framework, instead of search and fetch a single component at time across a scatter of databases resources. Here, we will describe several database resources, standalone and web-based tools applied in disease analytics workflows based in data-driven integration of outputs of multi-omic detection platforms
    corecore