8 research outputs found

    MRAR : mining multi-relation association rules

    Get PDF
    In this paper, we introduce a new class of association rules (ARs) named “Multi-Relation Association Rules” which in contrast to primitive ARs (that are usually extracted from multi-relational databases), each rule item consists of one entity and several relations. These relations indicate indirect relationship between entities. Consider the following Multi-Relation Association Rule where the first item consists of three relations live in, nearby and humid: “Those who live in a place which is near by a city with humid climate type and also are younger than 20 their health condition is good”. A new algorithm called MRAR is proposed to extract such rules from directed graphs with labeled edges which are constructed from RDBMSs or semantic web data. Also, the question “how to convert RDBMS data or semantic web data to a directed graph with labeled edges?” is answered. In order to evaluate the proposed algorithm, some experiments are performed on a sample dataset and also a real-world drug semantic web dataset. Obtained results confirm the ability of the proposed algorithm in mining Multi-Relation Association Rules

    A Granular-based Approach for Semisupervised Web Information Labeling

    Get PDF
    A key issue when mining web information is the labeling problem: data is abundant on the web but is unlabelled. In this thesis, we address this problem by proposing i) a novel theoretical granular model that structures categorical noun phrase instances as well as semantically related noun phrase pairs from a given corpus representing unstructured web pages with a variant of Tolerance Rough Sets Model (TRSM), ii) a semi-supervised learning algorithm called Tolerant Pattern Learner (TPL) that labels categorical instances as well as relations. TRSM has so far been successfully employed for document retrieval and classification, but not for learning categorical and relational phrases. We use the ontological information from the Never Ending Language Learner (Nell) system. We compared the performance of our algorithm with Coupled Bayesian Sets (CBS) and Coupled Pattern Learner (CPL) algorithms for categorical and relational labeling, respectively. Experimental results suggest that TPL can achieve comparable performance with CBS and CPL in terms of precision.Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery grant 194376.Master of Science in Applied Computer Scienc

    Temporal Information Extraction and Knowledge Base Population

    Full text link
    Temporal Information Extraction (TIE) from text plays an important role in many Natural Language Processing and Database applications. Many features of the world are time-dependent, and rich temporal knowledge is required for a more complete and precise understanding of the world. In this thesis we address aspects of two core tasks in TIE. First, we provide a new corpus of labeled temporal relations between events and temporal expressions, dense enough to facilitate a change in research directions from relation classification to identification, and present a system designed to address corresponding new challenges. Second, we implement a novel approach for the discovery and aggregation of temporal information about entity-centric fluent relations

    Was Suchmaschinen nicht können. Holistische Entitätssuche auf Web Daten

    Get PDF
    Mehr als 50% aller Web Suchanfragen sind entitätsbezogen. Benutzer suchen entweder nach Entitäten oder nach Entitätsinformationen. Dennoch solche Anfragen von Suchmaschinen nicht gut unterstützt. Aufbauend auf dem Konzept des semiotischen Dreiecks aus der kognitiven Psychologie, haben wir drei Anfragetypen zur Entitätssuche identifiziert: typbasierte Anfragen – Suche nach Entitäten eines gegebenen Typs, prototypbasierte Anfragen – Suche nach Entitäten mit bestimmten Eigenschaften, und instanzbasierte Anfragen – Suche nach Entitäten die ähnlich zu einer gegebene Entität sind. Für typbasierte Anfragen haben wir eine Methode entwickelt die query expansion mit einer self-supervised vocabulary learning Technik auf strukturierten und unstrukturierten Daten verbindet. Unser Ansatz liefert einen guten Kompromiss zwischen Precision und Recall. Für prototypbasierte Anfragen stellen wir ProSWIP vor. Dies ist ein eigenschaftsbasiertes System um Entitäten aus dem Web abzurufen. Da aber die Anzahl der Eigenschaften die durch die Benutzer bereitgestellt werden relativ klein sein kann, baut ProSWIP auf direkten Fragen und Benutzer Feedback um die Menge der Eigenschaften zu einer Menge welche die Intentionen der Benutzer korrekt erfasst zu erweitern. Unsere Experimente zeigen dass mit maximal vier Fragen eine perfekte Precision erreicht wird. In dem Fall von instanzbasierten Anfragen besteht die Schwierigkeit darin eine Anfrageform zu finden die die Benutzerintentionen eindeutig macht. Wir stellen eine minimalistische instanzbasierte Anfrage, die aus einem Beispiel und dem entsprechenden Entitätstypen besteht vor. Mit Hilfe des Konzepts der Familienähnlichkeit entwickeln wir eine praktische Lösung um Entitäten mit Bezug zur der Anfragenentität direkt aus dem Web abzurufen. Unser Ansatz erzielt sogar für Anfragen, die für standard Entitätssuchaufgaben wie related entity finding problematisch waren, gute Ergebnisse. Entitätszusammenfassung ist ein anderer Typ von entitätszentrischen Anfragen, der Informationen bezüglich einer Entität bereitstellt. Googles Knowledge Graph ist der Stand der Technik für solche Aufgaben. Aber das Zurückgreifen auf manuell erstellte Knowledgebases schließt weniger bekannten Entitäten für das Knowledge Graph aus. Wir schlagen daher vor datengetriebene Ansätze zu nutzen. Wir sind überzeugt dass das Bewältigen dieser vier Anfragetypen eine holistische Entitätssuche auf Web Daten für die nächste Generation von Suchmaschinen ermöglicht.More than 50% of all Web queries are entity related. Users search either for entities or for entity information. Still, search engines do not accommodate entity-centric search very well. Building on the concept of the semiotic triangle from cognitive psychology, which models entity types in terms of intensions and extensions, we identified three types of queries for retrieving entities: type-based queries - searching for entities of a given type, prototype-based queries - searching for entities having certain properties, and instance-based queries - searching for entities being similar to a given entity. For type-based queries we present a method that combines query expansion with a self-supervised vocabulary learning technique built on both structured and unstructured data. Our approach is able to achieve a good tradeoff between precision and recall. For prototype-based queries we propose ProSWIP, a property-based system for retrieving entities from the Web. Since the number of properties given by the users can be quite small, ProSWIP relies on direct questions and user feedback to expand the set of properties to a set that captures the user’s intentions correctly. Our experiments show that within a maximum of four questions the system achieves perfect precision of the selected entities. In the case of instance-based queries the first challenge is to establish a query form that allows for disambiguating user intentions without putting too much cognitive pressure on the user. We propose a minimalistic instance-based query comprising the example entity and intended entity type. With this query and building on the concept of family resemblance we present a practical way for retrieving entities directly from the Web. Our approach can even cope with queries which have proven problematic for benchmark tasks like related entity finding. Providing information about a given entity, entity summarization is another kind of entity-centric query. Google’s Knowledge Graph is the state of the art for this task. But relying entirely on manually curated knowledge bases, the Knowledge Graph does not include all new and less known entities. We propose to use a data-driven approach. Our experiments on real-world entities show the superiority of our method. We are confident that mastering these four query types enables holistic entity search on Web data for the next generation of search engines

    Ontology evolution in physics

    Get PDF
    With the advent of reasoning problems in dynamic environments, there is an increasing need for automated reasoning systems to automatically adapt to unexpected changes in representations. In particular, the automation of the evolution of their ontologies needs to be enhanced without substantially sacrificing expressivity in the underlying representation. Revision of beliefs is not enough, as adding to or removing from beliefs does not change the underlying formal language. General reasoning systems employed in such environments should also address situations in which the language for representing knowledge is not shared among the involved entities, e.g., the ontologies in a multi-ontology environment or the agents in a multi-agent environment. Our techniques involve diagnosis of faults in existing, possibly heterogeneous, ontologies and then resolution of these faults by manipulating the signature and/or the axioms. This thesis describes the design, development and evaluation of GALILEO (Guided Analysis of Logical Inconsistencies Lead to Evolution of Ontologies), a system designed to detect conflicts in highly expressive ontologies and resolve the detected conflicts by performing appropriate repair operations. The integrated mechanism that handles ontology evolution is able to distinguish between various types of conflicts, each corresponding to a unique kind of ontological fault. We apply and develop our techniques in the domain of Physics. This an excellent domain because many of its seminal advances can be seen as examples of ontology evolution, i.e. changing the way that physicists perceive the world, and case studies are well documented – unlike many other domains. Our research covers analysing a wide ranging development set of case studies and evaluating the performance of the system on a test set. Because the formal representations of most of the case studies are non-trivial and the underlying logic has a high degree of expressivity, we face some tricky technical challenges, including dealing with the potentially large number of choices in diagnosis and repair. In order to enhance the practicality and the manageability of the ontology evolution process, GALILEO incorporates the functionality of generating physically meaningful diagnoses and repairs and, as a result, narrowing the search space to a manageable size

    Meaning construction in popular science : an investigation into cognitive, digital, and empirical approaches to discourse reification

    Get PDF
    This thesis uses cognitive linguistics and digital humanities techniques to analyse abstract conceptualization in a corpus of popular science texts. Combining techniques from Conceptual Integration Theory, corpus linguistics, data-mining, cognitive pragmatics and computational linguistics, it presents a unified approach to understanding cross-domain mappings in this area, and through case studies of key extracts, describes how concept integration in these texts operates. In more detail, Part I of the thesis describes and implements a comprehensive procedure for semantically analysing large bodies of text using the recently- completed database of the Historical Thesaurus of English. Using log-likelihood statistical measures and semantic annotation techniques on a 600,000 word corpus of abstract popular science, this part establishes both the existence and the extent of significant analogical content in the corpus. Part II then identifies samples which are particularly high in analogical content from the corpus, and proposes an adaptation of empirical and corpus methods to support and enhance conceptual integration (sometimes called conceptual blending) analyses, informed by Part I’s methodologies for the study of analogy on a wider scale. Finally, the thesis closes with a detailed analysis, using this methodology, of examples taken from the example corpus. This analysis illustrates those conclusions which can be drawn from such work, completing the methodological chain of reasoning from wide-scale corpora to narrow-focus semantics, and providing data about the nature of highly-abstract popular science as a genre. The thesis’ original contribution to knowledge is therefore twofold; while contributing to the understanding of the reification of abstractions in discourse, it also focuses on methodological enhancements to existing tools and approaches, aiming to contribute to the established tradition of both analytic and procedural work advancing the digital humanities in the area of language and discourse

    Creating Digital Editions for Corpus Linguistics : The case of Potage Dyvers, a family of six Middle English recipe collections

    Get PDF
    This thesis presents a corpus-linguistically oriented digital documentary edition of six 15th-century culinary recipe collections, known as the Potage Dyvers family, with an introduction to its historical context and an analysis of its dialectal and structural features, and defines an editorial framework for producing such editions for the purposes of corpus linguistic research. Traditionally historical corpora have been compiled from printed editions not originally designed to serve as corpus linguistic data. Recently, both the digitalisation of textual editing and the turning of corpus compilers towards original sources have blurred the boundaries between these two crafts, placing corpus compilers into an editorial role. Despite the fact that traditional editorial approaches have been recognised as largely incompatible with the needs of linguistic research, and the established methods of corpus encoding do not satisfactorily represent the documentary context of manuscript texts, no explicitly linguistic editorial approach has so far been designed for editing manuscript sources for use in corpora. Even most digital editions, despite their advanced representational capabilities, are literary or historical in orientation and thus do not provide an adequate model. The editorial framework described here and the edition based on it have been explicitly designed to answer the needs of historical corpus linguistics. First, it aims at faithfully modelling the manuscript as a historical artefact, including both its textual content and its visual and material paratext, whose communicative importance has also been recognised by many historical linguists. Second, it presents this model in a form which allows not only the study of both text and paratext using corpus linguistic methods, but also allows resulting analytical metadata to be linked back to the edition, shared with other scholars, and used as the basis for further study. The edition itself is provided as a digital appendix to the thesis in the form of both a digital data archive encoded in TEI XML and three editorial presentations of this data, and serves not only as a demonstration of the editorial approach, but also provides a valuable new research resource. The choice of material is based on the insight that utilitarian texts like recipes provide valuable material especially for historical pragmatics and discourse studies. As one of the first vernacular text types, recipes also provide an excellent opportunity to study the diachronic development of a single textual genre. The Potage Dyvers family is the second largest known family of Middle English recipe collections, surviving in six physically diverse manuscripts. Of these, four were edited in 1888 by conflating them into two collections, but their complex interrelationships have so far escaped systematic study. The structural analysis of the six Potage Dyvers versions indicates that the family, containing a total of 371 unique recipes, in fact consists of three sibling pairs of MSS. Two of these contain largely the same material but in a different order, while the third shares only a core of 89 recipes with the others, deriving a large number of recipes from other sources. In terms of their language, all of the six versions exhibit mainly Midlands forms and combine dialectally unmarked forms with more local variants from different areas, reflecting the 15th-century loss of dialectal distinctions which has not yet reached orthographic or morphological uniformity, and indicating possible metropolitan associations.Tämä väitöskirja tarjoaa korpuslingvistisesti suuntautuneen digitaalisen tekstiedition kuudesta samankaltaisesta 1400-luvun englanninkielisestä ruokareseptikokoelmasta, jotka tunnetaan nimellä Potage Dyvers. Väitöskirja sisältää johdannon tekstien historialliseen kontekstiin sekä murrepiirteisiin ja tekstirakenteeseen pohjautuvat analyysit niiden todennäköisestä alkuperästä ja keskinäisistä suhteista. Väitöskirja kartoittaa historiallisen kielentutkimuksen käsikirjoituseditiolle asettamat vaatimukset ja määrittelee yksityiskohtaisen ohjeiston niiden täyttämiseksi. Historialliset tekstikorpukset on perinteisesti koottu digitoimalla painettuja tekstieditioita joita ei ole suunniteltu kielitieteelliseksi aineistoksi. Viime vuosina tekstieditioiden digitaalistuminen ja korpuslingvistien lisääntynyt kiinnostus alkuperäisiä dokumenttilähteitä kohtaan ovat häivyttäneet tekstieditoinnin ja kielikorpusten kokoamisen välistä rajaa. Vaikka yhtäältä perinteisten editointimenetelmien ongelmat kielentutkimuksen suhteen ja toisaalta aiempien historiallisten kielikorpusten tapa jättää huomiotta käsikirjoitustekstien materiaalinen konteksti on havaittu ongelmallisiksi, ei historiallisten käsikirjoituslähteiden esittämiseen tekstikorpuksissa ole kehitetty juurikaan menetelmiä. Väitöskirjan sisältämä ja kuvaama editio on suunniteltu erityisesti historiallisen korpuslingvistiikan tarpeisiin. Se pyrkii mallintamaan käsikirjoituksen historiallisena esineenä, tallentaen digitaalisesti paitsi tekstin, myös sen viestinnällisen merkityksen kannalta olennaisen materiaalisen kontekstin. Tämä malli esitetään muodossa, joka mahdollistaa paitsi sekä tekstin että materiaalisen kontekstin tutkimisen korpusmenetelmin, myös tutkimuksen tuloksena syntyvän metatiedon liittämisen alkuperäiseen editioon ja käyttämisen myöhemmän tutkimuksen pohjana. Itse editio joka toimii paitsi esimerkkinä editointimenetelmän käytöstä, myös itsessään arvokkaana tutkimusaineistona sisältyy väitöskirjan digitaalisiin liitteisiin sekä TEI XML -muotoisena digitaalisena data-arkistona että kolmessa erilaisessa esitysmuodossa. Keskiaikaiset reseptitekstit on valittu edition aineistoksi, koska niiden kaltaiset käytännölliset tekstit ovat arvokasta materiaalia esimerkiksi historialliselle pragmatiikalle ja diskurssintutkimukselle. Yhtenä vanhimmista kansankielisistä tekstilajeista reseptit myös tarjoavat erinomaisen tilaisuuden yksittäisen tekstilajin historiallisen kehityksen tutkimiseen. Kuutena eri versiona säilynyt Potage Dyvers on toiseksi suurin keskienglanninkielisten reseptikokoelmien ryhmä. Sen neljästä versiosta on olemassa vuonna 1888 julkaistu editio jossa ne esitettiin kahtena erillisenä tekstinä, mutta versioiden välisiä monimutkaisia suhteita ei ole tutkittu järjestelmällisesti. Versioiden välinen rakenneanalyysi osoittaa, että tämä yhteensä 371 ainutkertaista reseptiä sisältävä ryhmä koostuu itse asiassa kolmesta keskenään samankaltaisten kokoelmien parista. Näistä pareista kaksi sisältävät pääosin samat reseptit mutta hyvin eri järjestyksessä, kun taas kolmas jakaa muiden kanssa vain 89 reseptiä joihin se yhdistää suuren määrän reseptejä muista lähteistä. Kieleltään kaikki kuusi versiota edustavat pääosin Midlandsin alueen kielimuotoa, mutta murteellisesti värittömien muotojen suosiminen ja yhdistyminen useiden eri alueiden paikallisiin piirteisiin heijastaa 1400-luvulla tapahtunutta kieliasun yhtenäistymistä edeltävää murre-erojen tasaantumista, ja on erityisen tyypillistä Lontoon suurkaupunkialueen kielimuodolle

    Search and Mining Entity-relationship Data

    No full text
    This paper summarizes the details of the first international workshop on search and mining entity-relationship data. This workshop will bridge between IR, DB, and KM researchers to seek novel solutions for search and data mining of rich entity-relationship data and their applications in various domains. We first provide an overview about the workshop. We then briefly discuss the workshop program
    corecore