37 research outputs found

    User driven information extraction with LODIE

    Get PDF
    Information Extraction (IE) is the technique for transforming unstructured or semi-structured data into structured representation that can be understood by machines. In this paper we use a user-driven Information Extraction technique to wrap entity-centric Web pages. The user can select concepts and properties of interest from available Linked Data. Given a number of websites containing pages about the concepts of interest, the method will exploit (i) recurrent structures in the Web pages and (ii) available knowledge in Linked data to extract the information of interest from the Web pages

    Distantly Supervised Web Relation Extraction for Knowledge Base Population

    Get PDF
    Extracting information from Web pages for populating large, cross-domain knowledge bases requires methods which are suitable across domains, do not require manual effort to adapt to new domains, are able to deal with noise, and integrate information extracted from different Web pages. Recent approaches have used existing knowledge bases to learn to extract information with promising results, one of those approaches being distant supervision. Distant supervision is an unsupervised method which uses background information from the Linking Open Data cloud to automatically label sentences with relations to create training data for relation classifiers. In this paper we propose the use of distant supervision for relation extraction from the Web. Although the method is promising, existing approaches are still not suitable for Web extraction as they suffer from three main issues: data sparsity, noise and lexical ambiguity. Our approach reduces the impact of data sparsity by making entity recognition tools more robust across domains and extracting relations across sentence boundaries using unsupervised co- reference resolution methods. We reduce the noise caused by lexical ambiguity by employing statistical methods to strategically select training data. To combine information extracted from multiple sources for populating knowledge bases we present and evaluate several information integration strategies and show that those benefit immensely from additional relation mentions extracted using co-reference resolution, increasing precision by 8%. We further show that strategically selecting training data can increase precision by a further 3%

    An unsupervised data-driven method to discover equivalent relations in large linked datasets

    Get PDF
    This article addresses a number of limitations of state-of-the-art methods of Ontology Alignment: 1) they primarily address concepts and entities while relations are less well-studied; 2) many build on the assumption of the ‘well-formedness’ of ontologies which is unnecessarily true in the domain of Linked Open Data; 3) few have looked at schema heterogeneity from a single source, which is also a common issue particularly in very large Linked Dataset created automatically from heterogeneous resources, or integrated from multiple datasets. We propose a domain- and language-independent and completely unsupervised method to align equivalent relations across schemata based on their shared instances. We introduce a novel similarity measure able to cope with unbalanced population of schema elements, an unsupervised technique to automatically decide similarity threshold to assert equivalence for a pair of relations, and an unsupervised clustering process to discover groups of equivalent relations across different schemata. Although the method is designed for aligning relations within a single dataset, it can also be adapted for cross-dataset alignment where sameAs links between datasets have been established. Using three gold standards created based on DBpedia, we obtain encouraging results from a thorough evaluation involving four baseline similarity measures and over 15 comparative models based on variants of the proposed method. The proposed method makes significant improvement over baseline models in terms of F1 measure (mostly between 7% and 40%), and it always scores the highest precision and is also among the top performers in terms of recall. We also make public the datasets used in this work, which we believe make the largest collection of gold standards for evaluating relation alignment in the LOD context

    A Wavelet-Based Approach to Pattern Discovery in Melodies

    Get PDF

    Moving towards the semantic web: enabling new technologies through the semantic annotation of social contents.

    Get PDF
    La Web Social ha causat un creixement exponencial dels continguts disponibles deixant enormes quantitats de recursos textuals electrònics que sovint aclaparen els usuaris. Aquest volum d’informació és d’interès per a la comunitat de mineria de dades. Els algorismes de mineria de dades exploten característiques de les entitats per tal de categoritzar-les, agrupar-les o classificar-les segons la seva semblança. Les dades per si mateixes no aporten cap mena de significat: han de ser interpretades per esdevenir informació. Els mètodes tradicionals de mineria de dades no tenen com a objectiu “entendre” el contingut d’un recurs, sinó que extreuen valors numèrics els quals esdevenen models en aplicar-hi càlculs estadístics, que només cobren sentit sota l’anàlisi manual d’un expert. Els darrers anys, motivat per la Web Semàntica, molts investigadors han proposat mètodes semàntics de classificació de dades capaços d’explotar recursos textuals a nivell conceptual. Malgrat això, normalment aquests mètodes depenen de recursos anotats prèviament per poder interpretar semànticament el contingut d’un document. L’ús d’aquests mètodes està estretament relacionat amb l’associació de dades i el seu significat. Aquest treball es centra en el desenvolupament d’una metodologia genèrica capaç de detectar els trets més rellevants d’un recurs textual descobrint la seva associació semàntica, es a dir, enllaçant-los amb conceptes modelats a una ontologia, i detectant els principals temes de discussió. Els mètodes proposats són no supervisats per evitar el coll d’ampolla generat per l’anotació manual, independents del domini (aplicables a qualsevol àrea de coneixement) i flexibles (capaços d’analitzar recursos heterogenis: documents textuals o documents semi-estructurats com els articles de la Viquipèdia o les publicacions de Twitter). El treball ha estat avaluat en els àmbits turístic i mèdic. Per tant, aquesta dissertació és un primer pas cap a l'anotació semàntica automàtica de documents necessària per possibilitar el camí cap a la visió de la Web Semàntica.La Web Social ha provocado un crecimiento exponencial de los contenidos disponibles, dejando enormes cantidades de recursos electrónicos que a menudo abruman a los usuarios. Tal volumen de información es de interés para la comunidad de minería de datos. Los algoritmos de minería de datos explotan características de las entidades para categorizarlas, agruparlas o clasificarlas según su semejanza. Los datos por sí mismos no aportan ningún significado: deben ser interpretados para convertirse en información. Los métodos tradicionales no tienen como objetivo "entender" el contenido de un recurso, sino que extraen valores numéricos que se convierten en modelos tras aplicar cálculos estadísticos, los cuales cobran sentido bajo el análisis manual de un experto. Actualmente, motivados por la Web Semántica, muchos investigadores han propuesto métodos semánticos de clasificación de datos capaces de explotar recursos textuales a nivel conceptual. Sin embargo, generalmente estos métodos dependen de recursos anotados previamente para poder interpretar semánticamente el contenido de un documento. El uso de estos métodos está estrechamente relacionado con la asociación de datos y su significado. Este trabajo se centra en el desarrollo de una metodología genérica capaz de detectar los rasgos más relevantes de un recurso textual descubriendo su asociación semántica, es decir, enlazándolos con conceptos modelados en una ontología, y detectando los principales temas de discusión. Los métodos propuestos son no supervisados para evitar el cuello de botella generado por la anotación manual, independientes del dominio (aplicables a cualquier área de conocimiento) y flexibles (capaces de analizar recursos heterogéneos: documentos textuales o documentos semi-estructurados, como artículos de la Wikipedia o publicaciones de Twitter). El trabajo ha sido evaluado en los ámbitos turístico y médico. Esta disertación es un primer paso hacia la anotación semántica automática de documentos necesaria para posibilitar el camino hacia la visión de la Web Semántica.Social Web technologies have caused an exponential growth of the documents available through the Web, making enormous amounts of textual electronic resources available. Users may be overwhelmed by such amount of contents and, therefore, the automatic analysis and exploitation of all this information is of interest to the data mining community. Data mining algorithms exploit features of the entities in order to characterise, group or classify them according to their resemblance. Data by itself does not carry any meaning; it needs to be interpreted to convey information. Classical data analysis methods did not aim to “understand” the content and the data were treated as meaningless numbers and statistics were calculated on them to build models that were interpreted manually by human domain experts. Nowadays, motivated by the Semantic Web, many researchers have proposed semantic-grounded data classification and clustering methods that are able to exploit textual data at a conceptual level. However, they usually rely on pre-annotated inputs to be able to semantically interpret textual data such as the content of Web pages. The usability of all these methods is related to the linkage between data and its meaning. This work focuses on the development of a general methodology able to detect the most relevant features of a particular textual resource finding out their semantics (associating them to concepts modelled in ontologies) and detecting its main topics. The proposed methods are unsupervised (avoiding the manual annotation bottleneck), domain-independent (applicable to any area of knowledge) and flexible (being able to deal with heterogeneous resources: raw text documents, semi-structured user-generated documents such Wikipedia articles or short and noisy tweets). The methods have been evaluated in different fields (Tourism, Oncology). This work is a first step towards the automatic semantic annotation of documents, needed to pave the way towards the Semantic Web vision

    California Water: Looking to the Future

    Get PDF
    In general, California has abundant water resources, but they do not occur where people live and work, nor does precipitation occur when water is needed. To deal with these basic disparities, water agencies have built the most extensive plumbing system in the world. Local, regional, state, and federal agencies have constructed reservoirs and aqueducts throughout the State. None of the water projects was constructed easily or without controversy. From one perspective, the history of California is the history of arguing about water. More and more, however, the debates are changing from competition among water users to broader discussions of public concerns and preservation of common interests. Back in 19 57, the Department of Water Resources published The California Water Plan (Bulletin 3). That report set forth an ultimate plan of potential water development, essentially demonstrating that the State\u27s water resources are adequate to meet its ultimate needs. Bulletin 3 was followed by the Bulletin 160 series, published four times between 1966 and 1984 to update various elements of California\u27s statewide water planning. These four technical documents examined then-current California water in considerable detail, outlining the Department\u27s expectations of water supplies and water demand in coming decades. The present report differs significantly in approach from its predecessors. Taking a broad view of water events and issues in California, Bulletin 160-87 examines current water use and supply and considers at length how California can continue to meet the water needs of a continually growing population. The report also discusses several leading water management concerns, such as the quality of water supplies, the status of the Sacramento-San Joaquin Delta, and evolving water policies. Overall, Bulletin 160-87 sets forth a wide range of information and views that we hope will aid water managers, elected officials, and the public

    Investigating chromospheric magnetic activity on young stars and the wide field CAMera for UKIRT

    Get PDF
    Hertzprung-Russell diagrams are one of the most important tools for understanding pre-main sequence evolution when combined with theoretical evolutionary tracks. They are not only used to deduce the properties of the stars they are charting but to estimate the ages of clusters that house them and to investigate the age spreads for episodes of star formation. It is therefore vital that the determination of these dia­grams and tracks are built on solid theoretical and observational foundations. How­ever, work in recent years points to a potential problem. It has long been known that pre-main sequence stars exhibit regions of magnetic activity on their stellar surfaces similar to active regions observed on the Sun. What is not yet well known is the extent to which these active regions cover the stellar surfaces.Most spectral classification relies on moderate resolution optical spectra which tend to be dominated by the non-active photosphere which is hotter than the active regions. Resultant effective temperatures are overestimated if a large portion of the pre-main sequence stellar surface is covered in active regions, which in turn can lead to substantial error in mass and age calculations. This thesis presents a novel approach to measuring the distribution of magnetic regions on T Tauri stars which aims to over­come limitations of other observing techniques such as Doppler imaging or Zeeman measurements. The central line emission from the strong visible Call H & K lines are a proxy indicator of surface magnetic fields and are known through observations of the Sun to be enhanced above active plage regions. Simultaneous optical spectroscopic and photometric observations of a significant sample of fast rotating T Tauri stars in the nearby clusters p Ophiuchus & Upper Scorpius have allowed us to ascertain a di­rect correlation between variations in the Call doublet emissions and light intensities. Computer simulations which model the surface conditions as understood on T Tauri stars and generate correlations which mimic those in the observational data offer a manipulable tool for estimating how much of the stellar surface is covered.The Wide Field CAMera for the UKIRT Telescope on Mauna Kea is currently the most capable infrared imaging survey wide field camera in the world. The instrument focal plane consists of four Hawaii-II 2048 x 2048 IR detectors, to facilitate the best operating conditions and practises for the camera these detectors must be carefully characterised such that inherent qualities can either be corrected or accounted for. The second part of this thesis details the detector characterisation work carried out prior to the instrument delivery to the telescope. Obtaining a correct and stable operating temperature regardless of ambient temperature in the dome enclosure is key to the camera functioning optimally to carry out highly successful surveys. Presented here is a full model of the camera's thermal behaviour for the main instrument and the infrared detectors

    What Controls the Controller: Structure and Function Characterizations of Transcription Factor PU.1 Uncover Its Regulatory Mechanism

    Get PDF
    The ETS family transcription factor PU.1/Spi-1 is a master regulator of the self-renewal of hematopoietic stem cells and their differentiation along both major lymphoid and myeloid branches. PU.1 activity is determined in a dosage-dependent manner as a function of both its expression and real-time regulation at the DNA level. While control of PU.1 expression is well established, the molecular mechanisms of its real-time regulation remain elusive. Our work is focused on discovering a complete regulatory mechanism that governs the molecular interactions of PU.1. Structurally, PU.1 exhibits a classic transcription factor architecture in which intrinsically disordered regions (IDR), consisting of 66% of its primary structure, are tethered to a well-structured DNA binding domain. The transcriptionally active form of PU.1 is a monomer that binds target DNA sites as a 1:1 complex. Our investigations show that IDRs of PU.1 reciprocally control two separate inactive dimeric forms, with and without DNA. At high concentrations, PU.1 forms a non-canonical 2:1 complex at a single DNA specific site. In the absence of DNA, PU.1 also forms a dimer, but it is incompatible with DNA binding. The DNA-free PU.1 dimer is further promoted by phosphomimetic mutants of IDR residues that are phosphorylated in B-lymphocytic activation. These results lead us to postulate a model of real-time PU.1 regulation, unknown in the ETS family, where independent dimeric forms antagonize each other to control the dosage of active PU.1 monomer at its target DNA sites. To demonstrate the biological relevance of our model, cellular assays probing PU.1-specific reporters and native target genes show that PU.1 transactivation exhibits a distinct dose response consistent with negative feedback. In summary, we have established the first model for the general real-time regulation of PU.1 at the DNA/protein level, without the need for recruiting specific binding partners. These novel interactions present potential therapeutic targets for correcting de-regulated PU.1 dosage in hematologic disorders, including leukemia, lymphoma, and myeloma
    corecore