4,651 research outputs found

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

    Semi-automated Ontology Generation for Biocuration and Semantic Search

    Get PDF
    Background: In the life sciences, the amount of literature and experimental data grows at a tremendous rate. In order to effectively access and integrate these data, biomedical ontologies – controlled, hierarchical vocabularies – are being developed. Creating and maintaining such ontologies is a difficult, labour-intensive, manual process. Many computational methods which can support ontology construction have been proposed in the past. However, good, validated systems are largely missing. Motivation: The biocuration community plays a central role in the development of ontologies. Any method that can support their efforts has the potential to have a huge impact in the life sciences. Recently, a number of semantic search engines were created that make use of biomedical ontologies for document retrieval. To transfer the technology to other knowledge domains, suitable ontologies need to be created. One area where ontologies may prove particularly useful is the search for alternative methods to animal testing, an area where comprehensive search is of special interest to determine the availability or unavailability of alternative methods. Results: The Dresden Ontology Generator for Directed Acyclic Graphs (DOG4DAG) developed in this thesis is a system which supports the creation and extension of ontologies by semi-automatically generating terms, definitions, and parent-child relations from text in PubMed, the web, and PDF repositories. The system is seamlessly integrated into OBO-Edit and ProtĂ©gĂ©, two widely used ontology editors in the life sciences. DOG4DAG generates terms by identifying statistically significant noun-phrases in text. For definitions and parent-child relations it employs pattern-based web searches. Each generation step has been systematically evaluated using manually validated benchmarks. The term generation leads to high quality terms also found in manually created ontologies. Definitions can be retrieved for up to 78% of terms, child ancestor relations for up to 54%. No other validated system exists that achieves comparable results. To improve the search for information on alternative methods to animal testing an ontology has been developed that contains 17,151 terms of which 10% were newly created and 90% were re-used from existing resources. This ontology is the core of Go3R, the first semantic search engine in this field. When a user performs a search query with Go3R, the search engine expands this request using the structure and terminology of the ontology. The machine classification employed in Go3R is capable of distinguishing documents related to alternative methods from those which are not with an F-measure of 90% on a manual benchmark. Approximately 200,000 of the 19 million documents listed in PubMed were identified as relevant, either because a specific term was contained or due to the automatic classification. The Go3R search engine is available on-line under www.Go3R.org

    Learning Ontology Relations by Combining Corpus-Based Techniques and Reasoning on Data from Semantic Web Sources

    Get PDF
    The manual construction of formal domain conceptualizations (ontologies) is labor-intensive. Ontology learning, by contrast, provides (semi-)automatic ontology generation from input data such as domain text. This thesis proposes a novel approach for learning labels of non-taxonomic ontology relations. It combines corpus-based techniques with reasoning on Semantic Web data. Corpus-based methods apply vector space similarity of verbs co-occurring with labeled and unlabeled relations to calculate relation label suggestions from a set of candidates. A meta ontology in combination with Semantic Web sources such as DBpedia and OpenCyc allows reasoning to improve the suggested labels. An extensive formal evaluation demonstrates the superior accuracy of the presented hybrid approach

    Semi-automated Ontology Generation for Biocuration and Semantic Search

    Get PDF
    Background: In the life sciences, the amount of literature and experimental data grows at a tremendous rate. In order to effectively access and integrate these data, biomedical ontologies – controlled, hierarchical vocabularies – are being developed. Creating and maintaining such ontologies is a difficult, labour-intensive, manual process. Many computational methods which can support ontology construction have been proposed in the past. However, good, validated systems are largely missing. Motivation: The biocuration community plays a central role in the development of ontologies. Any method that can support their efforts has the potential to have a huge impact in the life sciences. Recently, a number of semantic search engines were created that make use of biomedical ontologies for document retrieval. To transfer the technology to other knowledge domains, suitable ontologies need to be created. One area where ontologies may prove particularly useful is the search for alternative methods to animal testing, an area where comprehensive search is of special interest to determine the availability or unavailability of alternative methods. Results: The Dresden Ontology Generator for Directed Acyclic Graphs (DOG4DAG) developed in this thesis is a system which supports the creation and extension of ontologies by semi-automatically generating terms, definitions, and parent-child relations from text in PubMed, the web, and PDF repositories. The system is seamlessly integrated into OBO-Edit and ProtĂ©gĂ©, two widely used ontology editors in the life sciences. DOG4DAG generates terms by identifying statistically significant noun-phrases in text. For definitions and parent-child relations it employs pattern-based web searches. Each generation step has been systematically evaluated using manually validated benchmarks. The term generation leads to high quality terms also found in manually created ontologies. Definitions can be retrieved for up to 78% of terms, child ancestor relations for up to 54%. No other validated system exists that achieves comparable results. To improve the search for information on alternative methods to animal testing an ontology has been developed that contains 17,151 terms of which 10% were newly created and 90% were re-used from existing resources. This ontology is the core of Go3R, the first semantic search engine in this field. When a user performs a search query with Go3R, the search engine expands this request using the structure and terminology of the ontology. The machine classification employed in Go3R is capable of distinguishing documents related to alternative methods from those which are not with an F-measure of 90% on a manual benchmark. Approximately 200,000 of the 19 million documents listed in PubMed were identified as relevant, either because a specific term was contained or due to the automatic classification. The Go3R search engine is available on-line under www.Go3R.org

    Ontology Enrichment from Free-text Clinical Documents: A Comparison of Alternative Approaches

    Get PDF
    While the biomedical informatics community widely acknowledges the utility of domain ontologies, there remain many barriers to their effective use. One important requirement of domain ontologies is that they achieve a high degree of coverage of the domain concepts and concept relationships. However, the development of these ontologies is typically a manual, time-consuming, and often error-prone process. Limited resources result in missing concepts and relationships, as well as difficulty in updating the ontology as domain knowledge changes. Methodologies developed in the fields of Natural Language Processing (NLP), Information Extraction (IE), Information Retrieval (IR), and Machine Learning (ML) provide techniques for automating the enrichment of ontology from free-text documents. In this dissertation, I extended these methodologies into biomedical ontology development. First, I reviewed existing methodologies and systems developed in the fields of NLP, IR, and IE, and discussed how existing methods can benefit the development of biomedical ontologies. This previously unconducted review was published in the Journal of Biomedical Informatics. Second, I compared the effectiveness of three methods from two different approaches, the symbolic (the Hearst method) and the statistical (the Church and Lin methods), using clinical free-text documents. Third, I developed a methodological framework for Ontology Learning (OL) evaluation and comparison. This framework permits evaluation of the two types of OL approaches that include three OL methods. The significance of this work is as follows: 1) The results from the comparative study showed the potential of these methods for biomedical ontology enrichment. For the two targeted domains (NCIT and RadLex), the Hearst method revealed an average of 21% and 11% new concept acceptance rates, respectively. The Lin method produced a 74% acceptance rate for NCIT; the Church method, 53%. As a result of this study (published in the Journal of Methods of Information in Medicine), many suggested candidates have been incorporated into the NCIT; 2) The evaluation framework is flexible and general enough that it can analyze the performance of ontology enrichment methods for many domains, thus expediting the process of automation and minimizing the likelihood that key concepts and relationships would be missed as domain knowledge evolves

    Automatically Detecting the Resonance of Terrorist Movement Frames on the Web

    Get PDF
    The ever-increasing use of the internet by terrorist groups as a platform for the dissemination of radical, violent ideologies is well documented. The internet has, in this way, become a breeding ground for potential lone-wolf terrorists; that is, individuals who commit acts of terror inspired by the ideological rhetoric emitted by terrorist organizations. These individuals are characterized by their lack of formal affiliation with terror organizations, making them difficult to intercept with traditional intelligence techniques. The radicalization of individuals on the internet poses a considerable threat to law enforcement and national security officials. This new medium of radicalization, however, also presents new opportunities for the interdiction of lone wolf terrorism. This dissertation is an account of the development and evaluation of an information technology (IT) framework for detecting potentially radicalized individuals on social media sites and Web fora. Unifying Collective Action Framing Theory (CAFT) and a radicalization model of lone wolf terrorism, this dissertation analyzes a corpus of propaganda documents produced by several, radically different, terror organizations. This analysis provides the building blocks to define a knowledge model of terrorist ideological framing that is implemented as a Semantic Web Ontology. Using several techniques for ontology guided information extraction, the resultant ontology can be accurately processed from textual data sources. This dissertation subsequently defines several techniques that leverage the populated ontological representation for automatically identifying individuals who are potentially radicalized to one or more terrorist ideologies based on their postings on social media and other Web fora. The dissertation also discusses how the ontology can be queried using intuitive structured query languages to infer triggering events in the news. The prototype system is evaluated in the context of classification and is shown to provide state of the art results. The main outputs of this research are (1) an ontological model of terrorist ideologies (2) an information extraction framework capable of identifying and extracting terrorist ideologies from text, (3) a classification methodology for classifying Web content as resonating the ideology of one or more terrorist groups and (4) a methodology for rapidly identifying news content of relevance to one or more terrorist groups

    Natural Language Processing: Emerging Neural Approaches and Applications

    Get PDF
    This Special Issue highlights the most recent research being carried out in the NLP field to discuss relative open issues, with a particular focus on both emerging approaches for language learning, understanding, production, and grounding interactively or autonomously from data in cognitive and neural systems, as well as on their potential or real applications in different domains

    Studies on User Intent Analysis and Mining

    Get PDF
    Predicting the goals of users can be extremely useful in e-commerce, online entertainment, information retrieval, and many other online services and applications. In this thesis, we study the task of user intent understanding, trying to bridge the gap between user expressions to online services and their goals behind it. As far as we know, most of the existing user intent studies are focusing on web search and social media domain. Studies on other areas are not enough. For example, as people more and more rely our daily life on cellphone, our information needs expressing to mobile devices and related services are increasing dramatically. Studies of user intent mining on mobile devices are not much. And the intentions of using mobile devices are different from the ones we use web search engine or social network. So we cannot directly apply the existing user intention to this area. Besides, user's intents are not stable but changing over time. And different interests will impact each other. Modeling such kind of dynamic user interests can help accurately understand and predict user's intent. But there're few existing works in this area. Moreover, user intent could be explicitly or implicitly expressed by users. The implicit intent expression is more close to human's natural language and also have great value to recognize and mine. To make further studies of these challenges, we first try to answer the question of “What is the user intent?” By referring amount of previous studies, we give our definition of user intent as “User intent is a task-specific, predefined or latent concept, topic or knowledge-base that is under an expression from a user who is trying to express his goal of information or service need.“ Then, we focus on the driving scenario when a user using cellphone and study the user intent in this domain. As far as we know, it is the first time of user intent analysis and categorization in this domain. And we also build a dataset of user input and related intent category and attributes by crowdsourcing and carefully handcraft. With the user intent taxonomy and dataset in hand, we conduct a user intent classification and user intent attribute recognition by supervised machine learning models. To classify the user intent for a user intent query, we use a convolutional neural network model to build a multi-class classifier. And then we use a sequential labeling method to recognize the intent attribute in the query. The experiment results show that our proposed method outperforms several baseline models in precision, recall, and F-score. In addition, we study the implicit user intent mining method through web search log data. By using a Restricted Boltzmann Machine, we make use of the correlation of query and click information to learn the latent intent behind a user web search. We propose a user intent prediction model on online discussion forum using Multivariate Hawkes Process. It dynamically models user intentions change and interact over time.The method models both of the internal and external factors of user's online forum response motivations, and also integrated the time decay fact of user's interests. We also present a data visualization method, using an enriched domain ontology to highlight the domain-specific words and entity relations within an article.Ph.D., Information Studies -- Drexel University, 201
    • 

    corecore