4,726 research outputs found

    Deriving Verb Predicates By Clustering Verbs with Arguments

    Full text link
    Hand-built verb clusters such as the widely used Levin classes (Levin, 1993) have proved useful, but have limited coverage. Verb classes automatically induced from corpus data such as those from VerbKB (Wijaya, 2016), on the other hand, can give clusters with much larger coverage, and can be adapted to specific corpora such as Twitter. We present a method for clustering the outputs of VerbKB: verbs with their multiple argument types, e.g. "marry(person, person)", "feel(person, emotion)." We make use of a novel low-dimensional embedding of verbs and their arguments to produce high quality clusters in which the same verb can be in different clusters depending on its argument type. The resulting verb clusters do a better job than hand-built clusters of predicting sarcasm, sentiment, and locus of control in tweets

    Semantically-guided evolutionary knowledge discovery from texts

    Get PDF
    This thesis proposes a new approach for structured knowledge discovery from texts which considers both the mining process itself, the evaluation of this knowledge by the model, and the human assessment of the quality of the outcome.This is achieved by integrating Natural-Language technology and Genetic Algorithms to produce explanatory novel hypotheses. Natural-Language techniques are specifically used to extract genre-based information from text documents. Additional semantic and rhetorical information for generating training data and for feeding a semistructured Latent Semantic Analysis process is also captured.The discovery process is modeled by a semantically-guided Genetic Algorithm which uses training data to guide the search and optimization process. A number of novel criteria to evaluate the quality of the new knowledge are proposed. Consequently, new genetic operations suitable for text mining are designed, and techniques for Evolutionary Multi-Objective Optimization are adapted for the model to trade off between different criteria in the hypotheses.Domain experts were used in an experiment to assess the quality of the hypotheses produced by the model so as to establish their effectiveness in terms of novel and interesting knowledge. The assessment showed encouraging results for the discovered knowledge and for the correlation between the model and the human opinions

    Doctor of Philosophy

    Get PDF
    dissertationThe objective of this work is to examine the efficacy of natural language processing (NLP) in summarizing bibliographic text for multiple purposes. Researchers have noted the accelerating growth of bibliographic databases. Information seekers using traditional information retrieval techniques when searching large bibliographic databases are often overwhelmed by excessive, irrelevant data. Scientists have applied natural language processing technologies to improve retrieval. Text summarization, a natural language processing approach, simplifies bibliographic data while filtering it to address a user's need. Traditional text summarization can necessitate the use of multiple software applications to accommodate diverse processing refinements known as "points-of-view." A new, statistical approach to text summarization can transform this process. Combo, a statistical algorithm comprised of three individual metrics, determines which elements within input data are relevant to a user's specified information need, thus enabling a single software application to summarize text for many points-of-view. In this dissertation, I describe this algorithm, and the research process used in developing and testing it. Four studies comprised the research process. The goal of the first study was to create a conventional schema accommodating a genetic disease etiology point-of-view, and an evaluative reference standard. This was accomplished through simulating the task of secondary genetic database curation. The second study addressed the development iv and initial evaluation of the algorithm, comparing its performance to the conventional schema using the previously established reference standard, again within the task of secondary genetic database curation. The third and fourth studies evaluated the algorithm's performance in accommodating additional points-of-view in a simulated clinical decision support task. The third study explored prevention, while the fourth evaluated performance for prevention and drug treatment, comparing results to a conventional treatment schema's output. Both summarization methods identified data that were salient to their tasks. The conventional genetic disease etiology and treatment schemas located salient information for database curation and decision support, respectively. The Combo algorithm located salient genetic disease etiology, treatment, and prevention data, for the associated tasks. Dynamic text summarization could potentially serve additional purposes, such as consumer health information delivery, systematic review creation, and primary research. This technology may benefit many user groups

    Automatic Identification of Interestingness in Biomedical Literature

    Get PDF
    This thesis presents research on automatically identifying interestingness in a graph of semantic predications. Interestingness represents a subjective quality of information that represents its value in meeting a user\u27s known or unknown retrieval needs. The perception of information as interesting requires a level of utility for the user as well as a balance between significant novelty and sufficient familiarity. It can also be influenced by additional factors such as unexpectedness or serendipity with recent experiences. The ability to identify interesting information facilitates the development of user-centered retrieval, especially in information semantic summarization and iterative, step-wise searching such as in discovery browsing systems. Ultimately, this allows biomedical researchers to more quickly identify information of greatest potential interest to them, whether expected or, perhaps more importantly, unexpected. Current discovery browsing systems use iterative information retrieval to discover new knowledge - a process that requires finding relevant co-occurring topics and relationships through consistent human involvement to identify interesting concepts. Although interestingness is subjective, this thesis identifies computable quantities in semantic data that correlate to interestingness in user searches. We compare several statistical and rule-based models correlating graph data extracted from semantic predications with concept interestingness as demonstrated in PubMed queries. Semantic predications represent scientific assertions extracted from all of the biomedical literature contained in the MEDLINE database. They are of the form, subject-predicate-object . Predications can easily be represented as graphs, where subjects and objects are nodes and predicates form edges. A graph of predications represents the assertions made in the citations from which the predications were extracted. This thesis uses graph metrics to identify features from the predication graph for model generation. These features are based on degree centrality (connectedness) of the seed concept node and surrounding nodes; they are also based on frequency of occurrence measures of the edges between the seed concept and surrounding nodes as well as between the nodes surrounding the seed concept and the neighbors of those nodes. A PubMed query log is used for training and testing models for interestingness. This log contains a set of user searches over a 24-hour period, and we make the assumption that co-occurrence of concepts with the seed concept in searches demonstrates interestingness of that concept with regard to the seed concept. Graph generation begins by the selection of a set of all predications containing the seed concept from the Semantic Medline database (our training dataset uses Alzheimer\u27s disease as the seed concept). The graph is built with the seed concept as the central node. Additional nodes are added for each concept that occurs with the seed concept in the initial predications and an edge is created for each instance of a predication containing the two concepts. The edges are labeled with the specific predicate in the predication. This graph is extended to include additional nodes within two leaps from the seed concept. The concepts in the PubMed query logs are normalized to UMLS concepts or Entrez Gene symbols using MetaMap. Token-based and user-based counts are collected for each co-occurring term. These measures are combined to create a weighted score which is used to determine three potential thresholds of interestingness based on deviation from the mean score. The concepts that are included in both the graph and the normalized log data are identified for use in model training and testing
    • …
    corecore