531 research outputs found
Topic Distiller:distilling semantic topics from documents
Abstract. This thesis details the design and implementation of a system that can find relevant and latent semantic topics from textual documents. The design of this system, named Topic Distiller, is inspired by research conducted on automatic keyphrase extraction and automatic topic labeling, and it employs entity linking and knowledge bases to reduce text documents to their semantic topics.
The Topic Distiller is evaluated using methods and datasets used in information retrieval and automatic keyphrase extraction. On top of the common datasets used in the literature three additional datasets are created to evaluate the system.
The evaluation reveals that the Topic Distiller is able to find relevant and latent topics from textual documents, beating the state-of-the-art automatic keyphrase methods in performance when used on news articles and social media posts.Semanttisten aiheiden suodattaminen dokumenteista. Tiivistelmä. Tässä diplomityössä tarkastellaan järjestelmää, joka pystyy löytämään tekstistä relevantteja ja piileviä semanttisia aihealueita, sekä kyseisen järjestelmän suunnittelua ja implementaatiota. Tämän Topic Distiller -järjestelmän suunnittelu ammentaa inspiraatiota automaattisen termintunnistamisen ja automaattisen aiheiden nimeämisen tutkimuksesta sekä hyödyntää automaattista semanttista annotointia ja tietämyskantoja tekstin aihealueiden löytämisessä.
Topic Distiller -järjestelmän suorituskykyä mitataan hyödyntämällä kirjallisuudessa paljon käytettyjä automaattisen termintunnistamisen evaluontimenetelmiä ja aineistoja. Näiden yleisten aineistojen lisäksi esittelemme kolme uutta aineistoa, jotka on luotu Topic Distiller -järjestelmän arviointia varten.
Evaluointi tuo ilmi, että Topic Distiller kykenee löytämään relevantteja ja piileviä aiheita tekstistä. Se päihittää kirjallisuuden viimeisimmät automaattisen termintunnistamisen menetelmät suorituskyvyssä, kun sitä käytetään uutisartikkelien sekä sosiaalisen median julkaisujen analysointiin
Proceedings of the 3rd Workshop on Domain-Specific Language Design and Implementation (DSLDI 2015)
The goal of the DSLDI workshop is to bring together researchers and
practitioners interested in sharing ideas on how DSLs should be designed,
implemented, supported by tools, and applied in realistic application contexts.
We are both interested in discovering how already known domains such as graph
processing or machine learning can be best supported by DSLs, but also in
exploring new domains that could be targeted by DSLs. More generally, we are
interested in building a community that can drive forward the development of
modern DSLs. These informal post-proceedings contain the submitted talk
abstracts to the 3rd DSLDI workshop (DSLDI'15), and a summary of the panel
discussion on Language Composition
Knowledgeable Preference Alignment for LLMs in Domain-specific Question Answering
Recently, the development of large language models (LLMs) has attracted wide
attention in academia and industry. Deploying LLMs to real scenarios is one of
the key directions in the current Internet industry. In this paper, we present
a novel pipeline to apply LLMs for domain-specific question answering (QA) that
incorporates domain knowledge graphs (KGs), addressing an important direction
of LLM application. As a real-world application, the content generated by LLMs
should be user-friendly to serve the customers. Additionally, the model needs
to utilize domain knowledge properly to generate reliable answers. These two
issues are the two major difficulties in the LLM application as vanilla
fine-tuning can not adequately address them. We think both requirements can be
unified as the model preference problem that needs to align with humans to
achieve practical application. Thus, we introduce Knowledgeable Preference
AlignmenT (KnowPAT), which constructs two kinds of preference set called style
preference set and knowledge preference set respectively to tackle the two
issues. Besides, we design a new alignment objective to align the LLM
preference with human preference, aiming to train a better LLM for
real-scenario domain-specific QA to generate reliable and user-friendly
answers. Adequate experiments and comprehensive with 15 baseline methods
demonstrate that our KnowPAT is an outperforming pipeline for real-scenario
domain-specific QA with LLMs. Our code is open-source at
https://github.com/zjukg/KnowPAT.Comment: Work in progress. Code is available at
https://github.com/zjukg/KnowPA
Recommended from our members
The development of a fuzzy expert system to help top decision makers in political and investment domains
This thesis was submitted for the degree of Doctor of Philosophy and was awarded by Brunel UniversityThe world’s increasing interconnectedness and the recent increase in the number of notable regional and international events pose greater and greater challenges for political decision-making, especially the decision to strengthen bilateral economic relationships between friendly nations. Typically, such critical decisions are influenced by certain factors and variables that are based on heterogeneous and vague information that exists in different domains. A serious problem that the decision-maker faces is the difficulty in building efficient political decision support systems (DSS) with heterogeneous factors. One must take many factors into account, for example, language (natural or human language), the availability, or lack thereof, of precise data (vague information), and possible consequences (rule conclusions).
The basic concept is a linguistic variable whose values are words rather than numbers and are therefore closer to human intuition. A common language is thus needed to describe such information which requires human knowledge for interpretation. To achieve robustness and efficiency of interpretation, we need to apply a method that can be used to generate high-level knowledge and information integration. Fuzzy logic is based on natural language and is tolerant of imprecise data. Fuzzy logic’s greatest strength lies in its ability to handle imprecise data, and it is perfectly suited for this situation.
In this thesis, we propose to use ontology to integrate the scattered information resources from the political and investment domains. The process started with understanding each concept and extracting key ideas and relationships between sets of information by constructing object paradigm ontology. Re-engineering according to the object-paradigm (OP) provided quality for the developed ontology where conceptualization can provide more expressive, reusable object and temporal ontology. Then fuzzy logic has been integrated with ontology. And a fuzzy ontology membership value that reflects the strength of an inter-concept relationship to represent pairs of concepts across ontology has been consistently used.
Each concept is assigned a fixed numerical value representing the concept consistency. Concept consistency is computed as a function of strength of all the relationships associated with the concept. Fuzzy expert systems enable one to weigh the consequences (rule conclusions) of certain choices based on vague information. Rule conclusions follow from rules composed of two parts, the if antecedent (input) and the then consequent (output). With fuzzy expert systems, one uses fuzzy logic toolbox graphical user interface (GUI) tools to build up a fuzzy inference system (FIS) to aid in decision-making. This research includes four main phases to develop a prototype architecture for an intelligent DSS that can help top political decision makers
Using semantic technologies to resolve heterogeneity issues in sustainability and disaster management knowledge bases
This thesis examines issues of semantic heterogeneity in the domains of sustainability indicators and disaster management. We propose a model that links two domains with the following logic. While disaster management implies a proper and efficient response to a risk that has materialised as a disaster, sustainability can be defined as the preparedness to unexpected situations by applying measurements such as sustainability indicators. As a step to this direction, we investigate how semantic technologies can tackle the issues of heterogeneity in the aforementioned domains. First, we consider approaches to resolve the heterogeneity issues of representing the key concepts of sustainability indicator sets. To develop a knowledge base, we apply the METHONTOLOGY approach to guide the construction of two ontology design candidates: generic and specic. Of the two, the generic design is more abstract, with fewer classes and properties. Documents describing two indicator systems - the Global Reporting Initiative and the Organisation for Economic Co-operation and Development - are used in the design of both candidate ontologies. We then evaluate both ontology designs using the ROMEO approach, to calculate their level of coverage against the seen indicators, as well as against an unseen third indicator set (the United Nations Statistics Division). We also show that use of existing structured approaches like METHONTOLOGY and ROMEO can reduce ambiguity in ontology design and evaluation for domain-level ontologies. It is concluded that where an ontology needs to be designed for both seen and unseen indicator systems, a generic and reusable design is preferable. Second, having addressed the heterogeneity issues at the data level of sustainability indicators in the first phase of the research, we then develop a software for a sustainability reporting framework - Circles of Sustainability - which provides two mechanisms for browsing heterogeneous sustainability indicator sets: a Tabular view and a Circular view. In particular, the generic design of ontology developed during the first phase of the research is applied to this software. Next, we evaluate the overall usefulness and ease of use for the presented software and the associated user interfaces by conducting a user study. The analysis of quantitative and qualitative results of the user study concludes that the Circular view is the preferred interface by most participants for browsing semantic heterogeneous indicators. Third, in the context of disaster management, we present a geotagger method for the OzCrisisTracker application that automatically detects and disambiguates the heterogeneity of georeferences mentioned in the tweets' content with three possibilities: definite, ambiguous and no-location. Our method semantically annotates the tweet components utilising existing and new ontologies. We also concluded that the accuracy of geographic focus of our geotagger is considerably higher than other systems. From a more general perspective the research contributions can be articulated as follows. The knowledge bases developed in this research have been applied to the two domain applications. The thesis therefore demonstrates how semantic technologies, such as ontology design patterns, browsing tools and geocoding, can untangle data representation and navigation issues of semantic heterogeneity in sustainability and disaster management domains
Recommended from our members
Towards an aspect weaving BPEL engine
This position paper proposes the use of dynamic aspects and
the visitor design pattern to obtain a highly configurable and
extensible BPEL engine. Using these two techniques, the
core of this infrastructural software can be customised to
meet new requirements and add features such as debugging,
execution monitoring, or changing to another Web Service
selection policy. Additionally, it can easily be extended to
cope with customer-specific BPEL extensions. We propose
the use of dynamic aspects not only on the engine itself
but also on the workflow in order to tackle the problems of
Web Service hot deployment and hot fixes to long running
processes. In this way, composing aWeb Service "on-the-fly"
means weaving its choreography interface into the workflow
Understanding Patient Safety Reports via Multi-label Text Classification and Semantic Representation
Medical errors are the results of problems in health care delivery. One of the key steps to eliminate errors and improve patient safety is through patient safety event reporting. A patient safety report may record a number of critical factors that are involved in the health care when incidents, near misses, and unsafe conditions occur. Therefore, clinicians and risk management can generate actionable knowledge by harnessing useful information from reports. To date, efforts have been made to establish a nationwide reporting and error analysis mechanism. The increasing volume of reports has been driving improvement in quantity measures of patient safety. For example, statistical distributions of errors across types of error and health care settings have been well documented. Nevertheless, a shift to quality measure is highly demanded. In a health care system, errors are likely to occur if one or more components (e.g., procedures, equipment, etc.) that are intrinsically associated go wrong. However, our understanding of what and how these components are connected is limited for at least two reasons. Firstly, the patient safety reports present difficulties in aggregate analysis since they are large in volume and complicated in semantic representation. Secondly, an efficient and clinically valuable mechanism to identify and categorize these components is absent.
I strive to make my contribution by investigating the multi-labeled nature of patient safety reports. To facilitate clinical implementation, I propose that machine learning and semantic information of reports, e.g., semantic similarity between terms, can be used to jointly perform automated multi-label classification. My work is divided into three specific aims. In the first aim, I developed a patient safety ontology to enhance semantic representation of patient safety reports. The ontology supports a number of applications including automated text classification. In the second aim, I evaluated multilabel text classification algorithms on patient safety reports. The results demonstrated a list of productive algorithms with balanced predictive power and efficiency. In the third aim, to improve the performance of text classification, I developed a framework for incorporating semantic similarity and kernel-based multi-label text classification. Semantic similarity values produced by different semantic representation models are evaluated in the classification tasks. Both ontology-based and distributional semantic similarity exerted positive influence on classification performance but the latter one shown significant efficiency in terms of the measure of semantic similarity.
Our work provides insights into the nature of patient safety reports, that is a report can be labeled by multiple components (e.g., different procedures, settings, error types, and contributing factors) it contains. Multi-labeled reports hold promise to disclose system vulnerabilities since they provide the insight of the intrinsically correlated components of health care systems. I demonstrated the effectiveness and efficiency of the use of automated multi-label text classification embedded with semantic similarity information on patient safety reports. The proposed solution holds potential to incorporate with existing reporting systems, significantly reducing the workload of aggregate report analysis
- …