273 research outputs found
News Text Classification Based on an Improved Convolutional Neural Network
With the explosive growth in Internet news media and the disorganized status of news texts, this paper puts forward an automatic classification model for news based on a Convolutional Neural Network (CNN). In the model, Word2vec is firstly merged with Latent Dirichlet Allocation (LDA) to generate an effective text feature representation. Then when an attention mechanism is combined with the proposed model, higher attention probability values are given to key features to achieve an accurate judgment. The results show that the precision rate, the recall rate and the F1 value of the model in this paper reach 96.4%, 95.9% and 96.2% respectively, which indicates that the improved CNN, through a unique framework, can extract deep semantic features of the text and provide a strong support for establishing an efficient and accurate news text classification model
Natural language processing for semiautomatic semantics extractio: encyclopedic entry disambiguation and relationship extraction using wikipedia and wordnet
Tesis doctoral inédita. Universidad Autónoma de Madrid, Escuela Politécnica Superior, septiembre de 200
Automatic text summarisation using linguistic knowledge-based semantics
Text summarisation is reducing a text document to a short substitute summary. Since the commencement of the field, almost all summarisation research works implemented to this date involve identification and extraction of the most important document/cluster segments, called extraction. This typically involves scoring each document sentence according to a composite scoring function consisting of surface level and semantic features. Enabling machines to analyse text features and understand their meaning potentially requires both text semantic analysis and equipping computers with an external semantic knowledge. This thesis addresses extractive text summarisation by proposing a number of semantic and knowledge-based approaches. The work combines the high-quality semantic information in WordNet, the crowdsourced encyclopaedic knowledge in Wikipedia, and the manually crafted categorial variation in CatVar, to improve the summary quality. Such improvements are accomplished through sentence level morphological analysis and the incorporation of Wikipedia-based named-entity semantic relatedness while using heuristic algorithms. The study also investigates how sentence-level semantic analysis based on semantic role labelling (SRL), leveraged with a background world knowledge, influences sentence textual similarity and text summarisation. The proposed sentence similarity and summarisation methods were evaluated on standard publicly available datasets such as the Microsoft Research Paraphrase Corpus (MSRPC), TREC-9 Question Variants, and the Document Understanding Conference 2002, 2005, 2006 (DUC 2002, DUC 2005, DUC 2006) Corpora. The project also uses Recall-Oriented Understudy for Gisting Evaluation (ROUGE) for the quantitative assessment of the proposed summarisersâ performances. Results of our systems showed their effectiveness as compared to related state-of-the-art summarisation methods and baselines. Of the proposed summarisers, the SRL Wikipedia-based system demonstrated the best performance
An ontology for human-like interaction systems
This report proposes and describes the development of a Ph.D. Thesis aimed at building an ontological knowledge model supporting Human-Like Interaction systems. The main function of such knowledge model in a human-like interaction system is to unify the representation of each concept, relating it to the appropriate terms, as well as to other concepts with which it shares semantic relations.
When developing human-like interactive systems, the inclusion of an ontological module can be valuable for both supporting interaction between participants and enabling accurate cooperation of the diverse components of such an interaction system. On one hand, during human communication, the relation between cognition and messages relies in formalization of concepts, linked to terms (or words) in a language that will enable its utterance (at the expressive layer). Moreover, each participant has a unique conceptualization (ontology), different from other individualâs. Through interaction, is the intersection of both partâs conceptualization what enables communication. Therefore, for human-like interaction is crucial to have a strong conceptualization, backed by a vast net of terms linked to its concepts, and the ability of mapping it with any interlocutorâs ontology to support denotation.
On the other hand, the diverse knowledge models comprising a human-like interaction system (situation model, user model, dialogue model, etc.) and its interface components (natural language processor, voice recognizer, gesture processor, etc.) will be continuously exchanging information during their operation. It is also required for them to share a solid base of references to concepts, providing consistency, completeness and quality to their processing.
Besides, humans usually handle a certain range of similar concepts they can use when building messages. The subject of similarity has been and continues to be widely studied in the fields and literature of computer science, psychology and sociolinguistics. Good similarity measures are necessary for several techniques from these fields such as information retrieval, clustering, data-mining, sense disambiguation, ontology translation and automatic schema matching. Furthermore, the ontological component should also be able to perform certain inferential processes, such as the calculation of semantic similarity between concepts. The principal benefit gained from this procedure is the ability to substitute one concept for another based on a calculation of the similarity of the two, given specific circumstances. From the humanâs perspective, the procedure enables referring to a given concept in cases where the interlocutor either does not know the term(s) initially applied to refer that concept, or does not know the concept itself. In the first case, the use of synonyms can do, while in the second one it will be necessary to refer the concept from some other similar (semantically-related) concepts...Programa Oficial de Doctorado en Ciencia y TecnologĂa InformĂĄticaSecretario: InĂ©s MarĂa GalvĂĄn LeĂłn.- Secretario: JosĂ© MarĂa Cavero Barca.- Vocal: Yolanda GarcĂa Rui
Introduction to Linguistics for English Language Teaching
We envisaged this book as a main reference for English language teachers. Like many may have thought that this book laid out in both theory and practical terms why English language teachers should study linguistics for their future professional teaching career. This book lays out in theoretical terms why many of our most common views about the study on linguistics are fundamentally important. This book equips the theoretical importance with practical assignments and authentic tasks. These are the times that try language teacherâs souls on linguistics, and, for that reason, this book advocates its own petite contribution in knowledge-development
Quran Ontology: Review On Recent Development And Open Research Issues
Quran is the holy book of Muslims that contains the commandment of words of Allah. Quran provides instructions and guidance to humankind in achieving happiness in life in the world and the hereafter. As a holy book, Quran contains rich knowledge and scientific facts. However, humans have difficulty in understanding the Quran content. It is caused by the fact that the meaning of the searched message content
depends on the interpretation. Ontology able to store the knowledge representation of Holy Quran. This paper studies recent ontology on Holy Quran research. We investigate the current trends and technology being applied. This investigation cover on several aspects, such as outcomes of previous studies, language which used on ontology development, coverage area of Quran ontology, datasets, tools to perform ontology development ontology population techniques, approaches used to integrate the knowledge of Quran and other resources into ontology, ontology testing techniques, and limitations on previous research. This review has identified four major issues involved in Quran ontology, i.e. availability of Quran ontology in various translation, ontology resources, automated process of Meronymy relationship extraction, and Instances Classification. The review of existing studies will allow future researchers to have a broad and useful background knowledge on primary and essential aspects of this research field
Combining Knowledge- and Corpus-based Word-Sense-Disambiguation Methods
In this paper we concentrate on the resolution of the lexical ambiguity that
arises when a given word has several different meanings. This specific task is
commonly referred to as word sense disambiguation (WSD). The task of WSD
consists of assigning the correct sense to words using an electronic dictionary
as the source of word definitions. We present two WSD methods based on two main
methodological approaches in this research area: a knowledge-based method and a
corpus-based method. Our hypothesis is that word-sense disambiguation requires
several knowledge sources in order to solve the semantic ambiguity of the
words. These sources can be of different kinds--- for example, syntagmatic,
paradigmatic or statistical information. Our approach combines various sources
of knowledge, through combinations of the two WSD methods mentioned above.
Mainly, the paper concentrates on how to combine these methods and sources of
information in order to achieve good results in the disambiguation. Finally,
this paper presents a comprehensive study and experimental work on evaluation
of the methods and their combinations
Resolving Other-Anaphora
Institute for Communicating and Collaborative SystemsReference resolution is a major component of any natural language system. In the past
30 years significant progress has been made in coreference resolution. However, there
is more anaphora in texts than coreference. I present a computational treatment of
other-anaphora, i.e., referential noun phrases (NPs) with non-pronominal heads modi-
fied by âotherâ or âanotherâ:
[. . . ] the move is designed to more accurately reflect the value of products
and to put steel on more equal footing with other commodities.
Such NPs are anaphoric (i.e., they cannot be interpreted in isolation), with an antecedent
that may occur in the previous discourse or the speakerâs and hearerâs mutual
knowledge. For instance, in the example above, the NP âother commoditiesâ refers to
a set of commodities excluding steel, and it can be paraphrased as âcommodities other
than steelâ.
Resolving such cases requires first identifying the correct antecedent(s) of the
other-anaphors. This task is the major focus of this dissertation. Specifically, the
dissertation achieves two goals. First, it describes a procedure by which antecedents
of other-anaphors can be found, including constraints and preferences which narrow
down the search. Second, it presents several symbolic, machine learning and hybrid
resolution algorithms designed specifically for other-anaphora. All the algorithms have
been implemented and tested on a corpus of examples from the Wall Street Journal.
The major results of this research are the following:
1. Grammatical salience plays a lesser role in resolving other-anaphors than in resolving
pronominal anaphora. Algorithms that solely rely on grammatical features
achieved worse results than algorithms that used semantic features as well.
2. Semantic knowledge (such as âsteel is a commodityâ) is crucial in resolving
other-anaphors. Algorithms that operate solely on semantic features outperformed
those that operate on grammatical knowledge.
3. The quality and relevance of the semantic knowledge base is important to success.
WordNet proved insufficient as a source of semantic information for resolving
other-anaphora. Algorithms that use the Web as a knowledge base achieved better performance than those using WordNet, because the Web contains domain specific
and general world knowledge which is not available from WordNet.
4. But semantic information by itself is not sufficient to resolve other-anaphors, as
it seems to overgenerate, leading to many false positives.
5. Although semantic information is more useful than grammatical information,
only integration of semantic and grammatical knowledge sources can handle the
full range of phenomena. The best results were obtained from a combination of
semantic and grammatical resources.
6. A probabilistic framework is best at handling the full spectrum of features, both
because it does not require commitment as to the order in which the features
should be applied, and because it allows features to be treated as preferences,
rather than as absolute constraints.
7. A full resolution procedure for other-anaphora requires both a probabilistic model
and a set of informed heuristics and back-off procedures. Such a hybrid system
achieved the best results so far on other-anaphora
Ontology Enrichment from Free-text Clinical Documents: A Comparison of Alternative Approaches
While the biomedical informatics community widely acknowledges the utility of domain ontologies, there remain many barriers to their effective use. One important requirement of domain ontologies is that they achieve a high degree of coverage of the domain concepts and concept relationships. However, the development of these ontologies is typically a manual, time-consuming, and often error-prone process. Limited resources result in missing concepts and relationships, as well as difficulty in updating the ontology as domain knowledge changes. Methodologies developed in the fields of Natural Language Processing (NLP), Information Extraction (IE), Information Retrieval (IR), and Machine Learning (ML) provide techniques for automating the enrichment of ontology from free-text documents. In this dissertation, I extended these methodologies into biomedical ontology development. First, I reviewed existing methodologies and systems developed in the fields of NLP, IR, and IE, and discussed how existing methods can benefit the development of biomedical ontologies. This previously unconducted review was published in the Journal of Biomedical Informatics. Second, I compared the effectiveness of three methods from two different approaches, the symbolic (the Hearst method) and the statistical (the Church and Lin methods), using clinical free-text documents. Third, I developed a methodological framework for Ontology Learning (OL) evaluation and comparison. This framework permits evaluation of the two types of OL approaches that include three OL methods. The significance of this work is as follows: 1) The results from the comparative study showed the potential of these methods for biomedical ontology enrichment. For the two targeted domains (NCIT and RadLex), the Hearst method revealed an average of 21% and 11% new concept acceptance rates, respectively. The Lin method produced a 74% acceptance rate for NCIT; the Church method, 53%. As a result of this study (published in the Journal of Methods of Information in Medicine), many suggested candidates have been incorporated into the NCIT; 2) The evaluation framework is flexible and general enough that it can analyze the performance of ontology enrichment methods for many domains, thus expediting the process of automation and minimizing the likelihood that key concepts and relationships would be missed as domain knowledge evolves
- âŠ