13 research outputs found

    Semantic Analyzer for Spanish, Using Ontologies

    Get PDF
    Analizar un texto de algún tema particular en lenguaje natural y convertirlo a una estructura entendible por las computadoras, no es tarea trivial. En parte porque en el proceso de conversión se debe comprobar la gramática, semántica, y la pragmática, al menos. Éste es uno de los retos de este proyecto. Con el abaratamiento de las computadoras y el acceso a una gran cantidad de documentos (por ejemplo, en Internet), resulta posible (y deseable) utilizar una computadora para analizar información relevante en documentos que están escritos en lenguaje natural, específicamente en Español. Los analizadores actuales utilizan principalmente la sintaxis, por ejemplo., atienden a las reglas gramaticales del lenguaje en español, pero no toman en cuenta su semántica. Por esta razón, nos hemos dado a la tarea de construir analizadores que usen tanto la información gramatical como la información semántica. Otro reto es la extracción automática de conocimiento en textos. Las herramientas necesarias para esta conversión son etiquetadores, analizadores sintácticos, morfológicos y semánticos. El lenguaje natural tiene como característica inherente la ambigüedad, que en los humanos se resuelve usando el sentido común; sin embargo, estás no son parte de las computadoras. Por ejemplo, “la puerta es amarilla”, en este caso “puerta” se puede referir a la puerta de un carro o a la puerta de una casa, pero ¿cómo puede saber esto la computadora? En este artículo se muestra una herramienta de análisis semántico que permite a la computadora resolver estos problemas.Este artículo presenta el diseño e implementación de un analizador semántico que identifica la cercanía semántica entre las palabras de una oración usando una ontología. Una definición común de ontología es una especificación explícita y formal de una conceptualización compartida. Esta ontología puede ser una herramienta importante al representar un texto descriptivo y analizar textos en español. El analizador semántico extrae datos de una ontología de conocimiento común que consta de relaciones, conceptos y valores almacenados con base en su significado. La ontología básica comienza con datos cualitativos alimentados manualmente por un usuario. El analizador responde "verdadero" si una oración está semánticamente relacionada o "falso" si no lo es. Su utilidad radica en hacer preguntas en lenguaje natural a una biblioteca digital y para que los robots de servicios entiendan el significado de la acción a realizar. Se ha probado su eficiencia usando el corpus CONLL v.2009 obteniéndose una buena precisión en sus resultados

    BIOMEDICAL WORD SENSE DISAMBIGUATION WITH NEURAL WORD AND CONCEPT EMBEDDINGS

    Get PDF
    Addressing ambiguity issues is an important step in natural language processing (NLP) pipelines designed for information extraction and knowledge discovery. This problem is also common in biomedicine where NLP applications have become indispensable to exploit latent information from biomedical literature and clinical narratives from electronic medical records. In this thesis, we propose an ensemble model that employs recent advances in neural word embeddings along with knowledge based approaches to build a biomedical word sense disambiguation (WSD) system. Specifically, our system identities the correct sense from a given set of candidates for each ambiguous word when presented in its context (surrounding words). We use the MSH WSD dataset, a well known public dataset consisting of 203 ambiguous terms each with nearly 200 different instances and an average of two candidate senses represented by concepts in the unified medical language system (UMLS). We employ a popular biomedical concept, Our linear time (in terms of number of senses and context length) unsupervised and knowledge based approach improves over the state-of-the-art methods by over 3% in accuracy. A more expensive approach based on the k-nearest neighbor framework improves over prior best results by 5% in accuracy. Our results demonstrate that recent advances in neural dense word vector representations offer excellent potential for solving biomedical WSD

    Machine learning innovations in address matching: A practical comparison of word2vec and CRFs

    Get PDF
    © 2019 The Authors. Transactions in GIS published by John Wiley & Sons Ltd Record linkage is a frequent obstacle to unlocking the benefits of integrated (spatial) data sources. In the absence of unique identifiers to directly join records, practitioners often rely on text-based approaches for resolving candidate pairs of records to a match. In geographic information science, spatial record linkage is a form of geocoding that pertains to the resolution of text-based linkage between pairs of addresses into matches and non-matches. These approaches link text-based address sequences, integrating sources of data that would otherwise remain in isolation. While recent innovations in machine learning have been introduced in the wider record linkage literature, there is significant potential to apply machine learning to the address matching sub-field of geographic information science. As a response, this paper introduces two recent developments in text-based machine learning—conditional random fields and word2vec—that have not been applied to address matching, evaluating their comparative strengths and drawbacks

    An enhanced sequential exception technique for semantic-based text anomaly detection

    Get PDF
    The detection of semantic-based text anomaly is an interesting research area which has gained considerable attention from the data mining community. Text anomaly detection identifies deviating information from general information contained in documents. Text data are characterized by having problems related to ambiguity, high dimensionality, sparsity and text representation. If these challenges are not properly resolved, identifying semantic-based text anomaly will be less accurate. This study proposes an Enhanced Sequential Exception Technique (ESET) to detect semantic-based text anomaly by achieving five objectives: (1) to modify Sequential Exception Technique (SET) in processing unstructured text; (2) to optimize Cosine Similarity for identifying similar and dissimilar text data; (3) to hybridize modified SET with Latent Semantic Analysis (LSA); (4) to integrate Lesk and Selectional Preference algorithms for disambiguating senses and identifying text canonical form; and (5) to represent semantic-based text anomaly using First Order Logic (FOL) and Concept Network Graph (CNG). ESET performs text anomaly detection by employing optimized Cosine Similarity, hybridizing LSA with modified SET, and integrating it with Word Sense Disambiguation algorithms specifically Lesk and Selectional Preference. Then, FOL and CNG are proposed to represent the detected semantic-based text anomaly. To demonstrate the feasibility of the technique, four selected datasets namely NIPS data, ENRON, Daily Koss blog, and 20Newsgroups were experimented on. The experimental evaluation revealed that ESET has significantly improved the accuracy of detecting semantic-based text anomaly from documents. When compared with existing measures, the experimental results outperformed benchmarked methods with an improved F1-score from all datasets respectively; NIPS data 0.75, ENRON 0.82, Daily Koss blog 0.93 and 20Newsgroups 0.97. The results generated from ESET has proven to be significant and supported a growing notion of semantic-based text anomaly which is increasingly evident in existing literatures. Practically, this study contributes to topic modelling and concept coherence for the purpose of visualizing information, knowledge sharing and optimized decision making

    Dynamic network analytics for recommending scientific collaborators

    Full text link
    Collaboration is one of the most important contributors to scientific advancement and a crucial aspect of an academic’s career. However, the explosion in academic publications has, for some time, been making it more challenging to find suitable research partners. Recommendation approaches to help academics find potential collaborators are not new. However, the existing methods operate on static data, which can render many suggestions less useful or out of date. The approach presented in this paper simulates a dynamic network from static data to gain further insights into the changing research interests, activities and co-authorships of scholars in a field–all insights that can improve the quality of the recommendations produced. Following a detailed explanation of the entire framework, from data collection through to recommendation modelling, we provide a case study on the field of information science to demonstrate the reliability of the proposed method, and the results provide empirical insights to support decision-making in related stakeholders—e.g., scientific funding agencies, research institutions and individual researchers in the field

    A Semantic neighborhood approach to relatedness evaluation on well-founded domain ontologies

    Get PDF
    In the context of natural language processing and information retrieval, ontologies can improve the results of the word sense disambiguation (WSD) techniques. By making explicit the semantics of the term, ontology-based semantic measures play a crucial role in determining how different ontology classes have a similar or related meaning. In this context, it is common to use semantic similarity as a basis for WSD. However, the measures generally consider only taxonomic relationships, which negatively affect the discrimination of two ontology classes that are related by the other relationship types. On the other hand, semantic relatedness measures consider diverse types of relationships to determine how much two classes on the ontology are related. However, these measures, especially the path-based approaches, have as the main drawback a high computational complexity to calculate the relatedness value. Also, for both types of semantic measures, it is unpractical to store all similarity or relatedness values between all ontology classes in memory, especially for ontologies with a large number of classes. In this work, we propose a novel approach based on semantic neighbors that aim to improve the performance of the knowledge-based measures in relatedness analysis. We also explain how to use this proposal into the path and feature-based measures. We evaluate our proposal on WSD using an existent domain ontology for a well-core description. This ontology contains 929 classes related to rock facies. Also, we use a set of sentences from four different corpora on the Oil&Gas domain. In the experiments, we compare our proposal with state-of-the-art semantic relatedness measures, such as path-based, feature-based, information content, and hybrid methods regarding the F-score, evaluation time, and memory consumption. The experimental results show that the proposed method obtains F-score gains in WSD, as well as a low evaluation time and memory consumption concerning the traditional knowledge-based measures.No contexto do processamento de linguagem natural e recuperação de informações, as ontologias podem melhorar os resultados das técnicas de desambiguação. Ao tornar explícita a semântica do termo, as medidas semânticas baseadas em ontologia desempenham um papel crucial para determinar como diferentes classes de ontologia têm um significado semelhante ou relacionado. Nesse contexto, é comum usar similaridade semântica como base para a desembiguação. No entanto, as medidas geralmente consideram apenas relações taxonômicas, o que afeta negativamente a discriminação de duas classes de ontologia relacionadas por outros tipos de relações. Por outro lado, as medidas de relacionamento semântico consideram diversos tipos de relacionamentos ontológicos para determinar o quanto duas classes estão relacionadas. No entanto, essas medidas, especialmente as abordagens baseadas em caminhos, têm como principal desvantagem uma alta complexidade computacional para sua execução. Além disso, tende a ser impraticável armazenar na memória todos os valores de similaridade ou relacionamento entre todas as classes de uma ontologia, especialmente para ontologias com um grande número de classes. Neste trabalho, propomos uma nova abordagem baseada em vizinhos semânticos que visa melhorar o desempenho das medidas baseadas em conhecimento na análise de relacionamento. Também explicamos como usar esta proposta em medidas baseadas em caminhos e características. Avaliamos nossa proposta na desambiguação utilizando uma ontologia de domínio preexistente para descrição de testemunhos. Esta ontologia contém 929 classes relacionadas a fácies de rocha. Além disso, usamos um conjunto de sentenças de quatro corpora diferentes no domínio Petróleo e Gás. Em nossos experimentos, comparamos nossa proposta com medidas de relacionamento semântico do estado-daarte, como métodos baseados em caminhos, características, conteúdo de informação, e métodos híbridos em relação ao F-score, tempo de avaliação e consumo de memória. Os resultados experimentais mostram que o método proposto obtém ganhos de F-score na desambiguação, além de um baixo tempo de avaliação e consumo de memória em relação às medidas tradicionais baseadas em conhecimento

    An Application of Natural Language Processing for Triangulation of Cognitive Load Assessments in Third Level Education

    Get PDF
    Work has been done to measure Mental Workload based on applications mainly related to ergonomics, human factors, and Machine Learning. The influence of Machine Learning is a reflection of an increased use of new technologies applied to areas conventionally dominated by theoretical approaches. However, collaboration between MWL and Natural Language Processing techniques seems to happen rarely. In this sense, the objective of this research is to make use of Natural Languages Processing techniques to contribute to the analysis of the relationship between Mental Workload subjective measures and Relative Frequency Ratios of keywords gathered during pre-tasks and post-tasks of MWL activities in third-level sessions under different topics and instructional designs. This research employs secondary, empirical and inductive methods to investigate Cognitive Load theory, instructional designs, Mental Workload foundations and measures and Natural Language Process Techniques. Then, NASA-TLX, Workload Profile and Relative Frequency Ratios are calculated. Finally, the relationship between NASA-TLX and Workload Profile and Relative Frequency Ratios is analysed using parametric and non-parametric statistical techniques. Results show that the relationship between Mental Workload and Relative Frequency Ratios of keywords, is only medium correlated, or not correlated at all. Furthermore, it has been found out that instructional designs based on the process of hearing and seeing, and the interaction between participants, can overcome other approaches such as those that make use of videos supported with images and text, or of a lecturer\u27s speech supported with slides

    Luonnollisen kielen käsittelyn menetelmät sanojen samankaltaisuuden mittaamisessa

    Get PDF
    An artificial intelligence application considered in this thesis was harnessed to extract competencies from job descriptions and higher education curricula written in natural language. Using these extracted competencies, the application is able to visualize the skills supply of the schools and the skills demand of the labor market. However, to understand natural language, computer must learn to evaluate the relatedness between words. The aim of the thesis is to propose the best methods for open text data mining and measuring the semantic similarity and relatedness between words. Different words can have similar meanings in natural language. The computer can learn the relatedness between words mainly by two different methods. We can construct an ontology from the studied domain, which models the concepts of the domain as well as the relations between them. The ontology can be considered as a directed graph. The nodes are the concepts of the domain and the edges between the nodes describe their relations. The semantic similarity between the concepts can be computed based on the distance and the strength of the relations between them. The other way to measure the word relatedness is based on statistical language models. The model learns the similarity between words relying on their probability distribution in large corpora. The words appearing in similar contexts, i.e., surrounded by similar words, tend to have similar meanings. The words are often represented as continuous distributed word vectors, each dimension representing some feature of the word. The feature can be either semantic, syntactic or morphological. However, the feature is latent, and usually not under understandable to a human. If the angle between the word vectors in the feature space is small, the words share same features and hence are similar. The study was conducted by reviewing available literature and implementing a web scraper for retrieving open text data from the web. The scraped data was fed into the AI application, which extracted the skills from the data and visualized the result in semantic maps
    corecore