13 research outputs found

    Normalisation of imprecise temporal expressions extracted from text

    Get PDF
    Orientador : Prof. Dr. Marcos Didonet Del FabroCo-Orientador : Prof. Dr. Angus RobertsTese (doutorado) - Universidade Federal do Paraná, Setor de Ciências Exatas, Programa de Pós-Graduação em Informática. Defesa: Curitiba, 05/04/2016Inclui referências : f. 95-105Resumo: Técnicas e sistemas de extração de informações são capazes de lidar com a crescente quantidade de dados não estruturados disponíveis hoje em dia. A informação temporal está entre os diferentes tipos de informações que podem ser extraídos a partir de tais fontes de dados não estruturados, como documentos de texto. Informações temporais descrevem as mudanças que acontecem através da ocorrência de eventos, e fornecem uma maneira de gravar, ordenar e medir a duração de tais ocorrências. A impossibilidade de identificar e extrair informação temporal a partir de documentos textuais faz com que seja difícil entender como os eventos são organizados em ordem cronológica. Além disso, em muitas situações, o significado das expressões temporais é impreciso, e não pode ser descrito com precisão, o que leva a erros de interpretação. As soluções existentes proporcionam formas alternativas de representar expressões temporais imprecisas. Elas são, entretanto, específicas e difíceis de generalizar. Além disso, a análise de dados temporais pode ser particularmente ineficiente na presença de erros ortográficos. As abordagens existentes usam métodos de similaridade para procurar palavras válidas dentro de um texto. No entanto, elas não são suficientes para processos erros de ortografia de uma forma eficiente. Nesta tese é apresentada uma metodologia para analisar e normalizar das expressões temporais imprecisas, em que, após a coleta e pré-processamento de dados sobre a forma como as pessoas interpretam descrições vagas de tempo no texto, diferentes técnicas são comparadas a fim de criar e selecionar o modelo de normalização mais apropriada para diferentes tipos de expressões imprecisas. Também são comparados um sistema baseado em regras e uma abordagem de aprendizagem de máquina na tentativa de identificar expressões temporais em texto, e é analisado o processo de produção de padrões de anotação, identificando possíveis fontes de problemas, dando algumas recomendações para serem consideradas no futuro esforços de anotação manual. Finalmente, é proposto um mapa fonético e é avaliado como a codificação de informação fonética poderia ser usado a fim de auxiliar os métodos de busca de similaridade e melhorar a qualidade da informação extraída.Abstract: Information Extraction systems and techniques are able to deal with the increasing amount of unstructured data available nowadays. Time is amongst the different kinds of information that may be extracted from such unstructured data sources, including text documents. Time describes changes which happen through the occurrence of events, and provides a way to record, order, and measure the duration of such occurrences. The inability to identify and extract temporal information from text makes it difficult to understand how the events are organized in a chronological order. Moreover, in many situations, the meaning of temporal expressions is imprecise, and cannot be accurately described, leading to interpretation errors. Existing solutions provide alternative ways of representing imprecise temporal expressions, though they are specific and hard to generalise. Furthermore, the analysis of temporal data may be particularly inefficient in the presence of spelling errors. Existing approaches use string similarity methods to search for valid words within a text. However, they are not rich enough to processes misspellings in an efficient way. In this thesis, we present a methodology to analyse and normalise of imprecise temporal expressions, in which, after collecting and pre-processing data on how people interpret vague descriptions of time in text, we compare different techniques in order to create and select the most appropriate normalisation model for different kinds of imprecise expressions. We also compare how a rule-based system and a machine learning approach perform on trying to identify temporal expression from text, and we analyse the process of producing gold standards, identifying possible sources of issues, giving some recommendations to be considered in future manual annotation efforts. Finally, we propose a phonetic map and evaluate how encoding phonetic information could be used in order to assist similarity search methods and improve information extraction quality

    Improving Risk Assessment of Miscarriage During Pregnancy with Knowledge Graph Embeddings

    Get PDF
    Miscarriages are the most common type of pregnancy loss, mostly occurring in the first 12 weeks of pregnancy. Pregnancy risk assessment aims to quantify evidence to reduce such maternal morbidities, and personalized decision support systems are the cornerstone of high-quality, patient-centered care to improve diagnosis, treatment selection, and risk assessment. However, data sparsity and the increasing number of patient-level observations require more effective forms of representing clinical knowledge to encode known information that enables performing inference and reasoning. Whereas knowledge embedding representation has been widely explored in the open domain data, there are few efforts for its application in the clinical domain. In this study, we contrast differences among multiple embedding strategies, and we demonstrate how these methods can assist in performing risk assessment of miscarriage before and during pregnancy. Our experiments show that simple knowledge embedding approaches that utilize domain-specific metadata perform better than complex embedding strategies, although both can improve results comparatively to a population probabilistic baseline in both AUPRC, F1-score, and a proposed normalized version of these evaluation metrics that better reflects accuracy for unbalanced datasets. Finally, embedding approaches provide evidence about each individual, supporting explainability for its model predictions in such a way that humans understand

    Effective techniques for Indonesian text retrieval

    Get PDF
    The Web is a vast repository of data, and information on almost any subject can be found with the aid of search engines. Although the Web is international, the majority of research on finding of information has a focus on languages such as English and Chinese. In this thesis, we investigate information retrieval techniques for Indonesian. Although Indonesia is the fourth most populous country in the world, little attention has been given to search of Indonesian documents. Stemming is the process of reducing morphological variants of a word to a common stem form. Previous research has shown that stemming is language-dependent. Although several stemming algorithms have been proposed for Indonesian, there is no consensus on which gives better performance. We empirically explore these algorithms, showing that even the best algorithm still has scope for improvement. We propose novel extensions to this algorithm and develop a new Indonesian stemmer, and show that these can improve stemming correctness by up to three percentage points; our approach makes less than one error in thirty-eight words. We propose a range of techniques to enhance the performance of Indonesian information retrieval. These techniques include: stopping; sub-word tokenisation; and identification of proper nouns; and modifications to existing similarity functions. Our experiments show that many of these techniques can increase retrieval performance, with the highest increase achieved when we use grams of size five to tokenise words. We also present an effective method for identifying the language of a document; this allows various information retrieval techniques to be applied selectively depending on the language of target documents. We also address the problem of automatic creation of parallel corpora --- collections of documents that are the direct translations of each other --- which are essential for cross-lingual information retrieval tasks. Well-curated parallel corpora are rare, and for many languages, such as Indonesian, do not exist at all. We describe algorithms that we have developed to automatically identify parallel documents for Indonesian and English. Unlike most current approaches, which consider only the context and structure of the documents, our approach is based on the document content itself. Our algorithms do not make any prior assumptions about the documents, and are based on the Needleman-Wunsch algorithm for global alignment of protein sequences. Our approach works well in identifying Indonesian-English parallel documents, especially when no translation is performed. It can increase the separation value, a measure to discriminate good matches of parallel documents from bad matches, by approximately ten percentage points. We also investigate the applicability of our identification algorithms for other languages that use the Latin alphabet. Our experiments show that, with minor modifications, our alignment methods are effective for English-French, English-German, and French-German corpora, especially when the documents are not translated. Our technique can increase the separation value for the European corpus by up to twenty-eight percentage points. Together, these results provide a substantial advance in understanding techniques that can be applied for effective Indonesian text retrieval

    Discovering knowledge structures in mind maps of mental health risks

    Get PDF
    This thesis addressed the problem of risk analysis in mental healthcare, with respect to the GRiST project at Aston University. That project provides a risk-screening tool based on the knowledge of 46 experts, captured as mind maps that describe relationships between risks and patterns of behavioural cues. Mind mapping, though, fails to impose control over content, and is not considered to formally represent knowledge. In contrast, this thesis treated GRiSTs mind maps as a rich knowledge base in need of refinement; that process drew on existing techniques for designing databases and knowledge bases. Identifying well-defined mind map concepts, though, was hindered by spelling mistakes, and by ambiguity and lack of coverage in the tools used for researching words. A novel use of the Edit Distance overcame those problems, by assessing similarities between mind map texts, and between spelling mistakes and suggested corrections. That algorithm further identified stems, the shortest text string found in related word-forms. As opposed to existing approaches’ reliance on built-in linguistic knowledge, this thesis devised a novel, more flexible text-based technique. An additional tool, Correspondence Analysis, found patterns in word usage that allowed machines to determine likely intended meanings for ambiguous words. Correspondence Analysis further produced clusters of related concepts, which in turn drove the automatic generation of novel mind maps. Such maps underpinned adjuncts to the mind mapping software used by GRiST; one such new facility generated novel mind maps, to reflect the collected expert knowledge on any specified concept. Mind maps from GRiST are stored as XML, which suggested storing them in an XML database. In fact, the entire approach here is ”XML-centric”, in that all stages rely on XML as far as possible. A XML-based query language allows user to retrieve information from the mind map knowledge base. The approach, it was concluded, will prove valuable to mind mapping in general, and to detecting patterns in any type of digital information

    Combining string and phonetic similarity matching to identify misspelt names of drugs in medical records written in Portuguese

    Get PDF
    BACKGROUND There is an increasing amount of unstructured medical data that can be analysed for different purposes. However, information extraction from free text data may be particularly inefficient in the presence of spelling errors. Existing approaches use string similarity methods to search for valid words within a text, coupled with a supporting dictionary. However, they are not rich enough to encode both typing and phonetic misspellings. RESULTS Experimental results showed a joint string and language-dependent phonetic similarity is more accurate than traditional string distance metrics when identifying misspelt names of drugs in a set of medical records written in Portuguese. CONCLUSION We present a hybrid approach to efficiently perform similarity match that overcomes the loss of information inherit from using either exact match search or string based similarity search methods

    Preface

    Get PDF

    Creating large semantic lexical resources for the Finnish language

    Get PDF
    Finnish belongs into the Finno-Ugric language family, and it is spoken by the vast majority of the people living in Finland. The motivation for this thesis is to contribute to the development of a semantic tagger for Finnish. This tool is a parallel of the English Semantic Tagger which has been developed at the University Centre for Computer Corpus Research on Language (UCREL) at Lancaster University since the beginning of the 1990s and which has over the years proven to be a very powerful tool in automatic semantic analysis of English spoken and written data. The English Semantic Tagger has various successful applications in the fields of natural language processing and corpus linguistics, and new application areas emerge all the time. The semantic lexical resources that I have created in this thesis provide the knowledge base for the Finnish Semantic Tagger. My main contributions are the lexical resources themselves, along with a set of methods and guidelines for their creation and expansion as a general language resource and as tailored for domain-specific applications. Furthermore, I propose and carry out several methods for evaluating semantic lexical resources. In addition to the English Semantic Tagger, which was developed first, and the Finnish Semantic Tagger second, equivalent semantic taggers have now been developed for Czech, Chinese, Dutch, French, Italian, Malay, Portuguese, Russian, Spanish, Urdu, and Welsh. All these semantic taggers taken together form a program framework called the UCREL Semantic Analysis System (USAS) which enables the development of not only monolingual but also various types of multilingual applications. Large-scale semantic lexical resources designed for Finnish using semantic fields as the organizing principle have not been attempted previously. Thus, the Finnish semantic lexicons created in this thesis are a unique and novel resource. The lexical coverage on the test corpora containing general modern standard Finnish, which has been the focus of the lexicon development, ranges from 94.58% to 97.91%. However, the results are also very promising in the analysis of domain-specific text (95.36%), older Finnish text (92.11–93.05%), and Internet discussions (91.97–94.14%). The results of the evaluation of lexical coverage are comparable to the results obtained with the English equivalents and thus indicate that the Finnish semantic lexical resources indeed cover the majority of core Finnish vocabulary

    Reader Response in the Digital Age. Letters to the editor vs. below-the-line comments. A synchronic comparison.

    Get PDF
    Heralded by some as the biggest revolution of the Internet, with great egalitarian and democratic potential, web 2.0 and social media are frowned on by others as sites where users constantly compete to take centre stage, more often than not by sharing everyday banalities, thus flooding the web with “tedious piffle”. While it is true that it has never been so easy to put in your two cents’ worth, the concept of user-generated content – one of the buzzwords of today’s participatory web – can look back on a long tradition in newspapers, where letters to the editor have always been a highly popular way for readers to make their voices heard in public. In their move online, most newspapers added comment sections to their websites, thus taking readers’ letters to the digital level and providing the basis for the present synchronic study, which compares 1,000 below-the-line comments posted on the websites of the Guardian and the Times to 1,000 letters to the editor written to the same newspapers by addressing, one by one, four common claims about, or (mis-)conceptions of, this form of user-generated content. The analysis begins on the micro-linguistic level, comparing the data sets in terms of their orthographic, typographic, lexical and syntactic features and addressing the claim that the language used to communicate on the Internet differs substantially from the language used in other contexts. The focus then shifts to the interactional structures found in the two genres and the question of whether below-the-line comments, as a form of web 2.0, are really more interactive than traditional letters to the editor, which are commonly perceived as a means of ‘talking back’ to the newspaper or journalist rather than a forum for interactive debates among users. The discussion then moves on to matters of face, (im-)politeness and identity construction by first investigating the face-threatening act of criticising others as well as the act of providing positive feedback. This analysis was inspired by the fact that the two genres, although clearly related, are perceived very differently: while comment sections are often associated with aggressive and uninhibited verbal behaviour and numerous calls for their closure can be found, such concerns have not been voiced about letters pages in newspapers. Moreover, it has been claimed that via online comments, more and more private topics are entering the public sphere, thus leading to an increase in subjectivity and personalisation. This last claim is addressed by exploring strategies of personalisation and the moves used to construct an expert identity. The comparative analysis is thus concluded with a focus on the domain of social behaviour, investigating the different means contributors employ to create their own identity and that of the people talked about or addressed

    Using Active Learning to Teach Critical and Contextual Studies: One Teaching Plan, Two Experiments, Three Videos.

    Get PDF
    Since the 1970s, art and design education at UK universities has existedas a divided practice; on the one hand applying active learning in thestudio and on the other hand enforcing passive learning in the lecturetheatre. As a result, art and design students are in their vast majorityreluctant about modules that may require them to think, read and writecritically during their academic studies. This article describes, evaluatesand analyses two individual active learning experiments designed todetermine if it is possible to teach CCS modules in a manner thatencourages student participation. The results reveal that opting foractive learning methods improved academic achievement, encouragedcooperation, and enforced an inclusive classroom. Furthermore, andcontrary to wider perception, the article demonstrates that activelearning methods can be equally beneficial for small-size as well aslarge-size groups
    corecore