3,592 research outputs found

    Document analysis by means of data mining techniques

    Get PDF
    The huge amount of textual data produced everyday by scientists, journalists and Web users, allows investigating many different aspects of information stored in the published documents. Data mining and information retrieval techniques are exploited to manage and extract information from huge amount of unstructured textual data. Text mining also known as text data mining is the processing of extracting high quality information (focusing relevance, novelty and interestingness) from text by identifying patterns etc. Text mining typically involves the process of structuring input text by means of parsing and other linguistic features or sometimes by removing extra data and then finding patterns from structured data. Patterns are then evaluated at last and interpretation of output is performed to accomplish the desired task. Recently, text mining has got attention in several fields such as in security (involves analysis of Internet news), for commercial (for search and indexing purposes) and in academic departments (such as answering query). Beyond searching the documents consisting the words given in a user query, text mining may provide direct answer to user by semantic web for content based (content meaning and its context). It can also act as intelligence analyst and can also be used in some email spam filters for filtering out unwanted material. Text mining usually includes tasks such as clustering, categorization, sentiment analysis, entity recognition, entity relation modeling and document summarization. In particular, summarization approaches are suitable for identifying relevant sentences that describe the main concepts presented in a document dataset. Furthermore, the knowledge existed in the most informative sentences can be employed to improve the understanding of user and/or community interests. Different approaches have been proposed to extract summaries from unstructured text documents. Some of them are based on the statistical analysis of linguistic features by means of supervised machine learning or data mining methods, such as Hidden Markov models, neural networks and Naive Bayes methods. An appealing research field is the extraction of summaries tailored to the major user interests. In this context, the problem of extracting useful information according to domain knowledge related to the user interests is a challenging task. The main topics have been to study and design of novel data representations and data mining algorithms useful for managing and extracting knowledge from unstructured documents. This thesis describes an effort to investigate the application of data mining approaches, firmly established in the subject of transactional data (e.g., frequent itemset mining), to textual documents. Frequent itemset mining is a widely exploratory technique to discover hidden correlations that frequently occur in the source data. Although its application to transactional data is well-established, the usage of frequent itemsets in textual document summarization has never been investigated so far. A work is carried on exploiting frequent itemsets for the purpose of multi-document summarization so a novel multi-document summarizer, namely ItemSum (Itemset-based Summarizer) is presented, that is based on an itemset-based model, i.e., a framework comprise of frequent itemsets, taken out from the document collection. Highly representative and not redundant sentences are selected for generating summary by considering both sentence coverage, with respect to a sentence relevance score, based on tf-idf statistics, and a concise and highly informative itemset-based model. To evaluate the ItemSum performance a suite of experiments on a collection of news articles has been performed. Obtained results show that ItemSum significantly outperforms mostly used previous summarizers in terms of precision, recall, and F-measure. We also validated our approach against a large number of approaches on the DUC’04 document collection. Performance comparisons, in terms of precision, recall, and F-measure, have been performed by means of the ROUGE toolkit. In most cases, ItemSum significantly outperforms the considered competitors. Furthermore, the impact of both the main algorithm parameters and the adopted model coverage strategy on the summarization performance are investigated as well. In some cases, the soundness and readability of the generated summaries are unsatisfactory, because the summaries do not cover in an effective way all the semantically relevant data facets. A step beyond towards the generation of more accurate summaries has been made by semantics-based summarizers. Such approaches combine the use of general-purpose summarization strategies with ad-hoc linguistic analysis. The key idea is to also consider the semantics behind the document content to overcome the limitations of general-purpose strategies in differentiating between sentences based on their actual meaning and context. Most of the previously proposed approaches perform the semantics-based analysis as a preprocessing step that precedes the main summarization process. Therefore, the generated summaries could not entirely reflect the actual meaning and context of the key document sentences. In contrast, we aim at tightly integrating the ontology-based document analysis into the summarization process in order to take the semantic meaning of the document content into account during the sentence evaluation and selection processes. With this in mind, we propose a new multi-document summarizer, namely Yago-based Summarizer, that integrates an established ontology-based entity recognition and disambiguation step. Named Entity Recognition from Yago ontology is being used for the task of text summarization. The Named Entity Recognition (NER) task is concerned with marking occurrences of a specific object being mentioned. These mentions are then classified into a set of predefined categories. Standard categories include “person”, “location”, “geo-political organization”, “facility”, “organization”, and “time”. The use of NER in text summarization improved the summarization process by increasing the rank of informative sentences. To demonstrate the effectiveness of the proposed approach, we compared its performance on the DUC’04 benchmark document collections with that of a large number of state-of-the-art summarizers. Furthermore, we also performed a qualitative evaluation of the soundness and readability of the generated summaries and a comparison with the results that were produced by the most effective summarizers. A parallel effort has been devoted to integrating semantics-based models and the knowledge acquired from social networks into a document summarization model named as SociONewSum. The effort addresses the sentence-based generic multi-document summarization problem, which can be formulated as follows: given a collection of news articles ranging over the same topic, the goal is to extract a concise yet informative summary, which consists of most salient document sentences. An established ontological model has been used to improve summarization performance by integrating a textual entity recognition and disambiguation step. Furthermore, the analysis of the user-generated content coming from Twitter has been exploited to discover current social trends and improve the appealing of the generated summaries. An experimental evaluation of the SociONewSum performance was conducted on real English-written news article collections and Twitter posts. The achieved results demonstrate the effectiveness of the proposed summarizer, in terms of different ROUGE scores, compared to state-of-the-art open source summarizers as well as to a baseline version of the SociONewSum summarizer that does not perform any UGC analysis. Furthermore, the readability of the generated summaries has also been analyzed

    Generating semantically enriched diagnostics for radiological images using machine learning

    Get PDF
    Development of Computer Aided Diagnostic (CAD) tools to aid radiologists in pathology detection and decision making relies considerably on manually annotated images. With the advancement of deep learning techniques for CAD development, these expert annotations no longer need to be hand-crafted, however, deep learning algorithms require large amounts of data in order to generalise well. One way in which to access large volumes of expert-annotated data is through radiological exams consisting of images and reports. Using past radiological exams obtained from hospital archiving systems has many advantages: they are expert annotations available in large quantities, covering a population-representative variety of pathologies, and they provide additional context to pathology diagnoses, such as anatomical location and severity. Learning to auto-generate such reports from images presents many challenges such as the difficulty in representing and generating long, unstructured textual information, accounting for spelling errors and repetition or redundancy, and the inconsistency across different annotators. In this thesis, the problem of learning to automate disease detection from radiological exams is approached from three directions. Firstly, a report generation model is developed such that it is conditioned on radiological image features. Secondly, a number of approaches are explored aimed at extracting diagnostic information from free-text reports. Finally, an alternative approach to image latent space learning from current state-of-the-art is developed that can be applied to accelerated image acquisition.Open Acces

    The horse before the cart: improving the accuracy of taxonomic directions when building tag hierarchies

    No full text
    Content on the Web is huge and constantly growing, and building taxonomies for such content can help with navigation and organisation, but building taxonomies manually is costly and time-consuming. An alternative is to allow users to construct folksonomies: collective social classifications. Yet, folksonomies are inconsistent and their use for searching and browsing is limited. Approaches have been suggested for acquiring implicit hierarchical structures from folksonomies, however, but these approaches suffer from the ‘popularity-generality’ problem, in that popularity is assumed to be a proxy for generality, i.e. high-level taxonomic terms will occur more often than low-level ones. To tackle this problem, we propose in this paper an improved approach. It is based on the Heymann–Benz algorithm, and works by checking the taxonomic directions against a corpus of text. Our results show that popularity works as a proxy for generality in at most 90.91% of cases, but this can be improved to 95.45% using our approach, which should translate to higher-quality tag hierarchy structure

    Identifying Outcomes of Care from Medical Records to Improve Doctor-Patient Communication

    Get PDF
    Between appointments, healthcare providers have limited interaction with their patients, but patients have similar patterns of care. Medications have common side effects; injuries have an expected healing time; and so on. By modeling patient interventions with outcomes, healthcare systems can equip providers with better feedback. In this work, we present a pipeline for analyzing medical records according to an ontology directed at allowing closed-loop feedback between medical encounters. Working with medical data from multiple domains, we use a combination of data processing, machine learning, and clinical expertise to extract knowledge from patient records. While our current focus is on technique, the ultimate goal of this research is to inform development of a system using these models to provide knowledge-driven clinical decision-making
    • 

    corecore