6 research outputs found

    Enhancing Performance in Medical Articles Summarization with Multi-Feature Selection

    Get PDF
    The research aimed at providing an outcome summary of extraordinary events information for public health surveillance systems based on the extraction of online medical articles. The data set used is 7,346 pieces. Characteristics possessed by online medical articles include paragraphs that comprise more than one and the core location of the story or important sentences scattered at the beginning, middle and end of a paragraph. Therefore, this study conducted a summary by maintaining important phrases related to the information of extraordinary events scattered in every paragraph in the medical article online. The summary method used is maximal marginal relevance with an n-best value of 0.7. While the multi feature selection in question is the use of features to improve the performance of the summary system. The first feature selection is the use of title and statistic number of word and noun occurrence, and weighting tf-idf. In addition, other features are word level category in medical content patterns to identify important sentences of each paragraph in the online medical article. The important sentences defined in this study are classified into three categories: core sentence, explanatory sentence, and supporting sentence. The system test in this study was divided into two categories, such as extrinsic and intrinsic test. Extrinsic test is comparing the summary results of the decisions made by the experts with the output resulting from the system. While intrinsic test compared three n-Best weighting value method, feature selection combination, and combined feature selection combination with word level category in medical content. The extrinsic evaluation result was 72%. While intrinsic evaluation result of feature selection combination merger method with word category in medical content was 91,6% for precision, 92,6% for recall and f-measure was 92,2%

    Exploring differential topic models for comparative summarization of scientific papers

    Get PDF
    This paper investigates differential topic models (dTM) for summarizing the differences among document groups. Starting from a simple probabilistic generative model, we propose dTM-SAGE that explicitly models the deviations on group-specific word distributions to indicate how words are used differentially across different document groups from a background word distribution. It is more effective to capture unique characteristics for comparing document groups. To generate dTM-based comparative summaries, we propose two sentence scoring methods for measuring the sentence discriminative capacity. Experimental results on scientific papers dataset show that our dTM-based comparative summarization methods significantly outperform the generic baselines and the state-of-the-art comparative summarization methods under ROUGE metrics

    Multi-document summarization based on atomic semantic events and their temporal relationss

    Get PDF
    Automatic multi-document summarization (MDS) is the process of extracting the most important information such as events and entities from multiple natural language texts focused on the same topic. We extract all types of semantic atomic information and feed them to a topic model to experiment with their effects on a summary. We design a coherent summarization system by taking into account the sentence relative positions in the original text. Our generic MDS system has outperformed the best recent multi-document summarization system in DUC 2004 in terms of ROUGE-1 recall and f1f_1-measure. Our query-focused summarization system achieves a statistically similar result to the state-of-the-art unsupervised system for DUC 2007 query-focused MDS task in ROUGE-2 recall measure. Update Summarization is a new form of MDS where novel yet salience sentences are chosen as summary sentences based on the assumption that the user has already read a given set of documents. In this thesis, we present an event based update summarization where the novelty is detected based on the temporal ordering of events and the saliency is ensured by event and entity distribution. To our knowledge, no other study has deeply investigated the effects of the novelty information acquired from the temporal ordering of events (assuming that a sentence contains one or more events) in the domain of update MDS. Our update MDS system has outperformed the state-of-the-art update MDS system in terms of ROUGE-2, and ROUGE-SU4 recall measures. Our MDS systems also generate quality summaries which are manually evaluated based on popular evaluation criteria

    Vers une représentation du contexte thématique en Recherche d'Information

    Get PDF
    Quand des humains cherchent des informations au sein de bases de connaissancesou de collections de documents, ils utilisent un systĂšme de recherche d information(SRI) faisant office d interface. Les utilisateurs doivent alors transmettre au SRI unereprĂ©sentation de leur besoin d information afin que celui-ci puisse chercher des documentscontenant des informations pertinentes. De nos jours, la reprĂ©sentation du besoind information est constituĂ©e d un petit ensemble de mots-clĂ©s plus souvent connu sousla dĂ©nomination de requĂȘte . Or, quelques mots peuvent ne pas ĂȘtre suffisants pourreprĂ©senter prĂ©cisĂ©ment et efficacement l Ă©tat cognitif complet d un humain par rapportĂ  son besoin d information initial. Sans une certaine forme de contexte thĂ©matiquecomplĂ©mentaire, le SRI peut ne pas renvoyer certains documents pertinents exprimantdes concepts n Ă©tant pas explicitement Ă©voquĂ©s dans la requĂȘte.Dans cette thĂšse, nous explorons et proposons diffĂ©rentes mĂ©thodes statistiques, automatiqueset non supervisĂ©es pour la reprĂ©sentation du contexte thĂ©matique de larequĂȘte. Plus spĂ©cifiquement, nous cherchons Ă  identifier les diffĂ©rents concepts implicitesd une requĂȘte formulĂ©e par un utilisateur sans qu aucune action de sa part nesoit nĂ©cessaire. Nous expĂ©rimentons pour cela l utilisation et la combinaison de diffĂ©rentessources d information gĂ©nĂ©rales reprĂ©sentant les grands types d informationauxquels nous sommes confrontĂ©s quotidiennement sur internet. Nous tirons Ă©galementparti d algorithmes de modĂ©lisation thĂ©matique probabiliste (tels que l allocationde Dirichlet latente) dans le cadre d un retour de pertinence simulĂ©. Nous proposonspar ailleurs une mĂ©thode permettant d estimer conjointement le nombre de conceptsimplicites d une requĂȘte ainsi que l ensemble de documents pseudo-pertinent le plusappropriĂ© afin de modĂ©liser ces concepts. Nous Ă©valuons nos approches en utilisantquatre collections de test TREC de grande taille. En annexes, nous proposons Ă©galementune approche de contextualisation de messages courts exploitant des mĂ©thodesde recherche d information et de rĂ©sumĂ© automatiqueWhen searching for information within knowledge bases or document collections,humans use an information retrieval system (IRS). So that it can retrieve documentscontaining relevant information, users have to provide the IRS with a representationof their information need. Nowadays, this representation of the information need iscomposed of a small set of keywords often referred to as the query . A few wordsmay however not be sufficient to accurately and effectively represent the complete cognitivestate of a human with respect to her initial information need. A query may notcontain sufficient information if the user is searching for some topic in which she is notconfident at all. Hence, without some kind of context, the IRS could simply miss somenuances or details that the user did not or could not provide in query.In this thesis, we explore and propose various statistic, automatic and unsupervisedmethods for representing the topical context of the query. More specifically, we aim toidentify the latent concepts of a query without involving the user in the process norrequiring explicit feedback. We experiment using and combining several general informationsources representing the main types of information we deal with on a dailybasis while browsing theWeb.We also leverage probabilistic topic models (such as LatentDirichlet Allocation) in a pseudo-relevance feedback setting. Besides, we proposea method allowing to jointly estimate the number of latent concepts of a query andthe set of pseudo-relevant feedback documents which is the most suitable to modelthese concepts. We evaluate our approaches using four main large TREC test collections.In the appendix of this thesis, we also propose an approach for contextualizingshort messages which leverages both information retrieval and automatic summarizationtechniquesAVIGNON-Bib. numĂ©rique (840079901) / SudocSudocFranceF
    corecore