9 research outputs found

    Extended list of stop words: Does it work for keyphrase extraction from short texts?

    Get PDF
    In this paper we study the problem of key phrase extraction from short texts written in Russian. As texts we consider messages posted on Internet car forums related to the purchase or repair of cars. The main assumption made is: the construction of lists of stop words for key phrase extraction can be effective if performed on the basis of a small, expert-marked collection. The results show that even a small number of texts marked by an expert can be enough to build an extended list of stop words. Extracted stop words allow to improve the quality of the key phrase extraction algorithm. Prior, we used a similar approach for key phrase extraction from scientific abstracts in the English language. In this paper we work with Russian texts. The obtained results show that the proposed approach works not only for texts that are appropriate in terms of structure and literacy, such as abstracts, but also for short texts, such as forum messages, in which many words may be misspelled and the text itself is poorly structured. Moreover, the results show that proposed approach works well not only with English texts, but also with texts in the Russian language

    TermITH-Eval: a French Standard-Based Resource for Keyphrase Extraction Evaluation

    Get PDF
    International audienceKeyphrase extraction is the task of finding phrases that represent the important content of a document. The main aim of keyphrase extraction is to propose textual units that represent the most important topics developed in a document. The output keyphrases of automatic keyphrase extraction methods for test documents are typically evaluated by comparing them to manually assigned reference keyphrases. Each output keyphrase is considered correct if it matches one of the reference keyphrases. However, the choice of the appropriate textual unit (keyphrase) for a topic is sometimes subjective and evaluating by exact matching underestimates the performance. This paper presents a dataset of evaluation scores assigned to automatically extracted keyphrases by human evaluators. Along with the reference keyphrases, the manual evaluations can be used to validate new evaluation measures. Indeed, an evaluation measure that is highly correlated to the manual evaluation is appropriate for the evaluation of automatic keyphrase extraction methods

    EXTRACCIÓN DE FRASES CLAVE UTILIZANDO PATRONES LÉXICOS A PARTIR DE RESÚMENES DE ARTÍCULOS CIENTÍFICOS

    Get PDF
    Hoy en día personas buscan la mejor forma de adquirir conocimiento y aprendizaje, sin embargo, este conocimiento se encuentra dentro de documentos con cantidades enormes de información, es por eso que las personas requieren un acercamiento al contenido, para poder determinar si es o no relevante para ellas. Las frases clave captan la idea principal de un documento y dan al lector una descripción del mismo. No obstante su asignación manual resulta costosa y se tiene que invertir mucho tiempo. Este inconveniente ha llevado a que investigadores busquen métodos que permitan extraer frases clave de manera automática y que éstas contengan la información principal de un documento. Es ahí donde inicia la tarea de extracción de frases clave, la que consiste en dos etapas: la identificación de frases clave candidatas y la selección de frases clave. En el presente trabajo de tesis, se extraen frases clave a partir de patrones léxicos para los resúmenes de artículos científicos en el idioma inglés. Se realizan los experimentos con el corpus de Inspec, el cual está constituido por 2000 resúmenes de artículos científicos en el idioma inglés. Cada resumen cuenta con 2 conjuntos de frases clave asignadas manualmente por un experto. Se comparan los resultados obtenidos con otros métodos del estado del arte

    DOCUMENT REPRESENTATION FOR CLUSTERING OF SCIENTIFIC ABSTRACTS

    Get PDF
    The key issue of the present paper is clustering of narrow-domain short texts, such as scientific abstracts. The work is based on the observations made when improving the performance of key phrase extraction algorithm. An extended stop-words list was used that was built automatically for the purposes of key phrase extraction and gave the possibility for a considerable quality enhancement of the phrases extracted from scientific publications. A description of the stop- words list creation procedure is given. The main objective is to investigate the possibilities to increase the performance and/or speed of clustering by the above-mentioned list of stop-words as well as information about lexeme parts of speech. In the latter case a vocabulary is applied for the document representation, which contains not all the words that occurred in the collection, but only nouns and adjectives or their sequences encountered in the documents. Two base clustering algorithms are applied: k-means and hierarchical clustering (average agglomerative method). The results show that the use of an extended stop-words list and adjective-noun document representation makes it possible to improve the performance and speed of k-means clustering. In a similar case for average agglomerative method a decline in performance quality may be observed. It is shown that the use of adjective-noun sequences for document representation lowers the clustering quality for both algorithms and can be justified only when a considerable reduction of feature space dimensionality is necessary

    Stop-words in keyphrase extraction problem

    Get PDF
    Keyword extraction problem is one of the most significant tasks in information retrieval. High-quality keyword extraction sufficiently influences the progress in the following subtasks of information retrieval: classification and clustering, data mining, knowledge extraction and representation, etc. The research environment has specified a layout for keyphrase extraction. However, some of the possible decisions remain uninvolved in the paradigm. In the paper the authors observe the scope of interdisciplinary methods applicable to automatic stop list feeding. The chosen method belongs to the class of experiential models. The research procedure based on this method allows to improve the quality of keyphrase extraction on the stage of candidate keyphrase building. Several ways to automatic feeding of the stop lists are proposed in the paper as well. One of them is based on provisions of lexical statistics and the results of its application to the discussed task point out the non-gaussian nature of text corpora. The second way based on usage of the Inspec train collection to the feeding of stop lists improves the quality considerably

    Finding Important Terms for Patients in Their Electronic Health Records: A Learning-to-Rank Approach Using Expert Annotations

    Get PDF
    BACKGROUND: Many health organizations allow patients to access their own electronic health record (EHR) notes through online patient portals as a way to enhance patient-centered care. However, EHR notes are typically long and contain abundant medical jargon that can be difficult for patients to understand. In addition, many medical terms in patients\u27 notes are not directly related to their health care needs. One way to help patients better comprehend their own notes is to reduce information overload and help them focus on medical terms that matter most to them. Interventions can then be developed by giving them targeted education to improve their EHR comprehension and the quality of care. OBJECTIVE: We aimed to develop a supervised natural language processing (NLP) system called Finding impOrtant medical Concepts most Useful to patientS (FOCUS) that automatically identifies and ranks medical terms in EHR notes based on their importance to the patients. METHODS: First, we built an expert-annotated corpus. For each EHR note, 2 physicians independently identified medical terms important to the patient. Using the physicians\u27 agreement as the gold standard, we developed and evaluated FOCUS. FOCUS first identifies candidate terms from each EHR note using MetaMap and then ranks the terms using a support vector machine-based learn-to-rank algorithm. We explored rich learning features, including distributed word representation, Unified Medical Language System semantic type, topic features, and features derived from consumer health vocabulary. We compared FOCUS with 2 strong baseline NLP systems. RESULTS: Physicians annotated 90 EHR notes and identified a mean of 9 (SD 5) important terms per note. The Cohen\u27s kappa annotation agreement was .51. The 10-fold cross-validation results show that FOCUS achieved an area under the receiver operating characteristic curve (AUC-ROC) of 0.940 for ranking candidate terms from EHR notes to identify important terms. When including term identification, the performance of FOCUS for identifying important terms from EHR notes was 0.866 AUC-ROC. Both performance scores significantly exceeded the corresponding baseline system scores (P \u3c .001). Rich learning features contributed to FOCUS\u27s performance substantially. CONCLUSIONS: FOCUS can automatically rank terms from EHR notes based on their importance to patients. It may help develop future interventions that improve quality of care

    Proceedings of the Seventh International Conference Formal Approaches to South Slavic and Balkan languages

    Get PDF
    Proceedings of the Seventh International Conference Formal Approaches to South Slavic and Balkan Languages publishes 17 papers that were presented at the conference organised in Dubrovnik, Croatia, 4-6 Octobre 2010

    Approximate Matching for Evaluating Keyphrase Extraction

    No full text
    We propose a new evaluation strategy for keyphrase extraction based on approximate keyphrase matching. It corresponds well with human judgments and is better suited to assess the performance of keyphrase extraction approaches. Additionally, we propose a generalized framework for comprehensive analysis of keyphrase extraction that subsumes most existing approaches, which allows for fair testing conditions. For the first time, we compare the results of state-of-the-art unsupervised and supervised keyphrase extraction approaches on three evaluation datasets and show that the relative performance of the approaches heavily depends on the evaluation metric as well as on the properties of the evaluation dataset. Keywords keyphrase extraction; approximate matching
    corecore