6,952 research outputs found

    Improving chronological sentence ordering by precedence relation

    Full text link

    Utilizing graph-based representation of text in a hybrid approach to multiple documents summarization

    Get PDF
    The aim of automatic text summarization is to process text with the purpose of identifying and presenting the most important information appearing in the text. In this research, we aim to investigate automatic multiple document summarization using a hybrid approach of extractive and “shallow abstractive methods. We aim to utilize the graph-based representation approach proposed in [1] and [2] as part of our method to multiple document summarization aiming to provide concise, informative and coherent summaries. We start by scoring sentences based on significance to extract top scoring ones from each document of the set of documents being summarized. In this step, we look into different criteria of scoring sentences, which include: the presence of highly frequent words of the document, the presence of highly frequent words of the set of documents and the presence of words found in the first and last sentence of the document and the different combination of such features. Upon running our experiments we found that the best combination of features to use is utilizing the presence of highly frequent words of the document and presence of words found in the first and last sentences of the document. The average f-score of those features had an average of 7.9% increase to other features\u27 f-scores. Secondly, we address the issue of redundancy of information through clustering sentences of same or similar information into one cluster that will be compressed into one sentence, thus avoiding redundancy of information as much as possible. We investigated clustering the extracted sentences based on two criteria for similarity, the first of which uses word frequency vector for similarity measure and the second of which uses word semantic similarity. Through our experiment, we found that the use of the word vector features yields much better clusters in terms of sentence similarity. The word feature vector had a 20% more number of clusters labeled to contain similar sentences as opposed to those of the word semantic feature. We then adopted a graph-based representation of text proposed in [1] and [2] to represent each sentence in a cluster, and using the k-shortest paths we found the shortest path to represent the final compressed sentence and use it as a final sentence in the summary. Human evaluator scored sentences based on grammatical correctness and almost 74% of 51 sentences evaluated got a perfect score of 2 which is a perfect or near perfect sentence. We finally propose a method for scoring the compressed sentences according to the order in which they should appear in the final summary. We used the Document Understanding Conference dataset for year 2014 as the evaluating dataset for our final system. We used the ROUGE system for evaluation which stands for Recall-Oriented Understudy for Gisting Evaluation. This system compare the automatic summaries to “ideal human references. We also compared our summaries ROUGE scores to those of summaries generated using the MEAD summarization tool. Our system provided better precision and f-score as well as comparable recall scores. On average our system has a percentage increase of 2% for precision and 1.6% increase in f-score than those of MEAD while MEAD has an increase of 0.8% in recall. In addition, our system provided more compressed version of the summary as opposed to that generated by MEAD. We finally ran an experiment to evaluate the order of sentences in the final summary and its comprehensibility where we show that our ordering method produced a comprehensible summary. On average, summaries that scored a perfect score in term of comprehensibility constitute 72% of the evaluated summaries. Evaluators were also asked to count the number of ungrammatical and incomprehensible sentences in the evaluated summaries and on average they were only 10.9% of the summaries sentences. We believe our system provide a \u27shallow abstractive summary to multiple documents that does not require intensive Natural Language Processing.

    Content management by keywords: An analytical Study

    Get PDF
    Various methods of content analysis are described with special emphasis to keyword analysis. The paper is based on an analytical study of 97 keywords extracted from titles and abstracts of 70 research articles from INSPEC, taking ten from each year starting from 2000 to 2006, in decreasing order of relevance, on Fermi Liquid, which is a specific subject under Condensed Matter Physics. The keywords beginning with the letters ‗A‘ to ‗F‘ only are considered for this study. The keywords are indexed to critically examine its physical structure that is composed of three fundamental kernels, viz. key phrase, modulator and qualifier. The key phrase reflects the central concept, which is usually post-coordinated by the modulator to amend the central concept in accordance with the relevant context. The qualifier comes after the modulator to describe the particular state of the central concept and/or amended concept. The keywords are further classified in 36 classes on the basis of the 10 parameters, of which 4 parameters are intrinsic, i.e. associativeness, chronological appearance, frequency of occurrence and category; and remaining 6 parameters are extrinsic, i.e. Clarity of meaning, type of meaning, scope of meaning, level of perception, mode of creation and area of occurrence. The number of classes under 4 intrinsic parameters is 16, while the same under 6 extrinsic parameters are 20. A new taxonomy of keywords has been proposed here that will help to analyze research-trend of a subject and also identify potential research-areas under its scope

    DeepEva: A deep neural network architecture for assessing sentence complexity in Italian and English languages

    Get PDF
    Automatic Text Complexity Evaluation (ATE) is a research field that aims at creating new methodologies to make autonomous the process of the text complexity evaluation, that is the study of the text-linguistic features (e.g., lexical, syntactical, morphological) to measure the grade of comprehensibility of a text. ATE can affect positively several different contexts such as Finance, Health, and Education. Moreover, it can support the research on Automatic Text Simplification (ATS), a research area that deals with the study of new methods for transforming a text by changing its lexicon and structure to meet specific reader needs. In this paper, we illustrate an ATE approach named DeepEva, a Deep Learning based system capable of classifying both Italian and English sentences on the basis of their complexity. The system exploits the Treetagger annotation tool, two Long Short Term Memory (LSTM) neural unit layers, and a fully connected one. The last layer outputs the probability of a sentence belonging to the easy or complex class. The experimental results show the effectiveness of the approach for both languages, compared with several baselines such as Support Vector Machine, Gradient Boosting, and Random Forest

    Teaching programme for 3º ESO. Inglés (The Use of SDGs and audiovisual materials)

    Get PDF
    Trabajo de Fin de Máster del Máster en Profesor de Educación Secundaria Obligatoria y Bachillerato, Formación Profesional y Enseñanza de Idiomas, curso 2021-2022[ES] Programación didáctica para el curso de 3º ESO de la asignatura Inglés como Lengua Extranjera apoyada en dos pilares fundamentales. Los Objetivos de Desarrollo Sostenible (ODS) como hilo conductor en la que se basan las unidades didácticas y el uso de materiales audio-visuales como soporte para la gran mayoría de las actividades y tareas, entre los que podemos encontrar fragmentos de películas, documentales, series, canciones, etc. Esta programación busca enseñar inglés a la vez que trata de concienciar a los jóvenes sobre temas tan importantes como el cambio climático, una educación de calidad, la erradicación de la pobreza, la igualdad de género, etc. De igual manera, cabe destacar que no es una programación didáctica al uso por las limitaciones de formato del TFM. El proyecto consiste en nueve unidades didácticas explicadas brevemente. Solo la unidad 6 se detalla en profundidad y dentro de ésta hay dos sesiones completas tarea por tarea. Finalmente hay una reflexión final y un apéndice con todos los recursos usados en la programación.[EN] Teaching programme for the 3rd ESO course of English as a Foreign Language, based on two fundamental pillars. The Sustainable Development Goals (SDGs) as the unifying thread on which the teaching units are based, and the use of audio-visual materials as a support for the vast majority of activities and tasks, among which we can find fragments of films, documentaries, series, songs, etc. This programme aims to teach English while raising awareness among young people about important issues such as climate change, quality education, eradication of poverty, gender equality, etc. Additionally, it should be noted that due to the limitations of the TFM format, this proposal is not an entire teaching programme. This project consists of nine didactic units of which only unit 6 is explained session in detail. Moreover only 2 of these sessions are detailed task per task. Finally, there is a final reflection and an appendix with all the resources used in the programming

    Authenticity and Teaching Idioms

    Get PDF
    The concept of ‘authenticity’ or ‘the authentic material’ has been a controversial issue during the past 30 years. However, in recent years, more emphasis has been given to a multifaceted model and there has been an attempt to put an end to binary definitions of authenticity. In the present study, five types of input authenticity proposed by Brown and Menasche (2008) were taken into account. This model consists of genuine input authenticity, altered input authenticity, adapted input authenticity, stimulated input authenticity and inauthenticity. In this study, an attempt was made to explore the effect of four out of five types of multifaceted input authenticity model on learning idiomatic expressions of EFL learners. A quasi-experimental research study was conducted and 62 male EFL learners were assigned to four groups and four types of authentic idiomatic materials were prepared and taught to them. A one-way ANOVA was run on the scores obtained from a pre-test (which tested participants’ idiomatic expressions understanding) and it did not show any significant difference among the participating groups (F = 0.39, p = .757).  During the treatment period, which lasted for two months, three sessions a week, four groups received four types of different idiomatic materials with different types of authenticity. A one-way ANOVA run on the scores of the four groups reached statistical difference (F = 31.31, p = .000). In order to find the exact location of differences found, a follow-up analysis (LSD test) was conducted. Generally viewed, the results found in this study suggest that the materials with less authenticity, namely simulated input authenticity and inauthenticity, could be more beneficial than the materials with higher degree of authenticity, namely altered and adapted input authenticity. Keywords: genuine input authenticity, altered input authenticity, adapted input authenticity, stimulated input authenticity and inauthenticity

    Causality Management and Analysis in Requirement Manuscript for Software Designs

    Get PDF
    For software design tasks involving natural language, the results of a causal investigation provide valuable and robust semantic information, especially for identifying key variables during product (software) design and product optimization. As the interest in analytical data science shifts from correlations to a better understanding of causality, there is an equal task focused on the accuracy of extracting causality from textual artifacts to aid requirement engineering (RE) based decisions. This thesis focuses on identifying, extracting, and classifying causal phrases using word and sentence labeling based on the Bi-directional Encoder Representations from Transformers (BERT) deep learning language model and five machine learning models. The aim is to understand the form and degree of causality based on their impact and prevalence in RE practice. Methodologically, our analysis is centered around RE practice, and we considered 12,438 sentences extracted from 50 requirement engineering manuscripts (REM) for training our machine models. Our research reports that causal expressions constitute about 32% of sentences from REM. We applied four evaluation metrics, namely recall, accuracy, precision, and F1, to assess our machine models’ performance and accuracy to ensure the results’ conformity with our study goal. Further, we computed the highest model accuracy to be 85%, attributed to Naive Bayes. Finally, we noted that the applicability and relevance of our causal analytic framework is relevant to practitioners for different functionalities, such as generating test cases for requirement engineers and software developers and product performance auditing for management stakeholders

    Predicting Text Quality: Metrics for Content, Organization and Reader Interest

    Get PDF
    When people read articles---news, fiction or technical---most of the time if not always, they form perceptions about its quality. Some articles are well-written and others are poorly written. This thesis explores if such judgements can be automated so that they can be incorporated into applications such as information retrieval and automatic summarization. Text quality does not involve a single aspect but is a combination of numerous and diverse criteria including spelling, grammar, organization, informative nature, creative and beautiful language use, and page layout. In the education domain, comprehensive lists of such properties are outlined in the rubrics used for assessing writing. But computational methods for text quality have addressed only a handful of these aspects, mainly related to spelling, grammar and organization. In addition, some text quality aspects could be more relevant for one genre versus another. But previous work have placed little focus on specialized metrics based on the genre of texts. This thesis proposes new insights and techniques to address the above issues. We introduce metrics that score varied dimensions of quality such as content, organization and reader interest. For content, we present two measures: specificity and verbosity level. Specificity measures the amount of detail present in a text while verbosity captures which details are essential to include. We measure organization quality by quantifying the regularity of the intentional structure in the article and also using the specificity levels of adjacent sentences in the text. Our reader interest metrics aim to identify engaging and interesting articles. The development of these measures is backed by the use of articles from three different genres: academic writing, science journalism and automatically generated summaries. Proper presentation of content is critical during summarization because summaries have a word limit. Our specificity and verbosity metrics are developed with this genre as the focus. The argumentation structure of academic writing lends support to the idea of using intentional structure to model organization quality. Science journalism articles convey research findings in an engaging manner and are ideally suited for the development and evaluation of measures related to reader interest

    Neural Natural Language Processing for Long Texts: A Survey of the State-of-the-Art

    Full text link
    The adoption of Deep Neural Networks (DNNs) has greatly benefited Natural Language Processing (NLP) during the past decade. However, the demands of long document analysis are quite different from those of shorter texts, while the ever increasing size of documents uploaded on-line renders automated understanding of long texts a critical area of research. This article has two goals: a) it overviews the relevant neural building blocks, thus serving as a short tutorial, and b) it surveys the state-of-the-art in long document NLP, mainly focusing on two central tasks: document classification and document summarization. Sentiment analysis for long texts is also covered, since it is typically treated as a particular case of document classification. Additionally, this article discusses the main challenges, issues and current solutions related to long document NLP. Finally, the relevant, publicly available, annotated datasets are presented, in order to facilitate further research.Comment: 53 pages, 2 figures, 171 citation
    corecore