4,776 research outputs found

    From corpus-based collocation frequencies to readability measure

    Get PDF
    This paper provides a broad overview of three separate but related areas of research. Firstly, corpus linguistics is a growing discipline that applies analytical results from large language corpora to a wide variety of problems in linguistics and related disciplines. Secondly, readability research, as the name suggests, seeks to understand what makes texts more or less comprehensible to readers, and aims to apply this understanding to issues such as text rating and matching of texts to readers. Thirdly, collocation is a language feature that occurs when particular words are used frequently together for other than purely grammatical reasons. The intersection of these three aspects provides the basis for on-going research within the Department of Computer and Information Sciences at the University of Strathclyde and is the motivation for this overview. Specifically, we aim through analysis of collocation frequencies in major corpora, to afford valuable insight on the content of texts, which we believe will, in turn, provide a novel basis for estimating text readability

    Satirical News Detection and Analysis using Attention Mechanism and Linguistic Features

    Full text link
    Satirical news is considered to be entertainment, but it is potentially deceptive and harmful. Despite the embedded genre in the article, not everyone can recognize the satirical cues and therefore believe the news as true news. We observe that satirical cues are often reflected in certain paragraphs rather than the whole document. Existing works only consider document-level features to detect the satire, which could be limited. We consider paragraph-level linguistic features to unveil the satire by incorporating neural network and attention mechanism. We investigate the difference between paragraph-level features and document-level features, and analyze them on a large satirical news dataset. The evaluation shows that the proposed model detects satirical news effectively and reveals what features are important at which level.Comment: EMNLP 2017, 11 page

    Summarising News Stories for Children

    Get PDF
    This paper proposes a system to automatically summarise news articles in a manner suitable for children by deriving and combining statistical ratings for how important, positively oriented and easy to read each sentence is. Our results demonstrate that this approach succeeds in generating summaries that are suitable for children, and that there is further scope for combining this extractive approach with abstractive methods used in text implification

    iCrawl: Improving the Freshness of Web Collections by Integrating Social Web and Focused Web Crawling

    Full text link
    Researchers in the Digital Humanities and journalists need to monitor, collect and analyze fresh online content regarding current events such as the Ebola outbreak or the Ukraine crisis on demand. However, existing focused crawling approaches only consider topical aspects while ignoring temporal aspects and therefore cannot achieve thematically coherent and fresh Web collections. Especially Social Media provide a rich source of fresh content, which is not used by state-of-the-art focused crawlers. In this paper we address the issues of enabling the collection of fresh and relevant Web and Social Web content for a topic of interest through seamless integration of Web and Social Media in a novel integrated focused crawler. The crawler collects Web and Social Media content in a single system and exploits the stream of fresh Social Media content for guiding the crawler.Comment: Published in the Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries 201

    Improving Web Page Readability by Plain Language

    Full text link
    In today's world anybody who wants to access any information the first choice is to use the web because it is the only source to provide easy and instant access to information. However web readers face many hurdles from web which includes load of web pages, text size, finding related information, spelling and grammar etc. However understanding of web pages written in English language creates great problems for non native readers who have basic knowledge of English. In this paper, we propose a plain language for a local language (Urdu) using English alphabets for web pages in Pakistan. For this purpose we developed two websites, one with a normal English fonts and other in a local language text scheme using English alphabets. We also conducted a questionnaire from 40 different users with a different level of English language fluency in Pakistan to gain the evidence of the practicality of our approach. The result shows that the proposed plain language text scheme using English alphabets improved the reading comprehension for non native English speakers in Pakistan

    A Wikipedia Literature Review

    Full text link
    This paper was originally designed as a literature review for a doctoral dissertation focusing on Wikipedia. This exposition gives the structure of Wikipedia and the latest trends in Wikipedia research

    INEX Tweet Contextualization Task: Evaluation, Results and Lesson Learned

    Get PDF
    Microblogging platforms such as Twitter are increasingly used for on-line client and market analysis. This motivated the proposal of a new track at CLEF INEX lab of Tweet Contextualization. The objective of this task was to help a user to understand a tweet by providing him with a short explanatory summary (500 words). This summary should be built automatically using resources like Wikipedia and generated by extracting relevant passages and aggregating them into a coherent summary. Running for four years, results show that the best systems combine NLP techniques with more traditional methods. More precisely the best performing systems combine passage retrieval, sentence segmentation and scoring, named entity recognition, text part-of-speech (POS) analysis, anaphora detection, diversity content measure as well as sentence reordering. This paper provides a full summary report on the four-year long task. While yearly overviews focused on system results, in this paper we provide a detailed report on the approaches proposed by the participants and which can be considered as the state of the art for this task. As an important result from the 4 years competition, we also describe the open access resources that have been built and collected. The evaluation measures for automatic summarization designed in DUC or MUC were not appropriate to evaluate tweet contextualization, we explain why and depict in detailed the LogSim measure used to evaluate informativeness of produced contexts or summaries. Finally, we also mention the lessons we learned and that it is worth considering when designing a task

    Estimating the Socio-Economic Impact of Product Reviews: Mining Text and Reviewer Characteristics

    Get PDF
    With the rapid growth of the Internet, the ability of users to create and publish content has created active electronic communities that provide a wealth of product information. However, the high volume of reviews that are typically published for a single product makes harder for individuals as well as manufacturers to locate the best reviews and understand the true underlying quality of a product. In this paper, we re-examine the impact of reviews on economic outcomes like product sales and see how different factors affect social outcomes like the extent of their perceived usefulness. Our approach explores multiple aspects of review text, such as lexical, grammatical, semantic, and stylistic levels to identify important text-based features. In addition, we also examine multiple reviewer-level features such as average usefulness of past reviews and the self-disclosed identity measures of reviewers that are displayed next to a review. Our econometric analysis reveals that the extent of subjectivity, informativeness, readability, and linguistic correctness in reviews matters in influencing sales and perceived usefulness. Reviews that have a mixture of objective, and highly subjective sentences have a negative effect on product sales, compared to reviews that tend to include only subjective or only objective information. However, such reviews are considered more informative (or helpful) by the users. By using Random Forest based classifiers, we show that we can accurately predict the impact of reviews on sales and their perceived usefulness. Reviews for products that have received widely fluctuating reviews, also have reviews of widely fluctuating helpfulness. In particular, we find that highly detailed and readable reviews can have low helpfulness votes in cases when users tend to vote negatively not because they disapprove of the review quality but rather to convey their disapproval of the review polarity. We examine the relative importance of the three broad feature categories: `reviewer-related' features, `review subjectivity' features, and `review readability' features, and find that using any of the three feature sets results in a statistically equivalent performance as in the case of using all available features. This paper is the first study that integrates econometric, text mining, and predictive modeling techniques toward a more complete analysis of the information captured by user-generated online reviews in order to estimate their socio-economic impact. Our results can have implications for judicious design of opinion forums

    THE IDENTIFICATION OF NOTEWORTHY HOTEL REVIEWS FOR HOTEL MANAGEMENT

    Get PDF
    The rapid emergence of user-generated content (UGC) inspires knowledge sharing among Internet users. A good example is the well-known travel site TripAdvisor.com, which enables users to share their experiences and express their opinions on attractions, accommodations, restaurants, etc. The UGC about travel provide precious information to the users as well as staff in travel industry. In particular, how to identify reviews that are noteworthy for hotel management is critical to the success of hotels in the competitive travel industry. We have employed two hotel managers to conduct an examination on Taiwan’s hotel reviews in Tripadvisor.com and found that noteworthy reviews can be characterized by their content features, sentiments, and review qualities. Through the experiments using tripadvisor.com data, we find that all three types of features are important in identifying noteworthy hotel reviews. Specifically, content features are shown to have the most impact, followed by sentiments and review qualities. With respect to the various methods for representing content features, LDA method achieves comparable performance to TF-IDF method with higher recall and much fewer features
    • …
    corecore