2,056 research outputs found

    DARIAH and the Benelux

    Get PDF

    Analysing occupational safety culture through mass media monitoring

    Get PDF
    In the last years, a group of researchers within the National Institute for Insurance against Accidents at Work (INAIL) has launched a pilot project about mass media monitoring in order to find out how the press deal with the culture of safety and health at work. To monitor mass media, the Institute has created a relational database of news concerning occupational injuries and diseases, that was filled with information obtained from the newspaper articles about work-related accidents and incidents, including the text itself of the articles. In keeping with that, the ultimate objective is to identify the major lines for awareness-raising actions on safety and health at work. In a first phase of this project, 1,858 news articles regarding 580 different accidents were collected; for each injury, not only the news texts but also several variables were identified. Our hypothesis is that, for different kind of accidents, a different language is used by journalists to narrate the events. To verify it, a text clustering procedure is implemented on the articles, together with a Lexical Correspondence Analysis; our purpose is to find language distinctions connected to groups of similar injuries. The identification of various ways in reporting the events, in fact, could provide new elements to describe safety knowledge, also establishing collaborations with journalists in order to enhance the communication and raise people attention toward workers' safety

    Distantly Labeling Data for Large Scale Cross-Document Coreference

    Full text link
    Cross-document coreference, the problem of resolving entity mentions across multi-document collections, is crucial to automated knowledge base construction and data mining tasks. However, the scarcity of large labeled data sets has hindered supervised machine learning research for this task. In this paper we develop and demonstrate an approach based on ``distantly-labeling'' a data set from which we can train a discriminative cross-document coreference model. In particular we build a dataset of more than a million people mentions extracted from 3.5 years of New York Times articles, leverage Wikipedia for distant labeling with a generative model (and measure the reliability of such labeling); then we train and evaluate a conditional random field coreference model that has factors on cross-document entities as well as mention-pairs. This coreference model obtains high accuracy in resolving mentions and entities that are not present in the training data, indicating applicability to non-Wikipedia data. Given the large amount of data, our work is also an exercise demonstrating the scalability of our approach.Comment: 16 pages, submitted to ECML 201

    Building a Document Genre Corpus: a Profile of the KRYS I Corpus

    Get PDF
    This paper describes the KRYS I corpus (http://www.krys-corpus.eu/Info.html), consisting of documents classified into 70 genre classes. It has been constructed as part of an effort to automate document genre classification as distinct from topic detection. Previously there has been very little work on building corpora of texts which have been classified using a non-topical genre palette. The reason for this is partly due to the fact that genre as a concept, is rooted in philosophy, rhetoric and literature, and highly complex and domain dependent in its interpretation ([11]). The usefulness of genre in everyday information search is only now starting to be recognised and there is no genre classification schema that has been consolidated to have applicable value in this direction. By presenting here our experiences in constructing the KRYS I corpus, we hope to shed light on the information gathering and seeking behaviour and the role of genre in these activities, as well as a way forward for creating a better corpus for testing automated genre classification tasks and the application of these tasks to other domains

    Analyzing Research Tendencies of ELT Researchers and Trajectory of English Language Teaching and Learning in the last Five Years

    Get PDF
    In accordance with the new advances in language teaching methodologies and integration of high technology tools as well as web applications, there are many scientific research published on English language teaching (ELT) and learning (ELL) in recent years. However, on the one hand, it is still a significant question to research that exactly what types of research topics are mostly studied among the researchers from different countries. What are the leading research groups on the world? Even though there are noteworthy studies to clarify mostly studied topics and trajectory of the researches on ELT by means of literature reviews, and there are very few studies to compare research tendencies of the researchers over text/content mining methodology. In fact, the papers reviewing literature are mostly limited in terms of depicting a broad understanding the scope of such studies. On the other hand, a corpus based detection methodology, which may illuminate those research tendencies and trajectory, and come up with very informative descriptive results in the field, is actually missing. In sum, the current research aims at finding out the most frequent research contexts and topics in the last five years through analyzing research papers published in leading academic journals in the field, and compare tendencies of the researchers from different institutions and countries in terms of selecting their research context and topics, and to figure out the trajectory for future studies. In this study, the researchers believe that there are different tendencies among the researchers in terms of their selecting research contexts and topics, which should be revealed for future researches. Researchers use a corpus-based detection methodology in this study, which is composed of storing variable data in .txt files and analyzing variables over the concordancer. Corpus-based detection method defines process of gathering textual data mentioned in the variables and analyzing them by means of a concordancer, named AntConc. The corpus-based data from the variables are analyzed by means of a statistical software, known as JASP in order to clear out potential differences among the researchers. A short analysis of the data indicates that the researchers still focus on the key words such as explicit learning and knowledge, implicit learning and knowledge as well as age and bilingualism. It is also observed that meta-analysis is an important topic in the studies conducted lately. Further results of the study could be beneficial for all followers including researchers and learners inside and outside the field of ELT and help people focus less frequently studied contexts and topics

    Stance-taking and public discussion in blogs.

    Get PDF
    Blogs, which can be written and read by anyone with a computer and an internet connection, would seem to expand the possibilities for engagement in public sphere debates. Indeed, blogs are full of the kind of vocabulary that suggests intense discussion. However, a closer look at the way this vocabulary is used in context suggests that the main concern of writers is selfpresentation, positioning themselves in a crowded forum, in what has been called stancetaking. When writers mark their stances, for instance by saying I think, they enact different ways of signalling a relation to others, marking disagreement, enacting surprise, andironicising previous contributions. All these moves are ways of presenting one’s own contribution as distinctive, showing one’s entitlement to a position. In this paper, I use concordance tools to identify strings that are very frequent in a corpus of blogs, relative to a general corpus of written texts, focus on those relatively frequent words that mark stance and analyse these markers in context. I argue that the prominence of stance-taking indicates the priority of individual positioning over collective and deliberative discussion

    Text Summarization Techniques: A Brief Survey

    Get PDF
    In recent years, there has been a explosion in the amount of text data from a variety of sources. This volume of text is an invaluable source of information and knowledge which needs to be effectively summarized to be useful. In this review, the main approaches to automatic text summarization are described. We review the different processes for summarization and describe the effectiveness and shortcomings of the different methods.Comment: Some of references format have update
    • …
    corecore