18,100 research outputs found

    Strategies for Improving Semi-automated Topic Classification of Media and Parliamentary documents

    Get PDF
    Since 1995 the techniques and capacities to store new electronic data and to make it available to many persons have become a common good. As of then, different organizations, such as research institutes, universities, libraries, and private companies (Google) started to scan older documents and make them electronically available as well. This has generated a lot of new research opportunities for all kinds of academic disciplines. The use of software to analyze large datasets has become an important part of doing research in the social sciences. Most academics rely on human coded datasets, both in qualitative and quantitative research. However, with the increasing amount of datasets and the complexity of the questions scholars pose to the datasets, the quest for more efficient and effective methods is now on the agenda. One of the most common techniques of content analysis is the Boolean key-word search method. To find certain topics in a dataset, the researcher creates first a list of keywords, added with certain parameters (AND, OR etc.). All keys are usually grouped in families and the entire list of keys and groups is called the ontology. Then the keywords are searched in the dataset, retrieving all documents containing the specified keywords. The online newspaper dataset, LexisNexis, provides the user with such a Boolean search method. However, the Boolean key-word search is not always satisfying in terms of reliability and validity. For that reason social scientists rely on hand-coding. Two projects that do so are the congressional bills project (www.congressionalbills.org ) and the policy agenda-setting project (see www.policyagendas.org ). They developed a topic code book and coded various different sources, such as, the state of the union speeches, bills, newspaper articles etcetera. The continuous improving automated coding techniques, and the increasing number of agenda setting projects (in especially European countries), however, has made the use of automated coding software a feasible option and also a necessity

    An XML-based Tool for Tracking English Inclusions in German Text

    Get PDF
    The use of lexicons and corpora advances both linguistic research and performances of current natural language processing (NLP) systems. We present a tool that exploits such resources, specifically English and German lexical databases and the World Wide Web to recognise English inclusions in German newspaper articles. The output of the tool can assist lexical resource developers in monitoring changing patterns of English inclusion usage. The corpus used for the classification covers three different domains. We report the classification results and illustrate their value to linguistic and NLP research

    DCU at MediaEval 2010 – Tagging task WildWildWeb

    Get PDF
    We describe our runs and results for the fixed label Wild Wild Web Tagging Task at MediaEval 2010. Our experiments indicate that including all words in the ASR transcripts of the document set results in better labeling accuracy than restricting the index to only words with recognition confidence above a fixed level. Additionally our results show that tagging accuracy can be improved by incorporating additional metadata describing the documents where it is available

    Overview of VideoCLEF 2009: New perspectives on speech-based multimedia content enrichment

    Get PDF
    VideoCLEF 2009 offered three tasks related to enriching video content for improved multimedia access in a multilingual environment. For each task, video data (Dutch-language television, predominantly documentaries) accompanied by speech recognition transcripts were provided. The Subject Classification Task involved automatic tagging of videos with subject theme labels. The best performance was achieved by approaching subject tagging as an information retrieval task and using both speech recognition transcripts and archival metadata. Alternatively, classifiers were trained using either the training data provided or data collected from Wikipedia or via general Web search. The Affect Task involved detecting narrative peaks, defined as points where viewers perceive heightened dramatic tension. The task was carried out on the “Beeldenstorm” collection containing 45 short-form documentaries on the visual arts. The best runs exploited affective vocabulary and audience directed speech. Other approaches included using topic changes, elevated speaking pitch, increased speaking intensity and radical visual changes. The Linking Task, also called “Finding Related Resources Across Languages,” involved linking video to material on the same subject in a different language. Participants were provided with a list of multimedia anchors (short video segments) in the Dutch-language “Beeldenstorm” collection and were expected to return target pages drawn from English-language Wikipedia. The best performing methods used the transcript of the speech spoken during the multimedia anchor to build a query to search an index of the Dutch language Wikipedia. The Dutch Wikipedia pages returned were used to identify related English pages. Participants also experimented with pseudo-relevance feedback, query translation and methods that targeted proper names

    On Horizontal and Vertical Separation in Hierarchical Text Classification

    Get PDF
    Hierarchy is a common and effective way of organizing data and representing their relationships at different levels of abstraction. However, hierarchical data dependencies cause difficulties in the estimation of "separable" models that can distinguish between the entities in the hierarchy. Extracting separable models of hierarchical entities requires us to take their relative position into account and to consider the different types of dependencies in the hierarchy. In this paper, we present an investigation of the effect of separability in text-based entity classification and argue that in hierarchical classification, a separation property should be established between entities not only in the same layer, but also in different layers. Our main findings are the followings. First, we analyse the importance of separability on the data representation in the task of classification and based on that, we introduce a "Strong Separation Principle" for optimizing expected effectiveness of classifiers decision based on separation property. Second, we present Hierarchical Significant Words Language Models (HSWLM) which capture all, and only, the essential features of hierarchical entities according to their relative position in the hierarchy resulting in horizontally and vertically separable models. Third, we validate our claims on real-world data and demonstrate that how HSWLM improves the accuracy of classification and how it provides transferable models over time. Although discussions in this paper focus on the classification problem, the models are applicable to any information access tasks on data that has, or can be mapped to, a hierarchical structure.Comment: Full paper (10 pages) accepted for publication in proceedings of ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR'16

    Text pre-processing of multilingual for sentiment analysis based on social network data

    Get PDF
    Sentiment analysis (SA) is an enduring area for research especially in the field of text analysis. Text pre-processing is an important aspect to perform SA accurately. This paper presents a text processing model for SA, using natural language processing techniques for twitter data. The basic phases for machine learning are text collection, text cleaning, pre-processing, feature extractions in a text and then categorize the data according to the SA techniques. Keeping the focus on twitter data, the data is extracted in domain specific manner. In data cleaning phase, noisy data, missing data, punctuation, tags and emoticons have been considered. For pre-processing, tokenization is performed which is followed by stop word removal (SWR). The proposed article provides an insight of the techniques, that are used for text pre-processing, the impact of their presence on the dataset. The accuracy of classification techniques has been improved after applying text pre-processing and dimensionality has been reduced. The proposed corpus can be utilized in the area of market analysis, customer behaviour, polling analysis, and brand monitoring. The text pre-processing process can serve as the baseline to apply predictive analysis, machine learning and deep learning algorithms which can be extended according to problem definition
    corecore