18,100 research outputs found
Strategies for Improving Semi-automated Topic Classification of Media and Parliamentary documents
Since 1995 the techniques and capacities to store new electronic data and to make it available to many persons have become a common good. As of then, different organizations, such as research institutes, universities, libraries, and private companies (Google) started to scan older documents and make them electronically available as well. This has generated a lot of new research opportunities for all kinds of academic disciplines. The use of software to analyze large datasets has become an important part of doing research in the social sciences. Most academics rely on human coded datasets, both in qualitative and quantitative research. However, with the increasing amount of datasets and the complexity of the questions scholars pose to the datasets, the quest for more efficient and effective methods is now on the agenda. One of the most common techniques of content analysis is the Boolean key-word search method. To find certain topics in a dataset, the researcher creates first a list of keywords, added with certain parameters (AND, OR etc.). All keys are usually grouped in families and the entire list of keys and groups is called the ontology. Then the keywords are searched in the dataset, retrieving all documents containing the specified keywords. The online newspaper dataset, LexisNexis, provides the user with such a Boolean search method. However, the Boolean key-word search is not always satisfying in terms of reliability and validity. For that reason social scientists rely on hand-coding. Two projects that do so are the congressional bills project (www.congressionalbills.org ) and the policy agenda-setting project (see www.policyagendas.org ). They developed a topic code book and coded various different sources, such as, the state of the union speeches, bills, newspaper articles etcetera. The continuous improving automated coding techniques, and the increasing number of agenda setting projects (in especially European countries), however, has made the use of automated coding software a feasible option and also a necessity
An XML-based Tool for Tracking English Inclusions in German Text
The use of lexicons and corpora advances both linguistic research and performances of current natural language processing (NLP) systems. We present a tool that exploits such resources, specifically English and German lexical databases and the World Wide Web to recognise English inclusions in German newspaper articles. The output of the tool can assist lexical resource developers in monitoring changing patterns of English inclusion usage. The corpus used for the classification covers three different domains. We report the classification results and illustrate their value to linguistic and NLP research
DCU at MediaEval 2010 – Tagging task WildWildWeb
We describe our runs and results for the fixed label Wild
Wild Web Tagging Task at MediaEval 2010. Our experiments indicate that including all words in the ASR transcripts of the document set results in better labeling accuracy than restricting the index to only words with recognition confidence above a fixed level. Additionally our results
show that tagging accuracy can be improved by incorporating additional metadata describing the documents where it is available
Overview of VideoCLEF 2009: New perspectives on speech-based multimedia content enrichment
VideoCLEF 2009 offered three tasks related to enriching video content for improved multimedia access in a multilingual environment. For each task, video data (Dutch-language television, predominantly documentaries) accompanied by speech recognition transcripts were provided.
The Subject Classification Task involved automatic tagging of videos with subject theme labels. The best performance was achieved by approaching subject tagging as an information retrieval task and using both speech recognition transcripts and archival metadata. Alternatively, classifiers were trained using either the training data provided or data collected from Wikipedia or via general Web search. The Affect Task involved detecting narrative peaks, defined as points where viewers perceive heightened dramatic tension. The task was carried out on the “Beeldenstorm” collection containing 45 short-form documentaries on the visual arts. The best runs exploited affective vocabulary and audience directed speech. Other approaches included using topic changes, elevated speaking pitch, increased speaking intensity and radical visual changes. The Linking Task, also called “Finding Related Resources Across Languages,” involved linking video to material on the same subject in a different language.
Participants were provided with a list of multimedia anchors (short video segments) in the Dutch-language “Beeldenstorm” collection and were expected to return target pages drawn from English-language Wikipedia. The best performing methods used the transcript of the
speech spoken during the multimedia anchor to build a query to search an index of the Dutch language Wikipedia. The Dutch Wikipedia pages returned were used to identify related English pages. Participants also experimented with pseudo-relevance feedback, query translation and methods that targeted proper names
On Horizontal and Vertical Separation in Hierarchical Text Classification
Hierarchy is a common and effective way of organizing data and representing
their relationships at different levels of abstraction. However, hierarchical
data dependencies cause difficulties in the estimation of "separable" models
that can distinguish between the entities in the hierarchy. Extracting
separable models of hierarchical entities requires us to take their relative
position into account and to consider the different types of dependencies in
the hierarchy. In this paper, we present an investigation of the effect of
separability in text-based entity classification and argue that in hierarchical
classification, a separation property should be established between entities
not only in the same layer, but also in different layers. Our main findings are
the followings. First, we analyse the importance of separability on the data
representation in the task of classification and based on that, we introduce a
"Strong Separation Principle" for optimizing expected effectiveness of
classifiers decision based on separation property. Second, we present
Hierarchical Significant Words Language Models (HSWLM) which capture all, and
only, the essential features of hierarchical entities according to their
relative position in the hierarchy resulting in horizontally and vertically
separable models. Third, we validate our claims on real-world data and
demonstrate that how HSWLM improves the accuracy of classification and how it
provides transferable models over time. Although discussions in this paper
focus on the classification problem, the models are applicable to any
information access tasks on data that has, or can be mapped to, a hierarchical
structure.Comment: Full paper (10 pages) accepted for publication in proceedings of ACM
SIGIR International Conference on the Theory of Information Retrieval
(ICTIR'16
Text pre-processing of multilingual for sentiment analysis based on social network data
Sentiment analysis (SA) is an enduring area for research especially in the field of text analysis. Text pre-processing is an important aspect to perform SA accurately. This paper presents a text processing model for SA, using natural language processing techniques for twitter data. The basic phases for machine learning are text collection, text cleaning, pre-processing, feature extractions in a text and then categorize the data according to the SA techniques. Keeping the focus on twitter data, the data is extracted in domain specific manner. In data cleaning phase, noisy data, missing data, punctuation, tags and emoticons have been considered. For pre-processing, tokenization is performed which is followed by stop word removal (SWR). The proposed article provides an insight of the techniques, that are used for text pre-processing, the impact of their presence on the dataset. The accuracy of classification techniques has been improved after applying text pre-processing and dimensionality has been reduced. The proposed corpus can be utilized in the area of market analysis, customer behaviour, polling analysis, and brand monitoring. The text pre-processing process can serve as the baseline to apply predictive analysis, machine learning and deep learning algorithms which can be extended according to problem definition
- …