17,392 research outputs found
Empirical Methodology for Crowdsourcing Ground Truth
The process of gathering ground truth data through human annotation is a
major bottleneck in the use of information extraction methods for populating
the Semantic Web. Crowdsourcing-based approaches are gaining popularity in the
attempt to solve the issues related to volume of data and lack of annotators.
Typically these practices use inter-annotator agreement as a measure of
quality. However, in many domains, such as event detection, there is ambiguity
in the data, as well as a multitude of perspectives of the information
examples. We present an empirically derived methodology for efficiently
gathering of ground truth data in a diverse set of use cases covering a variety
of domains and annotation tasks. Central to our approach is the use of
CrowdTruth metrics that capture inter-annotator disagreement. We show that
measuring disagreement is essential for acquiring a high quality ground truth.
We achieve this by comparing the quality of the data aggregated with CrowdTruth
metrics with majority vote, over a set of diverse crowdsourcing tasks: Medical
Relation Extraction, Twitter Event Identification, News Event Extraction and
Sound Interpretation. We also show that an increased number of crowd workers
leads to growth and stabilization in the quality of annotations, going against
the usual practice of employing a small number of annotators.Comment: in publication at the Semantic Web Journa
Automated identification of Fos expression
The concentration of Fos, a protein encoded by the immediate-early gene c-fos, provides a measure of synaptic activity that may not parallel the electrical activity of neurons. Such a measure is important for the difficult problem of identifying dynamic properties of neuronal circuitries activated by a variety of stimuli and behaviours. We employ two-stage statistical pattern recognition to identify cellular nuclei that express Fos in two-dimensional sections of rat forebrain after administration of antipsychotic drugs. In stage one, we distinguish dark-stained candidate nuclei from image background by a thresholding algorithm and record size and shape measurements of these objects. In stage two, we compare performance of linear and quadratic discriminants, nearest-neighbour and artificial neural network classifiers that employ functions of these measurements to label candidate objects as either Fos nuclei, two touching Fos nuclei or irrelevant background material. New images of neighbouring brain tissue serve as test sets to assess generalizability of the best derived classification rule, as determined by lowest cross-validation misclassification rate. Three experts, two internal and one external, compare manual and automated results for accuracy assessment. Analyses of a subset of images on two separate occasions provide quantitative measures of inter- and intra-expert consistency. We conclude that our automated procedure yields results that compare favourably with those of the experts and thus has potential to remove much of the tedium, subjectivity and irreproducibility of current Fos identification methods in digital microscopy
Unsupervised and knowledge-poor approaches to sentiment analysis
Sentiment analysis focuses upon automatic classiffication of a document's sentiment (and more generally extraction of opinion from text). Ways of expressing sentiment have been
shown to be dependent on what a document is about (domain-dependency). This complicates supervised methods for sentiment analysis which rely on extensive use of training data or linguistic resources that are usually either domain-specific or generic. Both kinds of resources prevent classiffiers from performing well across a range of domains, as this requires appropriate in-domain (domain-specific) data.
This thesis presents a novel unsupervised, knowledge-poor approach to sentiment analysis aimed at creating a domain-independent and multilingual sentiment analysis system.
The approach extracts domain-specific resources from documents that are to be processed, and uses them for sentiment analysis. This approach does not require any training corpora, large sets of rules or generic sentiment lexicons, which makes it domain- and languageindependent but at the same time able to utilise domain- and language-specific information.
The thesis describes and tests the approach, which is applied to diffeerent data, including customer reviews of various types of products, reviews of films and books, and news items; and to four languages: Chinese, English, Russian and Japanese. The approach is applied not only to binary sentiment classiffication, but also to three-way sentiment classiffication (positive, negative and neutral), subjectivity classifiation of documents and sentences, and to the extraction of opinion holders and opinion targets. Experimental results suggest that the approach is often a viable alternative to supervised systems, especially when applied to large document collections
- …