1,082 research outputs found

    Trusting the results in crosslingual keyword-based image retrieval

    Get PDF
    This paper gives a brief description of the starting points for the experiments the SICS team has performed in the 2006 interactive CLEF campaign

    Active Learning for Dialogue Act Classification

    Get PDF
    Active learning techniques were employed for classification of dialogue acts over two dialogue corpora, the English human-human Switchboard corpus and the Spanish human-machine Dihana corpus. It is shown clearly that active learning improves on a baseline obtained through a passive learning approach to tagging the same data sets. An error reduction of 7% was obtained on Switchboard, while a factor 5 reduction in the amount of labeled data needed for classification was achieved on Dihana. The passive Support Vector Machine learner used as baseline in itself significantly improves the state of the art in dialogue act classification on both corpora. On Switchboard it gives a 31% error reduction compared to the previously best reported result

    Tagging and Morphological Processing in the SVENSK System

    Get PDF
    This thesis describes the work of providing separate morphological processing and part-of-speech tagging modules in the SVENSK system by integrating the Uppsala Chart Processor (UCP) and a Brill tagger into the system. SVENSK employs GATE (General Architecture for Text Engineering) as the platform in which the components are to be integrated. Two pre-processing modules, a tokeniser and a sentence splitter for Swedish, were developed in order to facilitate the preparation of the texts to be analysed by UCP and the Brill tagger. These four components were then integrated in GATE together with a newly developed viewer for displaying the results produced by UCP. The thesis introduces the reader to the SVENSK project, the GATE system and its underlying parts, especially the database architecture which is based on the TIPSTER annotation model. Further, the issues in connection with the development and design of the tokeniser and the sentence splitter for Swedish are elaborated on. The mechanisms behind transformation-based error-driven learning methods as employed by the Brill tagger are introduced as well as the principles of chart processing in general and UCP in particular. The greater part of the thesis is devoted to the process of integrating the natural language (NL) modules in GATE using the Tcl/Tk application programmers interface (API) and a so-called loose coupling. The results of the integration of the NL modules are very encouraging: it is possible to mix modules written in programming languages from completely different paradigms (in this case the languages are Common LISP, Perl and C) and to have them interact with each other, thus maintaining a high degree of reuse of algorithmical resources. However, the use of Tcl/Tk and the associated API for processing structurally relatively complex data, i.e. the output from UCP, is time consuming and considerably slows the processing in GATE

    Direct Measurement of Fast Transients by Using Boot-strapped Waveform Averaging

    Full text link
    An approximation to coherent sampling, also known as boot-strapped waveform averaging, is presented. The method uses digital cavities to determine the condition for coherent sampling. It can be used to increase the effective sampling rate of a repetitive signal and the signal to noise ratio simultaneously. The method is demonstrated by using it to directly measure the fluorescence lifetime from rhodamine 6G by digitizing the signal from a fast avalanche photodiode. The obtained lifetime of 4.4+-0.1 ns is in agreement with the known values

    Dad, yo soy una chica americana : Migration, Identity and Language in Eduardo González Viaña\u27s El Corrido de Dante

    Get PDF
    The focus of this article is the representation of language and identity in Hispanic immigrant literature. It provides a framework for the analysis of linguistic and cultural constructions of migrant identities in literary texts, on the basis of the exploration of the novel El Corrido de Dante, by Eduardo González Viaña. The most significant finding is that González Viaña applies linguistic homogenization in order to stress a common Hispanic identity without effacing cultural, national and ethnic differences, as these are stylistically marked by means of strategic (re)creations of different varieties of Spanish and instances of code-switching between Spanish and English (Spanglish) that require no bilingual competence. The article also sheds light on three crucial language and identity conflicts in the novel: intergenerational conflicts between undocumented migrants and their U.S.-born children; conflicts between Chicano/as and “new” Latino/as; and asymmetric power relations between Hispanics and Anglo-Americans

    Bootstrapping Named Entity Annotation by Means of Active Machine Learning: A Method for Creating Corpora

    Get PDF
    This thesis describes the development and in-depth empirical investigation of a method, called BootMark, for bootstrapping the marking up of named entities in textual documents. The reason for working with documents, as opposed to for instance sentences or phrases, is that the BootMark method is concerned with the creation of corpora. The claim made in the thesis is that BootMark requires a human annotator to manually annotate fewer documents in order to produce a named entity recognizer with a given performance, than would be needed if the documents forming the basis for the recognizer were randomly drawn from the same corpus. The intention is then to use the created named en- tity recognizer as a pre-tagger and thus eventually turn the manual annotation process into one in which the annotator reviews system-suggested annotations rather than creating new ones from scratch. The BootMark method consists of three phases: (1) Manual annotation of a set of documents; (2) Bootstrapping – active machine learning for the purpose of selecting which document to an- notate next; (3) The remaining unannotated documents of the original corpus are marked up using pre-tagging with revision. Five emerging issues are identified, described and empirically investigated in the thesis. Their common denominator is that they all depend on the real- ization of the named entity recognition task, and as such, require the context of a practical setting in order to be properly addressed. The emerging issues are related to: (1) the characteristics of the named entity recognition task and the base learners used in conjunction with it; (2) the constitution of the set of documents annotated by the human annotator in phase one in order to start the bootstrapping process; (3) the active selection of the documents to annotate in phase two; (4) the monitoring and termination of the active learning carried out in phase two, including a new intrinsic stopping criterion for committee-based active learning; and (5) the applicability of the named entity recognizer created during phase two as a pre-tagger in phase three. The outcomes of the empirical investigations concerning the emerging is- sues support the claim made in the thesis. The results also suggest that while the recognizer produced in phases one and two is as useful for pre-tagging as a recognizer created from randomly selected documents, the applicability of the recognizer as a pre-tagger is best investigated by conducting a user study involving real annotators working on a real named entity recognition task

    A literature survey of active machine learning in the context of natural language processing

    Get PDF
    Active learning is a supervised machine learning technique in which the learner is in control of the data used for learning. That control is utilized by the learner to ask an oracle, typically a human with extensive knowledge of the domain at hand, about the classes of the instances for which the model learned so far makes unreliable predictions. The active learning process takes as input a set of labeled examples, as well as a larger set of unlabeled examples, and produces a classifier and a relatively small set of newly labeled data. The overall goal is to create as good a classifier as possible, without having to mark-up and supply the learner with more data than necessary. The learning process aims at keeping the human annotation effort to a minimum, only asking for advice where the training utility of the result of such a query is high. Active learning has been successfully applied to a number of natural language processing tasks, such as, information extraction, named entity recognition, text categorization, part-of-speech tagging, parsing, and word sense disambiguation. This report is a literature survey of active learning from the perspective of natural language processing

    “Tax Simplification”—Grave Threat to the Charitable Contribution Deduction: The Problem and a Proposed Solution

    Get PDF
    The present National Administration has continued to support proposed legislative changes aimed at substantially reducing the number of income tax returns in which deductions are itemized. The author contends that these tax simplification proposals are incompatible with the preservation of the charitable contribution deduction and would undermine the position of voluntary charitable organizations by reducing the incentives for giving. He proposes a solution to this dilemma by promoting the charitable contribution deduction, with certain limitations, to the position of a deduction from gross income, rather than a deduction from adjusted gross income

    Non-Convex Rank/Sparsity Regularization and Local Minima

    Full text link
    This paper considers the problem of recovering either a low rank matrix or a sparse vector from observations of linear combinations of the vector or matrix elements. Recent methods replace the non-convex regularization with â„“1\ell_1 or nuclear norm relaxations. It is well known that this approach can be guaranteed to recover a near optimal solutions if a so called restricted isometry property (RIP) holds. On the other hand it is also known to perform soft thresholding which results in a shrinking bias which can degrade the solution. In this paper we study an alternative non-convex regularization term. This formulation does not penalize elements that are larger than a certain threshold making it much less prone to small solutions. Our main theoretical results show that if a RIP holds then the stationary points are often well separated, in the sense that their differences must be of high cardinality/rank. Thus, with a suitable initial solution the approach is unlikely to fall into a bad local minima. Our numerical tests show that the approach is likely to converge to a better solution than standard â„“1\ell_1/nuclear-norm relaxation even when starting from trivial initializations. In many cases our results can also be used to verify global optimality of our method
    • …
    corecore