132 research outputs found

    Semi-Supervised Named Entity Recognition:\ud Learning to Recognize 100 Entity Types with Little Supervision\ud

    Get PDF
    Named Entity Recognition (NER) aims to extract and to classify rigid designators in text such as proper names, biological species, and temporal expressions. There has been growing interest in this field of research since the early 1990s. In this thesis, we document a trend moving away from handcrafted rules, and towards machine learning approaches. Still, recent machine learning approaches have a problem with annotated data availability, which is a serious shortcoming in building and maintaining large-scale NER systems. \ud \ud In this thesis, we present an NER system built with very little supervision. Human supervision is indeed limited to listing a few examples of each named entity (NE) type. First, we introduce a proof-of-concept semi-supervised system that can recognize four NE types. Then, we expand its capacities by improving key technologies, and we apply the system to an entire hierarchy comprised of 100 NE types. \ud \ud Our work makes the following contributions: the creation of a proof-of-concept semi-supervised NER system; the demonstration of an innovative noise filtering technique for generating NE lists; the validation of a strategy for learning disambiguation rules using automatically identified, unambiguous NEs; and finally, the development of an acronym detection algorithm, thus solving a rare but very difficult problem in alias resolution. \ud \ud We believe semi-supervised learning techniques are about to break new ground in the machine learning community. In this thesis, we show that limited supervision can build complete NER systems. On standard evaluation corpora, we report performances that compare to baseline supervised systems in the task of annotating NEs in texts. \u

    Emotion AI-Driven Sentiment Analysis: A Survey, Future Research Directions, and Open Issues

    Get PDF
    The essential use of natural language processing is to analyze the sentiment of the author via the context. This sentiment analysis (SA) is said to determine the exactness of the underlying emotion in the context. It has been used in several subject areas such as stock market prediction, social media data on product reviews, psychology, judiciary, forecasting, disease prediction, agriculture, etc. Many researchers have worked on these areas and have produced significant results. These outcomes are beneficial in their respective fields, as they help to understand the overall summary in a short time. Furthermore, SA helps in understanding actual feedback shared across di erent platforms such as Amazon, TripAdvisor, etc. The main objective of this thorough survey was to analyze some of the essential studies done so far and to provide an overview of SA models in the area of emotion AI-driven SA. In addition, this paper o ers a review of ontology-based SA and lexicon-based SA along with machine learning models that are used to analyze the sentiment of the given context. Furthermore, this work also discusses di erent neural network-based approaches for analyzing sentiment. Finally, these di erent approaches were also analyzed with sample data collected from Twitter. Among the four approaches considered in each domain, the aspect-based ontology method produced 83% accuracy among the ontology-based SAs, the term frequency approach produced 85% accuracy in the lexicon-based analysis, and the support vector machine-based approach achieved 90% accuracy among the other machine learning-based approaches.Ministerio de EducaciĂłn (MOE) en TaiwĂĄn N/

    Multifaceted Geotagging for Streaming News

    Get PDF
    News sources on the Web generate constant streams of information, describing the events that shape our world. In particular, geography plays a key role in the news, and understanding the geographic information present in news allows for its useful spatial browsing and retrieval. This process of understanding is called geotagging, and involves first finding in the document all textual references to geographic locations, known as toponyms, and second, assigning the correct lat/long values to each toponym, steps which are termed toponym recognition and toponym resolution, respectively. These steps are difficult due to ambiguities in natural language: some toponyms share names with non-location entities, and further, a given toponym can have many location interpretations. Removing these ambiguities is crucial for successful geotagging. To this end, geotagging methods are described which were developed for streaming news. First, a spatio-textual search engine named STEWARD, and an interactive map-based news browsing system named NewsStand are described, which feature geotaggers as central components, and served as motivating systems and experimental testbeds for developing geotagging methods. Next, a geotagging methodology is presented that follows a multifaceted approach involving a variety of techniques. First, a multifaceted toponym recognition process is described that uses both rule-based and machine learning–based methods to ensure high toponym recall. Next, various forms of toponym resolution evidence are explored. One such type of evidence is lists of toponyms, termed comma groups, whose toponyms share a common thread in their geographic properties that enables correct resolution. In addition to explicit evidence, authors take advantage of the implicit geographic knowledge of their audiences. Understanding the local places known by an audience, termed its local lexicon, affords great performance gains when geotagging articles from local newspapers, which account for the vast majority of news on the Web. Finally, considering windows of text of varying size around each toponym, termed adaptive context, allows for a tradeoff between geotagging execution speed and toponym resolution accuracy. Extensive experimental evaluations of all the above methods, using existing and two newly-created, large corpora of streaming news, show great performance gains over several competing prominent geotagging methods

    Enhancing knowledge acquisition systems with user generated and crowdsourced resources

    Get PDF
    This thesis is on leveraging knowledge acquisition systems with collaborative data and crowdsourcing work from internet. We propose two strategies and apply them for building effective entity linking and question answering (QA) systems. The first strategy is on integrating an information extraction system with online collaborative knowledge bases, such as Wikipedia and Freebase. We construct a Cross-Lingual Entity Linking (CLEL) system to connect Chinese entities, such as people and locations, with corresponding English pages in Wikipedia. The main focus is to break the language barrier between Chinese entities and the English KB, and to resolve the synonymy and polysemy of Chinese entities. To address those problems, we create a cross-lingual taxonomy and a Chinese knowledge base (KB). We investigate two methods of connecting the query representation with the KB representation. Based on our CLEL system participating in TAC KBP 2011 evaluation, we finally propose a simple and effective generative model, which achieved much better performance. The second strategy is on creating annotation for QA systems with the help of crowd- sourcing. Crowdsourcing is to distribute a task via internet and recruit a lot of people to complete it simultaneously. Various annotated data are required to train the data-driven statistical machine learning algorithms for underlying components in our QA system. This thesis demonstrates how to convert the annotation task into crowdsourcing micro-tasks, investigate different statistical methods for enhancing the quality of crowdsourced anno- tation, and ïŹnally use enhanced annotation to train learning to rank models for passage ranking algorithms for QA.Gegenstand dieser Arbeit ist das Nutzbarmachen sowohl von Systemen zur Wissener- fassung als auch von kollaborativ erstellten Daten und Arbeit aus dem Internet. Es werden zwei Strategien vorgeschlagen, welche fĂŒr die Erstellung effektiver Entity Linking (Disambiguierung von EntitĂ€tennamen) und Frage-Antwort Systeme eingesetzt werden. Die erste Strategie ist, ein Informationsextraktions-System mit kollaborativ erstellten Online- Datenbanken zu integrieren. Wir entwickeln ein Cross-Linguales Entity Linking-System (CLEL), um chinesische EntitĂ€ten, wie etwa Personen und Orte, mit den entsprechenden Wikipediaseiten zu verknĂŒpfen. Das Hauptaugenmerk ist es, die Sprachbarriere zwischen chinesischen EntitĂ€ten und englischer Datenbank zu durchbrechen, und Synonymie und Polysemie der chinesis- chen EntitĂ€ten aufzulösen. Um diese Probleme anzugehen, erstellen wir eine cross linguale Taxonomie und eine chinesische Datenbank. Wir untersuchen zwei Methoden, die ReprĂ€sentation der Anfrage und die ReprĂ€sentation der Datenbank zu verbinden. Schließlich stellen wir ein einfaches und effektives generatives Modell vor, das auf unserem System fĂŒr die Teilnahme an der TAC KBP 2011 Evaluation basiert und eine erheblich bessere Performanz erreichte. Die zweite Strategie ist, Annotationen fĂŒr Frage-Antwort-Systeme mit Hilfe von "Crowd- sourcing" zu erstellen. "Crowdsourcing" bedeutet, eine Aufgabe via Internet an eine große Menge an angeworbene Menschen zu verteilen, die diese simultan erledigen. Verschiedene annotierte Daten sind notwendig, um die datengetriebenen statistischen Lernalgorithmen zu trainieren, die unserem Frage-Antwort System zugrunde liegen. Wir zeigen, wie die Annotationsaufgabe in Mikro-Aufgaben fĂŒr das Crowdsourcing umgewan- delt werden kann, wir untersuchen verschiedene statistische Methoden, um die QualitĂ€t der Annotation aus dem Crowdsourcing zu erweitern, und schließlich nutzen wir die erwei- erte Annotation, um Modelle zum Lernen von Ranglisten von Textabschnitten zu trainieren

    New Perspectives in Critical Data Studies

    Get PDF
    This Open Access book examines the ambivalences of data power. Firstly, the ambivalences between global infrastructures and local invisibilities challenge the grand narrative of the ephemeral nature of a global data infrastructure. They make visible local working and living conditions, and the resources and arrangements required to operate and run them. Secondly, the book examines ambivalences between the state and data justice. It considers data justice in relation to state surveillance and data capitalism, and reflects on the ambivalences between an “entrepreneurial state” and a “welfare state”. Thirdly, the authors discuss ambivalences of everyday practices and collective action, in which civil society groups, communities, and movements try to position the interests of people against the “big players” in the tech industry. The book includes eighteen chapters that provide new and varied perspectives on the role of data and data infrastructures in our increasingly datafied societies

    Improving Search via Named Entity Recognition in Morphologically Rich Languages – A Case Study in Urdu

    Get PDF
    University of Minnesota Ph.D. dissertation. February 2018. Major: Computer Science. Advisors: Vipin Kumar, Blake Howald. 1 computer file (PDF); xi, 236 pages.Search is not a solved problem even in the world of Google and Bing's state of the art engines. Google and similar search engines are keyword based. Keyword-based searching suffers from the vocabulary mismatch problem -- the terms in document and user's information request don't overlap. For example, cars and automobiles. This phenomenon is called synonymy. Similarly, the user's term may be polysemous -- a user is inquiring about a river's bank, but documents about financial institutions are matched. Vocabulary mismatch exacerbated when the search occurs in Morphological Rich Language (MRL). Concept search techniques like dimensionality reduction do not improve search in Morphological Rich Languages. Names frequently occur news text and determine the "what," "where," "when," and "who" in the news text. Named Entity Recognition attempts to recognize names automatically in text, but these techniques are far from mature in MRL, especially in Arabic Script languages. Urdu is one the focus MRL of this dissertation among Arabic, Farsi, Hindi, and Russian, but it does not have the enabling technologies for NER and search. A corpus, stop word generation algorithm, a light stemmer, a baseline, and NER algorithm is created so the NER-aware search can be accomplished for Urdu. This dissertation demonstrates that NER-aware search on Arabic, Russian, Urdu, and English shows significant improvement over baseline. Furthermore, this dissertation highlights the challenges for researching in low-resource MRL languages

    Resolving Ambiguities in Confused Online Tamil Characters with Post Processing Algorithms

    No full text

    TextFrame: Cosmopolitanism and Non-Exclusively Anglophone Poetries

    Full text link
    This project proposes a replacement for some institutional-archival mechanisms of non-exclusively anglophone poetry as it is produced under racial capitalism and archived via its universities and grant-bearing nonprofits. The project argues specifically for the self-archiving of non-exclusively anglophone poetry, and by extension of poetry, in a manner that builds away from US-dominated, nationally-organized institutions. It argues that cosmopolitanist norm translation, as advocated by various critics, can function as part of a critique of institutional value creation used in maintaining inequalities through poetry. The US-based Poetry Foundation is currently the major online archive of contemporary anglophone poetry; the project comprises a series of related essays that culminate in a rough outline for a collaboratively designed, coded, and maintained application to replace the Foundation’s website. Whatever benefit might result, replacing archival mechanisms of racial capitalism while remaining within its systemic modes of value creation is at best a form of substitution: it is not an actual change in relations and not a transition to anything. Doing so may, however, allow greater clarity in understanding how poetry is situated within US-based institutions, beyond the images and values that poets and critics in the US often help to maintain. Chapter one, “‘Indianness’ and Omission: 60 Indian Poets,” reads the anthology 60 Indian Poets, published in 2008 in India and the UK (with US distribution), as argument about the contours of Indian Poetry in English and about the contours of India’s relations in the world. It relates Rashmi Sadana’s work on the meanings of English in India to decisions made within the anthology, and look further at Pollock’s conception of cosmopolitanism and vernacularity, particular as it applies to the Indian North-East and the poetry of Kynpham Sing Nongkynrih. The second chapter, “Archival Power: Individualization, the Racial State, and Institutional Poetry” engages Roderick Ferguson’s concept of archival power to explain the 2015 “crisis” within contemporary US poetry driven by practitioners of conceptual poetry, and an attempted archival act with regard to the Black Lives Matter movement. The chapter ends with a fragment of Alexis Pauline Gumbs’s recent account of US university life as experienced by Black artists and scholars. That chapter is followed by “The Poetry Foundation as Site of Archival Power,” which extends Jodi Melamed’s critique of US university value-creation mechanisms to Poetry magazine and the Poetry Foundation’s website. It argues that the Poetry Foundation functions as a de facto arm of the US university system as outlined in the previous chapter, and aids in capitalist value-creation. “TextFrame: An Open Archive for Poetry,” the fourth chapter, is an attempt to begin thinking a replacement for current mechanisms of archiving non-exclusively anglophone poetry. The fifth chapter, “Narayanan’s Language Events as Free-Tier Application,” documents work imagined for TextFrame, as an application, that has actually already been built: the poet and scholar Vivek Narayanan adapted Robert Desnos’s Language Events for the classroom using a variety of discrete free services, and the present author collaborated with Narayanan in creating a stand-alone Web application. Chapters six, seven, and eight function as case studies to be used in creating templates for providing context to specific poems within any built application. Both of the specific moments covered transmogrify the “anti-psychological.” The sixth chapter, “An Unendurable Age: Ashbery, O’Hara, and 1950s Precursors of ‘Self’ Psychology” thus argues that an anti-psychological ethos is developed in Ashbery and O’Hara’s poems of that moment. It shows that Frank O’Hara’s “Personism: A Manifesto” (1959) is almost certainly a parody of Gordon Allport’s theory of Personalism, of related strands of 1950s American psychology, and of the poetry that developed alongside them in the 1930s. It follows other critics in looking at midcentury conceptions of schizophrenia as a specifically homosexual disease, and argues for the importance of contemporarily published examples of schizophrenic discourse, particularly those of Harry Stack Sullivan. It argues that Ashbery’s poem “A Boy” can be read as directly engaging those ideas, and opposing them. The shorter discussions follow consider the affinities that Some Trees has with anti- or a-psychological theories of mind that were being developed at Harvard and MIT at the time that Ashbery and O’Hara were in Cambridge, including generative grammar and critiques of philosophical analyticity. The eighth chapter, “Before Conceptualism: Disgust and Over-determination in White-dominated Experimental Poetry in New York, 1999-2003,” highlights Dan Farrell and Lytle Shaw’s very different uses of lyric’s peculiar staging of voice to foreground the multi-furcation of white identities and voice in response to state pressures. The last two chapters take up two corollaries, or theoretical concerns that fell out trying to think a cosmopolitanist application. The first, “Why Not Reddit?” examines existing commercial cosmopolitanist solutions for some of the functionality proposed for the application, and reasons for rejecting them. In doing so, it discusses Thomas Farrell’s construct of “rhetorical culture” in detail, and traces a theory of communication and authorship within a community, particularly with regard to thinking history. The last chapter (and second corollary) is titled “Ethos in Pedagogy as a Limit on Norm Translation.” It establishes the Aristotelian concept of ethos as a pedagogical limit for norm translation. The study’s governing interest is not the conflicts or differences between practitioners or tendencies that are detailed here, but their relative incomprehensibility of those differences outside of their formative contexts

    Multilingual sentiment analysis in social media.

    Get PDF
    252 p.This thesis addresses the task of analysing sentiment in messages coming from social media. The ultimate goal was to develop a Sentiment Analysis system for Basque. However, because of the socio-linguistic reality of the Basque language a tool providing only analysis for Basque would not be enough for a real world application. Thus, we set out to develop a multilingual system, including Basque, English, French and Spanish.The thesis addresses the following challenges to build such a system:- Analysing methods for creating Sentiment lexicons, suitable for less resourced languages.- Analysis of social media (specifically Twitter): Tweets pose several challenges in order to understand and extract opinions from such messages. Language identification and microtext normalization are addressed.- Research the state of the art in polarity classification, and develop a supervised classifier that is tested against well known social media benchmarks.- Develop a social media monitor capable of analysing sentiment with respect to specific events, products or organizations
    • 

    corecore