Search CORE

7,661 research outputs found

Adaptive text mining: Inferring structure from sequences

Author: Witten Ian H.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2004
Field of study

Text mining is about inferring structure from sequences representing natural language text, and may be defined as the process of analyzing text to extract information that is useful for particular purposes. Although hand-crafted heuristics are a common practical approach for extracting information from text, a general, and generalizable, approach requires adaptive techniques. This paper studies the way in which the adaptive techniques used in text compression can be applied to text mining. It develops several examples: extraction of hierarchical phrase structures from text, identification of keyphrases in documents, locating proper names and quantities of interest in a piece of text, text categorization, word segmentation, acronym extraction, and structure recognition. We conclude that compression forms a sound unifying principle that allows many text mining problems to be tacked adaptively

Research Commons@Waikato

Enhanced ontology-based text classification algorithm for structurally organized documents

Author: Oleiwi Suha Sahib
Publication venue
Publication date: 01/01/2015
Field of study

Text classification (TC) is an important foundation of information retrieval and text mining. The main task of a TC is to predict the text‟s class according to the type of tag given in advance. Most TC algorithms used terms in representing the document which does not consider the relations among the terms. These algorithms represent documents in a space where every word is assumed to be a dimension. As a result such representations generate high dimensionality which gives a negative effect on the classification performance. The objectives of this thesis are to formulate algorithms for classifying text by creating suitable feature vector and reducing the dimension of data which will enhance the classification accuracy. This research combines the ontology and text representation for classification by developing five algorithms. The first and second algorithms namely Concept Feature Vector (CFV) and Structure Feature Vector (SFV), create feature vector to represent the document. The third algorithm is the Ontology Based Text Classification (OBTC) and is designed to reduce the dimensionality of training sets. The fourth and fifth algorithms, Concept Feature Vector_Text Classification (CFV_TC) and Structure Feature Vector_Text Classification (SFV_TC) classify the document to its related set of classes. These proposed algorithms were tested on five different scientific paper datasets downloaded from different digital libraries and repositories. Experimental obtained from the proposed algorithm, CFV_TC and SFV_TC shown better average results in terms of precision, recall, f-measure and accuracy compared against SVM and RSS approaches. The work in this study contributes to exploring the related document in information retrieval and text mining research by using ontology in TC

Universiti Utara Malaysia: UUM eTheses

From Frequency to Meaning: Vector Space Models of Semantics

Author: Pantel Patrick
Turney Peter D.
Publication venue: 'AI Access Foundation'
Publication date: 01/01/2010
Field of study

Computers understand very little of the meaning of human language. This profoundly limits our ability to give instructions to computers, the ability of computers to explain their actions to us, and the ability of computers to analyse and process text. Vector space models (VSMs) of semantics are beginning to address these limits. This paper surveys the use of VSMs for semantic processing of text. We organize the literature on VSMs according to the structure of the matrix in a VSM. There are currently three broad classes of VSMs, based on term-document, word-context, and pair-pattern matrices, yielding three classes of applications. We survey a broad range of applications in these three categories and we take a detailed look at a specific open source project in each category. Our goal in this survey is to show the breadth of applications of VSMs for semantics, to provide a new perspective on VSMs for those who are already familiar with the area, and to provide pointers into the literature for those who are less familiar with the field

arXiv.org e-Print Archive

CiteSeerX

NRC Publications Archive

Crossref

Towards Automated Related Work Summarization

Author: HOANG CONG DUY VU
Publication venue
Publication date: 27/12/2010
Field of study

Master'sMASTER OF SCIENC

ScholarBank@NUS

Recommended from our members

Republican Monsters: The Cultural Construction of American Positivist Criminology, 1767-1920

Author: Burton Chase Smith
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

This dissertation examines the history of and cultural influences on positivist criminology in the United States. From Benjamin Rush to the present day, the U.S. has produced an extensive corpus of empirical and theoretical studies that seeks to discern an objective, scientifically-grounded basis for criminal behavior. American positivist criminology has drawn on numerous subfields and theories, including rational choice / economic theory, biology, and psychology, but in all cases, maintains that a purely scientific explanation of offending is possible. This study proceeds from the perspective that divisions between scientific and non-scientific thought are untenable. Drawing on scholarship in literary criticism and sociology, I argue that positivist criminology confronts an inherent contradiction in purporting to develop a purely scientific account of phenomena that are defined by the moral and cultural sentiments of a society. I thus hypothesize that positivist criminology is in fact reliant on the irrational and fictive cultural tropes and images of crime that it claims to exorcize. The dissertation proceeds by reviewing the literature on the history of criminology, developing a set of functional types or tropes for character analysis, and then examining four separate periods in the development of scientific criminology: eighteenth century studies of rational action, nineteenth century studies of defective reasoning, early twentieth century studies of race and crime, and the development of scientifically informed criminalistics programs. Each of these cases captures a different period and focus in the development of scientific criminology. In threading continuity between these cases, I show how criminological positivism is consistently reliant on culturally informed tropes and characters to render itself sensible and coherent

eScholarship - University of California

Proceedings of the COLING 2004 Post Conference Workshop on Multilingual Linguistic Ressources MLR2004

Author: Armstrong Susan
Boitet Christian
Popescu-Belis Andrei
Sérasset Gilles
Tufis Dan
Publication venue: COLING
Publication date: 01/01/2004
Field of study

International audienceIn an ever expanding information society, most information systems are now facing the "multilingual challenge". Multilingual language resources play an essential role in modern information systems. Such resources need to provide information on many languages in a common framework and should be (re)usable in many applications (for automatic or human use). Many centres have been involved in national and international projects dedicated to building har- monised language resources and creating expertise in the maintenance and further development of standardised linguistic data. These resources include dictionaries, lexicons, thesauri, word-nets, and annotated corpora developed along the lines of best practices and recommendations. However, since the late 90's, most efforts in scaling up these resources remain the responsibility of the local authorities, usually, with very low funding (if any) and few opportunities for academic recognition of this work. Hence, it is not surprising that many of the resource holders and developers have become reluctant to give free access to the latest versions of their resources, and their actual status is therefore currently rather unclear. The goal of this workshop is to study problems involved in the development, management and reuse of lexical resources in a multilingual context. Moreover, this workshop provides a forum for reviewing the present state of language resources. The workshop is meant to bring to the international community qualitative and quantitative information about the most recent developments in the area of linguistic resources and their use in applications. The impressive number of submissions (38) to this workshop and in other workshops and conferences dedicated to similar topics proves that dealing with multilingual linguistic ressources has become a very hot problem in the Natural Language Processing community. To cope with the number of submissions, the workshop organising committee decided to accept 16 papers from 10 countries based on the reviewers' recommendations. Six of these papers will be presented in a poster session. The papers constitute a representative selection of current trends in research on Multilingual Language Resources, such as multilingual aligned corpora, bilingual and multilingual lexicons, and multilingual speech resources. The papers also represent a characteristic set of approaches to the development of multilingual language resources, such as automatic extraction of information from corpora, combination and re-use of existing resources, online collaborative development of multilingual lexicons, and use of the Web as a multilingual language resource. The development and management of multilingual language resources is a long-term activity in which collaboration among researchers is essential. We hope that this workshop will gather many researchers involved in such developments and will give them the opportunity to discuss, exchange, compare their approaches and strengthen their collaborations in the field. The organisation of this workshop would have been impossible without the hard work of the program committee who managed to provide accurate reviews on time, on a rather tight schedule. We would also like to thank the Coling 2004 organising committee that made this workshop possible. Finally, we hope that this workshop will yield fruitful results for all participants

Hal - Université Grenoble Alpes

Hal-Diderot

Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources

Author: EHRMANN MAUD
TURCHI MARCO
Publication venue: Centro de Innovación y Desarrollo Tecnológico en Cómputo, Instituto Politécnico Nacional, Mexico
Publication date: 09/08/2011
Field of study

Translation capability of a Phrase-Based Statistical Machine Translation (PBSMT) system mostly depends on parallel data and phrases that are not present in the training data are not correctly translated. This paper describes a method that efficiently expands the existing knowledge of a PBSMT system without adding more parallel data but using external morphological resources. A set of new phrase associations is added to translation and reordering models; each of them corresponds to a morphological variation of the source/target/both phrases of an existing association. New associations are generated using a string similarity score based on morphosyntactic information. We tested our approach on En-Fr and Fr-En translations and results showed improvements of the performance in terms of automatic scores (BLEU and Meteor) and reduction of out-of-vocabulary (OOV) words. We believe that our knowledge expansion framework is generic and could be used to add different types of information to the model.JRC.G.2-Global security and crisis managemen

JRC Publications Repository

‘Super disabilities’ vs ‘Disabilities’?:Theorizing the role of ableism in (mis)representational mythology of disability in the marketplace

Author: Addelson Kathryn Pyne.
Berger Peter L.
Campbell Joseph.
Downey Hilary
Eva Kipnis
Figueroa Peter.
Foucault Michel.
Goffman Erving.
Guttmann Ludwig.
Ian Brittain
Kaufman-Scarborough Carol.
Kvale Steinar.
Mahtani Minelle.
Ramirez Berg C.
Roman Ediberto.
Shauna Kearney
Shildrick Tracy.
Wendell Susan.
Whetten David A.
Worrell Tracy.
Publication venue: 'Informa UK Limited'
Publication date: 11/01/2019
Field of study

People with disabilities (PWD) constitute one of the largest minority groups with one in five people worldwide having a disability. While recognition and inclusion of this group in the marketplace has seen improvement, the effects of (mis)representation of PWD in shaping the discourse on fostering marketplace inclusion of socially marginalized consumers remain little understood. Although effects of misrepresentation (e.g., idealized, exoticized or selective representation) on inclusion/exclusion perceptions and cognitions has received attention in the context of ethnic/racial groups, the world of disability has been largely neglected. By extending the theory of ableism into the context of PWD representation and applying it to the analysis of the We’re the Superhumans advertisement developed for the Rio 2016 Paralympic Games, this paper examines the relationship between the (mis)representation and the inclusion/exclusion discourse. By uncovering that PWD misrepresentations can partially mask and/or redress the root causes of exclusion experienced by PWD in their lived realities, it contributes to the research agenda on the transformative role of consumption cultures perpetuating harmful, exclusionary social perceptions of marginalized groups versus contributing to advancement of their inclusion

Crossref

Coventry University Pure Portal

White Rose Research Online