Search CORE

97 research outputs found

Coherence Identification of Business Documents: Towards an Automated Message Processing System

Author: Chan Samuel W.K.
Publication venue: AIS Electronic Library (AISeL)
Publication date: 19/12/2001
Field of study

This paper describes our recent efforts in developing a text segmentation technique in our business document management system. The document analysis is based upon a knowledge-based analysis of the documents’ contents, by automating the coherence identification process, without a full semantic understanding. In the technique, document boundaries can be identified by observing the shifts of segments from one cluster to another. Our experimental results show that the combination of the heterogeneous knowledge is capable to address the topic shifts. Given the increasing recognition of document structure in the fields of information retrieval as well as knowledge management, this approach provides a quantitative model and automatic classification of documents in a business document management system. This will beneficial to the distribution of documents or automatic launching of business processes in a workflow management system

AIS Electronic Library (AISeL)

Designing Statistical Language Learners: Experiments on Noun Compounds

Author: Lauer Mark
Publication venue
Publication date: 01/01/1995
Field of study

The goal of this thesis is to advance the exploration of the statistical language learning design space. In pursuit of that goal, the thesis makes two main theoretical contributions: (i) it identifies a new class of designs by specifying an architecture for natural language analysis in which probabilities are given to semantic forms rather than to more superficial linguistic elements; and (ii) it explores the development of a mathematical theory to predict the expected accuracy of statistical language learning systems in terms of the volume of data used to train them. The theoretical work is illustrated by applying statistical language learning designs to the analysis of noun compounds. Both syntactic and semantic analysis of noun compounds are attempted using the proposed architecture. Empirical comparisons demonstrate that the proposed syntactic model is significantly better than those previously suggested, approaching the performance of human judges on the same task, and that the proposed semantic model, the first statistical approach to this problem, exhibits significantly better accuracy than the baseline strategy. These results suggest that the new class of designs identified is a promising one. The experiments also serve to highlight the need for a widely applicable theory of data requirements.Comment: PhD thesis (Macquarie University, Sydney; December 1995), LaTeX source, xii+214 page

arXiv.org e-Print Archive

CiteSeerX

CERN Document Server

Word sense discovery and disambiguation

Author: Linden Krister
Publication venue: 'University of Helsinki Libraries'
Publication date: 01/01/2005
Field of study

The work is based on the assumption that words with similar syntactic usage have similar meaning, which was proposed by Zellig S. Harris (1954,1968). We study his assumption from two aspects: Firstly, different meanings (word senses) of a word should manifest themselves in different usages (contexts), and secondly, similar usages (contexts) should lead to similar meanings (word senses). If we start with the different meanings of a word, we should be able to find distinct contexts for the meanings in text corpora. We separate the meanings by grouping and labeling contexts in an unsupervised or weakly supervised manner (Publication 1, 2 and 3). We are confronted with the question of how best to represent contexts in order to induce effective classifiers of contexts, because differences in context are the only means we have to separate word senses. If we start with words in similar contexts, we should be able to discover similarities in meaning. We can do this monolingually or multilingually. In the monolingual material, we find synonyms and other related words in an unsupervised way (Publication 4). In the multilingual material, we ?nd translations by supervised learning of transliterations (Publication 5). In both the monolingual and multilingual case, we first discover words with similar contexts, i.e., synonym or translation lists. In the monolingual case we also aim at finding structure in the lists by discovering groups of similar words, e.g., synonym sets. In this introduction to the publications of the thesis, we consider the larger background issues of how meaning arises, how it is quantized into word senses, and how it is modeled. We also consider how to define, collect and represent contexts. We discuss how to evaluate the trained context classi?ers and discovered word sense classifications, and ?nally we present the word sense discovery and disambiguation methods of the publications. This work supports Harris' hypothesis by implementing three new methods modeled on his hypothesis. The methods have practical consequences for creating thesauruses and translation dictionaries, e.g., for information retrieval and machine translation purposes. Keywords: Word senses, Context, Evaluation, Word sense disambiguation, Word sense discovery

Helsingin yliopiston digitaalinen arkisto

Recommended from our members

PowerAqua: Open Question Answering on the Semantic Web

Author: Lopez Vanessa
Publication venue
Publication date: 01/01/2011
Field of study

With the rapid growth of semantic information in the Web, the processes of searching and querying these very large amounts of heterogeneous content have become increasingly challenging. This research tackles the problem of supporting users in querying and exploring information across multiple and heterogeneous Semantic Web (SW) sources. A review of literature on ontology-based Question Answering reveals the limitations of existing technology. Our approach is based on providing a natural language Question Answering interface for the SW, PowerAqua. The realization of PowerAqua represents a considerable advance with respect to other systems, which restrict their scope to an ontology-specific or homogeneous fraction of the publicly available SW content. To our knowledge, PowerAqua is the only system that is able to take advantage of the semantic data available on the Web to interpret and answer user queries posed in natural language. In particular, PowerAqua is uniquely able to answer queries by combining and aggregating information, which can be distributed across heterogeneous semantic resources. Here, we provide a complete overview of our work on PowerAqua, including: the research challenges it addresses; its architecture; the techniques we have realised to map queries to semantic data, to integrate partial answers drawn from different semantic resources and to rank alternative answers; and the evaluation studies we have performed, to assess the performance of PowerAqua. We believe our experiences can be extrapolated to a variety of end-user applications that wish to open up to large scale and heterogeneous structured datasets, to be able to exploit effectively what possibly is the greatest wealth of data in the history of Artificial Intelligence

Open Research Online (The Open University)

OpenGrey Repository

Information retrieval and text mining technologies for chemistry

Author: Abacha A. B.
Alberts D.
Alfonso Valencia
American Chemical Society
Anália Lourenço
Aphinyanaphongs Y.
Appelt D. E.
Aramaki E.
Aronson A. R.
Asahara M.
Babych B.
Baeza-Yates R.
Bambenek J.
Barnard J. M.
Bast H.
Batista-Navarro R.
Batista-Navarro R. T.
Bian J.
Bies A.
Bikel D. M.
Blaschke C.
Brecher J. S.
Brill E.
Bunescu R.
Bunescu R. C.
Califf M. E.
Carpenter B.
Caruana R.
Chee B. W.
Chhieng D.
Chinchor N.
Chiticariu L.
Chowdhury M. F. M.
Chowdhury M. F. M.
Ciravegna F.
Cleverdon C. W.
Coden A.
Cohen R.
Collier N.
Corbett P.
Corbett P.
Cover T. M.
Craven M.
Cummings M. D.
Currano J. N.
Currano J. N.
Currano J. N.
Currano J. N.
Cutting D. R.
Davis C. H.
Dieb T. M.
Dieb T. M.
Dogan R. I.
Downs G. M.
Dunikowski L. G.
Embarek M.
Eom J.-H.
Faber J.
Fall C. J.
Fattore M.
Fennell R. W.
Freund Y.
Fujiyoshi A.
Fukuda K.
Gale W. A.
Garcelon N.
Garnier J.-P.
Garten Y.
Ginn R.
Giuliano C.
Gold S.
Grefenstette G.
Grishman R.
Gurulingappa H.
Gurulingappa H.
Gusfield D.
He Y.
Hearst M. A.
Hersh W.
Hersh W.
Hirschman L.
Hobbs J. R.
Hodge G. M.
Holzinger A.
Hsueh P.-Y.
Huber T.
Iyer S. V
Jackson P.
Joachims T.
Johnson D.
Jonnalagadda S.
Jonnalagadda S.
Julen Oyarzabal
Jurafsky D.
Kaewphan S.
Kaewphan S.
Karkaletsis V.
Katragadda S.
Kazama J.
Kazawa H.
Kelly L.
Kenny P. W.
Kim J.-D.
Kim Y.
Kleene S. C.
Kolárik C.
Kongburan W.
Kornai A.
Kraaij W.
Krallinger M.
Krallinger M.
Krallinger M.
Kremer G.
Kreuzthaler M.
Kucera H.
Lai H.
Lawson A. J.
Leaman R.
Leaman R.
Lee C.-H.
Levenshtein V. I.
Levin M. A.
Li J.
Li N.
Li Y.
Liu X.
Locke W. N.
Lovins J. B.
Lowe D. M.
Lupu M.
Lupu M.
Mackenzie C. E.
Manning C. D.
Mansouri A.
Martin E.
Martin Krallinger
Mattmann C.
Maynard D.
McCallum A.
McEwen L.
McKnight L.
McNaught A.
Meystre S. M.
Michalski S. R.
Michie D.
Mihalcea R.
Mitton R.
Miwa M.
Mollá D.
Murray-Rust P.
Müller B.
Nebel A.
Nikfarjam A.
Névéol A.
Névéol A.
Obdulia Rabal
Pang B.
Panico R.
Perez-Iratxeta C.
Ponomareva N.
Ratinov L.
Ratnaparkhi A.
Read J.
Rebholz-Schuhmann D.
Reeker L. H.
Rocchio J. J.
Rohbeck H.-G.
Rosario B.
Roth D. L.
Rupp C. J.
Rupp C. J.
Sagae K.
Salim N.
Salton G.
Sanchez-Cisneros D.
Saracevic T.
Sasaki Y.
Schapire R. E.
Schenck R.
Schenck R. J.
Schlaf A.
Schuemie M. J.
Segura Bedmar I.
Segura-Bedmar I.
Sekine S.
Sequeira E.
Settles B.
Settles B.
Sewell W.
Shen D.
Shidha M. V
Singhal A.
Smith E. G.
Stamatatos E.
Sutton C.
Sætre R.
Taylor K. T.
Tharatipyakul A.
Tomanek K.
Tomanek K.
Tsuruoka Y.
Tsuruoka Y.
Täger W.
Urbain J.
van Rijsbergen C. J.
Vapnik V. N.
Vasserman A.
Visweswaran S.
Voorhees E. M.
Wang W.
Wang Y.
Wei C.-H.
Wei C.-H.
Wermter J.
Wilbur W. J.
Willett P.
Willett P.
Williams A. J.
Witten I. H.
Workman M. L.
Wrublewski D. T.
Xu R.
Xue N.
Yan S.
Yang C.
Yang C. C.
Yang Y.
Zass E.
Zipf G. K.
Zipf G. K.
Zitnik S.
Publication venue: 'American Chemical Society (ACS)'
Publication date: 01/01/2017
Field of study

Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.A.V. and M.K. acknowledge funding from the European Community’s Horizon 2020 Program (project reference: 654021 - OpenMinted). M.K. additionally acknowledges the Encomienda MINETAD-CNIO as part of the Plan for the Advancement of Language Technology. O.R. and J.O. thank the Foundation for Applied Medical Research (FIMA), University of Navarra (Pamplona, Spain). This work was partially funded by Consellería de Cultura, Educación e Ordenación Universitaria (Xunta de Galicia), and FEDER (European Union), and the Portuguese Foundation for Science and Technology (FCT) under the scope of the strategic funding of UID/BIO/04469/2013 unit and COMPETE 2020 (POCI-01-0145-FEDER-006684). We thank Iñigo Garciá -Yoldi for useful feedback and discussions during the preparation of the manuscript.info:eu-repo/semantics/publishedVersio

Universidade do Minho: RepositoriUM

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Inquiries into words, constraints and contexts : Festschrift in the honour of Kimmo Koskenniemi on his 60th birthday

Author: Arppe Antti
Carlson Lauri
Linden Krister
Piitulainen Jussi Olavi
Suominen Mickael
Vainio Martti
Westerlund Hanna
Yli-Jyrä Anssi Mikael
Publication venue: CSLI publications
Publication date: 01/01/2005
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Formal Linguistic Models and Knowledge Processing. A Structuralist Approach to Rule-Based Ontology Learning and Population

Author: Di Buono Maria Pia
Publication venue: Universita degli studi di Salerno
Publication date: 02/03/2016
Field of study

2013 - 2014The main aim of this research is to propose a structuralist approach for knowledge processing by means of ontology learning and population, achieved starting from unstructured and structured texts. The method suggested includes distributional semantic approaches and NL formalization theories, in order to develop a framework, which relies upon deep linguistic analysis... [edited by author]XIII n.s

EleA@UniSA - Università degli Studi di Salerno

Can humain association norm evaluate latent semantic analysis?

Author: Gatkowska Izabela
Korzycki Michał
Lubaszewski Wiesław
Publication venue: [s.n.]
Publication date: 01/01/2013
Field of study

This paper presents the comparison of word association norm created by a psycholinguistic experiment to association lists generated by algorithms operating on text corpora. We compare lists generated by Church and Hanks algorithm and lists generated by LSA algorithm. An argument is presented on how those automatically generated lists reflect real semantic relations

Jagiellonian Univeristy Repository

Automatizované metody popisu struktury odborného textu a vztah některých prvků ke kvalitě textu

Author: Chýla Roman
Publication venue: Univerzita Karlova, Filozofická fakulta
Publication date: 01/01/2012
Field of study

Universal Semantic Language (USL) is a semi-formalized approach for the description of knowledge (a knowledge representation tool). The idea of USL was introduced by Vladimir Smetacek in the system called SEMAN which was used for keyword extraction tasks in the former Information centre of the Czechoslovak Republic. However due to the dissolution of the centre in early 90's, the system has been lost. This thesis reintroduces the idea of USL in a new context of quantitative content analysis. First we introduce the historical background and the problems of semantics and knowledge representation, semes, semantic fields, semantic primes and universals. The basic methodology of content analysis studies is illustrated on the example of three content analysis tools and we describe the architecture of a new system. The application was built specifically for USL discovery but it can work also in the context of classical content analysis. It contains Natural Language Processing (NLP) components and employs the algorithm for collocation discovery adapted for the case of cooccurences search between semantic annotations. The software is evaluated by comparing its pattern matching mechanism against another existing and established extractor. The semantic translation mechanism is evaluated in the task of...Univerzální sémantický jazyk (USJ) je semi-formalizovaný způsob zápisu znalostí (systém pro reprezentaci znalostí). Myšlenka USJ byla rozvinuta Vladimírem Smetáčkem v 80. letech při pracech na systému SÉMAN (Universální semantický analyzátor). Tento systém byl využíván pro automatizovanou extrakci klíčových slov v tehdejším informačním centru ČSSR. Avšak se zánikem centra v 90. letech byl systém SEMAN ztracen. Tato dizertace oživuje myšlenku USJ v novém kontextu automatizované obsahové analýzy. Nejdříve prezentujeme historický kontext a problémy spojené s reprezentací znalostí, sémů, sémantických polí, sémantických primitivů a univerzálií. Dále je představena metodika kvantitativní obsahové analýzy na příkladu tří klasických aplikací. Podrobně popíšeme architekturu nové aplikace, která byla vyvinuta speciálně pro potřeby evaluace USJ. Program může fungovat jako nástroj pro klasickou obsahovou analýzu, avšak obsahuje i nástroje pro zpracování přirozeného jazyka (NLP) a využívá algoritmů pro vyhledávání kolokací. Tyto byly upraveny pro potřeby vyhledávání vazeb mezi sémantickými anotacemi. Jednotlivé součásti programu jsou podrobeny praktickým testům. Subsystém pro vyhledávní vzorů v textech je porovnán s existujícím extraktorem klíčových slov. Mechanismus pro překlad do sémantických kódů je...Institute of Information Studies and LibrarianshipÚstav informačních studií a knihovnictvíFilozofická fakultaFaculty of Art

CU Digital Repository

Word Knowledge and Word Usage

Author
Publication venue: 'Walter de Gruyter GmbH'
Publication date
Field of study

Word storage and processing define a multi-factorial domain of scientific inquiry whose thorough investigation goes well beyond the boundaries of traditional disciplinary taxonomies, to require synergic integration of a wide range of methods, techniques and empirical and experimental findings. The present book intends to approach a few central issues concerning the organization, structure and functioning of the Mental Lexicon, by asking domain experts to look at common, central topics from complementary standpoints, and discuss the advantages of developing converging perspectives. The book will explore the connections between computational and algorithmic models of the mental lexicon, word frequency distributions and information theoretical measures of word families, statistical correlations across psycho-linguistic and cognitive evidence, principles of machine learning and integrative brain models of word storage and processing. Main goal of the book will be to map out the landscape of future research in this area, to foster the development of interdisciplinary curricula and help single-domain specialists understand and address issues and questions as they are raised in other disciplines

OAPEN Library