Search CORE

169 research outputs found

Unification-Based Glossing

Author: Hatzivassiloglou Vasileios
Knight Kevin
Publication venue
Publication date: 01/01/1995
Field of study

We present an approach to syntax-based machine translation that combines unification-style interpretation with statistical processing. This approach enables us to translate any Japanese newspaper article into English, with quality far better than a word-for-word translation. Novel ideas include the use of feature structures to encode word lattices and the use of unification to compose and manipulate lattices. Unification also allows us to specify abstract features that delay target-language synthesis until enough source-language information is assembled. Our statistical component enables us to search efficiently among competing translations and locate those with high English fluency.Comment: 8 pages, Compressed and uuencoded postscript. To appear: IJCAI-9

arXiv.org e-Print Archive

CiteSeerX

A Formal Model for Information Selection in Multi-Sentence Text Extraction

Author: Filatova Elena
Hatzivassiloglou Vasileios
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2004
Field of study

Selecting important information while accounting for repetitions is a hard task for both summarization and question answering. We propose a formal model that represents a collection of documents in a two-dimensional space of textual and conceptual units with an associated mapping between these two dimensions. This representation is then used to describe the task of selecting textual units for a summary or answer as a formal optimization task. We provide approximation algorithms and empirically validate the performance of the proposed model when used with two very different sets of features, words and atomic events

Crossref

Columbia University Academic Commons

Learning Anchor Verbs for Biological Interaction Patterns from Published Text Articles

Author: Hatzivassiloglou Vasileios
Weng Wubin
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2002
Field of study

Much of knowledge modeling in the molecular biology domain involves interactions between proteins, genes, various forms of RNA, small molecules, etc. Interactions between these substances are typically extracted and codified manually, increasing the cost and time for modeling and substantially limiting the coverage of the resulting knowledge base. In this paper, we describe an automatic system that learns from text interaction verbs; these verbs can then form the core of automatically retrieved patterns which model classes of biological interactions. We investigate text features relating verbs with genes and proteins, and apply statistical tests and a logistic regression statistical model to determine whether a given verb belongs to the class of interaction verbs. Our system, AVAD, achieves over 87% precision and 82% recall when tested on an 11 million word corpus of journal articles. In addition, we compare the automatically obtained results with a manually constructed database of interaction verbs and show that the automatic approach can significantly enrich the manual list by detecting rarer interaction verbs that were omitted from the database

CiteSeerX

Columbia University Academic Commons

Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying the Polarity of Opinion Sentences

Author: Hatzivassiloglou Vasileios
Yu Hong
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2003
Field of study

Opinion question answering is a challenging task for natural language processing. In this paper, we discuss a necessary component for an opinion question answering system: separating opinions from fact, at both the document and sentence level. We present a Bayesian classifier for discriminating between documents with a preponderance of opinions such as editorials from regular news stories, and describe three unsupervised, statistical techniques for the significantly harder task of detecting opinions at the sentence level. We also present a first model for classifying opinion sentences as positive or negative in terms of the main perspective being expressed in the opinion. Results from a large collection of news stories and a human evaluation of 400 sentences are reported, indicating that we achieve very high performance in document classification (upwards of 97% precision and recall), and respectable performance in detecting opinions and classifying them at the sentence level as positive, negative, or neutral (up to 91% accuracy)

CiteSeerX

Columbia University Academic Commons

Recommended from our members

Text-based approaches for non-topical image categorization

Author: Sable Carl L.
Hatzivassiloglou Vasileios
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2000
Field of study

The rapid expansion of multimedia digital collections brings to the fore the need for classifying not only text documents but their embedded non-textual parts as well. We propose a model for basing classification of multimedia on broad, non-topical features, and show how information on targeted nearby pieces of text can be used to effectively classify photographs on a first such feature, distinguishing between indoor and outdoor images. We examine several variations to a TF*IDF-based approach for this task, empirically analyze their effects, and evaluate our system on a large collection of images from current news newsgroups. In addition, we investigate alternative classification and evaluation methods, and the effects that secondary features have on indoor/outdoor classification. Using density estimation over the raw TF*IDF values, we obtain a classification accuracy of 82%, a number that outperforms baseline estimates and earlier, image-based approaches, at least in the domain of news articles, and that nears the accuracy of humans who perform the same task with access to comparable information

Columbia University Academic Commons

TamPub Julkaisuarkisto - TamPub Institutional Repository

Trepo - Institutional Repository of Tampere University

Translating Collocations for Bilingual Lexicons: A Statistical Approach

Author: Hatzivassiloglou Vasileios
McKeown Kathleen
Smadja Frank
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/1996
Field of study

Collocations are notoriously difficult for non-native speakers to translate, primarily because they are opaque and cannot be translated on a word-by-word basis. We describe a program named Champollion which, given a pair of parallel corpora in two different languages and a list of collocations in one of them, automatically produces their translations. Our goal is to provide a tool for compiling bilingual lexical information above the word level in multiple languages, for different domains. The algorithm we use is based on statistical methods and produces p-word translations of n-word collocations in which n and p need not be the same. For example, Champollion translates make...decision, employment equity, and stock market into prendre...décision, équité en matière d'emploi, and bourse respectively. Testing Champollion on three years' worth of the Hansards corpus yielded the French translations of 300 collocations for each year, evaluated at 73% accuracy on average. In this paper, we describe the statistical measures used, the algorithm, and the implementation of Champollion, presenting our results and evaluation

CiteSeerX

Columbia University Academic Commons

Recommended from our members

Using Density Estimation to Improve Text Categorization

Author: Hatzivassiloglou Vasileios
McKeown Kathleen
Sable Carl
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2002
Field of study

This paper explores the use of a statistical technique known as density estimation to potentially improve the results of text categorization systems which label documents by computing similarities between documents and categories. In addition to potentially improving a system's overall accuracy, density estimation converts similarity scores to probabilities. These probabilities provide confidence measures for a system's predictions which are easily interpretable and could potentially help to combine results of various systems. We discuss the results of three complete experiments on three separate data sets applying density estimation to the results of a TF*IDF/Rocchio system, and we compare these results to those of many competing approaches

Columbia University Academic Commons

Filling Knowledge Gaps in a Broad-Coverage Machine Translation System

Author: Chander Ishwar
Haines Matthew
Hatzivassiloglou Vasileios
Hovy Eduard
Iida Masayo
Knight Kevin
Luk Steve K.
Whitney Richard
Yamada Kenji
Publication venue
Publication date: 01/01/1995
Field of study

Knowledge-based machine translation (KBMT) techniques yield high quality in domains with detailed semantic models, limited vocabulary, and controlled input grammar. Scaling up along these dimensions means acquiring large knowledge resources. It also means behaving reasonably when definitive knowledge is not yet available. This paper describes how we can fill various KBMT knowledge gaps, often using robust statistical techniques. We describe quantitative and qualitative results from JAPANGLOSS, a broad-coverage Japanese-English MT system.Comment: 7 pages, Compressed and uuencoded postscript. To appear: IJCAI-9

arXiv.org e-Print Archive

CiteSeerX

Recommended from our members

Generation and Evaluation of Intraoperative Inferences for Automated Health Care Briefings on Patient Status After Bypass Surgery

Author: Concepcion Kristian
Feiner Steven
Hatzivassiloglou Vasileios
Jordan Desmond
McKeown Kathleen
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2001
Field of study

The authors present a system that scans electronic records from cardiac surgery and uses inference rules to identify and classify abnormal events (e.g., hypertension) that may occur during critical surgical points (e.g., start of bypass). This vital information is used as the content of automatically generated briefings designed by MAGIC, a multimedia system that they are developing to brief intensive care unit clinicians on patient status after cardiac surgery. By recognizing patterns in the patient record, inferences concisely summarize detailed patient data

Columbia University Academic Commons

PubMed Central

Automatically Identifying Gene/Protein Terms in MEDLINE Abstracts

Author: Hatzivassiloglou Vasileios
Rzhetsky Andrey
Wilbur W John
Yu Hong
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 31/10/2002
Field of study

Motivation. Natural language processing (NLP) techniques are used to extract information automatically from computer-readable literature. In biology, the identification of terms corresponding to biological substances (e.g., genes and proteins) is a necessary step that precedes the application of other NLP systems that extract biological information (e.g., protein–protein interactions, gene regulation events, and biochemical pathways). We have developed GPmarkup (for “gene/protein-full name mark up”), a software system that automatically identifies gene/protein terms (i.e., symbols or full names) in MEDLINE abstracts. As a part of marking up process, we also generated automatically a knowledge source of paired gene/protein symbols and full names (e.g., LARD for lymphocyte associated receptor of death) from MEDLINE. We found that many of the pairs in our knowledge source do not appear in the current GenBank database. Therefore our methods may also be used for automatic lexicon generation. Results. GPmarkup has 73% recall and 93% precision in identifying and marking up gene/protein terms in MEDLINE abstracts.Availability: A random sample of gene/protein symbols and full names and a sample set of marked up abstracts can be viewed at http://www.cpmc.columbia.edu/homepages/yuh9001/GPmarkup/

CiteSeerX

Elsevier - Publisher Connector

Columbia University Academic Commons