10 research outputs found
Resource Generation from Structured Documents for Low-density Languages
The availability and use of electronic resources for both manual and automated language related processing has increased tremendously in recent years. Nevertheless, many resources still exist only in printed form, restricting their availability and use. This especially holds true in low density languages or languages with limited electronic resources. For these documents, automated conversion into electronic resources is highly desirable.
This thesis focuses on the semi-automated conversion of printed structured documents (dictionaries in particular) to usable electronic representations. In the first part we present an entry tagging system that recognizes, parses, and tags the entries of a printed dictionary to reproduce the representation. The system uses the consistent layout and structure of the dictionaries, and the features that impose this structure, to capture and recover lexicographic information. We accomplish this by adapting two methods: rule-based and HMM-based. The system is designed to produce results quickly with minimal human assistance and reasonable accuracy. The use of an adaptive transformation-based learning as a post-processor at two points in the system yields significant improvements, even with an extremely small amount of user provided training data.
The second part of this thesis presents Morphology Induction from Noisy Data (MIND), a natural language morphology discovery framework that operates on information from limited, noisy data obtained from the conversion process. To use the resulting resources effectively, however, users must be able to search for them using the root form of morphologically deformed variant found in the text. Stemming and data driven methods are not suitable when data are sparse. The approach is based on the novel application of string searching algorithms. The evaluations show that MIND can segment words into roots and affixes from the noisy, limited data contained in a dictionary, and it can extract prefixes, suffixes, circumfixes, and infixes. MIND can also identify morphophonemic changes, i.e., phonemic variations between allomorphs of a morpheme, specifically point-of-affixation stem changes. This, in turn, allows non-native speakers to perform multilingual tasks for applications where response must be rapid, and they have limited knowledge. In addition, this analysis can feed other natural language processing tools requiring lexicons
Context-Based Quotation Recommendation
While composing a new document, anything from a news article to an email or
essay, authors often utilize direct quotes from a variety of sources. Although
an author may know what point they would like to make, selecting an appropriate
quote for the specific context may be time-consuming and difficult. We
therefore propose a novel context-aware quote recommendation system which
utilizes the content an author has already written to generate a ranked list of
quotable paragraphs and spans of tokens from a given source document.
We approach quote recommendation as a variant of open-domain question
answering and adapt the state-of-the-art BERT-based methods from open-QA to our
task. We conduct experiments on a collection of speech transcripts and
associated news articles, evaluating models' paragraph ranking and span
prediction performances. Our experiments confirm the strong performance of
BERT-based methods on this task, which outperform bag-of-words and neural
ranking baselines by more than 30% relative across all ranking metrics.
Qualitative analyses show the difficulty of the paragraph and span
recommendation tasks and confirm the quotability of the best BERT model's
predictions, even if they are not the true selected quotes from the original
news articles.Comment: 12 pages, 3 figure
PARSING AND TAGGING OF BINLINGUAL DICTIONARY
Bilingual dictionaries hold great potential as a source of lexical resources
for training and testing automated systems for optical character
recognition, machine translation, and cross-language information
retrieval. In this paper, we describe a system for extracting term
lexicons from printed bilingual dictionaries. Our work was divided into
three phases - dictionary segmentation, entry tagging, and generation. In
segmentation, pages are divided into logical entries based on structural
features learned from selected examples. The extracted entries are
associated with functional labels and passed to a tagging module which
associates linguistic labels with each word or phrase in the entry. The
output of the system is a structure that represents the entries from the
dictionary. We have used this approach to parse a variety of dictionaries
with both Latin and non-Latin alphabets, and demonstrate the results of
term lexicon generation for retrieval from a collection of French news
stories using English queries.
(LAMP-TR-106)
(CAR-TR-991)
(UMIACS-TR-2003-97
PaLI: A Jointly-Scaled Multilingual Language-Image Model
Effective scaling and a flexible task interface enable large language models
to excel at many tasks. PaLI (Pathways Language and Image model) extends this
approach to the joint modeling of language and vision. PaLI generates text
based on visual and textual inputs, and with this interface performs many
vision, language, and multimodal tasks, in many languages. To train PaLI, we
make use of large pretrained encoder-decoder language models and Vision
Transformers (ViTs). This allows us to capitalize on their existing
capabilities and leverage the substantial cost of training them. We find that
joint scaling of the vision and language components is important. Since
existing Transformers for language are much larger than their vision
counterparts, we train the largest ViT to date (ViT-e) to quantify the benefits
from even larger-capacity vision models. To train PaLI, we create a large
multilingual mix of pretraining tasks, based on a new image-text training set
containing 10B images and texts in over 100 languages. PaLI achieves
state-of-the-art in multiple vision and language tasks (such as captioning,
visual question-answering, scene-text understanding), while retaining a simple,
modular, and scalable design
Use of OCR for Rapid Constrution of Bilingual Lexicons
This paper describes an approach to analyzing the lexical structure of OCRed
bilingual dictionaries to construct resources suited for machine translation
of low-density languages, where online resources are limited. A rule-based
and an HMM-based method are used for rapid construction of MT lexicons based
on systematic structural clues provided in the original dictionary. We
evaluate the effectiveness of our techniques, concluding that: (1) the
rule-based method performs better on dictionaries with a simple structure;
(2) the stochastic method performs better on dictionaries with an enriched
structure; (3) regardless of the degree of dictionary richness, the
rule-based method gives better results for phrasal entries than for
single-word entries; and (4) Our resulting bilingual lexicons are
comprehensive enough to provide reasonable MT results when compared to
human-constructed lexicons.
(LAMP-TR-104)
(CAR-TR-986)
(UMIACS-TR-2003-78
Acquisition of Bilingual MT Lexicons from OCRed Dictionaries
This paper describes an approach to analyzing the lexical structure of OCRed bilingual dictionaries to construct resources suited for machine translation of low-density languages, where online resources are limited. A rule-based, an HMM-based, and a post-processed HMM-based method are used for rapid construction of MT lexicons based on systematic structural clues provided in the original dictionary. We evaluate the effectiveness of our techniques, concluding that: (1) the rule-based method performs better with dictionaries where the font is not an important distinguishing feature for determining information types; (2) the post-processed stochastic method improves the results of the stochastic method for phrasal entries; and (3) Our resulting bilingual lexicons are comprehensive enough to provide the basis for reasonable translation results when compared to human translations
Parsing And Tagging Of Bilingual Dictionary
Bilingual dictionaries hold great potential as a source of lexical resources for training and testing automated systems for optical character recognition, machine translation, and cross-language information retrieval. In this paper, we describe a system for extracting term lexicons from printed bilingual dictionaries. Our work was divided into three phases - dictionary segmentation, entry tagging, and generation. In segmentation, pages are divided into logical entries based on structural features learned from selected examples. The extracted entries are associated with functional labels and passed to a tagging module which associates linguistic labels with each word or phrase in the entry. The output of the system is a structure that represents the entries from the dictionary. We have used this approach to parse a variety of dictionaries with both Latin and non-Latin alphabets, and demonstrate the results of term lexicon generation for retrieval from a collection of French news stories using English queries