7,341 research outputs found
Handwriting Recognition of Historical Documents with few labeled data
Historical documents present many challenges for offline handwriting
recognition systems, among them, the segmentation and labeling steps. Carefully
annotated textlines are needed to train an HTR system. In some scenarios,
transcripts are only available at the paragraph level with no text-line
information. In this work, we demonstrate how to train an HTR system with few
labeled data. Specifically, we train a deep convolutional recurrent neural
network (CRNN) system on only 10% of manually labeled text-line data from a
dataset and propose an incremental training procedure that covers the rest of
the data. Performance is further increased by augmenting the training set with
specially crafted multiscale data. We also propose a model-based normalization
scheme which considers the variability in the writing scale at the recognition
phase. We apply this approach to the publicly available READ dataset. Our
system achieved the second best result during the ICDAR2017 competition
Text segmentation on multilabel documents: A distant-supervised approach
Segmenting text into semantically coherent segments is an important task with
applications in information retrieval and text summarization. Developing
accurate topical segmentation requires the availability of training data with
ground truth information at the segment level. However, generating such labeled
datasets, especially for applications in which the meaning of the labels is
user-defined, is expensive and time-consuming. In this paper, we develop an
approach that instead of using segment-level ground truth information, it
instead uses the set of labels that are associated with a document and are
easier to obtain as the training data essentially corresponds to a multilabel
dataset. Our method, which can be thought of as an instance of distant
supervision, improves upon the previous approaches by exploiting the fact that
consecutive sentences in a document tend to talk about the same topic, and
hence, probably belong to the same class. Experiments on the text segmentation
task on a variety of datasets show that the segmentation produced by our method
beats the competing approaches on four out of five datasets and performs at par
on the fifth dataset. On the multilabel text classification task, our method
performs at par with the competing approaches, while requiring significantly
less time to estimate than the competing approaches.Comment: Accepted in 2018 IEEE International Conference on Data Mining (ICDM
Utilizing sub-topical structure of documents for information retrieval.
Text segmentation in natural language processing typically refers to the process of decomposing a document into constituent subtopics. Our work centers on the application of text segmentation techniques within information retrieval (IR) tasks. For example, for scoring a document by combining the retrieval scores of its constituent segments, exploiting the proximity of query terms in documents for ad-hoc search, and for question answering (QA), where retrieved passages from multiple documents are aggregated and presented as a single document to a searcher. Feedback in ad hoc IR task is shown to beneο¬t from the use of extracted sentences instead of terms from the pseudo relevant documents for query expansion. Retrieval effectiveness for patent prior art search task is enhanced by applying text segmentation to the patent queries. Another aspect of our work involves augmenting text segmentation techniques to produce segments which are more readable with less unresolved anaphora. This is particularly useful for QA and snippet generation tasks where the objective is to aggregate relevant and novel information from multiple documents satisfying user information need on one hand, and ensuring that the automatically generated content presented to the user is easily readable without reference to the original source document
- β¦