6 research outputs found
AIDING MODERN TEXTUAL SCHOLARSHIP USING A VIRTUAL HINMAN COLLATOR
Collation is an important step in textual criticism and is most often an arduous task for most scholars involved in scholarly edition. Finding variations is important for researchers in bibliography and book history as well. In the late 1940s Charlton Hinman invented a machine that became popular as the Hinman collator. Using optical means, the Hinman Collator allowed manual comparison of separate copies of a text in order to detect any differences that had been introduced. Although these mechanical collation systems are helpful, they still require a lot of manual labor and some scholars find them hard to use. Another approach used sometimes is to perform collation on OCR output of text. However the state-of-the-art OCR mechanisms for 15th/16th century books are not efficient to date (70-80% accurate). Also scholars doing textual criticism generally prefer to work on original copies or facsimiles rather than OCR versions of them because the accuracy and some of the nuanced details of the original copy are important to them.
Thus there is a need of a tool that can reduce the effort required in the collation process while maintaining (and sometimes improving) the usefulness of the tool and allowing scholars to use original documents (high quality facsimiles). This research focuses on this aspect of scholarly work and explores various approaches for performing digital collation in a seamlessly easy manner. A prototype of the virtual Hinman (vHinman) collator was created and user evaluation was conducted amongst scholars experienced with collation work. Image-matching algorithms along with context information are used to match words and the tool was integrated into the creativity support environment CritSpace. The tool was tested on books from early modern and late modern period for which multiple copies with slight variations were available.
The tool showed a high accuracy rate for the books tested. Most of the scholars found the tool very promising. This kind of tool can save a massive amount of time for scholars and set up a paradigm of digital collation encouraging even more scholars in finding new uses of collation in their work
Recommended from our members
An Efficient Framework for Searching Text in Noisy Document Images
An efficient word spotting framework is proposed to search text in scanned books. The proposed method allows one to search for words when optical character recognition (OCR) fails due to noise or for languages where there is no OCR. Given a query word image, the aim is to retrieve matching words in the book sorted by the similarity. In the offline stage, SIFT descriptors are extracted over the corner points of each word image. Those features are quantized into visual terms (visterms) using hierarchical K-Means algorithm and indexed using an inverted file. In the query resolution stage, the candidate matches are efficiently identified using the inverted index. These word images are then forwarded to the next stage where the configuration of visterms on the image plane are tested. Configuration matching is efficiently performed by projecting the visterms on the horizontal axis and searching for the Longest Common Sebsequence (LCS) between the sequences of visterms. The proposed framework is tested on one English and two Telugu books. It is shown that the proposed method resolves a typical user query under 10 miliseconds providing very high retrieval accuracy (Mean Average Precision 0.93). The search accuracy for the English book is comparable to searching text in the high accuracy output of a commercial OCR engine