1,807,545 research outputs found
Exploiting Query Structure and Document Structure to Improve Document Retrieval Effectiveness
In this paper we present a systematic analysis of document
retrieval using unstructured and structured queries within
the score region algebra (SRA) structured retrieval framework. The behavior of di®erent retrieval models, namely
Boolean, tf.idf, GPX, language models, and Okapi, is tested
using the transparent SRA framework in our three-level structured retrieval system called TIJAH. The retrieval models are implemented along four elementary retrieval aspects: element and term selection, element score computation, score combination, and score propagation.
The analysis is performed on a numerous experiments
evaluated on TREC and CLEF collections, using manually
generated unstructured and structured queries. Unstructured queries range from the short title queries to long title
+ description + narrative queries. For generating structured
queries we exploit the knowledge of the document structure
and the content used to semantically describe or classify
documents. We show that such structured information can
be utilized in retrieval engines to give more precise answers to user queries then when using unstructured queries
Data Structure Lower Bounds for Document Indexing Problems
We study data structure problems related to document indexing and pattern
matching queries and our main contribution is to show that the pointer machine
model of computation can be extremely useful in proving high and unconditional
lower bounds that cannot be obtained in any other known model of computation
with the current techniques. Often our lower bounds match the known space-query
time trade-off curve and in fact for all the problems considered, there is a
very good and reasonable match between the our lower bounds and the known upper
bounds, at least for some choice of input parameters. The problems that we
consider are set intersection queries (both the reporting variant and the
semi-group counting variant), indexing a set of documents for two-pattern
queries, or forbidden- pattern queries, or queries with wild-cards, and
indexing an input set of gapped-patterns (or two-patterns) to find those
matching a document given at the query time.Comment: Full version of the conference version that appeared at ICALP 2016,
25 page
The effect of WWW document structure on students' information retrieval
This experiment investigated the effect the structure of a WWW document has on the amount of information retained by a reader. Three structures common on the Internet were tested: one long page; a table of contents leading to individual sections; and short sections of text on separate pages with revision questions. Participants read information structured in one of these ways and were then tested on recall of that information. A further experiment investigated the effect that 'browsing' - moving between pages - has on retrieval. There was no difference between the structures for overall amount of information retained. The single page version was best for recall of facts, while the short sections of text with revision questions led to the most accurate inferences from the material. Browsing on its own had no significant impact on information retrieval. Revision questions rather than structure per se were therefore the key factor
Recommended from our members
Hyper-Document structure: maintaining discourse coherence in non-linear documents
The passage from linear text to hypertext poses the challenge of expressing discourse coherence in non-linear text, where linguistic discourse markers no longer work. While hypertext introduces new possibilities for discourse organisation, it also requires the use of new devices which can support the expression of coherence by exploiting the technical characteristics and expressive capabilities of the medium. In this paper we show how in hypertext the notion of abstract document structure encompasses animated graphics as a form of meta-language for discourse construction
Creating Structured PDF Files Using XML Templates
This paper describes a tool for recombining the logical structure from an XML document with the typeset appearance of the corresponding PDF document. The tool uses the XML representation as a template for the insertion of the logical structure into the existing PDF document, thereby creating a Structured/Tagged PDF. The addition of logical structure adds value to the PDF in three ways: the accessibility is improved (PDF screen readers for visually impaired users perform better), media options are enhanced (the ability to reflow PDF documents, using structure as a guide, makes PDF viable for use on hand-held devices) and the re-usability of the PDF documents benefits greatly from the presence of an XML-like structure tree to guide the process of text retrieval in reading order (e.g. when interfacing to XML applications and databases)
Towards a document structure editor for software requirements analysis
Of the six or seven phases of the software engineering life cycle, requirements analysis tends to be the least understood and the least formalized. Correspondingly, a scarcity of useful software tools exist which aid in the development of user and system requirements. It is proposed that requirements analysis should culminate in a set of documents similar to those that usually accompany a delivered Software product. The design of a software tool, the Document Structure Editor, which facilitates the development of such documentation
Jointly Modeling Topics and Intents with Global Order Structure
Modeling document structure is of great importance for discourse analysis and
related applications. The goal of this research is to capture the document
intent structure by modeling documents as a mixture of topic words and
rhetorical words. While the topics are relatively unchanged through one
document, the rhetorical functions of sentences usually change following
certain orders in discourse. We propose GMM-LDA, a topic modeling based
Bayesian unsupervised model, to analyze the document intent structure
cooperated with order information. Our model is flexible that has the ability
to combine the annotations and do supervised learning. Additionally, entropic
regularization can be introduced to model the significant divergence between
topics and intents. We perform experiments in both unsupervised and supervised
settings, results show the superiority of our model over several
state-of-the-art baselines.Comment: Accepted by AAAI 201
Mapping and Displaying Structural Transformations between XML and PDF
Documents are often marked up in XML-based tagsets to delineate major structural components such as headings, paragraphs, figure captions and so on, without much regard to their eventual displayed appearance. And yet these same abstract documents, after many transformations and 'typesetting' processes, often emerge in the popular format of Adobe PDF, either for dissemination or archiving.
Until recently PDF has been a totally display-based document representation, relying on the underlying PostScript semantics of PDF. Early versions of PDF had no mechanism for retaining any form of abstract document structure but recent releases have now introduced an internal structure tree to create the so called 'Tagged PDF'.
This paper describes the development of a plugin for Adobe Acrobat which creates a two-window display. In one window is shown an XML document original and in the other its Tagged PDF counterpart is seen, with an internal structure tree that, in some sense, matches the one seen in XML. If a component is highlighted in either window then the corresponding structured item, with any attendant text, is also highlighted in the other window.
Important applications of correctly Tagged PDF include making PDF documents reflow intelligently on small screen devices and enabling them to be read out in correct reading order, via speech synthesiser software, for the visually impaired. By tracing structure transformation from source document to destination one can implement the repair of damaged PDF structure or the adaptation of an existing structure tree to an incrementally updated document
- …