46,946 research outputs found

    Visual exploration and retrieval of XML document collections with the generic system X2

    Get PDF
    This article reports on the XML retrieval system X2 which has been developed at the University of Munich over the last five years. In a typical session with X2, the user first browses a structural summary of the XML database in order to select interesting elements and keywords occurring in documents. Using this intermediate result, queries combining structure and textual references are composed semiautomatically. After query evaluation, the full set of answers is presented in a visual and structured way. X2 largely exploits the structure found in documents, queries and answers to enable new interactive visualization and exploration techniques that support mixed IR and database-oriented querying, thus bridging the gap between these three views on the data to be retrieved. Another salient characteristic of X2 which distinguishes it from other visual query systems for XML is that it supports various degrees of detailedness in the presentation of answers, as well as techniques for dynamically reordering and grouping retrieved elements once the complete answer set has been computed

    Image Retrieval in Digital Libraries - A Large Scale Multicollection Experimentation of Machine Learning techniques

    Get PDF
    International audienceWhile historically digital heritage libraries were first powered in image mode, they quickly took advantage of OCR technology to index printed collections and consequently improve the scope and performance of the information retrieval services offered to users. But the access to iconographic resources has not progressed in the same way, and the latter remain in the shadows: manual incomplete and heterogeneous indexation, data silos by iconographic genre. Today, however, it would be possible to make better use of these resources, especially by exploiting the enormous volumes of OCR produced during the last two decades, and thus valorize these engravings, drawings, photographs, maps, etc. for their own value but also as an attractive entry point into the collections, supporting discovery and serenpidity from document to document and collection to collection. This article presents an ETL (extract-transform-load) approach to this need, that aims to: Identify andextract iconography wherever it may be found, in image collections but also in printed materials (dailies, magazines, monographies); Transform, harmonize and enrich the image descriptive metadata (in particular with machine learning classification tools); Load it all into a web app dedicated to image retrieval. The approach is pragmatically dual, since it involves leveraging existing digital resources and (virtually) on-the-shelf technologies.Si historiquement, les bibliothèques numériques patrimoniales furent d’abord alimentées par des images, elles profitèrent rapidement de la technologie OCR pour indexer les collections imprimées afin d’améliorer périmètre et performance du service de recherche d’information offert aux utilisateurs. Mais l’accès aux ressources iconographiques n’a pas connu les mêmes progrès et ces dernières demeurent dans l’ombre : indexation manuelle lacunaire, hétérogène et non viable à grande échelle ; silos documentaires par genre iconographique ; recherche par le contenu (CBIR, content-based image retrieval) encore peu opérationnelle sur les collections patrimoniales. Aujourd’hui, il serait pourtant possible de mieux valoriser ces ressources, en particulier en exploitant les énormes volumes d’OCR produits durant les deux dernières décennies (tant comme descripteur textuel que pour l’identification automatique des illustrations imprimées). Et ainsi mettre en valeur ces gravures, dessins, photographies, cartes, etc. pour leur valeur propre mais aussi comme point d’entrée dans les collections, en favorisant découverte et rebond de document en document, de collection à collection. Cet article décrit une approche ETL (extract-transform-load) appliquée aux images d’une bibliothèque numérique à vocation encyclopédique : identifier et extraire l’iconographie partout où elle se trouve (dans les collections image mais aussi dans les imprimés : presse, revue, monographie) ; transformer, harmoniser et enrichir ses métadonnées descriptives grâce à des techniques d’apprentissage machine – machine learning – pour la classification et l’indexation automatiques ; charger ces données dans une application web dédiée à la recherche iconographique (ou dans d’autres services de la bibliothèque). Approche qualifiée de pragmatique à double titre, puisqu’il s’agit de valoriser des ressources numériques existantes et de mettre à profit des technologies (quasiment) mâtures

    Sentimental classification analysis of polarity multi-view textual data using data mining techniques

    Get PDF
    The data and information available in most community environments is complex in nature. Sentimental data resources may possibly consist of textual data collected from multiple information sources with different representations and usually handled by different analytical models. These types of data resource characteristics can form multi-view polarity textual data. However, knowledge creation from this type of sentimental textual data requires considerable analytical efforts and capabilities. In particular, data mining practices can provide exceptional results in handling textual data formats. Besides, in the case of the textual data exists as multi-view or unstructured data formats, the hybrid and integrated analysis efforts of text data mining algorithms are vital to get helpful results. The objective of this research is to enhance the knowledge discovery from sentimental multi-view textual data which can be considered as unstructured data format to classify the polarity information documents in the form of two different categories or types of useful information. A proposed framework with integrated data mining algorithms has been discussed in this paper, which is achieved through the application of X-means algorithm for clustering and HotSpot algorithm of association rules. The analysis results have shown improved accuracies of classifying the sentimental multi-view textual data into two categories through the application of the proposed framework on online polarity user-reviews dataset upon a given topics

    Adaptive Algorithms for Automated Processing of Document Images

    Get PDF
    Large scale document digitization projects continue to motivate interesting document understanding technologies such as script and language identification, page classification, segmentation and enhancement. Typically, however, solutions are still limited to narrow domains or regular formats such as books, forms, articles or letters and operate best on clean documents scanned in a controlled environment. More general collections of heterogeneous documents challenge the basic assumptions of state-of-the-art technology regarding quality, script, content and layout. Our work explores the use of adaptive algorithms for the automated analysis of noisy and complex document collections. We first propose, implement and evaluate an adaptive clutter detection and removal technique for complex binary documents. Our distance transform based technique aims to remove irregular and independent unwanted foreground content while leaving text content untouched. The novelty of this approach is in its determination of best approximation to clutter-content boundary with text like structures. Second, we describe a page segmentation technique called Voronoi++ for complex layouts which builds upon the state-of-the-art method proposed by Kise [Kise1999]. Our approach does not assume structured text zones and is designed to handle multi-lingual text in both handwritten and printed form. Voronoi++ is a dynamically adaptive and contextually aware approach that considers components' separation features combined with Docstrum [O'Gorman1993] based angular and neighborhood features to form provisional zone hypotheses. These provisional zones are then verified based on the context built from local separation and high-level content features. Finally, our research proposes a generic model to segment and to recognize characters for any complex syllabic or non-syllabic script, using font-models. This concept is based on the fact that font files contain all the information necessary to render text and thus a model for how to decompose them. Instead of script-specific routines, this work is a step towards a generic character and recognition scheme for both Latin and non-Latin scripts

    The Constitution and Legislative History

    Get PDF
    In this article, the author provides an extended analysis of the constitutional claims against legislative history, arguing that, under textualists’ own preference for constitutional text, the use of legislative history should be constitutional to the extent it is supported by Congress’s rulemaking power, a constitutionally enumerated power. This article has five parts. In part I, the author explains the importance of this question, considering the vast range of cases to which this claim of unconstitutionality could possibly apply—after all, statutory interpretation cases are the vast bulk of the work of the federal courts. She also explains why these claims should be of greater concern to a variety of constitutional theorists, particularly those who embrace theories of popular and common law constitutionalism, but as well to originalists. In part II, the author considers the textualist arguments against the constitutionality of legislative history. Article I, Section 7 provides that any bill must pass the House and the Senate and be presented to the President for veto or signature. As a number of textualists have argued, legislative history is not passed by both houses or signed by the President. Call this the “bicameralism argument.” Her answer to the bicameralism argument lies in a constitutional text that statutory textualists seem to have forgotten: Article I, Section 5 gives explicit power to Congress to set its own procedures, a power that gives legitimacy to legislative history created pursuant to those procedures. In fact, new developments in statutory interpretation theory (decision process theory) suggest that, in some cases, the only way to resolve textual conflict is to consider legislative procedure. In part III, the author considers a second prominent argument against the constitutionality of legislative history: non-delegation. Critics argue that Congress may not delegate the “legislative power” granted under the Constitution to members or committees, as only the entire Congress may constitutionally exercise that power. Call this the “non-delegation” argument. Again, her response is based on constitutional text: Article I, Section 5 specifically sanctions delegation to less than the whole of Congress; more importantly, there is no general norm against self-delegation stated explicitly or even implicitly in the Constitution. Finally, the author suggests that there is a certain inconsistency in the assertion of these claims: the non-self-delegation and bicameralism arguments can both be used to indict canons of construction, which textualists offer as the leading alternative to legislative history, but which have no supporting text comparable to Article I section 5 in the Constitution. In part IV, she considers arguments that judges’ use of legislative history violates the separation of powers because it allows the legislature to exceed the bounds of the “judicial power.” This argument can rather easily be turned on its head: in the quotations offered at the beginning of this article, members of Congress argue that judges are exercising the “legislative power” when they rewrite statutes without considering legislative history. As has been argued at length elsewhere, the use of “adjectival” argument in structural controversies—relying upon the terms “legislative, executive, and judicial”—perpetuates a weak understanding of the separation of powers, and one that the Constitution’s own text belies. The separation of powers does not prevent recourse to legislative history; in fact, as the article explains, blindness to legislative history may create different kinds of structural risks—risks to federalism, rather than risks to the separation of powers. Finally, in part V, the author concludes by suggesting that we should retire the strong form of the legislative history unconstitutionality argument, by which she means the claim that the constitution bars any and all legislative history. Instead, we should far more actively interrogate serious questions about the use of legislative history in particular cases. Can it really be wise—or even constitutional—for a judge to impose a meaning on an ambiguous statute with reference to the state-ments of a filibustering minority, or privilege some texts in ways that violate Congress’s rules? Fidelity to Congress, and the importance of Congress’s constitutional rules—what Francis Lieber once called the “common law” of the Congress—has yet to be theorized within this more pressing, but particular, sphere
    • …
    corecore