11 research outputs found

    Next steps in near-duplicate detection for eRulemaking

    Full text link

    Performance and Comparative Analysis of the Two Contrary Approaches for Detecting Near Duplicate Web Documents in Web Crawling

    Get PDF
    Recent years have witnessed the drastic development of World Wide Web (WWW). Information is being accessible at the finger tip anytime anywhere through the massive web repository. The performance and reliability of web engines thus face huge problems due to the presence of enormous amount of web data. The voluminous amount of web documents has resulted in problems for search engines leading to the fact that the search results are of less relevance to the user. In addition to this, the presence of duplicate and near-duplicate web documents has created an additional overhead for the search engines critically affecting their performance. The demand for integrating data from heterogeneous sources leads to the problem of near-duplicate web pages. The detection of near duplicate documents within a collection has recently become an area of great interest. In this research, we have presented an efficient approach for the detection of near duplicate web pages in web crawling which uses keywords and the distance measure. Besides that, G.S. Manku et al.’s fingerprint based approach proposed in 2007 was considered as one of the “state-of-the-art" algorithms for finding near-duplicate web pages. Then we have implemented both the approaches and conducted an extensive comparative study between our similarity score based approach and G.S. Manku et al.’s fingerprint based approach. We have analyzed our results in terms of time complexity, space complexity, Memory usage and the confusion matrix parameters. After taking into account the above mentioned performance factors for the two approaches, the comparison study clearly portrays our approach the better (less complex) of the two based on the factors considered.DOI:http://dx.doi.org/10.11591/ijece.v2i6.1746

    The Transformation of the U.S. Rulemaking Process - For Better or Worse

    Get PDF

    Automated classification of congressional legislation

    Full text link

    Next steps in near-duplicate detection for erulemaking

    No full text
    Large volume public comment campaigns and web portals that encourage the public to customize form letters produce many near-duplicate documents, which increases processing and storage costs, but is rarely a serious problem. A more serious concern is that form letter customizations can include substantive issues that agencies are likely to overlook. The identification of exact- and near-duplicate texts, and recognition of unique text within nearduplicate documents, is an important component of data cleaning and integration processes for eRulemaking. This paper presents DURIAN (DUplicate Removal In lArge collectioN), a refinement of a prior near-duplicate detection algorithm DURIAN uses a traditional bag-of-words document representation, document attributes ("metadata"), and document content structure to identify form letters and their edited copies in public comment collections. Experimental results demonstrate that DURIAN is about as effective as human assessors. The paper concludes by discussing challenges to moving near-duplicate detection into operational rulemaking environments

    Achieving the Potential: The Future of Federal e-Rulemaking: A Report to Congress and the President

    Get PDF
    Federal regulations are among the most important and widely used tools for implementing the laws of the land – affecting the food we eat, the air we breathe, the safety of consumer products, the quality of the workplace, the soundness of our financial institutions, the smooth operation of our businesses, and much more. Despite the central role of rulemaking in executing public policy, both regulated entities (especially small businesses) and the general public find it extremely difficult to follow the regulatory process; actively participating in it is even harder. E-rulemaking is the use of technology (particularly, computers and the World Wide Web) to: (i) help develop proposed rules; (ii) make rulemaking materials broadly available online, along with tools for searching, analyzing, explaining and managing the information they contain; and (iii) enable more effective and diverse public participation. E-rulemaking has transformative potential to increase the comprehensibility, transparency and accountability of the regulatory process. Specifically, e-rulemaking – effectively implemented – can open the rulemaking process to a broader range of participants, offer easier access to rulemaking and implementation materials, facilitate dialogue among interested parties about policy and enforcement, enhance regulatory coordination, and help produce better decisions that lead to more effective, accepted and enforceable rules. If realized, this vision would greatly strengthen civic participation and our democratic form of government

    Achieving the Potential: The Future of Federal e-Rulemaking: A Report to Congress and the President

    Get PDF
    Federal regulations are among the most important and widely used tools for implementing the laws of the land – affecting the food we eat, the air we breathe, the safety of consumer products, the quality of the workplace, the soundness of our financial institutions, the smooth operation of our businesses, and much more. Despite the central role of rulemaking in executing public policy, both regulated entities (especially small businesses) and the general public find it extremely difficult to follow the regulatory process; actively participating in it is even harder. E-rulemaking is the use of technology (particularly, computers and the World Wide Web) to: (i) help develop proposed rules; (ii) make rulemaking materials broadly available online, along with tools for searching, analyzing, explaining and managing the information they contain; and (iii) enable more effective and diverse public participation. E-rulemaking has transformative potential to increase the comprehensibility, transparency and accountability of the regulatory process. Specifically, e-rulemaking – effectively implemented – can open the rulemaking process to a broader range of participants, offer easier access to rulemaking and implementation materials, facilitate dialogue among interested parties about policy and enforcement, enhance regulatory coordination, and help produce better decisions that lead to more effective, accepted and enforceable rules. If realized, this vision would greatly strengthen civic participation and our democratic form of government

    Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources

    Get PDF
    Translation capability of a Phrase-Based Statistical Machine Translation (PBSMT) system mostly depends on parallel data and phrases that are not present in the training data are not correctly translated. This paper describes a method that efficiently expands the existing knowledge of a PBSMT system without adding more parallel data but using external morphological resources. A set of new phrase associations is added to translation and reordering models; each of them corresponds to a morphological variation of the source/target/both phrases of an existing association. New associations are generated using a string similarity score based on morphosyntactic information. We tested our approach on En-Fr and Fr-En translations and results showed improvements of the performance in terms of automatic scores (BLEU and Meteor) and reduction of out-of-vocabulary (OOV) words. We believe that our knowledge expansion framework is generic and could be used to add different types of information to the model.JRC.G.2-Global security and crisis managemen
    corecore