11 research outputs found
Performance and Comparative Analysis of the Two Contrary Approaches for Detecting Near Duplicate Web Documents in Web Crawling
Recent years have witnessed the drastic development of World Wide Web (WWW). Information is being accessible at the finger tip anytime anywhere through the massive web repository. The performance and reliability of web engines thus face huge problems due to the presence of enormous amount of web data. The voluminous amount of web documents has resulted in problems for search engines leading to the fact that the search results are of less relevance to the user. In addition to this, the presence of duplicate and near-duplicate web documents has created an additional overhead for the search engines critically affecting their performance. The demand for integrating data from heterogeneous sources leads to the problem of near-duplicate web pages. The detection of near duplicate documents within a collection has recently become an area of great interest. In this research, we have presented an efficient approach for the detection of near duplicate web pages in web crawling which uses keywords and the distance measure. Besides that, G.S. Manku et al.’s fingerprint based approach proposed in 2007 was considered as one of the “state-of-the-art" algorithms for finding near-duplicate web pages. Then we have implemented both the approaches and conducted an extensive comparative study between our similarity score based approach and G.S. Manku et al.’s fingerprint based approach. We have analyzed our results in terms of time complexity, space complexity, Memory usage and the confusion matrix parameters. After taking into account the above mentioned performance factors for the two approaches, the comparison study clearly portrays our approach the better (less complex) of the two based on the factors considered.DOI:http://dx.doi.org/10.11591/ijece.v2i6.1746
Next steps in near-duplicate detection for erulemaking
Large volume public comment campaigns and web portals that encourage the public to customize form letters produce many near-duplicate documents, which increases processing and storage costs, but is rarely a serious problem. A more serious concern is that form letter customizations can include substantive issues that agencies are likely to overlook. The identification of exact- and near-duplicate texts, and recognition of unique text within nearduplicate documents, is an important component of data cleaning and integration processes for eRulemaking. This paper presents DURIAN (DUplicate Removal In lArge collectioN), a refinement of a prior near-duplicate detection algorithm DURIAN uses a traditional bag-of-words document representation, document attributes ("metadata"), and document content structure to identify form letters and their edited copies in public comment collections. Experimental results demonstrate that DURIAN is about as effective as human assessors. The paper concludes by discussing challenges to moving near-duplicate detection into operational rulemaking environments
Achieving the Potential: The Future of Federal e-Rulemaking: A Report to Congress and the President
Federal regulations are among the most important and widely used tools for implementing the laws of the land – affecting the food we eat, the air we breathe, the safety of consumer products, the quality of the workplace, the soundness of our financial institutions, the smooth operation of our businesses, and much more. Despite the central role of rulemaking in executing public policy, both regulated entities (especially small businesses) and the general public find it extremely difficult to follow the regulatory process; actively participating in it is even harder.
E-rulemaking is the use of technology (particularly, computers and the World Wide Web) to: (i) help develop proposed rules; (ii) make rulemaking materials broadly available online, along with tools for searching, analyzing, explaining and managing the information they contain; and (iii) enable more effective and diverse public participation. E-rulemaking has transformative potential to increase the comprehensibility, transparency and accountability of the regulatory process. Specifically, e-rulemaking – effectively implemented – can open the rulemaking process to a broader range of participants, offer easier access to rulemaking and implementation materials, facilitate dialogue among interested parties about policy and enforcement, enhance regulatory coordination, and help produce better decisions that lead to more effective, accepted and enforceable rules. If realized, this vision would greatly strengthen civic participation and our democratic form of government
Achieving the Potential: The Future of Federal e-Rulemaking: A Report to Congress and the President
Federal regulations are among the most important and widely used tools for implementing the laws of the land – affecting the food we eat, the air we breathe, the safety of consumer products, the quality of the workplace, the soundness of our financial institutions, the smooth operation of our businesses, and much more. Despite the central role of rulemaking in executing public policy, both regulated entities (especially small businesses) and the general public find it extremely difficult to follow the regulatory process; actively participating in it is even harder.
E-rulemaking is the use of technology (particularly, computers and the World Wide Web) to: (i) help develop proposed rules; (ii) make rulemaking materials broadly available online, along with tools for searching, analyzing, explaining and managing the information they contain; and (iii) enable more effective and diverse public participation. E-rulemaking has transformative potential to increase the comprehensibility, transparency and accountability of the regulatory process. Specifically, e-rulemaking – effectively implemented – can open the rulemaking process to a broader range of participants, offer easier access to rulemaking and implementation materials, facilitate dialogue among interested parties about policy and enforcement, enhance regulatory coordination, and help produce better decisions that lead to more effective, accepted and enforceable rules. If realized, this vision would greatly strengthen civic participation and our democratic form of government
Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources
Translation capability of a Phrase-Based Statistical Machine Translation (PBSMT) system mostly depends on parallel data and phrases that are not present in the training data are not correctly translated. This paper describes a method that efficiently expands the existing knowledge of a PBSMT system without adding more parallel data but using external morphological resources. A set of new phrase associations is added to translation and reordering models; each of them corresponds to a morphological variation of the source/target/both phrases of an existing association. New associations are generated using a string similarity score based on morphosyntactic information. We tested our approach on En-Fr and Fr-En translations and results showed improvements of the performance in terms of automatic scores (BLEU and Meteor) and reduction of out-of-vocabulary (OOV) words. We believe that our knowledge expansion framework is generic and could be used to add different types of information to the model.JRC.G.2-Global security and crisis managemen
Recommended from our members
Applying latent semantic analysis to computer assisted assessment in the Computer Science domain: a framework, a tool, and an evaluation
This dissertation argues that automated assessment systems can be useful for both students and educators provided that the results correspond well with human markers. Thus, evaluating such a system is crucial. I present an evaluation framework and show how and why it can be useful for both producers and consumers of automated assessment systems. The framework is a refinement of a research taxonomy that came out of the effort to analyse the literature review of systems based on Latent Semantic Analysis (LSA), a statistical natural language processing technique that has been used for automated assessment of essays. The evaluation framework can help developers publish their results in a format that is comprehensive, relatively compact, and useful to other researchers.
The thesis claims that, in order to see a complete picture of an automated assessment system, certain pieces must be emphasised. It presents the framework as a jigsaw puzzle whose pieces join together to form the whole picture.
The dissertation uses the framework to compare the accuracy of human markers and EMMA, the LSA-based assessment system I wrote as part of this dissertation. EMMA marks short, free text answers in the domain of computer science. I conducted a study of five human markers and then used the results as a benchmark against which to evaluate EMMA. An integral part of the evaluation was the success metric. The standard inter-rater reliability statistic was not useful; I located a new statistic and applied it to the domain of computer assisted assessment for the first time, as far as I know.
Although EMMA exceeds human markers on a few questions, overall it does not achieve the same level of agreement with humans as humans do with each other. The last chapter maps out a plan for further research to improve EMMA