20 research outputs found

    Rank, select and access in grammar-compressed strings

    Full text link
    Given a string SS of length NN on a fixed alphabet of σ\sigma symbols, a grammar compressor produces a context-free grammar GG of size nn that generates SS and only SS. In this paper we describe data structures to support the following operations on a grammar-compressed string: \mbox{rank}_c(S,i) (return the number of occurrences of symbol cc before position ii in SS); \mbox{select}_c(S,i) (return the position of the iith occurrence of cc in SS); and \mbox{access}(S,i,j) (return substring S[i,j]S[i,j]). For rank and select we describe data structures of size O(nσlogN)O(n\sigma\log N) bits that support the two operations in O(logN)O(\log N) time. We propose another structure that uses O(nσlog(N/n)(logN)1+ϵ)O(n\sigma\log (N/n)(\log N)^{1+\epsilon}) bits and that supports the two queries in O(logN/loglogN)O(\log N/\log\log N), where ϵ>0\epsilon>0 is an arbitrary constant. To our knowledge, we are the first to study the asymptotic complexity of rank and select in the grammar-compressed setting, and we provide a hardness result showing that significantly improving the bounds we achieve would imply a major breakthrough on a hard graph-theoretical problem. Our main result for access is a method that requires O(nlogN)O(n\log N) bits of space and O(logN+m/logσN)O(\log N+m/\log_\sigma N) time to extract m=ji+1m=j-i+1 consecutive symbols from SS. Alternatively, we can achieve O(logN/loglogN+m/logσN)O(\log N/\log\log N+m/\log_\sigma N) query time using O(nlog(N/n)(logN)1+ϵ)O(n\log (N/n)(\log N)^{1+\epsilon}) bits of space. This matches a lower bound stated by Verbin and Yu for strings where NN is polynomially related to nn.Comment: 16 page

    Performance and Comparative Analysis of the Two Contrary Approaches for Detecting Near Duplicate Web Documents in Web Crawling

    Get PDF
    Recent years have witnessed the drastic development of World Wide Web (WWW). Information is being accessible at the finger tip anytime anywhere through the massive web repository. The performance and reliability of web engines thus face huge problems due to the presence of enormous amount of web data. The voluminous amount of web documents has resulted in problems for search engines leading to the fact that the search results are of less relevance to the user. In addition to this, the presence of duplicate and near-duplicate web documents has created an additional overhead for the search engines critically affecting their performance. The demand for integrating data from heterogeneous sources leads to the problem of near-duplicate web pages. The detection of near duplicate documents within a collection has recently become an area of great interest. In this research, we have presented an efficient approach for the detection of near duplicate web pages in web crawling which uses keywords and the distance measure. Besides that, G.S. Manku et al.’s fingerprint based approach proposed in 2007 was considered as one of the “state-of-the-art" algorithms for finding near-duplicate web pages. Then we have implemented both the approaches and conducted an extensive comparative study between our similarity score based approach and G.S. Manku et al.’s fingerprint based approach. We have analyzed our results in terms of time complexity, space complexity, Memory usage and the confusion matrix parameters. After taking into account the above mentioned performance factors for the two approaches, the comparison study clearly portrays our approach the better (less complex) of the two based on the factors considered.DOI:http://dx.doi.org/10.11591/ijece.v2i6.1746

    A Survey to Fix the Threshold and Implementation for Detecting Duplicate Web Documents

    Get PDF
    The drastic development in the information accessible on the World Wide Web has made the employment of automated tools to locate the information resources of interest, and for tracking and analyzing the same a certainty. Web Mining is the branch of data mining that deals with the analysis of World Wide Web. The concepts from various areas such as Data Mining, Internet technology and World Wide Web, and recently, Semantic Web can be said as the origin of web mining. Web mining can be defined as the procedure of determining hidden yet potentially beneficial knowledge from the data accessible in the web. Web mining comprise the sub areas: web content mining, web structure mining, and web usage mining. Web content mining is the process of mining knowledge from the web pages besides other web objects. The process of mining knowledge about the link structure linking web pages and some other web objects is defined as Web structure mining. Web usage mining is defined as the process of mining the usage patterns created by the users accessing the web pages. The search engine technology has led to the development of World Wide. The search engines are the chief gateways for access of information in the web. The ability to locate contents of particular interest amidst a huge heap has turned businesses beneficial and productive. The search engines respond to the queries by employing the process of web crawling that populates an indexed repository of web pages. The programs construct a confined repository of the segment of the web that they visit by navigating the web graph and retrieving pages. There are two main types of crawling, namely, Generic and Focused crawling. Generic crawlers crawls documents and links of diverse topics. Focused crawlers limit the number of pages with the aid of some prior obtained specialized knowledge. The systems that index, mine, and otherwise analyze pages (such as, the search engines) are provided with inputs from the repositories of web pages built by the web crawlers. The drastic development of the Internet and the growing necessity to incorporate heterogeneous data is accompanied by the issue of the existence of near duplicate data. Even if the near duplicate data don’t exhibit bit wise identical nature they are remarkably similar. The duplicate and near duplicate web pages either increase the index storage space or slow down or increase the serving costs which annoy the users, thus causing huge problems for the web search engines. Hence it is inevitable to design algorithms to detect such pages

    Розробка генератора тестів для “MOODLE” на базі онтології

    Get PDF
    The knowledge-based information technology, which solves the problem of automatic generation of quizzes with grouping them according to the hierarchy of domain concepts was developed.Existing methods of ontology representation and available tools for automatic quiz generation were investigated.Based on the analysis of formats and technologies that use existing software tools Protégé and MOODLE, guidelines and recommendations for the domain ontology construction were developed. The requirements were formulated, and the class of ontologies that can be used for further processing by the proposed methods was identified. A software tool - quiz generator was developed.Using quizzes in the educational process allows to quickly check the knowledge of large groups of students, to monitor the educational achievements, reduce data processing time. However, developing effective and verified quizzes is quite a time-consuming process, which contains a lot of routine work.Four target types of quizzes in a closed form suitable for automatic processing were identified. It was revealed that the generated quizzes are verified, i.e. questions do not contain any errors since they were automatically generated from true statements, the generated quiz does not need to be checked for errors and the number of questions is sufficient to use not only for control but also for learning.The developed technology will allow to increase the number of educational quizzes, thereby freeing a teacher from routine work in favor of its creative component, thus will enhance the education quality.Разработана базируемая на знаниях информационная технология, которая решает задачу автоматической генерации тестовых вопросов с группированием их в соответствии с иерархией понятий предметной области. В рамках разработанной технологии создано инструментальное программное средство. Разработанная технология позволит увеличить количество учебных тестов, освободив время преподавателя от рутинной работы в пользу её творческой составляющей, при этом повысит качественный уровень образования. Розроблено інформаційну технологію, що базується на знаннях, яка вирішує задачу автоматичної генерації тестових запитань з групуванням їх відповідно до ієрархії понять предметної області. В рамках розробленої технології створено інструментальний програмний засіб. Розроблена технологія дозволить збільшити кількість навчальних тестів звільнивши час викладача від рутинної роботи на користь її творчої складової, при цьому підвищить якісний рівень освіти.

    Evaluating text reuse discovery on the web

    Full text link

    Detecting and Analyzing Text Reuse with BLAST

    Get PDF
    In this thesis I expand upon my previous work on text reuse detection. I propose a novel method of detecting text reuse by leveraging BLAST (Basic Local Alignment Search Tool), an algorithm originally designed for aligning and comparing biomedical sequences, such as DNA and protein sequences. I explain the original BLAST algorithm in depth by going through it step-by-step. I also describe two other popular sequence alignment methods. I demonstrate the effectiveness of the BLAST text reuse detection method by comparing it against the previous state-of-the-art and show that the proposed method beats it by a large margin. I apply the method to a dataset of 3 million documents of scanned Finnish newspapers and journals, which have been turned into text using OCR (Optical Character Recognition) software. I categorize the results from the method into three categories: every day text reuse, long term reuse and viral news. I describe them and provide examples of them as well as propose a new, novel method of calculating a virality score for the clusters

    Next steps in near-duplicate detection for eRulemaking

    Full text link

    Product-Process Coupling to Enable Continuous Improvement of Assembly Processes

    Get PDF
    The objective of this research is to couple product and process design knowledge to enable continuous improvement of assembly processes. Specifically, the use of assembly solid model similarity to mine databases and retrieve assembly process information is investigated. Nine techniques of computing solid model similarity from literature are investigated for their correlation with human interpretation of assembly model similarity. A method of computing solid model similarity by using frequency distributions of tessellation areas is developed and investigated. For each of the nine solid model similarity methods, the results from use of component solid model similarity in conjunction with assembly model similarity are compared to the results when only assembly model similarity is used. A survey is conducted to gather human interpretation of assembly solid models from the perspective of assembly process similarity. From the tests conducted it is found that the method of using tessellation area distributions has weak correlation to human interpretation of assembly solid model similarity from the perspective of assembly processes. The D1 method, which uses distance between centroid and random points on the surface of solid models, was found to have highest correlation to survey results. The use of component model similarity in conjunction with similarity of the assembly model was found to improve the precision of the solid model similarity methods. Text similarity techniques from literature are investigated for their correlation with human interpretation of assembly work instruction similarity. Through testing, Latent Semantic Analysis is chosen as the method of computing assembly work instruction similarity since it has moderately positive correlation with respect to survey results and is less sensitive to the use of synonyms than the three other methods of text similarity investigated in this research. The Jaccard method of computing similarity is inherently a measure of consistency in the terminology used between the two texts being compared and this can be used to provide decision support while engineers author assembly work instructions. This will allow authors to understand the level of consistency between their work instructions and the other work instructions within the specific enterprise. The D1 method of computing solid model similarity and Latent Semantic Analysis to compute assembly work instruction similarity are used to compare assembly solid models and assembly work instructions obtained from a survey. In this survey, participants were presented with assembly solid models and asked to author assembly work instructions. The correlation between the solid model similarity scores and assembly work instruction similarity scores (within and across participants) indicates that regardless of assembly work instruction authors, assembly solid models and assembly work instruction share a moderately positive correlation. These results, coupled with the understanding that the causation between assembly work instructions and solid models is limited to those work instructions which describe handling of components and mating of components, can be used for process design knowledge retrieval and reuse. The results from this research can be used to mine databases by using solid model similarity and retrieve assembly work instructions. This will couple product design and assembly process design and allow for continuous improvement of the latter

    Detekce duplicit v rozsáhlých webových bázích dat

    Get PDF
    Tato diplomová práce se zabývá metodami používanými k detekci duplicitních dokumentů, a možností jejich integrace do internetového vyhledávače. Nabízí přehled běžně používaných metod, z nichž vybírá metodu aproximace Jaccardovy míry podobnosti v kombinaci se šindelováním. Vybranou metodu přizpůsobuje k implementaci v prostředí internetového vyhledávače Egothor. Cílem práce je představit tuto implementaci, popsat její vlastnosti a nalézt nejvhodnější parametry tak, aby detekce probíhala pokud možno v reálném čase. Důležitou vlastností metody je také možnost vykonávat dynamické změny nad kolekcí indexovaných dokumentů.This master thesis analyses the methods used for duplicity document detection and possibilities of their integration with a web search engine. It offers an overview of commonly used methods, from which it chooses the method of approximation of the Jaccard similarity measure in combination with shingling. The chosen method is adapted for implementation in the Egothor web search engine environment. The aim of the thesis is to present this implementation, describe its features, and find the most suitable parameters for the detection to run in real time. An important feature of the described method is also the possibility to make dynamic changes over the collection of indexed documents.Department of Software EngineeringKatedra softwarového inženýrstvíFaculty of Mathematics and PhysicsMatematicko-fyzikální fakult
    corecore