7 research outputs found

    Ubicrawler: a scalable fully distributed web crawler

    Get PDF
    We present the design and implementation of UbiCrawler, a scalable distributed web crawler, and we analyze its performance. The main features of UbiCrawler are platform independence, fault tolerance, a very effective assignment function for partitioning the domain to crawl, and more in general the complete decentralization of every task

    Mutable strings in Java: design, implementation and lightweight text-search algorithms

    Get PDF
    AbstractThe Java string classes, String and StringBuffer, lie at the extremes of a spectrum (immutable, reference based, and mutable, content based). Analogously, available text-search methods on string classes are implemented either as trivial, brute-force double loops, or as very sophisticated and resource-consuming regular-expression search methods. Motivated by our experience in data-intensive text applications, we propose a new string class, MutableString, which tries to get the right balance between extremes in both cases. Mutable strings can be in one of two states, compact and loose, in which they behave more like String and StringBuffer, respectively. Moreover, they support a wide range of sophisticated text-search algorithms with a very low resource usage and set-up time, using a new, very simple randomised data structure (a generalisation of Bloom filters) that stores an approximation from above of a lattice-valued function. Computing the function value requires a constant number of steps, and the error probability can be balanced with space usage. As a result, we obtain practical implementations of Boyer–Moore type algorithms that can be used with very large alphabets, such as Unicode collation elements. The techniques we develop are very general and amenable to a wide range of applications

    Comprehensive synchronization elimination for Java

    Get PDF
    AbstractIn this paper, we describe three novel analyses for eliminating unnecessary synchronization that remove over 70% of dynamic synchronization operations on the majority of our 15 benchmarks and improve the bottom-line performance of three by 37–53%. Our whole-program analyses attack three frequent forms of unnecessary synchronization: thread-local synchronization, reentrant synchronization, and enclosed lock synchronization. We motivate the design of our analyses with a study of the kinds of unnecessary synchronization found in a suite of single- and multi-threaded benchmarks of different sizes and drawn from a variety of domains. We analyze the performance of our optimizations in terms of dynamic operations removed and run-time speedup. We also show that our analyses may enable the use of simpler synchronization models than the model found in Java, at little or no additional cost in execution time. The synchronization optimizations, we describe enable programmers to design efficient, reusable and maintainable libraries and systems in Java without cumbersome manual code restructuring

    An evaluation of Java's I/O capabilities for high-performance computing

    Full text link
    Java is quickly becoming the preferred language for writing distributed applications because of its inherent support for programming on distributed platforms. In particular, Java provides compile-time and run-time security, automatic garbage collection, inherent support for multithreading, support for persistent objects and object migration, and portability. Given these significant advantages of Java, there is a growing interest in using Java for high-performance computing applications. To be successful in the high-performance computing domain, however, Java must have the capability to efficiently handle the significant I/O requirements commonly found in high-performance computing applications. While there has been significant research in high-performance I/O using languages such as C, C++, and Fortran, there has been relatively little research into the I/O capabilities of Java. In this paper, we evaluate the I/O capabilities of Java for high-performance computing. We examine several approaches that attempt to provide high-performance I/O--many of which are not obvious at first glance--and investigate their performance in both parallel and multithreaded environments. We also provide suggestions for expanding the I/O capabilities of Java to better support the needs of high-performance computing applications

    Automatic Generation of Thematically Focused Information Portals from Web Data

    Get PDF
    Finding the desired information on the Web is often a hard and time-consuming task. This thesis presents the methodology of automatic generation of thematically focused portals from Web data. The key component of the proposed Web retrieval framework is the thematically focused Web crawler that is interested only in a specific, typically small, set of topics. The focused crawler uses classification methods for filtering of fetched documents and identifying most likely relevant Web sources for further downloads. We show that the human efforts for preparation of the focused crawl can be minimized by automatic extending of the training dataset using additional training samples coined archetypes. This thesis introduces the combining of classification results and link-based authority ranking methods for selecting archetypes, combined with periodical re-training of the classifier. We also explain the architecture of the focused Web retrieval framework and discuss results of comprehensive use-case studies and evaluations with a prototype system BINGO!. Furthermore, the thesis addresses aspects of crawl postprocessing, such as refinements of the topic structure and restrictive document filtering. We introduce postprocessing methods and meta methods that are applied in an restrictive manner, i.e. by leaving out some uncertain documents rather than assigning them to inappropriate topics or clusters with low confidence. We also introduce the methodology of collaborative crawl postprocessing for multiple cooperating users in a distributed environment, such as a peer-to-peer overlay network. An important aspect of the thematically focused Web portal is the ranking of search results. This thesis addresses the aspect of search personalization by aggregating explicit or implicit feedback from multiple users and capturing topic-specific search patterns by profiles. Furthermore, we consider advanced link-based authority ranking algorithms that exploit the crawl-specific information, such as classification confidence grades for particular documents. This goal is achieved by weighting of edges in the link graph of the crawl and by adding virtual links between highly relevant documents of the topic. The results of our systematic evaluation on multiple reference collections and real Web data show the viability of the proposed methodology

    Performance Limitations of the Java Core Libraries

    No full text
    Unlike applets, traditional systems programs written in Java place significant demands on the Java runtime and core libraries, and their performance is often critically important. This paper describes our experiences using Java to build such a systems program, namely, a high-performance web crawler. We found that our runtime, which includes a just-in-time compiler that compiles Java bytecodes to native machine code, performed well. However, we encountered several performance problems with the Java core libraries, including excessive synchronization, excessive allocation, and other performance problems. The paper describes the most serious pitfalls and how we programmed around them. In total, these workarounds more than doubled the speed of our crawler

    Performance limitations of the Java core libraries

    Full text link
    corecore