74 research outputs found

    A Brief History of Web Crawlers

    Full text link
    Web crawlers visit internet applications, collect data, and learn about new web pages from visited pages. Web crawlers have a long and interesting history. Early web crawlers collected statistics about the web. In addition to collecting statistics about the web and indexing the applications for search engines, modern crawlers can be used to perform accessibility and vulnerability checks on the application. Quick expansion of the web, and the complexity added to web applications have made the process of crawling a very challenging one. Throughout the history of web crawling many researchers and industrial groups addressed different issues and challenges that web crawlers face. Different solutions have been proposed to reduce the time and cost of crawling. Performing an exhaustive crawl is a challenging question. Additionally capturing the model of a modern web application and extracting data from it automatically is another open question. What follows is a brief history of different technique and algorithms used from the early days of crawling up to the recent days. We introduce criteria to evaluate the relative performance of web crawlers. Based on these criteria we plot the evolution of web crawlers and compare their performanc

    An Overlay Architecture for Personalized Object Access and Sharing in a Peer-to-Peer Environment

    Get PDF
    Due to its exponential growth and decentralized nature, the Internet has evolved into a chaotic repository, making it difficult for users to discover and access resources of interest to them. As a result, users have to deal with the problem of information overload. The Semantic Web's emergence provides Internet users with the ability to associate explicit, self-described semantics with resources. This ability will facilitate in turn the development of ontology-based resource discovery tools to help users retrieve information in an efficient manner. However, it is widely believed that the Semantic Web of the future will be a complex web of smaller ontologies, mostly created by various groups of web users who share a similar interest, referred to as a Community of Interest. This thesis proposes a solution to the information overload problem using a user driven framework, referred to as a Personalized Web, that allows individual users to organize themselves into Communities of Interests based on ontologies agreed upon by all community members. Within this framework, users can define and augment their personalized views of the Internet by associating specific properties and attributes to resources and defining constraint-functions and rules that govern the interpretation of the semantics associated with the resources. Such views can then be used to capture the user's interests and integrate these views into a user-defined Personalized Web. As a proof of concept, a Personalized Web architecture that employs ontology-based semantics and a structured Peer-to-Peer overlay network to provide a foundation of semantically-based resource indexing and advertising is developed. In order to investigate mechanisms that support the resource advertising and retrieval of the Personalized Web architecture, three agent-driven advertising and retrieval schemes, the Aggressive scheme, the Crawler-based scheme, and the Minimum-Cover-Rule scheme, were implemented and evaluated in both stable and churn environments. In addition to the development of a Personalized Web architecture that deals with typical web resources, this thesis used a case study to explore the potential of the Personalized Web architecture to support future web service workflow applications. The results of this investigation demonstrated that the architecture can support the automation of service discovery, negotiation, and invocation, allowing service consumers to actualize a personalized web service workflow. Further investigation will be required to improve the performance of the automation and allow it to be performed in a secure and robust manner. In order to support the next generation Internet, further exploration will be needed for the development of a Personalized Web that includes ubiquitous and pervasive resources

    Over Cite : a cooperative digital research library

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2005.Includes bibliographical references (leaves 47-50).CiteSeer is a well-known online resource for the computer science research community, allowing users to search and browse a large archive of research papers. Unfortunately, its current centralized incarnation is costly to run. Although members of the community would presumably be willing to donate hardware and bandwidth at their own sites to assist CiteSeer, the current architecture does not facilitate such distribution of resources. OverCite is a design for a new architecture for a distributed and cooperative research library based on a distributed hash table (DHT). The new architecture harnesses donated resources at many sites to provide document search and retrieval service to researchers worldwide. A preliminary evaluation of an initial OverCite prototype shows that it can service more queries per second than a centralized system, and that it increases total storage capacity by a factor of n/4 in a system of n nodes. OverCite can exploit these additional resources by supporting new features such as document alerts, and by scaling to larger data sets.by Jeremy Stribling.S.M

    Confidential Data-Outsourcing and Self-Optimizing P2P-Networks: Coping with the Challenges of Multi-Party Systems

    Get PDF
    This work addresses the inherent lack of control and trust in Multi-Party Systems at the examples of the Database-as-a-Service (DaaS) scenario and public Distributed Hash Tables (DHTs). In the DaaS field, it is shown how confidential information in a database can be protected while still allowing the external storage provider to process incoming queries. For public DHTs, it is shown how these highly dynamic systems can be managed by facilitating monitoring, simulation, and self-adaptation

    DART: the distributed agent based retrieval toolkit

    Get PDF
    The technology of search engines is evolving from indexing and classification of web resources based on keywords to more sophisticated techniques which take into account the meaning and the context of textual information and usage. Replying to query, commercial search engines face the user requests with a large amount of results, mostly useless or only partially related to the request; the subsequent refinement, operated downloading and examining as much pages as possible and simply ignoring whatever stays behind the first few pages, is left up to the user. Furthermore, architectures based on centralized indexes, allow commercial search engines to control the advertisement of online information, in contrast to P2P architectures that focus the attention on user requirements involving the end user in search engine maintenance and operation. To address such wishes, new search engines should focus on three key aspects: semantics, geo-referencing, collaboration/distribution. Semantic analysis lets to increase the results relevance. The geo-referencing of catalogued resources allows contextualisation based on user position. Collaboration distributes storage, processing, and trust on a world-wide network of nodes running on users’ computers, getting rid of bottlenecks and central points of failures. In this paper, we describe the studies, the concepts and the solutions developed in the DART project to introduce these three key features in a novel search engine architecture

    Blocks\u27 Network: Redesign Architecture based on Blockchain Technology

    Get PDF
    The Internet is a global network that uses communication protocols. It is considered the most important system reached by humanity, which no one can abandon. However, this technology has become a weapon that threatens the privacy of users, especially in the client-server model, where data is stored and managed privately. Additionally, users have no power over their data that store in a private server, which means users’ data may interrupt by government or might be sold via service provider for-profit purposes. Furthermore, blockchain is a technology that we can rely on to solve issues related to client-server model if appropriately used. However, blockchain technology uses consensus protocol, which is used for creating an incontrovertible system of agreement between members across a distributed network. Thus, the consensus protocol is used to slow all member down from generating too fast in order to control the network creation pattern, which leads to scalability and latency problems. The proposed system will present a platform that leverages modernize blockchain called Blocks’ Network. The system is taking into consideration the issues related to privacy and confidentiality from the client-side model, and scalability and latency issues from the blockchain technology side. Blocks’ network is a public and a permissioned network that use a multi-dimensional hash to generate multiple chains for the purpose of a systematic approach. The proposed platform is an assembly point for users to create a decentralized network using P2P protocols. The system has high data flow due to frequent use by participants (for example, the use of the Internet). Besides, the system will store all traffic of the network overtly via Blocks’ Network

    A peer to peer approach to large scale information monitoring

    Get PDF
    Issued as final reportNational Science Foundation (U.S.

    Empirical and Analytical Perspectives on the Robustness of Blockchain-related Peer-to-Peer Networks

    Get PDF
    Die Erfindung von Bitcoin hat ein großes Interesse an dezentralen Systemen geweckt. Eine häufige Zuschreibung an dezentrale Systeme ist dabei, dass eine Dezentralisierung automatisch zu einer höheren Sicherheit und Widerstandsfähigkeit gegenüber Angriffen führt. Diese Dissertation widmet sich dieser Zuschreibung, indem untersucht wird, ob dezentralisierte Anwendungen tatsächlich so robust sind. Dafür werden exemplarisch drei Systeme untersucht, die häufig als Komponenten in komplexen Blockchain-Anwendungen benutzt werden: Ethereum als Infrastruktur, IPFS zur verteilten Datenspeicherung und schließlich "Stablecoins" als Tokens mit Wertstabilität. Die Sicherheit und Robustheit dieser einzelnen Komponenten bestimmt maßgeblich die Sicherheit des Gesamtsystems in dem sie verwendet werden; darüber hinaus erlaubt der Fokus auf Komponenten Schlussfolgerungen über individuelle Anwendungen hinaus. Für die entsprechende Analyse bedient sich diese Arbeit einer empirisch motivierten, meist Netzwerklayer-basierten Perspektive -- angereichert mit einer ökonomischen im Kontext von Wertstabilen Tokens. Dieses empirische Verständnis ermöglicht es Aussagen über die inhärenten Eigenschaften der studierten Systeme zu treffen. Ein zentrales Ergebnis dieser Arbeit ist die Entdeckung und Demonstration einer "Eclipse-Attack" auf das Ethereum Overlay. Mittels eines solchen Angriffs kann ein Angreifer die Verbreitung von Transaktionen und Blöcken behindern und Netzwerkteilnehmer aus dem Overlay ausschließen. Des weiteren wird das IPFS-Netzwerk umfassend analysiert und kartografiert mithilfe (1) systematischer Crawls der DHT sowie (2) des Mitschneidens von Anfragenachrichten für Daten. Erkenntlich wird hierbei, dass die hybride Overlay-Struktur von IPFS Segen und Fluch zugleich ist, da das Gesamtsystem zwar robust gegen Angriffe ist, gleichzeitig aber eine umfassende Überwachung der Netzwerkteilnehmer ermöglicht wird. Im Rahmen der wertstabilen Kryptowährungen wird ein Klassifikations-Framework vorgestellt und auf aktuelle Entwicklungen im Gebiet der "Stablecoins" angewandt. Mit diesem Framework wird somit (1) der aktuelle Zustand der Stablecoin-Landschaft sortiert und (2) ein Mittel zur Verfügung gestellt, um auch zukünftige Designs einzuordnen und zu verstehen.The inception of Bitcoin has sparked a large interest in decentralized systems. In particular, popular narratives imply that decentralization automatically leads to a high security and resilience against attacks, even against powerful adversaries. In this thesis, we investigate whether these ascriptions are appropriate and if decentralized applications are as robust as they are made out to be. To this end, we exemplarily analyze three widely-used systems that function as building blocks for blockchain applications: Ethereum as basic infrastructure, IPFS for distributed storage and lastly "stablecoins" as tokens with a stable value. As reoccurring building blocks for decentralized applications these examples significantly determine the security and resilience of the overall application. Furthermore, focusing on these building blocks allows us to look past individual applications and focus on inherent systemic properties. The analysis is driven by a strong empirical, mostly network-layer based perspective; enriched with an economic point of view in the context of monetary stabilization. The resulting practical understanding allows us to delve into the systems' inherent properties. The fundamental results of this thesis include the demonstration of a network-layer Eclipse attack on the Ethereum overlay which can be leveraged to impede the delivery of transaction and blocks with dire consequences for applications built on top of Ethereum. Furthermore, we extensively map the IPFS network through (1) systematic crawling of its DHT, as well as (2) monitoring content requests. We show that while IPFS' hybrid overlay structure renders it quite robust against attacks, this virtue of the overlay is simultaneously a curse, as it allows for extensive monitoring of participating peers and the data they request. Lastly, we exchange the network-layer perspective for a mostly economic one in the context of monetary stabilization. We present a classification framework to (1) map out the stablecoin landscape and (2) provide means to pigeon-hole future system designs. With our work we not only scrutinize ascriptions attributed to decentral technologies; we also reached out to IPFS and Ethereum developers to discuss results and remedy potential attack vectors

    A Location-Aware Middleware Framework for Collaborative Visual Information Discovery and Retrieval

    Get PDF
    This work addresses the problem of scalable location-aware distributed indexing to enable the leveraging of collaborative effort for the construction and maintenance of world-scale visual maps and models which could support numerous activities including navigation, visual localization, persistent surveillance, structure from motion, and hazard or disaster detection. Current distributed approaches to mapping and modeling fail to incorporate global geospatial addressing and are limited in their functionality to customize search. Our solution is a peer-to-peer middleware framework based on XOR distance routing which employs a Hilbert Space curve addressing scheme in a novel distributed geographic index. This allows for a universal addressing scheme supporting publish and search in dynamic environments while ensuring global availability of the model and scalability with respect to geographic size and number of users. The framework is evaluated using large-scale network simulations and a search application that supports visual navigation in real-world experiments

    Video-on-Demand over Internet: a survey of existing systems and solutions

    Get PDF
    Video-on-Demand is a service where movies are delivered to distributed users with low delay and free interactivity. The traditional client/server architecture experiences scalability issues to provide video streaming services, so there have been many proposals of systems, mostly based on a peer-to-peer or on a hybrid server/peer-to-peer solution, to solve this issue. This work presents a survey of the currently existing or proposed systems and solutions, based upon a subset of representative systems, and defines selection criteria allowing to classify these systems. These criteria are based on common questions such as, for example, is it video-on-demand or live streaming, is the architecture based on content delivery network, peer-to-peer or both, is the delivery overlay tree-based or mesh-based, is the system push-based or pull-based, single-stream or multi-streams, does it use data coding, and how do the clients choose their peers. Representative systems are briefly described to give a summarized overview of the proposed solutions, and four ones are analyzed in details. Finally, it is attempted to evaluate the most promising solutions for future experiments. Résumé La vidéo à la demande est un service où des films sont fournis à distance aux utilisateurs avec u
    corecore