96 research outputs found

    Enterprise Search Technology Using Solr and Cloud

    Get PDF
    Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. Solr powers the search and navigation features of many of the world\u27s largest internet sites. Databases and Solr have complementary strengths and weaknesses. SQL supports very simple wildcard-based text search with some simple normalization like matching upper case to lower case. The problem is that these are full table scans. In Solr all searchable words are stored in an inverse index , which searches orders of magnitude faster. Solr is a standalone/cloud enterprise search server with a REST-like API. You put documents in it (called indexing ) via XML, JSON, CSV or binary over HTTP. You query it via HTTP GET and receive XML, JSON, CSV or binary results. The project will be implemented using Amazon/Azure cloud, Apache Solr, Windows/Linux, MS-SQL server and open source tools

    A Critical Comparison of NOSQL Databases in the Context of Acid and Base

    Get PDF
    This starred paper will discuss two major types of databases – Relational and NOSQL – and analyze the different models used by these databases. In particular, it will focus on the choice of the ACID or BASE model to be more appropriate for the NOSQL databases. NOSQL databases use the BASE model because they do not usually comply with ACID model, something used by relational databases. However, some NOSQL databases adopt additional approaches and techniques to make the database comply with ACID model. In this light, this paper will explore some of these approaches and explain why NOSQL databases cannot simply follow the ACID model. What are the reasons behind the extensive use of the BASE model? What are some of the advantages and disadvantages of not using ACID? Particular attention will be paid to analyze if one model is better or superior to the other. These questions will be answered by reviewing existing research conducted on some of the NOSQL databases such as Cassandra, DynamoDB, MongoDB and Neo4j

    Data Storage and Dissemination in Pervasive Edge Computing Environments

    Get PDF
    Nowadays, smart mobile devices generate huge amounts of data in all sorts of gatherings. Much of that data has localized and ephemeral interest, but can be of great use if shared among co-located devices. However, mobile devices often experience poor connectivity, leading to availability issues if application storage and logic are fully delegated to a remote cloud infrastructure. In turn, the edge computing paradigm pushes computations and storage beyond the data center, closer to end-user devices where data is generated and consumed. Hence, enabling the execution of certain components of edge-enabled systems directly and cooperatively on edge devices. This thesis focuses on the design and evaluation of resilient and efficient data storage and dissemination solutions for pervasive edge computing environments, operating with or without access to the network infrastructure. In line with this dichotomy, our goal can be divided into two specific scenarios. The first one is related to the absence of network infrastructure and the provision of a transient data storage and dissemination system for networks of co-located mobile devices. The second one relates with the existence of network infrastructure access and the corresponding edge computing capabilities. First, the thesis presents time-aware reactive storage (TARS), a reactive data storage and dissemination model with intrinsic time-awareness, that exploits synergies between the storage substrate and the publish/subscribe paradigm, and allows queries within a specific time scope. Next, it describes in more detail: i) Thyme, a data storage and dis- semination system for wireless edge environments, implementing TARS; ii) Parsley, a flexible and resilient group-based distributed hash table with preemptive peer relocation and a dynamic data sharding mechanism; and iii) Thyme GardenBed, a framework for data storage and dissemination across multi-region edge networks, that makes use of both device-to-device and edge interactions. The developed solutions present low overheads, while providing adequate response times for interactive usage and low energy consumption, proving to be practical in a variety of situations. They also display good load balancing and fault tolerance properties.Resumo Hoje em dia, os dispositivos móveis inteligentes geram grandes quantidades de dados em todos os tipos de aglomerações de pessoas. Muitos desses dados têm interesse loca- lizado e efêmero, mas podem ser de grande utilidade se partilhados entre dispositivos co-localizados. No entanto, os dispositivos móveis muitas vezes experienciam fraca co- nectividade, levando a problemas de disponibilidade se o armazenamento e a lógica das aplicações forem totalmente delegados numa infraestrutura remota na nuvem. Por sua vez, o paradigma de computação na periferia da rede leva as computações e o armazena- mento para além dos centros de dados, para mais perto dos dispositivos dos utilizadores finais onde os dados são gerados e consumidos. Assim, permitindo a execução de certos componentes de sistemas direta e cooperativamente em dispositivos na periferia da rede. Esta tese foca-se no desenho e avaliação de soluções resilientes e eficientes para arma- zenamento e disseminação de dados em ambientes pervasivos de computação na periferia da rede, operando com ou sem acesso à infraestrutura de rede. Em linha com esta dico- tomia, o nosso objetivo pode ser dividido em dois cenários específicos. O primeiro está relacionado com a ausência de infraestrutura de rede e o fornecimento de um sistema efêmero de armazenamento e disseminação de dados para redes de dispositivos móveis co-localizados. O segundo diz respeito à existência de acesso à infraestrutura de rede e aos recursos de computação na periferia da rede correspondentes. Primeiramente, a tese apresenta armazenamento reativo ciente do tempo (ARCT), um modelo reativo de armazenamento e disseminação de dados com percepção intrínseca do tempo, que explora sinergias entre o substrato de armazenamento e o paradigma pu- blicação/subscrição, e permite consultas num escopo de tempo específico. De seguida, descreve em mais detalhe: i) Thyme, um sistema de armazenamento e disseminação de dados para ambientes sem fios na periferia da rede, que implementa ARCT; ii) Pars- ley, uma tabela de dispersão distribuída flexível e resiliente baseada em grupos, com realocação preventiva de nós e um mecanismo de particionamento dinâmico de dados; e iii) Thyme GardenBed, um sistema para armazenamento e disseminação de dados em redes multi-regionais na periferia da rede, que faz uso de interações entre dispositivos e com a periferia da rede. As soluções desenvolvidas apresentam baixos custos, proporcionando tempos de res- posta adequados para uso interativo e baixo consumo de energia, demonstrando serem práticas nas mais diversas situações. Estas soluções também exibem boas propriedades de balanceamento de carga e tolerância a faltas

    Query scheduling techniques and power/latency trade-off model for large-scale search engines

    Get PDF
    [Resumen] Los motores de búsqueda actuales deben enfrentarse a un veloz incremento de información y a un enorme tráfico de consultas. Las grandes compa˜nías se han visto obligadas a construir centros de datos geográficamente distribuidos y compuestos por miles de servidores. El suministro eléctrico supone un enorme gasto energético, por lo que una peque˜na mejora a nivel de eficiencia puede suponer grandes ventajas económicas. Esta tesis permitirá a grandes compa˜nías de Recuperación de Información la construcción de motores de búsqueda dotados de mayor eficiencia. Por una parte, esta tesis propone nuevas técnicas de distribución de consultas a los servidores que las procesan para disminuir su tiempo de respuesta, estimando cuál será el primer servidor disponible. Por otra parte, esta tesis define un modelo matemático que establece un balance entre el tiempo de respuesta de un motor de búsqueda y su consumo energético. Basándonos en datos históricos y actuales, el modelo estima el tráfico de consultas entrante y, de modo automático, aumenta/disminuye los servidores necesarios para procesar las consultas. Se consigue así un gran porcentaje de ahorro energético sin degradar la latencia del sistema. Nuestros experimentos atestiguan las grandes mejoras alcanzadas en cuanto a eficiencia y ahorro energético.[Resumo] Os motores de busca actuais deben enfrontarse a un grande incremento de información e a un enorme tráfico de consultas. As grandes compa˜nías víronse obrigadas a construír centros de datos xeograficamente distribuídos e compostos por milleiros de servidores. A subministración eléctrica supón un enorme gasto enerxético, polo que una pequena mellora a nivel de eficiencia pode supo˜ner grandes vantaxes económicas. Esta tese permitir´a a grandes compa˜n´ıas de Recuperaci´on de Información a construción de motores de busca dotados de maior eficiencia. Por una parte, esta tese propón novas técnicas de distribución de consultas aos servidores que as procesan para diminuír su tempo de resposta, estimando cál será o primeiro servidor dispo˜nible. Por outra parte, esta tese define un modelo matemático que establece un balance entre o tempo de resposta dun motor de busca e o seu consumo enerxético. A partir de datos históricos e actuais, o modelo estima o tráfico de consultas entrantes e automaticamente aumenta/diminúe os servidores necesarios para procesar as consultas. Conséguese así unha grande porcentaxe de aforro enerxético sen degradar a latencia do sistema. Os nosos experimentos testemu˜nan as grandes melloras alcanzadas en canto a eficiencia e aforro enerxético.[Abstract] Web search engines have to deal with a rapid increase of information, demanded by high incoming query traffic. This situation has driven companies to build geographically distributed data centres housing thousands of computers, consuming enormous amounts of electricity and requiring a huge infrastructure around. At this scale, even minor efficiency improvements result in large financial savings. This thesis represents a novel contribution to query scheduling and power consumption state-of-the-art, by assisting large-scale data centres to build more efficient search engines. On the one hand, this thesis proposes new scheduling techniques to decrease the response time of queries, by estimating the server that will be idle soonest. On the other hand, this thesis defines a simple mathematical model that establishes a threshold between the power and latency of a search engine. Using historical and current data, the model estimates the incoming query traffic and automatically increases/decreases the necessary number of active machines in the system. We achieve high energy savings during the whole day, without degrading the latency. Our experiments have attested the power of both scheduling methods and the power/latency trade-off model in improving the efficiency and achieving high energy savings

    An evaluation of non-relational database management systems as suitable storage for user generated text-based content in a distributed environment

    Get PDF
    Non-relational database management systems address some of the limitations relational database management systems have when storing large volumes of unstructured, user generated text-based data in distributed environments. They follow different approaches through the data model they use, their ability to scale data storage over distributed servers and the programming interface they provide. An experimental approach was followed to measure the capabilities these alternative database management systems present in their approach to address the limitations of relational databases in terms of their capability to store unstructured text-based data, data warehousing capabilities, ability to scale data storage across distributed servers and the level of programming abstraction they provide. The results of the research highlighted the limitations of relational database management systems. The different database management systems do address certain limitations, but not all. Document-oriented databases provide the best results and successfully address the need to store large volumes of user generated text-based data in a distributed environmentSchool of ComputingM. Sc. (Computer Science

    Towards a New Generation of Permissioned Blockchain Systems

    Get PDF
    With the release of Satoshi Nakamoto's Bitcoin system in 2008 a new decentralized computation paradigm, known as blockchain, was born. Bitcoin promised a trading network for virtual coins, publicly available for anyone to participate in but owned by nobody. Any participant could propose a transaction and a lottery mechanism decided in which order these transactions would be recorded in a ledger with an elegant mechanism to prevent double spending. The remarkable achievement of Nakamoto's protocol was that participants did not have to trust each other to behave correctly for it to work. As long as more than half of the network participants adhered to the correct code, the recorded transactions on the ledger would both be valid and immutable. Ethereum, as the next major blockchain to appear, improved on the initial idea by introducing smart contracts, which are decentralized Turing-complete stored procedures, thus making blockchain technology interesting for the enterprise setting. However, its intrinsically public data and prohibitive energy costs needed to be overcome. This gave rise to a new type of systems called permissioned blockchains. With these, access to the ledger is restricted and trust assumptions about malicious behaviour have been weakened, allowing more efficient consensus mechanisms to find a global order of transactions. One of the most popular representatives of this kind of blockchain is Hyperledger Fabric. While it is much faster and more energy efficient than permissionless blockchains, it has to compete with conventional distributed databases in the enterprise sector. This thesis aims to mitigate Fabric's three major shortcomings. First, compared to conventional database systems, it is still far too slow. This thesis shows how the performance can be increased by a factor of seven by redesigning the transaction processing pipeline and introducing more efficient data structures. Second, we present a novel solution to Fabric's intrinsic problem of a low throughput for workloads with transactions that access the same data. This is achieved by analyzing the dependencies of transactions and selectively re-executing transactions when a conflict is detected. Third, this thesis tackles the preservation of private data. Even though access to the blockchain as a whole can be restricted, in a setting where multiple enterprises collaborate this is not sufficient to protect sensitive proprietary data. Thus, this thesis introduces a new privacy-preserving blockchain protocol based on network sharding and targeted data dissemination. It also introduces an additional layer of abstraction for the creation of transactions and interaction with data on the blockchain. This allows developers to write applications without the need for low-level knowledge of the internal data structure of the blockchain system. In summary, this thesis addresses the shortcomings of the current generation of permission blockchain systems
    corecore