96 research outputs found
Enterprise Search Technology Using Solr and Cloud
Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. Solr powers the search and navigation features of many of the world\u27s largest internet sites.
Databases and Solr have complementary strengths and weaknesses. SQL supports very simple wildcard-based text search with some simple normalization like matching upper case to lower case. The problem is that these are full table scans. In Solr all searchable words are stored in an inverse index , which searches orders of magnitude faster.
Solr is a standalone/cloud enterprise search server with a REST-like API. You put documents in it (called indexing ) via XML, JSON, CSV or binary over HTTP. You query it via HTTP GET and receive XML, JSON, CSV or binary results. The project will be implemented using Amazon/Azure cloud, Apache Solr, Windows/Linux, MS-SQL server and open source tools
Recommended from our members
From Controlled Data-Center Environments to Open Distributed Environments: Scalable, Efficient, and Robust Systems with Extended Functionality
The past two decades have witnessed several paradigm shifts in computing environments. Starting from cloud computing which offers on-demand allocation of storage, network, compute, and memory resources, as well as other services, in a pay-as-you-go billingmodel. Ending with the rise of permissionless blockchain technology, a decentralized computing paradigm with lower trust assumptions and limitless number of participants. Unlike in the cloud, where all the computing resources are owned by some trusted cloud provider, permissionless blockchains allow computing resources owned by possibly malicious parties to join and leave their network without obtaining permission from some centralized trusted authority. Still, in the presence of malicious parties, permissionlessblockchain networks can perform general computations and make progress. Cloud computing is powered by geographically distributed data-centers controlled and managed by trusted cloud service providers and promises theoretically infinite computing resources. On the other hand, permissionless blockchains are powered by open networks of geographically distributed computing nodes owned by entities that are not necessarily known or trusted. This paradigm shift requires a reconsideration of distributed data management protocols and distributed system designs that assume low latency across system components, inelastic computing resources, or fully trusted computing resources.In this dissertation, we propose new system designs and optimizations that address scalability and efficiency of distributed data management systems in cloud environments. We also propose several protocols and new programming paradigms to extend the functionality and enhance the robustness of permissionless blockchains. The work presented spans global-scale transaction processing, large-scale stream processing, atomic transaction processing across permissionless blockchains, and extending the functionality and the use-cases of permissionless blockchains. In all these directions, the focus is on rethinking system and protocol designs to account for novel cloud and permissionless blockchain assumptions. For global-scale transaction processing, we propose GPlacer, a placement optimization framework that decides replica placement of fully and partial geo-replicated databases. For large-scale stream processing, we propose Cache-on-Track (CoT) an adaptive and elastic client-side cache that addresses server-side load-imbalances that occur in large-scale distributed storage layers. In permissionless blockchain transaction processing, we propose AC3WN, the first correct cross-chain commitment protocol that guarantees atomicity of cross-chain transactions. Also, we propose TXSC, a transactional smart contract programming framework. TXSC provides smart contract developers with transaction primitives. These primitives allow developers to write smart contracts without the need to reason about the anomalies that can arise due to concurrent smart contract function executions. In addition, we propose a forward-looking architecture that unifies both permissioned and permissionless blockchains and exploits the running infrastructure of permissionless blockchains to build global asset management systems
A Critical Comparison of NOSQL Databases in the Context of Acid and Base
This starred paper will discuss two major types of databases – Relational and NOSQL – and analyze the different models used by these databases. In particular, it will focus on the choice of the ACID or BASE model to be more appropriate for the NOSQL databases. NOSQL databases use the BASE model because they do not usually comply with ACID model, something used by relational databases. However, some NOSQL databases adopt additional approaches and techniques to make the database comply with ACID model. In this light, this paper will explore some of these approaches and explain why NOSQL databases cannot simply follow the ACID model. What are the reasons behind the extensive use of the BASE model? What are some of the advantages and disadvantages of not using ACID? Particular attention will be paid to analyze if one model is better or superior to the other. These questions will be answered by reviewing existing research conducted on some of the NOSQL databases such as Cassandra, DynamoDB, MongoDB and Neo4j
Data Storage and Dissemination in Pervasive Edge Computing Environments
Nowadays, smart mobile devices generate huge amounts of data in all sorts of gatherings.
Much of that data has localized and ephemeral interest, but can be of great use if shared
among co-located devices. However, mobile devices often experience poor connectivity,
leading to availability issues if application storage and logic are fully delegated to a
remote cloud infrastructure. In turn, the edge computing paradigm pushes computations
and storage beyond the data center, closer to end-user devices where data is generated
and consumed. Hence, enabling the execution of certain components of edge-enabled
systems directly and cooperatively on edge devices.
This thesis focuses on the design and evaluation of resilient and efficient data storage
and dissemination solutions for pervasive edge computing environments, operating with
or without access to the network infrastructure. In line with this dichotomy, our goal can
be divided into two specific scenarios. The first one is related to the absence of network
infrastructure and the provision of a transient data storage and dissemination system
for networks of co-located mobile devices. The second one relates with the existence of
network infrastructure access and the corresponding edge computing capabilities.
First, the thesis presents time-aware reactive storage (TARS), a reactive data storage
and dissemination model with intrinsic time-awareness, that exploits synergies between
the storage substrate and the publish/subscribe paradigm, and allows queries within a
specific time scope. Next, it describes in more detail: i) Thyme, a data storage and dis-
semination system for wireless edge environments, implementing TARS; ii) Parsley, a
flexible and resilient group-based distributed hash table with preemptive peer relocation
and a dynamic data sharding mechanism; and iii) Thyme GardenBed, a framework
for data storage and dissemination across multi-region edge networks, that makes use of
both device-to-device and edge interactions.
The developed solutions present low overheads, while providing adequate response
times for interactive usage and low energy consumption, proving to be practical in a
variety of situations. They also display good load balancing and fault tolerance properties.Resumo
Hoje em dia, os dispositivos móveis inteligentes geram grandes quantidades de dados
em todos os tipos de aglomerações de pessoas. Muitos desses dados têm interesse loca-
lizado e efêmero, mas podem ser de grande utilidade se partilhados entre dispositivos
co-localizados. No entanto, os dispositivos móveis muitas vezes experienciam fraca co-
nectividade, levando a problemas de disponibilidade se o armazenamento e a lógica das
aplicações forem totalmente delegados numa infraestrutura remota na nuvem. Por sua
vez, o paradigma de computação na periferia da rede leva as computações e o armazena-
mento para além dos centros de dados, para mais perto dos dispositivos dos utilizadores
finais onde os dados são gerados e consumidos. Assim, permitindo a execução de certos
componentes de sistemas direta e cooperativamente em dispositivos na periferia da rede.
Esta tese foca-se no desenho e avaliação de soluções resilientes e eficientes para arma-
zenamento e disseminação de dados em ambientes pervasivos de computação na periferia
da rede, operando com ou sem acesso à infraestrutura de rede. Em linha com esta dico-
tomia, o nosso objetivo pode ser dividido em dois cenários específicos. O primeiro está
relacionado com a ausência de infraestrutura de rede e o fornecimento de um sistema
efêmero de armazenamento e disseminação de dados para redes de dispositivos móveis
co-localizados. O segundo diz respeito à existência de acesso à infraestrutura de rede e
aos recursos de computação na periferia da rede correspondentes.
Primeiramente, a tese apresenta armazenamento reativo ciente do tempo (ARCT), um
modelo reativo de armazenamento e disseminação de dados com percepção intrínseca
do tempo, que explora sinergias entre o substrato de armazenamento e o paradigma pu-
blicação/subscrição, e permite consultas num escopo de tempo específico. De seguida,
descreve em mais detalhe: i) Thyme, um sistema de armazenamento e disseminação de
dados para ambientes sem fios na periferia da rede, que implementa ARCT; ii) Pars-
ley, uma tabela de dispersão distribuída flexível e resiliente baseada em grupos, com
realocação preventiva de nós e um mecanismo de particionamento dinâmico de dados; e
iii) Thyme GardenBed, um sistema para armazenamento e disseminação de dados em
redes multi-regionais na periferia da rede, que faz uso de interações entre dispositivos e
com a periferia da rede.
As soluções desenvolvidas apresentam baixos custos, proporcionando tempos de res-
posta adequados para uso interativo e baixo consumo de energia, demonstrando serem
práticas nas mais diversas situações. Estas soluções também exibem boas propriedades de balanceamento de carga e tolerância a faltas
Query scheduling techniques and power/latency trade-off model for large-scale search engines
[Resumen] Los motores de búsqueda actuales deben enfrentarse a un veloz incremento de información y a un enorme tráfico de consultas. Las grandes compa˜nías se han visto
obligadas a construir centros de datos geográficamente distribuidos y compuestos
por miles de servidores. El suministro eléctrico supone un enorme gasto energético,
por lo que una peque˜na mejora a nivel de eficiencia puede suponer grandes ventajas
económicas.
Esta tesis permitirá a grandes compa˜nías de Recuperación de Información la construcción de motores de búsqueda dotados de mayor eficiencia.
Por una parte, esta tesis propone nuevas técnicas de distribución de consultas a
los servidores que las procesan para disminuir su tiempo de respuesta, estimando
cuál será el primer servidor disponible.
Por otra parte, esta tesis define un modelo matemático que establece un balance
entre el tiempo de respuesta de un motor de búsqueda y su consumo energético.
Basándonos en datos históricos y actuales, el modelo estima el tráfico de consultas
entrante y, de modo automático, aumenta/disminuye los servidores necesarios para
procesar las consultas. Se consigue así un gran porcentaje de ahorro energético sin
degradar la latencia del sistema.
Nuestros experimentos atestiguan las grandes mejoras alcanzadas en cuanto a
eficiencia y ahorro energético.[Resumo] Os motores de busca actuais deben enfrontarse a un grande incremento de información e a un enorme tráfico de consultas. As grandes compa˜nías víronse obrigadas
a construír centros de datos xeograficamente distribuídos e compostos por
milleiros de servidores. A subministración eléctrica supón un enorme gasto enerxético, polo que una pequena mellora a nivel de eficiencia pode supo˜ner grandes
vantaxes económicas.
Esta tese permitir´a a grandes compa˜n´ıas de Recuperaci´on de Información a construción de motores de busca dotados de maior eficiencia.
Por una parte, esta tese propón novas técnicas de distribución de consultas aos
servidores que as procesan para diminuír su tempo de resposta, estimando cál será o
primeiro servidor dispo˜nible.
Por outra parte, esta tese define un modelo matemático que establece un balance
entre o tempo de resposta dun motor de busca e o seu consumo enerxético. A partir
de datos históricos e actuais, o modelo estima o tráfico de consultas entrantes e automaticamente aumenta/diminúe os servidores necesarios para procesar as consultas.
Conséguese así unha grande porcentaxe de aforro enerxético sen degradar a latencia
do sistema.
Os nosos experimentos testemu˜nan as grandes melloras alcanzadas en canto a
eficiencia e aforro enerxético.[Abstract] Web search engines have to deal with a rapid increase of information, demanded by high incoming query traffic. This situation has driven companies to build geographically distributed data centres housing thousands of computers, consuming enormous amounts of electricity and requiring a huge infrastructure around. At this scale, even minor efficiency improvements result in large financial savings.
This thesis represents a novel contribution to query scheduling and power consumption
state-of-the-art, by assisting large-scale data centres to build more efficient
search engines.
On the one hand, this thesis proposes new scheduling techniques to decrease the
response time of queries, by estimating the server that will be idle soonest.
On the other hand, this thesis defines a simple mathematical model that establishes
a threshold between the power and latency of a search engine. Using historical
and current data, the model estimates the incoming query traffic and automatically
increases/decreases the necessary number of active machines in the system. We
achieve high energy savings during the whole day, without degrading the latency.
Our experiments have attested the power of both scheduling methods and the
power/latency trade-off model in improving the efficiency and achieving high energy
savings
An evaluation of non-relational database management systems as suitable storage for user generated text-based content in a distributed environment
Non-relational database management systems address some of the limitations relational database management systems have when storing large volumes of unstructured, user generated text-based data in distributed environments. They follow different approaches through the data model they use, their ability to scale data storage over distributed servers and the programming interface they provide.
An experimental approach was followed to measure the capabilities these alternative database management systems present in their approach to address the limitations of relational databases in terms of their capability to store unstructured text-based data, data warehousing capabilities, ability to scale data storage across distributed servers and the level of programming abstraction they provide.
The results of the research highlighted the limitations of relational database management systems. The different database management systems do address certain limitations, but not all. Document-oriented databases provide the best results and successfully address the need to store large volumes of user generated text-based data in a distributed environmentSchool of ComputingM. Sc. (Computer Science
Towards a New Generation of Permissioned Blockchain Systems
With the release of Satoshi Nakamoto's Bitcoin system in 2008 a new decentralized computation paradigm, known as blockchain, was born. Bitcoin promised a trading network for virtual coins, publicly available for anyone to participate in but owned by nobody. Any participant could propose a transaction and a lottery mechanism decided in which order these transactions would be recorded in a ledger with an elegant mechanism to prevent double spending. The remarkable achievement of Nakamoto's protocol was that participants did not have to trust each other to behave correctly for it to work. As long as more than half of the network participants adhered to the correct code, the recorded transactions on the ledger would both be valid and immutable.
Ethereum, as the next major blockchain to appear, improved on the initial idea by introducing smart contracts, which are decentralized Turing-complete stored procedures, thus making blockchain technology interesting for the enterprise setting. However, its intrinsically public data and prohibitive energy costs needed to be overcome. This gave rise to a new type of systems called permissioned blockchains. With these, access to the ledger is restricted and trust assumptions about malicious behaviour have been weakened, allowing more efficient consensus mechanisms to find a global order of transactions. One of the most popular representatives of this kind of blockchain is Hyperledger Fabric. While it is much faster and more energy efficient than permissionless blockchains, it has to compete with conventional distributed databases in the enterprise sector.
This thesis aims to mitigate Fabric's three major shortcomings. First, compared to conventional database systems, it is still far too slow. This thesis shows how the performance can be increased by a factor of seven by redesigning the transaction processing pipeline and introducing more efficient data structures. Second, we present a novel solution to Fabric's intrinsic problem of a low throughput for workloads with transactions that access the same data. This is achieved by analyzing the dependencies of transactions and selectively re-executing transactions when a conflict is detected. Third, this thesis tackles the preservation of private data. Even though access to the blockchain as a whole can be restricted, in a setting where multiple enterprises collaborate this is not sufficient to protect sensitive proprietary data. Thus, this thesis introduces a new privacy-preserving blockchain protocol based on network sharding and targeted data dissemination. It also introduces an additional layer of abstraction for the creation of transactions and interaction with data on the blockchain. This allows developers to write applications without the need for low-level knowledge of the internal data structure of the blockchain system. In summary, this thesis addresses the shortcomings of the current generation of permission blockchain systems
- …