Search CORE

933 research outputs found

Recommended from our members

Local search: A guide for the information retrieval practitioner

Author: Abramson
Althofer
Andrew MacFarlane
Andrew Tuson
Baeck
Battiti
Boughanem
Cartwright
Chen
Chen
Chen
Cleverdon
Collins
Cordon
Cordon
Corne
Darwin
Dorigo
Downsland
Dueck
Fan
Fan
Fan
Fan
Feo
Fernandez-Villacanas Martin
Fogel
Fogel
Frakes
Frakes
Garey
Glover
Glover
Glover
Goldberg
Hajek
Harman
Harman
Harman
Harman
Hasan
Hawking
Hertz
Hertz
Holland
Hooker
Horng
Kekäläinen
Kirkpatrick
Koza
Kuflik
Lam
Lopez-Pujalte
Lopez-Pujalte
Lopez-Pujalte
Luke
Lundy
Martin-Bautisata
Masters
Michalewicz
Mock
Mock
Newell
Ogbu
Oliveira
Osman
Osman
Osman
Osman
Papadimitriou
Pohlheim
Rechenburg
Reeves
Reeves
Robertson
Sebastiani
Semet
Sinclair
Smith
Sparck Jones
Stefik
Tamine
Thangiah
Trotman
Van Laarhoven
Vrajitoru
Wartik
Yang
Zweben
Publication venue: 'Elsevier BV'
Publication date: 01/01/2009
Field of study

There are a number of combinatorial optimisation problems in information retrieval in which the use of local search methods are worthwhile. The purpose of this paper is to show how local search can be used to solve some well known tasks in information retrieval (IR), how previous research in the field is piecemeal, bereft of a structure and methodologically flawed, and to suggest more rigorous ways of applying local search methods to solve IR problems. We provide a query based taxonomy for analysing the use of local search in IR tasks and an overview of issues such as fitness functions, statistical significance and test collections when conducting experiments on combinatorial optimisation problems. The paper gives a guide on the pitfalls and problems for IR practitioners who wish to use local search to solve their research issues, and gives practical advice on the use of such methods. The query based taxonomy is a novel structure which can be used by the IR practitioner in order to examine the use of local search in IR

City Research Online

Crossref

Data generator for evaluating ETL process quality

Author: Abelló Gamazo Alberto
Jovanovic Petar
Nakuçi Emona
Theodorou Vasileios
Publication venue: 'Elsevier BV'
Publication date: 01/01/2017
Field of study

Obtaining the right set of data for evaluating the fulfillment of different quality factors in the extract-transform-load (ETL) process design is rather challenging. First, the real data might be out of reach due to different privacy constraints, while manually providing a synthetic set of data is known as a labor-intensive task that needs to take various combinations of process parameters into account. More importantly, having a single dataset usually does not represent the evolution of data throughout the complete process lifespan, hence missing the plethora of possible test cases. To facilitate such demanding task, in this paper we propose an automatic data generator (i.e., Bijoux). Starting from a given ETL process model, Bijoux extracts the semantics of data transformations, analyzes the constraints they imply over input data, and automatically generates testing datasets. Bijoux is highly modular and configurable to enable end-users to generate datasets for a variety of interesting test scenarios (e.g., evaluating specific parts of an input ETL process design, with different input dataset sizes, different distributions of data, and different operation selectivities). We have developed a running prototype that implements the functionality of our data generation framework and here we report our experimental findings showing the effectiveness and scalability of our approach.Peer ReviewedPostprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Real-time Feature Detection in Mass Spectrometer Data

Author: Hillman Chris
Publication venue
Publication date: 01/01/2018
Field of study

University of Dundee Online Publications

Storage Solutions for Big Data Systems: A Qualitative Study and Comparison

Author: Alam Mansaf
Ali Syed Arshad
Khan Samiya
Liu Xiufeng
Publication venue
Publication date: 01/01/2019
Field of study

Big data systems development is full of challenges in view of the variety of application areas and domains that this technology promises to serve. Typically, fundamental design decisions involved in big data systems design include choosing appropriate storage and computing infrastructures. In this age of heterogeneous systems that integrate different technologies for optimized solution to a specific real world problem, big data system are not an exception to any such rule. As far as the storage aspect of any big data system is concerned, the primary facet in this regard is a storage infrastructure and NoSQL seems to be the right technology that fulfills its requirements. However, every big data application has variable data characteristics and thus, the corresponding data fits into a different data model. This paper presents feature and use case analysis and comparison of the four main data models namely document oriented, key value, graph and wide column. Moreover, a feature analysis of 80 NoSQL solutions has been provided, elaborating on the criteria and points that a developer must consider while making a possible choice. Typically, big data storage needs to communicate with the execution engine and other processing and visualization technologies to create a comprehensive solution. This brings forth second facet of big data storage, big data file formats, into picture. The second half of the research paper compares the advantages, shortcomings and possible use cases of available big data file formats for Hadoop, which is the foundation for most big data computing technologies. Decentralized storage and blockchain are seen as the next generation of big data storage and its challenges and future prospects have also been discussed

arXiv.org e-Print Archive

Online Research Database In Technology

Metaheuristics “In the Large”

Author: Addis
Adriaensen
Agarwal
Ahmed
Altunay
Applegate
Asta
Asta
Bartz-Beielstein
Battiti
Bengio
Birattari
Bleuler
Boussemart
Burke
Burke
Cahon
Camacho-Villalón
Chakhlevitch
Cloete
Collberg
Consoli
Cowling
Cox
Dantzig
De Beukelaer
Di Gaspero
Drake
Durillo
Foster
Foster
Fuellerer
García-Nieto
García-Sánchez
Glover
Goh
Guervós
Hackney
Hammond
Hansen
Hooker
Hoos
Hughes
Hunt
Imade
Johnson
Kendall
Khalloof
Kheiri
Kheiri
Khichane
Kirkpatrick
Kocsis
Kocsis
Koza
König
Lim
Lukasiewycz
Luke
López-Ibáñez
López-Ibáñez
Malan
Manna
Marmion
Martin
Merelo Guervós
Merelo-Guervós
Miranda
Munawar
Nagata
Nallaperuma
Nallaperuma
Neumann
Nikzad
Pamparà
Pamparà
Pamparà
Pappa
Parejo
Parejo
Parejo
Parkes
Peer
Pellerin
Popper
Puchinger
Qu
Raidl
Rice
Rosenberg
Ross
Rotem-Gal-Oz
Scheibenpflug
Senington
Smith-Miles
Song
Song
Soria-Alcaraz
Stützle
Sutter
Swan
Swan
Swan
Sörensen
Sörensen
SörensenKenneth and Arnold Florian and Palhazi Cuervo, Daniel
Taillard
Taylor
Thabtah
Valipour
Wagner
Wagner
Weyland
Wolpert
Woodward
Xu
Publication venue: 'Elsevier BV'
Publication date: 03/06/2021
Field of study

Many people have generously given their time to the various activities of the MitL initiative. Particular gratitude is due to Adam Barwell, John A. Clark, Patrick De Causmaecker, Emma Hart, Zoltan A. Kocsis, Ben Kovitz, Krzysztof Krawiec, John McCall, Nelishia Pillay, Kevin Sim, Jim Smith, Thomas Stutzle, Eric Taillard and Stefan Wagner. J. Swan acknowledges the support of UK EPSRC grant EP/J017515/1 and the EU H2020 SAFIRE Factories project. P. GarciaSanchez and J. J. Merelo acknowledges the support of TIN201785727-C4-2-P by the Spanish Ministry of Economy and Competitiveness. M. Wagner acknowledges the support of the Australian Research Council grants DE160100850 and DP200102364.Following decades of sustained improvement, metaheuristics are one of the great success stories of opti- mization research. However, in order for research in metaheuristics to avoid fragmentation and a lack of reproducibility, there is a pressing need for stronger scientific and computational infrastructure to sup- port the development, analysis and comparison of new approaches. To this end, we present the vision and progress of the Metaheuristics “In the Large”project. The conceptual underpinnings of the project are: truly extensible algorithm templates that support reuse without modification, white box problem descriptions that provide generic support for the injection of domain specific knowledge, and remotely accessible frameworks, components and problems that will enhance reproducibility and accelerate the field’s progress. We argue that, via such principled choice of infrastructure support, the field can pur- sue a higher level of scientific enquiry. We describe our vision and report on progress, showing how the adoption of common protocols for all metaheuristics can help liberate the potential of the field, easing the exploration of the design space of metaheuristics.UK Research & Innovation (UKRI)Engineering & Physical Sciences Research Council (EPSRC) EP/J017515/1EU H2020 SAFIRE Factories projectSpanish Ministry of Economy and Competitiveness TIN201785727-C4-2-PAustralian Research Council DE160100850 DP20010236

arXiv.org e-Print Archive

Crossref

Stirling Online Research Repository (RIOXX)

Repository@Nottingham

Adelaide Research & Scholarship

Repositorio Institucional Universidad de Granada

Institutional Repository Universiteit Antwerpen

Stirling Online Research Repository

Lancaster E-Prints

Emergent relational schemas for RDF

Author: Pham M.-D. (Minh-Duc)
Publication venue
Publication date: 06/09/2018
Field of study

CWI's Institutional Repository

Effizienz in Cluster-Datenbanksystemen - Dynamische und Arbeitslastberücksichtigende Skalierung und Allokation

Author: Rabl Tilmann
Publication venue
Publication date: 08/12/2011
Field of study

Database systems have been vital in all forms of data processing for a long time. In recent years, the amount of processed data has been growing dramatically, even in small projects. Nevertheless, database management systems tend to be static in terms of size and performance which makes scaling a difficult and expensive task. Because of performance and especially cost advantages more and more installed systems have a shared nothing cluster architecture. Due to the massive parallelism of the hardware programming paradigms from high performance computing are translated into data processing. Database research struggles to keep up with this trend. A key feature of traditional database systems is to provide transparent access to the stored data. This introduces data dependencies and increases system complexity and inter process communication. Therefore, many developers are exchanging this feature for a better scalability. However, explicitly managing the data distribution and data flow requires a deep understanding of the distributed system and reduces the possibilities for automatic and autonomic optimization. In this thesis we present an approach for database system scaling and allocation that features good scalability although it keeps the data distribution transparent. The first part of this thesis analyzes the challenges and opportunities for self-scaling database management systems in cluster environments. Scalability is a major concern of Internet based applications. Access peaks that overload the application are a financial risk. Therefore, systems are usually configured to be able to process peaks at any given moment. As a result, server systems often have a very low utilization. In distributed systems the efficiency can be increased by adapting the number of nodes to the current workload. We propose a processing model and an architecture that allows efficient self-scaling of cluster database systems. In the second part we consider different allocation approaches. To increase the efficiency we present a workload-aware, query-centric model. The approach is formalized; optimal and heuristic algorithms are presented. The algorithms optimize the data distribution for local query execution and balance the workload according to the query history. We present different query classification schemes for different forms of partitioning. The approach is evaluated for OLTP and OLAP style workloads. It is shown that variants of the approach scale well for both fields of application. The third part of the thesis considers benchmarks for large, adaptive systems. First, we present a data generator for cloud-sized applications. Due to its architecture the data generator can easily be extended and configured. A key feature is the high degree of parallelism that makes linear speedup for arbitrary numbers of nodes possible. To simulate systems with user interaction, we have analyzed a productive online e-learning management system. Based on our findings, we present a model for workload generation that considers the temporal dependency of user interaction.Datenbanksysteme sind seit langem die Grundlage für alle Arten von Informationsverarbeitung. In den letzten Jahren ist das Datenaufkommen selbst in kleinen Projekten dramatisch angestiegen. Dennoch sind viele Datenbanksysteme statisch in Bezug auf ihre Kapazität und Verarbeitungsgeschwindigkeit was die Skalierung aufwendig und teuer macht. Aufgrund der guten Geschwindigkeit und vor allem aus Kostengründen haben immer mehr Systeme eine Shared-Nothing-Architektur, bestehen also aus unabhängigen, lose gekoppelten Rechnerknoten. Da dieses Konstruktionsprinzip einen sehr hohen Grad an Parallelität aufweist, werden zunehmend Programmierparadigmen aus dem klassischen Hochleistungsrechen für die Informationsverarbeitung eingesetzt. Dieser Trend stellt die Datenbankforschung vor große Herausforderungen. Eine der grundlegenden Eigenschaften traditioneller Datenbanksysteme ist der transparente Zugriff zu den gespeicherten Daten, der es dem Nutzer erlaubt unabhängig von der internen Organisation auf die Daten zuzugreifen. Die resultierende Unabhängigkeit führt zu Abhängigkeiten in den Daten und erhöht die Komplexität der Systeme und der Kommunikation zwischen einzelnen Prozessen. Daher wird Transparenz von vielen Entwicklern für eine bessere Skalierbarkeit geopfert. Diese Entscheidung führt dazu, dass der die Datenorganisation und der Datenfluss explizit behandelt werden muss, was die Möglichkeiten für eine automatische und autonome Optimierung des Systems einschränkt. Der in dieser Arbeit vorgestellte Ansatz zur Skalierung und Allokation erhält den transparenten Zugriff und zeichnet sich dabei durch seine vollständige Automatisierbarkeit und sehr gute Skalierbarkeit aus. Im ersten Teil dieser Dissertation werden die Herausforderungen und Chancen für selbst-skalierende Datenbankmanagementsysteme behandelt, die in auf Computerclustern betrieben werden. Gute Skalierbarkeit ist eine notwendige Eigenschaft für Anwendungen, die über das Internet zugreifbar sind. Lastspitzen im Zugriff, die die Anwendung überladen stellen ein finanzielles Risiko dar. Deshalb werden Systeme so konfiguriert, dass sie eventuelle Lastspitzen zu jedem Zeitpunkt verarbeiten können. Das führt meist zu einer im Schnitt sehr geringen Auslastung der unterliegenden Systeme. Eine Möglichkeit dieser Ineffizienz entgegen zu steuern ist es die Anzahl der verwendeten Rechnerknoten an die vorliegende Last anzupassen. In dieser Dissertation werden ein Modell und eine Architektur für die Anfrageverarbeitung vorgestellt, mit denen es möglich ist Datenbanksysteme auf Clusterrechnern einfach und effizient zu skalieren. Im zweiten Teil der Arbeit werden verschieden Möglichkeiten für die Datenverteilung behandelt. Um die Effizienz zu steigern wird ein Modell verwendet, das die Lastverteilung im Anfragestrom berücksichtigt. Der Ansatz ist formalisiert und optimale und heuristische Lösungen werden präsentiert. Die vorgestellten Algorithmen optimieren die Datenverteilung für eine lokale Ausführung aller Anfragen und balancieren die Last auf den Rechnerknoten. Es werden unterschiedliche Arten der Anfrageklassifizierung vorgestellt, die zu verschiedenen Arten von Partitionierung führen. Der Ansatz wird sowohl für Onlinetransaktionsverarbeitung, als auch Onlinedatenanalyse evaluiert. Die Evaluierung zeigt, dass der Ansatz für beide Felder sehr gut skaliert. Im letzten Teil der Arbeit werden verschiedene Techniken für die Leistungsmessung von großen, adaptiven Systemen präsentiert. Zunächst wird ein Datengenerierungsansatz gezeigt, der es ermöglicht sehr große Datenmengen völlig parallel zu erzeugen. Um die Benutzerinteraktion von Onlinesystemen zu simulieren wurde ein produktives E-learningsystem analysiert. Anhand der Analyse wurde ein Modell für die Generierung von Arbeitslasten erstellt, das die zeitlichen Abhängigkeiten von Benutzerinteraktion berücksichtigt