13,842 research outputs found
The LDBC social network benchmark: Business intelligence workload
The Social Network Benchmark’s Business Intelligence workload (SNB BI) is a comprehensive graph OLAP benchmark targeting analytical data systems capable of supporting graph workloads. This paper marks the finalization of almost a decade of research in academia and industry via the Linked Data Benchmark Council (LDBC). SNB BI advances the state-of-the art in synthetic and scalable analytical database benchmarks in many aspects. Its base is a sophisticated data generator, implemented on a scalable distributed infrastructure, that produces a social graph with small-world phenomena, whose value properties follow skewed and correlated distributions and where values correlate with structure. This is a temporal graph where all nodes and edges follow lifespan-based rules with temporal skew enabling realistic and consistent temporal inserts and (recursive) deletes. The query workload exploiting this skew and correlation is based on LDBC’s “choke point”-driven design methodology and will entice technical and scientific improvements in future (graph) database systems. SNB BI includes the first adoption of “parameter curation” in an analytical benchmark, a technique that ensures stable runtimes of query variants across different parameter values. Two performance metrics characterize peak single-query performance (power) and sustained concurrent query throughput. To demonstrate the portability of the benchmark, we present experimental results on a relational and a graph DBMS. Note that these do not constitute an official LDBC Benchmark Result – only audited results can use this trademarked term
Dynamic Feature Engineering and model selection methods for temporal tabular datasets with regime changes
The application of deep learning algorithms to temporal panel datasets is
difficult due to heavy non-stationarities which can lead to over-fitted models
that under-perform under regime changes. In this work we propose a new machine
learning pipeline for ranking predictions on temporal panel datasets which is
robust under regime changes of data. Different machine-learning models,
including Gradient Boosting Decision Trees (GBDTs) and Neural Networks with and
without simple feature engineering are evaluated in the pipeline with different
settings. We find that GBDT models with dropout display high performance,
robustness and generalisability with relatively low complexity and reduced
computational cost. We then show that online learning techniques can be used in
post-prediction processing to enhance the results. In particular, dynamic
feature neutralisation, an efficient procedure that requires no retraining of
models and can be applied post-prediction to any machine learning model,
improves robustness by reducing drawdown in regime changes. Furthermore, we
demonstrate that the creation of model ensembles through dynamic model
selection based on recent model performance leads to improved performance over
baseline by improving the Sharpe and Calmar ratios of out-of-sample prediction
performances. We also evaluate the robustness of our pipeline across different
data splits and random seeds with good reproducibility of results
PIKS: A Technique to Identify Actionable Trends for Policy-Makers Through Open Healthcare Data
With calls for increasing transparency, governments are releasing greater
amounts of data in multiple domains including finance, education and
healthcare. The efficient exploratory analysis of healthcare data constitutes a
significant challenge. Key concerns in public health include the quick
identification and analysis of trends, and the detection of outliers. This
allows policies to be rapidly adapted to changing circumstances. We present an
efficient outlier detection technique, termed PIKS (Pruned iterative-k means
searchlight), which combines an iterative k-means algorithm with a pruned
searchlight based scan. We apply this technique to identify outliers in two
publicly available healthcare datasets from the New York Statewide Planning and
Research Cooperative System, and California's Office of Statewide Health
Planning and Development. We provide a comparison of our technique with three
other existing outlier detection techniques, consisting of auto-encoders,
isolation forests and feature bagging. We identified outliers in conditions
including suicide rates, immunity disorders, social admissions,
cardiomyopathies, and pregnancy in the third trimester. We demonstrate that the
PIKS technique produces results consistent with other techniques such as the
auto-encoder. However, the auto-encoder needs to be trained, which requires
several parameters to be tuned. In comparison, the PIKS technique has far fewer
parameters to tune. This makes it advantageous for fast, "out-of-the-box" data
exploration. The PIKS technique is scalable and can readily ingest new
datasets. Hence, it can provide valuable, up-to-date insights to citizens,
patients and policy-makers. We have made our code open source, and with the
availability of open data, other researchers can easily reproduce and extend
our work. This will help promote a deeper understanding of healthcare policies
and public health issues
Colour technologies for content production and distribution of broadcast content
The requirement of colour reproduction has long been a priority driving the development of new colour imaging systems that maximise human perceptual plausibility. This thesis explores machine learning algorithms for colour processing to assist both content production and distribution. First, this research studies colourisation technologies with practical use cases in restoration and processing of archived content. The research targets practical deployable solutions, developing a cost-effective pipeline which integrates the activity of the producer into the processing workflow. In particular, a fully automatic image colourisation paradigm using Conditional GANs is proposed to improve content generalisation and colourfulness of existing baselines. Moreover, a more conservative solution is considered by providing references to guide the system towards more accurate colour predictions. A fast-end-to-end architecture is proposed to improve existing exemplar-based image colourisation methods while decreasing the complexity and runtime. Finally, the proposed image-based methods are integrated into a video colourisation pipeline. A general framework is proposed to reduce the generation of temporal flickering or propagation of errors when such methods are applied frame-to-frame. The proposed model is jointly trained to stabilise the input video and to cluster their frames with the aim of learning scene-specific modes. Second, this research explored colour processing technologies for content distribution with the aim to effectively deliver the processed content to the broad audience. In particular, video compression is tackled by introducing a novel methodology for chroma intra prediction based on attention models. Although the proposed architecture helped to gain control over the reference samples and better understand the prediction process, the complexity of the underlying neural network significantly increased the encoding and decoding time. Therefore, aiming at efficient deployment within the latest video coding standards, this work also focused on the simplification of the proposed architecture to obtain a more compact and explainable model
Recommended from our members
Ensuring Access to Safe and Nutritious Food for All Through the Transformation of Food Systems
Annual report of the officers of the town of Jackson, New Hampshire for the fiscal year ending December 31, 2022.
This is an annual report containing vital statistics for a town/city in the state of New Hampshire
A Benchmark Framework for Data Compression Techniques
Lightweight data compression is frequently applied in main memory database systems to improve query performance. The data processed by such systems is highly diverse. Moreover, there is a high number of existing lightweight compression techniques. Therefore, choosing the optimal technique for a given dataset is non-trivial. Existing approaches are based on simple rules, which do not suffice for such a complex decision. In contrast, our vision is a cost-based approach. However, this requires a detailed cost model, which can only be obtained from a systematic benchmarking of many compression algorithms on many different datasets. A naïve benchmark evaluates every algorithm under consideration separately. This yields many redundant steps and is thus inefficient. We propose an efficient and extensible benchmark framework for compression techniques. Given an ensemble of algorithms, it minimizes the overall run time of the evaluation. We experimentally show that our approach outperforms the naïve approach
Neural Architecture Search: Insights from 1000 Papers
In the past decade, advances in deep learning have resulted in breakthroughs
in a variety of areas, including computer vision, natural language
understanding, speech recognition, and reinforcement learning. Specialized,
high-performing neural architectures are crucial to the success of deep
learning in these areas. Neural architecture search (NAS), the process of
automating the design of neural architectures for a given task, is an
inevitable next step in automating machine learning and has already outpaced
the best human-designed architectures on many tasks. In the past few years,
research in NAS has been progressing rapidly, with over 1000 papers released
since 2020 (Deng and Lindauer, 2021). In this survey, we provide an organized
and comprehensive guide to neural architecture search. We give a taxonomy of
search spaces, algorithms, and speedup techniques, and we discuss resources
such as benchmarks, best practices, other surveys, and open-source libraries
GNN for Deep Full Event Interpretation and hierarchical reconstruction of heavy-hadron decays in proton-proton collisions
The LHCb experiment at the Large Hadron Collider (LHC) is designed to perform
high-precision measurements of heavy-hadron decays, which requires the
collection of large data samples and a good understanding and suppression of
multiple background sources. Both factors are challenged by a five-fold
increase in the average number of proton-proton collisions per bunch crossing,
corresponding to a change in the detector operation conditions for the LHCb
Upgrade I phase, recently started. A further ten-fold increase is expected in
the Upgrade II phase, planed for the next decade. The limits in the storage
capacity of the trigger will bring an inverse relation between the amount of
particles selected to be stored per event and the number of events that can be
recorded, and the background levels will raise due to the enlarged
combinatorics. To tackle both challenges, we propose a novel approach, never
attempted before in a hadronic collider: a Deep-learning based Full Event
Interpretation (DFEI), to perform the simultaneous identification, isolation
and hierarchical reconstruction of all the heavy-hadron decay chains per event.
This approach radically contrasts with the standard selection procedure used in
LHCb to identify heavy-hadron decays, that looks individually at sub-sets of
particles compatible with being products of specific decay types, disregarding
the contextual information from the rest of the event. We present the first
prototype for the DFEI algorithm, that leverages the power of Graph Neural
Networks (GNN). This paper describes the design and development of the
algorithm, and its performance in Upgrade I simulated conditions
- …