1,186 research outputs found
K-nearest Neighbor Search by Random Projection Forests
K-nearest neighbor (kNN) search has wide applications in many areas,
including data mining, machine learning, statistics and many applied domains.
Inspired by the success of ensemble methods and the flexibility of tree-based
methodology, we propose random projection forests (rpForests), for kNN search.
rpForests finds kNNs by aggregating results from an ensemble of random
projection trees with each constructed recursively through a series of
carefully chosen random projections. rpForests achieves a remarkable accuracy
in terms of fast decay in the missing rate of kNNs and that of discrepancy in
the kNN distances. rpForests has a very low computational complexity. The
ensemble nature of rpForests makes it easily run in parallel on multicore or
clustered computers; the running time is expected to be nearly inversely
proportional to the number of cores or machines. We give theoretical insights
by showing the exponential decay of the probability that neighboring points
would be separated by ensemble random projection trees when the ensemble size
increases. Our theory can be used to refine the choice of random projections in
the growth of trees, and experiments show that the effect is remarkable.Comment: 15 pages, 4 figures, 2018 IEEE Big Data Conferenc
An intelligent information forwarder for healthcare big data systems with distributed wearable sensors
© 2016 IEEE. An increasing number of the elderly population wish to live an independent lifestyle, rather than rely on intrusive care programmes. A big data solution is presented using wearable sensors capable of carrying out continuous monitoring of the elderly, alerting the relevant caregivers when necessary and forwarding pertinent information to a big data system for analysis. A challenge for such a solution is the development of context-awareness through the multidimensional, dynamic and nonlinear sensor readings that have a weak correlation with observable human behaviours and health conditions. To address this challenge, a wearable sensor system with an intelligent data forwarder is discussed in this paper. The forwarder adopts a Hidden Markov Model for human behaviour recognition. Locality sensitive hashing is proposed as an efficient mechanism to learn sensor patterns. A prototype solution is implemented to monitor health conditions of dispersed users. It is shown that the intelligent forwarders can provide the remote sensors with context-awareness. They transmit only important information to the big data server for analytics when certain behaviours happen and avoid overwhelming communication and data storage. The system functions unobtrusively, whilst giving the users peace of mind in the knowledge that their safety is being monitored and analysed
New scalable machine learning methods: beyond classification and regression
Programa Oficial de Doutoramento en Computación . 5009V01[Abstract]
The recent surge in data available has spawned a new and promising age of machine
learning. Success cases of machine learning are arriving at an increasing rate as some
algorithms are able to leverage immense amounts of data to produce great complicated
predictions. Still, many algorithms in the toolbox of the machine learning practitioner
have been render useless in this new scenario due to the complications associated with
large-scale learning. Handling large datasets entails logistical problems, limits the computational
and spatial complexity of the used algorithms, favours methods with few or
no hyperparameters to be con gured and exhibits speci c characteristics that complicate
learning. This thesis is centered on the scalability of machine learning algorithms,
that is, their capacity to maintain their e ectivity as the scale of the data grows, and
how it can be improved. We focus on problems for which the existing solutions struggle
when the scale grows. Therefore, we skip classi cation and regression problems and
focus on feature selection, anomaly detection, graph construction and explainable machine
learning. We analyze four di erent strategies to obtain scalable algorithms. First,
we explore distributed computation, which is used in all of the presented algorithms.
Besides this technique, we also examine the use of approximate models to speed up
computations, the design of new models that take advantage of a characteristic of the
input data to simplify training and the enhancement of simple models to enable them
to manage large-scale learning. We have implemented four new algorithms and six
versions of existing ones that tackle the mentioned problems and for each one we report
experimental results that show both their validity in comparison with competing
methods and their capacity to scale to large datasets. All the presented algorithms
have been made available for download and are being published in journals to enable
practitioners and researchers to use them.[Resumen]
El reciente aumento de la cantidad de datos disponibles ha dado lugar a una nueva y
prometedora era del aprendizaje máquina. Los éxitos en este campo se están sucediendo
a un ritmo cada vez mayor gracias a la capacidad de algunos algoritmos de aprovechar
inmensas cantidades de datos para producir predicciones difÃciles y muy certeras. Sin
embargo, muchos de los algoritmos hasta ahora disponibles para los cientÃficos de datos
han perdido su efectividad en este nuevo escenario debido a las complicaciones asociadas
al aprendizaje a gran escala. Trabajar con grandes conjuntos de datos conlleva
problemas logÃsticos, limita la complejidad computacional y espacial de los algoritmos
utilizados, favorece los métodos con pocos o ningún hiperparámetro a configurar y
muestra complicaciones especÃficas que dificultan el aprendizaje. Esta tesis se centra en
la escalabilidad de los algoritmos de aprendizaje máquina, es decir, en su capacidad de
mantener su efectividad a medida que la escala del conjunto de datos aumenta. Ponemos
el foco en problemas cuyas soluciones actuales tienen problemas al aumentar la
escala. Por tanto, obviando la clasificación y la regresión, nos centramos en la selección
de caracterÃsticas, detección de anomalÃas, construcción de grafos y en el aprendizaje
máquina explicable. Analizamos cuatro estrategias diferentes para obtener algoritmos
escalables. En primer lugar, exploramos la computación distribuida, que es utilizada en
todos los algoritmos presentados. Además de esta técnica, también examinamos el uso
de modelos aproximados para acelerar los cálculos, el dise~no de modelos que aprovechan
una particularidad de los datos de entrada para simplificar el entrenamiento y la
potenciación de modelos simples para adecuarlos al aprendizaje a gran escala. Hemos
implementado cuatro nuevos algoritmos y seis versiones de algoritmos existentes que
tratan los problemas mencionados y para cada uno de ellos detallamos resultados experimentales
que muestran tanto su validez en comparación con los métodos previamente
disponibles como su capacidad para escalar a grandes conjuntos de datos. Todos los algoritmos presentados han sido puestos a disposición del lector para su descarga y
se han difundido mediante publicaciones en revistas cientÃficas para facilitar que tanto
investigadores como cientÃficos de datos puedan conocerlos y utilizarlos.[Resumo]
O recente aumento na cantidade de datos dispo~nibles deu lugar a unha nova e prometedora
era no aprendizaxe máquina. Os éxitos neste eido estanse a suceder a un
ritmo cada vez maior gracias a capacidade dalgúns algoritmos de aproveitar inmensas
cantidades de datos para producir prediccións difÃciles e moi acertadas. Non obstante,
moitos dos algoritmos ata agora dispo~nibles para os cientÃficos de datos perderon a súa
efectividade neste novo escenario por mor das complicacións asociadas ao aprendizaxe
a grande escala. Traballar con grandes conxuntos de datos leva consigo problemas
loxÃsticos, limita a complexidade computacional e espacial dos algoritmos empregados,
favorece os métodos con poucos ou ningún hiperparámetro a configurar e ten complicacións especÃficas que dificultan o aprendizaxe. Esta tese céntrase na escalabilidade dos
algoritmos de aprendizaxe máquina, é dicir, na súa capacidade de manter a súa efectividade
a medida que a escala do conxunto de datos aumenta. Tratamos problemas para
os que as solucións dispoñibles teñen problemas cando crece a escala. Polo tanto, deixando
no canto a clasificación e a regresión, centrámonos na selección de caracterÃsticas,
detección de anomalÃas, construcción de grafos e no aprendizaxe máquina explicable.
Analizamos catro estratexias diferentes para obter algoritmos escalables. En primeiro
lugar, exploramos a computación distribuÃda, que empregamos en tódolos algoritmos
presentados. Ademáis desta técnica, tamén examinamos o uso de modelos aproximados
para acelerar os cálculos, o deseño de modelos que aproveitan unha particularidade dos
datos de entrada para simplificar o adestramento e a potenciación de modelos sinxelos
para axeitalos ao aprendizaxe a gran escala. Implementamos catro novos algoritmos e
seis versións de algoritmos existentes que tratan os problemas mencionados e para cada
un deles expoñemos resultados experimentais que mostran tanto a súa validez en comparación cos métodos previamente dispoñibles como a súa capacidade para escalar a
grandes conxuntos de datos. Tódolos algoritmos presentados foron postos a disposición
do lector para a súa descarga e difundÃronse mediante publicacións en revistas cientÃficas para facilitar que tanto investigadores como cientÃficos de datos poidan coñecelos e
empregalos
Data Imputation through the Identification of Local Anomalies
We introduce a comprehensive and statistical framework in a model free
setting for a complete treatment of localized data corruptions due to severe
noise sources, e.g., an occluder in the case of a visual recording. Within this
framework, we propose i) a novel algorithm to efficiently separate, i.e.,
detect and localize, possible corruptions from a given suspicious data instance
and ii) a Maximum A Posteriori (MAP) estimator to impute the corrupted data. As
a generalization to Euclidean distance, we also propose a novel distance
measure, which is based on the ranked deviations among the data attributes and
empirically shown to be superior in separating the corruptions. Our algorithm
first splits the suspicious instance into parts through a binary partitioning
tree in the space of data attributes and iteratively tests those parts to
detect local anomalies using the nominal statistics extracted from an
uncorrupted (clean) reference data set. Once each part is labeled as anomalous
vs normal, the corresponding binary patterns over this tree that characterize
corruptions are identified and the affected attributes are imputed. Under a
certain conditional independency structure assumed for the binary patterns, we
analytically show that the false alarm rate of the introduced algorithm in
detecting the corruptions is independent of the data and can be directly set
without any parameter tuning. The proposed framework is tested over several
well-known machine learning data sets with synthetically generated corruptions;
and experimentally shown to produce remarkable improvements in terms of
classification purposes with strong corruption separation capabilities. Our
experiments also indicate that the proposed algorithms outperform the typical
approaches and are robust to varying training phase conditions
A Hierarchical Framework Using Approximated Local Outlier Factor for Efficient Anomaly Detection
AbstractAnomaly detection aims to identify rare events that deviate remarkably from existing data. To satisfy real-world appli- cations, various anomaly detection technologies have been proposed. Due to the resource constraints, such as limited energy, computation ability and memory storage, most of them cannot be directly used in wireless sensor networks (WSNs). In this work, we proposed a hierarchical anomaly detection framework to overcome the challenges of anomaly detection in WSNs. We aim to detect anomalies by the accurate model and the approximated model learned at the re- mote server and sink nodes, respectively. Besides the framework, we also proposed an approximated local outlier factor algorithm, which can be learned at the sink nodes. The proposed algorithm is more efficient in computation and storage by comparing with the standard one. Experimental results verify the feasibility of our proposed method in terms of both accuracy and efficiency
CLAM-Accelerated K-Nearest Neighbors Entropy-Scaling Search of Large High-Dimensional Datasets via an Actualization of the Manifold Hypothesis
Many fields are experiencing a Big Data explosion, with data collection rates
outpacing the rate of computing performance improvements predicted by Moore's
Law.
Researchers are often interested in similarity search on such data.
We present CAKES (CLAM-Accelerated -NN Entropy Scaling Search), a novel
algorithm for -nearest-neighbor (-NN) search which leverages geometric
and topological properties inherent in large datasets.
CAKES assumes the manifold hypothesis and performs best when data occupy a
low dimensional manifold, even if the data occupy a very high dimensional
embedding space.
We demonstrate performance improvements ranging from hundreds to tens of
thousands of times faster when compared to state-of-the-art approaches such as
FAISS and HNSW, when benchmarked on 5 standard datasets.
Unlike locality-sensitive hashing approaches, CAKES can work with any
user-defined distance function.
When data occupy a metric space, CAKES exhibits perfect recall.Comment: As submitted to IEEE Big Data 202
FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search
We present FLASH (\textbf{F}ast \textbf{L}SH \textbf{A}lgorithm for
\textbf{S}imilarity search accelerated with \textbf{H}PC), a similarity search
system for ultra-high dimensional datasets on a single machine, that does not
require similarity computations and is tailored for high-performance computing
platforms. By leveraging a LSH style randomized indexing procedure and
combining it with several principled techniques, such as reservoir sampling,
recent advances in one-pass minwise hashing, and count based estimations, we
reduce the computational and parallelization costs of similarity search, while
retaining sound theoretical guarantees.
We evaluate FLASH on several real, high-dimensional datasets from different
domains, including text, malicious URL, click-through prediction, social
networks, etc. Our experiments shed new light on the difficulties associated
with datasets having several million dimensions. Current state-of-the-art
implementations either fail on the presented scale or are orders of magnitude
slower than FLASH. FLASH is capable of computing an approximate k-NN graph,
from scratch, over the full webspam dataset (1.3 billion nonzeros) in less than
10 seconds. Computing a full k-NN graph in less than 10 seconds on the webspam
dataset, using brute-force (), will require at least 20 teraflops. We
provide CPU and GPU implementations of FLASH for replicability of our results
- …