4,391 research outputs found

    Quality of Service Aware Data Stream Processing for Highly Dynamic and Scalable Applications

    Get PDF
    Huge amounts of georeferenced data streams are arriving daily to data stream management systems that are deployed for serving highly scalable and dynamic applications. There are innumerable ways at which those loads can be exploited to gain deep insights in various domains. Decision makers require an interactive visualization of such data in the form of maps and dashboards for decision making and strategic planning. Data streams normally exhibit fluctuation and oscillation in arrival rates and skewness. Those are the two predominant factors that greatly impact the overall quality of service. This requires data stream management systems to be attuned to those factors in addition to the spatial shape of the data that may exaggerate the negative impact of those factors. Current systems do not natively support services with quality guarantees for dynamic scenarios, leaving the handling of those logistics to the user which is challenging and cumbersome. Three workloads are predominant for any data stream, batch processing, scalable storage and stream processing. In this thesis, we have designed a quality of service aware system, SpatialDSMS, that constitutes several subsystems that are covering those loads and any mixed load that results from intermixing them. Most importantly, we natively have incorporated quality of service optimizations for processing avalanches of geo-referenced data streams in highly dynamic application scenarios. This has been achieved transparently on top of the codebases of emerging de facto standard best-in-class representatives, thus relieving the overburdened shoulders of the users in the presentation layer from having to reason about those services. Instead, users express their queries with quality goals and our system optimizers compiles that down into query plans with an embedded quality guarantee and leaves logistic handling to the underlying layers. We have developed standard compliant prototypes for all the subsystems that constitutes SpatialDSMS

    Advances in Streaming Novelty Detection

    Get PDF
    153 p.En primer lugar, en esta tesis se aborda un problema de confusión entre términos y problemas en el cual el mismo término es utilizado para referirse a diferentes problemas y, de manera similar, el mismo problema es llamado con diferentes términos indistintamente. Esto motiva una dificultad de avance en elcampo de conocimiento dado que es difícil encontrar literatura relacionada y propicia la repetición detrabajos. En la primera contribución se propone una asignación individual de términos a problemas y una formalización de los escenarios de aprendizaje para tratar de estandarizar el campo. En segundo lugar, se aborda el problema de Streaming Novelty Detection. En este problema, partiendo de un conjunto de datos supervisado, se aprende un modelo. A continuación, el modelo recibe nuevas instancias no etiquetadas para predecir su clase de manera online o en stream. El modelo debe actualizarse para hacer frente al concept-drift. En este escenario de clasificación, se asume que puedensurgir nuevas clases de forma dinámica. Por lo tanto, el modelo debe ser capaz de descubrir nuevas clases de manera automática y sin supervisión. En este contexto, esta tesis propone 2 contribuciones. En primerlugar una solución basada en mixturas de Guassianas donde cada clase en modelada con una de lascomponentes de la mixtura. En segundo lugar, se propone el uso de redes neuronales, tales como las redes Autoencoder, y las redes Deep Support Vector Data Description para trabajar con serie stemporales

    Next Generation Machine Learning Based Real Time Fraud Detection

    Get PDF
    Define a real time monitoring architecture that can scale as the network of devices monitored grows. From the research work carried out and the knowledge about the nature of the business, it was possible to develop a clustering methodology over the data streams that allows to detect patterns on entities. The methodology used is based on the concept of micro-cluster, which is a structure that maintains a summary of the patterns detected on entities.In telecommunications there are several schemes to defraud the telecommunications companies causing great financial losses. We can considerer three major categories in telecom fraud based on who the fraudsters are targeting. These categories are: Traffic Pumping Schemes, Defraud Telecom Service Providers, Conducted Over the Telephone. Traffic Pumping Schemes use "access stimulation" techniques to boost traffic to a high cost destination, which then shares the revenue with the fraudster. Defraud Telecom Service Providers are the most complicated, and exploit telecom service providers using SIP trunking, regulatory loopholes, and more. Conducted Over the Telephone, also known as "Phone Fraud", this category covers all types of general fraud that are perpetrated over the telephone. Telecommunications fraud negatively impacts everyone, including good paying customers. The losses increase the companies operating costs. While telecom companies take every measure to stop the fraud and reduce their losses, the criminals continue their attacks on companies with perceived weaknesses. The telecom business is facing a serious hazard growing as fast as the industry itself. Communications Fraud Control Association (CFCA) stated that telecom fraud represented nearly $30 billion globally in 2017 cite{telecomengine}. Another problem is to stay on top of the game with effective anti-fraud technologies. The need to ensure a secure and trustable Internet of Things (IoT) network brings the challenge to continuously monitor massive volumes of machine data in streaming. Therefore a different approach is required in the scope of Fraud Detection, where detection engines need to detect risk situations in real time and be able to adapt themselves to evolving behavior patterns. Machine learning based online anomaly detection can support this new approach. For applications involving several data streams, the challenge of detecting anomalies has become harder over time, as data can dynamically evolve in subtle ways following changes in the underlying infrastructure. The goal of this paper is to research existing online anomaly detection algorithms to select a set of candidates in order to test them in Fraud Detection scenarios

    Interactive Learning in Decision Support

    Get PDF
    De acordo com o dicionário priberam da língua portuguesa, o conceito de Fraude pode ser definido como uma “ação ilícita, punível por lei, que procura enganar alguém ou alguma entidade ou escapar a obrigações legais”. Este tópico tem vindo a ganhar cada vez mais relevância em tempos recentes, com novos casos a se tornarem públicos de uma forma frequente. Desta forma, existe uma procura contínua por soluções que permitam, numa primeira fase, prevenir a ocorrência de fraude, ou, caso a mesma já tenha ocorrido, a detetar o mais rapidamente possível. Isto representa um grande desafio: em primeiro lugar, a evolução tecnológica permite que se elaborem esquemas fraudulentos cada vez mais complexos e eficazes e, portanto, mais difíceis de detetar e parar. Para além disto, os dados e a informação que deles se pode retirar são vistos como algo cada vez mais importante no contexto social. Consequentemente, indivíduos e empresas começaram a recolher e armazenar grandes quantidades de todo o tipo de dados. Isto representa o conceito de Big Data – grandes quantidades de dados de diferentes tipos, com diferentes graus de complexidade, produzidos a ritmos diferentes e provenientes de diferentes fontes. Isto veio, por sua vez, tornar inviável a utilização de tecnologias e algoritmos tradicionais de deteção de fraude, uma vez que estes não possuem capacidade para processar um tão grande conjunto de dados, tão diversos. É neste contexto que a área de Machine Learning tem vindo a ser cada vez mais explorada, na busca por soluções que permitam dar resposta a este problema. Normalmente, os sistemas de Machine Learning são vistos como algo completamente autónomo. Nos últimos anos, no entanto, sistemas interativos nos quais especialistas humanos contribuem ativamente no processo de aprendizagem têm vindo a apresentar um desempenho superior quando comparados com sistemas completamente automatizados. Isto pode verificar-se em cenários em que existe um grande conjunto de dados de diversos tipos e de diferentes origens (Big Data), cenários em que o input é um fluxo de dados ou quando existe uma alteração do contexto no qual os dados estão inseridos, num fenómeno conhecido por concept drift. Tendo isto em conta, neste documento é descrito um projeto cujo tema se insere no contexto da utilização de aprendizagem interativa no suporte à decisão, abordando a temática das auditorias digitais e, mais concretamente, o caso da deteção de fraude fiscal. Desta forma, a solução proposta passa pelo desenvolvimento de um sistema de Machine Learning interativo e dinâmico, na medida em que um dos principais objetivos passa por permitir a um humano especialista no domínio não só contribuir com o seu conhecimento no processo de aprendizagem do sistema, mas também que este possa contribuir com novo conhecimento, através da sugestão de uma nova variável ou um novo valor para uma variável já existente, em qualquer altura. O sistema deve então ser capaz de integrar o novo conhecimento de uma forma autónoma e continuar com o seu normal funcionamento. Esta é, na verdade, a principal característica inovadora da solução proposta, uma vez que em sistemas de Machine Learning tradicionais isto não é possível, visto que estes implicam uma estrutura do dataset rígida, e em que qualquer alteração neste sentido implicaria um reinício de todo o processo de treino de modelos, desta vez com o novo dataset.Machine Learning has been evolving rapidly over the past years, with new algorithms and approaches being devised to solve the challenges that the new properties of data pose. Specifically, algorithms must now learn continuously and in real time, from very large and possibly distributed datasets. Usually, Machine Learning systems are seen as something fully automatic. Recently, however, interactive systems in which the human experts actively contribute towards the learning process have shown improved performance when compared to fully automated ones. This may be so on scenarios of Big Data, scenarios in which the input is a data stream, or when there is concept drift. In this paper, we present a system that learns and adapts in real-time by continuously incorporating user feedback, in a fully autonomous way. Moreover, it allows for users to manage variables (e.g. add, edit, remove), reflecting these changes on-the-fly in the Machine Learning pipeline. This paper describes the main functionalities of the system, which despite being of general-purpose, is being developed in the context of a project in the domain of financial fraud detection

    Adaptive Algorithms For Classification On High-Frequency Data Streams: Application To Finance

    Get PDF
    Mención Internacional en el título de doctorIn recent years, the problem of concept drift has gained importance in the financial domain. The succession of manias, panics and crashes have stressed the nonstationary nature and the likelihood of drastic structural changes in financial markets. The most recent literature suggests the use of conventional machine learning and statistical approaches for this. However, these techniques are unable or slow to adapt to non-stationarities and may require re-training over time, which is computationally expensive and brings financial risks. This thesis proposes a set of adaptive algorithms to deal with high-frequency data streams and applies these to the financial domain. We present approaches to handle different types of concept drifts and perform predictions using up-to-date models. These mechanisms are designed to provide fast reaction times and are thus applicable to high-frequency data. The core experiments of this thesis are based on the prediction of the price movement direction at different intraday resolutions in the SPDR S&P 500 exchange-traded fund. The proposed algorithms are benchmarked against other popular methods from the data stream mining literature and achieve competitive results. We believe that this thesis opens good research prospects for financial forecasting during market instability and structural breaks. Results have shown that our proposed methods can improve prediction accuracy in many of these scenarios. Indeed, the results obtained are compatible with ideas against the efficient market hypothesis. However, we cannot claim that we can beat consistently buy and hold; therefore, we cannot reject it.Programa de Doctorado en Ciencia y Tecnología Informática por la Universidad Carlos III de MadridPresidente: Gustavo Recio Isasi.- Secretario: Pedro Isasi Viñuela.- Vocal: Sandra García Rodrígue

    A Survey on Semi-Supervised Learning for Delayed Partially Labelled Data Streams

    Full text link
    Unlabelled data appear in many domains and are particularly relevant to streaming applications, where even though data is abundant, labelled data is rare. To address the learning problems associated with such data, one can ignore the unlabelled data and focus only on the labelled data (supervised learning); use the labelled data and attempt to leverage the unlabelled data (semi-supervised learning); or assume some labels will be available on request (active learning). The first approach is the simplest, yet the amount of labelled data available will limit the predictive performance. The second relies on finding and exploiting the underlying characteristics of the data distribution. The third depends on an external agent to provide the required labels in a timely fashion. This survey pays special attention to methods that leverage unlabelled data in a semi-supervised setting. We also discuss the delayed labelling issue, which impacts both fully supervised and semi-supervised methods. We propose a unified problem setting, discuss the learning guarantees and existing methods, explain the differences between related problem settings. Finally, we review the current benchmarking practices and propose adaptations to enhance them
    corecore