60 research outputs found

    Active Learning from Knowledge-Rich Data

    Get PDF
    With the ever-increasing demand for the quality and quantity of the training samples, it is difficult to replicate the success of modern machine learning models in knowledge-rich domains, where the labeled data for training is scarce and labeling new data is expensive. While machine learning and AI have achieved significant progress in many common domains, the lack of large-scale labeled data samples poses a grand challenge for the wide application of advanced statistical learning models in key knowledge-rich domains, such as medicine, biology, physical science, and more. Active learning (AL) offers a promising and powerful learning paradigm that can significantly reduce the data-annotation stress by allowing the model to only sample the informative objects to learn from human experts. Previous AL models leverage simple criteria to explore the data space and achieve fast convergence of AL. However, those active sampling methods are less effective in exploring knowledge-rich data spaces and result in slow convergence of AL. In this thesis, we propose novel AL methods to address knowledge-rich data exploration challenges with respect to different types of machine learning tasks. Specifically, for multi-class tasks, we propose three approaches that leverage different types of sparse kernel machines to better capture the data covariance and use them to guide effective data exploration in a complex feature space. For multi-label tasks, it is essential to capture label correlations, and we model them in three different approaches to guide effective data exploration in a large and correlated label space. For data exploration in a very high-dimension feature space, we present novel uncertainty measures to better control the exploration behavior of deep learning models and leverage a uniquely designed regularizer to achieve effective exploration in high-dimension space. Our proposed models not only exhibit a good behavior of exploration for different types of knowledge-rich data but also manage to achieve an optimal exploration-exploitation balance with strong theoretical underpinnings. In the end, we study active learning in a more realistic scenario where human annotators provide noisy labels. We propose a re-sampling paradigm that leverages the machine\u27s awareness to reduce the noise rate. We theoretically prove the effectiveness of the re-sampling paradigm and design a novel spatial-temporal active re-sampling function by leveraging the critical spatial and temporal properties of the maximum-margin kernel classifiers

    Artificial intelligence driven anomaly detection for big data systems

    Get PDF
    The main goal of this thesis is to contribute to the research on automated performance anomaly detection and interference prediction by implementing Artificial Intelligence (AI) solutions for complex distributed systems, especially for Big Data platforms within cloud computing environments. The late detection and manual resolutions of performance anomalies and system interference in Big Data systems may lead to performance violations and financial penalties. Motivated by this issue, we propose AI-based methodologies for anomaly detection and interference prediction tailored to Big Data and containerized batch platforms to better analyze system performance and effectively utilize computing resources within cloud environments. Therefore, new precise and efficient performance management methods are the key to handling performance anomalies and interference impacts to improve the efficiency of data center resources. The first part of this thesis contributes to performance anomaly detection for in-memory Big Data platforms. We examine the performance of Big Data platforms and justify our choice of selecting the in-memory Apache Spark platform. An artificial neural network-driven methodology is proposed to detect and classify performance anomalies for batch workloads based on the RDD characteristics and operating system monitoring metrics. Our method is evaluated against other popular machine learning algorithms (ML), as well as against four different monitoring datasets. The results prove that our proposed method outperforms other ML methods, typically achieving 98–99% F-scores. Moreover, we prove that a random start instant, a random duration, and overlapped anomalies do not significantly impact the performance of our proposed methodology. The second contribution addresses the challenge of anomaly identification within an in-memory streaming Big Data platform by investigating agile hybrid learning techniques. We develop TRACK (neural neTwoRk Anomaly deteCtion in sparK) and TRACK-Plus, two methods to efficiently train a class of machine learning models for performance anomaly detection using a fixed number of experiments. Our model revolves around using artificial neural networks with Bayesian Optimization (BO) to find the optimal training dataset size and configuration parameters to efficiently train the anomaly detection model to achieve high accuracy. The objective is to accelerate the search process for finding the size of the training dataset, optimizing neural network configurations, and improving the performance of anomaly classification. A validation based on several datasets from a real Apache Spark Streaming system is performed, demonstrating that the proposed methodology can efficiently identify performance anomalies, near-optimal configuration parameters, and a near-optimal training dataset size while reducing the number of experiments up to 75% compared with naïve anomaly detection training. The last contribution overcomes the challenges of predicting completion time of containerized batch jobs and proactively avoiding performance interference by introducing an automated prediction solution to estimate interference among colocated batch jobs within the same computing environment. An AI-driven model is implemented to predict the interference among batch jobs before it occurs within system. Our interference detection model can alleviate and estimate the task slowdown affected by the interference. This model assists the system operators in making an accurate decision to optimize job placement. Our model is agnostic to the business logic internal to each job. Instead, it is learned from system performance data by applying artificial neural networks to establish the completion time prediction of batch jobs within the cloud environments. We compare our model with three other baseline models (queueing-theoretic model, operational analysis, and an empirical method) on historical measurements of job completion time and CPU run-queue size (i.e., the number of active threads in the system). The proposed model captures multithreading, operating system scheduling, sleeping time, and job priorities. A validation based on 4500 experiments based on the DaCapo benchmarking suite was carried out, confirming the predictive efficiency and capabilities of the proposed model by achieving up to 10% MAPE compared with the other models.Open Acces

    Using contextual information to understand searching and browsing behavior

    Get PDF
    There is great imbalance in the richness of information on the web and the succinctness and poverty of search requests of web users, making their queries only a partial description of the underlying complex information needs. Finding ways to better leverage contextual information and make search context-aware holds the promise to dramatically improve the search experience of users. We conducted a series of studies to discover, model and utilize contextual information in order to understand and improve users' searching and browsing behavior on the web. Our results capture important aspects of context under the realistic conditions of different online search services, aiming to ensure that our scientific insights and solutions transfer to the operational settings of real world applications

    Combination of web usage, content and structure information for diverse web mining applications in the tourism context and the context of users with disabilities

    Get PDF
    188 p.This PhD focuses on the application of machine learning techniques for behaviourmodelling in different types of websites. Using data mining techniques two aspects whichare problematic and difficult to solve have been addressed: getting the system todynamically adapt to possible changes of user preferences, and to try to extract theinformation necessary to ensure the adaptation in a transparent manner for the users,without infringing on their privacy. The work in question combines information of differentnature such as usage information, content information and website structure and usesappropriate web mining techniques to extract as much knowledge as possible from thewebsites. The extracted knowledge is used for different purposes such as adaptingwebsites to the users through proposals of interesting links, so that the users can get therelevant information more easily and comfortably; for discovering interests or needs ofusers accessing the website and to inform the service providers about it; or detectingproblems during navigation.Systems have been successfully generated for two completely different fields: thefield of tourism, working with the website of bidasoa turismo (www.bidasoaturismo.com)and, the field of disabled people, working with discapnet website (www.discapnet.com)from ONCE/Tecnosite foundation

    A Systematical Study on Application Performance Management Libraries for Apps

    Full text link
    Being able to automatically detect the performance issues in apps will significantly improve their quality as well as having a positive influence on user satisfaction. Although app developers have been exploiting application performance management (APM)tools to capture these potential performance issues, most of them do not fully understand the internals of these APM tools and the effect on their apps, such as security risks, etc. To fill this gap, in this paper, we conduct the first systematic study on APMs for apps by scrutinizing 25 widely-used APMs for Android apps and develop a framework named APMHunter for exploring the usage of APMs inAndroid apps. Using APMHunter, we conduct a large-scale empirical study on 500,000 Android apps to explore the usage patterns ofAPMs and discover the potential misuses of APMs. We obtain two major findings: 1) some APMs still employ deprecated permissions and approaches, which leads to APM malfunction as expected; 2) inappropriate APMs utilization will cause privacy leakages. Thus, our study suggests that both APM vendors and developers should design and use APMs scrupulousl

    Cancer theranostics: multifunctional gold nanoparticles for diagnostics and therapy

    Get PDF
    Doctorate in Biology, Specialty in BiotechnologyThe use of gold nanoparticles (AuNPs) has been gaining momentum in molecular diagnostics due to their unique physico-chemical properties these systems present huge advantages, such as increased sensitivity, reduced cost and potential for single-molecule characterisation. Because of their versatility and easy of functionalisation, multifunctional AuNPs have also been proposed as optimal delivery systems for therapy (nanovectors). Being able to produce such systems would mean the dawn of a new age in theranostics (diagnostics and therapy)driven by nanotechnology vehicles. Nanotechnology can be exploit for cancer theranostics via the development of diagnostics systems such as colorimetric and imunoassays, and in therapy approaches through gene therapy, drug delivery and tumour targeting systems. The unique characteristics of nanoparticles in the nanometre range, such as high surface-tovolume ratio or shape/size-dependent optical properties, are drastically different from those of their bulk materials and hold pledge in the clinical field for disease therapeutics This PhD project intends to optimise a gold-nanoparticle based technique for the detection of oncogenes’ transcripts (c-Myc and BCR-ABL) that can be used for the evaluation of the expression profile in cancer cells, while simultaneously developing an innovative platform of multifunctional gold nanoparticles (tumour markers, cell penetrating peptides, fluorescent dyes) loaded with siRNA capable of silencing the selected proto-oncogenes, which can be used to evaluate the level of expression and determine the efficiency of silencing. This work is a part of an ongoing collaboration between Research Centre for Human Molecular Genetics, Faculdade de Ciências e Tecnologia, Universidade Nova de Lisboa, Portugal and Biofunctional Nanoparticles and Surfaces Group, Instituto de Nanociencia de Aragón, Spain within a European project [NanoScieE+ - NANOTRUCK]. In order to achieve this goal we developed effective conjugation strategies to combine, in a highly controlled way, biomolecules to the surface of AuNPs with specific functions such as: ssDNA oligos to detect specific sequences and for mRNA quantification; Biofunctional spacers: Poly(ethylene glycol) (PEG) spacers used to increase solubility and biocompatibility and confer chemical functionality; Cell penetrating peptides: to overcome the lipophilic barrier of the cellular membranes and deliver molecules into cells using TAT peptide to achieve cytoplasm and nucleus; Quaternary ammonium: to introduce stable positively charged in gold nanoparticles surface; and RNA interference: siRNA complementary to a master regulator gene, the proto-oncogene c-Myc, that is implicated in cell growth, proliferation, loss of differentiation, and cell death. In order to establish that they are viable alternatives to the available methods, these innovative nanoparticles were extensively characterized on their chemical functionalization, ease of uptake, cellular toxicity and inflammation, and knockdown of MYC protein expression in several cancer cell lines and in in vivo models.Fundação para a Ciência e Tecnologia - (SFRH/BD/62957/2009); PTDC/BIO/66514/2006; NANOLIGHT-PTDC/QUI-QUI/112597/2009; Silencing the silencers via multifunctional gold nanoconjugates towards cancer therapy - PTDC/BBB-NAN/1812/201
    corecore