76 research outputs found

    Feature selection, optimization and clustering strategies of text documents

    Get PDF
    Clustering is one of the most researched areas of data mining applications in the contemporary literature. The need for efficient clustering is observed across wide sectors including consumer segmentation, categorization, shared filtering, document management, and indexing. The research of clustering task is to be performed prior to its adaptation in the text environment. Conventional approaches typically emphasized on the quantitative information where the selected features are numbers. Efforts also have been put forward for achieving efficient clustering in the context of categorical information where the selected features can assume nominal values. This manuscript presents an in-depth analysis of challenges of clustering in the text environment. Further, this paper also details prominent models proposed for clustering along with the pros and cons of each model. In addition, it also focuses on various latest developments in the clustering task in the social network and associated environments

    Identifying the most informative features using a structurally interacting elastic net

    Get PDF
    Feature selection can efficiently identify the most informative features with respect to the target feature used in training. However, state-of-the-art vector-based methods are unable to encapsulate the relationships between feature samples into the feature selection process, thus leading to significant information loss. To address this problem, we propose a new graph-based structurally interacting elastic net method for feature selection. Specifically, we commence by constructing feature graphs that can incorporate pairwise relationship between samples. With the feature graphs to hand, we propose a new information theoretic criterion to measure the joint relevance of different pairwise feature combinations with respect to the target feature graph representation. This measure is used to obtain a structural interaction matrix where the elements represent the proposed information theoretic measure between feature pairs. We then formulate a new optimization model through the combination of the structural interaction matrix and an elastic net regression model for the feature subset selection problem. This allows us to (a) preserve the information of the original vectorial space, (b) remedy the information loss of the original feature space caused by using graph representation, and (c) promote a sparse solution and also encourage correlated features to be selected. Because the proposed optimization problem is non-convex, we develop an efficient alternating direction multiplier method (ADMM) to locate the optimal solutions. Extensive experiments on various datasets demonstrate the effectiveness of the proposed method

    Integration and visualisation of clinical-omics datasets for medical knowledge discovery

    Get PDF
    In recent decades, the rise of various omics fields has flooded life sciences with unprecedented amounts of high-throughput data, which have transformed the way biomedical research is conducted. This trend will only intensify in the coming decades, as the cost of data acquisition will continue to decrease. Therefore, there is a pressing need to find novel ways to turn this ocean of raw data into waves of information and finally distil those into drops of translational medical knowledge. This is particularly challenging because of the incredible richness of these datasets, the humbling complexity of biological systems and the growing abundance of clinical metadata, which makes the integration of disparate data sources even more difficult. Data integration has proven to be a promising avenue for knowledge discovery in biomedical research. Multi-omics studies allow us to examine a biological problem through different lenses using more than one analytical platform. These studies not only present tremendous opportunities for the deep and systematic understanding of health and disease, but they also pose new statistical and computational challenges. The work presented in this thesis aims to alleviate this problem with a novel pipeline for omics data integration. Modern omics datasets are extremely feature rich and in multi-omics studies this complexity is compounded by a second or even third dataset. However, many of these features might be completely irrelevant to the studied biological problem or redundant in the context of others. Therefore, in this thesis, clinical metadata driven feature selection is proposed as a viable option for narrowing down the focus of analyses in biomedical research. Our visual cortex has been fine-tuned through millions of years to become an outstanding pattern recognition machine. To leverage this incredible resource of the human brain, we need to develop advanced visualisation software that enables researchers to explore these vast biological datasets through illuminating charts and interactivity. Accordingly, a substantial portion of this PhD was dedicated to implementing truly novel visualisation methods for multi-omics studies.Open Acces

    Titanic smart objects

    Get PDF

    Artificial Intelligence for Spectral Analysis: a Comprehensive Framework

    Get PDF
    Die Spektralanalyse wird in diversen akademischen und industriellen Bereichen eingesetzt, um relevante Elementinformationen zu extrahieren. Bei der qualitativen Analyse ist eine genaue Identifizierung der vorhandenen Elemente erforderlich, und bei der quantitativen Analyse, die Konzentrationen aller relevanten Elemente müssen präzis bestimmt werden. Obwohl die aktuellen kommerziellen Ansätze hervorragende Ergebnisse bei der Elementquantifizierung liefern können, stoßen sie immer noch an ihre Grenzen: hohe Rechenzeit (insbesondere bei komplexen Aufgaben), personalintensive manuelle Elementidentifizierung und erhebliche Kosten für die Gerätekalibrierung. In dieser Dissertation wird ein umfassendes, auf neuronalen Netzen basierendes System für die Spektralanalyse in großem Maßstab entworfen. Um eine neue und angemessene Baseline zu erstellen, wobei die meisten gängigen Elemente (bis zu 28) abdeckt werden können, werden umfangreiche Experimente durchgeführt, um die erforderliche Trainingsdatengröße zu untersuchen, geeignete Netzwerkarchitekturen auszuwählen und problemspezifische Konfigurationen zu analysieren. Bei den Quantifizierungsaufgaben erreicht der vorgestellte Ansatz im Vergleich zu den klassischen Methoden die gleiche Fehlerquote mit einer signifikanten Geschwindigkeitssteigerung um einen Faktor von über 400. Auch für die qualitative Analyse wird die Klassifizierung von Elementen mit einer ausgezeichneten Genauigkeit von über 99\% bei realen Messungen automatisiert, wobei die Dimension der Eingabedaten auf einer interpretierbaren Weise stark reduziert wird. Darüber hinaus erfordern neuronale Netze in der Regel große Rechen- und Speicherressourcen, so dass die Anwendung mit Problemen in Bezug auf Latenzzeiten, Speicherplatzbedarf und Stromverbrauch konfrontiert sein kann, insbesondere bei Endgeräten mit geringer Leistung. Um dieses Problem zu lösen, wurde ein hybrider Ansatz entwickelt, der die Ausführung neuronaler Netze optimiert, beschleunigt und dennoch die endgültige Leistung beibehält. Die Ergebnisse auf verschiedenen Zielhardwareplattformen zeigen, dass dieser hybride Ansatz in den meisten Fällen eine bis zu 52-fache Komprimierung der Modellgröße und eine 600-fache Beschleunigung mit sogar besserer Performanz erreichen kann, was den Einsatz auf Edge-Geräten mit geringen Kosten ermöglicht. Um schließlich die letzte Hürde des Kalibrierungsproblems auf dem Weg zu einem großflächigen Einsatz auf einer großen Anzahl von Geräten in der Industrie zu überwinden, wird ein auf Meta-Learning basierender Ansatz entwickelt, um hervorragende Kalibrierungsergebnisse mit minimalen Kosten zu erreichen, indem die neuronale Netze lernen zu kalibrieren. Das allgemeine Spektralanalyseproblem wird als Multi-Geräte-Multi-Konfigurationsaufgabe formuliert und es erreicht die beste Fehlerrate vor und nach der Kalibrierung bei verschiedenen unbekannten Geräten. Im Vergleich zu den Basisansätzen mit Kalibrierung, schneidet es auch ohne Kalibrierung gleich gut ab, was in einem realen Szenario sehr praktisch ist, wo ein unbekanntes Gerät ohne verfügbare Referenzproben für die Kalibrierung eingesetzt werden muss. Darüber hinaus zeigt die Ressourcenanalyse, dass der Ansatz deutlich weniger Ressourcen für den industriellen Einsatz erfordert, was zu einem enormen Einsparungs- und Wachstumspotenzial beiträgt

    Learning from noisy data through robust feature selection, ensembles and simulation-based optimization

    Get PDF
    The presence of noise and uncertainty in real scenarios makes machine learning a challenging task. Acquisition errors or missing values can lead to models that do not generalize well on new data. Under-fitting and over-fitting can occur because of feature redundancy in high-dimensional problems as well as data scarcity. In these contexts the learning task can show difficulties in extracting relevant and stable information from noisy features or from a limited set of samples with high variance. In some extreme cases, the presence of only aggregated data instead of individual samples prevents the use of instance-based learning. In these contexts, parametric models can be learned through simulations to take into account the inherent stochastic nature of the processes involved. This dissertation includes contributions to different learning problems characterized by noise and uncertainty. In particular, we propose i) a novel approach for robust feature selection based on the neighborhood entropy, ii) an approach based on ensembles for robust salary prediction in the IT job market, and iii) a parametric simulation-based approach for dynamic pricing and what-if analyses in hotel revenue management when only aggregated data are available

    Click Fraud Detection in Online and In-app Advertisements: A Learning Based Approach

    Get PDF
    Click Fraud is the fraudulent act of clicking on pay-per-click advertisements to increase a site’s revenue, to drain revenue from the advertiser, or to inflate the popularity of content on social media platforms. In-app advertisements on mobile platforms are among the most common targets for click fraud, which makes companies hesitant to advertise their products. Fraudulent clicks are supposed to be caught by ad providers as part of their service to advertisers, which is commonly done using machine learning methods. However: (1) there is a lack of research in current literature addressing and evaluating the different techniques of click fraud detection and prevention, (2) threat models composed of active learning systems (smart attackers) can mislead the training process of the fraud detection model by polluting the training data, (3) current deep learning models have significant computational overhead, (4) training data is often in an imbalanced state, and balancing it still results in noisy data that can train the classifier incorrectly, and (5) datasets with high dimensionality cause increased computational overhead and decreased classifier correctness -- while existing feature selection techniques address this issue, they have their own performance limitations. By extending the state-of-the-art techniques in the field of machine learning, this dissertation provides the following solutions: (i) To address (1) and (2), we propose a hybrid deep-learning-based model which consists of an artificial neural network, auto-encoder and semi-supervised generative adversarial network. (ii) As a solution for (3), we present Cascaded Forest and Extreme Gradient Boosting with less hyperparameter tuning. (iii) To overcome (4), we propose a row-wise data reduction method, KSMOTE, which filters out noisy data samples both in the raw data and the synthetically generated samples. (iv) For (5), we propose different column-reduction methods such as multi-time-scale Time Series analysis for fraud forecasting, using binary labeled imbalanced datasets and hybrid filter-wrapper feature selection approaches
    corecore