Search CORE

76 research outputs found

Feature selection, optimization and clustering strategies of text documents

Author: Nikhath A. Kousar
Subrahmanyam K.
Publication venue: 'Institute of Advanced Engineering and Science'
Publication date: 01/04/2019
Field of study

Clustering is one of the most researched areas of data mining applications in the contemporary literature. The need for efficient clustering is observed across wide sectors including consumer segmentation, categorization, shared filtering, document management, and indexing. The research of clustering task is to be performed prior to its adaptation in the text environment. Conventional approaches typically emphasized on the quantitative information where the selected features are numbers. Efforts also have been put forward for achieving efficient clustering in the context of categorical information where the selected features can assume nominal values. This manuscript presents an in-depth analysis of challenges of clustering in the text environment. Further, this paper also details prominent models proposed for clustering along with the pros and cons of each model. In addition, it also focuses on various latest developments in the clustering task in the social network and associated environments

Crossref

ZENODO

Institute of Advanced Engineering and Science

Identifying the most informative features using a structurally interacting elastic net

Author: Bai Lu
Cui Lixin
Hancock Edwin R
Wang Yue
Zhang Zhihong
Publication venue: 'Elsevier BV'
Publication date: 01/01/2018
Field of study

Feature selection can efficiently identify the most informative features with respect to the target feature used in training. However, state-of-the-art vector-based methods are unable to encapsulate the relationships between feature samples into the feature selection process, thus leading to significant information loss. To address this problem, we propose a new graph-based structurally interacting elastic net method for feature selection. Specifically, we commence by constructing feature graphs that can incorporate pairwise relationship between samples. With the feature graphs to hand, we propose a new information theoretic criterion to measure the joint relevance of different pairwise feature combinations with respect to the target feature graph representation. This measure is used to obtain a structural interaction matrix where the elements represent the proposed information theoretic measure between feature pairs. We then formulate a new optimization model through the combination of the structural interaction matrix and an elastic net regression model for the feature subset selection problem. This allows us to (a) preserve the information of the original vectorial space, (b) remedy the information loss of the original feature space caused by using graph representation, and (c) promote a sparse solution and also encourage correlated features to be selected. Because the proposed optimization problem is non-convex, we develop an efficient alternating direction multiplier method (ADMM) to locate the optimal solutions. Extensive experiments on various datasets demonstrate the effectiveness of the proposed method

arXiv.org e-Print Archive

White Rose Research Online

Integration and visualisation of clinical-omics datasets for medical knowledge discovery

Author: Homola Daniel
Publication venue: Department of Surgery & Cancer, Imperial College London
Publication date: 01/03/2018
Field of study

In recent decades, the rise of various omics fields has flooded life sciences with unprecedented amounts of high-throughput data, which have transformed the way biomedical research is conducted. This trend will only intensify in the coming decades, as the cost of data acquisition will continue to decrease. Therefore, there is a pressing need to find novel ways to turn this ocean of raw data into waves of information and finally distil those into drops of translational medical knowledge. This is particularly challenging because of the incredible richness of these datasets, the humbling complexity of biological systems and the growing abundance of clinical metadata, which makes the integration of disparate data sources even more difficult. Data integration has proven to be a promising avenue for knowledge discovery in biomedical research. Multi-omics studies allow us to examine a biological problem through different lenses using more than one analytical platform. These studies not only present tremendous opportunities for the deep and systematic understanding of health and disease, but they also pose new statistical and computational challenges. The work presented in this thesis aims to alleviate this problem with a novel pipeline for omics data integration. Modern omics datasets are extremely feature rich and in multi-omics studies this complexity is compounded by a second or even third dataset. However, many of these features might be completely irrelevant to the studied biological problem or redundant in the context of others. Therefore, in this thesis, clinical metadata driven feature selection is proposed as a viable option for narrowing down the focus of analyses in biomedical research. Our visual cortex has been fine-tuned through millions of years to become an outstanding pattern recognition machine. To leverage this incredible resource of the human brain, we need to develop advanced visualisation software that enables researchers to explore these vast biological datasets through illuminating charts and interactivity. Accordingly, a substantial portion of this PhD was dedicated to implementing truly novel visualisation methods for multi-omics studies.Open Acces

Spiral - Imperial College Digital Repository

Titanic smart objects

Author: Breitenmoser Andreas
Publication venue: ETH, Swiss Federal Institute of Technology, Department of Information Technology and Electrical Engineering, Electronics Laboratory
Publication date: 01/01/2007
Field of study

Repository for Publications and Research Data

Artificial Intelligence for Spectral Analysis: a Comprehensive Framework

Author: Xie Xiang
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 16/04/2023
Field of study

Die Spektralanalyse wird in diversen akademischen und industriellen Bereichen eingesetzt, um relevante Elementinformationen zu extrahieren. Bei der qualitativen Analyse ist eine genaue Identifizierung der vorhandenen Elemente erforderlich, und bei der quantitativen Analyse, die Konzentrationen aller relevanten Elemente müssen präzis bestimmt werden. Obwohl die aktuellen kommerziellen Ansätze hervorragende Ergebnisse bei der Elementquantifizierung liefern können, stoßen sie immer noch an ihre Grenzen: hohe Rechenzeit (insbesondere bei komplexen Aufgaben), personalintensive manuelle Elementidentifizierung und erhebliche Kosten für die Gerätekalibrierung. In dieser Dissertation wird ein umfassendes, auf neuronalen Netzen basierendes System für die Spektralanalyse in großem Maßstab entworfen. Um eine neue und angemessene Baseline zu erstellen, wobei die meisten gängigen Elemente (bis zu 28) abdeckt werden können, werden umfangreiche Experimente durchgeführt, um die erforderliche Trainingsdatengröße zu untersuchen, geeignete Netzwerkarchitekturen auszuwählen und problemspezifische Konfigurationen zu analysieren. Bei den Quantifizierungsaufgaben erreicht der vorgestellte Ansatz im Vergleich zu den klassischen Methoden die gleiche Fehlerquote mit einer signifikanten Geschwindigkeitssteigerung um einen Faktor von über 400. Auch für die qualitative Analyse wird die Klassifizierung von Elementen mit einer ausgezeichneten Genauigkeit von über 99\% bei realen Messungen automatisiert, wobei die Dimension der Eingabedaten auf einer interpretierbaren Weise stark reduziert wird. Darüber hinaus erfordern neuronale Netze in der Regel große Rechen- und Speicherressourcen, so dass die Anwendung mit Problemen in Bezug auf Latenzzeiten, Speicherplatzbedarf und Stromverbrauch konfrontiert sein kann, insbesondere bei Endgeräten mit geringer Leistung. Um dieses Problem zu lösen, wurde ein hybrider Ansatz entwickelt, der die Ausführung neuronaler Netze optimiert, beschleunigt und dennoch die endgültige Leistung beibehält. Die Ergebnisse auf verschiedenen Zielhardwareplattformen zeigen, dass dieser hybride Ansatz in den meisten Fällen eine bis zu 52-fache Komprimierung der Modellgröße und eine 600-fache Beschleunigung mit sogar besserer Performanz erreichen kann, was den Einsatz auf Edge-Geräten mit geringen Kosten ermöglicht. Um schließlich die letzte Hürde des Kalibrierungsproblems auf dem Weg zu einem großflächigen Einsatz auf einer großen Anzahl von Geräten in der Industrie zu überwinden, wird ein auf Meta-Learning basierender Ansatz entwickelt, um hervorragende Kalibrierungsergebnisse mit minimalen Kosten zu erreichen, indem die neuronale Netze lernen zu kalibrieren. Das allgemeine Spektralanalyseproblem wird als Multi-Geräte-Multi-Konfigurationsaufgabe formuliert und es erreicht die beste Fehlerrate vor und nach der Kalibrierung bei verschiedenen unbekannten Geräten. Im Vergleich zu den Basisansätzen mit Kalibrierung, schneidet es auch ohne Kalibrierung gleich gut ab, was in einem realen Szenario sehr praktisch ist, wo ein unbekanntes Gerät ohne verfügbare Referenzproben für die Kalibrierung eingesetzt werden muss. Darüber hinaus zeigt die Ressourcenanalyse, dass der Ansatz deutlich weniger Ressourcen für den industriellen Einsatz erfordert, was zu einem enormen Einsparungs- und Wachstumspotenzial beiträgt

KITopen

Learning from noisy data through robust feature selection, ensembles and simulation-based optimization

Author: Mariello Andrea
Publication venue: University of Trento
Publication date: 30/03/2019
Field of study

The presence of noise and uncertainty in real scenarios makes machine learning a challenging task. Acquisition errors or missing values can lead to models that do not generalize well on new data. Under-fitting and over-fitting can occur because of feature redundancy in high-dimensional problems as well as data scarcity. In these contexts the learning task can show difficulties in extracting relevant and stable information from noisy features or from a limited set of samples with high variance. In some extreme cases, the presence of only aggregated data instead of individual samples prevents the use of instance-based learning. In these contexts, parametric models can be learned through simulations to take into account the inherent stochastic nature of the processes involved. This dissertation includes contributions to different learning problems characterized by noise and uncertainty. In particular, we propose i) a novel approach for robust feature selection based on the neighborhood entropy, ii) an approach based on ensembles for robust salary prediction in the IT job market, and iii) a parametric simulation-based approach for dynamic pricing and what-if analyses in hotel revenue management when only aggregated data are available

Unitn-eprints PhD

Click Fraud Detection in Online and In-app Advertisements: A Learning Based Approach

Author: Gubbi Sadashiva Thejas
Publication venue: FIU Digital Commons
Publication date: 30/09/2019
Field of study

Click Fraud is the fraudulent act of clicking on pay-per-click advertisements to increase a site’s revenue, to drain revenue from the advertiser, or to inflate the popularity of content on social media platforms. In-app advertisements on mobile platforms are among the most common targets for click fraud, which makes companies hesitant to advertise their products. Fraudulent clicks are supposed to be caught by ad providers as part of their service to advertisers, which is commonly done using machine learning methods. However: (1) there is a lack of research in current literature addressing and evaluating the different techniques of click fraud detection and prevention, (2) threat models composed of active learning systems (smart attackers) can mislead the training process of the fraud detection model by polluting the training data, (3) current deep learning models have significant computational overhead, (4) training data is often in an imbalanced state, and balancing it still results in noisy data that can train the classifier incorrectly, and (5) datasets with high dimensionality cause increased computational overhead and decreased classifier correctness -- while existing feature selection techniques address this issue, they have their own performance limitations. By extending the state-of-the-art techniques in the field of machine learning, this dissertation provides the following solutions: (i) To address (1) and (2), we propose a hybrid deep-learning-based model which consists of an artificial neural network, auto-encoder and semi-supervised generative adversarial network. (ii) As a solution for (3), we present Cascaded Forest and Extreme Gradient Boosting with less hyperparameter tuning. (iii) To overcome (4), we propose a row-wise data reduction method, KSMOTE, which filters out noisy data samples both in the raw data and the synthetically generated samples. (iv) For (5), we propose different column-reduction methods such as multi-time-scale Time Series analysis for fraud forecasting, using binary labeled imbalanced datasets and hybrid filter-wrapper feature selection approaches

DigitalCommons@Florida International University

Data analytics to augment decision making in cardiac care

Author: Rjoob Khaled
Publication venue
Publication date: 01/06/2022
Field of study

Ulster University's Research Portal