Search CORE

719 research outputs found

Crosslingual Document Embedding as Reduced-Rank Ridge Regression

Author: Jaggi Martin
Josifoski Martin
Paskov Hristo S.
Paskov Ivan S.
West Robert
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 13/02/2019
Field of study

There has recently been much interest in extending vector-based word representations to multiple languages, such that words can be compared across languages. In this paper, we shift the focus from words to documents and introduce a method for embedding documents written in any language into a single, language-independent vector space. For training, our approach leverages a multilingual corpus where the same concept is covered in multiple languages (but not necessarily via exact translations), such as Wikipedia. Our method, Cr5 (Crosslingual reduced-rank ridge regression), starts by training a ridge-regression-based classifier that uses language-specific bag-of-word features in order to predict the concept that a given document is about. We show that, when constraining the learned weight matrix to be of low rank, it can be factored to obtain the desired mappings from language-specific bags-of-words to language-independent embeddings. As opposed to most prior methods, which use pretrained monolingual word vectors, postprocess them to make them crosslingual, and finally average word vectors to obtain document vectors, Cr5 is trained end-to-end and is thus natively crosslingual as well as document-level. Moreover, since our algorithm uses the singular value decomposition as its core operation, it is highly scalable. Experiments show that our method achieves state-of-the-art performance on a crosslingual document retrieval task. Finally, although not trained for embedding sentences and words, it also achieves competitive performance on crosslingual sentence and word retrieval tasks.Comment: In The Twelfth ACM International Conference on Web Search and Data Mining (WSDM '19

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

Impact of the learners diversity and combination method on the generation of heterogeneous classifier ensembles

Author: Iglesias Martínez José Antonio
Ledezma Espino Agapito Ismael
Magan Lopez Elena
Sanchis de Miguel María Araceli
Sesmero Lorente María Paz
Publication venue: 'Elsevier BV'
Publication date: 15/07/2021
Field of study

Ensembles of classifiers is a proven approach in machine learning with a wide variety of research works. The main issue in ensembles of classifiers is not only the selection of the base classifiers, but also the combination of their outputs. According to the literature, it has been established that much is to be gained from combining classifiers if those classifiers are accurate and diverse. However, it is still an open issue how to define the relation between accuracy and diversity in order to define the best possible ensemble of classifiers. In this paper, we propose a novel approach to evaluate the impact of the diversity of the learners on the generation of heterogeneous ensembles. We present an exhaustive study of this approach using 27 different multiclass datasets and analysing their results in detail. In addition, to determine the performance of the different results, the presence of labelling noise is also considered.This work has been supported under projects PEAVAUTO-CM-UC3M–2020/00036/001, PID2019-104793RB-C31, and RTI2018-096036-B-C22, and by the Region of Madrid’s Excellence Program, Spain (EPUC3M17)

Universidad Carlos III de Madrid e-Archivo

Network Intrusion Detection with Two-Phased Hybrid Ensemble Learning and Automatic Feature Selection

Author: Chung Sunnie S.
Mananayaka Asanka Kavinda
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2023
Field of study

The use of network connected devices has grown exponentially in recent years revolutionizing our daily lives. However, it has also attracted the attention of cybercriminals making the attacks targeted towards these devices increase not only in numbers but also in sophistication. To detect such attacks, a Network Intrusion Detection System (NIDS) has become a vital component in network applications. However, network devices produce large scale high-dimensional data which makes it difficult to accurately detect various known and unknown attacks. Moreover, the complex nature of network data makes the feature selection process of a NIDS a challenging task. In this study, we propose a machine learning based NIDS with Two-phased Hybrid Ensemble learning and Automatic Feature Selection. The proposed framework leverages four different machine learning classifiers to perform automatic feature selection based on their ability to detect the most significant features. The two-phased hybrid ensemble learning algorithm consists of two learning phases, with the first phase constructed using classifiers built from an adaptation of the One-vs-One framework, and the second phase constructed using classifiers built from combinations of attack classes. The proposed framework was evaluated on two well-referenced datasets for both wired and wireless applications, and the results demonstrate that the two-phased ensemble learning framework combined with the automatic feature selection engine has superior attack detection capability compared to other similar studies found in the literature

Cleveland-Marshall College of Law

Recommended from our members

Parallelizing support vector machines for scalable image annotation

Author: Alham Nasullah Khalid
Publication venue: Brunel University School of Engineering and Design PhD Theses
Publication date: 01/01/2011
Field of study

This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.Machine learning techniques have facilitated image retrieval by automatically classifying and annotating images with keywords. Among them Support Vector Machines (SVMs) are used extensively due to their generalization properties. However, SVM training is notably a computationally intensive process especially when the training dataset is large. In this thesis distributed computing paradigms have been investigated to speed up SVM training, by partitioning a large training dataset into small data chunks and process each chunk in parallel utilizing the resources of a cluster of computers. A resource aware parallel SVM algorithm is introduced for large scale image annotation in parallel using a cluster of computers. A genetic algorithm based load balancing scheme is designed to optimize the performance of the algorithm in heterogeneous computing environments. SVM was initially designed for binary classifications. However, most classification problems arising in domains such as image annotation usually involve more than two classes. A resource aware parallel multiclass SVM algorithm for large scale image annotation in parallel using a cluster of computers is introduced. The combination of classifiers leads to substantial reduction of classification error in a wide range of applications. Among them SVM ensembles with bagging is shown to outperform a single SVM in terms of classification accuracy. However, SVM ensembles training are notably a computationally intensive process especially when the number replicated samples based on bootstrapping is large. A distributed SVM ensemble algorithm for image annotation is introduced which re-samples the training data based on bootstrapping and training SVM on each sample in parallel using a cluster of computers. The above algorithms are evaluated in both experimental and simulation environments showing that the distributed SVM algorithm, distributed multiclass SVM algorithm, and distributed SVM ensemble algorithm, reduces the training time significantly while maintaining a high level of accuracy in classifications

Brunel University Research Archive

Contextual models for object detection using boosted random fields

Author: Freeman William T.
Murphy Kevin P.
Torralba Antonio
Publication venue
Publication date: 01/01/2004
Field of study

We seek to both detect and segment objects in images. To exploit both local image data as well as contextual information, we introduce Boosted Random Fields (BRFs), which uses Boosting to learn the graph structure and local evidence of a conditional random field (CRF). The graph structure is learned by assembling graph fragments in an additive model. The connections between individual pixels are not very informative, but by using dense graphs, we can pool information from large regions of the image; dense models also support efficient inference. We show how contextual information from other objects can improve detection performance, both in terms of accuracy and speed, by using a computational cascade. We apply our system to detect stuff and things in office and street scenes

CiteSeerX

DSpace@MIT