13 research outputs found

    An iterative distance-based model for unsupervised weighted rank aggregation

    No full text
    Rank aggregation is a popular problem that combines different ranked lists from various sources (frequently called voters or judges), and generates a single aggregated list with improved ranking of its items. In this context, a portion of the existing methods attempt to address the problem by treating all voters equally. Nevertheless, several related works proved that the careful and effective assignment of different weights to each voter leads to enhanced performance. In this article, we introduce an unsupervised algorithm for learning the weights of the voters for a specific topic or query. The proposed method is based on the fact that if a voter has submitted numerous elements which have been placed in high positions in the aggregated list, then this voter should be treated as an expert, compared to the voters whose suggestions appear in lower places or do not appear at all. The algorithm iteratively computes the distance of each input list with the aggregated list and modifies the weights of the voters until all weights converge. The effectiveness of the proposed method is experimentally demonstrated by aggregating input lists from six TREC conferences. © 2019 Association for Computing Machinery

    Supervised papers classification on large-scale high-dimensional data with apache spark

    No full text
    The problem of classifying a research article into one or more fields of science is of particular importance for the academic search engines and digital libraries. A robust classification algorithm offers the users a wide variety of useful tools, such as the refinement of their search results, the browsing of articles by category, the recommendation of other similar articles, etc. In the current literature we encounter approaches which attempt to address this problem without taking into consideration important parameters such as the previous history of the authors and the categorization of the scientific journals which publish the articles. In addition, the existing works overlook the huge volume of the involved academic data. In this paper, we expand an existing effective algorithm for research articles classification, and we parallelize it on Apache Spark-A parallelization framework which is capable of sharing large amounts of data into the main memory of the nodes of a cluster-to enable the processing of large academic datasets. Furthermore, we present data manipulation methodologies which are useful not only for this particular problem, but also for most parallel machine learning approaches. In our experimental evaluation, we demonstrate that our proposed algorithm is considerably more accurate than the supervised learning approaches implemented within the machine learning library of Spark, whereas it outperforms them in terms of execution speed by a significant margin. © 2018 IEEE

    Effective products categorization with importance scores and morphological analysis of the titles

    No full text
    During the past few years, the e-commerce platforms and marketplaces have enriched their services with new features to improve their user experience and increase their profitability. Such features include relevant products suggestion, personalized recommendations, query understanding algorithms and numerous others. To effectively implement all these features, a robust products categorization method is required. Due to its importance, the problem of the automatic products classification into a given taxonomy has attracted the attention of multiple researchers. In the current literature, we encounter a broad variety of solutions, ranging from supervised and deep learning algorithms, as well as convolutional and recurrent neural networks. In this paper we introduce a supervised learning method which performs morphological analysis of the product titles by extracting and processing a combination of words and n-grams. In the sequel, each of these tokens receives an importance score according to several criteria which reflect the strength of the correlation of the token with a category. Based on these importance scores, we also propose a dimensionality reduction technique to reduce the size of the feature space without sacrificing much of the performance of the algorithm. The experimental evaluation of our method was conducted by using a real-world dataset, comprised of approximately 320 thousand product titles, which we acquired by crawling a product comparison Web platform. The results of this evaluation indicate that our approach is highly accurate, since it achieves a remarkable classification accuracy of over 95%. © 2018 IEEE

    Evaluating the Effects of Modern Storage Devices on the Efficiency of Parallel Machine Learning Algorithms

    No full text
    Big Data analytics is presently one of the most emerging areas of research for both organizations and enterprises. The requirement for deployment of efficient machine learning algorithms over huge amounts of data led to the development of parallelization frameworks and of specialized libraries (like Mahout and MLlib) which implement the most important among these algorithms. Moreover, the recent advances in storage technology resulted in the introduction of high-performing devices, broadly known as Solid State Drives (SSDs). Compared to the traditional Hard Drives (HDDs), SSDs offer considerably higher performance and lower power consumption. Motivated by these appealing features and the growing necessity for efficient large-scale data processing, we compared the performance of several machine learning algorithms on MapReduce clusters whose nodes are equipped with HDDs, SSDs, and devices which implement the latest 3D XPoint technology. In particular, we evaluate several dataset preprocessing methods like vectorization and dimensionality reduction, two supervised classifiers, Naive Bayes and Linear Regression, and the popular k-Means clustering algorithm. We use an experimental cluster equipped with the three aforementioned storage devices under different configurations, and two large datasets, Wikipedia and HIGGS. The experiments showed that the benefits which derive from the usage of SSDs depend on the cluster setup and the nature of the applied algorithms. © 2020 World Scientific Publishing Company

    A self-verifying clustering approach to unsupervised matching of product titles

    No full text
    The continuous growth of the e-commerce industry has rendered the problem of product retrieval particularly important. As more enterprises move their activities on the Web, the volume and the diversity of the product-related information increase quickly. These factors make it difficult for the users to identify and compare the features of their desired products. Recent studies proved that the standard similarity metrics cannot effectively identify identical products, since similar titles often refer to different products and vice-versa. Other studies employ external data sources to enrich the titles; these solutions are rather impractical, since the process of fetching external data is inefficient. In this paper we introduce UPM, an unsupervised algorithm for matching products by their titles that is independent of any external sources. UPM consists of three stages. During the first stage, the algorithm analyzes the titles and extracts combinations of words out of them. These combinations are evaluated in stage 2 according to several criteria, and the most appropriate of them are selected to form the initial clusters. The third phase is a post-processing verification stage that refines the initial clusters by correcting the erroneous matches. This stage is designed to operate in combination with all clustering approaches, especially when the data possess properties that prevent the co-existence of two data points within the same cluster. The experimental evaluation of UPM with multiple datasets demonstrates its superiority against the state-of-the-art clustering approaches and string similarity metrics, in terms of both efficiency and effectiveness. © 2020, Springer Nature B.V

    A Self-Pruning Classification Model for News

    No full text
    News aggregators are on-line services that collect articles from numerous reputable media and news providers and reorganize them in a convenient manner with the aim of assisting their users to access the information they seek. One of the most important tools offered by news aggregators is based on the classification of the articles into a fixed set of categories. In this article, we introduce a supervised classification method for news articles that analyzes their titles and constructs multiple types of tokens including single words and n-grams of variable sizes. In the sequel, it employs several statistics, such as frequencies and token-class correlations, to assign two importance scores to each token. These scores reflect the ambiguity of a token; namely, how significant it is for the classification of an article to a category. The tokens and their scores are stored in a support structure that is subsequently used to classify the unlabeled articles. In addition, we propose a dimensionality reduction approach that reduces the size of the model without significant degradation of its classification performance. The algorithm is experimentally evaluated by employing a popular dataset of news articles and is found to outperform standard classification methods. © 2019 IEEE

    Investigating the efficiency of machine learning algorithms on mapreduce clusters with SSDs

    No full text
    In the big data era, the efficient processing of large volumes of data has became a standard requirement for both organizations and enterprises. Since single workstations cannot sustain such tremendous workloads, MapReduce was introduced with the aim of providing a robust, easy, and fault-tolerant parallelization framework for the execution of applications on large clusters. One of the most representative examples of such applications is the machine learning algorithms which dominate the broad research area of data mining. Simultaneously, the recent advances in hardware technology led to the introduction of high-performing alternative devices for secondary storage, known as Solid State Drives (SSDs). In this paper we examine the perfor-mance of several parallel data mining algorithms on MapReduce clusters equipped with such modern hardware. More specifically, we investigate standard dataset preprocessing methods including vectorization and dimensionality reduction, and two supervised classifiers, Naive Bayes and Linear Regression. We compare the execution times of these algorithms on an experimental cluster equipped with both standard magnetic disks and SSDs, by employing two different datasets and by applying several different cluster configurations. Our experiments demonstrate that the usage of SSDs can accelerate the execution of machine learning methods by a margin which depends on the cluster setup and the nature of the applied algorithms. © 2018 IEEE

    Indexing in flash storage devices: a survey on challenges, current approaches, and future trends

    No full text
    Indexes are special purpose data structures, designed to facilitate and speed up the access to the contents of a file. Indexing has been actively and extensively investigated in DBMSes equipped with hard disk drives (HDDs). In the recent years, solid-state drives (SSDs), based on NAND flash technology, started replacing magnetic disks due to their appealing characteristics: high throughput/low latency, shock resistance, absence of mechanical parts, low power consumption. However, treating SSDs as simply another category of block devices ignores their idiosyncrasies, like erase-before-write, wear-out and asymmetric read/write, and may lead to poor performance. These peculiarities of SSDs dictate the refactoring or even the reinvention of the indexing techniques that have been designed primarily for HDDs. In this work, we present a concise overview of the SSD technology and the challenges it poses. We broadly survey 62 flash-aware indexes for various data types, analyze the main techniques they employ, and comment on their main advantages and disadvantages, aiming to provide a systematic and valuable resource for researchers working on algorithm design and index development for SSDs. Additionally, we discuss future trends and new lines of research related to this field. © 2019, Springer-Verlag GmbH Germany, part of Springer Nature

    A Scalable Short-Text Clustering Algorithm Using Apache Spark

    No full text
    Short text clustering deals with the problem of grouping together semantically similar documents with small lengths. Nowadays, huge amounts of text data is being generated by numerous applications such as microblogs, messengers, and services that generate or aggregate entitled entities. This large volume of highly dimensional and sparse information may easily overwhelm the current serial approaches and render them inefficient, or even inapplicable. Although many traditional clustering algorithms have been successfully parallelized in the past, the parallelization of short text clustering algorithms is a rather overlooked problem. In this paper we introduce pVEPHC, a short text clustering method that can be executed in parallel in large computer clusters. The algorithm draws inspiration from VEPHC, a recent two-stage approach with decent performance in several diverse tasks. More specifically, in this work we employ the Apache Spark framework to design parallel implementations of both stages of VEPHC. During the first stage, pVEPHC generates an initial clustering by identifying and modelling common low-dimensional vector representations of the original documents. In the sequel, the initial clustering is improved in the second stage by applying cluster split and merge operations in a hierarchical fashion. We have attested our implementation on an experimental Spark cluster and we report an almost linear improvement in the execution times of the algorithm. © 2021 IEEE

    An unsupervised distance-based model for weighted rank aggregation with list pruning

    No full text
    Combining multiple ranked lists of items, called voters, into a single consensus list is a popular problem with significant implications in numerous areas, including Bioinformatics, recommendation systems, metasearch engines, etc. Multiple recent solutions introduced supervised and unsupervised techniques that try to model the ordering of the list elements and identify common ranking patterns among the voters. Nevertheless, these works either require additional information (e.g. the element scores assigned by the voters, or training data), or they merge similar voters without the evidence that similar voters are important voters. Furthermore, these models are computationally expensive. To overcome these problems, this paper introduces an unsupervised method that identifies the expert voters, thus enhancing the aggregation performance. Specifically, we build upon the concept that collective knowledge is superior to the individual preferences. Therefore, the closer an individual list is to a consensus ranking, the stronger the respective voter is. By iteratively correcting these distances, we assign converging weights to each voter, leading to a final stable list. Moreover, to the best of our knowledge, this is the first work that employs these weights not only to assign scores to the individual elements, but also to determine their population. The proposed model has been extensively evaluated both with well-established TREC datasets and synthetic ones. The results demonstrate substantial precision improvements over three baseline and two recent state-of-the-art methods. © 2022 Elsevier Lt
    corecore