25 research outputs found

    Effective Unsupervised Matching of Product Titles with k-Combinations and Permutations

    No full text
    The problem of matching product titles is of particular interest for both users and marketers. The former, frequently search the Web with the aim of comparing prices and characteristics, or obtaining and aggregating information provided by other users. The latter, often require wide knowledge of competitive policies, prices and features to organize a promotional campaign about a group of products. To address this interesting problem, recent studies have attempted to enrich the product titles by exploiting Web search engines. More specifically, these methods suggest that for each product title a query should be submitted. After the results have been collected, the most important words which appear in the results are identified and appended in the titles. In the sequel, each word is assigned an importance score and finally, a similarity measure is applied to identify if two or more titles refer to the same product. Nonetheless, these methods have multiple problems including scalability, slow retrieval of the required additional search results, and lack of flexibility. In this paper, we present a different approach which addresses all these issues and is based on the morphological analysis of the titles of the products. In particular, our method operates in two phases. In the first phase, we compute the combinations of the words of the titles and we record several statistics such as word proximity and frequency values. In the second phase, we use this information to assign a score to each combination. The highest scoring combination is then declared as label of the cluster which contains each product. The experimental evaluation of the algorithm, in a real world dataset, demonstrated that compared to three popular string similarity metrics, our approach achieves up to 36% better matching performance and at least 13 times faster execution. © 2018 IEEE

    Computing scientometrics in large-scale academic search engines with MapReduce

    No full text
    Apart from the well-established facility of searching for research articles, the modern academic search engines also provide information regarding the scientists themselves. Until recently, this information was limited to include the articles each scientist has authored, accompanied by their corresponding citations. Presently, the most popular scientific databases have enriched this information by including scientometrics, that is, metrics which evaluate the research activity of a scientist. Although the computation of scientometrics is relatively easy when dealing with small data sets, in larger scales the problem becomes more challenging since the involved data is huge and cannot be handled efficiently by a single workstation. In this paper we attempt to address this interesting problem by employing MapReduce, a distributed, fault-tolerant framework used to solve problems in large scales without considering complex network programming details. We demonstrate that by setting the problem in a manner that is compatible to MapReduce, we can achieve an effective and scalable solution. We propose four algorithms which exploit the features of the framework and we compare their efficiency by conducting experiments on a large dataset comprised of roughly 1.8 million scientific documents. © 2012 Springer-Verlag

    A supervised machine learning classification algorithm for research articles

    No full text
    The issue of the automatic classification of research articles into one or more fields of science is of primary importance for scientific databases and digital libraries. A sophisticated classification strategy renders searching more effective and assists the users in locating similar relevant items. Although the most publishing services require from the authors to categorize their articles themselves, there are still cases where older documents remain unclassified, or the taxonomy changes over time. In this work we attempt to address this interesting problem by introducing a machine learning algorithm which combines several parameters and meta-data of a research article. In particular, our model exploits the training set to correlate keywords, authors, co-authorship, and publishing journals to a number of labels of the taxonomy. In the sequel, it applies this information to classify the rest of the documents. The experiments we have conducted with a large dataset comprised of about 1,5 million articles, demonstrate that in this specific application, our model outperforms the AdaBoost.MH and SVM methods. Copyright 2013 ACM

    Positional data organization and compression in web inverted indexes

    No full text
    To sustain the tremendous workloads they suffer on a daily basis, Web search engines employ highly compressed data structures known as inverted indexes. Previous works demonstrated that organizing the inverted lists of the index in individual blocks of postings leads to significant efficiency improvements. Moreover, the recent literature has shown that the current state-of-the-art compression strategies such as PForDelta and VSEncoding perform well when used to encode the lists docIDs. In this paper we examine their performance when used to compress the positional values. We expose their drawbacks and we introduce PFBC, a simple yet efficient encoding scheme, which encodes the positional data of an inverted list block by using a fixed number of bits. PFBC allows direct access to the required data by avoiding costly look-ups and unnecessary information decoding, achieving several times faster positions decompression than the state-of-the-art approaches. © 2012 Springer-Verlag

    Effective ranking fusion methods for personalized metasearch engines

    No full text
    Metasearch engines are a significant part of the information retrieval process. Most of Web users use them directly or indirectly to access information from more than one data sources. The cornerstone of their technology is their rank aggregation method, which is the algorithm they use to classify the collected results. In this paper we present three new rank aggregation methods. At first, we propose a method that takes into consideration the regional data for the user and the pages and assigns scores according to a variety of user defined parameters. In the second expansion, not all component engines are treated equally. The user is free to define the importance of each engine by setting appropriate weights. The third algorithm is designed to classify pages having URLs that contain subdomains. The three presented methods are combined into a single, personalized scoring formula, the Global KE. All algorithms have been implemented in QuadSearch, an experimental metasearch engine available at http://quadsearch.csd.auth.gr. © 2008 IEEE

    Modern web technologies

    No full text
    Nowadays, World Wide Web is one of the most significant tools that people employ to seek information, locate new sources of knowledge, communicate, share ideas and experiences or even purchase products and make online bookings. The technologies adopted by the modern Web applications are being discussed in this book chapter. We summarize the most fundamental principles employed by the Web such as the client-server model and the http protocol and then we continue by presenting the current trends such as asynchronous communications, distributed applications, cloud computing and mobile Web applications. Finally, we conduct a short discussion regarding the future of the Web and the technologies that are going to play key roles in the deployment of novel applications. © 2011 Springer-Verlag Berlin Heidelberg

    An iterative distance-based model for unsupervised weighted rank aggregation

    No full text
    Rank aggregation is a popular problem that combines different ranked lists from various sources (frequently called voters or judges), and generates a single aggregated list with improved ranking of its items. In this context, a portion of the existing methods attempt to address the problem by treating all voters equally. Nevertheless, several related works proved that the careful and effective assignment of different weights to each voter leads to enhanced performance. In this article, we introduce an unsupervised algorithm for learning the weights of the voters for a specific topic or query. The proposed method is based on the fact that if a voter has submitted numerous elements which have been placed in high positions in the aggregated list, then this voter should be treated as an expert, compared to the voters whose suggestions appear in lower places or do not appear at all. The algorithm iteratively computes the distance of each input list with the aggregated list and modifies the weights of the voters until all weights converge. The effectiveness of the proposed method is experimentally demonstrated by aggregating input lists from six TREC conferences. © 2019 Association for Computing Machinery

    The f Index: Quantifying the Impact of Coterminal Citations on Scientists' Ranking

    No full text
    Designing fair and unbiased metrics to measure the "level of excellence" of a scientist is a very significant task because they recently also have been taken into account when deciding faculty promotions, when allocating funds, and so on. Despite criticism that such scientometric evaluators are confronted with, they do have their merits, and efforts should be spent to arm them with robustness and resistance to manipulation. This article alms at initiating the study of the coterminal citations-their existence and implications-and presents them as a generalization of self-citations and of co-citation; it also shows how they can be used to capture any manipulation attempts against scientometric indicators, and finally presents a new index, the f index, that takes into account the coterminal citations. The utility of the new index is validated using the academic production of a number of esteemed computer scientists. The results confirm that the new index can discriminate those individuals whose work penetrates many scientific communities

    Supervised papers classification on large-scale high-dimensional data with apache spark

    No full text
    The problem of classifying a research article into one or more fields of science is of particular importance for the academic search engines and digital libraries. A robust classification algorithm offers the users a wide variety of useful tools, such as the refinement of their search results, the browsing of articles by category, the recommendation of other similar articles, etc. In the current literature we encounter approaches which attempt to address this problem without taking into consideration important parameters such as the previous history of the authors and the categorization of the scientific journals which publish the articles. In addition, the existing works overlook the huge volume of the involved academic data. In this paper, we expand an existing effective algorithm for research articles classification, and we parallelize it on Apache Spark-A parallelization framework which is capable of sharing large amounts of data into the main memory of the nodes of a cluster-to enable the processing of large academic datasets. Furthermore, we present data manipulation methodologies which are useful not only for this particular problem, but also for most parallel machine learning approaches. In our experimental evaluation, we demonstrate that our proposed algorithm is considerably more accurate than the supervised learning approaches implemented within the machine learning library of Spark, whereas it outperforms them in terms of execution speed by a significant margin. © 2018 IEEE

    Effective rank aggregation for metasearching

    No full text
    Nowadays, mashup services and especially metasearch engines play an increasingly important role on the Web. Most of users use them directly or indirectly to access and aggregate information from more than one data sources. Similarly to the rest of the search systems, the effectiveness of a metasearch engine is mainly determined by the quality of the results it returns in response to user queries. Since these services do not maintain their own document index, they exploit multiple search engines using a rank aggregation method in order to classify the collected results. However, the rank aggregation methods which have been proposed until now, utilize a very limited set of parameters regarding these results, such as the total number of the exploited resources and the rankings they receive from each individual resource. In this paper we present QuadRank, a new rank aggregation method, which takes into consideration additional information regarding the query terms, the collected results and the data correlated to each of these results (title, textual snippet. URL, individual ranking and others). We have implemented and tested QuadRank in a real-world metasearch engine. QuadSearch, a system developed as a testbed for algorithms related to the wide problem of metasearching. The name QuadSearch is related to the current number of the exploited engines (four). We have exhaustively tested QuadRank for both effectiveness and efficiency in the real-world search environment of QuadSearch and also, using a task from the recent TREC-2009 conference. The results we present in our experiments reveal that in most cases QuadRank outperformed all component engines, another metasearch engine (Dogpile) and two successful rank aggregation methods, Borda Count and the Outranking Approach. (C) 2010 Elsevier Inc. All rights reserved
    corecore