8 research outputs found

    Kodex ou comment organiser les résultats d'une recherche d'information par détection de communautés sur un graphe biparti ?

    Get PDF
    International audienceInformation Retrieval Systems (IRS) generally display results as a list of documents.One may think that a deeper structure exists within results. This hypothesis is reinforced bythe fact that most of the graphs produced from real data (e.g., graphs of documents) sharesome structural properties, and in particular a community structure. We propose to use theseproperties to better organize the set of returned documents for a query from a given IRS. For thispurpose, the retrieved document set is modeled as a bipartite graph (Documents ↔ Terms) onwhich the Kodex community detection algorithm is applied. This paper presents Kodex and itsevaluation : regarding F1 measure, Kodex overcomes baseline Okapi BM25 by 22%.Les Systèmes de Recherche d’Information structurent en général leurs résultats sousla forme d’une liste de documents. Nous pensons qu’il existe une structure plus riche dansces résultats. En effet, la plupart des graphes obtenus à partir de données réelles (entre autre,les graphes de documents) partagent certaines propriétés structurelles, en particulier uneorganisation en communautés que nous proposons d’exploiter afin de mieux organiser l’ensembledes documents restitués pour une requête. Pour ce faire, l’ensemble des documents restitués estmodélisé par un graphe biparti (Documents ↔ Termes) sur lequel est appliqué notre algorithmeKodex de détection de communautés. Cet article présente Kodex et son évaluation : sur la mesureF1 , Kodex améliore significativement la baseline Okapi BM25 de 22 %

    Discovering latent topical structure by second-order similarity analysis

    Get PDF
    This is the post-print of the Article - Copyright @ 2011 ASIS&TDocument similarity models are typically derived from a term-document vector space representation by comparing all vector-pairs using some similarity measure. Computing similarity directly from a ‘bag of words’ model can be problematic because term independence causes the relationships between synonymous and related terms and the contextual influences that determine the ‘sense’ of polysemous terms to be ignored. This paper compares two methods that potentially address these problems by modelling the higher-order relationships that lie latent within the original vector space. The first is latent semantic analysis (LSA), a dimension reduction method which is a well known means of addressing the vocabulary mismatch problem in information retrieval systems. The second is the lesser known, yet conceptually simple approach of second-order similarity (SOS) analysis, where similarity is measured in terms of profiles of first-order similarities as computed directly from the term-document space. Nearest neighbour tests show that SOS analysis produces similarity models that are consistently better than both first-order and LSA derived models at resolving both coarse and fine level semantic clusters. SOS analysis has been criticised for its cubic complexity. A second contribution is the novel application of vector truncation to reduce the run-time by a constant factor. Speed-ups of four to ten times are found to be easily achievable without losing the structural benefits associated with SOS analysis

    Mobile Search Engine using Clustering and Query Expansion

    Get PDF
    Internet content is growing exponentially and searching for useful content is a tedious task that we all deal with today. Mobile phones lack of screen space and limited interaction methods makes traditional search engine interface very inefficient. As the use of mobile internet continues to grow there is a need for an effective search tool. I have created a mobile search engine that uses clustering and query expansion to find relevant web pages efficiently. Clustering organizes web pages into groups that reflect different components of a query topic. Users can ignore clusters that they find irrelevant so they are not forced to sift through a long list of off-topic web pages. Query expansion uses query results, dictionaries, and cluster labels to formulate additional terms to manipulate the original query. The new manipulated query gives a more in depth result that eliminates noise. I believe that these two techniques are effective and can be combined to make the ultimate mobile search engine

    Web Archive Services Framework for Tighter Integration Between the Past and Present Web

    Get PDF
    Web archives have contained the cultural history of the web for many years, but they still have a limited capability for access. Most of the web archiving research has focused on crawling and preservation activities, with little focus on the delivery methods. The current access methods are tightly coupled with web archive infrastructure, hard to replicate or integrate with other web archives, and do not cover all the users\u27 needs. In this dissertation, we focus on the access methods for archived web data to enable users, third-party developers, researchers, and others to gain knowledge from the web archives. We build ArcSys, a new service framework that extracts, preserves, and exposes APIs for the web archive corpus. The dissertation introduces a novel categorization technique to divide the archived corpus into four levels. For each level, we will propose suitable services and APIs that enable both users and third-party developers to build new interfaces. The first level is the content level that extracts the content from the archived web data. We develop ArcContent to expose the web archive content processed through various filters. The second level is the metadata level; we extract the metadata from the archived web data and make it available to users. We implement two services, ArcLink for temporal web graph and ArcThumb for optimizing the thumbnail creation in the web archives. The third level is the URI level that focuses on using the URI HTTP redirection status to enhance the user query. Finally, the highest level in the web archiving service framework pyramid is the archive level. In this level, we define the web archive by the characteristics of its corpus and building Web Archive Profiles. The profiles are used by the Memento Aggregator for query optimization

    Büyük veri yığını analizi : yalın üretim literatürü üzerine bir uygulama

    Get PDF
    06.03.2018 tarihli ve 30352 sayılı Resmi Gazetede yayımlanan “Yükseköğretim Kanunu İle Bazı Kanun Ve Kanun Hükmünde Kararnamelerde Değişiklik Yapılması Hakkında Kanun” ile 18.06.2018 tarihli “Lisansüstü Tezlerin Elektronik Ortamda Toplanması, Düzenlenmesi ve Erişime Açılmasına İlişkin Yönerge” gereğince tam metin erişime açılmıştır.Bu çalışmada yalın üretimde alanında yayınlanan makaleler sistematik literatür çalışması ile ele alınmıştır. Sistematik değerlendirme, araştırılan bir soruya yanıt ya da probleme çözüm oluşturmak için, o alanda yayınlanmış tüm çalışmaların kapsamlı bir biçimde taranarak, çeşitli dâhil etme ve dışlama kriterleri kullanarak ve araştırmaların değerlendirilerek hangi çalışmaların değerlendirmeye alınacağının belirlenmesi, değerlendirmeye dâhil edilen araştırmalarda yer alan bulguların sentez edilmesidir. Sistematik değerlendirmeler daha çok bilimsel bilgi içerirler ve daha güçlü kanıtları üretmeleri bakımından önemlidirler. Günümüzde araştırma yürütmenin en ekonomik ve etkili yolu internet ve mevcut veri tabanlarını kullanmaktır. Literatür taraması, Scopus sitesinden 1991-2018 yıllarında yayınlanan Yalın Üretim (Lean Manufacturing) makaleleri üzerinden yürütülmüştür. Çalışmada makalenin başlığı, özeti, anahtar kelimeleri, yazar adları, yayınlandığı yıl, yayınlandığı dergi ve ülke şeklinde incelenmiştir ve makaleler Scopus adresinden Excel'e aktarılmıştır. Makalelerin istatistiksel analizi Scopus'un analiz kısmı kullanılarak, veri madenciliği kısmı ise RapidMiner 5.0 programında yapılmıştır. Literatürde yayınlanan yalın üretim çalışmaları büyük veri olarak görülebilir. Bu kapsamda literatür araştırmasında büyük veri analizi kapsamında çeşitli analizler yapılmaktadır. Çalışmada önce istatistiksel analizler ardından da metin madenciliği ile analizler yapılmaktadır.In this study, the articles published in the field of lean manufacturing were handled by a systematic literature study. Systematic evaluation is the synthesis of the findings contained in the assessed research to determine which studies should be evaluated by using a comprehensive survey of all published studies, using various inclusion and exclusion criteria, and evaluating the research, in order to establish an answer or probing solution to a research question. Systematic evaluations contain more scientific knowledge and are important for producing stronger evidence. The most economical and effective way to conduct research today is to use the internet and available databases. The literature review was conducted through Lean Manufacturing articles published in 1991-2018 from Scopus site. In the study, the title of the article, abstract, key words, author names, year of publication, published journal and country were examined, and then it has been exported to Excel from Scopus. Statistical analysis of the articles has been done in the Scopus Analysis application and data mining has been done in the RapidMiner 5.0 program. Lean production studies published in the literature can be seen as large data. In this context, various analyzes are made within the scope of large data analysis in the literature survey. In the first part of the analysis, statistical analyzes are carried out and then text mining analyzes are made

    A WEB PERSONALIZATION ARTIFACT FOR UTILITY-SENSITIVE REVIEW ANALYSIS

    Get PDF
    Online customer reviews are web content voluntarily posted by the users of a product (e.g. camera) or service (e.g. hotel) to express their opinions about the product or service. Online reviews are important resources for businesses and consumers. This dissertation focuses on the important consumer concern of review utility, i.e., the helpfulness or usefulness of online reviews to inform consumer purchase decisions. Review utility concerns consumers since not all online reviews are useful or helpful. And, the quantity of the online reviews of a product/service tends to be very large. Manual assessment of review utility is not only time consuming but also information overloading. To address this issue, review helpfulness research (RHR) has become a very active research stream dedicated to study utility-sensitive review analysis (USRA) techniques for automating review utility assessment. Unfortunately, prior RHR solution is inadequate. RHR researchers call for more suitable USRA approaches. Our current research responds to this urgent call by addressing the research problem: What is an adequate USRA approach? We address this problem by offering novel Design Science (DS) artifacts for personalized USRA (PUSRA). Our proposed solution extends not only RHR research but also web personalization research (WPR), which studies web-based solutions for personalized web provision. We have evaluated the proposed solution by applying three evaluation methods: analytical, descriptive, and experimental. The evaluations corroborate the practical efficacy of our proposed solution. This research contributes what we believe (1) the first DS artifacts to the knowledge body of RHR and WPR, and (2) the first PUSRA contribution to USRA practice. Moreover, we consider our evaluations of the proposed solution the first comprehensive assessment of USRA solutions. In addition, this research contributes to the advancement of decision support research and practice. The proposed solution is a web-based decision support artifact with the capability to substantially improve accurate personalized webpage provision. Also, website designers can apply our research solution to transform their works fundamentally. Such transformation can add substantial value to businesses

    Document clustering with optimized unsupervised feature selection and centroid allocation

    Get PDF
    An effective document clustering system can significantly improve the tasks of document analysis, grouping, and retrieval. The performance of a document clustering system mainly depends on document preparation and allocation of cluster positions. As achieving optimal document clustering is a combinatorial NP-hard optimization problem, it becomes essential to utilize non-traditional methods to look for optimal or near-optimal solutions. During the allocation of cluster positions or the centroids allocation process, the extra text features that represent keywords in each document have an effect on the clustering results. A large number of features need to be reduced using dimensionality reduction techniques. Feature selection is an important step that can be used to reduce the redundant and inconsistent features. Due to a large number of the potential feature combinations, text feature selection is considered a complicated process. The persistent drawbacks of the current text feature selection methods such as local optima and absence of class labels of features were addressed in this thesis. The supervised and unsupervised feature selection methods were investigated. To address the problems of optimizing the supervised feature selection methods so as to improve document clustering, memetic hybridization between filter and wrapper feature selection, known as Memetic Algorithm Feature Selection, was presented first. In order to deal with the unlabelled features, unsupervised feature selection method was also proposed. The proposed unsupervised feature selection method integrates Simulated Annealing to the global search using Differential Evolution. This combination also aims to combine the advantages of both the wrapper and filter methods in a memetic scheme but on an unsupervised basis. Two versions of this hybridization were proposed. The first was named Differential Evolution Simulated Annealing, which uses the standard mutation of Differential Evolution, and the second was named Dichotomous Differential Evolution Simulated Annealing, which used the dichotomous mutation of the differential evolution. After feature selection two centroid allocation methods were proposed; the first is the combination of Chaotic Logistic Search and Discrete Differential Evolution global search, which was named Differential Evolution Memetic Clustering (DEMC) and the second was based on using the Gradient search using the k-means as a local search with a modified Differential Harmony global Search. The resulting method was named Memetic Differential Harmony Search (MDHS). In order to intensify the exploitation aspect of MDHS, a binomial crossover was used with it. Finally, the improved method is named Crossover Memetic Differential Harmony Search (CMDHS). The test results using the F-measure, Average Distance of Document to Cluster (ADDC) and the nonparametric statistical tests showed the superiority of the CMDHS over the baseline methods, namely the HS, DHS, k-means and the MDHS. The tests also show that CMDHS is better than the DEMC proposed earlier. Finally the proposed CMDHS was compared with two current state-of-the-art methods, namely a Krill Herd (KH) based centroid allocation method and an Artifice Bee Colony (ABC) based method, and found to outperform these two methods in most cases

    A New Algorithm for Clustering Search Results

    No full text
    We develop a new algorithm for clustering search results. Differently from many other clustering systems that have been recently proposed as a post-processing step for Web search engines, our system is not based on phrase analysis inside snippets, but instead uses Latent Semantic Indexing on the whole document content. A main contribution of the paper is a novel strategy -- called Dynamic SVD Clustering -- to discover the optimal number of singular values to be used for clustering purposes. Moreover, the algorithm is such that the SVD computation step has in practice good performance, which makes it feasible to perform clustering when term vectors are available. We show that the algorithm has very good classification performance, and that it can be effectively used to cluster results of a search engine to make them easier to browse by users. The algorithm has being integrated into the Noodles search engine, a tool for searching and clustering Web and desktop documents
    corecore