152 research outputs found

    Document image classification combining textual and visual features.

    Get PDF
    This research contributes to the problem of classifying document images. The main addition of this thesis is the exploitation of textual and visual features through an approach that uses Convolutional Neural Networks. The study uses a combination of Optical Character Recognition and Natural Language Processing algorithms to extract and manipulate relevant text concepts from document images. Such content information are embedded within document images, with the aim of adding elements which help to improve the classification results of a Convolutional Neural Network. The experimental phase proves that the overall document classification accuracy of a Convolutional Neural Network trained using these text-augmented document images, is considerably higher than the one achieved by a similar model trained solely on classic document images, especially when different classes of documents share similar visual characteristics. The comparison between our method and state-of-the-art approaches demonstrates the effectiveness of combining visual and textual features. Although this thesis is about document image classification, the idea of using textual and visual features is not restricted to this context and comes from the observation that textual and visual information are complementary and synergetic in many aspects

    Neural IR Meets Graph Embedding: A Ranking Model for Product Search

    Full text link
    Recently, neural models for information retrieval are becoming increasingly popular. They provide effective approaches for product search due to their competitive advantages in semantic matching. However, it is challenging to use graph-based features, though proved very useful in IR literature, in these neural approaches. In this paper, we leverage the recent advances in graph embedding techniques to enable neural retrieval models to exploit graph-structured data for automatic feature extraction. The proposed approach can not only help to overcome the long-tail problem of click-through data, but also incorporate external heterogeneous information to improve search results. Extensive experiments on a real-world e-commerce dataset demonstrate significant improvement achieved by our proposed approach over multiple strong baselines both as an individual retrieval model and as a feature used in learning-to-rank frameworks.Comment: A preliminary version of the work to appear in TheWebConf'19 (formerly, WWW'19

    Term-driven E-Commerce

    Get PDF
    Die Arbeit nimmt sich der textuellen Dimension des E-Commerce an. Grundlegende Hypothese ist die textuelle Gebundenheit von Information und Transaktion im Bereich des elektronischen Handels. Überall dort, wo Produkte und Dienstleistungen angeboten, nachgefragt, wahrgenommen und bewertet werden, kommen natĂŒrlichsprachige AusdrĂŒcke zum Einsatz. Daraus resultiert ist zum einen, wie bedeutsam es ist, die Varianz textueller Beschreibungen im E-Commerce zu erfassen, zum anderen können die umfangreichen textuellen Ressourcen, die bei E-Commerce-Interaktionen anfallen, im Hinblick auf ein besseres VerstĂ€ndnis natĂŒrlicher Sprache herangezogen werden

    New Weighting Schemes for Document Ranking and Ranked Query Suggestion

    Get PDF
    Term weighting is a process of scoring and ranking a term’s relevance to a user’s information need or the importance of a term to a document. This thesis aims to investigate novel term weighting methods with applications in document representation for text classification, web document ranking, and ranked query suggestion. Firstly, this research proposes a new feature for document representation under the vector space model (VSM) framework, i.e., class specific document frequency (CSDF), which leads to a new term weighting scheme based on term frequency (TF) and the newly proposed feature. The experimental results show that the proposed methods, CSDF and TF-CSDF, improve the performance of document classification in comparison with other widely used VSM document representations. Secondly, a new ranking method called GCrank is proposed for re-ranking web documents returned from search engines using document classification scores. The experimental results show that the GCrank method can improve the performance of web returned document ranking in terms of several commonly used evaluation criteria. Finally, this research investigates several state-of-the-art ranked retrieval methods, adapts and combines them as well, leading to a new method called Tfjac for ranked query suggestion, which is based on the combination between TF-IDF and Jaccard coefficient methods. The experimental results show that Tfjac is the best method for query suggestion among the methods evaluated. It outperforms the most popularly used TF-IDF method in terms of increasing the number of highly relevant query suggestions

    Aggregated search: a new information retrieval paradigm

    Get PDF
    International audienceTraditional search engines return ranked lists of search results. It is up to the user to scroll this list, scan within different documents and assemble information that fulfill his/her information need. Aggregated search represents a new class of approaches where the information is not only retrieved but also assembled. This is the current evolution in Web search, where diverse content (images, videos, ...) and relational content (similar entities, features) are included in search results. In this survey, we propose a simple analysis framework for aggregated search and an overview of existing work. We start with related work in related domains such as federated search, natural language generation and question answering. Then we focus on more recent trends namely cross vertical aggregated search and relational aggregated search which are already present in current Web search

    Statistical learning for predictive targeting in online advertising

    Get PDF
    • 

    corecore