7,678 research outputs found
Document image classification combining textual and visual features.
This research contributes to the problem of classifying document images. The main addition of this thesis is the exploitation of textual and visual features through an approach that uses Convolutional Neural Networks.
The study uses a combination of Optical Character Recognition and Natural Language Processing algorithms to extract and manipulate relevant text concepts from document images.
Such content information are embedded within document images, with the aim of adding elements which help to improve the classification results of a Convolutional Neural Network.
The experimental phase proves that the overall document classification accuracy of a Convolutional Neural Network trained using these text-augmented document images, is considerably higher than the one achieved by a similar model trained solely on classic document images, especially when different classes of documents share similar visual characteristics. The comparison between our method and state-of-the-art approaches demonstrates the effectiveness of combining visual and textual features.
Although this thesis is about document image classification, the idea of using textual and visual features is not restricted to this context and comes from the observation that textual and visual information are complementary and synergetic in many aspects
Data Mining in Electronic Commerce
Modern business is rushing toward e-commerce. If the transition is done
properly, it enables better management, new services, lower transaction costs
and better customer relations. Success depends on skilled information
technologists, among whom are statisticians. This paper focuses on some of the
contributions that statisticians are making to help change the business world,
especially through the development and application of data mining methods. This
is a very large area, and the topics we cover are chosen to avoid overlap with
other papers in this special issue, as well as to respect the limitations of
our expertise. Inevitably, electronic commerce has raised and is raising fresh
research problems in a very wide range of statistical areas, and we try to
emphasize those challenges.Comment: Published at http://dx.doi.org/10.1214/088342306000000204 in the
Statistical Science (http://www.imstat.org/sts/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Beyond Keywords and Relevance: A Personalized Ad Retrieval Framework in E-Commerce Sponsored Search
On most sponsored search platforms, advertisers bid on some keywords for
their advertisements (ads). Given a search request, ad retrieval module
rewrites the query into bidding keywords, and uses these keywords as keys to
select Top N ads through inverted indexes. In this way, an ad will not be
retrieved even if queries are related when the advertiser does not bid on
corresponding keywords. Moreover, most ad retrieval approaches regard rewriting
and ad-selecting as two separated tasks, and focus on boosting relevance
between search queries and ads. Recently, in e-commerce sponsored search more
and more personalized information has been introduced, such as user profiles,
long-time and real-time clicks. Personalized information makes ad retrieval
able to employ more elements (e.g. real-time clicks) as search signals and
retrieval keys, however it makes ad retrieval more difficult to measure ads
retrieved through different signals. To address these problems, we propose a
novel ad retrieval framework beyond keywords and relevance in e-commerce
sponsored search. Firstly, we employ historical ad click data to initialize a
hierarchical network representing signals, keys and ads, in which personalized
information is introduced. Then we train a model on top of the hierarchical
network by learning the weights of edges. Finally we select the best edges
according to the model, boosting RPM/CTR. Experimental results on our
e-commerce platform demonstrate that our ad retrieval framework achieves good
performance
Toward Entity-Aware Search
As the Web has evolved into a data-rich repository, with the standard "page view," current search engines are becoming increasingly inadequate for a wide range of query tasks. While we often search for various data "entities" (e.g., phone number, paper PDF, date), today's engines only take us indirectly to pages. In my Ph.D. study, we focus on a novel type of Web search that is aware of data entities inside pages, a significant departure from traditional document retrieval. We study the various essential aspects of supporting entity-aware Web search. To begin with, we tackle the core challenge of ranking entities, by distilling its underlying conceptual model Impression Model and developing a probabilistic ranking framework, EntityRank, that is able to seamlessly integrate both local and global information in ranking. We also report a prototype system built to show the initial promise of the proposal. Then, we aim at distilling and abstracting the essential computation requirements of entity search. From the dual views of reasoning--entity as input and entity as output, we propose a dual-inversion framework, with two indexing and partition schemes, towards efficient and scalable query processing. Further, to recognize more entity instances, we study the problem of entity synonym discovery through mining query log data. The results we obtained so far have shown clear promise of entity-aware search, in its usefulness, effectiveness, efficiency and scalability
Effectively Grouping Named Entities From Click- Through Data Into Clusters Of Generated Keywords1
Many studies show that named entities are closely related to users\u27 search behaviors, which brings increasing interest in studying named entities in search logs recently. This paper addresses the problem of forming fine grained semantic clusters of named entities within a broad domain such as “company”, and generating keywords for each cluster, which help users to interpret the embedded semantic information in the cluster. By exploring contexts, URLs and session IDs as features of named entities, a three-phase approach proposed in this paper first disambiguates named entities according to the features. Then it properly weights the features with a novel measurement, calculates the semantic similarity between named entities with the weighted feature space, and clusters named entities accordingly. After that, keywords for the clusters are generated using a text-oriented graph ranking algorithm. Each phase of the proposed approach solves problems that are not addressed in existing works, and experimental results obtained from a real click through data demonstrate the effectiveness of the proposed approach
Computing with Granular Words
Computational linguistics is a sub-field of artificial intelligence; it is an interdisciplinary field dealing with statistical and/or rule-based modeling of natural language from a computational perspective. Traditionally, fuzzy logic is used to deal with fuzziness among single linguistic terms in documents. However, linguistic terms may be related to other types of uncertainty. For instance, different users search ‘cheap hotel’ in a search engine, they may need distinct pieces of relevant hidden information such as shopping, transportation, weather, etc. Therefore, this research work focuses on studying granular words and developing new algorithms to process them to deal with uncertainty globally. To precisely describe the granular words, a new structure called Granular Information Hyper Tree (GIHT) is constructed. Furthermore, several technologies are developed to cooperate with computing with granular words in spam filtering and query recommendation. Based on simulation results, the GIHT-Bayesian algorithm can get more accurate spam filtering rate than conventional method Naive Bayesian and SVM; computing with granular word also generates better recommendation results based on users’ assessment when applied it to search engine
Forecasting Sales of Durable Goods – Does Search Data Help?
Search data can be used to forecast macroeconomic measures. The
present study extends this research direction by drawing on real sales data from a
household panel over two years. Specifically, the study analyzes whether search
data improves forecasts for seven products groups of durable goods. The forecast
model also includes the average weekly price and a dummy for the Christmas
season. Forecast accuracy is indeed improved when search data is included even
for product groups that have a short information and search phase. The product
groups, however, need to be chosen carefully, because some durable goods show
no lag between online search and purchase
- …