6 research outputs found
Collecte orientée sur le Web pour la recherche d'information spécialisée
Les moteurs de recherche verticaux, qui se concentrent sur des segments spĂ©cifiques du Web, deviennent aujourd'hui de plus en plus prĂ©sents dans le paysage d'Internet. Les moteurs de recherche thĂ©matiques, notamment, peuvent obtenir de trĂšs bonnes performances en limitant le corpus indexĂ© Ă un thĂšme connu. Les ambiguĂŻtĂ©s de la langue sont alors d'autant plus contrĂŽlables que le domaine est bien ciblĂ©. De plus, la connaissance des objets et de leurs propriĂ©tĂ©s rend possible le dĂ©veloppement de techniques d'analyse spĂ©cifiques afin d'extraire des informations pertinentes.Dans le cadre de cette thĂšse, nous nous intĂ©ressons plus prĂ©cisĂ©ment Ă la procĂ©dure de collecte de documents thĂ©matiques Ă partir du Web pour alimenter un moteur de recherche thĂ©matique. La procĂ©dure de collecte peut ĂȘtre rĂ©alisĂ©e en s'appuyant sur un moteur de recherche gĂ©nĂ©raliste existant (recherche orientĂ©e) ou en parcourant les hyperliens entre les pages Web (exploration orientĂ©e).Nous Ă©tudions tout d'abord la recherche orientĂ©e. Dans ce contexte, l'approche classique consiste Ă combiner des mot-clĂ©s du domaine d'intĂ©rĂȘt, Ă les soumettre Ă un moteur de recherche et Ă tĂ©lĂ©charger les meilleurs rĂ©sultats retournĂ©s par ce dernier.AprĂšs avoir Ă©valuĂ© empiriquement cette approche sur 340 thĂšmes issus de l'OpenDirectory, nous proposons de l'amĂ©liorer en deux points. En amont du moteur de recherche, nous proposons de formuler des requĂȘtes thĂ©matiques plus pertinentes pour le thĂšme afin d'augmenter la prĂ©cision de la collecte. Nous dĂ©finissons une mĂ©trique fondĂ©e sur un graphe de cooccurrences et un algorithme de marche alĂ©atoire, dans le but de prĂ©dire la pertinence d'une requĂȘte thĂ©matique. En aval du moteur de recherche, nous proposons de filtrer les documents tĂ©lĂ©chargĂ©s afin d'amĂ©liorer la qualitĂ© du corpus produit. Pour ce faire, nous modĂ©lisons la procĂ©dure de collecte sous la forme d'un graphe triparti et appliquons un algorithme de marche alĂ©atoire biaisĂ© afin d'ordonner par pertinence les documents et termes apparaissant dans ces derniers.Dans la seconde partie de cette thĂšse, nous nous focalisons sur l'exploration orientĂ©e du Web. Au coeur de tout robot d'exploration orientĂ©e se trouve une stratĂ©gie de crawl qui lui permet de maximiser le rapatriement de pages pertinentes pour un thĂšme, tout en minimisant le nombre de pages visitĂ©es qui ne sont pas en rapport avec le thĂšme. En pratique, cette stratĂ©gie dĂ©finit l'ordre de visite des pages. Nous proposons d'apprendre automatiquement une fonction d'ordonnancement indĂ©pendante du thĂšme Ă partir de donnĂ©es existantes annotĂ©es automatiquement.Vertical search engines, which focus on a specific segment of the Web, become more and more present in the Internet landscape. Topical search engines, notably, can obtain a significant performance boost by limiting their index on a specific topic. By doing so, language ambiguities are reduced, and both the algorithms and the user interface can take advantage of domain knowledge, such as domain objects or characteristics, to satisfy user information needs.In this thesis, we tackle the first inevitable step of a all topical search engine : focused document gathering from the Web. A thorough study of the state of art leads us to consider two strategies to gather topical documents from the Web: either relying on an existing search engine index (focused search) or directly crawling the Web (focused crawling).The first part of our research has been dedicated to focused search. In this context, a standard approach consists in combining domain-specific terms into queries, submitting those queries to a search engine and down- loading top ranked documents. After empirically evaluating this approach over 340 topics, we propose to enhance it in two different ways: Upstream of the search engine, we aim at formulating more relevant queries in or- der to increase the precision of the top retrieved documents. To do so, we define a metric based on a co-occurrence graph and a random walk algorithm, which aims at predicting the topical relevance of a query. Downstream of the search engine, we filter the retrieved documents in order to improve the document collection quality. We do so by modeling our gathering process as a tripartite graph and applying a random walk with restart algorithm so as to simultaneously order by relevance the documents and terms appearing in our corpus.In the second part of this thesis, we turn to focused crawling. We describe our focused crawler implementation that was designed to scale horizontally. Then, we consider the problem of crawl frontier ordering, which is at the very heart of a focused crawler. Such ordering strategy allows the crawler to prioritize its fetches, maximizing the number of in-domain documents retrieved while minimizing the non relevant ones. We propose to apply learning to rank algorithms to efficiently order the crawl frontier, and define a method to learn a ranking function from existing crawls.PARIS11-SCD-Bib. Ă©lectronique (914719901) / SudocSudocFranceF
CHORUS Deliverable 2.1: State of the Art on Multimedia Search Engines
Based on the information provided by European projects and national initiatives related to multimedia search as well as domains experts that participated in the CHORUS Think-thanks and workshops, this document reports on the state of the art related to multimedia content search from, a technical, and socio-economic perspective.
The technical perspective includes an up to date view on content based indexing and retrieval technologies, multimedia search in the context of mobile devices and peer-to-peer networks, and an overview of current evaluation and benchmark inititiatives to measure the performance of multimedia search engines.
From a socio-economic perspective we inventorize the impact and legal consequences of these technical advances and point out future directions of research
Recommended from our members
Dynamic User Profiling for Search Personalisation
The performance of a personalised search system largely depends upon the ability to build user profiles which accurately capture the user's search interests. However, many approaches to user profiling have neglected the dynamic nature of the user's search interests. That is, a user's search interests typically change in response to their interactions with the search system during the search period. Therefore, a profile built for previous searches might not reflect that user's current search interests.
A widely used type of profile represents the topical interests of the user. In these cases, a typical approach is to build a user profile using topics discussed in documents which the user has found relevant, and where the topics are obtained from a human-generated ontology or directory. However, a key limitation of these approaches is that many documents may not contain the topics covered in the ontology. Moreover, the human-generated ontology requires manual effort to determine the correct categories for each document.
In this research, we address these problems by proposing novel techniques for dynamically building user profiles which capture the user's search interests changing over time. Instead of using a human-generated ontology, we use a topic modelling technique (Latent Dirichlet Allocation) for unsupervised extraction of the topics from documents. To dynamically build user profiles, we make two important assumptions. First, that the group of users with whom a user shares a set of common interests may be different depending upon the particular topic of interest. Second, the more recently clicked/relevant documents tell us more about the user's current search interests.
To test these assumptions, we develop and implement dynamic user profiles, and then evaluate them on two search personalisation tasks. Our first chosen task is personalising search results returned by a Web search engine, and the second is the task of personalising query suggestions made by an Intranet search engine. We found that dynamic user profiles can significantly improve the ranking quality over well-established baselines
Recommended from our members
Ranking for Scalable Information Extraction
Information extraction systems are complex software tools that discover structured information in natural language text. For instance, an information extraction system trained to extract tuples for an Occurs-in(Natural Disaster, Location) relation may extract the tuple from the sentence: "A tsunami swept the coast of Hawaii." Having information in structured form enables more sophisticated querying and data mining than what is possible over the natural language text. Unfortunately, information extraction is a time-consuming task. For example, a state-of-the-art information extraction system to extract Occurs-in tuples may take up to two hours to process only 1,000 text documents. Since document collections routinely contain millions of documents or more, improving the efficiency and scalability of the information extraction process over these collections is critical. As a significant step towards this goal, this dissertation presents approaches for (i) enabling the deployment of efficient information extraction systems and (ii) scaling the information extraction process to large volumes of text.
To enable the deployment of efficient information extraction systems, we have developed two crucial building blocks for this task. As a first contribution, we have created REEL, a toolkit to easily implement, evaluate, and deploy full-fledged relation extraction systems. REEL, in contrast to existing toolkits, effectively modularizes the key components involved in relation extraction systems and can integrate other long-established text processing and machine learning toolkits. To define a relation extraction system for a new relation and text collection, users only need to specify the desired configuration, which makes REEL a powerful framework for both research and application building. As a second contribution, we have addressed the problem of building representative extraction task-specific document samples from collections, a step often required by approaches for efficient information extraction. Specifically, we devised fully automatic document sampling techniques for information extraction that can produce better-quality document samples than the state-of-the-art sampling strategies; furthermore, our techniques are substantially more efficient than the existing alternative approaches.
To scale the information extraction process to large volumes of text, we have developed approaches that address the efficiency and scalability of the extraction process by focusing the extraction effort on the collections, documents, and sentences worth processing for a given extraction task. For collections, we have studied both (adaptations of) state-of-the art approaches for estimating the number of documents in a collection that lead to the extraction of tuples as well as information extraction-specific approaches. Using these estimations we can identify the collections worth processing and ignore the rest, for efficiency. For documents, we have developed an adaptive document ranking approach that relies on learning-to-rank techniques to prioritize the documents that are likely to produce tuples for an extraction task of choice. Our approach revises the (learned) ranking decisions periodically as the extraction process progresses and new characteristics of the useful documents are revealed. Finally, for sentences, we have developed an approach based on the sparse group selection problem that identifies sentences|modeled as groups of words|that best characterize the extraction task. Beyond identifying sentences worth processing, our approach aims at selecting sentences that lead to the extraction of unseen, novel tuples. Our approaches are lightweight and efficient, and dramatically improve the efficiency and scalability of the information extraction process. We can often complete the extraction task by focusing on just a very small fraction of the available text, namely, the text that contains relevant information for the extraction task at hand. Our approaches therefore constitute a substantial step towards efficient and scalable information extraction over large volumes of text
Toponym Resolution in Text
Institute for Communicating and Collaborative SystemsBackground. In the area of Geographic Information Systems (GIS), a shared discipline between
informatics and geography, the term geo-parsing is used to describe the process of identifying
names in text, which in computational linguistics is known as named entity recognition
and classification (NERC). The term geo-coding is used for the task of mapping from implicitly
geo-referenced datasets (such as structured address records) to explicitly geo-referenced
representations (e.g., using latitude and longitude). However, present-day GIS systems provide
no automatic geo-coding functionality for unstructured text.
In Information Extraction (IE), processing of named entities in text has traditionally been seen
as a two-step process comprising a flat text span recognition sub-task and an atomic classification
sub-task; relating the text span to a model of the world has been ignored by evaluations
such as MUC or ACE (Chinchor (1998); U.S. NIST (2003)).
However, spatial and temporal expressions refer to events in space-time, and the grounding of
events is a precondition for accurate reasoning. Thus, automatic grounding can improve many
applications such as automatic map drawing (e.g. for choosing a focus) and question answering
(e.g. , for questions like How far is London from Edinburgh?, given a story in which both occur
and can be resolved). Whereas temporal grounding has received considerable attention in the
recent past (Mani and Wilson (2000); Setzer (2001)), robust spatial grounding has long been
neglected.
Concentrating on geographic names for populated places, I define the task of automatic
Toponym Resolution (TR) as computing the mapping from occurrences of names for places as
found in a text to a representation of the extensional semantics of the location referred to (its
referent), such as a geographic latitude/longitude footprint.
The task of mapping from names to locations is hard due to insufficient and noisy databases,
and a large degree of ambiguity: common words need to be distinguished from proper names
(geo/non-geo ambiguity), and the mapping between names and locations is ambiguous (London
can refer to the capital of the UK or to London, Ontario, Canada, or to about forty other
Londons on earth). In addition, names of places and the boundaries referred to change over
time, and databases are incomplete.
Objective. I investigate how referentially ambiguous spatial named entities can be grounded,
or resolved, with respect to an extensional coordinate model robustly on open-domain news
text.
I begin by comparing the few algorithms proposed in the literature, and, comparing semiformal,
reconstructed descriptions of them, I factor out a shared repertoire of linguistic heuristics
(e.g. rules, patterns) and extra-linguistic knowledge sources (e.g. population sizes). I then
investigate how to combine these sources of evidence to obtain a superior method. I also investigate
the noise effect introduced by the named entity tagging step that toponym resolution
relies on in a sequential system pipeline architecture.
Scope. In this thesis, I investigate a present-day snapshot of terrestrial geography as represented
in the gazetteer defined and, accordingly, a collection of present-day news text. I limit
the investigation to populated places; geo-coding of artifact names (e.g. airports or bridges),
compositional geographic descriptions (e.g. 40 miles SW of London, near Berlin), for instance,
is not attempted. Historic change is a major factor affecting gazetteer construction and ultimately
toponym resolution. However, this is beyond the scope of this thesis.
Method. While a small number of previous attempts have been made to solve the toponym
resolution problem, these were either not evaluated, or evaluation was done by manual inspection
of system output instead of curating a reusable reference corpus.
Since the relevant literature is scattered across several disciplines (GIS, digital libraries,
information retrieval, natural language processing) and descriptions of algorithms are mostly
given in informal prose, I attempt to systematically describe them and aim at a reconstruction
in a uniform, semi-formal pseudo-code notation for easier re-implementation. A systematic
comparison leads to an inventory of heuristics and other sources of evidence.
In order to carry out a comparative evaluation procedure, an evaluation resource is required.
Unfortunately, to date no gold standard has been curated in the research community. To this
end, a reference gazetteer and an associated novel reference corpus with human-labeled referent
annotation are created.
These are subsequently used to benchmark a selection of the reconstructed algorithms and
a novel re-combination of the heuristics catalogued in the inventory.
I then compare the performance of the same TR algorithms under three different conditions,
namely applying it to the (i) output of human named entity annotation, (ii) automatic annotation
using an existing Maximum Entropy sequence tagging model, and (iii) a našıve toponym lookup
procedure in a gazetteer.
Evaluation. The algorithms implemented in this thesis are evaluated in an intrinsic or
component evaluation. To this end, we define a task-specific matching criterion to be used with
traditional Precision (P) and Recall (R) evaluation metrics. This matching criterion is lenient
with respect to numerical gazetteer imprecision in situations where one toponym instance is
marked up with different gazetteer entries in the gold standard and the test set, respectively, but
where these refer to the same candidate referent, caused by multiple near-duplicate entries in
the reference gazetteer.
Main Contributions. The major contributions of this thesis are as follows:
âą A new reference corpus in which instances of location named entities have been manually
annotated with spatial grounding information for populated places, and an associated
reference gazetteer, from which the assigned candidate referents are chosen. This reference
gazetteer provides numerical latitude/longitude coordinates (such as 51320 North,
0 50 West) as well as hierarchical path descriptions (such as London > UK) with respect
to a world wide-coverage, geographic taxonomy constructed by combining several large,
but noisy gazetteers. This corpus contains news stories and comprises two sub-corpora,
a subset of the REUTERS RCV1 news corpus used for the CoNLL shared task (Tjong
Kim Sang and De Meulder (2003)), and a subset of the Fourth Message Understanding
Contest (MUC-4; Chinchor (1995)), both available pre-annotated with gold-standard.
This corpus will be made available as a reference evaluation resource;
âą a new method and implemented system to resolve toponyms that is capable of robustly
processing unseen text (open-domain online newswire text) and grounding toponym instances
in an extensional model using longitude and latitude coordinates and hierarchical
path descriptions, using internal (textual) and external (gazetteer) evidence;
âą an empirical analysis of the relative utility of various heuristic biases and other sources
of evidence with respect to the toponym resolution task when analysing free news genre
text;
âą a comparison between a replicated method as described in the literature, which functions
as a baseline, and a novel algorithm based on minimality heuristics; and
âą several exemplary prototypical applications to show how the resulting toponym resolution
methods can be used to create visual surrogates for news stories, a geographic exploration
tool for news browsing, geographically-aware document retrieval and to answer
spatial questions (How far...?) in an open-domain question answering system. These
applications only have demonstrative character, as a thorough quantitative, task-based
(extrinsic) evaluation of the utility of automatic toponym resolution is beyond the scope of this thesis and left for future work