20,909 research outputs found
How to Search the Internet Archive Without Indexing It
Significant parts of cultural heritage are produced on the web during the
last decades. While easy accessibility to the current web is a good baseline,
optimal access to the past web faces several challenges. This includes dealing
with large-scale web archive collections and lacking of usage logs that contain
implicit human feedback most relevant for today's web search. In this paper, we
propose an entity-oriented search system to support retrieval and analytics on
the Internet Archive. We use Bing to retrieve a ranked list of results from the
current web. In addition, we link retrieved results to the WayBack Machine;
thus allowing keyword search on the Internet Archive without processing and
indexing its raw archived content. Our search system complements existing web
archive search tools through a user-friendly interface, which comes close to
the functionalities of modern web search engines (e.g., keyword search, query
auto-completion and related query suggestion), and provides a great benefit of
taking user feedback on the current web into account also for web archive
search. Through extensive experiments, we conduct quantitative and qualitative
analyses in order to provide insights that enable further research on and
practical applications of web archives
Quantitative Considerations about the Semantic Relationship of Entities in a Document Corpus
Providing suggestions for internet-users is an important task nowadays. So for example, when we enter a search string into the Google interface, it suggests further terms, based on previously formulated queries from other users having used the search engine before. In the context of an entity based search engine, entity-suggestion is also a very important task, when specifying the entities by the user. Additionally, this feature can also be utilized to suggest further entities, which are somehow related to already specified entities. If the suggestions are eligible the user can very quickly formulate his search desire. If the suggestions are based on the search corpus itself, new and previously unknown relationships between entities can be discovered along the way. The aim of this paper is a quantitative analysis of relationships between entities in a big document corpus under the aspect of providing suggestions for entities in real time
Behavioral Task Modeling for Entity Recommendation
Our everyday tasks involve interactions with a wide range of information. The information that we manage is often associated with a task context. However, current computer systems do not organize information in this way, do not help the user find information in task context, but require explicit user actions such as searching and information seeking. We explore the use of task context to guide the delivery of information to the user proactively, that is, to have the right information easily available at the right time. In this thesis, we used two types of novel contextual information: 24/7 behavioral recordings and spoken conversations for task modeling. The task context is created by monitoring the user's information behavior from temporal, social, and topical aspects; that can be contextualized by several entities such as applications, documents, people, time, and various keywords determining the task. By tracking the association amongst the entities, we can infer the user's task context, predict future information access, and proactively retrieve relevant information for the task at hand. The approach is validated with a series of field studies, in which altogether 47 participants voluntarily installed a screen monitoring system on their laptops 24/7 to collect available digital activities, and their spoken conversations were recorded. Different aspects of the data were considered to train the models. In the evaluation, we treated information sourced from several applications, spoken conversations, and various aspects of the data as different kinds of influence on the prediction performance. The combined influences of multiple data sources and aspects were also considered in the models. Our findings revealed that task information could be found in a variety of applications and spoken conversations. In addition, we found that task context models that consider behavioral information captured from the computer screen and spoken conversations could yield a promising improvement in recommendation quality compared to the conventional modeling approach that considered only pre-determined interaction logs, such as query logs or Web browsing history. We also showed how a task context model could support the users' work performance, reducing their effort in searching by ranking and suggesting relevant information. Our results and findings have direct implications for information personalization and recommendation systems that leverage contextual information to predict and proactively present personalized information to the user to improve the interaction experience with the computer systems.Jokapäiväisiin tehtäviimme kuuluu vuorovaikutusta monenlaisten tietojen kanssa. Hallitsemamme tiedot liittyvät usein johonkin tehtäväkontekstiin. Nykyiset tietokonejärjestelmät eivät kuitenkaan järjestä tietoja tällä tavalla tai auta käyttäjää löytämään tietoja tehtäväkontekstista, vaan vaativat käyttäjältä eksplisiittisiä toimia, kuten tietojen hakua ja etsimistä. Tutkimme, kuinka tehtäväkontekstia voidaan käyttää ohjaamaan tietojen toimittamista käyttäjälle ennakoivasti, eli siten, että oikeat tiedot olisivat helposti saatavilla oikeaan aikaan. Tässä väitöskirjassa käytimme kahdenlaisia uusia kontekstuaalisia tietoja: 24/7-käyttäytymistallenteita ja tehtävän mallintamiseen liittyviä puhuttuja keskusteluja. Tehtäväkonteksti luodaan seuraamalla käyttäjän tietokäyttäytymistä ajallisista, sosiaalisista ja ajankohtaisista näkökulmista katsoen; sitä voidaan kuvata useilla entiteeteillä, kuten sovelluksilla, asiakirjoilla, henkilöillä, ajalla ja erilaisilla tehtävää määrittävillä avainsanoilla. Tarkastelemalla näiden entiteettien välisiä yhteyksiä voimme päätellä käyttäjän tehtäväkontekstin, ennustaa tulevaa tiedon käyttöä ja hakea ennakoivasti käsillä olevaan tehtävään liittyviä asiaankuuluvia tietoja. Tätä lähestymistapaa arvioitiin kenttätutkimuksilla, joissa yhteensä 47 osallistujaa asensi vapaaehtoisesti kannettaviin tietokoneisiinsa näytönvalvontajärjestelmän, jolla voitiin 24/7 kerätä heidän saatavilla oleva digitaalinen toimintansa, ja joissa tallennettiin myös heidän puhutut keskustelunsa. Mallien kouluttamisessa otettiin huomioon datan eri piirteet. Arvioinnissa käsittelimme useista sovelluksista, puhutuista keskusteluista ja datan eri piirteistä saatuja tietoja erilaisina vaikutuksina ennusteiden toimivuuteen. Malleissa otettiin huomioon myös useiden tietolähteiden ja näkökohtien yhteisvaikutukset. Havaintomme paljastivat, että tehtävätietoja löytyi useista sovelluksista ja puhutuista keskusteluista. Lisäksi havaitsimme, että tehtäväkontekstimallit, joissa otetaan huomioon tietokoneen näytöltä ja puhutuista keskusteluista saadut käyttäytymistiedot, voivat parantaa suositusten laatua verrattuna tavanomaiseen mallinnustapaan, jossa tarkastellaan vain ennalta määritettyjä vuorovaikutuslokeja, kuten kyselylokeja tai verkonselaushistoriaa. Osoitimme myös, miten tehtäväkontekstimalli pystyi tukemaan käyttäjien suoritusta ja vähentämään heidän hakuihin tarvitsemaansa työpanosta järjestämällä hakutuloksia ja ehdottamalla heille asiaankuuluvia tietoja. Tuloksillamme ja havainnoillamme on suoria vaikutuksia tietojen personointi- ja suositusjärjestelmiin, jotka hyödyntävät kontekstuaalista tietoa ennustaakseen ja esittääkseen ennakoivasti personoituja tietoja käyttäjälle ja näin parantaakseen vuorovaikutuskokemusta tietokonejärjestelmien kanssa
From Word to Sense Embeddings: A Survey on Vector Representations of Meaning
Over the past years, distributed semantic representations have proved to be
effective and flexible keepers of prior knowledge to be integrated into
downstream applications. This survey focuses on the representation of meaning.
We start from the theoretical background behind word vector space models and
highlight one of their major limitations: the meaning conflation deficiency,
which arises from representing a word with all its possible meanings as a
single vector. Then, we explain how this deficiency can be addressed through a
transition from the word level to the more fine-grained level of word senses
(in its broader acceptation) as a method for modelling unambiguous lexical
meaning. We present a comprehensive overview of the wide range of techniques in
the two main branches of sense representation, i.e., unsupervised and
knowledge-based. Finally, this survey covers the main evaluation procedures and
applications for this type of representation, and provides an analysis of four
of its important aspects: interpretability, sense granularity, adaptability to
different domains and compositionality.Comment: 46 pages, 8 figures. Published in Journal of Artificial Intelligence
Researc
Dataset search: a survey
Generating value from data requires the ability to find, access and make
sense of datasets. There are many efforts underway to encourage data sharing
and reuse, from scientific publishers asking authors to submit data alongside
manuscripts to data marketplaces, open data portals and data communities.
Google recently beta released a search service for datasets, which allows users
to discover data stored in various online repositories via keyword queries.
These developments foreshadow an emerging research field around dataset search
or retrieval that broadly encompasses frameworks, methods and tools that help
match a user data need against a collection of datasets. Here, we survey the
state of the art of research and commercial systems in dataset retrieval. We
identify what makes dataset search a research field in its own right, with
unique challenges and methods and highlight open problems. We look at
approaches and implementations from related areas dataset search is drawing
upon, including information retrieval, databases, entity-centric and tabular
search in order to identify possible paths to resolve these open problems as
well as immediate next steps that will take the field forward.Comment: 20 pages, 153 reference
Validating multilingual hybrid automatic term extraction for search engine optimisation : the use case of EBM-GUIDELINES
Tools that automatically extract terms and their equivalents in other languages from parallel corpora can contribute to multilingual professional communication in more than one way. By means of a use case with data from a medical web site with point of care evidence summaries (Ebpracticenet), we illustrate how hybrid multilingual automatic term extraction from parallel corpora works and how it can be used in a practical application such as search engine optimisation. The original aim was to use the result of the extraction to improve the recall of a search engine by allowing automated multilingual searches. Two additional possible applications were found while considering the data: searching via related forms and searching via strongly semantically related words. The second stage of this research was to find the most suitable format for the required manual validation of the raw extraction results and to compare the validation process when performed by a domain expert versus a terminologist
BertNet: Harvesting Knowledge Graphs with Arbitrary Relations from Pretrained Language Models
It is crucial to automatically construct knowledge graphs (KGs) of diverse
new relations to support knowledge discovery and broad applications. Previous
KG construction methods, based on either crowdsourcing or text mining, are
often limited to a small predefined set of relations due to manual cost or
restrictions in text corpus. Recent research proposed to use pretrained
language models (LMs) as implicit knowledge bases that accept knowledge queries
with prompts. Yet, the implicit knowledge lacks many desirable properties of a
full-scale symbolic KG, such as easy access, navigation, editing, and quality
assurance. In this paper, we propose a new approach of harvesting massive KGs
of arbitrary relations from pretrained LMs. With minimal input of a relation
definition (a prompt and a few shot of example entity pairs), the approach
efficiently searches in the vast entity pair space to extract diverse accurate
knowledge of the desired relation. We develop an effective search-and-rescore
mechanism for improved efficiency and accuracy. We deploy the approach to
harvest KGs of over 400 new relations from different LMs. Extensive human and
automatic evaluations show our approach manages to extract diverse accurate
knowledge, including tuples of complex relations (e.g., "A is capable of but
not good at B"). The resulting KGs as a symbolic interpretation of the source
LMs also reveal new insights into the LMs' knowledge capacities.Comment: ACL 2023 (Findings); Code available at
https://github.com/tanyuqian/knowledge-harvest-from-lm
- …