Search CORE

69 research outputs found

Methods for web spam filtering

Author: Csalogány Károly
Publication venue
Publication date: 01/01/2009
Field of study

ELTE Digital Institutional Repository (EDIT)

World Wide Web? A closer look at the transnational online public discourse on climate change

Author: Reber Ueli
Publication venue: Universität Bern
Publication date
Field of study

This dissertations pursues three research objectives: (1) map how transnational the online public discourse on the global phenomenon of climate change is, (2) understand the role of the (trans)nationalized online public discourses on climate change in today’s hybrid media system, and (3) find, implement, and validate computational methods to study public discourses across different political and language spaces. Devoted to these objectives, the three articles included in this thesis produced the following results: (1) The public discourse on climate change is transnationalized to a considerable degree. First, the same topics define the issue in the countries studied. However, some of the topics are of different importance to the actors in these countries. Second, the discourses in the countries are shaped by both domestic and foreign actors. However, the scope of transnationalization is restricted to countries of the Global North, with a clear bias towards the United States. The Global South is thus a blind spot. (2) For the studied case of Germany, there is no evidence for continuous resonance among climate change skeptics’ online communication and legacy media. However, there are occasions of selective resonance when climate change skeptics manage to exploit specific events to push their perspectives and positions onto the mass media’s agenda. The influence of the transnational skeptical counter-movement on German mainstream discourse is therefore limited. (3) The combination of machine translation and topic models is a great option when it comes to the automated analysis of large multilingual corpora. Regardless of whether full texts or only the vocabulary of a corpus is translated, the approach produces reliable and robust results. Moreover, the analysis of transnational discourse convergence has shown that machine translation and topic models can also be used for comparative research

BORIS Theses

Mining Web Dynamics for Search

Author: Dai Na
Publication venue: Lehigh Preserve
Publication date
Field of study

Billions of web users collectively contribute to a dynamic web that preserves how information sources and descriptions change over time. This dynamic process sheds light on the quality of web content, and even indicates the temporal properties of information needs expressed via queries. However, existing commercial search engines typically utilize one crawl of web content (the latest) without considering the complementary information concealed in web dynamics. As a result, the generated rankings may be biased due to the efficiency of knowledge on page or hyperlink evolution, and the time-sensitive facet within search quality, e.g., freshness, has to be neglected. While previous research efforts have been focused on exploring the temporal dimension in retrieval process, few of them showed consistent improvements on large-scale real-world archival web corpus with a broad time span.We investigate how to utilize the changes of web pages and hyperlinks to improve search quality, in terms of freshness and relevance of search results. Three applications that I have focused on are: (1) document representation, in which the anchortext (short descriptive text associated with hyperlinks) importance is estimated by considering its historical status; (2) web authority estimation, in which web freshness is quantified and utilized for controlling the authority propagation; and (3) learning to rank, in which freshness and relevance are optimized simultaneously in an adaptive way depending on query type. The contributions of this thesis are: (1) incorporate web dynamics information into critical components within search infrastructure in a principled way; and (2) empirically verify the proposed methods by conducting experiments based on (or depending on) a large-scale real-world archival web corpus, and demonstrated their superiority over existing state-of-the-art

Lehigh University: Lehigh Preserve

LDA-Based Topic Strength Analysis

Author: Li Lei
Wang Jiamiao
Wu Xindong
Publication venue: Institute of Informatics, Slovak Academy of Sciences
Publication date: 09/02/2018
Field of study

Topic strength is an important hotspot in topic research. The evolution of topic strength not only indicates emerging new topics, but also helps us to determine whether a topic will produce some fluctuation of topic strength over time. Thus, topic strength analysis can provide significant findings in public opinion monitoring and user personalization. In this paper, we present an LDA-based topic strength analysis approach. We take topic quality into our topic strength consideration by combining local LDA and global LDA. For empirical studies, we use three data sets in real applications: film critic data of "A Chinese Odyssey" in Douban Movies, corruption news data in Sina News, and public paper data. Compared to existing approaches, experimental results show that our proposed approach can obtain better results of topic strength analysis in detecting the time of event topic occurrences and distinguishing different types of topics, and it can be used to monitor the occurrences of public opinions and the changes of public concerns

Computing and Informatics (E-Journal - Institute of Informatics, SAS, Bratislava)

Web-scale profiling of semantic annotations in HTML pages

Author: Meusel Robert
Publication venue
Publication date: 01/01/2017
Field of study

The vision of the Semantic Web was coined by Tim Berners-Lee almost two decades ago. The idea describes an extension of the existing Web in which “information is given well-deﬁned meaning, better enabling computers and people to work in cooperation” [Berners-Lee et al., 2001]. Semantic annotations in HTML pages are one realization of this vision which was adopted by large numbers of web sites in the last years. Semantic annotations are integrated into the code of HTML pages using one of the three markup languages Microformats, RDFa, or Microdata. Major consumers of semantic annotations are the search engine companies Bing, Google, Yahoo!, and Yandex. They use semantic annotations from crawled web pages to enrich the presentation of search results and to complement their knowledge bases. However, outside the large search engine companies, little is known about the deployment of semantic annotations: How many web sites deploy semantic annotations? What are the topics covered by semantic annotations? How detailed are the annotations? Do web sites use semantic annotations correctly? Are semantic annotations useful for others than the search engine companies? And how can semantic annotations be gathered from the Web in that case? The thesis answers these questions by proﬁling the web-wide deployment of semantic annotations. The topic is approached in three consecutive steps: In the ﬁrst step, two approaches for extracting semantic annotations from the Web are discussed. The thesis evaluates ﬁrst the technique of focused crawling for harvesting semantic annotations. Afterward, a framework to extract semantic annotations from existing web crawl corpora is described. The two extraction approaches are then compared for the purpose of analyzing the deployment of semantic annotations in the Web. In the second step, the thesis analyzes the overall and markup language-speciﬁc adoption of semantic annotations. This empirical investigation is based on the largest web corpus that is available to the public. Further, the topics covered by deployed semantic annotations and their evolution over time are analyzed. Subsequent studies examine common errors within semantic annotations. In addition, the thesis analyzes the data overlap of the entities that are described by semantic annotations from the same and across different web sites. The third step narrows the focus of the analysis towards use case-speciﬁc issues. Based on the requirements of a marketplace, a news aggregator, and a travel portal the thesis empirically examines the utility of semantic annotations for these use cases. Additional experiments analyze the capability of product-related semantic annotations to be integrated into an existing product categorization schema. Especially, the potential of exploiting the diverse category information given by the web sites providing semantic annotations is evaluated

MAnnheim DOCument Server

The Search as Learning Spaceship: Toward a Comprehensive Model of Psychological and Technological Facets of Search as Learning

Author: Dietze Stefan
Ewerth Ralph
Holtz Peter
Hoppe Anett
Kammerer Yvonne
Otto Christian
Pardi Georg
Rokicki Markus
von Hoyer Johannes
Yu Ran
Publication venue: Lausanne : Frontiers Research Foundation
Publication date: 01/01/2022
Field of study

Using a Web search engine is one of today’s most frequent activities. Exploratory search activities which are carried out in order to gain knowledge are conceptualized and denoted as Search as Learning (SAL). In this paper, we introduce a novel framework model which incorporates the perspective of both psychology and computer science to describe the search as learning process by reviewing recent literature. The main entities of the model are the learner who is surrounded by a specific learning context, the interface that mediates between the learner and the information environment, the information retrieval (IR) backend which manages the processes between the interface and the set of Web resources, that is, the collective Web knowledge represented in resources of different modalities. At first, we provide an overview of the current state of the art with regard to the five main entities of our model, before we outline areas of future research to improve our understanding of search as learning processes. Copyright © 2022 von Hoyer, Hoppe, Kammerer, Otto, Pardi, Rokicki, Yu, Dietze, Ewerth and Holtz

PubMed Central

SSOAR - Social Science Open Access Repository

Repositorium für Naturwissenschaften und Technik

Institutionelles Repositorium der Leibniz Universität Hannover

The Search as Learning Spaceship: Toward a Comprehensive Model of Psychological and Technological Facets of Search as Learning

Author: Dietze Stefan
Ewerth Ralph
Holtz Peter
Hoppe Anett
Kammerer Yvonne
Otto Christian
Pardi Georg
Rokicki Markus
von Hoyer Johannes
Yu Ran
Publication venue: Lausanne : Frontiers Research Foundation
Publication date: 01/01/2022
Field of study

Repositorium für Naturwissenschaften und Technik

Web Page Classification and Hierarchy Adaptation

Author: Qi Xiaoguang
Publication venue: Lehigh Preserve
Publication date
Field of study

Lehigh University: Lehigh Preserve

Scalability of findability: decentralized search and retrieval in large information networks

Author: Ke Weimao
Publication venue: University of North Carolina at Chapel Hill
Publication date: 01/08/2010
Field of study

Amid the rapid growth of information today is the increasing challenge for people to survive and navigate its magnitude. Dynamics and heterogeneity of large information spaces such as the Web challenge information retrieval in these environments. Collection of information in advance and centralization of IR operations are hardly possible because systems are dynamic and information is distributed. While monolithic search systems continue to struggle with scalability problems of today, the future of search likely requires a decentralized architecture where many information systems can participate. As individual systems interconnect to form a global structure, finding relevant information in distributed environments transforms into a problem concerning not only information retrieval but also complex networks. Understanding network connectivity will provide guidance on how decentralized search and retrieval methods can function in these information spaces. The dissertation studies one aspect of scalability challenges facing classic information retrieval models and presents a decentralized, organic view of information systems pertaining to search in large scale networks. It focuses on the impact of network structure on search performance and investigates a phenomenon we refer to as the Clustering Paradox, in which the topology of interconnected systems imposes a scalability limit. Experiments involving large scale benchmark collections provide evidence on the Clustering Paradox in the IR context. In an increasingly large, distributed environment, decentralized searches for relevant information can continue to function well only when systems interconnect in certain ways. Relying on partial indexes of distributed systems, some level of network clustering enables very efficient and effective discovery of relevant information in large scale networks. Increasing or reducing network clustering degrades search performances. Given this specific level of network clustering, search time is well explained by a poly-logarithmic relation to network size, indicating a high scalability potential for searching in a continuously growing information space

Carolina Digital Repository