104,165 research outputs found
Web Data Extraction, Applications and Techniques: A Survey
Web Data Extraction is an important problem that has been studied by means of
different scientific tools and in a broad range of applications. Many
approaches to extracting data from the Web have been designed to solve specific
problems and operate in ad-hoc domains. Other approaches, instead, heavily
reuse techniques and algorithms developed in the field of Information
Extraction.
This survey aims at providing a structured and comprehensive overview of the
literature in the field of Web Data Extraction. We provided a simple
classification framework in which existing Web Data Extraction applications are
grouped into two main classes, namely applications at the Enterprise level and
at the Social Web level. At the Enterprise level, Web Data Extraction
techniques emerge as a key tool to perform data analysis in Business and
Competitive Intelligence systems as well as for business process
re-engineering. At the Social Web level, Web Data Extraction techniques allow
to gather a large amount of structured data continuously generated and
disseminated by Web 2.0, Social Media and Online Social Network users and this
offers unprecedented opportunities to analyze human behavior at a very large
scale. We discuss also the potential of cross-fertilization, i.e., on the
possibility of re-using Web Data Extraction techniques originally designed to
work in a given domain, in other domains.Comment: Knowledge-based System
A literature survey of methods for analysis of subjective language
Subjective language is used to express attitudes and opinions towards things, ideas and people. While content and topic centred natural language processing is now part of everyday life, analysis of subjective aspects of natural language have until recently been largely neglected by the research community. The explosive growth of personal blogs, consumer opinion sites and social network applications in the last years, have however created increased interest in subjective language analysis. This paper provides an overview of recent research conducted in the area
Access to recorded interviews: A research agenda
Recorded interviews form a rich basis for scholarly inquiry. Examples include oral histories, community memory projects, and interviews conducted for broadcast media. Emerging technologies offer the potential to radically transform the way in which recorded interviews are made accessible, but this vision will demand substantial investments from a broad range of research communities. This article reviews the present state of practice for making recorded interviews available and the state-of-the-art for key component technologies. A large number of important research issues are identified, and from that set of issues, a coherent research agenda is proposed
XML content warehousing: Improving sociological studies of mailing lists and web data
In this paper, we present the guidelines for an XML-based approach for the
sociological study of Web data such as the analysis of mailing lists or
databases available online. The use of an XML warehouse is a flexible solution
for storing and processing this kind of data. We propose an implemented
solution and show possible applications with our case study of profiles of
experts involved in W3C standard-setting activity. We illustrate the
sociological use of semi-structured databases by presenting our XML Schema for
mailing-list warehousing. An XML Schema allows many adjunctions or crossings of
data sources, without modifying existing data sets, while allowing possible
structural evolution. We also show that the existence of hidden data implies
increased complexity for traditional SQL users. XML content warehousing allows
altogether exhaustive warehousing and recursive queries through contents, with
far less dependence on the initial storage. We finally present the possibility
of exporting the data stored in the warehouse to commonly-used advanced
software devoted to sociological analysis
Detecting Family Resemblance: Automated Genre Classification.
This paper presents results in automated genre classification of digital documents in PDF format. It describes genre classification as an important ingredient in contextualising scientific data and in retrieving targetted material for improving research. The current paper compares the role of visual layout, stylistic features and language model features in clustering documents and presents results in retrieving five selected genres (Scientific Article, Thesis, Periodicals, Business Report, and Form) from a pool of materials populated with documents of the nineteen most popular genres found in our experimental data set.
Automatic extraction of knowledge from web documents
A large amount of digital information available is written as text documents in the form of web pages, reports, papers, emails, etc. Extracting the knowledge of interest from such documents from multiple sources in a timely fashion is therefore crucial. This paper provides an update on the Artequakt system which uses natural language tools to automatically extract knowledge about artists from multiple documents based on a predefined ontology. The ontology represents the type and form of knowledge to extract. This knowledge is then used to generate tailored biographies. The information extraction process of Artequakt is detailed and evaluated in this paper
Connecting Dream Networks Across Cultures
Many species dream, yet there remain many open research questions in the
study of dreams. The symbolism of dreams and their interpretation is present in
cultures throughout history. Analysis of online data sources for dream
interpretation using network science leads to understanding symbolism in dreams
and their associated meaning. In this study, we introduce dream interpretation
networks for English, Chinese and Arabic that represent different cultures from
various parts of the world. We analyze communities in these networks, finding
that symbols within a community are semantically related. The central nodes in
communities give insight about cultures and symbols in dreams. The community
structure of different networks highlights cultural similarities and
differences. Interconnections between different networks are also identified by
translating symbols from different languages into English. Structural
correlations across networks point out relationships between cultures.
Similarities between network communities are also investigated by analysis of
sentiment in symbol interpretations. We find that interpretations within a
community tend to have similar sentiment. Furthermore, we cluster communities
based on their sentiment, yielding three main categories of positive, negative,
and neutral dream symbols.Comment: 6 pages, 3 figure
- âŠ