6,227 research outputs found
Web Data Extraction, Applications and Techniques: A Survey
Web Data Extraction is an important problem that has been studied by means of
different scientific tools and in a broad range of applications. Many
approaches to extracting data from the Web have been designed to solve specific
problems and operate in ad-hoc domains. Other approaches, instead, heavily
reuse techniques and algorithms developed in the field of Information
Extraction.
This survey aims at providing a structured and comprehensive overview of the
literature in the field of Web Data Extraction. We provided a simple
classification framework in which existing Web Data Extraction applications are
grouped into two main classes, namely applications at the Enterprise level and
at the Social Web level. At the Enterprise level, Web Data Extraction
techniques emerge as a key tool to perform data analysis in Business and
Competitive Intelligence systems as well as for business process
re-engineering. At the Social Web level, Web Data Extraction techniques allow
to gather a large amount of structured data continuously generated and
disseminated by Web 2.0, Social Media and Online Social Network users and this
offers unprecedented opportunities to analyze human behavior at a very large
scale. We discuss also the potential of cross-fertilization, i.e., on the
possibility of re-using Web Data Extraction techniques originally designed to
work in a given domain, in other domains.Comment: Knowledge-based System
Natural language processing
Beginning with the basic issues of NLP, this chapter aims to chart the major research activities in this area since the last ARIST Chapter in 1996 (Haas, 1996), including: (i) natural language text processing systems - text summarization, information extraction, information retrieval, etc., including domain-specific applications; (ii) natural language interfaces; (iii) NLP in the context of www and digital libraries ; and (iv) evaluation of NLP systems
Automated user modeling for personalized digital libraries
Digital libraries (DL) have become one of the most typical ways of accessing any kind of digitalized information. Due to this key role, users welcome any improvements on the services they receive from digital libraries. One trend used to
improve digital services is through personalization. Up to now, the most common approach for personalization in digital libraries has been user-driven. Nevertheless, the design of efficient personalized services has to be done, at least in part, in
an automatic way. In this context, machine learning techniques automate the process of constructing user models. This paper proposes a new approach to construct digital libraries that satisfy user’s necessity for information: Adaptive Digital Libraries, libraries that automatically learn user preferences and goals and personalize their interaction using this information
Research and Development Workstation Environment: the new class of Current Research Information Systems
Against the backdrop of the development of modern technologies in the field
of scientific research the new class of Current Research Information Systems
(CRIS) and related intelligent information technologies has arisen. It was
called - Research and Development Workstation Environment (RDWE) - the
comprehensive problem-oriented information systems for scientific research and
development lifecycle support. The given paper describes design and development
fundamentals of the RDWE class systems. The RDWE class system's generalized
information model is represented in the article as a three-tuple composite web
service that include: a set of atomic web services, each of them can be
designed and developed as a microservice or a desktop application, that allows
them to be used as an independent software separately; a set of functions, the
functional filling-up of the Research and Development Workstation Environment;
a subset of atomic web services that are required to implement function of
composite web service. In accordance with the fundamental information model of
the RDWE class the system for supporting research in the field of ontology
engineering - the automated building of applied ontology in an arbitrary domain
area, scientific and technical creativity - the automated preparation of
application documents for patenting inventions in Ukraine was developed. It was
called - Personal Research Information System. A distinctive feature of such
systems is the possibility of their problematic orientation to various types of
scientific activities by combining on a variety of functional services and
adding new ones within the cloud integrated environment. The main results of
our work are focused on enhancing the effectiveness of the scientist's research
and development lifecycle in the arbitrary domain area.Comment: In English, 13 pages, 1 figure, 1 table, added references in Russian.
Published. Prepared for special issue (UkrPROG 2018 conference) of the
scientific journal "Problems of programming" (Founder: National Academy of
Sciences of Ukraine, Institute of Software Systems of NAS Ukraine
Syntactic and Semantic Analysis and Visualization of Unstructured English Texts
People have complex thoughts, and they often express their thoughts with complex sentences using natural languages. This complexity may facilitate efficient communications among the audience with the same knowledge base. But on the other hand, for a different or new audience this composition becomes cumbersome to understand and analyze. Analysis of such compositions using syntactic or semantic measures is a challenging job and defines the base step for natural language processing.
In this dissertation I explore and propose a number of new techniques to analyze and visualize the syntactic and semantic patterns of unstructured English texts.
The syntactic analysis is done through a proposed visualization technique which categorizes and compares different English compositions based on their different reading complexity metrics. For the semantic analysis I use Latent Semantic Analysis (LSA) to analyze the hidden patterns in complex compositions. I have used this technique to analyze comments from a social visualization web site for detecting the irrelevant ones (e.g., spam). The patterns of collaborations are also studied through statistical analysis.
Word sense disambiguation is used to figure out the correct sense of a word in a sentence or composition. Using textual similarity measure, based on the different word similarity measures and word sense disambiguation on collaborative text snippets from social collaborative environment, reveals a direction to untie the knots of complex hidden patterns of collaboration
Recommended from our members
Proceedings ICPW'07: 2nd International Conference on the Pragmatic Web, 22-23 Oct. 2007, Tilburg: NL
Proceedings ICPW'07: 2nd International Conference on the Pragmatic Web, 22-23 Oct. 2007, Tilburg: N
Text-based Sentiment Analysis and Music Emotion Recognition
Nowadays, with the expansion of social media, large amounts of user-generated
texts like tweets, blog posts or product reviews are shared online. Sentiment polarity
analysis of such texts has become highly attractive and is utilized in recommender
systems, market predictions, business intelligence and more. We also witness deep
learning techniques becoming top performers on those types of tasks. There are
however several problems that need to be solved for efficient use of deep neural
networks on text mining and text polarity analysis.
First of all, deep neural networks are data hungry. They need to be fed with
datasets that are big in size, cleaned and preprocessed as well as properly labeled.
Second, the modern natural language processing concept of word embeddings as a
dense and distributed text feature representation solves sparsity and dimensionality
problems of the traditional bag-of-words model. Still, there are various uncertainties
regarding the use of word vectors: should they be generated from the same dataset
that is used to train the model or it is better to source them from big and popular
collections that work as generic text feature representations? Third, it is not easy for
practitioners to find a simple and highly effective deep learning setup for various
document lengths and types. Recurrent neural networks are weak with longer texts
and optimal convolution-pooling combinations are not easily conceived. It is thus
convenient to have generic neural network architectures that are effective and can
adapt to various texts, encapsulating much of design complexity.
This thesis addresses the above problems to provide methodological and practical
insights for utilizing neural networks on sentiment analysis of texts and achieving
state of the art results. Regarding the first problem, the effectiveness of various
crowdsourcing alternatives is explored and two medium-sized and emotion-labeled
song datasets are created utilizing social tags. One of the research interests of Telecom
Italia was the exploration of relations between music emotional stimulation and
driving style. Consequently, a context-aware music recommender system that aims
to enhance driving comfort and safety was also designed. To address the second
problem, a series of experiments with large text collections of various contents and
domains were conducted. Word embeddings of different parameters were exercised
and results revealed that their quality is influenced (mostly but not only) by the
size of texts they were created from. When working with small text datasets, it is
thus important to source word features from popular and generic word embedding
collections. Regarding the third problem, a series of experiments involving convolutional
and max-pooling neural layers were conducted. Various patterns relating
text properties and network parameters with optimal classification accuracy were
observed. Combining convolutions of words, bigrams, and trigrams with regional
max-pooling layers in a couple of stacks produced the best results. The derived
architecture achieves competitive performance on sentiment polarity analysis of
movie, business and product reviews.
Given that labeled data are becoming the bottleneck of the current deep learning
systems, a future research direction could be the exploration of various data programming
possibilities for constructing even bigger labeled datasets. Investigation
of feature-level or decision-level ensemble techniques in the context of deep neural
networks could also be fruitful. Different feature types do usually represent complementary
characteristics of data. Combining word embedding and traditional text
features or utilizing recurrent networks on document splits and then aggregating the
predictions could further increase prediction accuracy of such models
A state-of-the-art review of built environment information modelling (BeIM)
Elements that constitute the built environment are vast and so are the independent systems developed to model its various aspects. Many of these systems have been developed under various assumptions and approaches to execute functions that are distinct, complementary or sometimes similar. Also, these systems are ever increasing in number and often assume similar nomenclatures and acronyms thereby exacerbating the challenges of understanding their peculiar functions, definitions and differences. The current societal demand to improve sustainability performance through collaboration, whole-systems and through-life thinking, is driving the need to integrate independent systems associated with different aspects and scales of the built environment to deliver smart solutions and services that improve the wellbeing of
citizens. The contemporary object-oriented digitization of real world elements appears to provide a leeway for amalgamating modelling systems of various domains in the built environment which we termed as built environment information modelling (BeIM). These domains included Architecture, Engineering, Construction and Urban Planning and Design. Applications such as Building Information Modelling, Geographic Information Systems and 3D City Modelling systems are now being integrated for city modelling purposes. The various works directed at integrating these systems are examined revealing that current research efforts on integration fall into three categories: (1) data/file conversion systems, (2) semantic mapping systems and (3) the hybrid of both. The review outcome suggests that a good
knowledge of these domains and how their respective systems operate is vital to pursuing holistic systems integration in the built environment
- …