456,846 research outputs found
Web Data Extraction, Applications and Techniques: A Survey
Web Data Extraction is an important problem that has been studied by means of
different scientific tools and in a broad range of applications. Many
approaches to extracting data from the Web have been designed to solve specific
problems and operate in ad-hoc domains. Other approaches, instead, heavily
reuse techniques and algorithms developed in the field of Information
Extraction.
This survey aims at providing a structured and comprehensive overview of the
literature in the field of Web Data Extraction. We provided a simple
classification framework in which existing Web Data Extraction applications are
grouped into two main classes, namely applications at the Enterprise level and
at the Social Web level. At the Enterprise level, Web Data Extraction
techniques emerge as a key tool to perform data analysis in Business and
Competitive Intelligence systems as well as for business process
re-engineering. At the Social Web level, Web Data Extraction techniques allow
to gather a large amount of structured data continuously generated and
disseminated by Web 2.0, Social Media and Online Social Network users and this
offers unprecedented opportunities to analyze human behavior at a very large
scale. We discuss also the potential of cross-fertilization, i.e., on the
possibility of re-using Web Data Extraction techniques originally designed to
work in a given domain, in other domains.Comment: Knowledge-based System
Alternative approach to tree-structured web log representation and mining
More recent approaches to web log data representation aim to capture the user navigational patterns with respect to the overall structure of the web site. One such representation is tree-structured log files which is the focus of this work. Most existing methods for analyzing such data are based on the use of frequent subtree mining techniques to extract frequent user activity and navigational paths. In this paper we evaluate the use of other standard data mining techniques enabled by a recently proposed structure preserving flat data representation for tree-structured data. The initially proposed framework was adjusted to better suit the web log mining task. Experimental evaluation is performed on two real world web log datasets and comparisons are made with an existing state-of-the art classifier for tree-structured data. The results show the great potential of the method in enabling the application of a wider range of data mining/analysis techniques to tree-structured web log data
OPEN—Enabling Non-expert Users to Extract, Integrate, and Analyze Open Data
Government initiatives for more transparency and participation have lead to an increasing amount of structured data on the web in recent years. Many of these datasets have great potential. For example, a situational analysis and meaningful visualization of the data can assist in pointing out social or economic issues and raising people’s awareness. Unfortunately, the ad-hoc analysis of this so-called Open Data can prove very complex and time-consuming, partly due to a lack of efficient system support.On the one hand, search functionality is required to identify relevant datasets. Common document retrieval techniques used in web search, however, are not optimized for Open Data and do not address the semantic ambiguity inherent in it. On the other hand, semantic integration is necessary to perform analysis tasks across multiple datasets. To do so in an ad-hoc fashion, however, requires more flexibility and easier integration than most data integration systems provide. It is apparent that an optimal management system for Open Data must combine aspects from both classic approaches. In this article, we propose OPEN, a novel concept for the management and situational analysis of Open Data within a single system. In our approach, we extend a classic database management system, adding support for the identification and dynamic integration of public datasets. As most web users lack the experience and training required to formulate structured queries in a DBMS, we add support for non-expert users to our system, for example though keyword queries. Furthermore, we address the challenge of indexing Open Data
A semantic framework for ontology usage analysis
The Semantic Web envisions a Web where information is accessible and processable by computers as well as humans. Ontologies are the cornerstones for realizing this vision of the Semantic Web by capturing domain knowledge by defining the terms and the relationship between these terms to provide a formal representation of the domain with machine-understandable semantics. Ontologies are used for semantic annotation, data interoperability and knowledge assimilation and dissemination.In the literature, different approaches have been proposed to build and evolve ontologies, but in addition to these, one more important concept needs to be considered in the ontology lifecycle, that is, its usage. Measuring the “usage” of ontologies will help us to effectively and efficiently make use of semantically annotated structured data published on the Web (formalized knowledge published on the Web), improve the state of ontology adoption and reusability, provide a usage-based feedback loop to the ontology maintenance process for a pragmatic conceptual model update, and source information accurately and automatically which can then be utilized in the other different areas of the ontology lifecycle. Ontology Usage Analysis is the area which evaluates, measures and analyses the use of ontologies on the Web. However, in spite of its importance, no formal approach is present in the literature which focuses on measuring the use of ontologies on the Web. This is in contrast to the approaches proposed in the literature on the other concepts of the ontology lifecycle, such as ontology development, ontology evaluation and ontology evolution. So, to address this gap, this thesis is an effort in such a direction to assess, analyse and represent the use of ontologies on the Web.In order to address the problem and realize the abovementioned benefits, an Ontology Usage Analysis Framework (OUSAF) is presented. The OUSAF Framework implements a methodological approach which is comprised of identification, investigation, representation and utilization phases. These phases provide a complete solution for usage analysis by allowing users to identify the key ontologies, and investigate, represent and utilize usage analysis results. Various computation components with several methods, techniques, and metrics for each phase are presented and evaluated using the Semantic Web data crawled from the Web. For the dissemination of ontology-usage-related information accessible to machines and humans, The U Ontology is presented to formalize the conceptual model of the ontology usage domain. The evaluation of the framework, solution components, methods, and a formalized conceptual model is presented, indicating the usefulness of the overall proposed solution
Use of Web 2.0 Technologies in the Teaching/Learning of Business Education in Nigerian Universities
This paper assessed the use of web 2.0 technologies in the teaching/learning of business education courses in Nigeria Universities. The paper sought answers to the research question "what are the web 2.0 technologies use by lecturers and students of business education in Nigerian Universities? The study adopted both qualitative and quantitative approaches. The descriptive survey was the design used for the quantitative method while content analysis for the qualitative method.  A sample of 38 lecturers and 113 students were used for the survey. A total of 151 copies of the questionnaire were administered altogether, and all copies were retrieved and used for the study. A semi-structured interview was the instrument used to gather the qualitative data. Mean, standard deviation and ranks were used to analyze the quantitative data collected. Independent samples t-test statistic was used to test the null hypothesis at the 0.05 level of significance. The qualitative data were analyzed using two themes. The findings of the study revealed that web 2.0 technologies are not used in the teaching/learning of business education. It was also found that lack of technical expertise and uneasiness with openness and public discourse and interactions are some of the reasons why web tools are not used in teaching and learning.  Based on the findings, it was concluded that graduates of business education would not be able to get the required skills and competencies to be capable of operating effectively in the 21st world of employment. Based on the findings of the study, the study recommends among others that; there is the need for business education teachers and students to be given technical support to help them to divert the use of web 2.0 technologies from entertainment to educational uses. Keywords: Use, web, tools, teaching, learning, business education, Nigeria, Universities DOI: 10.7176/JEP/10-9-23 Publication date:March 31st 201
A Bibliometric Analysis of Collaborative Supply Chain Risk Management in Crisis Situations
Crises including the COVID-19 pandemic have caused disruptive changes to many industries and supply chains around the world. Their severe impacts on business and the economy provide an opportunity to increase preparedness and reveal the importance of implementing a collaborative supply chain risk management process. This paper uses a bibliometric analysis based on a co-citation analysis to reveal the research areas and gaps concerning collaborative supply chain risk management with a focus on crisis situations. Using a structured approach based on Soni and Kodali (2011) and GmĂĽr (2003), 269 papers were extracted from the database Web of Science (WOS) using a specific search string. Data filtering and preparation using title, abstract, and full paper screening, as well as the number of cited-in references, led to a final sum of 50 papers. These papers were prepared for the co-citation analysis based on a co-citation matrix that served as an input for the Organizational Risk Analyzer (ORA) software. The cluster analysis was carried out in the ORA software with a threshold of 0.01, and based on that, five clusters were extracted from the network. Extracted main research areas include collaboration approaches and criteria as well as decision-making approaches and lessons learned from COVID-19. Research gaps and suggested future research areas are presented based on the clusters analysis
A scalable analysis framework for large-scale RDF data
With the growth of the Semantic Web, the availability of RDF datasets from multiple domains
as Linked Data has taken the corpora of this web to a terabyte-scale, and challenges
modern knowledge storage and discovery techniques. Research and engineering on RDF
data management systems is a very active area with many standalone systems being introduced.
However, as the size of RDF data increases, such single-machine approaches meet
performance bottlenecks, in terms of both data loading and querying, due to the limited
parallelism inherent to symmetric multi-threaded systems and the limited available system
I/O and system memory. Although several approaches for distributed RDF data processing
have been proposed, along with clustered versions of more traditional approaches, their
techniques are limited by the trade-off they exploit between loading complexity and query
efficiency in the presence of big RDF data. This thesis then, introduces a scalable analysis
framework for processing large-scale RDF data, which focuses on various techniques to
reduce inter-machine communication, computation and load-imbalancing so as to achieve
fast data loading and querying on distributed infrastructures.
The first part of this thesis focuses on the study of RDF store implementation and parallel
hashing on big data processing. (1) A system-level investigation of RDF store implementation
has been conducted on the basis of a comparative analysis of runtime characteristics
of a representative set of RDF stores. The detailed time cost and system consumption is
measured for data loading and querying so as to provide insight into different triple store
implementation as well as an understanding of performance differences between different
platforms. (2) A high-level structured parallel hashing approach over distributed memory is
proposed and theoretically analyzed. The detailed performance of hashing implementations
using different lock-free strategies has been characterized through extensive experiments,
thereby allowing system developers to make a more informed choice for the implementation
of their high-performance analytical data processing systems.
The second part of this thesis proposes three main techniques for fast processing of large
RDF data within the proposed framework. (1) A very efficient parallel dictionary encoding
algorithm, to avoid unnecessary disk-space consumption and reduce computational complexity of query execution. The presented implementation has achieved notable speedups
compared to the state-of-art method and also has achieved excellent scalability. (2) Several
novel parallel join algorithms, to efficiently handle skew over large data during query processing.
The approaches have achieved good load balancing and have been demonstrated
to be faster than the state-of-art techniques in both theoretical and experimental comparisons.
(3) A two-tier dynamic indexing approach for processing SPARQL queries has been
devised which keeps loading times low and decreases or in some instances removes intermachine
data movement for subsequent queries that contain the same graph patterns. The
results demonstrate that this design can load data at least an order of magnitude faster than
a clustered store operating in RAM while remaining within an interactive range for query
processing and even outperforms current systems for various queries
Finding Structured and Unstructured Features to Improve the Search Result of Complex Question
-Recently, search engine got challenge deal with such a natural language questions.
Sometimes, these questions are complex questions. A complex question is a question that
consists several clauses, several intentions or need long answer.
In this work we proposed that finding structured features and unstructured features of
questions and using structured data and unstructured data could improve the search result
of complex questions. According to those, we will use two approaches, IR approach and
structured retrieval, QA template.
Our framework consists of three parts. Question analysis, Resource Discovery and
Analysis The Relevant Answer. In Question Analysis we used a few assumptions, and
tried to find structured and unstructured features of the questions. Structured feature
refers to Structured data and unstructured feature refers to unstructured data. In the
resource discovery we integrated structured data (relational database) and unstructured
data (webpage) to take the advantaged of two kinds of data to improve and reach the
relevant answer. We will find the best top fragments from context of the webpage In the
Relevant Answer part, we made a score matching between the result from structured data
and unstructured data, then finally used QA template to reformulate the question.
In the experiment result, it shows that using structured feature and unstructured
feature and using both structured and unstructured data, using approach IR and QA
template could improve the search result of complex questions
A unified view of data-intensive flows in business intelligence systems : a survey
Data-intensive flows are central processes in today’s business intelligence (BI) systems, deploying different technologies to deliver data, from a multitude of data sources, in user-preferred and analysis-ready formats. To meet complex requirements of next generation BI systems, we often need an effective combination of the traditionally batched extract-transform-load (ETL) processes that populate a data warehouse (DW) from integrated data sources, and more real-time and operational data flows that integrate source data at runtime. Both academia and industry thus must have a clear understanding of the foundations of data-intensive flows and the challenges of moving towards next generation BI environments. In this paper we present a survey of today’s research on data-intensive flows and the related fundamental fields of database theory. The study is based on a proposed set of dimensions describing the important challenges of data-intensive flows in the next generation BI setting. As a result of this survey, we envision an architecture of a system for managing the lifecycle of data-intensive flows. The results further provide a comprehensive understanding of data-intensive flows, recognizing challenges that still are to be addressed, and how the current solutions can be applied for addressing these challenges.Peer ReviewedPostprint (author's final draft
- …