4,331 research outputs found
BlogForever D2.6: Data Extraction Methodology
This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform
A Data Set and a Convolutional Model for Iconography Classification in Paintings
Iconography in art is the discipline that studies the visual content of
artworks to determine their motifs and themes andto characterize the way these
are represented. It is a subject of active research for a variety of purposes,
including the interpretation of meaning, the investigation of the origin and
diffusion in time and space of representations, and the study of influences
across artists and art works. With the proliferation of digital archives of art
images, the possibility arises of applying Computer Vision techniques to the
analysis of art images at an unprecedented scale, which may support iconography
research and education. In this paper we introduce a novel paintings data set
for iconography classification and present the quantitativeand qualitative
results of applying a Convolutional Neural Network (CNN) classifier to the
recognition of the iconography of artworks. The proposed classifier achieves
good performances (71.17% Precision, 70.89% Recall, 70.25% F1-Score and 72.73%
Average Precision) in the task of identifying saints in Christian religious
paintings, a task made difficult by the presence of classes with very similar
visual features. Qualitative analysis of the results shows that the CNN focuses
on the traditional iconic motifs that characterize the representation of each
saint and exploits such hints to attain correct identification. The ultimate
goal of our work is to enable the automatic extraction, decomposition, and
comparison of iconography elements to support iconographic studies and
automatic art work annotation.Comment: Published at ACM Journal on Computing and Cultural Heritage (JOCCH)
https://doi.org/10.1145/345888
Machine Learning in Automated Text Categorization
The automated categorization (or classification) of texts into predefined
categories has witnessed a booming interest in the last ten years, due to the
increased availability of documents in digital form and the ensuing need to
organize them. In the research community the dominant approach to this problem
is based on machine learning techniques: a general inductive process
automatically builds a classifier by learning, from a set of preclassified
documents, the characteristics of the categories. The advantages of this
approach over the knowledge engineering approach (consisting in the manual
definition of a classifier by domain experts) are a very good effectiveness,
considerable savings in terms of expert manpower, and straightforward
portability to different domains. This survey discusses the main approaches to
text categorization that fall within the machine learning paradigm. We will
discuss in detail issues pertaining to three different problems, namely
document representation, classifier construction, and classifier evaluation.Comment: Accepted for publication on ACM Computing Survey
Information Extraction on Para-Relational Data.
Para-relational data (such as spreadsheets and diagrams) refers to a type of nearly
relational data that shares the important qualities of relational data but does not
present itself in a relational format. Para-relational data often conveys highly valuable
information and is widely used in many different areas. If we can convert para-relational
data into the relational format, many existing tools can be leveraged for a
variety of interesting applications, such as data analysis with relational query systems
and data integration applications.
This dissertation aims to convert para-relational data into a high-quality relational
form with little user assistance. We have developed four standalone systems, each
addressing a specific type of para-relational data. Senbazuru is a prototype spreadsheet
database management system that extracts relational information from a large
number of spreadsheets. Anthias is an extension of the Senbazuru system to convert
a broader range of spreadsheets into a relational format. Lyretail is an extraction
system to detect long-tail dictionary entities on webpages. Finally, DiagramFlyer is
a web-based search system that obtains a large number of diagrams automatically
extracted from web-crawled PDFs. Together, these four systems demonstrate that
converting para-relational data into the relational format is possible today, and also
suggest directions for future systems.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/120853/1/chenzhe_1.pd
Knowledge will Propel Machine Understanding of Content: Extrapolating from Current Examples
Machine Learning has been a big success story during the AI resurgence. One
particular stand out success relates to learning from a massive amount of data.
In spite of early assertions of the unreasonable effectiveness of data, there
is increasing recognition for utilizing knowledge whenever it is available or
can be created purposefully. In this paper, we discuss the indispensable role
of knowledge for deeper understanding of content where (i) large amounts of
training data are unavailable, (ii) the objects to be recognized are complex,
(e.g., implicit entities and highly subjective content), and (iii) applications
need to use complementary or related data in multiple modalities/media. What
brings us to the cusp of rapid progress is our ability to (a) create relevant
and reliable knowledge and (b) carefully exploit knowledge to enhance ML/NLP
techniques. Using diverse examples, we seek to foretell unprecedented progress
in our ability for deeper understanding and exploitation of multimodal data and
continued incorporation of knowledge in learning techniques.Comment: Pre-print of the paper accepted at 2017 IEEE/WIC/ACM International
Conference on Web Intelligence (WI). arXiv admin note: substantial text
overlap with arXiv:1610.0770
- âŠ