15,412 research outputs found
A Supervised Term-Weighting Method and its Application to Variable Extraction from Digital Media
Successful modeling and prediction depend on effective methods for the extraction of domain-relevant variables. This paper proposes a methodology for identifying domain-specific terms. The proposed methodology relies on a collection of documents labeled as relevant or irrelevant to the domain under analysis. Based on the labeled document collection, we propose a supervised technique that weights terms based on their descriptive and discriminating power. Finally, the descriptive and discriminating values are combined into a general measure that, through the use of an adjustable parameter, allows to independently favor different aspects of retrieval such as maximizing precision or recall, or achieving a balance between both of them. The proposed technique is applied to the economic domain and is empirically evaluated through a human-subject experiment involving experts and non-experts in Economy. It is also evaluated as a term-weighting technique for query-term selection showing promising results. We finally illustrate the potential of the proposal as a first step for identifying different types of associations between words.Fil: Maisonnave, Mariano. Consejo Nacional de Investigaciones CientÃficas y Técnicas. Centro CientÃfico Tecnológico Conicet - BahÃa Blanca. Instituto de Ciencias e IngenierÃa de la Computación. Universidad Nacional del Sur. Departamento de Ciencias e IngenierÃa de la Computación. Instituto de Ciencias e IngenierÃa de la Computación; Argentina. Universidad Nacional del Sur. Departamento de Ciencias e IngenierÃa de la Computación; ArgentinaFil: Delbianco, Fernando Andrés. Consejo Nacional de Investigaciones CientÃficas y Técnicas. Centro CientÃfico Tecnológico Conicet - BahÃa Blanca. Instituto de Matemática BahÃa Blanca. Universidad Nacional del Sur. Departamento de Matemática. Instituto de Matemática BahÃa Blanca; Argentina. Universidad Nacional del Sur. Departamento de EconomÃa; ArgentinaFil: Tohmé, Fernando Abel. Consejo Nacional de Investigaciones CientÃficas y Técnicas. Centro CientÃfico Tecnológico Conicet - BahÃa Blanca. Instituto de Matemática BahÃa Blanca. Universidad Nacional del Sur. Departamento de Matemática. Instituto de Matemática BahÃa Blanca; Argentina. Universidad Nacional del Sur. Departamento de EconomÃa; ArgentinaFil: Maguitman, Ana Gabriela. Consejo Nacional de Investigaciones CientÃficas y Técnicas. Centro CientÃfico Tecnológico Conicet - BahÃa Blanca. Instituto de Ciencias e IngenierÃa de la Computación. Universidad Nacional del Sur. Departamento de Ciencias e IngenierÃa de la Computación. Instituto de Ciencias e IngenierÃa de la Computación; Argentina. Universidad Nacional del Sur. Departamento de Ciencias e IngenierÃa de la Computación; ArgentinaXIX Simposio Argentino de Inteligencia ArtificialBuenos AiresArgentinaUniversidad de Palermo. Facultad de IngenierÃa. Asociación Argentina de Inteligencia Artificial. Sociedad Argentina de Informátic
Explainable Text Classification in Legal Document Review A Case Study of Explainable Predictive Coding
In today's legal environment, lawsuits and regulatory investigations require
companies to embark upon increasingly intensive data-focused engagements to
identify, collect and analyze large quantities of data. When documents are
staged for review the process can require companies to dedicate an
extraordinary level of resources, both with respect to human resources, but
also with respect to the use of technology-based techniques to intelligently
sift through data. For several years, attorneys have been using a variety of
tools to conduct this exercise, and most recently, they are accepting the use
of machine learning techniques like text classification to efficiently cull
massive volumes of data to identify responsive documents for use in these
matters. In recent years, a group of AI and Machine Learning researchers have
been actively researching Explainable AI. In an explainable AI system, actions
or decisions are human understandable. In typical legal `document review'
scenarios, a document can be identified as responsive, as long as one or more
of the text snippets in a document are deemed responsive. In these scenarios,
if predictive coding can be used to locate these responsive snippets, then
attorneys could easily evaluate the model's document classification decision.
When deployed with defined and explainable results, predictive coding can
drastically enhance the overall quality and speed of the document review
process by reducing the time it takes to review documents. The authors of this
paper propose the concept of explainable predictive coding and simple
explainable predictive coding methods to locate responsive snippets within
responsive documents. We also report our preliminary experimental results using
the data from an actual legal matter that entailed this type of document
review.Comment: 2018 IEEE International Conference on Big Dat
Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval
This paper presents a new state-of-the-art for document image classification
and retrieval, using features learned by deep convolutional neural networks
(CNNs). In object and scene analysis, deep neural nets are capable of learning
a hierarchical chain of abstraction from pixel inputs to concise and
descriptive representations. The current work explores this capacity in the
realm of document analysis, and confirms that this representation strategy is
superior to a variety of popular hand-crafted alternatives. Experiments also
show that (i) features extracted from CNNs are robust to compression, (ii) CNNs
trained on non-document images transfer well to document analysis tasks, and
(iii) enforcing region-specific feature-learning is unnecessary given
sufficient training data. This work also makes available a new labelled subset
of the IIT-CDIP collection, containing 400,000 document images across 16
categories, useful for training new CNNs for document analysis
A linguistically-driven methodology for detecting impending and unfolding emergencies from social media messages
Natural disasters have demonstrated the crucial role of social media before, during and after emergencies
(Haddow & Haddow 2013). Within our EU project Sland \ub4 ail, we aim to ethically improve \ub4
the use of social media in enhancing the response of disaster-related agen-cies. To this end, we
have collected corpora of social and formal media to study newsroom communication of emergency
management organisations in English and Italian. Currently, emergency management agencies
in English-speaking countries use social media in different measure and different degrees,
whereas Italian National Protezione Civile only uses Twitter at the moment. Our method is developed
with a view to identifying communicative strategies and detecting sentiment in order to
distinguish warnings from actual disasters and major from minor disasters. Our linguistic analysis
uses humans to classify alert/warning messages or emer-gency response and mitigation ones based
on the terminology used and the sentiment expressed. Results of linguistic analysis are then used
to train an application by tagging messages and detecting disaster- and/or emergency-related terminology
and emotive language to simulate human rating and forward information to an emergency
management system
Literature classification for semi-automated updating of biological knowledgebases
BACKGROUND: As the output of biological assays increase in resolution and volume, the body of specialized biological data, such as functional annotations of gene and protein sequences, enables extraction of higher-level knowledge needed for practical application in bioinformatics. Whereas common types of biological data, such as sequence data, are extensively stored in biological databases, functional annotations, such as immunological epitopes, are found primarily in semi-structured formats or free text embedded in primary scientific literature. RESULTS: We defined and applied a machine learning approach for literature classification to support updating of TANTIGEN, a knowledgebase of tumor T-cell antigens. Abstracts from PubMed were downloaded and classified as either "relevant" or "irrelevant" for database update. Training and five-fold cross-validation of a k-NN classifier on 310 abstracts yielded classification accuracy of 0.95, thus showing significant value in support of data extraction from the literature. CONCLUSION: We here propose a conceptual framework for semi-automated extraction of epitope data embedded in scientific literature using principles from text mining and machine learning. The addition of such data will aid in the transition of biological databases to knowledgebases
Learning a Deep Listwise Context Model for Ranking Refinement
Learning to rank has been intensively studied and widely applied in
information retrieval. Typically, a global ranking function is learned from a
set of labeled data, which can achieve good performance on average but may be
suboptimal for individual queries by ignoring the fact that relevant documents
for different queries may have different distributions in the feature space.
Inspired by the idea of pseudo relevance feedback where top ranked documents,
which we refer as the \textit{local ranking context}, can provide important
information about the query's characteristics, we propose to use the inherent
feature distributions of the top results to learn a Deep Listwise Context Model
that helps us fine tune the initial ranked list. Specifically, we employ a
recurrent neural network to sequentially encode the top results using their
feature vectors, learn a local context model and use it to re-rank the top
results. There are three merits with our model: (1) Our model can capture the
local ranking context based on the complex interactions between top results
using a deep neural network; (2) Our model can be built upon existing
learning-to-rank methods by directly using their extracted feature vectors; (3)
Our model is trained with an attention-based loss function, which is more
effective and efficient than many existing listwise methods. Experimental
results show that the proposed model can significantly improve the
state-of-the-art learning to rank methods on benchmark retrieval corpora
Applying Rule Ensembles to the Search for Super-Symmetry at the Large Hadron Collider
In this note we give an example application of a recently presented
predictive learning method called Rule Ensembles. The application we present is
the search for super-symmetric particles at the Large Hadron Collider. In
particular, we consider the problem of separating the background coming from
top quark production from the signal of super-symmetric particles. The method
is based on an expansion of base learners, each learner being a rule, i.e. a
combination of cuts in the variable space describing signal and background.
These rules are generated from an ensemble of decision trees. One of the
results of the method is a set of rules (cuts) ordered according to their
importance, which gives useful tools for diagnosis of the model. We also
compare the method to a number of other multivariate methods, in particular
Artificial Neural Networks, the likelihood method and the recently presented
boosted decision tree method. We find better performance of Rule Ensembles in
all cases. For example for a given significance the amount of data needed to
claim SUSY discovery could be reduced by 15 % using Rule Ensembles as compared
to using a likelihood method.Comment: 24 pages, 7 figures, replaced to match version accepted for
publication in JHE
History of art paintings through the lens of entropy and complexity
Art is the ultimate expression of human creativity that is deeply influenced
by the philosophy and culture of the corresponding historical epoch. The
quantitative analysis of art is therefore essential for better understanding
human cultural evolution. Here we present a large-scale quantitative analysis
of almost 140 thousand paintings, spanning nearly a millennium of art history.
Based on the local spatial patterns in the images of these paintings, we
estimate the permutation entropy and the statistical complexity of each
painting. These measures map the degree of visual order of artworks into a
scale of order-disorder and simplicity-complexity that locally reflects
qualitative categories proposed by art historians. The dynamical behavior of
these measures reveals a clear temporal evolution of art, marked by transitions
that agree with the main historical periods of art. Our research shows that
different artistic styles have a distinct average degree of entropy and
complexity, thus allowing a hierarchical organization and clustering of styles
according to these metrics. We have further verified that the identified groups
correspond well with the textual content used to qualitatively describe the
styles, and that the employed complexity-entropy measures can be used for an
effective classification of artworks.Comment: 10 two-column pages, 5 figures; accepted for publication in PNAS
[supplementary information available at
http://www.pnas.org/highwire/filestream/824089/field_highwire_adjunct_files/0/pnas.1800083115.sapp.pdf
- …