23,468 research outputs found
Predicting the Law Area and Decisions of French Supreme Court Cases
In this paper, we investigate the application of text classification methods
to predict the law area and the decision of cases judged by the French Supreme
Court. We also investigate the influence of the time period in which a ruling
was made over the textual form of the case description and the extent to which
it is necessary to mask the judge's motivation for a ruling to emulate a
real-world test scenario. We report results of 96% f1 score in predicting a
case ruling, 90% f1 score in predicting the law area of a case, and 75.9% f1
score in estimating the time span when a ruling has been issued using a linear
Support Vector Machine (SVM) classifier trained on lexical features.Comment: RANLP 201
Using Machine Learning and Natural Language Processing to Review and Classify the Medical Literature on Cancer Susceptibility Genes
PURPOSE: The medical literature relevant to germline genetics is growing
exponentially. Clinicians need tools monitoring and prioritizing the literature
to understand the clinical implications of the pathogenic genetic variants. We
developed and evaluated two machine learning models to classify abstracts as
relevant to the penetrance (risk of cancer for germline mutation carriers) or
prevalence of germline genetic mutations. METHODS: We conducted literature
searches in PubMed and retrieved paper titles and abstracts to create an
annotated dataset for training and evaluating the two machine learning
classification models. Our first model is a support vector machine (SVM) which
learns a linear decision rule based on the bag-of-ngrams representation of each
title and abstract. Our second model is a convolutional neural network (CNN)
which learns a complex nonlinear decision rule based on the raw title and
abstract. We evaluated the performance of the two models on the classification
of papers as relevant to penetrance or prevalence. RESULTS: For penetrance
classification, we annotated 3740 paper titles and abstracts and used 60% for
training the model, 20% for tuning the model, and 20% for evaluating the model.
The SVM model achieves 89.53% accuracy (percentage of papers that were
correctly classified) while the CNN model achieves 88.95 % accuracy. For
prevalence classification, we annotated 3753 paper titles and abstracts. The
SVM model achieves 89.14% accuracy while the CNN model achieves 89.13 %
accuracy. CONCLUSION: Our models achieve high accuracy in classifying abstracts
as relevant to penetrance or prevalence. By facilitating literature review,
this tool could help clinicians and researchers keep abreast of the burgeoning
knowledge of gene-cancer associations and keep the knowledge bases for clinical
decision support tools up to date
Multiple Instance Learning: A Survey of Problem Characteristics and Applications
Multiple instance learning (MIL) is a form of weakly supervised learning
where training instances are arranged in sets, called bags, and a label is
provided for the entire bag. This formulation is gaining interest because it
naturally fits various problems and allows to leverage weakly labeled data.
Consequently, it has been used in diverse application fields such as computer
vision and document classification. However, learning from bags raises
important challenges that are unique to MIL. This paper provides a
comprehensive survey of the characteristics which define and differentiate the
types of MIL problems. Until now, these problem characteristics have not been
formally identified and described. As a result, the variations in performance
of MIL algorithms from one data set to another are difficult to explain. In
this paper, MIL problem characteristics are grouped into four broad categories:
the composition of the bags, the types of data distribution, the ambiguity of
instance labels, and the task to be performed. Methods specialized to address
each category are reviewed. Then, the extent to which these characteristics
manifest themselves in key MIL application areas are described. Finally,
experiments are conducted to compare the performance of 16 state-of-the-art MIL
methods on selected problem characteristics. This paper provides insight on how
the problem characteristics affect MIL algorithms, recommendations for future
benchmarking and promising avenues for research
Learning to Predict Charges for Criminal Cases with Legal Basis
The charge prediction task is to determine appropriate charges for a given
case, which is helpful for legal assistant systems where the user input is fact
description. We argue that relevant law articles play an important role in this
task, and therefore propose an attention-based neural network method to jointly
model the charge prediction task and the relevant article extraction task in a
unified framework. The experimental results show that, besides providing legal
basis, the relevant articles can also clearly improve the charge prediction
results, and our full model can effectively predict appropriate charges for
cases with different expression styles.Comment: 10 pages, accepted by EMNLP 201
Handwritten and Printed Text Separation in Real Document
The aim of the paper is to separate handwritten and printed text from a real
document embedded with noise, graphics including annotations. Relying on
run-length smoothing algorithm (RLSA), the extracted pseudo-lines and
pseudo-words are used as basic blocks for classification. To handle this, a
multi-class support vector machine (SVM) with Gaussian kernel performs a first
labelling of each pseudo-word including the study of local neighbourhood. It
then propagates the context between neighbours so that we can correct possible
labelling errors. Considering running time complexity issue, we propose linear
complexity methods where we use k-NN with constraint. When using a kd-tree, it
is almost linearly proportional to the number of pseudo-words. The performance
of our system is close to 90%, even when very small learning dataset where
samples are basically composed of complex administrative documents.Comment: Machine Vision Applications (2013
- …