14 research outputs found

    A preliminary approach to the multilabel classification problem of Portuguese juridical documents

    Get PDF
    Portuguese juridical documents from Supreme Courts and the Attorney General’s Office are manually classified by juridical experts into a set of classes belonging to a taxonomy of concepts. In this paper, a preliminary approach to develop techniques to automat- ically classify these juridical documents, is proposed. As basic strategy, the integration of natural language processing techniques with machine learning ones is used. Support Vector Machines (SVM) are used as learn- ing algorithm and the obtained results are presented and compared with other approaches, such as C4.5 and Naive Bayes

    Analysing part-of-speech for Portuguese text classification

    Get PDF
    This paper proposes and evaluates the use of linguistic in- formation in the pre-processing phase of text classification. We present several experiments evaluating the selection of terms based on different measures and linguistic knowledge. To build the classifier we used Sup- port Vector Machines (SVM), which are known to produce good results on text classification tasks. Our proposals were applied to two different datasets written in the Portuguese language: articles from a Brazilian newspaper (Folha de So Paulo) and juridical documents from the Portuguese Attorney General’s Office. The results show the relevance of part-of-speech information for the pre-processing phase of text classification allowing for a strong re- duction of the number of features needed in the text classification

    Using Linguistic Information and Machine Learning Techniques to Identify Entities from Juridical Documents

    Get PDF
    Information extraction from legal documents is an important and open problem. A mixed approach, using linguistic information and machine learning techniques, is described in this paper. In this approach, top-level legal concepts are identified and used for document classifica- tion using Support Vector Machines. Named entities, such as, locations, organizations, dates, and document references, are identified using se- mantic information from the output of a natural language parser. This information, legal concepts and named entities, may be used to popu- late a simple ontology, allowing the enrichment of documents and the creation of high-level legal information retrieval systems. The proposed methodology was applied to a corpus of legal documents - from the EUR-Lex site – and it was evaluated. The obtained results were quite good and indicate this may be a promising approach to the legal information extraction problem

    Is linguistic information relevant for the classification of legal texts?

    Get PDF
    Text classification is an important task in the legal domain. In fact, most of the legal information is stored as text in a quite unstructured format and it is important to be able to automatically classify these texts into a predefined set of concepts. Support Vector Machines (SVM), a machine learning al- gorithm, has shown to be a good classifier for text bases [Joachims, 2002]. In this paper, SVMs are applied to the classification of European Portuguese legal texts – the Por- tuguese Attorney General’s Office Decisions – and the rele- vance of linguistic information in this domain, namely lem- matisation and part-of-speech tags, is evaluated. The obtained results show that some linguistic information (namely, lemmatisation and the part-of-speech tags) can be successfully used to improve the classification results and, simultaneously, to decrease the number of features needed by the learning algorithm

    Matrix factorization for co-training algorithm to classify human rights abuses

    Get PDF
    In the human rights domain, there is need to filter, efficiently classify and prioritize the types of violation endured by victims in order to provide the necessary rehabilitation and support. However, the domain is dominated by unstructured data either from victims' accounts, doctors'/professionals' reports or available on line. Manual classification still prevails in this domain which is extremely time consuming and slow. This is a problem for non-government operated charities. To this end we have explored the application of the co-training algorithm in order to improve the performance of a semi-supervised learning algorithm by incorporating large amounts of unlabeled data into the training data set. However, it remains challenging to apply co-training on the data without two independent and self sufficient views. This paper puts forth a method of randomly dividing the available features to apply matrix factorization so as to discover latent features underlying the interactions between different kinds of entities present in a single view dataset. These labeled views balance the biased information in the dataset, but still satisfy the co-training assumptions. Alongside, the views are constrained such that pairs of labeled views create weak classifiers which in turn increase the prediction accuracy when combined. In the majority of cases, any classification tries to connect a single class to each sample or object. However, in the human rights domain, a victim can be subjected to more than one type of violation or abuse. This is multi-label classification where a sample can be assigned to more than one class. This paper aims to address all these aspects by bringing together a semi supervised classification model that relies on the effectiveness of matrix collaborative filtering in order to classify stories narrated by victims into one or more types of human rights abuses. Experimental results demonstrate the efficiency of this approach when applied on real-world stories from different victims

    A survey of genetic algorithms for multi-label classification

    Get PDF
    In recent years, multi-label classification (MLC) has become an emerging research topic in big data analytics and machine learning. In this problem, each object of a dataset may belong to multiple class labels and the goal is to learn a classification model that can infer the correct labels of new, previously unseen, objects. This paper presents a survey of genetic algorithms (GAs) designed for MLC tasks. The study is organized in three parts. First, we propose a new taxonomy focused on GAs for MLC. In the second part, we provide an up-to-date overview of the work in this area, categorizing the approaches identified in the literature with respect to the taxonomy. In the third and last part, we discuss some new ideas for combining GAs with MLC

    Common Sense Reasoning for Detection, Prevention, and Mitigation of Cyberbullying

    Get PDF
    Cyberbullying (harassment on social networks) is widely recognized as a serious social problem, especially for adolescents. It is as much a threat to the viability of online social networks for youth today as spam once was to email in the early days of the Internet. Current work to tackle this problem has involved social and psychological studies on its prevalence as well as its negative effects on adolescents. While true solutions rest on teaching youth to have healthy personal relationships, few have considered innovative design of social network software as a tool for mitigating this problem. Mitigating cyberbullying involves two key components: robust techniques for effective detection and reflective user interfaces that encourage users to reflect upon their behavior and their choices. Spam filters have been successful by applying statistical approaches like Bayesian networks and hidden Markov models. They can, like Google’s GMail, aggregate human spam judgments because spam is sent nearly identically to many people. Bullying is more personalized, varied, and contextual. In this work, we present an approach for bullying detection based on state-of-the-art natural language processing and a common sense knowledge base, which permits recognition over a broad spectrum of topics in everyday life. We analyze a more narrow range of particular subject matter associated with bullying (e.g. appearance, intelligence, racial and ethnic slurs, social acceptance, and rejection), and construct BullySpace, a common sense knowledge base that encodes particular knowledge about bullying situations. We then perform joint reasoning with common sense knowledge about a wide range of everyday life topics. We analyze messages using our novel AnalogySpace common sense reasoning technique. We also take into account social network analysis and other factors. We evaluate the model on real-world instances that have been reported by users on Formspring, a social networking website that is popular with teenagers. On the intervention side, we explore a set of reflective user-interaction paradigms with the goal of promoting empathy among social network participants. We propose an “air traffic control”-like dashboard, which alerts moderators to large-scale outbreaks that appear to be escalating or spreading and helps them prioritize the current deluge of user complaints. For potential victims, we provide educational material that informs them about how to cope with the situation, and connects them with emotional support from others. A user evaluation shows that in-context, targeted, and dynamic help during cyberbullying situations fosters end-user reflection that promotes better coping strategies
    corecore