306 research outputs found

    Technology Assisted Review of Legal Documents

    Get PDF
    A legal prediction-based approach will help judges and solicitors to take judicial decisions on current cases, which are going on in courts, and make predictions on new cases on the basis of existing references and judgments. This model also helps law students learn about legal references. This application was developed specifically for the “Supreme Court of Pakistan (SCP)” and the “Pakistan Bar Council (PBC)” to expedite their judgments and provide legal guidance to lawyers based on historical data and constitutions

    DEXTER: A workbench for automatic term extraction with specialized corpora

    Full text link
    [EN] Automatic term extraction has become a priority area of research within corpus processing. Despite the extensive literature in this field, there are still some outstanding issues that should be dealt with during the construction of term extractors, particularly those oriented to support research in terminology and terminography. In this regard, this article describes the design and development of DEXTER, an online workbench for the extraction of simple and complex terms from domain-specific corpora in English, French, Italian and Spanish. In this framework, three issues contribute to placing the most important terms in the foreground. First, unlike the elaborate morphosyntactic patterns proposed by most previous research, shallow lexical filters have been constructed to discard term candidates. Second, a large number of common stopwords are automatically detected by means of a method that relies on the IATE database together with the frequency distribution of the domain-specific corpus and a general corpus. Third, the term-ranking metric, which is grounded on the notions of salience, relevance and cohesion, is guided by the IATE database to display an adequate distribution of terms.Financial support for this research has been provided by the DGI, Spanish Ministry of Education and Science, grant FFI2014-53788-C3-1-P.Periñån-Pascual, C. (2018). DEXTER: A workbench for automatic term extraction with specialized corpora. Natural Language Engineering. 24(2):163-198. https://doi.org/10.1017/S1351324917000365S16319824

    Towards improving WEBSOM with multi-word expressions

    Get PDF
    Dissertação para obtenção do Grau de Mestre em Engenharia InformåticaLarge quantities of free-text documents are usually rich in information and covers several topics. However, since their dimension is very large, searching and filtering data is an exhaustive task. A large text collection covers a set of topics where each topic is affiliated to a group of documents. This thesis presents a method for building a document map about the core contents covered in the collection. WEBSOM is an approach that combines document encoding methods and Self-Organising Maps (SOM) to generate a document map. However, this methodology has a weakness in the document encoding method because it uses single words to characterise documents. Single words tend to be ambiguous and semantically vague, so some documents can be incorrectly related. This thesis proposes a new document encoding method to improve the WEBSOM approach by using multi word expressions (MWEs) to describe documents. Previous research and ongoing experiments encourage us to use MWEs to characterise documents because these are semantically more accurate than single words and more descriptive

    The Distributional Learning of Multi-Word Expressions: A Computational Approach

    Get PDF
    There has been much recent research in corpus and computational linguistics on distributional learning algorithms—computer code that induces latent linguistic structures in corpus data based on co-occurrences of transcribed units in that data. These algorithms have varied applications, from the investigation of human cognitive processes to the corpus extraction of relevant linguistic structures for lexicographic, second language learning, or natural language processing applications, among others. They also operate at various levels of linguistic structure, from phonetics to syntax. One area of research on distributional learning algorithms in which there remains relatively little work is the learning of multi-word, memorized, formulaic sequences, based on the co-occurrences of words. Examples of such multi-word expressions (MWEs) include kick the bucket, New York City, sit down, and as a matter of fact. In this dissertation, I present a novel computational approach to the distributional learning of such sequences in corpora. Entitled MERGE (Multi-word Expressions from the Recursive Grouping of Elements), my algorithm iteratively works by (1) assigning a statistical ‘attraction’ score to each two-word sequence (bigram) in a corpus, based on the individual and co-occurrence frequencies of these two words in that corpus; and (2) merging the highest-scoring bigram into a single, lexicalized unit. These two steps then repeat until some maximum number of iterations or minimum score threshold is reached (since, broadly speaking, the winning score progressively decreases with increasing iterations). Because one (or both) of the ‘words’ making up a winning bigram may be an output merged item from a previous iteration, the algorithm is able to learn MWEs that are in principle of any length (e.g., apple pie versus I’ll believe it when I see it). Moreover, these MWEs may contain one or more discontinuities of different sizes, up to some maximum size threshold (measured in words) specified by the user (e.g., as _ as in as tall as and as big as). Typically, the extraction of MWEs has been handled by algorithms that identify only continuous sequences, and in which the user must specify the length(s) of the sequences to be extracted beforehand; thus, MERGE offers a bottom-up, distributional-based approach that addresses these issues.In the present dissertation, in addition to describing the algorithm, I report three rating experiments and one corpus-based early child language study that validate the efficacy of MERGE in identifying MWEs. In one experiment, participants rate sequences extracted from a corpus by the algorithm for how well they instantiate true MWEs. As expected, the results reveal that the high-scoring output items that MERGE identifies early in its iterative process are rated as ‘good’ MWEs by participants (based on certain subjective criteria), with the quality of these ratings decreasing for output from later iterations (i.e., output items that were scored lower by the algorithm). In the other two experiments, participants rate high-ranking output both from MERGE and from an existing algorithm from the literature that also learns MWEs of various lengths—the Adjusted Frequency List (Brook O’Donnell 2011). Comparison of participant ratings reveals that the items that MERGE acquires are rated more highly than those acquired by the Adjusted Frequency List, suggesting that MERGE is a performance frontrunner among distributional learning algorithms of MWEs. More broadly, together the experiments suggest that MERGE acquires representations that are compatible with adult knowledge of formulaic language, and thus it may be useful for any number of research applications that rely on such formulaic language as a unit of analysis.Finally, in a study using two corpora of caregiver-child interactions, I run MERGE on caregiver utterances and then show that, of the MWEs induced by the algorithm, those that go on to be later acquired by the children receive higher scores by the algorithm than those that do not go on to be learned. These results suggest that, when applied to acquisition data, the algorithm is useful for identifying the structures of statistical co-occurrences in the caregiver input that are relevant to children in their acquisition of early multi-word knowledge.Overall, MERGE is shown to be a powerful computational approach to the distributional learning and extraction of MWEs, both when modeling adult knowledge of formulaic language, and when accounting for the early multi-word structures acquired by children

    Corpus design for expressive speech: impact of the utterance length

    Get PDF
    International audienceVoice corpus plays a crucial role in the quality of the synthetic speech generation, specially under a length constraint. Creating a new voice is costly and the recording script selection for an expressive TTS task is generally considered as an optimization problem in order to achieve a rich and parsimonious corpus. In order to vocalize a given book using a TTS system, we investigate four script selection approaches. Based on preliminary observations, we simply propose to select shortest utterances of the book and compare the achievements of this method with state of the art ones for two books, with different utterance lengths and styles, using two kinds of concatenation based TTS systems. The study of the TTS costs indicates that selecting the shortest utterances could result in better synthetic quality, which is confirmed by a perceptual test. By investigating usual criteria for corpus design in literature like unit coverage or distribution similarity of units, it turns out that they are not pertinent metrics in the framework of this study

    Early stopping by correlating online indicators in neural networks

    Get PDF
    Financiado para publicaciĂłn en acceso aberto: Universidade de Vigo/CISUGinfo:eu-repo/grantAgreement/AEI/Plan Estatal de InvestigaciĂłn CientĂ­fica y TĂ©cnica y de InnovaciĂłn 2013-2016/TIN2017-85160-C2-2-R/ES/AVANCES EN NUEVOS SISTEMAS DE EXTRACCION DE RESPUESTAS CON ANALISIS SEMANTICO Y APRENDIZAJE PROFUNDOinfo:eu-repo/grantAgreement/AEI/Plan Estatal de InvestigaciĂłn CientĂ­fica y TĂ©cnica y de InnovaciĂłn 2017-2020/PID2020-113230RB-C22/ES/SEQUENCE LABELING MULTITASK MODELS FOR LINGUISTICALLY ENRICHED NER: SEMANTICS AND DOMAIN ADAPTATION (SCANNER-UVIGO)In order to minimize the generalization error in neural networks, a novel technique to identify overfitting phenomena when training the learner is formally introduced. This enables support of a reliable and trustworthy early stopping condition, thus improving the predictive power of that type of modeling. Our proposal exploits the correlation over time in a collection of online indicators, namely characteristic functions for indicating if a set of hypotheses are met, associated with a range of independent stopping conditions built from a canary judgment to evaluate the presence of overfitting. That way, we provide a formal basis for decision making in terms of interrupting the learning process. As opposed to previous approaches focused on a single criterion, we take advantage of subsidiarities between independent assessments, thus seeking both a wider operating range and greater diagnostic reliability. With a view to illustrating the effectiveness of the halting condition described, we choose to work in the sphere of natural language processing, an operational continuum increasingly based on machine learning. As a case study, we focus on parser generation, one of the most demanding and complex tasks in the domain. The selection of cross-validation as a canary function enables an actual comparison with the most representative early stopping conditions based on overfitting identification, pointing to a promising start toward an optimal bias and variance control.Agencia Estatal de InvestigaciĂłn | Ref. TIN2017-85160-C2-2-RAgencia Estatal de InvestigaciĂłn | Ref. PID2020-113230RB-C22Xunta de Galicia | Ref. ED431C 2018/5

    Sensing Human Sentiment via Social Media Images: Methodologies and Applications

    Get PDF
    abstract: Social media refers computer-based technology that allows the sharing of information and building the virtual networks and communities. With the development of internet based services and applications, user can engage with social media via computer and smart mobile devices. In recent years, social media has taken the form of different activities such as social network, business network, text sharing, photo sharing, blogging, etc. With the increasing popularity of social media, it has accumulated a large amount of data which enables understanding the human behavior possible. Compared with traditional survey based methods, the analysis of social media provides us a golden opportunity to understand individuals at scale and in turn allows us to design better services that can tailor to individuals’ needs. From this perspective, we can view social media as sensors, which provides online signals from a virtual world that has no geographical boundaries for the real world individual's activity. One of the key features for social media is social, where social media users actively interact to each via generating content and expressing the opinions, such as post and comment in Facebook. As a result, sentiment analysis, which refers a computational model to identify, extract or characterize subjective information expressed in a given piece of text, has successfully employs user signals and brings many real world applications in different domains such as e-commerce, politics, marketing, etc. The goal of sentiment analysis is to classify a user’s attitude towards various topics into positive, negative or neutral categories based on textual data in social media. However, recently, there is an increasing number of people start to use photos to express their daily life on social media platforms like Flickr and Instagram. Therefore, analyzing the sentiment from visual data is poise to have great improvement for user understanding. In this dissertation, I study the problem of understanding human sentiments from large scale collection of social images based on both image features and contextual social network features. We show that neither visual features nor the textual features are by themselves sufficient for accurate sentiment prediction. Therefore, we provide a way of using both of them, and formulate sentiment prediction problem in two scenarios: supervised and unsupervised. We first show that the proposed framework has flexibility to incorporate multiple modalities of information and has the capability to learn from heterogeneous features jointly with sufficient training data. Secondly, we observe that negative sentiment may related to human mental health issues. Based on this observation, we aim to understand the negative social media posts, especially the post related to depression e.g., self-harm content. Our analysis, the first of its kind, reveals a number of important findings. Thirdly, we extend the proposed sentiment prediction task to a general multi-label visual recognition task to demonstrate the methodology flexibility behind our sentiment analysis model.Dissertation/ThesisDoctoral Dissertation Computer Science 201

    Diffusion and Perfusion MRI in Paediatric Posterior Fossa Tumours

    Get PDF
    Brain tumours in children frequently occur in the posterior fossa. Most undergo surgical resection, after which up to 25% develop cerebellar mutism syndrome (CMS), characterised by mutism, emotional lability and cerebellar motor signs; these typically improve over several months. This thesis examines the application of diffusion (dMRI) and arterial spin labelling (ASL) perfusion MRI in children with posterior fossa tumours. dMRI enables non-invasive in vivo investigation of brain microstructure and connectivity by a computational process known as tractography. The results of a unique survey of British neurosurgeons’ attitudes towards tractography are presented, demonstrating its widespread adoption and numerous limitations. State-of-the-art modelling of dMRI data combined with tractography is used to probe the anatomy of cerebellofrontal tracts in healthy children, revealing the first evidence of a topographic organization of projections to the frontal cortex at the superior cerebellar peduncle. Retrospective review of a large institutional series shows that CMS remains the most common complication of posterior fossa tumour resection, and that surgical approach does not influence surgical morbidity in this cohort. A prospective case-control study of children with posterior fossa tumours treated at Great Ormond Street Hospital is reported, in which children underwent longitudinal MR imaging at three timepoints. A region-of-interest based approach did not reveal any differences in dMRI metrics with respect to CMS status. However, the candidate also conducted an analysis of a separate retrospective cohort of medulloblastoma patients at Stanford University using an automated tractography pipeline. This demonstrated, in unprecedented spatiotemporal detail, a fine-grained evolution of changes in cerebellar white matter tracts in children with CMS. ASL studies in the prospective cohort showed that following tumour resection, increases in cortical cerebral blood flow were seen alongside reductions in blood arrival time, and these effects were modulated by clinical features of hydrocephalus and CMS. The results contained in this thesis are discussed in the context of the current understanding of CMS, and the novel anatomical insights presented provide a foundation for future research into the condition
    • 

    corecore