3,929 research outputs found

    Is text preprocessing still worth the time? A comparative survey on the influence of popular preprocessing methods on Transformers and traditional classifiers

    Get PDF
    With the advent of the modern pre-trained Transformers, the text preprocessing has started to be neglected and not specifically addressed in recent NLP literature. However, both from a linguistic and from a computer science point of view, we believe that even when using modern Transformers, text preprocessing can significantly impact on the performance of a classification model. We want to investigate and compare, through this study, how preprocessing impacts on the Text Classification (TC) performance of modern and traditional classification models. We report and discuss the preprocessing techniques found in the literature and their most recent variants or applications to address TC tasks in different domains. In order to assess how much the preprocessing affects classification performance, we apply the three top referenced preprocessing techniques (alone or in combination) to four publicly available datasets from different domains. Then, nine machine learning models – including modern Transformers – get the preprocessed text as input. The results presented show that an educated choice on the text preprocessing strategy to employ should be based on the task as well as on the model considered. Outcomes in this survey show that choosing the best preprocessing technique – in place of the worst – can significantly improve accuracy on the classification (up to 25%, as in the case of an XLNet on the IMDB dataset). In some cases, by means of a suitable preprocessing strategy, even a simple Naïve Bayes classifier proved to outperform (i.e., by 2% in accuracy) the best performing Transformer. We found that Transformers and traditional models exhibit a higher impact of the preprocessing on the TC performance. Our main findings are: (1) also on modern pre-trained language models, preprocessing can affect performance, depending on the datasets and on the preprocessing technique or combination of techniques used, (2) in some cases, using a proper preprocessing strategy, simple models can outperform Transformers on TC tasks, (3) similar classes of models exhibit similar level of sensitivity to text preprocessing

    Multidisciplinary perspectives on Artificial Intelligence and the law

    Get PDF
    This open access book presents an interdisciplinary, multi-authored, edited collection of chapters on Artificial Intelligence (‘AI’) and the Law. AI technology has come to play a central role in the modern data economy. Through a combination of increased computing power, the growing availability of data and the advancement of algorithms, AI has now become an umbrella term for some of the most transformational technological breakthroughs of this age. The importance of AI stems from both the opportunities that it offers and the challenges that it entails. While AI applications hold the promise of economic growth and efficiency gains, they also create significant risks and uncertainty. The potential and perils of AI have thus come to dominate modern discussions of technology and ethics – and although AI was initially allowed to largely develop without guidelines or rules, few would deny that the law is set to play a fundamental role in shaping the future of AI. As the debate over AI is far from over, the need for rigorous analysis has never been greater. This book thus brings together contributors from different fields and backgrounds to explore how the law might provide answers to some of the most pressing questions raised by AI. An outcome of the Católica Research Centre for the Future of Law and its interdisciplinary working group on Law and Artificial Intelligence, it includes contributions by leading scholars in the fields of technology, ethics and the law.info:eu-repo/semantics/publishedVersio

    Taking Politics at Face Value: How Features Expose Ideology

    Get PDF
    Previous studies using computer vision neural networks to analyze facial images have uncovered patterns in the feature extracted output that are indicative of individual dispositions. For example, Wang and Kosinski (2018) were able to predict the sexual orientation of a target from his or her facial image with surprising accuracy, while Kosinski (2021) was able to do the same in regards to political orientation. These studies suggest that computer vision neural networks can be used to classify people into categories using only their facial images.However, there is some ambiguity in regards to the degree to which these features extracted from facial images incorporate facial morphology when used to make predictions. Critics have suggested that a subject’s transient facial features, such as using makeup, having a tan, donning a beard, or wearing glasses, might be subtly indicative of group belonging (Agüera y Arcas et al., 2018). Further, previous research in this domain has found that accurate image categorization can occur without utilizing facial morphology at all, instead relying upon image brightness, color dominance, or the background of the image to make successful classifications (Leuner, 2019; Wang, 2022). This dissertation seeks to bring some clarity to this domain. Using an application programming interface (API) for the popular social networking site Twitter, a sample of nearly a quarter million images of ideological organization followers was created. These images were followers of organizations supportive of, or oppositional to, the polarizing political issues of gun control and immigration. Through a series of strong comparisons, this research tests for the influence of facial morphology in image categorization. Facial images were converted into point and mesh coordinate representations of the subjects’ faces, thus eliminating the influence of transient facial features. Images were able to be classified using facial morphology alone at rates well above chance (64% accuracy across all models utilizing only facial points, 62% using facial mesh). These results provide the strongest evidence to date that images can be categorized into social categories by their facial morphology alone

    Conversations on Empathy

    Get PDF
    In the aftermath of a global pandemic, amidst new and ongoing wars, genocide, inequality, and staggering ecological collapse, some in the public and political arena have argued that we are in desperate need of greater empathy — be this with our neighbours, refugees, war victims, the vulnerable or disappearing animal and plant species. This interdisciplinary volume asks the crucial questions: How does a better understanding of empathy contribute, if at all, to our understanding of others? How is it implicated in the ways we perceive, understand and constitute others as subjects? Conversations on Empathy examines how empathy might be enacted and experienced either as a way to highlight forms of otherness or, instead, to overcome what might otherwise appear to be irreducible differences. It explores the ways in which empathy enables us to understand, imagine and create sameness and otherness in our everyday intersubjective encounters focusing on a varied range of "radical others" – others who are perceived as being dramatically different from oneself. With a focus on the importance of empathy to understand difference, the book contends that the role of empathy is critical, now more than ever, for thinking about local and global challenges of interconnectedness, care and justice

    LIPIcs, Volume 251, ITCS 2023, Complete Volume

    Get PDF
    LIPIcs, Volume 251, ITCS 2023, Complete Volum

    Dividing and Conquering a BlackBox to a Mixture of Interpretable Models: Route, Interpret, Repeat

    Full text link
    ML model design either starts with an interpretable model or a Blackbox and explains it post hoc. Blackbox models are flexible but difficult to explain, while interpretable models are inherently explainable. Yet, interpretable models require extensive ML knowledge and tend to be less flexible and underperforming than their Blackbox variants. This paper aims to blur the distinction between a post hoc explanation of a Blackbox and constructing interpretable models. Beginning with a Blackbox, we iteratively carve out a mixture of interpretable experts (MoIE) and a residual network. Each interpretable model specializes in a subset of samples and explains them using First Order Logic (FOL), providing basic reasoning on concepts from the Blackbox. We route the remaining samples through a flexible residual. We repeat the method on the residual network until all the interpretable models explain the desired proportion of data. Our extensive experiments show that our route, interpret, and repeat approach (1) identifies a diverse set of instance-specific concepts with high concept completeness via MoIE without compromising in performance, (2) identifies the relatively ``harder'' samples to explain via residuals, (3) outperforms the interpretable by-design models by significant margins during test-time interventions, and (4) fixes the shortcut learned by the original Blackbox. The code for MoIE is publicly available at: https://github.com/batmanlab/ICML-2023-Route-interpret-repeat.Comment: ICML, 202

    Data- og ekspertdreven variabelseleksjon for prediktive modeller i helsevesenet : mot økt tolkbarhet i underbestemte maskinlæringsproblemer

    Get PDF
    Modern data acquisition techniques in healthcare generate large collections of data from multiple sources, such as novel diagnosis and treatment methodologies. Some concrete examples are electronic healthcare record systems, genomics, and medical images. This leads to situations with often unstructured, high-dimensional heterogeneous patient cohort data where classical statistical methods may not be sufficient for optimal utilization of the data and informed decision-making. Instead, investigating such data structures with modern machine learning techniques promises to improve the understanding of patient health issues and may provide a better platform for informed decision-making by clinicians. Key requirements for this purpose include (a) sufficiently accurate predictions and (b) model interpretability. Achieving both aspects in parallel is difficult, particularly for datasets with few patients, which are common in the healthcare domain. In such cases, machine learning models encounter mathematically underdetermined systems and may overfit easily on the training data. An important approach to overcome this issue is feature selection, i.e., determining a subset of informative features from the original set of features with respect to the target variable. While potentially raising the predictive performance, feature selection fosters model interpretability by identifying a low number of relevant model parameters to better understand the underlying biological processes that lead to health issues. Interpretability requires that feature selection is stable, i.e., small changes in the dataset do not lead to changes in the selected feature set. A concept to address instability is ensemble feature selection, i.e. the process of repeating the feature selection multiple times on subsets of samples of the original dataset and aggregating results in a meta-model. This thesis presents two approaches for ensemble feature selection, which are tailored towards high-dimensional data in healthcare: the Repeated Elastic Net Technique for feature selection (RENT) and the User-Guided Bayesian Framework for feature selection (UBayFS). While RENT is purely data-driven and builds upon elastic net regularized models, UBayFS is a general framework for ensembles with the capabilities to include expert knowledge in the feature selection process via prior weights and side constraints. A case study modeling the overall survival of cancer patients compares these novel feature selectors and demonstrates their potential in clinical practice. Beyond the selection of single features, UBayFS also allows for selecting whole feature groups (feature blocks) that were acquired from multiple data sources, as those mentioned above. Importance quantification of such feature blocks plays a key role in tracing information about the target variable back to the acquisition modalities. Such information on feature block importance may lead to positive effects on the use of human, technical, and financial resources if systematically integrated into the planning of patient treatment by excluding the acquisition of non-informative features. Since a generalization of feature importance measures to block importance is not trivial, this thesis also investigates and compares approaches for feature block importance rankings. This thesis demonstrates that high-dimensional datasets from multiple data sources in the medical domain can be successfully tackled by the presented approaches for feature selection. Experimental evaluations demonstrate favorable properties of both predictive performance, stability, as well as interpretability of results, which carries a high potential for better data-driven decision support in clinical practice.Moderne datainnsamlingsteknikker i helsevesenet genererer store datamengder fra flere kilder, som for eksempel nye diagnose- og behandlingsmetoder. Noen konkrete eksempler er elektroniske helsejournalsystemer, genomikk og medisinske bilder. Slike pasientkohortdata er ofte ustrukturerte, høydimensjonale og heterogene og hvor klassiske statistiske metoder ikke er tilstrekkelige for optimal utnyttelse av dataene og god informasjonsbasert beslutningstaking. Derfor kan det være lovende å analysere slike datastrukturer ved bruk av moderne maskinlæringsteknikker for å øke forståelsen av pasientenes helseproblemer og for å gi klinikerne en bedre plattform for informasjonsbasert beslutningstaking. Sentrale krav til dette formålet inkluderer (a) tilstrekkelig nøyaktige prediksjoner og (b) modelltolkbarhet. Å oppnå begge aspektene samtidig er vanskelig, spesielt for datasett med få pasienter, noe som er vanlig for data i helsevesenet. I slike tilfeller må maskinlæringsmodeller håndtere matematisk underbestemte systemer og dette kan lett føre til at modellene overtilpasses treningsdataene. Variabelseleksjon er en viktig tilnærming for å håndtere dette ved å identifisere en undergruppe av informative variabler med hensyn til responsvariablen. Samtidig som variabelseleksjonsmetoder kan lede til økt prediktiv ytelse, fremmes modelltolkbarhet ved å identifisere et lavt antall relevante modellparametere. Dette kan gi bedre forståelse av de underliggende biologiske prosessene som fører til helseproblemer. Tolkbarhet krever at variabelseleksjonen er stabil, dvs. at små endringer i datasettet ikke fører til endringer i hvilke variabler som velges. Et konsept for å adressere ustabilitet er ensemblevariableseleksjon, dvs. prosessen med å gjenta variabelseleksjon flere ganger på en delmengde av prøvene i det originale datasett og aggregere resultater i en metamodell. Denne avhandlingen presenterer to tilnærminger for ensemblevariabelseleksjon, som er skreddersydd for høydimensjonale data i helsevesenet: "Repeated Elastic Net Technique for feature selection" (RENT) og "User-Guided Bayesian Framework for feature selection" (UBayFS). Mens RENT er datadrevet og bygger på elastic net-regulariserte modeller, er UBayFS et generelt rammeverk for ensembler som muliggjør inkludering av ekspertkunnskap i variabelseleksjonsprosessen gjennom forhåndsbestemte vekter og sidebegrensninger. En case-studie som modellerer overlevelsen av kreftpasienter sammenligner disse nye variabelseleksjonsmetodene og demonstrerer deres potensiale i klinisk praksis. Utover valg av enkelte variabler gjør UBayFS det også mulig å velge blokker eller grupper av variabler som representerer de ulike datakildene som ble nevnt over. Kvantifisering av viktigheten av variabelgrupper spiller en nøkkelrolle for forståelsen av hvorvidt datakildene er viktige for responsvariablen. Tilgang til slik informasjon kan føre til at bruken av menneskelige, tekniske og økonomiske ressurser kan forbedres dersom informasjonen integreres systematisk i planleggingen av pasientbehandlingen. Slik kan man redusere innsamling av ikke-informative variabler. Siden generaliseringen av viktighet av variabelgrupper ikke er triviell, undersøkes og sammenlignes også tilnærminger for rangering av viktigheten til disse variabelgruppene. Denne avhandlingen viser at høydimensjonale datasett fra flere datakilder fra det medisinske domenet effektivt kan håndteres ved bruk av variabelseleksjonmetodene som er presentert i avhandlingen. Eksperimentene viser at disse kan ha positiv en effekt på både prediktiv ytelse, stabilitet og tolkbarhet av resultatene. Bruken av disse variabelseleksjonsmetodene bærer et stort potensiale for bedre datadrevet beslutningsstøtte i klinisk praksis

    Archaeological palaeoenvironmental archives: challenges and potential

    Get PDF
    This Arts and Humanities Research Council (AHRC) sponsored collaborative doctoral project represents one of the most significant efforts to collate quantitative and qualitative data that can elucidate practices related to archaeological palaeoenvironmental archiving in England. The research has revealed that archived palaeoenvironmental remains are valuable resources for archaeological research and can clarify subjects that include the adoption and importation of exotic species, plant and insect invasion, human health and diet, and plant and animal husbandry practices. In addition to scientific research, archived palaeoenvironmental remains can provide evidence-based narratives of human resilience and climate change and offer evidence of the scientific process, making them ideal resources for public science engagement. These areas of potential have been realised at an imperative time; given that waterlogged palaeoenvironmental remains at significant sites such as Star Carr, Must Farm, and Flag Fen, archaeological deposits in towns and cities are at risk of decay due to climate change-related factors, and unsustainable agricultural practices. Innovative approaches to collecting and archiving palaeoenvironmental remains and maintaining existing archives will permit the creation of an accessible and thorough national resource that can service archaeologists and researchers in the related fields of biology and natural history. Furthermore, a concerted effort to recognise absences in archaeological archives, matched by an effort to supply these deficiencies, can produce a resource that can contribute to an enduring geographical and temporal record of England's biodiversity, which can be used in perpetuity in the face of diminishing archaeological and contemporary natural resources. To realise these opportunities, particular challenges must be overcome. The most prominent of these include inconsistent collection policies resulting from pressures associated with shortages in storage capacity and declining specialist knowledge in museums and repositories combined with variable curation practices. Many of these challenges can be resolved by developing a dedicated storage facility that can focus on the ongoing conservation and curation of palaeoenvironmental remains. Combined with an OASIS + module designed to handle and disseminate data pertaining to palaeoenvironmental archives, remains would be findable, accessible, and interoperable with biological archives and collections worldwide. Providing a national centre for curating palaeoenvironmental remains and a dedicated digital repository will require significant funding. Funding sources could be identified through collaboration with other disciplines. If sufficient funding cannot be identified, options that would require less financial investment, such as high-level archive audits and the production of guidance documents, will be able to assist all stakeholders with the improved curation, management, and promotion of the archived resource
    corecore