295 research outputs found

    FairGer: Using NLP to Measure Support for Women and Migrants in 155 Years of German Parliamentary Debates

    Full text link
    We measure support with women and migrants in German political debates over the last 155 years. To do so, we (1) provide a gold standard of 1205 text snippets in context, annotated for support with our target groups, (2) train a BERT model on our annotated data, with which (3) we infer large-scale trends. These show that support with women is stronger than support with migrants, but both have steadily increased over time. While we hardly find any direct anti-support with women, there is more polarization when it comes to migrants. We also discuss the difficulty of annotation as a result of ambiguity in political discourse and indirectness, i.e., politicians' tendency to relate stances attributed to political opponents. Overall, our results indicate that German society, as measured from its political elite, has become fairer over time

    iMind: Uma ferramenta inteligente para suporte de compreensão de conteúdo

    Get PDF
    Usually while reading, content comprehension difficulty affects individual performance. Comprehension difficulties, e. g., could lead to a slow learning process, lower work quality, and inefficient decision-making. This thesis introduces an intelligent tool called “iMind” which uses wearable devices (e.g., smartwatches) to evaluate user comprehension difficulties and engagement levels while reading digital content. Comprehension difficulty can occur when there are not enough mental resources available for mental processing. The mental resource for mental processing is the cognitive load (CL). Fluctuations of CL lead to physiological manifestation of the autonomic nervous system (ANS), which can be measured by wearables, like smartwatches. ANS manifestations are, e. g., an increase in heart rate. With low-cost eye trackers, it is possible to correlate content regions to the measurements of ANS manifestation. In this sense, iMind uses a smartwatch and an eye tracker to identify comprehension difficulty at content regions level (where the user is looking). The tool uses machine learning techniques to classify content regions as difficult or non-difficult based on biometric and non-biometric features. The tool classified regions with a 75% accuracy and 80% f-score with Linear regression (LR). With the classified regions, it will be possible, in the future, to create contextual support for the reader in real-time by, e.g., translating the sentences that induced comprehension difficulty.Normalmente durante a leitura, a dificuldade de compreensão pode afetar o desempenho da leitura. A dificuldade de compreensão pode levar a um processo de aprendizagem mais lento, menor qualidade de trabalho ou uma ineficiente tomada de decisão. Esta tese apresenta uma ferramenta inteligente chamada “iMind” que usa dispositivos vestíveis (por exemplo, smartwatches) para avaliar a dificuldade de compreensão do utilizador durante a leitura de conteúdo digital. A dificuldade de compreensão pode ocorrer quando não há recursos mentais disponíveis suficientes para o processamento mental. O recurso usado para o processamento mental é a carga cognitiva (CL). As flutuações de CL levam a manifestações fisiológicas do sistema nervoso autônomo (ANS), manifestações essas, que pode ser medido por dispositivos vestíveis, como smartwatches. As manifestações do ANS são, por exemplo, um aumento da frequência cardíaca. Com eye trackers de baixo custo, é possível correlacionar manifestação do ANS com regiões do texto, por exemplo. Neste sentido, a ferramenta iMind utiliza um smartwatch e um eye tracker para identificar dificuldades de compreensão em regiões de conteúdo (para onde o utilizador está a olhar). Adicionalmente a ferramenta usa técnicas de machine learning para classificar regiões de conteúdo como difíceis ou não difíceis com base em features biométricos e não biométricos. A ferramenta classificou regiões com uma precisão de 75% e f-score de 80% usando regressão linear (LR). Com a classificação das regiões em tempo real, será possível, no futuro, criar suporte contextual para o leitor em tempo real onde, por exemplo, as frases que induzem dificuldade de compreensão são traduzidas

    Min-Max, Min-Max-Median, and Min-Max-IQR in Deciding Optimal Diagnostic Thresholds: Performances of a Logistic Regression Approach on Simulated and Real Data

    Get PDF
    Combining biomarkers and their statistics is used to increase the prediction performance of a diagnosis, but no gold standard method exists. We introduced and evaluated an approach using linear combinations of summary-based statistics tested in logistic regression models with 10-fold repeated cross-validation. We used AUC (area under the ROC- receiver operating characteristic curve), the value of the Youden index, sensitivity (Se), specificity (Sp), diagnostic odds ratio (DOR), Efficiency Index (EI) and Inefficiency Index (InI) as performance metrics on the real-data set. We tested the approaches in multivariate normal distribution simulations with 4, 10, and 100 biomarkers and on real data. The results show that the summary-based models, especially minimum-maximum-median regression model (LR(MMM)) and minimum-maximum-interquartile range model (LR(MMIQR)), have similar performances or slightly better performances than the classical LR model regardless of the imposed mean of biomarkers or covariance matrixes on both simulated and real-data. The differences in AUCs were higher as the number of combined biomarkers increased (LR(MMIQR) model vs. LR model: 0.09 equal or unequal means of four biomarkers, 0.26 equal means, and 0.11 unequal means of 10 biomarkers). In real data, the linear combination of four biomarkers on LR(MMM) and LR(MMIQR) slightly increases the AUCs compared to the LR model. The model's performances were marginally low and without clinical relevance. The linear combination of summary-based statistics, specifically LR(MMM) and LR(MMIQR), exhibits similar performances as the classical LR model when biomarkers are linearly combined to increase diagnostic accuracy. Although the models perform on simulation data-sets, no clinical relevance of the combination is observed in the applied real-data

    Empirical comparison of cross-platform normalization methods for gene expression data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Simultaneous measurement of gene expression on a genomic scale can be accomplished using microarray technology or by sequencing based methods. Researchers who perform high throughput gene expression assays often deposit their data in public databases, but heterogeneity of measurement platforms leads to challenges for the combination and comparison of data sets. Researchers wishing to perform cross platform normalization face two major obstacles. First, a choice must be made about which method or methods to employ. Nine are currently available, and no rigorous comparison exists. Second, software for the selected method must be obtained and incorporated into a data analysis workflow.</p> <p>Results</p> <p>Using two publicly available cross-platform testing data sets, cross-platform normalization methods are compared based on inter-platform concordance and on the consistency of gene lists obtained with transformed data. Scatter and ROC-like plots are produced and new statistics based on those plots are introduced to measure the effectiveness of each method. Bootstrapping is employed to obtain distributions for those statistics. The consistency of platform effects across studies is explored theoretically and with respect to the testing data sets.</p> <p>Conclusions</p> <p>Our comparisons indicate that four methods, DWD, EB, GQ, and XPN, are generally effective, while the remaining methods do not adequately correct for platform effects. Of the four successful methods, XPN generally shows the highest inter-platform concordance when treatment groups are equally sized, while DWD is most robust to differently sized treatment groups and consistently shows the smallest loss in gene detection. We provide an R package, CONOR, capable of performing the nine cross-platform normalization methods considered. The package can be downloaded at <url>http://alborz.sdsu.edu/conor</url> and is available from CRAN.</p

    Homeowner Preferences after September 11th, a Microdata Approach

    Get PDF
    The existence of homeowner preferences - specifically homeowner preferences for neighbors -is fundamental to economic models of sorting. This paper investigates whether or not the terrorist attacks of September 11, 2001 (9/11) impacted local preferences for Arab neighbors. We test for changes in preferences using a differences-in-differences approach in a hedonic pricing model. Relative to sales before 9/11, we find properties within 0.1 miles of an Arab homeowner sold at a 1.4% discount in the 180 days after 9/11. The results are robust to a number of specifications including time horizon, event date, distance, time, alternative ethnic groups, and the presence of nearby mosques. Previous research has shown price effects at neighborhood levels but has not identified effects at the micro or individual property level, and for good reason: most transaction level data sets do not include ethnic identifiers. Applying methods from the machine learning and biostatistics literature, we develop a binomial classifier using a supervised learning algorithm and identify Arab homeowners based on the name of the buyer. We train the binomial classifier using names from Summer Olympic Rosters for 221 countries during the years 1948-2012. We demonstrate the flexibility of our methodology and perform an interesting counterfactual by identifying Hispanic and Asian homeowners in the data; unlike the statistically significant results for Arab homeowners, we find no meaningful results for Hispanic and Asian homeowners following 9/11

    Prediction of drug classes based on gene expression data

    Get PDF
    Nowadays, the financial investments in pharmaceutical research and development are an enormous increase. Drug safety is very important to health and drug development. Finding new uses for the approved drug has become important for the pharma industry. Drug classification accuracy helps identify useful information for studying drugs, also helps in accurate diagnosis of drugs. Gene expression data makes a possible study of biological problems and machine learning methods are playing an essential role in the analysis process. Meanwhile, many machine learning methods have been applied to classification, clustering, dynamic modeling areas of gene expression analysis. This thesis work is using R programming language and SVM machine learning method to predict the ATC class of drugs based on the gene expression data to see how well the gene expression patterns correlate after treatment within the therapeutic/pharmacological subgroup. A dimensionality reduction method will use to reduces the dimensions of the dataset that improves the classification performance. The classifiers built using SVM machine learning technique in this thesis study had limited with detecting drug groups based on the ATC system

    FinDiff: Diffusion Models for Financial Tabular Data Generation

    Full text link
    The sharing of microdata, such as fund holdings and derivative instruments, by regulatory institutions presents a unique challenge due to strict data confidentiality and privacy regulations. These challenges often hinder the ability of both academics and practitioners to conduct collaborative research effectively. The emergence of generative models, particularly diffusion models, capable of synthesizing data mimicking the underlying distributions of real-world data presents a compelling solution. This work introduces 'FinDiff', a diffusion model designed to generate real-world financial tabular data for a variety of regulatory downstream tasks, for example economic scenario modeling, stress tests, and fraud detection. The model uses embedding encodings to model mixed modality financial data, comprising both categorical and numeric attributes. The performance of FinDiff in generating synthetic tabular financial data is evaluated against state-of-the-art baseline models using three real-world financial datasets (including two publicly available datasets and one proprietary dataset). Empirical results demonstrate that FinDiff excels in generating synthetic tabular financial data with high fidelity, privacy, and utility.Comment: 9 pages, 5 figures, 3 tables, preprint version, currently under revie
    corecore