230 research outputs found
Haplotype Variety Analysis of Human Populations: an Application to HapMap Data
We undertake a study to investigate the haplotype variety of distinct human populations. We use a natural measure of haplotype variety, the total number of haplotypes (TNH) present that reflects the number of haplotypes with nonzero frequencies estimated from the data at hand for each selection of multiple loci. For the analysis of real human populations, we use the haplotype data of the Denver Chinese, Tuscan Italians, Luhya Kenyans, and Gujarati Indians from release III of the HapMap database. Moreover, we show that the TNH statistic is biased in small sample data scenarios such as the HapMap and implement a nested simulation study to estimate and remove such bias. We perform a preliminary analysis of means and variances of the population allele frequencies in the four populations. Lastly, we implement a generalized linear model to detect and quantify the differences in haplotype structures of these populations. Our results show that all populations possess significantly different adjusted average TNH values. Our findings extend previous results based on alternative statistical approaches and demonstrate the existence of pronounced differences in the haplotype variety of the analyzed populations even after controlling for haplotype span as well as all allele frequencies and their two-way interactions
On the ranking of the disease susceptibility locus in family-based candidate gene studies: a simulation-based analysis
The ranking of the p-value of the true causal single nucleotide polymorphism in the ordered list of individual SNP p-values is an important factor for achieving success in the ultimate objective of association studies - identifying deleterious genetic variants. Thus, we undertake a study to assess the implications of complex, multimarker correlation structure, sample size and disease models on the ranking of the causal SNP. We carry out an extensive family-based candidate gene simulation study to analyze the position of the disease susceptibility locus in the complete list of individual SNP p-values ordered according to their statistical significance. We simulate data based on the haplotype distributions of ten randomly selected genes extracted from the HapMap database, various sample sizes (600,1000 and 2000) that current association studies employ, and disease models that mimic the characteristics of complex human disorders. We conclude that the average ranking of the causal SNP for sample sizes 600, 100 and 200 of 10.97, 9.65, and 8.34 are dramatically distant from the most significant and intuitively appropriate top position. This result is even more pronounced for genes with high average correlation and large number of common SNPs. Moreover, the gain of the DSL ranking when comparing sample sizes 600 to 1000 and 1000 to 2000, averaged over disease models, causal SNPs and genes, was approximately 1.3. These outcomes both reveal the importance of the sample size and quantify the magnitude required to unequivocally determine the identity of the DSL in family-based candidate gene studies. Our results show the overwhelming importance of large sample sizes in the localization of deleterious SNPs even under simple disease models. These conclusions possess pronounced importance for the design and result interpretation of candidate gene, next generation high-density genome-wide association studies, as well as for the construction and implementation of association tests based on the distribution of the most significant (minimum p-value) test statistics
A Kinship-Based Modification of the Armitage Trend Test to Address Hidden Population Structure and Small Differential Genotyping Errors
BACKGROUND/AIMS: We propose a modification of the well-known Armitage trend test to address the problems associated with hidden population structure and hidden relatedness in genome-wide case-control association studies. METHODS: The new test adopts beneficial traits from three existing testing strategies: the principal components, mixed model, and genomic control while avoiding some of their disadvantageous characteristics, such as the tendency of the principal components method to over-correct in certain situations or the failure of the genomic control approach to reorder the adjusted tests based on their degree of alignment with the underlying hidden structure. The new procedure is based on Gauss-Markov estimators derived from a straightforward linear model with an imposed variance structure proportional to an empirical relatedness matrix. Lastly, conceptual and analytical similarities to and distinctions from other approaches are emphasized throughout. RESULTS: Our simulations show that the power performance of the proposed test is quite promising compared to the considered competing strategies. The power gains are especially large when small differential differences between cases and controls are present; a likely scenario when public controls are used in multiple studies. CONCLUSION: The proposed modified approach attains high power more consistently than that of the existing commonly implemented tests. Its performance improvement is most apparent when small but detectable systematic differences between cases and controls exist
A Two-Light Version of the Classical Hundred Prisoners and a Light Bulb Problem: Optimizing Experimental Design through Simulations
We propose five original strategies of successively increasing complexity and efficiency that address a novel version of a classical mathematical problem that, in essence, focuses on the determination of an optimal protocol for exchanging limited amounts of information among a group of subjects with various prerogatives. The inherent intricacy of the problem�solving protocols eliminates the possibility to attain an analytical solution. Therefore, we implemented a large-scale simulation study to exhaustively search through an extensive list of competing algorithms associated with the above-mentioned 5 generally defined protocols. Our results show that the consecutive improvements in the average amount of time necessary for the strategy-specific problem-solving completion over the previous simpler and less advantageously structured designs were 18, 30, 12, and 9% respectively. The optimal multi-stage information exchange strategy allows for a successful execution of the task of interest in 1722 days (4.7 years) on average with standard deviation of 385 days. The execution of this protocol took as few as 1004 and as many as 4965 with median of 1616 days
A Comparative Study on Deep Learning Models for Text Classification of Unstructured Medical Notes with Various Levels of Class Imbalance
Background
Discharge medical notes written by physicians contain important information about the health condition of patients. Many deep learning algorithms have been successfully applied to extract important information from unstructured medical notes data that can entail subsequent actionable results in the medical domain. This study aims to explore the model performance of various deep learning algorithms in text classification tasks on medical notes with respect to different disease class imbalance scenarios. Methods
In this study, we employed seven artificial intelligence models, a CNN (Convolutional Neural Network), a Transformer encoder, a pretrained BERT (Bidirectional Encoder Representations from Transformers), and four typical sequence neural networks models, namely, RNN (Recurrent Neural Network), GRU (Gated Recurrent Unit), LSTM (Long Short-Term Memory), and Bi-LSTM (Bi-directional Long Short-Term Memory) to classify the presence or absence of 16 disease conditions from patients’ discharge summary notes. We analyzed this question as a composition of 16 binary separate classification problems. The model performance of the seven models on each of the 16 datasets with various levels of imbalance between classes were compared in terms of AUC-ROC (Area Under the Curve of the Receiver Operating Characteristic), AUC-PR (Area Under the Curve of Precision and Recall), F1 Score, and Balanced Accuracy as well as the training time. The model performances were also compared in combination with different word embedding approaches (GloVe, BioWordVec, and no pre-trained word embeddings). Results
The analyses of these 16 binary classification problems showed that the Transformer encoder model performs the best in nearly all scenarios. In addition, when the disease prevalence is close to or greater than 50%, the Convolutional Neural Network model achieved a comparable performance to the Transformer encoder, and its training time was 17.6% shorter than the second fastest model, 91.3% shorter than the Transformer encoder, and 94.7% shorter than the pre-trained BERT-Base model. The BioWordVec embeddings slightly improved the performance of the Bi-LSTM model in most disease prevalence scenarios, while the CNN model performed better without pre-trained word embeddings. In addition, the training time was significantly reduced with the GloVe embeddings for all models. Conclusions
For classification tasks on medical notes, Transformer encoders are the best choice if the computation resource is not an issue. Otherwise, when the classes are relatively balanced, CNNs are a leading candidate because of their competitive performance and computational efficiency
Privacy-Preserving ECG Data Analysis with Differential Privacy: A Literature Review and A Case Study
Differential privacy has become the preeminent technique to protect the
privacy of individuals in a database while allowing useful results from data
analysis to be shared. Notably, it guarantees the amount of privacy loss in the
worst-case scenario. Although many theoretical research papers have been
published, practical real-life application of differential privacy demands
estimating several important parameters without any clear solutions or
guidelines. In the first part of the paper, we provide an overview of key
concepts in differential privacy, followed by a literature review and
discussion of its application to ECG analysis. In the second part of the paper,
we explore how to implement differentially private query release on an
arrhythmia database using a six-step process. We provide guidelines and discuss
the related literature for all the steps involved, such as selection of the
value, distribution of the total budget across the
queries, and estimation of the sensitivity for the query functions. At the end,
we discuss the shortcomings and challenges of applying differential privacy to
ECG datasets
An R package for parametric estimation of causal effects
This article explains the usage of R package CausalModels, which is publicly
available on the Comprehensive R Archive Network. While packages are available
for sufficiently estimating causal effects, there lacks a package that provides
a collection of structural models using the conventional statistical approach
developed by Hernan and Robins (2020). CausalModels addresses this deficiency
of software in R concerning causal inference by offering tools for methods that
account for biases in observational data without requiring extensive
statistical knowledge. These methods should not be ignored and may be more
appropriate or efficient in solving particular problems. While implementations
of these statistical models are distributed among a number of causal packages,
CausalModels introduces a simple and accessible framework for a consistent
modeling pipeline among a variety of statistical methods for estimating causal
effects in a single R package. It consists of common methods including
standardization, IP weighting, G-estimation, outcome regression, instrumental
variables and propensity matching
Pitcher Effectiveness: A Step Forward for In Game Analytics and Pitcher Evaluation
With the introduction of Statcast in 2015, baseball analytics have become more precise. Statcast allows every play to be accurately tracked and the data it generates is easily accessible through Baseball Savant, which opens the opportunity for improved performance statistics to be developed. In this paper we propose a new tool, Pitcher Effectiveness, that uses Statcast data to evaluate starting pitchers dynamically, based on the results of in-game outcomes after each pitch. Pitcher Effectiveness successfully predicts instances where starting pitchers give up several runs, which we believe make it a new and important tool for the in-game and post-game evaluation of starting pitchers
Uleiurile vegetale autohtone: proprietăți și mecanisme.
Introducere. Uleiurile vegetale prin compușii lipofili (acizi grași polinesaturați, tocoferoli, tocotrienoli, fitosteroli, carotenoizi, acizi organici, alcooli, esteri, aldehide, cetone) și hidrofili (acizi fenolici, aldehide, esteri hidroxicinamici, flavonoli, glucozide, procianidine) determină un șir de proprietăți biologice. Scop. Analiza și sistematizarea proprietăților și mecanismelor uleiurilor vegetale. Materiale şi metode. Au fost selectate și analizate articolele din baza de date PubMed după cuvintele-cheie „uleiuri vegetale”, „proprietăți”, „mecanisme”. Rezultate. Uleiurile vegetale au demonstrat activitate antimicrobiană, antioxidantă, antiinflamatoare, antitumorală, regeneratoare, citoprotectoare. Activitatea antioxidantă s-a atribuit captării radicalilor liberi (superoxid anionilor, radicalilor peroxilici, radicalilor peroxinitriti), inhibării peroxidării lipidelor, micșorării nivelului dienelor conjugate, dialdehidei malonice și creșterii nivelului de expresie și producere a unor enzime antioxidante (glutation peroxidazei, catalazei, superoxid dismutazei). Activitatea antiinflamatoare a fost determinată de inhibarea supraproducției oxidului nitric, citokinelor (interleukin-1beta, factorul de necroză tumorală) și prostaglandinelor proinflamatoare, majorarea citokinelor antiinflamatoare, diminuarea infiltrării celulelor inflamatoare și stresului oxidativ. Efectul antitumoral al uleiurilor vegetale implică mai multe mecanisme, inclusiv apoptoza, afectarea ADN-ului și stresul oxidativ. Concluzii. Proprietățile biologice ale uleiurilor vegetale au fost atribuite acizilor grași polinesaturați, polifenolilor, procianidinelor, tocoferolilor, tocotrienolilor, carotenoizilor, fitosterolilor, clorofilelor, flavonolilor, glucozidelor
Predictors of Substance Use/Misuse in Youth
Introduction: Substance use/misuse is highly prevalent among youth, which is concerning given the associated adverse outcomes such as psychiatric conditions, interpersonal problems, as well as deficits in brain structure, function and cognition. Considering the rising global burden of disease due to substance use disorders, especially for youth in low- to middle-income countries, it is crucial for strategies enabling the prompt detection and early implementation of effective intervention measures for youth at the highest risk of substance use/misuse to be developed. The identification of predictors or other associated factors of youth substance use/misuse may facilitate the development of such strategies and inform policy or intervention efforts. Thus, this thesis aimed to identify predictors or other associated factors of substance use/misuse among youth in Brazil and across the globe. We also sought to determine the prevalence of underage alcohol use among youth in Brazil, since prior studies on similar topics have predominately been conducted in high-income countries.
Results: Various predictors or other associated factors of youth substance use/misuse including sociodemographic and clinical characteristics were identified. Risk factors for underage drinking in a nationally representative population of Brazilian adolescents were having other/no religion, residing in rural areas, depression, tobacco use, and illicit drug use. Alcohol abuse/dependence, tobacco abuse/dependence, manic episode history, suicide risk, and male sex were identified as important predictors of illicit substance abuse/dependence among young adults in Brazil using machine learning techniques. A relatively high prevalence of current underage alcohol use in Brazil of 22.2% was also found. Our comprehensive systematic review and meta-analysis indicated that several different types of childhood maltreatment were predictive of various types of youth substance use/misuse, with the included studies conducted in 22 countries across the globe.
Conclusion: Several important sociodemographic and clinical predictors or associated factors of substance use/misuse among youth in Brazil and across the globe were found. The identification of significant predictors can facilitate the prompt detection of youth at the highest risk of substance use/misuse and enhance intervention efforts through the implementation of measures to also prevent these predictors. Therefore, these findings highlight the severity of substance use/misuse among youth and indicate the promising potential of identifying predictors or other associated factors in informing early intervention efforts to prevent or mitigate youth substance use/misuse.ThesisDoctor of Philosophy (PhD)Since youth are still developing, they are especially likely to engage in risky behaviours such as using or misusing substances and to experience severe consequences because of this behaviour. Identifying risk factors or other factors associated with youth substance use or misuse can help to improve efforts that are directed at preventing or reducing this problem. Previous research on the topic of youth substance use or misuse has focused on this issue in high-income countries, however, most of the world’s youth live in low to middle-income countries such as Brazil. Therefore, this thesis aimed to identify risk factors or other associated factors of substance use or misuse among youth in Brazil and across the globe. We also determined the prevalence of underage drinking among Brazilian youth. Several different types of sociodemographic or clinical risk factors of youth substance use or misuse were found including tobacco use or misuse, depression, other or no religion, living in rural areas, and childhood maltreatment. The identification of these risk factors or other associated factors of youth substance use or misuse may help to improve efforts to prevent this problem and the associated negative outcomes
- …
