10,066 research outputs found
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
On the role of pre and post-processing in environmental data mining
The quality of discovered knowledge is highly depending on data quality. Unfortunately real data use to contain noise, uncertainty, errors, redundancies or even irrelevant information. The more complex is the reality to be analyzed, the higher the risk of getting low quality data. Knowledge Discovery from Databases (KDD) offers a global framework to prepare data in the right form to perform correct analyses. On the other hand, the quality of decisions taken upon KDD results, depend not only on the quality of the results themselves, but on the capacity of the system to communicate those results in an understandable form. Environmental systems are particularly complex and environmental users particularly require clarity in their results. In this paper some details about how this can be achieved are provided. The role of the pre and post processing in the whole process of Knowledge Discovery in environmental systems is discussed
Efficient posterior sampling for high-dimensional imbalanced logistic regression
High-dimensional data are routinely collected in many areas. We are
particularly interested in Bayesian classification models in which one or more
variables are imbalanced. Current Markov chain Monte Carlo algorithms for
posterior computation are inefficient as and/or increase due to
worsening time per step and mixing rates. One strategy is to use a
gradient-based sampler to improve mixing while using data sub-samples to reduce
per-step computational complexity. However, usual sub-sampling breaks down when
applied to imbalanced data. Instead, we generalize piece-wise deterministic
Markov chain Monte Carlo algorithms to include importance-weighted and
mini-batch sub-sampling. These approaches maintain the correct stationary
distribution with arbitrarily small sub-samples, and substantially outperform
current competitors. We provide theoretical support and illustrate gains in
simulated and real data applications.Comment: 4 figure
Looking Beyond Label Noise: Shifted Label Distribution Matters in Distantly Supervised Relation Extraction
In recent years there is a surge of interest in applying distant supervision
(DS) to automatically generate training data for relation extraction (RE). In
this paper, we study the problem what limits the performance of DS-trained
neural models, conduct thorough analyses, and identify a factor that can
influence the performance greatly, shifted label distribution. Specifically, we
found this problem commonly exists in real-world DS datasets, and without
special handing, typical DS-RE models cannot automatically adapt to this shift,
thus achieving deteriorated performance. To further validate our intuition, we
develop a simple yet effective adaptation method for DS-trained models, bias
adjustment, which updates models learned over the source domain (i.e., DS
training set) with a label distribution estimated on the target domain (i.e.,
test set). Experiments demonstrate that bias adjustment achieves consistent
performance gains on DS-trained models, especially on neural models, with an up
to 23% relative F1 improvement, which verifies our assumptions. Our code and
data can be found at
\url{https://github.com/INK-USC/shifted-label-distribution}.Comment: 13 pages: 10 pages paper, 3 pages appendix. Appears at EMNLP 201
- …