1,025 research outputs found
A New Computer-Aided Diagnosis System with Modified Genetic Feature Selection for BI-RADS Classification of Breast Masses in Mammograms
Mammography remains the most prevalent imaging tool for early breast cancer
screening. The language used to describe abnormalities in mammographic reports
is based on the breast Imaging Reporting and Data System (BI-RADS). Assigning a
correct BI-RADS category to each examined mammogram is a strenuous and
challenging task for even experts. This paper proposes a new and effective
computer-aided diagnosis (CAD) system to classify mammographic masses into four
assessment categories in BI-RADS. The mass regions are first enhanced by means
of histogram equalization and then semiautomatically segmented based on the
region growing technique. A total of 130 handcrafted BI-RADS features are then
extrcated from the shape, margin, and density of each mass, together with the
mass size and the patient's age, as mentioned in BI-RADS mammography. Then, a
modified feature selection method based on the genetic algorithm (GA) is
proposed to select the most clinically significant BI-RADS features. Finally, a
back-propagation neural network (BPN) is employed for classification, and its
accuracy is used as the fitness in GA. A set of 500 mammogram images from the
digital database of screening mammography (DDSM) is used for evaluation. Our
system achieves classification accuracy, positive predictive value, negative
predictive value, and Matthews correlation coefficient of 84.5%, 84.4%, 94.8%,
and 79.3%, respectively. To our best knowledge, this is the best current result
for BI-RADS classification of breast masses in mammography, which makes the
proposed system promising to support radiologists for deciding proper patient
management based on the automatically assigned BI-RADS categories
Fuzzy-rough set models and fuzzy-rough data reduction
Rough set theory is a powerful tool to analysis the information systems. Fuzzy rough set is introduced as a fuzzy generalization of rough sets. This paper reviewed the most important contributions to the rough set theory, fuzzy rough set theory and their applications. In many real world situations, some of the attribute values for an object may be in the set-valued form. In this paper, to handle this problem, we present a more general approach to the fuzzification of rough sets. Specially, we define a broad family of fuzzy rough sets. This paper presents a new development for the rough set theory by incorporating the classical rough set theory and the interval-valued fuzzy sets. The proposed methods are illustrated by an numerical example on the real case
Survey on Classification Algorithms for Data Mining:(Comparison and Evaluation)
Data mining concept is growing fast in popularity, it is a technology that involving methods at the intersection of (Artificial intelligent, Machine learning, Statistics and database system), the main goal of data mining process is to extract information from a large data into form which could be understandable for further use. Some algorithms of data mining are used to give solutions to classification problems in database. In this paper a comparison among three classification’s algorithms will be studied, these are (K- Nearest Neighbor classifier, Decision tree and Bayesian network) algorithms. The paper will demonstrate the strength and accuracy of each algorithm for classification in term of performance efficiency and time complexity required. For model validation purpose, twenty-four-month data analysis is conducted on a mock-up basis. Keywords: Decision tree, Bayesian network, k- nearest neighbour classifier
Feature Selection and Overlapping Clustering-Based Multilabel Classification Model
Multilabel classification (MLC) learning, which is widely applied in real-world applications, is a very important problem in machine learning. Some studies show that a clustering-based MLC framework performs effectively compared to a nonclustering framework. In this paper, we explore the clustering-based MLC problem. Multilabel feature selection also plays an important role in classification learning because many redundant and irrelevant features can degrade performance and a good feature selection algorithm can reduce computational complexity and improve classification accuracy. In this study, we consider feature dependence and feature interaction simultaneously, and we propose a multilabel feature selection algorithm as a preprocessing stage before MLC. Typically, existing cluster-based MLC frameworks employ a hard cluster method. In practice, the instances of multilabel datasets are distinguished in a single cluster by such frameworks; however, the overlapping nature of multilabel instances is such that, in real-life applications, instances may not belong to only a single class. Therefore, we propose a MLC model that combines feature selection with an overlapping clustering algorithm. Experimental results demonstrate that various clustering algorithms show different performance for MLC, and the proposed overlapping clustering-based MLC model may be more suitable
Variational Autoencoder Based Estimation Of Distribution Algorithms And Applications To Individual Based Ecosystem Modeling Using EcoSim
Individual based modeling provides a bottom up approach wherein interactions give rise to high-level phenomena in patterns equivalent to those found in nature. This method generates an immense amount of data through artificial simulation and can be made tractable by machine learning where multidimensional data is optimized and transformed. Using individual based modeling platform known as EcoSim, we modeled the abilities of elitist sexual selection and communication of fear. Data received from these experiments was reduced in dimension through use of a novel algorithm proposed by us: Variational Autoencoder based Estimation of Distribution Algorithms with Population Queue and Adaptive Variance Scaling (VAE-EDA-Q AVS). We constructed a novel Estimation of Distribution Algorithm (EDA) by extending generative models known as variational autoencoders (VAE). VAE-EDA-Q, proposed by us, smooths the data generation process using an iteratively updated queue (Q) of populations. Adaptive Variance Scaling (AVS) dynamically updates the variance at which models are sampled based on fitness. The combination of VAE-EDA-Q with AVS demonstrates high computational efficiency and requires few fitness evaluations. We extended VAE-EDA-Q AVS to act as a feature reducing wrapper method in conjunction with C4.5 Decision trees to reduce the dimensionality of data. The relationship between sexual selection, random selection, and speciation is a contested topic. Supporting evidence suggests sexual selection to drive speciation. Opposing evidence contends either a negative or absence of correlation to exist. We utilized EcoSim to model elitist and random mate selection. Our results demonstrated a significantly lower speciation rate, a significantly lower extinction rate, and a significantly higher turnover rate for sexual selection groups. Species diversification was found to display no significant difference. The relationship between communication and foraging behavior similarly features opposing hypotheses in claim of both increases and decreases of foraging behavior in response to alarm communication. Through modeling with EcoSim, we found alarm communication to decrease foraging activity in most cases, yet gradually increase foraging activity in some other cases. Furthermore, we found both outcomes resulting from alarm communication to increase fitness as compared to non-communication
Regression-based heterogeneity analysis to identify overlapping subgroup structure in high-dimensional data
Heterogeneity is a hallmark of complex diseases. Regression-based
heterogeneity analysis, which is directly concerned with outcome-feature
relationships, has led to a deeper understanding of disease biology. Such an
analysis identifies the underlying subgroup structure and estimates the
subgroup-specific regression coefficients. However, most of the existing
regression-based heterogeneity analyses can only address disjoint subgroups;
that is, each sample is assigned to only one subgroup. In reality, some samples
have multiple labels, for example, many genes have several biological
functions, and some cells of pure cell types transition into other types over
time, which suggest that their outcome-feature relationships (regression
coefficients) can be a mixture of relationships in more than one subgroups, and
as a result, the disjoint subgrouping results can be unsatisfactory. To this
end, we develop a novel approach to regression-based heterogeneity analysis,
which takes into account possible overlaps between subgroups and high data
dimensions. A subgroup membership vector is introduced for each sample, which
is combined with a loss function. Considering the lack of information arising
from small sample sizes, an norm penalty is developed for each membership
vector to encourage similarity in its elements. A sparse penalization is also
applied for regularized estimation and feature selection. Extensive simulations
demonstrate its superiority over direct competitors. The analysis of Cancer
Cell Line Encyclopedia data and lung cancer data from The Cancer Genome Atlas
shows that the proposed approach can identify an overlapping subgroup structure
with favorable performance in prediction and stability.Comment: 33 pages, 16 figure
Formation of regulatory modules by local sequence duplication
Turnover of regulatory sequence and function is an important part of
molecular evolution. But what are the modes of sequence evolution leading to
rapid formation and loss of regulatory sites? Here, we show that a large
fraction of neighboring transcription factor binding sites in the fly genome
have formed from a common sequence origin by local duplications. This mode of
evolution is found to produce regulatory information: duplications can seed new
sites in the neighborhood of existing sites. Duplicate seeds evolve
subsequently by point mutations, often towards binding a different factor than
their ancestral neighbor sites. These results are based on a statistical
analysis of 346 cis-regulatory modules in the Drosophila melanogaster genome,
and a comparison set of intergenic regulatory sequence in Saccharomyces
cerevisiae. In fly regulatory modules, pairs of binding sites show
significantly enhanced sequence similarity up to distances of about 50 bp. We
analyze these data in terms of an evolutionary model with two distinct modes of
site formation: (i) evolution from independent sequence origin and (ii)
divergent evolution following duplication of a common ancestor sequence. Our
results suggest that pervasive formation of binding sites by local sequence
duplications distinguishes the complex regulatory architecture of higher
eukaryotes from the simpler architecture of unicellular organisms
- …