12 research outputs found
Customer profile classification using transactional data
Customer profiles are by definition made up of factual and transactional data. It is often the case that due to reasons such as high cost of data acquisition and/or protection, only the transactional data are available for data mining operations. Transactional data, however, tend to be highly sparse and skewed due to a large proportion of customers engaging in very few transactions. This can result in a bias in the prediction accuracy of classifiers built using them towards the larger proportion of customers with fewer transactions. This paper investigates an approach for accurately and confidently grouping and classifying customers in bins on the basis of the number of their transactions. The experiments we conducted on a highly sparse and skewed real-world transactional data show that our proposed approach can be used to identify a critical point at which customer profiles can be more confidently distinguished
Innovative Hybridisation of Genetic Algorithms and Neural Networks in Detecting Marker Genes for Leukaemia Cancer
Methods for extracting marker genes that trigger the growth
of cancerous cells from a high level of complexity microarrays are of much interest from the computing community. Through the identified genes, the pathology of cancerous cells can be revealed and early precaution
can be taken to prevent further proliferation of cancerous cells. In this paper, we propose an innovative hybridised gene identification framework based on genetic algorithms and neural networks to identify marker genes for leukaemia disease. Our approach confirms that high classification
accuracy does not ensure the optimal set of genes have been identified and our model delivers a more promising set of genes even with a lower classification accurac
Winners’ notes. Using Multi-Resolution Clustering for Music Genre Identification
Article describing a less technical version of our winning entry in the ISMIS 2011 Music Genre competitio
Virtual Screening of Bioassay Data
Background: There are three main problems associated with the virtual screening of bioassay
data. The first is access to freely-available curated data, the second is the number of false positives
that occur in the physical primary screening process, and finally the data is highly-imbalanced with
a low ratio of Active compounds to Inactive compounds. This paper first discusses these three
problems and then a selection of Weka cost-sensitive classifiers (Naive Bayes, SVM, C4.5 and
Random Forest) are applied to a variety of bioassay datasets.
Results: Pharmaceutical bioassay data is not readily available to the academic community. The data
held at PubChem is not curated and there is a lack of detailed cross-referencing between Primary
and Confirmatory screening assays. With regard to the number of false positives that occur in the
primary screening process, the analysis carried out has been shallow due to the lack of crossreferencing
mentioned above. In six cases found, the average percentage of false positives from the
High-Throughput Primary screen is quite high at 64%. For the cost-sensitive classification, Weka's
implementations of the Support Vector Machine and C4.5 decision tree learner have performed
relatively well. It was also found, that the setting of the Weka cost matrix is dependent on the base
classifier used and not solely on the ratio of class imbalance.
Conclusions: Understandably, pharmaceutical data is hard to obtain. However, it would be
beneficial to both the pharmaceutical industry and to academics for curated primary screening and
corresponding confirmatory data to be provided. Two benefits could be gained by employing virtual
screening techniques to bioassay data. First, by reducing the search space of compounds to be
screened and secondly, by analysing the false positives that occur in the primary screening process,
the technology may be improved. The number of false positives arising from primary screening
leads to the issue of whether this type of data should be used for virtual screening. Care when using
Weka's cost-sensitive classifiers is needed - across the board misclassification costs based on class
ratios should not be used when comparing differing classifiers for the same dataset
Therapeutic opportunities within the DNA damage response
The DNA damage response (DDR) is essential for maintaining the genomic integrity of the cell, and its disruption is one of the hallmarks of cancer. Classically, defects in the DDR have been exploited therapeutically in the treatment of cancer with radiation therapies or genotoxic chemotherapies. More recently, protein components of the DDR systems have been identified as promising avenues for targeted cancer therapeutics. Here, we present an in-depth analysis of the function, role in cancer and therapeutic potential of 450 expert-curated human DDR genes. We discuss the DDR drugs that have been approved by the US Food and Drug Administration (FDA) or that are under clinical investigation. We examine large-scale genomic and expression data for 15 cancers to identify deregulated components of the DDR, and we apply systematic computational analysis to identify DDR proteins that are amenable to modulation by small molecules, highlighting potential novel therapeutic targets
Distinctive Behaviors of Druggable Proteins in Cellular Networks
<div><p>The interaction environment of a protein in a cellular network is important in defining the role that the protein plays in the system as a whole, and thus its potential suitability as a drug target. Despite the importance of the network environment, it is neglected during target selection for drug discovery. Here, we present the first systematic, comprehensive computational analysis of topological, community and graphical network parameters of the human interactome and identify discriminatory network patterns that strongly distinguish drug targets from the interactome as a whole. Importantly, we identify striking differences in the network behavior of targets of cancer drugs versus targets from other therapeutic areas and explore how they may relate to successful drug combinations to overcome acquired resistance to cancer drugs. We develop, computationally validate and provide the first public domain predictive algorithm for identifying druggable neighborhoods based on network parameters. We also make available full predictions for 13,345 proteins to aid target selection for drug discovery. All target predictions are available through <a href="http://canSAR.icr.ac.uk" target="_blank">canSAR.icr.ac.uk</a>. Underlying data and tools are available at <a href="https://cansar.icr.ac.uk/cansar/publications/druggable_network_neighbourhoods/" target="_blank">https://cansar.icr.ac.uk/cansar/publications/druggable_network_neighbourhoods/</a>.</p></div
Enrichment and depletion of key parameters in drug targets over what can be expected at random from the interactome.
<p>A) Graphlets and their constituent isomorphism orbits. The graph shows the graphlets and orbits, ordered by descending size and complexity, most enriched in cancer-drug targets (light blue bars). These same graphlets and orbits are either slightly depleted or not differentiated from random in targets of non-cancer drugs (dark blue). The gray line represents graphlets size and complexity (high-to-low). B) The distribution of detected community sizes and the enrichment or depletion of cancer drug targets (light blule) versus targets of drugs used to treat other diseases (dark blue). C) Box plots showing distinction of degree and google page rank; as well as the vertex modularity which distinguishes inter- versus intra-community communication of nodes. Further parameters are shown in the Supporting Information.</p
Cancer-drug targets are enriched for highly connected Graphlets.
<p>A) Interaction network highlighting the distribution of targets of approved cancer drugs (pink); targets of approved drugs from non-cancer therapeutic areas (blue); and targets predicted to be druggable by different druggability prediction methodologies(light and dark green). Druggable proteins are spread widely across the network while targets of current approved drugs tend to cluster into few areas. B) Cumulative fraction of all drug targets covered by communities. As indicated, a small number of communities cover the majority of drug targets. C) The network communities most enriched in drug targets are listed against the fold enrichment of the number of targets found in them (compared to what can be expected at random).</p
Network profiles and interactions between targets of drug combinations.
<p>A) Radar plots showing representative network property profiles of targets of drug combination. MEK and BRAF network property profiles are more similar to one another than the network profiles of CDKs and HMGCR. This may be related to the long-term effectiveness of the combinations of drugs targeting these proteins. B) Interactions between proteins targeted by drug combination showing high level of connectivity between targets such as EGFR, BRAF and MEK. The dotted edge indicates that no direct interaction takes place between HMGCR and the other proteins in the network.</p