8 research outputs found

    SIMILARITY ENHACEMENT IN TIME-AWARE RECOMMENDER SYSTEMS

    Get PDF
    Time-aware recommender systems (TARS) are systems that take into account a time factor - the age of the user data. There are three approaches for using a time factor: (1) the user data may be given different weights by their age, (2) it may be treated as a step in a biological process and (3) it may be compared in different time frames to find a significant pattern. This research deals with the latter approach. When dividing the data into several time frames, matching users becomes more difficult - similarity between users that was once identified in the total time frame may disappear when trying to match between them in smaller time frames. The user matching problem is largely affected by the sparsity problem, which is well known in the recommender system literature. Sparsity occurs where the actual interactions between users and data items is much smaller in comparison to the entire collection of possible interactions. The sparsity grows as the data is split into several time frames for comparison. As sparsity grows, matching similar users in different time frames becomes harder, increasing the need for finding relevant neighboring users. Our research suggests a flexible solution for dealing with the similarity limitation of current methods. To overcome the similarity problem, we suggest dividing items into multiple features. Using these features we extract several user interests, which can be compared among users. This comparison results in more user matches than in current TARS

    In Silico Toxicology Data Resources to Support Read-Across and (Q)SAR

    Get PDF
    A plethora of databases exist online that can assist in in silico chemical or drug safety assessment. However, a systematic review and grouping of databases, based on purpose and information content, consolidated in a single source has been lacking. To resolve this issue, this review provides a comprehensive listing of the key in silico data resources relevant to: chemical identity and properties, drug action, toxicology (including nano-material toxicity), exposure, omics, pathways, Absorption, Distribution, Metabolism and Elimination (ADME) properties, clinical trials, pharmacovigilance, patents-related databases, biological (genes, enzymes, proteins, other macromolecules etc.) databases, protein-protein interactions (PPIs), environmental exposure related, and finally databases relating to animal alternatives in support of 3Rs policies. More than nine hundred databases were identified and reviewed against criteria relating to accessibility, data coverage, interoperability or application programming interface (API), appropriate identifiers, types of in vitro-in vivo -clinical data recorded and suitability for modelling, read-across or similarity searching. This review also specifically addresses the need for solutions for mapping and integration of databases into a common platform for better translatability of preclinical data to clinical data

    Comprehensive survey on big data privacy protection

    Get PDF
    In recent years, the ever-mounting problem of Internet phishing has been threatening the secure propagation of sensitive data over the web, thereby resulting in either outright decline of data distribution or inaccurate data distribution from several data providers. Therefore, user privacy has evolved into a critical issue in various data mining operations. User privacy has turned out to be a foremost criterion for allowing the transfer of confidential information. The intense surge in storing the personal data of customers (i.e., big data) has resulted in a new research area, which is referred to as privacy-preserving data mining (PPDM). A key issue of PPDM is how to manipulate data using a specific approach to enable the development of a good data mining model on modified data, thereby meeting a specified privacy need with minimum loss of information for the intended data analysis task. The current review study aims to utilize the tasks of data mining operations without risking the security of individuals’ sensitive information, particularly at the record level. To this end, PPDM techniques are reviewed and classified using various approaches for data modification. Furthermore, a critical comparative analysis is performed for the advantages and drawbacks of PPDM techniques. This review study also elaborates on the existing challenges and unresolved issues in PPDM.Published versio

    NanoSAR: In Silico Modelling of Nanomaterial Toxicity

    Get PDF
    The number of engineered nanomaterials (ENMs) being exploited commercially is growing rapidly, due to the novel properties of ENMs. Clearly, it is important to understand and ameliorate any risks to health or the environment posed by the presence of ENMs. However, there still exists a critical gap in the literature on the (eco)toxicological properties of ENMs and the particular characteristics that influence their toxic effects. Given their increasing industrial and technological use, it is important to assess their potential health and environmental impacts in a time and cost effective manner. One strategy to alleviate the problem of a large number and variety of ENMs is through the development of data-driven models that decode the relationships between the biological activities of ENMs and their physicochemical characteristics. Although such structure-activity relationship (SAR) methods have proven to be effective in predicting the toxicity of substances in bulk form, their practical application to ENMs requires more research and further development. This study aimed to address this research need by investigating the application of data-driven toxicity modelling approaches (e.g. SAR) that are beneficial over animal testing from a cost, time and ethical perspective to ENMs. A large amount of data on ENM toxicity and properties was collected and analysed using quantitative methods to explore and explain the relationship between ENM properties and their toxic outcomes, as a part of this study. More specifically, multi-dimensional data visualisation techniques including heat maps combined with hierarchical clustering and parallel co-ordinate plots, were used for data exploration purposes while classification and regression based modelling tools, a genetic algorithm based decision tree construction algorithm and partial least squares, were successfully applied to explain and predict ENMs’ toxicity based on physicochemical characteristics. As a next step, the implementation of risk reduction measures for risks that are outside the range of tolerable limits was investigated. Overall, the results showed that computational methods hold considerable promise in their ability to identify and model the relationship between physicochemical properties and biological effects of ENMs, to make it possible to reach a decision more quickly and hence, to provide practical solutions for the risk assessment problems caused by the diversity of ENMs

    Towards a review of the EC Recommendation for a definition of the term "nanomaterial"; Part 1: Compilation of information concerning the experience with the definition

    Get PDF
    In October 2011 the European Commission (EC) published a Recommendation on the definition of nanomaterial (2011/696/EU). The purpose of this definition is to enable determination when a material should be considered a nanomaterial for regulatory purposes in the European Union. In view of the upcoming review of the current EC Definition of the term 'nanomaterial' and noting the need expressed by the EC Environment Directorate General and other Commission services for a set of scientifically sound reports as the basis for this review, the EC Joint Research Centre (JRC) prepares three consecutive reports, of which this is the first. This Report 1 compiles information concerning the experience with the definition regarding scientific-technical issues that should be considered when reviewing the current EC definition of nanomaterial. Based on this report and the feedback received, JRC will write a second, follow-up report. In this Report 2 the JRC will provide a detailed assessment of the scientific-technical issues compiled in Report 1, in relation to the objective of reviewing the current EC nanomaterial definition.JRC.I.4-Nanobioscience

    Nanoparticules et colloïdes multifonctionnels à base de clusters d’éléments de transition et complexes de lanthanides

    Get PDF
    The first part of this work involves the development and characterization ofnovel nanoparticles (NPs) of multifunctional silica with complex architectures.The challenge is to meet the increasing demand for development of newnon-toxic colloidal systems, magnetic and/or luminescent in the NIR regionfor potential applications in biotechnology. This objective was achievedby closely associating molybdenum clusters compounds with maghemitenanocrystals and/or gold in 50 nm silica NPs. An evaluation of thecytotoxicity of NPs containing clusters of transition elements of Cs2Mo6Br14and a time-gated fluorescence microscopy of Cs2Mo6I8(C2F5COO)6@SiO2NPs incorporated in cancer cells are presented.In the second part, microcrystalline powders of heteronuclear lanthanidebasedcoordination polymers with general chemical formula [Ln2-2xLn’2x(bdc)3,4H2O] ∞ 0 ≤ x ≤ 1 were dissolved in glycerol . These NPsexhibit luminescent properties identical to that of the bulk material.A detailed study of this new green synthetic route and a study of thestability over time and a dilution of the obtained colloids were performed.La première partie de ce travail porte sur l’élaboration et la caractérisation de nouvelles nanoparticules (NPs) multifonctionnelles de silice à architectures complexes. L’enjeu est de répondre à la demande croissante d’élaboration de nouveaux systèmes colloïdaux non toxiques, magnétiques et/ou luminescents dans la région NIR pour des applications potentielles en biotechnologie. Cet objectif a été atteint en associant intimement des composés à clusters de molybdène avec des nanocristaux de maghémite et/ou d’or dans une NPs de silice de 50 nm. Une évaluation de la cytotoxicité des NPs contenant des clusters d’éléments de transition Cs2Mo6Br14 ainsi qu’un suivi par microscopie de fluorescence en temps retardé des NPs Cs2Mo6I8(C2F5COO)6@SiO2 incorporées dans des cellules cancéreuses sont présentés. Dans la deuxième partie, des poudres microcristallines de composés hétéronucléaires de polymères de coordination à base de terres rares de formule chimique générale [Ln2-2xLn’2x(bdc)3,4H2O]∞ avec 0 ≤ x ≤ 1 ont été nanométrisées dans du glycérol. Ces NPs présentent des propriétés luminescentes identiques à celles du matériau massif. Une étude détaillée de cette nouvelle voie de synthèse répondant aux principes de la chimie verte ainsi qu’une étude de la stabilité en fonction du temps et de la dilution des colloïdes obtenus ont été réalisées

    Scalability aspects of data cleaning

    Get PDF
    Data cleaning has become one of the important pre-processing steps for many data science, data analytics, and machine learning applications. According to a survey by Gartner, more than 25% of the critical data in the world's top companies is flawed, which can result in economic losses amounting to trillions of dollars a year. Over the past few decades, several algorithms and tools have been developed to clean data. However, many of these solutions find it difficult to scale, as the amount of data has increased over time. For example, these solutions often involve a quadratic amount of tuple-pair comparisons or generation of all possible column combinations. Both these tasks can take days to finish if the dataset has millions of tuples or a few hundreds of columns, which is usually the case for real-world applications. The data cleaning tasks often have a trade-off between the scalability and the quality of the solution. One can achieve scalability by performing fewer computations, but at the cost of a lower quality solution. Therefore, existing approaches exploit this trade-off when they need to scale to larger datasets, settling for a lower quality solution. Some approaches have considered re-thinking solutions from scratch to achieve scalability and high quality. However, re-designing these solutions from scratch is a daunting task as it would involve systematically analyzing the space of possible optimizations and then tuning the physical implementations for a specific computing framework, data size, and resources. Another component in these solutions that becomes critical with the increasing data size is how this data is stored and fetched. As for smaller datasets, most of it can fit in-memory, so accessing it from a data store is not a bottleneck. However, for large datasets, these solutions need to constantly fetch and write the data to a data store. As observed in this dissertation, data cleaning tasks have a lifecycle-driven data access pattern, which are not suitable for traditional data stores, making these data stores a bottleneck when cleaning large datasets. In this dissertation, we consider scalability as a first-class citizen for data cleaning tasks and propose that the scalable and high-quality solutions can be achieved by adopting the following three principles: 1) by having a new primitive-base re-writing of the existing algorithms that allows for efficient implementations for multiple computing frameworks, 2) by efficiently involving domain expert’s knowledge to reduce computation and improve quality, and 3) by using an adaptive data store that can transform the data layout based on the access pattern. We make contributions towards each of these principles. First, we present a set of primitive operations for discovering constraints from the data. These primitives facilitate re-writing efficient distributed implementations of the existing discovery algorithms. Next, we present a framework involving domain experts, for faster clustering selection for data de-duplication. This framework asks a bounded number of queries to a domain-expert and uses their response to select the best clustering with a high accuracy. Finally, we present an adaptive data store that can change the layout of the data based on the workload's access pattern, hence speeding-up the data cleaning tasks

    Theoretical foundations for efficient clustering

    Get PDF
    Clustering aims to group together data instances which are similar while simultaneously separating the dissimilar instances. The task of clustering is challenging due to many factors. The most well-studied is the high computational cost. The clustering task can be viewed as an optimization problem where the goal is to minimize a certain cost function (like k-means cost or k-median cost). Not only are the minimization problems NP-hard but often also NP-hard to approximate (within a constant factor). There are two other major issues in clustering, namely under-specificity and noise-robustness. The focus of this thesis is tackling these two issues while simultaneously ensuring low computational cost. Clustering is an under-specified task. The same dataset may need to be clustered in different ways depending upon the intended application. Different solution requirements need different approaches. In such situations, domain knowledge is needed to better define the clustering problem. We incorporate this by allowing the clustering algorithm to interact with an oracle by asking whether two points belong to the same or different cluster. In a preliminary work, we show that access to a small number of same-cluster queries makes an otherwise NP-hard k-means clustering problem computationally tractable. Next, we consider the problem of clustering for data de-duplication; detecting records which correspond to the same physical entity in a database. We propose a correlation clustering like framework to model such record de-duplication problems. We show that access to a small number of same-cluster queries can help us solve the 'restricted' version of correlation clustering. Rather surprisingly, more relaxed versions of correlation clustering are intractable even when allowed to make a 'large' number of same-cluster queries. Next, we explore the issue of noise-robustness of clustering algorithms. Many real-world datasets, have on top of cohesive subsets, a significant amount of points which are `unstructured'. The addition of these noisy points makes it difficult to detect the structure of the remaining points. In the first line of work, we define noise as not having significantly large dense subsets. We provide computationally efficient clustering algorithms that capture all meaningful clusterings of the dataset; where the clusters are cohesive (defined formally by notions of clusterability) and where the noise satisfies the gray background assumption. We complement our results by showing that when either the notions of structure or the noise requirements are relaxed, no such results are possible. In the second line of work, we develop a generic procedure that can transform objective-based clustering algorithms into one that is robust to outliers (as long the number of such points is not 'too large'). In particular, we develop efficient noise-robust versions of two common clustering algorithms and prove robustness guarantees for them
    corecore