22 research outputs found

    Rapid Quantification of Molecular Diversity for Selective Database Acquisition

    Get PDF
    There is an increasing need to expand the structural diversity of the molecules investigated in lead-discovery programs. One way in which this can be achieved is by acquiring external datasets that will enhance an existing database. This paper describes a rapid procedure for the selection of external datasets using a measure of structural diversity that is calculated from sums of pairwise intermolecular structural similarities

    Development and evaluation of clustering techniques for finding people

    Get PDF
    Typically in a large organisation much expertise and knowledge is held informally within employees' own memories. When employees leave an organisation many documented links that go through that person are broken and no mechanism is usually available to overcome these broken links. This match making problem is related to the problem of finding potential work partners in a large and distributed organisation. This paper reports a comparative investigation into using standard information retrieval techniques to group employees together based on their webpages. This information can, hopefully, be subsequently used to redirect broken links to people who worked closely with a departed employee or used to highlight people, say indifferent departments, who work on similar topics. The paper reports the design and positive results of an experiment conducted at Risø National Laboratory comparing four different IR searching and clustering approaches using real users' web pages

    Automatic document clustering using topic analysis

    Get PDF
    Web users are demanding more out of current search engines. This can be noticed by the behaviour of users when interacting with search engines [12, 28]. Besides traditional query/results interactions, other tools are springing up on the web. An example of such tools includes web document clustering systems. The idea is for the user to interact with the system by navigating through an organised hierarchy of topics. Document clustering is ideal for unspecified search goals or for the exploration of a topic by the inexpert [21]. Document clustering is there to transform the current interactions of searching through a large amount of links into an efficient interaction where the interaction is navigation through hierarchies. This report will give an overview of the major work in this area, we will also propose our current work, progress and pitfalls which are being tackled.peer-reviewe

    Chemoinformatics Research at the University of Sheffield: A History and Citation Analysis

    Get PDF
    This paper reviews the work of the Chemoinformatics Research Group in the Department of Information Studies at the University of Sheffield, focusing particularly on the work carried out in the period 1985-2002. Four major research areas are discussed, these involving the development of methods for: substructure searching in databases of three-dimensional structures, including both rigid and flexible molecules; the representation and searching of the Markush structures that occur in chemical patents; similarity searching in databases of both two-dimensional and three-dimensional structures; and compound selection and the design of combinatorial libraries. An analysis of citations to 321 publications from the Group shows that it attracted a total of 3725 residual citations during the period 1980-2002. These citations appeared in 411 different journals, and involved 910 different citing organizations from 54 different countries, thus demonstrating the widespread impact of the Group's work

    Fundamental activity constraints lead to specific interpretations of the connectome

    Get PDF
    The continuous integration of experimental data into coherent models of the brain is an increasing challenge of modern neuroscience. Such models provide a bridge between structure and activity, and identify the mechanisms giving rise to experimental observations. Nevertheless, structurally realistic network models of spiking neurons are necessarily underconstrained even if experimental data on brain connectivity are incorporated to the best of our knowledge. Guided by physiological observations, any model must therefore explore the parameter ranges within the uncertainty of the data. Based on simulation results alone, however, the mechanisms underlying stable and physiologically realistic activity often remain obscure. We here employ a mean-field reduction of the dynamics, which allows us to include activity constraints into the process of model construction. We shape the phase space of a multi-scale network model of the vision-related areas of macaque cortex by systematically refining its connectivity. Fundamental constraints on the activity, i.e., prohibiting quiescence and requiring global stability, prove sufficient to obtain realistic layer- and area-specific activity. Only small adaptations of the structure are required, showing that the network operates close to an instability. The procedure identifies components of the network critical to its collective dynamics and creates hypotheses for structural data and future experiments. The method can be applied to networks involving any neuron model with a known gain function.Comment: J. Schuecker and M. Schmidt contributed equally to this wor

    Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches

    Get PDF
    We investigate the accuracy of different similarity approaches for clustering over two million biomedical documents. Clustering large sets of text documents is important for a variety of information needs and applications such as collection management and navigation, summary and analysis. The few comparisons of clustering results from different similarity approaches have focused on small literature sets and have given conflicting results. Our study was designed to seek a robust answer to the question of which similarity approach would generate the most coherent clusters of a biomedical literature set of over two million documents.We used a corpus of 2.15 million recent (2004-2008) records from MEDLINE, and generated nine different document-document similarity matrices from information extracted from their bibliographic records, including titles, abstracts and subject headings. The nine approaches were comprised of five different analytical techniques with two data sources. The five analytical techniques are cosine similarity using term frequency-inverse document frequency vectors (tf-idf cosine), latent semantic analysis (LSA), topic modeling, and two Poisson-based language models--BM25 and PMRA (PubMed Related Articles). The two data sources were a) MeSH subject headings, and b) words from titles and abstracts. Each similarity matrix was filtered to keep the top-n highest similarities per document and then clustered using a combination of graph layout and average-link clustering. Cluster results from the nine similarity approaches were compared using (1) within-cluster textual coherence based on the Jensen-Shannon divergence, and (2) two concentration measures based on grant-to-article linkages indexed in MEDLINE.PubMed's own related article approach (PMRA) generated the most coherent and most concentrated cluster solution of the nine text-based similarity approaches tested, followed closely by the BM25 approach using titles and abstracts. Approaches using only MeSH subject headings were not competitive with those based on titles and abstracts

    Mmwave Beam Management in Urban Vehicular Networks

    Get PDF
    Millimeter-wave (mmwave) communication represents a potential solution to capacity shortage in vehicular networks. However, effective beam alignment between senders and receivers requires accurate knowledge of the vehicles' position for fast beam steering, which is often impractical to obtain in real time. We address this problem by leveraging the traffic signals regulating vehicular mobility: as an example, we may coordinate beams with red traffic lights, as they correspond to higher vehicle densities and lower speeds. To evaluate our intuition, we propose a tractable, yet accurate, mmwave communication model accounting for both the distance and the heading of vehicles being served. Using such a model, we optimize the beam design and define a low-complexity, heuristic strategy. For increased realism, we consider as reference scenario a large-scale, real-world mobility trace of vehicles in Luxembourg. The results show that our approach closely matches the optimum and always outperforms static beam design based on road topology alone. Remarkably, it also yields better performance than solutions based on real-time mobility information

    Improved adaptive semi-unsupervised weighted oversampling (IA-SUWO) using sparsity factor for imbalanced datasets

    Get PDF
    The imbalanced data problem is common in data mining nowadays due to the skewed nature of data, which impact the classification process negatively in machine learning. For preprocessing, oversampling techniques significantly benefitted the imbalanced domain, in which artificial data is generated in minority class to enhance the number of samples and balance the distribution of samples in both classes. However, existing oversampling techniques encounter through overfitting and over-generalization problems which lessen the classifier performance. Although many clustering based oversampling techniques significantly overcome these problems but most of these techniques are not able to produce the appropriate number of synthetic samples in minority clusters. This study proposed an improved Adaptive Semi-unsupervised Weighted Oversampling (IA-SUWO) technique, using the sparsity factor which determine the sparse minority samples in each minority cluster. This technique consider the sparse minority samples which are far from the decision boundary. These samples also carry the important information for learning of minority class, if these samples are also considered for oversampling, imbalance ratio will be more reduce also it could enhance the learnability of the classifiers. The outcomes of the proposed approach have been compared with existing oversampling techniques such as SMOTE, Borderline-SMOTE, Safe-level SMOTE, and standard A-SUWO technique in terms of accuracy. As aforementioned, the comparative analysis revealed that the proposed oversampling approach performance increased in average by 5% from 85% to 90% than the existing comparative techniques

    Dissimilarity-based algorithms for selecting structurally diverse sets of compounds

    Get PDF
    This paper commences with a brief introduction to modern techniques for the computational analysis of molecular diversity and the design of combinatorial libraries. It then reviews dissimilarity-based algorithms for the selection of structurally diverse sets of compounds in chemical databases. Procedures are described for selecting a diverse subset of an entire database, and for selecting diverse combinatorial libraries using both reagent-based and product-based selection
    corecore