14 research outputs found

    Multi-Source Multi-View Clustering via Discrepancy Penalty

    Full text link
    With the advance of technology, entities can be observed in multiple views. Multiple views containing different types of features can be used for clustering. Although multi-view clustering has been successfully applied in many applications, the previous methods usually assume the complete instance mapping between different views. In many real-world applications, information can be gathered from multiple sources, while each source can contain multiple views, which are more cohesive for learning. The views under the same source are usually fully mapped, but they can be very heterogeneous. Moreover, the mappings between different sources are usually incomplete and partially observed, which makes it more difficult to integrate all the views across different sources. In this paper, we propose MMC (Multi-source Multi-view Clustering), which is a framework based on collective spectral clustering with a discrepancy penalty across sources, to tackle these challenges. MMC has several advantages compared with other existing methods. First, MMC can deal with incomplete mapping between sources. Second, it considers the disagreements between sources while treating views in the same source as a cohesive set. Third, MMC also tries to infer the instance similarities across sources to enhance the clustering performance. Extensive experiments conducted on real-world data demonstrate the effectiveness of the proposed approach

    Online Unsupervised Multi-view Feature Selection

    Full text link
    In the era of big data, it is becoming common to have data with multiple modalities or coming from multiple sources, known as "multi-view data". Multi-view data are usually unlabeled and come from high-dimensional spaces (such as language vocabularies), unsupervised multi-view feature selection is crucial to many applications. However, it is nontrivial due to the following challenges. First, there are too many instances or the feature dimensionality is too large. Thus, the data may not fit in memory. How to select useful features with limited memory space? Second, how to select features from streaming data and handles the concept drift? Third, how to leverage the consistent and complementary information from different views to improve the feature selection in the situation when the data are too big or come in as streams? To the best of our knowledge, none of the previous works can solve all the challenges simultaneously. In this paper, we propose an Online unsupervised Multi-View Feature Selection, OMVFS, which deals with large-scale/streaming multi-view data in an online fashion. OMVFS embeds unsupervised feature selection into a clustering algorithm via NMF with sparse learning. It further incorporates the graph regularization to preserve the local structure information and help select discriminative features. Instead of storing all the historical data, OMVFS processes the multi-view data chunk by chunk and aggregates all the necessary information into several small matrices. By using the buffering technique, the proposed OMVFS can reduce the computational and storage cost while taking advantage of the structure information. Furthermore, OMVFS can capture the concept drifts in the data streams. Extensive experiments on four real-world datasets show the effectiveness and efficiency of the proposed OMVFS method. More importantly, OMVFS is about 100 times faster than the off-line methods

    Aggregator: a machine learning approach to identifying MEDLINE articles that derive from the same underlying clinical trial

    Get PDF
    Objective It is important to identify separate publications that report outcomes from the same underlying clinical trial, in order to avoid over-counting these as independent pieces of evidence. Methods We created positive and negative training sets (comprised of pairs of articles reporting on the same condition and intervention) that were, or were not, linked to the same clinicaltrials.gov trial registry number. Features were extracted from MEDLINE and PubMed metadata; pairwise similarity scores were modeled using logistic regression. Results Article pairs from the same trial were identified with high accuracy (F1 score = 0.843). We also created a clustering tool, Aggregator, that takes as input a PubMed user query for RCTs on a given topic, and returns article clusters predicted to arise from the same clinical trial. Discussion Although painstaking examination of full-text may be needed to be conclusive, metadata are surprisingly accurate in predicting when two articles derive from the same underlying clinical trial

    Unsupervised Learning from Multi-view Data

    No full text
    With the advance of technology, data are often with multiple modalities or coming from multiple sources. Such data are called multi-view data. Usually, multiple views provide complementary information for the semantically same data. Learning from multi-view data can obtain better performance than relying on just one single view. Also, as the data explodes, most of the multi-view data are unlabeled and it is expensive to label the data. Thus, unsupervised learning from multi-view data is very important in many real-world applications. However, in real-world application, multi-view data are usually heterogeneous (different feature spaces for different views), incomplete, large-scale and high-dimensional. These challenges prevent us from applying existing unsupervised learning methods to real-world multi-view data. This dissertation presents my Ph.D. research works on unsupervised learning from multi-view data. First, we present the first algorithm to solve the multiple incomplete views clustering problem by collectively learning the kernel matrices for different views. Furthermore, we propose a more general tensor based multi-incomplete-view clustering method, which uses a tensor to model the multiple incomplete views and learns the latent features by sparse tensor factorization. Third, we present a faster multi-incomplete-view clustering algorithm based on weighted nonnegative matrix factorization. Lastly, we propose an online multi-view unsupervised feature selection algorithm to solve the scalability and high-dimensionality challenges

    Nuggets: findings shared in multiple clinical case reports

    No full text
    OBJECTIVE: The researchers assessed prevalence in the clinical case report literature of multiple reports independently reporting the same (or nearly the same) main finding. METHODS: Results from forty-five PubMed queries were examined for incidence and features of main findings (“nuggets”) shared in at least four case reports. RESULTS: The authors found that nuggets are surprisingly prevalent and large in the case report literature, the largest found so far was reported in seventeen articles. In most cases, the main findings of case reports were evident from examining titles alone. CONCLUSIONS: Our curated examples should serve as gold standards for developing specific automated methods for finding nuggets. Nuggets potentially enable finding-based (instead of topic-based) information retrieval

    Improving Soil Enzyme Activities and Related Quality Properties of Reclaimed Soil by Applying Weathered Coal in Opencast-Mining Areas of the Chinese Loess Plateau

    No full text
    There are many problems for the reclaimed soil in opencast-mining areas of the Loess Plateau of China such as poor soil structure and extreme poverty in soil nutrients and so on. For the sake of finding a better way to improve soil quality, the current study was to apply the weathered coal for repairing soil media and investigate the physicochemical properties of the reclaimed soil and the changes in enzyme activities after planting Robinia pseucdoacacia. The results showed that the application of the weathered coal significantly improved the quality of soil aggregates, increased the content of water stable aggregates, and the organic matter, humus, and the cation exchange capacity of topsoil were significantly improved, but it did not have a significant effect on soil pH. Planting R. pseucdoacacia significantly enhanced the activities of soil catalase, urease, and invertase, but the application of the weathered coal inhibited the activity of catalase. Although the application of appropriate weathered coal was able to significantly increase urease activity, the activities of catalase, urease, or invertase had a close link with the soil profile levels and time. This study suggests that applying weathered coals could improve the physicochemical properties and soil enzyme activities of the reclaimed soil in opencast-mining areas of the Loess Plateau of China and the optimum applied amount of the weathered coal for reclaimed soil remediation is about 27?000?kg?hm-2.There are many problems for the reclaimed soil in opencast-mining areas of the Loess Plateau of China such as poor soil structure and extreme poverty in soil nutrients and so on. For the sake of finding a better way to improve soil quality, the current study was to apply the weathered coal for repairing soil media and investigate the physicochemical properties of the reclaimed soil and the changes in enzyme activities after planting Robinia pseucdoacacia. The results showed that the application of the weathered coal significantly improved the quality of soil aggregates, increased the content of water stable aggregates, and the organic matter, humus, and the cation exchange capacity of topsoil were significantly improved, but it did not have a significant effect on soil pH. Planting R. pseucdoacacia significantly enhanced the activities of soil catalase, urease, and invertase, but the application of the weathered coal inhibited the activity of catalase. Although the application of appropriate weathered coal was able to significantly increase urease activity, the activities of catalase, urease, or invertase had a close link with the soil profile levels and time. This study suggests that applying weathered coals could improve the physicochemical properties and soil enzyme activities of the reclaimed soil in opencast-mining areas of the Loess Plateau of China and the optimum applied amount of the weathered coal for reclaimed soil remediation is about 27?000?kg?hm-2
    corecore