177,448 research outputs found

    Survey on Enhancing Clustering Output using Side Information by the Textual Extraction Mechanism

    Get PDF
    In text mining, more operations are based on the statistical analysis of a term, word or phrase. Clustering is a popular technique for automatically organizing a large collection of text; it is also used to text classification. Many text mining applications contains side information with text documents in the form of web documents, user access web-log, and different links attached with text files. This side information is helpful for clustering purpose but sometime it is risky to use side information because it may add noise to procedure. So we need a better technique for text mining to improve quality of presentation. In this paper, we are using different algorithms for enhancement of the clustering quality with the document-based, sentence-based, corpus-based, and combined approach concept analysis design, so as to maximize the benefits from using side information

    Strategies and algorithms for clustering large datasets: a review

    Get PDF
    The exploratory nature of data analysis and data mining makes clustering one of the most usual tasks in these kind of projects. More frequently these projects come from many different application areas like biology, text analysis, signal analysis, etc that involve larger and larger datasets in the number of examples and the number of attributes. Classical methods for clustering data like K-means or hierarchical clustering are beginning to reach its maximum capability to cope with this increase of dataset size. The limitation for these algorithms come either from the need of storing all the data in memory or because of their computational time complexity. These problems have opened an area for the search of algorithms able to reduce this data overload. Some solutions come from the side of data preprocessing by transforming the data to a lower dimensionality manifold that represents the structure of the data or by summarizing the dataset by obtaining a smaller subset of examples that represent an equivalent information. A different perspective is to modify the classical clustering algorithms or to derive other ones able to cluster larger datasets. This perspective relies on many different strategies. Techniques such as sampling, on-line processing, summarization, data distribution and efficient datastructures have being applied to the problem of scaling clustering algorithms. This paper presents a review of different strategies and clustering algorithms that apply these techniques. The aim is to cover the different range of methodologies applied for clustering data and how they can be scaled.Preprin

    User Behavior Analysis Using Web-based Machine Learning Features: new solutions for IT business

    Get PDF
    The development of information technologies in IT business increases the interest in executing machine learning models directly on the client browser, reducing the load on the server and the number of levels of access to it. At the same time, some features have advantages and disadvantages, associated with a smaller amount of information transmitted over the network, limited power of client devices, and others. Among modern client-side tools with machine learning capabilities, Tensorflow.js is suitable, which can be used to analyse user behaviour in web applications for classification and clustering models based on their behavioural patterns, predict future user behaviour trends, detect unusual or suspicious user actions, recommendation models based on their previous behaviour. The article analyses the features of implementation and the limitations associated with the use, specifically regarding the behaviour of users in social networks. The model was formed based on data from news posts on social networks Instagram and Facebook, with the following parameters of user activity, such as the number of likes, comments, and shares according to the post's text. These aspects are a significant addition to the tools that can be applied within the economic, technical, and other means of IT business development. Considering this, it is advisable to study the formation and development of the innovation management system in e-business in the future.The development of information technologies in IT business increases the interest in the execution of machine learning models directly on the client browser, reduces the load on the server and the number of levels of access to it. At the same time, there are some features that have advantages and disadvantages, which are associated with a smaller amount of information transmitted over the network, limited power of client devices, and others. Among modern client-side tools with machine learning capabilities, Tensorflow.js is suitable, which can be used to analyze user behavior in web applications for classification and clustering models based on their behavioral patterns, predict future user behavior trends, detect unusual or suspicious user actions, recommendation models based on their previous behavior. The article analyzes the features of implementation, the limitations associated with the use specifically for the behavior of users in social networks. The model was formed on the basis of data from news posts on social networks Instagram and Facebook with the following parameters of user activity as the number of likes, comments and shares according to the text of the post. These aspects are a significant addition to the tools that can be applied within the set of economic, technical and other means for IT business development. Taking this into account, in the future it is advisable to study the formation and development of the innovation management system in e-business

    CESI: Canonicalizing Open Knowledge Bases using Embeddings and Side Information

    Full text link
    Open Information Extraction (OpenIE) methods extract (noun phrase, relation phrase, noun phrase) triples from text, resulting in the construction of large Open Knowledge Bases (Open KBs). The noun phrases (NPs) and relation phrases in such Open KBs are not canonicalized, leading to the storage of redundant and ambiguous facts. Recent research has posed canonicalization of Open KBs as clustering over manuallydefined feature spaces. Manual feature engineering is expensive and often sub-optimal. In order to overcome this challenge, we propose Canonicalization using Embeddings and Side Information (CESI) - a novel approach which performs canonicalization over learned embeddings of Open KBs. CESI extends recent advances in KB embedding by incorporating relevant NP and relation phrase side information in a principled manner. Through extensive experiments on multiple real-world datasets, we demonstrate CESI's effectiveness.Comment: Accepted at WWW 201

    On Graph Stream Clustering with Side Information

    Full text link
    Graph clustering becomes an important problem due to emerging applications involving the web, social networks and bio-informatics. Recently, many such applications generate data in the form of streams. Clustering massive, dynamic graph streams is significantly challenging because of the complex structures of graphs and computational difficulties of continuous data. Meanwhile, a large volume of side information is associated with graphs, which can be of various types. The examples include the properties of users in social network activities, the meta attributes associated with web click graph streams and the location information in mobile communication networks. Such attributes contain extremely useful information and has the potential to improve the clustering process, but are neglected by most recent graph stream mining techniques. In this paper, we define a unified distance measure on both link structures and side attributes for clustering. In addition, we propose a novel optimization framework DMO, which can dynamically optimize the distance metric and make it adapt to the newly received stream data. We further introduce a carefully designed statistics SGS(C) which consume constant storage spaces with the progression of streams. We demonstrate that the statistics maintained are sufficient for the clustering process as well as the distance optimization and can be scalable to massive graphs with side attributes. We will present experiment results to show the advantages of the approach in graph stream clustering with both links and side information over the baselines.Comment: Full version of SIAM SDM 2013 pape
    corecore