Search CORE

177,448 research outputs found

Survey on Enhancing Clustering Output using Side Information by the Textual Extraction Mechanism

Author: Mr. Amol B. Mahadik, Prof. Yogesh B. Gurav
Publication venue: 'Auricle Technologies, Pvt., Ltd.'
Publication date: 31/12/2014
Field of study

In text mining, more operations are based on the statistical analysis of a term, word or phrase. Clustering is a popular technique for automatically organizing a large collection of text; it is also used to text classification. Many text mining applications contains side information with text documents in the form of web documents, user access web-log, and different links attached with text files. This side information is helpful for clustering purpose but sometime it is risky to use side information because it may add noise to procedure. So we need a better technique for text mining to improve quality of presentation. In this paper, we are using different algorithms for enhancement of the clustering quality with the document-based, sentence-based, corpus-based, and combined approach concept analysis design, so as to maximize the benefits from using side information

International Journal on Recent and Innovation Trends in Computing and Communication

Strategies and algorithms for clustering large datasets: a review

Author: Béjar Alonso Javier
Publication venue
Publication date: 01/01/2013
Field of study

The exploratory nature of data analysis and data mining makes clustering one of the most usual tasks in these kind of projects. More frequently these projects come from many different application areas like biology, text analysis, signal analysis, etc that involve larger and larger datasets in the number of examples and the number of attributes. Classical methods for clustering data like K-means or hierarchical clustering are beginning to reach its maximum capability to cope with this increase of dataset size. The limitation for these algorithms come either from the need of storing all the data in memory or because of their computational time complexity. These problems have opened an area for the search of algorithms able to reduce this data overload. Some solutions come from the side of data preprocessing by transforming the data to a lower dimensionality manifold that represents the structure of the data or by summarizing the dataset by obtaining a smaller subset of examples that represent an equivalent information. A different perspective is to modify the classical clustering algorithms or to derive other ones able to cluster larger datasets. This perspective relies on many different strategies. Techniques such as sampling, on-line processing, summarization, data distribution and efficient datastructures have being applied to the problem of scaling clustering algorithms. This paper presents a review of different strategies and clustering algorithms that apply these techniques. The aim is to cover the different range of methodologies applied for clustering data and how they can be scaled.Preprin

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

User Behavior Analysis Using Web-based Machine Learning Features: new solutions for IT business

Author: Honsor Yuriy
Kononenko Oleksii
Kuchmak Yuriy
Mysiuk Roman
Osadets Nazar
Ozhyhov Oleksii
Pohrebniak Andrii
Svystovych Andriy
Publication venue: Publishing Center Dialog
Publication date: 01/01/2024
Field of study

The development of information technologies in IT business increases the interest in executing machine learning models directly on the client browser, reducing the load on the server and the number of levels of access to it. At the same time, some features have advantages and disadvantages, associated with a smaller amount of information transmitted over the network, limited power of client devices, and others. Among modern client-side tools with machine learning capabilities, Tensorflow.js is suitable, which can be used to analyse user behaviour in web applications for classification and clustering models based on their behavioural patterns, predict future user behaviour trends, detect unusual or suspicious user actions, recommendation models based on their previous behaviour. The article analyses the features of implementation and the limitations associated with the use, specifically regarding the behaviour of users in social networks. The model was formed based on data from news posts on social networks Instagram and Facebook, with the following parameters of user activity, such as the number of likes, comments, and shares according to the post's text. These aspects are a significant addition to the tools that can be applied within the economic, technical, and other means of IT business development. Considering this, it is advisable to study the formation and development of the innovation management system in e-business in the future.The development of information technologies in IT business increases the interest in the execution of machine learning models directly on the client browser, reduces the load on the server and the number of levels of access to it. At the same time, there are some features that have advantages and disadvantages, which are associated with a smaller amount of information transmitted over the network, limited power of client devices, and others. Among modern client-side tools with machine learning capabilities, Tensorflow.js is suitable, which can be used to analyze user behavior in web applications for classification and clustering models based on their behavioral patterns, predict future user behavior trends, detect unusual or suspicious user actions, recommendation models based on their previous behavior. The article analyzes the features of implementation, the limitations associated with the use specifically for the behavior of users in social networks. The model was formed on the basis of data from news posts on social networks Instagram and Facebook with the following parameters of user activity as the number of likes, comments and shares according to the text of the post. These aspects are a significant addition to the tools that can be applied within the set of economic, technical and other means for IT business development. Taking this into account, in the future it is advisable to study the formation and development of the innovation management system in e-business

SSOAR - Social Science Open Access Repository

Traektoria Nauki

CESI: Canonicalizing Open Knowledge Bases using Embeddings and Side Information

Author: Jain Prince
Talukdar Partha
Vashishth Shikhar
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 31/01/2019
Field of study

Open Information Extraction (OpenIE) methods extract (noun phrase, relation phrase, noun phrase) triples from text, resulting in the construction of large Open Knowledge Bases (Open KBs). The noun phrases (NPs) and relation phrases in such Open KBs are not canonicalized, leading to the storage of redundant and ambiguous facts. Recent research has posed canonicalization of Open KBs as clustering over manuallydefined feature spaces. Manual feature engineering is expensive and often sub-optimal. In order to overcome this challenge, we propose Canonicalization using Embeddings and Side Information (CESI) - a novel approach which performs canonicalization over learned embeddings of Open KBs. CESI extends recent advances in KB embedding by incorporating relevant NP and relation phrase side information in a principled manner. Through extensive experiments on multiple real-world datasets, we demonstrate CESI's effectiveness.Comment: Accepted at WWW 201

arXiv.org e-Print Archive

Open Access Repository of IISc Research Publications

On Graph Stream Clustering with Side Information

Author: Yu Philip S.
Zhao Yuchen
Publication venue
Publication date: 28/01/2013
Field of study

Graph clustering becomes an important problem due to emerging applications involving the web, social networks and bio-informatics. Recently, many such applications generate data in the form of streams. Clustering massive, dynamic graph streams is significantly challenging because of the complex structures of graphs and computational difficulties of continuous data. Meanwhile, a large volume of side information is associated with graphs, which can be of various types. The examples include the properties of users in social network activities, the meta attributes associated with web click graph streams and the location information in mobile communication networks. Such attributes contain extremely useful information and has the potential to improve the clustering process, but are neglected by most recent graph stream mining techniques. In this paper, we define a unified distance measure on both link structures and side attributes for clustering. In addition, we propose a novel optimization framework DMO, which can dynamically optimize the distance metric and make it adapt to the newly received stream data. We further introduce a carefully designed statistics SGS(C) which consume constant storage spaces with the progression of streams. We demonstrate that the statistics maintained are sufficient for the clustering process as well as the distance optimization and can be scalable to massive graphs with side attributes. We will present experiment results to show the advantages of the approach in graph stream clustering with both links and side information over the baselines.Comment: Full version of SIAM SDM 2013 pape

arXiv.org e-Print Archive

Crossref