136,477 research outputs found
Normalized Information Distance
The normalized information distance is a universal distance measure for
objects of all kinds. It is based on Kolmogorov complexity and thus
uncomputable, but there are ways to utilize it. First, compression algorithms
can be used to approximate the Kolmogorov complexity if the objects have a
string representation. Second, for names and abstract concepts, page count
statistics from the World Wide Web can be used. These practical realizations of
the normalized information distance can then be applied to machine learning
tasks, expecially clustering, to perform feature-free and parameter-free data
mining. This chapter discusses the theoretical foundations of the normalized
information distance and both practical realizations. It presents numerous
examples of successful real-world applications based on these distance
measures, ranging from bioinformatics to music clustering to machine
translation.Comment: 33 pages, 12 figures, pdf, in: Normalized information distance, in:
Information Theory and Statistical Learning, Eds. M. Dehmer, F.
Emmert-Streib, Springer-Verlag, New-York, To appea
An Empirical Proposal towards the Algorithmic Approach and Pattern in Web Mining for Assorted Applications
ABSTRACT: Data mining or the analysis phase of the knowledge discovery process is the computational process of discovering patterns in large data sets that involves methods at the intersection of artificial intelligence, machine learning, statistics, and database system. The classical goal of the data mining and machine learning process is to fetch and extract information from a data set and transform it into an understandable structure for further use. Besides raw analysis step, it involves database and data management aspects, data preprocessing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating. Web Usage Mining is the type of data mining technique to discover interesting usage patterns from web data, in order to discover useful pattern and better serve the needs of web-based applications. Usage data captures the identity or origin of web users along with their browsing behavior at a web site. Web usage mining itself may be classified further depending on the kind of usage data considered. They are web server data, application server data and application level data. Web server data correspond to the user logs that are collected at web server. Some of the typical data collected and saved at a web server include IP addresses, page references, and access time of the users. In this paper a new technique is proposed to discover the web usage patterns of websites from the server log files with the foundation of clustering and improved Apriori algorithm
Structured and Unstructured Information Extraction Using Text Mining and Natural Language Processing Techniques
Information on web is increasing at infinitum. Thus, web has become an unstructured global area where information even if available, cannot be directly used for desired applications. One is often faced with an information overload and demands for some automated help. Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents by means of Text Mining and Natural Language Processing (NLP) techniques. Extracted structured information can be used for variety of enterprise or personal level task of varying complexity. The Information Extraction (IE) in also a set of knowledge in order to answer to user consultations using natural language. The system is based on a Fuzzy Logic engine, which takes advantage of its flexibility for managing sets of accumulated knowledge. These sets may be built in hierarchic levels by a tree structure. Information extraction is structured data or knowledge from unstructured text by identifying references to named entities as well as stated relationships between such entities. Data mining research assumes that the information to be “mined” is already in the form of a relational database. IE can serve an important technology for text mining. The knowledge discovered is expressed directly in the documents to be mined, then IE alone can serve as an effective approach to text mining. However, if the documents contain concrete data in unstructured form rather than abstract knowledge, it may be useful to first use IE to transform the unstructured data in the document corpus into a structured database, and then use traditional data mining tools to identify abstract patterns in this extracted data. We propose a novel method for text mining with natural language processing techniques to extract the information from data base with efficient way, where the extraction time and accuracy is measured and plotted with simulation. Where the attributes of entities and relationship entities from structured and semi structured information .Results are compared with conventional methods
A comparative study of the AHP and TOPSIS methods for implementing load shedding scheme in a pulp mill system
The advancement of technology had encouraged mankind to design and create useful
equipment and devices. These equipment enable users to fully utilize them in various
applications. Pulp mill is one of the heavy industries that consumes large amount of
electricity in its production. Due to this, any malfunction of the equipment might
cause mass losses to the company. In particular, the breakdown of the generator
would cause other generators to be overloaded. In the meantime, the subsequence
loads will be shed until the generators are sufficient to provide the power to other
loads. Once the fault had been fixed, the load shedding scheme can be deactivated.
Thus, load shedding scheme is the best way in handling such condition. Selected load
will be shed under this scheme in order to protect the generators from being
damaged. Multi Criteria Decision Making (MCDM) can be applied in determination
of the load shedding scheme in the electric power system. In this thesis two methods
which are Analytic Hierarchy Process (AHP) and Technique for Order Preference by
Similarity to Ideal Solution (TOPSIS) were introduced and applied. From this thesis,
a series of analyses are conducted and the results are determined. Among these two
methods which are AHP and TOPSIS, the results shown that TOPSIS is the best
Multi criteria Decision Making (MCDM) for load shedding scheme in the pulp mill
system. TOPSIS is the most effective solution because of the highest percentage
effectiveness of load shedding between these two methods. The results of the AHP
and TOPSIS analysis to the pulp mill system are very promising
Astroinformatics, data mining and the future of astronomical research
Astronomy, as many other scientific disciplines, is facing a true data deluge
which is bound to change both the praxis and the methodology of every day
research work. The emerging field of astroinformatics, while on the one end
appears crucial to face the technological challenges, on the other is opening
new exciting perspectives for new astronomical discoveries through the
implementation of advanced data mining procedures. The complexity of
astronomical data and the variety of scientific problems, however, call for
innovative algorithms and methods as well as for an extreme usage of ICT
technologies.Comment: To appear in the Proceedings of the 2-nd International Conference on
Frontiers on diagnostic technologie
Enumerating Maximal Bicliques from a Large Graph using MapReduce
We consider the enumeration of maximal bipartite cliques (bicliques) from a
large graph, a task central to many practical data mining problems in social
network analysis and bioinformatics. We present novel parallel algorithms for
the MapReduce platform, and an experimental evaluation using Hadoop MapReduce.
Our algorithm is based on clustering the input graph into smaller sized
subgraphs, followed by processing different subgraphs in parallel. Our
algorithm uses two ideas that enable it to scale to large graphs: (1) the
redundancy in work between different subgraph explorations is minimized through
a careful pruning of the search space, and (2) the load on different reducers
is balanced through the use of an appropriate total order among the vertices.
Our evaluation shows that the algorithm scales to large graphs with millions of
edges and tens of mil- lions of maximal bicliques. To our knowledge, this is
the first work on maximal biclique enumeration for graphs of this scale.Comment: A preliminary version of the paper was accepted at the Proceedings of
the 3rd IEEE International Congress on Big Data 201
- …