208 research outputs found
Feature Extraction and Duplicate Detection for Text Mining: A Survey
Text mining, also known as Intelligent Text Analysis is an important research area. It is very difficult to focus on the most appropriate information due to the high dimensionality of data. Feature Extraction is one of the important techniques in data reduction to discover the most important features. Proce- ssing massive amount of data stored in a unstructured form is a challenging task. Several pre-processing methods and algo- rithms are needed to extract useful features from huge amount of data. The survey covers different text summarization, classi- fication, clustering methods to discover useful features and also discovering query facets which are multiple groups of words or phrases that explain and summarize the content covered by a query thereby reducing time taken by the user. Dealing with collection of text documents, it is also very important to filter out duplicate data. Once duplicates are deleted, it is recommended to replace the removed duplicates. Hence we also review the literature on duplicate detection and data fusion (remove and replace duplicates).The survey provides existing text mining techniques to extract relevant features, detect duplicates and to replace the duplicate data to get fine grained knowledge to the user
Discovering Dense Correlated Subgraphs in Dynamic Networks
Given a dynamic network, where edges appear and disappear over time, we are
interested in finding sets of edges that have similar temporal behavior and
form a dense subgraph. Formally, we define the problem as the enumeration of
the maximal subgraphs that satisfy specific density and similarity thresholds.
To measure the similarity of the temporal behavior, we use the correlation
between the binary time series that represent the activity of the edges. For
the density, we study two variants based on the average degree. For these
problem variants we enumerate the maximal subgraphs and compute a compact
subset of subgraphs that have limited overlap. We propose an approximate
algorithm that scales well with the size of the network, while achieving a high
accuracy. We evaluate our framework on both real and synthetic datasets. The
results of the synthetic data demonstrate the high accuracy of the
approximation and show the scalability of the framework.Comment: Full version of the paper included in the proceedings of the PAKDD
2021 conferenc
Taming Numbers and Durations in the Model Checking Integrated Planning System
The Model Checking Integrated Planning System (MIPS) is a temporal least
commitment heuristic search planner based on a flexible object-oriented
workbench architecture. Its design clearly separates explicit and symbolic
directed exploration algorithms from the set of on-line and off-line computed
estimates and associated data structures. MIPS has shown distinguished
performance in the last two international planning competitions. In the last
event the description language was extended from pure propositional planning to
include numerical state variables, action durations, and plan quality objective
functions. Plans were no longer sequences of actions but time-stamped
schedules. As a participant of the fully automated track of the competition,
MIPS has proven to be a general system; in each track and every benchmark
domain it efficiently computed plans of remarkable quality. This article
introduces and analyzes the most important algorithmic novelties that were
necessary to tackle the new layers of expressiveness in the benchmark problems
and to achieve a high level of performance. The extensions include critical
path analysis of sequentially generated plans to generate corresponding optimal
parallel plans. The linear time algorithm to compute the parallel plan bypasses
known NP hardness results for partial ordering by scheduling plans with respect
to the set of actions and the imposed precedence relations. The efficiency of
this algorithm also allows us to improve the exploration guidance: for each
encountered planning state the corresponding approximate sequential plan is
scheduled. One major strength of MIPS is its static analysis phase that grounds
and simplifies parameterized predicates, functions and operators, that infers
knowledge to minimize the state description length, and that detects domain
object symmetries. The latter aspect is analyzed in detail. MIPS has been
developed to serve as a complete and optimal state space planner, with
admissible estimates, exploration engines and branching cuts. In the
competition version, however, certain performance compromises had to be made,
including floating point arithmetic, weighted heuristic search exploration
according to an inadmissible estimate and parameterized optimization
Un algorithme distribué pour le clustering de grands graphes
International audienceLe clustering de graphes est l'une des techniques clés qui permet de comprendre les structures présentes dans les données de graphe. La détection des clusters et l'identification des ponts et des bruit sont également des tâches critiques car elles jouent un rôle important dans l'analyse des graphes. Récem-ment, plusieurs algorithmes de clustering de graphes ont été proposés et utilisés dans de nombreux domaines d'application. La plupart de ces algorithmes sont basés sur les algorithmes de clustering structurel. Néanmoins, ces derniers ont été conçus pour le traitement des petits graphes. D'où, leur performance peut se dégrader dans le cas des graphes larges qui imposent des défis supplémentaires. Dans cet article, nous proposons DSCAN, un algorithme distribué de clustering de graphes qui est basé sur le clustering structurel. Notre algorithme est im-plimenté sur la base de framework de traitement de grands graphes BLADYG. L'évaluation expérimentale de DSCAN a montré son efficacité et sa compétiti-vité pour le traitement de grands graphes
- …