1,303 research outputs found
Feature selection, optimization and clustering strategies of text documents
Clustering is one of the most researched areas of data mining applications in the contemporary literature. The need for efficient clustering is observed across wide sectors including consumer segmentation, categorization, shared filtering, document management, and indexing. The research of clustering task is to be performed prior to its adaptation in the text environment. Conventional approaches typically emphasized on the quantitative information where the selected features are numbers. Efforts also have been put forward for achieving efficient clustering in the context of categorical information where the selected features can assume nominal values. This manuscript presents an in-depth analysis of challenges of clustering in the text environment. Further, this paper also details prominent models proposed for clustering along with the pros and cons of each model. In addition, it also focuses on various latest developments in the clustering task in the social network and associated environments
Joint Clustering and Registration of Functional Data
Curve registration and clustering are fundamental tools in the analysis of
functional data. While several methods have been developed and explored for
either task individually, limited work has been done to infer functional
clusters and register curves simultaneously. We propose a hierarchical model
for joint curve clustering and registration. Our proposal combines a Dirichlet
process mixture model for clustering of common shapes, with a reproducing
kernel representation of phase variability for registration. We show how
inference can be carried out applying standard posterior simulation algorithms
and compare our method to several alternatives in both engineered data and a
benchmark analysis of the Berkeley growth data. We conclude our investigation
with an application to time course gene expression
Resampling with neural networks for stochastic parameterization in multiscale systems
In simulations of multiscale dynamical systems, not all relevant processes
can be resolved explicitly. Taking the effect of the unresolved processes into
account is important, which introduces the need for paramerizations. We present
a machine-learning method, used for the conditional resampling of observations
or reference data from a fully resolved simulation. It is based on the
probabilistic classiffcation of subsets of reference data, conditioned on
macroscopic variables. This method is used to formulate a parameterization that
is stochastic, taking the uncertainty of the unresolved scales into account. We
validate our approach on the Lorenz 96 system, using two different parameter
settings which are challenging for parameterization methods.Comment: 27 pages, 17 figures. Submitte
Supporting Source Code Search with Context-Aware and Semantics-Driven Query Reformulation
Software bugs and failures cost trillions of dollars every year, and could even lead to deadly accidents (e.g., Therac-25 accident). During maintenance, software developers fix numerous bugs and implement hundreds of new features by making necessary changes to the existing software code. Once an issue report (e.g., bug report, change request) is assigned to a developer, she chooses a few important keywords from the report as a search query, and then attempts to find out the exact locations in the software code that need to be either repaired or enhanced. As a part of this maintenance, developers also often select ad hoc queries on the fly, and attempt to locate the reusable code from the Internet that could assist them either in bug fixing or in feature implementation. Unfortunately, even the experienced developers often fail to construct the right search queries. Even if the developers come up with a few ad hoc queries, most of them require frequent modifications which cost significant development time and efforts. Thus, construction of an appropriate query for localizing the software bugs, programming concepts or even the reusable code is a major challenge. In this thesis, we overcome this query construction challenge with six studies, and develop a novel, effective code search solution (BugDoctor) that assists the developers in localizing the software code of interest (e.g., bugs, concepts and reusable code) during software maintenance. In particular, we reformulate a given search query (1) by designing novel keyword selection algorithms (e.g., CodeRank) that outperform the traditional alternatives (e.g., TF-IDF), (2) by leveraging the bug report quality paradigm and source document structures which were previously overlooked and (3) by exploiting the crowd knowledge and word semantics derived from Stack Overflow Q&A site, which were previously untapped. Our experiment using 5000+ search queries (bug reports, change requests, and ad hoc queries) suggests that our proposed approach can improve the given queries significantly through automated query reformulations. Comparison with 10+ existing studies on bug localization, concept location and Internet-scale code search suggests that our approach can outperform the state-of-the-art approaches with a significant margin
Novel techniques of computational intelligence for analysis of astronomical structures
Gravitational forces cause the formation and evolution of a variety of cosmological structures. The detailed investigation and study of these structures is a crucial step towards our understanding of the universe. This thesis provides several solutions for the detection and classification of such structures. In the first part of the thesis, we focus on astronomical simulations, and we propose two algorithms to extract stellar structures. Although they follow different strategies (while the first one is a downsampling method, the second one keeps all samples), both techniques help to build more effective probabilistic models. In the second part, we consider observational data, and the goal is to overcome some of the common challenges in observational data such as noisy features and imbalanced classes. For instance, when not enough examples are present in the training set, two different strategies are used: a) nearest neighbor technique and b) outlier detection technique. In summary, both parts of the thesis show the effectiveness of automated algorithms in extracting valuable information from astronomical databases
- …