5,568 research outputs found
Text Mining Infrastructure in R
During the last decade text mining has become a widely used discipline utilizing statistical and machine learning methods. We present the tm package which provides a framework for text mining applications within R. We give a survey on text mining facilities in R and explain how typical application tasks can be carried out using our framework. We present techniques for count-based analysis methods, text clustering, text classification and string kernels.
On the Effect of Semantically Enriched Context Models on Software Modularization
Many of the existing approaches for program comprehension rely on the
linguistic information found in source code, such as identifier names and
comments. Semantic clustering is one such technique for modularization of the
system that relies on the informal semantics of the program, encoded in the
vocabulary used in the source code. Treating the source code as a collection of
tokens loses the semantic information embedded within the identifiers. We try
to overcome this problem by introducing context models for source code
identifiers to obtain a semantic kernel, which can be used for both deriving
the topics that run through the system as well as their clustering. In the
first model, we abstract an identifier to its type representation and build on
this notion of context to construct contextual vector representation of the
source code. The second notion of context is defined based on the flow of data
between identifiers to represent a module as a dependency graph where the nodes
correspond to identifiers and the edges represent the data dependencies between
pairs of identifiers. We have applied our approach to 10 medium-sized open
source Java projects, and show that by introducing contexts for identifiers,
the quality of the modularization of the software systems is improved. Both of
the context models give results that are superior to the plain vector
representation of documents. In some cases, the authoritativeness of
decompositions is improved by 67%. Furthermore, a more detailed evaluation of
our approach on JEdit, an open source editor, demonstrates that inferred topics
through performing topic analysis on the contextual representations are more
meaningful compared to the plain representation of the documents. The proposed
approach in introducing a context model for source code identifiers paves the
way for building tools that support developers in program comprehension tasks
such as application and domain concept location, software modularization and
topic analysis
Eigendecompositions of Transfer Operators in Reproducing Kernel Hilbert Spaces
Transfer operators such as the Perron--Frobenius or Koopman operator play an
important role in the global analysis of complex dynamical systems. The
eigenfunctions of these operators can be used to detect metastable sets, to
project the dynamics onto the dominant slow processes, or to separate
superimposed signals. We extend transfer operator theory to reproducing kernel
Hilbert spaces and show that these operators are related to Hilbert space
representations of conditional distributions, known as conditional mean
embeddings in the machine learning community. Moreover, numerical methods to
compute empirical estimates of these embeddings are akin to data-driven methods
for the approximation of transfer operators such as extended dynamic mode
decomposition and its variants. One main benefit of the presented kernel-based
approaches is that these methods can be applied to any domain where a
similarity measure given by a kernel is available. We illustrate the results
with the aid of guiding examples and highlight potential applications in
molecular dynamics as well as video and text data analysis
- …