105,734 research outputs found
Identifying e-Commerce in Enterprises by means of Text Mining and Classification Algorithms
Monitoring specific features of the enterprises, for example, the adoption of e-commerce, is an important and basic task for several economic activities. This type of information is usually obtained by means of surveys, which are costly due to the amount of personnel involved in the task. An automatic detection of this information would allow consistent savings. This can actually be performed by relying on computer engineering, since in general this information is publicly available on-line through the corporate websites. This work describes how to convert the detection of e-commerce into a supervised classification problem, where each record is obtained from the automatic analysis of one corporate website, and the class is the presence or the absence of e-commerce facilities. The automatic generation of similar data records requires the use of several Text Mining phases; in particular we compare six strategies based on the selection of best words and best n-grams. After this, we classify the obtained dataset by means of four classification algorithms: Support Vector Machines; Random Forest; Statistical and Logical Analysis of Data; Logistic Classifier. This turns out to be a difficult case of classification problem. However, after a careful design and set-up of the whole procedure, the results on a practical case of Italian enterprises are encouraging
A practical index for approximate dictionary matching with few mismatches
Approximate dictionary matching is a classic string matching problem
(checking if a query string occurs in a collection of strings) with
applications in, e.g., spellchecking, online catalogs, geolocation, and web
searchers. We present a surprisingly simple solution called a split index,
which is based on the Dirichlet principle, for matching a keyword with few
mismatches, and experimentally show that it offers competitive space-time
tradeoffs. Our implementation in the C++ language is focused mostly on data
compaction, which is beneficial for the search speed (e.g., by being cache
friendly). We compare our solution with other algorithms and we show that it
performs better for the Hamming distance. Query times in the order of 1
microsecond were reported for one mismatch for the dictionary size of a few
megabytes on a medium-end PC. We also demonstrate that a basic compression
technique consisting in -gram substitution can significantly reduce the
index size (up to 50% of the input text size for the DNA), while still keeping
the query time relatively low
Precise n-gram Probabilities from Stochastic Context-free Grammars
We present an algorithm for computing n-gram probabilities from stochastic
context-free grammars, a procedure that can alleviate some of the standard
problems associated with n-grams (estimation from sparse data, lack of
linguistic structure, among others). The method operates via the computation of
substring expectations, which in turn is accomplished by solving systems of
linear equations derived from the grammar. We discuss efficient implementation
of the algorithm and report our practical experience with it.Comment: 12 pages, to appear in ACL-9
On the Termination of Linear and Affine Programs over the Integers
The termination problem for affine programs over the integers was left open
in\cite{Braverman}. For more that a decade, it has been considered and cited as
a challenging open problem. To the best of our knowledge, we present here the
most complete response to this issue: we show that termination for affine
programs over Z is decidable under an assumption holding for almost all affine
programs, except for an extremely small class of zero Lesbegue measure. We use
the notion of asymptotically non-terminating initial variable values} (ANT, for
short) for linear loop programs over Z. Those values are directly associated to
initial variable values for which the corresponding program does not terminate.
We reduce the termination problem of linear affine programs over the integers
to the emptiness check of a specific ANT set of initial variable values. For
this class of linear or affine programs, we prove that the corresponding ANT
set is a semi-linear space and we provide a powerful computational methods
allowing the automatic generation of these sets. Moreover, we are able to
address the conditional termination problem too. In other words, by taking ANT
set complements, we obtain a precise under-approximation of the set of inputs
for which the program does terminate.Comment: arXiv admin note: substantial text overlap with arXiv:1407.455
- …