105,734 research outputs found

    Identifying e-Commerce in Enterprises by means of Text Mining and Classification Algorithms

    Get PDF
    Monitoring specific features of the enterprises, for example, the adoption of e-commerce, is an important and basic task for several economic activities. This type of information is usually obtained by means of surveys, which are costly due to the amount of personnel involved in the task. An automatic detection of this information would allow consistent savings. This can actually be performed by relying on computer engineering, since in general this information is publicly available on-line through the corporate websites. This work describes how to convert the detection of e-commerce into a supervised classification problem, where each record is obtained from the automatic analysis of one corporate website, and the class is the presence or the absence of e-commerce facilities. The automatic generation of similar data records requires the use of several Text Mining phases; in particular we compare six strategies based on the selection of best words and best n-grams. After this, we classify the obtained dataset by means of four classification algorithms: Support Vector Machines; Random Forest; Statistical and Logical Analysis of Data; Logistic Classifier. This turns out to be a difficult case of classification problem. However, after a careful design and set-up of the whole procedure, the results on a practical case of Italian enterprises are encouraging

    A practical index for approximate dictionary matching with few mismatches

    Get PDF
    Approximate dictionary matching is a classic string matching problem (checking if a query string occurs in a collection of strings) with applications in, e.g., spellchecking, online catalogs, geolocation, and web searchers. We present a surprisingly simple solution called a split index, which is based on the Dirichlet principle, for matching a keyword with few mismatches, and experimentally show that it offers competitive space-time tradeoffs. Our implementation in the C++ language is focused mostly on data compaction, which is beneficial for the search speed (e.g., by being cache friendly). We compare our solution with other algorithms and we show that it performs better for the Hamming distance. Query times in the order of 1 microsecond were reported for one mismatch for the dictionary size of a few megabytes on a medium-end PC. We also demonstrate that a basic compression technique consisting in qq-gram substitution can significantly reduce the index size (up to 50% of the input text size for the DNA), while still keeping the query time relatively low

    Precise n-gram Probabilities from Stochastic Context-free Grammars

    Full text link
    We present an algorithm for computing n-gram probabilities from stochastic context-free grammars, a procedure that can alleviate some of the standard problems associated with n-grams (estimation from sparse data, lack of linguistic structure, among others). The method operates via the computation of substring expectations, which in turn is accomplished by solving systems of linear equations derived from the grammar. We discuss efficient implementation of the algorithm and report our practical experience with it.Comment: 12 pages, to appear in ACL-9

    On the Termination of Linear and Affine Programs over the Integers

    Full text link
    The termination problem for affine programs over the integers was left open in\cite{Braverman}. For more that a decade, it has been considered and cited as a challenging open problem. To the best of our knowledge, we present here the most complete response to this issue: we show that termination for affine programs over Z is decidable under an assumption holding for almost all affine programs, except for an extremely small class of zero Lesbegue measure. We use the notion of asymptotically non-terminating initial variable values} (ANT, for short) for linear loop programs over Z. Those values are directly associated to initial variable values for which the corresponding program does not terminate. We reduce the termination problem of linear affine programs over the integers to the emptiness check of a specific ANT set of initial variable values. For this class of linear or affine programs, we prove that the corresponding ANT set is a semi-linear space and we provide a powerful computational methods allowing the automatic generation of these ANTANT sets. Moreover, we are able to address the conditional termination problem too. In other words, by taking ANT set complements, we obtain a precise under-approximation of the set of inputs for which the program does terminate.Comment: arXiv admin note: substantial text overlap with arXiv:1407.455
    • …
    corecore