18,807 research outputs found

    Concept Extraction and Clustering for Topic Digital Library Construction

    Get PDF
    This paper is to introduce a new approach to build topic digital library using concept extraction and document clustering. Firstly, documents in a special domain are automatically produced by document classification approach. Then, the keywords of each document are extracted using the machine learning approach. The keywords are used to cluster the documents subset. The clustered result is the taxonomy of the subset. Lastly, the taxonomy is modified to the hierarchical structure for user navigation by manual adjustments. The topic digital library is constructed after combining the full-text retrieval and hierarchical navigation function

    How Much is the Whole Really More than the Sum of its Parts? 1 + 1 = 2.5: Superlinear Productivity in Collective Group Actions

    Full text link
    In a variety of open source software projects, we document a superlinear growth of production (R∼cβR \sim c^\beta) as a function of the number of active developers cc, with β≃4/3\beta \simeq 4/3 with large dispersions. For a typical project in this class, doubling of the group size multiplies typically the output by a factor 2β=2.52^\beta=2.5, explaining the title. This superlinear law is found to hold for group sizes ranging from 5 to a few hundred developers. We propose two classes of mechanisms, {\it interaction-based} and {\it large deviation}, along with a cascade model of productive activity, which unifies them. In this common framework, superlinear productivity requires that the involved social groups function at or close to criticality, in the sense of a subtle balance between order and disorder. We report the first empirical test of the renormalization of the exponent of the distribution of the sizes of first generation events into the renormalized exponent of the distribution of clusters resulting from the cascade of triggering over all generation in a critical branching process in the non-meanfield regime. Finally, we document a size effect in the strength and variability of the superlinear effect, with smaller groups exhibiting widely distributed superlinear exponents, some of them characterizing highly productive teams. In contrast, large groups tend to have a smaller superlinearity and less variability.Comment: 29 pages, 8 figure

    SQL Injection Detection Using Machine Learning Techniques and Multiple Data Sources

    Get PDF
    SQL Injection continues to be one of the most damaging security exploits in terms of personal information exposure as well as monetary loss. Injection attacks are the number one vulnerability in the most recent OWASP Top 10 report, and the number of these attacks continues to increase. Traditional defense strategies often involve static, signature-based IDS (Intrusion Detection System) rules which are mostly effective only against previously observed attacks but not unknown, or zero-day, attacks. Much current research involves the use of machine learning techniques, which are able to detect unknown attacks, but depending on the algorithm can be costly in terms of performance. In addition, most current intrusion detection strategies involve collection of traffic coming into the web application either from a network device or from the web application host, while other strategies collect data from the database server logs. In this project, we are collecting traffic from two points: the web application host, and a Datiphy appliance node located between the webapp host and the associated MySQL database server. In our analysis of these two datasets, and another dataset that is correlated between the two, we have been able to demonstrate that accuracy obtained with the correlated dataset using algorithms such as rule-based and decision tree are nearly the same as those with a neural network algorithm, but with greatly improved performance

    Function Based Design-by-Analogy: A Functional Vector Approach to Analogical Search

    Get PDF
    Design-by-analogy is a powerful approach to augment traditional concept generation methods by expanding the set of generated ideas using similarity relationships from solutions to analogous problems. While the concept of design-by-analogy has been known for some time, few actual methods and tools exist to assist designers in systematically seeking and identifying analogies from general data sources, databases, or repositories, such as patent databases. A new method for extracting functional analogies from data sources has been developed to provide this capability, here based on a functional basis rather than form or conflict descriptions. Building on past research, we utilize a functional vector space model (VSM) to quantify analogous similarity of an idea's functionality. We quantitatively evaluate the functional similarity between represented design problems and, in this case, patent descriptions of products. We also develop document parsing algorithms to reduce text descriptions of the data sources down to the key functions, for use in the functional similarity analysis and functional vector space modeling. To do this, we apply Zipf's law on word count order reduction to reduce the words within the documents down to the applicable functionally critical terms, thus providing a mapping process for function based search. The reduction of a document into functional analogous words enables the matching to novel ideas that are functionally similar, which can be customized various ways. This approach thereby provides relevant sources of design-by-analogy inspiration. As a verification of the approach, two original design problem case studies illustrate the distance range of analogical solutions that can be extracted. This range extends from very near-field, literal solutions to far-field cross-domain analogies.National Science Foundation (U.S.) (Grant CMMI-0855326)National Science Foundation (U.S.) (Grant CMMI-0855510)National Science Foundation (U.S.) (Grant CMMI-0855293)SUTD-MIT International Design Centre (IDC

    The BioPrompt-box: an ontology-based clustering tool for searching in biological databases

    Get PDF
    Background: High-throughput molecular biology provides new data at an incredible rate, so that the increase in the size of biological databanks is enormous and very rapid. This scenario generates severe problems not only at indexing time, where suitable algorithmic techniques for data indexing and retrieval are required, but also at query time, since a user query may produce such a large set of results that their browsing and understanding becomes humanly impractical. This problem is well known to the Web community, where a new generation of Web search engines is being developed, like Vivisimo. These tools organize on-the-fly the results of a user query in a hierarchy of labeled folders that ease their browsing and knowledge extraction. We investigate this approach on biological data, and propose the so called The BioPrompt-boxsoftware system which deploys ontology-driven clustering strategies for making the searching process of biologists more efficient and effective. Results: The BioPrompt-box (Bpb) defines a document as a biological sequence plus its associated meta-data taken from the underneath databank - like references to ontologies or to external databanks, and plain texts as comments of researchers and (title, abstracts or even body of) papers. Bpboffers several tools to customize the search and the clustering process over its indexed documents. The user can search a set of keywords within a specific field of the document schema, or can execute Blastto find documents relative to homologue sequences. In both cases the search task returns a set of documents (hits) which constitute the answer to the user query. Since the number of hits may be large, Bpbclusters them into groups of homogenous content, organized as a hierarchy of labeled clusters. The user can actually choose among several ontology-based hierarchical clustering strategies, each offering a different view of the returned hits. Bpbcomputes these views by exploiting the meta-data present within the retrieved documents such as the references to Gene Ontology, the taxonomy lineage, the organism and the keywords. Of course, the approach is flexible enough to leave room for future additions of other meta-information. The ultimate goal of the clustering process is to provide the user with several different readings of the (maybe numerous) query results and show possible hidden correlations among them, thus improving their browsing and understanding. Conclusion: Bpb is a powerful search engine that makes it very easy to perform complex queries over the indexed databanks (currently only UNIPROT is considered). The ontology-based clustering approach is efficient and effective, and could thus be applied successfully to larger databanks, like GenBank or EMBL
    • …
    corecore