Search CORE

18,807 research outputs found

Concept Extraction and Clustering for Topic Digital Library Construction

Author: Chengzhi Zhang
Dan Wu
Publication venue
Publication date: 01/12/2008
Field of study

This paper is to introduce a new approach to build topic digital library using concept extraction and document clustering. Firstly, documents in a special domain are automatically produced by document classification approach. Then, the keywords of each document are extracted using the machine learning approach. The keywords are used to cluster the documents subset. The clustered result is the taxonomy of the subset. Lastly, the taxonomy is modified to the hierarchical structure for user navigation by manual adjustments. The topic digital library is constructed after combining the full-text retrieval and hierarchical navigation function

Putting Context into Schema Matching

Author: Bohannon Philip
Elnahrawy Eiman
Fan Wenfei
Flaster Michael
Publication venue
Publication date: 01/01/2006
Field of study

How Much is the Whole Really More than the Sum of its Parts? 1 + 1 = 2.5: Superlinear Productivity in Collective Group Actions

Author: Ghezzi Giacomo
Maillart Thomas
Sornette Didier
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 16/05/2014
Field of study

In a variety of open source software projects, we document a superlinear growth of production (

R \sim c^\beta

) as a function of the number of active developers

c

, with

\beta \simeq 4/3

with large dispersions. For a typical project in this class, doubling of the group size multiplies typically the output by a factor

2^\beta=2.5

, explaining the title. This superlinear law is found to hold for group sizes ranging from 5 to a few hundred developers. We propose two classes of mechanisms, {\it interaction-based} and {\it large deviation}, along with a cascade model of productive activity, which unifies them. In this common framework, superlinear productivity requires that the involved social groups function at or close to criticality, in the sense of a subtle balance between order and disorder. We report the first empirical test of the renormalization of the exponent of the distribution of the sizes of first generation events into the renormalized exponent of the distribution of clusters resulting from the cascade of triggering over all generation in a critical branching process in the non-meanfield regime. Finally, we document a size effect in the strength and variability of the superlinear effect, with smaller groups exhibiting widely distributed superlinear exponents, some of them characterizing highly productive teams. In contrast, large groups tend to have a smaller superlinearity and less variability.Comment: 29 pages, 8 figure

arXiv.org e-Print Archive

Repository for Publications and Research Data

Directory of Open Access Journals

Recommended from our members

Computerization of workflows, guidelines and care pathways: a review of implementation challenges for process-oriented health information systems

Author: Gooch P.
Roudsari A.
Publication venue: 'BMJ'
Publication date: 01/01/2011
Field of study

There is a need to integrate the various theoretical frameworks and formalisms for modeling clinical guidelines, workflows, and pathways, in order to move beyond providing support for individual clinical decisions and toward the provision of process-oriented, patient-centered, health information systems (HIS). In this review, we analyze the challenges in developing process-oriented HIS that formally model guidelines, workflows, and care pathways. A qualitative meta-synthesis was performed on studies published in English between 1995 and 2010 that addressed the modeling process and reported the exposition of a new methodology, model, system implementation, or system architecture. Thematic analysis, principal component analysis (PCA) and data visualisation techniques were used to identify and cluster the underlying implementation ‘challenge’ themes. One hundred and eight relevant studies were selected for review. Twenty-five underlying ‘challenge’ themes were identified. These were clustered into 10 distinct groups, from which a conceptual model of the implementation process was developed. We found that the development of systems supporting individual clinical decisions is evolving toward the implementation of adaptable care pathways on the semantic web, incorporating formal, clinical, and organizational ontologies, and the use of workflow management systems. These architectures now need to be implemented and evaluated on a wider scale within clinical settings

City Research Online

SQL Injection Detection Using Machine Learning Techniques and Multiple Data Sources

Author: Ross Kevin
Publication venue: SJSU ScholarWorks
Publication date: 01/04/2018
Field of study

SQL Injection continues to be one of the most damaging security exploits in terms of personal information exposure as well as monetary loss. Injection attacks are the number one vulnerability in the most recent OWASP Top 10 report, and the number of these attacks continues to increase. Traditional defense strategies often involve static, signature-based IDS (Intrusion Detection System) rules which are mostly effective only against previously observed attacks but not unknown, or zero-day, attacks. Much current research involves the use of machine learning techniques, which are able to detect unknown attacks, but depending on the algorithm can be costly in terms of performance. In addition, most current intrusion detection strategies involve collection of traffic coming into the web application either from a network device or from the web application host, while other strategies collect data from the database server logs. In this project, we are collecting traffic from two points: the web application host, and a Datiphy appliance node located between the webapp host and the associated MySQL database server. In our analysis of these two datasets, and another dataset that is correlated between the two, we have been able to demonstrate that accuracy obtained with the correlated dataset using algorithms such as rule-based and decision tree are nearly the same as those with a neural network algorithm, but with greatly improved performance

SJSU ScholarWorks

Function Based Design-by-Analogy: A Functional Vector Approach to Analogical Search

Author: Fu Katherine K
Jensen Dan
Murphy Jeremy
Otto Kevin
Wood Kristin
Yang Maria
Publication venue: 'ASME International'
Publication date: 01/07/2014
Field of study

Design-by-analogy is a powerful approach to augment traditional concept generation methods by expanding the set of generated ideas using similarity relationships from solutions to analogous problems. While the concept of design-by-analogy has been known for some time, few actual methods and tools exist to assist designers in systematically seeking and identifying analogies from general data sources, databases, or repositories, such as patent databases. A new method for extracting functional analogies from data sources has been developed to provide this capability, here based on a functional basis rather than form or conflict descriptions. Building on past research, we utilize a functional vector space model (VSM) to quantify analogous similarity of an idea's functionality. We quantitatively evaluate the functional similarity between represented design problems and, in this case, patent descriptions of products. We also develop document parsing algorithms to reduce text descriptions of the data sources down to the key functions, for use in the functional similarity analysis and functional vector space modeling. To do this, we apply Zipf's law on word count order reduction to reduce the words within the documents down to the applicable functionally critical terms, thus providing a mapping process for function based search. The reduction of a document into functional analogous words enables the matching to novel ideas that are functionally similar, which can be customized various ways. This approach thereby provides relevant sources of design-by-analogy inspiration. As a verification of the approach, two original design problem case studies illustrate the distance range of analogical solutions that can be extracted. This range extends from very near-field, literal solutions to far-field cross-domain analogies.National Science Foundation (U.S.) (Grant CMMI-0855326)National Science Foundation (U.S.) (Grant CMMI-0855510)National Science Foundation (U.S.) (Grant CMMI-0855293)SUTD-MIT International Design Centre (IDC

The BioPrompt-box: an ontology-based clustering tool for searching in biological databases

Author: Corsi Claudio
Ferragina Paolo
Marangoni Roberto
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Background: High-throughput molecular biology provides new data at an incredible rate, so that the increase in the size of biological databanks is enormous and very rapid. This scenario generates severe problems not only at indexing time, where suitable algorithmic techniques for data indexing and retrieval are required, but also at query time, since a user query may produce such a large set of results that their browsing and understanding becomes humanly impractical. This problem is well known to the Web community, where a new generation of Web search engines is being developed, like Vivisimo. These tools organize on-the-fly the results of a user query in a hierarchy of labeled folders that ease their browsing and knowledge extraction. We investigate this approach on biological data, and propose the so called The BioPrompt-boxsoftware system which deploys ontology-driven clustering strategies for making the searching process of biologists more efficient and effective. Results: The BioPrompt-box (Bpb) defines a document as a biological sequence plus its associated meta-data taken from the underneath databank - like references to ontologies or to external databanks, and plain texts as comments of researchers and (title, abstracts or even body of) papers. Bpboffers several tools to customize the search and the clustering process over its indexed documents. The user can search a set of keywords within a specific field of the document schema, or can execute Blastto find documents relative to homologue sequences. In both cases the search task returns a set of documents (hits) which constitute the answer to the user query. Since the number of hits may be large, Bpbclusters them into groups of homogenous content, organized as a hierarchy of labeled clusters. The user can actually choose among several ontology-based hierarchical clustering strategies, each offering a different view of the returned hits. Bpbcomputes these views by exploiting the meta-data present within the retrieved documents such as the references to Gene Ontology, the taxonomy lineage, the organism and the keywords. Of course, the approach is flexible enough to leave room for future additions of other meta-information. The ultimate goal of the clustering process is to provide the user with several different readings of the (maybe numerous) query results and show possible hidden correlations among them, thus improving their browsing and understanding. Conclusion: Bpb is a powerful search engine that makes it very easy to perform complex queries over the indexed databanks (currently only UNIPROT is considered). The ontology-based clustering approach is efficient and effective, and could thus be applied successfully to larger databanks, like GenBank or EMBL

Springer - Publisher Connector