15,876 research outputs found

    Mining Optimized Association Rules for Numeric Attributes

    Get PDF
    AbstractGiven a huge database, we address the problem of finding association rules for numeric attributes, such as(Balance∈I)⇒(CardLoan=yes),which implies that bank customers whose balances fall in a rangeIare likely to use card loan with a probability greater thanp. The above rule is interesting only if the rangeIhas some special feature with respect to the interrelation betweenBalanceandCardLoan. It is required that the number of customers whose balances are contained inI(called thesupportofI) is sufficient and also that the probabilitypof the conditionCardLoan=yesbeing met (called theconfidence ratio) be much higher than the average probability of the condition over all the data. Our goal is to realize a system that finds such appropriate ranges automatically. We mainly focus on computing twooptimized ranges: one that maximizes the support on the condition that the confidence ratio is at least a given threshold value, and another that maximizes the confidence ratio on the condition that the support is at least a given threshold number. Using techniques from computational geometry, we present novel algorithms that compute the optimized ranges in linear time if the data are sorted. Since sorting data with respect to each numeric attribute is expensive in the case of huge databases that occupy much more space than the main memory, we instead apply randomized bucketing as the preprocessing method and thus obtain an efficient rule-finding system. Tests show that our implementation is fast not only in theory but also in practice. The efficiency of our algorithm enables us to compute optimized rules for all combinations of hundreds of numeric and Boolean attributes in a reasonable time

    Agents for Integrating Distributed Data for Complex Computations

    Get PDF
    Algorithms for many complex computations assume that all the relevant data is available on a single node of a computer network. In the emerging distributed and networked knowledge environments, databases relevant for computations may reside on a number of nodes connected by a communication network. These data resources cannot be moved to other network sites due to privacy, security, and size considerations. The desired global computation must be decomposed into local computations to match the distribution of data across the network. The capability to decompose computations must be general enough to handle different distributions of data and different participating nodes in each instance of the global computation. In this paper, we present a methodology wherein each distributed data source is represented by an agent. Each such agent has the capability to decompose global computations into local parts, for itself and for agents at other sites. The global computation is then performed by the agent either exchanging some minimal summaries with other agents or travelling to all the sites and performing local tasks that can be done at each local site. The objective is to perform global tasks with a minimum of communication or travel by participating agents across the network

    PPI-IRO: A two-stage method for protein-protein interaction extraction based on interaction relation ontology

    Full text link
    Mining Protein-Protein Interactions (PPIs) from the fast-growing biomedical literature resources has been proven as an effective approach for the identifi cation of biological regulatory networks. This paper presents a novel method based on the idea of Interaction Relation Ontology (IRO), which specifi es and organises words of various proteins interaction relationships. Our method is a two-stage PPI extraction method. At fi rst, IRO is applied in a binary classifi er to determine whether sentences contain a relation or not. Then, IRO is taken to guide PPI extraction by building sentence dependency parse tree. Comprehensive and quantitative evaluations and detailed analyses are used to demonstrate the signifi cant performance of IRO on relation sentences classifi cation and PPI extraction. Our PPI extraction method yielded a recall of around 80% and 90% and an F1 of around 54% and 66% on corpora of AIMed and Bioinfer, respectively, which are superior to most existing extraction methods. Copyright © 2014 Inderscience Enterprises Ltd

    Environmental information systems : the development and implementation of the Lake Rukwa Basin integrated project environmental information system (LRBIP-EIS) database, Tanzania

    Get PDF
    Bibliography: leaves 91-97.The quest for sustenance inevitably forces mankind to exploit natural resources found within their environs. In many cases, the exploitation results in massive environmental degradation that disrupts the ecosystem and causes loss of bio-diversity. There is generally a lack of information systems to monitor and provide quantitative information on the state of the affected environment. Decision-makers usually fail to make informed decisions with regard to conservation strategies. The need to provide decision-makers with quantitative environmental information formed the basis of this thesis. An integrated environmental information system (EIS) database was developed according to the Software Development Methodology for three of the identified environmental sectors. This involved detailed user needs assessment to identify the information requirements (both spatial and textual) for each sector. The results were used to design separate data models that were later merged to create an integrated data model for the database application. A fisheries application prototype was developed to implement the proposed database design. The prototype has three major components. The Geographic Information System (GIS) handles the spatial data such as rivers, settlements, roads, and lakes. A relational database management system (RDBMS) was used to store and maintain the non-spatial data such as fisherman ' s personal details and fish catch data. Customized graphical user interfaces were designed to handle the data visualization and restricted access to the GIS and RDBMS environments

    An intelligent framework for monitoring student performance using fuzzy rule-based linguistic summarisation

    Get PDF
    Monitoring students' activity and performance is vital to enable educators to provide effective teaching and learning in order to better engage students with the subject and improve their understanding of the material being taught. We describe the use of a fuzzy Linguistic Summarisation (LS) technique for extracting linguistically interpretable scaled fuzzy weighted rules from student data describing prominent relationships between activity / engagement characteristics and achieved performance. We propose an intelligent framework for monitoring individual or group performance during activity and problem based learning tasks. The system can be used to more effectively evaluate new teaching approaches and methodologies, identify weaknesses and provide more personalised feedback on learner's progress. We present a case study and initial experiments in which we apply the fuzzy LS technique for analysing the effectiveness of using a Group Performance Model (GPM) to deploy Activity Led Learning (ALL) in a Master-level module. Results show that the fuzzy weighted rules can identify useful relationships between student engagement and performance providing a mechanism allowing educators to transparently evaluate teaching and factors effecting student performance, which can be incorporated as part of an automated intelligent analysis and feedback system

    Automated analysis of free-text comments and dashboard representations in patient experience surveys: a multimethod co-design study

    Get PDF
    BACKGROUND: Patient experience surveys (PESs) often include informative free-text comments, but with no way of systematically, efficiently and usefully analysing and reporting these. The National Cancer Patient Experience Survey (CPES), used to model the approach reported here, generates > 70,000 free-text comments annually. MAIN AIM: To improve the use and usefulness of PES free-text comments in driving health service changes that improve the patient experience. SECONDARY AIMS: (1) To structure CPES free-text comments using rule-based information retrieval (IR) (‘text engineering’), drawing on health-care domain-specific gazetteers of terms, with in-built transferability to other surveys and conditions; (2) to display the results usefully for health-care professionals, in a digital toolkit dashboard display that drills down to the original free text; (3) to explore the usefulness of interdisciplinary mixed stakeholder co-design and consensus-forming approaches in technology development, ensuring that outputs have meaning for all; and (4) to explore the usefulness of Normalisation Process Theory (NPT) in structuring outputs for implementation and sustainability. DESIGN: A scoping review, rapid review and surveys with stakeholders in health care (patients, carers, health-care providers, commissioners, policy-makers and charities) explored clinical dashboard design/patient experience themes. The findings informed the rules for the draft rule-based IR [developed using half of the 2013 Wales CPES (WCPES) data set] and prototype toolkit dashboards summarising PES data. These were refined following mixed stakeholder, concept-mapping workshops and interviews, which were structured to enable consensus-forming ‘co-design’ work. IR validation used the second half of the WCPES, with comparison against its manual analysis; transferability was tested using further health-care data sets. A discrete choice experiment (DCE) explored which toolkit features were preferred by health-care professionals, with a simple cost–benefit analysis. Structured walk-throughs with NHS managers in Wessex, London and Leeds explored usability and general implementation into practice. KEY OUTCOMES: A taxonomy of ranked PES themes, a checklist of key features recommended for digital clinical toolkits, rule-based IR validation and transferability scores, usability, and goal-oriented, cost–benefit and marketability results. The secondary outputs were a survey, scoping and rapid review findings, and concordance and discordance between stakeholders and methods. RESULTS: (1) The surveys, rapid review and workshops showed that stakeholders differed in their understandings of the patient experience and priorities for change, but that they reached consensus on a shortlist of 19 themes; six were considered to be core; (2) the scoping review and one survey explored the clinical toolkit design, emphasising that such toolkits should be quick and easy to use, and embedded in workflows; the workshop discussions, the DCE and the walk-throughs confirmed this and foregrounded other features to form the toolkit design checklist; and (3) the rule-based IR, developed using noun and verb phrases and lookup gazetteers, was 86% accurate on the WCPES, but needs modification to improve this and to be accurate with other data sets. The DCE and the walk-through suggest that the toolkit would be well accepted, with a favourable cost–benefit ratio, if implemented into practice with appropriate infrastructure support. LIMITATIONS: Small participant numbers and sampling bias across component studies. The scoping review studies mostly used top-down approaches and focused on professional dashboards. The rapid review of themes had limited scope, with no second reviewer. The IR needs further refinement, especially for transferability. New governance restrictions further limit immediate use. CONCLUSIONS: Using a multidisciplinary, mixed stakeholder, use of co-design, proof of concept was shown for an automated display of patient experience free-text comments in a way that could drive health-care improvements in real time. The approach is easily modified for transferable application. FUTURE WORK: Further exploration is needed of implementation into practice, transferable uses and technology development co-design approaches. FUNDING: The National Institute for Health Research Health Services and Delivery Research programme

    ON THE USE OF NATURAL LANGUAGE PROCESSING FOR AUTOMATED CONCEPTUAL DATA MODELING

    Get PDF
    This research involved the development of a natural language processing (NLP) architecture for the extraction of entity relation diagrams (ERDs) from natural language requirements specifications. Conceptual data modeling plays an important role in database and software design and many approaches to automating and developing software tools for this process have been attempted. NLP approaches to this problem appear to be plausible because compared to general free texts, natural language requirements documents are relatively formal and exhibit some special regularities which reduce the complexity of the problem. The approach taken here involves a loose integration of several linguistic components. Outputs from syntactic parsing are used by a set of hueristic rules developed for this particular domain to produce tuples representing the underlying meanings of the propositions in the documents and semantic resources are used to distinguish between correct and incorrect tuples. Finally the tuples are integrated into full ERD representations. The major challenge addressed in this research is how to bring the various resources to bear on the translation of the natural language documents into the formal language. This system is taken to be representative of a potential class of similar systems designed to translate documents in other restricted domains into corresponding formalisms. The system is incorporated into a tool that presents the final ERDs to users who can modify them in the attempt to produce an accurate ERD for the requirements document. An experiment demonstrated that users with limited experience in ERD specifications could produce better representations of requirements documents than they could without the system, and could do so in less time

    Data Mining and Machine Learning in Astronomy

    Full text link
    We review the current state of data mining and machine learning in astronomy. 'Data Mining' can have a somewhat mixed connotation from the point of view of a researcher in this field. If used correctly, it can be a powerful approach, holding the potential to fully exploit the exponentially increasing amount of available data, promising great scientific advance. However, if misused, it can be little more than the black-box application of complex computing algorithms that may give little physical insight, and provide questionable results. Here, we give an overview of the entire data mining process, from data collection through to the interpretation of results. We cover common machine learning algorithms, such as artificial neural networks and support vector machines, applications from a broad range of astronomy, emphasizing those where data mining techniques directly resulted in improved science, and important current and future directions, including probability density functions, parallel algorithms, petascale computing, and the time domain. We conclude that, so long as one carefully selects an appropriate algorithm, and is guided by the astronomical problem at hand, data mining can be very much the powerful tool, and not the questionable black box.Comment: Published in IJMPD. 61 pages, uses ws-ijmpd.cls. Several extra figures, some minor additions to the tex
    • …
    corecore