540 research outputs found

    CRYSTAL: Inducing a Conceptual Dictionary

    Full text link
    One of the central knowledge sources of an information extraction system is a dictionary of linguistic patterns that can be used to identify the conceptual content of a text. This paper describes CRYSTAL, a system which automatically induces a dictionary of "concept-node definitions" sufficient to identify relevant information from a training corpus. Each of these concept-node definitions is generalized as far as possible without producing errors, so that a minimum number of dictionary entries cover the positive training instances. Because it tests the accuracy of each proposed definition, CRYSTAL can often surpass human intuitions in creating reliable extraction rules.Comment: 6 pages, Postscript, IJCAI-95 http://ciir.cs.umass.edu/info/psfiles/tepubs/tepubs.htm

    Barriers to Dissemination of Local Health Data Faced by US State Agencies: Survey Study of Behavioral Risk Factor Surveillance System Coordinators

    Get PDF
    Background: Advances in information technology have paved the way to facilitate accessibility to population-level health data through web-based data query systems (WDQSs). Despite these advances in technology, US state agencies face many challenges related to the dissemination of their local health data. It is essential for the public to have access to high-quality data that are easy to interpret, reliable, and trusted. These challenges have been at the forefront throughout the COVID-19 pandemic. Objective: The purpose of this study is to identify the most significant challenges faced by state agencies, from the perspective of the Behavioral Risk Factor Surveillance System (BRFSS) coordinator from each state, and to assess if the coordinators from states with a WDQS perceive these challenges differently. Methods: We surveyed BRFSS coordinators (N=43) across all 50 US states and the District of Columbia. We surveyed the participants about contextual factors and asked them to rate system aspects and challenges they faced with their health data system on a Likert scale. We used two-sample t tests to compare the means of the ratings by participants from states with and without a WDQS. Results: Overall, 41/43 states (95%) make health data available over the internet, while 65% (28/43) employ a WDQS. States with a WDQS reported greater challenges (P=.01) related to the cost of hardware and software (mean score 3.44/4, 95% CI 3.09-3.78) than states without a WDQS (mean score 2.63/4, 95% CI 2.25-3.00). The system aspect of standardization of vocabulary scored more favorably (P=.01) in states with a WDQS (mean score 3.32/5, 95% CI 2.94-3.69) than in states without a WDQS (mean score 2.85/5, 95% CI 2.47-3.22). Conclusions: Securing of adequate resources and commitment to standardization are vital in the dissemination of local-level health data. Factors such as receiving data in a timely manner, privacy, and political opposition are less significant barriers than anticipated

    Efficient Algorithms for Fast Integration on Large Data Sets from Multiple Sources

    Get PDF
    Background Recent large scale deployments of health information technology have created opportunities for the integration of patient medical records with disparate public health, human service, and educational databases to provide comprehensive information related to health and development. Data integration techniques, which identify records belonging to the same individual that reside in multiple data sets, are essential to these efforts. Several algorithms have been proposed in the literatures that are adept in integrating records from two different datasets. Our algorithms are aimed at integrating multiple (in particular more than two) datasets efficiently. Methods Hierarchical clustering based solutions are used to integrate multiple (in particular more than two) datasets. Edit distance is used as the basic distance calculation, while distance calculation of common input errors is also studied. Several techniques have been applied to improve the algorithms in terms of both time and space: 1) Partial Construction of the Dendrogram (PCD) that ignores the level above the threshold; 2) Ignoring the Dendrogram Structure (IDS); 3) Faster Computation of the Edit Distance (FCED) that predicts the distance with the threshold by upper bounds on edit distance; and 4) A pre-processing blocking phase that limits dynamic computation within each block. Results We have experimentally validated our algorithms on large simulated as well as real data. Accuracy and completeness are defined stringently to show the performance of our algorithms. In addition, we employ a four-category analysis. Comparison with FEBRL shows the robustness of our approach. Conclusions In the experiments we conducted, the accuracy we observed exceeded 90% for the simulated data in most cases. 97.7% and 98.1% accuracy were achieved for the constant and proportional threshold, respectively, in a real dataset of 1,083,878 records

    A Scoping Review of Transitions, Stress, and Adaptation Among Emerging Adults

    Get PDF
    This scoping review examined research on transitions among emerging adults, 18- to 30-year-olds, to identify designs, populations, frameworks, transition types, and transition outcomes. A librarian conducted the search, yielding 2067 articles. Using predefined criteria, teams screened abstracts and reviewed articles, with 82% to 100% interrater agreement. Data from the final 160 articles were placed in evidence tables and summarized. Most frequently, the studies had exploratory-descriptive designs (69%), nondiagnosed samples (58%), no theoretical frameworks (58%), developmental transitions (34%), and health-related behavior outcomes (34%). This transition research is in an early stage of knowledge development and would benefit from further theory development

    Feedlot Performance of Growing Steer Calves on a High Roughage Ration Supplemented with a High Bypass or an All Natural Protein Supplement

    Get PDF
    This study was undertaken to compare a urea-based protein supplement containing meat and bone meal and dehydrated alfalfa as the primary by-pass protein source to a protein supplement containing soybean meal and sunflower meal as the protein sources

    RLT-S: A Web System for Record Linkage

    Get PDF
    Abstract Background Record linkage integrates records across multiple related data sources identifying duplicates and accounting for possible errors. Real life applications require efficient algorithms to merge these voluminous data sources to find out all records belonging to same individuals. Our recently devised highly efficient record linkage algorithms provide best-known solutions to this challenging problem. Method We have developed RLT-S, a freely available web tool, which implements our single linkage clustering algorithm for record linkage. This tool requires input data sets and a small set of configuration settings about these files to work efficiently. RLT-S employs exact match clustering, blocking on a specified attribute and single linkage based hierarchical clustering among these blocks. Results RLT-S is an implementation package of our sequential record linkage algorithm. It outperforms previous best-known implementations by a large margin. The tool is at least two times faster for any dataset than the previous best-known tools. Conclusions RLT-S tool implements our record linkage algorithm that outperforms previous best-known algorithms in this area. This website also contains necessary information such as instructions, submission history, feedback, publications and some other sections to facilitate the usage of the tool. Availability RLT-S is integrated into http://www.rlatools.com, which is currently serving this tool only. The tool is freely available and can be used without login. All data files used in this paper have been stored in https://github.com/abdullah009/DataRLATools. For copies of the relevant programs please see https://github.com/abdullah009/RLATools

    Privacy Protection and Aggregate Health Data: A Review of Tabular Cell Suppression Methods (Not) Employed in Public Health Data Systems

    Get PDF
    Public health research often relies on individuals’ confidential medical data. Therefore, data collecting entities, such as states, seek to disseminate this medical data as widely as possible while still maintaining the privacy of the individual for legal and ethical reasons. One common way in which this medical data is released is through the use of Web-based Data Query Systems (WDQS). In this article, we examined WDQS listed in the National Association for Public Health Statistics and Information Systems (NAPHSIS) specifically reviewing them for how they prevent statistical disclosure in queries that produce a tabular response. One of the most common methods to combat this type of disclosure is through the use of suppression, that is, if a cell count in a table is below a certain threshhold, the true value is suppressed. This technique does work to prevent the direct disclosure of small cell counts, however, primary suppression by itself is not always enough to preserve privacy in tabular data. Here, we present several real examples of tabular response queries that employ suppression, but we are able to infer the values of the suppressed cells, including cells with 1 counts, which could be linked to auxiliary data sources and thus has the possibility to create an identity disclosure. We seek to stimulate awareness of the potential for disclosure of information that individuals may wish to keep private through an online query system. This research is undertaken in the hope that privacy concerns can be dealt with preemptively rather than only after a major disclosure has taken place. In the wake of a such an event, a major concern is that state and local officials would react to this by permanently shutting down these sites and cutting off a valuable source of research data
    corecore