1,782 research outputs found
Structure Selection from Streaming Relational Data
Statistical relational learning techniques have been successfully applied in
a wide range of relational domains. In most of these applications, the human
designers capitalized on their background knowledge by following a
trial-and-error trajectory, where relational features are manually defined by a
human engineer, parameters are learned for those features on the training data,
the resulting model is validated, and the cycle repeats as the engineer adjusts
the set of features. This paper seeks to streamline application development in
large relational domains by introducing a light-weight approach that
efficiently evaluates relational features on pieces of the relational graph
that are streamed to it one at a time. We evaluate our approach on two social
media tasks and demonstrate that it leads to more accurate models that are
learned faster
A New Evolutionary Algorithm For Mining Noisy, Epistatic, Geospatial Survey Data Associated With Chagas Disease
The scientific community is just beginning to understand some of the profound affects that feature interactions and heterogeneity have on natural systems. Despite the belief that these nonlinear and heterogeneous interactions exist across numerous real-world systems (e.g., from the development of personalized drug therapies to market predictions of consumer behaviors), the tools for analysis have not kept pace. This research was motivated by the desire to mine data from large socioeconomic surveys aimed at identifying the drivers of household infestation by a Triatomine insect that transmits the life-threatening Chagas disease. To decrease the risk of transmission, our colleagues at the laboratory of applied entomology and parasitology have implemented mitigation strategies (known as Ecohealth interventions); however, limited resources necessitate the search for better risk models. Mining these complex Chagas survey data for potential predictive features is challenging due to imbalanced class outcomes, missing data, heterogeneity, and the non-independence of some features.
We develop an evolutionary algorithm (EA) to identify feature interactions in Big Datasets with desired categorical outcomes (e.g., disease or infestation). The method is non-parametric and uses the hypergeometric PMF as a fitness function to tackle challenges associated with using p-values in Big Data (e.g., p-values decrease inversely with the size of the dataset). To demonstrate the EA effectiveness, we first test the algorithm on three benchmark datasets. These include two classic Boolean classifier problems: (1) the majority-on problem and (2) the multiplexer problem, as well as (3) a simulated single nucleotide polymorphism (SNP) disease dataset. Next, we apply the EA to real-world Chagas Disease survey data and successfully archived numerous high-order feature interactions associated with infestation that would not have been discovered using traditional statistics. These feature interactions are also explored using network analysis. The spatial autocorrelation of the genetic data (SNPs of Triatoma dimidiata) was captured using geostatistics. Specifically, a modified semivariogram analysis was performed to characterize the SNP data and help elucidate the movement of the vector within two villages. For both villages, the SNP information showed strong spatial autocorrelation albeit with different geostatistical characteristics (sills, ranges, and nuggets). These metrics were leveraged to create risk maps that suggest the more forested village had a sylvatic source of infestation, while the other village had a domestic/peridomestic source. This initial exploration into using Big Data to analyze disease risk shows that novel and modified existing statistical tools can improve the assessment of risk on a fine-scale
Recommended from our members
Assessing unidimensionality: A comparison of Rasch Modeling, Parallel Analysis, and TETRAD
The evaluation of assessment dimensionality is a necessary stage in the gathering of evidence to support the validity of interpretations based on a total score, particularly when assessment development and analysis are conducted within an item response theory (IRT) framework. In this study, we employ polytomous item responses to compare two methods that have received increased attention in recent years (Rasch model and Parallel analysis) with a method for evaluating assessment structure that is less well-known in the educational measurement community (TETRAD). The three methods were all found to be reasonably effective. Parallel Analysis successfully identified the correct number of factors and while the Rasch approach did not show the item misfit that would indicate deviation from clear unidimensionality, the pattern of residuals did seem to indicate the presence of correlated, yet distinct, factors. TETRAD successfully confirmed one dimension in the single-construct data set and was able to confirm two dimensions in the combined data set, yet excluded one item from each cluster, for no obvious reasons. The outcomes of all three approaches substantiate the conviction that the assessment of dimensionality requires a good deal of judgment. Accessed 19,548 times on https://pareonline.net from October 08, 2007 to December 31, 2019. For downloads from January 1, 2020 forward, please click on the PlumX Metrics link to the right
Identifying Causal Structures from Cyberstalking: Behaviors Severity and Association
This paper presents an etiological cyberstalking study, meaning the use of various technologies and internet in general to harass or to stalk someone. The novelty of the paper is the multivariate empirical approach of cyberstalking victimization that has received less attention from the research community. Also, there is a lack of such studies from the causal perspective. It happens, since in most of the studies, a priority is given on a single causation identification, whereas the data examination used for mining causal relationships in this paper presents a novel and great potential to detect combined or multiple cause factors. The paper focuses in the impact that variables such as age, gender and the fact whether the participant has ever harassed someone, is related to the fact of being victim of cyberstalking. The research aims to find the causes of cyberstalking in high schoolās teenagers. Furthermore, an exploratory data analysis has been performed. A weak and moderate correlation between the factors on the dataset is emphasized. The odds ratio among the variables has been calculated, which implies that girls are twice as likely as boys to be cyberstalked. Similarly, concerning outcomes related to cyberstalking frequency recidivism are noticed
Risk Management Based on Expert Rules and Data Mining: A Case Study in Insurance
Correctness, transparency and effectiveness are the principal attributes of knowledge derived from databases using data mining. In the current data mining research there is a focus on efficiency improvement of algorithms for knowledge discovery. However, improving the algorithms is often not sufficient. The limitations of data mining can only be dissolved by the integration of knowledge of experts in the field, encoded in some accessible way, with knowledge derived from patterns in the databases. In this paper we discuss an approach for combining expert knowledge and knowledge derived from transactional databases. The approach proposed is applicable to a wide variety of risk management problems. We illustrate the approach with a case study on fraud detection in an insurance company. The case clearly shows that the combination of expert knowledge with monotomic neural networks leads to significant performance improvements
Knowledge-based Biomedical Data Science 2019
Knowledge-based biomedical data science (KBDS) involves the design and
implementation of computer systems that act as if they knew about biomedicine.
Such systems depend on formally represented knowledge in computer systems,
often in the form of knowledge graphs. Here we survey the progress in the last
year in systems that use formally represented knowledge to address data science
problems in both clinical and biological domains, as well as on approaches for
creating knowledge graphs. Major themes include the relationships between
knowledge graphs and machine learning, the use of natural language processing,
and the expansion of knowledge-based approaches to novel domains, such as
Chinese Traditional Medicine and biodiversity.Comment: Manuscript 43 pages with 3 tables; Supplemental material 43 pages
with 3 table
- ā¦