1,038,188 research outputs found

    Statistical structures for internet-scale data management

    Get PDF
    Efficient query processing in traditional database management systems relies on statistics on base data. For centralized systems, there is a rich body of research results on such statistics, from simple aggregates to more elaborate synopses such as sketches and histograms. For Internet-scale distributed systems, on the other hand, statistics management still poses major challenges. With the work in this paper we aim to endow peer-to-peer data management over structured overlays with the power associated with such statistical information, with emphasis on meeting the scalability challenge. To this end, we first contribute efficient, accurate, and decentralized algorithms that can compute key aggregates such as Count, CountDistinct, Sum, and Average. We show how to construct several types of histograms, such as simple Equi-Width, Average-Shifted Equi-Width, and Equi-Depth histograms. We present a full-fledged open-source implementation of these tools for distributed statistical synopses, and report on a comprehensive experimental performance evaluation, evaluating our contributions in terms of efficiency, accuracy, and scalability

    Wrapper syntax for example-based machine translation

    Get PDF
    TransBooster is a wrapper technology designed to improve the performance of wide-coverage machine translation systems. Using linguistically motivated syntactic information, it automatically decomposes source language sentences into shorter and syntactically simpler chunks, and recomposes their translation to form target language sentences. This generally improves both the word order and lexical selection of the translation. To date, TransBooster has been successfully applied to rule-based MT, statistical MT, and multi-engine MT. This paper presents the application of TransBooster to Example-Based Machine Translation. In an experiment conducted on test sets extracted from Europarl and the Penn II Treebank we show that our method can raise the BLEU score up to 3.8% relative to the EBMT baseline. We also conduct a manual evaluation, showing that TransBooster-enhanced EBMT produces a better output in terms of fluency than the baseline EBMT in 55% of the cases and in terms of accuracy in 53% of the cases

    RealText-cs - Corpus based domain independent Content Selection model

    Get PDF
    Content selection is a highly domain dependent task responsible for retrieving relevant information from a knowledge source using a given communicative goal. This paper presents a domain independent content selection model using keywords as communicative goal. We employ DBpedia triple store as our knowledge source and triples are selected based on weights assigned to each triple. The calculation of the weights is carried out through log likelihood distance between a domain corpus and a general reference corpus. The method was evaluated using keywords extracted from QALD dataset and the performance was compared with cross entropy based statistical content selection. The evaluation results showed that the proposed method can perform 32% better than cross entropy based statistical content selection

    Online Index Extraction from Linked Open Data Sources

    Get PDF
    The production of machine-readable data in the form of RDF datasets belonging to the Linked Open Data (LOD) Cloud is growing very fast. However, selecting relevant knowledge sources from the Cloud, assessing the quality and extracting synthetical information from a LOD source are all tasks that require a strong human effort. This paper proposes an approach for the automatic extraction of the more representative information from a LOD source and the creation of a set of indexes that enhance the description of the dataset. These indexes collect statistical information regarding the size and the complexity of the dataset (e.g. the number of instances), but also depict all the instantiated classes and the properties among them, supplying user with a synthetical view of the LOD source. The technique is fully implemented in LODeX, a tool able to deal with the performance issues of systems that expose SPARQL endpoints and to cope with the heterogeneity on the knowledge representation of RDF data. An evaluation on LODeX on a large number of endpoints (244) belonging to the LOD Cloud has been performed and the effectiveness of the index extraction process has been presented

    Log4Perf: Suggesting and Updating Logging Locations for Web-based Systems' Performance Monitoring

    Get PDF
    Performance assurance activities are an essential step in the release cycle of software systems. Logs have become one of the most important sources of information that is used to monitor, understand and improve software performance. However, developers often face the challenge of making logging decisions, i.e., neither logging too little and logging too much is desirable. Although prior research has proposed techniques to assist in logging decisions, those automated logging guidance techniques are rather general, without considering a particular goal, such as monitoring software performance. In this thesis, we present Log4Perf, an automated approach that provides suggestions of where to insert logging statements with the goal of monitoring web-based systems' software performance. In particular, our approach builds and manipulates a statistical performance model to identify the locations in the source code that statistically significantly influence software performance. To evaluate Log4Perf, we conduct case studies on open source systems, i.e., CloudStore and OpenMRS, and one large-scale commercial system. Our evaluation results show that Log4Perf can build well-fit statistical performance models, indicating that such models can be leveraged to investigate the influence of locations in the source code on performance. Also, the suggested logging locations are often small and simple methods that do not have logging statements and that are not performance hotspots, making our approach an ideal complement to traditional approaches that are based on software metrics or performance hotspots. In addition, we proposed approaches that can suggest the need for updating logging locations when software evolves. After evaluating our approach, we manually examine the logging locations that are newly suggested or deprecated and identify seven root-causes. Log4Perf is integrated into the release engineering process of the commercial software to provide logging suggestions on a regular basis

    Learning State-Augmented Policies for Information Routing in Communication Networks

    Full text link
    This paper examines the problem of information routing in a large-scale communication network, which can be formulated as a constrained statistical learning problem having access to only local information. We delineate a novel State Augmentation (SA) strategy to maximize the aggregate information at source nodes using graph neural network (GNN) architectures, by deploying graph convolutions over the topological links of the communication network. The proposed technique leverages only the local information available at each node and efficiently routes desired information to the destination nodes. We leverage an unsupervised learning procedure to convert the output of the GNN architecture to optimal information routing strategies. In the experiments, we perform the evaluation on real-time network topologies to validate our algorithms. Numerical simulations depict the improved performance of the proposed method in training a GNN parameterization as compared to baseline algorithms.Comment: 13 pages, 11 figures, submitted t

    Performance evaluation of inpatient service in Beijing: a horizontal comparison with risk adjustment based on Diagnosis Related Groups

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The medical performance evaluation, which provides a basis for rational decision-making, is an important part of medical service research. Current progress with health services reform in China is far from satisfactory, without sufficient regulation. To achieve better progress, an effective tool for evaluating medical performance needs to be established. In view of this, this study attempted to develop such a tool appropriate for the Chinese context.</p> <p>Methods</p> <p>Data was collected from the front pages of medical records (FPMR) of all large general public hospitals (21 hospitals) in the third and fourth quarter of 2007. Locally developed Diagnosis Related Groups (DRGs) were introduced as a tool for risk adjustment and performance evaluation indicators were established: Charge Efficiency Index (CEI), Time Efficiency Index (TEI) and inpatient mortality of low-risk group cases (IMLRG), to reflect respectively work efficiency and medical service quality. Using these indicators, the inpatient services' performance was horizontally compared among hospitals. Case-mix Index (CMI) was used to adjust efficiency indices and then produce adjusted CEI (aCEI) and adjusted TEI (aTEI). Poisson distribution analysis was used to test the statistical significance of the IMLRG differences between different hospitals.</p> <p>Results</p> <p>Using the aCEI, aTEI and IMLRG scores for the 21 hospitals, Hospital A and C had relatively good overall performance because their medical charges were lower, LOS shorter and IMLRG smaller. The performance of Hospital P and Q was the worst due to their relatively high charge level, long LOS and high IMLRG. Various performance problems also existed in the other hospitals.</p> <p>Conclusion</p> <p>It is possible to develop an accurate and easy to run performance evaluation system using Case-Mix as the tool for risk adjustment, choosing indicators close to consumers and managers, and utilizing routine report forms as the basic information source. To keep such a system running effectively, it is necessary to improve the reliability of clinical information and the risk-adjustment ability of Case-Mix.</p

    Analyzing two-phase single-case data with nonoverlap and mean difference indices: Illustration, software tools, and alternatives

    Get PDF
    Two-phase single-case designs, including baseline evaluation followed by an intervention, represent the most clinically straightforward option for combining professional practice and research. However, unless they are part of a multiple-baseline schedule, such designs do not allow demonstrating a causal relation between the intervention and the behavior. Although the statistical options reviewed here cannot help overcoming this methodological limitation, we aim to make practitioners and applied researchers aware of the available appropriate options for extracting maximum information from the data. In the current paper, we suggest that the evaluation of behavioral change should include visual and quantitative analyses, complementing the substantive criteria regarding the practical importance of the behavioral change. Specifically, we emphasize the need to use structured criteria for visual analysis, such as the ones summarized in the What Works Clearinghouse Standards, especially if such criteria are complemented by visual aids, as illustrated here. For quantitative analysis, we focus on the non-overlap of all pairs and the slope and level change procedure, as they offer straightforward information and have shown reasonable performance. An illustration is provided of the use of these three pieces of information: visual, quantitative, and substantive. To make the use of visual and quantitative analysis feasible, open source software is referred to and demonstrated. In order to provide practitioners and applied researchers with a more complete guide, several analytical alternatives are commented on pointing out the situations (aims, data patterns) for which these are potentially useful
    corecore