688,304 research outputs found

    A Python Tool for Selecting Domain-Specific Data in Machine Translation

    Get PDF
    As the volume of data for Machine Translation (MT) grows, the need for models that can perform well in specific use cases, like patent and medical translations, becomes increasingly important. Unfortunately, generic models do not work well in such cases, as they often fail to handle domain-specific style and terminology. Only using datasets that cover domains similar to the target domain to train MT systems can effectively lead to high translation quality (for a domain-specific use-case) (Wang et al., 2017; Pourmostafa Roshan Sharami et al., 2021; Pourmostafa Roshan Sharami et al., 2022). This highlights the limitation of data-driven MT when trained on general domain data, regardless of dataset size. To address this challenge, researchers have implemented various strategies to improve domain-specific translation using Domain Adaptation (DA) methods (Saunders, 2022; Sharami et al., 2023). The DA process involves initially training a generic model, which is then fine-tuned using a domain-specific dataset (Chu and Wang, 2018). One approach to generating a domain-specific dataset is to select similar data from generic corpora for a specific language pair and then utilize both general (to train) and domain-specific (to fine-tune) parallel corpora for MT. In line with this approach, we developed a language-agnostic Python tool implementing the methodology proposed by Sharami et al. (2022). This tool uses monolingual domain-specific corpora to generate a parallel in-domain corpus, facilitating data selection for DA

    A Python Tool for Selecting Domain-Specific Data in Machine Translation

    Get PDF
    As the volume of data for Machine Translation (MT) grows, the need for models that can perform well in specific use cases, like patent and medical translations, becomes increasingly important. Unfortunately, generic models do not work well in such cases, as they often fail to handle domain-specific style and terminology. Only using datasets that cover domains similar to the target domain to train MT systems can effectively lead to high translation quality (for a domain-specific use-case) (Wang et al., 2017; Pourmostafa Roshan Sharami et al., 2021; Pourmostafa Roshan Sharami et al., 2022). This highlights the limitation of data-driven MT when trained on general domain data, regardless of dataset size. To address this challenge, researchers have implemented various strategies to improve domain-specific translation using Domain Adaptation (DA) methods (Saunders, 2022; Sharami et al., 2023). The DA process involves initially training a generic model, which is then fine-tuned using a domain-specific dataset (Chu and Wang, 2018). One approach to generating a domain-specific dataset is to select similar data from generic corpora for a specific language pair and then utilize both general (to train) and domain-specific (to fine-tune) parallel corpora for MT. In line with this approach, we developed a language-agnostic Python tool implementing the methodology proposed by Sharami et al. (2022). This tool uses monolingual domain-specific corpora to generate a parallel in-domain corpus, facilitating data selection for DA

    Impact of sampling technique on the performance of surrogate models generated with artificial neural network (ANN): A case study for a natural gas stabilization unit

    Get PDF
    Data-driven models are essential tools for the development of surrogate models that can be used for the design, operation, and optimization of industrial processes. One approach of developing surrogate models is through the use of input-output data obtained from a process simulator. To enhance the model robustness, proper sampling techniques are required to cover the entire domain of the process variables uniformly. In the present work, Monte Carlo with pseudo-random samples as well as Latin hypercube samples and quasi-Monte Carlo samples with Hammersley Sequence Sampling (HSS) are generated. The sampled data obtained from the process simulator are fitted to neural networks for generating a surrogate model. An illustrative case study is solved to predict the gas stabilization unit performance. From the developed surrogate models to predict process data, it can be concluded that of the different sampling methods, Latin hypercube sampling and HSS have better performance than the pseudo-random sampling method for designing the surrogate model. This argument is based on the maximum absolute value, standard deviation, and the confidence interval for the relative average error as obtained from different sampling techniques.Qatar UniversityScopu

    A PVS-Simulink Integrated Environment for Model-Based Analysis of Cyber-Physical Systems

    Get PDF
    This paper presents a methodology, with supporting tool, for formal modeling and analysis of software components in cyber-physical systems. Using our approach, developers can integrate a simulation of logic-based specifications of software components and Simulink models of continuous processes. The integrated simulation is useful to validate the characteristics of discrete system components early in the development process. The same logic-based specifications can also be formally verified using the Prototype Verification System (PVS), to gain additional confidence that the software design complies with specific safety requirements. Modeling patterns are defined for generating the logic-based specifications from the more familiar automata-based formalism. The ultimate aim of this work is to facilitate the introduction of formal verification technologies in the software development process of cyber-physical systems, which typically requires the integrated use of different formalisms and tools. A case study from the medical domain is used to illustrate the approach. A PVS model of a pacemaker is interfaced with a Simulink model of the human heart. The overall cyber-physical system is co-simulated to validate design requirements through exploration of relevant test scenarios. Formal verification with the PVS theorem prover is demonstrated for the pacemaker model for specific safety aspects of the pacemaker design

    Multi-source Pseudo-label Learning of Semantic Segmentation for the Scene Recognition of Agricultural Mobile Robots

    Full text link
    This paper describes a novel method of training a semantic segmentation model for environment recognition of agricultural mobile robots by unsupervised domain adaptation exploiting publicly available datasets of outdoor scenes that are different from our target environments i.e., greenhouses. In conventional semantic segmentation methods, the labels are given by manual annotation, which is a tedious and time-consuming task. A method to work around the necessity of the manual annotation is unsupervised domain adaptation (UDA) that transfer knowledge from labeled source datasets to unlabeled target datasets. Most of the UDA methods of semantic segmentation are validated by tasks of adaptation from non-photorealistic synthetic images of urban scenes to real ones. However, the effectiveness of the methods is not well studied in the case of adaptation to other types of environments, such as greenhouses. In addition, it is not always possible to prepare appropriate source datasets for such environments. In this paper, we adopt an existing training method of UDA to a task of training a model for greenhouse images. We propose to use multiple publicly available datasets of outdoor images as source datasets, and also propose a simple yet effective method of generating pseudo-labels by transferring knowledge from the source datasets that have different appearance and a label set from the target datasets. We demonstrate in experiments that by combining our proposed method of pseudo-label generation with the existing training method, the performance was improved by up to 14.3% of mIoU compared to the best score of the single-source training.Comment: 10 pages, 7 figures, submitted to Machine Vision And Application

    From napkin sketches to reliable software

    Get PDF
    In the past few years, model-driven software engineering (MDSE) and domain-specific modeling languages (DSMLs) have received a lot of attention from both research and industry. The main goal of MDSE is generating software from models that describe systems on a high level of abstraction. DSMLs are languages specifically designed to create such models. High-level models are refined into models on lower levels of abstraction by means of model transformations. The ability to model systems on a high level of abstraction using graphical diagrams partially explains the popularity of the informal modeling language UML. However, even designing simple software systems using such graphical diagrams can lead to large models that are cumbersome to create. To deal with this problem, we investigated the integration of textual languages into large, existing modeling languages by comparing two approaches and designed a DSML with a concrete syntax consisting of both graphical and textual elements. The DSML, called the Simple Language of Communicating Objects (SLCO), is aimed at modeling the structure and behavior of concurrent, communicating objects and is used as a case study throughout this thesis. During the design of this language, we also designed and implemented a number of transformations to various other modeling languages, leading to an iterative evolution of the DSML, which was influenced by the problem domain, the target platforms, model quality, and model transformation quality. Traditionally, the state-space explosion problem in model checking is handled by applying abstractions and simplifications to the model that needs to be verified. As an alternative, we demonstrate a model-driven engineering approach that works the other way around using SLCO. Instead of making a concrete model more abstract, we refine abstract models by transformation to make them more concrete, aiming at the verification of models that are as close to the implementation as possible. The results show that it is possible to validate more concrete models when fine-grained transformations are applied instead of coarse-grained transformations. Semantics are a crucial part of the definition of a language, and to verify the correctness of model transformations, the semantics of both the input and the output language must be formalized. For these reasons, we implemented an executable prototype of the semantics of SLCO that can be used to transform SLCO models to labeled transition systems (LTSs), allowing us to apply existing tools for visualization and verification of LTSs to SLCO models. For given input models, we can use the prototype in combination with these tools to show, for each transformation that refines SLCO models, that the input and output models exhibit the same observable behavior. This, however, does not prove the correctness of these transformations in general. To prove this, we first formalized the semantics of SLCO in the form of structural operational semantics (SOS), based on the aforementioned prototype. Then, equivalence relations between LTSs were defined based on each transformation, and finally, these relations were shown to be either strong bisimulations or branching bisimulations. In addition to this approach, we studied property preservation of model transformations without restricting ourselves to a fixed set of transformations. Our technique takes a property and a transformation, and checks whether the transformation preserves the property. If a property holds for the initial model, which is often small and easy to analyze, and the property is preserved, then the refined model does not need to be analyzed too. Combining the MDSE techniques discussed in this thesis enables generating reliable and correct software by means of refining model transformations from concise, formal models specified on a high level of abstraction using DSMLs

    Towards Agent-Based Model Specification of Smart Grid: A Cognitive Agent-Based Computing Approach

    Get PDF
    A smart grid can be considered as a complex network where each node represents a generation unit or a consumer, whereas links can be used to represent transmission lines. One way to study complex systems is by using the agent-based modeling paradigm. The agent-based modeling is a way of representing a complex system of autonomous agents interacting with each other. Previously, a number of studies have been presented in the smart grid domain making use of the agent-based modeling paradigm. However, to the best of our knowledge, none of these studies have focused on the specification aspect of the model. The model specification is important not only for understanding but also for replication of the model. To fill this gap, this study focuses on specification methods for smart grid modeling. We adopt two specification methods named as Overview, design concept, and details and Descriptive agent-based modeling. By using specification methods, we provide tutorials and guidelines for model developing of smart grid starting from conceptual modeling to validated agent-based model through simulation. The specification study is exemplified through a case study from the smart grid domain. In the case study, we consider a large set of network, in which different consumers and power generation units are connected with each other through different configuration. In such a network, communication takes place between consumers and generating units for energy transmission and data routing. We demonstrate how to effectively model a complex system such as a smart grid using specification methods. We analyze these two specification approaches qualitatively as well as quantitatively. Extensive experiments demonstrate that Descriptive agent-based modeling is a more useful approach as compared with Overview, design concept, and details method for modeling as well as for replication of models for the smart grid

    Investigation of Colorado front range winter storms using a nonhydrostatic mesoscale numerical model designed for operational use

    Get PDF
    Fall 1993.Also issued as author's sdissertation (Ph.D.) -- Colorado State University, 1993.Includes bibliographical references.State-of-the-art data sources such as Doppler radar, automated surface observations, wind profiler, digital satellite, and aircraft reports are for the first time providing the capability to generate real-time, operational three-dimensional gridded data sets with sufficient spatial and temporal resolutions to diagnose the structure and evolution of mesoscale systems. A prototype data assimilation system of this type, called the Local Analysis and Prediction System (LAPS), is being developed at the National Oceanic and Atmospheric System's Forecast Systems Laboratory (FSL). The investigation utilizes the three-dimensional LAPS analyses for initialization of the full physics, nonhydrostatic Regional Atmospheric Modeling System (RAMS) model developed at the Colorado State University to create a system capable of generating operational mesoscale predictions. The LAPS/RAMS system structured for operational use can add significant value to existing operational model output and can provide an improved scientific understanding of mesoscale weather events. The results a.re presented through two case study analyses, the 7 January 1992 Colorado Front Range blizzard and the 8-9 March 1992 eastern Colorado snowstorm. Both cases a.re ideal for this investigation due to the significant mesoscale variation observed in the precipitation and flow structure. The case study results demonstrate the ability to successfully detect and predict mesoscale features using a mesoscale numerical model initialized with high resolution (10 km horizontal grid interval), non­ homogeneous data. Conceptual models of the two snowstorms are developed by utilizing the RAMS model output in combination with observations and other larger domain model simulations. The strong influence of the Colorado topography on the resultant flow is suggested by the generation of a lee vortex that frequently develops east of the Front Range and south of the Cheyenne Ridge in stable, northwest synoptic flow. The lee vortex, often called the "Longmont anticyclone", exhibits surface flow characteristics that are similar to results from low Froude number flow around an isolated obstacle. A series of numerical experiments using RAMS with idealized topography and horizontally homogeneous initial conditions are presented to investigate typical low Froude number flow characteristics in the vicinity of barriers representative of the Colorado topography. The results are compared to the findings of previous investigations and to the case study observations and numerical predictions. The findings suggest that the Colorado orography significantly altered the low-level flow in both case studies resulting in mesoscale variation of observed precipitation. Improved representation of the topography by the model led to the majority of the forecast improvement

    CausaLM: Causal Model Explanation Through Counterfactual Language Models

    Full text link
    Understanding predictions made by deep neural networks is notoriously difficult, but also crucial to their dissemination. As all ML-based methods, they are as good as their training data, and can also capture unwanted biases. While there are tools that can help understand whether such biases exist, they do not distinguish between correlation and causation, and might be ill-suited for text-based models and for reasoning about high level language concepts. A key problem of estimating the causal effect of a concept of interest on a given model is that this estimation requires the generation of counterfactual examples, which is challenging with existing generation technology. To bridge that gap, we propose CausaLM, a framework for producing causal model explanations using counterfactual language representation models. Our approach is based on fine-tuning of deep contextualized embedding models with auxiliary adversarial tasks derived from the causal graph of the problem. Concretely, we show that by carefully choosing auxiliary adversarial pre-training tasks, language representation models such as BERT can effectively learn a counterfactual representation for a given concept of interest, and be used to estimate its true causal effect on model performance. A byproduct of our method is a language representation model that is unaffected by the tested concept, which can be useful in mitigating unwanted bias ingrained in the data.Comment: Our code and data are available at: https://amirfeder.github.io/CausaLM/ Under review for the Computational Linguistics journa
    • …
    corecore