207 research outputs found

    Bile Leak following Laparoscopic Cholecystectomy due to Perforated Duodenal Ulcer in Patient with Roux-en-Y Gastric Bypass

    Get PDF
    Background and Aims. Screening for gastric diseases in symptomatic outpatients with conventional esophagogastroduodenoscopy (C-EGD) is expensive and has poor compliance. We aimed to explore the efficiency and safety of magnetic-controlled capsule gastroscopy (MCCG) in symptomatic outpatients who refused C-EGD. Methods. We performed a retrospective study of 76794 consecutive symptomatic outpatients from January 2014 to October 2019. A total of 2318 adults () in the MCCG group who refused C-EGD were matched with adults in the C-EGD group using propensity-score matching (PSM). The detection rates of abnormalities were analyzed to explore the application of MCCG in symptomatic patients. Results. Our study demonstrated a prevalence of gastric ulcers (GUs) in patients with functional dyspepsia- (FD-) like symptoms of 8.14%. The detection rate of esophagitis and Barrett’s esophagus was higher in patients with typical gastroesophageal reflux disease (GERD) symptoms than in patients in the other four groups (). The detection rates of gastric ulcers in the five groups (abdominal pain, bloating, heartburn, follow-up, and bleeding) were significantly different (). The total detection rate of gastric ulcers in symptomatic patients was 9.7%. A total of 7 advanced carcinomas were detected by MCCG and confirmed by endoscopic or surgical biopsy. The advanced gastric cancer detection rate was not significantly different between the MCCG group and the C-EGD matched group in terms of nonhematemesis GI bleeding (2 vs. 2, ). In addition, the overall focal lesion detection rate in the MCCG group was superior to that in the C-EGD matched group (224 vs. 184, ). MCCG gained a clinically meaningful small bowel diagnostic yield of 54.8% (17/31) out of 31 cases of suspected small bowel bleeding. No patient reported capsule retention at the two-week follow-up. Conclusion. MCCG is well tolerated, safe, and technically feasible and has a considerable diagnostic yield. The overall gastric diagnostic yield of gastric focal lesions with MCCG was comparable to that with C-EGD. MCCG offered a supplementary diagnosis in patients who had a previously undiagnostic C-EGD, indicating that MCCG could play an important role in the routine monitoring and follow-up of outpatient. MCCG shows its safety and efficiency in symptomatic outpatient applications

    Energy-Efficient Work-Stealing Language Runtimes

    Get PDF
    Work stealing is a promising approach to constructing multithreaded program runtimes of parallel programming languages. This paper presents HERMES, an energy-efficient work-stealing language runtime. The key insight is that threads in a work-stealing environment – thieves and victims – have varying impacts on the overall program running time, and a coordination of their execution “tempo ” can lead to energy efficiency with minimal performance loss. The centerpiece of HERMES is two complementary algorithms to coordinate thread tempo: the workpath-sensitive algorithm determines tempo for each thread based on thief-victim relationships on the execution path, whereas the workload-sensitive algorithm selects appropriate tempo based on the size of work-stealing deques. We construct HERMES on top of Intel Cilk Plus’s runtime, and implement tempo adjustment through standard Dynamic Voltage and Frequency Scaling (DVFS). Benchmarks running on HERMES demonstrate an average of 11-12 % energy savings with an average of 3-4% performance loss through meter-based measurements over commercial CPUs. 1

    Lace: non-blocking split deque for work-stealing

    Get PDF
    Work-stealing is an efficient method to implement load balancing in fine-grained task parallelism. Typically, concurrent deques are used for this purpose. A disadvantage of many concurrent deques is that they require expensive memory fences for local deque operations.\ud \ud In this paper, we propose a new non-blocking work-stealing deque based on the split task queue. Our design uses a dynamic split point between the shared and the private portions of the deque, and only requires memory fences when shrinking the shared portion.\ud \ud We present Lace, an implementation of work-stealing based on this deque, with an interface similar to the work-stealing library Wool, and an evaluation of Lace based on several common benchmarks. We also implement a recent approach using private deques in Lace. We show that the split deque and the private deque in Lace have similar low overhead and high scalability as Wool

    Porting Decision Tree Algorithms to Multicore using FastFlow

    Full text link
    The whole computer hardware industry embraced multicores. For these machines, the extreme optimisation of sequential algorithms is no longer sufficient to squeeze the real machine power, which can be only exploited via thread-level parallelism. Decision tree algorithms exhibit natural concurrency that makes them suitable to be parallelised. This paper presents an approach for easy-yet-efficient porting of an implementation of the C4.5 algorithm on multicores. The parallel porting requires minimal changes to the original sequential code, and it is able to exploit up to 7X speedup on an Intel dual-quad core machine.Comment: 18 pages + cove

    Resilient Optimistic Termination Detection for the Async-Finish Model

    Get PDF
    International audienceDriven by increasing core count and decreasing mean-time-to-failure in supercomputers, HPC runtime systems must improve support for dynamic task-parallel execution and resilience to failures. The async-finish task model, adapted for distributed systems as the asynchronous partitioned global address space programming model, provides a simple way to decompose a computation into nested task groups, each managed by a ‘finish’ that signals the termination of all tasks within the group.For distributed termination detection, maintaining a consistent view of task state across multiple unreliable processes requires additional book-keeping when creating or completing tasks and finish-scopes. Runtime systems which perform this book-keeping pessimistically, i.e. synchronously with task state changes, add a high communication overhead compared to non-resilient protocols. In this paper, we propose optimistic finish, the first message-optimal resilient termination detection protocol for the async-finish model. By avoiding the communication of certain task and finish events, this protocol allows uncertainty about the global structure of the computation which can be resolved correctly at failure time, thereby reducing the overhead for failure-free execution.Performance results using micro-benchmarks and the LULESH hydrodynamics proxy application show significant reductions in resilience overhead with optimistic finish compared to pessimistic finish. Our optimistic finish protocol is applicable to any task-based runtime system offering automatic termination detection for dynamic graphs of non-migratable tasks

    TaskGenX: A Hardware-Software Proposal for Accelerating Task Parallelism

    Get PDF
    As chip multi-processors (CMPs) are becoming more and more complex, software solutions such as parallel programming models are attracting a lot of attention. Task-based parallel programming models offer an appealing approach to utilize complex CMPs. However, the increasing number of cores on modern CMPs is pushing research towards the use of fine grained parallelism. Task-based programming models need to be able to handle such workloads and offer performance and scalability. Using specialized hardware for boosting performance of task-based programming models is a common practice in the research community. Our paper makes the observation that task creation becomes a bottleneck when we execute fine grained parallel applications with many task-based programming models. As the number of cores increases the time spent generating the tasks of the application is becoming more critical to the entire execution. To overcome this issue, we propose TaskGenX. TaskGenX offers a solution for minimizing task creation overheads and relies both on the runtime system and a dedicated hardware. On the runtime system side, TaskGenX decouples the task creation from the other runtime activities. It then transfers this part of the runtime to a specialized hardware. We draw the requirements for this hardware in order to boost execution of highly parallel applications. From our evaluation using 11 parallel workloads on both symmetric and asymmetric multicore systems, we obtain performance improvements up to 15×, averaging to 3.1× over the baseline.This work has been supported by the RoMoL ERC Advanced Grant (GA 321253), by the European HiPEAC Network of Excellence, by the Spanish Ministry of Science and Innovation (contracts TIN2015-65316-P), by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272), and by the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 671697 and No. 779877. M. Moretó has been partially supported by the Ministry of Economy and Competitiveness under Ramon y Cajal fellowship number RYC-2016-21104. Finally, the authors would like to thank Thomas Grass for his valuable help with the simulator.Peer ReviewedPostprint (author's final draft

    A domain-specific high-level programming model

    No full text
    International audienceNowadays, computing hardware continues to move toward more parallelism and more heterogeneity, to obtain more computing power. From personal computers to supercomputers, we can find several levels of parallelism expressed by the interconnections of multi-core and many-core accelerators. On the other hand, computing software needs to adapt to this trend, and programmers can use parallel programming models (PPM) to fulfil this difficult task. There are different PPMs available that are based on tasks, directives, or low level languages or library. These offer higher or lower abstraction levels from the architecture by handling their own syntax. However, to offer an efficient PPM with a greater (additional) high-levelabstraction level while saving on performance, one idea is to restrict this to a specific domain and to adapt it to a family of applications. In the present study, we propose a high-level PPM specific to digital signal processing applications. It is based on data-flow graph models of computation, and a dynamic runtime model of execution (StarPU). We show how the user can easily express this digital signal processing application, and can take advantage of task, data and graph parallelism in the implementation, to enhance the performances of targeted heterogeneous clusters composed of CPUs and different accelerators (e.g., GPU, Xeon Phi

    Revisiting the Cache Miss Analysis of Multithreaded Algorithms

    Full text link

    Energy-Efficient Multiprocessor Scheduling for Flow Time and Makespan

    Full text link
    We consider energy-efficient scheduling on multiprocessors, where the speed of each processor can be individually scaled, and a processor consumes power sαs^{\alpha} when running at speed ss, for α>1\alpha>1. A scheduling algorithm needs to decide at any time both processor allocations and processor speeds for a set of parallel jobs with time-varying parallelism. The objective is to minimize the sum of the total energy consumption and certain performance metric, which in this paper includes total flow time and makespan. For both objectives, we present instantaneous parallelism clairvoyant (IP-clairvoyant) algorithms that are aware of the instantaneous parallelism of the jobs at any time but not their future characteristics, such as remaining parallelism and work. For total flow time plus energy, we present an O(1)O(1)-competitive algorithm, which significantly improves upon the best known non-clairvoyant algorithm and is the first constant competitive result on multiprocessor speed scaling for parallel jobs. In the case of makespan plus energy, which is considered for the first time in the literature, we present an O(ln⁥1−1/αP)O(\ln^{1-1/\alpha}P)-competitive algorithm, where PP is the total number of processors. We show that this algorithm is asymptotically optimal by providing a matching lower bound. In addition, we also study non-clairvoyant scheduling for total flow time plus energy, and present an algorithm that achieves O(ln⁥P)O(\ln P)-competitive for jobs with arbitrary release time and O(ln⁥1/αP)O(\ln^{1/\alpha}P)-competitive for jobs with identical release time. Finally, we prove an Ω(ln⁥1/αP)\Omega(\ln^{1/\alpha}P) lower bound on the competitive ratio of any non-clairvoyant algorithm, matching the upper bound of our algorithm for jobs with identical release time

    Highly Scalable Multiplication for Distributed Sparse Multivariate Polynomials on Many-core Systems

    Full text link
    We present a highly scalable algorithm for multiplying sparse multivariate polynomials represented in a distributed format. This algo- rithm targets not only the shared memory multicore computers, but also computers clusters or specialized hardware attached to a host computer, such as graphics processing units or many-core coprocessors. The scal- ability on the large number of cores is ensured by the lacks of synchro- nizations, locks and false-sharing during the main parallel step.Comment: 15 pages, 5 figure
    • 

    corecore