207 research outputs found
Bile Leak following Laparoscopic Cholecystectomy due to Perforated Duodenal Ulcer in Patient with Roux-en-Y Gastric Bypass
Background and Aims. Screening for gastric diseases in symptomatic outpatients with conventional esophagogastroduodenoscopy (C-EGD) is expensive and has poor compliance. We aimed to explore the efficiency and safety of magnetic-controlled capsule gastroscopy (MCCG) in symptomatic outpatients who refused C-EGD. Methods. We performed a retrospective study of 76794 consecutive symptomatic outpatients from January 2014 to October 2019. A total of 2318 adults () in the MCCG group who refused C-EGD were matched with adults in the C-EGD group using propensity-score matching (PSM). The detection rates of abnormalities were analyzed to explore the application of MCCG in symptomatic patients. Results. Our study demonstrated a prevalence of gastric ulcers (GUs) in patients with functional dyspepsia- (FD-) like symptoms of 8.14%. The detection rate of esophagitis and Barrettâs esophagus was higher in patients with typical gastroesophageal reflux disease (GERD) symptoms than in patients in the other four groups (). The detection rates of gastric ulcers in the five groups (abdominal pain, bloating, heartburn, follow-up, and bleeding) were significantly different (). The total detection rate of gastric ulcers in symptomatic patients was 9.7%. A total of 7 advanced carcinomas were detected by MCCG and confirmed by endoscopic or surgical biopsy. The advanced gastric cancer detection rate was not significantly different between the MCCG group and the C-EGD matched group in terms of nonhematemesis GI bleeding (2 vs. 2, ). In addition, the overall focal lesion detection rate in the MCCG group was superior to that in the C-EGD matched group (224 vs. 184, ). MCCG gained a clinically meaningful small bowel diagnostic yield of 54.8% (17/31) out of 31 cases of suspected small bowel bleeding. No patient reported capsule retention at the two-week follow-up. Conclusion. MCCG is well tolerated, safe, and technically feasible and has a considerable diagnostic yield. The overall gastric diagnostic yield of gastric focal lesions with MCCG was comparable to that with C-EGD. MCCG offered a supplementary diagnosis in patients who had a previously undiagnostic C-EGD, indicating that MCCG could play an important role in the routine monitoring and follow-up of outpatient. MCCG shows its safety and efficiency in symptomatic outpatient applications
Energy-Efficient Work-Stealing Language Runtimes
Work stealing is a promising approach to constructing multithreaded program runtimes of parallel programming languages. This paper presents HERMES, an energy-efficient work-stealing language runtime. The key insight is that threads in a work-stealing environment â thieves and victims â have varying impacts on the overall program running time, and a coordination of their execution âtempo â can lead to energy efficiency with minimal performance loss. The centerpiece of HERMES is two complementary algorithms to coordinate thread tempo: the workpath-sensitive algorithm determines tempo for each thread based on thief-victim relationships on the execution path, whereas the workload-sensitive algorithm selects appropriate tempo based on the size of work-stealing deques. We construct HERMES on top of Intel Cilk Plusâs runtime, and implement tempo adjustment through standard Dynamic Voltage and Frequency Scaling (DVFS). Benchmarks running on HERMES demonstrate an average of 11-12 % energy savings with an average of 3-4% performance loss through meter-based measurements over commercial CPUs. 1
Lace: non-blocking split deque for work-stealing
Work-stealing is an efficient method to implement load balancing in fine-grained task parallelism. Typically, concurrent deques are used for this purpose. A disadvantage of many concurrent deques is that they require expensive memory fences for local deque operations.\ud
\ud
In this paper, we propose a new non-blocking work-stealing deque based on the split task queue. Our design uses a dynamic split point between the shared and the private portions of the deque, and only requires memory fences when shrinking the shared portion.\ud
\ud
We present Lace, an implementation of work-stealing based on this deque, with an interface similar to the work-stealing library Wool, and an evaluation of Lace based on several common benchmarks. We also implement a recent approach using private deques in Lace. We show that the split deque and the private deque in Lace have similar low overhead and high scalability as Wool
Porting Decision Tree Algorithms to Multicore using FastFlow
The whole computer hardware industry embraced multicores. For these machines,
the extreme optimisation of sequential algorithms is no longer sufficient to
squeeze the real machine power, which can be only exploited via thread-level
parallelism. Decision tree algorithms exhibit natural concurrency that makes
them suitable to be parallelised. This paper presents an approach for
easy-yet-efficient porting of an implementation of the C4.5 algorithm on
multicores. The parallel porting requires minimal changes to the original
sequential code, and it is able to exploit up to 7X speedup on an Intel
dual-quad core machine.Comment: 18 pages + cove
Resilient Optimistic Termination Detection for the Async-Finish Model
International audienceDriven by increasing core count and decreasing mean-time-to-failure in supercomputers, HPC runtime systems must improve support for dynamic task-parallel execution and resilience to failures. The async-finish task model, adapted for distributed systems as the asynchronous partitioned global address space programming model, provides a simple way to decompose a computation into nested task groups, each managed by a âfinishâ that signals the termination of all tasks within the group.For distributed termination detection, maintaining a consistent view of task state across multiple unreliable processes requires additional book-keeping when creating or completing tasks and finish-scopes. Runtime systems which perform this book-keeping pessimistically, i.e. synchronously with task state changes, add a high communication overhead compared to non-resilient protocols. In this paper, we propose optimistic finish, the first message-optimal resilient termination detection protocol for the async-finish model. By avoiding the communication of certain task and finish events, this protocol allows uncertainty about the global structure of the computation which can be resolved correctly at failure time, thereby reducing the overhead for failure-free execution.Performance results using micro-benchmarks and the LULESH hydrodynamics proxy application show significant reductions in resilience overhead with optimistic finish compared to pessimistic finish. Our optimistic finish protocol is applicable to any task-based runtime system offering automatic termination detection for dynamic graphs of non-migratable tasks
TaskGenX: A Hardware-Software Proposal for Accelerating Task Parallelism
As chip multi-processors (CMPs) are becoming more and more complex, software solutions such as parallel programming models are attracting a lot of attention. Task-based parallel programming models offer an appealing approach to utilize complex CMPs. However, the increasing number of cores on modern CMPs is pushing research towards the use of fine grained parallelism. Task-based programming models need to be able to handle such workloads and offer performance and scalability. Using specialized hardware for boosting performance of task-based programming models is a common practice in the research community.
Our paper makes the observation that task creation becomes a bottleneck when we execute fine grained parallel applications with many task-based programming models. As the number of cores increases the time spent generating the tasks of the application is becoming more critical to the entire execution. To overcome this issue, we propose TaskGenX. TaskGenX offers a solution for minimizing task creation overheads and relies both on the runtime system and a dedicated hardware. On the runtime system side, TaskGenX decouples the task creation from the other runtime activities. It then transfers this part of the runtime to a specialized hardware. We draw the requirements for this hardware in order to boost execution of highly parallel applications. From our evaluation using 11 parallel workloads on both symmetric and asymmetric multicore systems, we obtain performance improvements up to 15Ă, averaging to 3.1Ă over the baseline.This work has been supported by the RoMoL ERC Advanced Grant (GA 321253), by the European HiPEAC Network of Excellence, by the Spanish Ministry of Science and Innovation (contracts TIN2015-65316-P), by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272), and by the European Unionâs Horizon 2020 research and innovation programme under grant agreement No. 671697 and No. 779877. M. MoretĂł has been partially supported by the Ministry of
Economy and Competitiveness under Ramon y Cajal fellowship number RYC-2016-21104. Finally, the authors would like to thank Thomas Grass for his valuable help with the simulator.Peer ReviewedPostprint (author's final draft
A domain-specific high-level programming model
International audienceNowadays, computing hardware continues to move toward more parallelism and more heterogeneity, to obtain more computing power. From personal computers to supercomputers, we can find several levels of parallelism expressed by the interconnections of multi-core and many-core accelerators. On the other hand, computing software needs to adapt to this trend, and programmers can use parallel programming models (PPM) to fulfil this difficult task. There are different PPMs available that are based on tasks, directives, or low level languages or library. These offer higher or lower abstraction levels from the architecture by handling their own syntax. However, to offer an efficient PPM with a greater (additional) high-levelabstraction level while saving on performance, one idea is to restrict this to a specific domain and to adapt it to a family of applications. In the present study, we propose a high-level PPM specific to digital signal processing applications. It is based on data-flow graph models of computation, and a dynamic runtime model of execution (StarPU). We show how the user can easily express this digital signal processing application, and can take advantage of task, data and graph parallelism in the implementation, to enhance the performances of targeted heterogeneous clusters composed of CPUs and different accelerators (e.g., GPU, Xeon Phi
Energy-Efficient Multiprocessor Scheduling for Flow Time and Makespan
We consider energy-efficient scheduling on multiprocessors, where the speed
of each processor can be individually scaled, and a processor consumes power
when running at speed , for . A scheduling algorithm
needs to decide at any time both processor allocations and processor speeds for
a set of parallel jobs with time-varying parallelism. The objective is to
minimize the sum of the total energy consumption and certain performance
metric, which in this paper includes total flow time and makespan. For both
objectives, we present instantaneous parallelism clairvoyant (IP-clairvoyant)
algorithms that are aware of the instantaneous parallelism of the jobs at any
time but not their future characteristics, such as remaining parallelism and
work. For total flow time plus energy, we present an -competitive
algorithm, which significantly improves upon the best known non-clairvoyant
algorithm and is the first constant competitive result on multiprocessor speed
scaling for parallel jobs. In the case of makespan plus energy, which is
considered for the first time in the literature, we present an
-competitive algorithm, where is the total number of
processors. We show that this algorithm is asymptotically optimal by providing
a matching lower bound. In addition, we also study non-clairvoyant scheduling
for total flow time plus energy, and present an algorithm that achieves -competitive for jobs with arbitrary release time and
-competitive for jobs with identical release time. Finally,
we prove an lower bound on the competitive ratio of
any non-clairvoyant algorithm, matching the upper bound of our algorithm for
jobs with identical release time
Highly Scalable Multiplication for Distributed Sparse Multivariate Polynomials on Many-core Systems
We present a highly scalable algorithm for multiplying sparse multivariate
polynomials represented in a distributed format. This algo- rithm targets not
only the shared memory multicore computers, but also computers clusters or
specialized hardware attached to a host computer, such as graphics processing
units or many-core coprocessors. The scal- ability on the large number of cores
is ensured by the lacks of synchro- nizations, locks and false-sharing during
the main parallel step.Comment: 15 pages, 5 figure
- âŠ