77 research outputs found
Instruction scheduling optimizations for energy efficient VLIW processors
Very Long Instruction Word (VLIW) processors are wide-issue statically scheduled
processors. Instruction scheduling for these processors is performed by the compiler
and is therefore a critical factor for its operation. Some VLIWs are clustered, a design
that improves scalability to higher issue widths while improving energy efficiency and
frequency. Their design is based on physically partitioning the shared hardware resources
(e.g., register file). Such designs further increase the challenges of instruction
scheduling since the compiler has the additional tasks of deciding on the placement
of the instructions to the corresponding clusters and orchestrating the data movements
across clusters.
In this thesis we propose instruction scheduling optimizations for energy-efficient
VLIW processors. Some of the techniques aim at improving the existing state-of-theart
scheduling techniques, while others aim at using compiler techniques for closing
the gap between lightweight hardware designs and more complex ones. Each of the
proposed techniques target individual features of energy efficient VLIW architectures.
Our first technique, called Aligned Scheduling, makes use of a novel scheduling
heuristic for hiding memory latencies in lightweight VLIW processors without hardware
load-use interlocks (Stall-On-Miss). With Aligned Scheduling, a software-only
technique, a SOM processor coupled with non-blocking caches can better cope with
the cache latencies and it can perform closer to the heavyweight designs. Performance
is improved by up to 20% across a range of benchmarks from the Mediabench II and
SPEC CINT2000 benchmark suites.
The rest of the techniques target a class of VLIW processors known as clustered
VLIWs, that are more scalable and more energy efficient and operate at higher frequencies
than their monolithic counterparts.
The second scheme (LUCAS) is an improved scheduler for clustered VLIW processors
that solves the problem of the existing state-of-the-art schedulers being very
susceptible to the inter-cluster communication latency. The proposed unified clustering
and scheduling technique is a hybrid scheme that performs instruction by instruction
switching between the two state-of-the-art clustering heuristics, leading to better
scheduling than either of them. It generates better performing code compared to the
state-of-the-art for a wide range of inter-cluster latency values on the Mediabench II
benchmarks.
The third technique (called CAeSaR) is a scheduler for clustered VLIW architectures
that minimizes inter-cluster communication by local caching and reuse of already
received data. Unlike dynamically scheduled processors, where this can be supported
by the register renaming hardware, in VLIWs it has to be done by the code generator.
The proposed instruction scheduler unifies cluster assignment, instruction scheduling
and communication minimization in a single unified algorithm, solving the phase ordering
issues between all three parts. The proposed scheduler shows an improvement
in execution time of up to 20.3% and 13.8% on average across a range of benchmarks
from the Mediabench II and SPEC CINT2000 benchmark suites.
The last technique, applies to heterogeneous clustered VLIWs that support dynamic
voltage and frequency scaling (DVFS) independently per cluster. In these processors
there are no hardware interlocks between clusters to honor the data dependencies.
Instead, the scheduler has to be aware of the DVFS decisions to guarantee correct
execution. Effectively controlling DVFS, to selectively decrease the frequency of clusters
with slack in their schedule, can lead to significant energy savings. The proposed
technique (called UCIFF) solves the phase ordering problem between frequency selection
and scheduling that is present in existing algorithms. The results show that UCIFF
produces better code than the state-of-the-art and very close to the optimal across the
Mediabench II benchmarks.
Overall, the proposed instruction scheduling techniques lead to either better efficiency
on existing designs or allow simpler lightweight designs to be competitive
against ones with more complex hardware
Recommended from our members
COMET: Communication-optimised multi-threaded error-detection technique
© 2016 ACM. Relentless technology scaling has made transistors more vulnerable to soft, or transient, errors. To keep systems robust against these, current error detection techniques use different types of redundancy at the hardware or the software level. A consequence of these additional protection mechanisms is that these systems tend to become slower. In particular, software error-detection techniques degrade performance considerably, limiting their uptake. This paper focuses on software redundant multi-threading error detection, a compiler-based technique that makes use of redundant cores within a multi-core system to perform error checking. Implementations of this scheme feature two threads that execute almost the same code: the main thread runs the original code and the checker thread executes code to verify the correctness of the original. The main thread communicates the values that require checking to the checker thread to use in its comparisons. We identify a major performance bottleneck in existing schemes: poorly performing inter-core communication and the generated code associated with it. Our study shows this is a major performance impediment within existing techniques since the two threads require extremely fine-grained communication, on the order of every few instructions. We alleviate this bottleneck with a series of code generation optimisations at the compiler level. We propose COMET (Communication-Optimised Multi-threaded Error-detection Technique), which improves performance across the NAS parallel benchmarks by 31.4% (on average) compared to the state-of-the-art, without affecting fault-coverage
Recommended from our members
Lynx: Using OS and hardware support for fast fine-grained inter-core communication
Designing high-performance software queues for fast intercore communication is challenging, but critical for maximising software parallelism. State-of-the-art single-producer / single-consumer queues for streaming applications contain multiple sections, requiring the producer and consumer to operate independently on different sections from each other. While these queues perform well for coarse-grained data transfers, they perform poorly in the fine-grained case. This paper proposes Lynx, a novel SP/SC queue, specifically tuned for fine-grained communication. Lynx is built from the ground up, reducing the generated code on the critical-path to just two operations per enqueue and dequeue. To achieve this it relies on existing commodity processor hardware and operating system exception handling support to deal with infrequent queue maintenance operations. Lynx outperforms the state-of-the art by up to 1.57× in total 64-bit throughput reaching a peak throughput of 15.7GB/s on a common desktop system. Real applications using Lynx get a performance improvement of up to 1.4×.This work was supported by the Engineering and Physical Sciences Research Council (EPSRC), through grant reference EP/K026399/1.This is the author accepted manuscript. The final version is available from Association for Computing Machinery via http://dx.doi.org/10.1145/2925426.2926274
Improvements in reading and spelling skills after a phonological and morphological knowledge intervention in Greek children with spelling difficulties : a pilot study
In this study, we evaluated the effects of the online computer-based training program “Lexilogy-Greek” on the reading and spelling performance of young poor readers and spellers. The training is based on psycholinguistic principles that emphasize the importance of acquiring efficient phonological as well as morphological knowledge in remediating reading and spelling difficulties. Our sample consisted of fifteen 5th and 6th grade primary school children. Reading and spelling were tested at three points, with a no-intervention period and subsequently an intervention period in between these time points. We adopted a single group repeated measurement design and tested for intervention effects using repeated measures ANOVAs. The results revealed substantial treatment effects on spelling, word reading fluency and text reading fluency
Developmental surface and phonological dyslexia in both Greek and English.
The hallmark of developmental surface dyslexia in English and French is inaccurate reading of words with atypical spelling-sound correspondences. According to Douklias, Masterson and Hanley (2009), surface dyslexia can also be observed in Greek (a transparent orthography for reading that does not contain words of this kind). Their findings suggested that surface dyslexia in Greek can be characterized by slow reading of familiar words, and by inaccurate spelling of words with atypical sound-spelling correspondences (Greek is less transparent for spelling than for reading). In this study, we report seven adult cases whose slow reading and impaired spelling accuracy satisfied these criteria for Greek surface dyslexia. When asked to read words with atypical grapheme-phoneme correspondences in English (their second language), their accuracy was severely impaired. A co-occurrence was also observed between impaired spelling of words with atypical phoneme-grapheme correspondences in English and Greek. These co-occurrences provide strong evidence that surface dyslexia genuinely exists in Greek and that slow reading of real words in Greek reflects the same underlying impairment as that which produces inaccurate reading of atypical words in English. Two further individuals were observed with impaired reading and spelling of nonwords in both languages, consistent with developmental phonological dyslexia. Neither of the phonological dyslexics read words slowly. In terms of computational models of reading aloud, these findings suggest that slow reading by dyslexics in transparent orthographies is the consequence of a developmental impairment of the lexical (Coltheart, Rastle, Perry, Langdon, & Zeigler, 2001; Perry, Ziegler, & Zorzi, 2010) or semantic reading route (Plaut, McClelland, Seidenberg, & Patterson, 1996). This outcome provides evidence that the neurophysiological substrate(s) that support the lexical/semantic and the phonological pathways that are involved in reading and spelling are the same in both Greek and English
HelexKids:a word frequency database for Greek and Cypriot primary school children
In this article, we introduce HelexKids, an online written-word database for Greek-speaking children in primary education (Grades 1 to 6). The database is organized on a grade-by-grade basis, and on a cumulative basis by combining Grade 1 with Grades 2 to 6. It provides values for Zipf, frequency per million, dispersion, estimated word frequency per million, standard word frequency, contextual diversity, orthographic Levenshtein distance, and lemma frequency. These values are derived from 116 textbooks used in primary education in Greece and Cyprus, producing a total of 68,692 different word types. HelexKids was developed to assist researchers in studying language development, educators in selecting age-appropriate items for teaching, as well as writers and authors of educational books for Greek/Cypriot children. The database is open access and can be searched online at www.helexkids.org
- …