7 research outputs found

    УСОВЕРШЕНСТВОВАННЫЙ ПЛАНИРОВЩИК КООПЕРАТИВНОГО ВЫПОЛНЕНИЯ ПОТОКОВ НА МНОГОЯДЕРНОЙ СИСТЕМЕ

    Get PDF
    Three architectures of the cooperative thread scheduler in a multithreaded application that is executed on a multi-core system are considered. Architecture A0 is based on the synchronization and scheduling facilities, which are provided by the operating system. Architecture A1 introduces a new synchronization primitive and a single queue of the blocked threads in the scheduler, which reduces the interaction activity between the threads and operating system, and significantly speed up the processes of blocking and unblocking the threads. Architecture A2 replaces the single queue of blocked threads with dedicated queues, one for each of the synchronizing primitives, extends the number of internal states of the primitive, reduces the inter- dependence of the scheduling threads, and further significantly speeds up the processes of blocking and unblocking the threads. All scheduler architectures are implemented on Windows operating systems and based on the User Mode Scheduling. Important experimental results are obtained for multithreaded applications that implement two blocked parallel algorithms of solving the linear algebraic equation systems by the Gaussian elimination. The algorithms differ in the way of the data distribution among threads and by the thread synchronization models. The number of threads varied from 32 to 7936. Architecture A1 shows the acceleration of up to 8.65% and the architecture A2 shows the acceleration of up to 11.98% compared to A0 architecture for the blocked parallel algorithms computing the triangular form and performing the back substitution. On the back substitution stage of the algorithms, architecture A1 gives the acceleration of up to 125%, and architecture A2 gives the acceleration of up to 413% compared to architecture A0. The experiments clearly show that the proposed architectures, A1 and A2 outperform A0 depending on the number of thread blocking and unblocking operations, which happen during the execution of multi-threaded applications. The conducted computational experiments demonstrate the improvement of parameters of multithreaded applications on a heterogeneous multi-core system due the proposed advanced versions of the thread scheduler.Рассматриваются три архитектуры планировщика кооперативного выполнения потоков в многопоточном приложении, исполняемом на многоядерной системе. Архитектура А0 использует средства взаимодействия и синхронизации потоков, предоставляемые операционной системой. Архитектура А1 вводит новый примитив синхронизации потоков и единую для планировщика очередь заблокированных потоков, благодаря которым уменьшает активность взаимодействия потоков с операционной системой и значительно ускоряет процессы блокировки и разблокировки потоков. Архитектура А2 заменяет единую очередь заблокированных потоков на отдельные очереди для каждого примитива синхронизации и расширяет набор внутренних состояний примитива, уменьшая взаимозависимость потоков планирования и значительно ускоряя процессы блокировки и разблокировки рабочих потоков. Архитектуры планировщика реализованы в операционных системах Windows на базе технологии User Mode Scheduling. Важные экспериментальные результаты получены для многопоточных приложений, реализующих два блочно-параллельных алгоритма решения систем линейных алгебраических уравнений методом Гаусса. Алгоритмы различаются способами распределения данных между потоками и моделями синхронизации потоков. Число потоков варьировалось от 32 до 7936. Архитектура А1 показала ускорение до 8.65%, а архитектура А2 показала ускорение до 11.98 % по сравнению архитектурой А0 на блочно-параллельных алгоритмах с учетом их прямого и обратного хода. На обратном ходе алгоритмов архитектура А1 дала ускорение до 125 %, а архитектура А2 дала ускорение до 413 % по сравнению архитектурой А0. Эксперименты убедительно доказывают, что предлагаемые в статье архитектуры А1 и А2 выигрывают у А0 тем значительнее, чем большее количество блокировок и разблокировок потоков происходит во время выполнения многопоточного приложения

    Аналіз технології багатоядерного центрального процесора

    Get PDF
    Робота викладена на _30_ сторінках, у тому числі включає _9_ рисунків, список цитованої літератури із _19_ джерел.Об’єктом дослідження кваліфікаційної роботи є аналіз технології багатоядерного центрального процесора. Мета роботи полягає у аналізі технології що використовуються при побудові та створенню багатоядерного центрального процесора. При виконанні роботи було проаналізовано види, модифікації, переваги та недоліки технології багатоядерного центрального процесора, визначені фірми які є передовими в розробці багатоядерних процесорів, їхні особливості та принципи роботи з системою. У результаті проведення аналізу встановлено, що багатоядерний процесор є важливим доповненням у часовій шкалі мікропроцесорів та засобом вирішення проблем з високим споживанням енергії, ціною та розміром. Виділено архітектуру Intel та AMD, двох найважливіших постачальників сучасних багатоядерних систем, і порівняння зроблено з точки зору архітектур

    Task Activity Vectors: A Novel Metric for Temperature-Aware and Energy-Efficient Scheduling

    Get PDF
    This thesis introduces the abstraction of the task activity vector to characterize applications by the processor resources they utilize. Based on activity vectors, the thesis introduces scheduling policies for improving the temperature distribution on the processor chip and for increasing energy efficiency by reducing the contention for shared resources of multicore and multithreaded processors

    Analysis and evaluation of high performance web servers

    Get PDF
    Web servers are a very important tool when providing users with requested content on the Internet. Usage of the Internet is growing day-by-day, making those software applications essential. In the first part of the thesis, the web server world will be introduced to the reader, by giving a brief explanation of some of the available technologies as well as different dynamic protocols. Also, as there are different web servers available in the market, during this report it will be chosen the best performing ones. So, it will be presented a comparative chart between all of them in order to show the most important features of each one. Defining the scenario and the test cases is mandatory. For this reason, it is described the used hardware and software used to perform those benchmarks. The hardware is maintained equal during the whole test process, in order to let web server’s performance gaps to their internal architecture. Operating system and benchmarking tools are also described and given some examples. Furthermore, test cases are chosen to show some strengths and weakness of each web server, enabling us to compare the relative performance between them. Finally, the last part of the report consists on presenting the obtained results during the benchmark process, as well as presenting some lessons learned during the curse of the whole thesis, summing-up with some conclusions

    A shared-disk parallel cluster file system

    Get PDF
    Dissertação apresentada para obtenção do Grau de Doutor em Informática Pela Universidade Nova de Lisboa, Faculdade de Ciências e TecnologiaToday, clusters are the de facto cost effective platform both for high performance computing (HPC) as well as IT environments. HPC and IT are quite different environments and differences include, among others, their choices on file systems and storage: HPC favours parallel file systems geared towards maximum I/O bandwidth, but which are not fully POSIX-compliant and were devised to run on top of (fault prone) partitioned storage; conversely, IT data centres favour both external disk arrays (to provide highly available storage) and POSIX compliant file systems, (either general purpose or shared-disk cluster file systems, CFSs). These specialised file systems do perform very well in their target environments provided that applications do not require some lateral features, e.g., no file locking on parallel file systems, and no high performance writes over cluster-wide shared files on CFSs. In brief, we can say that none of the above approaches solves the problem of providing high levels of reliability and performance to both worlds. Our pCFS proposal makes a contribution to change this situation: the rationale is to take advantage on the best of both – the reliability of cluster file systems and the high performance of parallel file systems. We don’t claim to provide the absolute best of each, but we aim at full POSIX compliance, a rich feature set, and levels of reliability and performance good enough for broad usage – e.g., traditional as well as HPC applications, support of clustered DBMS engines that may run over regular files, and video streaming. pCFS’ main ideas include: · Cooperative caching, a technique that has been used in file systems for distributed disks but, as far as we know, was never used either in SAN based cluster file systems or in parallel file systems. As a result, pCFS may use all infrastructures (LAN and SAN) to move data. · Fine-grain locking, whereby processes running across distinct nodes may define nonoverlapping byte-range regions in a file (instead of the whole file) and access them in parallel, reading and writing over those regions at the infrastructure’s full speed (provided that no major metadata changes are required). A prototype was built on top of GFS (a Red Hat shared disk CFS): GFS’ kernel code was slightly modified, and two kernel modules and a user-level daemon were added. In the prototype, fine grain locking is fully implemented and a cluster-wide coherent cache is maintained through data (page fragments) movement over the LAN. Our benchmarks for non-overlapping writers over a single file shared among processes running on different nodes show that pCFS’ bandwidth is 2 times greater than NFS’ while being comparable to that of the Parallel Virtual File System (PVFS), both requiring about 10 times more CPU. And pCFS’ bandwidth also surpasses GFS’ (600 times for small record sizes, e.g., 4 KB, decreasing down to 2 times for large record sizes, e.g., 4 MB), at about the same CPU usage.Lusitania, Companhia de Seguros S.A, Programa IBM Shared University Research (SUR

    Resource Allocation for Software Pipelines in Many-core Systems

    Get PDF
    Many-core systems integrate a growing number of cores on a single chip and are expected to integrate hundreds and even thousands of cores soon. Despite their massive processing power, it is crucial to employ their resources efficiently to benefit from parallel processing. This dissertation tackles a major challenge, resource allocation, for complex, memory-intensive applications. The proposed methods allow to significantly improve the performance over the state of the art in many scenarios
    corecore