5 research outputs found

    In-Line Interrupt Handling and Lock-Up Free Translation Lookaside Buffers (TLBs)

    Get PDF
    The effects of the general-purpose precise interrupt mechanisms in use for the past few decades have received very little attention. When modern out-of-order processors handle interrupts precisely, they typically begin by flushing the pipeline to make the CPU available to execute handler instructions. In doing so, the CPU ends up flushing many instructions that have been brought in to the reorder buffer. In particular, these instructions may have reached a very deep stage in the pipeline—representing significant work that is wasted. In addition, an overhead of several cycles and wastage of energy (per exception detected) can be expected in refetching and reexecuting the instructions flushed. This paper concentrates on improving the performance of precisely handling software managed translation look-aside buffer (TLB) interrupts, one of the most frequently occurring interrupts. The paper presents a novel method of in-lining the interrupt handler within the reorder buffer. Since the first level interrupt-handlers of TLBs are usually small, they could potentially fit in the reorder buffer along with the user-level code already there. In doing so, the instructions that would otherwise be flushed from the pipe need not be refetched and reexecuted. Additionally, it allows for instructions independent of the exceptional instruction to continue to execute in parallel with the handler code. By in-lining the TLB interrupt handler, this provides lock-up free TLBs. This paper proposes the prepend and append schemes of in-lining the interrupt handler into the available reorder buffer space. The two schemes are implemented on a performance model of the Alpha 21264 processor built by Alpha designers at the Palo Alto Design Center (PADC), California. We compare the overhead and performance impact of handling TLB interrupts by the traditional scheme, the append in-lined scheme, and the prepend in-lined scheme. For small, medium, and large memory footprints, the overhead is quantified by comparing the number and pipeline state of instructions flushed, the energy savings, and the performance improvements. We find that lock-up free TLBs reduce the overhead of refetching and reexecuting the instructions flushed by 30-95 percent, reduce the execution time by 5-25 percent, and also reduce the energy wasted by 30-90 percent

    THE EFFECTS OF AGGRESSIVE OUT-OF-ORDER MECHANISMS ON THE MEMORY SUB-SYSTEM

    Get PDF
    Contrary to existing work that demonstrate significant improvements in performance with larger reorder buffers, the work presented in this dissertation shows that larger instruction windows do not necessarily provide the significant improvements in performance. By using detailed models of the DRAM system and the memory subsystem, we show that increasing out-of-order aggressiveness by increasing reorder buffer sizes beyond 128 entries no longer buys any improvement in processor performance. In fact we observe that it can actually degrade processor performance. Additionally, this dissertation demonstrates a non-intuitive problem associated with the out-of-order execution of memory instructions: the reordering of memory instructions can cause a degradation in the performance of the memory subsystem. Specifically, we show that increasing out-of-order aggressiveness in terms of reorder buffer sizes increases the frequency of replay traps and data cache misses. The presentation of this problem in itself is of utmost significance: the very mechanisms commonly used to improve performance are sources of performance degradation in the memory subsystem. We observe that while the negative effects of out-of-order execution existed for only a small fraction of the time with small reorder buffers, eliminating other sources of stalls by increasing out-of-order capability introduces these unexpected side effects in the memory subsystem to represent significant overhead. This reveals that one can not overlook rarely occurring events in the memory subsystem. To gain insight on the source of the problem, we attempt to measure the degree to which memory system performance relies on out-of-order execution. Using the network communication concept of windowing, we decided to change the load/store scheduling window independently of the ALU scheduling window. Our study revealed that memory instructions issued out-of-order are the primary reason for the increase in the frequency of replay traps. On the other hand, the out-of-order issue of memory instructions is responsible for the constructive and destructive references to the data cache. Incorporating detailed memory subsystem models and a realistic DRAM model into existing simulators and filtering out the destructive references from the total cache references can allow for aggressive out-of-order cores to reap the true benefits of out-of-order execution

    Address spaces and virtual memory : specification, implementation, and correctness

    Get PDF
    In modern operating systems tasks operate concurrently on a logical memory. Address spaces control access rights to and the sharing of that memory. They are associated with tasks and manipulated dynamically by memory management operations of the operating system. For cost reasons, logical memory and address spaces are not implemented directly but simulated. The contents of the logical memory are placed in two different memories, the main and the swap memory. Tasks access their address space by using an architecturally defined address translation mechanism, which is implemented by the memory management unit (MMU) and optimized with a translation look-aside buffer (TLB). This mechanism either redirects a memory access to some main memory location or generates a page fault exception resulting in a call to the page fault handler, a low-level operating system procedure. This construction is correct iff it is transparent to the tasks, so that they behave as if they would operate directly on the logical memory under control of their address spaces. We call the formalization of this correctness criterion a virtual memory simulation theorem. In our thesis we formulate and prove such a theorem for an abstract multiprocessor. We apply the theorem to a concrete implementation, a VAMP [BJK+03] with a singlelevel address translation mechanism and an exemplary page fault handler. We show how to extend the architecture and proofs to support TLBs, multi-level translation, and multiprocessing.In modernen Betriebssystemen operieren Programme nebenläufig auf einem logischen Speicher. Der Zugriff auf diesen Speicher und seine gemeinsame Nutzung wird durch Adressräume geregelt. Diese sind den Programmen zugeordnet und können durch Speicherverwaltungsoperationen des Betriebssystems dynamisch manipuliert werden. Logischer Speicher und Adressräume werden aus Kostengründen nicht direkt implementiert sondern simuliert. Hierbei verteilen sich die Inhalte des logischen Speichers auf zwei verschiedene Speicher, den Haupt- und den Auslagerungsspeicher. Zugriff auf ihren Adressraum wird den Programmen nur unter Nutzung eines durch die Rechnerarchitektur definierten Adressübersetzungsmechanismus gewährt, der durch die Memory Management Unit (MMU) und den Translation Look-Aside Buffer (TLB) implementiert wird. Dieser Mechanismus lenkt einen Zugriff entweder auf eine Hauptspeicheradresse um, oder er erzeugt einen Seitenfehler, der den Aufruf der Seitenfehlerbehandlung, eines hardware-nahen Betriebssystemteils, einleitet. Diese Konstruktion ist korrekt, wenn sie für die Programme transparent ist, das heißt, wenn diese sich mit ihr so verhalten, als griffen sie direkt auf den logischen Speicher unter Kontrolle ihrer Adressräume zu. Die Formalisierung dieser Korrektheitsaussage heißt Simulationssatz für virtuellen Speicher. In der vorliegenden Arbeit formulieren und beweisen wir einen derartigen Satz für ein abstraktes Mehrprozessorsystem. Wir wenden ihn auf eine konkrete Implementierung an, den VAMP [BJK+03] mit einem einstufigen Adressübersetzungsmechanismus und einer exemplarischen Seitenfehlerbehandlung. Wir zeigen, wie Rechnerarchitektur und Korrektheitsbeweise für die Unterstützung von TLBs, mehrstufiger Übersetzung und Mehrprozessorbetrieb erweitert werden können
    corecore