Abstract: Traditional display protocols have limitations in terms of fixed frame rates, high bandwidth requirements, and precise control over the display of frames. We propose a novel scalable packetized display protocol architecture incorporating dynamic frame rates, high speed capabilities, and dynamic synchronization to bridge performance gaps. We further provide a modular FPGA implementation of the architecture for use on array emitters.
Introduction
Current fixed frame rate display technology, such as, DVI, HDMI, and DisplayPort [1] are a commonly utilized, but limiting technology for high-speed IR display systems [2] , [3] . These display technologies, designed for relatively low-speeds (generally 60 Hz) incorporate a number of design decisions that limit the ability to utilize them effectively with IR emitter technology. Firstly, these technologies require custom designed synchronization solutions and hardware when utilized with multiple sources in-order to ensure correct synchronization because they are not designed to handle synchronization across multiple sources. Secondly, the fixed frame rate nature of technology imposes a static requirement on frame rate across all displayed frames increasing bandwidth requirements by requiring the same amount of data be sent for all frames regardless of what data changes. This necessarily means that maximum frame rate operation is limited by the resolution size of imagery due to bandwidth limitations.
In this paper, we propose an alternative to traditional display technology, a packetized display protocol (PDP) architecture [4] capable of bridging the performance gap of ever increasing speed requirements of high speed projector systems. Our PDP architecture eschews with some of the assumptions found within traditional display technology in order to provide scalability, reduce bandwidth requirements, increase performance, ease synchronization burden, and provide a desirable set of features, such as, dynamic sub-window frame rates not found within current technology.
Our protocol architecture draws inspiration from the video processing field, where encoding schemes for video streaming represent a body of research that attempts to tackle a similar but more limited challenge [5] . Some of these encoding schemes attempt to provide a variable frame rate for segments of the incoming stream through differencing algorithms, but also rely on compression [6] which reduces quality and introduce artifacts. In our case, we require lossless quality; and thus, cannot utilize these for our purposes. Instead, we seek to craft a lossless solution for the IRLED projector field that incorporates similar variable frame rate features.
The contributions of this paper are the architecture of a physical layer agnostic packetized display protocol with the following features (1) intelligent dynamic per-frame bandwidth utilization, (2) finegrained control over frame transmission and synchronization, (3) dynamically changing intra-frame rates, and (4) a realized implementation of the protocol for use on array emitter technology.
The paper is divided into the following sections, design methodology, which discusses the rationale for PDP; abstract architecture, which discusses the overall abstract PDP architecture; packetized display protocol, which discusses the underlying PDP protocol; example scenario, which examines utilizing PDP within an example system; Implemented Architecture, which discusses an implementation of the protocol on an FPGA; Experimental Results, which shows sample data propagating through the system, and the conclusion, which discusses current and future work on PDP.
Design Methodology
As discussed briefly in the Section 1, our PDP architecture eschews with assumptions found in traditional display technology in order to provide a rich set of desirable features. In this section, we will discuss the assumptions of traditional display technology and how PDP differs, in order to provide the reader a rationale, as well as, the benefits to our proposed alternative.
Current display technology, such as HDMI, assumes a fixed frame rate display which places a hard limit on frame timing and synchronization. In detail, the underlying display protocols operate in a best-effort fashion where a buffer swap transmitting a frame of data occurs at static predetermined interval. If a new frame is unavailable to be transmitted at each interval due to any delay, such as processing delay, the previous frame will be retransmitted. This necessarily makes correct synchronization challenging because modern computation systems do not generally provide realtime guarantees for a number of reasons, such as, variability in work needed for frame generation, CPU scheduling, and I/O delays. Generally, while these challenges can be addressed to some degree, they require custom hardware solutions on top of existing display protocols because endto-end system synchronization is out of the scope of typical display standards which are designed to push relatively low frame rates over single hardware links. Secondly, because of the static nature of of the transmission interval (e.g. 100 hz), the frame rate cannot be dynamically controlled or changed after initialization. Instead, these protocols have static bandwidth requirements for a given resolution and frame rate of the form found in Figure 1 . This disallows for fine-grained control over the frame rate in cases where a user might wish to dynamically change it to match the processing rate. In high speed display scenarios, this introduces the problem of unavoidable frame drops. This issue becomes further compounded by the fact that traditional display protocols utilize proprietary drivers and hardware such that frame-drops become effectively silent. An effort to address this problem necessarily requires complete control over the entire end-to-end system meaning an effective and correct solution requires customized hardware and software that deviates from standard display protocol behavior. Any solution that falls short of these requirements would necessarily fail to completely address the issue. In short, to truly address this issue within traditional display protocols would require essentially creating device specific new protocol tied to particular custom hardware. Finally, display protocols disallow the transmission of sub-frames of data. A frame is necessarily transmitted in whole when the transmission interval reached. Just as with the previous problem, to alleviate this would necessarily require control over driver and hardware behavior in an end-to-end system.
Our proposed architecture, on the other hand, is designed to utilize a dynamic source driven refresh rate through the coordination of both source (scene generator) and sink (display). In this architecture, frames are segmented into pieces or sub-frames and sent to the display based upon how often these segments need to update. An example of this is shown in Figure 4a where different regions of the display operate at different frame rates. By utilizing dynamic frame rate control at sub-frame resolutions, substantial bandwidth reductions can occur. This will be discussed in further detail in subsequent sections.
The underlying PDP protocol itself is designed to allow for fine-grained control over when and what data is transmitted, incorporate mechanisms to synchronize and eliminate frame-jitter. Furthermore, the protocol architecture is abstracted in such a way that the physical interconnect layers are transparent in order to enable it to be capable of operating over a wide-spectrum of hardware, as well as, to future proof the protocol for use in future hardware. This enables a risk-reduction for the utilization of the protocol in that a hardware system implementing this protocol could switch or upgrade physical components and still utilize the same protocol within the software stack given an appropriately compatible physical layer. To facilitate this, we have chosen a packetized protocol structure capable of transmitting pixel data in a generalized way. In the subsequent section we will discuss the architecture of our protocol.
Abstract Architecture Model
In this section, we will describe the abstract architecture upon which our PDP protocol operates in order to motivate the current and future use-cases for our protocol. Additionally, we will discuss how different components operate in the system in order to give the reader an intuition of the underlying operation.
Traditional IRLED display systems are typically composed of three components as part of the display: scene generation, non-uniformity correction, and the actual display of InfraRed with sensors used to capture data. We propose an Abstract Machine Model (AMM) to capture these three components as shown in Figure 2a . This AMM separates the system operation of an IRLED system into three main components as well: scene generation, compositing, and display on IRLED array tiles with links between each component. The relationship between these components remains abstracted in such a way that hardware components may be scaled to fit demand. At its most basic, a single scene generator, compositor, and IRLED array tile may be used. For higher speed requirements, hardware components may be mapped as needed. The links between components in the system utilize the PDP protocol for communication, data-transfer, and synchronization. The compositing component differs from a traditional IRLED system in that it is responsible for taking imagery from many sources, possibly at different frame rates, and combining them into a single image for transmission to IRLED array tiles. This process is briefly shown in Figure 2b . During the compositing process frame segments are ranked to determine which to send at high speeds, and which to send at low speeds for intelligent bandwidth utilization. Once segmented and ranked into non-overlapping speed classes, frame segments are transmitted at the necessary rate.
For example consider the simple case, that of a single scene generator. In this case, the compositor will receive data from a single source. As frame data is received, a differencing algorithm must be employed to determine how to segment the overall frame for optimal data transfer based off of the rate of change of individual portions of the frame relative to the prior frame. Portions that rapidly change will be sent more often than portions that change slowly in order to maximize bandwidth for high-speed transfer of hot portions of an image. This consequently also has the effect of improving the performance of the analog chain of display devices by allowing for devices to reserve more time to drive rapidly changing portions of a display over portions of the display changing relatively slowly. 
Packetized Display Protocol
In this section, we will discuss the underlying communication details of our packetized protocol architecture. First, we will discuss the individual communication packets used to send pixel data, as well as, to coordinate and synchronize operation. Figure 3 shows the basic packets used for communication within PDP. These are strictly for data transfer and synchronization of system operations, and do not include other aspects such as system setup or enumeration. 1 These packets are organized into type specific fields of some set word-size. The exact size of word fields is left abstracted to allow for an optimal implementation to be utilized in practice. For example, a system may utilize 24-bit word size if an array has a native 24-bit pixel size, or 32-bit word size if the hardware transport layer has a specific optimal word size.
Typically, a multiple of 8-bit word size would be utilized in practice, as most hardware architectures (such as x86) utilize some multiple of this size. In any given implementation, the word size of all fields must match, in order to simplify decoding operations. This allows for fixed-size decoding of incoming data, which simplifies processing and firmware implementation; as well as, can ease timing constraints and enforce non-variability in the decoding time of incoming packets of data. In general, PDP packets are designed to send a minimal amount of header data to lower overhead and ordered in a way to minimize buffering requirements to enable real-time processing.
In terms of the protocol itself, PDP uses a single global coordinate system to refer to pixel locations on a display array. For example, a 512 by 512 pixel array would have coordinates from 0 to 511 in both the horizontal and vertical directions. All packets referencing sub-regions of this display would utilize coordinates that map to some rectangular sub-region of the display. Any overlapping regions of data would be composited during system operation with priority given to data segments sent at higher frame-rates.
PDP Packets are segmented into three types, a draw region packet, array reset packet, and trigger packet. All packets consist of a Type ID field of word-size. The draw region packet is used to send a rectangular sub-region of pixel data in global array coordinates. It has fields for the start and stop horizontal and vertical coordinates (defined inclusively) followed by individual pixel data. For example, suppose a scene generator were to send a packet of data from array region 10 to 19 along the X axis and 20 to 29 along the Y axis, a total of 100 pixels of data would follow the packet coordinates given that the packet specifies a 100 pixel sized region.
The second packet, array reset, is utilized to indicate that quadrants on a given array should be cleared. It consists of an array specific quadrant bit-mask used to indicate which quadrant to reset. Any unused bits are reserved. This type of packet would be utilized exclusively between compositor and array tile links.
The third packet, trigger, is used to implement a trigger based synchronization within in PDP. It consists of a system specific action bit-mask used to indicate the type of operation to trigger. In IRLED array systems, the coordinator of synchronization is dependent on the array itself and the different components within the system. In some systems, a sensor may be used as the source of synchronization, in other systems, another component may be utilized. Other aspects of system operation may even be triggered outside of the system synchronization interval based off of other events. For the reason, PDP has opted for a trigger based approach to synchronization. This approach allows for synchronization, data transfer, and computation to be custom tailored to an individual systems use-case. For example, the action mask could be used to trigger the generation of the next frame to be displayed when needed, the source of which is defined by the system itself. Another example would be utilize the action mask to indicate that further computations (such as scene generation) stall until otherwise indicated.
Example Scenario
In this section, we provide an example scenario of the communication protocol in action within a simplified abstracted architecture. For demonstration purposes, the underlying details of the hardware are omitted and only the utilization of the protocol itself is examined. Figure 4b shows a series of frames segmented within PDP. For the purposes of this example, assume a system utilizing a single scene generator, compositor, and IRLED array. Assume also the average intensity of each region of a frame is computed during operation. The highlighted segments indicate a large change in intensity occurring within the region from frame to frame. Suppose that frames are generated at 4 times the rate of an external synchronization pulse. During operation within PDP, segments with a large change in intensity require higher frame-rate to reflect the fast frequency changes in intensity. The PDP protocol would be utilized to send these regions at a much higher frame rate than the slowly changing segments. Each segment would be sent using a region packet, one after another. Once sent, remaining time would be utilized by a compositor to send the slower changing data at a much slower rate. In this particular example, we could send the fast data for the frames at a rate much higher than that of the slowly changing segments; only updating the other segments once the fast moving data in all frames is sent. Once all data is sent, an external synchronization pulse from a sensor would then be utilized to indicate the data has been captured and a corresponding trigger packet sent from the compositor to scene generator to indicate that more frames be generated. The same procedure would then continue for the next set of frames.
In a real system, frames would be segmented more finely than in this example, allowing for small segments to be dynamically transmitted when needed. This would then give the ability for fast changing data to update at rates far greater than a static fixed frame rate display would be capable of doing under the same hardware considerations with limited bandwidth. Secondly, as discussed briefly above, this would give the analog portions of a display more time to settle thereby improving display fidelity. In more detail, the analog portions of a display are time constrained by the number of pixels that need to be addressed in a given time in fixed frame rate systems. By de-prioritizing slow changing data, these segments no longer need to be addressed at the same rate as the higher changing data in the display, allowing for more time for the high frame rate pixels to settle if needed. Figure 5 shows a comparison between a classic HDMI/DVI frame vs. a PDP frame where the test case is a 32 × 32 region of interest and the PDP frame requires no blanking period. Assuming the resolution for this test case is 1920 × 1080, the HDMI frame can only run at 60 Hz even though only a small portion of the frame is updating. However, in the PDP case the frame can be packed within the total frame size (by removing blanking porches) for an effective resolution of 2185 × 1135. The PDP draw region packet for a 32 × 32 square only requires 1029 pixels (32 × 32 data + 5 pixel header overhead) allowing the same PDP packet to be sent within the HDMI transport frame 2410 times for an effective frame rate of 144 KHz and a speed up of over 2410×.
Implemented Architecture
In this section we will describe the implemented FPGA version of the PDP architecture used on our arrays. For the purposes of this discussion, only relevant details are included to ease the readers understanding. First, we will discuss the purpose of the implemented architecture. Following this each component will be discussed at a high level. Finally, the operation of sub-components will be discussed.
The implemented architecture consists of the portion of the AMM that drives an IRLED tile (or emitter array) directly from data packets sent by a compositor. As such, it is responsible for receiving PDP packets, decoding and validating them, and drawing them to an IRLED array. This is shown in Figure 6a . In the current implementation, packets are sent using an underlying HDMI protocol layer. The incoming data is synchronized across two distinct clock domains utilizing a synchronized circular buffer (SCB). The input side consists of two separate HDMI inputs in order to meet system bandwidth requirements. Each input is assumed to contain clock skew relative to the other so separate SCBs are used to synchronize these to the system domain. At a high level, individual data words of 24-bit sized values come in per HDMI cycle. These are transitioned to the system domain and stored for retrieval by the array emitter module. The array emitter module is responsible for bringing in each 24-bit word value and emptying the corresponding SCB slot. As it brings in each word, it begins to decode them into PDP commands. Once enough data is buffered for a write or reset command, it sends the data to the write buffer module which then drives an emitter directly through IO lines. Figure 6b shows the details of the synchronized circular buffer utilized in the implementation. Internally, it consists of two controllers, two data routers, and the actual internal buffer storage itself. The write controller is used to coordinate which internal buffer to write data to. A buffer is marked as full when a trigger comes in, and a new buffer is selected when the previous is filled. This is triggered via an external write enable signal sent from HDMI. The write router does the actual data redirection based off of the buffer selected by the write controller. Internally, a requestacknowledgment handshake is used to ensure the data has transitioned correctly across clock domains and is available on the other side. Once data becomes available, the read router will output the data lines, as well as, a valid signal indicating that the data can be read and cleared. The read controller will clear it once an empty trigger is sent from the array emitter indicating that the corresponding word has been read. It will then select the next buffer. Once the last buffer is written or read by either controller, the first buffer will be selected again. To note here for clarity, the actual filling and clearing of the buffers is done by HDMI write enable on the writer side, and the array emitter on the reader side. Figure 7 shows the details of the array emitter module used in the implementation. Internally, the array emitter consists of individual controllers that each perform a particular role. Currently these are the write enable controller, which controls writing draw region packets; the reset controller, which controls the resetting process per frame; and the valid controller which does validation checks on the input in order to verify the correctness of packets. Each controller takes in similar input signals and produces similar output signals with some exceptions depending on the actual function of the individual controller module. On the input side, each controller takes in a valid and data line corresponding to an individual word of a PDP packet. Additionally, an active line is brought into each controller to indicate whether another controller module is active or not. This is used to ensure that other modules do not become active (for correctness purposes) while another is currently processing a packet. During an idle phase, the write enable and reset controllers will wait for a corresponding packet ID to come in.
When a packet ID matching an operation handled by a given controller arrives, the corresponding controller will switch states and then wait for the rest of the incoming packet data to arrive. This is shown in Figure 8 . For example, item 3 shows each state machine waiting for a corresponding PDP OP code. If a draw region packet ID were to arrive, then the state machine would wait for the X start address, X end address, Y start address, and Y end address shown in item 4. Finally, the state machine would buffer the needed data for a write and move to the write state once all data has been buffered. In the write state it would send the data to the write buffer. If more data needed to be written for the PDP packet, it would then continue buffering the needed data, and wait for the write buffer to be idle to send the next set of data. The busy line in Figure 7 indicates when the write buffer is in the process of writing to the array. This would continue until all data was written for the packet, finally proceeding to the idle state. The reset controller contains similar logic, but for the reset process. The valid controller is used to ensure that incorrect or corrupt packet data is cleared. If an invalid packet OP code arrives during an array emitter idle phase, the valid controller will simply empty the corresponding SCB slot. Currently, this is done by checking that an OP code ID matches a known packet type in the backend. The Active-Out, Active-In, and Select lines are used to coordinate which array emitter module has control over the write buffer. After an array write, normally an array emitter will signal the write buffer that it is ceding control to the other array emitter, but only in cases where the other array emitter is currently active and waiting to write. In future revisions, CRC checks will be performed on packet data to ensure the header and body are valid. In more detail, the header will carry a precomputed 24-bit CRC for the header data. During decoding of each data word in the header, the firmware will recompute the CRC and then confirm the header matches the provided CRC. If header doesn't match, the packet will be discarded. The same process will be performed for the body of the packet to ensure the validity of data to be written. If the data corruption is detected a flag will be set indicating corruption has occurred.
Experimental Results
In this section, we provide a few captures of simulation inputs and outputs in order to show how packets arrive and are processed by the architecture. Figure 9 shows simulated HDMI input. PxlClk denotes the HDMI incoming pixel clock pulse. Vsync denotes the vertical sync pulse period. Hsync denotes the horizontal sync pulse period. VDE stands for video data enable and denotes when data is valid to be read from the data stream. Addr denotes the input HDMI address which indicates the current pixel of the stream. HDMI_data is the data stream of input pixels. For our purposes, only PxlClk, VDE, and HDMI_data are used by our architecture to buffer incoming data. HDMI_data is used a carrier line for incoming PDP data. When VDE goes high words of data representing PDP packets start to stream in. These are indicated by Packet ID, X start, X end, Y start, Y end, and Packet Data. Each word would be stored in an SCB slot as indicated in the previous section. The final piece of data indicated is a reset packet. Note, that the data prior to Packet ID would be ignored as it does not represent a valid PDP command. It would be discarded by the valid controller. Figure 10 shows the final output driven to the array. Sysclk denotes the system clock driving the PDP architecture. WB_sel and WB_busy denote the busy and select lines shown in Figure 6a . Quad, load, dac, and array_reset denote signals that drive our array. DAC indicates the values driving our digitaltoanalog driver chain to our array. Quad and load indicate which portion of the array to drive. State and count are state signals the denote which part of the array write process currently occurring. Highlighted in red is data from the write enable packet. Note, all values out are up shifted by 5 bits to be received by the DACs in the system. Additionally, the values are shown in reverse order from the input diagram. For example, 992 corresponds to the value of 31 on the input side. In purple the reset packet is shown with two stages of array writes. For the entirety of reset, the array_reset and quad lines are driven high. In the first stage, the load line goes low. In the second stage the load line goes high.
Future Work and Conclusion
In this paper, we described a packetized display protocol architecture and associated abstract machine model to convey the limitations in current fixed frame technology. Additionally, we provide an alternative display architecture that eschews the design decisions of current technology in order to provide intelligent dynamic bandwidth utilization, fine-grained control over frame transmission and synchronization; as well as, allows for dynamically changing intra-frame rates. We believe this architecture has the potential to provide the capabilities to bridge the performance gap found in current technology, and will serve as a better-fit solution for future high performance IRLED systems due to the scalable nature of the design and the carefully incorporated abstraction tailored to allow for different types of hardware and system setups to utilize the PDP architecture. Care has been taken in the design to incorporate many different possible system setups without limiting the use-case of PDP to a specific hardware setup; while at the same time, considering firmware implementation and timing aspects to packet decoding.
Current work includes a FPGA based implementation of a PDP decoder architecture utilizing HDMI. We have provided both a description of the implemented architecture, as well as, simulated sample data running on the architecture. Future work includes testing the architecture on an emitter array, performing scalability testing, and comparing the results to a classic architecture at matching pixel clock rates in order to show effective speedup with varying packet sizes. Further work is to be done to demonstrate dynamic frame-rates in action on an array. Finally, a CRC is to be implemented to ensure correct operation at all times. We also wish to scale the number of inputs in order to increase the effective hardware bandwidth further than capable with a classical system.
