The architecture of VIS multimedia extension by Lee, Ben et al.
AN ABSTRACT OF THE THESIS OF
Young-Kyong Kwon for the degree of Master of Science in Electrical and Computer
Engineering presented on June 9. 1999. Title: The Architecture of VIS Multimedia
Extension
Abstract approved:
Ben Lee
In the past, multimedia technology focused mainly on designing high quality audio
and graphical imagery as well as providing adequate performance levels that the users
demand for multimedia applications. However, the concept of multimedia has expanded
intoNew-Mediathat involves variety use of multimedia data in consumer-oriented appli-
cations, such as video conferencing, virtual reality, and multimedia games with 3-D ef-
fect in video and audio. As the demands for multimedia applications increase, vendors
have come up with new cost-effective microprocessor designs to satis!y the complexity
of new media processing. One of the most efficient methods is to incorporate the special-
purpose multimedia processorintoa general-purpose processor, thereby offering multi-
media-related functions at a small cost. Most effective way to integrate the two different
processors is to extend the existing instruction setintoa multimedia-oriented Instruction
Set Architecture, calledMedia ISA Extension.Currently, many microprocessor vendors
have produced variety of general-purpose processors with multimedia extensions.
This thesis aims to provide the overall design philosophy behind Media ISA Exten-
sion and its effect on the overall performance of a general-purpose processor.In par-
ticular, Sun Microsystems' Visual Instruction Set (VIS) media extension of U1traSPARC-
V9 is studied. On the average, VIS provides a speed-up of 3 to 4 for various multimedia
applications. This performance improvement comes from the considerable reduction in
the number of instructions executed due to Single Instruction Multiple Data (SIMD) exe-
cution style of VIS media extensions. To examine how VIS improves multimedia per-
formance, the thesis stdies the design concept, benefits and limitations of VIS exten-
Redacted for privacysions, and the perfonnance of various multimedia applications with and without VIS cx-
tension. Finally, the possible architectural modifications to the VIS extension for further
performance enhancement are suggested.Young-Kyong Kwon
ne 9, 1999
ights ReservedThe Architecture of VIS Multimedia Extension
by
Young-Kyong Kwon
A THESIS
submitted to
Oregon State University
in partial fulfillment of
the requirements for the
degree of
Master of Science
Presented June 9, 1999
Commencement June 2000Master of Science thesis of Young-Kyong Kwon presented on June 9. 1999
APPROVED:
Major Professor, representing Electrical and Computer Engineering
t2I1/'19
Head or Chair oartment of Electrical and Computer Engineering
Dean of (iidiiäe School
I understand that my thesis will become part of the permanent collection of Oregon State
University libraries. My signature below authorizes release of my thesis to any reader
upon request.
Young-Kyong Kwon, Author
Redacted for privacy
Redacted for privacy
Redacted for privacy
Redacted for privacyACKNOWLEDGMENT
This thesis is dedicated to my parents of Tae-gue Kwon and Chun-kil Lee, parents-
in-law, my wife Hyoung-woo Choi, and my daughter Cheoung-yeon Kwon. I appreciate
elder and younger sister, Mi-ok and Mi-sun, who assisted me in concentrating on studies.
Without their unconditional support, this thesis would have never been completed.
Special thanks to Prof. Ben Lee, my major professor. His guidance and advice have
been extremely valuable during my studies at Oregon State University.Also, special
thanks to the committee members, Prof. Alexandre F. Tenca, Prof. Robert J. Simpson,
and Prof. Robert M. Burton.
Thanks to my senior Han-tak Kwak. His help and kindness helped me successfully
complete my studies.I would like to thank senior officers, Jung-seo Hwang, Youn-suk
Kim, who have always had concerns for my work being. Also, thanks to all my friends at
the Korean Air Force, especially the seven officers who have always shared difficulties
and happiness with me.
In addition, I would like to thank Republic of Korea Air Force and Division of Ar-
maments, which allowed me to study abroad andsupportedme financially.TABLE OF CONTENTS
1INTRODUCTION............................................................................................. 1
2 MULTIMEDIA AND iTS PROCESSING.......................................................4
2.1Definition of Multimedia................................................................... 4
2.2Multimedia Applications....................................................................5
2.2.1Image..................................................................................6
2.2.2Graphics.............................................................................11
2.2.3Video and Audio ...............................................................15
3 MEDIA PROCESSING ARCHiTECTURE...................................................19
3.1Limitations of Special Purpose Processors......................................19
3.2Multimedia Extensions.....................................................................21
3.2.1Multimedia Extension Architecture..................................22
3.2.2Single Instruction Multiple Data (SIMD).........................23
3.2.3Multimedia Instruction Set Overview...............................24
4VIS INSTRUCTION SET OF ULTRA-SPARC............................................27
4.1Hardware Features of UItraSPARC.................................................27
4.1.1Overview of UItraSPARC Architecture............................27
4.1.2Hardware Support for VIS Instruction Set........................29
4.2VIS Data Format..............................................................................35
4.2.1Partitioned Data Formats...................................................36
4.2.2Fixed Data Formats........................................................... 37
4.3VIS Instructions................................................................................37
4.3.1Conversion Instructions .....................................................39
4.3.2Arithmetic Instruction.......................................................41
4.3.3Logical Instructions...........................................................42
4.3.4Address Manipulation Instructions...................................43
4.3.5Pixel Comparison Instructions..........................................45
4.3.6Memory Access Instructions.............................................46
4.3.7Motion Estimation Instruction...........................................47
4.4Other Media ISA Extensions............................................................48
5 IMPLEMENTATION OF MULTIMEDIA FUNCTION...............................51
5.1Alpha Blending................................................................................. 51TABLE OF CONTENTS (CONTINUED)
5.1.1Implementation and Code Comparison.............................53
5.1.2Alpha Blending in C..........................................................54
5.1.3AiphablendinginVlS.......................................................57
5.1.4Assembly Code version of the Alpha blending.................61
5.2Motion Estimation............................................................................64
5.2.1Implementation and Code Comparison............................65
5.2.2 C Implementation..............................................................67
5.2.3VIS implementation..........................................................69
5.2.4Assembly Code Version of the Motion Estimation..........71
5.3Overall Comparison of VIS and C implementation.........................72
6 PERFORMANCE ANALYSIS OF VIS.........................................................73
6.1Simulation Environment..................................................................73
6.1.1Organization of INCAS.....................................................73
6.1.2Capabilities of the Simulator.............................................75
6.2Applications.....................................................................................76
6.2.1Image & Graphics applications.........................................77
6.2.2Audio Application.............................................................78
6.2.3Video Application.............................................................78
6.3Simulation Results............................................................................79
6.4Analysis of the Simulation Results..................................................83
6.4.1Impacts of VIS Architecture Features...............................84
6.4.2Advantage of VIS..............................................................87
6.4.3Limitations of VIS Instructions.........................................92
7 CONCLUSION AND FUTURE WORK........................................................94
BIBLIOGRAPHY................................................................................................. 95LIST OF FIGURES
Figure
2.1: Hierarchy of Multimedia [3]..............................................................................5
2.2: Example of Thresholding as a Point Operation................................................10
2.3: Filtering............................................................................................................10
2.4: Composits.........................................................................................................11
2.5: Modeling...........................................................................................................12
2.6: Adding Surface.................................................................................................13
2.7: Shading.............................................................................................................14
2.8: Ray Tracing......................................................................................................15
2.9: Moving Picture with 30 Frames (3ofps)...........................................................16
2.10: Compression Structure...................................................................................18
3.1: Capability of Multimedia Extension................................................................21
3.2: Wasted Operand Space [101.............................................................................23
3.3: SIMD Processing in a Subword........................................................................24
3.4: Data Conversion...............................................................................................25
3.5: Arithmetic and Saturation.................................................................................25
4.1: UltraSparc-I [11]..............................................................................................28
4.2: Dual 9-stage Pipeline [11]................................................................................30
4.3: Floating point Register for Multimedia Data [13]............................................31
4.4: Graphic Status Register [14]............................................................................32
4.5: Floating-point/Graphics Unit [15]....................................................................33
4.6: Partitioned 16/32-bit graphics adder [16].........................................................34
4.7: Partitioned 16*8 bit Multiplier [16].................................................................35
4.8: VIS Data Types [11].........................................................................................
37 4.9: Partitioned Data Format [11]............................................................................
4.10: Conversionfor Expand [11]...........................................................................39
4.11: Conversion for Pack [111 40 ................................................................................
41 4.12: Add and Subtraction [11]...............................................................................
4.13: Multiplication [11]..........................................................................................42LIST OF FIGURES (CONTINUED)
Figure Page
4.14: Align Instruction [11].....................................................................................43
4.15: Edge Handling Instruction [11]......................................................................44
4.16: Array Instruction [11].....................................................................................45
4.17: Block Load and Store [17]..............................................................................47
4.18: MMX Technology Pipeline Structure [18].....................................................49
4.19: MMX Register Set [18]..................................................................................50
5.1: Image Size determined by X and Y, and Color Band by Z..............................52
5.2: Blending Application........................................................................................53
5.3: Float Chart of Alpha Blending in VIS and C...................................................54
5.4: Code of Ablending_C [11]................................................................................55
5.5: Image Construction byLoopExecution in C code...........................................56
5.6: Code of Ablending_VIS [11]............................................................................59
5.7: Image Construction byLoopExecution in VIS code.......................................60
5.8: Alpha Blending in VIS Assembly....................................................................61
5.9: Image Data Load and Expand..........................................................................62
5.10: Appending Alpha value to Image...................................................................63
5.11: blending of image data...................................................................................63
5.12: Motion Estimation in Compression Algorithm..............................................64
5.13: Motion Estimation between Two Frames.......................................................65
5.14: Float Chart of Motion Estimation in VIS and C.............................................66
5.15: Macroblock and SAD Area............................................................................67
5.16: C code for Sum of Absolute Difference ......................................................... 68
5.17: Comparison VIS Algorithm with C................................................................69
5.18: VIS code for Sum of Absolute Difference [11]..............................................70
5.19: SAD in Assembly Level.................................................................................71
6.1: Organization of INCAS [19] ............................................................................ 74
6.2: INCAS Accuracy Model [11]...........................................................................75
6.3: Simulation Results shown by the Simulator INCAS........................................79LIST OF FIGURES (CONTINUED)
Figure
6.4: Performance comparisons of VIS with C.........................................................82
6.5: Overall Performance of Applications in VIS...................................................83
6.6: Distribution of Instructions by Operation.........................................................85
6.7: IPC comparison FGU with no FGU.................................................................86
6.8: Instruction Counts between C and VIS............................................................87
6.9: Loop Overheads in C Instructions....................................................................88
6.10: Impact of VIS Arithmetic Operation on C.....................................................89
6.11: Impact of VIS Branch Operation on C...........................................................90
6.12: Impact of VIS Memory Access Operation on C.............................................91
6.13: Overhead of VIS instructions.........................................................................93LIST OF TABLES
Table
2.1: Channel Depth for Colors................................................................................... 8
3.1: Calculation for Movie Size [3] ......................................................................... 19
4.1: VIS Instruction Set [14].................................................................................... 38
4.2: Extensions Comparison.................................................................................... 48
6.1: Simulation Result of Image Applications.........................................................80
6.2: Simulation Result of Audio and Video Applications.......................................81
6.3: Summary of Distributed Instruction................................................................. 84THE ARCHITECTURE OF VIS MULTIMEDIA EXTENSION
1INTRODUCTION
As the demand for multimedia applications increase, multimedia processing has be-
come an essential functionality of a computer. In addition, the term "multimedia" in the
past has progressed into a new concept calledNew-Mediaas a future direction for multi-
media technology. In the past, multimedia technology focused mainly on designing high
quality audio and graphical imagery as well as providing adequate performance levels
that the users demand in multimedia applications. However, the concept of multimedia
has expanded intoNew-Mediathat involves variety use of multimedia data in consumer-
oriented applications, such as video conferencing, virtual reality, and multimedia games
with 3-D effect for video and audio.
This new multimedia technology provides general users with a broad use of multi-
media data in their computer applications. On the other hand, it has become a challenge
for computer manufacturers since hardware designers are forced to find a cost-effective
way to boost performance of a processor. Therefore, vendors are required to find optimal
solutions by notonlysatisfying the performance consumers desire, but also alleviating
the problem of expensive design costs. To solve these difficulties, vendors have tried to
incorporate the dedicated hardware used for media processing into a single multimedia
chip. One such solution is to use microprocessors to handle multimedia data without as-
sistance of add-on hardware. In other words, new-media processing technology focuses
on utilization of general-purpose processors (GPP) combined with specialized hardware
modules, such as digital signal processor, audio processor, and pixel processor.
However, processing of a large amount of media data is quite demanding for a gen-
eral-purpose processor, even for most powerful microprocessors. For example, real-time,
broadcast-quality video requires the manipulation of about 10.4 million pixels per second
[1]. Since each pixel contains information for three colors, this amounts to more than 30
million pieces of data per second. Assuming a GPP with clock speed of 200 MHz, there
are only 20 clock cycles available for processing each pixel, it is not possible for GPP todirectly handle such multimedia data. Thus, microprocessor designers have researched
for a proper solution to the newly emerging architectural challenge. As a result, new in-
structions that efficiently execute multimedia data are added to the layer of Instruction
Set Architecture (ISA)---this is commonly referred to asMultimedia Extension.
Multimedia extension is based on a parallel architecture concept Single Instruction
Multiple Data (SIMD), which enables the processing of multiple data with one mstruc-
tion. SIMD approach in multimedia processing is a very effective way of enhancing per-
fonnance since multimedia data are composed of small native data type (8 or 16 bits)
with a large data set, a large amount of inherent data parallelism, and it is usually com-
putation-intensive. Multimedia extensions implement an SIMD parallel architecture to
support high-throughput mathematical processing at the CPU level. By partitioning 64-
bit floating-point registers into 8-, 16-, or 32-bit components, multimedia extension en-
ables the operation on multiple media data at the same time.
Major microprocessor vendors have already introduced various types of multimedia
extensions: VIS by Sun Microsystems, MMX by Intel, and MAX by Hewlett Packard.
Even though there are various kinds of multimedia instruction sets, the basic concept and
the operations are similar among them except for some specialized instructions that each
vendor provides to fit their own architectural needs. To examine the features of multi-
media instruction set, this thesis provides a case study of VIS in terms of its characteris-
tics and performance.
By illustrating the VIS multimedia instruction set used in U1traSPARC-V9 of Sun
Microsystems, this thesis attempts to provide the general design philosophy of multime-
dia instruction set architecture as well as their characteristics. Also, the performances of
the VIS multimedia extension is analyzed via simulation studies of multimedia bench-
mark programs using INCAS. Furthermore, the comparison study of VIS and C codes in
both source- and assembly-level provides better understanding of multimedia processing.
The organization of the thesis is as follows: Chapter 2 introduces the multimedia and
multimedia processing by explaining several examples used in Image, Graphics, and
Audio/Video applications. Chapter 3 discusses the concepts of a general media process-
ing architecture. VIS multimedia extension is introduced as one of the multimedia in-
struction sets in Chapter 4. The specific features and advantages in VIS are discussed indetail. Chapter 5 presents the implementation of the multimedia functions embedded in
multimedia applications. The algorithms and data structures of VIS and C codes are il-
lustrated. Chapter 6 presents the simulation results and its analysis of VIS in tenns of
performance. Finally, Chapter 7 provides a conclusion and possible future work.4
2MULTIMEDIA AND ITS PROCESSING
During the last decade, computer and communication industries have shown tre-
mendous growth of interest in multimedia applications. Computers and networks are of-
ten used to process and transmit more than just text or images; video, audio and other
media data such as graphics have become an integral part of computer applications.
However, computers and networks need to be further improved in order to provide better
support for multimedia applications.
This chapter aims to provide a general concept behind multimedia and multimedia
applications; multimedia applications are categorized into three areas, namelyimage,
graphics, andvideo/audio,and the techniques associated with each area are also ex-
plained. Finally, techniques that are being applied to enhance the performance of multi-
media processing are discussed.
2.1DefInition of Multimedia
It is difficult to clearly define the term "multimedia" because of its broad scope and
applications. However, a definition of multimedia can be given as follows:
The use of more than one medium in a program or system such as the use of audio,
video, graphics, animation and computer data used together for a program. Multime-
dia means the joining of any two or more of these mediums [1].
Delivery of information, usually via a personal computer, that combines different
content formats (text, graphics, audio, still images, animation, motion video, etc.)
and/or storage media (magnetic disk, optical disc, video/audio tape, and RAM) [l}.
Therefore, multimedia can be defined as a collection of data composed of image,
video, audio, graphic, and text for the conveyance of the specific information. Figure 2.1
shows a hierarchy of multimedia associated various data types.It shows that the mixed5
media deals with text, graphics, and imagery; in addition, if audio and/or video in-
formation is added together, it becomes a multimedia.
Imaginary
(Photograph)
Graphics Mixed
(2-D & 3-D) Media
-Jdnia
Audio
(Voice&Sound)
Figure2.1:Hierarchy of Multimedia[3]
2.2Multimedia Applications
In the previous section, each of audio, video, picture,2-Dand3-Dgraphics, anima-
tion, text, and image has been collectively called multimedia data. Multimedia applica-
tions allow users to manipulate and utilize the multimedia data in order to convey a use-
ful information. For example, video conferencing application is used to broadcast or re-
ceive audio/video information over the communication network. These multimedia ap-
plications can be organized into four categories: image, graphics, audio, and video.In
order to provide a basic understanding of each multimedia application, their characteris-
tics such as general concepts, representation, and associated operations are introduced.2.2.1Image
As a spatial representation of an object, image is a data representing a two-
dimensional scene. In the realm of multimedia application, image is commonly referred
as a digital image, and is usually taken from the real world via digital cameras, frame
grabbers, or scanners [4]. The characteristic of an image can be explained in terms of its
representation and operations.
2.2.1.1Image Representation
The image displayed on a screen is composed of thousands (or millions) of small
dots, called pixelsa contraction of the phrase "picture element" [5]. The pixels in a
digital image are arranged in a rectangular array with a certain height and width. Each
pixel consists of one or more bits of information representing the following factors of the
image: color model, alpha channels, numberofchannels, channel depth, interlacing, and
pixel aspect ratio.
2.2.1.1.1 Color Model
The color model represents the encoding scheme of colors in the computer display.
There are several standards commonly used with in the digital color image:RGB, HSB,
CMYK, and YUV.
RGB: This is the most popular color model composed of three intensitiesred (R),
green (G), and blue (B). The combination of ROB bit values represents a certain
color. This model is suitable for video display drivers since the numeric value can be
easily mapped to voltages for the red, green, and blue in color CRTs [6].
HSB: Color is represented by hue (H), saturation (S), and brightness (B). The hue
determines the color using a combination of three different colors represented in an-
gular valuesred from 0 to 120 degree, green 120 to 240 degree, and blue 240 to 3607
degree.The saturation is related to the intensity, and the brightness controls the
amounts of gray. Generally, this color model deals with tint, shade and brightness.
YUV: This color model is widely used in television industry, and it is based on video
signal concept, i.e., Y signal, U signal, and V signal. Y specifies luminance, which
determines black-and-white portion of a video signal. U and V are color difference
signals that form the color portion of a video signal, called chrominance. YUV is
suitable for video broadcast since it makes efficient use of bandwidththe human
eyes are more sensitive to the changes in luminance rather than the changes in color,
so it is possible to better utilize available bandwidth by allocating less to chroma and
more to luminance.
CMYK: Computer displays emit lights, and color is produced by adding red, green,
and blue intensities.Paper, on the other hand, reflects lights. To print a particular
color on a white page, one must apply inks that subtract (absorb) all colors except the
one desired. For this reason, printers use inks corresponding to the subtractive prima-
ries:Cyan (C), Magenta (M),andYellow (Y),the complements of the red, green, and
blue. Furthermore, mixing equal quantities of pure cyan, magenta, and yellowinks
produces black. However, in practice, inks are not pure and a black-ink (K) is often
used to give better black or gray [6].
2.2.1.1.2 Aloha Channels
In general, an alpha channel has 8-bit depth and determines the transparency and
opacity of the images by controlling the color information.
2.2.1.1.3 Channel DePthS
The amount of information stored in a pixel is determined by channel depth, i.e.,
pixel attribute depends on this depth. For example, a pixel with RGB true color value
only has the 24-bit channel depth. If 8-bit alpha channel is added, the channel depth be-
comes 32-bit. Table 2.1 shows the channel depths for color used in PCs today.Color DepthDisplayed ColorBytes per pixelCommon Name for Color Depth
4-Bit 16 0.5 Standard VGA
8-Bit 256 1.0 256-Color Mode
16-Bit 65,536 2.0 High Color
24-Bit 16,777,216 3.0 True Color
Table 2.1: Channel Depth for Colors [5]
2.2.1.1.4 Interlacing
The interlacing gives display priorities to images by controlling the order of channel
storage. When storing an image, interlaced images are stored in memory according to its
priority scheme. When the image is retrieved on screen, image is displayed in its inter-
laced order. On the other hand, non-interlaced image shows up in a linear fashion from
top to bottom. For example, suppose an image is composed of several interlaced objects.
The objects with high priority will be displayed faster than other objects.
2.2.1.1.5 Pixel Aspect Ratio
This term refers to the ratio of pixel width to pixel height. In general, a pixel takes a
rectangular shape for simpler image representation, e.g., scanners work with square pix-
els.Relation between pixel ratio and screen is important when displaying a correct im-
age. If both sizes differ from each other, an image will be displayed in either stretched or
squeezed fashion. Also, the aspect ratio of an image determines the number of horizontal
pixels and vertical pixels.
2.2.1.2Image Operations
"What is the image operation?" is the same question as "What is image processing?"
or "What is the image manipulation?" Consequently, the image operation means the im-
age processing, which means changing of the original image into a new image. New im-age is created by modifying the original image with special tools or techniques. In gen-
era!, image processing involves (1) reading of an image in a digitized format, and (2) ma-
nipulating the digitized image into new one using a special imaging software. The spe-
cia! techniques commonly used in image processing can be categorized into six opera-
tions: editing, point operation,filtering, composition, and conversion [7].
2.2.1.2.1 Editin&
Editing is a basic technique in image processing that permits the manipulation of
pixels in certain portion of the whole image, e.g., changing of pixel values. As a result,
the portion of the original image is altered into a different image. For example, one ob-
ject belonging to the whole original image may be cut off or moved to another position.
Image editors usually support cutting, copying, and pasting of selected groups of pixels.
2.2.1.2.2 Point Oøerations
This technique is applied to every pixel in the original image rather than certain pix-
els as with editing technique. Thus, if this technique is performed on an image, there is
no pixels excluded from being modified.
Thresholding operation shown in Figure 2.2 is a good example of point operation.
Thresholding changes the image into a binary image according to its intensity. Thresh-
olding can be useful in isolating certain areas of a picture by comparing an intensity of
each pixel with threshold value since each pixel has different intensity. If the pixel value
is less than or greater than the threshold value, that pixel will be displayed either in black
or white according to the result of comparison. In other words, by applying a threshold to
the image, it is possible to separate the objects into black or white.(a) Original (b) Thresholding Original
Figure 2.2: Example of Thresholding as a Point Operation
2.2.1.2.3 Filtering
Filter operation is very similar to the point operation in that it is applied to all the
pixels in an image. Difference is that, after going through filtering, each pixel has a new
value. Filtering technique can be used to sharpen, blur, or remove noises in the image.
Filtering can also be applied to the edges of an image to make them smooth. Figure 2.3
shows an example of a filtering technique that removes noises in the original image,
yielding a much clearer image.
Figure 2.3: Filtering11
2.2.1.2.4 Composition
This technique is used to produce a new image derived fromsource images; source
images are combined with special channel such as alpha channel, andnew images are
created based on channel distribution, i.e., new images are calculated by the channel
value. The relation between new images and source images is specified by mathematical
equations. Figure 2.4 shows examples of composits where (a) is the alpha channelcorn-
posit and (b) is the general composit.
(a) Alpha Blending Composit (b) A General Composits
Figure 2.4: Composits
2.2.1.2.5 Conversions
With the variety of image formats being used, there is a frequent need to convert
from one format to another. This operation provides a conversion of formatsamong im-
age formats such as JPEG, GIF, bitmap, and TIFF.
2.2.2Graphics
Graphics data and image data are often confused with each other. Compared to im-
ages, graphics share the same concepts in their basic representations, such as a unit of a
pixel, color, channel, and pixel aspect ratio. Once graphics are created, it might be called
a graphic image in a sense.Thus, graphics are a kind of images, commonly called
graphic images. However, strictly speaking, graphics need a preparing process before12
becoming images. It must go through technical steps only used for graphic concept, i.e.,
modeling and rendering. Modeling and rendering process makes graphics into more re-
alistic graphic images, and these two steps differentiate graphics from images.
2.2.2.1Modeling
A modeling is an outline like schematic or blueprint for the graphic images. To con-
struct a scene, elements of a scene, called objects, need to be modeled first. Generally, a
graphic model contains groups of such graphics objects. There are many ways to model
an object, such as geometric models, solid models, and drawing models, but the basic
concept of modeling remains the same.
When modeling, one of the important elements to consider is whether it is for 2-
dimensional or 3-dimensional objects. Figure 2.5 shows examples of 2-D and 3-D mod-
eling. Two-dimensional primitives include lines, shapes such as rectangles and ellipses,
curves, and more general polygons. 3-dimensional primitives also include the surfaces of
various forms.
(a) 2-Dinnsional
Figure 2.5: Modeling
(b) 3-Din-nsionaI13
2.2.2.2Rendering
The definition of rendering in graphics varies among computer graphics designers;
to avoid confusion, the meaning of rendering here refers to a process of calculating image
details and drawing them on a screen. Common rendering techniques can be classified as
follows: adding suiface, shading, and ray tracing.
2.2.2.2.1 Adding Surface
This technique is applied on objects to attach realistic surfaces to any part of the
model; this requires a mapping from 3-D surface coordinates to 2-D image coordinates.
Given a point on the surface, the image is sampled, and the resulting value is applied to
the surface at that point. Each different surface is then mapped with the attributes such as
coloring, texture, transparency, and reflectivity.Figure 2.6 shows a process of adding
surfaces to a 2-D object, where (a) is a texture, (b) is sphere-shaped object, and (c) is a
sphere mapped with the texture.
Figure 2.6: Adding Surface14
2.2.2.2.2 Shading
Shading technique adds information to the images on how light interacts with the
various objects within the model. Depend on the sensitivity, shading technique is divided
into flat-shading and curved-shading.In flat shading, surfaces are described using
meshes of small polygonal surface or a block of color. On the other hand, curved shading
goes a step further by blending the colors at the edges of each shape (interpolation) to
make the surfaces appear smoother.Also, there is even more sophisticated method,
called Gouraud Shading. Figure 2.7 illustrates the examples of shading; (b) looks more
realistic than (a) because the surface is made smooth by linearly interpolating across po-
lygonal facets.
(a) Flat Shading
2.2.2.2.3 Ray Tracing
(b) Gouraud Shading
Figure 2.7: Shading
Strictly speaking, ray tracing can be considered as a more sophisticated shading
technique. Ray tracing retraces the exact path of every single ray of light that hits the
objects, following each ray as it bounces off the surfaces of the corresponding objects.It
causes all the shadow, reflections, and highlights in the scene to be drawn. As a result,
the final image is highly detailed and very realisticon the average, it involves tracing
the paths of over 300,000 rays. Figure 2.8 shows two examples: (a) is a simple-ray trac-15
ing applied to two objects, and (b) shows an image where all the objects are individually
shaded using complex-ray tracing.
(a) Simple Raytracing (b) Complex Raytracing
Figure 2.8: Ray Tracing
2.2.3Video and Audio
In this section, the fundamental concepts and techniques of video/audio are intro-
duced. Also, the current processing methods, including compression techniques, are ex-
plained.
2.2.3. 1Video Representations
Moving picture, commonly calledvideo data,on a monitor is actually a group of
still images consecutively displayed on screen to be seen as if the image is moving. This
moving picture consists of sequences of video pictures or frames that are played back at a
fixed number of frames per second. The concepts of frame, frame rate, scan-line, and
sampling, are explained below:16
Frame: The moving picture or video is a 3-dimensional image displayed in twodi-
mensions, and is made of a group of still images. Here, each still image is called a
frame.
Frame Rate: The number of the frames displayed per second (fps). An impression of
smooth motion starts about 16 frames per second. The frame rate of movies, Amen-
can TV, and European TV is 24, 30, and 25 fps, respectively.
Number of Scan Lines: A frame is divided into multiple scan-lines, and each frame
usually consists of a few hundreds lines.
Sampling: This is generally used for audio signal, and allows the conversion of con-
tiguous analogue signal into discrete digital signal.
(a) Moving Picture Structure
____________}Scan
Line
Franc Structure
Figure 2.9: Moving Picture with 30 Frames (3ofps)
Figure 2.9 explains moving picture composed of 30 fps. Each frame can be expressed
with a still image that refers to general digital image containing the amount of pixels.
Therefore video frames can also be modified and manipulated like graphic or image due
to a still image.17
2.2.3.2 Audio Representations
Digital audio is produced by sampling continuous audio signal of the source. An
analog-to-digital converter (AID converter or ADC) takes as input an electrical signal
corresponding to the sound (coming from a microphone) and produces a digital data
stream. The analog signal is reproduced from this data by a digital-to-analog converter
(D/A converter or DAC) which consumes the data stream and generates an analog elec-
trical signal. The technological terms related to digital audio are as follows: sampling
frequency, sample size, number of channels.
Sampling frequency (rate): This refers to the number of times per second that digit-
izing circuitry measures an analog signal to produce a digital value. The higher the
sampling rate, the finer the quality of the digitized media.
Sample size: This refers to the unit of signal to be expressed in a bit size used for
sampling.
Number of channels: This determines the spatial characteristics of audio sound. Ste-
reo audio signals contains two channels- the left and right. Some audio products use
four channels, while professional audio editing equipment deals with 16, 32, or more
channels.
2.2.3.3Processing of the Video and Audio Data
It is not practical to handle huge amount of video and audio data without reducing
their sizes.Reducing the amount of data not only saves a storage space, but also in-
creases the processing speed. This data reduction technique is called compression.18
CD audio, Dat
M: 0.721
rLossy : JPEG
Lossless:
1Lossy vs. Lossless
VideoI1Real-tmn
Ljnterfraiv. JntrafranMPEG, DVI
Figure 2.10: Compression Structure
Figure 2.10 shows the compression techniques widely used in audio, image, and
video applications [6].The Audio compression includes two methods: Pulse Code
Modulation (PCM) allows audio signal to be represented in the form of a series of pulse
at various sampling rates. Adaptive Delta Pulse Code Modulation (ADPCM) further re-
duces thePCMdata rate by encoding differences between sample values.
The image compression methods can be classified either asLosslessandLossy.
Lossless method removes only the redundancy data, and the data is neither altered nor
lost in the process. On the other hand, lossy compression, such as JPEG, removes redun-
dancies as well as some unnecessary information, which aggressively reduces the data
size when compared to lossless compression.
In video, lossy compression method is widely accepted since it is not critical to re-
construct the exact video data from the compressed data to obtain a high quality video.
Finally,Interframe compressionis a relative compression method often used in
MPEG. This compression method takes advantage of video characteristic that successive
video frames tend to be similar temporally. Interframe compression, called frame-to-
frame compression, exploits temporal redundancy and predicts the intermediate frames
from independently coded key frames. The key frame is one that differs significantly
from the immediately following frames.19
3MEDIA PROCESSING ARCHITECTURE
As the demand for multimedia applications increase, hardware solutions for multi-
media processing are moving away from the special-purpose processor combined with
specialized hardware modules, such as digital signal processor, audio processor, and pixel
processor. Instead, general-purpose processors with multimedia enhanced capability are
being tested as an alternative. This chapter looks at the trend in multimedia processing,
and the limitation of current methods and the feasibility of general-purpose processor
with enhanced multimedia processing are discussed.
3.1Limitations of Special Purpose Processors
The amount of data that needs to be processed in a multimedia application can be
tremendous, thus requires very intensive computing.For example, when compressed
video is retrieved and displayed on screen, the total amount of data that needs to be proc-
essed indirectly reflects how much workload processors have to handle. Table 3.1 shows
the size of various movie clips.It shows that the size of a 1-second movie clip ranges
approximately from 2Mb to 18Mb. Thus, a large amount of multimedia data has to be
processed each second.
Movie 1 Movie 2 Movie 3
Image Width by Height 640*480 640*480 640*480
Number of Color Bits 2 8 8
Frame per Second 15 30 30
Compression Ratio 4:1 4:1 10:1
Compressed Image Size 2.304Mb 18.43Mb 7.372Mb
Table 3.1: Calculation for Movie Size [3]
As a result, it is not possible for general processor to operate on tremendous amount
of data without special hardware support. Because of these reasons, the hardware dealing
with multimedia data, so called a special-purpose device or chip, has been used. As the20
hardware technology improves, such devices have changes from separate board and spe-
cial-purpose chip to Digital Signal Processors (DSP) [8]. However, cost of such devices
has increased as the memory and processing requirements for multimedia processing
have increased. To reduce the cost, hardware designers and vendors have come up with
an idea that combines memory with processing hardware [8]. These movements are ul-
timately focused on the utilization of a general-purpose processor as a way of processing
multimedia.
In order to run multimedia applications on general-purpose processor (GPP), both
core processor design and its Instruction Set Architecture (ISA) have to be considered.
As a hardware requirement, the core of GPP must be able to handle the workload of mul-
timedia applications. Fortunately, semiconductor technology has made a great advance-
ment during the last decade. As a result, the capability of general-purpose processors
now have processing power that is comparable to special-purpose processors. Therefore,
a general-purpose processor can fully substitute for a special-purpose processor for mul-
timedia processing with an additional advantage of being cost-effective.
Also, ISA of GPP has to be improved for processing multimedia data because of the
difficulties in manipulating various data types used in multimedia. The common charac-
teristic of multimedia data is that multimedia data are mostly composed of small, native
data types. Therefore, multiple multimedia data can be easily packed into a single word,
providing a way to exploit subword parallelism [9]. For example, four 16-bit multimedia
data can be packed into a 64-bit word, and they can be manipulated at the same time.
Therefore, multimedia extension based on subword parallelism for GPP has been devel-
oped by many microprocessor vendors, e.g., Intel's MMX, Sun Microsystems' VIS for
SPARC, Silicon Graphics' MDMX for MIPS V, DEC's MVI for Alpha, and Hewlett-
Packard's MAX2 for PA-RISC.21
3.2Multimedia Extensions
Multimedia extension refers to an instruction set that is specially designed for mul-
timedia information processing at the ISA level.Its basic goal is to provide high-
performance media processing on a general-purpose microprocessor. Consider a multi-
media system with no multimedia extension as shown in Figure 3.1 (a), where Intel
Digital Video Interactive (DVI) board is used for a real-time video playback. The DVI
board is composed of six components:Audio processor, Display processor,
VRAN, Pixel processor, Video digitizer,andAudio digitizer.A de-
scription regarding six components follows.
Host System Delivery Board Host System
-face
Video
digitizer
Audio
digitizer
Capture Board (optional)
(a) Without Mu1tirrdia (b) Multindia
Extension [6] Extension
Host
processor
Secondary
storage
Host
memory
Input
devices
Host video
adaptor
Audio
processor
Pixel
processor
VRAM
Display
processor
Figure 3.1: Capability of Multimedia Extension
Host
procesor
with
Multimedia
Extension
Secondary
storage
Host
memory
Input
devices
Host video
adaptor22
Audio processor decodes encoded audio data from VRAM and passes them to the
digital-to-analog converters.
Pixel processor performs operations such as graphics rendering, image processing,
image compression, and special effects on the video data.
. VRAM buffers video frames composed of audio and video data. This memory space
provides a temporary storage for each processor.
Display processor generates an analog signal suitable for displaying on a monitor.
Audio and Video digitizers capture analog signal and transform into the digital signal.
This board is optional, so that it doesn't influence multimedia processing.
As shown in Figure 3.1 (b), GPP with multimedia extensions can replace the audio
processor, pixel processor, and VRAM components in a given multimedia system.
Therefore, multimedia enhanced GPP is cost-effective. This section explains the design
considerations for multimedia extension, such as the Single Instruction Multiple Data
(SIMD) technique, and provides an overview of a multimedia instruction set.
3.2.1Multimedia Extension Architecture
Many current general-purpose processors support the execution of 32-bit or 64-bit
data at a very high clock speed. These wide-width architectural enhancements offer bet-
ter performance for scientific and business applications.Unfortunately, multimedia-
based applications do not fully benefit from the enhanced architecture, not because of a
flaw in the architecture or lack of raw computational power, but processor's inefficiency
in dealing with digital media data types. For example, digital audio uses 16-bit data type,
and pixels used in image, graphics, or video uses an 8-bit data type. When the processor
loads data, if a single data unit of 16-bit audio data or that of 8-bit pixel is fetched, the
processor will perform operation on only single 8- or 16-bit data. As a result, the proces-
sor's resource cannot be fully utilized. Figure 3.2 shows an example of the wasted oper-
and space. In this example, add operation is performed onSource AandSource B,
wasting almost 75% of operands' space. If this space remains unused on every operation,23
the amount of wasted resources would be quite large and performance will be poor.
Therefore, operating on multiple data in parallel is important to enhance the overall per-
formance of multimedia applications.
63 87 0
Source A Un4sed 25
+
Source B Unused
Pr.!A3 01
I unusea I C9 00 Destination1.4
Figure 3.2: Wasted Operand Space [10]
3.2.2Single Instruction Multiple Data (SIMD)
The basic components of multimedia data are usually simple integers with 8, or 16
bits wide, thus subword parallelism is easily found in multimedia data. SIMD processing
of subword allows the same instruction to be applied to every subword within a word. A
subword is a lower precision unit of data contained within a word [9]. For example, if a
word size is 64 bits, a subword can be of size 8, 16, and 32 bits as shown in Figure 3.3
(a). Hence, an instruction can simultaneously operate on eight 8-bit subwords, four 16-
bit subwords, or two 32-bit subwords. Figure 3.3 (b) shows an add operation of multiple
subword data in FU of GPP. Therefore, STMD model can be extended into the GPP ar-
chitecture to improve multimedia performance.24
64 I
(a) 64-bit a word
32 32
(b) Tw 32-bit suburds
16 16 16 16
(c) Four 16-bit subrds
18181818181818181
(d) Eight 8-bit subwords
(a) Subwords within a 64-bit word size
Dispatch Unit
_____________I+ (add)
subword srcl
subwwd src2
subwrddst
Functional Unit
iru Irn---I
(b) SIIvID in a Single processor
Figure 3.3: SIMD Processing in a Subword
3.2.3Multimedia Instruction Set Overview
The class of instructions implemented in various multimedia extensions can be di-
vided into four categories: data conversion, arithmetic/logical, memory access, and mis-
cellaneous.
3.2.3.1Data Conversion
In most multimedia applications, it is inevitable that the type or size of multimedia
data changes during the intermediate processing steps.In order to preserve their preci-
sion, data conversion instructions are used to convert among 8-, 16-, or 32-bit data into
packed- or unpacked-subwords.25
8-bit subword I8
16-bit subword
I 16 Data Conversion
32-bit subword
I 32
Figure 3.4: Data Conversion
Figure 3.4 illustrates the conversion process of an 8-bit word into a 32-bit word.
Theses conversion instructions include pack, unpack, and expand instructions.
3.2.3.2Arithmetic and Logic
Addition, subtraction, multiplication, and division are the most frequently used op-
erations in multimedia applications. Consider an example in Figure 3.5, where an arith-
for each 8 bits is executed in parallel.
(a) 4 subwords with 8 bits
(b) 4 subwords with 8 bits
(c) 4 subwords with 8 bits
I
i1IEJIJI*
Figure 3.5: Arithmetic and Saturation
During the arithmetic operation, overflow or underfiow may occur among subwords
as indicated by shaded area, yielding a wrong result. However, each subword actually is
an independent data, representing an element such as a pixel. Saturation guarantees the
correctness of each subword during arithmetic operations.Consider the 8-bit color
scheme where 0 value represents black and 255 value represents white. The result of
adding I to 255 should not change a white pixel into a black one, i.e. a result of 25626
should have a saturation values of 255 rather than 0. Saturation arithmetic is often used
in 3-D rendering algorithms such as Gourad shading and Alpha blending.
3.2.3.3 Memory Access
Memory access instruction is one of the most important features that provide the ac-
celeration of media processing. Memory access instructions generally support the fol-
lowing features:
Direct memory access to memory by bypassing the caches for data fetch.
Less number of instructions are required for data load and store, e.g., loading of eight
8-bit data usually requires 8 instructions, but can be done with 1 instruction if all the
8-bit data are loaded simultaneously.
High memory bandwidth that allows processor to move data at a fast rate, thus im-
proving the processing speeds.
3.2.3.4Miscellaneous
So far, the most common operations offered by multimedia extensions have been
introduced. In addition, there are several other instructions such asedge,pdist, and ar-
ray, which are intended for specific graphics applications.
edge: This instruction is used for generatingmaskfor partial store operation. Partial
store operation stores only 8-bit or 16-bit data into memory, requiring invalidation or
masking of the remainder of either 8- or 16-bit data in a word. This process is called
mask operation and edge instruction performs the operation.
pdist: This instruction implements a pixel distance calculation. Pixel distance is the
absolute value of the difference between two pairs of pixel, and the calculation of
pixel distance is requiredinMPEGvideo compressiontechnique.
array: This instruction breaks down 3-D image data into 2-D planar image data. This
instruction is commonly used in medical application.27
4VIS INSTRUCTION SET OF ULTRA-SPARC
Visual Instruction Set (VIS)is a set of multimedia extensions for the U1traSPARC-
V9 by Sun Microsystems. VIS is designed to speed up the multimedia processing, and
provides support for 2D and 3D graphics, image, video/audio compression and decom-
pression, networking, and other algorithms [11]. Also, VIS enhances the performance of
multimedia processing by 2 to 4 times.
To understand how VIS improves the performance, this chapter describes the archi-
tecture of UItraSPARC-V9 and specific hardware designs to support VIS. In addition,
the fundamental operations provided by VIS are explained.
4.1Hardware Features of UItraSPARC
U1traSPARC microprocessor is implemented based on 64-bit SPARC-V9 architec-
ture that gives accelerated multimedia performance using VIS [11]. VIS is designed to
be well suited for U1traSPARC. This section explains the hardware structure of UltraS-
PARC as well as hardware features thatsupport VIS.
4.1.1Overview of U1traSPARC Architecture
SPARC stands for a Scalable Processor ARChitecture. Based on a RISC architec-
tare, SPARC is a superscalar processor capable of issuing four instructions per cycle with
special emphasis on VIS [11]. Figure 4.1 shows a block diagram of U1traSPARC-I ar-
chitecture composed of five major functional units: Prefetch/Dispatch Unit (PDU), Inte-
ger Execution Unit (IEU), Floating-Point/Graphics Unit (FGU), Load Store Unit (LSU),
and External Cache (E-cache).28
.............................................................u.............-
I-Cache Prefetch and Prediction I I
[I
Branchii
Dispatch Unit I and I IMMU
Next Field I
I , t*
I Integer I ILoad! I IFloating I
IBranch I
I Polnt/
I Execution14 L1Store4 'I i IUnit
I
Unit Unit Unit
..__
.
.
I
Load D-Cache Store
Buffer DMMU Buffer
U
.
.. -
I.1' Second-Level Cache Interfece/ System Interfece .1
I .............................................
I-Level4 Branch
System Data tSecondI
Cache 128+16 I I
1
System
Address 128+16 (ECC) (parity)
35+1(parlty)
Figure 4.1: UltraSparc-I [11]
Prepetch/Dispatch Unit (PDU): This unit prefetches instructions based on a dynamic
branch prediction mechanism and a next field address. By predicting branches accu-
rately over 90% of the time, the PDU can supply four instructions per cycle to the
core execution unit.
Integer Execution Unit (IEU): The [EU perfonns all integer arithmetic/logical opera-
tions. This unit incorporates a novel 3-D register file with supports for 7-read and 3-
write ports.
Floating-Point/Graphics Unit (FGU): The FGU integrates five functional units and a
Register File composed of 32 64-bit registers. The floating-point adder, multiplier
and divider have been augmented with a graphics adder and multiplier to perform the
partitioned integer operations in VIS.29
Load Store Unit (LSU): The LSU executes all instructions that transfer data between
the memory and the register files in the IEU and the FGU. Included in this unit are
the Data Cache (D-Cache), Load Buffer, Store Buffer and Data Memory Management
Unit (DMMU).
External Cache (E-Cache): The E-Cache handles the misses in the instruction cache
of the PDU and the data cache of the LSU.
All instructions in U1traSPARC are executed in the following steps. When instruc-
tions are fetched by PDU, they are stored in the Instruction Buffer (IB). Then, four in-
structions in lB are decoded and issued to the corresponding Functional Unit (FIJ) by the
Dispatch Unit. At this point, all FIJs execute the instructions. If an instruction is a VIS
instruction, it is executed in the FGU.
4.1.2Hardware Support for VIS Instruction Set
To fully support over 40 instructions tailored for multimedia data manipulation, sev-
eral hardware design considerations are given for VIS. This section introduces hardware
of VIS and its design concepts.In particular, processor pipeline, Register File, and
Floating-point Graphics Unit (FGU) are presented to show how they differ from similar
components in other general-purpose processors.
4.1.2.1Pipeline
There are two major architectural supports for VIS: dual pipeline and 9-stage pipe-
line. Figure 4.2 shows U1traSPARC processor pipeline composed of two pipelines,In-
teger PipelineandFloating-point/Graphics (FG) Pipeline.FG pipeline is
added to integer pipe to exclusively support VIS. Thus, Floating-point/Graphic pipeline
is mainly used for the execution of VIS instructions.This allows a fast execution of
multimedia data by guaranteeing the separate execution channel for multimedia data
processing. When VIS instructions are dispatched, they are executed through FG pipe-IiJ
line. On the other hand, integer pipeline primarily implements instructions related to IEU
and LSU. All load and store instructions are executed byinteger pipeline.
Figure 4.2: Dual 9-stage Pipeline [11]
Figure 4.2 shows the processor pipeline of UItraSPARC with its 9 stages. The func-
tions of each stage are explained below:
Fetch (F) stage: Instructions are brought in.
Decode (D) stage: Instructions are decoded and stored in the lB.
Grouping (G) stage: The instructions are grouped into integer or FG category and
then dispatched to the corresponding pipe.
Execution (E) and Register (R) stages: Integer instructions are executed in the integer
pipe, and FG instructions are decoded again to see if they are FP or FG instructions.
Cache Access (C) and Execution (Xl) stages: Cache is accessed for load or store. In
FG, execution is performed.
Ni and Execution (X2) stages: Cache hit/miss of integer instruction is determined.
Execution of FG instructions is continued.
N2 and Execution (X3) stags: Integer instructions are halted to be synchronized with
FG instructions.
N3 stage: Integer and FG instructions are merged.
Write (W) stage: All results from integer and FG pipes are written to the register files.31
Note that three stages, Ni, N2, and N3 are added to the integer pipeline in order to
make integer pipeline symmetrical to the PG pipeline. This also makes exception han-
dling simple.
4.1.2.2Register File
An U1traSPARC processor includes two types of registers: general-purpose registers
and control/status registers. General-purpose registers consist of about 160 integer reg-
isters and 32 floating-point registers. Integer register file is organized into a register win-
dow and global register set [12].
Rd31
RO
Rd16
Rs30 Rs31
Rs28 Rs29
Rs2 Rs3
RsO Ral
63 31 0
Figure 4.3: Floating point Register for Multimedia Data [13]
Figure 4.3 shows the floating point registers consisting of (a) 32 64-bit double-
precision registers and (b) 3232-bit single-precision registers, where 16 double-precision
registers are shared with 32 single-precision registers. For example, RdO is shared with
RsO and Rs 1, where R and Rj denote single- and double-precision floating-point regis-
ters, respectively. This arrangement of registers allows the processor to independentlyaccess and update the high and low 32-bit halves of a 64-bit double-precision register by
referencing them as the corresponding single-precision registers [13].
In addition to the FP register files, VIS instructions use the Graphic Status Register
(GSR), introduced specially for multimedia processing. This special register is mainly
used for operations such as data format conversion or memory alignment.Figure 4.4
shows the GSR composed of two fields:scale_factorandalign_offset.
IscaIe_ctor align....oet
63 76 32 0
Figure 4.4: Graphic Status Register [14]
Visjack() instruction uses thescale_factorfield to specify how much shifting
occurs when the data is packed. For example, the 4-bit size ofscale_factordeter-
mines the ranges of 0 to 15 to be shifted.
In order to provide flexibility in multimedia data load, thealign_offsetis used
to specify the offset of a particular data item within a 64-bit address boundary. The con-
tent ofalign_offsetis determined by the rightmost three bits of the vis_alignaddr()
instruction. When the data is loaded from the memory, the actual data address is deter-
mined by GSR'salign_offsetfrom the current address pointer of vis_alignaddrØ.
4.1.2.3Floating point/Graphic Unit (FGU)
FGU is provided for the operation of the F? and VIS instructions. Figure 4.5 shows
that FGU is composed of five Functional Units (Fli), which can be grouped in two cate-
gories:4.1.2.3.1  Floating-pointer Instruction Execution Unit 
•  Floating-point Adder: addition/subtraction/absolute value/negative 
•  Floating-point Multiplier: multiplication 
•  Floating-point Divider: divide/square root 
4.1.2.3.2  VIS Instructions Execution Unit 
•  Graphics Adder: align, merge, expand, and logical 
•  Graphics Multiplier: pack, compare and pdist 
Load/ 
Store 
Unit 
load data 
Dispatch 
Unit 
Integr 
Execution 
Unit 
Floating-point Graphic Unit 
FGU 
Data 
Instructions 
Branch 
Unit 
Figure 4.5: Floating-point/Graphics Unit [15] 
33 34
Graphic adder unit and Graphic multiplier unit operate only on VIS instructions.
Graphic adder performs the single cycle partitioned data operations, that is, 16- and
32-bit partitioned add, subtract, data alignment, merge, and expand. As discussed in
Section 3.2.2, the design of graphic adder unit is different from others in that they do not
propagate carries between adders.
\'14M'/\4A
16"its 16its 16"its 16it
Figure 4.6: Partitioned 16/32-bit graphics adder [16]
The graphics adder, shown in Figure 4.6, is organized as 4 independent 16-bit add-
ers, A3, A2, Al, and AO. These adders are designed not to propagate carries and over-
flows. For 32-bit operations, a carry is generated by adder AO and A2, and is used to se-
lect one of two results computed for adders Al and A3; results are calculated based both
on a "0" and "1" input, analogous to a conditional sum adder. Selection of the correct
results is based on the value of the carry. Finally there is no clamping perfonned on indi-
vidual results, carry outs are simply ignored which means the results wrap around.35
8*16 816 8l6 5*.
1 6-bit/\ 1 6-bit/\ 1 6-bit/\1 6-bit/
Figure 4.7: Partitioned 16*8 bit Multiplier [16]
Graphic multiplier unit is composed of four 8*16 multiplier as shown in Figure 4.7.
Combined together, four multipliers allow the implementation of 64-bit operation in three
cycles as follows:
In the first cycle, the operands are Booth encoded, and a corresponding data is ac-
cessed by MUX.
In the second cycle, multiplication is implemented by add and shift operations.
In the last cycle, results are propagated to the appropriate sub-unit such as the corn-
pletion unit.
The VIS functions perfonned by multiplier are mainly used for compare, pack, pixel
distance operations, etc.
4.2VIS Data Format
This section examines the data types tailored for VIS instructions. The VIS provides
eight data types to support multimedia applications. Figure 4.8 shows all the data types
in VIS, from 8-bit signed to 64-bit double.36
Signed byte: vis_s8
I I I
76 0
Unsigned byte: vis_u8
I I
7 0
Signed short: vis_s16 Is I I
1514 0
Unsigned short: vis_u16 I I
15 0
Signed long: vis_s32 Is I I
3130 0
Unsigned long vis_u32
I I
3130 0
Float: vis_f32 Is I I
Double: vis_d64
I
Figure 4.8: ViSDataTypes [11]
Data formats for VIS can be classified into Partitioned Data Format and Fixed Data
Format based on multimedia data typedata format differs from data type because it
contains subwords. These two data formats are required for operation on data such as
graphics,. audio, and coefficient value used in filtering.
4.2.1Partitioned Data Formats
Partitioned data format refers to a word format composed of several subwords. The
choice of a format depends on application's data type. For example, in graphics or video
applications, pixels typically consist of four 8-bit unsigned integers contained in a 32-bit
word. Figure 4.9 illustrates several partitioned data formats used for various multimedia
applications.37
(a) vis_f32
Iu8 Iu8Iu8 Iu8 I
31 23 15 7 0
(b) vis_t32
S16H s16
31 1615 0
(c) vis_d64
I s16H s16H s16H s16
63 47 31 15 0
(d) vis_d64 u8 u8Iu8Iu8Iu8Iu8Iu8Iu8I
63 55 47 39 31 23 15 7 0
Figure 4.9: Partitioned Data Format [11]
Each 32-bit and 64-bit word has two types of partitioned data formats based on their
subword size. (a) is composed of four 8-bit unsigned integers. (b) and (c) are the 16-bit
signed fixed point values contained in 32-bit and 64-bit partitioned data format, respec-
tively. (d) represents 8-bit partitioned data format.
4.2.2Fixed Data Formats
Fixed data format is either four 16-bit fixed-point components or two 32-bit fixed-
point components within a 64-bit word. For example, filtering or simple image compu-
tations on pixel values are implemented in this format. The fixed data format is also
called intermediate format since it is mainly used for intermediate results to support ad-
ditional precision or dynamic range during data processing.
4.3VIS Instructions
VIS includes additional functions that are not common in other extensions, such as
motion estimation and 64-byte block load/store.These features make VIS distinctive
from other multimedia extensions. Table 4.1 listsallthe VIS instructions, and the de-
scription of each instruction is provided according to its functionality38 
Categories  Instructions  Description 
vis  fexpand  Four 8-bit to 16-bit expand 
Conversion 
vis  fpackfix  Tow 32-bit to 16-bit fixed pack 
VlS  fpack[16,32]  Four 16-bit/two 32-bit pixel pack 
vis  fmerge  Tow 32-bit pixel to 64-bit pixel merge 
vis _fpadd[16,32]  Four 16-bit/two 32-bit partitioned add (single precision) 
vis _fpsub[16,32]  Four 16-bit/two 32-bit partitioned subtract (single precision) 
vis  fmul8sux16  Signed upper 8-x16-bit partitioned product 
vis_fmul8ulx16  Unsigned upper 8-x16-bit partitioned product 
Arithmetic  vis  fmul8x16  8-x16-bit partitioned product of corresponding components 
vis  fmul8x16al  8-x16-bit lower partitioned product of 4 components 
vis  - fmul8x16au  8-x16-bit upper partitioned product of 4 components 
vis  fmul8sux16  Signed upper 8-x16-bit multiply to 32-bit partitioned product 
vis  fmul8ulx16  Unsigned upper 8-x16-bit multiply to 32-bit partitioned product 
VlS  fandnotl [  s]  Negated src1 AND src2 (single precision) 
VlS  fandnot2[s]  Srcl AND negated src2 (single precision) 
vis  fnand[s]  Logical NAND (single precision) 
vis  fneg(s,d,q)  Floating-point negate 
VlS  fnor[s]  Logical NOR (single precision) 
vis  fnotl[s]  Negate(1' complement) src1(single precision) 
Logic 
vis  fnot2[s]  Negate(1' complement) src2(single precision) 
vis  fone[s]  One fill(single precision) 
vis  fornot1[s]  Negate(1' complement) src1(single precision) 
vis  fornot2[s]  Negate(1' complement) src2(single precision) 
vis  for[s]  Logical OR (single precision) 
vis  fxnor  Logical XNOR (single precision) 
vis  fxor  Logical XOR (single precision) 
VlS  fzero  Zero fill(single precision) 
Address 
VlS  alignaddl  Calculate  address for misaligned data access (little-endian) 
Manipula-
VlS  array[8,16,32]  3-D address to blocked byte address conversion 
VlS  faligndata  Perform data alignment for misaligned data  tion 
vis  alignaddr  Calculate  address for misaligned data access 
vis _pst[8, 16,32]  Eight 8-bit/4 16-bit/2 32-bit partial stores 
ldda  128-bit atomic load 
Memory  lddfa  Zero -extending 8-116-bit load to a double precision FP register 
Access  stdfa  8-/16-bit store from a double precision FP register 
bld  64-byte block load 
bst  64-byte block store 
vis  fcmeq[16,32]  Four 16-bit/two 32-bit compare; set integer dest if src1 = src2 
Compare 
vis  fcmpgt[16,32]  Four 16-bit/two 32-bit compare; set integer dest if src1 > src2 
vis  fcmple[16,32]  Four 16-bit/two 32-bit compare; set integer dest if src 1  < = src2 
vis  fcmne[16,32]  Four 16-bit/two 32-bit compare; set integer dest if src1  != src2 
Distance  vis _pdist  Distance between 8 8-bit components 
vis  fsrc1[s]  Copy src1 (single precision) 
Others  vis  fsrc2[s]  Copy src2 (single precision) 
shutdown  Power-down support 
Table 4.1: VIS Instruction Set [14] 39
4.3.1Conversion Instructions
The conversion instructions allow changing of the data format among various types,
e.g., conversion between a partitioned data format to a fixed data format.
Expand
Expand is used at the beginning of an arithmetic operation to convert the 8-bit data
into 16-bit intermediate data format to preserve precision and dynamic range. For ex-
ample, 3-D rendering requires smooth interpolation between the maximum and the
minimum intensities. This operation requires at least 16-bit data format to retain suf-
ficient precision during the interpolation.
__-I, hi I
Idata48
result4_16
I I I I I
I
63 47 31 15 0
I
data_4_8 coonent I
7 01
I I I
resnit4_16 coont
Ioo:e
I
I
11 3 I
Figure 4.10: Conversion for Expand [11]
Figure 4.10 illustrates visjexpand() that converts four unsigned 8-bit elements into
four 16-bit fixed elements by left-shifting each 8-bit value by 4 to provide resolution
with dynamic range and zero-extending the results to a 16-bit fixed value to prevent
overflow..Pack
The data expanded for precision have to be reverted when storing them back into
memory. The visjpackl6/32() converts 16- or 32-bit intermediate fixed format to a
lower precision 8- or 16-bit data format for restoring the original format, e.g. 8-bit for
pixel, 16-bit for audio data.
63 47 31 15 0
data_4_16
I-. I --- I . I
result_4_8
I I
31 23 15 7 0
data_4_8 component
result_4_16 component
3 0
GSR.scale_factor 1010
16 0
Figure 4.11: Conversion for Pack [11]
For example, thevisjpackl6()in Figure 4.11 takes four 16-bit fixed components in
data_4_l6,scales, truncates, and clips them into four 8-bit unsigned components.
This is accomplished by left-shifting the 16-bit component the number of times speci-
fied in thescale_factorfield of GSR, truncating into an 8-bit unsigned integer,
and discarding the least significant bits.Thescale_factorof GSR allows the
control of optimum range and precision of the data.
Also, there are other conversion instructions: visjpackfix() and visJnergeQ. The
visjpackjIx() converts two 32-bit partitioned data into two 16-bit partitioned data. The
vis.finerge() merges the contents of two registers into one by combining the subword
from each register in an alternating fashion.41
4.3.2Arithmetic Instruction
The VIS arithmetic instructions perfonn partitioned addition, subtraction, or multi-
plication.
Add and Subtraction
These instructions perform addition and subtraction on eight 8-bit, four 16-bit, or two
32-bit partitioned data. They are applied to double-precision or single-precision with
visjpadd() and visjpsubO.
datal_4_16
dara2_4.j 6
sunL4_16or
differenc&..4_16
63 47 31 15 0
Figure 4.12: Add and Subtraction [11]
Figure 4.12 illustrates add and subtraction operations.The visjpaddl6() and
visjpsubl6() perform partitioned addition and subtraction on 64-bit partitioned vari-
ables. Overflow and underfiow are not detected and results wrap around because of
saturation.
Multiplication
These instructions perform the multiplication of an 8-bit by a 16-bit partitioned data
and produce a 16-bit partitioned vis_d64 result.Multiplication instructions support
various types of the partitioned data format, and can be used in pixel manipulation
and filtering operation.42
Generally, the multiplicand is fixed as a pixel and multiplier works as a scale factor,
visfinul pixel x scale factor.Therefore, an 8-bit pixel can be applied to various
scale factors by multiplication. The results are changed into new objects different
from old one.
pixels
1/
7'0 1,1./Ill
/l/l scale
I
L/ /' 31 15 0 63
if
MSB
if
MSB
if
MSB
result
I I I I I
63 47 31 15 0
Figure 4.13: Multiplication [11]
Figure 4.13 shows scaling operations of four 8-bit partitioned data with a 16-bit path-
tioned scaling factor using visfinul8xl6() instruction, i.e., every pixel element is
multiplied to the corresponding scale factor.
The vis.f,nul8xl6au() and visfmul8xl6al() multiply every four 8-bit data to an up-
per 16-bit single fixed-point component and a lower 16-bit single fixed-point compo-
nent, respectively. The resulting 24-bit values are truncated into 16-bit data. Others
multiplication instructions are carried out in the similar manner.
4.3.3Logical Instructions
These Instructions include logical operations involving none, one, or two operands.
They can be classified into three groups based on their functions:43 
•  Set variable to all ones or clear variable to zero:  visJone(  ),  visJzero(), respec-
tively. 
•  Copy a value or its complement: visJsrc(), visJnot(). 
•  Perform logical operations between two 32-bit or two vis_d64 partitioned vari-
ables: visJ[or, and, xor, xnor, nand, xnor, omot,andnot](). 
4.3.4  Address Manipulation Instructions 
These instructions are used to access a specific position of memory address.  They 
are used for pixel data load and store, which is  not exactly located on the boundary of 
each 64-bit pixel.  These instructions include align instructions, edge handling instruc-
tions, and array handling instructions. 
•  Align Instructions 
The vis_alignaddr() and visJaligndata() are used for align operation that allows the 
processor to access the specific 8-bit data within a 64-bit word. 
start address(x1003)=da+Offset=x1005+( -2) 
offset=- 2 
Figure 4.14: Align Instruction [  11] 
Figure 4.14 illustrates the calculation of the memory address and the loading of a 64-
bit data using vis_alignaddr(da,  offset) and visJaligndata(data_hi, data_lo) instruc-
tions.  Suppose  the  initial  location  of 64-bit  memory  load  is  pointed  by  the  data 
pointer dp, but the  processor wants  to  access  the  data from  the  location  ds.  The vis_alignaddr(da, offset) sets the effective memory location to ds for data load by
adding offset value and da. Then, vis..faligndata(data_hi, data_b) loads 8 bytes
of data from the memory location ds as indicated by shaded area, i.e., the processor
loads 8-byte of data from memory location x1003 into a FP register in the Figure
4.14.
.Edge Instructions
The vis_edge8Q, vis_edge 160, and vis_edge32() compute a mask to identify which 8-
16-, or 32-bit components of a vis_d64 variable are valid for writing to a 8 byte
aligned address. This instruction is used with partial store instructions, vis.pst0.
8-byte data(64-bit) emask 3-byte vis_pst_8
+ [00011111] =0 0 0
fi
1111
start address partial
(Ox1OOO
ernask= 00 0 1 1 1 1 1
Figure 4.15: Edge Handling Instruction [11]
Figures 4.15 shows an example of a mask. Suppose the partial store is to be per-
formed at the address Ox 1000, and the mask is given as [0001111]. This mask dis-
ables the writes to the locations Oxl000, OxlOOl and 0x1002, and enables the writes
to the locations 0x1003, 0x1004, 0x1005, and 0x1006. The edge instructions are of-
ten used to make the pixel boundaries smooth, i.e., anti-aliasing.
Array Instructions
Array instructions are used to provide efficient access to 3-dimensional data sets.
Therefore, these instructions enhance the 3-D image handling operations by convert-
ing a 3-dimensional fixed-point address contained in the integer source register to a
blocked-byte address in the integer destination register.45
7* A
(a) One 3-D Volun (b) n pieces 2-D Planar Image
ZintegerZfraction Yinteger Yfraction Xinteger Xfraction
63 5554 4443 3332 2221 1110 0
(c) 3-D Array Fixed-Point Address Format
upper middle lower
z x V Z X V Z X V
20 17 17 17 13 9 5 4 2 0
(d) 3-D Array Blocked Address Format
Figure 4.16: Array Instruction [liii
In Figure 4.16, the data structure of (a) corresponds to the format in (c), and (b) corre-
sponds to the format in (d). The 3-D cube is grouped by several upper, middle, and
lower planar images. The array operation transforms a cube into many 2-D planar
images, e.g., when a doctor wants to closely examine a 3-D tumor image, the array
operation breaks the 3-D tumor image into multiple slices of 2-D images.
4.3.5Pixel Comparison Instructions
The visjcmp() instructions perform logical comparison operations between two
pixels. Four basic comparisons are performed: equal, not equal, less than or equal, and
greater than. The comparison operations generate 2- or 4-bit mask, i.e., a 2-bit mask re-
sults from the comparison of two 32-bit pixels or a 4-bit mask from four 16-bit pixels,46
respectively. This is an important operation for 3-D rendering when constructing objects
in a graphic application. For example, if an object A is hidden behind another object B,
the object A should not be seen and Z-buffering algorithm calculates the hidden surface
of the object A. This hidden-surface elimination is done with the compare instructions.
4.3.6Memory Access Instructions
In VIS, memory access Instruction is one of the most important features providing
the acceleration of media applications. The accesses to memory in various ways make a
processor more flexible in transferring data. They can be divided into three kinds of in-
structions based on how much data is transferred and which register is used for data.
They are partial store, short floating-point load and store, and block load and store in-
structions.
.Partial Store Instructions
The vis...pstOisusedfor partially storing result in memory, which can store any num-
ber of bytes from a register without disturbing adjacent data. They store 8-, 16- or
32-byte results using mask, typically determined by edge instructions. Typical uses
include writing only selected channels of a multi-channel image. As shown in Figure
4-16, visjst_8 stores the 5 bytes of the 8-byte data to the memory using the mask
operation.
.Short Floating point Load and Store Instructions
The visjd...u() and vis_st_u() perform 8- and 16-bit loads and stores between FP reg-
isters and memory. The vis..Jd....uOinstruction reads 8 or 16 bits of data from a speci-
fiedmemory location, and the vis_st_u() stores into a specific memory location, re-
spectively. These instructions are only effective when 8- or 16-bit data sizes are to be
read or stored in a multimedia application.Block Load and Store Instruction
The lddfa and sddfa instructions handle the transfer of a block of 64-byte data.It is
very useful in the applications that require the movement of a large amount of data.
These instructions allow direct block-transfer of 64 bytes of data between memory
and FP register file without going through caches.
Processor
FP Register
File
a block of 64 bytes I Memory
load & store I
64-byte
Figure 4.17: Block Load and Store [17]
Figure 4.17 depicts the block load/store operations, and it alleviates the following
problems in multimedia applications: (1) in multimedia processing, a large amount of
data handling is inevitable. If data access has to go through the caches every time, the
data transfer will take long. (2) processors can move image data directly from main
memory to the screen at a fast rate, thus reducing the screen flicking.
4.3.7Motion Estimation Instruction
The vispdst() computes the absolute value of the difference between two pairs of
pixel i.e. between eight pairs of vis_u8 components.It takes three double-precision ar-
guments pixels], pixels2, and accum in the form of vispdst(pixelsl, pixels2, accum).
The pixels are subtracted from one another pair-wise, and the absolute values of the dif-
ference are accumulated into accum. This instruction is intended for accelerating motion
compensation operation to support real-time video compression in such as H.320 video
conferencing application.4.4Other Media ISA Extensions
Many vendors have already implemented multimedia extensions on their microproc-
essor architecture.Table 4-2 summarizes architectural features and characteristics of
such extensions.
Characteristics MDMX MMX VIS MAX
Architecture MIPS V Pentium U1traSPARCPA-RISC2
Extention Operands 3-4 2 3 3
Features
Data Path 32(FP) 8(FP) 32(FP) 31(Int.) ___________
Integer 8*8-bit Yes Yes Yes No
Storage 4*16-bit Yes Yes Yes Yes
Size 2*32-bit Yes Yes Yes No
8*8-bit 8 or 24-bit Yes Yes No
Subword
4*16-bit 16 or 48-bit Yes Yes Yes Handling
2*32-bit No Yes Yes No
Add/Subtract Yes Yes Yes Yes
Multiply Yes Yes Yes Yes
Arithmetic Multiply&Add Yes No No No
Accumulation Yes No No No
Saturation Yes Yes Pack only Yes
Logical Instructions Yes Yes Yes Yes
Expand/Pack Yes Yes Yes No
Data Con- Merge Yes Yes Yes Yes version
Permute 8bit+l6bit No Interleave Yes
Compare Yes Yes Yes Yes
Memory Access No No Yes No
Shift/Round Yes Yes No No
Others Partial Store No No Yes No
Address Handling No No Yes No
Motion Estimation No No Yes No
Table 4.2: Extensions Comparison
Mips Digital Media Extension (MDMX): MDMX is a multimedia extension imple-
mented in MIPS-V architecture of MIPS Technologies Inc.
.Multi Media Extension (MMX): MMX is a multimedia extension in Pentium, a
trademark of Intel Corporation.
Visual Instruction Set (VIS): VIS is a multimedia extension in UItraSPARC, a trade-
mark of Sun Microsystems, Inc.Multimedia Acceleration eXtension (MAX): MAX is a multimedia extension in the
PA-RISC2, a trademark of Hewlett Packard, Inc.
As an example, this section discusses the hardware features and the supported in-
struction by Intel's MMX. Intel introduced Pentium with MMX technology in 1997 as a
32-bit, 2-way issue superscalar CISC processor with an additional support for fixed-point
and floating-point arithmetic for multimedia processing. This new multimedia extension
is called MMX and enhances the multimedia performance.
The most visible architectural changes in MMX are deeper pipelines, larger caches,
improved branch prediction, and 57 new MMX instructions. This new processor is de-
veloped for use in DSP, image, and video processing applications. Figure 4.18 shows the
new pipelining structure to support MMX technology. MMX pipeline is added to the two
integer pipelines. Two integer pipelines support the execution of two simultaneous inte-
ger instructions.
MMX pipe MEXWMM2 M3WM
Main pipe FP F D2 E WB
Integer pipe 1 El
Integer pipe2 El E2 E3
Figure 4.18: MMX Technology Pipeline Structure [18]Another important hardware support for MMX extensions is MMX Register Set
composed of eight 64-bit registers.Figure 4.19 shows the MMX Register Set.The
MMX instructions are executed in the specially designated registers called MMX Regis-
ter Set. MMX registers MMO-MM7 are aliased on the 64-bit mantissas of the floating-
point registersIntel Architecture (IA) uses the existing floating-point register as MMX
register. This register organization provides the backward compatibility of the existing
IA. TheFP Tagis provided to control the resource conflicts due to sharing of floating-
point registers.
Figure 4.19: MMX Register Set [18]
All MMX instructions can be grouped into seven categories [18]: Arithmetic In-
structions, Comparison Instructions, Conversion Instructions, Logical Instructions, Shift
Instructions, Data Transfer Instructions, and Empty MMX State (EMMS) Instruction. In
particular, theEmpty MMX State (EMMS)instruction is used to empty the MMX states.
The EMMS instruction must be used to clear the MMX state (e.g., empty the floating-
point tag word in Figure 4.20) after instruction completes because MMX registers are
shared with the floating-point register. Otherwise, conflict can occur if another floating-
point instruction is executed before the state is cleared.51
5IMPLEMENTATION OF MULTIMEDIA FUNCTION
This chapter introduces the algorithms that are functionally embedded in current
multimedia applications, such as image processing, audio/video applications, and graph-
ics editing packages. To see what advantages the multimedia extensions provide, some
multimedia operations are studied in detail. We examine two such operations, namely
Alpha Blending and Motion Estimation, implemented in both 'C' and 'VIS'. By analyz-
ing both high-level and assembly-level codes, the advantages of using VIS extensions are
characterized in terms of hardware resources utilization and the number of instructions
required.
5.1Alpha Blending
Alpha blending is one of standard image processing techniques that combine two in-
dividual images into a blended translucent image, and it is often used in pre-press, image
painting/editing, software packages, and digital post-production.
Alpha blending creates a blended image from IMAGE 1 and IMAGE2 according to
the transparency assigned by alpha as follows:
BLENDED IMAGE = alpha x IMAGE! + (255 - alpha) x IMAGE2.
The IMAGE! and IMAGE2 are 2-dimensional planar images stored in a 3-
dimensional array, 1MG [X1[Y][Z] (see Figure 5-1). The X and Y values represent the
horizontal and vertical size of the image, respectively. The Z is the color band factor that
defines the color scheme for the image, i.e., gray scale or color. When the values of Z is
3, the image is made up of three colors: Red, Green, and Blue. On the other hand, when
Z is given the value of 1, the color of the image is denoted by Black and White contain-
ing 256 different gray levels. Figure 5.1 shows a conceptual representation of 512 x 512
image data.5
512
(a) 512*512 Inge Size (b) Color Band Factor Z
Figure 5.1: Image Size determined by X and Y, and Color Band by Z
The alpha value is first multiplied to each pixel in IMAGE!, and the difference be-
tween the alpha value and the maximum alpha value is multiplied to IMAGE2. Then, the
IMAGE 1 and IMAGE2 are added together into a blended image.
In general, the image color is represented by 32 bits per pixel in RGB model since
each of Red, Green, Blue, and alpha has 8 bits. The RGB maps to the color of each pixel
and the alpha to its opacity. The alpha value controls how the new color is combined
from the original colors so that the both colors "show through". The value of alpha usu-
allyranges from 0 to 255 where 0 corresponds to absolute transparency and 255 to abso-
lute opacity. Therefore, transparent or translucent objects have lower opacity values than
opaque objects.
Figure 5.2 shows an example of an Alpha-blended image. Assuming the images of a
coffee in (a) and apenin (b) are the source images IMAGE! and IMAGE2, respectively
and Alpha blending process creates the BLENDED IMAGE shown in (c). The alpha
value was set to 128 for the above example.53
Figure 5.2: Blending Application
5.1.1Implementation and Code Comparison
This section implements the Alpha blending in C and VIS, and compares their high-
and assembly-level codes in order to study the characteristics of VIS extension. In high-
level VIS implementation, we examine how VIS instructions are used in Alpha blending.
Also, we analyze assembly implementation to study how VIS allows better utilization of
the hardware resources such as register and memory.
Figure 5.3 illustrates the overall algorithm implemented in VIS and C. In the C im-
plementation, Alpha blending is performed pixel-by-pixel until all pixels are manipulated
according to the blending equation. On the other hand, VIS implementation operates on
8 pixels at a time, taking SIMD approach to the Alpha blending.54
NO
C Implementation
Horizontal Pixel (j)
Vertical Pixel (i)
Color Band(k)
Blending
START
YES
SIMD
VIS Implementation
Horizontal Pixel (j)
SIMD
preparation
Vertical Pixel (i)
Load 8-pixel
Source&
Alpha Data
V'S
Blending
7Store
STOP iii::
Figure 5.3: Float Chart of Alpha Blending in VIS and C.
5.1.2Alpha Blending in C
Figure 5.4 shows Alpha blending in C, named 'Ablending_C', that operate on three
image data arrays and one alpha value array. Among three image data arrays, two arrays
are used for source 1 and source 2 images, and the other is used for a blended image.
Ablending_C uses unsigned 8-bit data type, vis_u8, to handle an 8-bit pixel of the image
data.55
void vdk_c_blend_8iis (vis_u8 *srcl,mt slbl,
vis_u8 *src2,mt slb2,
vis_u8 *alpha,mtalb,
vis_u8 *dst,mtdlb,
mt w, mt h,mt bands)
mt i,j, k; /*jndjces for x,y,and z in the main loop*/
vis_u8 *sal, *511; /*pointers for pixelandline of source/
vis_u8 *5a2, *812; /*pointers for pixelandline of source/
vis_u8 *aa, *al;/*pointers for pixelandline of alpha image /
vis_u8 *da, *dl; /*pointers for pixelandline of destination*/
sli=sal=srcl;
s12=sa2=src2;
al=aa=alpha;
dl=da=dst;
for (j=0;j<h; j++){
for Ci=0; i<w; i ++){
for (k=0; k<bands; k +) {
*da=( (C ((*aa)<<4)*(*gal)+0x80)>>8)
+(((OxOffO-((*aa)<<4))*(*j)+0x80)>>8)) >>4;
sal++;
sa2++;
da++;
aa++;
sll=sal=sll+slbl;
sl2=sa2=s12+slb2;
al=aa=al+alb;
dl=da=dl+dib;
}
Figure 5.4: Code of Ablending_C [11]56
U.... U.... U.. ..
III II
..
I.. ..
J
ii iii i- :
J. ____________
(a) Number of Vertical (b) Number of Horizontal (c) Image Size by Inner
Pixel (middle loop,1) Pixel (outer loopJ) and Outer Loop
Figure 5.5: Image Construction by Loop Execution in C code
Figure 5.5 depicts how the blending operations are performed on each image pixel.
The algorithm consists of a triple-nested loop comprised of i, J, and k indices.
The outer-loop index j counts the number of vertical pixel lines, called scanline,
which makes up of a frame and determines the vertical resolution.
The middle-loop index i depends on the number of horizontal pixel lines, i.e., the
horizontal resolution.
Finally, inner-loop k determines the pixel size. Inner-loop has a function that counts
the number of bands used in each pixel. In this code, Z value is defined to be 1, i.e.,
k = 1, thus pixels are displayed in gray scale. If three bands are defined, each pixel is
normally represented by a three-layered pixel with 24-bit per pixel. The actual alpha
blending operation is included in this loop.
During the execution of the loops, each pixel is accessed and a new pixel of blended
image is created according to the blending equation below:
dst = alpha xsrcl + (255 - alpha) xsrc2 (1)57
da = ((( ((* aa) <<4) x(* sal) + 0x80) >> 8)
+ ((( OxOflO -((* aa) <<4)) x(* sa2) + 0x80>> 8)) >> 4) (2)
The Alpha blending equation (1) is translated into a C expression (2).Variables
dst, srcl, src2 and alpha correspond to BLENDED IMAGE, IMAGE!, IMAGE2,
and alpha coefficient value, respectively. In the equation (2), alpha is shifted left by 4 to
make lower 4-bit effective and applied to source imagel.As a result, image! has
changed into a 16-bit image data multiplied with 8-bit alpha value. Next, the 7th bit of
IMAGE 1 is rounded by adding 0x80, and only upper 8 bits become valid after the 8-bit
right-shift operation. After this, the same procedures are applied to source image2 except
for different alpha value which has a complement alpha value used for source 1. Next,
two images with different alpha value are added together.Finally, right-shift by 4-bit
causes the correct 8 bits to be stored in da as a blended pixel data.
The data structure used in C implementation of Alpha blending is very similar to
matrix multiplication due to the characteristics of image data composed of pixels. The
number of iterations for Alpha blending depend on the number of instructions for equa-
tion (2), the number of pixels counted from outer-loop and middle-loop, and pixel size
determined by inner-loop. Therefore the complexity of execution time for C implemen-
tation is in the order of 0(n2), wherenis the pixel size of the image.
5.1.3Alpha blending in VIS
The VIS implementation of Alpha blending follows a different approach compared
to the C version. Figure 5.6 shows the Alpha blending algorithm implemented with VIS
extension, namedAblending_VIS.58
#include <stdlib.h>
#include "vis_types .h"
#include "vis_proto .h"
/***************************************************************/
#define BIJENDIM \
adh = vis_fexpand_hi (a);
adi = vis_fexpand_lo(a);
bdh = vis_fpsubl6(ones, adh); \
bdl = vis_fpsubl6(ones, adi); \
sflh = vis_read_hi(sdl);
sf11 = vis_rea&..lo(sdl);
rdlh = vis_fmul8xl6(sflh, adh);
rdll = vis_fmul8xl6(sfll, adl);
sf2h = vis_read_hi(sd2); \
sf21 = vis_read_lo(sd2); \
rd2h = vis_fmulBxl6(ef2h, bdh);
rd2l = vis_fntul8xl6(sf2l, bdl);
rdh = vis_fpaddl6(rdlh, rd2h); \
rdl = vis_fpaddl6(rdll, rd2l); \
dd = vis_fpackl6_to_hi(dd, rdh); \
dd = vis_fpackl6_to_lo(dd, rdl);
/ ***************************************************************/
void vdk_vis_blend_8_1_inLs (vis_u8 *3rd, mt slbl,
vis_u8 *src2,mt slb2,
vis_u8 *alpha, mt alb,
vis_u8 *dst,mt dlb,
mt w, mt h)
{
vis_u8*sal, *5a2;/* start point in source images*/
vis_u8*s11, *512;/* start point of a line in source irnages*/
vis_d64 *spl, *sp2;/* 8-byte aligned start point in source ilnages*/
vis_u8*da*dend; /start and end points of a line in destination*
vis_d64 *dp; 1* 8-byte aligned start point in destination*/
mt off; /* offset of address alignment in destinatiorl*/
mt emask; 1* edge masks /
vis_d64 sdl, sil, slO;/* source data */
vis_d64 sd2, s21, s20;
vis_f32 sflh, sf2h;
vis_f32 sf11, sf21;
vis_d64 dd; /* destination data /
vis_d64 rdlh, rd2h; /intermediate result */
vis_d64 rdll, rd2l;
vis_d64 rdh, *rph = &rdh;
vis_d64 rdl, *rpl = &rdl;
vis_d64 ones = vis_to_double_dup(OxOffOOffO);
vis_u8*aa; /start point in alpha image /
vis_u8*al; /start point of a line in alpha image *1
vis_d64 *ap; 1* 8-byte aligned start point in alpha image *
vis_d64 a, aO, al; 1* alpha data *1
vis_d64 adh, bdh;
vis_d64 adi, bdl;
mt59
/* initialize GSR scale factor *1
vis_write_gsr(3 << 3);
sil = sal = srcl;
s12 = sa2 = src2;
al= aa= alpha;
da= dst;
1* row loop */
for (j= 0;j< h;j ++) (
1* prepare the destination address */
dp = (vis_d64 *)((vis_u32) da & (-7));
off = (vis_u32) dp -(vis_u32) da;
dend = da + w 1;
1* generate edge mask for the start point *1
emask = vis_edge8(da, dend);
/* 8-pixel column loop *1
while ((vis_u32) dp <= (vis_u32) dend) {
1* load 8 bytes of source data */
spl = (vis_d64 *) vis_alignaddr(sal, off);
slO = spl(0);
sll = spl(l);
sdl = vis_faligndata(slO, sli);
sp2 = (vis_d64 *) vis_alignaddr(sa2, off);
820 = sp2(0];
s21 = sp2(l);
sd2 = vis_faligndata(s20, 821);
/* load 8 bytes of alpha data */
ap = (vis_d64 *) vis_alignaddr(aa, off);
aO = ap(0];
al = ap[l);
a= vis_faligndata(aO, al);
BLENDIM;
/store 8 bytes of result */
vis_pst_8(dd, dp, emask);
sal += 8;
sa2 += 8;
aa+= 8;
dp++;
/prepare edge mask for the end point /
emask = vis_edge8((vis_u8,*) dp, dend);
sll = sal = sli + slbl;
812 = sa2 = s12 + slb2;
al= aa= al+ aib;
da +dlb;
Figure 5.6: Code ofAblending_VIS [11]
As can be seen, the algorithm shown in Figure 5.6 is quite different from the C-
version. In order to take full advantage of the SIMD capability, several modifications,
such as data type, loop structure, and special instruction for handling memory access, are
made. The 64-bit data type, vis_64, has been added to handle 64-bit data size for SIMD
processing. This enables the manipulation of up to 8 pixels at a time, and provides thesignificant perfonnance enhancement. In addition, 32-bit data type,VIS_f32,is intro-
duced to provide an intermediate data storage. The first step is to access the specific
memory location for input data loading.Within thewhile loopin Figure5-6,
vis_alignaddrandvis_faligndataprovide aligned-accesses to multiple data in
order to load into the registers. Also,16 VISinstructions, defined asBLENDIMin the
above, are provided for creating a blended image and corresponds to the blending equa-
tion.
The C version sequentially operates on one pixel at a time since the image is in a
two-dimensional array form. This two-dimensional image array is actually stored in a
flat memory strip with rows arranged in sequential order, and the traversal of column in-
volves accessing memories at a location offset by a stride equal to the row length. This is
the effective structure which SIMD technique can be applied. The code starts with modi-
fying the inner loop.
=8-pixels
Spixels Spixeb
I
I__I
j
groupedby8pixels
(a) Number of Horizontal (b) Number of Vertical (c) Image Size by Inner
Pixel (outer Ioop,J) Pixel (8-pixel cohmm loop) and Outer Loop
__.._----
HI
Figure5.7:Image Construction by Loop Execution in VIS code
Figure5.7shows VIS loop execution modified from the C version. Loop organiza-
tion has been changed so that the inner loop fetches8pixels in each iteration. Therefore,
VIS code breaks down the image into a fragmentof 1x8pixels. This means that each 8riI
pixels has changed into one block as an input vector, which is an 8-column matrix. The
image is then calculated by operating on 8 pixels at a time. The configuration of inner
loop has changed to be adjusted for SIMD data block, and accordingly, the overall task
through the loops will be reduced by a factor of 8.
5.1.4Assembly Code version of the Alpha blending
We have compared C and VIS versions of Alpha blending.It is clear that the ad-
vantage of VIS comes from its SIMD capability. However, it is difficult to see how
hardware resources are utilized in the high-level version of VIS, thus assembly-level im-
plementation is studied.
alignaddr %i4,%o7,%oO
ldd (%oOJ,%f12
ldd [%oO+8],%fl4
faligndata %f12, %f14,%flO
fexpand %flO,%f12
fexpand %fll,%f14
fpsubl6 %fO,%fl2,%fl6
fmul8xl6 %f6,%f12,%f12
fpsubl6 %fO,%f14,%f18
fmul8xl6 %f7,%f14,%f14
fmul8xl6 %f8,%f16,%f16
fmulBxl6 %f9, %f18,%f 18
fpaddl6 %f12,%f16,%f16
fpaddl6 %f14,%f18,%f12
fpackl6 %f16,%f4
fpackl6 %f12,%f5
stda %f4, (%gl)%o5,192
Figure 5.8: Alpha Blending in VIS Assembly
Figure 5.8 shows BLENDIM in assembly code. Based on the assembly code, the
following illustrates the Alpha blending operation in three major steps: image data load,
multiplicationofalpha value, and blendingofimage data.62
Figure 5.9 explains the first step, image data load. The 8 pixels of image data are
first loaded intoRaandRbfrom memory. For this, the effective address for a 64-bit load
is calculated byalignaddrinstruction based on the offset. Then,fal±gndatamoves
upper and lower four bytes of data into the destination registerRcfromRaandRb.Fi-
nally,f expandinstruction converts each 8-bit alpha value ofRcinto 16-bit partitioned
format ofRdandRe.This operation is required to preserve the precision of alpha value
for next multiplication operations with image data.
start
data nointer
Menry Location
a
39 3! 23 15 7 63 55 47 39 31 2
1I A I B ICD E
IF IG I H Rb IJ K
IL IM I
RcIAIBIcID MINIOIPI
55 47 39 23 15 7
RdIA B C D Re A B C
63 47 31 15 0 63 47 31 15
(c) fexpand
Figure 5.9: Image Data Load and Expand
0
Next step shown in Figure 5.10 is the calculation of the terms alpha*srcl and (255-
alpha)*src2. The term alpha*srcl is calculated byfmul8xl6instruction, which multi-
plies the 16-bit partitioned alpha value with the source image data inRs.On the other
hand, the term (255alpha)*src2 requires subtraction operation before the multiplication
of alpha value. This subtraction operation is performed by fpsubl 6 instruction.63
634731150 634731150 634731150 3123 1570
RzIofImIOf11RJD I
RII0-AliD-Bbr-C IfO.D Rd.bAsilBs2ID*831D*s41
(a)subRz Rd Rf (h) fnuil8*16RsRd Rd
''
O(.0jlO.D Rtss1lss2s31s1
Rf )ssll(0f.A)*u1I(_*u1 l(_A)*1
I
(c)fiad8*16RtRf Rf
Figure 5.10: Appending Alpha value to Image
The third step combines the two image data that are adjusted with alpha values. In
Figure 5.11, the two images are simply mixed by fpaddl6 instruction, creating a
blended image. After the blended image is created, it needs to be packed into 8-bit by
fpack instruction because the intermediate pixel data is 16-bit. Finally, the source image
1 and 2, the blended image will be stored in the destination image, DIMG [X][Y][Z] by
the stda instruction.
634731150 634731 150
II! A)(Lk $) ((a-q ((i-I
Rd
634731 150
(a)add16RuRf Ri
Figure 5.11: blending of image data5.2Motion Estimation
When the moving picture of size 640 by 480 is composed of 30 frames per second
with 8-bit color-depth, the size of 1-second clip is about 7.3 Mb [3]. Therefore, it is not
practical to handle such a huge amount data without reducing its size. To properly proc-
ess the video data, various compression methods are devised and widely used in video
application. One of the most efficient compression techniques in video processing is mo-
tion estimation.
Entropy File
Motion Vector
& Difference
Figure 5.12: Motion Estimation in Compression Algorithm
Figure 5.12 illustrates a video compression system based on Motion Estimation
(ME). Motion estimation is a technique that compresses video data across the frames by
saving only the differences between frames. The frame differences are calculated be-
tween two sequential frames. ME plays an important role in the overall compression
system in that it contributes the most in achieving high compression rate.
For illustration of ME, Figure 5.13 shows the frames divided into macroblocks. A
macroblock is defined as a unit of motion estimation area composed of number of vertical
and horizontal pixels (i xj), generally as 16x16 pixels. Macroblock is of interest if it re-
sides within a search area in ME. The search area is assigned as a portion of frame,
where the motion is most likely to happen.65
frame k+1
fr.mek
ipixels
j pels
search area
..:i:.:.i:
Mb current current
Mb(X) Mb(Y)
(a) Macroblock(Mb) (b) Motion Estimaiton
Figure 5.13: Motion Estimation between Two Frames
Assume that the macroblock X in frame (k+1) has moved from the original loca-
tion to Y in frame (k) as in Figure 5.13 (b). By comparing macroblock in frame (k)
with frame (k1), the difference is calculated mathematically indicating the movement
of the macroblocks. If the difference is close to zero, this means no movement of mac-
roblock is detected. On the other hand, if the difference is greater than zero, macroblock
has moved between two frames. Therefore, the changes in two frames can be identified
based on the differences. Now, we examine and compare the C and VIS implementation
of ME algorithm.
5.2.1Implementation and Code Comparison
This section studies ME based on Sum of Absolute Difference (SAD) in consecutive
frames. SAD is mathematically defined as follows:
Diff= {abs(MB1(i,j)MB2(i,j))},
i=oj=O
whereMB1(1,j) and MB2 (1,j) indicate the pixels in macroblock of frame 1 and
frame2, respectively. The Di f f is the sum of absolute differences calculated betweenevery pixel inside the macroblock. Therefore, Diff represents the overall differences
between two macroblocks rather than individual pixels.
NO(C Implementation)
I aide
IHorlwntal Pixel (I) I
I VerticalPlxel(l)I
I SAD I
Store
START
SIMD
YES(VIS Implementation)
YES
8-pixel Unit
Horizontal Pixel U)
InepIon
Vertical Pixel (I)
Load 8-pixel
Infrmee(k)
Load S-pixel
In frame(k+1)
SADInVIS
Store/
STOP
NO
YES
NO
SIMO
1-pixel Unit
(left-over)
Horizontal Pixel U)
I VerticalPlxelQ) I
I SAD I
Figure 5.14: Float Chart of Motion, Estimation in VIS and C
Store
Figure 5.14 shows the flowchart of ME algorithm in both C and VIS. The SIMD
condition determines the method of implementation whether it is implemented in C or
VIS. SAD execution in a macroblock is restricted inside of search area only. The VIS
algorithm is further split into two areas, 8-pixel area for SIMD operation and remaining67
area with less than 8 pixels that cannot be implemented in SI1MD. On the other hand,
each pixel in the implementation of C is operated on pixel by pixel, according to the van-
able i andj designating vertical or horizontal within MB.
5.2.2C Implementation
The calculation of SAD involves the finding of effective SAD area within a mac-
roblock and the computation of SAD value.
F]
A
........
ivaumiiu ......Ii' .u.....
Frame (A*B macroblocks)
nx(i)=16
Macroblock T(16*16 pixels)
Figure 5.15: Macroblock and SAD Area
Figure 5.15 illustrates the relationship among search area, macroblock, and effective
SAD area. Assume that a frame is composed ofAXBnumber ofmacroblocks,and each
macroblockconsists of 16x16 pixels. Given asearch areain a frame, effectiveSAD
areais the intersection of search area andmacroblock T,which is defined bynxhori-
zontal pixels andnyvertical pixels. The area outside of the search area inmacroblock
Twill be excluded from the SAD computation. In order to findSADarea, algorithm first
looks for intersection area and then effectiveSADis determined.68
*filb # of bytes in one row of framel(=width)
*fi2b * of bytes in one row of frazne2(=width)
*flx, fly upper left corner of 16x16 blockin frainel
*f2x,f2y upper left corner of 16x16 blockin frame2
syupper left corner of search areain framel
*sh, sw heightandwidth of search area in frainel
/*find intersection of search areaand16xl6 block!
x=max(sx, fix);
nx=min(sx+sw, fix16)-x;
y=xnax(sy, fly);
fly=min(sy+sh, fiy+l6) y;
/*16xi6 block is outside search area*/
if (nx <= 0 ny <= 0) return 0;
accuin= 0;
sli=sal; s12=sa2;
for (j=0; j < fly; j++) (
for (i=0; i < flX; i++ ) C
accum +=s(*sai-*sa2);
sal++; sa2++;)
sli=sal=sli + flib;
s12=sa2=s12 + f2lb;
I
return accum;
Figure 5.16: C code for Sum of Absolute Difference [11]
Figure 5.16 shows the C implementation of SAD function used in motion estima-
tion. Overall algorithm is divided into two functional part: finding of effective SAD area
and computation of SAD value.
5.2.2.1Finding effective SAD area
Effective SAD area is determined (see Figure 5-15) within search area. Assume that
the very upper left corner of search area is set tozeroas a starting point, mc and fly. Out-
side of search area is excluded by the statement "if (mc<=0 ny<=0) re-
turn 0", because negative values of mc and ny indicate that the MB is out of the search
area.5.2.2.2Computation of SAD value
.The SAD computation is performed only on the pixels inside the effective SAD
area. This area is traversed by the inner-loop variable 0±nx and outer-loop
variable0jfly.
Every pixel values in effective SAD area are used in the computation of the equa-
tion accu+=abs(*sal_*sa2) in each iteration.Note that *sa] and *sa2
points to each pixel in MB] and MB2, respectively.
Return the accumulated result.
5.2.3VIS implementation
While VIS implementation of SAD algorithm is very similar to C implementation, it
utilizes special instructions, vispdist(vis_d64 data], vis_d64 data2, vis_d64 accu). This
instruction manipulates two data with eight pixels, and stores it to the designated 64-bit
floating-point register in a single cycle.
nx 16*nx
4 4 nx 4nx-8
I.x * Left
pixels pixels
(a)SADarea inC (b)SADarea inVIS
Figure 5.17: Comparison VIS Algorithm with C
Figure 5.17 shows VIS implementation model in contrast with that of C. Despite
the fact that the size of effective SAD area is the same for both VIS and C, the actual
computation of SAD values is performed differently. The C implementation sees the ef-
fective SAD area simply as a single block shown in Figure 5.17 (a), and the SAD value is70
calculated for each pixel one at a time throughout the entire block. On the other hand,
VIS implementation splits the effective SAD area into the number of 8-pixel size area
and remaining area. The 8-pixel size area utilizes the vis...pdist () instruction that can
operates on 8 pixels at a time, thus it reduces the inner loop iteration to 1/8 of C imple-
mentation. On the other hand, the SAD computation of the remaining area is forced to
be processed pixel-by-pixel as in C implementation because the size of ax is less than 8
pixels.
if (nx <= 0 fly <= 0) return 0;
nx8 = nx>>3;/*compute width in 8-byte units*/
accum = vis_fzero();
sli = sal; s12 = sa2;
for (j = 0; j < ny; j++)C /* row loop/
for (i = 0; i < nx8; i++)C
/load 8 bytes of source datafrom farmei*/
spi = (vis_d64 *) vis_alignaddr(sal, 0);
slO = spl[O];
eli = spill];
sdi = vis_faligndata(slO, eli);
/load 8 bytes of source datafrom farlue2*/
sp2 = (vis_d64 *) vis_alignaddr(sa2, 0);
s20 = sp2(01;
s21 = sp2[l];
sd2 = vis_faligndata(s20, s2i);
accuin ='vi&..pdist(sdi, sd2, accwn);
sal += 8;
sa2 += 8;
eli = sal = eli + flib;
sl2 = sa2 = s12 + f2].b;
)/* process what's left over (nx%8) in plain c code */
sal = sil = framei + fllb*fly + fix + nx8*8;
sa2 = s12 = frame2 + f2lb*f2y + fix + nx8*8;
fix -= (nx8'8);
if (nx)C
for (j = 0; j < fly; j++){
for (i = 0; i < nx; i++ ) C
accum +=s(*sai - *sa2);
sai++; sa2++;
)
sil = sal = eli + flib;
s12 = sa2 = s12 + f2lb;
result.d64 = accum;
Figure 5.18: VIS code for Sum of Absolute Difference [11]71
Figure 5.18 shows the SAD function in VIS. Unlike C implementation, VIS imple-
mentation utilizes 64-bit data type, thus 8 pixels can be handled at the same time.There-
fore, the inner-loop variable nx has changed intonx8 sothat VIS instructions operate on
8 pixels at a time.To properly handle 8 pixels in each cycle, the instructions
vis_alignadd.r(sal, 0)andvis_faligndata(s20, s21) prepare loading of 8 pix-
els from memory to a register. For SAD calculation, the absolute value is calculated by
the vis_pdst () instruction and the results are accumulated in the variable accuxn. This
process corresponds to accwn+=abs(*sal_*sa2) in Figure 5-16.
Next, SAD needs to be computed for remaining portion. For this, the variable nx
has changed into mc=mc- (nx8*8) to cover the area less than 8-pixels, and the cal-
culation of SAD is done by following the same process implemented in C.
5.2.4Assembly Code Version of the Motion Estimation
This section examines the assembly-level implementations of both VIS and C that
perfonn the calculation of the SAD values.
].dub[%gl],%ol alignaddr %14%gO,%iO
idub(%i21,%ol ldd [%iO8],%f2
sub%oO,%ol,%15 ldd [%iO],%fO
or %gO,%15,%16 faligridata %fO,%f2,%fO
subcc%gO,%15,%gO alignaddr %g5,%gO,%14
ble .L9 ldd (%14],%f2
add %gl,1,%g]. ldd [%14i-8],%f4
sub%gO,%15,%16 fa].igndata %f2,%f4,%flO
L9 add%iO,%16,%iO fmovd %f8,%f4
pdist %fO,%flO,%f4
SAD in SAD in vis_sunEbsdiff.s
Figure 5.19: SAD in Assembly Level72
Figure 5.19 compares the C implementation (a) to the VIS implementation (b),
where both assembly code implements the SAD equation, accum += abs(*sa1*sa2).
Note that C and VIS require 9 and 10 instructions to implement the SAD equation, re-
spectively.In C, seven instructions are needed to compute accum for one pixel after
loading two pixels from the memory. On the other hand, VIS implements accum with
onepdistinstruction for 8-pixels, and other 9 instructions support thepdistinstruc-
tion. A pair of instructions in (b),faligndataandalignaddr, load8 pixels from
memory into a floating-point register. Next,pdistexecutes each of 8 pixels in floating-
point registers to get the result of SAD. This instruction operates on a function of accum
+= abs (*sa1*sa2). As a result, this instruction shows a capability that operates on two
of 64 bits source data at a time. On the other hand, there is required for a trade-off like
faligndataandalignaddrto support vis_pdist().
5.3Overall Comparison of VIS and C implementation
The studies of Alpha blendingand Motion Estimation functions show that the main
difference between VIS and C implementations lies in their loop structure. The primary
reason comes from the fact that the VIS implementations utilize SIMD capability pro-
vided by various VIS instructions to improve the performance when dealing with 64-bit
data. SIMD capabilities of VIS also allow 64-bit operation to be done in one cycle and
reduce the number of loop iterations.73
6PERFORMANCE ANALYSIS OF VIS
As an architectural enhancement that provides remarkable multimedia performance,
multimedia ISA extensions for microprocessors were introduced and discussed in the
previous chapters. To verify the performance gains of multimedia extensions, this chap-
ter analyzes the simulation results of several representative multimedia functions with
VIS and without VIS. Multimedia functions were written in both 'C' and 'VIS', and
their performance results were compared using near cycle accurate simulator, INCAS.
The overall performances are compared, and the performance differences are analyzed in
terms of hardware support and advantages provided by VIS. The limitations of VIS are
also identified from the simulation results.
6.1Simulation Environment
INCAS is an execution driven simulator used for code performance prediction and
cycle counting, and allows processor status to be examined at each cycle. INCAS stands
for "It's a Near Cycle Accurate Simulator," which models the U1traSPARC-I processor
[11]. This subsection discusses features of INCAS.
6.1.1Organization of INCAS
As mentioned above, INCAS closely models the U1traSPARC-L Figure 6.1 shows
the structure of INCAS composed of five modules corresponding to the CPU internal
structure: IEU, LSU, PDU, ECU, and SDB.74
Figure 6.1: Organization of INCAS [19]
Integer Execution Unit (LEU)
This module implements Integer Execution Unit (JEU) and Floating Point Graphic
Unit (FGU), and displays the value of all floating points and integer registers and the
contents of the pipeline for the purpose of debugging.
Load and Store Unit (LSU)
This module implements Load and Store Unit that includes the Data Memory Man-
agement Unit (DMMU) and the data cache.It maintains the internal state of
load/store buffer, accesses contents of a data cache block, and translates data's virtual
address to physical address.
Prefetch and Dispatch Unit (PDU)
This module implements the Prefetch and Dispatch Unit (PDU), including the In-
struction Memory Management Unit (IMMU) and the instruction cache. PDUmod-
ule accesses the contents of the instruction cache, shows description of the entries in
the 1TLB, and displays the contents of branch pipeline and the instruction buffer.
External Cache Unit (ECU)
ECU module accesses the contents of the external cache and shows the status of ECU
controller and the number of ECU dispatched requests.
Spitfire Data Buffer Unit (SDB)
This module implements the Data Buffer Unit module and displays the contents of
one or more FIFOs.75
6.1.2Capabilities of the Simulator
Fundamentally, the purpose of INCAS is to evaluate the performance of UltraS-
PARC-I processor as closely as possible. Therefore, the primary goal of designing IN-
CAS simulator was focused on precisely following the design specification of UltraS-
PARC-I so that the performance can be accurately analyzed. Thus, it is necessary to un-
derstand the capabilities and the limitations of INCAS in order to properly analyze the
result from the simulator.
6.1.2.1Limitation of INCAS
INCAS follows the U1traSPARC-I architecture, including the instruction cache, the
data cache and the external or level-2 cache. However, the interaction of the processor
with the system controller and the main memory is modeled less accurately as shown in
Figure 6.2. Therefore, when working with large data sets, level-2 cache misses will cause
frequent interaction with the main memory, which in turn results in inaccurate INCAS
cycle counts that may be greater or less than that of a real UItraSPARC system.
UItraSPARC-1 processor wh 16
Kbytes Instruction Cache&
16Kbyte Data Cache
INCAS accurate
External 2nd Level Cache5l2
Kbytes to 4 Mbytes
is
ISystem Controlior
I
INCAS less accurate
IMain Memory
I
Figure 6.2: INCAS Accuracy Model [11176
6.1.2.2Capability of the Simulator
INCAS provides two major features: cycle counting and debugging.The cycle
counting feature allows users to analyze the performance of a program. After executable
binary file is loaded, it runs until the execution reaches a breakpoint. To figure out cur-
rent cycle count at this point, the command 'time' can be used to display several statistics
such as user time, system time, cycle count, instruction count, CPI, and IPC. In addition,
it allows users to gather statistics of the intermediate result within a program by setting a
breakpoint address.
Also, by monitoring each of five modules in INCAS, users can examine the proces-
sor states during the execution of a program.Watch pipe, watch load, watch disp,
watch doneindividually monitor the status of the pipeline, the loading of instructions,
dispatching of instructions, and the completion of instructions, respectively. Fivemod-
ules are actually in charge of executingwatchfunctions according to the corresponding
processor module. However, INCAS does not allow users to change the architectural pa-
rameters such as cache size, memory bandwidth, and function units.
6.2Applications
This section introduces the benchmark programs used in the performance analysis
with INCAS. These benchmark programs represent the common operations embedded in
current multimedia applications. The multimedia applications are typically classified into
four categories: Image, Graphics, Audio, and Video. The benchmark programs used in
the simulation studies are listed as follows:
Image & Graphics applications: Image Addition, Alpha blending, and Threshold
Audio applications: Fir filter
Video application: Motion estimation
Note that Image and Graphics are grouped as one application area since their basic
techniques used in media applications are similar.77
6.2.1Image & Graphics applications
6.2.1.1 Image Addition (IA')
This benchmark adds two images and creates a new image. The basic function equation
is as follow:
Dest =(srcl+ src2)I 2
The above equation has function that adds two images of 8-bit two-pixel values. The
srcl and src2 are first and second images designated as pixell and pixel2 with 8-bit size,
respectively, and after a result of adding two pixels, the added image is stored indest.
This function is a very basic operation and common to many imaging and graphics appli-
cations such as digital pre-press, digital, remote sensing, etc. The following image sizes
were used in the simulation: 64x64 pixels, 256x256 pixels, 512x512 pixels, and
1024x1024 pixels.
6.2.1.2 Alpha blending (AB')
Alpha blending combines two images and creates translucent objects in a scene. The ba-
sic function equation is as follow:
Dest =a xsrcl + (1 - a) xsrc2, 0a 255
As already discussed in Chapter 5, this function is commonly used in pre-process, image
paint and editing software. It is especially useful in digital post-production requiring for
final frame as a blend of many different images. The following image sizes were used in
the simulation: 64x64 pixels, 256x256 pixels, 512x512 pixels, and 1024x1024 pixels.
6.2.1.3 Threshold(1'H)
This function sets a threshold value used for reference of pixels and transforms 8-bIt
color into 1-bit black and white color, according to a threshold value. If the pixel values
are greater than the threshold value, the resulting pixels are set to white. Otherwise, the
resulting pixel is set to black. This function is used as an editing technique. Addition-78
ally, it processes the image to be adjusted for printers, because they need to change a
color image into black and white image for the print out. The following image sizes were
again used in the simulation: 64x64 pixels, 256x256 pixels, 512x512 pixels, and
1024x 1024 pixels.
6.2.2Audio Application
Finite Impulse Response (FF) Filter is used for audio stream manipulation according
to the coefficients of FIR filter.
flen-1
dest[nj =FIR[k}x src[n+k], 0 n <dien
The above flen and dien represent the length of a filter and destination data. FIR[k]
and src[n+k] indicate a number of coefficients included in a filter and source audio
streams. Filter coefficient, FIR[kJ, is scaled to src[n+kj audio streams and the filtered
audio stream is stored in dest. This function determines the quality of the sound accord-
ing to filter coefficients. For a simulation of audio application, different size of audio
stream and filter size are required for input data. A fundamental audio operation is gen-
erally processed in a form of bit streams (n) and filter size is designated as the number of
coefficients (k). The audio data are fed into four different audio stream sizes based on bit
data stream: 1,000 streams, 2,000 streams, 3,000 streams, and 4,000 streams. And, each
stream corresponds to 32 filter coefficients.
6.2.3Video Application
Motion estimation (ME) is selected as one of video applications, and the relevant
equation is as follows:
dEff= {abs(MB1(i,j) -MB2(i,j)))
i=Oj=O
The dff determines the degree of motion between consecutive two frames. More
details are given in Section 5.2. For input data, different macroblock sizes are fed intosimulator to figure out performance variations as follows: 16x16 blocks, 32x32 blocks,
64x64 blocks, and 126x128 blocks.
6.3Simulation Results
The simulation results below were obtained from INCAS running on UItraSPARC -
II 296 MHz using SPARC SC4.2 compiler. As input files, five benchmark programs
were fed into the simulator in forms of C and VIS codes.
Incas Release 2.0 - Beta
Configuration phase pwd is '/opt/SuNwincas/lib".
Preprocessing configuration file "/opt/StjNwincas/lib/us-l.conf".
Parsing configuration file "/opt/SUNWincas/lib/us-l.conf".
Creating C module classes.
Creating module instances and interfaces.
Performing interface configurations.
Performing shared object registrations.
Performing shared object lookups.
Performing interface configuration verifications.
Reading Ui commands from/opt/SUNWincas/lib/incasrc'.
Negative phase is active.
load: Piletiming_blend8ims"
ieul: breakpoint 1 (stage G) at sim_breakl (0x8374) encountered.
real time Feb 23 13:12:09.289736 user time 1:03:36.450000 system time 0.06000
cycle count: 42314936(11087.56 cps39.92 MCPH)
instr count: 8134194(2131.33 ips = 7.67 MIPH)
cpi: 5.202, ipc: 0.192
ieul: breakpoint 2 (stage G) at sim_break2 (0x838c) encountered.
real time Feb 23 13:20:59.137198user time 8:49.110000system time 0.06000
cycle Count: 5735622(11058.58 cps - 39.81 MCPH)
instr count: 1057859(2115.48 ips 7.62 MIPH)
cpi: 5.422, ipc: 0.184
Figure 6.3: Simulation Results shown by the Simulator INCAS
Figure 6.3 shows the results of a simulation for Alpha blending program, c_blend8ims.c
and vis_blendims.c, with input image size composed of 512x512 pixels. The results are
given in terms of cycle count, instruction count, user time, CPI, and IPC. The cycle
count and instruction count mean the total number of cycles and instructions required to
complete the Alpha blending program, respectively. The user time means CPU time, orIiJ
execution time indicating the overall performances. The CPI and IPC are clock cycles
per instructions and instructions per cycle, respectively. The first result represents the
simulation of C code, and the second result is for the simulation of VIS code.
Table 6.1 and Table 6.2 show the simulation results in a tabular form; Image Appli-
cations results are given in Table 6.1, and Audio Application and Video Applications re-
sults are given in Table 6.2. Overall performances reflect the average of execution time
of each data size.
Data
Size
(Pixels)
Compari-
son
Imaging Application
Image Addition Alpha Blending Threshold
C VIS C VIS C VIS
CYCLE 42,675 9,979 127,204 15,142 61,733 13,793
INSTS. 82,726 12,853 127,983 17,602 95,141 10,965
64*64 CPI 0.516 0.776 0.994 0.860 0.649 1.258
IPC 1.939 1.288 1.006 1.162 1.541 0.795
Exec.Time 8.11 1.51 15.73 2.16 6.60 1.87
CIVIS 5.4 7.3 3.5
CYCLE 1,377,046 262,028 3,011,503 392,657 975,232 225,758
LNSTS. 1,313,832 192,567 2035,504 266,821 1,510,949 160,341
256*
CPI 1.048 1.361 1.479 1.472 0.645 1.408
256 IPC 0,954 0.735 0.676 0.680 1.549 0.710
Exec.Time 2:57.40 31.12 5:32.83 0:44.84 2:33.63 3272
CIVIS 5.7 7.4 4.7
CYCLE 25,498,040 2,304,548 42,314,936 5,735,622 3,907,516 894,926
INSTS. 5,249,067 761,908 8,134,194 1,057,859 6,036,518 631,892
512*
CPI 4.858 3.025 5.202 5.422 0.647 1.416
512 IPC 0,206 0.331 0.192 0.184 1.545 0.706
Exec.Time 39:07 4:01 1:03:36 8:49 10:05.05 3:04.19
C/VIS 6.8 7.2 3.3
CYCLE 137,657,504 17,566,000 185,293,728 23,958,99228,781,6247,345,404
INSTS. 20983852 3031092 32521264 4212804 24131622 2508884
1024 *
CPI 6.560 5.795 5.698 5.687 1.193 2.928
1024
IPC 0.152 0.173 0.176 0.176 0.838 0.342
Exec.Time 3:22:30 27:28 4:37:12 37:04 1:02:52 14:36
CIVIS 7.4 7.5 4.3
Overall Performance 6.3 7.3 3.9
Table 6.1: Simulation Result of Image ApplicationsData Size
(Audio: Length)
(Video: Pixels)
Comparison
Audio Application Video Application
Fir Filter Motion Estimation
C VIS C VIS
Audio:1000
Video: 16*16
CYCLE 268,628 125,204 3,737 1,666
INSTS. 223,020 172,792 3,275 753
CPI 1.205 0.725 1.141 2.212
IPC 0.830 1.380 0.876 0.452
Exec. Time 33.77 20.56 0.42 0.18
CIVIS 16 2.3
Audio:2000
Video: 32*32
CYCLE 536,841 250,013 11,865 5,014
INSTS. 446,025 345,543 12,619 2,353
CPI 1.204 0.724 0.940 2.131
IPC 0.831 1.382 1.064 0.469
Exec. Time 1:08.03 42.94 1.41 0.54
C/VIS 16 2.6
Audio:3000
Video: 64*64
CYCLE 805,120 374,766 44,470 16,915
INSTS. 669,026 518,298 49,741 8,335
CPI 1.203 0.723 0.894 2.029
IPC 0.831 1.383 1.119 0.493
Exec. Time 1:41,84 1:10.93 5.34 1.97
CIVIS 1.4 2.7
Audio:4000
Video:128*128
CYCLE 117,330 441,816 106,756 50,578
INSTS. 992,025 611,049 120,875 25,789
CPI 1.203 0.723 0.883 1.961
IPC 0.831 1.383 1.132 0.510
Exec. Time 2:30,47 1:41 .01 12.97 5.62
CIVIS 1,5 2.3
Overall Performance 1.5 2.5
Table 6.2: Simulation Result of Audio and Video Applications
Figure 6.4 illustrates the performance of image, audio, and video applications im-
plemented in C and VIS. The performance was measured for each benchmark program
using four different input data sizes. As the input data sizes become large, the execution
time also increases. Especially, Alpha blending and Image Addition show notable differ-
ences in execution time between C and VIS, i.e., VIS performance is much better in both
benchmarks. On the other hand, VIS shows less performance improvement in audio or
video benchmark applications.4000
3500
3000
2500
2000
1500
0
1000
500
0
14
12
C.)
10
0
8
I-
C)04 x
w
2
0
64 256 512 1024
Image Size(pixels)
IilresIloIcI(TH)
I
64 256 512 1024
Image Size(pixels)
)MoflonI.stimatIon
16 32 64 128
Search Area (pixels)
18000
16000
14000
12000
10000
8000
6000
4000
2000
0
160
140
120
100
80
20
0
(b) Alpha Blending (AB)
64 256 512 1024
Image Size(pixels)
(d)Fir Filter(FF)
1000 2000 3000 4000
Audio Stream(bits)
(f) Legends
C
V's
Figure 6.4: Performance comparisons of VIS with CThe performance results vary depending on the characteristics of the application and
its execution algorithm. In Alpha blending and Image Addition, both applications have
similar execution algorithm, and their functionality in multimedia application is similar.
To operate on all of the input data (pixels), their execution algorithms use the triple-
nested loop structure, and operation continues until all input data are executed. These
characteristics, suitable for SIMD, allow better utilization of VIS instructions.This is
one of the reasons why the performance improvement provided by VIS varies from ap-
plications to application.
6.4Analysis of the Simulation Results
VIS instructions designed for multimedia processing provide significant perform-
ance improvements over the general instructions set architecture. Their performance im-
provement shows about 3 to 4 times better performance on average compared to general
architectures in all application areas.
8
7
6
5
Speed-up
(CNIS)
3
2
1
0
IA AB TH FF ME
Figure 6.5: Overall Performance of Applications in VIS
I .vis84
Figure 6.5 shows the overall performance improvement of VIS normalized to the
execution times of C versions. On the average, the execution times of VIS are about 4
times faster than that of C. However, the speed-up varies from 1.5 to 7.3. The applica-
tionAlpha blendingshows the highest speed-up; on the other hand,Fir filter-
inggives the lowest performance improvement. These variations explain one of the as-
pects that performance of VIS is application-dependent. The most dominant factors that
affect the performance improvement in VIS can be categorized into two parts: multimedia
architectural features and benefitsofVJS. When both factors are combined, it results in
significant performance improvement for multimedia processing.
6.4.1Impacts of VIS Architecture Features
Performance improvements of VIS architecture come from the utilization of Float-
ing-point Graphic Unit (FGU). FGU has a dedicated pipeline, called FGU pipe and per-
forms the graphic operations of the VIS instructions. Also, FGU only uses floating-point
registers to prevent interfering with integer operations. Therefore, FGU helps VIS in-
structions to yield better performance.
Applications Image
dMon
Ad- Alpha
Blend ini_Threshold Filter
Motion
Estimation
Program Type CVISCVISCVISCVISCVIS
Operation
ALU 25 10 36 21 29 1928 6 43 15
Memory25 12 14 18 33 1620 27 14 29
Branch37 3 25 8 22 4 24 11 21 20
FGU 0 58 0 52 0 37 0 33 0 26
Others 13 17 25 1 16 1428 2322 10
Table 6.3: Summary of Distributed Instruction
Table 6.3 shows the distribution of instructions, which was calculated from the as-
sembly programs.Instructions are grouped according to their operator types: ALU,
Memory, Branch, FGU, and others. The FGU operation indicates graphic operation. FP
registers are accessed mostly by VIS instructions for FGU. Also, VIS programs rely onthe integer registers for addressing data (address calculations for loads and stores) the
floating-point registers for manipulating operand of VIS instructions, namely multimedia
data.
0
4-
0
4-
4-
(I)
C
V
wN
cc
E
I-
0z
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
'S
\c
Figure 6.6: Distribution of Instructions by Operation
FGU
0 Others
E Branch
0 Memory
ALU
Based on Table 6.3, Figure 6.6 depicts the distribution of total instructions executed
for the benchmarks. This figure emphasizes the importance of FP registers engaged by
FGU. C programs for all applications do not use instructions that access FP registers.
This implies that C programs cannot take advantage of FGU and its pipeline. Accessing
only integer registers makes the complex datapath causing a bottleneck, since every in-
struction depends only on the integer pipeline. On the other hand, programs in VIS fully
use both integer and FP registers with the support of a dual pipeline. In Figure 6.6, ALU
operation in C version is mostly greater than that of the VIS version by a factor of 2 on
the average.This is because the VIS programs utilize both ALU pipelines: one per-
formed on integer registers and the other performed on FP registers. Use of the registers
depends on whether the ALU operation is related to multimedia processing or not. If so,
it is executed in FGU and corresponding data are performed using FP registers, other-ri
wise, it is executed the ALU using integer registers. This signifies that the VIS programs
can make the datapath simple through independent use of the integer and floating-point
registers. As a result, use of both integer and FP registers increases the data throughput
by providing the maximum number of registers and instruction parallelism. Thus, the use
of FP registers has a significant influence on performance.
1 .8
1 .6
1 .4
1 .2
1
a-
0.8
0.6
0.4
0.2
0
DIPC with FGU 1PC without FGU
IA AB TH FF ME
Figure 6.7: IPC comparison between FGU and without FGU
Figure 6.7 compares an average JPCs of VIS programs and C programs. In addition,
IPCs of VIS programs were measured excluding the overhead in order to study the effect
of FGU on overall performance. On the average, IPCs supported by FGU are better than
without FGU. However, the IPCs ofTHandMEin VIS are less than that of the C
benchmarks due to the application characteristics. The benchmarkTHhas many partial
store instructions, which require two cycles and other supporting instructions. The appli-
cationMEincrements a complex algorithm and uses the vispdist() instruction whose la-
tency is three cycles. TheMEof C version only needs to process one block of a search
area. On the other hand, VIS processes two separate motion estimation areas, one for
VIS and theleft-overas discussed in Chapter 5.6.4.2Advantage of VIS
The speed-up of 1.57.3 obtained from VIS programs comes from significant re-
duction in instruction counts. In all the benchmark programs, instruction counts of VIS
versions are much smaller than that of the C versions. Figure 6.8 shows the number of
instructions executed in each application. The following discussion explains how VIS
reduces in instruction counts.
450C""
C
350(
C)300(
250(
o200(
In
100(
50(
,vvvu
)0000
)0000
)0000
)0000
)0000
)0000
)0000
)0000
IA AB TH FF MS
DC 209838523252126424131622 992025 186510
Uv1S30310924212804 2508884 611049 37230
Figure 6.8: Instruction Counts between C and VIS
6.4.2.1Effect of Smaller Number of Loop Iteration
DC
. VIS
The primary reason for smaller instruction counts is in the ioop characteristics of
VIS programs. The multiple number of loop iterations in C versions are replaced by a
SIMD execution of the loop in VIS programs, yielding about 8 times reduction in loop
iteration as explained in Section 5.1.3.In addition, the reduction of loop iteration also
reduces the loop overhead inherent in each iteration.The several instructions are re-
quired for ioop iteration: index increment instruction and branch compare instruction.[SI.]
Considering a large number of loop iterations, loop overhead degrades the overall per-
formance. Therefore, the VIS programs have an advantage of reducing instruction count
as well as an additional effect of reducing the loop overhead inherent in loop control.
6000000
5000000
(0
C
4000000
C)
C
3000000
C)
2000000
1000000
0
IA AB TH FF MS
DC6E+06 6E06 6E+06 256064 33024
vis 264192 264192 264192 400004352
Iuu.
90
'n 80
C
00
60
(0
50
40i:
10w__--- -a--
0
IA AB TH FF MS
4--C 29 19 27 29 17
a---vis 8 6 10 9 10
(a) Number of Loop Overhead Instructions (b) Percentage of Loop Overhead
Figure 6.9: Loop Overheads in C Instructions
Figure 6-9 shows the effect of SIMD on instruction counts for VIS programs. The
number of instructions in the benchmarks represent the loop overhead counts during the
loop execution, i.e., index increment and branch compare overhead.In Figure 6-9 (b),
the percentage of ioop overhead varies in benchmarks, depending on the total instruction
count of applications. Both instruction counts and percentages of benchmarks in C are
larger than that of VIS benchmarks. Therefore, the much smaller number of instructions
executed in VIS provides an additional performance improvement.6.4.2.2Effect of Arithmetic Operation Instructions
The VIS packed arithmetic instructions such as visjpaddQ, visjpsubQ, and
vis.Jinul() allow four arithmetic instructions to be replaced with one VIS instruction. For
example, four packed partitioned pixel data in 32-bit word can be manipulated using one
VIS arithmetic instruction. Therefore, a VIS arithmetic operation can provide four times
the speed-up in terms of typical arithmetic operations.
90
80
(I)
C
6C
0
()
C
0
4C
3c
C
2C
ic
LJJJU
00000
00000
00000
00000
00000
00000
00000
00000
--
TH FF MS
DC2E069E+066E+067812265279
vis303109884689175622183311862
I UU
80
70
C)
.60
In
.E50
40
30
20
(.1
i10 a 0
IA IAB TH IFF IMS
'-4--C 9 30 I29 I I40
.vis 10 21 7 2
(a) Number of Arithmetic Operation (b) Percentage of Arithmetic Instructions
Figure 6.10: Impact of VIS Arithmetic Operation on C
Figure 6.10 shows the arithmetic instruction counts and their percentages for both
VIS and C benchmarks. In calculating the instruction counts, only the instructions re-
lated to the actual multimedia data processing were considered. A large number of
arithmetic instructions were executed in the benchmark AB because of the complexity of
the alpha blending equation. This increases the possibility of VIS AB program to have a
better performance since more arithmetic operations render to the performance improve-
ment.In Figure 6.10 (b), MS is different from other benchmarks. A sequence of 48
arithmetic instructions in the C program is required to calculate the Motion Estimationfor only one pixel. On the other hand, these 48 arithmetic instructions can be reduced to
one special instruction, vis_pdistQ, in VIS
6.4.2.3Effect of Boundary Manipulation Instruction for Branches
The common operation used in every application is edge masking operation using
vis_edgeQ. The vis_edge() instruction decreases the number of branches that occur in a
ioop. Most multimedia applications require testing of edge boundaries to improve image
qualities, such as in removing of jagged image. The branch instructions are used to make
function calls for examining the edge boundaries and processing of the boundary images.
In C programs, considerable number of instructions are required, and the overall execu-
tion time increases.
On the other hand, VIS versions place the edge instructions in the beginning of the
loop iteration as a replacement for the test function in C versions. As a result, VIS re-
duces the number of instructions and thus the execution time.
90
8C
70
6C
50
40
30
20
10
uuuuu
D0000
D0000
D0000
D0000
D0000
D0000
D0000
D0000
IA IAB ITHIFFIMS
IC 3E+06I3E+06I2E+06I43401116320
181866121064012268001488841862
IJu-
80
a
C.)
60
C
40
a
C,
i;1HHHH
(a) Branch Instruction increased inc (b) Percentage of Branch Instuctions
Figure 6.11: Impact of VIS Branch Operation on C
Figure 6.11 shows that, on the average, about 7% of the instructions in VIS bench-
marks are branch instructions compared to 11% in C benchmarks. The differences are91
due to the use ofedgeinstructions. Therefore, the instruction counts of branch operations
in C benchmarks grow as the ioop iterations increase. The number of branch instructions
inFFis different from other benchmarks.Since audio data does not need edge or
boundary processing,vis_edge()instruction was not used. Therefore, FF did not benefit
from edge instructions. In addition, the ioop system of FF in C has a double-nested loop,
so there is not much difference in loop iterations with VIS.
6.4.2.4Effect on Memory Instructions
VIS load and store instructions are capable of handling packed data types. Similar
to VIS arithmetic instructions, the VIS load and store instructions, such as vis_lddfa() and
vis_stdfaQ, can replace four general load/store instructions.
9c
BC
7C
6C
C.)
3(
'I,
E 2C
ic
uuuuv
00000
00000
D0000
00000
00000
00000
D0000
IA AB TH FF MS
DC 5E+068E+064E+064340122847
V1S36373180043337683316498311169
Iuu
80
0
0
60
40
w
20
w
C.)
IA IAB
ITH IFF IMS
'-4--c 25 27
I18 5 14
Vi5 12 19 15 27 3
(a) Load/Store increased in C (b) Percentage of Memory Access Instruction
Figure 6.12: Impact of VIS Memory Access Operation on C
In Figure 6.12 (a), the number of memory access instructions forABbenchmark are
the largest. Since the applicationABrequires an additional Alpha values, more memory
accesses are required unlike other benchmarks. In Figure 6-12 (b),VISFFshows poor92
performance sinceFFdoes not use vis_lddfa() and visjtdfa() instructions. Instead,FF
uses general load/store instructions. As a result, the performance suffers from additional
memory accesses using general instructions.
In general, VIS applications that show small performance improvements are due to
the following reasons: (1) the characteristics of applications that prevent from exploita-
tion of VIS instructions, (2) use of VIS instructions with a large overhead, and (3) small
differences in the number of loop iterations between C and VIS.
6.4.3Limitations of VIS Instructions
Even though overall performance of VIS benchmarks is better than that of C bench-
marks, there are some limitations that restrict further performance improvement in VIS.
These limiting factors will be referred to as VIS overhead, and they are briefly discussed
based on the simulation results. Most of VIS overhead originates from the following op-
erations: aligning data, data conversion, and reading/writing from/to GSR. These op-
erations are usually required to exploit SIMD in VIS extension.
Aligning Data: This operation is required when dealing with a group of 4 or 8 data
values at a time. For example, visj'alignad4r() and visjaligndata() must be used be-
fore loading and storing of data in VIS codes. Aligning data overhead counts for
about 10% of execution time on the average.
Data Conversion: To change a data format suitable for VIS instructions, VIS instruc-
tions such as visjack() and vis...expand() instructions are needed. On the average,
data conversion overhead counts for 11% of overall execution time.
Reading/Writing from/to GSR and Combining Data: GSR handling is required to
support data conversion instruction, vis_expand. This is done using vis_writ&,gsr()
and vis_read.jsr() instructions, which take about 2% of the execution time.100%
90%
80%
70%
60%
50%
4o%
30%
E20%
10%
fO/
V /0
IA AB Th FF MS
Appflcaons
Figure 6.13: Overhead of VIS instructions
DOTHERS
0 EDGE
OVERHEAD
OBRANCH
MEMORY
AJU_j
93
Figure 6.13 compares the VIS overhead with other instruction categories. The per-
centage of VIS overhead varies with each application. The overhead ofIA, AB, TH, FF,
andMSare 32%, 24%, 20%, 12%, and 20%, respectively. The applicationIAhas the
highest overhead because it requires more data conversion instruction.On the other
hand, the overhead of FF is the lowest since audio stream filtering functions require only
a few data load instructions. The overall percentage of VIS overhead is about 23% on the
average when all benchmark programs are considered.7CONCLUSION AND FUTURE WORK
As the demand for multimedia application increases, multimedia processing has be-
come an essential function of a computer. Microprocessor vendors have searched for a
cost-effective way to use general-purpose processors for multimedia processing while
maintaining high performance. Based on this philosophy, many vendors have designed
multimedia extensions in ISA level of the general-purpose processors. The VIS exten-
sion of U1traSPARC processor is one of them.
The simulation studies show that the performance of applications using VIS exhibit
on the average about 3 or 4 times of speed-up compared to that of multimedia processing
based on general ISA. This performance improvement comes from the reduction in the
number of instructions executed. The primary reason for reduction in instruction counts
is based on SIMD processing technique. The VIS supports SIMD processing with archi-
tectural enhancement and various multimedia-oriented instructions, and enables the ma-
nipulation of multiple data in a single cycle. Also, a separate pipeline of FGU can exploit
more instruction-level parallelism when dealing with 64-bit data.
However, VIS has some limitations that restrict further performance improvement,
including overhead required to exploit SIMD as in aligning data and data conversion op-
erations. The overhead of visjpack() data conversion instruction occurs when truncating
16/32-bit intermediate results into 8-bit final results before the results are stored in mem-
ory.This overhead of visjpack() counts 10% of overall instructions.If this overhead
can be reduced, the performance of VIS will improve further. For example, if a large size
of accumulator is provided in the floating-point register file and directly stores the 16- or
32-bit size of intermediate results, the visJpack() operation will not be necessary. As a
result, the performance will improve as well as the precision of results will be more accu-
rate. Therefore, the provision of accumulator in VIS architecture will be worthy of in-
vestigation as a way of performance improvement in the future work.95
BIBLIOGRAPHY
1.Real-time MPEG2 Decode with The Visual Instruction Set (VIS), available at
http:Ilwww.sun.comfsparclwhitepapers(wp95-028
2. Access to Multimedia Technology by People with Sensory Disabilities, available at
hup://www.ncd.gov/publications/sensorv.html
3. Larry L. Ball, Multimedia Network Integration & Management, McGraw-Hill, 1996.
4. Free On Line Dictionary of Computer, available at http://wombat.doc.ic.ac.uk
5.CharlesM.Koziweok PC Guide, available at
http://www.pcguide.com/ref/video/modes.htm
6. Simon J.Gibbs & Dionysios C. Tsichritzis, Multimedia Programming:Objects, Envi-
ronments and Frameworks, Addison-Wesley Publishing Company, 1995.
7. Image Editor Photoshop, Adobe Systemsmc,1991.
8. Ruby B. Lee and Michael D. Smith, Media Processing: A New Design Target, IEEE
Micro Aug. 1996.
9. Rubu B, Lee, Subword parallelism with Max-2, IEEE Micro. Aug. 1996.
10. Grant Erickson, RISC for Graphics: A Survey and AnalysisofMultimedia Extended
Instruction Set Architectures, Dept. EE. U of Minesota, 1996.
11. Sun Microsystems, VIS Instruction Set User's Manual, July 1997.
12. David L. Weaver and Tom Germond, The SPARC Architecture Manual Version9,
Prentice-Hall,mc,1994.
13. Marc Tremblay, J.Micahael O'Connor, Venkatesh Narayanan, Kiang He, VIS Speeds
New Media Processing, IEEE Aug. 1996.
14. UltraSPARCUser's Manunal (U1traSPARC-1 and U1traSPARC-II), Sun Micro-
systems, 1997.
15. U1traSPARC The Visual InstructionSet(W57"):On Chip Support for New-Media
Processing, available at http://www.sun.com/sparc/whitepaper/wp95-022
16. L. Kohn et al., The Visual instruction Set (VIS) in UltraSparc, Proc. Compcon, IEEE
CS Press, 1995, pp. 462-469.
17. Accelerating Core Networking Function Using The U1traSPARC VIS Instruction Set,
available at http://www.sun.com/microelectronics/whitepaper/wpr-001 396
18. MMf" Technology Technical Overview, Intel, 1996, available at
http://developer.intel.com/drg/mmx/manuals/overview/index.htm
19. INCAS User's Guide 2.0, Sun Microsystems Jun.1997.