An approach for designing a hybrid parallel system that can perform di erent l e v els of parallelism adaptively is presented. An adaptive parallel computer vision system (APVIS) is proposed to attain this goal. The APVIS is constructed by i n tegrating two di erent t ypes of parallel architectures, i.e., a multiprocessor based system (MBS) and a memory based processor array ( M P A), tightly into a single machine. One important feature in the APVIS is that the programming interface to execute data parallel code onto the MPA is the same as the usual subroutine calling mechanism. Thus the existence of the MPA is transparent t o t h e programmers. This research is to design an underlying base architecture that can be optimally executed for a broad range of vision tasks. A performance model is provided to show the e ectiveness of the APVIS. It turns out that the proposed APVIS can provide signi cant performance improvement and cost e ectiveness for highly parallel applications having a mixed set of parallelisms. Also an example application composed of a series of vision algorithms, from low level and medium level processing steps, is mapped onto the MPA. Consequently, the APVIS with a few or tens of MPA modules can perform the chosen example application in real time when multiple images are incoming successively with a few seconds inter{arrival time.
Introduction
Computer vision has been regarded as one of the most complex and computationally intensive problems. In general, most computer vision algorithms require many di erent degrees of parallelism as well as diverse computational and communication types. Computer vision tasks, e.g., image understanding and pattern recognition of objects, consist of a complex procedure requiring a variety of steps that successively transform the iconic data into recognition information. A recognition methodology must pay substantial attention to each of the following six steps: image formation, conditioning, labeling, grouping, extracting, and matching 1]. Therefore, vision algorithms are generally divided into three levels based on the processing sequence. Each l e v el can be characterized as follows:
Low level processing : There exists massive data parallelism. It is suitable for SIMD type of computations. Computations are normally simple and data independent. Computations mainly involve n umeric processing and manipulation of simple data structure such a s pixels. Also the communication requires e cient broadcast and synchronization, e.g., in convolution, ltering, and edge detection algorithms.
Medium level processing : It manipulates symbolic data (e.g., region). Computations are normally data dependent and irregular. Tasks in this class belong to medium or coarse grain parallelism and are suitable for MIMD type of computations, e.g., in region analysis.
High level processing : It manipulates both numeric simple data and symbolic data processing. It is suitable for MIMD coarse grain parallelism. It requires exible structure and distributed controls for computations and communications, e.g., in object recognitions.
Many computer vision systems of di erent sizes and con gurations have been proposed and developed. Exploitation of parallelism is one way t o a c hieve computational speedup for a broad range of applications, where the exploitation of data parallelism and/or function parallelism is possible 2] . In data parallelism, the same operation is performed over many data elements by m a n y processors simultaneously. F unction parallelism allows two or more tasks to be performed simultaneously. Also, the degree of parallelism can be divided into ne grain, medium grain, and coarse grain sizes. Data parallelism can be utilized with the minimum overhead on SIMD machines and the distributed memory-MIMD m a c hines. However, function parallelism and multiprocessing can be best utilized on the shared memory MIMD machines under the conventional programming model. Mixed-mode processing system, the PASM system, is an approach to exploit heterogeneous processing within a single machine to support the SIMD and/or MIMD mode of parallelism 3, 4] . Massively parallel SIMD systems are suited for low level vision algorithms. Examples of these systems are MP- 1 5] , MPP 6, 7] , Connection Machine 8, 9], PAPIA 10], MPP pyramid 11], and SliM 12] . However, these architectures are not suitable for medium to high level vision algorithms requiring for more complex and nonuniform processing. IUA (Image Understanding Architecture) 13] and NETRA 14] are constructed as a hierarchical structure. The CMU warp processor is a systolic array m a c hine built for image understanding tasks 15] .
The adaptive parallel computer vision system (APVIS) approach i s t o d e s i g n a h ybrid system that can be performed adaptively for di erent t ypes of parallelism. The APVIS can perform adaptively for di erent l e v els of processing steps in vision tasks as a cost-e ective system. Especially, t wo optimal machine architecture for medium to high level and low l e v el processing algorithms are chosen as the multiprocessor system and the SIMD array system respectively and these are integrated into a single machine tightly coupled via single system bus. Speci cally, the medium to high level processing are executed by the multiprocessor system and the low level processing steps can be executed by the SIMD array selectively. The proposed SIMD array system is similar to the previous memory-processor integrated arrays, such a s CRAM 16], IMAP 17], and PIM 18] which i n tegrate the processors and their local memories within a chip, in order to overcome the low bandwidth to the local memory. Furthermore, in the APVIS, any i n teraction by the program and shared data between the multiprocessors system and the SIMD array can be resolved by means of simple memory reads and writes.
Performance evaluation on the APVIS is performed by the analytical comparison using relative performance and an example of vision application.
In Section 2, the architecture model of APVIS is presented. Section 3 describes the system operational model and the performance modeling of APVIS. Also it presents preliminary performance evaluation. In Section 4, an example of computer vision application is presented. Finally, Section 5 provides a conclusion.
The APVIS Architecture
The APVIS approach is to design an underlying base architecture to execute all levels of processing required in computer vision tasks e ciently. In this section, the organization of the APVIS is presented and also its design issues are discussed.
The Overview of APVIS Architecture
Basic construction of the APVIS is organized as shown in Figure 1 . The APVIS is divided into two major system modules, i.e., a multiprocessing base system (MBS) and a data parallel SIMD system, called as the memory based processor array ( M P A). The MBS is constructed as a m ultiprocessor architecture that contains a host shared memory (HSM), multiple identical processors, and a single interconnection network. However, the interconnection netwo r k o f t h e APVIS is constructed as a single shared bus for simple and clear construction as in Figure 1 .
The processors of the MBS, called as host processors (HPs), are the general purpose othe-shelf high performance processors and share a single pool of memory to enhance resource sharing and communication among di erent processors. Each HP uses its private cache. Cache updating is currently based on the write-through scheme for coherence problem. Thus, a cache coherence protocol based on the snooping mechanism can be easily applied to support the operation of MPA.
The MBS supports multiprocessing and medium to coarse grain multitasking by using shared memory system model. For given image understanding application tasks, the MBS can execute many processes corresponding to the medium to high level processing steps of any vision task, which are based on the coarse and medium grain parallelisms and are executed simultaneously on many HPs. And also most of low l e v el processing steps based on the ne grain data parallelism can be supported by using the MPA. The MPA system is executed jointly under the control of the HPs and thus the MPA is a passive system that is controlled and executed by a n y request from the HPs. Operational interaction between the HP and the MPA can be performed via conventional subroutine calling mechanism. An arbitrary process running on a HP can invoke a parallel subroutine call to be executed by the MPA for data parallelism. This ty p e o f c o n trol transfer is called a MPA subroutine call, to di erentiate from any conventional subroutine call.
The Memory based Processor Array ( M P A) Architecture
The MPA system is an e cient a r c hitecture for low l e v el processing steps in computer vision tasks, which usually accompany m a s s i v e data parallelism. In this section, the MPA system the memory or as the SIMD array. A n y HP inputs and outputs the data in the form of memory reads/writes by the arbitration of the ACP. Unlike the back{end SIMD system, the HP does not need to fetch or transfer the shared data processed by t h e M P A system to the HSM in order to execute a sequential or any coarse grain task. The HP just reads the shared data in the ASM system as the data in its local memory. Therefore, the MPA system is considered as a part of the contiguous host memory address space. Also, the APVIS can achieve the performance gain from the memory interface structure. This e ect on the performance gain is explained in Section 4.1 later. In Section 3.3, the structure of memory address mapping and between the HPs and the PUs are discussed in details.
And also the MPA system as a SIMD array is constructed as one dimensional array architecture. Algorithm mapping to one dimensional structure of MPA and the performance advantages for the structure of the MPA are discussed in Section 4.
Especially the input data for image processing are often gray level data, and thus the bit parallel computation is desirable as the basic building block o f e a c h P U i n m a n y cases. Therefore, each PU in the MAP is constructed as an 8-bit ALU, a shifter, a set of general purpose registers, two registers (S and R) for communication with neighboring PUs, and its associated dual ported memory as shown in Figure 3 . A set of PUs with their associated ACU is fabricated into a single chip and called a MPA module. The MPA system can be con gured several chip modules and each module can be operated independently. For a given number of MPA modules, all the MPA modules can be con gured as a single group or as multiple groups depending on computational needs. When a group of multiple MPA modules performs a single task, multiple MPA subroutines are load into their corresponding ACMs and invoked simultaneously. T h us, APVIS can be partitioned to process multiple images continuously or independently.
Memory Address Mapping Structure for MPA and MBS
Each A M i n t h e M P A is constructed as the dual ported memory modules and has the dual structure to access ASM. In other words, it may be con gured di erently in the view of the HPs or in the view of the PUs. There are m PU and AM modules, each module has w word memory space as shown in Figure 4 . The HP can access the ASM via the address arbitration by the ACP. T h us, if the MPA is not activated by a M P A subroutine call, the MUX in the ACP selects the global address to access the ASM from the system bus. Otherwise, the MUX in the ACP selects the local address generated from each PU. Each PU can access its own AM by its local address when it is operated.
The overall execution ow of the APVIS system can be classi ed into the following steps. First, the HPs compiles an application program of computer vision task and stores on its secondary storage. When this program is to be executed, any HP is allocated and the program is loaded into the memory. T h us the code blocks for intermediate level or high level processing steps are loaded into the HSM, and the code and data blocks for low l e v el processing steps are loaded into the ACM, and the ASM, respectively. The code blocks for low l e v el processing steps are formed by a set of MPA subroutines. Then the HP starts to execute the program. When the HP encounters any calling instruction to execute a MPA subroutine on, the control 
System Operational Model and Performance Evaluation
In this section, the operational model of the APVIS is presented. Based on this model, the execution ow of application programs on the APVIS is explained.
System Operational Model
General characteristics of the application tasks for computer visions as well as the computer vision task model are considered in this section. The task model describes the computational model underlying the computer vision application tasks. A computer vision task can be decomposed into a collection of non-overlapping code blocks such that within each c o d e b l o c k 
P P P

HSM ASM
(b) The memory load map in the APVIS An architecture type denotes a parallel architecture to be mapped optimally by the software parallelism identi ed by its code type. Thus, a task may consist of a combination of scalar, medium to coarse grain parallel code blocks for medium to high level processing steps, and ne grain data parallel code blocks for low l e v el processing steps. For the operational model of APVIS, a scalar code block and a medium to coarse grain parallel code block are merged into a general code block ( G) when they are adjacent because these code blocks are to be executed by the HPs. A set of ne grain parallel code blocks is denoted by P. Therefore, an instruction stream in the APVIS can be characterized to be formed as a set of general code blocks interleaved with data parallel code blocks. G i and P i denote the i-th code block of the G and P respectively, where G = fG 1 G 2 ::: G n g and P = fP 1 P 2 ::: P n;1 g.
As shown in Figure 5 (a), the layout of these code blocks for a vision application program is represented as the APVIS operational model. The programmer may decide the boundary of the G i and P i explicitly at the programming time. Two c o d e b l o c ks,G i and P i , are interleaved for all i. T h us, the G i begins right after the P i;1 and ends before the P i . In this approach, G i includes any instruction to initiate the P i at the end of G i . Figure 5 (b) shows the memory load map of an application program in the APVIS. GD and P Dare the data required to execute G and P code blocks, respectively. When a program compiled is to be executed, any host processor is allocated and the program is loaded into its memory, i.e., HSM, as in Figure 5 (b) . Thus the general code blocks and their associated data blocks are loaded into the HSM, the ne grain parallel code blocks (P ) i n to the ACM, and parallel data blocks (PD) are loaded into the AMs. Each P i is formed as a set of instructions to perform a MPA subroutine. Then the host processor of APVIS starts to execute the general code blocks.
System Performance Modeling
In this section, a performance model for the APVIS is presented and evaluated. Performance model can be derived from the operational model for the APVIS, and it can be represented in terms of the parameters given by the operational model. Let jG i j and jP i j be the size of the i ; th general code block and data parallel code block, respectively. Especially jG i j means the number of unit operations to execute in G i . Then, parallel execution times on the pure multiprocessor system, called P-SMP (pure-shared memory multiprocessor), and on the APVIS for given code blocks, i.e., G and P, can be obtained as:
where is the time to execute a basic arithmetic or logic operation by a single processor of the multiprocessor system, where the same processors are assumed in both P-SMP and APVIS, and is the time to execute an eight-bit basic arithmetic or logic operation by a single PU of the MPA. Let a system parameter, s, be the ratio of execution capability b y these two di erent processors, i.e., s = T A ex T M ex . The program parameter is de ned as the ratio of the time to execute a set of data parallel code blocks, P, to the time to execute the entire code blocks, on a single processor. Thus, can be obtained as the following equation:
This analytic model is based on the comparison of the execution times for the P-SMP and the APVIS, relative to the execution time on a single processor system. Thus, the terms of Equation (1) need to be changed as follows:
where n a is the number of PUs in the MPA, n m is the number of HPs in the P-SMP. T g i and T p i denote the execution time for G i on a single processor, and for P i , respectively. For a given G i , there exists the maximum possible degree of parallelism, namely i . Thus, min( i n m ) is the degree of parallelism for G i with the given number of HPs. T p i oh and T Ap i oh are the overheads occurred for the code block P i when executed by the P-SMP and by t h e M P A, respectively. T p i oh is usually larger than T Ap i oh . Because, if P i has to be executed by the P-SMP, lots of overhead is accompanied. This overhead is the cost of processor creation, initialization, termination, communication, and so forth 20, 21, 22] .
T g i oh is the overhead occurred to execute G i by the P-SMP or by the HPs of the APVIS. In general, this type of overhead tends to increase as the degree of parallelism grows. Thus the overhead can be expressed as T g i oh = k(d ; 1), where k is the proportional constant, i.e., the ratio of the overhead over the entire execution time and d is the maximum degree of parallelism for G i , i.e., the subtasks to be executed simultaneously.
For simplicity, a given application is assumed to be constructed as a set of two general code blocks G and a parallel code block P. It is assumed that there exists only serial code in G 2 and G 2 ' 0. Thus, G = G 1 and P = P 1 . Also let d m = min( n m ) a n d d APV IS = min( n APV IS ), where is the maximum degree of parallelism for the G except the scalar code block, and n APV IS is the number of HPs in the APVIS. The following equation can be obtained from Equation (1) 
Performace Evaluation
The APVIS approach is to execute various types of code blocks onto the optimally matched architectures to minimize the overall execution time spent on the code blocks. To e v aluate the architectural feature, performance and cost e ectiveness are considered. To compare the APVIS with the P-SMP, a constant is de ned as the ratio of the number of PUs in the APVIS corresponding to the cost of a single processor node in the P-SMP, s u c h a s C HP = C P U , where C HP and C P U are costs for each processor type. In general, may range from 32 to some big numbers.
Relative performance improvement b e t ween two m a c hine types are shown in Figure 6 (a). It is assumed that the number of processors in the P-SMP (N HP ) is 16, and is 64 as the number of PUs (N P U ). The APVIS 1 is constructed as N HP = 15 and N P U = 64. And the APVIS 4 has N HP = 1 2 a n d N P U = 256, and the APVIS 8 h a s N HP = 8 a n d N P U = 512. And also parameters are chosen as s = 0 :25 = 1 0 T p i oh = 0 :05 T Ap i oh = 0 :005, and k = 0 :01. As in Figure 6 (a), the relative speedup for each system is shown as the ratio changes. Figure 6 (b) shows the relative speedup that the proportional constant o f o verhead, k, i s c hanged from 0.01 to 0.09. In Figure 6 (c), the performance improvement i s s h o wn by c hanging the constant . As shown in Figure 6 (c), the performance of APVIS 1 i s w orse than that of P-SMP when 32. And there exists a threshold value of the speedup for a speci c value of . T h us, the larger , the better the threshold value of speedup can be obtained. 
An Example of Computer Vision Application
In this section, the performance of the APVIS for computer visions is evaluated by applying a computer vision application. This application example is a pattern recognition constructed as a sequence of several image processing algorithms, namely, 2{D convolution, histogramming, nding the best threshold, thresholding, region segmentation with connected component labeling, region analysis, and statistical pattern recognition in that order 1, 26] . This example is chosen because it provides e ective data parallelism to utilize and easy view of overall processing.
Low level and medium level vision algorithms, e.g., 2{D convolutions, histogramming, thresholding, and connected component labeling, can be e ciently mapped onto the MPA system supporting ne grain data parallel processing. Thus, low l e v el vision algorithms or tasks correspond to the data code blocks in the analysis. Also, the sequential algorithm, nding the best threshold, should be processed in the MBS. The MBS can achieve signi cant performance gain by the memory interface structure of the MPA system, because there is no need to transfer the shared data between the MBS and the MPA system. High level vision algorithms, e.g., region analysis and statistical pattern recognition, can be performed in the MBS supporting the coarse grain parallel processing.
Some algorithms can be performed by concurrent ne grain and medium to coarse grain parallelism execution by the APVIS. This class of algorithms corresponds to concurrent m ultilevel processing for a single image data or multiple image pipelined processing. One of the most representative algorithms that process a single image by applying multilevel concurrency is to derive visible surface representation 25]. Multiple image pipelining repeatedly applies the same sequence of bottom up operations to incoming images. Also, when multiple users process multiple computer vision programs, the concurrent m ultilevel processing can be supported e ectively by the APVIS. 
Low Level and Medium Level Computer Vision Algorithms
Some assumptions for the system con guration and the applications are rst described. 2. P PUs for the MPA system. P is a power of two. A MPA c hip includes m PUs and the MPA system consists of n MPA c hip modules. Therefore, P = mn and N P N 2 .
3. Each PU can process N 1 partitioned subimage as shown in Figure 7 (b). In the striped partitioning, the image pixels are divided into groups of complete columns, each M P A chip module is assigned one such group, and each PU is assigned for pixels of a column. Block{checkerboard partitioning as shown in Figure 7 (a) is usually used in the SIMD mesh connected computers. 
To perform W W convolutions on the MPA system, a parallel algorithm represented by pseudo code is shown in Algorithm 1 of Figure 10 . Algorithm 1 uses W W lter array F. Every 
Although the communication hardware cost of the MPA system is lower than the SIMD mesh connected computers, the performance for the convolutions by t h e M P A system, O(W 2 N) if the N = P, is similar to that of the SIMD mesh owing to the e cient algorithm utilizing a certain number of registers and the memory interface architecture. Therefore, the MPA system including its algorithm is cost{e ective f o r W W convolutions.
To perform histogramming for the MPA system, a parallel algorithm represented by pseudo code is shown in Algorithm 2 of Figure 11 . Speci cally, the number of computation steps, histogramming MPA , to execute histogramming by the MPA system with P PUs ( N 2 ) i s obtained as The algorithm to nd the best threshold is based on Otsu's algorithm 23]. Let P(1) ::: P(L) represent the histogram probabilities of the observed gray v alues 1 : : : L . Here, a best threshold is that threshold is chosen such t h a t t h e w eighted sum of group variance is minimized. Let 2 W be the weighted sum of group variances, that is, the within{group variance. Let 2 1 (t) and q 1 (t) be the variance and the probability for the group with values less than or equal to t, respectively, and 2 2 (t) a n d q 2 (t) be the variance and the probability for the group with values greater than t, respectively. L e t 1 (t) be the mean for the rst group and 2 (t) the mean for the second group. Then the within{group variance 2 W is de ned by 2 W (t) = q 1 (t) 2 1 (t) + q 2 (t) 2 2 (t)
where q 1 (t) =
P(i). The relationship between the total variance, 2 , and the within{group variance can make the calculation of the best threshold less computationally complex. By rewriting 2 , w e h a ve 2 = q 1 (t) 2 1 (t) + q 2 (t) 2
The rst bracketed term is called the within{group variance 2 W and the second bracketed term is called the between{group variance 2 B . Because the total variance 2 does not depend on t, the t minimizing 2 W (t) will be the t maximizing the between{group variance 2 B (t). Algorithm 3 of Figure 12 is the order of calculation followed to obtain the best threshold, BT.
It is assumed that the HP executes a basic arithmetic or logic operation in O(1) time. Times for the complex operations are assumed to be a multiplicative of the time for a basic operation. In this research, it is assumed that the oating point unit (FPU) in the HP executes oating point n umber addition, multiplication, and division in 3, 4, and 16 times of unit by the pipelining and radix{4 SRT divider. Ld log 2 N 2 Algorithm 3. Finding the best threshold (Input: Histogram Probabilities P 0::L ; 1], Output: Best Threshold BT)
when BT = t = 0 , q 1 (0) = 0, q 2 ( 0 ) = 1 , 1 ( 0 ) = 0 , 2 (0) = , a n d 2 B ( 0 ) = 0 3 for t = 1 to L do 4 q 1 (t) i s o b t a i n e d b y q 1 (t) = q 1 (t ; 1) + P(t) 5 1 (t) is obtained by 1 (t) = q 1 (t ; 1) 1 (t ; 1) + tP(t) q 1 (t) 6 2 (t) is obtained by 2 (t) = ; q 1 (t) 1 needed. However, the HP with the MPA system need not transfer the shared data to the local memory because the MPA system can be operated as the passive memory structure.
Therefore, the number of computation steps, F B T MBS;MPA , to execute histogramming is obtained as 
The APVIS can get the performance improvement to extract the e ect of the memory interface structure over the conventional back{end interfacing system for SIMD system conguration. Thus, the performance improvement from the memory interface structure can be de ned as the ratio of the time to fetch the shared data to the time both to fetch the shared data and to execute the sequential program. The performance improvement for nding the best threshold can be obtained as the previous section, the APVIS can get good points from the memory interface structure in performing a conventional program composed of sequential code blocks and parallel code blocks
To perform thresholding for the MPA system, a parallel algorithm represented by pseudo code is shown in Algorithm 4 of Figure 13 . Therefore, the number of computation steps, thresholding MPA , to execute thresholding by the MPA system with P PUs ( N 2 ) is obtained as
A parallel algorithm to label connected components of the threshold binary image is based on the component shrinking algorithm 24]. This algorithm is based on shrinking operation. Notice, I(0,0) is in the topmost left corner of the array. When labeling 8{connected components, the value of pixel I(i j) is determined by t h e 2 2 neighborhoods of the pixel. The new value of a pixel I(i j) is de ned to be h(h(I(i j)+I(i j;1)+I(i;1 j );1)+h(I(i j)+I(i;1 j ;1);1)), where h(t) is the \heavyside" function de ned as h(t) = 0 f o r t 0, h(t) = 1 f o r t > 0, called shrinking operation.
To perform the algorithm of labeling connected components by using the local shrinking operations for the MPA system, Algorithm 5 as in Figure 14 Figure 15 shows the number of computation steps required for some low l e v el and medium level vision algorithms by the MPA system. As shown in Figure 15 , the total execution time, T, for a sequence of vision algorithms is dominated by the execution time of the step to label connected components. To s h o w the speedup and throughput together, the processing environment for the APVIS is supposed to process multiple sequential images or to support multiple users and multiple vision programs. It is assumed that the number of the MPA modules enough to support a single image processing requirement i s n and the arrival rate of the images is . T h us, the inter{arrival time of incoming images is 1 . In order to process the incoming images in real time, the incoming image tasks should not be queued. Thus, the service rate of the incoming image tasks should be higher than the arrival rate of incoming images. The service rate can be calculated as n T . Therefore, if T n, the APVIS can perform this application under the above e n vironment in real time. Figure 16 shows the number of MPA modules required to perform this application in real time. When = 1 0 ;6 , t h e i n ter{arrival time of images is 10 6 time unit. If one time unit is one micro second, then the inter{arrival time of images, 1 , is one second. Figure 16 low level vision tasks and the sum of the execution times for both some low l e v el and medium level vision tasks, respectively. Since the execution time to label connected components, CCL MPA , dominates the total execution time, T, t h e n umber of MPA modules as shown in Figure 16 (a) is hundreds of times smaller than that as shown in Figure 16 (b) . If the size of incoming image is moderate, e.g., 256 256 or 512 512 image pixels, the MPA system composed of a few or tens of MPA modules is capable of performing the computer vision application of low level and medium level tasks with a few seconds inter{arrival time in real time. Figure 16 (a) shows that the MPA system is responsible for low l e v el vision tasks in real time and then powerful processors, i.e., HPs, in the MBS perform higher level vision tasks such as the region segmentation task. Namely Figure 16 (a) is the case that the medium to high level vision tasks are executed by the MBS and the low l e v el vision tasks are executed by t h e MPA modules adaptively. Similarly, Figure 16 (b) shows that the MPA system is responsible for the region segmentation as well as low l e v el vision tasks and then the MBS is responsible for the rest tasks. However the required number of the MPA modules as shown in Figure 16 (b) is very large. In this case, the MPA system in the APVIS should be responsible for the medium level tasks such as the labeling connected components with a lot of MPA modules. Figure 16 shows that the APVIS is more cost e ective approach than the pure SIMD approach. Consequently, Figure 16 represents that the number of the MPA modules in the APVIS can be determined according to the processing requirement of the given vision application.
Conclusions
The APVIS approach is to design a hybrid parallel system that can be performed adaptively for di erent t ypes of parallelism in computer vision tasks. The APVIS is constructed by integrating two di erent t ypes of parallel architectures tightly into a single machine. Operational interaction between these two systems can be performed via the conventional subroutine calling mechanism to execute data parallel code onto the MPA. Thus the existence of the MPA is transparent to the programmers. This research is to design an underlying base architecture that can be optimally executed for a broad range of vision tasks, from low l e v el to high level processing algorithms. Another objective of the APVIS is to provide the balanced improvement on both the speedup and the system e ciency. T h us, the proposed APVIS can provide signi cant performance improvement and cost e ectiveness for highly parallel applications having a mixed set of parallelisms. Also an example application composed of a series of vision tasks, from low level and medium level processing steps, is mapped onto the MPA t o s h o w relative performance gain. Consequently, the APVIS with a few or tens of MPA modules can perform any vision application for multiple incoming images with a few seconds inter{arrival time in real time.
