A theoretical memory with limited processing power and internal connectivity at each element is proposed. This memory carries out parallel processing within itself to solve generic array problems. The applicability of this in-memory finest-grain massive SIMD approach is studied in some details. For an array of N items, it reduces the total instruction cycle count of universal operations such as insertion/deletion and match finding to ~ 1, local operations such as filtering and template matching to ~ local operation size, and global operations such as sum, finding global limit and sorting to ~√N instruction cycles. It eliminates most streaming activities for data processing purpose on the system bus. Yet it remains general-purposed, easy to use, pin compatible with conventional memory, and practical for implementation. The increase of bus width or caching ability, and the small-scale usage of SIMD inside CPU in the form of vector processor [5] only levitate the problem to certain degrees [5] [6] [7] .
3/36
9/24/2010 7:06 PM bus are still for processing purpose. The bus bottleneck problem seems to call for in-memory processing [5] .
Earliest attempts using massive SIMD (single instrument multiple data) approaches are enhanced version of vector-CPU in special and dedicated systems [25] . Our Human logic tends to formulate serial solutions because both induction and deduction are serial in nature. So SIMD is more likely to be used as special steps of predominantly serial solutions, and it is unlikely to be the foundation of a general-purposed computer system. In-memory SIMD approach using large-to-medium grain size [26] [27][28] [29] do not seem to have clear advantages over MIMD approach using large number of in-memory small function units, because the former has a reduced processing bandwidth than that of a conceptual SIMD execution of the targeted application [28] , while the later is more flexible in usage. As a result, some of such SIMD approaches later evolved into mixed SIMD/MIMD approaches [28] [29] . Due to their nontrivial programming need and control overheads, both approaches have compatibility issues with prevailing bus-sharing architectures and operating systems, which may be the reason why both have much less attention than the two prevailing MIMD approaches. So far, the only main-stream usage of pure in-memory massive SIMD is content addressable memory [30] , whose success is due to (1) its specificity to a classic and ubiquitous array problem, (2) its distribution of processing power to fine-grain storage units, (3) its compatibility with prevailing bus-sharing architectures & operating systems, and (4) its trivial programming need and minimal control overheads. The design of a smart memory family-the concurrent processing memory [9] , or simply CPM-is a theoretical attempt to extend the success of content addressable memory to in-memory massive SIMD architecture in general.
At this moment, silicon integration [32] [33] has progressed rapidly in implementing Moore's law [34] , and the silicon industries has entered a new era of billions of transistors per chip [35] .
With such a huge transistor budget recently available, it is a hot debate [36] [37] [38] to establish parallelism whether on instruction level [35] , or on thread level [37] , or on data level [38] .
Concurrent Processing Memory
4/36 9/24/2010 7:06 PM Among them, data level parallelism is simplest in both programmability and hardware construct [38] , while its weakness is its specification for applications. With the advance in intranet speed and grid computing, a specific device such as an ultra-fast SQL engine can now be shared efficiently in a network by multiple tasks, which may create revived interest to use in-memory massive SIMD approach for solving large-scale real-time array problems. In-memory massive SIMD approach may now have a new life in a new era of silicon integration and grid computing.
The Concurrent Processing Memory

Architecture
Built upon the success of content addressable memory, and with silicon integration in mind, the design philosophy of CPM is to distribute limited and specific processing power to the smallest storage unit for its targeted application while maintain its compatibility with a traditional memory. Application specificity further means that the instruction set for each PE is limited to its targeted application(s) only, so that PE programming becomes trivial.
The basic rules for CPM, as shown in Figure 1 , are:
Rule 1.
A CPM is made of identical PEs (processing elements), each of which contains a fixed number of registers.
Rule 2.
Each PE has at least one addressable register which can read from or write to an exclusive bus exclusively.
Rule 3.
Each PE contains has a unique element address, but each PE is not aware of its own address, so that each PE exists in an identical environment.
Rule 4.
Multiple PEs can be activated concurrently if each of their element addresses is:
(1) no less than a start address, (2) no more than an end address, and (3) an integer increment starting from the start address.
Concurrent Processing Memory
5/36 9/24/2010 7:06 PM
Rule 5.
Multiple activated PEs can read and execute a same instruction concurrently from a concurrent bus which broadcasts to all PEs.
Rule 6.
Multiple activated PEs can identify themselves concurrently.
Rule 7.
The neighboring PEs are connected so that an PE can read at least one register of each of its neighbors.
Rule 8.
There is an extra external command pin to indicate that the address and data bus contains whether (1) address and data or (2) an instruction for the CPM when it is enabled.
As shown in Figure 1 , a CPM has a control unit which controls every PE. [7] to translate instructions, cache instructions and data, make internal macro calls, and presents result using normal synchronization techniques [6] . The micro-kernel may also hide PE instruction set inside the memory, and expose application-oriented instruction set to the user of CPM. If the CPM has a higher execution rate than the system bus, its output also needs to be cached.
Except Rule 4, all of the above rules have mature usages, e.g., Rule 5 is the signature of SIMD approach in general [5] , Rule 6 is used extensively in a content addressable memory [30] , and Rule 8 is a common technique in programming devices using a bus [6] . Nor the above rules necessarily capture high-performance designs. The PE activation in Rule 4 seems too rigid. The connectivity in Rule 7 is perhaps the most crude and least efficient among existing types of PE connectivity [5] [8] . But they may be the most practical ones for massive SIMD architectures specific for large-scale array problems. For an example, if PE activation is done by a dedicated processor, then the number of PE can be activated or deactivated for each instruction cycle is limited by the word width of the processor, thus not suitable for massive number of PEs, even though this approach is much more flexibility in programming and it is widely used in other SIMD and MIMD approaches. Instead, a general decoder implements Rule 4 in ~1 instruction cycles for any number of PEs. The set of rules of CPM capture the essence of a massive in-memory SIMD working in a traditional bus-sharing architecture, with identical PEs, minimal PE control overhead, simplest PE-to-PE relation, uniform PE-to-memory relation, and extreme specificity for its targeted application. As a massive SIMD approach, the silicon budget of each PE needs to be controlled carefully.
The construct of each PE is demonstrated on logic design level [39] in this paper.
General Decoder
The ability to instantly activate PEs according to Rule 4 is crucial for the CPM, which is provided by a general decoder comprising (1) a carry-pattern generator, (2) a parallel shifter,
an all-line decoder, and (4) an AND gate array that combines the corresponding bit outputs from the parallel shifter and the all-line decoder.
A carry-pattern generator inputs a carry number (which is the carry number input for the general decoder) and activates all of its bit outputs whose address corresponds to the increments of the carry number. 
Concurrent Processing Memory 8/36 9/24/2010 7:06 PM
The above expression can be generalized for arbitrary number of inputs, and transformed into standard product-of-sum format using either K-map or Quine-McCluskey method [6] , and the carry pattern generator can be constructed using corresponding two-level gates. 
The circuit diagram of a 3/8 all-line decoder is shown in Figure 3 .
The bit outputs of the parallel shifter are filtered through AND gates by the corresponding bit outputs of the all-line decoder, before becoming the bit outputs of the general decoder, as shown in Figure 4 .
If the carry number is a constant of 1 for Rule 4, the general decoder can be simplified. The start address is input into a first all-line decoder whose outputs are negatively assertive, and the end address is input into a second all-line decoder whose outputs are positively assertive. The corresponding outputs from the two all-line decoders are AND-combined, before becoming the bit outputs of the general decoder.
Content Movable Memory
Construct and Basic Functionality
The structure of a content movable PE is shown in Figure 5 . Each PE has only one addressable register, which can be read by its left and right neighboring PE. Each PE has an additional temporary register, whose content can be copied to the addressable register. A multiplexer selects the addressable register of either the left or the right neighbor, to copy to the temporary register. The concurrent bus has only two bits, one to select the output of the multiplexer, and the other to select which registers inside each PE to be copied. The content of a neighboring addressable register is first copied to the temporary register, then to the addressable register of the PE. Thus, the temporary register only needs to remember its content for one clock cycle and can be made of DRAM cell [6] . The copy operation is enabled by the enable line from the control unit. With the carry number to be a constant of 1 for Rule 4, the content of all the addressable registers within an address range can be moved concurrently Content movable memory does not implement Rule 6 of CPM and each of its PE has no match line.
Usage
A content movable memory can be used to manage data object within itself. It can inserts, deletes, shrinks, enlarges, or moves data objects without overhead such as extensive copying [7] . Content movable memory is aimed at replacing both SRAM and DRAM due to its memory management functionality and its overall speed and density.
Programming techniques are tied to memory management. Traditional computer languages require explicitly or implicitly defining the size of an object before using it, so that a complier can allocate the space for the object in a conventional memory [6] [41] [42] . The improper size of objects is a major source of errors, such as value overflow [6] , value underflow [6] and buffer overrun [6] ; while always allocating very large size for safety wastes memory space and computation power. When using a conventional memory, such allocation and dellocation requirement is ubiquitous for all operating systems [41] and programming languages [42] , either explicitly or implicitly. Such requirement favors procedural language [42] over functional language [42] , because the latter tends to allocate most objects dynamically on heap [42] , creating memory fragment problem. When using a content movable memory for programming, for integer and only one type for floating-point, and (4) functional language is favored over procedural language, because the former is not tied to implementation details, has better optimization potentials, and is more elegant by pursuing pure algorithm goal [42] .
Content Searchable Memory
A content searchable memory is similar to a content addressable memory [30] , provided that the content searchable memory has smallest grain for searching, and it has local connectivity between neighboring PEs according to Rule 7 of CPM, to remove length limit on the substring to search for, and alignment limit on the content to be searched.
Construct and Basic Functionality
The PE construct of a content searchable memory is shown in Figure 6 . It contains one addressable register and one storage bit register. The concurrent bus sends to every PE:
• A mask, which masks the content of the addressable register of the PE.
• A datum, whose value is compared with the masked data at an equal comparator.
• A comparison code of either = or ≠, which is matched against the output of the equal comparator, to become the comparison result bit. Using the datum and mask, more complicated comparison requirement, such as "do not care", can also be achieved.
• A self code. When it is true, the comparison result bit is saved into the storage bit register. Otherwise, the storage bit register is true only of the comparison result bit is true and the storage bit register from the right neighboring PE is also true.
Assume a string to be searched is loaded in a content searchable memory. To find a substring: (1) Because most string consists of either 8-bit or 16-bit characters, and in the most popular 16-bit character set two bytes of each character have different formats [43] , the addressable register should have byte size. The carry number is usually a constant of 1 for Rule 4, unless the content to be searched is structured, so as a look-up table.
Usage
Finding a substring from a large string is an extremely frequent operation which usually needs to be done in very high speed, such as web search. It logically calls for massive SIMD execution if the string to be searched is already in memory. When it is implemented by MIMD approaches, the algorithm is complicated, requiring pre-processing of the string to be searched for high speed applications [3] . In contrast, content searchable memory finds a substring from an original string by concurrently matches each character of the substring, and it only takes ~ M instruction cycles, in which M is the length of the substring. Thus, content searchable memory can vastly improve efficiency of string-based content searches.
Content Change
It is easy to add the PE construct of content movable memory into the PE construct of content searchable memory, to result in a CPM whose content can be searched concurrently and modified easily. Such combination can apply to other types of CPM.
Content Comparable Memory
A content comparable memory extends the functionality of a content searchable memory from value matching to value comparing.
Concurrent Processing Memory 13/36 9/24/2010 7:06 PM
Construct and Basic Functionality
The PE construct of content comparable memory is shown in Figure 7 . It is very similar to the PE construct of content searchable memory shown in Figure 6 . The concurrent bus sends to every PE:
• A datum, whose value is compared with the masked data at a value comparator.
• A comparison code of either = or ≠ or < or > or ≤ or ≥, which is matched against the output of the value comparator in a match table, to become the comparison result bit.
• A select code bit, which selects value of either left or right neighboring storage bit register as the selected bit.
• A self code bit, which selects either the selected bit, or the NAND combination of the comparison result bit and the current value of the storage bit register, to input into the bit storage register.
• An update code bit, which is AND combined with the comparison result bit, to update the bit storage register with its input.
Any logic combination between the current values of the bit storage register and the comparison result bit can be constructed using a neighboring storage bit register which is not In worst cases the data within the content searchable memory can still be accessed serially.
Thus, a content comparable memory can be used to implement SQL with vastly improved speed.
Rule 4 of CPM requires that each array item to be equal in size, so usually only addresses to a variable size field of large size can be stored in a content comparable memory; however the variable size field can be stored in a content searchable memory if quick content search capability is required.
Count for Statistics
By matching each section limit one-by-one, the histogram of M sections is constructed in ~ M instruction cycles, which provides base for statistical characterization of the array.
Content Computable Memory
A content computable memory is similar to content addressable processors [26] and associative processors [27] with local neighboring connectivity only. But content computable memory has more overall calculation power due to Rule 4 of CPM, even both SIMD approaches have comparable processing power on PE level. By processes all data related to each array item in each PE, such as all data related to one pixel in imaging processing, or all data related to one cell in modeling, each PE of content computable memory has minimal grain size for the target application.
Dimension
In a 1-D content computable memory, each PE has two neighbors of immediately higher or lower element addresses. In a 2-D content computable memory, each PE has four neighbors and PEs of a 2-D content computable memory forms a square lattice; the element address is partitioned into X and Y addresses which obeys Rule 4 of CPM independently. Except neighbor count, both types of content computable memory have identical PE construct.
Concurrent Processing Memory
16/36 9/24/2010 7:06 PM
PE Construct and Basic Functionality
A content computable memory PE with a bit-serial logic is shown in Figure 8 . It has:
• Some data register, as 1st, 2nd, and etc data registers in Figure 8 .
• A neighboring register, which can be read by its neighboring PEs.
• An operation register, which is involved in each operation.
• A match bit register, a status bit register and a carry bit register, shown as the squares marked by "M", "S" and "C" respectively in Figure 8 .
The concurrent bus sends a datum and an instruction to each PE. The instruction format for
is "condition: operation [bit] register[bit]", in which:
• One operand is the bit of the operation register at the first "[bit]" bit, which is selected by a multiplexer called bit multiplexer in Figure 8 .
• The other operand is the bit of the "register" at the second "[bit]" bit, which is selected by a multiplexer called register multiplexer in Figure 8 . The "register" could be any of its data registers, its neighboring register, or any of its neighbor's neighboring registers.
• The "[bit]" bit, the "register[bit]" bit, the status bit and the carry bit, together with their respective logic negation, are input into a multiplexer called condition multiplexer in Figure 8 , whose bit output V is selected by part of the "condition" code. V is combined with the datum bit D on the concurrent bus, a "compare" code bit C of the "condition" code, and current value of the M bit register according to Equation 7-1, to result in a Boolean value B.
Equation 7-1: B = M + C (V D + !V !D) + !C V;
• The "operation" decides which registers to write to. The Boolean value B can be saved into the match bit register. If B is true, the match bit register can be saved into (1) the status bit register, (2) the carry bit register, and (3) For simplicity of description of the following concurrent algorithms: (1) the operation registers of all the activated PEs are collectively referred to as the operation layer; (2) the neighboring registers of all the activated PEs are collectively referred to as the neighboring layer; (3) the neighboring layer of the PE whose element address is one less or one more than the activated PE is called the left layer or the right layer respectively; and (4) the neighboring layer of the PE whose Y element address is one less or one more than the activated PE is called the top layer or the bottom layer respectively. It is assumed that the values to be processed are always initially stored in the neighboring layer.
Local Operations
A special 1D vector of odd-number of items is used to describe the content of the operation When operation layer is copied to or exchanged with neighboring layer, the successive operations are no longer addictive, such as a 3-point (1 2 1) Gaussian averaging algorithm:
1. Copy neighboring layer to operation layer.
2. Add left layer to operation layer.
3. Copy operation layer to neighboring layer.
4. Add right layer to operation layer. The result is in operation layer.
In the above algorithm, without Step 3, Step 4 is also additive to Step 1 and 2, and the algorithm result is (1 1 1) . When the result of a first operation A undergoes a second operation B, the overall operation C is expressed mathematically as:
The # operation satisfies: This concept is extendable to 2D local operations, such as a 9-point Gaussian averaging which requires 8 instruction cycles:
Generally, a local operation involving M neighbors takes ~ M instruction cycles.
Sum
To sum a one-dimensional array of N items, the array is divided into sections, each of which 
Find Global Limit
A procedure similar to sum can be used to find global limit for an array.
Search for Template
In contrast to a substring search, the result of a template search does not have to match the template exact. The difference between the template and a section of original data is captured by a matching value. The goal of template search is to find a best match or all acceptable matches. Template search is the foundation for image pattern recognition [2] .
To search a template of size M, the array is divided into N / M sections, each of which contains M consecutive items. The algorithm diagram is shown in Figure 11 . In Step 1, the template to be matched is concurrently loaded to all sections in ~M instruction cycles. Then the point-to-point absolute difference is calculated concurrently for all points in ~ 1 instruction cycles, which is omitted from the algorithm flow diagram. In Step 2, the difference in all sections are 
Sort an Array
By asking all elements to identify themselves if their left layer is larger than their neighboring layer, the disorder items, which are the items stored in the neighboring layer that need to be sorted to small-to-large order, can be all found immediately. Thus a sorting algorithm can stop immediately if no further sorting is required. The disorder item count also guides the sorting direction. If initially the disorder item count for small-to-large sorting is more than that of the large-to-small sorting, the array should be sorted into large-to-small order-to sort an array in either order is functionally equivalent. Thus, the worst case for sorting, to sort a nearly sorted array into the other order, can be avoided. • Peak: It is an insertion of a larger item into an otherwise ordered neighborhood. To restore order, the peak item can be moved to the left of the left-most item to its right which is larger than it, or to the right end of the sequence in ~2 instruction cycles.
• Valley: It is an insertion of a smaller item into an otherwise ordered neighborhood. To restore order, the valley item can be moved to the right of the right-most item to its left which is smaller than it, or to the left end of the sequence in ~2 instruction cycles.
• Fault: It is an exchange of two already sorted neighboring items. To restore order, the two faulty items can be exchanged in ~1 instruction cycles.
Only ~1 instruction cycles are required to exchange concurrently all even and odd numbered neighboring items once toward small-to-large order. Alternatively repeating exchanging all (1) even and odd and (2) odd and even numbered neighboring items makes a local exchange sorting algorithm, which is good at removing random local disorders. Using the local exchange sorting algorithm, an originally random-ordered array quickly becomes largely sorted with only a few point defects. When M is sufficiently large, after M instruction cycles, the average distance between the remaining point defects is ~ M. However, the local exchange sorting algorithm is inefficient in sorting a nearly sorted array, with peaks and valleys moving to their respective sorting destinations one step at a time.
In contrast, a global moving sorting algorithm removes peaks and valleys in a nearly sorted array and inserts them to proper places. Using content computable memory, the concurrent detection of all of the above point defects in each 4-item neighborhood requires ~4 instruction cycles, the concurrent detection of the destination position of a peak or valley takes ~1 instruction cycles, while each insertion takes ~2 instruction cycles. Thus, the global moving sorting algorithm is very efficient in sorting a nearly sorted array. To sort an originally random-ordered array, if the local exchange sorting algorithm is first used for M instruction cycles, then the sorting is finished by the global moving sorting algorithm, the total instruction cycle count is ~ (M + N /M), which has a minimum of ~ √N when M ~ √N.
Thresholding
With its multiple dimensions of data, image processing and modeling generally requires large amount of calculation, which is proportional to the size of data in each dimension.
Using a conventional bus-sharing MIMD computer, the instruction cycle count is linearly proportional to the amount of calculation. Thus, to solve a problem in a realistic time period, thresholding [2] is frequently used to ignore large amount data for the subsequent processing.
Thresholding is a major problem [2] , because proper thresholding is difficult to achieve, and thresholding in different processing stages may interact with each other.
Using a content computable memory, the instruction cycle count is decoupled from the amount of calculation, and is independent of the size of data in each dimension. Thus, thresholding can be used only in last stage to qualify the result. Also, thresholding itself has been reduced to ~1 instruction cycle operation.
Line Detection
Due to neighbor-to-neighbor connectivity, 2-D content computable memory can treat line detection problem as a neighbor counting problem. To detect edges line with pixel length L lying exactly along X direction left to each pixel, the neighbor count algorithm is direct:
1. All pixels concurrently subtract the raw intensity of their bottom layer from that of their top layer, and store the result in the neighboring layer. 
Conclusion and Discussion
As an in-memory massive SIMD approach, CPM seems to be able to vastly improve the solutions to typical array problems in the framework of traditional bus-sharing architecture running a prevailing multi-task operating system. Each type of CPM is specified at a particular application in both its hardware construct and software instruction set.
One important issue for the wide acceptance of in-memory massive SIMD architecture is its ability to be shared by multiple tasks. CPM allows task switch by allowing exclusively writing to one set of its addressable registers while concurrently operating on another set of its registers within a same memory. However, while a traditional task is completely captured by its register
, a SIMD task has data distributed all over the SIMD device, so that the cost of a traditional task switching for a SIMD task may be too high. How to best incorporate such a SIMD task into a currently prevailing preemptive multitask operating system [5] [6] remains a question. Massive amount of data for a pending task needs to be provided to the SIMD device, which may makes the bus bottle-neck problem even worse, thus it may be worthwhile to introduce an additional bus with DMA capability dedicated to a SIMD devices for task switching purposes. On the other hand, as a family member of CPM, the content movable memory may greatly simplify memory management function of current operating systems. Thus, CPM may raise new requirements and new opportunities for the current operating systems.
Another important issue is whether the set of rules for CPM is too restrictive in general, even though content movable memory, content searchable memory and content comparable memory hardware & software costs, e.g., according to Figure 16 , the instruction set of each PE depending on its element address, though on each level of super connectivity SIMD is still preserved. Whether such cost is justified for a massive SIMD approach needs further studies.
As a theoretical discussion focusing on the applicability of an in-memory massive SIMD approach, this paper has left out all implementation technical questions, such as (1) how to broadcast a same instruction to a massive number of PEs on a same chip at decent speed, (2) how to deal with grounding problem when a massive number of PEs change states concurrently, (3) how to chain a same type CPM together to achieve higher capacity like chaining real memories, and (4) what is the expected clock rate and number of PEs using today's technology.
Although these questions are vital for CPM, and they are probably very challenging questions in term of currently available technologies, they can be solved only if there is a need to solve them-further studies are need for the worth and practicability of CPM.
Simply physics may give a rough estimation for the above questions. For an example, assume that all PEs of a CPM are laid out on a square lattice, and a dedicated routing layer is used for each bit of the concurrent bus. Such a routing layer has a speed limit due to its capacitance and resistance. Denote the overall size and the thickness of the copper layer for GHz, which means an overall delay less than 0.5x10 -9 sec. Thus L should be no more than 0.5 mm. For content movable memory, a rough layout using 20-nm technology requires about 0. . Within a same chip, due to SIMD simplicity, the CPM only needs simple caching logic for input & output, and simple synchronization logic between each synchronized PE arrays, both of which should not occupy significant amount of silicon area. Thus, at least content moveable memory seems practical using today's best technologies.
As an independent research, the author of this paper feels indebted to the encouragements 
