Instruction Set For Dense Data by , N/A
Technical Disclosure Commons 
Defensive Publications Series 
March 2021 
Instruction Set For Dense Data 
N/A 
Follow this and additional works at: https://www.tdcommons.org/dpubs_series 
Recommended Citation 
N/A, "Instruction Set For Dense Data", Technical Disclosure Commons, (March 05, 2021) 
https://www.tdcommons.org/dpubs_series/4131 
This work is licensed under a Creative Commons Attribution 4.0 License. 
This Article is brought to you for free and open access by Technical Disclosure Commons. It has been accepted for 
inclusion in Defensive Publications Series by an authorized administrator of Technical Disclosure Commons. 
Instruction Set For Dense Data 
ABSTRACT 
Multi-core computing speeds up computation by parallelizing tasks amongst several 
simultaneously-running compute cores. With multiple cores, data should be near-simultaneously 
and rapidly accessible by each core; this puts excessive stresses on the CPU-memory bus. Data 
formats are typically uncompressed, e.g., they assign a fixed number of bits to a data type, no 
matter how small the actual values are, and without regard to the distribution of those values. 
Data in uncompressed form consumes substantial memory, cache capacity, and memory 
bandwidth. This disclosure describes techniques that enable a compute core to fetch data in units 
of bits, rather than bytes. The core operates directly on compressed, prefix-encoded, data 
streams; fetches only a few bits (instead of one or more bytes, as is typical); and operates on that 
compressed data. The techniques improve the efficiency of multi-core computing by reducing 
the load on the cache, the memory, and the core-memory buses. 
KEYWORDS 
● Entropy coding 
● ANS coding 
● Golomb-Rice coding 
● Prefix coded stream 
● Memory bandwidth 
● Cache bandwidth 
● Cache latency 
● Multicore 
2
: Instruction Set For Dense Data
Published by Technical Disclosure Commons, 2021
BACKGROUND 
With the slowdown in Moore's law, further progress in computing speed and efficiency 
likely depends on improvements in algorithmic and data representation. One technique to 
improve computing speed is parallelization which involves the use of several simultaneously-
running compute cores. 
For optimal efficiency in processors, data is maintained close to the execution units. With 
multiple cores, it is useful for data to be near-simultaneously and rapidly accessible by each core. 
This requirement can stress the computer architecture in general and the CPU-memory buses in 
particular. To enable simultaneous data availability to multiple cores, a relatively complex 
hierarchical caching scheme is used. Such a caching scheme typically has three levels of caches - 
L1, L2, and L3 - with L1 and L2 dedicated to a single core, and the L3-cache operable at the chip 
level. For example, the caches may be sized as follows: L1 at around 32 kB, L2 at about 256 kB, 
and L3 with several megabytes. 
Data formats typically assign a fixed number of bits to a data type or symbol. For 
example, characters are encoded using 8 bits, and integers are encoded using 32 or 64 bits, no 
matter how small the actual values are, and without regard to the distribution of those values. 
Prior to access by a compute core for execution, compressed data such as video, images, audio, 
text, etc. are converted intermediately into an uncompressed form. Data in uncompressed form 
consumes substantially more memory, cache capacity, and memory bandwidth between the 
cache and the core. 
DESCRIPTION 
This disclosure describes techniques that enable a compute core to fetch data in units of 
bits, rather than bytes. The core can directly operate on compressed, prefix-encoded, data 
3
Defensive Publications Series, Art. 4131 [2021]
https://www.tdcommons.org/dpubs_series/4131
streams; fetch only a few, or just as many as needed, bits (instead of one or more bytes, as is 
typical); and operate on that compressed data. The substantial data-compression ratios of video 
(5-to-200), image (3-to-10), or audio (3-to-5) codecs are translated into an almost equivalent 
reduction of load on the memory, cache, and memory bandwidth.  
Per the techniques, the following instructions are added to the instruction set of the CPU, 
which enable random, bitwise addressing of data from a prefix-encoded stream. 
● LOAD NBITS, INREG1, INREG2, OUTREG
○ NBITS: number of bits to be read 
○ INREG1: input register 
○ INREG2: input register 
○ OUTREG: output register 
The input and output are both bit-level addresses, the construction of which is illustrated 
by the following examples. 
For operating with a single prefix-entropy code, or for bit-loading: The input and output 
are byte-pointers with three additional bits to indicate a bit position, e.g., the byte-address 
and the bit-address. 
For operating with multiple prefix-entropy codes available in the CPU: The input and 
output include the byte-address, the bit-address, and the entropy-code ID. 
For operating with a single ANS-entropy code available in the CPU: The input and 
output include the byte-address, the bit-address, and the ANS-state. 
For operating with multiple ANS-entropy codes: The input and output include the byte-
address, the bit-address, the entropy code ID, and the ANS-state. 
4
: Instruction Set For Dense Data
Published by Technical Disclosure Commons, 2021
Upon the execution of a LOAD command, the address is automatically incremented by 
NBITS bits. However, for prefix-coded data streams, the input argument NBITS is 
optional, since a symbol of a prefix code is uniquely decodable when a sufficient number 
of bits becomes available. The output register OUTREG includes the integer value 
represented by those bits. 
● STORE NBITS, INREG, OUTREG
○ NBITS: number of bits to be written 
○ INREG: input register 
○ OUTREG: output register 
The input and output are both bit-level addresses, the construction of which is illustrated 
by the following examples. 
For operating with a single prefix-entropy code, or for bit-loading: The input and output 
are byte-pointers with three additional bits to indicate a bit position, e.g., the byte-address 
and the bit-address. 
For operating with multiple prefix-entropy codes available in the CPU: The input and 
output include the byte-address, the bit-address, and the entropy-code ID. 
For operating with a single ANS-entropy code available in the CPU: The input and 
output include the byte-address, the bit-address, and the ANS-state. 
For operating with multiple ANS-entropy codes: The input and output include the byte-
address, the bit-address, the entropy code ID, and the ANS-state. 
Upon the execution of a STORE command, the address is automatically incremented by 
NBITS bits. The input register INREG includes the register value of the bits to be written. 
5
Defensive Publications Series, Art. 4131 [2021]
https://www.tdcommons.org/dpubs_series/4131
Instructions are similarly defined when Golomb-Rice, Huffman, or other entropy codes are used 
to compress the incoming data streams. As illustrated above, for ANS-coded data streams, the 
progression of pointers can be achieved by manipulating the ANS state. A simple probability 
model for a low symbol-count distribution can be stored in a special register, e.g., one of the 
single-input-multiple-data (SIMD) registers. The dictionary used to decode the incoming data 
stream can be stored in core registers, or can be part of the LOAD command (e.g., in the form of a 
pointer to the dictionary table). 
Bit-wise addressing of data streams, as described herein leverages the high compression 
ratios of data sources such as video, images, etc. and obviates traditional, multi-step bit 
operations used to isolate bits of interest, e.g., left-shift, right-shift, mask, etc. 
In this manner, the techniques of this disclosure enable the storage and processing of data 
in a compact form. The techniques improve the efficiency of multi-core computing by reducing 
the load on the cache, the memory, and the core-memory buses. In an example application, the 
techniques can be employed to enable video-decoding hardware to keep symbols in compressed 
form until dedicated hardware consumes the data directly in the compressed form. 
CONCLUSION 
This disclosure describes techniques that enable a compute core to fetch data in units of 
bits, rather than bytes. The core operates directly on compressed, prefix-encoded, data streams; 
fetches only a few bits (instead of one or more bytes, as is typical); and operates on that 
compressed data. The techniques improve the efficiency of multi-core computing by reducing 
the load on the cache, the memory, and the core-memory buses.
6
: Instruction Set For Dense Data
Published by Technical Disclosure Commons, 2021
