CVC Verilog Compiler -- Fast Complex Language Compilers Can be Simple by Meyer, Steven
ar
X
iv
:1
60
3.
08
05
9v
2 
 [c
s.P
L]
  1
2 J
an
 20
18
CVC Verilog Compiler – Fast Complex
Language Compilers Can be Simple
Steven Meyer
Tachyon Design Automation
Boston, MA 02109, USA
smeyer@tachyon-da.com
Abstract
This paper explains how to develop Verilog hardware description
language (HDL) optimized flow graph compiled simulators. It is
claimed that the methods and algorithms described here can be ap-
plied in the development of flow graph compilers for other complex
computer languages. The method uses the von Neumann computer
architecture (MRAMmodel) as the best abstract model of computa-
tion and uses comparison and selection of alternative machine code
sequences to utilize modern processor low level parallelism. By us-
ing the anti formalist method described here, the fastest available
full IEEE 1364 2005 Verilog HDL standard simulators has been
developed. The compiler only required 95,000 lines of C code and
two developers. This paper explains how such a compiled simulator
validates the anti-formalism computer science methodology best
expressed by Peter Naur’s datalogy and provides specific guide-
lines for applying the method. Development history from a slow
interpreter into a fast flow graph based machine code compiled sim-
ulator is described. The failure of initial efforts that tried to convert
a full 1364 compliant interpreter into interpreted execution of pos-
sibly auto generated virtual machines is discussed. The argument
that fast Verilog simulation requires detail removing abstraction is
shown to be incorrect. Reasons parallel GPU Verilog simulation
has not succeeded are given.
1. Introduction
This paper presents a novel method for implementing fast compil-
ers for complex computer languages. The simple organization and
development methods used to create a fast Verilog hardware de-
scription language (HDL) machine code simulator are described.
In electronic design, systems are expressed as hardware description
language code. The HDL describes both procedural parts of elec-
tronic systems and declarative gate level parts. In the case of Ver-
ilog, the HDL is a low level language similar to Pascal (Wirth 1975)
with added low level parallelism and electronic gate descriptions.
The Verilog compiler translates the HDL description into machine
code for X86 and X86 64 machines running Linux that is executed
simulating electronic system behavior after fabrication. All elec-
[Copyright notice will appear here once ’preprint’ option is removed.]
tronic design automation (EDA) tools (also physical design tools
not discussed here) run only on enterprise quality Linux because of
large size of EDA system descriptions.
The organization of the compiled simulator can be viewed as
a modern analog of the multi-pass compilers applying problem
specific methods used to develop early compilers (Gries 1971)
(8-10 for history, 410, 411 and 451-454 for methods). The early
compilers were developed by scientists who were mostly trained as
physicists. See Budiansky’s account of the role of English physicist
Patrick Blackett’s department during WWII that was perhaps the
first use of modern algorithmic thinking (Budiansky 2013).
Computers are now so fast and are equipped with so much fast
random access memory that the various (virtual) passes can be ex-
ecuted without storing results of each pass in secondary storage.
The organization in which one unified representation is continu-
ally modified and transformed is similar to the approach Peter Naur
used in developing the Gier Algol compiler (Gries 1971) (p. 9). For
example, pass 6 for a compiler run on a machine with only 128,000
words of memory checked types of identifiers and operands (and
updated information) then converted to Polish notation and output
the result for the next code generation pass. In the simulator de-
scribed here, the code generator does similar checking but converts
the source program (really the internal representation of source be-
cause of ample memory) into a flow graph of basic blocks that con-
tain virtual machine instructions.
2. Simulator Description
The Verilog hardware description language is defined by the IEEE
1364-2005 Verilog HDL standard (IEEE 2005). The standard in-
cludes both register transfer level (RTL) procedural descriptions
and gate level descriptions with rules for annotating accurate delay
information from manufacturing process measurements.
The simulator described here was developed by two people in
less then 10 years. The ten years included training a young pro-
grammer who started as a college student intern and includes a ma-
jority of time implementing the elaboration code for the generate
feature added by the 1364 standard committee which is almost in-
compatible with instance based display machine register technol-
ogy required for very fast Verilog simulation ( idp area).
The simulator consists of about 330,000 lines of C code that
is 10% or less the size of the competing commercial quality Ver-
ilog full accuracy simulators. The simulator was first developed as
an interpreted Verilog simulator, then the flow graph based com-
piler was added. The Verilog elaborator (parser, fix up and simu-
lation preparation phases) is about 100,000 lines (numbers are ap-
proximate because there are common routines and overlap). Of the
100,000 lines, 20,000 or 20% are needed for the complicated Ver-
ilog 2005 generate feature.
1
Generate allows compile time variables called parameters to
change HDL variable sizes and design instance hierarchy structure.
This feature creates instance specific constants that must be treated
as variables during simulation. The other Verilog simulators flat-
ten designs resulting in much larger memory use for designs with
many repeated instances and makes based addressing difficult. It
is common for an IC to contain millions of instances of a latch or
flip flop macro cell each of which requires storing not just its per
instance state information but also the machine instructions for ev-
ery instance when the flattened elaboration method is used. In the
simulator described here machine code to simulate one instance
plus per instance state information and per instance base address-
ing change machine instructions only need to be stored. The devel-
opment of the Verilog 1364 standard by committee has continually
made Verilog more complicated (the language reference manual is
590 pages) and has made it more difficult to implement the sim-
ulator’s per instance base address register algorithms. The display
offset algorithm results in a simpler code generator and faster sim-
ulation at the cost of a much more complicated elaboration phase.
About 50,000 lines are needed to execute interpreted simula-
tion. The compiler itself is only about 95,000 lines of which 15,000
are executable binary support libraries. 70,000 lines are needed to
implement miscellaneous features used both by the compiler and
the interpreter: SDF delay annotation, four programming language
interface APIs (tf , acc , vpi and dpi ), a debugger for the in-
terpreter, toggle coverage recording and report generation, rarely
used switch level simulation, plus an expression evaluation variant
called X propagation that implements a more pessimistic unknown
(X state) evaluation algorithm.
3. Relation to Naur’s Datalogy and
Anti-Formalism
The theoretical background behind this simulator’s development
method follows Naur’s methods. In the 1990s Peter Naur, one of the
founders of computer science, realized that CS had become formal
mathematics separated from reality. Naur advocated the importance
of programmer specific computer program development that does
not use preconceptions. The clearest explanation for Naur’s method
that was used in developing this compiler appears in the book Con-
versations - Pluralism in Software Engineering (Naur 2011). This
books amplifies the program development method Naur described
in his 2005 Turing Award lecture (Naur 2007). In (Naur 2011) page
30, the interviewer asks “... you basically say that there are no foun-
dations, there is no such thing as computer science, and we must
not formalize for the sake of formalization alone”. Naur answers,
“I am not sure I see it this way. I see these techniques as tools which
are applicable in some cases, but which definitely are not basic in
any sense.” Naur continues (p. 44) “The programmer has to real-
ize what these alternatives are and then choose the one that suits
his understanding best. This has nothing to do with formal proofs.”
Einstein described this as the 20th century split between axiomat-
ics and reality by saying that “axiomatics purges mathematics of
all extraneous elements” which makes it evident “that mathematics
as such can not predict anything” about reality (Einstein 1921). See
also (Naur 2005) and (Meyer 2013) for more detailed discussion of
Naur’s anti-formalism.
Current compiler development methodology chooses formal al-
gorithms over Naur’s programmer specific approach that rejects
pre-suppositions. During the development of this compiler, anoma-
lies in mathematical foundations of logic that current computer sci-
ence takes as truth beyond criticism influenced design decisions.
The first is Paul Finsler’s proof that the continuum hypothesis is
true (Finsler 1969). The proof is only indirectly related to computer
science. The second example is Juri Hartmanis’ proof that P=NP in
MRAM (parallel ram) models (Hartmanis and Simon 1976). This
power of MRAMmachines contributed to looking for other sources
of parallel speed improvement and motivated using von Neumann
machine architecture targeted language specific recursive descent
parsing.
A more direct result of rejecting formalism is the choice to
use the fast Cooper-Kennedy dominator algorithm (Cooper et al.
2006). This choice and its use in the crucial optimization algorithm
define-use lists is easy once one starts with skepticism toward al-
gorithms that have been formally proven to be correct. The Cooper
algorithm’s “real” speed is viewed as falsification of the axioms
used in concrete complexity theory.
The main lesson of this anti-formal development method avoids
using abstraction. Avoid machine generated tables and compiler
phases generated by automatic generators. Parse using language
specific recursive descend and use the C run time stack for re-
membering context. Parse expressions using simple operator prece-
dence. This results in a very fast parser but is only possible because
assignment operators are not part of expressions in Verilog. Assume
proofs use axioms that do not apply to reality unless specifically de-
termining that the axioms are good. Finally, following Naur, look
for algorithms that are simple so they can be adopted to the problem
specific aspects of Verilog and the data structures and algorithms of
electronic design simulation.
4. Importance of von Neumann Observation on
Weakness of Formal Machines
There is a history of compiler development that was closely tied to
the von Neumann computer architecture. The best model of com-
puting is the von Neumann machine itself. Von Neumann was skep-
tical of models that used automata in general and understood that
Turing Machines (TM) were too weak to model human program
writing expressed in his 1950s criticism of neural networks and
other simple automata. Von Neumann’s thinking is analyzed in de-
tail in (Aspray 1990) and (Meyer 2016). Von Neumann explicitly
justified machine models that consist of unbounded memory cell
size finite number of randomly accessible memory cells.
TMs are too weak a model of computation because they lack
random access and lack the ability to select bits from cells. In other
words, the P=?NP question only exists for TMs not for von Neu-
mann architecture MRAM machines for which the class P is equal
to the class NP (Hartmanis and Simon 1976) (Meyer 2016). For
von Neumann machines, guessing (non determinism) is no faster
than enumeration in the polynomial bound sense. This is expressed
in CVC by the use of optimization algorithms that search not ex-
haustively but using graph theory properties combined with human
conceptual problem solving. For decidable problems outside the
class NP such as the yes or no question “are two regular expres-
sions equivalent?”, compiled computer languages allow people to
code their understanding into a computer program or electronic cir-
cuit description so that the program can solve concrete regular ex-
pression equivalence problems.
Two concrete examples of traditional compiler development
anti-formalism are William Mckeeman et. al. development of the
XPL compiler system (McKeeman et al. 1971) and development of
Bell Labs C computer language and compiler (also development of
Unix) (Ritchie 1993). Mckeeman’s contribution was understanding
compiler boot strapping. Dennis Ritchie and Ken Thompson under-
stood that the model they were developing a compiler and operating
system for was the von Neumann computer itself instead of devel-
oping for some abstract model of computation that was for example
attempted in the Multics project. These pioneers of compiler devel-
opment also understood the importance of methods that allowed
one or two people to develop large computer programs.
2 2018/1/16
5. Development History
The simulator described here was developed in the 1990s as a Ver-
ilog simulator to compete with the original Verilog XL simulator
that used interpreted execution. At the time Verilog semantics was
aimed at interpreted simulation because simulation properties such
as delays could be set at any time during a simulation run and be-
cause a command line debugger was an integral part of Verilog,
i.e. simulations often expected to read Verilog source from script
files at various times during a simulation run. In the late 1990s Ver-
ilog native machine code compilers were introduced and Verilog
semantics was changed to require specification of all design infor-
mation and properties at compile time. See (Thomas and Moorby
2002) for a historical description of the Verilog HDL.
By 2000 this simulator was no longer speed competitive so
it was used as the digital engine for an analog and mixed signal
(AMS) simulator. The speed problem was not as serious for mixed
signal simulation because analog simulation requires solving dif-
ferential equations. Analog simulation numerical algorithms run
orders of magnitude slower than digital simulation.
After the use in an AMS simulator, the simulator described
here needed to be improved so that it would be speed competitive
with compiled digital Verilog simulators. The most obvious speed
problem involved register transfer level (RTL) simulation. RTL
simulation is almost the same as normal programming language
execution except Verilog values require at least 2 computer bits
per Verilog bit and the RTL execution must interact with an event
driven scheduler.
The most obvious problem was that a number of if statements
in the interpreter C code were needed to select which interpreter al-
gorithm to run. For example, a simple logic and (&) operator needs
different evaluation C language code sections (usually one proce-
dure) for scalars (1 bit), narrow vectors (up to 32 or 64 bits), wide
vectors and strength model bit vectors (one byte per bit required).
In addition for all but scalars, both unsigned and signed execution
procedures are required. Sign extension for non integral number of
word bit vectors requires significant calculation. The Verilog stan-
dard requires that at least up to one million bits wide vectors sim-
ulate correctly. It seemed that taking the interpreter evaluation rou-
tines and converting to high level instructions for a virtual machine
that could then be interpreted was a good idea (see for example
(Ertl and Gregg 2003)). Automatically generating a virtual Verilog
machine from the simulators interpreted evaluation code was tried
(Ertl et al. 2002). The development was not too hard, but the result-
ing execution speed increased performance by only a small amount.
Although the if statement overhead was removed, extra overhead
to decode and execute the interpreter virtual instructions nullified
most of the gains.
5.1 Concrete algorithms but not organization from Morgan’s
optimizing compiler book
It was realized that the instruction level parallelism and branch pre-
diction provided by modern microprocessors was required for fast
simulation. Development of a full code generator began. It was next
attempted to implement the concrete step by step method in Robert
Morgan’s book on building an optimizing compiler (Morgan 1998).
The book describes the method used in building the very good Dig-
ital Equipment Corporation Alpha microprocessor compilers.
This simulator does not use the code generator organization
from (Morgan 1998) (section 2.1, 21-26). Morgan advocates a ba-
sically breadth first filter down approach with a chain of transfor-
mations each of which has a different data representation. Morgan
writes: “Each phase has a simple interface” that can be tested in iso-
lation. Also, “No component of the compiler can use information
about how another component of the compiler is implemented” (p.
21).
Instead, modified versions of the very good algorithms spread
throughout the Morgan book are implemented. During develop-
ment of the code generator, new ideas were compared against the
concrete Morgan book approach to make sure they were no worse.
The Morgan book algorithms are especially useful because excep-
tions are discussed. For example allowing non static single assign-
ment (SSA) form lists that violate the rule that each variable is as-
signed to only once or allowing define-use chains sometimes to not
have exactly one element (p. 142).
Morgan’s idea that virtual instructions should be as close to
machine instructions as possible is used (Figure 2.2, p. 24). The one
exception is that Verilog requires very complicated mostly boiler
plate prologue and epilogue instruction sequences that is just one
virtual instruction.
Code generator steps use one master representation accessible
from the interpreted data base. Both flow graphs and temp names
(unbounded number of virtual registers that in Verilog are often
wide and require two bits to represent 4 values) are accessed in nu-
merous ways. Basic blocks are accessible directly from interpreter
execution form that then points to flow graphs. Flow graphs espe-
cially for net change propagation operators are accessed from in-
dexed tables and AVL trees (see igen.h in simulator source). The
machine code compiler adds information to the previous interpreted
simulator data structures.
A compiler organization that combines all information into one
master data base is best. Any code generation phase can use any
of the information that is accessed either directly from the inter-
preter execution net list, from indices (sometimes indexed tables
and sometimes trees), from code generation basic blocks or ma-
chine code routines. The idea is to continue to improve the global
data base during code generator development and to continually
add more information to all of the various parts of the one unified
representation. For example, flow graph building algorithm insights
were used to improve design elaboration data structures and inter-
preter execution data structures.
This unified data base where each part is kept consistent with
other parts is the key to simplicity and code quality. For Verilog,
because there are so many different types of operations from proce-
dural RTL to declarative gate level to load and driver propagation,
depth first code generation is better. Morgan’s type of low level
machine instructions (p. 24) are generated with some optimization
by expanding constructs all the way down to something close to
the final virtual instruction sequences when the flow graphs virtual
instruction sequences are built. Later mapping to machine instruc-
tions except for X86 fixed registers is straight forward. This ap-
proach may be Verilog specific because Verilog allows values to be
read and written from anywhere in a design. Cross module refer-
ences are allowed and a programming language interface (PLI) can
run in any Verilog thread. There is effectively no usable context
information in Verilog.
For example, once some C code for the unified data structures of
the flow graph and basic block mechanism were written, the Mor-
gan book detailed algorithms on define-use lists and importance of
SSA (12.5.1 p. 291 and 7.1, p. 142) could be applied. The heuristic
is used to generate (top down depth first) as many temporaries as
possible. Then SSA problems are fixed when needed or even al-
lowed to violate SSA form that is recognized by the machine code
expander.. The crucial data structure used for optimization is the
define-use lists. Morgan suggests the idea of allowing non SSA in-
structions and temporaries (p. 142). Extra temporaries can then be
eliminated during optimization passes through all flow graphs and
virtual instruction lists.
3 2018/1/16
5.2 Compiler file organization
Basic block creation and virtual instruction generation C code is in
v bbgen’s C files. Define-use list and other flow graph elaboration
code are in the v bbopt.c file. The v regasn.c file assigns machine
registers to the unbounded number of temps. The v cvcms.c and
v cvcrt.c files plus the v asmlnk.c file contain support C proce-
dures plus code to generate the GNU AS assembly output, run gas
and link the final output executable called cvcsim. The v aslib.c file
contains wrapper C procedures that are called from the generated
assembly but whose function is usually to call an interpreter execu-
tion procedure. By using wrappers, early versions of the compiler
could compile almost all of Verilog but simulation was not yet fast
because execution used wrappers that executed the slow interpreter
code.
6. What is Verilog
Verilog is a Pascal like language (Wirth 1975) for the description
of electronic hardware. All variables are static because there are
no implicit stacks in hardware. Verilog is a combination of nor-
mal behavioral programming with parallelism, execution of hard-
ware described at the RTL level and low level primitive declar-
ative gates and flip flops. A common electronic design method
codes circuit descriptions in RTL then runs a program to synthe-
size the RTL into gates also coded in the Verilog language. Ver-
ilog is used to simulate (predict behavior when an IC is fabricated)
both the RTL and the synthesized gates with accurate timing. See
(Thomas and Moorby 2002) for a description of the Verilog HDL.
See (Sutherland et al. 2006), pp. 401-413 for a history of the Ver-
ilog HDL. See (Allen and Kennedy 2002), pp. 619-622 for a de-
scription of Verilog from the viewpoint of optimizing compiler de-
velopment.
7. Why Verilog Simulations Run Slowly
Compared to Computer Programs
First, Verilog RTL values require 2 bits (4 values) for every hard-
ware bit. Gate level accurate delay simulation also requires bus val-
ues that allow 127 different values and driving strengths. A simple
logic operation requires at least 3 or 4 instructions (not counting
loads and stores). Second, hardware registers are almost always
wider than the native von Neumann machine register width because
hardware design involves modeling the next generation electronics.
This requires evaluating multiple machine words for each opera-
tion. Third, Verilog requires event driven simulation. When a de-
lay or event control (@(clk3) say) is executed, the simulation must
schedule a new event in an event queue and suspend the current
execution thread to be restarted later. A even slower process oc-
curs when a value is changed. All variables that are referenced in
right hand side expressions driving the left hand side value must be
evaluated and new assignments made. These effected left hand side
variables are called loads. Also, when a wire is evaluated, it may
be necessary to evaluate multiple drivers and determine which is
the strongest. See (Meyer 1988) for a data structure that allows im-
plementation of efficient load propagation and driver competition
algorithms.
8. Performance
The simulator described here is arguably the fastest full accuracy
1364 Verilog simulator. Evidence here is anecdotal from users be-
cause the other commercial simulators are licensed with no public
benchmarking allowed clauses. One anonymous published compar-
ison of this simulator with other compiled commercial simulators
for RTL but not gate level simulation speed (Anonymous 2011)
showed 2 to 5 times faster simulation by this simulator for two dis-
cussed designs.
Other anecdotal evidence is that optimization is so good that in
one case full accuracy 1364 Verilog simulation is almost as fast as
cycle based evaluation using a detail removing cycle based open
source Verilog evaluator called Verilator for a Motorola 68k pro-
cessor model. Such fast simulation is possible because the model
really only requires 2 state not 4 state variables.
8.1 Comparison with Open Source Icarus Verilog
There is another open source Verilog simulator Icarus Verilog
called iverilog that implements most IEEE 1364 Verilog RTL fea-
tures. It is difficult to use for benchmarking because it does not
implement Verilog source macro cell library options (called -v and
-y libraries). A sixty four bit version of the simulator described
here is usually 35 to 45 times faster than Icarus Verilog release
called 10.0 stable made with the default options (-g -O) run on an
Intel Core i5-5200U 2.2 GHz low power processor running Linux
Centos 6.7.
For one SHA1 check sum circuit, this simulator is 111 times
faster. The SHA1model is almost just a Pascal program so again the
+2state option can be used. The two state option works by keeping
the X and Z values (called B part words) around but simulation only
needs to initialize the words to zero once. If a design simulation
really requires X and Z values, simulations run with this option
will be incorrect. Flow graph optimization normally removes B part
basic blocks from output machine code when X or Z values are
impossible. The two state option allows generating simpler flow
graphs that allows evaluation of the values for A (0 or 1) words to
be optimized even more.
9. Simple Design Method for Complex Language
Compilers
9.1 Develop an interpreter using language specific simple
concrete methods
This simulator was developed at the same time the Verilog IEEE
1364 Verilog standard was defined and being continually changed.
It is very difficult to simplify or shorten compiler development
duration when a computer language is also under development.
Some helpful ideas are:
9.2 Use simple organization
One centralized include file is used by every C source file with
defined and used prototypes at the top of each source file. This
organization makes it easy to eliminate occurrences of more than
one routine for the same basic function (wide vector sign extension
for example). Also, it is important to be able to make a change in
only one place when Verilog changes or when better simulation
algorithms are implemented. It is also usually not obvious when a
new Verilog feature is first introduced what part of simulation is the
same as some pre existing feature.
9.3 Use only one internal organization (net list data structure
for Verilog)
The first pass scans and parses Verilog source into normal module
lists, statement and gate lists, expression trees and linked symbol ta-
bles. Then the next set of fix up procedures is used to fill the same
date structure with more information which includes possibly to-
tally changing the instance tree hierarchy. Then the next elaboration
phase fills in more simulation preparation information and allocates
variable and state memory just before beginning interpreted sim-
ulation. This organization allows easily moving processing steps
forward and backwards when the Verilog language changes. The
compiler is run just before interpreted simulation would have been
4 2018/1/16
executed. It outputs a binary that is then run for compiled simula-
tion.
9.4 Avoid generators, grammars and tables
The simplest and most powerful scanning method uses a giant
case statement. The simplest and most powerful parsing method
uses language specific recursive descent because normal push down
stack based parser tables use very weak push down automata, even
weaker than TMs. This is especially important in complicated lan-
guages such as Verilog where the scanner needs parser information
and the parser needs scanner information. The 1364 standard com-
mittee does not worry about constraining the language so that it
can be coded as a context free grammar. Verilog assignments are
separate from expressions so simple and fast operator precedence
parsing can be used (Gries 1971) section 6.1, 122-132. This simula-
tor can elaborate 5 million line designs is less than 15 or 20 seconds
on a modern fast CPU. Fast elaboration speeds up development of
code generator algorithms, allows faster compiler debugging and
assists in finding faster machine code instructions patterns for the
output machine language program.
9.5 For complex languages first develop an interpreter
The first step in fast compiled Verilog development requires devel-
oping a 1364 standard compliant interpreted simulator so that com-
piler machine code execution can be regressed against the inter-
preter standard. Once simulation is running, many speed improve-
ments will become obvious (made visible) without needing to wait
for a long design phase before experimental data is available.
9.6 Add interpreter wrapper capability
There are I CALL ASLPROC and I CALL ASLFUNC virtual
instructions that work with the normal unbounded register temps
and define-use predominator algorithms, i.e. interface is no differ-
ent from a low level virtual machine instruction. This allows run-
ning simulations with only some constructs compiled. During ini-
tial development, only narrow (less than 32 or 64 bit) variables were
compiled. All wider expressions and assignments were evaluated
with wrappers. Then step by step wrappers were replaced by low
level machine instructions. Some very complicated Verilog algo-
rithms such as multiple path conditional delay selection and switch
level simulation are still just wrappers.
9.7 Make writing virtual instruction generation same as
interpreter execution
For example, Verilog binary operator evaluation starts with a wrap-
per, then is replaced by a version of the same procedure with calls
to gen tn to create temporary registers, and start bblk to re-
place if statements (see eval binary procedures in v ex2.c file near
line 6600) for which the code generation procedure is in file v -
bbgen.c. This allows code generation to mostly involve replacing
C program evaluation statements with the corresponding temp reg-
ister, basic block and virtual operation generation procedure calls.
Code generation for more complicated simulation operations such
as scheduled event processing can be simplified this way also.
9.8 Modify Morgan’s dominator-based optimization
algorithms
The book Building an Optimizing Compiler by Robert Morgan
contains a concrete simplified method for optimizing flow graphs
and computing define-use data structures. Start with that and sim-
plify even more and modify to fit your complex language. This is
the hard very important part of developing a fast compiler. Once
good flow graph optimizations are implemented, the register allo-
cator becomes much easier and better. See optimize 1mod flow-
graphs at the beginning of file v bbopt.c for the code that imple-
ments this.
9.9 Use low level virtual instructions close to machine
instructions
Virtual instructions are defined and emitted that are mapped into
machine instructions. It is best to generate low level virtual instruc-
tions because the code generation human coder for the particular
Verilog feature can craft good instructions sequences. Verilog op-
erations at minimum require operating on an A part (one or more
computer RAM words) for the 0 and 1 value and a B part for the
X and Z value so even simple operations require multiple machine
instructions and probably two by two vector cross product evalua-
tion.
The copy operation is an exception especially for Verilog be-
cause much simulation is copying data that can be very wide. Initial
code generation inserts a complicated virtual I COPY instruction
whenever there is a possibility of a need for a copy. Then during
mapping from the I COPY virtual instruction to machine instruc-
tions, the copy usually can be removed without needing to emit
anything.
9.10 Avoid extra representations such as tuples and breadth
first transformations
One of the best simplifying and simulation speed increasing meth-
ods is depth first code generation. This method is intentionally the
opposite of what (Morgan 1998), page 212 recommends. Morgan
recommends breath first code generation with successive transfor-
mations to lower levels. It is much simpler to use depth first instruc-
tion generation of almost machine instructions because it allows
the compiler writer to control machine code sequences and reduces
number of lines of compiler code.
9.11 Use experimentation to find fast CPU instruction
parallelism code sequences
Once a working code generator was written and debugged, the best
speed improvement method was to set up shell scripts that would
run two different versions of the compiler on a speed test regres-
sion suite and compare results. The low level machine instruction
sequence from the best result would then be used. The developers
did not have access to the X86 64 multi-issue pipeline optimiza-
tion rules documentation so they needed to run experiments. Maybe
the experimental approach is better in general because the rules
are so complex. This was the most important simulation execution
speed up idea. The other was compiling the change propagation and
scheduled event processing code into flow graphs.
10. Allen and Kennedy Hardware Simulation as
Abstraction Contradicted
In the book Optimizing Compilers for Modern Architectures,
Randy Allen and Ken Kennedy argue that the task of optimiza-
tion of hardware descriptions abstracts hardware descriptions to a
less detailed level (Allen and Kennedy 2002). Allen and Kennedy
write “Another way of saying this is that simulation speed is re-
lated to the level of abstraction of the simulated design more than
anything else – the higher the level, the faster the simulation” (p.
624). For Verilog this claim is wrong. Low level exact Verilog and
machine code detail modeling results in faster simulation because
it allows maximum utilization of low level instruction parallelism
that is built into modern microprocessors. The optimizations in-
clude: processing as wide a bit vector chunk at a time as possible,
output instruction sequences to keep instruction pre fetch queues
and pipe lines full and output instruction sequences that work well
with microprocessor branch prediction algorithms. Basic blocks in
5 2018/1/16
Verilog are small but very numerous because separate A parts and
B parts require conditionals.
This compiler development project shows that a not very com-
plicated code generator (less than 100,000 lines of C code) can pro-
duce good code using the experimentation method of running speed
regression test suites with different version of to be emitted instruc-
tion sequences and choosing the fastest. The standard way (almost
but not quite required by the IEEE 1364 standard) to store Verilog
4 value logic and wire values stores values as two separate machine
word areas. The B part selects unknown values: X (unknown) and
Z (high impedance) values. If the B part is zero, the A part selects
0 or 1 values. Bit and part select operations need to be optimized
as fast load, shift and mask operations that use mostly boiler plate
instruction sequences. Four value logic operations can also be gen-
erated as combinations of machine logic instructions combining the
A and B parts. Although, complex logic operations are sometimes
better implemented using separated A and B parts or even as table
look ups.
The remainder of this section criticizes specific ideas for imple-
menting simulation abstraction from the Allen and Kennedy book.
10.1 Inlining modules, p. 624
Expanding modules inline is called a fundamental optimization.
Incorrect because it is much better to use the normal programming
language optimization of not inlining but separating machine code
from state and variable data that are accessed using per instance
based addressing using traditional display technology (Gries 1971)
(pp. 172-175). Not only is code smaller but separating state from
model description makes many optimizations visible especially
ideas for pre-compiling event scheduling and processing.
10.2 HDL level execution ordering, p. 626
Allen and Kennedy argue for HDL statement reordering optimiza-
tions. Far more important is machine instruction reordering to max-
imize low level microprocessor instruction parallelism. Part of the
reason for this is that although Verilog has fork-join constructs,
they are rarely used. Parallelism in Verilog comes from a very large
number of different always block usually triggered with an event
control (for example always @(clk)).
10.3 Dynamic versus static scheduling, p. 627
This section discusses oblivious evaluation. Instead of the nor-
mal Verilog algorithm that “dynamically tracks changes and propa-
gates those changes”, the alternative blindly evaluates without any
change propagation overhead. The oblivious method can not work
for Verilog because it is common to have thousands of tick gaps be-
tween edges that have very high event rates. Static scheduling and
compiling events after change recording into pre-compiled flow
graphs to the extent possible is more important.
10.4 Fusing Always Blocks, p. 628
Here Allen and Kennedy are correct. Fusing always blocks makes
a huge speed improvement.
10.5 Vectorizing Always Blocks, p. 632
The idea is that Verilog code generation optimization should “red-
erive the higher-level abstraction that was the original intent”. The
problem with this is that the decision to code scalarized or vectored
is best left to the designer (HDL generation program). Vectorizing
is not always better because if only a few bits change per clock
cycle in a wide vector, it is better to simulate the scalarized indi-
vidual bits. There is no way to determine bit percentage switching
frequency for a given simulation a priori. Change detection of vec-
tor selects is fast because it only requires a mask, shift and xor then
branch, but propagation of the changes can be expensive if the se-
lected part select bits used in right hand side expressions do not
exactly match. It is much better not to change the Verilog HDL
code, but to find fast instruction sequences.
10.6 Two-State versus Four-State Logic, p. 637
Two state simulation is good when it is possible. The problem is
that the reason for hardware simulation is to find mistakes that
will result in unknown X (IC state that will be sometimes 1 and
sometimes 0) states in the fabricated integrated circuit. Most real
HDL designs do not allow two-state simulation, but if possible sim-
ulation is much faster because evaluations can directly use hard-
ware machine instructions. Most four state evaluations are similar
to evaluating two by two vector cross products. This simulator’s
low level design feature treats temporaries even for four state val-
ues that require an A part word memory area and a B part word
memory area as one temp. When an expression evaluation can be
executed as two state, the B part instruction sequences are just not
emitted or optimized away for expressions containing non two state
elements during flow graph simplification. Verilog variables can be
declared as having a two state type.
10.7 Rewriting Block Conditions, p. 637
Allen and Kennedy advocate changing trigger conditions on Ver-
ilog blocks by abstracting and guessing user intent to eliminate the
need to propagate change operators that in Verilog are usually event
controls on always blocks. Instead of trying to rewrite the Verilog
source to higher level, it is better to compile event queue propa-
gation by generating flow graphs because most event controls are
simple. Change operator occurrence scheduling and wake up event
processing should also be compiled into flow graphs. State data
should be first coded into the per instance ( idp) based display area
(see procedure alloc fill ctevtab idp map els near line 3820 of
file v prp.c). Then separate flow graphs are coded to schedule the
changes and to propagate the changes. The generated flow graphs
are optimized the normal way so most change operators require
just a few instructions to schedule and a few instructions when the
flow graph is jumped to from the event queue processing code. The
scheduling part for delay controls (@(clk) say) flow graph genera-
tion procedure is in the gen dce schd tev routine near line 10400
in file v bbgen3.c.
11. GPU or Other Multi-Core Parallelism Does
Not Work for Verilog
There have been many projects that attempt to use specialized
computer hardware rather than optimizing flow graph compilers
to speed up Verilog simulation. The reason for this is that Verilog
simulations are a significant consumer of electronic company com-
pute cycles with sometimes entire server farms dedicated to run-
ning only Verilog. Verilog compiled executable speed is of crucial
importance and has huge economic value. At least so far efforts to
speed up HDL simulation either with special purpose hardware or
GPUs have failed. It is possible to build special purpose hardware
that will run unoptimized Verilog 10 times faster. The problem is
that a good flow graph based optimizing compiler run on a mod-
ern fast multi-issue microprocessor can be 10 times or more faster
than the naive unoptimized computer software algorithms the spe-
cial hardware is compared to.
Hardware emulation uses a different approach in which a design
is converted to gate level and then fabricated into FPGAs (“layed
out”). Emulated designs run much faster than software compiled
simulation so they are good for early software development, but do
not “simulate” a design in sufficient detail for hardware debugging.
6 2018/1/16
The reason parallel Verilog simulation has not succeeded for
most large system models is that Verilog HDL basic blocks tend
to by very small with a high proportion of jumps and synchro-
nizations. Small basic blocks parallelize efficiently on multi-issue
CPUs but have too much synchronization overhead for coarser par-
allelism.
In the compiler described here, parallelism is used in one place.
Value change dump file output (FST format is smallest and fastest)
that is used by wave form viewers (software oscilloscopes) to debug
hardware may be run on up to 2 additional cores to encode value
changes. This use of parallelism can improve simulation times for
simulations that generate huge value change dump files by a factor
of two. Very large value change files are also written because they
can be input into back end physical design tools that analyze the
value changes to optimize IC power usage.
One area where parallel Verilog simulation has worked to some
extent is providing tools that guess and advise users how to parti-
tion designs among multiple X86 64 cores. Simulation can then be
run in parallel on multiple cores because the assignments to cores
results in minimal communication overhead. Parallel Verilog sim-
ulation on GPUs has not succeeded.
12. Conclusions
The general area of how to represent digital electronic circuits is
currently rather unsettled. Many designers would prefer to write
programming language (C usually) code and have a program au-
tomatically convert the code into a hardware design that is then
represented in Verilog (called high level synthesis).
However, If a design really can be coded as a computer program
and implemented using FPGA integrated circuit technology, it may
be cheaper and better in speed, cost and power to implement the
function in software as a highly optimized compiled low level
program on off the shelf multicore SoCs containing conventional
CPUs, signal processing CPUs, GPUs and other types of CPUs.
Another way of expressing this observation is that if digital circuits
can be expressed using only two state logic with X and Z cross
products ignored, implementation as computer programs may be
better.
In my view there is a need for a simple Verilog like HDL that
is intended to be generated by computer programs - maybe V– –.
Verilog elaboration and use of constant parameters that are some-
times not really constant at run time makes sense when HDL de-
signs are coded by hand because it makes Verilog easier to write,
but not when Verilog is machine generated. If such a language
eliminated non locality such as cross module references and force-
release statements (originally intended to allow coding reset but-
tons), much faster simulation might be possible. Such simulation
could use some type of graph theory connectivity compilation.
13. Acknowledgments
Andrew Vanvick(AIV in source) was the other CVC developer.
References
R. Allen and K. Kennedy. Optimizing Compilers for Modern Archictec-
tures. Academic Press, 2002.
Anonymous. A verilog simulator comparision, September 2011. URL
http://www.semiwiki.com/forum/content/777-verilog-simulator-comparison.html .
W. Aspray. John von Neumann and the Origins of Modern Computing. MIT
Press, 1990.
S. Budiansky. Blackett’s War: The Men Who Defeated the Nazi U-Boats
and Brought Sciene to the Art of Warfare. Knopf, 2013.
K. Cooper, T. Harvey, and K. Kennedy. A simple, fast dominance algo-
rithm. Technical Report TR-06-33870, Rice University Department of
Computer Science, 2006.
A. Einstein. Geometry and experience, January 1921. URL
http://www.relativitycalculator.com/pdfs/einstein_geometry_and_experience_1921.pdf .
Lecture before Prussian Academy of Sciences, Berlin.
A. Ertl and D. Gregg. The structure and performance of efficient inter-
preters. Journal of Instruction-Level Parallelism, 5, 2003. electronic
archive journal.
A. Ertl, D. Gregg, A. Krall, and B. Paysan. vmgen – a generator of efficient
virtual machine interpreters. Software Practive and Experience, 32(3):
265–295, 2002.
P. Finsler. Uber die unabha¨ngigkeit der continuumshypothese. Dialectica,
23(1):67–78, 1969.
D. Gries. Compiler Construction for Digital Computers. John Wiley and
Sons, 1971.
J. Hartmanis and J. Simon. On the structure of feasible computations. In
M. Rubinoff and M. Yovits, editors, Advances in Computers, volume 14,
pages 1–43. Academic Press, 1976.
IEEE. Ieee std 1364-2005 verilog hardware description language reference
manual, 2005.
W. McKeeman, H. James, and D. Wortman. A Compiler Generator. Pren-
tice Hall, 1971.
S. Meyer. A data structure for circuit net lists. In Proceedings 25th Design
Automation Conference, pages 613–616. ACM/IEEE, June 1988.
S. Meyer. Adding methodological testing to naur’s anti-formalism. In
IACAP 2013 Proceedings. Internation Association for Computing and
Philosophy, July 2013.
S. Meyer. Philosophical solution to p=?np: P is equal to np. arXiv, cs.GL,
March 2016. URL https://arxiv.org/abs/1603.06018 .
R. Morgan. Building an Optimizing Compiler. Digital Press, 1998.
P. Naur. Computing as science, pages Appendix 2, 208–217. naur.com
Publishing, 2005. www.Naur.com/Nauranat-ref.html URL of Oct.
2014.
P. Naur. Computing versus human thinking. Communication of the ACM,
50(1):85–94, 2007.
P. Naur. Conversations - Pluralism in Software Engineering. Lonely
Scholar Publishing, Belgium, 2011. Edgar Daylight ed.
D. Ritchie. The development of the c language. In T. Bergin and R. Gibson,
editors, History of Programming Languages II, pages 671–698. ACM,
April 1993.
S. Sutherland, S. Davidmann, and P. Flake. System Verilog for Design.
Springer, 2006.
D. Thomas and P. Moorby. The Verilog Hardware Description Language.
Springer, 2002.
N. Wirth. The programming language pascal. Acta Informatica, 1:35–63,
1975.
7 2018/1/16
