A Stochastic Petri Net Reverse Engineering Methodology for Deep Understanding of Technical Documents by Rematska, Giorgia
Wright State University 
CORE Scholar 
Browse all Theses and Dissertations Theses and Dissertations 
2018 
A Stochastic Petri Net Reverse Engineering Methodology for Deep 
Understanding of Technical Documents 
Giorgia Rematska 
Wright State University 
Follow this and additional works at: https://corescholar.libraries.wright.edu/etd_all 
 Part of the Computer Engineering Commons, and the Computer Sciences Commons 
Repository Citation 
Rematska, Giorgia, "A Stochastic Petri Net Reverse Engineering Methodology for Deep Understanding of 
Technical Documents" (2018). Browse all Theses and Dissertations. 1946. 
https://corescholar.libraries.wright.edu/etd_all/1946 
This Dissertation is brought to you for free and open access by the Theses and Dissertations at CORE Scholar. It 
has been accepted for inclusion in Browse all Theses and Dissertations by an authorized administrator of CORE 
Scholar. For more information, please contact library-corescholar@wright.edu. 
 
 
A STOCHASTIC PETRI NET REVERSE 
ENGINEERING METHODOLOGY FOR DEEP 
UNDERSTANDING OF TECHNICAL 
DOCUMENTS 
 
 
A dissertation submitted in partial fulfillment of the 
requirements for the degree of 
Doctor of Philosophy 
 
 
By 
 
 
GIORGIA REMATSKA 
               M.Sc., Technical University of Crete, 2015 
                B.S., Technical University of Crete, 2013 
 
 
 
________________________________________ 
 
 
 
2018 
Wright State University
 
 
 
 
 
 
 
 
COPYRIGHT BY 
GIORGIA REMATSKA 
2018 
 
 
 
WRIGHT STATE UNIVERSITY 
 
GRADUATE SCHOOL 
 
April 23, 2018 
I HEREBY RECOMMEND THAT THE DISSERTATION PREPARED UNDER MY 
SUPERVISION BY Giorgia Rematska ENTITLED A Stochastic Petri Net Reverse 
Engineering Methodology for Deep Understanding of Technical Documents BE 
ACCEPTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE 
DEGREE OF Doctor of Philosophy
 
 
 
 
 
 
Committee on 
Final Examination 
 
Nikolaos G. Bourbakis, Ph.D. 
 
Soon M. Chung, Ph.D. 
 
Bin Wang, Ph.D. 
 
Sukarno Mertoguno, Ph.D. 
 
 
Nikolaos Bourbakis, Ph.D.  
Dissertation Director 
 
Michael Raymer, Ph.D. 
Director, Computer Science 
and Engineering 
Ph.D. Program 
 
Barry Milligan, Ph.D.  
Interim Dean of the Graduate School 
 
iv 
 
ABSTRACT 
 
 
Rematska, Giorgia Ph.D., Computer Science and Engineering Ph.D. program, Wright State 
University, 2018. A Stochastic Petri Net Reverse Engineering Methodology for Deep 
Understanding of Technical Documents. 
 
 
 
Systems Reverse Engineering has gained great attention over time and is associated 
with numerous different research areas. The importance of this research derives from 
several technological necessities. Security analysis and learning purposes are two of them 
and can greatly benefit from reverse engineering. More specifically, reverse engineering 
of technical documents for deeper automatic understanding is a research area where reverse 
engineering can contribute a lot. 
In this PhD dissertation we develop a novel reverse engineering methodology for 
deep understanding of architectural description of digital hardware systems that appear in 
technical documents. Initially, we offer a survey on reverse engineering of electronic or 
digital systems. We also provide a classification of the research methods within this field, 
and a maturity metric is presented to highlight weaknesses and strengths of existing 
methodologies and systems that are currently available. 
A technical document (TD) is typically composed by several modalities, like 
natural language (NL) text, system’s diagrams, tables, math formulas, graphics, pictures, 
etc. Thus, for automatic deep understanding of technical documents, a synergistic 
collaboration among these modalities is necessary. Here we will deal with the synergistic 
v 
 
collaboration between NL-text and system’s diagrams for a better and deeper 
understanding of a TD. In particular, a technical document is decomposed into two 
modalities NL-text and figures of system’s diagrams. Then, the NL-text is processed with 
a Natural Language text Understanding (NLU) method and text sentences are categorized 
into five categories, by utilizing a Convolutional Neural Network to classify them 
accordingly. While, a Diagram-Image-Modeling (DIM) method processes the figures by 
extracting the system’s diagrams. More specifically, NLU processes the text from the 
document and determines the associations among the nouns and their interactions, by 
creating their stochastic Petri-net (SPN) graph model. DIM performs processing/analysis 
of figures to transform the diagram into a graph model that holds all relevant information 
appearing in the diagram. Then, we combine (associate) these models in a synergistic way 
and create a synergistic SPN graph. From this SPN graph we obtain the functional 
specifications that form the behavior of the system in a form of pseudocode. In parallel we 
extract a flowchart to enhance the understanding that the reader could have about the 
pseudocode and the hardware system as a unity.  
 
vi 
 
ABSTRACT ............................................................................................................... IV 
ACKNOWLEDGEMENS .............................................................................................. 1 
1 INTRODUCTION ................................................................................................ 3 
1.1 Motivation ........................................................................................................................ 3 
1.2 Outline .............................................................................................................................. 4 
2 LITERATURE REVIEW ON REVERSE ENGINEERING OF DIAGRAMS ..................... 6 
2.1 Introduction ...................................................................................................................... 6 
2.2 Classification scheme ........................................................................................................ 7 
2.2.1 PCB ................................................................................................................................. 8 
2.2.2 IC .................................................................................................................................. 10 
2.2.3 DC ................................................................................................................................. 14 
2.2.4 BD ................................................................................................................................. 17 
2.3 Features .......................................................................................................................... 18 
2.4 Evaluation results ............................................................................................................ 23 
2.5 Conclusions ..................................................................................................................... 27 
TABLE OF CONTENTS 
vii 
 
3 OVERVIEW OF THE PROPOSED METHODOLOGY ............................................ 28 
4 STRUCTURE ANALYSIS .................................................................................... 33 
4.1 Introduction .................................................................................................................... 33 
4.2 Overview ......................................................................................................................... 34 
4.3 Figure and NL text extraction .......................................................................................... 35 
4.4 Caption Detection ........................................................................................................... 35 
4.5 Feature Selection ............................................................................................................ 37 
4.6 Structure Clustering and Classification ............................................................................ 38 
4.7 Figure Caption and Paragraph Association ...................................................................... 41 
4.8 Results ............................................................................................................................. 41 
4.9 Conclusions ..................................................................................................................... 42 
5 AN IMPROVED SPN BASED NLU SCHEME ....................................................... 44 
5.1 Introduction .................................................................................................................... 44 
5.2 Improving Kernel Extraction (A-V-P) ................................................................................ 45 
5.2.1 Rules for the kernel extraction .................................................................................... 46 
5.2.2 Extracting the A->V->P Kernel and Converting NL Sentences into Graphs [54] [79] ... 49 
viii 
 
5.2.3 Converting NL Graphs into SPN graphs [54] [79] ......................................................... 50 
5.2.4 Text SPN Examples ....................................................................................................... 51 
5.3 Partial sentences ............................................................................................................. 53 
5.4 Conclusions ..................................................................................................................... 54 
6 THE DIAGRAM MODEL ................................................................................... 55 
6.1 Block diagram Detection ................................................................................................. 55 
6.2 Diagram Extraction .......................................................................................................... 56 
6.3 Diagram Graph Model ..................................................................................................... 57 
6.4 Diagram Text Recognition ............................................................................................... 57 
6.5 Filtering ........................................................................................................................... 60 
6.6 Diagram Graph ................................................................................................................ 60 
6.7 Synthesis of nouns in text SPN graph .............................................................................. 61 
6.8 Noun replacement........................................................................................................... 62 
6.9 Results ............................................................................................................................. 63 
6.10 Conclusions ................................................................................................................. 67 
7 CLASSIFICATION ............................................................................................. 68 
ix 
 
7.1 Text Preprocessing .......................................................................................................... 68 
7.1.1 Introduction ................................................................................................................. 68 
7.1.2 Filtering ........................................................................................................................ 69 
7.1.3 Stemming and Lemmatization ..................................................................................... 69 
7.1.4 Feature Space............................................................................................................... 70 
7.1.5 Bag of Words ................................................................................................................ 70 
7.1.6 Neural Word Embeddings ............................................................................................ 71 
7.2 Classification algorithms.................................................................................................. 72 
7.2.1 Naïve Bayes Classifier ................................................................................................... 73 
7.2.2 Support Vector Machine .............................................................................................. 73 
7.2.3 Decision trees ............................................................................................................... 74 
7.2.4 Random Forest Classification ....................................................................................... 75 
7.2.5 Neural Networks .......................................................................................................... 75 
7.2.6 Convolutional Neural Networks ................................................................................... 77 
7.3 Sentence Classification .................................................................................................... 78 
7.3.1 Introduction ................................................................................................................. 78 
7.3.2 Timing Sentences (TS): ................................................................................................. 79 
7.3.3 Conditional Sentences (CS) .......................................................................................... 79 
7.3.4 Link Sentences (LS) ....................................................................................................... 79 
7.3.5 Processing Sentences (PS) ............................................................................................ 80 
7.3.6 Information Sentences (IS) ........................................................................................... 80 
7.4 Experimental Setup ......................................................................................................... 80 
x 
 
7.4.1 Dataset ......................................................................................................................... 80 
7.4.2 Preprocessing ............................................................................................................... 81 
7.4.3 Training and Testing Set ............................................................................................... 82 
7.4.4 Feature Scaling ............................................................................................................. 82 
7.4.5 Classifiers ..................................................................................................................... 82 
7.4.6 Results and discussion ................................................................................................. 84 
7.5 Conclusions ..................................................................................................................... 85 
8 FORMAL MODELING ...................................................................................... 86 
8.1 Introduction .................................................................................................................... 86 
8.2 Synergy Language ............................................................................................................ 89 
8.2.1 Overview ...................................................................................................................... 89 
8.2.2 Definition ..................................................................................................................... 91 
8.2.3 Example ........................................................................................................................ 97 
8.3 From Synergy to SPN mapping ...................................................................................... 106 
8.4 SPN Synthesis ................................................................................................................ 112 
8.4.1 Sentence grouping ..................................................................................................... 112 
8.4.2 The synthesis of SPNs................................................................................................. 114 
8.5 Conclusions ................................................................................................................... 119 
9 PSEUDOCODE EXTRACTION PROCESS .......................................................... 120 
xi 
 
9.1 Extraction Process ......................................................................................................... 120 
9.2 Results ........................................................................................................................... 125 
9.2.1 Example 1 ................................................................................................................... 126 
9.2.2 Example 2 ................................................................................................................... 130 
9.2.3 Example 3 ................................................................................................................... 132 
9.3 Conclusions ................................................................................................................... 146 
10 CONCLUSIONS-CONTRIBUTIONS .................................................................. 147 
10.1 Limitations and Future work ..................................................................................... 148 
REFERENCES ........................................................................................................ 149 
APPENDIX A ......................................................................................................... 158 
 
xii 
 
TABLE 2-1. OVERVIEW OF THE METHODOLOGIES USED. ----------------------------------------------------------------- 19 
TABLE 2-2. FEATURES USED IN THE EVALUATION PROCESS. ------------------------------------------------------------ 21 
TABLE 2-3. WEIGHT ASSOCIATED WITH EACH FEATURE. ----------------------------------------------------------------- 22 
TABLE 2-4. MATURITY FOR EACH METHODOLOGY. ------------------------------------------------------------------------ 24 
TABLE 2-5. SCORES OBTAINED FROM THE USER AND THE DEVELOPER. --------------------------------------------- 25 
TABLE 4-1. CAPTION DETECTION EXAMPLE. --------------------------------------------------------------------------------- 36 
TABLE 4-2 FEATURES. --------------------------------------------------------------------------------------------------------------- 37 
TABLE 4-3 SCORES FOR CLUSTERING AND PARAGRAPH IDENTIFICATION. ------------------------------------------ 42 
TABLE 6-1. COSINE SIMILARITY BETWEEN WORDS. ------------------------------------------------------------------------ 59 
TABLE 7-1. ERROR RATE OF THE CLASSIFICATION ALGORITHMS. ------------------------------------------------------ 84 
TABLE 8-1. PARTIAL SENTENCES. ------------------------------------------------------------------------------------------------ 98 
TABLE 8-2. KERNELS. ---------------------------------------------------------------------------------------------------------------- 98 
TABLE 8-3. GROUPED NL TEXT. ------------------------------------------------------------------------------------------------ 112 
TABLE 9-1. NL TEXT OF CONTROL UNIT. ------------------------------------------------------------------------------------- 133 
TABLE 9-2. NL TEXT OF STORAGE UNIT. ------------------------------------------------------------------------------------- 135 
TABLE 9-3. NL-TEXT OF S-BOX UNIT. ----------------------------------------------------------------------------------------- 138 
 
 
LIST OF TABLES 
xiii 
 
FIGURE 2-1. CLASSIFICATION SCHEME. ........................................................................................................... 8 
FIGURE 2-2. PRINTED CIRCUIT BOARDS [8], [9]. .............................................................................................. 9 
FIGURE 2-3. EXAMPLES OF IC [17] [18]. ........................................................................................................ 11 
FIGURE 2-4. EXAMPLE OF VLCI [23]. .............................................................................................................. 12 
FIGURE 2-5. DIGITAL CIRCUIT REPRESENTING A FULL ADDER. ...................................................................... 14 
FIGURE 2-6. EXAMPLES OF BLOCK DIAGRAM [44]. ....................................................................................... 17 
FIGURE 2-7. MATURITY OBTAINED FOR EACH METHODOLOGY. .................................................................. 24 
FIGURE 2-8. COMPARISON OF TOTAL MATURITY AND SCORES OBTAINED FROM P FEATURES. .................. 25 
FIGURE 3-1. ADU METHODOLOGY AND ITS MODALITIES. ............................................................................ 30 
FIGURE 3-2. REVERSE ENGINEERING METHODOLOGY. ................................................................................. 31 
FIGURE 4-1. OVERVIEW OF THE DOCUMENT STRUCTURE ANALYSIS. ........................................................... 35 
FIGURE 4-2. EXAMPLE OF GROUPING WITHIN EACH CLUSTER. .................................................................... 40 
FIGURE 4-3. EXAMPLE OF FIGURE CAPTION WITH MULTIPLE TEXT LINES [65]. ............................................ 40 
FIGURE 5-1. THE GRAPH OF THE KERNEL OF A SENTENCE. ........................................................................... 49 
FIGURE 5-2. BASIC STATE MACHINE AND THE SPN TO REPRESENT THE KERNEL: V CHANGES STATE OF P TO 
P’ AND A TO A’. .................................................................................................................................... 50 
FIGURE 5-3. SPN GRAPHS: SYMBOLIC REPRESENTATION OF KERNELS. ........................................................ 51 
FIGURE 5-4. (A) UNCOMBINED SPN GRAPH OBTAINED FROM THE METHODOLOGY IN [53] OF THE TEXT 
“ONCE THE REGISTERS’ OPERANDS HAVE BEEN FETCHED, THEY CAN BE OPERATED ON BY ALU TO 
COMPUTE A MEMORY ADDRESS, TO COMPUTE AN ARITHMETIC RESULT, OR A COMPARE” [80],(B) 
UNCOMBINED SPN GRAPH OF THE MODIFIED KERNEL. ...................................................................... 52 
LIST OF FIGURES 
xiv 
 
FIGURE 5-5. (A) UNCOMBINED SPN GRAPH OBTAINED FROM THE METHODOLOGY IN [53] OF THE TEXT 
“THE MULTIPLEXER WHOSE OUTPUT RETURNS TO THE REGISTER FILE IS USED TO STEER THE 
OUTPUT OF THE ALU OR THE OUTPUT OF THE DATA MEMORY FOR WRITING INTO THE REGISTER 
FILE.” [80], (B). UNCOMBINED SPN GRAPH OF THE MODIFIED KERNEL. ............................................. 53 
FIGURE 6-1. EXAMPLE OF THE BLOCK-TEXT RECOGNITION PROCESS. (A) BLOCK DIAGRAM [41] (B) NL TEXT.
 ............................................................................................................................................................. 59 
FIGURE 6-2. (A) BLOCK DIAGRAM (B) DIAGRAM GRAPH. .............................................................................. 61 
FIGURE 6-3 TEXT SPN GRAPH OF EXAMPLE 2. (A) TEXT-SPN, (B) D-GRAPH. ................................................. 62 
FIGURE 6-4 TEXT SPN GRAPH AFTER SYNTHESIS OF NOUNS. ....................................................................... 62 
FIGURE 6-5. T-SPN AND D-GRAPH OF THE BLOCK DIAGRAM SHOWN IN FIG.1. (A) T-SPN. (B) D-GRAPH (C) 
T-SPN. (D) D-SPN. ................................................................................................................................. 63 
FIGURE 6-6. TESTED BLOCK DIAGRAM. (A) BLOCK-DIAGRAM [80]. (B) NL TEXT. .......................................... 64 
FIGURE 6-7. DIAGRAM GRAPH. ..................................................................................................................... 65 
FIGURE 6-8 TEXT SPN (T-SPN1) GRAPH. ......................................................................................................... 66 
FIGURE 6-9 TEXT SPN (T-SPN2) GRAPH AFTER APPLYING SYNTHESIS OF NOUNS AND NOUN REPLACEMENT.
 ............................................................................................................................................................. 66 
FIGURE 7-1. SUPPORT VECTOR MACHINE. .................................................................................................... 74 
FIGURE 7-2. (A) CELL BODY OF A SIMPLE NEURON (B) ABSTRACT REPRESENTATION OF DEEP NEURAL 
NETWORK. ........................................................................................................................................... 76 
FIGURE 7-3. CONVOLUTIONAL NEURAL NETWORK. IMAGE TAKEN FROM [133]. ........................................ 77 
FIGURE 7-4. MODEL ARCHITECTURE. ............................................................................................................ 83 
FIGURE 8-1. BUILDING BLOCKS OF FUNCTIONAL SPECIFICATIONS. .............................................................. 87 
FIGURE 8-2. FUNCTIONAL SPECIFICATION EXTRACTION PROCESS. .............................................................. 89 
FIGURE 8-3. ILLUSTRATION OF SYNERGY STRINGS AND THE SEQUENCE OF STRINGS. ................................. 91 
xv 
 
FIGURE 8-4. A SIMPLE BLOCK DIAGRAM EXAMPLE. ...................................................................................... 92 
FIGURE 8-5. LEFT: EXAMPLE BLOCK DIAGRAM. IMAGE TAKEN FROM [140]. RIGHT: NL TEXT ASSOCIATED 
WITH THE BLOCK DIAGRAM. ............................................................................................................... 97 
FIGURE 8-6. GRAPH OF BLOCK DIAGRAMS IN FIGURE 8-5. YELLOW RECTANGLES INDICATE AN INPUT TO 
THE BLOCK DIAGRAM AND BLUE INDICATE AN OUTPUT. ................................................................... 99 
FIGURE 8-7. MAPPING FROM SYNERGY LANGUAGE TO A FUNCTIONAL SPECIFICATION BUILDING BLOCK.
 ........................................................................................................................................................... 103 
FIGURE 8-8. MAPPING FROM SYNERGY LANGUAGE TO A FUNCTION DEFINITION BUILDING BLOCK. ....... 104 
FIGURE 8-9. MAPPING FROM SYNERGY TO THE SECOND FUNCTION DEFINITION. .................................... 105 
FIGURE 8-10. MAPPING FROM SYNERGY TO THE THIRD FUNCTION DEFINITION. ...................................... 106 
FIGURE 8-11.MAPPING FROM SYNERGY TO SPN: PROCESSING CASE. ........................................................ 109 
FIGURE 8-12. MAPPING FROM SYNERGY TO SPN: CONDITIONAL CASE. .................................................... 110 
FIGURE 8-13. MAPPING FROM SYNERGY TO SPN: IO CASE. ........................................................................ 111 
FIGURE 8-14. MAPPING FROM SYNERGY TO SPN: FUNCTIONS. ................................................................. 111 
FIGURE 8-15. MAPPING FROM SYNERGY TO SPN: CONNECTIONS.............................................................. 112 
FIGURE 8-16. (A) P1 PARTIALLY MATCHES A2. (B) COMBINED SPN GRAPH OF TWO KERNELS OF TS. ......... 116 
FIGURE 8-17. (A) P1 PARTIALLY MATCHES A2 AND A3 AND I<J<T. (B) COMBINED SPN GRAPH OF THREE 
KERNELS OF TS. .................................................................................................................................. 116 
FIGURE 8-18. (A) SPN REPRESENTATION OF NL KERNELS OF TYPE TS. (B) BASIC SPN GRAPH. (C) ENRICHED 
SPN THAT CONTAINS BOTH SPNS. ..................................................................................................... 117 
FIGURE 8-19. CONTROL OF PARALLEL CONNECTIONS OF TS. ..................................................................... 117 
FIGURE 8-20. (A) NLSI,J IS THE KERNEL OF A NON-TIMING SENTENCE BELONGING TO A GROUP 
CONTROLLED BY NLSI,1.(B) ENRICHED SPN GRAPH OCCURRING WHEN A2 PARTIALLY MATCHES X1 
AND V2 PARTIALLY MATCHES TR1. IN A SIMILAR MANNER WE CAN OBTAIN AN ENRICHED SPN 
xvi 
 
GRAPH FOR WHEN A2 AND P2 PARTIAL MATCHES X1 AND X2 IN ANY ORDER, OR ONE OF THE PLACES 
(A1 OR P2) PARTIALLY MATCHES ONE OF THE PLACES (X1 OR X2) AND V2 PARTIALLY MATCHES TR1. 118 
FIGURE 8-21. (A) NLSI,J IS THE KERNEL OF A NON-TIMING SENTENCE BELONGING TO A GROUP 
CONTROLLED BY NLSI,1. (B) ENRICHED SPN GRAPH OCCURRING WHEN A2 PARTIALLY MATCHES X1.
 ........................................................................................................................................................... 119 
FIGURE 9-1. (A) SPN PLACES THAT CONTAINS “I_” PREFIX. (B) PSEUDOCODE OF SPN IN (A). (C) SPN PLACES 
THAT CONTAINS “O_” PREFIX. (D) PSEUDOCODE OF SPN IN (C). (E) SPN GRAPH OF A FUNCTION 
DEFINITION. (F) PSEUDOCODE OF SPN IN (E). (G) SPN GRAPH OF A LINK TRANSITION. (H) 
PSEUDOCODE OF SPN IN (G). ............................................................................................................. 122 
FIGURE 9-2. (A) LINK TO AN SPN GRAPH OF FUNCTION DEFINITION BLOCK (B) PSEUDOCODE OF SPN IN (A). 
(C) A SIMPLE TRANSITION PLACE WITH ONE INPUT PLACE AND ONE OUTPUT PLACE. (D) 
PSEUDOCODE OF SPN IN (C). (E)A SIMPLE TRANSITION CONTROLLED BY AN IF PLACE. (F) 
PSEUDOCODE OF SPN IN (E). ............................................................................................................. 123 
FIGURE 9-3. (A) A COMPLEX SPN GRAPH. (B) TRANSLATION TO PSEUDOCODE. ........................................ 124 
FIGURE 9-4.(A) A COMPLEX SPN GRAPH WITH TIMING INFORMATION. (B) PSEUDOCODE TRANSLATION.
 ........................................................................................................................................................... 125 
FIGURE 9-5. SIMPLE EXAMPLE: (A) BLOCK DIAGRAM (B) NL TEXT. ............................................................. 126 
FIGURE 9-6. TEXT SPN GRAPH. .................................................................................................................... 127 
FIGURE 9-7. DIAGRAM GRAPH. ................................................................................................................... 128 
FIGURE 9-8. FINAL SPN GRAPH. ................................................................................................................... 128 
FIGURE 9-9. PSEUDOCODE OF EXAMPLE 1. ................................................................................................. 129 
FIGURE 9-10. CODE EXAMPLE. .................................................................................................................... 130 
FIGURE 9-11. NL TEXT OF THE BLOCK DIAGRAM SHOWN IN FIGURE 6-6(A). ............................................. 131 
FIGURE 9-12. PSEUDOCODE OF EXAMPLE 2. ............................................................................................... 132 
xvii 
 
FIGURE 9-13. GENERAL BLOCK DIAGRAM OF THE APPROACH IN [140]. ..................................................... 133 
FIGURE 9-14. DIAGRAM GRAPH OF CONTROL UNIT. .................................................................................. 133 
FIGURE 9-15. TEXT SPN OF CONTROL UNIT. ................................................................................................ 134 
FIGURE 9-16. SECOND BLOCK DIAGRAM OF THE APPROACH IN [140] AND IS THE STORAGE-UNIT OF THE 
BLOCK IN THE ABOVE FIGURE. ........................................................................................................... 135 
FIGURE 9-17. DIAGRAM GRAPH OF STORAGE UNIT. ................................................................................... 136 
FIGURE 9-18. TEXT SPN OF STORAGE UNIT. ................................................................................................ 137 
FIGURE 9-19. S-BOX RAM BLOCK OF THE ABOVE DIAGRAM [140]. ............................................................ 137 
FIGURE 9-20. TEXT SPN OF S-BOX UNIT. ..................................................................................................... 138 
FIGURE 9-21. DIAGRAM GRAPH OF S-BOX. ................................................................................................. 139 
FIGURE 9-22. FINAL SPN GRAPH OF ONE BLOCK OF S-BOX UNIT. ............................................................... 140 
FIGURE 9-23. PSEUDOCODE FOR S-BOX UNIT. ............................................................................................ 141 
FIGURE 9-24. FLOWCHART OF ‘(I_BLOCK)_256_BYTES_RAM’ FUNCTION. ................................................. 142 
FIGURE 9-25. GENERAL FLOWCHART OF THE ‘S-BOX’ UNIT. ....................................................................... 143 
FIGURE 9-26. SIMPLIFIED VERSION OF PSEUDOCODE OF THE COMPLETE SYSTEM. ................................... 144 
FIGURE 9-27. SIMPLIFIED FLOWCHART OF THE COMPLETE SYSTEM. ......................................................... 145 
FIGURE 9-28. COMPARISON BETWEEN OUR METHODOLOGY AND THE ONE IN [140]............................... 146 
 
 
1 
 
ACKNOWLEDGEMENS 
 
I owe my deepest gratitude to my supervisor Prof. Nikolaos Bourbakis, who 
provided constant encouragement and support and without whom this work wouldn’t have 
been possible. He provided guidance and advice, that allowed at the same time the freedom 
to choose my own directions. 
I would like to place on record my sincere thank you to Dr. Soon M. Chung, Dr. 
Bin Wang and Dr. Sukarno Mertoguno for accepting, reviewing this work and participating 
in the examination committee. Especially, I would like to thank Dr. Mertoguno for his 
support through an ONR grant.  
To my friends Alexia Papadopoulou and Froso Sarri, that even with a continent 
between us, have provided me with constant support. 
Moreover, I would like to specially acknowledge my friend and colleague Argyris 
Angeleas as well as all friends/colleagues, Spyridon Manganas, Stavros Mallios, 
Adamantia Psarologou, Michael Tsakalakis, Zacharias Chasparis and Iosif Papadakis-
Ktistakis.   
Finally, I would like to place a special thank you to my family for their love and 
constant support throughout these years. Last but not least, I would like to thank Michael 
Telcide who, during the last year of my studies, has been supportive and made this arduous 
journey easier.
2 
 
 
 
 
 
 
 
This dissertation is dedicated to my parents, for their moral support and giving 
me the opportunity for a better future. 
 
3 
 
1.1 MOTIVATION 
“How does it work?” Most people, at some point in their life have asked this 
question. In the context of computer systems and applications an answer to this question 
derives from reverse engineering tools and techniques. Reverse Engineering (RE) is the 
process of retrieving information from systems without any pre-existing knowledge of the 
system’s overall behavior, by exploiting existing knowledge on some of the system’s 
components.
This scientific area has been studied for several decades and includes both hardware 
and software RE. Focusing on the area of electronic and digital systems, and particularly 
on the technical documents describing that type of systems, with high-level design details. 
Curiosity emerges on the possibility of automatically understanding and reverse 
engineering them. Reverse engineering in this case, leads in obtaining a first level 
simulation of the system being described in technical documents. 
Driven by the need to obtain functional specifications and design details of existing 
products is one of the reasons to follow a reverse engineering direction, but not the most 
important one. The scientific endeavor of obtaining functional characteristics of a technical 
document, could assist greatly the academic community. Different scientific publications 
1 INTRODUCTION 
4 
 
could be analyzed and compared with other similar methodologies. Furthermore, designs 
can be simulated, and important issues could be detected and corrected. 
Security is another area that RE of technical documents can be applied. By 
obtaining the functional characteristics and design details of a system, a possible 
intellectual property infringement can be detected. Higher level, reverse engineering could 
also reduce the computational cost induced from lower level circuits that have higher 
design details. Furthermore, is less expensive from other approaches, like chip RE, which 
require specific tools to analyze the product. 
Most of the reverse engineering approaches of electronic or digital systems are 
applied in the product itself, or on diagrams showing low-level design details. Reverse 
Engineering architectural designs appearing in the form of block diagrams has not yet been 
implemented, mostly due to the abstract nature of blocks that appear in the diagram. To 
achieve our reverse engineering purposes of technical documents a methodology for 
automatic deep understanding (ADU) of technical documents is required. 
1.2 OUTLINE 
In the following chapters the proposed revere engineering methodology for ADU 
is presented. More specifically, in chapter 2 a significant number of publications in the area 
of reverse engineering of electronic or digital systems are presented. A metric is created to 
evaluate each system based on certain features that will be analyzed. Chapter 3 shows the 
overview of the proposed Methodology. Chapter 4 shows the structure analysis method, 
that extract the NL text and the diagrams. Chapter 5 shows an improved SPN based NLU 
scheme for the extraction of kernels in sentences. Furthermore, Chapter 6 provides the 
5 
 
details for the extraction of the blocks in a diagram and their modeling to graphs. Chapter 
7 provides the classification schemes used for our sentence classification purposes. In 
chapter 8 Synergy formal language and its mapping to Stochastic Petri Net is presented. 
Chapter 9 provide the details for our pseudocode extraction methodology. Finally, in 
chapter 10 conclusions and future work is provided.  
6 
 
 
2.1 INTRODUCTION 
Not limiting ourselves to only reverse engineering of technical documents, but 
confined in the area of electronic or digital systems, there are several reasons as to why 
reverse engineering is important. The most appealing one is associated with the need to 
verify a system or design. At the core of this activity is the case of product verification by 
building and improving a system, which is usually accomplished by testing and fault 
diagnosis. A side area of reverse engineering includes security analysis, where great 
importance is given to verification of the integrity of integrated circuits (IC). Verification 
of components obtained from different sources has become a necessity due to malicious 
alteration and infringement. There are other cases associated with reverse engineering of 
systems. Those include repairing or integrating parts in a system for which no 
documentation exists. 
The main focus of this literature review relates to reverse engineering of different 
configurations of diagrams associated with electronic or digital systems. In general, any 
form of visual representation of circuits is considered. A system can be visualized by 
different type of diagrams. The end product can be imaged and processed, while the internal 
structure of a chip can be seen using, for example, an electron microscope. Different 
2 LITERATURE REVIEW ON REVERSE 
ENGINEERING OF DIAGRAMS 
7 
 
schematic layouts are generated during the design process and, on a higher level, block 
diagrams are used to conceptualize the basic idea of an electronic system. The basic steps 
in reverse engineering diagrams involve image capturing or layout representation. A netlist 
representing the connection and association of the layout is created. Identification and 
connection of the components and finally extraction of the functional characteristics of 
those components. 
In this review, different methodologies that are used in reverse engineering of 
electronic or digital systems are presented and evaluated. This review does not cover 
reverse engineering in general, since that is a large area. Our focus is visual reverse 
engineering of electronic or digital circuits. A more comprehensive survey referring to chip 
and system reverse engineering can be found at [1] here both reverse and anti-reverse 
engineering techniques are described. 
This chapter is organized as follows. In section 2.2 a classification scheme of 
diagrams representing a system and their methodologies is presented. In section 2.3 the 
features and a metric that will be used in the evaluation process are presented.  In section 
2.4 we present the evaluation results and in section 2.5 a conclusion of this chapter is stated. 
2.2 CLASSIFICATION SCHEME 
Our classification scheme is presented in Fig. 1. Printed Circuit Boards (PCB) allow the 
interconnection of different electronic components and form an operational Circuit [2]. 
Integrated Circuits (IC) are made from transistors and several other electronic components 
placed on a semiconductor wafer [3]. Digital Circuits (DC) consist of a set of 
interconnected logic gates created from transistors that implement Boolean logic [4]. 
8 
 
Finally, in Block Diagrams (BD), blocks connected by lines represent the relationship 
between components of a system [5]. 
Diagrams
PCB IC DC BD
 
Figure 2-1. Classification Scheme. 
2.2.1 PCB 
PCB usually consists of many layers. In his paper, Grand [6] provides a set of both 
destructive and non-destructive techniques used in reverse engineering of PCBs. After 
providing a brief introduction to PCB’s fabrication process, techniques to access different 
layers of a multi-layer PCB are analyzed. The obtained results are used to characterize and 
assess each technique. Components consisting the PCB usually can be identified by labels 
integrated on them.  Koutsougeras et al. [7] re-generates the PCB by recognizing those 
labels. In their paper, information from a real PCB is extracted and the structural 
description of the system is generated using HDL code. The implemented system performs 
image analysis of the PCB; graph generation representing the PCB’s components and an 
HDL code generator that converts the previously obtained information into an HDL model. 
9 
 
  
(a) (b) 
Figure 2-2. Printed Circuit Boards [8], [9]. 
In the project provided by Johnson [10], a Printed Circuit Board is reverse 
engineered by obtaining the board’s netlists. Different methodologies are used to 
distinguish between board and coppers. These methodologies include K-means algorithm, 
thresholding and various image processing techniques. Connected regions are located on 
each layer, as well as component pads, which are then combined to generate the netlist. 
Templates provided by a library are used to recognize the components. Each layer is 
processed separately and netlists from both layers are combined in order to recognize 
connections from both sides. 
Assuring the quality of a PCB plays a significant role in production of electronic 
parts, Naidu et al. [11] introduce the reverse engineering of a two-layer board from images 
of the board. Netlist generation of the components within the board is used in error 
detection. Wu et al. [12] present an automated inspection system of printed circuit boards 
by utilizing machine vision. More specifically, the template matching technique is used for 
defect detection by utilizing subtraction and elimination where each recognized defect is 
subsequently classified into seven categories. 
10 
 
Mat et al [13] present PCB’s track segmentation by using mathematical 
morphological operation. The image obtained from the PCB undergoes image processing 
using MATLAB. Image processing contains several steps which include binarization, edge 
detection, dilation, hole filling and erosion. By selecting a linear structuring element, 
characteristics such as shape and size of the input image are maintained. Logbotham et al. 
[14] utilize stereo imaging to assist reverse engineering of Circuit Card Assemblies/ Printed 
Circuit Boards (CCA /PCB). Four X-ray images of the scene are used to deal with the 
correspondence problem. A trace map is reconstructed out of the stereo images, allowing 
to distinguish traces between different layers. 
Detailed reference to inspection techniques of printed circuit boards is out of the 
scope of this dissertation and a detailed survey is presented in [1]. 
2.2.2 IC 
Information on techniques in IC reverse Engineering can be found at [15], where 
the state of the art in IC reverse engineering is presented. Nohl et al. [16] present the details 
of reverse engineering a cipher from a silicon implementation. The author’s main focus is 
to reveal the low cost and automation that exist in reverse engineering a silicon. The process 
describes the steps of delayering a Mifare RFID tag; photographing it and implementing 
template matching to derive different instances of gates. Template matching uses 
normalized cross-correlation and the methodology displays all the vulnerabilities of the 
encryption algorithm implemented on the cipher. 
11 
 
  
(a) (b) 
Figure 2-3. Examples of IC [17] [18]. 
Wilk et al [19] use temporal logic to specify hardware circuits. The temporal logic 
described in their paper is composed of a small number of operators and is considered to 
be important in automatic verification systems. Lammens et al [20] describe a tautology 
checker. The Boolean expressions derived from the circuit’s netlist and the Boolean 
expressions specified from the behavioral specification can be compared with the tautology 
checker, thus providing a verification tool for VLSI systems. Lin et al. [21] describe a 
Layout Expert System (LES) for VLSI design that produces a symbolic layout 
representation. LES utilizes algorithmic and rule-based techniques in order to provide the 
layout representation. 
A method to verify the correctness of VLSI Layout Circuits by comparing it with a 
logic plan circuit in a hierarchical manner is presented in [22]. Equivalent circuits are 
derived from the transformation of both layouts and are subsequently compared to derive 
validity of VLSI circuits. 
12 
 
 
Figure 2-4. Example of VLCI [23]. 
A method engaging imaging, while automating the processing, is introduced in 
[24]. Portion of semiconductor integrated circuits is examined comprehensively and 
information indicating design specifications are obtained. The methods include image 
registration, polygon representation with transformation in a specific format, netlist 
creation and schematic acquisition.  
During the design process of integrated circuits, layout standard cells are used to 
incorporate the design. An expert system for VLSI layout diagram reverse engineering is 
presented in [25]. In this system VLSI colored layout diagrams are processed in a way that 
allows the extraction of the functional behavior of the diagrams. The system comprises of 
several components that perform image processing to generate an attributed graph of the 
layout shape. A knowledge base, an abstraction mechanism and an inference engine are 
implemented to link the layout to higher levels of abstraction. 
In the invention introduced in [26], a system to automatically acquire layout 
information from multiple fabrication layers of an IC is described. The system includes 
means to capture the image of a wafer or die and transform it using image processing 
algorithms to obtain specific features of the transistors and the metal interconnects. These 
13 
 
features are later compared to a reference library in order to recognize specific circuits of 
the die. A method for reverse engineering of physical layout of CMOS integrated circuits 
is introduced at [27]. Two methods are proposed in the paper that deal with identifying 
vias, which is the metal layer interconnect of an IC, by using image processing techniques 
and taking into consideration their technology rules. Two methods are implemented and 
compared. The first one includes blob feature analysis and the second includes cross 
correlation analysis. 
Singh et al. [28] present a methodology to extract RTL Models from transistor 
netlist. In their approach, a switch-level simulator is utilized to distinguish between 
sequential and combinational circuits. Binary Decision Diagrams are invoked to represent 
the relation and the sets in the algorithm described and the results are logic gates and latches 
of the circuit that capture the behavior of the system. Another approach that utilizes logic 
simulation is presented in [29]. In the aforementioned paper a representation function is 
constructed to model specifications and is evaluated by the logic simulator. The method 
demonstrated can be applied on pipelined circuits. 
An approach that differentiates from others is presented in [30]. In this paper no 
netlists are extracted, as opposed to other methodologies, but reverse engineering of the IC 
facilitates the features extraction that describes the IC’s layout, which are subsequently 
used as features of one-class Support Vector Machine (SVM). Images are obtained from 
each layer and compared to reference layout grids in order to extract features from area and 
centroid differences. SVM is used to classify IC’s as Trojan-free and Trojan-infected based 
on those features. 
14 
 
2.2.3 DC 
Reverse engineering digital circuits has been widely addressed and different 
approaches are utilized. As in the case of ICs, the approaches taken are related either to 
diagrams representing logic circuits or to netlist describing the connections of the gates 
consisting the circuit. In the invention of the Chisholm et al. [31], a software along with a 
methodology is described for reverse engineering of integrated circuits by analyzing 
netlists or graphs. In this invention, different tools are utilized that allow semantic and 
syntactic matching and their association. 
Li et al. [32] provide a tool for deriving word-level structures by processing the 
gate level netlist. They utilize both structural and functional matching techniques to 
identify the words. In the structural technique wires are considered equivalent, based on 
their shape. In the functional technique, a feasible cut of a wire in the netlist characterizes 
the function. A number of techniques are used that incorporate shape hashing, bit slice 
aggregation symbolic evaluation and Quantified Boolean Formula. The tool that was 
developed includes graph representation of the extracted words. 
 
Figure 2-5. Digital Circuit representing a full adder. 
Hansen et al. [33] identify high level structures in logic circuits. The ISCAS-85 
benchmark is used and several techniques are combined to assist the reverse engineering 
15 
 
process. A program was implemented by the authors that identifies sub circuits from a 
library of known circuits and replaces the sub circuits with higher level modules. A 
methodology to derive high level function of circuits is presented in [34]. In this paper the 
high-level functionality is obtained by processing simulation traces of the component’s 
netlist. The behavioral patterns mined by this methodology are represented as graphs, 
which are subsequently compared to components of an abstract library. 
Chowdhary et al. [35] provide an approach that extracts logically equivalent sub 
circuits from a datapath circuit. Two sets of algorithms are implemented that generate tree 
structure and single principal output-graph templates needed to understand their structure. 
Poirot et al. [36] describe a methodology for synthesis of integrated circuits. In this patent 
zero-suppressed binary decision diagrams were used to represent Boolean expressions. 
Boolean expressions are subsequently decomposed by using a candidate Boolean divisor. 
An automated system to extract functional and structural characteristics of digital circuits 
is described in [37]. The system is first represented by an attributed graph and 
subsequently, it is converted to an SPN graph in order to extract the behavior of the 
components of a digital circuit. 
An automatic reader for hand-drawn diagrams is developed in [38]. The 
methodologies used include pattern recognition techniques for the circuit diagram. A two 
stage approach that includes loop-structure analysis is utilized to recognize the symbols. In 
order to identify the symbols, methodologies such as template matching and feature 
extraction are used. The system includes also string recognition of characters and 
connection line analysis. A method of recognizing roughly hand-drawn logical diagrams is 
presented in [39]. In their paper, multi-layered knowledge is employed by the authors. 
16 
 
Heuristic processing is exploited to compute the certainty factor used to generate the 
hypotheses of symbols, based on the concept that segmentation should be flexible. 
Subranayan et al. [40] propose a set of algorithms that identify combinational and 
sequential components from unstructured netlists of an IC. The tool developed can infer 
higher level netlists consisting of components such as register files adders and counters. A 
range of algorithms that involve bit-slice identification and aggregation, as well as 
aggregation of multibit component is described. Additionally, they include word 
identification and word propagation that leads to module identification and matching from 
predefined reference modules. Derived combinational modules are further processed to 
generate larger modules. The developed algorithms can also be used to facilitate hardware 
Trojan detection from an analyst. 
Fisler et al. [41] provide a logical formalization of hardware diagrams. In this paper 
definitions are provided that describe the physical structure of a device, as well as the 
functionality of those components. The basic idea of Hyperproof is applied and inference 
rules of circuit and timing diagrams are combined. A different approach that utilizes AI 
techniques is introduced by Rokach et al. [42]. An algorithm is introduced for computing 
circuit decomposition related to circuit synthesis. The algorithm uses a greedy approach 
for learning function and is based on decision trees. A set of metrics is also introduced to 
measure the performance of decompositions. 
17 
 
2.2.4 BD 
The main problem that we encountered during our research is the lack of 
methodologies in reverse engineering of block diagrams. Nevertheless, an effort has been 
made to provide closely related methodologies [43]. 
 
Figure 2-6. Examples of block diagram [44]. 
Burch et al. [45] express the set of states in a datapath circuit and their transition 
relations with Binary Decision Diagrams. Circuits are specified using Computation Tree 
Logic, which describes properties of computation path. Both synchronous and 
asynchronous circuits with data path logic are being verified using this methodology. 
Examples of verification of a pipeline design and a stack are provided to support their 
methodology. The drawback of this methodology is the knowledge required of the 
sequence of states occurring at some point of the execution. 
Butler et al. [46] provide a semantic model of data flow diagrams (DFD) of printed 
documents. After scanning the documents, features of the DFD are extracted and semantic 
analysis is performed. The techniques used in the semantic model development include 
18 
 
formalization and semantic inference of DFDs, where the formalization is based on π-
calculus. 
Borger et al. [47] present a design approach for constructing formal models of a 
real-life processor by utilizing the reverse engineering task. The formal models refer to the 
block diagram and each component is formally specified as an evolving algebra. Later on, 
these models are combined resulting in composition of datapath components. 
Bunke [48] presents attributed programmed graph grammars and their application 
in extracting description from schematic diagrams. Although it consists of a reliable 
method to extract structural characteristics, no functional descriptions are obtained. 
Furthermore, in other papers [49], a logic is demonstrated to support reasoning of 
diagrammatic representation of hardware. 
Methods and apparatus introduced on this survey could be applied on reverse 
engineering of block diagrams. However, an appropriate knowledge base and a formal 
logic language to communicate with this knowledge base are required. A formal logical 
language that can interact with a knowledge base is presented in [50]. 
2.3 FEATURES 
This section provides the features that are used in the evaluation process of the 
methodologies and systems introduced in the papers of the previous section. The most 
comprehensive methodologies were included in the evaluation process. Their reference, 
along with their category and a brief description is presented in Table 2-1. 
 
19 
 
Table 2-1. Overview of the methodologies used. 
Ref. Category Brief Description 
R1 [27] IC RE of chip physical level by observing metal layer interconnects. 
R2 [21] IC Produces a symbolic layout representation by utilizing algorithmic and 
rule-based techniques 
R3 [30] IC Features extraction from images to be used in one class SVM 
R4 [20] IC Tautology checker for Boolean expressions obtained from a circuit 
netlists. 
R5 [19] IC Linear time temporal logic for verification of VLSI systems 
R6 [38] PCB Automatic reader for hand-drawn diagrams. Includes pattern 
recognition techniques, template matching, feature extraction. 
R7 [40] DC Introduces bit-slice matching and aggregation. Overview of 
Algorithms for counter, shift register and RAM detection 
R8 [41] DC Logical Formalization of hardware design Diagrams 
R9 [25] IC Knowledge representation of VLSI layout diagrams by utilizing 
hierarchical attributed graph structure a Knowledge base 
R10 [37] DC SPN to represent visual data extracted from PCB 
R11 [39] DC Method for roughly hand-drawn logical diagrams based on the concept 
that no strict segmentation should be applied and utilized many kind of 
knowledge. 
R12 [45] BD Temporal logic model checking algorithm that represents state graph 
using BDD 
R13 [16] IC Template matching, normalized cross correlation details to RE a cipher 
from a silicon implementation 
R14 [48] BD Graph Grammars for schematic diagram interpretation. 
R15 [28] IC RTL models from transistors are extracted by utilizing a switch level 
simulator. 
R16 [42] DC Algorithmic analysis of logic synthesis. 
R17 [29] IC Symbolic switch simulator for verifying pipelined hardware design. 
R18 [7] PCB Translation of a graph obtained from a real PCB to Verilog HDL. 
R19 [33] DC Identification of high level structure in logic circuits. 
R20 [47] BD Components of a processor are specified as evolving algebra 
R21 [12] PCB Template matching technique for defect detect. 
R22 [13] PCB Mathematical morphological operation for PCB track segmentation. 
R23 [10] PCB K-means algorithm, thresholding to obtain netlists of PCB. 
R24 [14] PCB Stereo imaging for track map reconstruction 
R25 [29] PCB Netlist generation of layers of a PCB to be used in error detection. 
20 
 
R26 [40] DC Sequential and combinational component identification from 
unstructured netlists. 
R27 [26] IC Layout information is obtained from the fabrication layers. 
R28 [31] DC Semantic and syntactic matching to RE an IC. 
R29 [24] IC Image registration, netlist creation and schematic acquisition are 
utilized to examine a semiconductor. 
R30 [22] IC A method to verify the correctness of VLSI layout Circuit. 
R31 [34] DC Simulation traces of a circuit’s netlist are processed. 
R32 [32] DC Word-level structures are derived by processing gate level netlists. 
R33 [46] DB Semantic model derivation for data flow diagrams 
 
A number of features [51] is selected to evaluate those methodologies. Their name 
and their description, along with their abbreviation, are listed in Table 2-2. These features 
were selected based on several characteristics that each methodology should comply with. 
These characteristics relate to the performance of the used methodologies, the 
computational requirements, the quality of their results and their efficiency. Other 
characteristics include the user interaction, where systems that require no user intervention 
are considered friendlier than the ones that, not only have interaction with the user, but also 
require training to users that have no prior knowledge of the diagrams examined. Another 
fact is that complete systems including the total reverse engineering process are rated better 
than those in which only the methodology is described. 
 
 
 
 
21 
 
Table 2-2. Features used in the evaluation process. 
Feature Description 
Reliability (P1) The methodology produces expected results under normal operating 
conditions [47]. 
Robustness (P2) Produce results under extreme conditions 
Complexity (N1) Express the difficulty in implementing a methodology due to a large 
number of components or associations. Also refers to Computational and 
Memory requirements. 
Originality (P3) A novel methodology is presented. 
Scalability (P4) The methodology can process multiple or larger diagrams regardless of 
the incremental complexity of the system. 
Friendliness (P5) The system offers a user friendly interface 
Accuracy (P6) The precision of the results 
Efficiency (P7) The methodology can achieve the desired results in an efficient way. 
Portability (P8) The implementation of the methodology is not depended on various 
operating systems 
Availability (P9) The methodology is well-defined in such a way that allow re-
implementation of the system 
Cost (N2) The amount of money needed to use and /or implement the system based 
on the description provided [47]. 
Prototype (P10) The results obtained are a product of a preliminary model of the 
methodology developed. 
Product (P) The methodology has been implemented in a commercial setting [47]. 
Speed (P11) Processing time of the methodology presented. 
Further 
Improvements 
(N3) 
Enhancements required in the design. 
 
In order to have a more quantitative approach in evaluating the methodologies, a 
score is assigned to each feature for each methodology based on two perspectives—the 
developer’s and the user’s. The minimum and the maximum score that a methodology can 
be graded with is set to 1 and 5, respectively, and a weight is assigned to the scores relating 
to users and those relating to developers. This weight is different for each feature and 
independent for the case of the user and for the case of the developer, as shown in Table 2-
22 
 
3. The features labeled as P features reflect a positive impact to the overall evaluation 
process, while the others have a negative impact. To model this behavior, a maturity metric 
is selected, where the sum of scores relating to negative (N) features are subtracted from 
the sum of the scores relating to positive (P) features. 
Table 2-3. Weight associated with each feature. 
Feature Weights 
User (wu) Developer (wd) 
P1 1 1 
P2 1 1 
N1 0.2 1 
P3 0.1 0.9 
P4 1 1 
P5 1 0.9 
P6 1 1 
P7 0.5 0.8 
P8 1 0.9 
P9 0.4 1 
N2 1 1 
P10 1 1 
P11 1 1 
N3 0.4 0.7 
 
A total score is derived for each methodology and represented by the following 
maturity metric. 
23 
 
𝑀 = ∑ 𝑃𝑎𝑣𝑔𝑖
𝑁
𝑖=1 − ∑ 𝑁𝑎𝑣𝑔𝑗 + 𝑃
𝐾
𝑗=1  (1) 
In equation (1), M is the maturity metric, Pavg is the weighted average of the scores of P 
features obtained from the user’s and the developer’s perspective, while Navg is the 
weighted average of the scores of N features. P represents the feature Product, where a 
score of 5 is assigned if a product of the methodology exists and 0 if it does not, while it is 
independent for the user and the developer. Equation (2) and (3) show the Pavg and Navg 
respectively. 
𝑃𝑎𝑣𝑔 =
𝑃𝑈∗𝑤𝑈+𝑃𝐷∗𝑤𝐷
𝑤𝑈+𝑤𝐷
 (2) 
𝑁𝑎𝑣𝑔 =
𝑁𝑈∗𝑤𝑈+𝑁𝐷∗𝑤𝐷
𝑤𝑈+𝑤𝐷
 (3) 
In the above equations, PU and PD are scores assigned from the user and the 
developer and are associated with the positive features, and NU and ND are scores related 
to the negative features. Finally, wu and wd are the weights representing the importance of 
their score. The weights for each feature, as it has already been mentioned, are shown in 
Table 2-3. 
2.4 EVALUATION RESULTS 
The final results from the maturity metric associated with each 
methodology/system are shown in Table 2-4. Table 2-5 shows scores obtained from the 
user and the developer for each methodology. To allow better comprehension of this review 
we placed Table 2-5 at the end of this section. Figure 2-7 displays the maturity for each 
feature that was obtained for each methodology, along with total maximum value, the 
24 
 
maximum value obtained from only the user’s perspective and the maximum value 
obtained from only the developer’s perspective.  
Table 2-4. Maturity for each methodology. 
Methodology R1 R2 R3 R4 R5 R6 R7 
M 23.5 26.2 31.6 17 17.2 23.1 27.3 
        
Methodology R8 R9 R10 R11 R12 R13 R14 
M 17.2 25.5 25.3 17.6 21.2 30.4 27.3 
        
Methodology R15 R16 R17 R18 R19 R20 R21 
M 25.7 27.6 16.3 22.2 20.6 22 25.6 
        
Methodology R22 R23 R24 R25 R26 R27 R28 
M 21.3 19.2 20.8 15.9 28 32.1 33.7 
        
Methodology R29 R30 R31 R32 R33 
  
M 31.2 32.7 25.7 27.3 23.1 
  
 
 
Figure 2-7. Maturity obtained for each methodology. 
By examining the scores obtained from the evaluation process, we observe that 
systems that have higher scores based on positive features do not necessarily have higher 
0
10
20
30
40
50
60
R1 R3 R5 R7 R9 R11 R13 R15 R17 R19 R21 R23 R25 R27 R29 R31 R33
M
at
u
ri
ty
Methology
Max
Developer
User
25 
 
total scores. Let us consider, for example, R13 and R16 in Figure 2-8. R16 has higher 
scores than R13 based on the positive features, but R13 has higher maturity. We also 
observe that when a methodology is implemented to its full extend, or when a complete 
system is presented, then they are considered expensive, which leads in having lower 
maturity. 
 
Figure 2-8. Comparison of total maturity and scores obtained from P features. 
 
Table 2-5. Scores obtained from the user and the developer. 
Ref. R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 
Perspe
ctive 
D U D U D U D U D U D U D U D U D U D U D U 
P1 4 4 4 4 4 4 3 3 2 3 4 4 4 4 3 3 4 4 4 3 3 2 
P2 4 4 4 4 4 4 4 3 3 1 3 1 4 1 3 1 4 1 3 1 3 1 
N1 1 3 5 3 2 1 3 2 3 2 4 2 3 1 3 1 3 3 3 2 3 1 
P3 2 3 2 4 4 5 3 1 1 1 3 1 3 1 3 1 3 2 3 1 3 1 
P4 4 3 3 3 5 5 3 3 3 3 4 3 4 3 3 2 4 3 4 3 3 1 
P5 5 2 5 3 5 4 2 2 3 4 4 2 4 3 4 4 4 4 5 4 4 3 
P6 3 4 3 4 5 4 3 2 2 1 4 4 3 3 3 3 3 2 3 3 3 2 
P7 4 3 4 3 4 4 3 3 3 3 4 3 4 4 4 2 4 3 4 3 3 3 
P8 4 3 2 3 4 4 3 2 3 2 3 2 3 3 4 4 3 3 3 3 3 3 
0
10
20
30
40
50
R
1
R
2
R
3
R
4
R
5
R
6
R
7
R
8
R
9
R
1
0
R
1
1
R
1
2
R
1
3
R
1
4
R
1
5
R
1
6
R
1
7
R
1
8
R
1
9
R
2
0
R
2
1
R
2
2
R
2
3
R
2
4
R
2
5
R
2
6
R
2
7
R
2
8
R
2
9
R
3
0
R
3
1
R
3
2
R
3
3
M
at
u
ri
ty
Methodology
Maturity scores from P featues
26 
 
P9 2 2 5 4 5 4 3 2 3 3 4 3 4 3 5 4 4 4 5 4 2 2 
N2 4 4 3 3 2 2 1 2 1 1 2 2 3 3 1 1 2 3 3 4 2 2 
P10 2 1 4 3 5 4 4 3 2 1 2 3 4 2 3 3 3 3 3 3 2 2 
P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
P11 1 1 3 3 4 4 1 1 3 2 1 1 3 3 1 1 1 1 1 1 1 1 
N3 3 4 4 4 2 2 5 5 5 5 4 3 3 3 5 5 3 3 3 3 4 4 
Ref R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 
Perspe
ctive 
D U D U D U D U D U D U D U D U D U D U D U 
P1 3 2 3 4 3 4 3 4 4 3 2 1 4 3 3 3 3 2 3 2 3 2 
P2 3 2 4 1 3 1 3 1 3 1 3 1 3 2 3 1 4 2 4 2 2 1 
N1 4 2 3 1 3 1 3 1 5 4 4 2 3 1 3 1 4 3 2 1 2 1 
P3 3 1 3 2 2 3 4 3 5 5 2 3 3 1 3 2 5 4 3 2 2 2 
P4 3 1 3 2 3 2 2 3 3 2 2 3 3 2 4 3 3 1 3 3 4 3 
P5 4 2 5 3 5 4 5 4 5 4 4 2 5 4 4 3 5 4 5 4 5 4 
P6 3 3 4 3 3 4 3 2 3 3 3 2 3 3 3 3 4 3 3 4 3 3 
P7 3 3 4 3 3 4 3 4 4 5 3 4 3 4 3 4 3 3 3 4 3 3 
P8 3 3 3 2 2 2 2 2 4 3 3 3 3 3 3 3 2 1 3 1 3 1 
P9 5 3 5 4 5 4 4 5 5 4 2 3 4 3 2 1 5 4 5 4 5 4 
N2 2 2 2 2 3 3 3 3 3 4 2 3 3 5 3 3 3 3 4 5 2 1 
P10 3 2 3 5 3 3 3 4 5 4 2 3 3 3 3 3 3 2 3 2 1 1 
P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
P11 2 2 4 3 4 5 3 3 3 3 3 2 3 3 2 2 1 1 3 3 1 1 
N3 4 3 2 2 3 2 4 3 5 4 5 4 4 3 5 4 4 5 3 3 4 4 
Ref R23 R24 R25 R26 R27 R28 R29 R30 R31 R32 R33 
Perspe
ctive 
D U D U D U D U D U D U D U D U D U D U D U 
P1 3 1 4 3 3 1 4 3 4 3 4 3 2 2 2 3 4 3 4 3 3 3 
P2 2 1 4 2 2 1 5 4 3 3 4 4 4 3 2 3 4 4 4 4 3 2 
N1 2 1 3 2 2 1 5 4 3 3 3 3 2 3 2 2 4 3 3 3 3 3 
P3 2 2 3 1 2 1 4 3 5 4 5 4 5 4 5 4 4 3 3 2 2 3 
P4 3 1 3 1 2 2 4 2 3 3 3 3 2 3 2 3 2 3 3 4 3 3 
27 
 
P5 5 5 4 4 4 3 4 3 3 3 2 2 4 3 5 4 4 4 4 3 3 2 
P6 3 3 3 3 3 3 3 3 4 3 3 4 3 3 3 3 4 3 3 3 2 3 
P7 3 3 3 3 3 3 3 3 3 3 4 3 3 3 3 3 3 2 3 3 3 3 
P8 4 3 3 2 5 3 4 3 4 3 2 2 3 3 4 3 3 3 3 2 4 3 
P9 4 4 5 5 2 3 5 4 4 3 4 3 1 1 2 3 4 4 5 3 5 4 
N2 4 3 5 5 3 3 3 3 4 3 2 2 2 2 2 2 3 3 2 3 2 2 
P10 2 1 3 2 1 1 4 3 4 3 4 4 4 3 4 4 3 3 4 3 3 3 
P 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 
P11 1 1 1 1 1 1 2 2 2 2 2 3 2 3 2 2 2 3 2 2 2 2 
N3 4 4 4 3 5 3 2 3 3 3 3 3 2 2 3 3 4 4 3 3 4 3 
 
2.5 CONCLUSIONS 
In this chapter, a summary of reverse engineering of diagrams is presented, and a 
classification scheme is provided for the examined publications. The classification scheme 
resulted in four categories, PCB, IC, DC and BD. Based on the analysis of the referenced 
papers, a maturity metric was developed. Results regarding the maturity analysis show 
weak and strong points of each paper, and are not indented to compare the methodologies 
or systems with each other, but rather to compare them to their own potential. Despite the 
significant progress in the area of reverse engineering, our research showed that there is 
little effort directed towards reverse engineering of diagrams. All things considered, Due 
to the abstract nature of block diagrams, reverse engineering them will be a challenge.
28 
 
Technical documents are composed from different modalities, like Natural 
Language (NL) text, block diagrams, math formulas, pseudo-code, tables, graphics, special 
features and pictures. Thus, the understanding of a technical document requires the 
knowledge from different scientific domains. In this research effort we will focus only on 
the understanding and association of the NL-Text and the system’s diagrams in order to 
automatically generate the system’s operational pseudo-code (functional algorithm). Thus, 
parts of the NL text are used for describing the block diagrams, which provide the general 
architecture design of a specific system. We will focus our attention only on technical 
documents of electronic or digital systems. Block diagrams and their associated NL text 
appearing in technical documents provide all the details required to understand how a 
specific system works and how is implemented. However, this understanding comes 
naturally only to humans, especially when they have a certain amount of knowledge in that 
area. Building an automatic system that could understand technical documents, the way 
that human experts do, is challenging and to our knowledge, a system that can provide this 
capability does not yet exists. 
A methodology for automatic deep understanding (ADU) of technical documents 
is a great challenging problem with many applications and is based on multiple modality 
research areas [52] [43] [142]. It requires knowledge from various domains, like Natural 
3 OVERVIEW OF THE PROPOSED 
METHODOLOGY 
29 
 
Language Processing (NLP), Natural Language Understanding (NLU), Image Processing 
(IP), Pattern Recognition (PR), Image Understanding (IU), Knowledge Representation 
(KR), Functional Modeling (FM), Machine Learning, and others. It actually reflects on the 
complex way that human experts understand and study technical documents. It looks like 
a reverse engineering approach based on different modalities. The reverse engineering 
methodologies usually are specific domain depended and use limited modalities, like 
digital circuits reverse engineering use image processing, pattern recognition and domain 
knowledge (Boolean logic) to understand the functionality of a specific circuit.  
However, deep understanding a technical document involves more modalities and 
expertise from different domains. Thus, it is a more difficult and challenging problem, 
especially when the technical documents offer limited descriptions of their diagrams and 
their functionality. 
Figure 3-1 shows the ADU methodology that consists of several modalities. ADU 
methodology combines and associates the outcomes from these modalities in order to 
obtain a deeper understanding of the document. In this dissertation we will only consider 
modalities related to NL-Text and block diagrams and provide the association between 
them in order to achieve the automatic extraction of functional specifications (operations) 
of the architectural design. 
30 
 
 
Figure 3-1. ADU Methodology and its modalities. 
Following is the overview of the Reverse Engineering methodology (Figure 3-2). 
It shows the modules participating in the process of automatic extraction of functional 
specifications (operations) of an architectural design of a hardware system depicted in a 
block diagram and explained with Natural Language text. Let it be noted here that the 
functionality is expressed in a form of SPN graphs and represented with pseudocode.  
ADU
GraphicsTables
NL-Text
Figures
(block diagrams)
Equations Pictures
Associations
31 
 
 
Figure 3-2. Reverse Engineering Methodology. 
Structure Analysis extracts the structure of a PDF document. Figures, captions and 
paragraphs consisting the NL text are extracted and associated with one another.  
In the NLU module, sentences are extracted from the natural language text and are 
used in the extraction of the A->V->P kernel. The A->V->P kernel is used to create a first 
version of the text SPN graph as explained in [53] [54]. The kernel extraction process is 
subsequently improved my eliminating most of the errors in the extraction of kernels. This 
improved A->V->P kernel is used in the creation of the second version of the text SPN 
graph, while a third version of the text SPN graph is created using synthesis of nouns, noun 
and replacement.  
Technical 
Document
Structure Analysis
DIMNLU NL textClassification
Synergy
Functional 
Specification
S
P
N
Diagram Graph
Sentence, Kernels,
 SPN graph
Category Knowledge
Base
32 
 
In the DIM methodology blocks of the diagram are recognized and the graph of the 
diagram is created. By collaborating with the NL text erroneous words recognized for each 
block are corrected. As we can see from the diagram there is an interaction between the 
NLU and DIM methodologies. By collaborating, these methodologies enhance one another 
and ultimately lead in extracting their overall functionality. 
Classification utilizes a Convolutional Neural Network to identify the category that 
each text sentence belongs to. Synergy fuses all this information together and provides a 
Stochastic Petri Net Graph (SPN).  
Finally, the SPN obtained from Synergy is parsed and functional specifications 
relating to the architectural design are obtained. These functional specifications are 
provided in a pseudocode as well as flowchart form. 
. 
33 
 
4.1 INTRODUCTION 
A Technical document contains various types of components, such as NL text, 
figures, tables, formulas and others. NL text is also organized in several ways, with title, 
abstract, sections, paragraphs, sentences, etc. The objective here is three-folded. First, to 
identify and separate from other components NL text and at the same time to obtain its 
structure. Second, separate from the NL text the captions that relate to specific figures. 
Finally, associating figures to specific paragraphs of the NL text is of utmost importance. 
Here, the topic is related to document layout analysis or document structure analysis 
and has been extensible studied [55] [56] [57]. This structural information has various 
applications in indexing and information retrieval. Bui et al. [58] extracts raw text from 
PDF document and is classified into different categories through a text classification 
algorithm that follows multi-passive frameworks. However, the use of resources from 
MEDLINE [59], make this approach inapplicable to us. In [55] a model is described for a 
figure search engine, that extracts figures and their associated metadata from pdf 
documents. However, in this approach the pdf is processed as an image, where an image 
segmentation algorithm is performed to identify text and images.  
Another approach that extracts images and text is provided in [60]. In their 
approach, figure metadata is extracted by utilizing document layout, font information as 
well as lexical features. In [61] a methodology is described that separates tables, formulas 
4 STRUCTURE ANALYSIS 
34 
 
and pseudocode from natural language. However, this approach requires an annotated 
corpus. 
In this chapter we will describe an extraction process to achieve our objectives. 
Therefore, we will show the process of classifying the NL text into body text, figure 
captions, and metadata such as headers, footnotes and other non-important information. 
And finally, present the caption’s association with their respective paragraph in the NL 
text.  
Note that in this thesis we only handle technical documents that are in PDF format 
and we assume that they are rigorous, concise and well written by following the appropriate 
guidelines. Documents that have structural irregularities, i.e. misplaced captions, will not 
be studied as they may yield irregular results.  
4.2 OVERVIEW 
In this research, we present an approach to extract the PDF structure and associate 
figure components with NL parts. Certain aspects of this approach were borrowed from 
[62] and were adjusted to the needs of this research. PDFBox tool [63] was utilized to 
extract raw text as well as figures from the PDF document. PDFBox is a free open source 
tool, that has been utilized in several approaches in document structure extraction [60] [64]. 
The overview of the document structure process is shown in Figure 4-1.  
35 
 
 
Figure 4-1. Overview of the document structure analysis. 
4.3 FIGURE AND NL TEXT EXTRACTION 
In this step raw text and figures are extracted with the use of PDFBox tool. We 
assume that all the information can be read directly from the pdf through PDFBox. 
Therefore, no additional steps are required to obtain i.e. the reading order of the text within 
columns of one page since it is obtained with the tool. Raw text is organized in ordered text 
lines based on the order that they appear in the document. 
4.4 CAPTION DETECTION 
In this section NL text lines from the previous unit are obtained and an extraction 
process is employed to identify text lines that contain figure captions. The first step is to 
identify the text lines that contain the term figure or its lexicographical variations. After 
doing so, we followed the approach in [60] to determine the figure id. Text lines that hold 
a figure id are considered candidate figure caption text lines. After detecting text lines with 
the figure terms, to later categorize them as figure captions or not, we also examine if they 
appear immediately after a figure. This means that no other text line exists between the text 
line holding the figure term and the image.  
Figure & NL 
Text 
Extraction
Caption 
Detection
Feature 
Selection
Structure 
Clustering & 
Classification
Figure 
Caption and 
Paragraph 
Association
Candidate
Captions
Features Classes
36 
 
Furthermore, following the same approach in [60] lexical features of the text line 
are examined. The basic idea lies in the conception that when we refer to figure captions a 
new sentence starts with the text line, and the figure id that is associated with the figure 
caption is followed by a noun phrase. However, in other text lines the noun phrase is the 
figure id itself. Another part that is examined is if the character/letter following the figure 
id appears in uppercase or lowercase [60]. 
All information obtained in this module, will be used as features later in the 
clustering process. To detect the lexical variations, Stanford Parser was used as in the case 
of kernel extraction that will be explained in chapter 5. 
Consider for example the three text lines in Table 4-1. The first one is a figure 
caption while in the other two the text lines have a figure caption as a reference. Therefore, 
in the first example, the letter of the word following the figure id is uppercase, a new 
sentence starts, and the figure id is followed by a noun phrase. Contrary to the first example 
the other two don’t have an uppercase letter following the figure id and the phrase tag in 
this case is a verb phrase.  
Table 4-1. Caption Detection Example. 
NL text line Figure Id 
Capital 
Letter 
Phrase Tag 
“Fig. 3. This figure shows the system-
architecture template used for 
hardware [65]” 
Fig. 3. Yes Noun Phrase (NP) 
“From one application to another. 
Figure 3 shows the system” [65] 
Figure 3 No Verb phrase (VP) 
“Figure 2 illustrates the different 
steps in this process. In” [65] 
Figure 2 No Verb phrase (VP) 
 
37 
 
4.5 FEATURE SELECTION 
PDFBox and NL text extraction step provides, among other things, the geometrical 
characteristics of a text line. Therefore, feature extraction calculates layout and font 
features of the text line. From these geometrical characteristics, the whitespace size 
between text lines is calculated as well as the font size, the font type, the average line length 
and whether they appear in bold font or not.  
These parameters will serve as features for the structure classification module. 
Furthermore, these features will also include whether the text line is a candidate figure 
caption, if the figure id is followed by a capital letter and if the text line after figure id is a 
noun phrase or not; as obtained from the caption detection module. Table 4-2 shows a 
summary of the features that will consist the feature vector in a clustering algorithm. 
Table 4-2 Features. 
Features Definition 
Whitespace Size Hold a numerical value and indicates the size between two 
consecutive text lines.  
Font size Holds a numerical value and indicates the size of font used for each 
text line 
Line Length Holds a numerical value and indicates the length that each text line 
holds in the document. 
Bold Font Holds a binary value and indicates if the font is bold or not 
Caption Holds a binary value and indicates if the text line refers to a candidate 
figure caption 
Phrase Tag Holds a binary value and indicates if the line after figure id is noun 
phrase or not. 1: A noun phrase follows. 0: A verb phrase follows 
Capital Letter Holds a binary value and indicates if the word following the figure id 
is a capital letter. 1: is a capital letter. 0: is not a capital letter 
 
38 
 
4.6 STRUCTURE CLUSTERING AND CLASSIFICATION 
Features (obtained through the feature selection process) compose the feature 
vector to be used with a clustering algorithm. Here, K-means clustering is used, however 
an extensive survey on other text clustering algorithms can be found at [66] [67]. We will 
describe here in brief the intuition of K-means clustering. In the K-means clustering 
algorithm there is a set of K representations around which the cluster is built. The K-means 
algorithm requires the number of clusters in advance and K points (centroids) are selected. 
So initially each data is assigned to the closest centroid (K-clusters). With an iterative 
process, centroids are recalculated, and data are reassigned to the closest centroids until no 
more reassignment takes place. To avoid the random initialization trap K-mean++ is 
utilized. 
Clustering will yield an initial structure of the text lines in the document. To 
accurately describe each structure, further processing is required. A methodology is 
required that can automatically label each cluster. Labeling is based on certain features as 
well as the number of text lines that each cluster holds. Let us mention here that K-means 
clustering was used with K=7. The number of clusters was chosen based on the elbow 
method [64]. Although, our goal is to obtain three categories; body text, metadata and 
figure captions, we use a larger number of clusters. Having a larger number of clusters 
points in better recognizing different structures in the document.  
We consider as metadata any part of the natural language text of the document that 
has a descriptive role in the document organization. Thus, blocks of text, such as title, 
authors, table elements, table captions, headers, footnotes, references are part of metadata 
category. The body text is in fact, the natural language text, that forms the main contents 
39 
 
of the document. Finally, figure captions are the groups of natural language text that form 
the captions of figures.  
After obtaining the clusters a methodology is defined that reduces the initial number 
of clusters into three. Initially, each cluster contains text lines from the NL text that are 
grouped based on our features. The cluster having the most text lines is considered as the 
cluster holding the body text. The key assumption here is that the body of a text is the 
largest part of the NL text in a document. Therefore, the largest cluster will be labeled as 
body text.  
Furthermore, we label clusters that contain candidate figure captions as the figure 
caption cluster. The number of candidate figure captions existing in one of the initial 
clusters must exceed a certain threshold to be considered part of the figure caption category. 
Let ci, where i={1,2,…,K=7} denote the clusters and |ci| denote the number of elements in 
cluster ci, the threshold is set to be τ= a*|ci|. After several observations a is set to 0.75.  
The remaining clusters that are not identified as figure captions or body text, are 
merged into one cluster that consist the metadata. Therefore, the final clusters, represent 
the body text, metadata and figure captions. Each cluster is further decomposed to form 
smaller groups within each cluster. Let Pi ={pi1, pi2, … piGi}, denote the set of groups within 
each cluster ci. Gi in this case is the number of groups in each cluster. Pij, where iϵZ<|c| and 
jϵZ<Gi, is formed by consecutive text lines that form each cluster. Consider for example 
the cluster in Figure 4-2. Consecutive text lines in this cluster are grouped together. Once 
the continuity of the text lines (x≠z) is broken it marks the start of a new group. 
40 
 
Line 1
Line 2
...
Line k
Line k+1
Line k+2
 
Line k+x
Line k+z
Line k+z+1
 
Line n
P2,1
P2,2
 
Figure 4-2. Example of grouping within each cluster. 
 
Figure 4-3. Example of figure caption with multiple text lines [65]. 
Nonetheless, the figure captions’ cluster will contain only the first text line of the 
figure caption, however in certain cases a caption consists of more than one text lines 
(Figure 4-3). Therefore, text lines that are part of a figure caption is required to be placed 
in the figure caption cluster. To achieve that, groups in the metadata cluster (P1) are 
examined. If the first element of a group is an immediate succession of one figure caption 
text line and if meets certain criteria, then that group becomes part of the figure caption 
group. Let further define 𝐿𝑖,𝑗,𝑙 where 𝑖 ∈ ℤ < |𝑐|, 𝑗 ∈ ℤ < 𝐺𝑖, to be a text line 𝑙 of group j 
in cluster 𝑐𝑖.Thus, a group of the metadata cluster becomes part or not of the captions’ 
cluster based on the following criteria: 
41 
 
𝐶𝑞,𝑟 = {
𝑃𝑞,𝑗  ∈ 𝑃𝑟,ℎ , 𝑖𝑓 𝐿𝑞,𝑗,0 = 𝐿𝑟,ℎ,|𝐺𝑟|−1 + 1 𝑎𝑛𝑑 𝐹𝐷𝑖𝑓𝑓𝑞,𝑟,𝑗,ℎ ≤ 𝑎 
𝑃𝑞,𝑗  ∉ 𝑃𝑟,ℎ , 𝑖𝑓 𝐿𝑞,𝑗,0 = 𝐿0,ℎ,|𝑇𝑟|−1 + 1 𝑎𝑛𝑑 𝐹𝐷𝑖𝑓𝑓𝑞,𝑟,𝑗,ℎ > 𝑎
(1) 
Where 𝐹𝐷𝑖𝑓𝑓𝑞,𝑟,𝑗,ℎ 𝑖𝑠 𝑑𝑒𝑓𝑖𝑛𝑒𝑑 𝑖𝑛 (2), 𝑞, 𝑟 ∈ ℤ < |𝑐| 𝑎𝑛𝑑 𝑞 ≠ 𝑟  
𝐹𝐷𝑖𝑓𝑓𝑞,𝑟,𝑗,ℎ = |𝐹𝐿𝑞,𝑗,0 − 𝐹𝐿𝑟,ℎ,|𝐺𝑖|−1
| , 𝑞, 𝑟 ∈ ℤ < |𝑐|, 𝑗 ∈ ℤ < 𝐺𝑞 𝑎𝑛𝑑 ℎ ∈ ℤ < 𝐺𝑟  (2) 
𝐹 in this case denotes the size of the font in a text line, q=1, r=0, and 𝑎 is set after 
experiments to 𝑎 = 0.01.  
The same approach and criteria as above is used to correct any mislabeled body text 
lines. Thus, a group of the metadata cluster becomes part or not of the body text’s cluster 
based also on criteria (1).  
After re-arranging the element between the clusters, the body text cluster is further 
regrouped. Groups in the cluster are merged together based on the criteria (1), where q=2, 
r=2, a=0.1. The final groups within the body text clusters are in fact the paragraphs forming 
the body text. Therefore, P2 is the set of paragraphs in the PDF document. 
4.7 FIGURE CAPTION AND PARAGRAPH ASSOCIATION 
Paragraphs are further associated with the figure captions. Each figure id is 
examined to identify the paragraphs that mention it. This is simply obtained by examining 
candidate caption lines in the paragraphs and comparing their figure id with the figure id 
of caption lines. 
4.8 RESULTS 
We tested 90 randomly selected scientific publication in PDF form, from the IEEE 
digital library [68], restricted in the domain of hardware architecture. We manually 
42 
 
examined the documents and the results are shown in Table 4-3. We used precision, recall 
and F1-score as evaluation metrics for our results. Precision measures the correct 
predictions of elements in a category with respect to the total number of predicted elements 
for that category. Recall measures the correct predictions of elements in a category with 
respect to the actual elements that belong to that category. F1 score is the harmonic mean 
of precision and recall.  
We will write the formulas for each metric based on true positive(TP), true 
negatives(TN), false positives(FP) and false negatives (NP). 
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃
𝑇𝑃 + 𝐹𝑃
, 𝑟𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑃
𝑇𝑃 + 𝐹𝑁
, 𝐹1 =  
2 × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
 
Table 4-3 Scores for clustering and paragraph identification. 
 Precision Recall F1 
Figure Caption Detection 94.5% 89.8% 92.09% 
Paragraph identification 93.8% 89.9% 91.8% 
Paragraph reference 
identification 
97.5% 91.7% 94.5% 
Results show that there is room for improvement. However, they are adequate for 
the purposes of this dissertation.  
4.9 CONCLUSIONS 
All in all, each text line in the document was categorized based on several features 
to three categories, body text, metadata, and figure captions. K-means clustering was first 
utilized with a higher number of clusters and by following the above methodology reduced 
the number of clusters to three and corrected mislabeled lines. Furthermore, we were able 
43 
 
to obtain a further categorization that refer to the paragraphs of the body text. These 
approaches can work for any technical document, regardless of the scientific domain. 
 
44 
 
5.1 INTRODUCTION 
Our goal is reverse engineering of technical documents and understanding of 
technical documents becomes necessary to achieve our goal. Design details of a system are 
usually revealed in an abstract form from block diagrams but become more specific from 
the NL text. Therefore, an NLU scheme is required to assist our efforts in reverse 
engineering. 
NLU is a subpart of NLP and NLP has been studied for several decades and has 
utilized several techniques. A most recent review on the current state of NLP research is 
shown in [69]. NLU undertakes the work of obtaining the meaning behind an NL text and 
represents that knowledge in a form understandable by a machine. There are several ways 
to integrate and represent this knowledge [70]. A compelling representation formalism 
applicable to language processing are graphs. The importance of graph representation of 
text is shown in [71], while a summary of graph based methods is shown in [51]. 
Representing sentence’s meaning using state machines and more importantly using 
SPNs, allows the incorporation of meaning and timing of events/actions appearing in an 
NL text. To our knowledge there are only Mills [72] [73] [74] and Psarologou [75] [76] 
[53] that have used SPNs for NLU purposes. The first one has set the foundations of a 
theoretical background, while the latter presents a formal model for representing the 
meaning of an NL sentence to SPN graphs. The work in [53] provides a methodology that 
5 AN IMPROVED SPN BASED NLU SCHEME 
45 
 
extracts the kernel from the NL sentences. The basic features of an NL text can be 
represented by each kernel (Agent-Verb-Patient). The kernel actually provides us with the 
information of “who”, is doing “what”, to “whom”. A formal language, called Glossa [75], 
is created to formally represent the kernel’s structure, which is subsequently mapped to an 
SPN state machine. Different sentence’s kernels are finally combined, which allows 
event/actions to be associated between them. 
In this chapter, we will focus on the work in [53] and we will show the challenges 
and problems that occurred in the extraction of kernels and analyze the development of an 
improved algorithm for the kernel extraction. 
5.2 IMPROVING KERNEL EXTRACTION (A-V-P) 
The algorithm for the extraction of kernels [54] [75] was created based on studies 
of different structures of parse trees. As a result, there are errors on the extraction of kernels 
of sentences with parse tree structures different than the already studied ones. These errors 
made the extraction of kernels in complex sentences a challenge and subsequently resulted 
in a very limited association with the diagram graph which is required to later extract the 
functional specifications of the architectural design. To overcome these limitations, the 
algorithm for the kernel extraction was modified and the previously developed 
methodology for the kernel extraction was adjusted accordingly. The rest of the features 
remained unchanged. 
Before extracting the kernel the parse tree is obtained using Stanford parser [77] 
[78]. The parser contains tags for each word in the sentence. Table 1 and Table 2 of 
appendix A shows the Part of Speech (POS) and chunk tags, respectively. 
46 
 
5.2.1 Rules for the kernel extraction  
There are several steps followed for the extraction of kernels (s) that are analyzed 
in [53]. We modified this algorithm and the steps of the modified algorithm are as follows: 
1) “Nouns and their corresponding connections appearing before a verb phrase (VP) 
are collected” [53]. 
2) “Verbs and their corresponding connections are collected” [53]. 
3) “Nouns and their corresponding connections appearing in a verb phrase (VP) are 
collected” [53]. 
4) "Closing of a verb phrase (VP) initiates the kernel formation. 
5) Kernel formation examines the voice of the verb: 
a) “Active voice of the verb leads in nouns before the verb phrase to be considered 
as agents and all the nouns in the verb phrase to be considered as patient” [53]. 
b) “Otherwise, passive voice of the verb leads in swapping agents with patients if 
they are followed by a preposition” [53]. 
6) Saves the kernels [53].  
Our modification of the algorithm refers to the fourth rule of [53]. The previous 
rule as stated in [53] had two parts that affected the formation of the kernels, both were 
dependent on the voice of the verb. We removed the first part of this rule, since by forming 
kernels only when the very first verb phrase was closed leads to grouping verbs together 
as one, which subsequently lead to wrong associations between agents, verbs and patients. 
47 
 
Grouping of verbs added an ambiguity in discerning the correct triplet (A-V-P) – which 
agent and patient belongs to which verb. Regardless of the voice of the verb, forming the 
kernels the moment the verb phrase is closed helps extracting the correct A-V-P. 
Additionally, dissociation of the start of the kernel formation from the verb’s voice, enables 
us to form the kernels consisting of the inner verbs first, and later associate the outer verb 
phrase with each of the kernels formed from the inner verbs. 
Additional to the aforementioned algorithm, a set of extra rules are taken into 
consideration for the extraction of kernels. These rules, are generally considered 
assumption and below we will refer to each of the assumptions obtained from [53] and 
when appropriate refer to their problems and present their modifications. 
1) Phrases (in a sentence) consisting of verbs “by performing” or “for updating” are 
considered explanatory phrases. These phrases provide a more detailed explanation 
of the previous kernel and refer to the preceding agent. Phrases of the category “by 
doing” usually answer the question of how an agent is performing a specific action, 
and phrase of the category “for doing” usually answer the question of why an agent 
is performing a specific action [53].  
2) Phrases starting with the word “which” or “that” are processed as regular sentences, 
but since they are a continuation of their preceding sentence they have as an agent 
the patient of their preceding action (previous kernel). 
3) If no agents or patients are not available then adjectives ,if any, are considered as 
agents or patients [53]. 
48 
 
4) When ‘before’ or ‘after’ preposition precede nouns then these nouns are considered 
agents or patients. ‘Before’ and ‘after’ prepositions provide important timing 
information.  
5) “If a modal auxiliary verb precedes a verb, then it is considered part of the verb” 
[53]. 
6) As part of the verb are also considered prepositions preceding the verb and they 
belong to specific prepositional verbs [53]. 
Now, besides modifying assumptions that were considered in [53] we added a few 
extra rules (assumptions). These assumptions are as follows: 
7) If a verb phrase follows immediately an infinitival to (TO tag) then the sentence is 
considered an explanatory phrase. These phrases usually refer to the preceding 
patient. 
8) Personal pronouns not resolved after Anaphora Resolution (AR) as suggested in  
[53] they are replaced with the preceding agent (agent of the previous kernel). 
9) If a main verb (action) has an explanatory verb, then when another explanatory 
phrase is detected it becomes a main verb having as an agent the patient of the 
explanatory phrase provided that they don’t have any conjunction between them. 
10) When parentheses are detected then depending on their contents we have two cases 
a) If no verb exists inside the parenthesis and two or more nouns are detected, then 
a new kernel is created having as agent the noun(s) preceding the parentheses 
49 
 
and as patients the nouns inside the parentheses. A verb (action) is placed 
between them. 
b) If a verb exists inside the parenthesis, then a new kernel is formed based on 
contents of the parentheses following the rules described so far. An additional 
agent is added, obtained from the noun(s) preceding the parentheses. 
11) If a kernel has no agent, then the patient of the preceding kernel becomes the agent. 
12) When multiple verb phrases are nested in other verb phrases in the parse tree then 
the kernel that is created from the verb phrase that is in the highest level in the parse 
tree, has as patients the agents of each kernel that might be created from the nested 
verb phrases. 
13) Adjectives that belong to one of the words detected in the diagram are also 
considered agents or patients. 
The information of adjectives and cardinal numbers are stored for each kernel. 
5.2.2 Extracting the A->V->P Kernel and Converting NL Sentences 
into Graphs [54] [79] 
After extracting the A->V->P kernel of the sentence each part of the kernel (agent 
-> action (verb) -> patient) is associated based on the graph shown in Figure 5-1. 
Agent Action Patient
 
Figure 5-1. The graph of the kernel of a sentence. 
50 
 
The graph of the sentence is constructed based on the kernel. More specifically the 
agent node is connected to the verb/action node and the verb/action node is connected to 
the patient node. 
5.2.3 Converting NL Graphs into SPN graphs [54] [79] 
Understanding natural language sentences, requires representing them in a way, 
that allows the expression of timing and associations of actions. Therefore, the graph form 
of an NL sentence is converted to SPN graphs. We will refer to the SPN graphs obtained 
from the NL text as text SPN graphs. SPNs consist of places and transitions. Transitions 
are connected to places and when a transition is fired the places change their states (Figure 
5-2). As far as kernels are concerned, agents and patients are mapped to places in an SPN 
graph, while verbs are mapped to transitions. Each agent and patient has two states, before 
and after the firing of a verb/transition. As we can see from Figure 5-2 when a verb fires at 
time T1 then the agent (patient) at time T0 changes its state. 
  
Figure 5-2. Basic state machine and the SPN to represent the kernel: V changes 
state of P to P’ and A to A’. 
In case where the number of kernels becomes large, a symbolic visual 
representation of the kernel is introduced, where one place embodies both states (before 
and after the cause of a verb) of an agent (patient). This representation is shown in Figure 
5-3. 
51 
 
P
P 
A
A 
V
A P
 
Figure 5-3. SPN graphs: symbolic representation of kernels. 
5.2.4 Text SPN Examples 
In this section, we will provide different examples of complex sentences and 
compare our modified extraction of kernels with the previous one. The results are shown 
in the form of SPN graphs, where the rules for the mapping of kernels to SPNs are provided 
in [53]. 
Figure 5-4a shows the uncombined SPN graph of the following text “Once the 
registers’ operands have been fetched, they can be operated on by ALU to compute a 
memory address, to compute an arithmetic result, or a compare” obtained from the 
methodology in [53]. Figure 5-4b shows the uncombined graph of the same text from our 
methodology. We can monitor from the Figure 5-4a that in the previous methodology a 
kernel that should have been extracted as one (“register operands have been fetched”) is 
broken into two pieces. Figure 5-4b shows that this problem is removed. Furthermore, we 
can observe in Figure 5-4a the existence of a personal pronoun that has not been replaced 
by the AR. Figure 5-4b shows how such a case has been resolved as stated in assumption 
ten above. 
52 
 
 
(a) 
 
 
(b) 
 
Figure 5-4. (a) Uncombined SPN graph obtained from the methodology in [53] of 
the text “Once the registers’ operands have been fetched, they can be operated on by 
ALU to compute a memory address, to compute an arithmetic result, or a compare” 
[80],(b) Uncombined SPN graph of the modified kernel.  
Figure 5-5a shows the uncombined SPN graph of the following text, “The 
multiplexer whose output returns to the register file is used to steer the output of the ALU 
or the output of the data memory for writing into the register file.” obtained from the 
methodology in [53]. Figure 5-5b shows the uncombined graph of the same text from our 
methodology. Again, in these sentences the previous methodology would split each noun 
detected to different patients and each one would have an “and” conjunction. We can see 
that our methodology (Figure 5-5b) combines the nouns in an appropriate way and any 
other problem is resolved. 
53 
 
 
(a) 
 
(b) 
Figure 5-5. (a) Uncombined SPN graph obtained from the methodology in [53] of 
the text “The multiplexer whose output returns to the register file is used to steer the 
output of the ALU or the output of the data memory for writing into the register file.” 
[80], (b). Uncombined SPN graph of the modified kernel. 
5.3 PARTIAL SENTENCES 
For reason that will become clear later, NLU performs splitting of sentences based 
on the number of kernels that are detected. These sentences consist the partial sentences 
and they are used in sentence classification that will be shown in chapter 7. To better 
understand partial sentences, consider for example the sentence “The S-Box is initialized 
linearly, when the reset state occurs”. This sentence consists of two kernels, K1= (S-Box)-
(is)-(linearly) and K2= (Reset_state)- (occurs). Therefore, the sentence is split into the 
partial sentences “The S-Box is initialized linearly” and “when the reset state occurs”. 
54 
 
5.4 CONCLUSIONS 
In this chapter, challenges and problems that occurred in the extraction of kernels 
from previous work was analyzed. The challenges lead to an improved algorithm for kernel 
extraction, that allows processing of more complex sentences. Furthermore, NLU performs 
splitting of sentences into partial sentences.
55 
 
 
Block diagrams have an abstract nature and obtaining their functionality poses a 
challenge. Furthermore, all techniques referenced are very useful in extracting the diagram 
but in our case, which requires diagram understanding, are not applicable. Our research 
showed that there is a little effort directed towards understanding a block diagram that 
would lead in providing a functional description. 
6.1 BLOCK DIAGRAM DETECTION 
DIM receives the captions from structure analysis along with all the figures 
detected. However, figures and their associated captions are required to be extracted and 
classified into block diagrams and non-block diagrams. Therefore, we will show how 
figures are distinguished between figures that contains block diagrams and other type of 
figures.  
First, caption text lines are processed and a caption referencing a block diagram is 
detected based on the appearance of certain keywords on that caption or captions. The 
respective figures for these captions are classified as block diagrams. Next, the identified 
block diagrams are processed as mentioned above and the words recognized serves as new 
keywords to examine the remaining captions. If the remaining captions contains these new 
6 THE DIAGRAM MODEL 
56 
 
keywords, then the figures associated with these captions are also considered block 
diagrams. Steps one through three summarize the aforementioned process.  
Step 1: Initial block diagram are selected by the appearance of keywords in their 
captions, i.e. “block diagram”, “architecture”.  
Step 2: Identified diagrams are processed and text abbreviations form new 
keywords.  
Step 3: Repeat step 1 and step 2 until no other captions are detected.  
From these steps a hierarchy is also obtained between the block diagrams. 
Therefore, if keywords of a figure appear in the caption of another figure then the second 
block diagram is part of the first block diagram thus obtaining a hierarchy between the 
diagrams. 
6.2 DIAGRAM EXTRACTION 
The information presented in technical documents is not only in a text form but in 
the majority of the cases includes block diagrams that represent either the overall structure 
of the system or parts of subsystems. Block diagrams provide minimal details since they 
are composed of a set of interconnected blocks and each block is described by a text 
alluding its association to a system component. 
To detect each block in the diagram we followed the idea of Li and Dori [81]. They 
developed the Sparse Pixel Vectorization Algorithm (SPV) to track lines in a diagram. Due 
to the nature of block diagrams, where the image consists of line segments connected to 
one another, SPV algorithm seemed like a suitable choice [82].  
57 
 
During the extraction procedure, we track each line using the SPV algorithm from 
end to end, in order to first recognize the blocks and then follow their connections. The text 
abbreviations in image blocks are recognized by using the Tesseract OCR Engine [83]. 
From now on we will refer to this text as block-text. 
To account for imperfections in the image we apply morphological operations. we 
first apply opening operator, using a rectangle as the structuring element. The size of the 
structuring element is different for the two operations. A 5x5 structuring element was used 
for erosion and 3x3 structuring element was used for dilation 
6.3 DIAGRAM GRAPH MODEL 
To obtain the graph of the diagram, blocks are considered as nodes and connections 
are considered as arcs. Each node consists of additional information that include the block-
text, their position in the image and in general, they hold any information that allows the 
re-generation of the image. Overall, the graph provides the structural description of the 
block diagrams. 
6.4 DIAGRAM TEXT RECOGNITION 
As already mentioned to recognize the block-text we used the OCR Engine 
provided by Tesseract. Although this engine performs several image processing operations 
internally before conducting the actual OCR, there are conditions that limit the accuracy of 
the results. These conditions can vary from low image quality to very unusual fonts. To 
overcome these limitations, several operations are performed that invoke the original NL 
text. For our purposes it is highly important to correctly recognize the block-text, since 
they provide a direct link to the NL text describing the technical document. 
58 
 
After obtaining an initial version of the block-text a comparison is performed with 
the NL text. Erroneous words obtained from the OCR engine are corrected to match the 
words shown in the NL text. Block-text may contain either one word or a sentence. We 
measure the similarity of each word in the block-text with the words in the NL text. To 
measure the similarity between characters of words we employ the cosine similarity. More 
specifically words from NL text and from the image are converted to Boolean vectors. And 
for each vector pair we calculate the cosine similarity. Words that have the higher cosine 
similarity are selected to replace the erroneous word obtained from the image. Furthermore, 
words located next to each other in the NL text have also higher similarity than words 
located in various positions. The correct word is the one from the NL text that has the best 
similarity. The formula for the cosine similarity is provided below.  
𝑠imilarity = cos(θ) =
word_text_vector ∙ diagram_word_vector
‖word_text_vector‖‖diagram_word_vector‖
 
To clarify the steps performed during the block-text recognition process we will 
provide as an example the diagram shown in Figure 6-1. Assuming that the OCR engine 
gives “Conlmol” for the first block and “Address Genet-lor” for the second block then by 
comparing each word with the NL text we obtain the cosine similarity as shown in Table 
6-1. We select the words that has the maximum similarity. For “Conlmol” we select 
“Control” and for “Address Genet-lor” we select “Address Generator”, which consist the 
correct words of the block diagram. 
 
59 
 
Control
Address 
Generator
 
“Control module generates the 
appropriate signals for the Address 
Generator.” 
(a) (b) 
Figure 6-1. Example of the block-text recognition process. (a) Block Diagram 
[41] (b) NL text. 
Table 6-1. Cosine Similarity between words. 
NL words 
Block-text character mathces 
Conlmol Address Genet-lor 
Control 0.84 0 0.47 
Module 0.37 0.37 0 
generates 0.33 0 0.66 
appropriate 0 0.42 0 
signals 0 0.37 0 
Address 0 0.99 0 
Generator 0.33 0.0 0.81 
 
There are conditions, where the methodology might be in quandary over what word 
to select from the NL text. Consider for example the words “Stack1” and “Stack2” and the 
OCR engine recognizes from the diagram the words “stackl” and “stackZ”, respectively. 
Both words have the same similarity when compared to words “Stack1” and “Stack2” of 
the NL text, which makes the selection of the word difficult. To overcome this uncertainty, 
we follow the methodology presented in [67] in order to recognize the characters that are 
difficult to differentiate. 
When a block is described in a block diagram the orientation of the text could be 
either vertical or horizontal and in rare case diagonal. To have more accurate results block-
text is examined in both horizontal and vertical direction and with the previously described 
procedure we select the best match. 
60 
 
6.5 FILTERING 
A lot of information appearing in the NL text might not be relevant to the functional 
specifications of the system. Therefore, we chose to exclude NL sentences that doesn’t 
contain any information relating to block diagrams. To achieve that, all words that appear 
in the block diagrams will form a keyword set. This keyword set will determine if a 
sentence is included or not. More specifically if a sentence contains at least one of the 
keywords then it is included otherwise it is excluded. 
6.6 DIAGRAM GRAPH 
The graph of the diagram is a visual representation of the block diagram; the 
conversion is quite straightforward. Blocks become nodes of the graph. Inputs and outputs 
are placed as a small yellow and blue rounded rectangle respectively and are connected to 
the node. Connections between blocks is translated as connections between nodes where 
signal names appearing in the connection are placed on the arc. Figure 6-2 provides a graph 
of an example block diagram. 
 
 
 
61 
 
 
(a) 
 
(b) 
Figure 6-2. (a) block diagram (b) diagram graph. 
6.7 SYNTHESIS OF NOUNS IN TEXT SPN GRAPH 
To provide the design rationale we will work on the example shown in Figure 6-3. 
In this example, the text SPN graph (Figure 6-3a) has one agent “block” and four patients 
named “block1”, “logic”, “block2” and “logic”. If we examine the diagram graph (Figure 
6-3b) we can see that “block” is connected to the two nodes “Block1 Logic” and “Block2 
Logic”. Therefore, we synthesize the patients of the text-SPN based on the connections of 
the diagram graph. The synthesis results in a new text SPN graph as shown in Figure 6-4. 
Synthesis applies both to agents and patients, as long as they agree to the diagram graph. 
In1       Block     X1   
Out1
In2      Block     X2   
Out2
Input_name Output_name 1
Output_name 2
Block X1 Block X1
Input_name
Output_name1
In1 Out1=In2
Output_name2
62 
 
Block
Logic
Block2
Logic
Block1Connects
 
Block
Block1 Logic
Block2 Logic
 
(a) (b) 
Figure 6-3 Text SPN graph of example 2. (a) Text-SPN, (b) D-graph. 
Block
Block2 Logic
Block1 logicConnects
 
Figure 6-4 Text SPN graph after synthesis of nouns. 
6.8 NOUN REPLACEMENT 
Following the idea of noun synthesis, we now separate each noun, where possible, 
to individual words (nouns) in order to be replaced with words related to block-text. Again, 
let us work on an example to provide the analysis of our methodology. The Text-SPN and 
the Diagram-graph of the block diagram are shown in Figure 6-5a and Figure 6-5b 
respectively. From the D-graph we see that a node exists with the name “Address 
generator” and part of the agent in the Text-SPN matches the node “Address generator”. 
Therefore, we replace the agent “Signals_Address_generator” in the Text-SPN graph with 
the agent “Address_generator”. The new T-SPN graph is shown in Figure 6-5c. 
 
63 
 
Control 
Module
Signals._Ad
dress_gene
rator
generates
 
 
(a) (b) 
 
(c) 
Figure 6-5. T-SPN and D-graph of the block diagram shown in Fig.1. (a) T-SPN. 
(b) D-graph (c) T-SPN. (d) D-SPN. 
Association between the block-text and the agent and patients of the text-SPN 
includes also partial matches between them. 
6.9 RESULTS 
The methodology described in the previous chapters has been implemented in Java. 
The program takes as input the image containing the block diagram and the NL text and 
generates the diagram SPN graph. We will provide one diagram although the methodology 
has been tested to more block-diagrams (Fig. 7 in [84], Fig. 5 in [85], Fig. 10 in [86], Fig. 
1 in [87], Fig. 2 in [44]). Figure 6-6 shows the block diagram (a) as taken from [Fig. 8 in 
[80]] along with a simplified NL description (b).  
 
 
 
 
Control 
Module
Address_ge
nerator
64 
 
 
 
 
(a)  
“PC sends address to ICache. Instructions from ICache go through EDIFT coprocessor. 
Register File Tags1 sends signals to Stack1 Propagation Logic. Register File Tags2 
sends signals to Stack2 Propagation Logic. Stack1 & Stack2 registers connects to Stack1 
Propagation Logic and Stack2 Propagation Logic. Stack1 Propagation Logic stores the 
Stack1 tags. Stack2 Propagation Logic stores the Stack2 tags. Writeback Logic updates 
tag information in the Register File. Register File sends address to ALU. ALU updates 
to DCache. Check Logic communicates with Tags writeback Logic.” 
(b) 
Figure 6-6. Tested Block Diagram. (a) Block-diagram [80]. (b) NL text. 
The graph extracted from the image after word correction is presented in Figure 6-
7. As already explained the graph provides the connections of the blocks.  
65 
 
 
Figure 6-7. Diagram graph. 
Using the tool presented in [53] we extract an initial version of the text SPN-graph 
(T-SPN1) as shown in Figure 6-8. Figure 6-9 shows the text-SPN (T-SPN2) after applying 
noun replacement and synthesis of nouns. As we can see from Figure 6-9 problems that are 
occurring due to incomplete kernels and kernels that have uncombined nouns are resolved. 
Moreover, in Figure 6-9 the nouns have been replaced with a simpler noun that is in 
accordance with the block diagram. 
66 
 
 
Figure 6-8 Text SPN (T-SPN1) graph. 
 
 
Figure 6-9 Text SPN (T-SPN2) graph after applying synthesis of nouns and noun 
replacement. 
67 
 
Comparing now the two methodologies for kernel extraction in this particular 
example 62% of the kernels are extracted correctly from the methodology in [53] while 
91% are extracted correctly by the improved methodology. 
6.10 CONCLUSIONS 
In this chapter, the process of extracting the information depicted in block diagrams 
is presented. After recognizing each block in the diagram, a graph representation is selected 
to represent that information. Furthermore, figures are grouped in figures that contain block 
diagram and figures that don’t contain block diagrams. Moreover, a synergistic model 
between NLU and DIM methods was presented. More specifically we associated text SPN 
graphs with diagrams graphs. First, we improved the diagram graph by associating the NL 
text with the misrecognized text of the diagram. Last, we provided a better association in 
the text-SPN graph by synthesizing or replacing nouns based on information provided by 
the diagram graph.  
68 
 
Text classification is the process of appointing text documents to a set of predefined 
categories. Traditionally, machine learning algorithms are used for this task. In the machine 
learning setting, the availability of an initial corpus is required where text documents have 
been labeled based on their category [88]. The pre-classified documents are then, used in 
the training phase of a machine learning algorithm, where a classifier is obtained. Based 
on this classifier a category is assigned to a text document. Text classifications is applicable 
to a wide variety of domains. Some examples include but not limited to social media, 
multimedia networks, news filtering and organization, opinion mining. 
An important step is required to support the application of machine learning 
algorithms to text classification and refers to the process of transforming raw text into a 
numerical representation. This step refers to text preprocessing and has been extensible 
studied in the literature [89], [90] [91] [92] [93] [94]. Assuming, the text has been 
segmented into sentences, there are several steps involved in the text preprocessing as we 
will briefly discuss later in the following section. 
7.1 TEXT PREPROCESSING 
7.1.1 Introduction 
Natural language text cannot be used in raw format in any machine learning 
algorithm. A preprocessing step is required where text is transformed into a suitable form 
7 CLASSIFICATION 
69 
 
understandable by machine learning algorithms i.e. set of words contained in it. Essentially, 
each sentence from the text is formulated into tokens of individual words (terms) and then 
they are transformed into a vector representation, commonly called feature space. The most 
common methods representing a text document in vector form include the Bag of Words 
(BoW) and one of the most recent approaches word embeddings. We will briefly discuss 
each of them in the following sections. 
7.1.2 Filtering 
Filtering aims in reducing the number of terms (T) initially obtained from the text 
documents. The most widely used method is stop word filtering. Words from the text that 
are found in a stop word list are removed from the text. Several stop word lists have been 
built manually [94] [95] to be used in such a process. Approaches have also been developed 
that attempt to extract stop word list automatically [96] [97]. What all these lists have in 
common is that they include words that do not contribute in information extraction, 
clustering or classification. Examples of stop-words include prepositions such as after, 
about, before, or pronouns such as we, they, he. 
7.1.3 Stemming and Lemmatization 
The purpose of stemming is to remove inflectional [98] and in some cases 
derivational affixes. In this case the words are replaced by the morphological root of the 
word (stem). The stem of a word represents group of words with similar meaning. A survey 
defining the most well-known methods of stemming algorithms in information retrieval 
can be found in [99].  
70 
 
Lemmatization selects the lemma of a word depending on the context. Typically, 
lemmatization make use of morphological analysis of words and a dictionary, to obtain the 
lemma of a word by removing inflectional endings.  
Stemming and Lemmatization have a closely related meaning. Both aim in reducing 
the inflectional forms of a word in a common base. However, with stemming capturing 
different meaning in words based on their part of speech is not feasible, as opposed to 
lemmatization. Stemming, trims the word with a crude heuristic process, while 
lemmatization is focused on extracting the proper base form of a word. Lemmatization, 
however, is time consuming and in practice usually stemming methods are applied. 
However, in our case lemmatization is applied instead of stemming for reasons that will be 
explained later. 
7.1.4 Feature Space 
Feature space is an n-dimensional space (Rn) where features lie. In this case, n is 
the number of features and each feature is a representation of raw data, which in our case 
are the unique words of text. Following are methods used in the transformation of text data 
to the feature space model. 
7.1.5 Bag of Words 
The most popular model of representing a text document is the Bag of Words 
(BoW) [100]. BoW representation is based on the frequency of a word in a document. 
Assuming a document (di) consist of several unique tokens (ti), then ti would be a feature 
and its value is the number of times that ti appears in the document. The total number of 
tokens will determine the dimension of the feature space. Furthermore, by normalizing 
71 
 
each feature, each value in the feature space represent probabilities of token occurrences 
in the text.  
Frequency of each document is not the only value that a vector representation can 
take. Each token can be assigned a numerical value that characterizes its importance in the 
document. In such vector representations words are given a certain weight to show their 
significance in the document. The most used approach in the literature is the term frequency 
-inverse document frequency (tf-idf) [101] as shown by the following equation. 
𝑤𝑖,𝑗 = 𝑡𝑓𝑖,𝑗 × log (
𝑁
𝑑𝑓𝑖
) 
 
In this equation wi,j is the weight that is assigned to each token i in the document j, 
tfi,j is the frequency of token i in the document j, N is the number of documents and dfi is 
the number of documents that contain the token i. 
The process of assigning a weight to each token of the feature space, is called 
feature weighting. There are two types of feature weighting that depend clearly on the use 
or not of the class label. The first one is unsupervised feature weighting and is built on 
statistical measures from term occurrences i.e. tf-idf, and the other one is supervised feature 
weighting that takes advantage of the category that each text belongs to. A comparative 
study on term weighting schemes for text Classification can be found at [102], while [103] 
provides a survey on term dependencies and term weighting. 
7.1.6 Neural Word Embeddings 
Neural word embeddings are vectors used to represent words. Unlike bag of words 
method, where each word is mapped into a vector same as the size of vocabulary, neural 
72 
 
word embeddings map words to a lower dimensional vector space. This type of vector 
representation is greatly advantageous as this embeddings are proven to be informative and 
provide information about similarity and other linguistic properties. Several applications 
of neural embeddings have shown to have better performance for NLP tasks [104] as well 
as capture syntactic and semantic word relationships [105]–[108]. A recent method to 
obtain word embeddings is Word2Vec [105], [109]. Word2Vec is a two-layer neural 
network that trains words or phrase against other words or phrases in the input text. 
Therefore, it can capture associations or similarities of words with other words. To compute 
the word embeddings word2vec provides efficient implementation for two architectures; 
continuous bag of word (CBOW) and skip-gram [105], [106], [109]. The first uses the 
sequence of words in the text to predict a target word, while the latter uses a word to predict 
a specific text. Word2Vec can not only capture similarity between words but also can 
capture similarities between vectors of words; similar words are likely to have similar 
vectors. Word2Vec can have applications in a diverse field of scientific research, ranging 
from sentiment analysis to e-commerce. 
7.2 CLASSIFICATION ALGORITHMS 
As we already mentioned text classification assigns labels to text document based 
on a pre-defined category. All classification tasks start with an already labeled set of text 
documents that consists the training set. From this training set the goal is to determine an 
appropriate classification model that would predict the correct category of a new text 
document. Following is brief reference to some of the state of the art classification 
algorithms. More detailed approaches as well a comparative evaluation can be found in 
several surveys [90], [91], [110], [111]. 
73 
 
7.2.1 Naïve Bayes Classifier 
Naïve Bayes is a probabilistic classifier. Therefore, it returns the posterior 
probability a document belonging to a class given a set of features. It does that by 
calculating the prior probability, the marginal likelihood and finally the likelihood from an 
input corpus. 
𝑝(𝐿|𝑋) =
𝑃(𝑋|𝐿)𝑃(𝐿)
𝑃(𝑋)
 (1) 
Where L is the class label and X is the set of Features X=[x1, x2, …,xn]. P(L) is the 
prior probability, i.e. probability of a document belonging to class l1. P(X|L) stands for the 
likelihood, and it’s the probability of the features to belong to a specific class. Equation (1) 
can be transformed as: 
𝑝(𝐿|𝑥1, 𝑥2, … , 𝑥𝑛) =
𝑃(𝑥1, 𝑥2, … , 𝑥𝑛|𝐿)𝑃(𝐿)
𝑃(𝑥1, 𝑥2, … , 𝑥𝑛)
 (2) 
 
The independence assumption of Bayes theorem states that words are not associated 
with each other within a document, therefore the above equation can be transformed as: 
𝑝(𝐿|𝑥1, 𝑥2, … , 𝑥𝑛) =
𝑃(𝐿) ∏ 𝑝(𝑥𝑖|𝐿)
𝑛
𝑖=1
𝑃(𝑥1)𝑃(𝑥2) … 𝑃(𝑥𝑛)
 (2) 
 
Documents belong to the class with the highest posterior probability. Despite the 
naïve assumption, that assumes independence between words, it yields good results for 
sentence classification.  
7.2.2 Support Vector Machine 
Support Vector Machine (SVM) is a linear supervised classification algorithm 
proposed in [112] and has been applied in numerous text classification applications. As can 
be shown in Figure 7-1 (two-dimensional feature space to be able to visualize it) SVM 
74 
 
maps the document into the feature space. Red points belong to one class and green points 
belong to the other class. One SVM can only classify document in only two classes. The 
goal is to determine a hyperplane in the feature space that can separate features to different 
classes. This separation is based on the largest margin. It is considered therefore; an 
optimization problem where Maximum margin hyperplane needs to be obtained. Words 
having distance l from the hyperplane are called support vectors 
Support Vectors
Margin l
 
Figure 7-1. Support Vector Machine. 
The size of the feature space in SVM does not affect the learning property, which 
makes SVM suitable for text classification that contains a large number of features 
7.2.3 Decision trees 
Decision trees define a set of rules, where the assignment of the classes is positioned 
in the leaves. During the training process, it recursively partitions the training set into two 
sets, T1 and T2. T1 contains documents with a word t1 and T2 contains documents without 
words t1. This aims in finding the word t1 that can, most accurately, predict the class of the 
document. Decision trees are a fast tool and can be used in large training set with a large 
number of features. However, decision trees are prone to overfitting.  
75 
 
7.2.4 Random Forest Classification 
Random Forest classification incorporates decision trees, where instead of 
executing a decision tree once, it is executed multiple times. Therefore, a decision tree is 
build based on K-random features from the training set. This process is repeated N times, 
N defines the number of trees. Classes are predicted by all N trees and the actual predicted 
class will be the one that gained the most predictions from the N trees.  
7.2.5 Neural Networks 
Neural networks and especially deep neural networks have recently gained great 
attention for text classification tasks. The difference between neural networks, deep or not, 
lies simply in the number of the hidden layers used. Hidden layers are actually the layers 
between the input and the output layer. Deep neural nets have been successful in solving 
machine learning related problems ranging from image, video, speech and as of recently 
has made major advances in natural language understanding and text classification. 
Regardless if we are talking about machine learning or deep learning, the most frequent 
form is supervised learning. Supervised learning relies on training on datasets with 
predefined labels on the data.  
What actually consist the neural networks are unit blocks called neurons. Figure 7-
2 (a) shows a simple neuron. In this unit the weighted sum of their inputs is calculated 
following a nonlinear activation function Φ. Rectifier linear unit (ReLu) is currently top at 
the popularity ladder of activation functions.  ReLu is defined as f(x) = max(x,0). In [113] 
is analyzed why these functions is actually so beneficial and a performance analysis of 
various other activation functions can also be found at [114]. 
76 
 
x1
Output
x2
xn
.
.
.
 
 
(a) (b) 
 
Figure 7-2. (a) Cell Body of a simple neuron (b) Abstract representation of deep 
neural network. 
An abstract representation of a deep neural network can be seen at Figure 7-2 (b). 
This figure refers to a feedforward neural network architecture. Similar architectures are 
used in many applications of deep learning. Inputs to neurons from a layer are obtained 
from outputs of neurons from the previous layer.  
The power of deep neural nets is derived from the fact that hidden units allow a 
non-linear distortion of an input to achieve linear separability in the output. Deep learning 
and architecture analysis of deep neural nets is out of this scope of this thesis and a good 
review on deep learning can be found at [115], while more information on deep learning 
and deep neural nets in general can be found at [116], [117]. Here we will only be 
concerned with Convolutional Neural Networks (CNNs) [118] since it appears in the 
majority of approaches [119]–[124] for text categorization. Although there are other 
approaches that make use of Recurrent Neural Networks RNNs [125], [126] and other type 
of deep neural networks [127]–[129], they are out of this scope.  
77 
 
7.2.6 Convolutional Neural Networks 
CNNs are conventionally used in image classifications. They have been extensible 
described and explained in different research papers [118], [130]–[132]. Here we are only 
going to give a brief introduction. CNNs are designed to process data in an array topology, 
for example a 2D image. Unlike traditional neural nets the input goes through a series of 
convolutional layers, followed by a fully connected layer as can be seen in Figure 7-3. 
 
Figure 7-3. Convolutional Neural Network. Image taken from [133]. 
Convolutional layer performs the convolution operation, where intuitively a kernel 
(also known as feature detector, neuron or filter) slides over the input to compute its 
similarity with the input. From this operation a feature map is obtained. To increase non-
linearity in the image the feature map is then passed to an activation unit such as ReLU.  
Pooling eliminates a large amount of information that doesn’t add any value 
between the similarity of the features. Instead, pooling merges semantically similar features 
together. A typical pooling method is max or average pooling, where the maximum or 
average values of local regions in the feature map is selected, thus creating the pooled 
feature map. Pooling is extremely beneficial since reduces the number of parameters that 
78 
 
are going to the final layer of the neural network, provides spatial invariance and prevents 
overfitting. Here we will consider pooling as part of the convolutional layer.  
After a series of convolutional layers, a fully connected layer with linear activation 
function is used. In a fully connected layer outputs from each neuron of a layer connects to 
all neurons of the next layer.  
7.3 SENTENCE CLASSIFICATION 
7.3.1 Introduction 
Towards our goal in extracting the pseudocode from an architectural design of a 
hardware system, we realized that the way to achieve this is to extract information from 
each sentence of the Natural Language text. Having in mind, how an expert could realize 
the behavior of a hardware system, by reading the text and observing the block diagram, 
we realized that key features (words) of the sentences in the NL text provide insight on the 
functionality of the hardware system.  
An effort is made towards categorizing each sentence, by observing a large number 
of sentences. Each category gives a certain direction on how we perceive different types of 
operations in the architectural design. More specifically, these categories could provide if 
an operation occur, when this operation is executed or could be to show simply 
informational related content. Furthermore, each sentence is selected in a way that can 
direct the recognition of functional specifications.  
All things considered, we decided to categorize or in other words classify the 
sentences into five categories. Timing Sentences, Conditional sentences, Link Sentences, 
79 
 
Processing sentences and Informational Sentence. Following we explain each one of the 
categories that each sentence can take. 
7.3.2 Timing Sentences (TS):  
Detects sentences that control the flow of data in the block diagram. Essentially, 
they detect timing related information of the system’s operation. They actually reply to 
questions of when an operation (or step) is performed. For example, a sentence could start 
with “In the first clock cycle the two values are added”. This suggest an operation that is 
happening in a certain time.  
7.3.3 Conditional Sentences (CS) 
Conditional sentences provide restrictions in operations that are performed in a 
hardware system. They actually are a precondition for an operation to be executed. 
Consider for example the sentence “if write enable is activated then data are stored in the 
memory”. In this example the action of storing something in the memory depends on a 
signal (write enable).  
7.3.4 Link Sentences (LS) 
Link sentences show connections between different blocks in the diagram or 
between diagrams. Consider for example the sentence “Adder takes as input the result of 
multiplier”. This sentence suggests a connection between an output of the multiplier and 
an input of the adder. 
80 
 
7.3.5 Processing Sentences (PS) 
As the name suggests processing sentences are an indication that a process or 
operation is being performed. Consider for example the sentence “data are stored in 
memory”. In this example the sentence shows the type of operation (store) that will be 
executed.  
7.3.6 Information Sentences (IS) 
Informational sentences include any other sentences and have a more informative 
role. They don’t provide any indication about the functional specifications of the hardware 
system. An example of informational sentence includes “Block diagram consist of three 
blocks”. This sentence does not provide any further information of what we already know. 
For our problem of sentence classification, we applied several state of the art 
classification algorithms. SVM, Naive Bayes, Classification trees, Decision trees. 
Furthermore, influenced from the literature, we also examined the effect of CNNs.  
7.4 EXPERIMENTAL SETUP 
7.4.1 Dataset 
Since this categorization of sentences is specific to our research, we can’t find any 
available public dataset of sentences labeled with particular classes. Therefore, a dataset 
was required to be manually created. For that reason, approximately 50 hundred papers 
were parsed, and each sentence was manually label in one of the above categories.  
81 
 
7.4.2 Preprocessing 
Stop word filtering is performed for each sentence of our dataset. NLTK library 
[134] was used to obtained the stop word list. As already explained, stop word filtering 
removes insignificant to classification words. Among these words prepositions are also 
included. However, in our case the prepositions play an important role in the sentence 
classification so they were excluded from the stop word list.  
The next step includes lemmatization of each word, in order to remove inflectional 
forms from the sentences. We performed lemmatization over stemming since it provides a 
proper base form of a word. Although is time consuming, timing in our case is not as 
important as in information systems. NLTK library [134] was also used for the 
lemmatization process. 
To encode the sentence for the state of the art classification algorithms, we are using 
the BoW vector representation where each word is given a weight based on tf-idf measure. 
BoW and tf-idf have already been discussed at the beginning of this chapter.  
Moreover, to encode the sentence for our CNN we are using neural word 
embeddings obtained from word2vec in two ways. In the first case we used the publicly 
available dataset of pre-trained word embeddings on 100 billion words of Google News 
[105], [109]. In the latter case word embeddings from word2vec were obtained by training 
it with a set of documents in the area of computer architecture and utilizing the continuous 
bag-of-words architecture with dimensionality of 100. To accomplish that we used the 
word2vec implementation of gensim [135] library. 
82 
 
7.4.3 Training and Testing Set 
To build the training set we used 80% of the sentences of our dataset, while the 
other 20% was used as a testing set to validate the performance of our classification 
algorithms. The same set was used in all classification algorithms. 
7.4.4 Feature Scaling  
Before features are inserted to the classifiers are first standardized by removing the 
mean and scaling to unit variance. StandardScaler tool from the preprocessing tool of the 
sclearn library was used. 
7.4.5 Classifiers 
7.4.5.1 State of the art ML algorithms 
 
Traditionally, SVM does not support multi class classification, however ‘the “one-
against-one” [136] or “one-vs-the-rest’ [136] multi-class strategy can be followed. In the 
first approach N*(N-1)/2 classifiers (N is the number of classes) are constructed and each 
one trains data from the two classes, while in the latter N classifiers are constructed. Both 
approaches yield similar results so in our case “one-against-one’ approach was chosen. To 
implement the other state of the art classification algorithms we used the sklearn library. 
7.4.5.2 CNNs 
 
To build our CNN model we followed an approach similar to [121]. Figure 7-4 
shows the model architecture with one convolutional layers (L1) and two fully connected 
layers (L2 and L3). We represent a sentence as a vector S={s1,s2,...,sn}, where si are the 
83 
 
terms that consist the sentence while preserving the order, with which they appear. All 
sentences are set to have fixed length n, and therefore each sentence vector si is padded 
with empty strings or truncated if its length is larger than n. The si word of the sentence, is 
then represented by a 100-dimensional word embedding vector si ∈ R100 as already 
explained. 
Word 
Embeddings
L1 L2
FC
 -
 S
o
ft
m
ax
 c
at
eg
o
ri
ca
l 
p
ro
ba
bi
lit
y 
d
is
tr
ib
u
ti
o
n
TS
LS
CS
PS
IS
• Convolution
• RelU
• Max Pooling
• FC
• RelU
L3
 
Figure 7-4. Model architecture. 
CNN is employed to perform 1D convolution to sentence vectors in a similar 
manner of the 2D convolution of images. In this case the embedded vector xi of each term 
in the sentence would play the role of a pixel. 
Max pooling is performed over the result from each neuron performing the 
convolution to capture the most important characteristics and obtain the feature map. There 
are several different filters used in the convolutional process thus obtaining several 
different feature maps. 
We have explained the architecture of the convolutional layer that was used. 
Features from this layer are passed to the first fully connected layer (L2) where ReLU is 
84 
 
used as an activation function and from that to the second connected layer (L3) where 
softmax activation is used that provides the probability distributions over each label. 
7.4.6 Results and discussion 
Result of the CNN model against the state-of-the art classification algorithm is 
shown in Table 7-1. We can see that CNN performs a little better than all of the 
classification algorithms. Due to the nature of our sentence classification task (custom 
classification), which makes them specific only for our work, it’s not feasible to compare 
it with other algorithms in literature that employ sentence classification. It could however, 
in the future be compared to publicly available datasets. In all classification algorithms the 
error rate is high. However, we believe that by increasing the size of the dataset we will 
get a lower error rate as well as a better understanding of which algorithm performs better.  
Table 7-1. Error Rate of the classification algorithms. 
Algorithm Error Rate 
SVM 27.6% 
Naïve Bayes 29.3% 
Decision Tree 29.4% 
Random Forest 28.3% 
CNN 26.7% 
 
In conventional deep neural networks, the text document consisting of a sequence 
of words is represented with one hot vectors of each sequence that after multiplying with a 
85 
 
weight matrix is projected into a continuous vector space. However, using word 
embeddings as extra word features leads to accuracy improvements.  
7.5 CONCLUSIONS 
In this chapter a number of classifications algorithms are presented and applied for 
sentence classification purposes. A custom build dataset is used to train each algorithm in 
order to classify our sentences into five categories. These sentences play the most important 
role in extracting the functional specifications.  
 
86 
 
8.1 INTRODUCTION 
As already mentioned the goal of this thesis is to automatically extract functional 
specifications (operations) of architectural designs of hardware systems depicted in block 
diagram(s). To clarify this, our goal is to identify all steps that are involved in the 
operations, or in other words functionality of the hardware system in time. We consider 
therefore these steps as specifications that derive the operations involved in the 
functionality of the system. 
Functional specifications can be numerous; therefore, a form of generality is 
imperative to our extraction process. More specifically, there is a need to introduce specific 
parts that form these functional specifications. These parts can be a form of categorization 
that depends on the structure, the steps and conditions that are present in the operation of 
the hardware system. Hence, we adopt specific conventions to denote groups of steps or 
more appropriately the building blocks of functional specifications.  
8 FORMAL MODELING 
87 
 
Inputs
Outputs
Function 
Definition
Conditions
Processes
Function Calls
 
Figure 8-1. Building Blocks of functional specifications. 
As shown in Figure 8-1 inputs refer to any type of input connection in the block 
diagram of the architectural design. The values to these inputs, are inserted either manually 
from a user or are expected from another hardware system or block diagram. With respect 
to hardware systems, outputs refer to any type of output signal in the block diagram of the 
architectural design. The values of these outputs are obtained after the execution of the 
functional specifications. Processes consist of the steps or actions that define the operations 
of the system. Conditions provide different paths in the execution of operations. Function 
Definition provides the description of certain operations, that are used several times and 
consist of inputs, outputs, processes, conditions and function calls. Function calls are the 
actual execution of functions. 
All things considered, having a general structure of the functional specifications, 
assists our extraction process by mapping each step in to a specific building block. 
The rest of this chapter provides a more detailed description of our functional 
specification extraction process and defines synergy language, that provides the formal 
definition of functional specifications in our extraction mechanism.  
88 
 
A formal language [137], [138] named Synergy was developed to define in a most 
accurate way the building blocks of an algorithm extraction process. Basically, what 
synergy language offers is a comprehensive description of how functional specifications 
are modeled under our algorithm extraction process that we will explain in the following 
sections. Based on the definition of formal language [137] strings are formed over an 
alphabet. Following that direction, in our case strings constitute the building blocks of 
functional specifications and each string is formed from information in hardware blocks, 
kernels obtained from NLU and their category obtained after classification. The 
combination of those strings can subsequently provide the order of each building block of 
functional specifications. The important part to note here is that synergy provides a 
representation of combinations of functional specifications’ building blocks in a most 
efficient way. Another, important part of this synergy string, is to validate the 
implementation of the extraction process.  
Although synergy language provides a well-defined representation of functional 
specification in terms of the order of structural parts, it lacks the ability to model timing 
information (what is executed and when in time). This raises the need for a new model, 
that can integrate functional behavior as well as structural characteristics. In order to 
simulate the functional behavior and provide a more detailed structural description, we 
have used Stochastic Petri Nets (SPNs)[139]. Thus, SPNs can show the flow of data in 
real-time, and obtain performance characteristics. Furthermore, SPNs are applicable in 
achieving concurrency control and hazard detection in hardware designs. 
89 
 
Knowledge 
Base
Synergy
Production Rules
Synergy extraction
Block Information
Sentence Kernels
Sentence Categories
SPN 
Synthesis
Flowchart
Pseudocode
 
Figure 8-2. Functional specification extraction process. 
Figure 8-2. shows a clear view of functional specifications’ extraction process. As 
already explained, the goal is to extract the pseudocode of an architectural approach of a 
hardware system. Block information obtained from the DIM unit, sentence kernels 
obtained from the NLU unit and sentence categories obtained from the classification unit 
are all fused in the Synergy Unit. The synergy unit extracts the Synergy string, where along 
with information obtained from an external Knowledge base, they are forwarded to the 
SPN Synthesis unit. Now, the SPN synthesis maps each Synergy string to a SPN graph and 
then incorporates all SPN graphs from the NLU unit, the KB, and the Synergy mapping to 
extract the pseudocode and depict it with a flowchart.  
In the following sections the Synergy language is defined and mapping between the 
Synergy language and the SPN model is analyzed.  
8.2 SYNERGY LANGUAGE 
8.2.1 Overview 
A formal language is defined that helps in formally representing building blocks 
and their order in functional specifications. In consideration of its use; in representing the 
90 
 
association between text and diagrams, it is named Synergy. Synergy’s purpose is to 
provide a mathematical model, around which the functional specification extraction 
process is designed. Synergy language couples together information from block diagrams, 
sentence’s kernels and sentence’s categories and represents them in an efficient and 
compact way.  
To represent a building block in the functional specifications process, the following 
are defined: 
Definition 1: A synergy string represents a building block of the functional 
specifications. 
To be more specific, as shown in Figure 8-1, we define six type of building blocks in our 
functional specifications. The first three, represent parts of the structure of the functional 
specifications and subsequently of the architecture design. The other parts are more 
operation oriented. Especially, the last three can be interchangeable combined to form a 
series of operations, appearing in serial or in parallel, that are showcasing the functionality 
of the architecture design.  
Definition 2: The sequence of Synergy strings represents a general form of 
functional specifications extracted from the block diagram of an architectural design of a 
hardware system. 
Essentially, Synergy strings show a sequence of steps and conditions derived from 
each building block and represent a general form of functional specifications. Therefore, 
functional specifications can be represented based on several possible combinations of 
actions (steps and conditions) as we can see in Figure 8-3. Each part of the synergy string 
91 
 
can be mapped into one building block. It’s imperative to clarify here that each block 
diagram has its own sequence of strings.  
Inputs
Outputs
Conditions 1
Process1
Process N
Process 2 Process 3
Part 1 of Synergy string
Part 3 of Synergy string
Part 4 of Synergy string
Part 5 of Synergy string
Part 6 of Synergy string
Part N of Synergy string
.
.
.
Function Definition
Part 2 of Synergy string
Se
q
u
en
ce
.
.
.
 
Figure 8-3. Illustration of Synergy strings and the sequence of strings. 
Let us be clear here and point out that synergy provides the sequence of steps of 
operations and all their possible conditions that take place in the block diagram. However, 
it lacks the ability to provide the timing association of each operation. That means that 
further steps are required to realize which operation is active in time as it will be explained 
in the following sections.  
8.2.2 Definition 
Synergy is defined to assist in the formal representation of blocks and their order 
in the algorithm.  
Definition 3. Grammar 
G = (V, Σ, P, S) 
92 
 
is used to define Synergy, where V is the finite non-empty set of non-terminal symbols. Σ 
is the finite non-empty set of terminal symbols. S∈V is the start symbol. P is the finite non-
empty set of production rules P of the form 𝑎 → 𝛽 where α∈ V and 𝛽 ∈ (𝛴 ∪ 𝑉)∗. Since 
the production rule in P satisfies |α|=1, this grammar is a context-free grammar.  
Following are the set of terminal symbols. Throughout this chapter terminal 
symbols will be placed in bold.  
Ln: where {Ln | n ∈ string of words}. Words that appear in the block diagram and are 
associated to the name of an input or output in the block, will be modeled by this symbol. 
For example, consider the block shown in Figure 8-4, in this case “Input_name”, 
“Output_name”, “In1”, “Out1”, “In2” and “Out2” will be modeled as LInput_name, LOutput_name, 
LIn1, LOut1, LIn2, LOut2. 
Output_nameIn1  Block X1   Out1 In2  Block X2   Out2
Input_name
 
Figure 8-4. A simple block diagram example. 
Bn: where {Bn | n∈ string of words}. Words that appear inside each box of the block 
diagram and are associated to the name of a specific block are modeled by this symbol. For 
example, in the block diagram of Figure 8-4, “Block X1” and “Block X2” blocks will be 
modeled as BBlock_X1 and BBlock_X2. 
93 
 
Ki,j,namej: where {Ki,j,namej | i ∈  [1,2,3,4] and j ∈  [1,2,3], name ∈ {value of j kernel 
component}}. For this symbol, index i refers to each one of the sentence categories as 
follows: 
1. Timing Sentences that show the flow of data in the block diagram. 
2. Conditional Sentences that provide restrictions in the flow of data. 
3. Link sentences that show information of block connections. 
4. Processing sentence that provide processing operations in the system. 
5. Information sentences that have a more informative role. 
Index j refers to each one of the kernels’ components in the Glossa Language [75]: 
1. Agents: Refers to all the agents extracted along with their operators from one 
sentence. 
2. Verbs: Refers to all verbs extracted along with their operators from one sentence. 
3. Patients: Refers to all patients extracted along with their operators from one 
sentence.  
This symbol models information related to the category of each partial sentence and its 
respective kernel.  
_IC_: It is a connectivity symbol and connects an input of the block diagram to one of the 
inputs of a block in the diagram. For example, in Figure 8-4 it will be used to connect 
“Input_name” with “In1”.  
_OC_: It is a connectivity symbol and connects an output of a block in the diagram to an 
output of the block diagram. For example, in Figure 8-4 it will be used to connect “Out2” 
with “Output_name”. 
94 
 
C: Connectivity symbol to model the connections between two blocks. For example, in 
Figure 8-4 it will be used to connect “Block X1” with “Block X2”.  
I: A symbol used to model the input building block of functional specifications’ 
representation. 
O: A symbol used to model the output building block of functional specifications’ 
representation. 
FD_Bk: Models a function definition building block, k in this case refers to the name of 
blocks in the diagrams. The idea is that each block is modeled as a function. 
CN: Models a conditional building block of functional specifications’ representation.  
PR: Models a processing building.  
F_Bk: Models a function call building block, k in this case refers to the name of blocks in 
the diagrams. In order for F_Bk to work a function definition FD_Bk is required.  
_SC_: Models the connection of two building blocks sequentially. 
_BC_: Models the connection of multiple building blocks in parallel. 
Symbols “<<” , “>>”, “,”: Bind symbols together.  
Symbols such “(“, “)”, “[“, “]” are used to separate symbols together and determine the 
scope of different operators.  
This model provides a general view of all components that are participating in 
functional specification extraction process. Thus, data image extraction and modeling, 
natural language understanding module and sentence classification all work synergistically 
95 
 
together to provide functional specifications’ building blocks. Therefore, this model can 
capture and model interdisciplinary information.  
Non-terminal symbols represent steps in the functional specification extraction 
process and are used as intermediate steps in the production rules. The most important one 
is AG that represents all functional specifications’ building blocks. Production rules 
provide a more elaborate representation of the steps that are followed to extract the building 
blocks in the functional specifications. In the following presentation of production rules “|” 
is used to compress production rules of the form 𝑎 → 𝛽 𝑎𝑛𝑑 𝑎 → 𝛾  to 𝑎 → 𝛽|𝛾 . We 
define production rules as:  
R1. Input →Ln_IC_Lk, n≠k 
Here we define connections of block diagram’s input with their respective block input. 
R2. Output → Ln_OC_Lk, n≠k 
Here we define connections of block diagram’s outputs with s their respective block output. 
R3. Link → Ln_C_Lk, n≠k 
Here we define a connection between inputs and outputs of blocks. 
R4. Processing_step → PR<< (Ki,1,name1|Ln|Bn), (Ki,2,name2| Ki,3,name3|Lk|Bk) >> |<< 
(Ki,1,name1|Ln|Bn), (Ki,2,name2), (Ki,3,name3|Lk|Bk) >>, n≠k 
Here we define a processing step along with the parameters that participate in that 
processing step. 
R5. Condition  → CN ((Ln | Ki,j,namej), Ln | Ki,k,namej, Ln | Ki,l,namej|ϵ), i=2, j ∈ [1,3], k≠2 
96 
 
Here a conditional building block is defined with its parameters. 
R6. F_parameters → (Link|Inputs|Outputs), F_parameters|(Link|Inputs|Outputs) 
R7. F_call → F_Bk<<F_parameters-> > 
Here we model a function call along with its parameters. 
R8. AG → (Processing_step| Condition| F_call |BC_<< AG >>|_SC_<< AG >>) 
Identifies a sequence of building blocks that are connected either sequentially or in parallel. 
R9. IO_parameters →  Ln, IO_parameters|(Ln) 
R10. IO → [I<<IO_parameters>>][O<<IO_parameters>>] 
Builds inputs and outputs of functional specifications or functions. 
R11. Functions →FD_Bk[IO][AG], Functions| FD_Bk[IO][AG] 
Defines a function with its input, outputs, and the building blocks that consist that function.  
R12. S →[IO][Functions][AG] 
Defines the final string that will represent the order in which the building blocks of 
functional specifications appear. 
All components of the synergy language have been defined and described. To 
illustrate the effectiveness of synergy language we are going to show an example of a 
hardware implementation obtained from a scientific document. 
97 
 
8.2.3 Example 
Following the example shown in previous chapters of a hardware implementation 
of an RC4 stream cipher[140], we are going to present how that implementation is formed 
in the synergy language. To keep the length of this dissertation to a minimum, we are only 
using a small part of the natural language text consisting the document and only one of the 
block diagrams.  
 
 
S1 “The block diagram of the S-box 
RAM is shown Fig. 4.”[140] 
S2 “It consists of three 256 bytes 
RAM blocks. “[140] 
S3 “Each RAM block has four inputs 
and one output. “[140] 
S4 “The two inputs are the read and 
write signals while the other two 
are the address and data signals. 
“[140] 
S5 “Also, all the three RAM boxes 
have the same signals of clock and 
reset. “[140] 
S6 “The operation of RAM blocks is 
quite simple.” [140] 
S7 “If the reset signal occurs the 
blocks are 
initialized linearly.” [140] 
 
S8 “For each block, if the write signal 
occurs new data are stored in the 
address position, or if the read 
signal occurs the data in the 
address position are available 
on the output of the block.” [140] 
Figure 8-5. Left: Example block diagram. Image taken from [140]. Right: NL text 
associated with the block diagram. 
Figure 8-5 shows one of the block diagrams and its respective paragraph of this 
document. Each sentence of the paragraph is marked with a number to facilitate later 
reference to those sentences. Table 8-1 holds the partial sentences of each sentence. Table 
98 
 
8-2 holds the kernels (A, V, P) for each one of the sentences. The category column indicates 
the category as obtained from our classification scheme. Synergy symbol row of table 8-2 
shows, the representation of each Agent, Verb and Patient with Synergy language. 
Table 8-1. Partial Sentences. 
Sentence Partial Sentence 
S7 “If the reset signal occurs “(P1) [140] 
S7 “the blocks are initialized linearly “(P2) [140] 
S8 “if the write signal occurs “(P3) [140] 
S8 “new data are stored in the address position “(P4) [140] 
S8 “if the read signal occurs “(P5) [140] 
S8 “For each block ,,or the data in the address position are available on the 
output of the block “(P6) [140] 
Table 8-2. Kernels. 
Partial Sentence Agent (A) Verb(V) Patient Category 
P1 Kernel 
component 
reset_signal occurs  CS 
Synergy 
symbol 
K2,1,reset_signal K2,2,occurs  
P2 Kernel 
component 
Linearly are_initialized Blocks PS 
Synergy 
symbol 
K4,1,linearly K4,2,are_initialized K4,3,blocks 
P3 Kernel 
component 
write_signal Occurs new_data PS 
Synergy 
symbol 
K2,1,write_signal K2,2,occurs K4,3,new_data 
P4 Kernel 
component 
Address_position are_stored new_data CS 
Synergy 
symbol 
K4,1,address_position K4,2,are_stored K2,3,new_data 
P5 Kernel 
component 
Signal read_occurs  CS 
Synergy 
symbol 
K2,1,signal K2,2,read_occurs  
P6 Kernel 
component 
block,data, 
address_position 
are Output_block PS 
Synergy 
symbol 
K4,1,block,K4,1,data, 
K4,1,address_position 
K4,2,are K4,3,output_block 
99 
 
 
The symbolic representation in the synergy language of each Agent, Verb, and 
Patient that belongs to a specific category is Kij, as already explained above. Therefore, 
Agent of partial sentence P2 in Table 8-1 will be, K4,1,linearly since it belongs to a partial 
sentence of category CS and is an Agent. In a similar manner, Verb of partial sentence P1 
will be K4,2,are_initialized and Patient will be K4,3,blocks.  
As we can see from figure 8-5, the block diagram consists of three blocks, fourteen 
inputs and three outputs. Figure 8-6 shows the respective graph of the block diagram. This 
graph is obtained from the DIM unit and we reintroduce it here to be more discernible.  
(i_block)_25
6_RAM
(j_block)_25
6_RAM
(t_block)_2
56_RAM
Si_write
Si_read
data
address
Clock
reset
St_write
St_read data#2
address#2
Sj_write
Sj_read
data#1
address#1
Si
Sj
St
 
Figure 8-6. Graph of block diagrams in figure 8-5. Yellow rectangles indicate an 
input to the block diagram and blue indicate an output. 
100 
 
Following is a step by step process to represent the building blocks of the above 
diagram with a synergy string. We mark with green the non-terminal symbols that are 
replaced with their respective production rule. 
S →[IO][Functions][AG] 
By replacing IO with rule R10 we get: 
S →[[I<<IO_parameters>> ][O<<IO_parameters>>]][Functions][AG] (1) 
IO_parameters are replaced with R9 and the appropriate parameter for R9 are chosen to 
reflect the inputs of the block diagram.  
S → [[I<< Lreset, Lclock, Lsi_read, Lsi_write, Ldata, Laddress, Lsj_read, Lsj_write, Ldata1, 
Laddress1,Lst_read, Lst_write, Ldata2, Laddress2 >> ][O<<IO_parameters>>]][Functions][AG] 
IO_parameters are replaced with R9 and the appropriate parameter for R9 are chosen to 
reflect the outputs of the block diagram.  
S → [[I<< Lreset, Lclock, Lsi_read, Lsi_write, Ldata, Laddress, Lsj_read, Lsj_write, Ldata1, 
Laddress1,Lst_read, Lst_write, Ldata2, Laddress2 >>] [O<< LSi, LSj, LSt,>>]] [Functions][AG] 
Functions are replaced three times  with rule R11 as follows: 
S → [[I<< Lreset, Lclock, Lsi_read, Lsi_write, Ldata, Laddress, Lsj_read, Lsj_write, Ldata1, 
Laddress1,Lst_read, Lst_write, Ldata2, Laddress2 >>] [O<< LSi, LSj, LSt,>>]] 
[FD_B(i_block)_256_bytes_RAM[IO][AG],Functions][AG] 
101 
 
S → [[I<< Lreset, Lclock, Lsi_read, Lsi_write, Ldata, Laddress, Lsj_read, Lsj_write, Ldata1, 
Laddress1,Lst_read, Lst_write, Ldata2, Laddress2 >>] [O<< LSi, LSj, LSt,>>]] 
[FD_B(i_block)_256_bytes_RAM[IO][AG], FD_B(j_block)_256_bytes_RAM [IO][AG],Functions][AG] 
S → [[I<< Lreset, Lclock, Lsi_read, Lsi_write, Ldata, Laddress, Lsj_read, Lsj_write, Ldata1, 
Laddress1,Lst_read, Lst_write, Ldata2, Laddress2 >>] [O<< LSi, LSj, LSt,>>]] 
[FD_B(i_block)_256_bytes_RAM[IO][AG], FD_B(j_block)_256_bytes_RAM [IO][AG], 
FD_B(t_block)_256_bytes_RAM [IO][AG]][AG] 
S → [[I<< Lreset, Lclock, Lsi_read, Lsi_write, Ldata, Laddress, Lsj_read, Lsj_write, Ldata1, 
Laddress1,Lst_read, Lst_write, Ldata2, Laddress2 >>] [O<< LSi, LSj, LSt,>>]] 
[FD_B(i_block)_256_bytes_RAM[[I<<IO_parameters>> ][O<<IO_parameters>>]][AG], 
FD_B(j_block)_256_bytes_RAM [IO][AG], FD_B(t_block)_256_bytes_RAM [IO][AG]][AG] 
S → [[I<< Lreset, Lclock, Lsi_read, Lsi_write, Ldata, Laddress, Lsj_read, Lsj_write, Ldata1, 
Laddress1,Lst_read, Lst_write, Ldata2, Laddress2 >>] [O<< LSi, LSj, LSt,>>]] 
[FD_B(i_block)_256_bytes_RAM[[I<< LSi_read, LSi_write, Ldata1, Laddress1 Lclock, Lreset >> ][O<< 
LSi>]][AG], FD_B(j_block)_256_bytes_RAM [IO][AG], FD_B(t_block)_256_bytes_RAM 
[IO][AG]][AG] 
S → [[I<< Lreset, Lclock, Lsi_read, Lsi_write, Ldata, Laddress, Lsj_read, Lsj_write, Ldata1, 
Laddress1,Lst_read, Lst_write, Ldata2, Laddress2 >>] [O<< LSi, LSj, LSt,>>]] 
[FD_B(i_block)_256_bytes_RAM[[I<< LSi_read, LSi_write, Ldata1, Laddress1 Lclock, Lreset >> ][O<< 
LSi>]][ BC_<< AG >>], FD_B(j_block)_256_bytes_RAM [IO][AG], FD_B(t_block)_256_bytes_RAM 
[IO][AG]][] 
102 
 
S → [[I<< Lreset, Lclock, Lsi_read, Lsi_write, Ldata, Laddress, Lsj_read, Lsj_write, Ldata1, 
Laddress1,Lst_read, Lst_write, Ldata2, Laddress2 >>] [O<< LSi, LSj, LSt,>>]] 
[FD_B(i_block)_256_bytes_RAM[[I<< LSi_read, LSi_write, Ldata1, Laddress1 Lclock, Lreset >> ][O<< 
LSi>]][ BC_<< _SC_<<Condition, Processing_step>>,_SC_<< Condition, 
Processing_step >>,_SC_<< Condition, Processing_step >> >>], 
FD_B(j_block)_256_bytes_RAM [IO][AG], FD_B(t_block)_256_bytes_RAM [IO][AG]][] 
S → [[I<< Lreset, Lclock, Lsi_read, Lsi_write, Ldata, Laddress, Lsj_read, Lsj_write, Ldata1, 
Laddress1,Lst_read, Lst_write, Ldata2, Laddress2 >>] [O<< LSi, LSj, LSt,>>]] 
[FD_B(i_block)_256_bytes_RAM[[I<< LSi_read, LSi_write, Ldata1, Laddress1 Lclock, Lreset >> ][O<< 
LSi>]][ BC_<< _SC_<< CN (Lreset,K2,2,occurs),PR4 (K4,1,linearly, K4,2,are_initialized 
,B(i_block)_256_bytes_RAM)>>,_SC_<< CN (LSi_write,K2,2,occurs, Ldata),PR4 (Laddress, K4,2,are_stored, 
Ldata)>>,_SC_<< CN (LSi_read,K2,2,occurs),PR4 (Laddress!data!block, K4,2,are, LSj)>> >>], 
FD_B(j_block)_256_bytes_RAM [IO][AG], FD_B(t_block)_256_bytes_RAM [IO][AG]][] 
Same steps apply to the other two function definitions. Following is the final string that 
represents functional specifications with a form of synergy string.  
S → [[I<< Lreset, Lclock, Lsi_read, Lsi_write, Ldata, Laddress, Lsj_read, Lsj_write, Ldata1, 
Laddress1,Lst_read, Lst_write, Ldata2, Laddress2 >>] [O<< LSi, LSj, LSt,>>]] 
[FD_B(i_block)_256_bytes_RAM[[I<< LSi_read, LSi_write, Ldata, Laddress Lclock, Lreset>> ][O<< 
LSi>>]][ BC_<< _SC_<< CN (Lreset,K2,2,occurs),PR4 (K4,1,linearly, K4,2,are_initialized 
,B(i_block)_256_bytes_RAM)>>,_SC_<< CN (LSi_write,K2,2,occurs, Ldata),PR4 (Laddress, K4,2,are_stored, 
Ldata)>>,_SC_<< CN (LSi_read,K2,2,occurs),PR4 (Laddress!data!block, K4,2,are, LO1)>> >>], 
FD_B(j_block)_256_bytes_RAM[[I<< LSj_read, LSj_write, Ldata1, Laddress1 Lclock, Lreset >> ][O<< 
LSi>]][ BC_<< _SC_<< CN (Lreset,K2,2,occurs),PR4 (K4,1,linearly, K4,2,are_initialized 
103 
 
,B(j_block)_256_bytes_RAM)>>,_SC_<< CN (LSj_write,K2,2,occurs, Ldata),PR4 (Laddress, K4,2,are_stored, 
Ldata)>>,_SC_<< CN (LSj_read,K2,2,occurs),PR4 (Laddress!data!block, K4,2,are, LSj)>> >>], 
FD_B(t_block)_256_bytes_RAM[[I<< LSt_read, LSt_write, Ldata2, Laddress2 Lclock, Lreset >> ][O<< 
LSt>>]][ BC_<< _SC_<< CN (Lreset,K2,2,occurs),PR4 (K4,1,linearly, K4,2,are_initialized 
,B(i_block)_256_bytes_RAM)>>,_SC_<< CN (LSt_write,K2,2,occurs, Ldata),PR4 (Laddress, K4,2,are_stored, 
Ldata)>>,_SC_<< CN (LSj_read,K2,2,occurs),PR4 (Laddress!data!block, K4,2,are, LSt)>> >>]][] 
A visual illustration of building blocks is provided to better understand the 
representation deriving from this string. Therefore, the string for representing inputs and 
outputs is shown in Figure 8-7. In this figure, the acquisition of inputs for the functional 
specifications from the string of Synergy language is more discernible.  
I<< Lreset, Lclock, Lsi_read, Lsi_write, Ldata, Laddress, Lreset, Lclock, 
Lsj_read, Lsj_write, Ldata1, Laddress1, Lreset, Lclock, Lst_read, Lst_write, 
Ldata2, Laddress2 >>
Inputs: reset, clock, Si_read, 
Si_write, data, address, Sj_read, 
Sj_write, data1, address1, St_read, 
St_write, data2, address2
 
(a) 
O<< LSi, LSj, LSt,>> Outputs: Si, Sj, St
 
(b) 
Figure 8-7. Mapping from Synergy language to a functional specification building 
block. 
Figure 8-8 shows how a function definition building block is build. As can be seen 
from the figure a function is created with its name obtained from the Synergy string. Inputs 
and outputs are obtained in the same manner as in Figure 8-7. Different colors are used to 
show how the mapping from a synergy string to a building block can occur. Note that this 
is not the final representation of the building blocks of functional specifications. This is 
just a visual representation of how Synergy language can model building blocks.  
 
104 
 
 
[FD_B(i_block)_256_bytes_RAM[[I<< LSi_read, LSi_write, Ldata, Laddress Lclock, Lreset>> ][O<< LSi>>]][ BC_<< _SC_<< CN 
(Lreset,K2,2,occurs),PR4 (K4,1,linearly, K4,2,are_initialized ,B(i_block)_256_bytes_RAM)>>,_SC_<< CN (LSi_write,K2,2,occurs, Ldata),PR4 
(Laddress, K4,2,are_stored, Ldata)>>,_SC_<< CN (LSi_read,K2,2,occurs),PR4 (Laddress!data!block, K4,2,are, LSj)>> >>], 
Function: (i_block)_256_bytes_Ram()
Inputs: Si_write,Si_read, data, 
address,clock,reset
Outputs: Si
[I<< LSi_read, LSi_write, Ldata, 
Laddress Lclock, Lreset>> ][O<< 
LSi>>]
Condition :Reset 
occurs
 _SC_<<
 CN (Lreset,K2,2,occurs),
PR4 (K4,1,linearly, K4,2,are_initialized 
,B(i_block)_256_bytes_RAM)
>>
(i_block)_256_bytes_
RAM are_intilaize 
Linearly
Condition :Reset 
occurs
(i_block)_256_bytes_
RAM are_intilaize 
Linearly
Condition :Reset 
occurs
(i_block)_256_bytes_
RAM are_intilaize 
Linearly
_SC_<<
 CN (LSi_write,K2,2,occurs, Ldata),
PR4 (Laddress, K4,2,are_stored, 
Ldata)
>>
_SC_<< 
CN (LSi_read,K2,2,occurs),
PR4 (Laddress!data!block, K4,2,are, 
LO1)
>> 
FD_B(i_block)_256_bytes_RAM
 
Figure 8-8. Mapping from Synergy language to a function definition building 
block.  
Figure 8-9 and Figure 8-10 show the mapping of the other two function definition 
building blocks.  
105 
 
[FD_B(i_block)_256_bytes_RAM[[I<< LSj_read, LSj_write, Ldata1, Laddress1 Lclock, Lreset>> ][O<< LSj>>]][ BC_<< _SC_<< CN 
(Lreset,K2,2,occurs),PR4 (K4,1,linearly, K4,2,are_initialized ,B(j_block)_256_bytes_RAM)>>,_SC_<< CN (LSj_write,K2,2,occurs, Ldata),PR4 
(Laddress, K4,2,are_stored, Ldata)>>,_SC_<< CN (LSj_read,K2,2,occurs),PR4 (Laddress!data!block, K4,2,are, LSj)>> >>], 
Function: (j_block)_256_bytes_Ram()
Inputs: Sj_write,Sj_read, data1, 
address1,clock,reset
Outputs: Sj
[I<< LSj_read, LSj_write, Ldata1, 
Laddress1 Lclock, Lreset>> ][O<< 
LSj>>]
Condition :Reset 
occurs
 _SC_<<
 CN (Lreset,K2,2,occurs),
PR4 (K4,1,linearly, K4,2,are_initialized 
,B(i_block)_256_bytes_RAM)
>>
(j_block)_256_bytes_
RAM are_intilaize 
Linearly
Condition :Si_write 
occurs data
Data are_stored 
address
Condition :Si_read 
occurs
Sj are address and 
data and block
_SC_<<
 CN (LSi_write,K2,2,occurs, Ldata),
PR4 (Laddress, K4,2,are_stored, 
Ldata)
>>
_SC_<< 
CN (LSi_read,K2,2,occurs),
PR4 (Laddress!data!block, K4,2,are, 
Lsj)
>> 
FD_B(j_block)_256_bytes_RAM
 
 
Figure 8-9. Mapping from synergy to the second function definition. 
106 
 
[FD_B(t_block)_256_bytes_RAM[[I<< LSt_read, LSt_write, Ldata2, Laddress2 Lclock, Lreset>> ][O<< LSt>>]][ BC_<< _SC_<< CN 
(Lreset,K2,2,occurs),PR4 (K4,1,linearly, K4,2,are_initialized ,B(t_block)_256_bytes_RAM)>>,_SC_<< CN (LSt_write,K2,2,occurs, Ldata),PR4 
(Laddress, K4,2,are_stored, Ldata)>>,_SC_<< CN (LSt_read,K2,2,occurs),PR4 (Laddress!data!block, K4,2,are, LSt)>> >>], 
Function: (j_block)_256_bytes_Ram()
Inputs: St_write,St_read, data2, 
address2,clock,reset
Outputs: St
[I<< LSt_read, LSt_write, Ldata2, 
Laddress2 Lclock, Lreset>> ][O<< 
LSt>>]
Condition :Reset 
occurs
 _SC_<<
 CN (Lreset,K2,2,occurs),
PR4 (K4,1,linearly, K4,2,are_initialized 
,B(t_block)_256_bytes_RAM)
>>
(t_block)_256_bytes_
RAM are_intilaize 
Linearly
Condition :St_write 
occurs data
Data are_stored 
address
Condition :St_read 
occurs
St are address and 
data and block
_SC_<<
 CN (LSt_write,K2,2,occurs, Ldata),
PR4 (Laddress, K4,2,are_stored, 
Ldata)
>>
_SC_<< 
CN (LSt_read,K2,2,occurs),
PR4 (Laddress!data!block, K4,2,are, 
LSt)
>> 
FD_B(j_block)_256_bytes_RAM
 
 
Figure 8-10. Mapping from Synergy to the third function definition. 
With these visual representations it was shown how from only the Synergy string 
the building blocks of functional specifications are obtained in general. In the following 
sections it is explained how from Synergy language we can obtain a more elaborate 
representation of those specifications and more accurate in terms of the system’s 
functionality. 
8.3 FROM SYNERGY TO SPN MAPPING 
In the previous sections a description was provided of how synergy language 
formally represents building blocks of functional specifications. Furthermore, it was shown 
how information from NL text and images were combined under synergy language. As 
already explained synergy language provides a well-defined representation of the order of 
107 
 
the structural parts of the functional specifications. However, synergy language fails to 
model which structural part of functional specifications is active and when in time. 
Therefore, to account for this limitation we have chosen Stochastic Petri Nets (SPNs). 
SPNs are a well-equipped model to represent flow of data in real time and obtain 
performance characteristics. In this section, we show step by step how parts of synergy 
strings are mapped to SPN graphs.  
The synergy string representing a block diagram is parsed from left to write and 
separated into parts that we will call partial strings. These partial strings contain at least 
one terminal symbols, followed in certain cases with other terminal symbols enclosed in 
parenthesis or brackets separated by commas. Following is the mapping of each one of 
synergy parts into an SPN graph. We use colors to better show the mapping from one part 
to the other. 
We will start by showing how a partial string that contains PR is mapped. PR 
represents a processing building block and from rule R4 above it can appear with the 
following combinations: 
PR<< (Ki,1,name1|Ln|Bn), (Ki,2,name2) >> |<< (Ki,1,name1|Ln|Bn), (Ki,2,name2), (Ki,3,name3|Lk|Bk) 
>> 
Therefore, PR is always followed by French quotes (‘<<’, ‘>>’) where inside the 
angle quotes two or three terminal symbols appear separated by commas. If we think PR 
as an operator, then based on the above PR can be either a binary operator or a ternary 
operator. This rule is rewritten in a way that will provide a general form of a processing 
step.  
108 
 
PR<< Termname1, Termname2 >> or PR<< Termname1, Termname2, Termname3 >>, where Term 
refers to one of the K, L and B type terminal symbols. 
From this general form of a processing step the second term is depicted on the SPN 
graph as a transition box. The remaining terms are mapped as places in the SPN graph. 
Figure 8-11 shows the mapping for these two cases. Assuming Termname1 or 
Termname3consist of sub-terms connected with an operator (Tn1,Tn2), then the mapping is 
similar to the approach in [53]. Symbols “*, @, %, $” appearing in this figure are operators 
of Glossa language [75]. The first two are “and” connections and the other two are “or” 
connections. 
109 
 
 
Synergy string SPN graph representation 
PR<< Termname1, Termname2 >> name1
name2
 
PR<< Termname1, Termname2, Termname3 >> name1
name2
name3
 
PR<< Tn1*Tn2, Termname2, Termname3 >> 
n1
name2
name3
n2
 
PR<< Tn1%Tn2, Termname2, Termname3 >> 
n1 name2
name3
n2
 
PR<< Termname1, Termname2, Tn1 @Tn2, >> 
n1name2
name1
n2
 
PR<< Termname1, Termname2, Tn1$Tn2, >> 
n1
name1
n2
name2
 
Figure 8-11.Mapping from synergy to SPN: processing case. 
Next, we will show how a partial string that contains CN is mapped. CN represents 
a conditional building block CN ((Ln | Ki,j,namej), Ln | Ki,k,namej, Ln | Ki,l,namej|ϵ), i=2, j ∈ 
[1,3], k≠2. CN, is followed by a pair of parenthesis where inside the parenthesis two or 
110 
 
three terminal symbols appear separated by commas. This rule is rewritten in a way that 
will provide a general form of a conditional step.  
CN (Termname1, Termname2, Termname3 >>, where Term refers to one of the K, L, and ϵ type 
of terminal symbols. 
From this general form of a conditional step the index of all terms is used and 
combined as one string by also place the string “if_” as a prefix. This combined string is 
mapped as one place in the SPN graph. Assuming Termname1 or Termname3 consist of sub-
terms (Tn1,Tn2) connected with an operator. The string “if_” is placed as a prefix on each 
of the term and then according to their operator are combined based on the mapping shown 
Figure 8-12. 
Synergy string SPN graph representation 
CN<< Termname1, Termname2, Termname3 >> 
If_name1
_name2_
name3
 
CN << Tn1*Tn2, Termname2, Termname3 >> 
If_n1
and
If_n2
 
CN << Tn1%Tn2, Termname2, Termname3 >> 
If_n1 or
If_n2
 
Figure 8-12. Mapping from synergy to SPN: conditional case. 
Following, is the mapping of input connections as defined in rules R1, R2 and R3 
as well as rule R10. 
 
Synergy string SPN graph representation 
111 
 
Ln_IC_Lk, n
input
k
 
Ln_OC_Lk, n
output
k
 
Ln_C_Lk, n
Link
k
 
I<< Ln1 ,Ln2,…, Lnk >> I_n1 I_n2 I_nk...
 
O<< Ln1 ,Ln2,…, Lnk >> O_n1 O_n2 O_nk...
 
Figure 8-13. Mapping from synergy to SPN: IO case. 
Figure 8-14 shows the modeling of a function call and function definitions. 
Specifically, for the function call case the example provided has only one parameter. 
Synergy string SPN graph representation 
F_Bk<< Ln_C_Lr,>> n
Link
k
 
FD_Bk (I<<Ln1>>O<<Ln2 >> partial 
synergy string) I_n1 O_n2
Partial Synergy 
String SPN 
Mapping
k
 
Figure 8-14. Mapping from synergy to SPN: Functions. 
Figure 8-15 shows how _SC_ and _BC_ are mapped. 
 
 
 
 
Synergy string SPN graph representation 
112 
 
_SC_<<partial string 1, partial string 2>> 
Last place of SPN 
of partial string 1
Transition of SPN 
of partial string 2
 
_BC_<<partial string 1, partial string 2>> 
SPN of partial 
string 1
SPN of partial 
string 2
 
Figure 8-15. Mapping from synergy to SPN: Connections. 
8.4 SPN SYNTHESIS 
8.4.1 Sentence grouping 
As already explained sentences are broken down into partial sentences and each 
one is sent to the classification unit where a category of each one of them is obtained. The 
category of each sentence along with its respective kernel will form groups in the NL 
sentences and their kernels.  
Therefore, timing sentences will determine the group of each kernel. Partial 
sentences and their respective kernels that follow a timing sentence are grouped together. 
Consider the NL text in Table 8-3. Each sentence contains one or more kernels and each 
kernel belong to a certain category.  
Table 8-3. Grouped NL text. 
Sentence 
Figure 8-1 
Kernel(S)  Category # 
S1 
A→ (key_setup_phase) 
V → (is_divided) 
P→ (steps) 
TS NLS1,1 
S2 
A→first_step_S-box 
V→is_filled 
TS NLS2,1 
S3 
A→reset_state 
V→occurs 
CS NLS2,2 
113 
 
A→linearyly 
V→S-box 
P→is_initialized 
PS NLS2,3 
S4 
A→randomly 
V→is_filled 
P→ second_step_key_setup@S-Box 
TS NLS3,1 
S5 
A→ S-Box 
V→is_used 
P→3x256_bytes_RAM_memory 
IS NLS3,2 
A→Fig.4, 
V→'s_shown, 
P→S-Box, 
LS NLS3,3 
S6 
A→algorithm_swapping, 
V→are_used_for 
P→Si__, Sj_registers 
IS NLS3,4 
S7 
A→temporary_store_intermediate_variables 
V→are_used 
P→j__@t_registers 
IS NLS3,5 
A→ 
V→are_produced, 
P→temporary_store_intermediate_variables, 
PS NLS3,6 
S8 
A→address_first_RAM_block 
V→is_used, 
P→first_clock_cycle_counter_i 
TS NLS4,1 
A→ 
V→see 
P→Fig.3 
LS NLS4,2 
S9 
A→new_value_j, 
V→is_used_for 
P→value_Si 
IS NLS4,3 
A→Fig.3, 
V→is_shown, 
P→new_value_j 
LS NLS4,4 
S10 
A→ Si_register 
V→ is_stored 
P→ value_Si 
PS NLS4,5 
S11 
A→ new_value_j 
V→ are_used_for 
P→ two_adders 
IS NLS4,6 
S12 
A→ two_adders, 
V→ accept, 
P→ input_values_Ki@ Si, 
LS NLS4,7 
S13 
A→ address_second_RAM_block, 
V→ is_used_for 
P→ second_clock_cycle_new_value_j, 
TS NLS5,1 
S14 
A→ Sj_register_temporary 
V→ is_stored, 
P→ stored_value_address 
PS NLS5,2 
114 
 
S15 
A→j_addresses*i_addresses_correspondingly 
V→ are_written 
P→ 
thrird_cycle_contents_Si_register@contents_Sj_register, 
TS NLS6,1 
S16 
A→ 
V→ is_achieved 
P→ procedure_swapping, 
IS NLS6,2 
 
As already explained the timing sentences (TS) determine the groups. Therefore, in 
the above example six groups are formed. Sentences following immediately after a timing 
sentence and before the next timing sentence are placed into one group. In Table 8-3 kernels 
that belong to the same group have the same color. The intuition behind this approach is 
that timing sentences might refer to clock cycles or different phases of the functional 
specification or different steps in the execution path. And the majority of cases suggests 
that anything that follows a timing sentence will be executed at the time that this particular 
timing sentence refers to. 
A numbering is introduced for each one of the sentences for future reference. These 
numbers are introduced in a way that identifies the type and the group of the sentence. 
Therefore NLSi,j refers to a sentence of group i and is the jth sentence of that group. For 
timing sentences j is always 1.   
8.4.2 The synthesis of SPNs 
As explained in section 8.4.1 kernels are grouped based on the order of timing 
sentences. A number is also introduced to identify the group and the type of each sentence. 
Therefore a kernel number as NLSi,j refers to a kernel that belongs to the i-th group and is 
the jth kernel of that group. If j is 1 then it is a timing sentence and if j is not 1 then it’s a 
non-timing sentence. 
115 
 
This section provides how these groups enrich an SPN graph obtained from 
Synergy mapping, add restrictions in transitions and provides a more elaborate SPN model. 
These enriched model is a representative of the system’s behavior. We will call the SPN 
graph obtained from Synergy mapping as basic SPN. 
The first step is to retrieve from NLU the SPN graphs for each NLSi,j as defined in 
[53]. In the following figures, cases of modification of the basic SPN based on NLSi,j SPN 
graphs is presented. For reasons that will become clear in the next chapter name of SPNs 
representing kernels of timing sentences are modified by attaching a prefix “T_”. For 
example assuming we have a timing kernel K= (A1, V1, P1) then this kernel becomes K= 
(T_A1,T_V1, T_P1). 
SPNs that belong to timing sentences are integrated together in sequence of 
combined SPN graphs. The combination is based on the name of Agents (A) and Patients 
(P). More specifically, if the name of an Agent or patient contains partially keywords that 
appear in the block diagram, then these keywords are excluded from the matching process. 
Same rules apply for the patients. The remaining string is separated into words if 
applicable. The lemma of each word is obtained and after that if more than half lemmatized 
words match the two nodes are connected with a transition. If matches occur between more 
than one timing kernel, then they are connected in parallel. Figure 8-16 and Figure 8-17 
shows different examples of combinations.  
116 
 
T_A1
T_V1
T_P1
NLSi,1 = (A1, V1,P1)
T_A2
T_V2
T_P2
NLSj,1 = (A2, V2,P2)
i<j
 
T_A2
T_V2
T_P2
NLSj,1 = (A2, V2,P2)
T_A1
T_V1
T_P1
NLSi,1 = (A1, V1,P1)
 
(a) (b) 
Figure 8-16. (a) P1 partially matches A2. (b) Combined SPN graph of two kernels 
of TS. 
T_A1
T_V1
T_P1
NLSi,1 = (A1, V1,P1)
T_A2
T_V2
T_P2
NLSj,1 = (A2, V2,P2)
T_A3
T_V3
T_P3
NLSt,1 = (A3, V3,P3)  
T_A2
T_V2
T_P2
T_A3
T_V3
T_P3
T_A1
T_V1
T_P1
 
(a) (b) 
Figure 8-17. (a) P1 partially matches A2 and A3 and i<j<t. (b) Combined SPN 
graph of three kernels of TS. 
The combined SPN of Timing kernels becomes part of the basic SPN and both 
consist an enriched SPN graph (Figure 8-18).  
 
117 
 
T_A2
T_V2
T_P2
T_A3
T_V3
T_P3
T_A1
T_V1
T_P1
 
Basic SPN graph
 
(a) (b) 
T_A2
T_V2
T_P2
T_A3
T_V3
T_P3
T_A1
T_V1
T_P1
Basic SPN graph
 
(c) 
Figure 8-18. (a) SPN representation of NL kernels of type TS. (b) Basic SPN 
graph. (c) Enriched SPN that contains both SPNs. 
In the case of Figure 8-18(c), the transition of the third kernel is controlled by the 
second kernel. This control comes from the block of SPNs that the second timing kernels 
controls as it’s shown in Figure 8-19.  
T_A2
T_V2
T_P2
T_A3
T_V3
T_P3
T_A1
T_V1
T_P1
SPN graph 
controlled by 
T_A2
 
Figure 8-19. Control of parallel connections of TS. 
Therefore, kernels that belong to timing sentences are combined and included in 
the SPN graph obtained from synergy language. Thus, forming an enriched SPN graph. 
Now we examine how non-timing sentence (NLSi,j j≠1) alter the form of the basic SPN 
graph. Non-timing sentences are part of a group controlled by a timing sentence. Figure 8-
118 
 
20 provides an illustrative example of the transformation occurring to the basic SPN graph 
in this case.  
X1
Tr1
X2
A2
V2
P2
NLSi,j = (A2, V2,P2) j  
T_A1
T_V1
T_P1
NLSi,1 = (A1, V1,P1)
A small part of the basic SPN 
graph
 
(a) 
X1
Tr1
X2
T_A1
 
(b) 
Figure 8-20. (a) NLSi,j is the kernel of a non-timing sentence belonging to a group 
controlled by NLSi,1.(b) Enriched SPN graph occurring when A2 partially matches X1 and 
V2 partially matches Tr1. In a similar manner we can obtain an enriched SPN graph for 
when A2 and P2 partial matches X1 and X2 in any order, or one of the places (A1 or P2) 
partially matches one of the places (X1 or X2) and V2 partially matches Tr1. 
Assuming now that a place in a non-timing sentence partially matches the name of 
a function in the SPN graph or the name of one place of the basic SPN graph. In this case 
that transition will be controlled by the timing sentence of the group that non-timing 
sentence belongs to. An illustration is presented in Figure 8-21. 
 
119 
 
X1
Tr1
X2
B1
A2
V2
P2
NLSi,j = (A2, V2,P2) j  
T_A1
T_V1
T_P1
NLSi,1 = (A1, V1,P1)
 
X1
Tr1
X2
T_A1
B1
 
(a) (b) 
Figure 8-21. (a) NLSi,j is the kernel of a non-timing sentence belonging to a group 
controlled by NLSi,1. (b) Enriched SPN graph occurring when A2 partially matches X1. 
8.5 CONCLUSIONS 
In this chapter a formal modeling of our methodology is presented. For this purpose, 
a formal language name Synergy was defined to formally represent building blocks and 
their order in the functional specifications. Furthermore, mapping between Synergy 
language and SPNs is shown.  
 
120 
 
So far, we showed how a document is decomposed to obtain relevant NL text that 
relates to figures of block diagrams. We further, showed a NLU unit that performs kernel 
extraction that holds the basic meaning of each sentence and at the same time break 
sentences into partial sentence. Moreover, we showed a classification scheme from which 
a category is assigned to each partial sentence. DIM unit performs image processing and 
transforms the diagram in a graph form that holds all the relevant information of the 
diagram. All this information is combined and represented with the Synergy formal 
language (which is mapped into an SPN graph) that is later enriched with timing 
information.  
In this chapter we will show how from an SPN graph we obtain the pseudocode of 
the system represented in the technical document with its architectural description shown 
in form of block diagrams. This pseudocode refers to the functional specifications that form 
the behavior of the system. In parallel we extract a flowchart in order to enhance the 
understanding that the reader could have about the pseudocode and the hardware system 
as a unity.  
9.1 EXTRACTION PROCESS 
Pseudocode is derived directly from the enriched SPN graph of the previous 
chapter. It is related on the names of the places and transitions in the SPN graph. More 
specifically, to extract the pseudocode from the SPN graph two cases are distinguished. 
9  PSEUDOCODE EXTRACTION PROCESS 
121 
 
The first one extracts a more general pseudocode and is applicable to cases where no timing 
information exist. While the other one is applicable to cases where timing information is 
available. For the first case we perform the following steps.  
1. Places that contain the prefix “I_” and don’t belong to a block of SPNs representing 
a function definition consist the inputs of the functional specifications. a string 
“Inputs: I1, I2,...,In” is created that contains the n inputs.  
2. Places that contain the prefix “O_” and don’t belong to a block of SPNs 
representing a function definition consist the outputs of the functional 
specifications. A string “Outputs O1, O2,…,On”, with each one of the n outputs is 
created and is placed after the input string. 
3. SPN grouped together, under a name become functions, having that specific name. 
The string “Function: name” is created. The body of the function is extracted from 
the group of SPNs based on the two previous steps and following ones. After 
extracting the body of the function “End function” is created. 
4. Transitions named “link” having an input place “X1” and an output place “X2” are 
written as “X2=X1”. 
5. All transitions named link that connect to a function definition block are translated 
as function calls. Therefore, a function call is written as “name (Xi=Xj, Xk=Xl), 
where name is the name of the function appearing in the function definition section 
and Xi and Xk are links to Xj and Xl respectively. 
6. Places that have the “if_” prefix become conditions and control the transitions they 
connected to. A condition is written as “if name”, where name is the name of the 
place having an “if_” prefix.  
122 
 
7. Transitions that are not controlled by an if place, become processing steps. 
Transition that are controlled by an if place, also become processing steps and are 
placed immediately after the if condition. A processing step is written as 
“name_of_output_place” “name_of_transition” “name_of_input_places”. If input 
places are more than one, then an ‘or’ or ‘and’ connection is placed that is derived 
from the SPN graph. 
8. The flow of arrows will determine the order of different steps.  
Figure 9-1 and 9-2 serves as a visual aid to clarify the aforementioned steps. 
I_n1 I_n2 I_nk...
 
Inpus: n1,n2, ….nk 
(a) (b) 
O_n1 O_n2 O_nk...
 
Outputs: n1,n2,….nk 
(c) (d) 
I_n1 O_n2SPN graph
k
 
Function: k 
Inputs: n1 
Outputs: n2 
“Body of function by translating SPN 
graph” 
End function 
(e) 
(f) 
n
Link
k
 
k=n 
(g) (h) 
Figure 9-1. (a) SPN places that contains “I_” prefix. (b) Pseudocode of SPN in (a). (c) 
SPN places that contains “O_” prefix. (d) Pseudocode of SPN in (c). (e) SPN graph of a 
function definition. (f) Pseudocode of SPN in (e). (g) SPN graph of a link transition. (h) 
Pseudocode of SPN in (g). 
123 
 
X1
Link
SPN graphI_X2 I_X3
k
X4
Link
 
 
k (X2=X1, X4=X3) 
(a) (b) 
X2
Tr
X3
 
X3 Tr X2 
(c) (d) 
If_X1
X2
Tr
X3
 
If X1 
X3 Tr X2 
(e) (f) 
Figure 9-2. (a) Link to an SPN graph of function definition block (b) Pseudocode 
of SPN in (a). (c) A simple transition place with one input place and one output place. (d) 
Pseudocode of SPN in (c). (e)A simple transition controlled by an if place. (f) 
Pseudocode of SPN in (e).  
To better understand these steps, an illustrative example is provided in Figure 9-
3(a) of a more complex SPN graph. The pseudocode is shown in Figure 9-3(b). 
 
 
 
124 
 
I_X4
Tr3
X5
I_X6
Tr4
O_X7
X1
Tr1
O_X3
B1
If_I_n1 or
If_I_n2
I_X2
Tr2
X8
link link
O_X9
 
Inputs: X4,n1,n2, 
X1,X8,X6 
Outputs: X7,X9 
Function: B1 
Inputs: 
n1,n2,x1,x8 
Outputs: X3 
if n1 or n2 
X3 Tr1 (X1 and 
X2) 
If not (n1 or n2)  
X3 Tr2 X8 
End function 
X5 Tr3 X4 
B1(X2=X5,x9=X3) 
X7 Tr4 X6 
  
 
(a) (b) 
Figure 9-3. (a) A complex SPN graph. (b) Translation to pseudocode. 
When timing information is available it provides a direction in the flow of data in 
the SPN graph, by activating and restricting transitions in time. In such a case, different 
steps are followed to extract the pseudocode as we will explain below. 
i. Function Definitions and function calls are not applicable here and the SPN graph 
is processed as a unity.  
ii. The timing transitions will provide the order of steps to extract the pseudocode 
from the places and transitions of the SPN graph.  
iii. Starting from the first timing transition we follow the above steps except (3 and 5) 
to extract the pseudocode.  
Modifying the example in Figure 9-3 to include timing information we get the SPN 
graph of Figure 9-4 (a), then the pseudocode for this case is shown in Figure 9-4 (b).  
125 
 
 
I_X4
Tr3
X5
I_X6
Tr4
O_X7
X1
Tr1
O_X3
B1
If_I_
n1
or
If_I_
n2
I_X2
Tr2
X8
link link
O_X9
T_A2
T_V2
T_P2
T_A3
T_V3
T_P3
T_A1
T_V1
T_P1
 
 (a) 
 
 
Inputs: X4n1,n2, X1,X8, X6 
Outputs: X7,X9 
X5 TR3 x4 
X2=X5 
if (n1 or n2) 
X3 Tr1 (X1 and X2) 
If not (n1 or n2) 
X3 Tr2 X8 
X7 Tr4 X6 
 
 (b) 
Figure 9-4.(a) A complex SPN graph with timing information. (b) Pseudocode 
translation.  
9.2 RESULTS 
The methodology that has been described has been implemented in java and python. 
The program takes as input the technical document in pdf and generates the pseudocode 
for the hardware architecture design of the hardware system. Three examples are provided 
here. Since the results for each step are provided in previous chapters, only the final results 
for the pseudocode are shown here. The first example has been manually generated and it 
has a small description followed by its block diagram. The second one is obtained from 
[80] and has one block diagram and its text is not as descriptive of the operations that are 
performed in the system. Finally the third example [140] has three block diagrams and its 
text is relatively descriptive about the operations that are performed. Let us note here that 
these examples serve as a proof of concept for our methodology.  
126 
 
9.2.1 Example 1 
Figure 9-5(a) shows the block diagram of our manually created example and Figure 
9-5(b) shows its respective NL text. 
SUB Comparator
M
u
xI1
I2
S1
0
Value1
Value2
Max
 
(a) 
“We created an example of block diagrams that finds the maximum number 
between integer values. Figure 1 shows the hardware implementation that finds the 
maximum values between two numbers. First, the 32-bit number value1 and value2 are 
inserted at the SUB module. SUB module subtracts the two inputs. The result of SUB 
module is sent to a comparator unit. The comparator unit compares the result of SUB 
module with 0. The comparator unit will yield 1 if the result is greater than 0 and the 
comparator will yield 0 if the result is less than 0. If the result from comparator is 1, 
value1 will be selected as the max value from multiplexer. If the result from comparator 
is 0, value2 will be selected as the max value from multiplexer. “ 
(b) 
Figure 9-5. Simple example: (a) block diagram (b) NL text. 
Figure 9-6 shows the SPN graph for the NL text only. The SPN graph is a visual 
representation of the kernels detected in the NL Text. Figure 9-7 shows the diagram graph 
of the model and figure 9-8 shows the extracted SPN graph. 
127 
 
 
Figure 9-6. Text SPN graph. 
128 
 
 
Figure 9-7. Diagram Graph. 
 
Figure 9-8. Final SPN graph. 
129 
 
Figure 9-9(a) shows the pseudocode extracted from our methodology. We visualize 
the pseudocode with a flowchart as shown in Figure 9-9(b).  
Input1=Value1 
Input2=Value2 
SUB subtracts (Input1, Input2) 
comparator =SUB 
comparator compares (SUB,0) 
if result_is_greator_0 
   comparator will_yield (1) 
if result_is_less_0 
   will_yield2 (0_ comparator) 
if comparator_is_1 
   max will_be_selected value1 
If comparator_is_0 
   Max will_be_selected2 
value2 
Input1=Value1
Input2=Value2
SUB subtracts (Input1, Input2)
comparator =SUB
comparator compares (SUB,0)
result_is_greator_0 result_is_less_0
comparator will_yield (1) will_yield2 (0_ comparator)
 comparator_is_0 
max will_be_selected value1 Max will_be_selected2 value2
Start
End
 
(a) (b) 
Figure 9-9. Pseudocode of example 1. 
For this particular example, a code is manually written to show that with a well-
defined and explained architectural design of a hardware system, reverse engineering the 
system to code level is possible. Figure 9-10 provides the code written in python for this 
particular example.  
 
130 
 
 import sys  
input1 = int(sys.argv[1])  
input2 =int (sys.argv[2]) 
sub = input1 - input2 
if sub>0: 
 comparator =1 
else: 
 comparator =0 
if comparator ==1 : 
 max_value = input1 
else:  
 max_value = input2 
 
Figure 9-10. Code example. 
9.2.2 Example 2 
For our second example consider the block diagram in Figure 6.6 but having as NL 
text the one shown in Figure 9-11. The pseudocode for this example is shown in Figure 9-
12. Some steps are omitted as well as the flowchart to better fit the example in this 
dissertation. In this example, details relating to the operations of the hardware system are 
limited in the NL text. Therefore, the information extracted derives mainly from the 
knowledge base. It’s worth noting here that the level of detail that appears in the NL text 
affects the performance of our methodology. The more details in the NL text increase the 
ability of our methodology to better extract the steps and operations that are performed.  
“Lastly, in this section, we will discuss about our future work and more 
specifically the hardware implementation of EDIFT, which will greatly minimize the 
method’ s real-time performance overhead. Figure 5.7 illustrates how EDIFT can be 
131 
 
applied into a standard 5-stage architecture, where the upper part represents the 
system’s CPU and the lower part represents the EDIFT parallel coprocessor. After an 
instruction has been fetched, signals go through both the CPU’s Decode stage and 
through EDIFT coprocessor, where the tags of the used registers are retrieved. Register 
File Tags1 refers to tags of the CPU registers (i.e. edx, ecx, eax in x86 Assembly ) that 
follow the upward-growing stack, while Register File Tags2 follow the downwards-
growing stack. The Stack1 & Stack2 registers component holds additional, EDIFT 
internal registers, which are used for the tag stacks’ manipulation. Parallel to the CPU’s 
execution stage is the EDIFT’s tag Propagation Logic, which then stores the new tags 
in the Stack1 Tags and Stack2 Tags memory. These refer to the tags that track untrusted 
information inside the program’s stack, for the upward-growing and downward-growing 
stacks respectively. 
 Finally, the last stage, includes the Tags Writeback Logic for updating the tag 
information in the Register Files or the Check Logic, which checks for inconsistent tag 
information and sends the Detection Signal when an attack is being performed. 
Moreover, since each one of EDIFT’ s components, including the propagation and check 
logic components which are described in Tables 5.1 and 5.2, consist of very simple 
arithmetic and data manipulation instructions, rather than, for instance, a complex 
encryption algorithm, they can fit within the standard processor’s clock cycle ( 14*109 
seconds per cycle for a 4 GHz processor). Hence, EDIFT could be completely masked 
within the CPU’ s instruction pipeline and does not introduce any delay since the 
comparison and the execution of the security operations are done in parallel.” 
Figure 9-11. NL text of the block diagram shown in Figure 6-6(a). 
 
132 
 
Signals go_through (Decode and EDIFT_coprocessor) 
Check_Logic sends (Detection_Signal) 
Tags_Writeback_Logic =Stack1_Tags or Stack2_Tag 
Tags_Writeback_Logic for_updating (Check_Logic or Register_File ) 
Register_File_Tags2 follow (downwards-growing_stack) 
tags_CPU_registers=Register_File_Tags1 
if clk and not_reset and write_enable 
   mem_ICache write (data_ICache, address_ICache) 
if clk and reset 
   mem_ICache initialize (0, address_ICache) 
if clk and not_reset and read_enable 
   data_out_ICache read (data_ICache, address_ICache) 
if clk and not_reset and write_enable1 
   mem_DCache write (data_DCache, address_DCache) 
if clk and reset 
   mem_DCache initialize (0, address_DCache) 
if clk and not_reset and read_enable1 
   data_out_DCache read (data_DCache, address_DCache) 
if clk and not_reset and write_enable3 
   mem_ Register_File write (data_ Register_File, address_ Register_File) 
if clk and reset 
   mem_ Register_File initialize1 (0, address_ Register_File) 
if clk and not_reset and read_enable1 
   data_out_ Register_File read (data_ Register_File, address_ Register_File) 
I3A=I3 or I3B= I3 or I3C=I3 
if I3A 
   ALU_out OP1 (I1,I2) 
if I3B 
   ALU_out OP2 (I1,I2) 
if I3C 
   ALU_out OP3 (I1,I2) 
 
Figure 9-12. Pseudocode of example 2. 
9.2.3 Example 3 
Let’s now consider a more complex example obtained from [140]. In this document 
three block diagrams appear and are associated together in a hierarchical manner. More 
specifically, the first diagram is the general diagram, the second block diagram shows a 
detailed explanation of one of the blocks in the general block diagram and the third 
provides a detailed explanation of one of the blocks in the second diagram. Figure 9-13 
133 
 
shows the top-level block diagram and Table 9-1 shows its respective NL-text. Figures 9-
14 and 9-15 show the respective diagram graph and text SPN graph.  
 
Figure 9-13. General Block diagram of the approach in [140]. 
Table 9-1. NL text of Control Unit. 
# NL Text 
S1 “The operation of the storage unit is synchronized by the control 
unit.” [140] 
S2 “The control unit is responsible for the generation of the clock 
and control signals.” [140] 
 
 
Figure 9-14. Diagram graph of control Unit. 
134 
 
 
Figure 9-15. Text SPN of control unit.  
Figure 9-16 shows the middle level block diagram and is referred to as storage unit 
implementation. Table 9-2 shows its respective NL text. Following are the diagram graph 
and text SPN graph (Figure 9-17, Figure 9-18). 
135 
 
 
Figure 9-16. Second Block diagram of the approach in [140] and is the storage-
unit of the block in the above figure. 
Table 9-2. NL text of Storage Unit. 
# NL Text 
S1 “The K-Box consists of the key ,repeating as necessary times ,in order to fill 
the array.” [140] 
S2 “First ,the key setup phase is divided in two steps.” [140] 
S3 “In the first step the S-box is filled.” [140] 
S4 “The S-Box is initialized linearly ,such as S0 = 0 ,S1 = 1 ,S2 = 2 ,...when the 
reset state occurs.” [140] 
S5 “In the second step of the key setup ,the S-Box is randomly filled.” [FM2] 
S6 “For the S-Box ,3x256-bytes RAM memory is used as it 's shown in Fig. 4.” 
[140] 
S7 “The Si _ and Sj_registers are used for the necessary by the algorithm 
swapping that are produced.” [140] 
S8 “The j _ and t_registers are used in order to temporary store all the 
intermediate variables see Fig. 3.” [140] 
S9 “At the first clock cycle the value of counter i is used as address in the first 
RAM block.” [140] 
S10 “The value of Si is used for the computation of the new value of j as it is 
shown in Fig .3.” [140] 
S11 “It is stored in the Si_register.” [140] 
S12 “The two adders are used for the computation of the new value of j.” [140] 
S13 “They accept as input the values of Ki and Si.” [140] 
136 
 
S14 “At the second clock cycle the new produced value j is used as address for the 
second RAM block.” [140] 
S15 “The stored value in this address is temporary stored in the Sj_register.” [140] 
S16 “At the third cycle the contents of the Si_register and Sj_register are written at 
the j and i addresses correspondingly.” [140] 
S17 “With this procedure ,the swapping is achieved.” [140] 
 
 
Figure 9-17. Diagram graph of Storage Unit. 
137 
 
 
Figure 9-18. Text SPN of Storage Unit.  
Finally Figure 9-19 shows the lower level block diagram and is referred to as S-
Box unit. Table 9-3 shows its respective NL text. Following are the diagram graph and text 
SPN graph (Figure 9-20, Figure 9-21). 
 
 
 
Figure 9-19. S-Box ram block of the above diagram [140]. 
 
 
138 
 
Table 9-3. NL-text of S-Box unit. 
# NL Text 
S1 “The block diagram of the S-box RAM is shown Fig. 4”. [140] 
S2 “It consists of three 256 bytes RAM blocks.” [140] 
S3 “Each RAM block has four inputs and one output.” [140] 
S4 “The two inputs are the read and write signals while the other two are the 
address and data signals.” [140] 
S5 “Also, all the three RAM boxes have the same 
signals of clock and reset.” [140] 
S6 “The operation of RAM blocks is quite simple.” [140] 
S7 “If the reset signal occurs the blocks are 
initialized linearly.” [140] 
S8 “For each block, if the write signal occurs new data are stored in the address 
position, or if the read signal occurs the data in the address position are 
available on the output of the block.” [140] 
 
 
Figure 9-20. Text SPN of S-Box unit. 
139 
 
 
Figure 9-21. Diagram graph of S-Box. 
As already explained a bottom up approach is followed in handling the diagrams. 
The block diagram (S-Box unit) in Figure 9-19 is processed first. The final SPN graph for 
one of the blocks of S-Box unit (“(i-block) 256 bytes RAM)” is shown in Figure 9-22. The 
places and transitions for this block were obtained after mapping from synergy language.  
140 
 
 
Figure 9-22. Final SPN graph of one block of S-Box unit. 
The natural Language text for the “S-Box” did not have any timing sentences 
therefore we can obtain a more general block diagram. The pseudocode for only the “S-
Box” after parsing the final SPN graph is shown in Figure 9-23.  
Inputs: reset,clock,Si_read,Si_write,data, 
address, Sj_read, Sj_write, data#1, address#1, St_write, St_read, data#2, address#2 
Outputs:Si, Sj, St 
Function: i_block_256_bytes_RAM() 
Inputs: Si_read, Si_write, Data, Address, Reset, Clock 
Outputs: Si 
Reset_ occurs=Reset 
Clock=Clock 
Address= Address 
Data=Data 
Si_write=Si_write 
Si_read=Si_read 
If reset_occurs  
(I_block)_256_bytes_RAM are_initialized (linearly) 
If Si_write_occurs 
   are_stored (address,data) 
if Si_read_occurs 
      Si are (data,address) 
End function 
Function: j_block_256_bytes_RAM() 
141 
 
Inputs: Sj_read, Sj_write, Data#1, Address#1, Reset#1, Clock#1 
Outputs: Sj 
Reset_ccurs#1=Reset 
Clock#1=Clock#1 
Address#1= Address#1 
Data#1=Data#1 
Sj_write=Sj_write 
Sj_read=Sj_read 
If reset_occurs#1  
(j block) 256 bytes RAM are_initialized#1 (linearly#1) 
If Sj_write_occurs 
   are_stored#1 (address#1, data#1) 
if Sj_read_occurs 
      Sj are#1 (data#1, address#1) 
End function 
Function: t_block_256_bytes_RAM() 
Inputs: St_read, St_write, Data#2, Address#2, Reset#2, Clock#2 
Outputs: St 
Reset_ccurs#2=Reset#2 
Clock#2=Clock#2 
Address#2= Address#2 
Data#2=Data#2 
St_write=St_write 
St_read=St_read 
If reset_occurs#1  
(t_block)_256_bytes_RAM are_initialized#2 (linearly#2) 
If St_write_occurs 
   are_stored#2 (address#2, data#2) 
if St_read_occurs 
      St are#2 (data#2, address#2) 
End function 
i_block_256_bytes_RAM(Reset=Reset, clock=clock, Si_read=Si_read, 
Si_write=Si_write, data=data, address=address) 
j_block_256_bytes_RAM(Reset#1=Reset, clock#1=clock, Sj_read=Sj_read, 
Sj_write=Sj_write, data#1=data#1, address#1=address#1) 
t_block_256_bytes_RAM(Reset#2=Reset, clock#2=clock, St_read=St_read, 
St_write=St_write, data#2=data#2, address#2=address#2) 
 
Figure 9-23. Pseudocode for S-Box unit. 
Figure 9-24 shows the flowchart for only one of the functions of the above 
pseudocode.  
142 
 
(I block) 256 bytes 
RAM are (intialized, 
linearly)
Si are (data, address)
Si
Si_read
Si_write
Data
Address
Reset
Clock
Start
If reset_occurs If Si_read_occursIf Si_write_occurs
Si are (data, address)
end
 
Figure 9-24. Flowchart of ‘(i_block)_256_bytes_RAM’ function. 
The flowchart of the other two units (“t and j_block 256 bytes ram”) is the same as 
the “i-block 256 bytes RAM”, but having their own signals. Now as already explained the 
“S-box” unit doesn’t have any time sentences. Therefore, it can be translated to pseudocode 
based on the rules that we described in section 9.1. Each separate block is a function and 
its code is described in Figure 9-23. Now the pseudocode for the “S-Box” would simply 
be to call those functions. Figure 9-25 shows the general flowchart for that unit.  
143 
 
(i_block)_256_bytes
_RAM
(t_block)_256_bytes
_RAM
Si
Sj
St
Si_read
Si_write
Data
Address
Reset
Clock
Sj_read
Sj_write
Data#1
Address#1
St_read
St_write
Data#2
Address#2
Start
end
(i_block)_256_bytes
_RAM
 
Figure 9-25. General Flowchart of the ‘S-box’ unit. 
We will not provide here the final SPN graph of the complete system since it 
contains a huge number of places and transitions and it would be impossible to distinguish 
them. We will however provide a simplified version of pseudocode and flowchart in Figure 
9-26 and Figure 9-27. 
 
 
 
144 
 
 Inputs: reset,clock, Si_read,Si_write, data, address, Sj_read, Sj_write, data#1, 
address#1, St_write, St_read, data#2, address#2, key_Input[127:0], 
read_enable_key, write_enable_key, key_sel, (couner)_i_[7:0] 
Outputs: Stream: 
Key=key_Input[127:0] 
sel = key_sel 
Necessary_times 
Repeating key 
K-Box consist_of key 
is_filled (S-Box) 
Randomly is_filled2 S-Box 
if Reset_occurs 
   (I_block)_256_bytes_RAM are_initialized Linearly 
O3=j[7:0] 
If sel  
   O0=key_out[7:0] 
O1= add ( O0,O3) 
Si[9:0]_1=Si[7:0] 
J[7:0]=add( O1,Si[7:0]) 
Sj[7:0]=Sj[7:0] 
Address = J[7:0] 
Address2=i[7:0] 
Are_stored (data,address2) 
Are_stored2(data,address2 
 
 
Figure 9-26. Simplified version of pseudocode of the complete system. 
 
 
145 
 
 
Figure 9-27. Simplified flowchart of the complete system. 
The reason for selecting this example is that the authors have provided the flowchart for 
their block diagram. Figure 9-28 show the flowchart as obtained from [140] but recreated 
for visualization purposes.  
Inputs]
Start
Key=key_input[127:0]
sel = key_sel
Necessary_time Repeating key
K-Box consist of key
Is_filled (S-Box)
reset_occurs
(I_block)_256 bytes_RAM are (intialized, linearly)
O3=j[7:0]
sel
O0=key_out[7:0]
O1=add (O0,O3)
Si[9:0]_1=Si[7:0]
J[7:0]=add (O1,Si[7:0])
Sj[7:0]=Sj[7:0]
Address = J[7:0]
Address2=i[7:0]
Are_stored (data,address2)
Are_stored2(data,address2
Stream
end
...
 O0=I2_mux-2-in-1
Randomly is_fille2 (S-Box)
146 
 
 
Figure 9-28. Comparison between our methodology and the one in [140]. 
As we can see from the figure we have successfully detected the most important 
operations described in the paper. However, loops suggesting repetition of steps have not 
yet been extracted, which will consist our future work.  
9.3 CONCLUSIONS 
In this chapter, the extraction of pseudocode from the final SPN graph is shown. 
Furthermore, three examples are shown each with a different level of difficulty in order to 
prove the effectiveness of our methodology.   
O0=key_out[7:0]
O1=add (O0,O3)
Si[9:0]_1=Si[7:0]
J[7:0]=add (O1,Si[7:0])
Sj[7:0]=Sj[7:0]
Address = J[7:0]
Address2=i[7:0]
Are_stored (data,address2)
Are_stored2(data,address2
Out_counter= add (out_counter, init)
(counter) i =out_counter
Start
j=i=0
j=j+Si+Ki
Swap Si,Sj
i=i+1
i<356?
Start
147 
 
Reverse Engineering has been studied for several decades and is directed towards 
different research fields. There are still domains that remain impervious, and these include 
reverse engineering of architectural designs of hardware systems appearing in technical 
documents. The design details appearing in technical documents are in the most abstract 
form, and raise a certain challenge towards understanding them.
Here, a novel reverse engineering methodology for deep understanding of technical 
documents is proposed. Particularly, first a survey of reverse engineering of diagrams is 
presented. Second an improved NLU scheme is analyzed and a diagram extraction and 
their modeling into graphs is explained. Third, a synergistic model between NLU, DIM 
and sentence classification is introduced and a formal language named Synergy is 
developed. Finally, a novel methodology for pseudocode extraction from architectural 
designs of hardware systems  is implemented. Conclusively, the main contributions of our 
dissertation are:  
• An extensive survey was conducted and publications in the area of electronic or 
digital system were presented.  
• Categorization of each publication into a classification scheme and evaluation 
based on a maturity metric [141].  
10 CONCLUSIONS-CONTRIBUTIONS 
148 
 
• Use of K-means clustering to identify groups of text lines in the Document and 
automatically label each group into Caption, Body text and Metadata. 
• Improvement of an algorithm [53] for kernel extraction, that allows processing of 
more complex sentences. 
• Performed Sentence Classification with Convolutional Neural Networks. 
• Implementation of a novel methodology that utilizes machine learning NLU and 
DIM and extracts functional specifications from architectural designs of hardware 
systems. 
• Pseudocode extraction representing the functionality (operations) of architectural 
designs of hardware system, appearing in the form of block diagrams. 
10.1 LIMITATIONS AND FUTURE WORK 
For our future work, performing Optical Character Recognition (OCR) for 
document structure extraction limited by PDFBox tool is important. Furthermore, in order 
to provide better detection for block diagrams, a classification scheme(s) is going to be 
explored. The dataset for sentence classification is planned to be extended, that will lead 
in a more accurate classification scheme. Moreover, more classification models for the 
sentence classification are to be explored. As far as the pseudocode extraction is concerned, 
we plan to derive iteration information and more importantly incorporate table, graphs and 
equations in the functional specifications model. 
149 
 
REFERENCES 
 
[1] S. E. Quadir et al., “A Survey on Chip to System Reverse Engineering,” J. Emerg. 
Technol. Comput. Syst., vol. 13, no. 1, pp. 6:1–6:34, Apr. 2016.
[2] “How printed circuit board is made - material, manufacture, making, history, used, 
processing, parts, components, steps.” [Online]. Available: http://www.madehow.com/ 
Volume-2/Printed-Circuit-Board.html. [Accessed: 13-Apr-2017]. 
[3] “What is integrated circuit (IC)? - Definition from WhatIs.com,” WhatIs.com. 
[Online]. Available: http://whatis.techtarget.com/definition/integrated-circuit-IC. 
[Accessed: 13-Apr-2017]. 
[4] “Digital Circuits/Digital Circuit Types - Wikibooks, open books for an open world.” 
[Online]. Available: https://en.wikibooks.org/wiki/Digital_Circuits 
/Digital_Circuit_Types. [Accessed: 13-Apr-2017]. 
[5] “Search Form.” [Online]. Available: https://pascal.computer.org/sev_display/ 
index.action. [Accessed: 13-Apr-2017]. 
[6] J. Grand, “Printed Circuit Board Deconstruction Techniques,” in Proceedings of the 
8th USENIX Conference on Offensive Technologies, Berkeley, CA, USA, 2014, pp. 
11–11. 
[7] C. Koutsougeras, N. Bourbakis, and V. Gallardo, “Reverse engineering of real PCB 
level design using VERILOG HDL,” ENGINEERING INTELLIGENT SYSTEMS 
FOR ELECTRICAL ENGINEERING AND COMMUNICATIONS, vol. 10, no. 2, pp. 
63–68, Jun. 2002. 
[8] “Types of Printed Circuit Boards | PCB Universe, Inc.” . 
[9] “Printed Circuit Board Manufacturers in Chennai, Double Sided PCB,” krcircuits.com. 
[Online]. Available: http://www.krcircuits.com/products.html. [Accessed: 13-Apr-
2017]. 
[10] Ben Johnson, “EE368: Reverse Engineering o Printed Circuit Boards.” 
[11] J. A. Naidu, N. Sowjanya, T. A. S. Bhargav, R. Ganesh Sainath, and M. Sujatha, 
“Reverse Engineering for Error Detections of Printed Circuit Board (PCB),” 
International Journal For Technological Research In Engineering, vol. 2, no. 7, Mar. 
2015. 
[12] W. Wen-Yen, J. W. Mao-Jiun, and L. Chih-Ming, “Automated inspection of printed 
circuit boards through machine vision,” Computers in Industry, vol. 2, no. 28, pp. 103–
111, 1996. 
[13] Mat Ruzinoor Che, S. Azmi, R. Daud, A. N. Zulkifli, and F. K. Ahmad, 
“Morphological Operation on Printed Circuit Board (PCB) Reverse Engineering using 
MATLAB.” 
[14] H. G. Longbotham, P. Yan, H. N. Kothari, and J. Zhou, “Nondestructive reverse 
engineering of trace maps in multilayered PCBs,” in AUTOTESTCON ’95. Systems 
Readiness: Test Technology for the 21st Century. Conference Record, 1995, pp. 390–
397. 
150 
 
[15] R. Torrance and D. James, “The State-of-the-Art in IC Reverse Engineering,” in 
Cryptographic Hardware and Embedded Systems - CHES 2009, Springer, Berlin, 
Heidelberg, 2009, pp. 363–381. 
[16] K. Nohl, D. Evans, S. Starbug, and H. Plötz, “Reverse-engineering a Cryptographic 
RFID Tag,” in Proceedings of the 17th Conference on Security Symposium, Berkeley, 
CA, USA, 2008, pp. 185–193. 
[17] “Analog Integrated Circuits: Batteries Not Needed,” The MITRE Corporation, 
Aug. 2013. 
[18] “VLSI TRAINING IN THRISSUR - Software Training, Embedded Systems 
Course In West Thrissur - Click.in.” [Online]. Available: http://thrissur.click.in/vlsi-
training-in-thrissur-c72-v4284693. [Accessed: 13-Apr-2017]. 
[19] A. Wilk and A. Pnueli, “Specification and verification of VLSI systems,” in 1989 
IEEE International Conference on Computer-Aided Design. Digest of Technical 
Papers, 1989, pp. 460–463. 
[20] P. Lammens, L. Claesen, and H. D. Man, “Correctness verification of VLSI 
modules supported by a very efficient Boolean prover,” in Proceedings 1989 IEEE 
International Conference on Computer Design: VLSI in Computers and Processors, 
1989, pp. 266–269. 
[21] Y. L. Lin and D. D. Gajski, “LES: a layout expert system,” IEEE Transactions on 
Computer-Aided Design of Integrated Circuits and Systems, vol. 7, no. 8, pp. 868–876, 
Aug. 1988. 
[22] W. Meier, “Method for hierarchic logic verification of VLSI circuits,” US5671399 
A, 23-Sep-1997. 
[23] “Integrated circuit design,” Wikipedia. 03-Apr-2017. 
[24] G. Chamberlain and L. Lam, “Computer-assisted design analysis method for 
extracting device and interconnect information,” US6289116 B1, 11-Sep-2001. 
[25] N. G. Bourbakis, A. Mogzadeh, S. Mertoguno, and C. Koutsougeras, “A 
knowledge-based expert system for automatic visual VLSI reverse-engineering: VLSI 
layout version,” IEEE Transactions on Systems, Man, and Cybernetics - Part A: 
Systems and Humans, vol. 32, no. 3, pp. 428–436, May 2002. 
[26] K. K. Yu and C. N. Berglund, “Automated system for extracting design and layout 
information from an integrated circuit,” US5086477 A, 04-Feb-1992. 
[27] G. Masalskis and R. Navickas, “Reverse Engineering of CMOS Integrated 
Circuits,” Elektronika ir Elektrotechnika, vol. 88, no. 8, pp. 25–28, Mar. 2015. 
[28] K. J. Singh and P. A. Subrahmanyam, “Extracting RTL models from transistor 
netlists,” in Proceedings of IEEE International Conference on Computer Aided Design 
(ICCAD), 1995, pp. 11–17. 
[29] S. Bose and A. L. Fisher, “Verifying pipelined hardware using symbolic logic 
simulation,” in Proceedings 1989 IEEE International Conference on Computer Design: 
VLSI in Computers and Processors, 1989, pp. 217–221. 
[30] C. Bao, D. Forte, and A. Srivastava, “On application of one-class SVM to reverse 
engineering-based hardware Trojan detection,” in Fifteenth International Symposium 
on Quality Electronic Design, 2014, pp. 47–54. 
[31] G. H. Chisholm, S. T. Eckmann, C. M. Lain, and R. L. Veroff, “Reverse 
engineering of integrated circuits,” US6536018 B1, 18-Mar-2003. 
151 
 
[32] W. Li et al., “WordRev: Finding word-level structures in a sea of bit-level gates,” 
in 2013 IEEE International Symposium on Hardware-Oriented Security and Trust 
(HOST), 2013, pp. 67–74. 
[33] M. C. Hansen, H. Yalcin, and J. P. Hayes, “Unveiling the ISCAS-85 benchmarks: 
a case study in reverse engineering,” IEEE Design Test of Computers, vol. 16, no. 3, 
pp. 72–80, 1999. 
[34] W. Li, Z. Wasson, and S. A. Seshia, “Reverse engineering circuits using behavioral 
pattern mining,” in 2012 IEEE International Symposium on Hardware-Oriented 
Security and Trust, 2012, pp. 83–88. 
[35] A. Chowdhary, S. Kale, P. K. Saripella, N. K. Sehgal, and R. K. Gupta, “Extraction 
of functional regularity in datapath circuits,” IEEE Transactions on Computer-Aided 
Design of Integrated Circuits and Systems, vol. 18, no. 9, pp. 1279–1296, Sep. 1999. 
[36] F. Poirot, R. Roane, and G. Tarroux, “Automatic synthesis of integrated circuits 
employing boolean decomposition,” US5805462 A, 08-Sep-1998. 
[37] J. Gattiker, S. Mertoguno, A. Moghaddamzadeh, and N. Bourbakis, “Visual reverse 
engineering using SPNs for automated testing and diagnosis of digital circuits,” in 
AUTOTESTCON ’95. Systems Readiness: Test Technology for the 21st Century. 
Conference Record, 1995, pp. 236–242. 
[38] A. Okazaki, T. Kondo, K. Mori, S. Tsunekawa, and E. Kawamoto, “An automatic 
circuit diagram reader with loop-structure-based symbol recognition,” IEEE 
Transactions on Pattern Analysis and Machine Intelligence, vol. 10, no. 3, pp. 331–
341, May 1988. 
[39] H. Kato and S. Inokuchi, “The recognition method for roughly hand-drawn logical 
diagrams based on hybrid utilization of multi-layered knowledge,” in 10th 
International Conference on Pattern Recognition [1990] Proceedings, 1990, vol. i, pp. 
578–582 vol.1. 
[40] P. Subramanyan et al., “Reverse Engineering Digital Circuits Using Structural and 
Functional Analyses,” IEEE Transactions on Emerging Topics in Computing, vol. 2, 
no. 1, pp. 63–80, Mar. 2014. 
[41] K. Fisler, “A logical formalization of hardware design diagrams.,” Indiana 
University Department of Computer Science, Technical Report TR416, 1994. 
[42] L. Rokach, A. Feldman, M. Kalech, and G. Provan, “Machine-learning-based 
circuit synthesis,” in 2012 IEEE 27th Convention of Electrical and Electronics 
Engineers in Israel, 2012, pp. 1–5. 
[43] S. Srihari et al., “Document Understanding: Research Directions,” 1992. 
[44] R. P. Colwell, R. P. Nix, J. J. O’Donnell, D. B. Papworth, and P. K. Rodman, “A 
VLIW architecture for a trace scheduling compiler,” IEEE Transactions on Computers, 
vol. 37, no. 8, pp. 967–979, Aug. 1988. 
[45] J. R. Burch, E. M. Clarke, K. L. McMillan, and D. L. Dill, “Sequential Circuit 
Verification Using Symbolic Model Checking,” in Proceedings of the 27th ACM/IEEE 
Design Automation Conference, New York, NY, USA, 1990, pp. 46–51. 
[46] G. Butler, P. Grogono, R. Shinghal, and I. Tjandra, “Retrieving information from 
data flow diagrams,” in Proceedings of 2nd Working Conference on Reverse 
Engineering, 1995, pp. 22–29. 
[47] E. Borger and G. D. Castillo, “A formal method for provably correct composition 
of a real-life processor out of basic components. (The APE100 Reverse Engineering 
152 
 
Study),” in , First IEEE International Conference on Engineering of Complex 
Computer Systems, 1995. Held jointly with 5th CSESAW, 3rd IEEE RTAW and 20th 
IFAC/IFIP WRTP, Proceedings, 1995, pp. 145–148. 
[48] H. Bunke, “Attributed Programmed Graph Grammars and Their Application to 
Schematic Diagram Interpretation,” IEEE Transactions on Pattern Analysis and 
Machine Intelligence, vol. PAMI-4, no. 6, pp. 574–582, Nov. 1982. 
[49] K. Fisler and S. D. Johnson, “Integrating design and verification environments 
through a logic supporting hardware diagrams,” in Design Automation Conference, 
1995. Proceedings of the ASP-DAC ’95/CHDL ’95/VLSI ’95., IFIP International 
Conference on Hardware Description Languages. IFIP International Conference on 
Very Large Scal, 1995, pp. 669–674. 
[50] H. J. Levesque, “The Interaction with Incomplete Knowledge Bases: A Formal 
Treatment,” in Proceedings of the 7th International Joint Conference on Artificial 
Intelligence - Volume 1, San Francisco, CA, USA, 1981, pp. 240–245. 
[51] M. T. Mills and N. G. Bourbakis, “Graph-Based Methods for Natural Language 
Processing and Understanding #x2014;A Survey and Analysis,” IEEE Transactions on 
Systems, Man, and Cybernetics: Systems, vol. 44, no. 1, pp. 59–71, Jan. 2014. 
[52] S. N. Srihari, “Document Image Understanding,” in Proceedings of 1986 ACM Fall 
Joint Computer Conference, Los Alamitos, CA, USA, 1986, pp. 87–96. 
[53] A. Psarologou, “A Stochastic Petri Net based NLU Scheme for Technical 
Documents Understanding,” Wright State University, 2016. 
[54] A. Psarologou, A. Esposito, and N. Bourbakis, “A Synthesis of Stochastic Petri Net 
(SPN) Graphs for Natural Language Understanding (NLU) Event/Action Association,” 
in 2015 IEEE 27th International Conference on Tools with Artificial Intelligence 
(ICTAI), 2015, pp. 218–225. 
[55] S. Ray Choudhury and C. L. Giles, “An Architecture for Information Extraction 
from Figures in Digital Libraries,” 2015, pp. 667–672. 
[56] N. G. Bourbakis, “A methodology for document processing: separating text from 
images,” Engineering Applications of Artificial Intelligence, vol. 14, no. 1, pp. 35–41, 
Feb. 2001. 
[57] R. Keefer and N. Bourbakis, “A Survey on Document Image Processing Methods 
Useful for Assistive Technology for the Blind,” International Journal of Image and 
Graphics, vol. 15, no. 01, p. 1550005, Jan. 2015. 
[58] D. D. A. Bui, G. Del Fiol, and S. Jonnalagadda, “PDF text classification to leverage 
information extraction from publication reports,” Journal of Biomedical Informatics, 
vol. 61, pp. 141–148, Jun. 2016. 
[59] “MEDLINE®/PubMed® Data Element (Field) Descriptions.” [Online]. Available: 
https://www.nlm.nih.gov/bsd/mms/medlineelements.html. [Accessed: 15-Apr-2018]. 
[60] S. R. Choudhury et al., “Figure Metadata Extraction from Digital Documents,” 
2013, pp. 135–139. 
[61] M. Jang, J. D. Choi, and J. Allan, “Improving Document Clustering by Eliminating 
Unnatural Language,” arXiv:1703.05706 [cs], Mar. 2017. 
[62] S. Li, L. Gao, Z. Tang, and Y. Yu, “Cross-reference identification within a PDF 
document,” 2015, p. 940209. 
[63] “Apache PDFBox | A Java PDF Library.” [Online]. Available: 
https://pdfbox.apache.org/. [Accessed: 15-Apr-2018]. 
153 
 
[64] P. Bholowalia, EBK-Means: A Clustering Technique based on Elbow Method and 
K-Means in WSN. . 
[65] N. George et al., “Hardware system synthesis from Domain-Specific Languages,” 
in 2014 24th International Conference on Field Programmable Logic and Applications 
(FPL), 2014, pp. 1–8. 
[66] C. C. Aggarwal and C. Zhai, “A Survey of Text Clustering Algorithms,” in Mining 
Text Data, C. C. Aggarwal and C. Zhai, Eds. Boston, MA: Springer US, 2012, pp. 77–
128. 
[67] P. Berkhin, “A Survey of Clustering Data Mining Techniques,” in Grouping 
Multidimensional Data, J. Kogan, C. Nicholas, and M. Teboulle, Eds. 
Berlin/Heidelberg: Springer-Verlag, 2006, pp. 25–71. 
[68] “IEEE Xplore Digital Library.” [Online]. Available: 
https://ieeexplore.ieee.org/Xplore/home.jsp. [Accessed: 15-Apr-2018]. 
[69] D. Liu, Y. Li, and M. A. Thomas, “A Roadmap for Natural Language Processing 
Research in Information Systems,” 2017. 
[70] E. Ovchinnikova, Integration of World Knowledge for Natural Language 
Understanding. Springer Science & Business Media, 2012. 
[71] V. Nastase, R. Mihalcea, and D. R. Radev, “A survey of graphs in natural language 
processing,” Natural Language Engineering, vol. 21, no. 05, pp. 665–698, Nov. 2015. 
[72] N. Bourbakis and M. Mills, “Converting natural language text sentences into spn 
representations for associating events,” Int. J. Semantic Computing, vol. 06, no. 03, pp. 
353–370, Sep. 2012. 
[73] M. Mills, “Natural Language Document and Event Association Using Stochastic 
Petri Net Modeling,” 2013. 
[74] M. Mills, A. Psarologou, and N. Bourbakis, “Modeling Natural Language 
Sentences into SPN Graphs,” in 2013 IEEE 25th International Conference on Tools 
with Artificial Intelligence, 2013, pp. 889–896. 
[75] A. Psarologou and N. Bourbakis, “Glossa — A Formal Language as a Mapping 
Mechanism of NL Sentences into SPN State Machine for Actions/Events Association,” 
Int. J. Artif. Intell. Tools, vol. 26, no. 02, p. 1750012, Feb. 2017. 
[76] A. Psarologou, N. Bourbakis, and M. Virvou, “A mapping mechanism of NL 
sentences onto an SPN state machine for understanding purposes,” in IISA 2014, The 
5th International Conference on Information, Intelligence, Systems and Applications, 
2014, pp. 321–324. 
[77] “The Stanford Natural Language Processing Group.” [Online]. Available: 
https://nlp.stanford.edu/software/lex-parser.shtml. [Accessed: 13-Apr-2017]. 
[78] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky, 
“The stanford corenlp natural language processing toolkit,” presented at the ACL 
(System Demonstrations), 2014, pp. 55–60. 
[79] N. G. Bourbakis and B. Manaris, “An SPN based methodology for document 
understanding,” in Proceedings Tenth IEEE International Conference on Tools with 
Artificial Intelligence (Cat. No.98CH36294), 1998, pp. 10–15. 
[80] A. Trikalinou and N. Bourbakis, “An Enhanced Dynamic Information Flow 
Tracking Method with Reverse Stack Execution,” IJMSTR, vol. 3, no. 1, pp. 40–58, 
Jan. 2015. 
154 
 
[81] L. Wenyin and D. Dori, “Sparse pixel tracking: a fast vectorization algorithm 
applied to engineering drawings,” in Proceedings of 13th International Conference on 
Pattern Recognition, 1996, vol. 3, pp. 808–812 vol.3. 
[82] L. Wenyin and D. Dori, “From Raster to Vectors: Extracting Visual Information 
from Line Drawings,” Pattern Analysis & Applications, vol. 2, no. 1, pp. 10–21, Apr. 
1999. 
[83] R. Smith, “An Overview of the Tesseract OCR Engine,” in Ninth International 
Conference on Document Analysis and Recognition (ICDAR 2007), 2007, vol. 2, pp. 
629–633. 
[84] V. Bonato, E. Marques, and G. A. Constantinides, “A Parallel Hardware 
Architecture for Scale and Rotation Invariant Feature Detection,” IEEE Transactions 
on Circuits and Systems for Video Technology, vol. 18, no. 12, pp. 1703–1712, Dec. 
2008. 
[85] T.-C. Chen, Y.-W. Huang, and L.-G. Chen, “Fully utilized and reusable architecture 
for fractional motion estimation of H.264/AVC,” in 2004 IEEE International 
Conference on Acoustics, Speech, and Signal Processing, 2004, vol. 5, pp. V-9–12 
vol.5. 
[86] T.-C. Chen, C.-J. Lian, and L.-G. Chen, “Hardware Architecture Design of an 
H.264/AVC Video Codec,” in Proceedings of the 2006 Asia and South Pacific Design 
Automation Conference, Piscataway, NJ, USA, 2006, pp. 750–757. 
[87] N. Bourbakis, N. Pereira, and S. Mertoguno, “Hardware design of a letter-driven 
OCR and document processing system,” Journal of Network and Computer 
Applications, vol. 19, no. 3, pp. 275–294, Jul. 1996. 
[88] F. Sebastiani, “Machine Learning in Automated Text Categorization,” ACM 
Comput. Surv., vol. 34, no. 1, pp. 1–47, Mar. 2002. 
[89] V. Srividhya and R. Anitha, “Evaluating Preprocessing Techniques in Text Categor 
ization,” no. 2010, p. 3, 2010. 
[90] A. Hotho, A. Nurnberger, G. Paaß, and S. Augustin, “A Brief Survey of Text 
Mining,” p. 37. 
[91] M. W. Berry, Survey of Text Mining II. Springer, 2008. 
[92] V. Gupta and G. S. Lehal, “A Survey of Text Mining Techniques and 
Applications,” Journal of Emerging Technologies in Web Intelligence, vol. 1, no. 1, 
Aug. 2009. 
[93] R. Alghamdi and K. Alfalqi, “A Survey of Topic Modeling in Text Mining,” 
International Journal of Advanced Computer Science and Applications, vol. 6, no. 1, 
2015. 
[94] A. Blanchard, “Understanding and customizing stopword lists for enhanced patent 
mapping,” World Patent Information, vol. 29, no. 4, pp. 308–316, Dec. 2007. 
[95] E. Dragut, F. Fang, P. Sistla, C. Yu, and W. Meng, “Stop Word and Related 
Problems in Web Interface Integration,” Proc. VLDB Endow., vol. 2, no. 1, pp. 349–
360, Aug. 2009. 
[96] I. Ounis, “Automatically Building a Stopword List for an Information Retrieval 
System,” Journal of Digital Information Management: Special Issue on the 5th Dutch-
belgian Information Retrieval Workshop (dir’05), 2005. 
155 
 
[97] H. Saif, M. Fernández, and H. Alani, “Automatic stopword generation using 
contextual semantics for sentiment analysis of Twitter,” in CEUR Workshop 
Proceedings, Riva del Garda, Trentino, Italy, 2014, vol. 1272. 
[98] M. F. Porter, “An algorithm for suffix stripping,” Program, vol. 14, no. 3, pp. 130–
137, Mar. 1980. 
[99] C. Moral, A. de Antonio, R. Imbert, and J. Ramírez, “A Survey of Stemming 
Algorithms in Information Retrieval,” Information Research: An International 
Electronic Journal, vol. 19, no. 1, Mar. 2014. 
[100] Y. Zhang, R. Jin, and Z.-H. Zhou, “Understanding bag-of-words model: a statistical 
framework,” Int. J. Mach. Learn. & Cyber., vol. 1, no. 1–4, pp. 43–52, Dec. 2010. 
[101] G. Salton and C. Buckley, “Term-weighting approaches in automatic text 
retrieval,” Information Processing & Management, vol. 24, no. 5, pp. 513–523, Jan. 
1988. 
[102] A. Mazyad, F. Teytaud, and C. Fonlupt, “A Comparative Study on Term Weighting 
Schemes for Text Classification,” in Machine Learning, Optimization, and Big Data, 
2017, pp. 100–108. 
[103] D. Mabrouk, S. Rady, N. Badr, and M. E. Khalifa, “A survey on information 
retrieval systems’ modeling using term dependencies and term weighting,” in 2017 
Eighth International Conference on Intelligent Computing and Information Systems 
(ICICIS), 2017, pp. 321–328. 
[104] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa, 
“Natural Language Processing (Almost) from Scratch,” Journal of Machine Learning 
Research, vol. 12, no. Aug, pp. 2493–2537, 2011. 
[105] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient Estimation of Word 
Representations in Vector Space,” arXiv:1301.3781 [cs], Jan. 2013. 
[106] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed 
Representations of Words and Phrases and their Compositionality,” in Advances in 
Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. 
Ghahramani, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2013, pp. 3111–
3119. 
[107] G. Zhou, T. He, J. Zhao, and P. Hu, “Learning Continuous Word Embedding with 
Metadata for Question Retrieval in Community Question Answering,” in Proceedings 
of the 53rd Annual Meeting of the Association for Computational Linguistics and the 
7th International Joint Conference on Natural Language Processing (Volume 1: Long 
Papers), Beijing, China, 2015, pp. 250–259. 
[108] O. Levy and Y. Goldberg, “Neural Word Embedding as Implicit Matrix 
Factorization,” in Advances in Neural Information Processing Systems 27, Z. 
Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds. 
Curran Associates, Inc., 2014, pp. 2177–2185. 
[109] “Google Code Archive - Long-term storage for Google Code Project Hosting.” 
[Online]. Available: https://code.google.com/archive/p/word2vec/. [Accessed: 09-
May-2018]. 
[110] C. C. Aggarwal and C. Zhai, “A Survey of Text Classification Algorithms,” in 
Mining Text Data, Springer, Boston, MA, 2012, pp. 163–222. 
156 
 
[111] F. Colas and P. Brazdil, “Comparison of SVM and Some Older Classification 
Algorithms in Text Classification Tasks,” in Artificial Intelligence in Theory and 
Practice, 2006, pp. 169–178. 
[112] C. Cortes and V. Vapnik, “Support-vector networks,” Mach Learn, vol. 20, no. 3, 
pp. 273–297, Sep. 1995. 
[113] X. Glorot, A. Bordes, and Y. Bengio, “Deep Sparse Rectifier Neural Networks,” p. 
9. 
[114] B. Karlik and A. V. Olgac, “Performance Analysis of Various Activation Functions 
in Generalized MLP Architectures of Neural Networks,” p. 12. 
[115] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, 
pp. 436–444, May 2015. 
[116] G. Bebis and M. Georgiopoulos, “Feed-forward neural networks,” IEEE Potentials, 
vol. 13, no. 4, pp. 27–31, Oct. 1994. 
[117] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016. 
[118] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied 
to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 
Nov. 1998. 
[119] R. Johnson and T. Zhang, “Effective Use of Word Order for Text Categorization 
with Convolutional Neural Networks,” arXiv:1412.1058 [cs, stat], Dec. 2014. 
[120] R. Johnson and T. Zhang, “Semi-supervised Convolutional Neural Networks for 
Text Categorization via Region Embedding,” in Advances in Neural Information 
Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. 
Garnett, Eds. Curran Associates, Inc., 2015, pp. 919–927. 
[121] Y. Chen, “Convolutional Neural Network for Sentence Classification,” Aug. 2015. 
[122] A. Hassan and A. Mahmood, “Deep learning for sentence classification,” in 2017 
IEEE Long Island Systems, Applications and Technology Conference (LISAT), 2017, 
pp. 1–5. 
[123] X. Zhang, J. Zhao, and Y. LeCun, “Character-level Convolutional Networks for 
Text Classification,” in Advances in Neural Information Processing Systems 28, C. 
Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran 
Associates, Inc., 2015, pp. 649–657. 
[124] C. dos Santos and M. Gatti, “Deep Convolutional Neural Networks for Sentiment 
Analysis of Short Texts,” in Proceedings of COLING 2014, the 25th International 
Conference on Computational Linguistics: Technical Papers, Dublin, Ireland, 2014, 
pp. 69–78. 
[125] S. Lai, L. Xu, K. Liu, and J. Zhao, “Recurrent Convolutional Neural Networks for 
Text Classification,” in Proceedings of the Twenty-Ninth AAAI Conference on 
Artificial Intelligence, Austin, Texas, 2015, pp. 2267–2273. 
[126] D. Tang, B. Qin, and T. Liu, “Document Modeling with Gated Recurrent Neural 
Network for Sentiment Classification,” in Proceedings of the 2015 Conference on 
Empirical Methods in Natural Language Processing, Lisbon, Portugal, 2015, pp. 1422–
1432. 
[127] X. Glorot, A. Bordes, and Y. Bengio, “Domain Adaptation for Large-Scale 
Sentiment Classification: A Deep Learning Approach,” p. 8. 
157 
 
[128] M.-L. Zhang and Z.-H. Zhou, “Multilabel Neural Networks with Applications to 
Functional Genomics and Text Categorization,” IEEE Transactions on Knowledge and 
Data Engineering, vol. 18, no. 10, pp. 1338–1351, Oct. 2006. 
[129] M. Iyyer, V. Manjunatha, J. Boyd-Graber, and H. Daumé III, “Deep Unordered 
Composition Rivals Syntactic Methods for Text Classification,” in Proceedings of the 
53rd Annual Meeting of the Association for Computational Linguistics and the 7th 
International Joint Conference on Natural Language Processing (Volume 1: Long 
Papers), Beijing, China, 2015, pp. 1681–1691. 
[130] J. Wu, “Introduction to Convolutional Neural Networks,” p. 31. 
[131] C.-C. J. Kuo, “Understanding convolutional neural networks with a mathematical 
model,” Journal of Visual Communication and Image Representation, vol. 41, pp. 406–
413, Nov. 2016. 
[132] D. Scherer, A. Müller, and S. Behnke, “Evaluation of Pooling Operations in 
Convolutional Architectures for Object Recognition,” in Artificial Neural Networks – 
ICANN 2010, 2010, pp. 92–101. 
[133] S. Albelwi and A. Mahmood, “A Framework for Designing the Architectures of 
Deep Convolutional Neural Networks,” Entropy, vol. 19, no. 6, p. 242, May 2017. 
[134] “Natural Language Toolkit — NLTK 3.3 documentation.” [Online]. Available: 
https://www.nltk.org/. [Accessed: 09-May-2018]. 
[135] “gensim: topic modelling for humans.” [Online]. Available: 
https://radimrehurek.com/gensim/models/word2vec.html. [Accessed: 09-May-2018]. 
[136] “Pattern Recognition and Machine Learning.” [Online]. Available: 
https://www.spiedigitallibrary.org/journals/Journal-of-Electronic-Imaging/volume-
16/issue-4/049901/Pattern-Recognition-and-Machine-
Learning/10.1117/1.2819119.short?SSO=1. [Accessed: 09-May-2018]. 
[137] M. A. Harrison, Introduction to Formal Language Theory, 1st ed. Boston, MA, 
USA: Addison-Wesley Longman Publishing Co., Inc., 1978. 
[138] T. Jiang, M. Li, B. Ravikumar, and K. W. Regan, “Algorithms and Theory of 
Computation Handbook,” M. J. Atallah and M. Blanton, Eds. Chapman & Hall/CRC, 
2010, pp. 20–20. 
[139] P. J. Haas, Stochastic Petri Nets: Modelling, Stability, Simulation. Springer Science 
& Business Media, 2006. 
[140] P. Kitsos, G. Kostopoulos, N. Sklavos, and O. Koufopavlou, “Hardware 
implementation of the RC4 stream cipher,” 2003, vol. 3, pp. 1363–1366. 
[141] G. Rematska and N. G. Bourbakis, “A survey on reverse engineering of technical 
diagrams,” 2016, pp. 1–8. 
[142]   N. Bourbakis, “Converting Diagrams, Symbols, Formulas, Tables and Graphics into 
SPN and NL Text sentences for Automatic Deep Understanding of Technical 
Documents”, Proc. Int. IEEE Conference on ICTAI, Boston, MA, Nov. 2017. 
 
 
 
158 
 
APPENDIX A 
Based on their syntactic and grammatical role each word in the parse tree has a tag. 
Below we provide the Part of Speech (POS) tag (Table 1) and chunk tag (Table 2).
Table 1 
Tag Description 
CC conjunction, coordinating 
CD cardinal number 
DT Determiner 
EX existential there 
FW foreign word 
IN conjunction, subordinating or preposition 
JJ Adjective 
JJR adjective, comparative 
JJS adjective, superlative 
LS list item marker 
MD verb, modal auxiliary 
NN noun, singular or mass 
NNS noun, plural 
NNP noun, proper singular 
NNPS noun, proper plural 
PDT predeterminer 
POS possessive ending 
PRP pronoun, personal 
PRP$ pronoun, possessive 
RB Adverb 
RBR adverb, comparative 
RBS adverb, superlative 
RP adverb, particle 
SYM Symbol 
TO infinitival to 
UH Interjection 
VB verb, base form 
VBZ verb, 3rd person singular present 
VBP verb, non-3rd person singular present 
VBD verb, past tense 
159 
 
VBN verb, past participle 
VBG verb, gerund or present participle 
WDT wh-determiner 
WP wh-pronoun, personal 
WP$ wh-pronoun, possessive 
WRB wh-adverb 
. punctuation mark, sentence closer 
, punctuation mark, comma 
: punctuation mark, colon 
( contextual separator, left paren 
) contextual separator, right paren 
 
Table  2 
Tag Description 
NP noun phrase  
PP prepositional phrase 
VP  verb phrase  
ADVP adverb phrase 
ADJP adjective phrase  
SBAR subordinating conjunction  
PRT Particle 
INTJ Interjection 
 
