1. 程式人生 > >The dcc Decompiler 反編譯

The dcc Decompiler 反編譯

The dcc Decompiler

The dcc decompiler was developed by Cristina Cifuentes while a PhD student at the Queensland University of Technology (QUT), Australia, 1991-4, under the supervision of Professor John GoughMike Van Emmerik developed the library signature recognition algorithms while employed by QUT.  The dcc distribution is made available under the GPL license.  The

readme file provides information about the distribution.  We do not provide support for this decompiler, if you email, you'll get the standard reply.  However, we participate in the Boomerang open source project, which aims at creating a retargetable decompiler based on some of the dcc and UQBT ideas, design, and/or implementations. 

Table of Contents

Notice
Decompilation is a technique that allows you to recover lost source code. It is also needed in some cases for computer security, interoperability and error correction. dcc, and any decompiler in general, should not be used for "cracking" other programs, as programs are protected by copyright. Cracking of programs is not only illegal but it rides on other's creative effort. See the
ethics of decompilation
for more information.

dcc

The dcc decompiler decompiles .exe files from the (i386, DOS) platform to C programs. The final C program contains assembler code for any subroutines that are not possible to be decompiled at a higher level than assembler.

 The analysis performed by dcc is based on traditional compiler optimization techniques and graph theory. The former is capable of eliminating registers and intermediate instructions to reconstruct high-level statements; the later is capable of determining the control structures in each subroutine.

 Please note that at present, only C source is produced; dcc cannot (as yet) produce C++ source.

 The structure of a decompiler resembles that of a compiler: a front-, middle-, and back-end which perform separate tasks. The front-end is a machine-language dependent module that reads in machine code for a particular machine and transforms it into an intermediate, machine-independent representation of the program. The middle-end (aka the Universal Decompiling Machine or UDM) is a machine and language independent module that performs the core of the decompiling analysis: data flow and control flow analysis. Finally, the back-end is high-level language dependent and generates code for the program (C in the case of dcc).

In practice, several programs are used with the decompiler to create the high-level program. These programs aid in the detection of compiler and library signatures, hence augmenting the readability of programs and eliminating compiler start-up and library routines from the decompilation analysis.

Example of Decompilation

We illustrate the decompilation of a fibonacci program (see Figure 4). Figure 1 illustrates the relevant machine code of this binary. No library or compiler start up code is included. Figure 2 presents the disassembly of the binary program. All calls to library routines were detected by dccSign (the signature matcher), and thus not included in the analysis. Figure 3 is the final output from dcc. This C program can be compared with the original C program in Figure 4.
         55 8B EC 83 EC 04 56 57 1E B8 94 00 50 9A 
   0E 00 3C 17 59 59 16 8D 46 FC 50 1E B8 B1 00 50 
   9A 07 00 F0 17 83 C4 08 BE 01 00 EB 3B 1E B8 B4
   00 50 9A 0E 00 3C 17 59 59 16 8D 46 FE 50 1E B8
   C3 00 50 9A 07 00 F0 17 83 C4 08 FF 76 FE 9A 7C
   00 3B 16 59 8B F8 57 FF 76 FE 1E B8 C6 00 50 9A
   0E 00 3C 17 83 C4 08 46 3B 76 FC 7E C0 33 C0 50
   9A 0A 00 49 16 59 5F 5E 8B E5 5D CB 55 8B EC 56
   8B 76 06 83 FE 02 7E 1E 8B C6 48 50 0E E8 EC FF
   59 50 8B C6 05 FE FF 50 0E E8 E0 FF 59 8B D0 58
   03 C2 EB 07 EB 05 B8 01 00 EB 00 5E 5D CB
Figure 1 - Machine Code for Fibonacci.exe
                proc_1  PROC  FAR                        
000 00053C 55                  PUSH           bp         
001 00053D 8BEC                MOV            bp, sp      
002 00053F 56                  PUSH           si          
003 000540 8B7606              MOV            si, [bp+6]  
004 000543 83FE02              CMP            si, 2       
005 000546 7E1E                JLE            L1          
006 000548 8BC6                MOV            ax, si      
007 00054A 48                  DEC            ax          
008 00054B 50                  PUSH           ax          
009 00054C 0E                  PUSH           cs          
010 00054D E8ECFF              CALL  near ptr proc_1      
011 000550 59                  POP            cx          
012 000551 50                  PUSH           ax          
013 000552 8BC6                MOV            ax, si      
014 000554 05FEFF              ADD            ax, 0FFFEh  
015 000557 50                  PUSH           ax          
016 000558 0E                  PUSH           cs          
017 000559 E8E0FF              CALL  near ptr proc_1      
018 00055C 59                  POP            cx          
019 00055D 8BD0                MOV            dx, ax      
020 00055F 58                  POP            ax          
021 000560 03C2                ADD            ax, dx      
023 00056B 5E             L2:  POP            si          
024 00056C 5D                  POP            bp          
025 00056D CB                  RETF                       
026 000566 B80100         L1:  MOV            ax, 1       
027 000569 EB00                JMP            L2          
                proc_1  ENDP                              
                                                          
                main  PROC  FAR                           
000 0004C2 55                  PUSH           bp          
001 0004C3 8BEC                MOV            bp, sp      
002 0004C5 83EC04              SUB            sp, 4       
003 0004C8 56                  PUSH           si          
004 0004C9 57                  PUSH           di          
005 0004CA 1E                  PUSH           ds          
006 0004CB B89400              MOV            ax, 94h     
007 0004CE 50                  PUSH           ax          
008 0004CF 9A0E004D01          CALL   far ptr printf      
009 0004D4 59                  POP            cx          
010 0004D5 59                  POP            cx          
011 0004D6 16                  PUSH           ss          
012 0004D7 8D46FC              LEA            ax, [bp-4]  
013 0004DA 50                  PUSH           ax          
014 0004DB 1E                  PUSH           ds          
015 0004DC B8B100              MOV            ax, 0B1h    
016 0004DF 50                  PUSH           ax          
017 0004E0 9A07000102          CALL   far ptr scanf       
018 0004E5 83C408              ADD            sp, 8       
019 0004E8 BE0100              MOV            si, 1       
021 000528 3B76FC         L3:  CMP            si, [bp-4]  
022 00052B 7EC0                JLE            L4          
023 00052D 33C0                XOR            ax, ax      
024 00052F 50                  PUSH           ax          
025 000530 9A0A005A00          CALL   far ptr exit        
026 000535 59                  POP            cx          
027 000536 5F                  POP            di          
028 000537 5E                  POP            si          
029 000538 8BE5                MOV            sp, bp      
030 00053A 5D                  POP            bp          
031 00053B CB                  RETF                       
032 0004ED 1E             L4:  PUSH           ds          
033 0004EE B8B400              MOV            ax, 0B4h    
034 0004F1 50                  PUSH           ax          
035 0004F2 9A0E004D01          CALL   far ptr printf      
036 0004F7 59                  POP            cx          
037 0004F8 59                  POP            cx          
038 0004F9 16                  PUSH           ss          
039 0004FA 8D46FE              LEA            ax, [bp-2]  
040 0004FD 50                  PUSH           ax          
041 0004FE 1E                  PUSH           ds          
042 0004FF B8C300              MOV            ax, 0C3h    
043 000502 50                  PUSH           ax          
044 000503 9A07000102          CALL   far ptr scanf       
045 000508 83C408              ADD            sp, 8       
046 00050B FF76FE              PUSH  word ptr [bp-2]      
047 00050E 9A7C004C00          CALL   far ptr proc_1      
048 000513 59                  POP            cx          
049 000514 8BF8                MOV            di, ax      
050 000516 57                  PUSH           di          
051 000517 FF76FE              PUSH  word ptr [bp-2]      
052 00051A 1E                  PUSH           ds          
053 00051B B8C600              MOV            ax, 0C6h    
054 00051E 50                  PUSH           ax          
055 00051F 9A0E004D01          CALL   far ptr printf      
056 000524 83C408              ADD            sp, 8       
057 000527 46                  INC            si          
058                            JMP            L3         ;Synthetic inst 
                main  ENDP
Figure 2 - Code produced by the Disassembler
/*                                                            
 * Input file   : fibo.exe                                    
 * File type    : EXE                                         
 */                                                           
                                                              
int proc_1 (int arg0)                                         
/* Takes 2 bytes of parameters.                               
 * High-level language prologue code.                         
 * C calling convention.                                      
 */                                                           
{                                                             
int loc1;                                                     
int loc2; /* ax */                                            
                                                              
    loc1 = arg0;                                              
    if (loc1 > 2) {                                           
        loc2 = (proc_1 ((loc1 - 1)) + proc_1 ((loc1 + 0xFFFE)));  
    }                                                         
    else {                                                    
        loc2 = 1;                                             
    }                                                         
    return (loc2);                                            
}                                                             
                                                              
                                                              
void main ()                                                  
/* Takes no parameters.                                       
 * High-level language prologue code.                         
 */                                                           
{                                                             
int loc1;                                                     
int loc2;                                                    
int loc3;                                                     
int loc4;                                                     
                                                              
    printf ("Input number of iterations: ");                  
    scanf ("%d", &loc1);                                      
    loc3 = 1;                                                 
    while ((loc3 <= loc1)) {                                  
        printf ("Input number: ");                            
        scanf ("%d", &loc2);                                  
        loc4 = proc_1 (loc2);                                 
        printf ("fibonacci(%d) = %u/n", loc2, loc4);          
        loc3 = (loc3 + 1);                                    
    } /* end of while */                                      
    exit (0);                                                 
}
Figure 3 - Code produced by dcc in C
#include <stdio.h>                                            
                                                              
int main()                                                    
{ int i, numtimes, number;                                    
  unsigned value, fib();                                      
                                                              
   printf("Input number of iterations: ");                    
   scanf ("%d", &numtimes);                                   
   for (i = 1; i <= numtimes; i++)                            
   {                                                          
      printf ("Input number: ");                              
      scanf ("%d", &number);                                  
      value = fib(number);                                    
      printf("fibonacci(%d) = %u/n", number, value);          
   }                                                          
   exit(0);                                                   
}                                                             
                                                              
unsigned fib(x)                 /* compute fibonacci number recursively */
int x;                                                        
{                                                             
   if (x > 2)                                                 
      return (fib(x - 1) + fib(x - 2));                       
   else                                                       
      return (1);                                             
}
Figure 4 - Initial C Program

PhD Thesis

C Cifuentes. Reverse Compilation Techniques, Queensland University of Technology, Department of Computer Science, PhD thesis. July 1994. (474 Kb compressed postscript file). Also available in compressed dvi format (365 Kb).

ABSTRACT

Techniques for writing reverse compilers or decompilers are presented in this thesis. These techniques are based on compiler and optimization theory, and are applied to decompilation in a unique way; these techniques have never before been published.

A decompiler is composed of several phases which are grouped into modules dependent on language or machine features. The front-end is a machine dependent module that parses the binary program, analyzes the semantics of the instructions in the program, and generates an intermediate low-level representation of the program, as well as a control flow graph of each subroutine. The universal decompiling machine is a language and machine independent module that analyzes the low-level intermediate code and transforms it into a high-level representation available in any high-level language, and analyzes the structure of the control flow graph(s) and transform them into graphs that make use of high-level control structures. Finally, the back-end is a target language dependent module that generates code for the target language.

Decompilation is a process that involves the use of tools to load the binary program into memory, parse or disassemble such a program, and decompile or analyze the program to generate a high-level language program. This process benefits from compiler and library signatures to recognize particular compilers and library subroutines. Whenever a compiler signature is recognized in the binary program, all compiler start-up and library subroutines are not decompiled; in the former case, the routines are eliminated from the final target program and the entry point to the main program is used for the decompiler analysis, in the latter case the subroutines are replaced by their library name.

The presented techniques were implemented in a prototype decompiler for the Intel i80286 architecture running under the DOS operating system, dcc, which produces target C programs for source .exe or .com files. Sample decompiled programs, comparisons against the initial high-level language program, and an analysis of results is presented in Chapter 9.

Chapter 1 gives an introduction to decompilation from a compiler point of view, Chapter 2 gives an overview of the history of decompilation since its appearance in the early 1960s, Chapter 3 presents the relations between the static binary code of the source binary program and the actions performed at run-time to implement the program, Chapter 4 describes the phases of the front-end module, Chapter 5 defines data optimization techniques to analyze the intermediate code and transform it into a higher-representation, Chapter 6 defines control structure transformation techniques to analyze the structure of the control flow graph and transform it into a graph of high-level control structures, Chapter 7 describes the back-end module, Chapter 8 presents the decompilation tool programs, Chapter 9 gives an overview of the implementation of dcc and the results obtained, and Chapter 10 gives the conclusions and future work of this research.

The techniques presented in this thesis expand on earlier work described in the literature. Previous work in decompilation did not document on the interprocedural register analysis required to determine register arguments and register return values, the analysis required to eliminate stack-related instructions (i.e. push and pop), or the structuring of a generic set of control structures. Innovative work done for this research is described in Chapters 5, 6, and 8. Chapter 5, Sections 5.2 and 5.4 illustrate and describe nine different types of optimizations that transform the low-level intermediate code into a high-level representation. These optimizations take into account condition codes, subroutine calls (i.e. interprocedural analysis) and register spilling, eliminating all low-level features of the intermediate instructions (such as condition codes and registers) and introducing the high-level concept of expressions into the intermediate representation. Chapter 6, Sections 6.2 and 6.6 illustrate and describe algorithms to structure different types of loops and conditional, including multi-way branch conditionals (e.g. case statements). Previous work in this area has concentrated in the structuring of loops, few papers attempt to structure 2-way conditional branches, no work on multi-way conditional branches is described in the literature. This thesis presents a complete method for structuring all types of structures based on a predetermined, generic set of high-level control structures. A criterion for determining the generic set of control structures is given in Chapter 6, Section 6.4. Chapter 8 describes all tools used to decompile programs, the most important tool is the signature generator (Section 8.2) which is used to determine compiler and library signatures in architectures that have an operating system that do not share libraries, such as the DOS operating system.

Future Work - A Retargetable Decompiler

A retargetable decompiler engine can be built based on ideas and code from the UQBT project, by reusing the frontend of that framework and writing a new backend that supports the RTL and HRTL intermediate representation of the UQBT system.  Please refer to the open source project Boomerang.

dcc Distribution

The dcc source code distribution is made available under the GNU GPL General Public License.

The dcc distribution is available in gzip tar format for Unix users, dcc.tar.gz and dcc_oo.tar.gz, and in its individual .zip files for PC users, dcc files pages.  Read the readme file for a description of what is included in the distribution and installation instructions. If you do not have the tar and/or pkunzip programs, contact your system's administrator.

There is a now a second version of the decompiler; mainly to distinguish it from the first, we'll call it the "OO" version (it has the beginnings of Object Orientation, but there is still much to be done). This version has a bug fixed which caused the output to be wrong some of the time (randomly; successive runs would result in different output). It is also converted to C++, (the source for dcc; dcc does not produce C++ source), so those users wishing to use a C compiler without C++ facilities will have to stick to the original version. The file dccsrcoo.zip has the source for the later version, and dcc_oo.tar.gz has the whole distribution, with dccsrcoo.zip instead of dccsrc.zip and dcc32.zip instead of dcc.zip. This version has a better chance of working on PC compilers such as Microsoft Visual C++ and Borland C++. There is no longer any use of the curses library; it was found to be too much of a distribution hassle.

 The OO version of dcc is the most recent, and has bug fixes that the original does not. For most purposes, the OO version is the one to start working with.

Support
Please note that the authors are not currently working on this project and therefore cannot support any changes required on dcc. Source code is provided "as is". Read the documentation first.

Likewise, please don't email the authors with requests for modifications to dcc, or specific questions about its inner workings. If you do, you will just get a reply with this formletter.

Note
Dcc has a fundamental implementation flaw that limits it to about 30KB of input binary program, i.e. it currently handles toy programs only!  The problem is that pointers are kept in many places; many of these pointers point to elements of arrays. The arrays are all of variable size; the realloc system call can and will change the virtual addresses of these arrays, thus invalidating the pointers. Because of this, results are unpredictable as soon as one array is resized. (However, a segmentation fault is likely when this happens). The arrays are sized such that they don't get reallocated for input binaries less than about 30KB.

Before any serious work can be done with dcc, this implementation flaw has to be corrected. As noted above, the authors do not have the time to correct this error, or to offer any suggestions as to how to do this.