The dcc Decompiler 反編譯
The dcc Decompiler
The dcc decompiler was developed by Cristina Cifuentes while a PhD student at the Queensland University of Technology (QUT), Australia, 1991-4, under the supervision of Professor John Gough. Mike Van Emmerik developed the library signature recognition algorithms while employed by QUT. The dcc distribution is made available under the GPL license. The
Table of Contents
NoticeDecompilation is a technique that allows you to recover lost source code. It is also needed in some cases for computer security, interoperability and error correction. dcc, and any decompiler in general, should not be used for "cracking" other programs, as programs are protected by copyright. Cracking of programs is not only illegal but it rides on other's creative effort. See the
dcc
The dcc decompiler decompiles .exe files from the (i386, DOS) platform to C programs. The final C program contains assembler code for any subroutines that are not possible to be decompiled at a higher level than assembler.The analysis performed by dcc is based on traditional compiler optimization techniques and graph theory. The former is capable of eliminating registers and intermediate instructions to reconstruct high-level statements; the later is capable of determining the control structures in each subroutine.
Please note that at present, only C source is produced; dcc cannot (as yet) produce C++ source.
The structure of a decompiler resembles that of a compiler: a front-, middle-, and back-end which perform separate tasks. The front-end is a machine-language dependent module that reads in machine code for a particular machine and transforms it into an intermediate, machine-independent representation of the program. The middle-end (aka the Universal Decompiling Machine or UDM) is a machine and language independent module that performs the core of the decompiling analysis: data flow and control flow analysis. Finally, the back-end is high-level language dependent and generates code for the program (C in the case of dcc).
In practice, several programs are used with the decompiler to create the high-level program. These programs aid in the detection of compiler and library signatures, hence augmenting the readability of programs and eliminating compiler start-up and library routines from the decompilation analysis.
Example of Decompilation
We illustrate the decompilation of a fibonacci program (see Figure 4). Figure 1 illustrates the relevant machine code of this binary. No library or compiler start up code is included. Figure 2 presents the disassembly of the binary program. All calls to library routines were detected by dccSign (the signature matcher), and thus not included in the analysis. Figure 3 is the final output from dcc. This C program can be compared with the original C program in Figure 4.55 8B EC 83 EC 04 56 57 1E B8 94 00 50 9A 0E 00 3C 17 59 59 16 8D 46 FC 50 1E B8 B1 00 50 9A 07 00 F0 17 83 C4 08 BE 01 00 EB 3B 1E B8 B4 00 50 9A 0E 00 3C 17 59 59 16 8D 46 FE 50 1E B8 C3 00 50 9A 07 00 F0 17 83 C4 08 FF 76 FE 9A 7C 00 3B 16 59 8B F8 57 FF 76 FE 1E B8 C6 00 50 9A 0E 00 3C 17 83 C4 08 46 3B 76 FC 7E C0 33 C0 50 9A 0A 00 49 16 59 5F 5E 8B E5 5D CB 55 8B EC 56 8B 76 06 83 FE 02 7E 1E 8B C6 48 50 0E E8 EC FF 59 50 8B C6 05 FE FF 50 0E E8 E0 FF 59 8B D0 58 03 C2 EB 07 EB 05 B8 01 00 EB 00 5E 5D CBFigure 1 - Machine Code for Fibonacci.exe
proc_1 PROC FAR 000 00053C 55 PUSH bp 001 00053D 8BEC MOV bp, sp 002 00053F 56 PUSH si 003 000540 8B7606 MOV si, [bp+6] 004 000543 83FE02 CMP si, 2 005 000546 7E1E JLE L1 006 000548 8BC6 MOV ax, si 007 00054A 48 DEC ax 008 00054B 50 PUSH ax 009 00054C 0E PUSH cs 010 00054D E8ECFF CALL near ptr proc_1 011 000550 59 POP cx 012 000551 50 PUSH ax 013 000552 8BC6 MOV ax, si 014 000554 05FEFF ADD ax, 0FFFEh 015 000557 50 PUSH ax 016 000558 0E PUSH cs 017 000559 E8E0FF CALL near ptr proc_1 018 00055C 59 POP cx 019 00055D 8BD0 MOV dx, ax 020 00055F 58 POP ax 021 000560 03C2 ADD ax, dx 023 00056B 5E L2: POP si 024 00056C 5D POP bp 025 00056D CB RETF 026 000566 B80100 L1: MOV ax, 1 027 000569 EB00 JMP L2 proc_1 ENDP main PROC FAR 000 0004C2 55 PUSH bp 001 0004C3 8BEC MOV bp, sp 002 0004C5 83EC04 SUB sp, 4 003 0004C8 56 PUSH si 004 0004C9 57 PUSH di 005 0004CA 1E PUSH ds 006 0004CB B89400 MOV ax, 94h 007 0004CE 50 PUSH ax 008 0004CF 9A0E004D01 CALL far ptr printf 009 0004D4 59 POP cx 010 0004D5 59 POP cx 011 0004D6 16 PUSH ss 012 0004D7 8D46FC LEA ax, [bp-4] 013 0004DA 50 PUSH ax 014 0004DB 1E PUSH ds 015 0004DC B8B100 MOV ax, 0B1h 016 0004DF 50 PUSH ax 017 0004E0 9A07000102 CALL far ptr scanf 018 0004E5 83C408 ADD sp, 8 019 0004E8 BE0100 MOV si, 1 021 000528 3B76FC L3: CMP si, [bp-4] 022 00052B 7EC0 JLE L4 023 00052D 33C0 XOR ax, ax 024 00052F 50 PUSH ax 025 000530 9A0A005A00 CALL far ptr exit 026 000535 59 POP cx 027 000536 5F POP di 028 000537 5E POP si 029 000538 8BE5 MOV sp, bp 030 00053A 5D POP bp 031 00053B CB RETF 032 0004ED 1E L4: PUSH ds 033 0004EE B8B400 MOV ax, 0B4h 034 0004F1 50 PUSH ax 035 0004F2 9A0E004D01 CALL far ptr printf 036 0004F7 59 POP cx 037 0004F8 59 POP cx 038 0004F9 16 PUSH ss 039 0004FA 8D46FE LEA ax, [bp-2] 040 0004FD 50 PUSH ax 041 0004FE 1E PUSH ds 042 0004FF B8C300 MOV ax, 0C3h 043 000502 50 PUSH ax 044 000503 9A07000102 CALL far ptr scanf 045 000508 83C408 ADD sp, 8 046 00050B FF76FE PUSH word ptr [bp-2] 047 00050E 9A7C004C00 CALL far ptr proc_1 048 000513 59 POP cx 049 000514 8BF8 MOV di, ax 050 000516 57 PUSH di 051 000517 FF76FE PUSH word ptr [bp-2] 052 00051A 1E PUSH ds 053 00051B B8C600 MOV ax, 0C6h 054 00051E 50 PUSH ax 055 00051F 9A0E004D01 CALL far ptr printf 056 000524 83C408 ADD sp, 8 057 000527 46 INC si 058 JMP L3 ;Synthetic inst main ENDPFigure 2 - Code produced by the Disassembler
/* * Input file : fibo.exe * File type : EXE */ int proc_1 (int arg0) /* Takes 2 bytes of parameters. * High-level language prologue code. * C calling convention. */ { int loc1; int loc2; /* ax */ loc1 = arg0; if (loc1 > 2) { loc2 = (proc_1 ((loc1 - 1)) + proc_1 ((loc1 + 0xFFFE))); } else { loc2 = 1; } return (loc2); } void main () /* Takes no parameters. * High-level language prologue code. */ { int loc1; int loc2; int loc3; int loc4; printf ("Input number of iterations: "); scanf ("%d", &loc1); loc3 = 1; while ((loc3 <= loc1)) { printf ("Input number: "); scanf ("%d", &loc2); loc4 = proc_1 (loc2); printf ("fibonacci(%d) = %u/n", loc2, loc4); loc3 = (loc3 + 1); } /* end of while */ exit (0); }Figure 3 - Code produced by dcc in C
#include <stdio.h> int main() { int i, numtimes, number; unsigned value, fib(); printf("Input number of iterations: "); scanf ("%d", &numtimes); for (i = 1; i <= numtimes; i++) { printf ("Input number: "); scanf ("%d", &number); value = fib(number); printf("fibonacci(%d) = %u/n", number, value); } exit(0); } unsigned fib(x) /* compute fibonacci number recursively */ int x; { if (x > 2) return (fib(x - 1) + fib(x - 2)); else return (1); }Figure 4 - Initial C Program
PhD Thesis
C Cifuentes. Reverse Compilation Techniques, Queensland University of Technology, Department of Computer Science, PhD thesis. July 1994. (474 Kb compressed postscript file). Also available in compressed dvi format (365 Kb).ABSTRACT
Techniques for writing reverse compilers or decompilers are presented in this thesis. These techniques are based on compiler and optimization theory, and are applied to decompilation in a unique way; these techniques have never before been published.
A decompiler is composed of several phases which are grouped into modules dependent on language or machine features. The front-end is a machine dependent module that parses the binary program, analyzes the semantics of the instructions in the program, and generates an intermediate low-level representation of the program, as well as a control flow graph of each subroutine. The universal decompiling machine is a language and machine independent module that analyzes the low-level intermediate code and transforms it into a high-level representation available in any high-level language, and analyzes the structure of the control flow graph(s) and transform them into graphs that make use of high-level control structures. Finally, the back-end is a target language dependent module that generates code for the target language.
Decompilation is a process that involves the use of tools to load the binary program into memory, parse or disassemble such a program, and decompile or analyze the program to generate a high-level language program. This process benefits from compiler and library signatures to recognize particular compilers and library subroutines. Whenever a compiler signature is recognized in the binary program, all compiler start-up and library subroutines are not decompiled; in the former case, the routines are eliminated from the final target program and the entry point to the main program is used for the decompiler analysis, in the latter case the subroutines are replaced by their library name.
The presented techniques were implemented in a prototype decompiler for the Intel i80286 architecture running under the DOS operating system, dcc, which produces target C programs for source .exe or .com files. Sample decompiled programs, comparisons against the initial high-level language program, and an analysis of results is presented in Chapter 9.
Chapter 1 gives an introduction to decompilation from a compiler point of view, Chapter 2 gives an overview of the history of decompilation since its appearance in the early 1960s, Chapter 3 presents the relations between the static binary code of the source binary program and the actions performed at run-time to implement the program, Chapter 4 describes the phases of the front-end module, Chapter 5 defines data optimization techniques to analyze the intermediate code and transform it into a higher-representation, Chapter 6 defines control structure transformation techniques to analyze the structure of the control flow graph and transform it into a graph of high-level control structures, Chapter 7 describes the back-end module, Chapter 8 presents the decompilation tool programs, Chapter 9 gives an overview of the implementation of dcc and the results obtained, and Chapter 10 gives the conclusions and future work of this research.
The techniques presented in this thesis expand on earlier work described in the literature. Previous work in decompilation did not document on the interprocedural register analysis required to determine register arguments and register return values, the analysis required to eliminate stack-related instructions (i.e. push and pop), or the structuring of a generic set of control structures. Innovative work done for this research is described in Chapters 5, 6, and 8. Chapter 5, Sections 5.2 and 5.4 illustrate and describe nine different types of optimizations that transform the low-level intermediate code into a high-level representation. These optimizations take into account condition codes, subroutine calls (i.e. interprocedural analysis) and register spilling, eliminating all low-level features of the intermediate instructions (such as condition codes and registers) and introducing the high-level concept of expressions into the intermediate representation. Chapter 6, Sections 6.2 and 6.6 illustrate and describe algorithms to structure different types of loops and conditional, including multi-way branch conditionals (e.g. case statements). Previous work in this area has concentrated in the structuring of loops, few papers attempt to structure 2-way conditional branches, no work on multi-way conditional branches is described in the literature. This thesis presents a complete method for structuring all types of structures based on a predetermined, generic set of high-level control structures. A criterion for determining the generic set of control structures is given in Chapter 6, Section 6.4. Chapter 8 describes all tools used to decompile programs, the most important tool is the signature generator (Section 8.2) which is used to determine compiler and library signatures in architectures that have an operating system that do not share libraries, such as the DOS operating system.
Future Work - A Retargetable Decompiler
A retargetable decompiler engine can be built based on ideas and code from the UQBT project, by reusing the frontend of that framework and writing a new backend that supports the RTL and HRTL intermediate representation of the UQBT system. Please refer to the open source project Boomerang.dcc Distribution
The dcc source code distribution is made available under the GNU GPL General Public License.The dcc distribution is available in gzip tar format for Unix users, dcc.tar.gz and dcc_oo.tar.gz, and in its individual .zip files for PC users, dcc files pages. Read the readme file for a description of what is included in the distribution and installation instructions. If you do not have the tar and/or pkunzip programs, contact your system's administrator.
There is a now a second version of the decompiler; mainly to distinguish it from the first, we'll call it the "OO" version (it has the beginnings of Object Orientation, but there is still much to be done). This version has a bug fixed which caused the output to be wrong some of the time (randomly; successive runs would result in different output). It is also converted to C++, (the source for dcc; dcc does not produce C++ source), so those users wishing to use a C compiler without C++ facilities will have to stick to the original version. The file dccsrcoo.zip has the source for the later version, and dcc_oo.tar.gz has the whole distribution, with dccsrcoo.zip instead of dccsrc.zip and dcc32.zip instead of dcc.zip. This version has a better chance of working on PC compilers such as Microsoft Visual C++ and Borland C++. There is no longer any use of the curses library; it was found to be too much of a distribution hassle.
The OO version of dcc is the most recent, and has bug fixes that the original does not. For most purposes, the OO version is the one to start working with.
Support
Please note that the authors are not currently working on this project and therefore cannot support any changes required on dcc. Source code is provided "as is". Read the documentation first.
Likewise, please don't email the authors with requests for modifications to dcc, or specific questions about its inner workings. If you do, you will just get a reply with this formletter.
Note
Dcc has a fundamental implementation flaw that limits it to about 30KB of input binary program, i.e. it currently handles toy programs only! The problem is that pointers are kept in many places; many of these pointers point to elements of arrays. The arrays are all of variable size; the realloc system call can and will change the virtual addresses of these arrays, thus invalidating the pointers. Because of this, results are unpredictable as soon as one array is resized. (However, a segmentation fault is likely when this happens). The arrays are sized such that they don't get reallocated for input binaries less than about 30KB.
Before any serious work can be done with dcc, this implementation flaw has to be corrected. As noted above, the authors do not have the time to correct this error, or to offer any suggestions as to how to do this.