1. 程式人生 > 其它 >Study Notes of CS:APP (Till Book 3.8 & Lecture 8.1, Regularly Updated)

Study Notes of CS:APP (Till Book 3.8 & Lecture 8.1, Regularly Updated)

Computer Systems: A Programmer's Perspective, Third Edition, Pearson, 2016 15-213/18-213: Introduction to Computer Systems (ICS)

Study Notes of the Book CS:APP and its ICS+ Course 15-213

Book & Course Information

Instructors

Randal E. Bryant and David R. O'Hallaron

Textbooks

Randal E. Bryant and David R. O'Halloron,

Computer Systems: A Programmer's Perspective, Third Edition, Pearson, 2016

Brian W. Kernighan and Dennis M. Ritchie,

The C Programming Language, Second Edition, Prentice Hall, 1988

Home

http://www.cs.cmu.edu/~213

Related Materials

Reading Notes:

嵌入式與Linux那些事's cnblogs posts

北洛's cnblogs posts

FannieGirl's cnblogs posts

頔瀟's CSDN posts

Learning Materials:

CS:APP3e Book Site

15-213/18-213 Fall 2019 (Latest Course Lectured by Randy Bryant, Up-to-date Slides, etc. Available)

SJTU ICS SE101 2019

Compiler Explorer

My Foreword

Content Covered

Currently I decide only to study the common parts of the five system courses suggested and the 15-213 implementation, as is listed in the table below.

Chapter

Phase

1

1

Overview

1.1-1.10

2

2.1-2.3, 2.5

3

3.1-3.10, 3.12

4

5

6

6.1-6.4, 6.7

7

8

9

9.1-9.8, 9.13

10

11

12

First, read a section or sections of the book with the assistance of eudic and take down the key points. Second, watch the corresponding 15-213 video(s) if available and add the complement provided by the lecture PPT.

Book errata will be included.

Notes Organization

The structural style of my notes lies between that of a normal article and of a normal piece of slides.

Notes of "Summary" sections are no more and no less than the original text with topic words highlighted.

Formatting

Use the style scheme as much close as the global edition as possible, to the degree in which formatting would not become a burden on me.

Text size and font have been adjusted for online reading on a browser.

Fixing Post Style

Macros for Blog Post on Word

Reminder for myself: No applying to the original Word Document!

Sub BeforePublish()

Call ConvertNumbersToText

Call PictureResize(110)

Call AddSmallCapsMarks

Call AddAllCapsMarks

Call IndentToBlankSpaces

End Sub

Replacements for HTML on VS Code.

Replace

With

\$MALL([^$@]*)CAP\$

<span style="font-variant: small-caps;">$1</span>

@LL([^$@]*)C@PS

<span style="text-transform: uppercase;">$1</span>

<p(>(<.*>*)?)

<p style="text-indent:2em"$1

&nbsp;&nbsp;&nbsp;&nbsp;

&nbsp;&nbsp;

(<p style=")margin-left: (\d)0pt;(">(?:<strong>)?<span style=".*)(">(?:<strong>)?&bull;&nbsp;&nbsp;)

$1padding-left:$2em;$3margin-left:-1em$4

Alternatively, replacements using a Java program (unstable).

To-do List

Have more useful chapters or sections covered. What I mean by "useful" is that the content may make sense for my personal advances; I may include parts that are required by some exam or some job.

Refine my notes. The first versions are definitely way too rambling. Also, avoid abusing \r\ns and list items! In particular:

Remove all derivations.

Remove in-line CSS style of

Fully match labels (such as '<', '>' and '/', etc.) for accuracy.

Course Overview: Topics

Programs and Data

Bits operations, arithmetic, assembly language programs

Representation of C control and data structures

Includes aspects of architecture and compilers

The Memory Hierarchy

Memory technology, memory hierarchy, caches, disks, locality

Includes aspects of architecture and OS

Exceptional Control Flow

Hardware exceptions, processes, process control, Unix signals, nonlocal jumps

Includes aspects of compilers, OS, and architecture

Virtual Memory

Virtual memory, address translation, dynamic storage allocation

Includes aspects of architecture and OS

Networking, and Concurrency

High level and low-level I/O, network programming

Internet services, Web servers

concurrency, concurrent server design, threads

I/O multiplexing with select

Includes aspects of networking, OS, and architecture

Chapter 1A Tour of Computer Systems

A computer system consists of hardware and systems software that work together to run application programs.

1.1Information Is Bits + Context

A program begins life as a source program (or source file). The source program is a sequence of bits, each with a value of 0 or 1, organized in bytes (8-bit chunks). Each byte represents some text character in the program.

Most computer systems represent text characters using the ASCII standard that represents each character with a unique byte-size integer value.

Text files: files that consist exclusively of ASCII characters. Binary files: all other files.

A fundamental idea: All information in a system is represented as a bunch of bits. The only thing that distinguishes different data objects is the context in which we view them.

1.2Programs Are Translated by Other Programs into Different Forms

In order to run .c on the system, the individual C statements must be translated by other programs into a sequence of low-level machine-language instructions. These instructions are then packaged in anexecutable object program (or an executable object file) and stored as a binary disk file.

The compilation system: the programs that perform the four phases (preprocessor, compiler, assembler, and linker).

Preprocessing phase. The preprocessor (cpp) modifies the original C program according to directives that begin with the '#' character. The result is another C program, typically with the .i suffix.

Compilation phase. The compiler (cc1) translates .i into .s, which contains an assembly-language program. Assembly language is useful because it provides a common output language for different compilers for different high-level languages.

Assembly phase. Next, the assembler (as) translates .s into machine language instructions, packages them in a relocatable object program, and stores the result in the object file .o.

Linking phase. The printf function is part of the standard C libraryprovided by every C compiler. The printf function resides in a separate precompiled object file printf.o, which is merged with our .o program by the linker (ld). The result is an executable object file (or simply executable) that is ready to be loaded into memory and executed by the system.

1.3It Pays to Understand How Compilation Systems Work

Some important reasons:

Optimizing program performance.

Understanding link-time errors.

Avoiding security holes.

Buffer overflow vulnerabilities

1.4Processors Read and Interpret Instructions Stored in Memory

The shell is a command-line interpreter that prints a prompt, waits for you to type a command line, and then performs the command. If the first word of the command line does not correspond to a built-in shell command, then the shell assumes that it is the name of an executable file that it should load and run.

1.4.1Hardware Organization of a System

Buses

Buses: a collection of electrical conduits running throughout the system that carry bytes of information back and forth between the components.

Typically designed to transfer words (fixed-size chunks of bytes).

The word size (The number of bytes in a word) is a fundamental system parameter that varies across systems. Most machines: either 4 bytes (32bits) or 8 bytes (64 bits).

I/O Devices

Input/output (I/O) devices: the system's connection to the external world.

Example system has 4: a keyboard and mouse, a display and a disk drive (or simply disk)

Each is connected to the I/O bus by either a controller or an adapter.
The distinction: packaging.

Controllers are chip sets in the device itself or on the motherboard (the system's main printed circuit board).

An adapter is a card that plugs into a slot on the motherboard.

The purpose of each: to transfer information back and forth between the I/O bus and an I/O device.

Main Memory

The main memory: a temporary storage device that holds both a program and the data it manipulates while the processor is executing the program.

Physically, consists of a collection of dynamic random access memory (DRAM) chips.

Logically, is organized as a linear array of bytes, each with its own unique address (array index) starting at zero.

In general, each of the machine instructions that constitute a program can consist of a variable number of bytes. The sizes of data items that correspond to C program variables vary according to type.

Processor

The central processing unit (CPU), or simply processor: the engine that executes (interprets) instructions stored in main memory.

The program counter (PC): a register (a word-size storage device) at its core.

At any point in time, points at (contains the address of) some machine-language instruction in main memory.

Repeatedly

executes the instruction pointed at by the PC

updates the PC to point to the next instruction

Appears to operate according to a very simple instruction execution model, defined by its instruction set architecture.

Instructions execute in strict sequence, and executing a single instruction involves performing a series of steps.

The processor

reads the instruction from memory pointed at by the program counter (PC),

interprets the bits in the instruction,

performs some simple operation dictated by the instruction, and then

updates the PC to point to the next instruction, which may or may not be contiguous in memory to the instruction that was just executed.

Operations revolve around

main memory,

the register file, and

the arithmetic/logic unit (ALU).

The register file: a small storage device that consists of a collection of word-size registers, each with its own unique name.

The ALU: computes new data and address values.

Examples of operations:

Load: Copy a byte or a word from main memory into a register, overwriting the previous contents of the register.

Store: Copy a byte or a word from a register to a location in main memory, overwriting the previous contents of that location.

Operate: Copy the contents of two registers to the ALU, perform an arithmetic operation on the two words, and store the result in a register, overwriting the previous contents of that register.

Jump: Extract a word from the instruction itself and copy that word into the PC, overwriting the previous value of the PC.

Distinguish the processor's instruction set architecture,

describing the effect of each machine-code instruction,

from its microarchitecture,

describing how the processor is actually implemented.

1.4.2Running the hello Program

Using direct memory access (DMA), the data travel directly from disk to main memory, without passing through the processor.

1.5Caches Matter

Lesson: A system spends a lot of time moving information from one place to another.

The machine instructions in the program

Stored on disk

Copied to main memory

Copied into the processor

The data string

On disk

Copied to main memory

Copied to the display device

Much of this copying is overhead: slows down the "real work" of the program.

A major goal for system designers: to make these copy operations run as fast as possible.

Larger storage devices are slower than smaller ones. And faster devices are more expensive to build than their slower counterparts.

The processor–memory gap: The processor can read data from the register file much faster than from memory.

To deal with it: Cache memories(or simply caches): smaller, faster storage devices that serve as temporary staging areas for information that the processor is likely to need in the near future.

An L1 cache

On the processor chip

Holds many bytes

Can be accessed nearly as fast as the register file

An L2 cache

Larger

With fewer bytes

Connected to the processor by a special bus

Still much faster than accessing the main memory.

The L1 and L2 caches are implemented with static random access memory (SRAM, a hardware technology).

Newer and more powerful systems: 3 levels.

The idea behind:

Exploit locality(the tendency for programs to access data and code in localized regions) to get a very large and fast memory.

Set up caches to hold data that are likely to be accessed often.

1.6Storage Devices Form a Hierarchy

The storage devices in every computer system are organized as a memory hierarchy similar to Figure.

The main idea: storage at one level serves as a cache for storage at the next lower level.

1.7The Operating System Manages the Hardware

Programs relay on the services provided by the operating system to accessed the hardware.

The operating system:

A layer of software interposed between the application program and the hardware.

All attempts by an application program to manipulate the hardware must go through the operating system.

Two primary purposes:
(1) to protect the hardware from misuse by runaway applications and
(2) to provide applications with simple and uniform mechanisms for manipulating complicated and often wildly different low-level hardware devices.

Both goals are achieved via the fundamental abstractions: processes, virtual memory, and files.

1.7.1Processes

A process: the operating system's abstraction for a running program.

Multiple processes can run concurrently on the same system, and each process appears to have exclusive use of the hardware.
Concurrently: The instructions of one process are interleaved with the instructions of another process.

In either case, a single CPU can appear to execute multiple processes concurrently by having the processor switch among them:

Traditional systems could only execute one program at a time,
while newer multicore processors can execute several programs simultaneously.

The operating system performs this interleaving with context switching (a mechanism).

A uniprocessor system containing a single CPU and multiprocessor systems

Context switching of a uniprocessor system:

The operating system keeps track of all the context (state) information that the process needs in order to run. Including information such as the current values of the PC, the register file and the contents of main memory

At any point in time, a uniprocessor system can only execute the code for a single process.
When the operating system decides to transfer control from the current process to some new process, it performs a context switch by

saving the context of the current process,

restoring the context of the new process,

and then passing control to the new process.

The new process picks up exactly where it left off.

The basic idea for the example scenario:

The kernel: the portion of the operating system code that is always resident in memory.

The transition from one process to another is managed by the operating system kernel.

When an application program requires some action by the operating system, it executes a special system call instruction, transferring control to the kernel.
The kernel then performs the requested operation and returns back to the application program.

Note:

Not a separate process.

Instead, a collection of code and data structures that the system uses to manage all the processes.

1.7.2Threads

A process can actually consist of multiple threads (execution units)

Each running in the context of the process and sharing the same code and global data.

Increasingly important:

Requirement for concurrency in network servers

Easier to share data between than processes

More efficient than processes

Multi-threading: make programs run faster

1.7.3Virtual Memory

Virtual memory: an abstraction that provides each process with the illusion that it has exclusive use of the main memory.

Each process has the same uniform view of memory: virtual address space.

For Linux processes:

Areas (starting with the lowest addresses and working the way up):

Program code and data.

Code begins at the same fixed address for all processes, followed by data locations that correspond to global C variables.

Are initialized directly from the contents of an executable object file.

Heap.

Follow the code and data areas immediately.

Expands and contracts dynamically at run time as a result of calls to C standard library routines.

Shared libraries.

Near the middle of the address space.

Holds the code and data for shared libraries.

Powerful but difficult.

Stack (user stack).

At the top of the user's virtual address space.

Used by the compiler to implement function calls.

Expands and contracts dynamically during the execution of the program.
In particular, each time:

call a function, grows.

return from a function, contracts.

Kernel virtual memory.

The top region reserved.

Application must invoke the kernel to read or write the contents of this area or to directly call functions defined in the kernel code.

For virtual memory to work, a sophisticated interaction is required between the hardware and the operating system software.

The basic idea: to store the contents of a process's virtual memory on disk and then use the main memory as a cache for the disk.

1.7.4Files

A file: a sequence of bytes.

Every I/O device is modeled as a file.

All input and output in the system is performed by reading and writing files, using Unix I/O (a small set of system calls).

Very powerful

Provides applications with a uniform view of all the varied I/O devices that might be contained in the system.

1.8Systems Communicate with Other Systems Using Networks

From the point of view of an individual system, the network can be viewed as just another I/O device.

When the system copies a sequence of bytes from main memory to the network adapter, the data flow across the network to another machine.

Similarly, the system can read data sent from other machines and copy these data to its main memory.

1.9Important Themes

A system is a collection of intertwined hardware and systems software that must cooperate in order to achieve the ultimate goal of running application programs.

1.9.1Amdahl's Law

Amdahl's law:

An observation about the effectiveness of improving the performance of one part of a system.

The main idea: when we speed up one part of a system, the effect on the overall system performance depends on both how significant this part was and how much it sped up.

Consider a system in which executing some application requires time Told.
Suppose some part of the system requires a fraction α of this time, and that we improve its performance by a factor of k.
The overall execution time would be Tnew = Told[(1− α) + α/k].
The speedup S = Told/Tnew = 1/[(1 − α) + α/k].

The major insight—to significantly speed up the entire system, we must improve the speed of a very large fraction of the overall system. One interesting special case: setting k to ∞. S = 1/(1 − α).

A general principle for improving any process.

Most meaningful for computers. Performance is routinely improved by high factors.

1.9.2Concurrency and Parallelism

Concurrency: the general concept of a system with multiple, simultaneous activities.

Parallelism: the use of concurrency to make a system run faster.

Can be exploited at multiple levels of abstraction in a computer system.

Highlight 3 here from the highest to the lowest level in the system hierarchy.

Thread-Level Concurrency

With the process abstraction, multiple programs execute at the same time (concurrency). With threads, multiple control flows execute within a single process.

A uniprocessor system: the configuration that a single processor has to switch among multiple tasks.

Since the advent of time-sharing

Only simulated, by having a single computer rapidly switch among its executing processes

Allows:

multiple users to interact with a system at the same time

a single user to engage in multiple tasks concurrently

A multiprocessor system: a system consisting of multiple processors all under the control of a single operating system kernel.

Have become commonplace with the advent of multi-core processors and hyperthreading

Multi-core processors have several CPUs ("cores") integrated onto a single integrated-circuit chip.

Each L1 cache is split into two parts—

one to hold recently fetched instructions

one to hold data

The cores share higher levels of cache as well as the interface to main memory.

Hyperthreading (simultaneous multi-threading): a technique that allows a single CPU to execute multiple flows of control.

It involves having multiple copies of some of the CPU hardware,

such as program counters and register files,

while having only single copies of other parts of the hardware,

such as the units that perform floating-point arithmetic.

A hyperthreaded processor decides which of its threads to execute on a cycle-by-cycle basis.

It enables the CPU to take better advantage of its processing resources.

Improve system performance—

Reduces the need to simulate concurrency when performing multiple tasks.

Can run a single application program faster, but only if that program is expressed in terms of multiple threads that can effectively execute in parallel.

Instruction-Level Parallelism

Instruction-level parallelism: the property that processors can execute multiple instructions at one time.

Pipelining: the actions required to execute an instruction are partitioned into different steps and the processor hardware is organized as a series of stages, each performing one of these steps.

The stages can operate in parallel, working on different parts of different instructions.

Superscalar processors: Processors that can sustain execution rates faster than 1 instruction per cycle.

Most modern processors support superscalar operation.

Application programmers can use a high-level model to understand the performance of their programs.

They can then write programs such that the generated code achieves higher degrees of instruction-level parallelism and therefore runs faster.

Single-Instruction, Multiple-Data (SIMD) Parallelism

Single-instruction, multiple-data (SIMD) parallelism: The mode that processors with special hardware allows a single instruction to cause multiple operations to be performed in parallel.

1.9.3The Importance of Abstractions in Computer Systems

The use of abstractions is one of the most important concepts in computer science.

On the processor side

The instruction set architecture: an abstraction of the actual processor hardware

A machine-code program behaves as if it were executed on a processor that performs just one instruction at a time.

The underlying hardware is always in a consistent way.

Different processor implementations can execute the same machine code while offering a range of cost and performance.

On the operating system side

Files: an abstraction of I/O devices

Virtual memory: an abstraction of program memory

Processes: an abstraction of a running program

The virtual machine: an abstraction of the entire computer

A way to manage computers that must be able to run programs designed for multiple operating systems or different versions of the same operating system.

1.10Summary

A computer system consists of hardware and systems software that cooperate to run application programs. Information inside the computer is represented as groups of bits that are interpreted in different ways, depending on the context. Programs are translated by other programs into different forms, beginning as ASCII text and then translated by compilers and linkers into binary executable files.

Processors read and interpret binary instructions that are stored in main memory. Since computers spend most of their time copying data between memory, I/O devices, and the CPU registers, the storage devices in a system are arranged in a hierarchy, with the CPU registers at the top, followed by multiple levels of hardware cache memories, DRAM main memory, and disk storage. Storage devices that are higher in the hierarchy are faster and more costly per bit than those lower in the hierarchy. Storage devices that are higher in the hierarchy serve as caches for devices that are lower in the hierarchy. Programmers can optimize the performance of their C programs by understanding and exploiting the memory hierarchy.

The operating system kernel serves as an intermediary between the application and the hardware. It provides three fundamental abstractions: (1) Files are abstractions for I/O devices. (2) Virtual memory is an abstraction for both main memory and disks. (3) Processes are abstractions for the processor, main memory, and I/O devices.

Finally, networks provide ways for computer systems to communicate with one another. From the viewpoint of a particular system, the network is just another I/O device.

Part IProgram Structure and Execution

How application programs are represented and executed.

Chapter 2Representing and Manipulating Information

Bits

Computers store and process information represented as two-valued signals.

Bits form the basis of the digital revolution.

The decimal, or base-10, representation: natural for humans. Binary values: when building machines that store and process information.

Two-valued signals can readily be represented, stored, and transmitted

Encodings

Group bits together and apply some interpretation and represent the elements of any finite set.

The 3 most important representations of numbers.

Unsigned encodings

Based on traditional binary notation

Represent numbers ≥ 0

Two's-complement encodings

Represent signed integers

Either positive or negative

Floating-point encodings

A base-2 version of scientific notation for representing real numbers

Computers implement arithmetic operations with them.

Some operations can overflow when the results are too large to be represented.

The different mathematical properties of integer versus floating-point arithmetic:

Integer computer arithmetic satisfies many of the familiar properties of true integer arithmetic.

Floating-point arithmetic has altogether different mathematical properties.

Stem from the difference in how they handle the finiteness of their representations—

Integer representations encode a comparatively small range of values precisely

Floating-point representations encode a wide range of values approximately

A number of computer security vulnerabilities have arisen due to some of the subtleties of computer arithmetic.

Computers use several different binary representations to encode numeric values.

2.1Information Storage

Computers use bytes (blocks of 8 bits) as the smallest addressable unit of memory.

A machine-level program views memory as a very large array of bytes — virtual memory.

Every byte of memory is identified by a unique number — its address.

The virtual address space: the set of all possible addresses. Just a conceptual image presented to the machine-level program.

Program objects: program data, instructions, and control information. The management of the storage is all performed within the virtual address space. Example: The value of a pointer in C is the virtual address of the first byte of some block of storage. Type information also is associated with each pointer.

2.1.1Hexadecimal Notation

A Byte in Different Notations

= 8 bits.

Binary

000000002 to 111111112.

Decimal

010 to 25510.

Hexadecimal (or "hex", base-16):

very convenient for describing bit patterns.

Use '0' through '9' along with 'A' through 'F' to represent 16 possible values

0016 to FF16.

In C, starting with 0x or 0X.

Example: write FA1D37B16 as 0xFA1D37B, as 0xfa1d37b, or even mixing

Manually converting between decimal, binary, and hexadecimal representations of bit patterns:

Converting between binary and hexadecimal: straightforward.

Convert binary to hexadecimal. Example: 0x173A4C

Convert binary to hexadecimal. Note: if the total number of bits is not a multiple of 4, make the leftmost group be the one with fewer than 4 bits, effectively padding the number with leading 0s. Example: 1111001010110110110011.

When x = 2n for some nonnegative integer n, the binary representation of x is simply 1 followed by n 0s. The hexadecimal 0 represents 4 binary 0s. So, for n = i + 4j, where 0 ≤ i ≤ 3, write x with a leading 2i, followed by j 0s. An example: x = 2,048 = 211, n = 11= 3 + 4 · 2, 0x800.

Converting between decimal and hexadecimal: multiplication or division.

Convert decimal to hexadecimal. To convert a decimal number x to hexadecimal, repeatedly x = q · 16 + r. Use r as the least significant digit and generate the remaining digits by repeating the process on q. Example: decimal 314,156:

0x4CB2C.

Convert hexadecimal to decimal. Multiply each of the hexadecimal digits by the appropriate power of 16. Example: 0x7AF, 7 · 162 + 10 · 16 + 15 = 7 · 256 + 10 · 16 + 15 = 1,967.

2.1.2Data Sizes

Every computer has a word size (the nominal size of pointer data). w-bit: the virtual addresses range from 0 to 2w − 1.

A widespread shift from with 32-bit machines to 64-bit ones (a virtual address space of 16 exabytes).

32-bit programs vs 64-bit programs: the distinction lies in how a program is compiled, rather than the type of machine on which it runs.

Computers and compilers support multiple data formats using different ways to encode data.

The C language supports multiple data formats for both integer and floating point data.

Integer data:

Signed: negative, zero, and positive values.

Unsigned: nonnegative values.

char: a single byte. Can also be used to store integer values.

short, int, and long: provide a range of sizes.

A pointer (e.g., char *): uses the full word size.

Two different floating-point formats:

float: single precision, 4 bytes

double: double precision, 8 bytes

Fixed-size integer types: int32_t and int64_t. Use them is the best way for programmers to have close control over data representations.

Most of the data types encode signed values, unless prefixed by unsigned or using the specific unsigned declaration for fixed-size data types. The exception: char. The C standard does not guarantee these signed data. Use signed char to guarantee a 1-byte signed value. In many contexts, however, insensitive.

The C language allows a variety of ways to order the keywords and to include or omit optional keywords.

One aspect of portability is to make the program insensitive to the exact sizes of the different data types. The C standards set lower bounds on the numeric ranges of the different data types, but there are no upper bounds (except with the fixed-size types).

2.1.3Addressing and Byte Ordering

Must establish two conventions for multi-byte objects: addressing and byte ordering.

Addressing

A multi-byte object is stored as a contiguous sequence of bytes. Address: the smallest address of the bytes used.

Example: intx at 0x100. (Assuming 32-bit) The 4 bytes would be stored in 0x100, 0x101, 0x102, and 0x103.

Byte Ordering

Two common conventions.
A w-bit integer [xw −1, xw −2, . . . , x1, x0], xw−1: the most significant bit, x0: the least.

Assuming w: a multiple of 8. The most significant byte: [xw−1, xw−2, …, xw−8], the least significant byte: [x7, x6, …, x0] and the other bytes: bits from the middle.

Little endian: the least significant byte comes first

Big endian: the most significant byte comes first

Example: intx at 0x100: 0x01234567. The ordering depends on the type of machine:

Note: The high-order byte is
0x01, while the low-order byte is 0x67.

Machines:

Little-endian: most Intel-compatible machines, machines that use Intel-compatible processors manufactured by IBM or Oracle, Android, iOS.

Big-endian: most machines from IBM and Oracle

Bi-endian: ARM microprocessors

At times, it becomes an issue.

When binary data are communicated over a network between different machines.

When looking at the byte sequences representing integer data.

Often when inspecting machine-level programs

4004d3: 01 05 43 0b 20 00 add %eax,0x200b43(%rip)
generated by a disassembler

Disassembler: a tool that determines the instruction sequence represented by an executable program file

Add 0x200b43 to the current value of the program counter

Having bytes appear in reverse order is common when reading machine-level program representations generated for little-endian machines

When programs are written that circumvent the normal type system.

In C, a cast or a union

size_t: the preferred data type for expressing the sizes of data structures

sizeof(T)returns the number of bytes required to store an object of type T.

To write portable code

The different machine/operating system configurations use different conventions for storage allocation

2.1.4Representing Strings

A string in C

Encoded by an array of characters

Terminated by the null (having value 0) character.

Each character represented by the ASCII character code.

The ASCII code for decimal digit x happens to be 0x3x.

The terminating byte = 0x00

Independent of conventions

2.1.5Representing Code

Binary code is seldom portable across different combinations of machine and operating system.

A fundamental concept: a program is simply a sequence of bytes.

2.1.6Introduction to Boolean Algebra

Boolean algebra.

The work of George Boole

Encode true and false as 1 and 0

Boolean Operations

The simplest Boolean algebra: defined over {0, 1}.

The 4 Boolean operations

Boolean operation

Corresponding logical operation

~

not

¬

&

and

|

or

^

exclusive-or

Boolean Operations over Bit Vectors

Extending the 4 Boolean operations to bit vectors(strings of 0's and 1's of fixed length w).

Defined according to applications to the matching elements.

Examples:

Application: to represent finite sets.

Encode A {0, 1, …, w− 1} with [aw−1, …, a1, a0], ai = 1 if and only if i A.

Example: a = [01101001] encodes A = {0, 3, 5, 6}, b = [01010101] encodes B = {0, 2, 4, 6}.

Boolean operation

Corresponding set operation

|

Set union

&

Set intersection

~

Set complement

Continuing the example: a & b yields [01000001], AB = {0, 6}.

Practical applications example: There are a number of different signals that can interrupt the execution of a program. Selectively enable or disable different signals by specifying a bit-vector mask, where a 1 in bit position i indicates that signal i is enabled and a 0 indicates that it is disabled. Thus, the mask represents the set of enabled signals.

2.1.7Bit-Level Operations in C

The symbols for the Boolean operations can be applied to any "integral" data type.

Examples: expression evaluation for char:

Evaluating bit-level expression: (1) Expand hexadecimal to binary, (2) perform the operations, and (3) convert back to hexadecimal.

Use: to implement masking operations

A mask: a bit pattern that indicates a selected set of bits within a word.

Example:

x & 0xFF: the least significant byte of x and 0's (all others).

Example: x = 0x89ABCDEF, 0x000000EF.

~0: all 1's

can be written 0xFFFFFFFF when int is 32 bits, not as portable.

2.1.8Logical Operations in C

Logical operator

Corresponding logic operation

||

or

&&

and

!

not

Any nonzero argument as true, argument 0 as false

Return either 1 or 0, indicating a result of either true or false

A bitwise operation matches that of its logical counterpart only when the arguments are restricted to 0 or 1.

The logical operators do not evaluate the second argument if the result of can be determined by evaluating the first.

Examples: a && 5/a, p && *p++.

2.1.9Shift Operations in C

Shift operations shift bit patterns to the left and to the right.

x: [xw−1, xw−2, …, x0]

Left shift operation: x << k

[xwk−1, xwk−2, …, x0, 0, …, 0]

x is shifted k bits to the left, dropping off the k most significant bits and filling the right end with k 0's.

0 ≤ kw − 1

Associate

Right shift operation: x >> k, 2 forms

Logical.

The left end is filled with k 0's

[0, …, 0, xw−1, xw−2, …, xk]

Arithmetic.

The left end is filled with k repetitions of the most significant bit

[xw−1, …, xw−1, xw−1, xw−2, …, xk]

Useful for operating on signed integer data.

Example (the italicized digits fill the ends):

Not precisely defined by the C standards

In practice, arithmetic for signed data, logical (must) for unsigned data.

The definition in Java: x >> k arithmetically, x >>> k logically

2.2Integer Representations

2.2.1Integral Data Types

Integral data types represent finite ranges of integers.

Size: char, short, long

All nonnegative or possibly negative: unsigned or the default

The only machine-dependent: long (8-byte with 64-bit, 4-byte with 32-bit)

Asymmetric

≤ the typical

Symmetric except the fixed-size data types

int could be 2-byte (mostly for 16-bit), long can be 4-byte (typically for 32-bit)

The fixed-size data types

Ranges = those of typical numbers

Asymmetric

2.2.2Unsigned Encodings

Consider an integer data type of w bits.

A bit vector is written as:

, to denote the entire vector

[xw−1, xw−2, ..., x0], to denote the individual bits within the vector

The unsigned interpretation of : treated as binary.

xi = 0 or 1 (2i is part of the value).

principle: Definition of unsigned encoding

For vector = [xw−1, xw−2, ..., x0]:

B2Uw() xi2i

: the left-hand side is defined to be equal to the right-hand side.

Examples:

UMaxw 2i =2w − 1

A mapping: B2Uw: {0, 1}w{0, … , UMaxw}.

principle: Uniqueness of unsigned encoding

Function B2Uw is a bijection.

Bijection: a function f that goes two ways: it maps a value x to a value y where y = f (x), but it can also operate in reverse, since for every y, there is a unique value x such that f (x) = y, which is given by the inverse function x = f−1(y).

U2Bw: the inverse of B2Uw

2.2.3Two's-Complement Encodings

Two's-complement form: the most common representation of signed numbers

principle: Definition of two's-complement encoding

For vector = [xw−1, xw−2, ..., x0]:

B2Tw() xw−12w−1 +xi2i

The sign bit: xw−1

"Weight": −2w−1

1: negative; 0: nonnegative.

TMinw −2w−1, TMaxw 2i =2w−1 − 1.

A mapping: B2Tw: {0, 1}w {TMinw, …, TMaxw}.

principle: Uniqueness of two's-complement encoding

Function B2Tw is a bijection.

T2Bw: the inverse of B2Tw

Drop w: UMax, TMin, and TMax

Points worth highlighting:

Asymmetric Range: |TMin| = |TMax| + 1

UMax = 2TMax + 1

−1: same representation as UMax

0: a string of all 0's in both

Two's-complement in languages:

Not required by C. <limits.h> defines a set of constants delimiting the ranges of the different integer data types for the particular machine.

Example: for a two's-complement machine, INT_MAX = TMaxw, INT_MIN = TMinw and UINT_MAX = UMaxw.

Required by Java.

Example to get a better understanding:

2.2.4Conversions between Signed and Unsigned

Casting example:

int x, unsigned u

(unsigned) x converts x to unsigned, (int) u converts u to int.

A general rule of handling conversions between signed and unsigned numbers with the same word size—the numeric values might change, but the bit patterns do not.

U2Bw and T2Bw

U2Bw

0 ≤ xUMaxw, U2Bw(x)

Unique unsigned

T2Bw

TMinw xTMaxw, T2Bw(x)

Unique two's-complement

T2Uw and U2Tw

T2Uw

TMinw xTMaxw, T2Uw(x) B2Uw(T2Bw(x))

0 ≤ T2Uw(x) ≤ UMaxw, same representation

principle: Conversion from two's complement to unsigned

For x such that TMinwxTMaxw:

T2Uw(x) =

derivation: Conversion from two's complement to unsigned

B2Uw(T2Bw(x)) = T2Uw(x) = x + xw−12w

In a two's-complement representation of x, bit xw−1 determines whether or not x is negative.

Examples:

The behavior of T2U:

< 0: converted to large positive

≥ 0: unchanged

U2Tw(u)

U2Tw 0 ≤ xUMaxw, U2Tw(x) B2Tw (U2B w(x))

0 ≤ U2T w(x) ≤ UMaxw, same representation

principle: Unsigned to two's-complement conversion

For u such that 0 ≤ uUMaxw:

U2Tw(u) =

derivation: Unsigned to two's-complement conversion

U2Tw(u) = −uw−12w + u

In the unsigned representation of u, bit uw−1 determines whether or not u is greater than TMaxw

= 2w−1 − 1.

The behavior of U2T:

TMaxw, unchanged

> TMaxw, converted to negative

Summary

The effects of converting in both directions:

0 ≤ xTMaxw

T2Uw(x) = x, U2Tw(x) = x, identical

Outside

+/− 2w

2 extremes:

T2Uw (−1) = UMaxw

T2Uw(TMinw) = TMaxw + 1

2.2.5Signed versus Unsigned in C

Numbers

Signed: by default

Example: 12345 or 0x1A2B

Unsigned: adding 'U' or 'u' as a suffix

Example: 12345U or 0x1A2Bu

Conversion between:

Allowed but not specified. Mostly U2Tw and T2Uw.

Explicit casting and implicit casting (when an expression of one type is assigned to a variable of another)

printf does not use type information.

Possibly nonintuitive behavior:

2.2.6Expanding the Bit Representation of a Number

One common operation: to convert between integers having different word sizes while retaining the same numeric value. Converting from a smaller to a larger data type should always be possible.

Converting Unsigned to Larger

Zero extension: adding leading zeros to the representation.

principle: Expansion of an unsigned number by zero extension

Define bit vectors = [uw−1, uw−2, …, u0] of width w and ′ = [0, …, 0, uw−1, uw−2, …, u0] of width w, where w′>w. Then B2Uw() = B2Uw(′).

Converting Two's-complement to Larger

Sign extension: adding copies of the most significant bit to the representation.

principle: Expansion of a two's-complement number by sign extension

Define bit vectors = [xw−1, xw−2, …, x0] of width w and ′ = [xw−1, …, xw−1, xw−1, xw−2, …, x0] of width w, where w′>w. Then B2Tw (x) = B2Tw(x).

The value is preserved:

derivation: Expansion of a two's-complement number by sign extension

The relative order of conversion: The program first changes the size and then the type.

2.2.7Truncating Numbers

Truncating = [xw−1, xw−2, …, x0] to k-bit: drop the high-order wk bits, ′ = [xk−1, xk−2, …, x0]. Truncating a number can alter its value—a form of overflow.

Truncating Unsigned

principle: Truncation of an unsigned number

Let be the bit vector [xw−1, xw−2, …, x0], and let ′ be the result of truncating it to k bits: ′ = [xk−1, xk−2, …, x0]. Let x = B2Uw() and x′ = B2Uk(′). Then x = x′ mod 2k.

Truncating Two's-complement

principle: Truncation of a two's-complement number

Let be the bit vector [xw−1, xw−2, …, x0], and let ′ be the result of truncating it to k bits: ′ = [xk−1, xk−2, …, x0]. Let x = B2Uw() and x′ = B2Tk(′). Then x = U2Tk(x mod 2k).

derivation: Truncation of a two's-complement number

B2Tw ([xw−1, xw−2, …, x0]) mod 2k = B2Uk ([xk−1, xk−2, …, x0])

Summary

The effect of truncation

For unsigned:

B2Uk([xk−1, xk−2, …, x0]) = B2Uw([xw−1, xw−2, …, x0]) mod 2k

For two's-complement:

B2Tk([xk−1, xk−2, …, x0]) = U2Tk(B2Uw ([xw−1, xw−2, …, x0]) mod 2k)

2.2.8Advice on Signed versus Unsigned

The implicit conversion can lead to errors or vulnerabilities.

One way to avoid: to never use unsigned.

Example: Java.

Unsigned values are useful

When words = collections of bits. Example:

Packing a word with flags describing various Boolean conditions

Addresses (naturally unsigned)

When implementing mathematical packages for modular / multiprecision arithmetic.

2.3Integer Arithmetic

2.3.1Unsigned Addition

"Word size inflation":

Some programming languages support arbitrary size arithmetic;
More commonly, programming languages support fixed-size arithmetic.

x y for arguments x and y, where 0 ≤ x, y < 2w: the result of truncating the integer sum x + y to be w bits long, viewed as an unsigned number.

Characterized as a form of modular arithmetic: discarding any bits with weight > 2w−1

principle: Unsigned addition

For x and y such that 0 ≤ x, y < 2w:

x y =

Illustration:

Overflow: an arithmetic operation whose integer result cannot fit within the word size limits of the data type.

Occurs when the two operands sum to 2w or more

Not signaled as errors

principle: Detecting overflow of unsigned addition

For x and y in the range 0 ≤ x, yUMaxw, let s x y. Then the computation of s overflowed if and only if s < x (or equivalently, s < y).

Modular addition forms an abelian group (a mathematical structure).

Commutative and associative

The identity element: 0; every element has an additive inverse

Value x for every value x: x x = 0

principle: Unsigned negation

For any number x such that 0 ≤ x < 2w, its w-bit unsigned negation x is given by the following:

x =

2.3.2Two's-Complement Addition

x y, given integer values x and y where −2w−1 x, y ≤ 2w−1 − 1: the result of truncating the integer sum x + y to w bits, viewed as a two's-complement number.

principle: Two's-complement addition

For integer values x and y in the range −2w−1 x, y ≤ 2w−1 − 1:

Illustration:

Positive overflow: x + y exceeds TMaxw (case 4)

Negative overflow: x + y is less than TMinw (case 1)

Has the same bit-level representation as the unsigned sum

Examples:

principle: Detecting overflow in two's-complement addition

For x and y in the range TMinwx, yTMaxw, let s x y. Then the computation of s has had positive overflow if and only if x > 0 and y > 0 but s ≤ 0. The computation has had negative overflow if and only if x < 0 and y <0 but s ≥ 0.

Illustrations:

2.3.3Two's-Complement Negation

x: the additive inverse under

principle: Two's-complement negation

For x in the range TMinwxTMaxw, its two's-complement negation x is given by the formula

x =

2.3.4Unsigned Multiplication

x y for integers x and y where 0 ≤ x, y ≤ 2w − 1: the result of truncating the 2w-bit product x · y to w bits, viewed as an unsigned number.

principle: Unsigned multiplication

For x and y such that 0 ≤ x, yUMaxw:

x y = (x · y) mod 2w

2.3.5Two's-Complement Multiplication

x y for integers x and y where −2w-1x, y ≤ 2w-1 − 1: the result of truncating the 2w-bit product x · y to w bits, viewed as a two's-complement number.

principle: Two's-complement multiplication

For x and y such that TMinwx, yTMaxw:

x y = U2Tw((x · y) mod 2w)

principle: Bit-level equivalence of unsigned and two's-complement multiplication

Let and be bit vectors of length w. Define integers x and y as the values represented by these bits in two's-complement form: x = B2Tw(x) and y = B2Tw(y).Define nonnegative integers x′ and y′ as the values represented by these bits in unsigned form: x = B2Uw(x) and y = B2Uw(y). Then

T2Bw(x y) = U2Bw(xy′)

Illustrations:

2.3.6Multiplying by Constants

Integer multiply is slow. Optimization: to replace multiplications by constants with combinations of shift and addition operations.

principle: Multiplication by a power of 2

Let x be the unsigned integer represented by bit pattern [xw−1, xw−2, …, x0]. Then for any k ≥ 0, the w + k-bit unsigned representation of x2k is given by [xw−1, xw−2, …, x0, 0, …, 0], where k zeros have been added to the right.

principle: Unsigned multiplication by a power of 2

For C variables x and k with unsigned values x and k, such that 0 ≤ k < w, the C expression x << k yields the value x 2k.

principle: Two's-complement multiplication by a power of 2

For C variables x and k with two's-complement value x and unsigned value k, such that 0 ≤ k < w, the C expression x << k yields the value x 2k.

The task of generating code for the expression x * K, for some constant K.

K: [(0…0) (1…1) (0…0) . . . (1…1)].

Compute a run of 1's from bit position n down to bit position m (nm) using either form:

Form A: (x<<n) + (x<<(n − 1)) + . . . + (x<<m)

Form B: (x<<(n + 1)) - (x<<m)

Compute x * K by adding together the results for each run.

2.3.7Dividing by Powers of 2

Performed using a right shift:

Logical: unsigned

Arithmetic: two's-complement

Integer division always rounds toward 0. Some notation for any real number a:

a: the unique integer a′ such that a′ ≤ a < a′ + 1.

a: the unique integer a′ such that a′ − 1< a′ ≤ a′.

Dividing by a Power of 2 with Unsigned Arithmetic

principle: Unsigned division by a power of 2

For C variables x and k with unsigned values x and k, such that 0 ≤ k < w, the C expression x >> k yields the value x/2k.

Examples:

Dividing by a Power of 2 with Two's-complement Arithmetic

Using an arithmetic right shift:

principle: Two's-complement division by a power of 2, rounding down

Let C variables x and k have two's-complement value x and unsigned value k, respectively, such that 0 ≤ k < w. The C expression x >> k, when the shift is performed arithmetically, yields the value x/2k.

Examples:

Correcting for the improper rounding that occurs when a negative number is shifted right by "biasing" the value before shifting.

principle: Two's-complement division by a power of 2, rounding up

Let C variables x and k have two's-complement value x and unsigned value k, respectively, such that 0 ≤ k < w. The C expression (x + (1 << k) - 1) >> k, when the shift is performed arithmetically, yields the value x/2k.

Demonstration:

(x<0 ? x+(1<<k)-1 : x) >> k will compute x/2k.

2.3.8Final Thoughts on Integer Arithmetic

2.4Floating Point

2.4.1Fractional Binary Numbers

2.4.2IEEE Floating-Point Representation

2.4.3Example Numbers

2.4.4Rounding

2.4.5Floating-Point Operations

2.4.6Floating Point in C

2.5Summary

Computers encode information as bits, generally organized as sequences of bytes. Different encodings are used for representing integers, real numbers, and character strings. Different models of computers use different conventions for encoding numbers and for ordering the bytes within multi-byte data.

The C language is designed to accommodate a wide range of different implementations in terms of word sizes and numeric encodings. Machines with 64-bit word sizes have become increasingly common, replacing the 32-bit machines that dominated the market for around 30 years. Because 64-bit machines can also run programs compiled for 32-bit machines, we have focused on the distinction between 32- and 64-bit programs, rather than machines. The advantage of 64-bit programs is that they can go beyond the 4 GB address limitation of 32-bit programs.

Most machines encode signed numbers using a two's-complement representation and encode floating-point numbers using IEEE Standard 754. Understanding these encodings at the bit level, as well as understanding the mathematical characteristics of the arithmetic operations, is important for writing programs that operate correctly over the full range of numeric values.

When casting between signed and unsigned integers of the same size, most C implementations follow the convention that the underlying bit pattern does not change. On a two's-complement machine, this behavior is characterized by functions T2Uw and U2Tw, for a w-bit value. The implicit casting of C gives results that many programmers do not anticipate, often leading to program bugs.

Due to the finite lengths of the encodings, computer arithmetic has properties quite different from conventional integer and real arithmetic. The finite length can cause numbers to overflow, when they exceed the range of the representation. Floating-point values can also underflow, when they are so close to 0.0 that they are changed to zero.

The finite integer arithmetic implemented by C, as well as most other programming languages, has some peculiar properties compared to true integer arithmetic. For example, the expression x*x can evaluate to a negative number due to overflow. Nonetheless, both unsigned and two's-complement arithmetic satisfy many of the other properties of integer arithmetic, including associativity, commutativity, and distributivity. This allows compilers to do many optimizations. For example, in replacing the expression 7*x by (x<<3)-x, we make use of the associative, commutative, and distributive properties, along with the relationship between shifting and multiplying by powers of 2.

We have seen several clever ways to exploit combinations of bit-level operations and arithmetic operations. For example, we saw that with two's-complement arithmetic, ~x+1 is equivalent to -x. As another example, suppose we want a bit pattern of the form [0, …, 0, 1, …, 1], consisting of wk zeros followed by k ones. Such bit patterns are useful for masking operations. This pattern can be generated by the C expression (1<<k)-1, exploiting the property that the desired bit pattern has numeric value 2k − 1. For example, the expression (1<<8)-1 will generate the bit pattern 0xFF.

Floating-point representations approximate real numbers by encoding numbers of the form x × 2y. IEEE Standard 754 provides for several different precisions, with the most common being single (32 bits) and double (64 bits). IEEE floating point also has representations for special values representing plus and minus infinity, as well as not-a-number.

Floating-point arithmetic must be used very carefully, because it has only limited range and precision, and because it does not obey common mathematical properties such as associativity.

Chapter 3Machine-Level Representation of Programs

Machine Code and Assembly Code

Computers execute machine code.

Machine code: sequences of bytes encoding the low-level operations that manipulate data, manage memory, read and write data on storage devices, and communicate over networks.

A compiler generates machine code through a series of stages, based on the rules of the programming language, the instruction set of the target machine, and the conventions followed by the operating system.

The gcc C compiler

Generates its output in the form of assembly code.

Assembly code: a textual representation of the machine code giving the individual instructions in the program.

Then invokes both an assembler and a linker to generate the executable machine code from the assembly code.

A High-level Language versus Low-level instructions

A high-level language

Low-level instructions

Shields programmers from the detailed machine-level implementation. Much more productive and reliable.

Must be specified by a programmer.

A program can be compiled and executed on a number of different machines.

Assembly code is highly machine specific.

The Importance of Learning Machine Code

Code optimization.

Understanding the run-time behavior of a program.

Concurrent programming.

Guarding against attacks.

The Relation between Source Code and the Generated Assembly

Understanding the relation between source code and the generated assembly: a form of reverse engineering.

Reverse engineering: trying to understand the process by which a system was created by studying the system and working backward.

The system is a machine-generated assembly language program.

x86-64

The machine language for most processors in laptop and desktop machines, data centers and supercomputers.

Started with Intel's 16 bits, expanded to 32 bits, and most recently to 64 bits.

Its rival: Advanced Micro Devices (AMD).

The Transition from 32-bit to 64-bit Machines

A 32-bit machine:

Can only use around 4 gigabytes (232bytes) of RAM.

Current 64-bit machines:

Can use up to 256 terabytes (248 bytes)

Could readily be extended to use up to 16 exabytes (264 bytes)

3.1A Historical Perspective

The Intel processor line (x86) has followed a long evolutionary development.

Some models of Intel processors and some of their key features:

8086

One of the first 16-bit microprocessors.

A variant 8088: IBM PCs and MS-DOS.

i386

Expanded the architecture to 32 bits.

Added the flat addressing model. The first to fully support Unix.

PentiumPro

Introduced P6 microarchitecture (a radically new processor design).

Pentium 4E

Added hyperthreading and EM64T.

Hyperthreading: a method to run two programs simultaneously on a single processor.

EM64T(x86-64): Intel's implementation of a 64-bit extension to IA32 developed by Advanced Micro Devices (AMD).

Core 2.

First multi-core Intel microprocessor.

Multi-core processor: multiple processors are implemented on a single chip.

Core i7, Nehalem.

Incorporated both hyperthreading and multi-core, with the initial version supporting two executing programs on each core and up to four cores on each chip.

Backward compatible: able to run code compiled for any earlier version.

Intel's names for their processor line:

IA32: "Intel Architecture 32-bit"

Intel64 (x86-64): the 64-bit extension to IA32

"x86" (colloquial): the overall line

Advanced Micro Devices (AMD) have produced Intel-compatible processors. Introduced x86-64.

3.2Program Encodings

Suppose C program: p1.c and p2.c. Compiling using a Unix command line:

The command gcc: the gcc C compiler.

Since default on Linux, also cc

The command-line option -Og1: a level of optimization.

Higher levels of optimization: -O1, -O2.

The command-line directive -o p: p: the final executable code file

The gcc command invokes an entire sequence of programs to turn the source

code into executable code.

First, the C preprocessor expands the source code

To include any files specified with #include commands

To expand any macros, specified with #define declarations

Second, the compiler generates assembly code versions of the two source file p1.s and p2.s.

Next, the assembler converts the assembly code into binary object-code files p1.o and p2.o.

Object code: One form of machine code

Contains binary representations of all of the instructions, but the addresses of global values are not yet filled in.

Finally, the linker merges these two object-code files along with code implementing library functions (e.g., printf) and generates the final executable code file p.

Executable code: the second form of machine code

The exact form of code that is executed by the processor.

3.2.1Machine-Level Code

Computer systems employ several different forms of abstraction, hiding details of an implementation through the use of a simpler abstract model. Two important forms of abstraction for machine-level programming:

The instruction set architecture, or ISA defines the format and behavior of a machine-level program.

Defines the processor state, The format of the instructions, and the effect each of these instructions will have on the state.

Most ISAs describe the behavior of a program as if each instruction is executed in sequence.

The processor hardware: Executes instructions concurrently.

Virtual addresses: the memory addresses used by a machine-level program.

Providing a memory model that appears to be a very large byte array.

The actual implementation of the memory system: A combination of multiple hardware memories and operating system software

The compiler does most of the work in the overall compilation sequence, transforming programs into instructions. The main feature of the assembly-code representation: in a more readable textual format.

Visible parts of the x86-64 processor state:

The program counter (the PC, %rip in x86-64) indicates the address in memory of the next instruction to be executed.

The integer register file contains 16 registers.

Registers: named locations storing 64-bit values.

Hold addresses or integer data.

Some keep track of critical parts of the program state. Others hold temporary data.

The condition code registers hold status information about the most recently executed arithmetic or logical instruction.

Implement conditional changes in the control or data flow.

A set of vector registers can each hold one or more values.

Machine code views the memory as a large byte-addressable array.

Aggregate data types: contiguous collections of bytes.

Scalar data types: no distinctions.

The program memory

Contains

The executable machine code for the program

Some information required by the operating system

A run-time stack for managing procedure calls and returns

Blocks of memory allocated by the user (e.g., malloc).

Addressed using virtual addresses.

Only limited subranges are valid.

The operating system manages this virtual address space, translating virtual addresses into the physical addresses of values in the actual processor memory.

A single machine instruction performs only a very elementary operation.

3.2.2Code Examples

Suppose mstore.c:

Generating mstore.s

The assembly-code file:

Each indented line: a single machine instruction.

All information about local variable names or data types has been stripped away.

Generating mstore.o

The object-code file:

In binary format

The hexadecimal representation:

A key lesson: the program executed by the machine is simply a sequence of bytes encoding a series of instructions.

Generating prog

Requires running a linker on the set of object-code files, one of which must contain main.

Suppose main.c:

Contains not just the machine code for the procedures but also code used to start and terminate the program as well as to interact with the operating system.

Disassembling mstore.o and prog

Disassemblers:

To inspect the contents of machine-code files

Generates a format similar to assembly code from the machine code.

With Linux systems, objdump (for "object dump") given -d:

The result:

Features about machine code and its disassembled representation:

x86-64 instructions can range in length from 1 to 15 bytes.

Commonly used instructions and those with fewer operands require a smaller number of bytes

The instruction format: from a given starting position, there is a unique decoding of the bytes into machine instructions.

Example: pushq %rbx: 53.

The disassembler determines the assembly code based purely on the byte sequences in the machine-code file.

The disassembler uses a slightly different naming convention for the instructions than does the assembly code generated by gcc.

Example: the omissions or additions of the suffix 'q'.

Disassembling prog:

Extract various code sequences:

Almost identical to that generated by the disassembly of mstore.o.

Differences:

The addresses—the linker has shifted the location of this code to a different range of addresses.

The linker has filled in the address that callq should use in calling mult2.

One task for the linker: to match function calls with the locations of the executable code for those functions.

Two additional lines of code.

No effect on the program. Memory system performance.

3.2.3Notes on Formatting

Generates mstore.s

The full content:

Lines beginning with '.': directives

A clearer presentation:

3.3Data Formats

"Word": 2 bytes

"Double words": 4 bytes

"Quad words": 8 bytes

3.4Accessing Information

An x86-64 central processing unit (CPU) contains a set of 16 general-purpose registers storing 64-bit values.

General-purpose registers store integer data and pointers.

2 conventions for instructions for copying and generating values, having registers as destinations:

Number of bytes generated

The remaining bytes

1 or 2

Unchanged

4

0

Different registers serve different roles in typical programs.

Most unique: %rsp

Used to indicate the end position in the run-time stack

Specifically read and written by some instructions

The other 15 registers: more flexible

A small number of instructions make specific use of certain registers.

A set of standard programming conventions governs how the registers are to be used for managing the stack, passing function arguments, returning values from functions, and storing local and temporary data.

3.4.1Operand Specifiers

Operands: the source values to use in performing an operation and the destination location into which to place the result.

Operand types:

Immediate

Constant values

A '$' followed by an integer using C notion

Example: $-577 or $0x1F.

Different instructions allow different ranges of immediate values; the assembler will automatically select the most compact way of encoding a value.

Register

The contents of a register, one of the 16 low-order portions of the registers.

ra: An arbitrary register a

R[ra]: Viewing the set of registers as an array R indexed by register identifiers.

A memory reference

Memory location is accessed according to the effective address (a computed address)

Mb[Addr]: A reference to the b-byte value stored in memory starting at address Addr

Drop b.

An addressing mode: a form of memory references.

The most general form: Imm(rb,ri,s)

4 components:

An immediate offset Imm

A base register rb (64-bit)

An index register ri (64-bit)

A scale factor s (1, 2, 4, or 8)

The effective address = Imm + R[rb]+ R[ris

Often seen when referencing elements of arrays.

The other forms: special cases.

The more complex addressing modes are useful when referencing array and structure elements.

3.4.2Data Movement Instructions

Grouping different instructions into instruction classes: The instructions in a class perform the same operation but with different operand sizes.

Simple Data Movement Instructions

mov: Copy data from a source location to a destination location, without any transformation.

Operands

Operand

Description

Type

S

Source

Immediate / register / memory

D

Destination

Register / Memory

Copying from one memory location to another

mov Memory, Register

mov Register, Memory

Register operands for these instructions can be the labeled portions of any of the 16 registers

Only update the specific register bytes or memory locations indicated by D except movl having D: a register.

Convention: Any instruction that generates a 32-bit value for a register also sets the high-order portion of the register to 0.

Examples showing 5 possible combinations:

movabsq: For dealing with 64-bit immediate

movq: Only by sign-extension

Operands:

Operand

Description

Type

I

Source

Immediate (64-bit)

R

Destination

Register

Zero- and Sign-extending Data Movement Instructions

Copying a smaller source value to a larger destination

movz & movs: Zero extension and sign extension

Operands

Operand

Description

Type

S

Source

Register / memory

R

Destination

Register

Final 2 characters of each instruction: Size designators

The absence of explicit "movzlq".

Instead, movl having D: a register

Property: an instruction generating a 4-byte value with a register as the destination will fill the upper 4 bytes with zeros.

cltq: The same effect as movslq %eax, %rax.

3.4.3Data Movement Example

C "Pointers" = addresses

Dereferencing a pointer involves copying that pointer into a register, and then using this register in a memory reference.

Local variables are often kept in registers rather than stored in memory locations.

Register access is much faster than memory access.

3.4.4Pushing and Popping Stack Data

Push and pop instructions

The stack data structure

Discipline: "Last-in, first-out"

Operations:

Push: Add data to a stack

Pop: Remove data

Array implementation

Insert and remove elements from top (one end of the array)

The program stack

Stored in some region of memory.

The stack pointer %rsp holds the address of the top stack element.

pushq: Push data

Operand:

Operand

Description

Type

S

Source

Register

Behavior: pushq %rbp is equivalent to

popq: Pop data

Operand:

Operand

Description

Type

R

Destination

Register

Behavior: popq %rax is equivalent to

The popped value remains until overwritten

Arbitrary stack positions can be addressed

Example: movq 8(%rsp),%rdx

3.5Arithmetic and Logical Operations

The x86-64 integer and logic operations.

leaq (no other size variants) + instruction classes (having 4 size variants)

4 groups:

Load effective address

Unary: 2 operands

Binary: 1 operand

Shifts

3.5.1Load Effective Address

leaq (the load effective address instruction): Copy the effective address to the destination

Operands

Operand

Description

Type

S

Source

Memory

D

Destination

Register

Uses

To generate pointers for later memory references.

To compactly describe common arithmetic operations.

Example: if %rdx = x, then leaq 7(%rdx,%rdx,4), %rax set %rax = 5x + 7.

Clever uses by compilers.

Illustration:

C program:

The arithmetic operations:

The ability to perform addition and limited forms of multiplication proves useful when compiling simple arithmetic expressions.

3.5.2Unary and Binary Operations

Unary Operations

Unary operations:

Operand:

Operand

Description

Type

D

Both source and destination

Register / memory

Example: incq (%rsp)

Binary Operations

Binary operations:

Operands:

Operand

Description

Type

S

Source

Immediate / register / memory

D

Both source and destination

Register / memory

Cannot both be memory

When D is a memory location, the processor must read the value from memory, perform the operation, and then write the result back to memory.

Example: subq %rax,%rdx

3.5.3Shift Operations

Shift operations:

Operands

Operand

Description

Type

k

Shift amount

Immediate / register (%cl)

D

Value to shift

Register / memory

With x86-64, when D is w-bit, amount = the low-order m bits of %cl, where 2m = w.

Example: When %cl = 0xFF, then

salb would shift by 7

salw would shift by 15

sall would shift by 31

salq would shift by 63

Left shift

sal and shl: Fill from the right with zeros.

Right shift

sar (arithmetic, >>A): Fill with copies of the sign bit

shr (logical, >>L): Fill with zeros

3.5.4Discussion

Most instructions shown (except sar and shr) can be used for either unsigned or two's-complement arithmetic.

Makes two's-complement arithmetic preferred to implement signed integer arithmetic.

Example:

In general, compilers generate code that uses individual registers for multiple program values and moves program values among the registers.

3.5.5Special Arithmetic Operations

Operations involving 128-bit (16-byte) numbers:

Oct word: A 16-byte quantity

Full Multiply Operations

mulq (for unsigned) and imulq (for two's-complement): Compute the full 128-bit product of two.

Operand: 1-operand

Operand

Description

Type

S

Source

Immediate / register / memory

Arguments: %rax and S.

Product: %rdx (high-order 64 bits) and %rax (low-order 64 bits).

2 forms of imulq:

Another: Member of imul, 2-operand, implements and

Tell by counting operand number

mulq example:

Declarations:

uint64_t: In inttypes.h, C extension.

__int128: Support provided by gcc

Assembly code:

Division or Modulus Operations

The 1-operand divide instructions:

idivq: Signed division instruction

Parts

Dividend: %rdx (high-order 64 bits) and %rax (low-order 64 bits)

Divisor: S

Quotient: %rax

Remainder: %rdx

64-bit division:

Dividend: %rax (64-bit)

Set bits of %rdx =

0's (unsigned arithmetic)

Sign bit of %rax (signed arithmetic), using cqto

cqto: Read the sign bit from %rax and copies it across all of %rdx

No operands

Illustration:

Function:

Assembly code:

divq: Unsigned division instruction.

Set %rdx = 0 beforehand

3.6Control

Sequential and Conditional Behavior

Sequential behavior:

Straight-line code

Instructions follow one another in sequence

Conditional behavior:

C control constructs

Such as conditionals, loops, and switches

Conditional execution

The sequence of operations that get performed depends on the outcomes of tests applied to the data

2 strategies for implementing conditional operations

Conditional control transfers

The execution order:

Normally, sequential: Statements are in the order they appear in the program.

Alternatively, a jump instruction: Control should pass to some other part of the program.

Conditional data transfers

3.6.1Condition Codes

Condition code registers:

Single-bit

Describe attributes of the most recent arithmetic or logical operation

Tested to perform conditional branches.

Most useful condition codes:

CF: Carry flag.

The most recent operation generated a carry out of the most significant bit.

Used to detect overflow for unsigned operations.

ZF: Zero flag.

The most recent operation yielded zero.

SF: Sign flag.

The most recent operation yielded a negative value.

OF: Overflow flag.

The most recent operation caused a two's-complement overflow—either negative or positive.

Example: add, t = a+b, integers

Condition codes:

The setting of conditional codes by instructions:

Integer arithmetic operations.

leaq does not alter any; address computations

The remaining ones

Operation type / instruction class

CF

OF

Logical operations

Set to 0

Set to 0

Shift operations

Last bit shifted out

Set to 0

inc and dec

Set

Unchanged

cmp and test: Set without altering any other registers

cmp: Set the condition codes according to the differences of their two operands

Behavior: sub without updating destinations

ATT: Operands are in reverse order

Flags:

ZF: Set if S1 = S2

The others: Determine ordering relation

test:

Behavior: and without altering destinations

Operands:

Typically, S1 = S2

E.g., testq %rax,%rax: Whether %rax >, = or < 0

Or one is a mask indicating which bits should be tested

3.6.2Accessing the Condition Codes

Using conditional codes:

The set instructions

The conditional jump instructions

Conditional data transfers

set: Set a single byte to 0 or to 1 depending on some combination of the condition codes.

The suffixes: Different conditions

Operand:

Operand

Description

Type

Length

D

Destination

Register / memory

1 byte

To generate a 32-bit or 64-bit result: Clear the high-order bits

Typical instruction sequence to compute a < b (a, b: long)

"Synonyms"

Set condition codes according to the computation t = a-b

Comparison tests

Signed comparisons: Combinations of SF ^ OF and ZF

Unsigned comparisons: Combinations of CF and ZF

How machine code does or does not distinguish between signed and unsigned values:

It mostly uses the same instructions

Some circumstances require different instructions

Different versions of right shifts, division and multiplication instructions

Different combinations of condition codes

3.6.3Jump Instructions

A jump instruction: Causes the execution to switch to a completely new position in the program.

Jump destinations: Indicated in assembly code by a label.

Example:

In generating the object-code file, the assembler determines the addresses of all labeled instructions and encodes the jump targets (the addresses of the destination instructions) as part of the jump instructions.

The different jump instructions

jmp

Unconditional

Either direct or indirect

A direct jump: The jump target is encoded as part of the instruction

The jump target: A label

Example: .L1

An indirect jump: The jump target is read from a register or a memory location. Direct jumps are

'*' followed by memory

Examples:

jmp *%rax uses %rax as the jump target

jmp *(%rax) uses %rax as the read address

The remaining: conditional—they either jump or continue executing at the next instruction in the code sequence, depending on some combination of the condition codes.

The names and the conditions match those of set.

"Synonyms"

Can only be direct

3.6.4Jump Instruction Encodings

Jump encodings:

PC-relative addressing.

Encode the difference between the address of the target instruction and the address of the instruction immediately following the jump.

These offsets can be encoded using 1, 2, or 4 bytes.

Absolute addressing.

Give an "absolute" address, using 4 bytes to directly specify the target.

The assembler and linker select the appropriate encodings of the jump destinations.

PC-relative addressing example: branch.c

The assembly code:

2 jumps: jmp, jg

The disassembled version of .o

PC = The address of the instruction following the jump

The disassembled version of the program after linking:

The jump instructions provide a means to implement conditional execution (if), as well as several different loop constructs.

3.6.5Implementing Conditional Branches with Conditional Control

Implementing conditional branches

Most general: Conditional control transfers

Alternative: Conditional data transfers

Conditional control transfers:

Example:

The assembly implementation of if-else

The general form of if-else in C

The form of assembly implementation:

3.6.6Implementing Conditional Branches with Conditional Moves

Implementing conditional operations:

Conventional: Conditional transfer of control

The program follows one execution path when a condition holds and another when it does not.

Simple and general, but very inefficient on modern processors.

Alternate: Conditional transfer of data.

Computes both outcomes of a conditional operation and then selects one based on whether or not the condition holds.

Makes sense only in restricted cases

Implemented by a simple conditional move instruction

Better matched to the performance characteristics of modern processors.

Conditional control transfer

Example:

The relative performance of using conditional data transfers versus conditional control transfers

Processors achieve high performance through pipelining

Pipelining: An instruction is processed via a sequence of stages, each operating concurrently

E.g.,

Fetching the instruction from memory

Determining the instruction type

Reading from memory

Performing an arithmetic operation

Writing to memory

Updating the program counter

Achieves high performance by overlapping the steps of the successive instructions

Such as fetching one while performing the arithmetic operations for a previous one.

Requires being able to determine the sequence well ahead of time in order to keep the pipeline full.

When the machine encounters a conditional jump (a "branch"), it cannot determine until it has evaluated the branch condition.

Branch prediction logic is employed.

Guessing reliably: The pipeline will be full.

Mispredicting a jump: The processor cancels and fetches.

Misprediction performance penalty.

Conditional move instructions:

Operands:

Operand

Description

Type

Length

S

Source

Register / memory

16, 32, or 64 bits

R

Destination

Register

The outcome: depends on the values of the condition codes.

As with the different set and jump instructions

S is copied D only if the specified condition holds.

Single-byte: Not supported.

The operand length: Inferred from R

Unlike the unconditional instructions: Explicitly encoded

The processor can execute conditional move instructions without having to predict the outcome of the test.

The processor simply reads the source value (possibly from memory), checks the condition code, and then either updates the destination register or keeps it the same.

Unlike conditional jumps

Implementing conditional operations via conditional data transfers

The general form of conditional expression and assignment:

Conditional control transfer:

Combines conditional and unconditional jumps

Conditional move:

The final statement: A conditional move

Bad cases for conditional moves.

Invalid behavior

The case for the earlier example

Illustration:

C function:

Invalid implementation:

Null pointer dereferencing error.

Must be compiled using branching code

Code efficiency.

Example: wasted computation.

Compilers must take into account the relative performance of wasted computation versus the potential for performance penalty due to branch misprediction.

Used by gcc only when computations are easy

3.6.7Loops

C looping constructs: do-while, while, and for.

Implementation

No corresponding instructions

Instead, combinations of conditional tests and jumps

Compilers generate loop code based on the two basic loop patterns.

Do-While Loops

The general do-while Translation:

C code

Equivalent goto version

Example:

Reverse engineering assembly code requires determining which registers are used for which program values

While Loops

The general while translation:

while version

2 translation methods:

Jump to middle translation: Performs the initial test by performing an unconditional jump to the test at the end of the loop.

-Og

Equivalent goto version:

Example:

Guarded do translation: First transforms the code into a do-while loop by using a conditional branch to skip over the loop if the initial test fails.

-O1

Equivalent do-while version:

Equivalent goto version:

The compiler can often optimize the initial test, for example, determining that the test condition will always hold.

Example:

For Loops

The general for translation:

for version:

Equivalent while version:

Equivalent goto version:

Following the jump-to-middle strategy:

Following the guarded-do strategy:

Examples:

for version:

Components:

Equivalent while version

Equivalent goto version (jump-to-middle):

Corresponding assembly-language code (-Og)

3.6.8Switch Statements

switch statements allow jump table implementation.

Jump table: An array where entry i is the address of a code segment implementing the action the program should take when the switch index equals i.

The code performs a jump table reference using the switch index to determine the jump target.

Advantage over if-else: The time taken to perform the switch is independent of the number of switch cases.

Used by gcc when there are a number of cases and they span a small range of values.

Example:

switch_eg and switch_eg_impl

Features of (a)

Case labels that do not span a contiguous range

Cases with multiple labels

Cases that fall through to other cases

Assembly code for switch statement

Gcc operator &&: Used to create a pointer for a code location

The range is shifted

Treating index as unsigned

Simplifies the branching possibilities

Key step in executing: To access a code location through the jump table.

In (b), computed goto (gcc's extension): goto *jt[index];

In assembly code for switch, indirect jmp

Jump table

In (b), an array

Duplicate cases: Same code label

Missing cases: Default label

In assembly code, declarations

.rodata (for "read-only data"): Segment of the object-code file

A sequence of 7 "quad" words:

Value of each = address associated with the labels.

.L4: The start of this allocation.

The address associated: Base for the indirect jump.

The use of a jump table allows a very efficient way to implement a multiway branch.

3.7Procedures

Procedures: Key abstraction

Suppose procedure P calls procedure Q, and Q then executes and returns back to P.

Mechanisms:

Passing control.

The program counter must be set to the starting address of the code for Q upon entry and then set to the instruction in P following the call to Q upon return.

Passing data.

P must be able to provide one or more parameters to Q, and Q must be able to return a value back to P.

Allocating and deallocating memory.

Q may need to allocate space for local variables when it begins and then free that storage before it returns.

The x86-64 implementation: Minimalist strategy

Only as much as is required.

3.7.1The Run-Time Stack

Storage management using a stack:

LIFO discipline

The stack and the registers store the information required for:

Passing control and data

Allocating memory

The x86-64 stack

Grows toward lower addresses

%rsp points to the top element

Storing data on and retrieving it from the stack:

pushq

popq

Allocating and deallocating space:

Decrement %rsp

Increment %rsp

The procedure's stack frame: The region where a procedure allocates space on the stack

When it requires storage beyond what it can hold in registers

General structure:

The frame for theexecuting procedure is always at the top.

Frame for the caller

Portions:

Arguments 7-n

If required by the callee

The return address: Wherewithin the caller the program should resume execution once the callee returns.

Pushed by the caller when call

Frame for the callee

Allocated by extending the currentstack boundary

Portions:

Saved registers

Local variables

Arguments build areas

Sizes

Fixed size frames

Allocated at the beginning of the procedure.

Variable-size frames

Procedures allocate only the portions of stack frames they require.

A leaf procedure: All of the local variables can be held in registers and the function does not call any other functions.

3.7.2Control Transfer

call and ret:

call: Pushes the return address onto the stack and sets the PC to the beginning of the callee.

The return address: The address of the instruction immediately following call.

ret: Pops the return address off the stack and sets the PC to it.

The general forms:

In the disassembly by objdump: callq and retq

'q': x86-64 versions

call

The target: The address of the instruction where the called procedure starts.

Either direct or indirect

The target of a direct call: Label

The target of an indirect call: *Operand

Example: The execution of call and ret for multstore and main

Excerpts of disassembly:

More detailed example: Detailed execution of top and leaf

The standard call/return mechanism conveniently matches the LIFO memory management discipline.

3.7.3Data Transfer

Data passing

Calls may involve passing data as arguments

Returning may also involve returning a value

Mostly via registers

Passing integral (i.e., integer and pointer) arguments

Passing up to 6 arguments via registers

The registers

Passing arguments 7–n on the stack (n > 6)

Stack top: Argument 7.

All data sizes are rounded up to be multiples of 8.

The portion "Argument build area": Space allocated within a procedure's stack frame for these arguments.

Example:

3.7.4Local Storage on the Stack

Common cases where local data must be stored in memory:

Not enough registers

The address operator '&'

Arrays or structures

The portion of the stack frame labeled "Local variables": Space allocated by a procedure son the stack frame by decrementing the stack pointer.

Example of the handling of '&'

The run-time stack provides a simple mechanism for allocating local storage when it is required and deallocating it when the function completes.

More complex example:

3.7.5Local Storage in Registers

The set of program registers acts as a single resource shared by all of the procedures.

When one procedure (the caller) calls another (the callee), the callee does not overwrite some register value that the caller planned to use later.

The uniform set of conventions for register usage:

Callee-saved registers: %rbx, %rbp, and %r12%r15.

One's value must be preserved by the callee.

By not changing it at all

By pushq-ing it, altering it, and then popq-ing it before ret.

The portion "Saved registers": Created by the pushing of register values.

With this convention, the caller can safely store a value in a callee-saved register, call, and then use it without risk of corruption.

Caller-saved registers: %rax, %rdi, %rsi, %rdx, %rcx and%r8-%r11

Can be modified by any function.

Example: P

3.7.6Recursive Procedures

Procedures can call themselves recursively

Provided by the stack discipline

Example: rfact

Mechanism: Each invocation of a function has its own private storage for state information

Return address

Callee-saved registers

The stack discipline of allocation and deallocation naturally matches the call-return ordering of functions.

Even works for mutual recursion

E.g., when P calls Q, which in turn calls P.

3.8Array Allocation and Access

Pointers to elements within arrays are translated into address computations in machine code.

3.8.1Basic Principles

Declaration

For data type T and integer constant N

xA: the starting location

2 effects:

Allocates a contiguous region of L · N bytes in memory

L: the size (in bytes) of data type T

Introduces an identifier A that can be used as a pointer to xA.

0 ≤ iN−1, A[i]is at

&A[i] = xA + L · i

Examples

Declarations:

Arrays generated:

Array access

Example: Evaluating E[i]

Suppose E: an int array

E in %rdx, and i in %rcx

Address computation:

3.8.2Pointer Arithmetic

Arithmetic on pointers: If T *p = xp, then p+i = xp + L· i, where L is the size of T.

The generation and dereferencing of pointers:'&' and '*'.

Example:

Expressions involving E each with an assembly-code implementation

E in %rdx, and i in %rcx

Result: Data in %eax, and pointers in %rax

3.8.3Nested Arrays

The general principles hold even for arrays of arrays

Example

Declaration

Elements order in memory

Row-major

A[0], followed by A[1], and so on.

Illustration:

Consequence of the nested declaration.

To access elements of multidimensional arrays

Compute the offset

mov (xD, C · i + j,L), D

In general

Declaration

D[i][j] is at

&D[i][j] = xD + L (C · i + j)

L: the size of data type T in bytes

Example

Declaration:

Copying A[i][j] to %eax:

3.8.4Fixed-Size Arrays

Optimizing code operating on multidimensional arrays of fixed size.

Example: fix_prod_ele (-O1)

Declaration of fix_matrix

fix_prod_ele and fix_prod_ele_opt

Optimizations:

Generating Aptr

Generating Bptr

Generating Bend

Assembly code

3.8.5Variable-Size Arrays

Variable-size arrays

Array dimension expressions: Computed as the array is being allocated

Declaration

expr1 and expr2 are evaluated as the declaration is encountered

Example:

var_ele: Access A[i][j] of A[n][n]

Code

&A[i][j] = xA + 4(n · i) + 4j = xA + 4(n · i + j)

Must use imul: Can incur significant performance penalty (unavoidable)

Optimized when referenced within a loop:

Optimize the index computations by exploiting the regularity of the access patterns.

Example: var_prod_ele

var_prod_ele and var_prod_ele_opt

Assembly code for the loop

3.9Heterogeneous Data Structures

3.9.1Structures

3.9.2Unions

3.9.3Data Alignment

3.10Combining Control and Data in Machine-Level Programs

3.10.1Understanding Pointers

3.10.2Life in the RealWorld: Using the gdb Debugger

3.10.3Out-of-Bounds Memory References and Buffer Overflow

3.10.4Thwarting Buffer Overflow Attacks

3.10.5Supporting Variable-Size Stack Frames

3.11Floating-Point Code

3.11.1Floating-Point Movement and Conversion Operations

3.11.2Floating-Point Code in Procedures

3.11.3Floating-Point Arithmetic Operations

3.11.4Defining and Using Floating-Point Constants

3.11.5Using Bitwise Operations in Floating-Point Code

3.11.6Floating-Point Comparison Operations

3.11.7Observations about Floating-Point Code

3.12Summary

In this chapter, we have peered beneath the layer of abstraction provided by the C language to get a view of machine-level programming. By having the compiler generate an assembly-code representation of the machine-level program, we gain insights into both the compiler and its optimization capabilities, along with the machine, its data types, and its instruction set. In Chapter 5, we will see that knowing the characteristics of a compiler can help when trying to write programs that have efficient mappings onto the machine. We have also gotten a more complete picture of how the program stores data in different memory regions. In Chapter 12, we will see many examples where application programmers need to know whether a program variable is on the run-time stack, in some dynamically allocated data structure, or part of the global program data. Understanding how programs map onto machines makes it easier to understand the differences between these kinds of storage.

Machine-level programs, and their representation by assembly code, differ in many ways from C programs. There is minimal distinction between different data types. The program is expressed as a sequence of instructions, each of which performs a single operation. Parts of the program state, such as registers and the run-time stack, are directly visible to the programmer. Only low-level operations are provided to support data manipulation and program control. The compiler must use multiple instructions to generate and operate on different data structures and to implement control constructs such as conditionals, loops, and procedures. We have covered many different aspects of C and how it gets compiled. We have seen that the lack of bounds checking in C makes many programs prone to buffer overflows. This has made many systems vulnerable to attacks by malicious intruders, although recent safeguards provided by the run-time system and the compiler help make programs more secure.

We have only examined the mapping of C onto x86-64, but much of what we have covered is handled in a similar way for other combinations of language and machine. For example, compiling C++ is very similar to compiling C. In fact, early implementations of C++ first performed a source-to-source conversion from C++ to C and generated object code by running a C compiler on the result. C++ objects are represented by structures, similar to a C struct. Methods are represented by pointers to the code implementing the methods. By contrast, Java is implemented in an entirely different fashion. The object code of Java is a special binary representation known as Java byte code. This code can be viewed as a machine-level program for a virtual machine. As its name suggests, this machine is not implemented directly in hardware. Instead, software interpreters process the byte code, simulating the behavior of the virtual machine. Alternatively, an approach known as just-in-time compilation dynamically translates byte code sequences into machine instructions. This approach provides faster execution when code is executed multiple times, such as in loops. The advantage of using byte code as the low-level representation of a program is that the same code can be "executed" on many different machines, whereas the machine code we have considered runs only on x86-64 machines.

Chapter 4

Chapter 5

Chapter 6The Memory Hierarchy

6.1Storage Technologies

6.1.1Random Access Memory

6.1.2Disk Storage

6.1.3Solid State Disks

6.1.4Storage Technology Trends

6.2Locality

6.2.1Locality of References to Program Data

6.2.2Locality of Instruction Fetches

6.2.3Summary of Locality

6.3The Memory Hierarchy

6.3.1Caching in the Memory Hierarchy

6.3.2Summary of Memory Hierarchy Concepts

6.4Cache Memories

6.4.1Generic Cache Memory Organization

6.4.2Direct-Mapped Caches

6.4.3Set Associative Caches

6.4.4Fully Associative Caches

6.4.5Issues with Writes

6.4.6Anatomy of a Real Cache Hierarchy

6.4.7Performance Impact of Cache Parameters

6.5

6.6

6.7Summary

Part IIRunning Programs on a System

The interaction between your programs and the hardware.

Chapter 1

Chapter 2

Chapter 3

Chapter 4

Chapter 5

Chapter 7

Chapter 8

Chapter 9

Chapter 10

Chapter 11

Chapter 9Virtual Memory

9.1Physical and Virtual Addressing

9.2Address Spaces

9.3VM as a Tool for Caching

9.3.1DRAM Cache Organization

9.3.2Page Tables

9.3.3Page Hits

9.3.4Page Faults

9.3.5Allocating Pages

9.3.6Locality to the Rescue Again

9.4VM as a Tool for Memory Management

9.5VM as a Tool for Memory Protection

9.6Address Translation

9.6.1Integrating Caches and VM

9.6.2Speeding Up Address Translation with a TLB

9.6.3Multi-Level Page Tables

9.6.4Putting It Together: End-to-End Address Translation

9.7Case Study: The Intel Core i7/Linux Memory System

9.7.1Core i7 Address Translation

9.7.2Linux Virtual Memory System

9.8Memory Mapping

9.8.1Shared Objects Revisited

9.8.2The fork Function Revisited

9.8.3The execve Function Revisited

9.8.4User-Level Memory Mapping with the mmap Function

9.9

9.10Garbage Collection

9.10.1Garbage Collector Basics

9.10.2Mark&Sweep Garbage Collectors

9.10.3Conservative Mark&Sweep for C Programs

9.11Common Memory-Related Bugs in C Programs

9.11.1Dereferencing Bad Pointers

9.11.2Reading Uninitialized Memory

9.11.3Allowing Stack Buffer Overflows

9.11.4Assuming That Pointers and the Objects They Point to Are the Same Size

9.11.5Making Off-by-One Errors

9.11.6Referencing a Pointer Instead of the Object It Points To

9.11.7Misunderstanding Pointer Arithmetic

9.11.8Referencing Nonexistent Variables

9.11.9Referencing Data in Free Heap Blocks

9.11.10Introducing Memory Leaks

9.12Summary

Part IIIInteraction and Communication between Programs

The basic I/O services provided by Unix operating systems and how to use these services to build applications.