google ads

google search

Monday, June 8, 2009

Assembly Process

It is useful to consider how a person would process a program before trying to think about how it is done by a program. For this purpose, consider the program in Figure 2.1. It is important to note that the assembly process does not require any understanding of the program being assembled. Thus, it is unnecessary to understand the integer division algorithm implemented by the code in Figure 2.1, and little understanding of the particular machine code being used is needed (for those who are curious, the code is written for an R6502 microprocessor, the processor used in the historically important Apple II family of personal computers from the late 1970's).
; UNSIGNED INTEGER DIVIDE ROUTINE
; Takes dividend in A, divisor in Y
; Returns remainder in A, quotient in Y
START: STA IDENDL ;Store the low half of the dividend
STY ISOR ;Store the divisor
LDA #0 ;Zero the high half of the dividend (in register A)
TAX ;Zero the loop counter (in register X)
LOOP: ASL IDENDL ;Shift the dividend left (low half first)
ROL ; (high half second)
CMP ISOR ;Compare high dividend with divisor
BCC NOSUB ;If IDEND < ISOR don't subtract
SBC ISOR ;Subtract ISOR from IDEND
INC IDENDL ;Put a one bit in the quotient
NOSUB: INX ;Count times through the loop
CPX #8
BNE LOOP ;Repeat loop 8 times
LDY IDENDL ;Return quotient in Y
RTS ;Return remainder in A

IDENDL:B 0 ;Reserve storage for the low dividend/quotient
ISOR: B 0 ;Reserve storage for the divisor
Figure 2.1. An example assembly language program.
When a person who knows the Roman alphabet looks at text such as that illustrated in Figure 2.1, an important, almost unconscious processing step takes place: The text is seen not as a random pattern on the page, but as a sequence of lines, each composed of a sequence of punctuation marks, numbers, and word-like strings. This processing step is formally called lexical analysis, and the words and similar structures recognized at this level are called lexemes.
If the person knows the language in which the text is written, a second and still possibly unconscious processing step will occur: Lexical elements of the text will be classified into structures according to their function in the text. In the case of an assembly language, these might be labels, opcodes, operands, and comments; in English, they might be subjects, objects, verbs, and subsidiary phrases. This level of analysis is called syntactic analysis, and is performed with respect to the grammar or syntax of the language in question.
A person trying to hand translate the above example program must know that the R6502 microprocessor has a 16 bit memory address, that memory is addressed in 8 bit (one byte) units, and that instructions have a one byte opcode field followed optionally by additional bytes for the operands. The first step would typically involve looking at each instruction to find out how many bytes of memory it occupies. Table 2.1 lists the instructions used in the above example and gives the necessary information for this step.
Opcode Bytes Hex Code

ASL 3 0E aa aa
B 1 cc
BCC 2 90 oo
BNE 2 D0 oo
CMP 3 CD aa aa
CPX # 2 E0 cc
INC 3 EE aa aa
INX 1 E8
LDA # 2 A9 cc
LDY 3 AC aa aa
ROL 1 2A
RTS 1 60
SBC 3 ED aa aa
STA 3 8D aa aa
STY 3 8C aa aa
TAX 1 AA

Notes: aa aa - two byte address, least significant byte first.
oo - one byte relative address.
cc - one byte of constant data.
Table 2.1. Opcodes on the R6502.
To begin the translation of the example program to machine code, we take the data from table 2.1 and attach it to each line of code. Each significant line of an assembly language program includes the symbolic name of one machine instruction, for example, STA. This is called the opcode or operation code for that line. The programmer, of course, needs to know what the program is supposed to do and what these opcodes are supposed to do, but the translator has no need to know this! For the curious, the STA instruction stores the contents of the accumulator register in the indicated memory address, but you do not need to know this to assemble the program!
Table 2.1 shows the numerical equivalent of each opcode code in hexadecimal, base 16. We could have used any number base; inside the computer, the bytes are stored in binary, and because hexidecimal to binary conversion is trivial, we use that base here. While we're at it, we will strip off all the irrelevant commentary and formatting that was only included only for the human reader, and leave only the textual description of the program.
8D START: STA IDENDL
aa
aa
8C STY ISOR
aa
aa
A9 LDA #0
cc
AA TAX
0E LOOP: ASL IDENDL
aa
aa
2A ROL
CD CMP ISOR
aa
aa
90 BCC NOSUB
oo
ED SBC ISOR
aa
aa
EE INC IDENDL
aa
aa
E8 NOSUB: INX
E0 CPX #8
cc
D0 BNE LOOP
oo
AC LDY IDENDL
aa
aa
60 RTS
cc IDENDL:B 0
cc ISOR: B 0
Figure 2.2. Partial translation of the example to machine language
The result of this first step in the translation is shown in Figure 2.2. This certainly does not complete the job! Table 2.1 included constant data, relative offsets and addresses, as indicated by the lower case notatons cc, oo and aaaa, and to finish the translation to machine code, we must substitute numeric values for these!
Constants are the easiest. We simply incorporate the appropriate constants from the source code into the machine code, translating each to hexadecimal. Relative offsets are a bit more difficult! These give the number of bytes ahead (if positive) or behind (if negative) the location immediately after the location that references the offset. Negative offsets are represented using 2's complement notation.
8D START: STA IDENDL
aa
aa
8C STY ISOR
aa
aa
A9 LDA #0
00
AA TAX
0E LOOP: ASL IDENDL
aa
aa
2A ROL
CD CMP ISOR
aa
aa
90 BCC NOSUB
06
ED SBC ISOR
aa
aa
EE INC IDENDL
aa
aa
E8 NOSUB: INX
E0 CPX #8
08
D0 BNE LOOP
EC
AC LDY IDENDL
aa
aa
60 RTS
00 IDENDL:B 0
00 ISOR: B 0
Figure 2.3. Additional translation of the example to machine language
The result of this next translation step is shown in boldface in Figure 2.3. We cannot complete the translation without determining where the code will be placed in memory. Suppose, for example, that we place this code in memory starting at location 020016. This allows us to determine which byte goes in what memory location, and it allows us to assign values to the two labels IDENDL and ISOR, and thus, fill out the values of all of the 2-byte address fields to complete the translation.
0200: 8D START: STA IDENDL
0201: 21
0202: 02
0203: 8C STY ISOR
0204: 22
0205: 02
0206: A9 LDA #0
0207: 00
0208: AA TAX
0209: 0E LOOP: ASL IDENDL
020A: 21
020B: 02
020C: 2A ROL
020D: CD CMP ISOR
020E: 22
020F: 02
0210: 90 BCC NOSUB
0211: 06
0212: ED SBC ISOR
0213: 22
0214: 02
0215: EE INC IDENDL
0216: 21
0217: 02
0218: E8 NOSUB: INX
0219: E0 CPX #8
021A: 08
021B: D0 BNE LOOP
021C: EC
021D: AC LDY IDENDL
021E: 21
021F: 02
0220: 60 RTS
0221: 00 IDENDL:B 0
0222: 00 ISOR: B 0
Figure 2.4. Complete translation of the example to machine language
Again, in completing the translation to machine code, the changes from Figure 2.3 to Figure 2.4 are shown in boldface. For hand assembly of a small program, we don't need anything additional, but if we were assembling a program that ran on for pages and pages, it would be helpful to read through it once to find the numerical addresses of each label in the program, and then read through it again, substituting those numerical values into the code where they are needed.
symbol address

START 0200
LOOP 0209
NOSUB 0218
IDENDL 0221
ISOR 0222
Table 2.2. The symbol table for Figure 2.4.
Table 2.2 shows the symbol table for this small example, sorted into numerical order. For a really large program, we might rewrite the table into alphabetical order to before using it to finish the assembly.
It is worth noting the role which the meaning of the assembly code played in the assembly process. None! The programmer writing the line STA IDENDL must have understood its meaning, "store the value of the A register in the location labeled IDENDL", and the CPU, when it executes the corresponding binary instruction 8D 21 02 must know that this means "store the value of the A register in the location 0221", but there is no need for the person or computer program that translates assembly code to machine code to understand this!
This same assertion holds for compilers for high level languages. A C++ compiler does not understand that for(;;)x(); involves a loop, but only that, prior to the code for a call to the function x, the compiler should note the current memory address, and after the call, the compiler should output some particular instruction that references that address. The person who wrote the compiler knew that this instruction is a branch back to the start of the loop, but the compiler has no understanding of this!
To translator performing the assembly process, whether that translator is a human clerk or an assembler, the line STA IDENDL means "allocate 3 consecutive bytes of memory, put 8D in the first byte, and put the 16 bit value of the symbol IDENDL in the remaining 2 bytes." If the symbol IDENDL is mapped to the value 0221 by the symbol table, then the interpretation of the result of the assembler's interpretation of the source code is the same as the programmers interpretation. These relationships may be illustrated in Figure 2.5.
Source Text
/ \ compiler or
programmer's / \ assembler's
view of meaning / \ view of meaning
/ \
Abstract Meaning ----- Machine Code

hardware's
view of meaning
Figure 2.5. Views of the meaning of a program.

No comments:

Post a Comment