This chapter describes lexical conventions associated with the following items:
You can use blank and tab characters anywhere between operators, identifiers, and constants. Adjacent identifiers or constants that are not otherwise separated must be separated by a blank or tab.
These characters can also be used within character constants; however, they are not allowed within operators and identifiers.
The number sign character (#) introduces a comment. Comments that start with a number sign extend through the end of the line on which they appear. You can also use C language notation (/*...*/) to delimit comments.
Do not start a comment with a number sign in column one; the assembler uses cpp (the C language preprocessor) to preprocess assembler code and cpp interprets number signs in the first column as preprocessor directives.
An identifier consists of a case-sensitive sequence of alphanumeric characters (A-Z, a-z, 0-9) and the following special characters:
Identifiers can be up to 31 characters long, and the first character cannot be numeric (0-9).
If an undefined identifier is referenced, the assembler assumes that the identifier is an external symbol. The assembler treats the identifier like a name specified by a .globl directive (see Chapter 5).
If the identifier is defined to the assembler and the identifier has not been specified as global, the assembler assumes that the identifier is a local symbol.
The assembler supports the following constants:
The assembler interprets all scalar constants as twos complement numbers. Scalar constants can be any of the digits 0123456789abcdefABCDEF.
Scalar constants can be either decimal, hexadecimal, or octal constants:
Floating-point constants can appear only in floating-point directives (see Chapter 5) and in the floating-point load immediate instructions (see Section 4.2). Floating-point constants have the following format:
±d1[.d2][e|E±d3]
The "+" symbol (plus sign) is optional.
For example, the number .02173 can be represented as follows:
21.73E-3
The floating-point directives, such as .float and .double, may optionally use hexadecimal floating-point constants instead of decimal constants. A hexadecimal floating-point constant consists of the following elements:
[+|-]0x[1|0].<hex-digits>h0x<hex-digits>
The assembler places the first set of hexadecimal digits (excluding the 0 or 1 preceding the decimal point) in the mantissa field of the floating-point format without attempting to normalize it. It stores the second set of hexadecimal digits in the exponent field without biasing them. If the mantissa appears to be denormalized, it checks to determine whether the exponent is appropriate. Hexadecimal floating-point constants are useful for generating IEEE special symbols and for writing hardware diagnostics.
For example, either of the following directives generates the single-precision number 1.0:
.float 1.0e+0 .float 0x1.0h0x7f
The assembler uses normal (nearest) rounding mode to convert floating-point constants.
All characters except the newline character are allowed in string constants. String constants begin and end with double quotation marks (").
The assembler observes most of the backslash conventions used by the C language. Table 2-1 shows the assembler's backslash conventions.
Convention | Meaning |
\a | Alert (0x07) |
\b | Backspace (0x08) |
\f | Form feed (0x0c) |
\n | Newline (0x0a) |
\r | Carriage return (0x0d) |
\t | Horizontal tab (0x09) |
\v | Vertical feed (0x0b) |
\\ | Backslash (0x5c) |
\" | Quotation mark (0x22) |
\' | Single quote (0x27) |
\nnn | Character whose octal value is nnn (where n is 0-7) |
\Xnn | Character whose hexadecimal value is nn (where n is 0-9, a-f, or A-F) |
Deviations from C conventions are as follows:
For octal notation, the backslash conventions require three characters when the next character could be confused with the octal number.
For hexadecimal notation, the backslash conventions require two characters when the next character could be confused with the hexadecimal number. Insert a 0 (zero) as the first character of the single-character hexadecimal number when this condition occurs.
You can include multiple statements on the same line by separating the statements with semicolons. Note, however, that the assembler does not recognize semicolons as separators when they follow comment symbols (# or /*).
The assembler supports the following types of statements:
Each keyword statement can include an optional label, an operation code (mnemonic or directive), and zero or more operands (with an optional comment following the last operand on the statement):
[ label: ] opcode operand [ ; opcode operand; ... ] [ # comment ]
Some keyword statements also support relocation operands (see Section 2.6.4).
Labels can consist of label definitions or numeric values.
Label definitions always end with a colon. You can put a label definition on a line by itself.
A null statement is an empty statement that the assembler ignores. Null statements can have label definitions. For example, the following line has three null statements in it:
label: ; ;
A keyword statement contains a predefined keyword. The syntax for the rest of the statement depends on the keyword. Keywords are either assembler instructions (mnemonics) or directives.
Assembler instructions in the main instruction set and the floating-point instruction set are described in Chapter 3 and Chapter 4, respectively. Assembler directives are described in Chapter 5.
Relocation operands are generally useful in only two situations:
Some macro instructions (for example, ldgp) require special coordination between the machine-code instructions and the relocation sequences given to the linker. By using the macro instructions, the assembler programmer relies on the assembler to generate the appropriate relocation sequences.
In some instances, the use of macro instructions may be undesirable. For example, a compiler that supports the generation of assembly language files may not want to defer instruction scheduling to the assembler. Such a compiler will want to schedule some or all of the machine-code instructions. To do this, the compiler must have a mechanism for emitting an object file's relocation sequences without using macro instructions. The mechanism for establishing these sequences is the relocation operand.
A relocation operand can be placed after the normal operand on an assembly language statement:
opcode operand relocation_operand
The syntax of the relocation_operand is as follows:
!relocation_type! sequence_number
The relocation types must be enclosed within a pair of exclamation points (!) and are not case sensitive. See Table 7-11 for descriptions of the different types of relocation operations.
The following examples contain relocation operands in the source code:
# Equivalent C statement: # sym1 += sym2 (Both external)
# Assembly statements containing macro instructions: ldq $1, sym1 ldq $2, sym2 addq $1, $2, $3 stq $3, sym1
# Assembly statements containing machine-code instructions # requiring relocation operandss: ldq $1, sym1($gp)!literal!1 ldq $2, sym2($gp)!literal!2
ldq $3, sym1($1)!lituse_base!1 ldq $4, sym2($1)!lituse_base!2 addq $3, $4, $3 stq $3, sym1($1)!lituse_base!1
The assembler stores the sym1 and sym2 address constants in the .lita section.
In this example, the code with relocation operands provides better
performance than the other code because it saves on register
usage and on the length of machine-code instruction sequences.
# Assembly statements containing macro instructions: beq $2, L stq $31, sym ldgp $gp, 0($27) ...
# Assembly statements containing machine-code instructions that # require relocation operandss: ldq $at, sym($gp)!literal!1 beq $2, L # crosses basic block boundary ldah $gp, 0($27)!gpdisp!2 stq $31, sym($at)!lituse_base!1 lda $gp, 0($gp)!gpdisp!2
In this example, the programmer has elected to schedule the load of the address of sym before the conditional branch.
# Assembly statements containing macro instructions: jsr sym1 ldgp $gp, 0($ra)
.extern sym1
.text
# Assembly statements containing machine-code instructions that # require relocation operandss: ldq $27, sym1($gp)!literal!1 jsr $26, ($27), sym1!lituse_jsr!1 # as1 puts in an R_HINT for the jsr instruction ldah $gp, 0($ra)!gpdisp!2 lda $gp, 0($gp)!gpdisp!2
In this example, the code with relocation operands does not provide any significant gains over the other code. This example is only provided to show the different coding methods.
An expression is a sequence of symbols that represents a value. Each expression and its result have data types. The assembler does arithmetic in twos complement integers with 64 bits of precision. Expressions follow precedence rules and consist of the following elements:
You can also use a single character string in place of an integer within an expression. For example, the following two pairs of statements are equivalent:
.byte "a" ; .word "a"+0x19 .byte 0x61 ; .word 0x7a
The assembler supports the operators shown in Table 2-2.
Operator | Meaning |
+ | Addition |
- | Subtraction |
* | Multiplication |
/ | Division |
% | Remainder |
<< | Shift left |
>> | Shift right (sign is not extended) |
^ | Bitwise EXCLUSIVE OR |
& | Bitwise AND |
| | Bitwise OR |
- | Minus (unary) |
+ | Identity (unary) |
~ | Complement |
For the order of operator evaluation within expressions, you can rely on the precedence rules or you can group expressions with parentheses. Unless parentheses enforce precedence, the assembler evaluates all operators of the same precedence strictly from left to right. Because parentheses also designate index registers, ambiguity can arise from parentheses in expressions. To resolve this ambiguity, put a unary + in front of parentheses in expressions.
The assembler has three precedence levels. The following table lists the precedence rules from lowest to highest:
Precedence | Operators |
Least binding, lowest precedence | Binary +, - |
. | |
. | Binary *, /, %, <<, >>, ^, &, | |
. | |
Most binding, highest precedence | Unary -, +, ~ |
Note
The assembler's precedence scheme differs from that of the C language.
Each symbol you reference or define in an assembly program belongs to one of the type categories shown in Table 2-4.
Type | Description |
undefined | Any symbol that is referenced but not defined becomes global undefined. (Declaring such a symbol in a .globl directive merely makes its status clearer.) |
absolute | A constant defined in an assignment (=) expression. |
text | Any symbol defined while the .text directive is in effect belongs to the text section. The text section contains the program's instructions, which are not modifiable during execution. |
data | Any symbol defined while the .data directive is in effect belongs to the data section. The data section contains memory that the linker can initialize to nonzero values before your program begins to execute. |
sdata | The type sdata is similar to the type data, except that defining a symbol while the .sdata ("small data") directive is in effect causes the linker to place it within the small data section. This increases the chance that the linker will be able to optimize memory references to the item by using gp-relative addressing. |
rdata and
rconst |
Any symbol defined while the .rdata or .rconst directives are in effect belongs to this category. The only difference between the types rdata and rconst is that the former is allowed to have dynamic relocations and the latter is not. (The types rdata and rconst are also similar to the type data but, unlike data, cannot be modified during execution.) |
bss and sbss |
Any symbol defined in a
.comm
or
.lcomm
directive belongs to these sections, except that a
.data,
.sdata,
.rdata,
or
.rconst
directive can override a
.comm
directive.
The
.bss
and
.sbss
sections consist of memory that the kernel loader
initializes to zero before your program begins to execute.
If a symbol's size is less than the number of bytes specified by the -G compilation option (which defaults to eight), it belongs to .sbss section (small bss section), and the linker places it within the small data section. This increases the chance that the linker will be able to optimize memory references to the item by using gp-relative addressing. Local symbols in the .bss or .sbss sections efined by .lcomm directives are allocated memory by the assembler, global symbols are allocated memory by the linker, and symbols defined by .comm directives are overlaid upon like-named symbols (in the fashion of Fortran COMMON blocks) by the linker. |
Symbols in the undefined category are always global; that is, they are visible to the linker and can be shared with other modules of your program. Symbols in the absolute, text, data, sdata, rdata, rconst, bss, and sbss type categories are local unless declared in a .globl directive.
For any expression, the result's type depends on the types of the operands and the operator. The following type propagation rules are used in expressions:
The assembler accepts addresses expressed in the formats described in Table 2-5.
Format | Address Description |
(base-register) | Specifies an indexed address, which assumes a zero offset. The base register's contents specify the address. |
expression | Specifies an absolute address. The assembler generates the most locally efficient code for referencing the value at the specified address. |
expression(base-register) | Specifies a based address. To get the address, the value of the expression is added to the contents of the base register. The assembler generates the most locally efficient code for referencing the value at the specified address. |
relocatable-symbol | Specifies a relocatable address. The assembler generates the necessary instructions to address the item and generates relocation information for the linker. |
relocatable-symbol±expression | Specifies a relocatable address. To get the address, the value of the expression, which has an absolute value, is added or subtracted from the relocatable symbol. The assembler generates the necessary instructions to address the item and generates relocation information for the linker. If the symbol name does not appear as a label anywhere in the assembly, the assembler assumes that the symbol is external. |
relocatable-symbol(index-register) | Specifies an indexed relocatable address. To get the address, the index register is added to the relocatable symbol's address. The assembler generates the necessary instructions to address the item and generates relocation information for the linker. If the symbol name does not appear as a label anywhere in the assembly, the assembler assumes that the symbol is external. |
relocatable-symbol±expression(index-register) | Specifies an indexed relocatable address. To get the address, the assembler adds or subtracts the relocatable symbol, the expression, and the contents of index register. The assembler generates the necessary instructions to address the item and generates relocation information for the link editor. If the symbol name does not appear as a label anywhere in the assembly, the assembler assumes that the symbol is external. |