2 Lexical Conventions

This chapter describes lexical conventions associated with the following items:

Blank and tab characters (Section 2.1)
Comments (Section 2.2)
Identifiers (Section 2.3)
Constants (Section 2.4)
Physical lines (Section 2.5)
Statements (Section 2.6)
Expressions (Section 2.7)
Address formats (Section 2.8)

2.1 Blank and Tab Characters

You can use blank and tab characters anywhere between operators, identifiers, and constants. Adjacent identifiers or constants that are not otherwise separated must be separated by a blank or tab.

These characters can also be used within character constants; however, they are not allowed within operators and identifiers.

2.2 Comments

The number sign character (#) introduces a comment. Comments that start with a number sign extend through the end of the line on which they appear. You can also use C language notation (/*...*/) to delimit comments.

Do not start a comment with a number sign in column one; the assembler uses cpp (the C language preprocessor) to preprocess assembler code and cpp interprets number signs in the first column as preprocessor directives.

2.3 Identifiers

An identifier consists of a case-sensitive sequence of alphanumeric characters (A-Z, a-z, 0-9) and the following special characters:

. (period)
_ (underscore)
$ (dollar sign)

Identifiers can be up to 31 characters long, and the first character cannot be numeric (0-9).

If an undefined identifier is referenced, the assembler assumes that the identifier is an external symbol. The assembler treats the identifier like a name specified by a .globl directive (see Chapter 5).

If the identifier is defined to the assembler and the identifier has not been specified as global, the assembler assumes that the identifier is a local symbol.

2.4 Constants

The assembler supports the following constants:

Scalar constants
Floating-point constants
String constants

2.4.1 Scalar Constants

The assembler interprets all scalar constants as twos complement numbers. Scalar constants can be any of the digits 0123456789abcdefABCDEF.

Scalar constants can be either decimal, hexadecimal, or octal constants:

Decimal constants consist of a sequence of decimal digits (0-9) without a leading zero.
Hexadecimal constants consist of the characters 0x (or 0X) followed by a sequence of hexadecimal digits (0-9abcdefABCDEF).
Octal constants consist of a leading zero followed by a sequence of octal digits (0-7).

2.4.2 Floating-Point Constants

Floating-point constants can appear only in floating-point directives (see Chapter 5) and in the floating-point load immediate instructions (see Section 4.2). Floating-point constants have the following format:

±d1[.d2][e|E±d3]

d1: is written as a decimal integer and denotes the integral part of the floating-point value.

d2: is written as a decimal integer and denotes the fractional part of the floating-point value.

d3: is written as a decimal integer and denotes a power of 10.

The "+" symbol (plus sign) is optional.

For example, the number .02173 can be represented as follows:

21.73E-3

The floating-point directives, such as .float and .double, may optionally use hexadecimal floating-point constants instead of decimal constants. A hexadecimal floating-point constant consists of the following elements:

[+|-]0x[1|0].<hex-digits>h0x<hex-digits>

The assembler places the first set of hexadecimal digits (excluding the 0 or 1 preceding the decimal point) in the mantissa field of the floating-point format without attempting to normalize it. It stores the second set of hexadecimal digits in the exponent field without biasing them. If the mantissa appears to be denormalized, it checks to determine whether the exponent is appropriate. Hexadecimal floating-point constants are useful for generating IEEE special symbols and for writing hardware diagnostics.

For example, either of the following directives generates the single-precision number 1.0:

.float 1.0e+0
.float 0x1.0h0x7f

The assembler uses normal (nearest) rounding mode to convert floating-point constants.

2.4.3 String Constants

All characters except the newline character are allowed in string constants. String constants begin and end with double quotation marks (").

The assembler observes most of the backslash conventions used by the C language. Table 2-1 shows the assembler's backslash conventions.

Table 2-1: Backslash Conventions

Convention	Meaning
\a	Alert (0x07)
\b	Backspace (0x08)
\f	Form feed (0x0c)
\n	Newline (0x0a)
\r	Carriage return (0x0d)
\t	Horizontal tab (0x09)
\v	Vertical feed (0x0b)
\\	Backslash (0x5c)
\"	Quotation mark (0x22)
\'	Single quote (0x27)
\`nnn`	Character whose octal value is `nnn` (where `n` is 0-7)
\X`nn`	Character whose hexadecimal value is `nn` (where `n` is 0-9, a-f, or A-F)

Deviations from C conventions are as follows:

The assembler does not recognize "\?".
The assembler does not recognize the prefix "L" (wide character constant).
The assembler limits hexadecimal constants to two characters.
The assembler allows the leading "x" character in a hexadecimal constants to be either uppercase or lowercase; that is, both \xnn and \Xnn are allowed.

For octal notation, the backslash conventions require three characters when the next character could be confused with the octal number.

For hexadecimal notation, the backslash conventions require two characters when the next character could be confused with the hexadecimal number. Insert a 0 (zero) as the first character of the single-character hexadecimal number when this condition occurs.

2.5 Multiple Lines Per Physical Line

You can include multiple statements on the same line by separating the statements with semicolons. Note, however, that the assembler does not recognize semicolons as separators when they follow comment symbols (# or /*).

2.6 Statements

The assembler supports the following types of statements:

Null statements
Keyword statements

Each keyword statement can include an optional label, an operation code (mnemonic or directive), and zero or more operands (with an optional comment following the last operand on the statement):

[ label: ] opcode operand [ ; opcode operand; ... ] [ # comment ]

Some keyword statements also support relocation operands (see Section 2.6.4).

2.6.1 Labels

Labels can consist of label definitions or numeric values.

A label definition consists of an identifier followed by a colon. (See Section 2.3 for the rules governing identifiers.) Label definitions assign the current value and type of the location counter to the name. An error results when the name is already defined.
Label definitions always end with a colon. You can put a label definition on a line by itself.
A numeric label is a single numeric value (1-255). Unlike label definitions, the value of a numeric label can be applied to any number of statements in a program. To reference a numeric label, put an f (forward) or a b (backward) immediately after the referencing digit in an instruction, for example, br 7f (which is a forward branch to numeric label 7). The reference directs the assembler to look for the nearest numeric label that corresponds to the specified number in the lexically forward or backward direction.

2.6.2 Null Statements

A null statement is an empty statement that the assembler ignores. Null statements can have label definitions. For example, the following line has three null statements in it:

label: ; ;

2.6.3 Keyword Statements

A keyword statement contains a predefined keyword. The syntax for the rest of the statement depends on the keyword. Keywords are either assembler instructions (mnemonics) or directives.

Assembler instructions in the main instruction set and the floating-point instruction set are described in Chapter 3 and Chapter 4, respectively. Assembler directives are described in Chapter 5.

2.6.4 Relocation Operands

Relocation operands are generally useful in only two situations:

In application programs in which the programmer needs precise control over scheduling
In source code written for compiler development

Some macro instructions (for example, ldgp) require special coordination between the machine-code instructions and the relocation sequences given to the linker. By using the macro instructions, the assembler programmer relies on the assembler to generate the appropriate relocation sequences.

In some instances, the use of macro instructions may be undesirable. For example, a compiler that supports the generation of assembly language files may not want to defer instruction scheduling to the assembler. Such a compiler will want to schedule some or all of the machine-code instructions. To do this, the compiler must have a mechanism for emitting an object file's relocation sequences without using macro instructions. The mechanism for establishing these sequences is the relocation operand.

A relocation operand can be placed after the normal operand on an assembly language statement:

opcode operand relocation_operand

The syntax of the relocation_operand is as follows:

!relocation_type! sequence_number

relocation_type

Any one of the following relocation types can be specified:

literal
lituse_base
lituse_bytoff
lituse_jsr
gpdisp
gprelhigh
gprellow

The relocation types must be enclosed within a pair of exclamation points (!) and are not case sensitive. See Table 7-11 for descriptions of the different types of relocation operations.

sequence_number: The sequence number is a numeric constant with a value range of 1 to 2147483647. The constant can be base 8, 10, or 16. Bases other than 10 require a prefix (see Section 2.4.1).

The following examples contain relocation operands in the source code:

Example 1: Referencing multiple lituse_base relocations

# Equivalent C statement:
# sym1 += sym2  (Both external)

 

# Assembly statements containing macro instructions:
ldq   $1, sym1
ldq   $2, sym2
addq  $1, $2, $3
stq   $3, sym1

 

# Assembly statements containing machine-code instructions
# requiring relocation operandss:
ldq   $1, sym1($gp)!literal!1
ldq   $2, sym2($gp)!literal!2

 

ldq   $3, sym1($1)!lituse_base!1
ldq   $4, sym2($1)!lituse_base!2
addq  $3, $4, $3
stq   $3, sym1($1)!lituse_base!1

The assembler stores the sym1 and sym2 address constants in the .lita section.

In this example, the code with relocation operands provides better performance than the other code because it saves on register usage and on the length of machine-code instruction sequences.

Example 2: Referencing an ldgp sequence that is scheduled inside a lituse_base relocation

# Assembly statements containing macro instructions:
beq   $2, L
stq   $31, sym
ldgp  $gp, 0($27)
...

 

# Assembly statements containing machine-code instructions that
# require relocation operandss:
ldq   $at, sym($gp)!literal!1
beq   $2, L            # crosses basic block boundary
ldah  $gp, 0($27)!gpdisp!2
stq   $31, sym($at)!lituse_base!1
lda   $gp, 0($gp)!gpdisp!2

In this example, the programmer has elected to schedule the load of the address of sym before the conditional branch.

Example 3: A routine call

# Assembly statements containing macro instructions:
jsr   sym1
ldgp  $gp, 0($ra)

 

.extern sym1

 

.text

 

# Assembly statements containing machine-code instructions that
# require relocation operandss:
ldq   $27, sym1($gp)!literal!1
jsr   $26, ($27), sym1!lituse_jsr!1
# as1 puts in an R_HINT for the jsr instruction
ldah  $gp, 0($ra)!gpdisp!2
lda   $gp, 0($gp)!gpdisp!2

In this example, the code with relocation operands does not provide any significant gains over the other code. This example is only provided to show the different coding methods.

2.7 Expressions

An expression is a sequence of symbols that represents a value. Each expression and its result have data types. The assembler does arithmetic in twos complement integers with 64 bits of precision. Expressions follow precedence rules and consist of the following elements:

Operators
Identifiers
Constants

You can also use a single character string in place of an integer within an expression. For example, the following two pairs of statements are equivalent:

.byte "a" ; .word "a"+0x19
.byte 0x61 ; .word 0x7a

2.7.1 Expression Operators

The assembler supports the operators shown in Table 2-2.

Table 2-2: Expression Operators

Operator	Meaning
+	Addition
-	Subtraction
*	Multiplication
/	Division
%	Remainder
<<	Shift left
>>	Shift right (sign is not extended)
^	Bitwise EXCLUSIVE OR
&	Bitwise AND
\|	Bitwise OR
-	Minus (unary)
+	Identity (unary)
~	Complement

2.7.2 Expression Operator Precedence Rules

For the order of operator evaluation within expressions, you can rely on the precedence rules or you can group expressions with parentheses. Unless parentheses enforce precedence, the assembler evaluates all operators of the same precedence strictly from left to right. Because parentheses also designate index registers, ambiguity can arise from parentheses in expressions. To resolve this ambiguity, put a unary + in front of parentheses in expressions.

The assembler has three precedence levels. The following table lists the precedence rules from lowest to highest:

Table 2-3: Operator Precedence

Precedence	Operators
Least binding, lowest precedence	Binary +, -
.
.	Binary *, /, %, <<, >>, ^, &, \|
.
Most binding, highest precedence	Unary -, +, ~

Note
The assembler's precedence scheme differs from that of the C language.

2.7.3 Data Types

Each symbol you reference or define in an assembly program belongs to one of the type categories shown in Table 2-4.

Table 2-4: Data Types

Type	Description
undefined	Any symbol that is referenced but not defined becomes global undefined. (Declaring such a symbol in a `.globl` directive merely makes its status clearer.)
absolute	A constant defined in an assignment (=) expression.
text	Any symbol defined while the `.text` directive is in effect belongs to the text section. The text section contains the program's instructions, which are not modifiable during execution.
data	Any symbol defined while the `.data` directive is in effect belongs to the data section. The data section contains memory that the linker can initialize to nonzero values before your program begins to execute.
sdata	The type sdata is similar to the type data, except that defining a symbol while the `.sdata` ("small data") directive is in effect causes the linker to place it within the small data section. This increases the chance that the linker will be able to optimize memory references to the item by using gp-relative addressing.
rdata and rconst	Any symbol defined while the `.rdata` or `.rconst` directives are in effect belongs to this category. The only difference between the types rdata and rconst is that the former is allowed to have dynamic relocations and the latter is not. (The types rdata and rconst are also similar to the type data but, unlike data, cannot be modified during execution.)
bss and sbss	Any symbol defined in a `.comm` or `.lcomm` directive belongs to these sections, except that a `.data`, `.sdata`, `.rdata`, or `.rconst` directive can override a `.comm` directive. The `.bss` and `.sbss` sections consist of memory that the kernel loader initializes to zero before your program begins to execute. If a symbol's size is less than the number of bytes specified by the `-G` compilation option (which defaults to eight), it belongs to `.sbss` section (small bss section), and the linker places it within the small data section. This increases the chance that the linker will be able to optimize memory references to the item by using gp-relative addressing. Local symbols in the `.bss` or `.sbss` sections efined by `.lcomm` directives are allocated memory by the assembler, global symbols are allocated memory by the linker, and symbols defined by `.comm` directives are overlaid upon like-named symbols (in the fashion of Fortran COMMON blocks) by the linker.

Symbols in the undefined category are always global; that is, they are visible to the linker and can be shared with other modules of your program. Symbols in the absolute, text, data, sdata, rdata, rconst, bss, and sbss type categories are local unless declared in a .globl directive.

2.7.4 Type Propagation in Expressions

For any expression, the result's type depends on the types of the operands and the operator. The following type propagation rules are used in expressions:

If an operand is undefined, the result is undefined.
If both operands are absolute, the result is absolute.
If the operator is a plus sign (+) and the first operand refers to an undefined external symbol or a relocatable symbol in a .text section, .data section, or .bss section, the result has the first operand's type and the other operand must be absolute.
If the operator is a minus sign (-) and the first operand refers to a relocatable symbol in a .text section, .data section, or .bss section, the type propagation rules can vary:
- The second operand can be absolute (if it was previously defined) and the result has the first operand's type.
- The second operand can have the same type as the first operand and the result is absolute.
- If the first operand is external undefined, the second operand must be absolute.
The operators *, /, %, <<, >>, ~, ^, &, and | apply only to absolute symbols.

2.8 Address Formats

The assembler accepts addresses expressed in the formats described in Table 2-5.

Table 2-5: Address Formats

Format	Address Description
(`base-register`)	Specifies an indexed address, which assumes a zero offset. The base register's contents specify the address.
`expression`	Specifies an absolute address. The assembler generates the most locally efficient code for referencing the value at the specified address.
`expression(base-register)`	Specifies a based address. To get the address, the value of the expression is added to the contents of the base register. The assembler generates the most locally efficient code for referencing the value at the specified address.
`relocatable-symbol`	Specifies a relocatable address. The assembler generates the necessary instructions to address the item and generates relocation information for the linker.
`relocatable-symbol±expression`	Specifies a relocatable address. To get the address, the value of the expression, which has an absolute value, is added or subtracted from the relocatable symbol. The assembler generates the necessary instructions to address the item and generates relocation information for the linker. If the symbol name does not appear as a label anywhere in the assembly, the assembler assumes that the symbol is external.
`relocatable-symbol(index-register)`	Specifies an indexed relocatable address. To get the address, the index register is added to the relocatable symbol's address. The assembler generates the necessary instructions to address the item and generates relocation information for the linker. If the symbol name does not appear as a label anywhere in the assembly, the assembler assumes that the symbol is external.
`relocatable-symbol±expression(index-register)`	Specifies an indexed relocatable address. To get the address, the assembler adds or subtracts the relocatable symbol, the expression, and the contents of index register. The assembler generates the necessary instructions to address the item and generates relocation information for the link editor. If the symbol name does not appear as a label anywhere in the assembly, the assembler assumes that the symbol is external.