Feeble attempt at updating the documentation; remove Appendix B

Feeble attempt to document 64-bit support. Also, remove Appendix B since we have been utterly useless at keeping it up to date, and it's redundant with the processor manufacturer's documentation anyway.
author: H. Peter Anvin <hpa@zytor.com> 2007-09-11 23:52:01 +0000
committer: H. Peter Anvin <hpa@zytor.com> 2007-09-11 23:52:01 +0000
commit: 9b49e24e1fe1a4afc021f6c3a01720fcabdc47ca (patch)
tree: 19cdcae470bc747d6ffe4b0ce17a1e178fcf5141 /doc
parent: 62cb606f6876b01c5d89ad00b6d3d4a3a2ffccf2 (diff)
download: nasm-9b49e24e1fe1a4afc021f6c3a01720fcabdc47ca.tar.gz
2 files changed, 6803 insertions, 6750 deletions
diff --git a/doc/insref.src b/doc/insref.src
new file mode 100644
index 00000000..1406f871
--- /dev/null
+++ b/doc/insref.src
@@ -0,0 +1,6732 @@
+\A{iref} x86 Instruction Reference
+
+This appendix provides a complete list of the machine instructions
+which NASM will assemble, and a short description of the function of
+each one.
+
+It is not intended to be an exhaustive documentation on the fine
+details of the instructions' function, such as which exceptions they
+can trigger: for such documentation, you should go to Intel's Web
+site, \W{http://developer.intel.com/design/Pentium4/manuals/}\c{http://developer.intel.com/design/Pentium4/manuals/}.
+
+Instead, this appendix is intended primarily to provide
+documentation on the way the instructions may be used within NASM.
+For example, looking up \c{LOOP} will tell you that NASM allows
+\c{CX} or \c{ECX} to be specified as an optional second argument to
+the \c{LOOP} instruction, to enforce which of the two possible
+counter registers should be used if the default is not the one
+desired.
+
+The instructions are not quite listed in alphabetical order, since
+groups of instructions with similar functions are lumped together in
+the same entry. Most of them don't move very far from their
+alphabetic position because of this.
+
+
+\H{iref-opr} Key to Operand Specifications
+
+The instruction descriptions in this appendix specify their operands
+using the following notation:
+
+\b Registers: \c{reg8} denotes an 8-bit \i{general purpose
+register}, \c{reg16} denotes a 16-bit general purpose register,
+\c{reg32} a 32-bit one and \c{reg64} a 64-bit one. \c{fpureg} denotes
+one of the eight FPU stack registers, \c{mmxreg} denotes one of the
+eight 64-bit MMX registers, and \c{segreg} denotes a segment register.
+\c{xmmreg} denotes one of the 8, or 16 in x64 long mode, SSE XMM registers.
+In addition, some registers (such as \c{AL}, \c{DX}, \c{ECX} or \c{RAX})
+may be specified explicitly.
+
+\b Immediate operands: \c{imm} denotes a generic \i{immediate operand}.
+\c{imm8}, \c{imm16} and \c{imm32} are used when the operand is
+intended to be a specific size. For some of these instructions, NASM
+needs an explicit specifier: for example, \c{ADD ESP,16} could be
+interpreted as either \c{ADD r/m32,imm32} or \c{ADD r/m32,imm8}.
+NASM chooses the former by default, and so you must specify \c{ADD
+ESP,BYTE 16} for the latter. There is a special case of the allowance
+of an \c{imm64} for particular x64 versions of the MOV instruction.
+
+\b Memory references: \c{mem} denotes a generic \i{memory reference};
+\c{mem8}, \c{mem16}, \c{mem32}, \c{mem64} and \c{mem80} are used
+when the operand needs to be a specific size. Again, a specifier is
+needed in some cases: \c{DEC [address]} is ambiguous and will be
+rejected by NASM. You must specify \c{DEC BYTE [address]}, \c{DEC
+WORD [address]} or \c{DEC DWORD [address]} instead.
+
+\b \i{Restricted memory references}: one form of the \c{MOV}
+instruction allows a memory address to be specified \e{without}
+allowing the normal range of register combinations and effective
+address processing. This is denoted by \c{memoffs8}, \c{memoffs16},
+\c{memoffs32} or \c{memoffs64}.
+
+\b Register or memory choices: many instructions can accept either a
+register \e{or} a memory reference as an operand. \c{r/m8} is
+shorthand for \c{reg8/mem8}; similarly \c{r/m16} and \c{r/m32}.
+On legacy x86 modes, \c{r/m64} is MMX-related, and is shorthand for
+\c{mmxreg/mem64}. When utilizing the x86-64 architecture extension,
+\c{r/m64} denotes use of a 64-bit GPR as well, and is shorthand for
+\c{reg64/mem64}.
+
+
+\H{iref-opc} Key to Opcode Descriptions
+
+This appendix also provides the opcodes which NASM will generate for
+each form of each instruction. The opcodes are listed in the
+following way:
+
+\b A hex number, such as \c{3F}, indicates a fixed byte containing
+that number.
+
+\b A hex number followed by \c{+r}, such as \c{C8+r}, indicates that
+one of the operands to the instruction is a register, and the
+`register value' of that register should be added to the hex number
+to produce the generated byte. For example, EDX has register value
+2, so the code \c{C8+r}, when the register operand is EDX, generates
+the hex byte \c{CA}. Register values for specific registers are
+given in \k{iref-rv}.
+
+\b A hex number followed by \c{+cc}, such as \c{40+cc}, indicates
+that the instruction name has a condition code suffix, and the
+numeric representation of the condition code should be added to the
+hex number to produce the generated byte. For example, the code
+\c{40+cc}, when the instruction contains the \c{NE} condition,
+generates the hex byte \c{45}. Condition codes and their numeric
+representations are given in \k{iref-cc}.
+
+\b A slash followed by a digit, such as \c{/2}, indicates that one
+of the operands to the instruction is a memory address or register
+(denoted \c{mem} or \c{r/m}, with an optional size). This is to be
+encoded as an effective address, with a \i{ModR/M byte}, an optional
+\i{SIB byte}, and an optional displacement, and the spare (register)
+field of the ModR/M byte should be the digit given (which will be
+from 0 to 7, so it fits in three bits). The encoding of effective
+addresses is given in \k{iref-ea}.
+
+\b The code \c{/r} combines the above two: it indicates that one of
+the operands is a memory address or \c{r/m}, and another is a
+register, and that an effective address should be generated with the
+spare (register) field in the ModR/M byte being equal to the
+`register value' of the register operand. The encoding of effective
+addresses is given in \k{iref-ea}; register values are given in
+\k{iref-rv}.
+
+\b The codes \c{ib}, \c{iw} and \c{id} indicate that one of the
+operands to the instruction is an immediate value, and that this is
+to be encoded as a byte, little-endian word or little-endian
+doubleword respectively.
+
+\b The codes \c{rb}, \c{rw} and \c{rd} indicate that one of the
+operands to the instruction is an immediate value, and that the
+\e{difference} between this value and the address of the end of the
+instruction is to be encoded as a byte, word or doubleword
+respectively. Where the form \c{rw/rd} appears, it indicates that
+either \c{rw} or \c{rd} should be used according to whether assembly
+is being performed in \c{BITS 16} or \c{BITS 32} state respectively.
+
+\b The codes \c{ow} and \c{od} indicate that one of the operands to
+the instruction is a reference to the contents of a memory address
+specified as an immediate value: this encoding is used in some forms
+of the \c{MOV} instruction in place of the standard
+effective-address mechanism. The displacement is encoded as a word
+or doubleword. Again, \c{ow/od} denotes that \c{ow} or \c{od} should
+be chosen according to the \c{BITS} setting.
+
+\b The codes \c{o16} and \c{o32} indicate that the given form of the
+instruction should be assembled with operand size 16 or 32 bits. In
+other words, \c{o16} indicates a \c{66} prefix in \c{BITS 32} state,
+but generates no code in \c{BITS 16} state; and \c{o32} indicates a
+\c{66} prefix in \c{BITS 16} state but generates nothing in \c{BITS
+32}.
+
+\b The codes \c{a16} and \c{a32}, similarly to \c{o16} and \c{o32},
+indicate the address size of the given form of the instruction.
+Where this does not match the \c{BITS} setting, a \c{67} prefix is
+required. Please note that \c{a16} is useless in long mode as
+16-bit addressing is depreciated on the x86-64 architecture extension.
+
+
+\S{iref-rv} Register Values
+
+Where an instruction requires a register value, it is already
+implicit in the encoding of the rest of the instruction what type of
+register is intended: an 8-bit general-purpose register, a segment
+register, a debug register, an MMX register, or whatever. Therefore
+there is no problem with registers of different types sharing an
+encoding value.
+
+Please note that for the register classes listed below, the register
+extensions (REX) classes require the use of the REX prefix, in which
+is only available when in long mode on the x86-64 processor. This
+pretty much goes for any register that has a number higher than 7.
+
+The encodings for the various classes of register are:
+
+\b 8-bit general registers: \c{AL} is 0, \c{CL} is 1, \c{DL} is 2,
+\c{BL} is 3, \c{AH} is 4, \c{CH} is 5, \c{DH} is 6 and \c{BH} is
+7. Please note that \c{AH}, \c{BH}, \c{CH} and \c{DH} are not
+addressable when using the REX prefix in long mode.
+
+\b 8-bit general register extensions (REX): \c{SPL} is 4, \c{BPL} is 5,
+\c{SIL} is 6, \c{DIL} is 7, \c{R8B} is 8, \c{R9B} is 9, \c{R10B} is 10,
+\c{R11B} is 11, \c{R12B} is 12, \c{R13B} is 13, \c{R14B} is 14 and
+\c{R15B} is 15.
+
+\b 16-bit general registers: \c{AX} is 0, \c{CX} is 1, \c{DX} is 2,
+\c{BX} is 3, \c{SP} is 4, \c{BP} is 5, \c{SI} is 6, and \c{DI} is 7.
+
+\b 16-bit general register extensions (REX): \c{R8W} is 8, \c{R9W} is 9,
+\c{R10w} is 10, \c{R11W} is 11, \c{R12W} is 12, \c{R13W} is 13, \c{R14W}
+is 14 and \c{R15W} is 15.
+
+\b 32-bit general registers: \c{EAX} is 0, \c{ECX} is 1, \c{EDX} is
+2, \c{EBX} is 3, \c{ESP} is 4, \c{EBP} is 5, \c{ESI} is 6, and
+\c{EDI} is 7.
+
+\b 32-bit general register extensions (REX): \c{R8D} is 8, \c{R9D} is 9,
+\c{R10D} is 10, \c{R11D} is 11, \c{R12D} is 12, \c{R13D} is 13, \c{R14D}
+is 14 and \c{R15D} is 15.
+
+\b 64-bit general register extensions (REX): \c{RAX} is 0, \c{RCX} is 1,
+\c{RDX} is 2, \c{RBX} is 3, \c{RSP} is 4, \c{RBP} is 5, \c{RSI} is 6,
+\c{RDI} is 7, \c{R8} is 8, \c{R9} is 9, \c{R10} is 10, \c{R11} is 11,
+\c{R12} is 12, \c{R13} is 13, \c{R14} is 14 and \c{R15} is 15.
+
+\b \i{Segment registers}: \c{ES} is 0, \c{CS} is 1, \c{SS} is 2, \c{DS}
+is 3, \c{FS} is 4, and \c{GS} is 5.
+
+\b \I{floating-point, registers}Floating-point registers: \c{ST0}
+is 0, \c{ST1} is 1, \c{ST2} is 2, \c{ST3} is 3, \c{ST4} is 4,
+\c{ST5} is 5, \c{ST6} is 6, and \c{ST7} is 7.
+
+\b 64-bit \i{MMX registers}: \c{MM0} is 0, \c{MM1} is 1, \c{MM2} is 2,
+\c{MM3} is 3, \c{MM4} is 4, \c{MM5} is 5, \c{MM6} is 6, and \c{MM7}
+is 7.
+
+\b 128-bit \i{XMM (SSE) registers}: \c{XMM0} is 0, \c{XMM1} is 1,
+\c{XMM2} is 2, \c{XMM3} is 3, \c{XMM4} is 4, \c{XMM5} is 5, \c{XMM6} is
+6 and \c{XMM7} is 7.
+
+\b 128-bit \i{XMM (SSE) register} extensions (REX): \c{XMM8} is 8,
+\c{XMM9} is 9, \c{XMM10} is 10, \c{XMM11} is 11, \c{XMM12} is 12,
+\c{XMM13} is 13, \c{XMM14} is 14 and \c{XMM15} is 15.
+
+\b \i{Control registers}: \c{CR0} is 0, \c{CR2} is 2, \c{CR3} is 3,
+and \c{CR4} is 4.
+
+\b \i{Control register} extensions: \c{CR8} is 8.
+
+\b \i{Debug registers}: \c{DR0} is 0, \c{DR1} is 1, \c{DR2} is 2,
+\c{DR3} is 3, \c{DR6} is 6, and \c{DR7} is 7.
+
+\b \i{Test registers}: \c{TR3} is 3, \c{TR4} is 4, \c{TR5} is 5,
+\c{TR6} is 6, and \c{TR7} is 7.
+
+(Note that wherever a register name contains a number, that number
+is also the register value for that register.)
+
+
+\S{iref-cc} \i{Condition Codes}
+
+The available condition codes are given here, along with their
+numeric representations as part of opcodes. Many of these condition
+codes have synonyms, so several will be listed at a time.
+
+In the following descriptions, the word `either', when applied to two
+possible trigger conditions, is used to mean `either or both'. If
+`either but not both' is meant, the phrase `exactly one of' is used.
+
+\b \c{O} is 0 (trigger if the overflow flag is set); \c{NO} is 1.
+
+\b \c{B}, \c{C} and \c{NAE} are 2 (trigger if the carry flag is
+set); \c{AE}, \c{NB} and \c{NC} are 3.
+
+\b \c{E} and \c{Z} are 4 (trigger if the zero flag is set); \c{NE}
+and \c{NZ} are 5.
+
+\b \c{BE} and \c{NA} are 6 (trigger if either of the carry or zero
+flags is set); \c{A} and \c{NBE} are 7.
+
+\b \c{S} is 8 (trigger if the sign flag is set); \c{NS} is 9.
+
+\b \c{P} and \c{PE} are 10 (trigger if the parity flag is set);
+\c{NP} and \c{PO} are 11.
+
+\b \c{L} and \c{NGE} are 12 (trigger if exactly one of the sign and
+overflow flags is set); \c{GE} and \c{NL} are 13.
+
+\b \c{LE} and \c{NG} are 14 (trigger if either the zero flag is set,
+or exactly one of the sign and overflow flags is set); \c{G} and
+\c{NLE} are 15.
+
+Note that in all cases, the sense of a condition code may be
+reversed by changing the low bit of the numeric representation.
+
+For details of when an instruction sets each of the status flags,
+see the individual instruction, plus the Status Flags reference
+in \k{iref-Flags}
+
+
+\S{iref-SSE-cc} \i{SSE Condition Predicates}
+
+The condition predicates for SSE comparison instructions are the
+codes used as part of the opcode, to determine what form of
+comparison is being carried out. In each case, the imm8 value is
+the final byte of the opcode encoding, and the predicate is the
+code used as part of the mnemonic for the instruction (equivalent
+to the "cc" in an integer instruction that used a condition code).
+The instructions that use this will give details of what the various
+mnemonics are, this table is used to help you work out details of what
+is happening.
+
+\c Predi-  imm8  Description Relation where:   Emula- Result   QNaN
+\c  cate  Encod-             A Is 1st Operand  tion   if NaN   Signal
+\c         ing               B Is 2nd Operand         Operand  Invalid
+\c
+\c EQ     000B   equal       A = B                    False     No
+\c
+\c LT     001B   less-than   A < B                    False     Yes
+\c
+\c LE     010B   less-than-  A <= B                   False     Yes
+\c                or-equal
+\c
+\c ---    ----   greater     A > B             Swap   False     Yes
+\c               than                          Operands,
+\c                                             Use LT
+\c
+\c ---    ----   greater-    A >= B            Swap   False     Yes
+\c               than-or-equal                 Operands,
+\c                                             Use LE
+\c
+\c UNORD  011B   unordered   A, B = Unordered         True      No
+\c
+\c NEQ    100B   not-equal   A != B                   True      No
+\c
+\c NLT    101B   not-less-   NOT(A < B)               True      Yes
+\c               than
+\c
+\c NLE    110B   not-less-   NOT(A <= B)              True      Yes
+\c               than-or-
+\c               equal
+\c
+\c ---    ----   not-greater NOT(A > B)        Swap   True      Yes
+\c               than                          Operands,
+\c                                             Use NLT
+\c
+\c ---    ----   not-greater NOT(A >= B)       Swap   True      Yes
+\c               than-                         Operands,
+\c               or-equal                      Use NLE
+\c
+\c ORD    111B   ordered      A , B = Ordered         False     No
+
+The unordered relationship is true when at least one of the two
+values being compared is a NaN or in an unsupported format.
+
+Note that the comparisons which are listed as not having a predicate
+or encoding can only be achieved through software emulation, as
+described in the "emulation" column. Note in particular that an
+instruction such as \c{greater-than} is not the same as \c{NLE}, as,
+unlike with the \c{CMP} instruction, it has to take into account the
+possibility of one operand containing a NaN or an unsupported numeric
+format.
+
+
+\S{iref-Flags} \i{Status Flags}
+
+The status flags provide some information about the result of the
+arithmetic instructions. This information can be used by conditional
+instructions (such a \c{Jcc} and \c{CMOVcc}) as well as by some of
+the other instructions (such as \c{ADC} and \c{INTO}).
+
+There are 6 status flags:
+
+\c CF - Carry flag.
+
+Set if an arithmetic operation generates a
+carry or a borrow out of the most-significant bit of the result;
+cleared otherwise. This flag indicates an overflow condition for
+unsigned-integer arithmetic. It is also used in multiple-precision
+arithmetic.
+
+\c PF - Parity flag.
+
+Set if the least-significant byte of the result contains an even
+number of 1 bits; cleared otherwise.
+
+\c AF - Adjust flag.
+
+Set if an arithmetic operation generates a carry or a borrow
+out of bit 3 of the result; cleared otherwise. This flag is used
+in binary-coded decimal (BCD) arithmetic.
+
+\c ZF - Zero flag.
+
+Set if the result is zero; cleared otherwise.
+
+\c SF - Sign flag.
+
+Set equal to the most-significant bit of the result, which is the
+sign bit of a signed integer. (0 indicates a positive value and 1
+indicates a negative value.)
+
+\c OF - Overflow flag.
+
+Set if the integer result is too large a positive number or too
+small a negative number (excluding the sign-bit) to fit in the
+destination operand; cleared otherwise. This flag indicates an
+overflow condition for signed-integer (two's complement) arithmetic.
+
+
+\S{iref-ea} Effective Address Encoding: \i{ModR/M} and \i{SIB}
+
+An \i{effective address} is encoded in up to three parts: a ModR/M
+byte, an optional SIB byte, and an optional byte, word or doubleword
+displacement field.
+
+The ModR/M byte consists of three fields: the \c{mod} field, ranging
+from 0 to 3, in the upper two bits of the byte, the \c{r/m} field,
+ranging from 0 to 7, in the lower three bits, and the spare
+(register) field in the middle (bit 3 to bit 5). The spare field is
+not relevant to the effective address being encoded, and either
+contains an extension to the instruction opcode or the register
+value of another operand.
+
+The ModR/M system can be used to encode a direct register reference
+rather than a memory access. This is always done by setting the
+\c{mod} field to 3 and the \c{r/m} field to the register value of
+the register in question (it must be a general-purpose register, and
+the size of the register must already be implicit in the encoding of
+the rest of the instruction). In this case, the SIB byte and
+displacement field are both absent.
+
+In 16-bit addressing mode (either \c{BITS 16} with no \c{67} prefix,
+or \c{BITS 32} with a \c{67} prefix), the SIB byte is never used.
+The general rules for \c{mod} and \c{r/m} (there is an exception,
+given below) are:
+
+\b The \c{mod} field gives the length of the displacement field: 0
+means no displacement, 1 means one byte, and 2 means two bytes.
+
+\b The \c{r/m} field encodes the combination of registers to be
+added to the displacement to give the accessed address: 0 means
+\c{BX+SI}, 1 means \c{BX+DI}, 2 means \c{BP+SI}, 3 means \c{BP+DI},
+4 means \c{SI} only, 5 means \c{DI} only, 6 means \c{BP} only, and 7
+means \c{BX} only.
+
+However, there is a special case:
+
+\b If \c{mod} is 0 and \c{r/m} is 6, the effective address encoded
+is not \c{[BP]} as the above rules would suggest, but instead
+\c{[disp16]}: the displacement field is present and is two bytes
+long, and no registers are added to the displacement.
+
+Therefore the effective address \c{[BP]} cannot be encoded as
+efficiently as \c{[BX]}; so if you code \c{[BP]} in a program, NASM
+adds a notional 8-bit zero displacement, and sets \c{mod} to 1,
+\c{r/m} to 6, and the one-byte displacement field to 0.
+
+In 32-bit addressing mode (either \c{BITS 16} with a \c{67} prefix,
+or \c{BITS 32} with no \c{67} prefix) the general rules (again,
+there are exceptions) for \c{mod} and \c{r/m} are:
+
+\b The \c{mod} field gives the length of the displacement field: 0
+means no displacement, 1 means one byte, and 2 means four bytes.
+
+\b If only one register is to be added to the displacement, and it
+is not \c{ESP}, the \c{r/m} field gives its register value, and the
+SIB byte is absent. If the \c{r/m} field is 4 (which would encode
+\c{ESP}), the SIB byte is present and gives the combination and
+scaling of registers to be added to the displacement.
+
+If the SIB byte is present, it describes the combination of
+registers (an optional base register, and an optional index register
+scaled by multiplication by 1, 2, 4 or 8) to be added to the
+displacement. The SIB byte is divided into the \c{scale} field, in
+the top two bits, the \c{index} field in the next three, and the
+\c{base} field in the bottom three. The general rules are:
+
+\b The \c{base} field encodes the register value of the base
+register.
+
+\b The \c{index} field encodes the register value of the index
+register, unless it is 4, in which case no index register is used
+(so \c{ESP} cannot be used as an index register).
+
+\b The \c{scale} field encodes the multiplier by which the index
+register is scaled before adding it to the base and displacement: 0
+encodes a multiplier of 1, 1 encodes 2, 2 encodes 4 and 3 encodes 8.
+
+The exceptions to the 32-bit encoding rules are:
+
+\b If \c{mod} is 0 and \c{r/m} is 5, the effective address encoded
+is not \c{[EBP]} as the above rules would suggest, but instead
+\c{[disp32]}: the displacement field is present and is four bytes
+long, and no registers are added to the displacement.
+
+\b If \c{mod} is 0, \c{r/m} is 4 (meaning the SIB byte is present)
+and \c{base} is 5, the effective address encoded is not
+\c{[EBP+index]} as the above rules would suggest, but instead
+\c{[disp32+index]}: the displacement field is present and is four
+bytes long, and there is no base register (but the index register is
+still processed in the normal way).
+
+
+\S{iref-rex} Register Extensions: The \i{REX} Prefix
+
+The Register Extensions, or \i{REX} for short, prefix is the means
+of accessing extended registers on the x86-64 architecture. \i{REX}
+is considered an instruction prefix, but is required to be after
+all other prefixes and thus immediately before the first instruction
+opcode itself. So overall, \i{REX} can be thought of as an "Opcode
+Prefix" instead. The \i{REX} prefix itself is indicated by a value
+of 0x4X, where X is one of 16 different combinations of the actual
+\i{REX} flags.
+
+The \i{REX} prefix flags consist of four 1-bit extensions fields.
+These flags are found in the lower nibble of the actual \i{REX}
+prefix opcode. Below is the list of \i{REX} prefix flags, from
+high bit to low bit.
+
+\c{REX.W}: When set, this flag indicates the use of a 64-bit operand,
+as opposed to the default of using 32-bit operands as found in 32-bit
+Protected Mode.
+
+\c{REX.R}: When set, this flag extends the \c{reg (spare)} field of
+the \c{ModRM} byte. Overall, this raises the amount of addressable
+registers in this field from 8 to 16.
+
+\c{REX.X}: When set, this flag extends the \c{index} field of the
+\c{SIB} byte. Overall, this raises the amount of addressable
+registers in this field from 8 to 16.
+
+\c{REX.B}: When set, this flag extends the \c{r/m} field of the
+\c{ModRM} byte. This flag can also represent an extension to the
+opcode register \c{(/r)} field. The determination of which is used
+varies depending on which instruction is used. Overall, this raises
+the amount of addressable registers in these fields from 8 to 16.
+
+Interal use of the \i{REX} prefix by the processor is consistent,
+yet non-trivial. Most instructions use the \i{REX} prefix as
+indicated by the above flags. Some instructions require the \i{REX}
+prefix to be present even if the flags are empty. Some instructions
+default to a 64-bit operand and require the \i{REX} prefix only for
+actual register extensions, and thus ignores the \c{REX.W} field
+completely.
+
+At any rate, NASM is designed to handle, and fully supports, the
+\i{REX} prefix internally. Please read the appropriate processor
+documentation for further information on the \i{REX} prefix.
+
+You may have noticed that opcodes 0x40 through 0x4F are actually
+opcodes for the INC/DEC instructions for each General Purpose
+Register. This is, of course, correct... for legacy x86. While
+in long mode, opcodes 0x40 through 0x4F are reserved for use as
+the REX prefix. The other opcode forms of the INC/DEC instructions
+are used instead.
+
+
+\H{iref-flg} Key to Instruction Flags
+
+Given along with each instruction in this appendix is a set of
+flags, denoting the type of the instruction. The types are as follows:
+
+\b \c{8086}, \c{186}, \c{286}, \c{386}, \c{486}, \c{PENT} and \c{P6}
+denote the lowest processor type that supports the instruction. Most
+instructions run on all processors above the given type; those that
+do not are documented. The Pentium II contains no additional
+instructions beyond the P6 (Pentium Pro); from the point of view of
+its instruction set, it can be thought of as a P6 with MMX
+capability.
+
+\b \c{3DNOW} indicates that the instruction is a 3DNow! one, and will
+run on the AMD K6-2 and later processors. ATHLON extensions to the
+3DNow! instruction set are documented as such.
+
+\b \c{CYRIX} indicates that the instruction is specific to Cyrix
+processors, for example the extra MMX instructions in the Cyrix
+extended MMX instruction set.
+
+\b \c{FPU} indicates that the instruction is a floating-point one,
+and will only run on machines with a coprocessor (automatically
+including 486DX, Pentium and above).
+
+\b \c{KATMAI} indicates that the instruction was introduced as part
+of the Katmai New Instruction set. These instructions are available
+on the Pentium III and later processors. Those which are not
+specifically SSE instructions are also available on the AMD Athlon.
+
+\b \c{MMX} indicates that the instruction is an MMX one, and will
+run on MMX-capable Pentium processors and the Pentium II.
+
+\b \c{PRIV} indicates that the instruction is a protected-mode
+management instruction. Many of these may only be used in protected
+mode, or only at privilege level zero.
+
+\b \c{SSE} and \c{SSE2} indicate that the instruction is a Streaming
+SIMD Extension instruction. These instructions operate on multiple
+values in a single operation. SSE was introduced with the Pentium III
+and SSE2 was introduced with the Pentium 4.
+
+\b \c{UNDOC} indicates that the instruction is an undocumented one,
+and not part of the official Intel Architecture; it may or may not
+be supported on any given machine.
+
+\b \c{WILLAMETTE} indicates that the instruction was introduced as
+part of the new instruction set in the Pentium 4 and Intel Xeon
+processors. These instructions are also known as SSE2 instructions.
+
+\b \c{X64} indicates that the instruction was introduced as part of
+the new instruction set in the x86-64 architecture extension,
+commonly referred to as x64, AMD64 or EM64T.
+
+
+\H{iref-inst} x86 Instruction Set
+
+
+\S{insAAA} \i\c{AAA}, \i\c{AAS}, \i\c{AAM}, \i\c{AAD}: ASCII
+Adjustments
+
+\c AAA                           ; 37                   [8086]
+
+\c AAS                           ; 3F                   [8086]
+
+\c AAD                           ; D5 0A                [8086]
+\c AAD imm                       ; D5 ib                [8086]
+
+\c AAM                           ; D4 0A                [8086]
+\c AAM imm                       ; D4 ib                [8086]
+
+These instructions are used in conjunction with the add, subtract,
+multiply and divide instructions to perform binary-coded decimal
+arithmetic in \e{unpacked} (one BCD digit per byte - easy to
+translate to and from \c{ASCII}, hence the instruction names) form.
+There are also packed BCD instructions \c{DAA} and \c{DAS}: see
+\k{insDAA}.
+
+\b \c{AAA} (ASCII Adjust After Addition) should be used after a
+one-byte \c{ADD} instruction whose destination was the \c{AL}
+register: by means of examining the value in the low nibble of
+\c{AL} and also the auxiliary carry flag \c{AF}, it determines
+whether the addition has overflowed, and adjusts it (and sets
+the carry flag) if so. You can add long BCD strings together
+by doing \c{ADD}/\c{AAA} on the low digits, then doing
+\c{ADC}/\c{AAA} on each subsequent digit.
+
+\b \c{AAS} (ASCII Adjust AL After Subtraction) works similarly to
+\c{AAA}, but is for use after \c{SUB} instructions rather than
+\c{ADD}.
+
+\b \c{AAM} (ASCII Adjust AX After Multiply) is for use after you
+have multiplied two decimal digits together and left the result
+in \c{AL}: it divides \c{AL} by ten and stores the quotient in
+\c{AH}, leaving the remainder in \c{AL}. The divisor 10 can be
+changed by specifying an operand to the instruction: a particularly
+handy use of this is \c{AAM 16}, causing the two nibbles in \c{AL}
+to be separated into \c{AH} and \c{AL}.
+
+\b \c{AAD} (ASCII Adjust AX Before Division) performs the inverse
+operation to \c{AAM}: it multiplies \c{AH} by ten, adds it to
+\c{AL}, and sets \c{AH} to zero. Again, the multiplier 10 can
+be changed.
+
+
+\S{insADC} \i\c{ADC}: Add with Carry
+
+\c ADC r/m8,reg8                 ; 10 /r                [8086]
+\c ADC r/m16,reg16               ; o16 11 /r            [8086]
+\c ADC r/m32,reg32               ; o32 11 /r            [386]
+
+\c ADC reg8,r/m8                 ; 12 /r                [8086]
+\c ADC reg16,r/m16               ; o16 13 /r            [8086]
+\c ADC reg32,r/m32               ; o32 13 /r            [386]
+
+\c ADC r/m8,imm8                 ; 80 /2 ib             [8086]
+\c ADC r/m16,imm16               ; o16 81 /2 iw         [8086]
+\c ADC r/m32,imm32               ; o32 81 /2 id         [386]
+
+\c ADC r/m16,imm8                ; o16 83 /2 ib         [8086]
+\c ADC r/m32,imm8                ; o32 83 /2 ib         [386]
+
+\c ADC AL,imm8                   ; 14 ib                [8086]
+\c ADC AX,imm16                  ; o16 15 iw            [8086]
+\c ADC EAX,imm32                 ; o32 15 id            [386]
+
+\c{ADC} performs integer addition: it adds its two operands
+together, plus the value of the carry flag, and leaves the result in
+its destination (first) operand. The destination operand can be a
+register or a memory location. The source operand can be a register,
+a memory location or an immediate value.
+
+The flags are set according to the result of the operation: in
+particular, the carry flag is affected and can be used by a
+subsequent \c{ADC} instruction.
+
+In the forms with an 8-bit immediate second operand and a longer
+first operand, the second operand is considered to be signed, and is
+sign-extended to the length of the first operand. In these cases,
+the \c{BYTE} qualifier is necessary to force NASM to generate this
+form of the instruction.
+
+To add two numbers without also adding the contents of the carry
+flag, use \c{ADD} (\k{insADD}).
+
+
+\S{insADD} \i\c{ADD}: Add Integers
+
+\c ADD r/m8,reg8                 ; 00 /r                [8086]
+\c ADD r/m16,reg16               ; o16 01 /r            [8086]
+\c ADD r/m32,reg32               ; o32 01 /r            [386]
+
+\c ADD reg8,r/m8                 ; 02 /r                [8086]
+\c ADD reg16,r/m16               ; o16 03 /r            [8086]
+\c ADD reg32,r/m32               ; o32 03 /r            [386]
+
+\c ADD r/m8,imm8                 ; 80 /7 ib             [8086]
+\c ADD r/m16,imm16               ; o16 81 /7 iw         [8086]
+\c ADD r/m32,imm32               ; o32 81 /7 id         [386]
+
+\c ADD r/m16,imm8                ; o16 83 /7 ib         [8086]
+\c ADD r/m32,imm8                ; o32 83 /7 ib         [386]
+
+\c ADD AL,imm8                   ; 04 ib                [8086]
+\c ADD AX,imm16                  ; o16 05 iw            [8086]
+\c ADD EAX,imm32                 ; o32 05 id            [386]
+
+\c{ADD} performs integer addition: it adds its two operands
+together, and leaves the result in its destination (first) operand.
+The destination operand can be a register or a memory location.
+The source operand can be a register, a memory location or an
+immediate value.
+
+The flags are set according to the result of the operation: in
+particular, the carry flag is affected and can be used by a
+subsequent \c{ADC} instruction.
+
+In the forms with an 8-bit immediate second operand and a longer
+first operand, the second operand is considered to be signed, and is
+sign-extended to the length of the first operand. In these cases,
+the \c{BYTE} qualifier is necessary to force NASM to generate this
+form of the instruction.
+
+
+\S{insADDPD} \i\c{ADDPD}: ADD Packed Double-Precision FP Values
+
+\c ADDPD xmm1,xmm2/mem128        ; 66 0F 58 /r     [WILLAMETTE,SSE2]
+
+\c{ADDPD} performs addition on each of two packed double-precision
+FP value pairs.
+
+\c    dst[0-63]   := dst[0-63]   + src[0-63],
+\c    dst[64-127] := dst[64-127] + src[64-127].
+
+The destination is an \c{XMM} register. The source operand can be
+either an \c{XMM} register or a 128-bit memory location.
+
+
+\S{insADDPS} \i\c{ADDPS}: ADD Packed Single-Precision FP Values
+
+\c ADDPS xmm1,xmm2/mem128        ; 0F 58 /r        [KATMAI,SSE]
+
+\c{ADDPS} performs addition on each of four packed single-precision
+FP value pairs
+
+\c    dst[0-31]   := dst[0-31]   + src[0-31],
+\c    dst[32-63]  := dst[32-63]  + src[32-63],
+\c    dst[64-95]  := dst[64-95]  + src[64-95],
+\c    dst[96-127] := dst[96-127] + src[96-127].
+
+The destination is an \c{XMM} register. The source operand can be
+either an \c{XMM} register or a 128-bit memory location.
+
+
+\S{insADDSD} \i\c{ADDSD}: ADD Scalar Double-Precision FP Values
+
+\c ADDSD xmm1,xmm2/mem64         ; F2 0F 58 /r     [KATMAI,SSE]
+
+\c{ADDSD} adds the low double-precision FP values from the source
+and destination operands and stores the double-precision FP result
+in the destination operand.
+
+\c    dst[0-63]   := dst[0-63] + src[0-63],
+\c    dst[64-127) remains unchanged.
+
+The destination is an \c{XMM} register. The source operand can be
+either an \c{XMM} register or a 64-bit memory location.
+
+
+\S{insADDSS} \i\c{ADDSS}: ADD Scalar Single-Precision FP Values
+
+\c ADDSS xmm1,xmm2/mem32         ; F3 0F 58 /r     [WILLAMETTE,SSE2]
+
+\c{ADDSS} adds the low single-precision FP values from the source
+and destination operands and stores the single-precision FP result
+in the destination operand.
+
+\c    dst[0-31]   := dst[0-31] + src[0-31],
+\c    dst[32-127] remains unchanged.
+
+The destination is an \c{XMM} register. The source operand can be
+either an \c{XMM} register or a 32-bit memory location.
+
+
+\S{insAND} \i\c{AND}: Bitwise AND
+
+\c AND r/m8,reg8                 ; 20 /r                [8086]
+\c AND r/m16,reg16               ; o16 21 /r            [8086]
+\c AND r/m32,reg32               ; o32 21 /r            [386]
+
+\c AND reg8,r/m8                 ; 22 /r                [8086]
+\c AND reg16,r/m16               ; o16 23 /r            [8086]
+\c AND reg32,r/m32               ; o32 23 /r            [386]
+
+\c AND r/m8,imm8                 ; 80 /4 ib             [8086]
+\c AND r/m16,imm16               ; o16 81 /4 iw         [8086]
+\c AND r/m32,imm32               ; o32 81 /4 id         [386]
+
+\c AND r/m16,imm8                ; o16 83 /4 ib         [8086]
+\c AND r/m32,imm8                ; o32 83 /4 ib         [386]
+
+\c AND AL,imm8                   ; 24 ib                [8086]
+\c AND AX,imm16                  ; o16 25 iw            [8086]
+\c AND EAX,imm32                 ; o32 25 id            [386]
+
+\c{AND} performs a bitwise AND operation between its two operands
+(i.e. each bit of the result is 1 if and only if the corresponding
+bits of the two inputs were both 1), and stores the result in the
+destination (first) operand. The destination operand can be a
+register or a memory location. The source operand can be a register,
+a memory location or an immediate value.
+
+In the forms with an 8-bit immediate second operand and a longer
+first operand, the second operand is considered to be signed, and is
+sign-extended to the length of the first operand. In these cases,
+the \c{BYTE} qualifier is necessary to force NASM to generate this
+form of the instruction.
+
+The \c{MMX} instruction \c{PAND} (see \k{insPAND}) performs the same
+operation on the 64-bit \c{MMX} registers.
+
+
+\S{insANDNPD} \i\c{ANDNPD}: Bitwise Logical AND NOT of
+Packed Double-Precision FP Values
+
+\c ANDNPD xmm1,xmm2/mem128       ; 66 0F 55 /r     [WILLAMETTE,SSE2]
+
+\c{ANDNPD} inverts the bits of the two double-precision
+floating-point values in the destination register, and then
+performs a logical AND between the two double-precision
+floating-point values in the source operand and the temporary
+inverted result, storing the result in the destination register.
+
+\c    dst[0-63]   := src[0-63]   AND NOT dst[0-63],
+\c    dst[64-127] := src[64-127] AND NOT dst[64-127].
+
+The destination is an \c{XMM} register. The source operand can be
+either an \c{XMM} register or a 128-bit memory location.
+
+
+\S{insANDNPS} \i\c{ANDNPS}: Bitwise Logical AND NOT of
+Packed Single-Precision FP Values
+
+\c ANDNPS xmm1,xmm2/mem128       ; 0F 55 /r        [KATMAI,SSE]
+
+\c{ANDNPS} inverts the bits of the four single-precision
+floating-point values in the destination register, and then
+performs a logical AND between the four single-precision
+floating-point values in the source operand and the temporary
+inverted result, storing the result in the destination register.
+
+\c    dst[0-31]   := src[0-31]   AND NOT dst[0-31],
+\c    dst[32-63]  := src[32-63]  AND NOT dst[32-63],
+\c    dst[64-95]  := src[64-95]  AND NOT dst[64-95],
+\c    dst[96-127] := src[96-127] AND NOT dst[96-127].
+
+The destination is an \c{XMM} register. The source operand can be
+either an \c{XMM} register or a 128-bit memory location.
+
+
+\S{insANDPD} \i\c{ANDPD}: Bitwise Logical AND For Single FP
+
+\c ANDPD xmm1,xmm2/mem128        ; 66 0F 54 /r     [WILLAMETTE,SSE2]
+
+\c{ANDPD} performs a bitwise logical AND of the two double-precision
+floating point values in the source and destination operand, and
+stores the result in the destination register.
+
+\c    dst[0-63]   := src[0-63]   AND dst[0-63],
+\c    dst[64-127] := src[64-127] AND dst[64-127].
+
+The destination is an \c{XMM} register. The source operand can be
+either an \c{XMM} register or a 128-bit memory location.
+
+
+\S{insANDPS} \i\c{ANDPS}: Bitwise Logical AND For Single FP
+
+\c ANDPS xmm1,xmm2/mem128        ; 0F 54 /r        [KATMAI,SSE]
+
+\c{ANDPS} performs a bitwise logical AND of the four single-precision
+floating point values in the source and destination operand, and
+stores the result in the destination register.
+
+\c    dst[0-31]   := src[0-31]   AND dst[0-31],
+\c    dst[32-63]  := src[32-63]  AND dst[32-63],
+\c    dst[64-95]  := src[64-95]  AND dst[64-95],
+\c    dst[96-127] := src[96-127] AND dst[96-127].
+
+The destination is an \c{XMM} register. The source operand can be
+either an \c{XMM} register or a 128-bit memory location.
+
+
+\S{insARPL} \i\c{ARPL}: Adjust RPL Field of Selector
+
+\c ARPL r/m16,reg16              ; 63 /r                [286,PRIV]
+
+\c{ARPL} expects its two word operands to be segment selectors. It
+adjusts the \i\c{RPL} (requested privilege level - stored in the bottom
+two bits of the selector) field of the destination (first) operand
+to ensure that it is no less (i.e. no more privileged than) the \c{RPL}
+field of the source operand. The zero flag is set if and only if a
+change had to be made.
+
+
+\S{insBOUND} \i\c{BOUND}: Check Array Index against Bounds
+
+\c BOUND reg16,mem               ; o16 62 /r            [186]
+\c BOUND reg32,mem               ; o32 62 /r            [386]
+
+\c{BOUND} expects its second operand to point to an area of memory
+containing two signed values of the same size as its first operand
+(i.e. two words for the 16-bit form; two doublewords for the 32-bit
+form). It performs two signed comparisons: if the value in the
+register passed as its first operand is less than the first of the
+in-memory values, or is greater than or equal to the second, it
+throws a \c{BR} exception. Otherwise, it does nothing.
+
+
+\S{insBSF} \i\c{BSF}, \i\c{BSR}: Bit Scan
+
+\c BSF reg16,r/m16               ; o16 0F BC /r         [386]
+\c BSF reg32,r/m32               ; o32 0F BC /r         [386]
+
+\c BSR reg16,r/m16               ; o16 0F BD /r         [386]
+\c BSR reg32,r/m32               ; o32 0F BD /r         [386]
+
+\b \c{BSF} searches for the least significant set bit in its source
+(second) operand, and if it finds one, stores the index in
+its destination (first) operand. If no set bit is found, the
+contents of the destination operand are undefined. If the source
+operand is zero, the zero flag is set.
+
+\b \c{BSR} performs the same function, but searches from the top
+instead, so it finds the most significant set bit.
+
+Bit indices are from 0 (least significant) to 15 or 31 (most
+significant). The destination operand can only be a register.
+The source operand can be a register or a memory location.
+
+
+\S{insBSWAP} \i\c{BSWAP}: Byte Swap
+
+\c BSWAP reg32                   ; o32 0F C8+r          [486]
+
+\c{BSWAP} swaps the order of the four bytes of a 32-bit register:
+bits 0-7 exchange places with bits 24-31, and bits 8-15 swap with
+bits 16-23. There is no explicit 16-bit equivalent: to byte-swap
+\c{AX}, \c{BX}, \c{CX} or \c{DX}, \c{XCHG} can be used. When \c{BSWAP}
+is used with a 16-bit register, the result is undefined.
+
+
+\S{insBT} \i\c{BT}, \i\c{BTC}, \i\c{BTR}, \i\c{BTS}: Bit Test
+
+\c BT r/m16,reg16                ; o16 0F A3 /r         [386]
+\c BT r/m32,reg32                ; o32 0F A3 /r         [386]
+\c BT r/m16,imm8                 ; o16 0F BA /4 ib      [386]
+\c BT r/m32,imm8                 ; o32 0F BA /4 ib      [386]
+
+\c BTC r/m16,reg16               ; o16 0F BB /r         [386]
+\c BTC r/m32,reg32               ; o32 0F BB /r         [386]
+\c BTC r/m16,imm8                ; o16 0F BA /7 ib      [386]
+\c BTC r/m32,imm8                ; o32 0F BA /7 ib      [386]
+
+\c BTR r/m16,reg16               ; o16 0F B3 /r         [386]
+\c BTR r/m32,reg32               ; o32 0F B3 /r         [386]
+\c BTR r/m16,imm8                ; o16 0F BA /6 ib      [386]
+\c BTR r/m32,imm8                ; o32 0F BA /6 ib      [386]
+
+\c BTS r/m16,reg16               ; o16 0F AB /r         [386]
+\c BTS r/m32,reg32               ; o32 0F AB /r         [386]
+\c BTS r/m16,imm                 ; o16 0F BA /5 ib      [386]
+\c BTS r/m32,imm                 ; o32 0F BA /5 ib      [386]
+
+These instructions all test one bit of their first operand, whose
+index is given by the second operand, and store the value of that
+bit into the carry flag. Bit indices are from 0 (least significant)
+to 15 or 31 (most significant).
+
+In addition to storing the original value of the bit into the carry
+flag, \c{BTR} also resets (clears) the bit in the operand itself.
+\c{BTS} sets the bit, and \c{BTC} complements the bit. \c{BT} does
+not modify its operands.
+
+The destination can be a register or a memory location. The source can
+be a register or an immediate value.
+
+If the destination operand is a register, the bit offset should be
+in the range 0-15 (for 16-bit operands) or 0-31 (for 32-bit operands).
+An immediate value outside these ranges will be taken modulo 16/32
+by the processor.
+
+If the destination operand is a memory location, then an immediate
+bit offset follows the same rules as for a register. If the bit offset
+is in a register, then it can be anything within the signed range of
+the register used (ie, for a 32-bit operand, it can be (-2^31) to (2^31 - 1)
+
+
+\S{insCALL} \i\c{CALL}: Call Subroutine
+
+\c CALL imm                      ; E8 rw/rd             [8086]
+\c CALL imm:imm16                ; o16 9A iw iw         [8086]
+\c CALL imm:imm32                ; o32 9A id iw         [386]
+\c CALL FAR mem16                ; o16 FF /3            [8086]
+\c CALL FAR mem32                ; o32 FF /3            [386]
+\c CALL r/m16                    ; o16 FF /2            [8086]
+\c CALL r/m32                    ; o32 FF /2            [386]
+
+\c{CALL} calls a subroutine, by means of pushing the current
+instruction pointer (\c{IP}) and optionally \c{CS} as well on the
+stack, and then jumping to a given address.
+
+\c{CS} is pushed as well as \c{IP} if and only if the call is a far
+call, i.e. a destination segment address is specified in the
+instruction. The forms involving two colon-separated arguments are
+far calls; so are the \c{CALL FAR mem} forms.
+
+The immediate \i{near call} takes one of two forms (\c{call imm16/imm32},
+determined by the current segment size limit. For 16-bit operands,
+you would use \c{CALL 0x1234}, and for 32-bit operands you would use
+\c{CALL 0x12345678}. The value passed as an operand is a relative offset.
+
+You can choose between the two immediate \i{far call} forms
+(\c{CALL imm:imm}) by the use of the \c{WORD} and \c{DWORD} keywords:
+\c{CALL WORD 0x1234:0x5678}) or \c{CALL DWORD 0x1234:0x56789abc}.
+
+The \c{CALL FAR mem} forms execute a far call by loading the
+destination address out of memory. The address loaded consists of 16
+or 32 bits of offset (depending on the operand size), and 16 bits of
+segment. The operand size may be overridden using \c{CALL WORD FAR
+mem} or \c{CALL DWORD FAR mem}.
+
+The \c{CALL r/m} forms execute a \i{near call} (within the same
+segment), loading the destination address out of memory or out of a
+register. The keyword \c{NEAR} may be specified, for clarity, in
+these forms, but is not necessary. Again, operand size can be
+overridden using \c{CALL WORD mem} or \c{CALL DWORD mem}.
+
+As a convenience, NASM does not require you to call a far procedure
+symbol by coding the cumbersome \c{CALL SEG routine:routine}, but
+instead allows the easier synonym \c{CALL FAR routine}.
+
+The \c{CALL r/m} forms given above are near calls; NASM will accept
+the \c{NEAR} keyword (e.g. \c{CALL NEAR [address]}), even though it
+is not strictly necessary.
+
+
+\S{insCBW} \i\c{CBW}, \i\c{CWD}, \i\c{CDQ}, \i\c{CWDE}: Sign Extensions
+
+\c CBW                           ; o16 98               [8086]
+\c CWDE                          ; o32 98               [386]
+
+\c CWD                           ; o16 99               [8086]
+\c CDQ                           ; o32 99               [386]
+
+All these instructions sign-extend a short value into a longer one,
+by replicating the top bit of the original value to fill the
+extended one.
+
+\c{CBW} extends \c{AL} into \c{AX} by repeating the top bit of
+\c{AL} in every bit of \c{AH}. \c{CWDE} extends \c{AX} into
+\c{EAX}. \c{CWD} extends \c{AX} into \c{DX:AX} by repeating
+the top bit of \c{AX} throughout \c{DX}, and \c{CDQ} extends
+\c{EAX} into \c{EDX:EAX}.
+
+
+\S{insCLC} \i\c{CLC}, \i\c{CLD}, \i\c{CLI}, \i\c{CLTS}: Clear Flags
+
+\c CLC                           ; F8                   [8086]
+\c CLD                           ; FC                   [8086]
+\c CLI                           ; FA                   [8086]
+\c CLTS                          ; 0F 06                [286,PRIV]
+
+These instructions clear various flags. \c{CLC} clears the carry
+flag; \c{CLD} clears the direction flag; \c{CLI} clears the
+interrupt flag (thus disabling interrupts); and \c{CLTS} clears the
+task-switched (\c{TS}) flag in \c{CR0}.
+
+To set the carry, direction, or interrupt flags, use the \c{STC},
+\c{STD} and \c{STI} instructions (\k{insSTC}). To invert the carry
+flag, use \c{CMC} (\k{insCMC}).
+
+
+\S{insCLFLUSH} \i\c{CLFLUSH}: Flush Cache Line
+
+\c CLFLUSH mem                   ; 0F AE /7        [WILLAMETTE,SSE2]
+
+\c{CLFLUSH} invalidates the cache line that contains the linear address
+specified by the source operand from all levels of the processor cache
+hierarchy (data and instruction). If, at any level of the cache
+hierarchy, the line is inconsistent with memory (dirty) it is written
+to memory before invalidation. The source operand points to a
+byte-sized memory location.
+
+Although \c{CLFLUSH} is flagged \c{SSE2} and above, it may not be
+present on all processors which have \c{SSE2} support, and it may be
+supported on other processors; the \c{CPUID} instruction (\k{insCPUID})
+will return a bit which indicates support for the \c{CLFLUSH} instruction.
+
+
+\S{insCMC} \i\c{CMC}: Complement Carry Flag
+
+\c CMC                           ; F5                   [8086]
+
+\c{CMC} changes the value of the carry flag: if it was 0, it sets it
+to 1, and vice versa.
+
+
+\S{insCMOVcc} \i\c{CMOVcc}: Conditional Move
+
+\c CMOVcc reg16,r/m16            ; o16 0F 40+cc /r      [P6]
+\c CMOVcc reg32,r/m32            ; o32 0F 40+cc /r      [P6]
+
+\c{CMOV} moves its source (second) operand into its destination
+(first) operand if the given condition code is satisfied; otherwise
+it does nothing.
+
+For a list of condition codes, see \k{iref-cc}.
+
+Although the \c{CMOV} instructions are flagged \c{P6} and above, they
+may not be supported by all Pentium Pro processors; the \c{CPUID}
+instruction (\k{insCPUID}) will return a bit which indicates whether
+conditional moves are supported.
+
+
+\S{insCMP} \i\c{CMP}: Compare Integers
+
+\c CMP r/m8,reg8                 ; 38 /r                [8086]
+\c CMP r/m16,reg16               ; o16 39 /r            [8086]
+\c CMP r/m32,reg32               ; o32 39 /r            [386]
+
+\c CMP reg8,r/m8                 ; 3A /r                [8086]
+\c CMP reg16,r/m16               ; o16 3B /r            [8086]
+\c CMP reg32,r/m32               ; o32 3B /r            [386]
+
+\c CMP r/m8,imm8                 ; 80 /7 ib             [8086]
+\c CMP r/m16,imm16               ; o16 81 /7 iw         [8086]
+\c CMP r/m32,imm32               ; o32 81 /7 id         [386]
+
+\c CMP r/m16,imm8                ; o16 83 /7 ib         [8086]
+\c CMP r/m32,imm8                ; o32 83 /7 ib         [386]
+
+\c CMP AL,imm8                   ; 3C ib                [8086]
+\c CMP AX,imm16                  ; o16 3D iw            [8086]
+\c CMP EAX,imm32                 ; o32 3D id            [386]
+
+\c{CMP} performs a `mental' subtraction of its second operand from
+its first operand, and affects the flags as if the subtraction had
+taken place, but does not store the result of the subtraction
+anywhere.
+
+In the forms with an 8-bit immediate second operand and a longer
+first operand, the second operand is considered to be signed, and is
+sign-extended to the length of the first operand. In these cases,
+the \c{BYTE} qualifier is necessary to force NASM to generate this
+form of the instruction.
+
+The destination operand can be a register or a memory location. The
+source can be a register, memory location or an immediate value of
+the same size as the destination.
+
+
+\S{insCMPccPD} \i\c{CMPccPD}: Packed Double-Precision FP Compare
+\I\c{CMPEQPD} \I\c{CMPLTPD} \I\c{CMPLEPD} \I\c{CMPUNORDPD}
+\I\c{CMPNEQPD} \I\c{CMPNLTPD} \I\c{CMPNLEPD} \I\c{CMPORDPD}
+
+\c CMPPD xmm1,xmm2/mem128,imm8   ; 66 0F C2 /r ib  [WILLAMETTE,SSE2]
+
+\c CMPEQPD xmm1,xmm2/mem128      ; 66 0F C2 /r 00  [WILLAMETTE,SSE2]
+\c CMPLTPD xmm1,xmm2/mem128      ; 66 0F C2 /r 01  [WILLAMETTE,SSE2]
+\c CMPLEPD xmm1,xmm2/mem128      ; 66 0F C2 /r 02  [WILLAMETTE,SSE2]
+\c CMPUNORDPD xmm1,xmm2/mem128   ; 66 0F C2 /r 03  [WILLAMETTE,SSE2]
+\c CMPNEQPD xmm1,xmm2/mem128     ; 66 0F C2 /r 04  [WILLAMETTE,SSE2]
+\c CMPNLTPD xmm1,xmm2/mem128     ; 66 0F C2 /r 05  [WILLAMETTE,SSE2]
+\c CMPNLEPD xmm1,xmm2/mem128     ; 66 0F C2 /r 06  [WILLAMETTE,SSE2]
+\c CMPORDPD xmm1,xmm2/mem128     ; 66 0F C2 /r 07  [WILLAMETTE,SSE2]
+
+The \c{CMPccPD} instructions compare the two packed double-precision
+FP values in the source and destination operands, and returns the
+result of the comparison in the destination register. The result of
+each comparison is a quadword mask of all 1s (comparison true) or
+all 0s (comparison false).
+
+The destination is an \c{XMM} register. The source can be either an
+\c{XMM} register or a 128-bit memory location.
+
+The third operand is an 8-bit immediate value, of which the low 3
+bits define the type of comparison. For ease of programming, the
+8 two-operand pseudo-instructions are provided, with the third
+operand already filled in. The \I{Condition Predicates}
+\c{Condition Predicates} are:
+
+\c EQ     0   Equal
+\c LT     1   Less-than
+\c LE     2   Less-than-or-equal
+\c UNORD  3   Unordered
+\c NE     4   Not-equal
+\c NLT    5   Not-less-than
+\c NLE    6   Not-less-than-or-equal
+\c ORD    7   Ordered
+
+For more details of the comparison predicates, and details of how
+to emulate the "greater-than" equivalents, see \k{iref-SSE-cc}
+
+
+\S{insCMPccPS} \i\c{CMPccPS}: Packed Single-Precision FP Compare
+\I\c{CMPEQPS} \I\c{CMPLTPS} \I\c{CMPLEPS} \I\c{CMPUNORDPS}
+\I\c{CMPNEQPS} \I\c{CMPNLTPS} \I\c{CMPNLEPS} \I\c{CMPORDPS}
+
+\c CMPPS xmm1,xmm2/mem128,imm8   ; 0F C2 /r ib     [KATMAI,SSE]
+
+\c CMPEQPS xmm1,xmm2/mem128      ; 0F C2 /r 00     [KATMAI,SSE]
+\c CMPLTPS xmm1,xmm2/mem128      ; 0F C2 /r 01     [KATMAI,SSE]
+\c CMPLEPS xmm1,xmm2/mem128      ; 0F C2 /r 02     [KATMAI,SSE]
+\c CMPUNORDPS xmm1,xmm2/mem128   ; 0F C2 /r 03     [KATMAI,SSE]
+\c CMPNEQPS xmm1,xmm2/mem128     ; 0F C2 /r 04     [KATMAI,SSE]
+\c CMPNLTPS xmm1,xmm2/mem128     ; 0F C2 /r 05     [KATMAI,SSE]
+\c CMPNLEPS xmm1,xmm2/mem128     ; 0F C2 /r 06     [KATMAI,SSE]
+\c CMPORDPS xmm1,xmm2/mem128     ; 0F C2 /r 07     [KATMAI,SSE]
+
+The \c{CMPccPS} instructions compare the two packed single-precision
+FP values in the source and destination operands, and returns the
+result of the comparison in the destination register. The result of
+each comparison is a doubleword mask of all 1s (comparison true) or
+all 0s (comparison false).
+
+The destination is an \c{XMM} register. The source can be either an
+\c{XMM} register or a 128-bit memory location.
+
+The third operand is an 8-bit immediate value, of which the low 3
+bits define the type of comparison. For ease of programming, the
+8 two-operand pseudo-instructions are provided, with the third
+operand already filled in. The \I{Condition Predicates}
+\c{Condition Predicates} are:
+
+\c EQ     0   Equal
+\c LT     1   Less-than
+\c LE     2   Less-than-or-equal
+\c UNORD  3   Unordered
+\c NE     4   Not-equal
+\c NLT    5   Not-less-than
+\c NLE    6   Not-less-than-or-equal
+\c ORD    7   Ordered
+
+For more details of the comparison predicates, and details of how
+to emulate the "greater-than" equivalents, see \k{iref-SSE-cc}
+
+
+\S{insCMPSB} \i\c{CMPSB}, \i\c{CMPSW}, \i\c{CMPSD}: Compare Strings
+
+\c CMPSB                         ; A6                   [8086]
+\c CMPSW                         ; o16 A7               [8086]
+\c CMPSD                         ; o32 A7               [386]
+
+\c{CMPSB} compares the byte at \c{[DS:SI]} or \c{[DS:ESI]} with the
+byte at \c{[ES:DI]} or \c{[ES:EDI]}, and sets the flags accordingly.
+It then increments or decrements (depending on the direction flag:
+increments if the flag is clear, decrements if it is set) \c{SI} and
+\c{DI} (or \c{ESI} and \c{EDI}).
+
+The registers used are \c{SI} and \c{DI} if the address size is 16
+bits, and \c{ESI} and \c{EDI} if it is 32 bits. If you need to use
+an address size not equal to the current \c{BITS} setting, you can
+use an explicit \i\c{a16} or \i\c{a32} prefix.
+
+The segment register used to load from \c{[SI]} or \c{[ESI]} can be
+overridden by using a segment register name as a prefix (for
+example, \c{ES CMPSB}). The use of \c{ES} for the load from \c{[DI]}
+or \c{[EDI]} cannot be overridden.
+
+\c{CMPSW} and \c{CMPSD} work in the same way, but they compare a
+word or a doubleword instead of a byte, and increment or decrement
+the addressing registers by 2 or 4 instead of 1.
+
+The \c{REPE} and \c{REPNE} prefixes (equivalently, \c{REPZ} and
+\c{REPNZ}) may be used to repeat the instruction up to \c{CX} (or
+\c{ECX} - again, the address size chooses which) times until the
+first unequal or equal byte is found.
+
+
+\S{insCMPccSD} \i\c{CMPccSD}: Scalar Double-Precision FP Compare
+\I\c{CMPEQSD} \I\c{CMPLTSD} \I\c{CMPLESD} \I\c{CMPUNORDSD}
+\I\c{CMPNEQSD} \I\c{CMPNLTSD} \I\c{CMPNLESD} \I\c{CMPORDSD}
+
+\c CMPSD xmm1,xmm2/mem64,imm8    ; F2 0F C2 /r ib  [WILLAMETTE,SSE2]
+
+\c CMPEQSD xmm1,xmm2/mem64       ; F2 0F C2 /r 00  [WILLAMETTE,SSE2]
+\c CMPLTSD xmm1,xmm2/mem64       ; F2 0F C2 /r 01  [WILLAMETTE,SSE2]
+\c CMPLESD xmm1,xmm2/mem64       ; F2 0F C2 /r 02  [WILLAMETTE,SSE2]
+\c CMPUNORDSD xmm1,xmm2/mem64    ; F2 0F C2 /r 03  [WILLAMETTE,SSE2]
+\c CMPNEQSD xmm1,xmm2/mem64      ; F2 0F C2 /r 04  [WILLAMETTE,SSE2]
+\c CMPNLTSD xmm1,xmm2/mem64      ; F2 0F C2 /r 05  [WILLAMETTE,SSE2]
+\c CMPNLESD xmm1,xmm2/mem64      ; F2 0F C2 /r 06  [WILLAMETTE,SSE2]
+\c CMPORDSD xmm1,xmm2/mem64      ; F2 0F C2 /r 07  [WILLAMETTE,SSE2]
+
+The \c{CMPccSD} instructions compare the low-order double-precision
+FP values in the source and destination operands, and returns the
+result of the comparison in the destination register. The result of
+each comparison is a quadword mask of all 1s (comparison true) or
+all 0s (comparison false).
+
+The destination is an \c{XMM} register. The source can be either an
+\c{XMM} register or a 128-bit memory location.
+
+The third operand is an 8-bit immediate value, of which the low 3
+bits define the type of comparison. For ease of programming, the
+8 two-operand pseudo-instructions are provided, with the third
+operand already filled in. The \I{Condition Predicates}
+\c{Condition Predicates} are:
+
+\c EQ     0   Equal
+\c LT     1   Less-than
+\c LE     2   Less-than-or-equal
+\c UNORD  3   Unordered
+\c NE     4   Not-equal
+\c NLT    5   Not-less-than
+\c NLE    6   Not-less-than-or-equal
+\c ORD    7   Ordered
+
+For more details of the comparison predicates, and details of how
+to emulate the "greater-than" equivalents, see \k{iref-SSE-cc}
+
+
+\S{insCMPccSS} \i\c{CMPccSS}: Scalar Single-Precision FP Compare
+\I\c{CMPEQSS} \I\c{CMPLTSS} \I\c{CMPLESS} \I\c{CMPUNORDSS}
+\I\c{CMPNEQSS} \I\c{CMPNLTSS} \I\c{CMPNLESS} \I\c{CMPORDSS}
+
+\c CMPSS xmm1,xmm2/mem32,imm8    ; F3 0F C2 /r ib  [KATMAI,SSE]
+
+\c CMPEQSS xmm1,xmm2/mem32       ; F3 0F C2 /r 00  [KATMAI,SSE]
+\c CMPLTSS xmm1,xmm2/mem32       ; F3 0F C2 /r 01  [KATMAI,SSE]
+\c CMPLESS xmm1,xmm2/mem32       ; F3 0F C2 /r 02  [KATMAI,SSE]
+\c CMPUNORDSS xmm1,xmm2/mem32    ; F3 0F C2 /r 03  [KATMAI,SSE]
+\c CMPNEQSS xmm1,xmm2/mem32      ; F3 0F C2 /r 04  [KATMAI,SSE]
+\c CMPNLTSS xmm1,xmm2/mem32      ; F3 0F C2 /r 05  [KATMAI,SSE]
+\c CMPNLESS xmm1,xmm2/mem32      ; F3 0F C2 /r 06  [KATMAI,SSE]
+\c CMPORDSS xmm1,xmm2/mem32      ; F3 0F C2 /r 07  [KATMAI,SSE]
+
+The \c{CMPccSS} instructions compare the low-order single-precision
+FP values in the source and destination operands, and returns the
+result of the comparison in the destination register. The result of
+each comparison is a doubleword mask of all 1s (comparison true) or
+all 0s (comparison false).
+
+The destination is an \c{XMM} register. The source can be either an
+\c{XMM} register or a 128-bit memory location.
+
+The third operand is an 8-bit immediate value, of which the low 3
+bits define the type of comparison. For ease of programming, the
+8 two-operand pseudo-instructions are provided, with the third
+operand already filled in. The \I{Condition Predicates}
+\c{Condition Predicates} are:
+
+\c EQ     0   Equal
+\c LT     1   Less-than
+\c LE     2   Less-than-or-equal
+\c UNORD  3   Unordered
+\c NE     4   Not-equal
+\c NLT    5   Not-less-than
+\c NLE    6   Not-less-than-or-equal
+\c ORD    7   Ordered
+
+For more details of the comparison predicates, and details of how
+to emulate the "greater-than" equivalents, see \k{iref-SSE-cc}
+
+
+\S{insCMPXCHG} \i\c{CMPXCHG}, \i\c{CMPXCHG486}: Compare and Exchange
+
+\c CMPXCHG r/m8,reg8             ; 0F B0 /r             [PENT]
+\c CMPXCHG r/m16,reg16           ; o16 0F B1 /r         [PENT]
+\c CMPXCHG r/m32,reg32           ; o32 0F B1 /r         [PENT]
+
+\c CMPXCHG486 r/m8,reg8          ; 0F A6 /r             [486,UNDOC]
+\c CMPXCHG486 r/m16,reg16        ; o16 0F A7 /r         [486,UNDOC]
+\c CMPXCHG486 r/m32,reg32        ; o32 0F A7 /r         [486,UNDOC]
+
+These two instructions perform exactly the same operation; however,
+apparently some (not all) 486 processors support it under a
+non-standard opcode, so NASM provides the undocumented
+\c{CMPXCHG486} form to generate the non-standard opcode.
+
+\c{CMPXCHG} compares its destination (first) operand to the value in
+\c{AL}, \c{AX} or \c{EAX} (depending on the operand size of the
+instruction). If they are equal, it copies its source (second)
+operand into the destination and sets the zero flag. Otherwise, it
+clears the zero flag and copies the destination register to AL, AX or EAX.
+
+The destination can be either a register or a memory location. The
+source is a register.
+
+\c{CMPXCHG} is intended to be used for atomic operations in
+multitasking or multiprocessor environments. To safely update a
+value in shared memory, for example, you might load the value into
+\c{EAX}, load the updated value into \c{EBX}, and then execute the
+instruction \c{LOCK CMPXCHG [value],EBX}. If \c{value} has not
+changed since being loaded, it is updated with your desired new
+value, and the zero flag is set to let you know it has worked. (The
+\c{LOCK} prefix prevents another processor doing anything in the
+middle of this operation: it guarantees atomicity.) However, if
+another processor has modified the value in between your load and
+your attempted store, the store does not happen, and you are
+notified of the failure by a cleared zero flag, so you can go round
+and try again.
+
+
+\S{insCMPXCHG8B} \i\c{CMPXCHG8B}: Compare and Exchange Eight Bytes
+
+\c CMPXCHG8B mem                 ; 0F C7 /1             [PENT]
+
+This is a larger and more unwieldy version of \c{CMPXCHG}: it
+compares the 64-bit (eight-byte) value stored at \c{[mem]} with the
+value in \c{EDX:EAX}. If they are equal, it sets the zero flag and
+stores \c{ECX:EBX} into the memory area. If they are unequal, it
+clears the zero flag and stores the memory contents into \c{EDX:EAX}.
+
+\c{CMPXCHG8B} can be used with the \c{LOCK} prefix, to allow atomic
+execution. This is useful in multi-processor and multi-tasking
+environments.
+
+
+\S{insCOMISD} \i\c{COMISD}: Scalar Ordered Double-Precision FP Compare and Set EFLAGS
+
+\c COMISD xmm1,xmm2/mem64        ; 66 0F 2F /r     [WILLAMETTE,SSE2]
+
+\c{COMISD} compares the low-order double-precision FP value in the
+two source operands. ZF, PF and CF are set according to the result.
+OF, AF and AF are cleared. The unordered result is returned if either
+source is a NaN (QNaN or SNaN).
+
+The destination operand is an \c{XMM} register. The source can be either
+an \c{XMM} register or a memory location.
+
+The flags are set according to the following rules:
+
+\c    Result          Flags        Values
+
+\c    UNORDERED:      ZF,PF,CF <-- 111;
+\c    GREATER_THAN:   ZF,PF,CF <-- 000;
+\c    LESS_THAN:      ZF,PF,CF <-- 001;
+\c    EQUAL:          ZF,PF,CF <-- 100;
+
+
+\S{insCOMISS} \i\c{COMISS}: Scalar Ordered Single-Precision FP Compare and Set EFLAGS
+
+\c COMISS xmm1,xmm2/mem32        ; 66 0F 2F /r     [KATMAI,SSE]
+
+\c{COMISS} compares the low-order single-precision FP value in the
+two source operands. ZF, PF and CF are set according to the result.
+OF, AF and AF are cleared. The unordered result is returned if either
+source is a NaN (QNaN or SNaN).
+
+The destination operand is an \c{XMM} register. The source can be either
+an \c{XMM} register or a memory location.
+
+The flags are set according to the following rules:
+
+\c    Result          Flags        Values
+
+\c    UNORDERED:      ZF,PF,CF <-- 111;
+\c    GREATER_THAN:   ZF,PF,CF <-- 000;
+\c    LESS_THAN:      ZF,PF,CF <-- 001;
+\c    EQUAL:          ZF,PF,CF <-- 100;
+
+
+\S{insCPUID} \i\c{CPUID}: Get CPU Identification Code
+
+\c CPUID                         ; 0F A2                [PENT]
+
+\c{CPUID} returns various information about the processor it is
+being executed on. It fills the four registers \c{EAX}, \c{EBX},
+\c{ECX} and \c{EDX} with information, which varies depending on the
+input contents of \c{EAX}.
+
+\c{CPUID} also acts as a barrier to serialize instruction execution:
+executing the \c{CPUID} instruction guarantees that all the effects
+(memory modification, flag modification, register modification) of
+previous instructions have been completed before the next
+instruction gets fetched.
+
+The information returned is as follows:
+
+\b If \c{EAX} is zero on input, \c{EAX} on output holds the maximum
+acceptable input value of \c{EAX}, and \c{EBX:EDX:ECX} contain the
+string \c{"GenuineIntel"} (or not, if you have a clone processor).
+That is to say, \c{EBX} contains \c{"Genu"} (in NASM's own sense of
+character constants, described in \k{chrconst}), \c{EDX} contains
+\c{"ineI"} and \c{ECX} contains \c{"ntel"}.
+
+\b If \c{EAX} is one on input, \c{EAX} on output contains version
+information about the processor, and \c{EDX} contains a set of
+feature flags, showing the presence and absence of various features.
+For example, bit 8 is set if the \c{CMPXCHG8B} instruction
+(\k{insCMPXCHG8B}) is supported, bit 15 is set if the conditional
+move instructions (\k{insCMOVcc} and \k{insFCMOVB}) are supported,
+and bit 23 is set if \c{MMX} instructions are supported.
+
+\b If \c{EAX} is two on input, \c{EAX}, \c{EBX}, \c{ECX} and \c{EDX}
+all contain information about caches and TLBs (Translation Lookahead
+Buffers).
+
+For more information on the data returned from \c{CPUID}, see the
+documentation from Intel and other processor manufacturers.
+
+
+\S{insCVTDQ2PD} \i\c{CVTDQ2PD}:
+Packed Signed INT32 to Packed Double-Precision FP Conversion
+
+\c CVTDQ2PD xmm1,xmm2/mem64      ; F3 0F E6 /r     [WILLAMETTE,SSE2]
+
+\c{CVTDQ2PD} converts two packed signed doublewords from the source
+operand to two packed double-precision FP values in the destination
+operand.
+
+The destination operand is an \c{XMM} register. The source can be
+either an \c{XMM} register or a 64-bit memory location. If the
+source is a register, the packed integers are in the low quadword.
+
+
+\S{insCVTDQ2PS} \i\c{CVTDQ2PS}:
+Packed Signed INT32 to Packed Single-Precision FP Conversion
+
+\c CVTDQ2PS xmm1,xmm2/mem128     ; 0F 5B /r        [WILLAMETTE,SSE2]
+
+\c{CVTDQ2PS} converts four packed signed doublewords from the source
+operand to four packed single-precision FP values in the destination
+operand.
+
+The destination operand is an \c{XMM} register. The source can be
+either an \c{XMM} register or a 128-bit memory location.
+
+For more details of this instruction, see the Intel Processor manuals.
+
+
+\S{insCVTPD2DQ} \i\c{CVTPD2DQ}:
+Packed Double-Precision FP to Packed Signed INT32 Conversion
+
+\c CVTPD2DQ xmm1,xmm2/mem128     ; F2 0F E6 /r     [WILLAMETTE,SSE2]
+
+\c{CVTPD2DQ} converts two packed double-precision FP values from the
+source operand to two packed signed doublewords in the low quadword
+of the destination operand. The high quadword of the destination is
+set to all 0s.
+
+The destination operand is an \c{XMM} register. The source can be
+either an \c{XMM} register or a 128-bit memory location.
+
+For more details of this instruction, see the Intel Processor manuals.
+
+
+\S{insCVTPD2PI} \i\c{CVTPD2PI}:
+Packed Double-Precision FP to Packed Signed INT32 Conversion
+
+\c CVTPD2PI mm,xmm/mem128        ; 66 0F 2D /r     [WILLAMETTE,SSE2]
+
+\c{CVTPD2PI} converts two packed double-precision FP values from the
+source operand to two packed signed doublewords in the destination
+operand.
+
+The destination operand is an \c{MMX} register. The source can be
+either an \c{XMM} register or a 128-bit memory location.
+
+For more details of this instruction, see the Intel Processor manuals.
+
+
+\S{insCVTPD2PS} \i\c{CVTPD2PS}:
+Packed Double-Precision FP to Packed Single-Precision FP Conversion
+
+\c CVTPD2PS xmm1,xmm2/mem128     ; 66 0F 5A /r     [WILLAMETTE,SSE2]
+
+\c{CVTPD2PS} converts two packed double-precision FP values from the
+source operand to two packed single-precision FP values in the low
+quadword of the destination operand. The high quadword of the
+destination is set to all 0s.
+
+The destination operand is an \c{XMM} register. The source can be
+either an \c{XMM} register or a 128-bit memory location.
+
+For more details of this instruction, see the Intel Processor manuals.
+
+
+\S{insCVTPI2PD} \i\c{CVTPI2PD}:
+Packed Signed INT32 to Packed Double-Precision FP Conversion
+
+\c CVTPI2PD xmm,mm/mem64         ; 66 0F 2A /r     [WILLAMETTE,SSE2]
+
+\c{CVTPI2PD} converts two packed signed doublewords from the source
+operand to two packed double-precision FP values in the destination
+operand.
+
+The destination operand is an \c{XMM} register. The source can be
+either an \c{MMX} register or a 64-bit memory location.
+
+For more details of this instruction, see the Intel Processor manuals.
+
+
+\S{insCVTPI2PS} \i\c{CVTPI2PS}:
+Packed Signed INT32 to Packed Single-FP Conversion
+
+\c CVTPI2PS xmm,mm/mem64         ; 0F 2A /r        [KATMAI,SSE]
+
+\c{CVTPI2PS} converts two packed signed doublewords from the source
+operand to two packed single-precision FP values in the low quadword
+of the destination operand. The high quadword of the destination
+remains unchanged.
+
+The destination operand is an \c{XMM} register. The source can be
+either an \c{MMX} register or a 64-bit memory location.
+
+For more details of this instruction, see the Intel Processor manuals.
+
+
+\S{insCVTPS2DQ} \i\c{CVTPS2DQ}:
+Packed Single-Precision FP to Packed Signed INT32 Conversion
+
+\c CVTPS2DQ xmm1,xmm2/mem128     ; 66 0F 5B /r     [WILLAMETTE,SSE2]
+
+\c{CVTPS2DQ} converts four packed single-precision FP values from the
+source operand to four packed signed doublewords in the destination operand.
+
+The destination operand is an \c{XMM} register. The source can be
+either an \c{XMM} register or a 128-bit memory location.
+
+For more details of this instruction, see the Intel Processor manuals.
+
+
+\S{insCVTPS2PD} \i\c{CVTPS2PD}:
+Packed Single-Precision FP to Packed Double-Precision FP Conversion
+
+\c CVTPS2PD xmm1,xmm2/mem64      ; 0F 5A /r        [WILLAMETTE,SSE2]
+
+\c{CVTPS2PD} converts two packed single-precision FP values from the
+source operand to two packed double-precision FP values in the destination
+operand.
+
+The destination operand is an \c{XMM} register. The source can be
+either an \c{XMM} register or a 64-bit memory location. If the source
+is a register, the input values are in the low quadword.
+
+For more details of this instruction, see the Intel Processor manuals.
+
+
+\S{insCVTPS2PI} \i\c{CVTPS2PI}:
+Packed Single-Precision FP to Packed Signed INT32 Conversion
+
+\c CVTPS2PI mm,xmm/mem64         ; 0F 2D /r        [KATMAI,SSE]
+
+\c{CVTPS2PI} converts two packed single-precision FP values from
+the source operand to two packed signed doublewords in the destination
+operand.
+
+The destination operand is an \c{MMX} register. The source can be
+either an \c{XMM} register or a 64-bit memory location. If the
+source is a register, the input values are in the low quadword.
+
+For more details of this instruction, see the Intel Processor manuals.
+
+
+\S{insCVTSD2SI} \i\c{CVTSD2SI}:
+Scalar Double-Precision FP to Signed INT32 Conversion
+
+\c CVTSD2SI reg32,xmm/mem64      ; F2 0F 2D /r     [WILLAMETTE,SSE2]
+
+\c{CVTSD2SI} converts a double-precision FP value from the source
+operand to a signed doubleword in the destination operand.
+
+The destination operand is a general purpose register. The source can be
+either an \c{XMM} register or a 64-bit memory location. If the
+source is a register, the input value is in the low quadword.
+
+For more details of this instruction, see the Intel Processor manuals.
+
+
+\S{insCVTSD2SS} \i\c{CVTSD2SS}:
+Scalar Double-Precision FP to Scalar Single-Precision FP Conversion
+
+\c CVTSD2SS xmm1,xmm2/mem64      ; F2 0F 5A /r     [KATMAI,SSE]
+
+\c{CVTSD2SS} converts a double-precision FP value from the source
+operand to a single-precision FP value in the low doubleword of the
+destination operand. The upper 3 doublewords are left unchanged.
+
+The destination operand is an \c{XMM} register. The source can be
+either an \c{XMM} register or a 64-bit memory location. If the
+source is a register, the input value is in the low quadword.
+
+For more details of this instruction, see the Intel Processor manuals.
+
+
+\S{insCVTSI2SD} \i\c{CVTSI2SD}:
+Signed INT32 to Scalar Double-Precision FP Conversion
+
+\c CVTSI2SD xmm,r/m32            ; F2 0F 2A /r     [WILLAMETTE,SSE2]
+
+\c{CVTSI2SD} converts a signed doubleword from the source operand to
+a double-precision FP value in the low quadword of the destination
+operand. The high quadword is left unchanged.
+
+The destination operand is an \c{XMM} register. The source can be either
+a general purpose register or a 32-bit memory location.
+
+For more details of this instruction, see the Intel Processor manuals.
+
+
+\S{insCVTSI2SS} \i\c{CVTSI2SS}:
+Signed INT32 to Scalar Single-Precision FP Conversion
+
+\c CVTSI2SS xmm,r/m32            ; F3 0F 2A /r     [KATMAI,SSE]
+
+\c{CVTSI2SS} converts a signed doubleword from the source operand to a
+single-precision FP value in the low doubleword of the destination operand.
+The upper 3 doublewords are left unchanged.
+
+The destination operand is an \c{XMM} register. The source can be either
+a general purpose register or a 32-bit memory location.
+
+For more details of this instruction, see the Intel Processor manuals.
+
+
+\S{insCVTSS2SD} \i\c{CVTSS2SD}:
+Scalar Single-Precision FP to Scalar Double-Precision FP Conversion
+
+\c CVTSS2SD xmm1,xmm2/mem32      ; F3 0F 5A /r     [WILLAMETTE,SSE2]
+
+\c{CVTSS2SD} converts a single-precision FP value from the source operand
+to a double-precision FP value in the low quadword of the destination
+operand. The upper quadword is left unchanged.
+
+The destination operand is an \c{XMM} register. The source can be either
+an \c{XMM} register or a 32-bit memory location. If the source is a
+register, the input value is contained in the low doubleword.
+
+For more details of this instruction, see the Intel Processor manuals.
+
+
+\S{insCVTSS2SI} \i\c{CVTSS2SI}:
+Scalar Single-Precision FP to Signed INT32 Conversion
+
+\c CVTSS2SI reg32,xmm/mem32      ; F3 0F 2D /r     [KATMAI,SSE]
+
+\c{CVTSS2SI} converts a single-precision FP value from the source
+operand to a signed doubleword in the destination operand.
+
+The destination operand is a general purpose register. The source can be
+either an \c{XMM} register or a 32-bit memory location. If the
+source is a register, the input value is in the low doubleword.
+
+For more details of this instruction, see the Intel Processor manuals.
+
+
+\S{insCVTTPD2DQ} \i\c{CVTTPD2DQ}:
+Packed Double-Precision FP to Packed Signed INT32 Conversion with Truncation
+
+\c CVTTPD2DQ xmm1,xmm2/mem128    ; 66 0F E6 /r     [WILLAMETTE,SSE2]
+
+\c{CVTTPD2DQ} converts two packed double-precision FP values in the source
+operand to two packed single-precision FP values in the destination operand.
+If the result is inexact, it is truncated (rounded toward zero). The high
+quadword is set to all 0s.
+
+The destination operand is an \c{XMM} register. The source can be
+either an \c{XMM} register or a 128-bit memory location.
+
+For more details of this instruction, see the Intel Processor manuals.
+
+
+\S{insCVTTPD2PI} \i\c{CVTTPD2PI}:
+Packed Double-Precision FP to Packed Signed INT32 Conversion with Truncation
+
+\c CVTTPD2PI mm,xmm/mem128        ; 66 0F 2C /r     [WILLAMETTE,SSE2]
+
+\c{CVTTPD2PI} converts two packed double-precision FP values in the source
+operand to two packed single-precision FP values in the destination operand.
+If the result is inexact, it is truncated (rounded toward zero).
+
+The destination operand is an \c{MMX} register. The source can be
+either an \c{XMM} register or a 128-bit memory location.
+
+For more details of this instruction, see the Intel Processor manuals.
+
+
+\S{insCVTTPS2DQ} \i\c{CVTTPS2DQ}:
+Packed Single-Precision FP to Packed Signed INT32 Conversion with Truncation
+
+\c CVTTPS2DQ xmm1,xmm2/mem128    ; F3 0F 5B /r     [WILLAMETTE,SSE2]
+
+\c{CVTTPS2DQ} converts four packed single-precision FP values in the source
+operand to four packed signed doublewords in the destination operand.
+If the result is inexact, it is truncated (rounded toward zero).
+
+The destination operand is an \c{XMM} register. The source can be
+either an \c{XMM} register or a 128-bit memory location.
+
+For more details of this instruction, see the Intel Processor manuals.
+
+
+\S{insCVTTPS2PI} \i\c{CVTTPS2PI}:
+Packed Single-Precision FP to Packed Signed INT32 Conversion with Truncation
+
+\c CVTTPS2PI mm,xmm/mem64         ; 0F 2C /r       [KATMAI,SSE]
+
+\c{CVTTPS2PI} converts two packed single-precision FP values in the source
+operand to two packed signed doublewords in the destination operand.
+If the result is inexact, it is truncated (rounded toward zero). If
+the source is a register, the input values are in the low quadword.
+
+The destination operand is an \c{MMX} register. The source can be
+either an \c{XMM} register or a 64-bit memory location. If the source
+is a register, the input value is in the low quadword.
+
+For more details of this instruction, see the Intel Processor manuals.
+
+
+\S{insCVTTSD2SI} \i\c{CVTTSD2SI}:
+Scalar Double-Precision FP to Signed INT32 Conversion with Truncation
+
+\c CVTTSD2SI reg32,xmm/mem64      ; F2 0F 2C /r    [WILLAMETTE,SSE2]
+
+\c{CVTTSD2SI} converts a double-precision FP value in the source operand
+to a signed doubleword in the destination operand. If the result is
+inexact, it is truncated (rounded toward zero).
+
+The destination operand is a general purpose register. The source can be
+either an \c{XMM} register or a 64-bit memory location. If the source is a
+register, the input value is in the low quadword.
+
+For more details of this instruction, see the Intel Processor manuals.
+
+
+\S{insCVTTSS2SI} \i\c{CVTTSS2SI}:
+Scalar Single-Precision FP to Signed INT32 Conversion with Truncation
+
+\c CVTTSD2SI reg32,xmm/mem32      ; F3 0F 2C /r    [KATMAI,SSE]
+
+\c{CVTTSS2SI} converts a single-precision FP value in the source operand
+to a signed doubleword in the destination operand. If the result is
+inexact, it is truncated (rounded toward zero).
+
+The destination operand is a general purpose register. The source can be
+either an \c{XMM} register or a 32-bit memory location. If the source is a
+register, the input value is in the low doubleword.
+
+For more details of this instruction, see the Intel Processor manuals.
+
+
+\S{insDAA} \i\c{DAA}, \i\c{DAS}: Decimal Adjustments
+
+\c DAA                           ; 27                   [8086]
+\c DAS                           ; 2F                   [8086]
+
+These instructions are used in conjunction with the add and subtract
+instructions to perform binary-coded decimal arithmetic in
+\e{packed} (one BCD digit per nibble) form. For the unpacked
+equivalents, see \k{insAAA}.
+
+\c{DAA} should be used after a one-byte \c{ADD} instruction whose
+destination was the \c{AL} register: by means of examining the value
+in the \c{AL} and also the auxiliary carry flag \c{AF}, it
+determines whether either digit of the addition has overflowed, and
+adjusts it (and sets the carry and auxiliary-carry flags) if so. You
+can add long BCD strings together by doing \c{ADD}/\c{DAA} on the
+low two digits, then doing \c{ADC}/\c{DAA} on each subsequent pair
+of digits.
+
+\c{DAS} works similarly to \c{DAA}, but is for use after \c{SUB}
+instructions rather than \c{ADD}.
+
+
+\S{insDEC} \i\c{DEC}: Decrement Integer
+
+\c DEC reg16                     ; o16 48+r             [8086]
+\c DEC reg32                     ; o32 48+r             [386]
+\c DEC r/m8                      ; FE /1                [8086]
+\c DEC r/m16                     ; o16 FF /1            [8086]
+\c DEC r/m32                     ; o32 FF /1            [386]
+
+\c{DEC} subtracts 1 from its operand. It does \e{not} affect the
+carry flag: to affect the carry flag, use \c{SUB something,1} (see
+\k{insSUB}). \c{DEC} affects all the other flags according to the result.
+
+This instruction can be used with a \c{LOCK} prefix to allow atomic
+execution.
+
+See also \c{INC} (\k{insINC}).
+
+
+\S{insDIV} \i\c{DIV}: Unsigned Integer Divide
+
+\c DIV r/m8                      ; F6 /6                [8086]
+\c DIV r/m16                     ; o16 F7 /6            [8086]
+\c DIV r/m32                     ; o32 F7 /6            [386]
+
+\c{DIV} performs unsigned integer division. The explicit operand
+provided is the divisor; the dividend and destination operands are
+implicit, in the following way:
+
+\b For \c{DIV r/m8}, \c{AX} is divided by the given operand; the
+quotient is stored in \c{AL} and the remainder in \c{AH}.
+
+\b For \c{DIV r/m16}, \c{DX:AX} is divided by the given operand; the
+quotient is stored in \c{AX} and the remainder in \c{DX}.
+
+\b For \c{DIV r/m32}, \c{EDX:EAX} is divided by the given operand;
+the quotient is stored in \c{EAX} and the remainder in \c{EDX}.
+
+Signed integer division is performed by the \c{IDIV} instruction:
+see \k{insIDIV}.
+
+
+\S{insDIVPD} \i\c{DIVPD}: Packed Double-Precision FP Divide
+
+\c DIVPD xmm1,xmm2/mem128        ; 66 0F 5E /r     [WILLAMETTE,SSE2]
+
+\c{DIVPD} divides the two packed double-precision FP values in
+the destination operand by the two packed double-precision FP
+values in the source operand, and stores the packed double-precision
+results in the destination register.
+
+The destination is an \c{XMM} register. The source operand can be
+either an \c{XMM} register or a 128-bit memory location.
+
+\c    dst[0-63]   := dst[0-63]   / src[0-63],
+\c    dst[64-127] := dst[64-127] / src[64-127].
+
+
+\S{insDIVPS} \i\c{DIVPS}: Packed Single-Precision FP Divide
+
+\c DIVPS xmm1,xmm2/mem128        ; 0F 5E /r        [KATMAI,SSE]
+
+\c{DIVPS} divides the four packed single-precision FP values in
+the destination operand by the four packed single-precision FP
+values in the source operand, and stores the packed single-precision
+results in the destination register.
+
+The destination is an \c{XMM} register. The source operand can be
+either an \c{XMM} register or a 128-bit memory location.
+
+\c    dst[0-31]   := dst[0-31]   / src[0-31],
+\c    dst[32-63]  := dst[32-63]  / src[32-63],
+\c    dst[64-95]  := dst[64-95]  / src[64-95],
+\c    dst[96-127] := dst[96-127] / src[96-127].
+
+
+\S{insDIVSD} \i\c{DIVSD}: Scalar Double-Precision FP Divide
+
+\c DIVSD xmm1,xmm2/mem64         ; F2 0F 5E /r     [WILLAMETTE,SSE2]
+
+\c{DIVSD} divides the low-order double-precision FP value in the
+destination operand by the low-order double-precision FP value in
+the source operand, and stores the double-precision result in the
+destination register.
+
+The destination is an \c{XMM} register. The source operand can be
+either an \c{XMM} register or a 64-bit memory location.
+
+\c    dst[0-63]   := dst[0-63] / src[0-63],
+\c    dst[64-127] remains unchanged.
+
+
+\S{insDIVSS} \i\c{DIVSS}: Scalar Single-Precision FP Divide
+
+\c DIVSS xmm1,xmm2/mem32         ; F3 0F 5E /r     [KATMAI,SSE]
+
+\c{DIVSS} divides the low-order single-precision FP value in the
+destination operand by the low-order single-precision FP value in
+the source operand, and stores the single-precision result in the
+destination register.
+
+The destination is an \c{XMM} register. The source operand can be
+either an \c{XMM} register or a 32-bit memory location.
+
+\c    dst[0-31]   := dst[0-31] / src[0-31],
+\c    dst[32-127] remains unchanged.
+
+
+\S{insEMMS} \i\c{EMMS}: Empty MMX State
+
+\c EMMS                          ; 0F 77                [PENT,MMX]
+
+\c{EMMS} sets the FPU tag word (marking which floating-point registers
+are available) to all ones, meaning all registers are available for
+the FPU to use. It should be used after executing \c{MMX} instructions
+and before executing any subsequent floating-point operations.
+
+
+\S{insENTER} \i\c{ENTER}: Create Stack Frame
+
+\c ENTER imm,imm                 ; C8 iw ib             [186]
+
+\c{ENTER} constructs a \i\c{stack frame} for a high-level language
+procedure call. The first operand (the \c{iw} in the opcode
+definition above refers to the first operand) gives the amount of
+stack space to allocate for local variables; the second (the \c{ib}
+above) gives the nesting level of the procedure (for languages like
+Pascal, with nested procedures).
+
+The function of \c{ENTER}, with a nesting level of zero, is
+equivalent to
+
+\c           PUSH EBP            ; or PUSH BP         in 16 bits
+\c           MOV EBP,ESP         ; or MOV BP,SP       in 16 bits
+\c           SUB ESP,operand1    ; or SUB SP,operand1 in 16 bits
+
+This creates a stack frame with the procedure parameters accessible
+upwards from \c{EBP}, and local variables accessible downwards from
+\c{EBP}.
+
+With a nesting level of one, the stack frame created is 4 (or 2)
+bytes bigger, and the value of the final frame pointer \c{EBP} is
+accessible in memory at \c{[EBP-4]}.
+
+This allows \c{ENTER}, when called with a nesting level of two, to
+look at the stack frame described by the \e{previous} value of
+\c{EBP}, find the frame pointer at offset -4 from that, and push it
+along with its new frame pointer, so that when a level-two procedure
+is called from within a level-one procedure, \c{[EBP-4]} holds the
+frame pointer of the most recent level-one procedure call and
+\c{[EBP-8]} holds that of the most recent level-two call. And so on,
+for nesting levels up to 31.
+
+Stack frames created by \c{ENTER} can be destroyed by the \c{LEAVE}
+instruction: see \k{insLEAVE}.
+
+
+\S{insF2XM1} \i\c{F2XM1}: Calculate 2**X-1
+
+\c F2XM1                         ; D9 F0                [8086,FPU]
+
+\c{F2XM1} raises 2 to the power of \c{ST0}, subtracts one, and
+stores the result back into \c{ST0}. The initial contents of \c{ST0}
+must be a number in the range -1.0 to +1.0.
+
+
+\S{insFABS} \i\c{FABS}: Floating-Point Absolute Value
+
+\c FABS                          ; D9 E1                [8086,FPU]
+
+\c{FABS} computes the absolute value of \c{ST0},by clearing the sign
+bit, and stores the result back in \c{ST0}.
+
+
+\S{insFADD} \i\c{FADD}, \i\c{FADDP}: Floating-Point Addition
+
+\c FADD mem32                    ; D8 /0                [8086,FPU]
+\c FADD mem64                    ; DC /0                [8086,FPU]
+
+\c FADD fpureg                   ; D8 C0+r              [8086,FPU]
+\c FADD ST0,fpureg               ; D8 C0+r              [8086,FPU]
+
+\c FADD TO fpureg                ; DC C0+r              [8086,FPU]
+\c FADD fpureg,ST0               ; DC C0+r              [8086,FPU]
+
+\c FADDP fpureg                  ; DE C0+r              [8086,FPU]
+\c FADDP fpureg,ST0              ; DE C0+r              [8086,FPU]
+
+\b \c{FADD}, given one operand, adds the operand to \c{ST0} and stores
+the result back in \c{ST0}. If the operand has the \c{TO} modifier,
+the result is stored in the register given rather than in \c{ST0}.
+
+\b \c{FADDP} performs the same function as \c{FADD TO}, but pops the
+register stack after storing the result.
+
+The given two-operand forms are synonyms for the one-operand forms.
+
+To add an integer value to \c{ST0}, use the c{FIADD} instruction
+(\k{insFIADD})
+
+
+\S{insFBLD} \i\c{FBLD}, \i\c{FBSTP}: BCD Floating-Point Load and Store
+
+\c FBLD mem80                    ; DF /4                [8086,FPU]
+\c FBSTP mem80                   ; DF /6                [8086,FPU]
+
+\c{FBLD} loads an 80-bit (ten-byte) packed binary-coded decimal
+number from the given memory address, converts it to a real, and
+pushes it on the register stack. \c{FBSTP} stores the value of
+\c{ST0}, in packed BCD, at the given address and then pops the
+register stack.
+
+
+\S{insFCHS} \i\c{FCHS}: Floating-Point Change Sign
+
+\c FCHS                          ; D9 E0                [8086,FPU]
+
+\c{FCHS} negates the number in \c{ST0}, by inverting the sign bit:
+negative numbers become positive, and vice versa.
+
+
+\S{insFCLEX} \i\c{FCLEX}, \c{FNCLEX}: Clear Floating-Point Exceptions
+
+\c FCLEX                         ; 9B DB E2             [8086,FPU]
+\c FNCLEX                        ; DB E2                [8086,FPU]
+
+\c{FCLEX} clears any floating-point exceptions which may be pending.
+\c{FNCLEX} does the same thing but doesn't wait for previous
+floating-point operations (including the \e{handling} of pending
+exceptions) to finish first.
+
+
+\S{insFCMOVB} \i\c{FCMOVcc}: Floating-Point Conditional Move
+
+\c FCMOVB fpureg                 ; DA C0+r              [P6,FPU]
+\c FCMOVB ST0,fpureg             ; DA C0+r              [P6,FPU]
+
+\c FCMOVE fpureg                 ; DA C8+r              [P6,FPU]
+\c FCMOVE ST0,fpureg             ; DA C8+r              [P6,FPU]
+
+\c FCMOVBE fpureg                ; DA D0+r              [P6,FPU]
+\c FCMOVBE ST0,fpureg            ; DA D0+r              [P6,FPU]
+
+\c FCMOVU fpureg                 ; DA D8+r              [P6,FPU]
+\c FCMOVU ST0,fpureg             ; DA D8+r              [P6,FPU]
+
+\c FCMOVNB fpureg                ; DB C0+r              [P6,FPU]
+\c FCMOVNB ST0,fpureg            ; DB C0+r              [P6,FPU]
+
+\c FCMOVNE fpureg                ; DB C8+r              [P6,FPU]
+\c FCMOVNE ST0,fpureg            ; DB C8+r              [P6,FPU]
+
+\c FCMOVNBE fpureg               ; DB D0+r              [P6,FPU]
+\c FCMOVNBE ST0,fpureg           ; DB D0+r              [P6,FPU]
+
+\c FCMOVNU fpureg                ; DB D8+r              [P6,FPU]
+\c FCMOVNU ST0,fpureg            ; DB D8+r              [P6,FPU]
+
+The \c{FCMOV} instructions perform conditional move operations: each
+of them moves the contents of the given register into \c{ST0} if its
+condition is satisfied, and does nothing if not.
+
+The conditions are not the same as the standard condition codes used
+with conditional jump instructions. The conditions \c{B}, \c{BE},
+\c{NB}, \c{NBE}, \c{E} and \c{NE} are exactly as normal, but none of
+the other standard ones are supported. Instead, the condition \c{U}
+and its counterpart \c{NU} are provided; the \c{U} condition is
+satisfied if the last two floating-point numbers compared were
+\e{unordered}, i.e. they were not equal but neither one could be
+said to be greater than the other, for example if they were NaNs.
+(The flag state which signals this is the setting of the parity
+flag: so the \c{U} condition is notionally equivalent to \c{PE}, and
+\c{NU} is equivalent to \c{PO}.)
+
+The \c{FCMOV} conditions test the main processor's status flags, not
+the FPU status flags, so using \c{FCMOV} directly after \c{FCOM}
+will not work. Instead, you should either use \c{FCOMI} which writes
+directly to the main CPU flags word, or use \c{FSTSW} to extract the
+FPU flags.
+
+Although the \c{FCMOV} instructions are flagged \c{P6} above, they
+may not be supported by all Pentium Pro processors; the \c{CPUID}
+instruction (\k{insCPUID}) will return a bit which indicates whether
+conditional moves are supported.
+
+
+\S{insFCOM} \i\c{FCOM}, \i\c{FCOMP}, \i\c{FCOMPP}, \i\c{FCOMI},
+\i\c{FCOMIP}: Floating-Point Compare
+
+\c FCOM mem32                    ; D8 /2                [8086,FPU]
+\c FCOM mem64                    ; DC /2                [8086,FPU]
+\c FCOM fpureg                   ; D8 D0+r              [8086,FPU]
+\c FCOM ST0,fpureg               ; D8 D0+r              [8086,FPU]
+
+\c FCOMP mem32                   ; D8 /3                [8086,FPU]
+\c FCOMP mem64                   ; DC /3                [8086,FPU]
+\c FCOMP fpureg                  ; D8 D8+r              [8086,FPU]
+\c FCOMP ST0,fpureg              ; D8 D8+r              [8086,FPU]
+
+\c FCOMPP                        ; DE D9                [8086,FPU]
+
+\c FCOMI fpureg                  ; DB F0+r              [P6,FPU]
+\c FCOMI ST0,fpureg              ; DB F0+r              [P6,FPU]
+
+\c FCOMIP fpureg                 ; DF F0+r              [P6,FPU]
+\c FCOMIP ST0,fpureg             ; DF F0+r              [P6,FPU]
+
+\c{FCOM} compares \c{ST0} with the given operand, and sets the FPU
+flags accordingly. \c{ST0} is treated as the left-hand side of the
+comparison, so that the carry flag is set (for a `less-than' result)
+if \c{ST0} is less than the given operand.
+
+\c{FCOMP} does the same as \c{FCOM}, but pops the register stack
+afterwards. \c{FCOMPP} compares \c{ST0} with \c{ST1} and then pops
+the register stack twice.
+
+\c{FCOMI} and \c{FCOMIP} work like the corresponding forms of
+\c{FCOM} and \c{FCOMP}, but write their results directly to the CPU
+flags register rather than the FPU status word, so they can be
+immediately followed by conditional jump or conditional move
+instructions.
+
+The \c{FCOM} instructions differ from the \c{FUCOM} instructions
+(\k{insFUCOM}) only in the way they handle quiet NaNs: \c{FUCOM}
+will handle them silently and set the condition code flags to an
+`unordered' result, whereas \c{FCOM} will generate an exception.
+
+
+\S{insFCOS} \i\c{FCOS}: Cosine
+
+\c FCOS                          ; D9 FF                [386,FPU]
+
+\c{FCOS} computes the cosine of \c{ST0} (in radians), and stores the
+result in \c{ST0}. The absolute value of \c{ST0} must be less than 2**63.
+
+See also \c{FSINCOS} (\k{insFSIN}).
+
+
+\S{insFDECSTP} \i\c{FDECSTP}: Decrement Floating-Point Stack Pointer
+
+\c FDECSTP                       ; D9 F6                [8086,FPU]
+
+\c{FDECSTP} decrements the `top' field in the floating-point status
+word. This has the effect of rotating the FPU register stack by one,
+as if the contents of \c{ST7} had been pushed on the stack. See also
+\c{FINCSTP} (\k{insFINCSTP}).
+
+
+\S{insFDISI} \i\c{FxDISI}, \i\c{FxENI}: Disable and Enable Floating-Point Interrupts
+
+\c FDISI                         ; 9B DB E1             [8086,FPU]
+\c FNDISI                        ; DB E1                [8086,FPU]
+
+\c FENI                          ; 9B DB E0             [8086,FPU]
+\c FNENI                         ; DB E0                [8086,FPU]
+
+\c{FDISI} and \c{FENI} disable and enable floating-point interrupts.
+These instructions are only meaningful on original 8087 processors:
+the 287 and above treat them as no-operation instructions.
+
+\c{FNDISI} and \c{FNENI} do the same thing as \c{FDISI} and \c{FENI}
+respectively, but without waiting for the floating-point processor
+to finish what it was doing first.
+
+
+\S{insFDIV} \i\c{FDIV}, \i\c{FDIVP}, \i\c{FDIVR}, \i\c{FDIVRP}: Floating-Point Division
+
+\c FDIV mem32                    ; D8 /6                [8086,FPU]
+\c FDIV mem64                    ; DC /6                [8086,FPU]
+
+\c FDIV fpureg                   ; D8 F0+r              [8086,FPU]
+\c FDIV ST0,fpureg               ; D8 F0+r              [8086,FPU]
+
+\c FDIV TO fpureg                ; DC F8+r              [8086,FPU]
+\c FDIV fpureg,ST0               ; DC F8+r              [8086,FPU]
+
+\c FDIVR mem32                   ; D8 /7                [8086,FPU]
+\c FDIVR mem64                   ; DC /7                [8086,FPU]
+
+\c FDIVR fpureg                  ; D8 F8+r              [8086,FPU]
+\c FDIVR ST0,fpureg              ; D8 F8+r              [8086,FPU]
+
+\c FDIVR TO fpureg               ; DC F0+r              [8086,FPU]
+\c FDIVR fpureg,ST0              ; DC F0+r              [8086,FPU]
+
+\c FDIVP fpureg                  ; DE F8+r              [8086,FPU]
+\c FDIVP fpureg,ST0              ; DE F8+r              [8086,FPU]
+
+\c FDIVRP fpureg                 ; DE F0+r              [8086,FPU]
+\c FDIVRP fpureg,ST0             ; DE F0+r              [8086,FPU]
+
+\b \c{FDIV} divides \c{ST0} by the given operand and stores the result
+back in \c{ST0}, unless the \c{TO} qualifier is given, in which case
+it divides the given operand by \c{ST0} and stores the result in the
+operand.
+
+\b \c{FDIVR} does the same thing, but does the division the other way
+up: so if \c{TO} is not given, it divides the given operand by
+\c{ST0} and stores the result in \c{ST0}, whereas if \c{TO} is given
+it divides \c{ST0} by its operand and stores the result in the
+operand.
+
+\b \c{FDIVP} operates like \c{FDIV TO}, but pops the register stack
+once it has finished.
+
+\b \c{FDIVRP} operates like \c{FDIVR TO}, but pops the register stack
+once it has finished.
+
+For FP/Integer divisions, see \c{FIDIV} (\k{insFIDIV}).
+
+
+\S{insFEMMS} \i\c{FEMMS}: Faster Enter/Exit of the MMX or floating-point state
+
+\c FEMMS                         ; 0F 0E           [PENT,3DNOW]
+
+\c{FEMMS} can be used in place of the \c{EMMS} instruction on
+processors which support the 3DNow! instruction set. Following
+execution of \c{FEMMS}, the state of the \c{MMX/FP} registers
+is undefined, and this allows a faster context switch between
+\c{FP} and \c{MMX} instructions. The \c{FEMMS} instruction can
+also be used \e{before} executing \c{MMX} instructions
+
+
+\S{insFFREE} \i\c{FFREE}: Flag Floating-Point Register as Unused
+
+\c FFREE fpureg                  ; DD C0+r              [8086,FPU]
+\c FFREEP fpureg                 ; DF C0+r              [286,FPU,UNDOC]
+
+\c{FFREE} marks the given register as being empty.
+
+\c{FFREEP} marks the given register as being empty, and then
+pops the register stack.
+
+
+\S{insFIADD} \i\c{FIADD}: Floating-Point/Integer Addition
+
+\c FIADD mem16                   ; DE /0                [8086,FPU]
+\c FIADD mem32                   ; DA /0                [8086,FPU]
+
+\c{FIADD} adds the 16-bit or 32-bit integer stored in the given
+memory location to \c{ST0}, storing the result in \c{ST0}.
+
+
+\S{insFICOM} \i\c{FICOM}, \i\c{FICOMP}: Floating-Point/Integer Compare
+
+\c FICOM mem16                   ; DE /2                [8086,FPU]
+\c FICOM mem32                   ; DA /2                [8086,FPU]
+
+\c FICOMP mem16                  ; DE /3                [8086,FPU]
+\c FICOMP mem32                  ; DA /3                [8086,FPU]
+
+\c{FICOM} compares \c{ST0} with the 16-bit or 32-bit integer stored
+in the given memory location, and sets the FPU flags accordingly.
+\c{FICOMP} does the same, but pops the register stack afterwards.
+
+
+\S{insFIDIV} \i\c{FIDIV}, \i\c{FIDIVR}: Floating-Point/Integer Division
+
+\c FIDIV mem16                   ; DE /6                [8086,FPU]
+\c FIDIV mem32                   ; DA /6                [8086,FPU]
+
+\c FIDIVR mem16                  ; DE /7                [8086,FPU]
+\c FIDIVR mem32                  ; DA /7                [8086,FPU]
+
+\c{FIDIV} divides \c{ST0} by the 16-bit or 32-bit integer stored in
+the given memory location, and stores the result in \c{ST0}.
+\c{FIDIVR} does the division the other way up: it divides the
+integer by \c{ST0}, but still stores the result in \c{ST0}.
+
+
+\S{insFILD} \i\c{FILD}, \i\c{FIST}, \i\c{FISTP}: Floating-Point/Integer Conversion
+
+\c FILD mem16                    ; DF /0                [8086,FPU]
+\c FILD mem32                    ; DB /0                [8086,FPU]
+\c FILD mem64                    ; DF /5                [8086,FPU]
+
+\c FIST mem16                    ; DF /2                [8086,FPU]
+\c FIST mem32                    ; DB /2                [8086,FPU]
+
+\c FISTP mem16                   ; DF /3                [8086,FPU]
+\c FISTP mem32                   ; DB /3                [8086,FPU]
+\c FISTP mem64                   ; DF /7                [8086,FPU]
+
+\c{FILD} loads an integer out of a memory location, converts it to a
+real, and pushes it on the FPU register stack. \c{FIST} converts
+\c{ST0} to an integer and stores that in memory; \c{FISTP} does the
+same as \c{FIST}, but pops the register stack afterwards.
+
+
+\S{insFIMUL} \i\c{FIMUL}: Floating-Point/Integer Multiplication
+
+\c FIMUL mem16                   ; DE /1                [8086,FPU]
+\c FIMUL mem32                   ; DA /1                [8086,FPU]
+
+\c{FIMUL} multiplies \c{ST0} by the 16-bit or 32-bit integer stored
+in the given memory location, and stores the result in \c{ST0}.
+
+
+\S{insFINCSTP} \i\c{FINCSTP}: Increment Floating-Point Stack Pointer
+
+\c FINCSTP                       ; D9 F7                [8086,FPU]
+
+\c{FINCSTP} increments the `top' field in the floating-point status
+word. This has the effect of rotating the FPU register stack by one,
+as if the register stack had been popped; however, unlike the
+popping of the stack performed by many FPU instructions, it does not
+flag the new \c{ST7} (previously \c{ST0}) as empty. See also
+\c{FDECSTP} (\k{insFDECSTP}).
+
+
+\S{insFINIT} \i\c{FINIT}, \i\c{FNINIT}: initialize Floating-Point Unit
+
+\c FINIT                         ; 9B DB E3             [8086,FPU]
+\c FNINIT                        ; DB E3                [8086,FPU]
+
+\c{FINIT} initializes the FPU to its default state. It flags all
+registers as empty, without actually change their values, clears
+the top of stack pointer. \c{FNINIT} does the same, without first
+waiting for pending exceptions to clear.
+
+
+\S{insFISUB} \i\c{FISUB}: Floating-Point/Integer Subtraction
+
+\c FISUB mem16                   ; DE /4                [8086,FPU]
+\c FISUB mem32                   ; DA /4                [8086,FPU]
+
+\c FISUBR mem16                  ; DE /5                [8086,FPU]
+\c FISUBR mem32                  ; DA /5                [8086,FPU]
+
+\c{FISUB} subtracts the 16-bit or 32-bit integer stored in the given
+memory location from \c{ST0}, and stores the result in \c{ST0}.
+\c{FISUBR} does the subtraction the other way round, i.e. it
+subtracts \c{ST0} from the given integer, but still stores the
+result in \c{ST0}.
+
+
+\S{insFLD} \i\c{FLD}: Floating-Point Load
+
+\c FLD mem32                     ; D9 /0                [8086,FPU]
+\c FLD mem64                     ; DD /0                [8086,FPU]
+\c FLD mem80                     ; DB /5                [8086,FPU]
+\c FLD fpureg                    ; D9 C0+r              [8086,FPU]
+
+\c{FLD} loads a floating-point value out of the given register or
+memory location, and pushes it on the FPU register stack.
+
+
+\S{insFLD1} \i\c{FLDxx}: Floating-Point Load Constants
+
+\c FLD1                          ; D9 E8                [8086,FPU]
+\c FLDL2E                        ; D9 EA                [8086,FPU]
+\c FLDL2T                        ; D9 E9                [8086,FPU]
+\c FLDLG2                        ; D9 EC                [8086,FPU]
+\c FLDLN2                        ; D9 ED                [8086,FPU]
+\c FLDPI                         ; D9 EB                [8086,FPU]
+\c FLDZ                          ; D9 EE                [8086,FPU]
+
+These instructions push specific standard constants on the FPU
+register stack.
+
+\c  Instruction    Constant pushed
+
+\c  FLD1           1
+\c  FLDL2E         base-2 logarithm of e
+\c  FLDL2T         base-2 log of 10
+\c  FLDLG2         base-10 log of 2
+\c  FLDLN2         base-e log of 2
+\c  FLDPI          pi
+\c  FLDZ           zero
+
+
+\S{insFLDCW} \i\c{FLDCW}: Load Floating-Point Control Word
+
+\c FLDCW mem16                   ; D9 /5                [8086,FPU]
+
+\c{FLDCW} loads a 16-bit value out of memory and stores it into the
+FPU control word (governing things like the rounding mode, the
+precision, and the exception masks). See also \c{FSTCW}
+(\k{insFSTCW}). If exceptions are enabled and you don't want to
+generate one, use \c{FCLEX} or \c{FNCLEX} (\k{insFCLEX}) before
+loading the new control word.
+
+
+\S{insFLDENV} \i\c{FLDENV}: Load Floating-Point Environment
+
+\c FLDENV mem                    ; D9 /4                [8086,FPU]
+
+\c{FLDENV} loads the FPU operating environment (control word, status
+word, tag word, instruction pointer, data pointer and last opcode)
+from memory. The memory area is 14 or 28 bytes long, depending on
+the CPU mode at the time. See also \c{FSTENV} (\k{insFSTENV}).
+
+
+\S{insFMUL} \i\c{FMUL}, \i\c{FMULP}: Floating-Point Multiply
+
+\c FMUL mem32                    ; D8 /1                [8086,FPU]
+\c FMUL mem64                    ; DC /1                [8086,FPU]
+
+\c FMUL fpureg                   ; D8 C8+r              [8086,FPU]
+\c FMUL ST0,fpureg               ; D8 C8+r              [8086,FPU]
+
+\c FMUL TO fpureg                ; DC C8+r              [8086,FPU]
+\c FMUL fpureg,ST0               ; DC C8+r              [8086,FPU]
+
+\c FMULP fpureg                  ; DE C8+r              [8086,FPU]
+\c FMULP fpureg,ST0              ; DE C8+r              [8086,FPU]
+
+\c{FMUL} multiplies \c{ST0} by the given operand, and stores the
+result in \c{ST0}, unless the \c{TO} qualifier is used in which case
+it stores the result in the operand. \c{FMULP} performs the same
+operation as \c{FMUL TO}, and then pops the register stack.
+
+
+\S{insFNOP} \i\c{FNOP}: Floating-Point No Operation
+
+\c FNOP                          ; D9 D0                [8086,FPU]
+
+\c{FNOP} does nothing.
+
+
+\S{insFPATAN} \i\c{FPATAN}, \i\c{FPTAN}: Arctangent and Tangent
+
+\c FPATAN                        ; D9 F3                [8086,FPU]
+\c FPTAN                         ; D9 F2                [8086,FPU]
+
+\c{FPATAN} computes the arctangent, in radians, of the result of
+dividing \c{ST1} by \c{ST0}, stores the result in \c{ST1}, and pops
+the register stack. It works like the C \c{atan2} function, in that
+changing the sign of both \c{ST0} and \c{ST1} changes the output
+value by pi (so it performs true rectangular-to-polar coordinate
+conversion, with \c{ST1} being the Y coordinate and \c{ST0} being
+the X coordinate, not merely an arctangent).
+
+\c{FPTAN} computes the tangent of the value in \c{ST0} (in radians),
+and stores the result back into \c{ST0}.
+
+The absolute value of \c{ST0} must be less than 2**63.
+
+
+\S{insFPREM} \i\c{FPREM}, \i\c{FPREM1}: Floating-Point Partial Remainder
+
+\c FPREM                         ; D9 F8                [8086,FPU]
+\c FPREM1                        ; D9 F5                [386,FPU]
+
+These instructions both produce the remainder obtained by dividing
+\c{ST0} by \c{ST1}. This is calculated, notionally, by dividing
+\c{ST0} by \c{ST1}, rounding the result to an integer, multiplying
+by \c{ST1} again, and computing the value which would need to be
+added back on to the result to get back to the original value in
+\c{ST0}.
+
+The two instructions differ in the way the notional round-to-integer
+operation is performed. \c{FPREM} does it by rounding towards zero,
+so that the remainder it returns always has the same sign as the
+original value in \c{ST0}; \c{FPREM1} does it by rounding to the
+nearest integer, so that the remainder always has at most half the
+magnitude of \c{ST1}.
+
+Both instructions calculate \e{partial} remainders, meaning that
+they may not manage to provide the final result, but might leave
+intermediate results in \c{ST0} instead. If this happens, they will
+set the C2 flag in the FPU status word; therefore, to calculate a
+remainder, you should repeatedly execute \c{FPREM} or \c{FPREM1}
+until C2 becomes clear.
+
+
+\S{insFRNDINT} \i\c{FRNDINT}: Floating-Point Round to Integer
+
+\c FRNDINT                       ; D9 FC                [8086,FPU]
+
+\c{FRNDINT} rounds the contents of \c{ST0} to an integer, according
+to the current rounding mode set in the FPU control word, and stores
+the result back in \c{ST0}.
+
+
+\S{insFRSTOR} \i\c{FSAVE}, \i\c{FRSTOR}: Save/Restore Floating-Point State
+
+\c FSAVE mem                     ; 9B DD /6             [8086,FPU]
+\c FNSAVE mem                    ; DD /6                [8086,FPU]
+
+\c FRSTOR mem                    ; DD /4                [8086,FPU]
+
+\c{FSAVE} saves the entire floating-point unit state, including all
+the information saved by \c{FSTENV} (\k{insFSTENV}) plus the
+contents of all the registers, to a 94 or 108 byte area of memory
+(depending on the CPU mode). \c{FRSTOR} restores the floating-point
+state from the same area of memory.
+
+\c{FNSAVE} does the same as \c{FSAVE}, without first waiting for
+pending floating-point exceptions to clear.
+
+
+\S{insFSCALE} \i\c{FSCALE}: Scale Floating-Point Value by Power of Two
+
+\c FSCALE                        ; D9 FD                [8086,FPU]
+
+\c{FSCALE} scales a number by a power of two: it rounds \c{ST1}
+towards zero to obtain an integer, then multiplies \c{ST0} by two to
+the power of that integer, and stores the result in \c{ST0}.
+
+
+\S{insFSETPM} \i\c{FSETPM}: Set Protected Mode
+
+\c FSETPM                        ; DB E4                [286,FPU]
+
+This instruction initializes protected mode on the 287 floating-point
+coprocessor. It is only meaningful on that processor: the 387 and
+above treat the instruction as a no-operation.
+
+
+\S{insFSIN} \i\c{FSIN}, \i\c{FSINCOS}: Sine and Cosine
+
+\c FSIN                          ; D9 FE                [386,FPU]
+\c FSINCOS                       ; D9 FB                [386,FPU]
+
+\c{FSIN} calculates the sine of \c{ST0} (in radians) and stores the
+result in \c{ST0}. \c{FSINCOS} does the same, but then pushes the
+cosine of the same value on the register stack, so that the sine
+ends up in \c{ST1} and the cosine in \c{ST0}. \c{FSINCOS} is faster
+than executing \c{FSIN} and \c{FCOS} (see \k{insFCOS}) in succession.
+
+The absolute value of \c{ST0} must be less than 2**63.
+
+
+\S{insFSQRT} \i\c{FSQRT}: Floating-Point Square Root
+
+\c FSQRT                         ; D9 FA                [8086,FPU]
+
+\c{FSQRT} calculates the square root of \c{ST0} and stores the
+result in \c{ST0}.
+
+
+\S{insFST} \i\c{FST}, \i\c{FSTP}: Floating-Point Store
+
+\c FST mem32                     ; D9 /2                [8086,FPU]
+\c FST mem64                     ; DD /2                [8086,FPU]
+\c FST fpureg                    ; DD D0+r              [8086,FPU]
+
+\c FSTP mem32                    ; D9 /3                [8086,FPU]
+\c FSTP mem64                    ; DD /3                [8086,FPU]
+\c FSTP mem80                    ; DB /7                [8086,FPU]
+\c FSTP fpureg                   ; DD D8+r              [8086,FPU]
+
+\c{FST} stores the value in \c{ST0} into the given memory location
+or other FPU register. \c{FSTP} does the same, but then pops the
+register stack.
+
+
+\S{insFSTCW} \i\c{FSTCW}: Store Floating-Point Control Word
+
+\c FSTCW mem16                   ; 9B D9 /7             [8086,FPU]
+\c FNSTCW mem16                  ; D9 /7                [8086,FPU]
+
+\c{FSTCW} stores the \c{FPU} control word (governing things like the
+rounding mode, the precision, and the exception masks) into a 2-byte
+memory area. See also \c{FLDCW} (\k{insFLDCW}).
+
+\c{FNSTCW} does the same thing as \c{FSTCW}, without first waiting
+for pending floating-point exceptions to clear.
+
+
+\S{insFSTENV} \i\c{FSTENV}: Store Floating-Point Environment
+
+\c FSTENV mem                    ; 9B D9 /6             [8086,FPU]
+\c FNSTENV mem                   ; D9 /6                [8086,FPU]
+
+\c{FSTENV} stores the \c{FPU} operating environment (control word,
+status word, tag word, instruction pointer, data pointer and last
+opcode) into memory. The memory area is 14 or 28 bytes long,
+depending on the CPU mode at the time. See also \c{FLDENV}
+(\k{insFLDENV}).
+
+\c{FNSTENV} does the same thing as \c{FSTENV}, without first waiting
+for pending floating-point exceptions to clear.
+
+
+\S{insFSTSW} \i\c{FSTSW}: Store Floating-Point Status Word
+
+\c FSTSW mem16                   ; 9B DD /7             [8086,FPU]
+\c FSTSW AX                      ; 9B DF E0             [286,FPU]
+
+\c FNSTSW mem16                  ; DD /7                [8086,FPU]
+\c FNSTSW AX                     ; DF E0                [286,FPU]
+
+\c{FSTSW} stores the \c{FPU} status word into \c{AX} or into a 2-byte
+memory area.
+
+\c{FNSTSW} does the same thing as \c{FSTSW}, without first waiting
+for pending floating-point exceptions to clear.
+
+
+\S{insFSUB} \i\c{FSUB}, \i\c{FSUBP}, \i\c{FSUBR}, \i\c{FSUBRP}: Floating-Point Subtract
+
+\c FSUB mem32                    ; D8 /4                [8086,FPU]
+\c FSUB mem64                    ; DC /4                [8086,FPU]
+
+\c FSUB fpureg                   ; D8 E0+r              [8086,FPU]
+\c FSUB ST0,fpureg               ; D8 E0+r              [8086,FPU]
+
+\c FSUB TO fpureg                ; DC E8+r              [8086,FPU]
+\c FSUB fpureg,ST0               ; DC E8+r              [8086,FPU]
+
+\c FSUBR mem32                   ; D8 /5                [8086,FPU]
+\c FSUBR mem64                   ; DC /5                [8086,FPU]
+
+\c FSUBR fpureg                  ; D8 E8+r              [8086,FPU]
+\c FSUBR ST0,fpureg              ; D8 E8+r              [8086,FPU]
+
+\c FSUBR TO fpureg               ; DC E0+r              [8086,FPU]
+\c FSUBR fpureg,ST0              ; DC E0+r              [8086,FPU]
+
+\c FSUBP fpureg                  ; DE E8+r              [8086,FPU]
+\c FSUBP fpureg,ST0              ; DE E8+r              [8086,FPU]
+
+\c FSUBRP fpureg                 ; DE E0+r              [8086,FPU]
+\c FSUBRP fpureg,ST0             ; DE E0+r              [8086,FPU]
+
+\b \c{FSUB} subtracts the given operand from \c{ST0} and stores the
+result back in \c{ST0}, unless the \c{TO} qualifier is given, in
+which case it subtracts \c{ST0} from the given operand and stores
+the result in the operand.
+
+\b \c{FSUBR} does the same thing, but does the subtraction the other
+way up: so if \c{TO} is not given, it subtracts \c{ST0} from the given
+operand and stores the result in \c{ST0}, whereas if \c{TO} is given
+it subtracts its operand from \c{ST0} and stores the result in the
+operand.
+
+\b \c{FSUBP} operates like \c{FSUB TO}, but pops the register stack
+once it has finished.
+
+\b \c{FSUBRP} operates like \c{FSUBR TO}, but pops the register stack
+once it has finished.
+
+
+\S{insFTST} \i\c{FTST}: Test \c{ST0} Against Zero
+
+\c FTST                          ; D9 E4                [8086,FPU]
+
+\c{FTST} compares \c{ST0} with zero and sets the FPU flags
+accordingly. \c{ST0} is treated as the left-hand side of the
+comparison, so that a `less-than' result is generated if \c{ST0} is
+negative.
+
+
+\S{insFUCOM} \i\c{FUCOMxx}: Floating-Point Unordered Compare
+
+\c FUCOM fpureg                  ; DD E0+r              [386,FPU]
+\c FUCOM ST0,fpureg              ; DD E0+r              [386,FPU]
+
+\c FUCOMP fpureg                 ; DD E8+r              [386,FPU]
+\c FUCOMP ST0,fpureg             ; DD E8+r              [386,FPU]
+
+\c FUCOMPP                       ; DA E9                [386,FPU]
+
+\c FUCOMI fpureg                 ; DB E8+r              [P6,FPU]
+\c FUCOMI ST0,fpureg             ; DB E8+r              [P6,FPU]
+
+\c FUCOMIP fpureg                ; DF E8+r              [P6,FPU]
+\c FUCOMIP ST0,fpureg            ; DF E8+r              [P6,FPU]
+
+\b \c{FUCOM} compares \c{ST0} with the given operand, and sets the
+FPU flags accordingly. \c{ST0} is treated as the left-hand side of
+the comparison, so that the carry flag is set (for a `less-than'
+result) if \c{ST0} is less than the given operand.
+
+\b \c{FUCOMP} does the same as \c{FUCOM}, but pops the register stack
+afterwards. \c{FUCOMPP} compares \c{ST0} with \c{ST1} and then pops
+the register stack twice.
+
+\b \c{FUCOMI} and \c{FUCOMIP} work like the corresponding forms of
+\c{FUCOM} and \c{FUCOMP}, but write their results directly to the CPU
+flags register rather than the FPU status word, so they can be
+immediately followed by conditional jump or conditional move
+instructions.
+
+The \c{FUCOM} instructions differ from the \c{FCOM} instructions
+(\k{insFCOM}) only in the way they handle quiet NaNs: \c{FUCOM} will
+handle them silently and set the condition code flags to an
+`unordered' result, whereas \c{FCOM} will generate an exception.
+
+
+\S{insFXAM} \i\c{FXAM}: Examine Class of Value in \c{ST0}
+
+\c FXAM                          ; D9 E5                [8086,FPU]
+
+\c{FXAM} sets the FPU flags \c{C3}, \c{C2} and \c{C0} depending on
+the type of value stored in \c{ST0}:
+
+\c  Register contents     Flags
+
+\c  Unsupported format    000
+\c  NaN                   001
+\c  Finite number         010
+\c  Infinity              011
+\c  Zero                  100
+\c  Empty register        101
+\c  Denormal              110
+
+Additionally, the \c{C1} flag is set to the sign of the number.
+
+
+\S{insFXCH} \i\c{FXCH}: Floating-Point Exchange
+
+\c FXCH                          ; D9 C9                [8086,FPU]
+\c FXCH fpureg                   ; D9 C8+r              [8086,FPU]
+\c FXCH fpureg,ST0               ; D9 C8+r              [8086,FPU]
+\c FXCH ST0,fpureg               ; D9 C8+r              [8086,FPU]
+
+\c{FXCH} exchanges \c{ST0} with a given FPU register. The no-operand
+form exchanges \c{ST0} with \c{ST1}.
+
+
+\S{insFXRSTOR} \i\c{FXRSTOR}: Restore \c{FP}, \c{MMX} and \c{SSE} State
+
+\c FXRSTOR memory                ; 0F AE /1               [P6,SSE,FPU]
+
+The \c{FXRSTOR} instruction reloads the \c{FPU}, \c{MMX} and \c{SSE}
+state (environment and registers), from the 512 byte memory area defined
+by the source operand. This data should have been written by a previous
+\c{FXSAVE}.
+
+
+\S{insFXSAVE} \i\c{FXSAVE}: Store \c{FP}, \c{MMX} and \c{SSE} State
+
+\c FXSAVE memory                 ; 0F AE /0         [P6,SSE,FPU]
+
+\c{FXSAVE}The FXSAVE instruction writes the current \c{FPU}, \c{MMX}
+and \c{SSE} technology states (environment and registers), to the
+512 byte memory area defined by the destination operand. It does this
+without checking for pending unmasked floating-point exceptions
+(similar to the operation of \c{FNSAVE}).
+
+Unlike the \c{FSAVE/FNSAVE} instructions, the processor retains the
+contents of the \c{FPU}, \c{MMX} and \c{SSE} state in the processor
+after the state has been saved. This instruction has been optimized
+to maximize floating-point save performance.
+
+
+\S{insFXTRACT} \i\c{FXTRACT}: Extract Exponent and Significand
+
+\c FXTRACT                       ; D9 F4                [8086,FPU]
+
+\c{FXTRACT} separates the number in \c{ST0} into its exponent and
+significand (mantissa), stores the exponent back into \c{ST0}, and
+then pushes the significand on the register stack (so that the
+significand ends up in \c{ST0}, and the exponent in \c{ST1}).
+
+
+\S{insFYL2X} \i\c{FYL2X}, \i\c{FYL2XP1}: Compute Y times Log2(X) or Log2(X+1)
+
+\c FYL2X                         ; D9 F1                [8086,FPU]
+\c FYL2XP1                       ; D9 F9                [8086,FPU]
+
+\c{FYL2X} multiplies \c{ST1} by the base-2 logarithm of \c{ST0},
+stores the result in \c{ST1}, and pops the register stack (so that
+the result ends up in \c{ST0}). \c{ST0} must be non-zero and
+positive.
+
+\c{FYL2XP1} works the same way, but replacing the base-2 log of
+\c{ST0} with that of \c{ST0} plus one. This time, \c{ST0} must have
+magnitude no greater than 1 minus half the square root of two.
+
+
+\S{insHLT} \i\c{HLT}: Halt Processor
+
+\c HLT                           ; F4                   [8086,PRIV]
+
+\c{HLT} puts the processor into a halted state, where it will
+perform no more operations until restarted by an interrupt or a
+reset.
+
+On the 286 and later processors, this is a privileged instruction.
+
+
+\S{insIBTS} \i\c{IBTS}: Insert Bit String
+
+\c IBTS r/m16,reg16              ; o16 0F A7 /r         [386,UNDOC]
+\c IBTS r/m32,reg32              ; o32 0F A7 /r         [386,UNDOC]
+
+The implied operation of this instruction is:
+
+\c IBTS r/m16,AX,CL,reg16
+\c IBTS r/m32,EAX,CL,reg32
+
+Writes a bit string from the source operand to the destination.
+\c{CL} indicates the number of bits to be copied, from the low bits
+of the source. \c{(E)AX} indicates the low order bit offset in the
+destination that is written to. For example, if \c{CL} is set to 4
+and \c{AX} (for 16-bit code) is set to 5, bits 0-3 of \c{src} will
+be copied to bits 5-8 of \c{dst}. This instruction is very poorly
+documented, and I have been unable to find any official source of
+documentation on it.
+
+\c{IBTS} is supported only on the early Intel 386s, and conflicts
+with the opcodes for \c{CMPXCHG486} (on early Intel 486s). NASM
+supports it only for completeness. Its counterpart is \c{XBTS}
+(see \k{insXBTS}).
+
+
+\S{insIDIV} \i\c{IDIV}: Signed Integer Divide
+
+\c IDIV r/m8                     ; F6 /7                [8086]
+\c IDIV r/m16                    ; o16 F7 /7            [8086]
+\c IDIV r/m32                    ; o32 F7 /7            [386]
+
+\c{IDIV} performs signed integer division. The explicit operand
+provided is the divisor; the dividend and destination operands
+are implicit, in the following way:
+
+\b For \c{IDIV r/m8}, \c{AX} is divided by the given operand;
+the quotient is stored in \c{AL} and the remainder in \c{AH}.
+
+\b For \c{IDIV r/m16}, \c{DX:AX} is divided by the given operand;
+the quotient is stored in \c{AX} and the remainder in \c{DX}.
+
+\b For \c{IDIV r/m32}, \c{EDX:EAX} is divided by the given operand;
+the quotient is stored in \c{EAX} and the remainder in \c{EDX}.
+
+Unsigned integer division is performed by the \c{DIV} instruction:
+see \k{insDIV}.
+
+
+\S{insIMUL} \i\c{IMUL}: Signed Integer Multiply
+
+\c IMUL r/m8                     ; F6 /5                [8086]
+\c IMUL r/m16                    ; o16 F7 /5            [8086]
+\c IMUL r/m32                    ; o32 F7 /5            [386]
+
+\c IMUL reg16,r/m16              ; o16 0F AF /r         [386]
+\c IMUL reg32,r/m32              ; o32 0F AF /r         [386]
+
+\c IMUL reg16,imm8               ; o16 6B /r ib         [186]
+\c IMUL reg16,imm16              ; o16 69 /r iw         [186]
+\c IMUL reg32,imm8               ; o32 6B /r ib         [386]
+\c IMUL reg32,imm32              ; o32 69 /r id         [386]
+
+\c IMUL reg16,r/m16,imm8         ; o16 6B /r ib         [186]
+\c IMUL reg16,r/m16,imm16        ; o16 69 /r iw         [186]
+\c IMUL reg32,r/m32,imm8         ; o32 6B /r ib         [386]
+\c IMUL reg32,r/m32,imm32        ; o32 69 /r id         [386]
+
+\c{IMUL} performs signed integer multiplication. For the
+single-operand form, the other operand and destination are
+implicit, in the following way:
+
+\b For \c{IMUL r/m8}, \c{AL} is multiplied by the given operand;
+the product is stored in \c{AX}.
+
+\b For \c{IMUL r/m16}, \c{AX} is multiplied by the given operand;
+the product is stored in \c{DX:AX}.
+
+\b For \c{IMUL r/m32}, \c{EAX} is multiplied by the given operand;
+the product is stored in \c{EDX:EAX}.
+
+The two-operand form multiplies its two operands and stores the
+result in the destination (first) operand. The three-operand
+form multiplies its last two operands and stores the result in
+the first operand.
+
+The two-operand form with an immediate second operand is in
+fact a shorthand for the three-operand form, as can be seen by
+examining the opcode descriptions: in the two-operand form, the
+code \c{/r} takes both its register and \c{r/m} parts from the
+same operand (the first one).
+
+In the forms with an 8-bit immediate operand and another longer
+source operand, the immediate operand is considered to be signed,
+and is sign-extended to the length of the other source operand.
+In these cases, the \c{BYTE} qualifier is necessary to force
+NASM to generate this form of the instruction.
+
+Unsigned integer multiplication is performed by the \c{MUL}
+instruction: see \k{insMUL}.
+
+
+\S{insIN} \i\c{IN}: Input from I/O Port
+
+\c IN AL,imm8                    ; E4 ib                [8086]
+\c IN AX,imm8                    ; o16 E5 ib            [8086]
+\c IN EAX,imm8                   ; o32 E5 ib            [386]
+\c IN AL,DX                      ; EC                   [8086]
+\c IN AX,DX                      ; o16 ED               [8086]
+\c IN EAX,DX                     ; o32 ED               [386]
+
+\c{IN} reads a byte, word or doubleword from the specified I/O port,
+and stores it in the given destination register. The port number may
+be specified as an immediate value if it is between 0 and 255, and
+otherwise must be stored in \c{DX}. See also \c{OUT} (\k{insOUT}).
+
+
+\S{insINC} \i\c{INC}: Increment Integer
+
+\c INC reg16                     ; o16 40+r             [8086]
+\c INC reg32                     ; o32 40+r             [386]
+\c INC r/m8                      ; FE /0                [8086]
+\c INC r/m16                     ; o16 FF /0            [8086]
+\c INC r/m32                     ; o32 FF /0            [386]
+
+\c{INC} adds 1 to its operand. It does \e{not} affect the carry
+flag: to affect the carry flag, use \c{ADD something,1} (see
+\k{insADD}). \c{INC} affects all the other flags according to the result.
+
+This instruction can be used with a \c{LOCK} prefix to allow atomic execution.
+
+See also \c{DEC} (\k{insDEC}).
+
+
+\S{insINSB} \i\c{INSB}, \i\c{INSW}, \i\c{INSD}: Input String from I/O Port
+
+\c INSB                          ; 6C                   [186]
+\c INSW                          ; o16 6D               [186]
+\c INSD                          ; o32 6D               [386]
+
+\c{INSB} inputs a byte from the I/O port specified in \c{DX} and
+stores it at \c{[ES:DI]} or \c{[ES:EDI]}. It then increments or
+decrements (depending on the direction flag: increments if the flag
+is clear, decrements if it is set) \c{DI} or \c{EDI}.
+
+The register used is \c{DI} if the address size is 16 bits, and
+\c{EDI} if it is 32 bits. If you need to use an address size not
+equal to the current \c{BITS} setting, you can use an explicit
+\i\c{a16} or \i\c{a32} prefix.
+
+Segment override prefixes have no effect for this instruction: the
+use of \c{ES} for the load from \c{[DI]} or \c{[EDI]} cannot be
+overridden.
+
+\c{INSW} and \c{INSD} work in the same way, but they input a word or
+a doubleword instead of a byte, and increment or decrement the
+addressing register by 2 or 4 instead of 1.
+
+The \c{REP} prefix may be used to repeat the instruction \c{CX} (or
+\c{ECX} - again, the address size chooses which) times.
+
+See also \c{OUTSB}, \c{OUTSW} and \c{OUTSD} (\k{insOUTSB}).
+
+
+\S{insINT} \i\c{INT}: Software Interrupt
+
+\c INT imm8                      ; CD ib                [8086]
+
+\c{INT} causes a software interrupt through a specified vector
+number from 0 to 255.
+
+The code generated by the \c{INT} instruction is always two bytes
+long: although there are short forms for some \c{INT} instructions,
+NASM does not generate them when it sees the \c{INT} mnemonic. In
+order to generate single-byte breakpoint instructions, use the
+\c{INT3} or \c{INT1} instructions (see \k{insINT1}) instead.
+
+
+\S{insINT1} \i\c{INT3}, \i\c{INT1}, \i\c{ICEBP}, \i\c{INT01}: Breakpoints
+
+\c INT1                          ; F1                   [P6]
+\c ICEBP                         ; F1                   [P6]
+\c INT01                         ; F1                   [P6]
+
+\c INT3                          ; CC                   [8086]
+\c INT03                         ; CC                   [8086]
+
+\c{INT1} and \c{INT3} are short one-byte forms of the instructions
+\c{INT 1} and \c{INT 3} (see \k{insINT}). They perform a similar
+function to their longer counterparts, but take up less code space.
+They are used as breakpoints by debuggers.
+
+\b \c{INT1}, and its alternative synonyms \c{INT01} and \c{ICEBP}, is
+an instruction used by in-circuit emulators (ICEs). It is present,
+though not documented, on some processors down to the 286, but is
+only documented for the Pentium Pro. \c{INT3} is the instruction
+normally used as a breakpoint by debuggers.
+
+\b \c{INT3}, and its synonym \c{INT03}, is not precisely equivalent to
+\c{INT 3}: the short form, since it is designed to be used as a
+breakpoint, bypasses the normal \c{IOPL} checks in virtual-8086 mode,
+and also does not go through interrupt redirection.
+
+
+\S{insINTO} \i\c{INTO}: Interrupt if Overflow
+
+\c INTO                          ; CE                   [8086]
+
+\c{INTO} performs an \c{INT 4} software interrupt (see \k{insINT})
+if and only if the overflow flag is set.
+
+
+\S{insINVD} \i\c{INVD}: Invalidate Internal Caches
+
+\c INVD                          ; 0F 08                [486]
+
+\c{INVD} invalidates and empties the processor's internal caches,
+and causes the processor to instruct external caches to do the same.
+It does not write the contents of the caches back to memory first:
+any modified data held in the caches will be lost. To write the data
+back first, use \c{WBINVD} (\k{insWBINVD}).
+
+
+\S{insINVLPG} \i\c{INVLPG}: Invalidate TLB Entry
+
+\c INVLPG mem                    ; 0F 01 /7             [486]
+
+\c{INVLPG} invalidates the translation lookahead buffer (TLB) entry
+associated with the supplied memory address.
+
+
+\S{insIRET} \i\c{IRET}, \i\c{IRETW}, \i\c{IRETD}: Return from Interrupt
+
+\c IRET                          ; CF                   [8086]
+\c IRETW                         ; o16 CF               [8086]
+\c IRETD                         ; o32 CF               [386]
+
+\c{IRET} returns from an interrupt (hardware or software) by means
+of popping \c{IP} (or \c{EIP}), \c{CS} and the flags off the stack
+and then continuing execution from the new \c{CS:IP}.
+
+\c{IRETW} pops \c{IP}, \c{CS} and the flags as 2 bytes each, taking
+6 bytes off the stack in total. \c{IRETD} pops \c{EIP} as 4 bytes,
+pops a further 4 bytes of which the top two are discarded and the
+bottom two go into \c{CS}, and pops the flags as 4 bytes as well,
+taking 12 bytes off the stack.
+
+\c{IRET} is a shorthand for either \c{IRETW} or \c{IRETD}, depending
+on the default \c{BITS} setting at the time.
+
+
+\S{insJcc} \i\c{Jcc}: Conditional Branch
+
+\c Jcc imm                       ; 70+cc rb             [8086]
+\c Jcc NEAR imm                  ; 0F 80+cc rw/rd       [386]
+
+The \i{conditional jump} instructions execute a near (same segment)
+jump if and only if their conditions are satisfied. For example,
+\c{JNZ} jumps only if the zero flag is not set.
+
+The ordinary form of the instructions has only a 128-byte range; the
+\c{NEAR} form is a 386 extension to the instruction set, and can
+span the full size of a segment. NASM will not override your choice
+of jump instruction: if you want \c{Jcc NEAR}, you have to use the
+\c{NEAR} keyword.
+
+The \c{SHORT} keyword is allowed on the first form of the
+instruction, for clarity, but is not necessary.
+
+For details of the condition codes, see \k{iref-cc}.
+
+
+\S{insJCXZ} \i\c{JCXZ}, \i\c{JECXZ}: Jump if CX/ECX Zero
+
+\c JCXZ imm                      ; a16 E3 rb            [8086]
+\c JECXZ imm                     ; a32 E3 rb            [386]
+
+\c{JCXZ} performs a short jump (with maximum range 128 bytes) if and
+only if the contents of the \c{CX} register is 0. \c{JECXZ} does the
+same thing, but with \c{ECX}.
+
+
+\S{insJMP} \i\c{JMP}: Jump
+
+\c JMP imm                       ; E9 rw/rd             [8086]
+\c JMP SHORT imm                 ; EB rb                [8086]
+\c JMP imm:imm16                 ; o16 EA iw iw         [8086]
+\c JMP imm:imm32                 ; o32 EA id iw         [386]
+\c JMP FAR mem                   ; o16 FF /5            [8086]
+\c JMP FAR mem32                 ; o32 FF /5            [386]
+\c JMP r/m16                     ; o16 FF /4            [8086]
+\c JMP r/m32                     ; o32 FF /4            [386]
+
+\c{JMP} jumps to a given address. The address may be specified as an
+absolute segment and offset, or as a relative jump within the
+current segment.
+
+\c{JMP SHORT imm} has a maximum range of 128 bytes, since the
+displacement is specified as only 8 bits, but takes up less code
+space. NASM does not choose when to generate \c{JMP SHORT} for you:
+you must explicitly code \c{SHORT} every time you want a short jump.
+
+You can choose between the two immediate \i{far jump} forms (\c{JMP
+imm:imm}) by the use of the \c{WORD} and \c{DWORD} keywords: \c{JMP
+WORD 0x1234:0x5678}) or \c{JMP DWORD 0x1234:0x56789abc}.
+
+The \c{JMP FAR mem} forms execute a far jump by loading the
+destination address out of memory. The address loaded consists of 16
+or 32 bits of offset (depending on the operand size), and 16 bits of
+segment. The operand size may be overridden using \c{JMP WORD FAR
+mem} or \c{JMP DWORD FAR mem}.
+
+The \c{JMP r/m} forms execute a \i{near jump} (within the same
+segment), loading the destination address out of memory or out of a
+register. The keyword \c{NEAR} may be specified, for clarity, in
+these forms, but is not necessary. Again, operand size can be
+overridden using \c{JMP WORD mem} or \c{JMP DWORD mem}.
+
+As a convenience, NASM does not require you to jump to a far symbol
+by coding the cumbersome \c{JMP SEG routine:routine}, but instead
+allows the easier synonym \c{JMP FAR routine}.
+
+The \c{JMP r/m} forms given above are near calls; NASM will accept
+the \c{NEAR} keyword (e.g. \c{JMP NEAR [address]}), even though it
+is not strictly necessary.
+
+
+\S{insLAHF} \i\c{LAHF}: Load AH from Flags
+
+\c LAHF                          ; 9F                   [8086]
+
+\c{LAHF} sets the \c{AH} register according to the contents of the
+low byte of the flags word.
+
+The operation of \c{LAHF} is:
+
+\c  AH <-- SF:ZF:0:AF:0:PF:1:CF
+
+See also \c{SAHF} (\k{insSAHF}).
+
+
+\S{insLAR} \i\c{LAR}: Load Access Rights
+
+\c LAR reg16,r/m16               ; o16 0F 02 /r         [286,PRIV]
+\c LAR reg32,r/m32               ; o32 0F 02 /r         [286,PRIV]
+
+\c{LAR} takes the segment selector specified by its source (second)
+operand, finds the corresponding segment descriptor in the GDT or
+LDT, and loads the access-rights byte of the descriptor into its
+destination (first) operand.
+
+
+\S{insLDMXCSR} \i\c{LDMXCSR}: Load Streaming SIMD Extension
+ Control/Status
+
+\c LDMXCSR mem32                 ; 0F AE /2        [KATMAI,SSE]
+
+\c{LDMXCSR} loads 32-bits of data from the specified memory location
+into the \c{MXCSR} control/status register. \c{MXCSR} is used to
+enable masked/unmasked exception handling, to set rounding modes,
+to set flush-to-zero mode, and to view exception status flags.
+
+For details of the \c{MXCSR} register, see the Intel processor docs.
+
+See also \c{STMXCSR} (\k{insSTMXCSR}
+
+
+\S{insLDS} \i\c{LDS}, \i\c{LES}, \i\c{LFS}, \i\c{LGS}, \i\c{LSS}: Load Far Pointer
+
+\c LDS reg16,mem                 ; o16 C5 /r            [8086]
+\c LDS reg32,mem                 ; o32 C5 /r            [386]
+
+\c LES reg16,mem                 ; o16 C4 /r            [8086]
+\c LES reg32,mem                 ; o32 C4 /r            [386]
+
+\c LFS reg16,mem                 ; o16 0F B4 /r         [386]
+\c LFS reg32,mem                 ; o32 0F B4 /r         [386]
+
+\c LGS reg16,mem                 ; o16 0F B5 /r         [386]
+\c LGS reg32,mem                 ; o32 0F B5 /r         [386]
+
+\c LSS reg16,mem                 ; o16 0F B2 /r         [386]
+\c LSS reg32,mem                 ; o32 0F B2 /r         [386]
+
+These instructions load an entire far pointer (16 or 32 bits of
+offset, plus 16 bits of segment) out of memory in one go. \c{LDS},
+for example, loads 16 or 32 bits from the given memory address into
+the given register (depending on the size of the register), then
+loads the \e{next} 16 bits from memory into \c{DS}. \c{LES},
+\c{LFS}, \c{LGS} and \c{LSS} work in the same way but use the other
+segment registers.
+
+
+\S{insLEA} \i\c{LEA}: Load Effective Address
+
+\c LEA reg16,mem                 ; o16 8D /r            [8086]
+\c LEA reg32,mem                 ; o32 8D /r            [386]
+
+\c{LEA}, despite its syntax, does not access memory. It calculates
+the effective address specified by its second operand as if it were
+going to load or store data from it, but instead it stores the
+calculated address into the register specified by its first operand.
+This can be used to perform quite complex calculations (e.g. \c{LEA
+EAX,[EBX+ECX*4+100]}) in one instruction.
+
+\c{LEA}, despite being a purely arithmetic instruction which
+accesses no memory, still requires square brackets around its second
+operand, as if it were a memory reference.
+
+The size of the calculation is the current \e{address} size, and the
+size that the result is stored as is the current \e{operand} size.
+If the address and operand size are not the same, then if the
+addressing mode was 32-bits, the low 16-bits are stored, and if the
+address was 16-bits, it is zero-extended to 32-bits before storing.
+
+
+\S{insLEAVE} \i\c{LEAVE}: Destroy Stack Frame
+
+\c LEAVE                         ; C9                   [186]
+
+\c{LEAVE} destroys a stack frame of the form created by the
+\c{ENTER} instruction (see \k{insENTER}). It is functionally
+equivalent to \c{MOV ESP,EBP} followed by \c{POP EBP} (or \c{MOV
+SP,BP} followed by \c{POP BP} in 16-bit mode).
+
+
+\S{insLFENCE} \i\c{LFENCE}: Load Fence
+
+\c LFENCE                        ; 0F AE /5        [WILLAMETTE,SSE2]
+
+\c{LFENCE} performs a serialising operation on all loads from memory
+that were issued before the \c{LFENCE} instruction. This guarantees that
+all memory reads before the \c{LFENCE} instruction are visible before any
+reads after the \c{LFENCE} instruction.
+
+\c{LFENCE} is ordered respective to other \c{LFENCE} instruction, \c{MFENCE},
+any memory read and any other serialising instruction (such as \c{CPUID}).
+
+Weakly ordered memory types can be used to achieve higher processor
+performance through such techniques as out-of-order issue and
+speculative reads. The degree to which a consumer of data recognizes
+or knows that the data is weakly ordered varies among applications
+and may be unknown to the producer of this data. The \c{LFENCE}
+instruction provides a performance-efficient way of ensuring load
+ordering between routines that produce weakly-ordered results and
+routines that consume that data.
+
+\c{LFENCE} uses the following ModRM encoding:
+
+\c           Mod (7:6)        = 11B
+\c           Reg/Opcode (5:3) = 101B
+\c           R/M (2:0)        = 000B
+
+All other ModRM encodings are defined to be reserved, and use
+of these encodings risks incompatibility with future processors.
+
+See also \c{SFENCE} (\k{insSFENCE}) and \c{MFENCE} (\k{insMFENCE}).
+
+
+\S{insLGDT} \i\c{LGDT}, \i\c{LIDT}, \i\c{LLDT}: Load Descriptor Tables
+
+\c LGDT mem                      ; 0F 01 /2             [286,PRIV]
+\c LIDT mem                      ; 0F 01 /3             [286,PRIV]
+\c LLDT r/m16                    ; 0F 00 /2             [286,PRIV]
+
+\c{LGDT} and \c{LIDT} both take a 6-byte memory area as an operand:
+they load a 16-bit size limit and a 32-bit linear address from that
+area (in the opposite order) into the \c{GDTR} (global descriptor table
+register) or \c{IDTR} (interrupt descriptor table register). These are
+the only instructions which directly use \e{linear} addresses, rather
+than segment/offset pairs.
+
+\c{LLDT} takes a segment selector as an operand. The processor looks
+up that selector in the GDT and stores the limit and base address
+given there into the \c{LDTR} (local descriptor table register).
+
+See also \c{SGDT}, \c{SIDT} and \c{SLDT} (\k{insSGDT}).
+
+
+\S{insLMSW} \i\c{LMSW}: Load/Store Machine Status Word
+
+\c LMSW r/m16                    ; 0F 01 /6             [286,PRIV]
+
+\c{LMSW} loads the bottom four bits of the source operand into the
+bottom four bits of the \c{CR0} control register (or the Machine
+Status Word, on 286 processors). See also \c{SMSW} (\k{insSMSW}).
+
+
+\S{insLOADALL} \i\c{LOADALL}, \i\c{LOADALL286}: Load Processor State
+
+\c LOADALL                       ; 0F 07                [386,UNDOC]
+\c LOADALL286                    ; 0F 05                [286,UNDOC]
+
+This instruction, in its two different-opcode forms, is apparently
+supported on most 286 processors, some 386 and possibly some 486.
+The opcode differs between the 286 and the 386.
+
+The function of the instruction is to load all information relating
+to the state of the processor out of a block of memory: on the 286,
+this block is located implicitly at absolute address \c{0x800}, and
+on the 386 and 486 it is at \c{[ES:EDI]}.
+
+
+\S{insLODSB} \i\c{LODSB}, \i\c{LODSW}, \i\c{LODSD}: Load from String
+
+\c LODSB                         ; AC                   [8086]
+\c LODSW                         ; o16 AD               [8086]
+\c LODSD                         ; o32 AD               [386]
+
+\c{LODSB} loads a byte from \c{[DS:SI]} or \c{[DS:ESI]} into \c{AL}.
+It then increments or decrements (depending on the direction flag:
+increments if the flag is clear, decrements if it is set) \c{SI} or
+\c{ESI}.
+
+The register used is \c{SI} if the address size is 16 bits, and
+\c{ESI} if it is 32 bits. If you need to use an address size not
+equal to the current \c{BITS} setting, you can use an explicit
+\i\c{a16} or \i\c{a32} prefix.
+
+The segment register used to load from \c{[SI]} or \c{[ESI]} can be
+overridden by using a segment register name as a prefix (for
+example, \c{ES LODSB}).
+
+\c{LODSW} and \c{LODSD} work in the same way, but they load a
+word or a doubleword instead of a byte, and increment or decrement
+the addressing registers by 2 or 4 instead of 1.
+
+
+\S{insLOOP} \i\c{LOOP}, \i\c{LOOPE}, \i\c{LOOPZ}, \i\c{LOOPNE}, \i\c{LOOPNZ}: Loop with Counter
+
+\c LOOP imm                      ; E2 rb                [8086]
+\c LOOP imm,CX                   ; a16 E2 rb            [8086]
+\c LOOP imm,ECX                  ; a32 E2 rb            [386]
+
+\c LOOPE imm                     ; E1 rb                [8086]
+\c LOOPE imm,CX                  ; a16 E1 rb            [8086]
+\c LOOPE imm,ECX                 ; a32 E1 rb            [386]
+\c LOOPZ imm                     ; E1 rb                [8086]
+\c LOOPZ imm,CX                  ; a16 E1 rb            [8086]
+\c LOOPZ imm,ECX                 ; a32 E1 rb            [386]
+
+\c LOOPNE imm                    ; E0 rb                [8086]
+\c LOOPNE imm,CX                 ; a16 E0 rb            [8086]
+\c LOOPNE imm,ECX                ; a32 E0 rb            [386]
+\c LOOPNZ imm                    ; E0 rb                [8086]
+\c LOOPNZ imm,CX                 ; a16 E0 rb            [8086]
+\c LOOPNZ imm,ECX                ; a32 E0 rb            [386]
+
+\c{LOOP} decrements its counter register (either \c{CX} or \c{ECX} -
+if one is not specified explicitly, the \c{BITS} setting dictates
+which is used) by one, and if the counter does not become zero as a
+result of this operation, it jumps to the given label. The jump has
+a range of 128 bytes.
+
+\c{LOOPE} (or its synonym \c{LOOPZ}) adds the additional condition
+that it only jumps if the counter is nonzero \e{and} the zero flag
+is set. Similarly, \c{LOOPNE} (and \c{LOOPNZ}) jumps only if the
+counter is nonzero and the zero flag is clear.
+
+
+\S{insLSL} \i\c{LSL}: Load Segment Limit
+
+\c LSL reg16,r/m16               ; o16 0F 03 /r         [286,PRIV]
+\c LSL reg32,r/m32               ; o32 0F 03 /r         [286,PRIV]
+
+\c{LSL} is given a segment selector in its source (second) operand;
+it computes the segment limit value by loading the segment limit
+field from the associated segment descriptor in the \c{GDT} or \c{LDT}.
+(This involves shifting left by 12 bits if the segment limit is
+page-granular, and not if it is byte-granular; so you end up with a
+byte limit in either case.) The segment limit obtained is then
+loaded into the destination (first) operand.
+
+
+\S{insLTR} \i\c{LTR}: Load Task Register
+
+\c LTR r/m16                     ; 0F 00 /3             [286,PRIV]
+
+\c{LTR} looks up the segment base and limit in the GDT or LDT
+descriptor specified by the segment selector given as its operand,
+and loads them into the Task Register.
+
+
+\S{insMASKMOVDQU} \i\c{MASKMOVDQU}: Byte Mask Write
+
+\c MASKMOVDQU xmm1,xmm2          ; 66 0F F7 /r     [WILLAMETTE,SSE2]
+
+\c{MASKMOVDQU} stores data from xmm1 to the location specified by
+\c{ES:(E)DI}. The size of the store depends on the address-size
+attribute. The most significant bit in each byte of the mask
+register xmm2 is used to selectively write the data (0 = no write,
+1 = write) on a per-byte basis.
+
+
+\S{insMASKMOVQ} \i\c{MASKMOVQ}: Byte Mask Write
+
+\c MASKMOVQ mm1,mm2              ; 0F F7 /r        [KATMAI,MMX]
+
+\c{MASKMOVQ} stores data from mm1 to the location specified by
+\c{ES:(E)DI}. The size of the store depends on the address-size
+attribute. The most significant bit in each byte of the mask
+register mm2 is used to selectively write the data (0 = no write,
+1 = write) on a per-byte basis.
+
+
+\S{insMAXPD} \i\c{MAXPD}: Return Packed Double-Precision FP Maximum
+
+\c MAXPD xmm1,xmm2/m128          ; 66 0F 5F /r     [WILLAMETTE,SSE2]
+
+\c{MAXPD} performs a SIMD compare of the packed double-precision
+FP numbers from xmm1 and xmm2/mem, and stores the maximum values
+of each pair of values in xmm1. If the values being compared are
+both zeroes, source2 (xmm2/m128) would be returned. If source2
+(xmm2/m128) is an SNaN, this SNaN is forwarded unchanged to the
+destination (i.e., a QNaN version of the SNaN is not returned).
+
+
+\S{insMAXPS} \i\c{MAXPS}: Return Packed Single-Precision FP Maximum
+
+\c MAXPS xmm1,xmm2/m128          ; 0F 5F /r        [KATMAI,SSE]
+
+\c{MAXPS} performs a SIMD compare of the packed single-precision
+FP numbers from xmm1 and xmm2/mem, and stores the maximum values
+of each pair of values in xmm1. If the values being compared are
+both zeroes, source2 (xmm2/m128) would be returned. If source2
+(xmm2/m128) is an SNaN, this SNaN is forwarded unchanged to the
+destination (i.e., a QNaN version of the SNaN is not returned).
+
+
+\S{insMAXSD} \i\c{MAXSD}: Return Scalar Double-Precision FP Maximum
+
+\c MAXSD xmm1,xmm2/m64           ; F2 0F 5F /r     [WILLAMETTE,SSE2]
+
+\c{MAXSD} compares the low-order double-precision FP numbers from
+xmm1 and xmm2/mem, and stores the maximum value in xmm1. If the
+values being compared are both zeroes, source2 (xmm2/m64) would
+be returned. If source2 (xmm2/m64) is an SNaN, this SNaN is
+forwarded unchanged to the destination (i.e., a QNaN version of
+the SNaN is not returned). The high quadword of the destination
+is left unchanged.
+
+
+\S{insMAXSS} \i\c{MAXSS}: Return Scalar Single-Precision FP Maximum
+
+\c MAXSS xmm1,xmm2/m32           ; F3 0F 5F /r     [KATMAI,SSE]
+
+\c{MAXSS} compares the low-order single-precision FP numbers from
+xmm1 and xmm2/mem, and stores the maximum value in xmm1. If the
+values being compared are both zeroes, source2 (xmm2/m32) would
+be returned. If source2 (xmm2/m32) is an SNaN, this SNaN is
+forwarded unchanged to the destination (i.e., a QNaN version of
+the SNaN is not returned). The high three doublewords of the
+destination are left unchanged.
+
+
+\S{insMFENCE} \i\c{MFENCE}: Memory Fence
+
+\c MFENCE                        ; 0F AE /6        [WILLAMETTE,SSE2]
+
+\c{MFENCE} performs a serialising operation on all loads from memory
+and writes to memory that were issued before the \c{MFENCE} instruction.
+This guarantees that all memory reads and writes before the \c{MFENCE}
+instruction are completed before any reads and writes after the
+\c{MFENCE} instruction.
+
+\c{MFENCE} is ordered respective to other \c{MFENCE} instructions,
+\c{LFENCE}, \c{SFENCE}, any memory read and any other serialising
+instruction (such as \c{CPUID}).
+
+Weakly ordered memory types can be used to achieve higher processor
+performance through such techniques as out-of-order issue, speculative
+reads, write-combining, and write-collapsing. The degree to which a
+consumer of data recognizes or knows that the data is weakly ordered
+varies among applications and may be unknown to the producer of this
+data. The \c{MFENCE} instruction provides a performance-efficient way
+of ensuring load and store ordering between routines that produce
+weakly-ordered results and routines that consume that data.
+
+\c{MFENCE} uses the following ModRM encoding:
+
+\c           Mod (7:6)        = 11B
+\c           Reg/Opcode (5:3) = 110B
+\c           R/M (2:0)        = 000B
+
+All other ModRM encodings are defined to be reserved, and use
+of these encodings risks incompatibility with future processors.
+
+See also \c{LFENCE} (\k{insLFENCE}) and \c{SFENCE} (\k{insSFENCE}).
+
+
+\S{insMINPD} \i\c{MINPD}: Return Packed Double-Precision FP Minimum
+
+\c MINPD xmm1,xmm2/m128          ; 66 0F 5D /r     [WILLAMETTE,SSE2]
+
+\c{MINPD} performs a SIMD compare of the packed double-precision
+FP numbers from xmm1 and xmm2/mem, and stores the minimum values
+of each pair of values in xmm1. If the values being compared are
+both zeroes, source2 (xmm2/m128) would be returned. If source2
+(xmm2/m128) is an SNaN, this SNaN is forwarded unchanged to the
+destination (i.e., a QNaN version of the SNaN is not returned).
+
+
+\S{insMINPS} \i\c{MINPS}: Return Packed Single-Precision FP Minimum
+
+\c MINPS xmm1,xmm2/m128          ; 0F 5D /r        [KATMAI,SSE]
+
+\c{MINPS} performs a SIMD compare of the packed single-precision
+FP numbers from xmm1 and xmm2/mem, and stores the minimum values
+of each pair of values in xmm1. If the values being compared are
+both zeroes, source2 (xmm2/m128) would be returned. If source2
+(xmm2/m128) is an SNaN, this SNaN is forwarded unchanged to the
+destination (i.e., a QNaN version of the SNaN is not returned).
+
+
+\S{insMINSD} \i\c{MINSD}: Return Scalar Double-Precision FP Minimum
+
+\c MINSD xmm1,xmm2/m64           ; F2 0F 5D /r     [WILLAMETTE,SSE2]
+
+\c{MINSD} compares the low-order double-precision FP numbers from
+xmm1 and xmm2/mem, and stores the minimum value in xmm1. If the
+values being compared are both zeroes, source2 (xmm2/m64) would
+be returned. If source2 (xmm2/m64) is an SNaN, this SNaN is
+forwarded unchanged to the destination (i.e., a QNaN version of
+the SNaN is not returned). The high quadword of the destination
+is left unchanged.
+
+
+\S{insMINSS} \i\c{MINSS}: Return Scalar Single-Precision FP Minimum
+
+\c MINSS xmm1,xmm2/m32           ; F3 0F 5D /r     [KATMAI,SSE]
+
+\c{MINSS} compares the low-order single-precision FP numbers from
+xmm1 and xmm2/mem, and stores the minimum value in xmm1. If the
+values being compared are both zeroes, source2 (xmm2/m32) would
+be returned. If source2 (xmm2/m32) is an SNaN, this SNaN is
+forwarded unchanged to the destination (i.e., a QNaN version of
+the SNaN is not returned). The high three doublewords of the
+destination are left unchanged.
+
+
+\S{insMOV} \i\c{MOV}: Move Data
+
+\c MOV r/m8,reg8                 ; 88 /r                [8086]
+\c MOV r/m16,reg16               ; o16 89 /r            [8086]
+\c MOV r/m32,reg32               ; o32 89 /r            [386]
+\c MOV reg8,r/m8                 ; 8A /r                [8086]
+\c MOV reg16,r/m16               ; o16 8B /r            [8086]
+\c MOV reg32,r/m32               ; o32 8B /r            [386]
+
+\c MOV reg8,imm8                 ; B0+r ib              [8086]
+\c MOV reg16,imm16               ; o16 B8+r iw          [8086]
+\c MOV reg32,imm32               ; o32 B8+r id          [386]
+\c MOV r/m8,imm8                 ; C6 /0 ib             [8086]
+\c MOV r/m16,imm16               ; o16 C7 /0 iw         [8086]
+\c MOV r/m32,imm32               ; o32 C7 /0 id         [386]
+
+\c MOV AL,memoffs8               ; A0 ow/od             [8086]
+\c MOV AX,memoffs16              ; o16 A1 ow/od         [8086]
+\c MOV EAX,memoffs32             ; o32 A1 ow/od         [386]
+\c MOV memoffs8,AL               ; A2 ow/od             [8086]
+\c MOV memoffs16,AX              ; o16 A3 ow/od         [8086]
+\c MOV memoffs32,EAX             ; o32 A3 ow/od         [386]
+
+\c MOV r/m16,segreg              ; o16 8C /r            [8086]
+\c MOV r/m32,segreg              ; o32 8C /r            [386]
+\c MOV segreg,r/m16              ; o16 8E /r            [8086]
+\c MOV segreg,r/m32              ; o32 8E /r            [386]
+
+\c MOV reg32,CR0/2/3/4           ; 0F 20 /r             [386]
+\c MOV reg32,DR0/1/2/3/6/7       ; 0F 21 /r             [386]
+\c MOV reg32,TR3/4/5/6/7         ; 0F 24 /r             [386]
+\c MOV CR0/2/3/4,reg32           ; 0F 22 /r             [386]
+\c MOV DR0/1/2/3/6/7,reg32       ; 0F 23 /r             [386]
+\c MOV TR3/4/5/6/7,reg32         ; 0F 26 /r             [386]
+
+\c{MOV} copies the contents of its source (second) operand into its
+destination (first) operand.
+
+In all forms of the \c{MOV} instruction, the two operands are the
+same size, except for moving between a segment register and an
+\c{r/m32} operand. These instructions are treated exactly like the
+corresponding 16-bit equivalent (so that, for example, \c{MOV
+DS,EAX} functions identically to \c{MOV DS,AX} but saves a prefix
+when in 32-bit mode), except that when a segment register is moved
+into a 32-bit destination, the top two bytes of the result are
+undefined.
+
+\c{MOV} may not use \c{CS} as a destination.
+
+\c{CR4} is only a supported register on the Pentium and above.
+
+Test registers are supported on 386/486 processors and on some
+non-Intel Pentium class processors.
+
+
+\S{insMOVAPD} \i\c{MOVAPD}: Move Aligned Packed Double-Precision FP Values
+
+\c MOVAPD xmm1,xmm2/mem128       ; 66 0F 28 /r     [WILLAMETTE,SSE2]
+\c MOVAPD xmm1/mem128,xmm2       ; 66 0F 29 /r     [WILLAMETTE,SSE2]
+
+\c{MOVAPD} moves a double quadword containing 2 packed double-precision
+FP values from the source operand to the destination. When the source
+or destination operand is a memory location, it must be aligned on a
+16-byte boundary.
+
+To move data in and out of memory locations that are not known to be on
+16-byte boundaries, use the \c{MOVUPD} instruction (\k{insMOVUPD}).
+
+
+\S{insMOVAPS} \i\c{MOVAPS}: Move Aligned Packed Single-Precision FP Values
+
+\c MOVAPS xmm1,xmm2/mem128       ; 0F 28 /r        [KATMAI,SSE]
+\c MOVAPS xmm1/mem128,xmm2       ; 0F 29 /r        [KATMAI,SSE]
+
+\c{MOVAPS} moves a double quadword containing 4 packed single-precision
+FP values from the source operand to the destination. When the source
+or destination operand is a memory location, it must be aligned on a
+16-byte boundary.
+
+To move data in and out of memory locations that are not known to be on
+16-byte boundaries, use the \c{MOVUPS} instruction (\k{insMOVUPS}).
+
+
+\S{insMOVD} \i\c{MOVD}: Move Doubleword to/from MMX Register
+
+\c MOVD mm,r/m32                 ; 0F 6E /r             [PENT,MMX]
+\c MOVD r/m32,mm                 ; 0F 7E /r             [PENT,MMX]
+\c MOVD xmm,r/m32                ; 66 0F 6E /r     [WILLAMETTE,SSE2]
+\c MOVD r/m32,xmm                ; 66 0F 7E /r     [WILLAMETTE,SSE2]
+
+\c{MOVD} copies 32 bits from its source (second) operand into its
+destination (first) operand. When the destination is a 64-bit \c{MMX}
+register or a 128-bit \c{XMM} register, the input value is zero-extended
+to fill the destination register.
+
+
+\S{insMOVDQ2Q} \i\c{MOVDQ2Q}: Move Quadword from XMM to MMX register.
+
+\c MOVDQ2Q mm,xmm                ; F2 OF D6 /r     [WILLAMETTE,SSE2]
+
+\c{MOVDQ2Q} moves the low quadword from the source operand to the
+destination operand.
+
+
+\S{insMOVDQA} \i\c{MOVDQA}: Move Aligned Double Quadword
+
+\c MOVDQA xmm1,xmm2/m128         ; 66 OF 6F /r     [WILLAMETTE,SSE2]
+\c MOVDQA xmm1/m128,xmm2         ; 66 OF 7F /r     [WILLAMETTE,SSE2]
+
+\c{MOVDQA} moves a double quadword from the source operand to the
+destination operand. When the source or destination operand is a
+memory location, it must be aligned to a 16-byte boundary.
+
+To move a double quadword to or from unaligned memory locations,
+use the \c{MOVDQU} instruction (\k{insMOVDQU}).
+
+
+\S{insMOVDQU} \i\c{MOVDQU}: Move Unaligned Double Quadword
+
+\c MOVDQU xmm1,xmm2/m128         ; F3 OF 6F /r     [WILLAMETTE,SSE2]
+\c MOVDQU xmm1/m128,xmm2         ; F3 OF 7F /r     [WILLAMETTE,SSE2]
+
+\c{MOVDQU} moves a double quadword from the source operand to the
+destination operand. When the source or destination operand is a
+memory location, the memory may be unaligned.
+
+To move a double quadword to or from known aligned memory locations,
+use the \c{MOVDQA} instruction (\k{insMOVDQA}).
+
+
+\S{insMOVHLPS} \i\c{MOVHLPS}: Move Packed Single-Precision FP High to Low
+
+\c MOVHLPS xmm1,xmm2             ; OF 12 /r        [KATMAI,SSE]
+
+\c{MOVHLPS} moves the two packed single-precision FP values from the
+high quadword of the source register xmm2 to the low quadword of the
+destination register, xmm2. The upper quadword of xmm1 is left unchanged.
+
+The operation of this instruction is:
+
+\c    dst[0-63]   := src[64-127],
+\c    dst[64-127] remains unchanged.
+
+
+\S{insMOVHPD} \i\c{MOVHPD}: Move High Packed Double-Precision FP
+
+\c MOVHPD xmm,m64               ; 66 OF 16 /r      [WILLAMETTE,SSE2]
+\c MOVHPD m64,xmm               ; 66 OF 17 /r      [WILLAMETTE,SSE2]
+
+\c{MOVHPD} moves a double-precision FP value between the source and
+destination operands. One of the operands is a 64-bit memory location,
+the other is the high quadword of an \c{XMM} register.
+
+The operation of this instruction is:
+
+\c    mem[0-63]   := xmm[64-127];
+
+or
+
+\c    xmm[0-63]   remains unchanged;
+\c    xmm[64-127] := mem[0-63].
+
+
+\S{insMOVHPS} \i\c{MOVHPS}: Move High Packed Single-Precision FP
+
+\c MOVHPS xmm,m64               ; 0F 16 /r         [KATMAI,SSE]
+\c MOVHPS m64,xmm               ; 0F 17 /r         [KATMAI,SSE]
+
+\c{MOVHPS} moves two packed single-precision FP values between the source
+and destination operands. One of the operands is a 64-bit memory location,
+the other is the high quadword of an \c{XMM} register.
+
+The operation of this instruction is:
+
+\c    mem[0-63]   := xmm[64-127];
+
+or
+
+\c    xmm[0-63]   remains unchanged;
+\c    xmm[64-127] := mem[0-63].
+
+
+\S{insMOVLHPS} \i\c{MOVLHPS}: Move Packed Single-Precision FP Low to High
+
+\c MOVLHPS xmm1,xmm2             ; OF 16 /r         [KATMAI,SSE]
+
+\c{MOVLHPS} moves the two packed single-precision FP values from the
+low quadword of the source register xmm2 to the high quadword of the
+destination register, xmm2. The low quadword of xmm1 is left unchanged.
+
+The operation of this instruction is:
+
+\c    dst[0-63]   remains unchanged;
+\c    dst[64-127] := src[0-63].
+
+\S{insMOVLPD} \i\c{MOVLPD}: Move Low Packed Double-Precision FP
+
+\c MOVLPD xmm,m64                ; 66 OF 12 /r     [WILLAMETTE,SSE2]
+\c MOVLPD m64,xmm                ; 66 OF 13 /r     [WILLAMETTE,SSE2]
+
+\c{MOVLPD} moves a double-precision FP value between the source and
+destination operands. One of the operands is a 64-bit memory location,
+the other is the low quadword of an \c{XMM} register.
+
+The operation of this instruction is:
+
+\c    mem(0-63)   := xmm(0-63);
+
+or
+
+\c    xmm(0-63)   := mem(0-63);
+\c    xmm(64-127) remains unchanged.
+
+\S{insMOVLPS} \i\c{MOVLPS}: Move Low Packed Single-Precision FP
+
+\c MOVLPS xmm,m64                ; OF 12 /r        [KATMAI,SSE]
+\c MOVLPS m64,xmm                ; OF 13 /r        [KATMAI,SSE]
+
+\c{MOVLPS} moves two packed single-precision FP values between the source
+and destination operands. One of the operands is a 64-bit memory location,
+the other is the low quadword of an \c{XMM} register.
+
+The operation of this instruction is:
+
+\c    mem(0-63)   := xmm(0-63);
+
+or
+
+\c    xmm(0-63)   := mem(0-63);
+\c    xmm(64-127) remains unchanged.
+
+
+\S{insMOVMSKPD} \i\c{MOVMSKPD}: Extract Packed Double-Precision FP Sign Mask
+
+\c MOVMSKPD reg32,xmm              ; 66 0F 50 /r   [WILLAMETTE,SSE2]
+
+\c{MOVMSKPD} inserts a 2-bit mask in r32, formed of the most significant
+bits of each double-precision FP number of the source operand.
+
+
+\S{insMOVMSKPS} \i\c{MOVMSKPS}: Extract Packed Single-Precision FP Sign Mask
+
+\c MOVMSKPS reg32,xmm              ; 0F 50 /r      [KATMAI,SSE]
+
+\c{MOVMSKPS} inserts a 4-bit mask in r32, formed of the most significant
+bits of each single-precision FP number of the source operand.
+
+
+\S{insMOVNTDQ} \i\c{MOVNTDQ}: Move Double Quadword Non Temporal
+
+\c MOVNTDQ m128,xmm              ; 66 0F E7 /r     [WILLAMETTE,SSE2]
+
+\c{MOVNTDQ} moves the double quadword from the \c{XMM} source
+register to the destination memory location, using a non-temporal
+hint. This store instruction minimizes cache pollution.
+
+
+\S{insMOVNTI} \i\c{MOVNTI}: Move Doubleword Non Temporal
+
+\c MOVNTI m32,reg32              ; 0F C3 /r        [WILLAMETTE,SSE2]
+
+\c{MOVNTI} moves the doubleword in the source register
+to the destination memory location, using a non-temporal
+hint. This store instruction minimizes cache pollution.
+
+
+\S{insMOVNTPD} \i\c{MOVNTPD}: Move Aligned Four Packed Single-Precision
+FP Values Non Temporal
+
+\c MOVNTPD m128,xmm              ; 66 0F 2B /r     [WILLAMETTE,SSE2]
+
+\c{MOVNTPD} moves the double quadword from the \c{XMM} source
+register to the destination memory location, using a non-temporal
+hint. This store instruction minimizes cache pollution. The memory
+location must be aligned to a 16-byte boundary.
+
+
+\S{insMOVNTPS} \i\c{MOVNTPS}: Move Aligned Four Packed Single-Precision
+FP Values Non Temporal
+
+\c MOVNTPS m128,xmm              ; 0F 2B /r        [KATMAI,SSE]
+
+\c{MOVNTPS} moves the double quadword from the \c{XMM} source
+register to the destination memory location, using a non-temporal
+hint. This store instruction minimizes cache pollution. The memory
+location must be aligned to a 16-byte boundary.
+
+
+\S{insMOVNTQ} \i\c{MOVNTQ}: Move Quadword Non Temporal
+
+\c MOVNTQ m64,mm                 ; 0F E7 /r        [KATMAI,MMX]
+
+\c{MOVNTQ} moves the quadword in the \c{MMX} source register
+to the destination memory location, using a non-temporal
+hint. This store instruction minimizes cache pollution.
+
+
+\S{insMOVQ} \i\c{MOVQ}: Move Quadword to/from MMX Register
+
+\c MOVQ mm1,mm2/m64               ; 0F 6F /r             [PENT,MMX]
+\c MOVQ mm1/m64,mm2               ; 0F 7F /r             [PENT,MMX]
+
+\c MOVQ xmm1,xmm2/m64             ; F3 0F 7E /r    [WILLAMETTE,SSE2]
+\c MOVQ xmm1/m64,xmm2             ; 66 0F D6 /r    [WILLAMETTE,SSE2]
+
+\c{MOVQ} copies 64 bits from its source (second) operand into its
+destination (first) operand. When the source is an \c{XMM} register,
+the low quadword is moved. When the destination is an \c{XMM} register,
+the destination is the low quadword, and the high quadword is cleared.
+
+
+\S{insMOVQ2DQ} \i\c{MOVQ2DQ}: Move Quadword from MMX to XMM register.
+
+\c MOVQ2DQ xmm,mm                ; F3 OF D6 /r     [WILLAMETTE,SSE2]
+
+\c{MOVQ2DQ} moves the quadword from the source operand to the low
+quadword of the destination operand, and clears the high quadword.
+
+
+\S{insMOVSB} \i\c{MOVSB}, \i\c{MOVSW}, \i\c{MOVSD}: Move String
+
+\c MOVSB                         ; A4                   [8086]
+\c MOVSW                         ; o16 A5               [8086]
+\c MOVSD                         ; o32 A5               [386]
+
+\c{MOVSB} copies the byte at \c{[DS:SI]} or \c{[DS:ESI]} to
+\c{[ES:DI]} or \c{[ES:EDI]}. It then increments or decrements
+(depending on the direction flag: increments if the flag is clear,
+decrements if it is set) \c{SI} and \c{DI} (or \c{ESI} and \c{EDI}).
+
+The registers used are \c{SI} and \c{DI} if the address size is 16
+bits, and \c{ESI} and \c{EDI} if it is 32 bits. If you need to use
+an address size not equal to the current \c{BITS} setting, you can
+use an explicit \i\c{a16} or \i\c{a32} prefix.
+
+The segment register used to load from \c{[SI]} or \c{[ESI]} can be
+overridden by using a segment register name as a prefix (for
+example, \c{es movsb}). The use of \c{ES} for the store to \c{[DI]}
+or \c{[EDI]} cannot be overridden.
+
+\c{MOVSW} and \c{MOVSD} work in the same way, but they copy a word
+or a doubleword instead of a byte, and increment or decrement the
+addressing registers by 2 or 4 instead of 1.
+
+The \c{REP} prefix may be used to repeat the instruction \c{CX} (or
+\c{ECX} - again, the address size chooses which) times.
+
+
+\S{insMOVSD} \i\c{MOVSD}: Move Scalar Double-Precision FP Value
+
+\c MOVSD xmm1,xmm2/m64           ; F2 0F 10 /r     [WILLAMETTE,SSE2]
+\c MOVSD xmm1/m64,xmm2           ; F2 0F 11 /r     [WILLAMETTE,SSE2]
+
+\c{MOVSD} moves a double-precision FP value from the source operand
+to the destination operand. When the source or destination is a
+register, the low-order FP value is read or written.
+
+
+\S{insMOVSS} \i\c{MOVSS}: Move Scalar Single-Precision FP Value
+
+\c MOVSS xmm1,xmm2/m32           ; F3 0F 10 /r     [KATMAI,SSE]
+\c MOVSS xmm1/m32,xmm2           ; F3 0F 11 /r     [KATMAI,SSE]
+
+\c{MOVSS} moves a single-precision FP value from the source operand
+to the destination operand. When the source or destination is a
+register, the low-order FP value is read or written.
+
+
+\S{insMOVSX} \i\c{MOVSX}, \i\c{MOVZX}: Move Data with Sign or Zero Extend
+
+\c MOVSX reg16,r/m8              ; o16 0F BE /r         [386]
+\c MOVSX reg32,r/m8              ; o32 0F BE /r         [386]
+\c MOVSX reg32,r/m16             ; o32 0F BF /r         [386]
+
+\c MOVZX reg16,r/m8              ; o16 0F B6 /r         [386]
+\c MOVZX reg32,r/m8              ; o32 0F B6 /r         [386]
+\c MOVZX reg32,r/m16             ; o32 0F B7 /r         [386]
+
+\c{MOVSX} sign-extends its source (second) operand to the length of
+its destination (first) operand, and copies the result into the
+destination operand. \c{MOVZX} does the same, but zero-extends
+rather than sign-extending.
+
+
+\S{insMOVUPD} \i\c{MOVUPD}: Move Unaligned Packed Double-Precision FP Values
+
+\c MOVUPD xmm1,xmm2/mem128       ; 66 0F 10 /r     [WILLAMETTE,SSE2]
+\c MOVUPD xmm1/mem128,xmm2       ; 66 0F 11 /r     [WILLAMETTE,SSE2]
+
+\c{MOVUPD} moves a double quadword containing 2 packed double-precision
+FP values from the source operand to the destination. This instruction
+makes no assumptions about alignment of memory operands.
+
+To move data in and out of memory locations that are known to be on 16-byte
+boundaries, use the \c{MOVAPD} instruction (\k{insMOVAPD}).
+
+
+\S{insMOVUPS} \i\c{MOVUPS}: Move Unaligned Packed Single-Precision FP Values
+
+\c MOVUPS xmm1,xmm2/mem128       ; 0F 10 /r        [KATMAI,SSE]
+\c MOVUPS xmm1/mem128,xmm2       ; 0F 11 /r        [KATMAI,SSE]
+
+\c{MOVUPS} moves a double quadword containing 4 packed single-precision
+FP values from the source operand to the destination. This instruction
+makes no assumptions about alignment of memory operands.
+
+To move data in and out of memory locations that are known to be on 16-byte
+boundaries, use the \c{MOVAPS} instruction (\k{insMOVAPS}).
+
+
+\S{insMUL} \i\c{MUL}: Unsigned Integer Multiply
+
+\c MUL r/m8                      ; F6 /4                [8086]
+\c MUL r/m16                     ; o16 F7 /4            [8086]
+\c MUL r/m32                     ; o32 F7 /4            [386]
+
+\c{MUL} performs unsigned integer multiplication. The other operand
+to the multiplication, and the destination operand, are implicit, in
+the following way:
+
+\b For \c{MUL r/m8}, \c{AL} is multiplied by the given operand; the
+product is stored in \c{AX}.
+
+\b For \c{MUL r/m16}, \c{AX} is multiplied by the given operand;
+the product is stored in \c{DX:AX}.
+
+\b For \c{MUL r/m32}, \c{EAX} is multiplied by the given operand;
+the product is stored in \c{EDX:EAX}.
+
+Signed integer multiplication is performed by the \c{IMUL}
+instruction: see \k{insIMUL}.
+
+
+\S{insMULPD} \i\c{MULPD}: Packed Single-FP Multiply
+
+\c MULPD xmm1,xmm2/mem128        ; 66 0F 59 /r     [WILLAMETTE,SSE2]
+
+\c{MULPD} performs a SIMD multiply of the packed double-precision FP
+values in both operands, and stores the results in the destination register.
+
+
+\S{insMULPS} \i\c{MULPS}: Packed Single-FP Multiply
+
+\c MULPS xmm1,xmm2/mem128        ; 0F 59 /r        [KATMAI,SSE]
+
+\c{MULPS} performs a SIMD multiply of the packed single-precision FP
+values in both operands, and stores the results in the destination register.
+
+
+\S{insMULSD} \i\c{MULSD}: Scalar Single-FP Multiply
+
+\c MULSD xmm1,xmm2/mem32         ; F2 0F 59 /r     [WILLAMETTE,SSE2]
+
+\c{MULSD} multiplies the lowest double-precision FP values of both
+operands, and stores the result in the low quadword of xmm1.
+
+
+\S{insMULSS} \i\c{MULSS}: Scalar Single-FP Multiply
+
+\c MULSS xmm1,xmm2/mem32         ; F3 0F 59 /r     [KATMAI,SSE]
+
+\c{MULSS} multiplies the lowest single-precision FP values of both
+operands, and stores the result in the low doubleword of xmm1.
+
+
+\S{insNEG} \i\c{NEG}, \i\c{NOT}: Two's and One's Complement
+
+\c NEG r/m8                      ; F6 /3                [8086]
+\c NEG r/m16                     ; o16 F7 /3            [8086]
+\c NEG r/m32                     ; o32 F7 /3            [386]
+
+\c NOT r/m8                      ; F6 /2                [8086]
+\c NOT r/m16                     ; o16 F7 /2            [8086]
+\c NOT r/m32                     ; o32 F7 /2            [386]
+
+\c{NEG} replaces the contents of its operand by the two's complement
+negation (invert all the bits and then add one) of the original
+value. \c{NOT}, similarly, performs one's complement (inverts all
+the bits).
+
+
+\S{insNOP} \i\c{NOP}: No Operation
+
+\c NOP                           ; 90                   [8086]
+
+\c{NOP} performs no operation. Its opcode is the same as that
+generated by \c{XCHG AX,AX} or \c{XCHG EAX,EAX} (depending on the
+processor mode; see \k{insXCHG}).
+
+
+\S{insOR} \i\c{OR}: Bitwise OR
+
+\c OR r/m8,reg8                  ; 08 /r                [8086]
+\c OR r/m16,reg16                ; o16 09 /r            [8086]
+\c OR r/m32,reg32                ; o32 09 /r            [386]
+
+\c OR reg8,r/m8                  ; 0A /r                [8086]
+\c OR reg16,r/m16                ; o16 0B /r            [8086]
+\c OR reg32,r/m32                ; o32 0B /r            [386]
+
+\c OR r/m8,imm8                  ; 80 /1 ib             [8086]
+\c OR r/m16,imm16                ; o16 81 /1 iw         [8086]
+\c OR r/m32,imm32                ; o32 81 /1 id         [386]
+
+\c OR r/m16,imm8                 ; o16 83 /1 ib         [8086]
+\c OR r/m32,imm8                 ; o32 83 /1 ib         [386]
+
+\c OR AL,imm8                    ; 0C ib                [8086]
+\c OR AX,imm16                   ; o16 0D iw            [8086]
+\c OR EAX,imm32                  ; o32 0D id            [386]
+
+\c{OR} performs a bitwise OR operation between its two operands
+(i.e. each bit of the result is 1 if and only if at least one of the
+corresponding bits of the two inputs was 1), and stores the result
+in the destination (first) operand.
+
+In the forms with an 8-bit immediate second operand and a longer
+first operand, the second operand is considered to be signed, and is
+sign-extended to the length of the first operand. In these cases,
+the \c{BYTE} qualifier is necessary to force NASM to generate this
+form of the instruction.
+
+The MMX instruction \c{POR} (see \k{insPOR}) performs the same
+operation on the 64-bit MMX registers.
+
+
+\S{insORPD} \i\c{ORPD}: Bit-wise Logical OR of Double-Precision FP Data
+
+\c ORPD xmm1,xmm2/m128           ; 66 0F 56 /r     [WILLAMETTE,SSE2]
+
+\c{ORPD} return a bit-wise logical OR between xmm1 and xmm2/mem,
+and stores the result in xmm1. If the source operand is a memory
+location, it must be aligned to a 16-byte boundary.
+
+
+\S{insORPS} \i\c{ORPS}: Bit-wise Logical OR of Single-Precision FP Data
+
+\c ORPS xmm1,xmm2/m128           ; 0F 56 /r        [KATMAI,SSE]
+
+\c{ORPS} return a bit-wise logical OR between xmm1 and xmm2/mem,
+and stores the result in xmm1. If the source operand is a memory
+location, it must be aligned to a 16-byte boundary.
+
+
+\S{insOUT} \i\c{OUT}: Output Data to I/O Port
+
+\c OUT imm8,AL                   ; E6 ib                [8086]
+\c OUT imm8,AX                   ; o16 E7 ib            [8086]
+\c OUT imm8,EAX                  ; o32 E7 ib            [386]
+\c OUT DX,AL                     ; EE                   [8086]
+\c OUT DX,AX                     ; o16 EF               [8086]
+\c OUT DX,EAX                    ; o32 EF               [386]
+
+\c{OUT} writes the contents of the given source register to the
+specified I/O port. The port number may be specified as an immediate
+value if it is between 0 and 255, and otherwise must be stored in
+\c{DX}. See also \c{IN} (\k{insIN}).
+
+
+\S{insOUTSB} \i\c{OUTSB}, \i\c{OUTSW}, \i\c{OUTSD}: Output String to I/O Port
+
+\c OUTSB                         ; 6E                   [186]
+\c OUTSW                         ; o16 6F               [186]
+\c OUTSD                         ; o32 6F               [386]
+
+\c{OUTSB} loads a byte from \c{[DS:SI]} or \c{[DS:ESI]} and writes
+it to the I/O port specified in \c{DX}. It then increments or
+decrements (depending on the direction flag: increments if the flag
+is clear, decrements if it is set) \c{SI} or \c{ESI}.
+
+The register used is \c{SI} if the address size is 16 bits, and
+\c{ESI} if it is 32 bits. If you need to use an address size not
+equal to the current \c{BITS} setting, you can use an explicit
+\i\c{a16} or \i\c{a32} prefix.
+
+The segment register used to load from \c{[SI]} or \c{[ESI]} can be
+overridden by using a segment register name as a prefix (for
+example, \c{es outsb}).
+
+\c{OUTSW} and \c{OUTSD} work in the same way, but they output a
+word or a doubleword instead of a byte, and increment or decrement
+the addressing registers by 2 or 4 instead of 1.
+
+The \c{REP} prefix may be used to repeat the instruction \c{CX} (or
+\c{ECX} - again, the address size chooses which) times.
+
+
+\S{insPACKSSDW} \i\c{PACKSSDW}, \i\c{PACKSSWB}, \i\c{PACKUSWB}: Pack Data
+
+\c PACKSSDW mm1,mm2/m64          ; 0F 6B /r             [PENT,MMX]
+\c PACKSSWB mm1,mm2/m64          ; 0F 63 /r             [PENT,MMX]
+\c PACKUSWB mm1,mm2/m64          ; 0F 67 /r             [PENT,MMX]
+
+\c PACKSSDW xmm1,xmm2/m128       ; 66 0F 6B /r     [WILLAMETTE,SSE2]
+\c PACKSSWB xmm1,xmm2/m128       ; 66 0F 63 /r     [WILLAMETTE,SSE2]
+\c PACKUSWB xmm1,xmm2/m128       ; 66 0F 67 /r     [WILLAMETTE,SSE2]
+
+All these instructions start by combining the source and destination
+operands, and then splitting the result in smaller sections which it
+then packs into the destination register. The \c{MMX} versions pack
+two 64-bit operands into one 64-bit register, while the \c{SSE}
+versions pack two 128-bit operands into one 128-bit register.
+
+\b \c{PACKSSWB} splits the combined value into words, and then reduces
+the words to bytes, using signed saturation. It then packs the bytes
+into the destination register in the same order the words were in.
+
+\b \c{PACKSSDW} performs the same operation as \c{PACKSSWB}, except that
+it reduces doublewords to words, then packs them into the destination
+register.
+
+\b \c{PACKUSWB} performs the same operation as \c{PACKSSWB}, except that
+it uses unsigned saturation when reducing the size of the elements.
+
+To perform signed saturation on a number, it is replaced by the largest
+signed number (\c{7FFFh} or \c{7Fh}) that \e{will} fit, and if it is too
+small it is replaced by the smallest signed number (\c{8000h} or
+\c{80h}) that will fit. To perform unsigned saturation, the input is
+treated as unsigned, and the input is replaced by the largest unsigned
+number that will fit.
+
+
+\S{insPADDB} \i\c{PADDB}, \i\c{PADDW}, \i\c{PADDD}: Add Packed Integers
+
+\c PADDB mm1,mm2/m64             ; 0F FC /r             [PENT,MMX]
+\c PADDW mm1,mm2/m64             ; 0F FD /r             [PENT,MMX]
+\c PADDD mm1,mm2/m64             ; 0F FE /r             [PENT,MMX]
+
+\c PADDB xmm1,xmm2/m128          ; 66 0F FC /r     [WILLAMETTE,SSE2]
+\c PADDW xmm1,xmm2/m128          ; 66 0F FD /r     [WILLAMETTE,SSE2]
+\c PADDD xmm1,xmm2/m128          ; 66 0F FE /r     [WILLAMETTE,SSE2]
+
+\c{PADDx} performs packed addition of the two operands, storing the
+result in the destination (first) operand.
+
+\b \c{PADDB} treats the operands as packed bytes, and adds each byte
+individually;
+
+\b \c{PADDW} treats the operands as packed words;
+
+\b \c{PADDD} treats its operands as packed doublewords.
+
+When an individual result is too large to fit in its destination, it
+is wrapped around and the low bits are stored, with the carry bit
+discarded.
+
+
+\S{insPADDQ} \i\c{PADDQ}: Add Packed Quadword Integers
+
+\c PADDQ mm1,mm2/m64             ; 0F D4 /r             [PENT,MMX]
+
+\c PADDQ xmm1,xmm2/m128          ; 66 0F D4 /r     [WILLAMETTE,SSE2]
+
+\c{PADDQ} adds the quadwords in the source and destination operands, and
+stores the result in the destination register.
+
+When an individual result is too large to fit in its destination, it
+is wrapped around and the low bits are stored, with the carry bit
+discarded.
+
+
+\S{insPADDSB} \i\c{PADDSB}, \i\c{PADDSW}: Add Packed Signed Integers With Saturation
+
+\c PADDSB mm1,mm2/m64            ; 0F EC /r             [PENT,MMX]
+\c PADDSW mm1,mm2/m64            ; 0F ED /r             [PENT,MMX]
+
+\c PADDSB xmm1,xmm2/m128         ; 66 0F EC /r     [WILLAMETTE,SSE2]
+\c PADDSW xmm1,xmm2/m128         ; 66 0F ED /r     [WILLAMETTE,SSE2]
+
+\c{PADDSx} performs packed addition of the two operands, storing the
+result in the destination (first) operand.
+\c{PADDSB} treats the operands as packed bytes, and adds each byte
+individually; and \c{PADDSW} treats the operands as packed words.
+
+When an individual result is too large to fit in its destination, a
+saturated value is stored. The resulting value is the value with the
+largest magnitude of the same sign as the result which will fit in
+the available space.
+
+
+\S{insPADDSIW} \i\c{PADDSIW}: MMX Packed Addition to Implicit Destination
+
+\c PADDSIW mmxreg,r/m64          ; 0F 51 /r             [CYRIX,MMX]
+
+\c{PADDSIW}, specific to the Cyrix extensions to the MMX instruction
+set, performs the same function as \c{PADDSW}, except that the result
+is placed in an implied register.
+
+To work out the implied register, invert the lowest bit in the register
+number. So \c{PADDSIW MM0,MM2} would put the result in \c{MM1}, but
+\c{PADDSIW MM1,MM2} would put the result in \c{MM0}.
+
+
+\S{insPADDUSB} \i\c{PADDUSB}, \i\c{PADDUSW}: Add Packed Unsigned Integers With Saturation
+
+\c PADDUSB mm1,mm2/m64           ; 0F DC /r             [PENT,MMX]
+\c PADDUSW mm1,mm2/m64           ; 0F DD /r             [PENT,MMX]
+
+\c PADDUSB xmm1,xmm2/m128         ; 66 0F DC /r    [WILLAMETTE,SSE2]
+\c PADDUSW xmm1,xmm2/m128         ; 66 0F DD /r    [WILLAMETTE,SSE2]
+
+\c{PADDUSx} performs packed addition of the two operands, storing the
+result in the destination (first) operand.
+\c{PADDUSB} treats the operands as packed bytes, and adds each byte
+individually; and \c{PADDUSW} treats the operands as packed words.
+
+When an individual result is too large to fit in its destination, a
+saturated value is stored. The resulting value is the maximum value
+that will fit in the available space.
+
+
+\S{insPAND} \i\c{PAND}, \i\c{PANDN}: MMX Bitwise AND and AND-NOT
+
+\c PAND mm1,mm2/m64              ; 0F DB /r             [PENT,MMX]
+\c PANDN mm1,mm2/m64             ; 0F DF /r             [PENT,MMX]
+
+\c PAND xmm1,xmm2/m128           ; 66 0F DB /r     [WILLAMETTE,SSE2]
+\c PANDN xmm1,xmm2/m128          ; 66 0F DF /r     [WILLAMETTE,SSE2]
+
+
+\c{PAND} performs a bitwise AND operation between its two operands
+(i.e. each bit of the result is 1 if and only if the corresponding
+bits of the two inputs were both 1), and stores the result in the
+destination (first) operand.
+
+\c{PANDN} performs the same operation, but performs a one's
+complement operation on the destination (first) operand first.
+
+
+\S{insPAUSE} \i\c{PAUSE}: Spin Loop Hint
+
+\c PAUSE                         ; F3 90           [WILLAMETTE,SSE2]
+
+\c{PAUSE} provides a hint to the processor that the following code
+is a spin loop. This improves processor performance by bypassing
+possible memory order violations. On older processors, this instruction
+operates as a \c{NOP}.
+
+
+\S{insPAVEB} \i\c{PAVEB}: MMX Packed Average
+
+\c PAVEB mmxreg,r/m64            ; 0F 50 /r             [CYRIX,MMX]
+
+\c{PAVEB}, specific to the Cyrix MMX extensions, treats its two
+operands as vectors of eight unsigned bytes, and calculates the
+average of the corresponding bytes in the operands. The resulting
+vector of eight averages is stored in the first operand.
+
+This opcode maps to \c{MOVMSKPS r32, xmm} on processors that support
+the SSE instruction set.
+
+
+\S{insPAVGB} \i\c{PAVGB} \i\c{PAVGW}: Average Packed Integers
+
+\c PAVGB mm1,mm2/m64             ; 0F E0 /r        [KATMAI,MMX]
+\c PAVGW mm1,mm2/m64             ; 0F E3 /r        [KATMAI,MMX,SM]
+
+\c PAVGB xmm1,xmm2/m128          ; 66 0F E0 /r     [WILLAMETTE,SSE2]
+\c PAVGW xmm1,xmm2/m128          ; 66 0F E3 /r     [WILLAMETTE,SSE2]
+
+\c{PAVGB} and \c{PAVGW} add the unsigned data elements of the source
+operand to the unsigned data elements of the destination register,
+then adds 1 to the temporary results. The results of the add are then
+each independently right-shifted by one bit position. The high order
+bits of each element are filled with the carry bits of the corresponding
+sum.
+
+\b \c{PAVGB} operates on packed unsigned bytes, and
+
+\b \c{PAVGW} operates on packed unsigned words.
+
+
+\S{insPAVGUSB} \i\c{PAVGUSB}: Average of unsigned packed 8-bit values
+
+\c PAVGUSB mm1,mm2/m64           ; 0F 0F /r BF          [PENT,3DNOW]
+
+\c{PAVGUSB} adds the unsigned data elements of the source operand to
+the unsigned data elements of the destination register, then adds 1
+to the temporary results. The results of the add are then each
+independently right-shifted by one bit position. The high order bits
+of each element are filled with the carry bits of the corresponding
+sum.
+
+This instruction performs exactly the same operations as the \c{PAVGB}
+\c{MMX} instruction (\k{insPAVGB}).
+
+
+\S{insPCMPEQB} \i\c{PCMPxx}: Compare Packed Integers.
+
+\c PCMPEQB mm1,mm2/m64           ; 0F 74 /r             [PENT,MMX]
+\c PCMPEQW mm1,mm2/m64           ; 0F 75 /r             [PENT,MMX]
+\c PCMPEQD mm1,mm2/m64           ; 0F 76 /r             [PENT,MMX]
+
+\c PCMPGTB mm1,mm2/m64           ; 0F 64 /r             [PENT,MMX]
+\c PCMPGTW mm1,mm2/m64           ; 0F 65 /r             [PENT,MMX]
+\c PCMPGTD mm1,mm2/m64           ; 0F 66 /r             [PENT,MMX]
+
+\c PCMPEQB xmm1,xmm2/m128        ; 66 0F 74 /r     [WILLAMETTE,SSE2]
+\c PCMPEQW xmm1,xmm2/m128        ; 66 0F 75 /r     [WILLAMETTE,SSE2]
+\c PCMPEQD xmm1,xmm2/m128        ; 66 0F 76 /r     [WILLAMETTE,SSE2]
+
+\c PCMPGTB xmm1,xmm2/m128        ; 66 0F 64 /r     [WILLAMETTE,SSE2]
+\c PCMPGTW xmm1,xmm2/m128        ; 66 0F 65 /r     [WILLAMETTE,SSE2]
+\c PCMPGTD xmm1,xmm2/m128        ; 66 0F 66 /r     [WILLAMETTE,SSE2]
+
+The \c{PCMPxx} instructions all treat their operands as vectors of
+bytes, words, or doublewords; corresponding elements of the source
+and destination are compared, and the corresponding element of the
+destination (first) operand is set to all zeros or all ones
+depending on the result of the comparison.
+
+\b \c{PCMPxxB} treats the operands as vectors of bytes;
+
+\b \c{PCMPxxW} treats the operands as vectors of words;
+
+\b \c{PCMPxxD} treats the operands as vectors of doublewords;
+
+\b \c{PCMPEQx} sets the corresponding element of the destination
+operand to all ones if the two elements compared are equal;
+
+\b \c{PCMPGTx} sets the destination element to all ones if the element
+of the first (destination) operand is greater (treated as a signed
+integer) than that of the second (source) operand.
+
+
+\S{insPDISTIB} \i\c{PDISTIB}: MMX Packed Distance and Accumulate
+with Implied Register
+
+\c PDISTIB mm,m64                ; 0F 54 /r             [CYRIX,MMX]
+
+\c{PDISTIB}, specific to the Cyrix MMX extensions, treats its two
+input operands as vectors of eight unsigned bytes. For each byte
+position, it finds the absolute difference between the bytes in that
+position in the two input operands, and adds that value to the byte
+in the same position in the implied output register. The addition is
+saturated to an unsigned byte in the same way as \c{PADDUSB}.
+
+To work out the implied register, invert the lowest bit in the register
+number. So \c{PDISTIB MM0,M64} would put the result in \c{MM1}, but
+\c{PDISTIB MM1,M64} would put the result in \c{MM0}.
+
+Note that \c{PDISTIB} cannot take a register as its second source
+operand.
+
+Operation:
+
+\c    dstI[0-7]     := dstI[0-7]   + ABS(src0[0-7] - src1[0-7]),
+\c    dstI[8-15]    := dstI[8-15]  + ABS(src0[8-15] - src1[8-15]),
+\c    .......
+\c    .......
+\c    dstI[56-63]   := dstI[56-63] + ABS(src0[56-63] - src1[56-63]).
+
+
+\S{insPEXTRW} \i\c{PEXTRW}: Extract Word
+
+\c PEXTRW reg32,mm,imm8          ; 0F C5 /r ib     [KATMAI,MMX]
+\c PEXTRW reg32,xmm,imm8         ; 66 0F C5 /r ib  [WILLAMETTE,SSE2]
+
+\c{PEXTRW} moves the word in the source register (second operand)
+that is pointed to by the count operand (third operand), into the
+lower half of a 32-bit general purpose register. The upper half of
+the register is cleared to all 0s.
+
+When the source operand is an \c{MMX} register, the two least
+significant bits of the count specify the source word. When it is
+an \c{SSE} register, the three least significant bits specify the
+word location.
+
+
+\S{insPF2ID} \i\c{PF2ID}: Packed Single-Precision FP to Integer Convert
+
+\c PF2ID mm1,mm2/m64             ; 0F 0F /r 1D          [PENT,3DNOW]
+
+\c{PF2ID} converts two single-precision FP values in the source operand
+to signed 32-bit integers, using truncation, and stores them in the
+destination operand. Source values that are outside the range supported
+by the destination are saturated to the largest absolute value of the
+same sign.
+
+
+\S{insPF2IW} \i\c{PF2IW}: Packed Single-Precision FP to Integer Word Convert
+
+\c PF2IW mm1,mm2/m64             ; 0F 0F /r 1C          [PENT,3DNOW]
+
+\c{PF2IW} converts two single-precision FP values in the source operand
+to signed 16-bit integers, using truncation, and stores them in the
+destination operand. Source values that are outside the range supported
+by the destination are saturated to the largest absolute value of the
+same sign.
+
+\b In the K6-2 and K6-III, the 16-bit value is zero-extended to 32-bits
+before storing.
+
+\b In the K6-2+, K6-III+ and Athlon processors, the value is sign-extended
+to 32-bits before storing.
+
+
+\S{insPFACC} \i\c{PFACC}: Packed Single-Precision FP Accumulate
+
+\c PFACC mm1,mm2/m64             ; 0F 0F /r AE          [PENT,3DNOW]
+
+\c{PFACC} adds the two single-precision FP values from the destination
+operand together, then adds the two single-precision FP values from the
+source operand, and places the results in the low and high doublewords
+of the destination operand.
+
+The operation is:
+
+\c    dst[0-31]   := dst[0-31] + dst[32-63],
+\c    dst[32-63]  := src[0-31] + src[32-63].
+
+
+\S{insPFADD} \i\c{PFADD}: Packed Single-Precision FP Addition
+
+\c PFADD mm1,mm2/m64             ; 0F 0F /r 9E          [PENT,3DNOW]
+
+\c{PFADD} performs addition on each of two packed single-precision
+FP value pairs.
+
+\c    dst[0-31]   := dst[0-31]  + src[0-31],
+\c    dst[32-63]  := dst[32-63] + src[32-63].
+
+
+\S{insPFCMP} \i\c{PFCMPxx}: Packed Single-Precision FP Compare
+\I\c{PFCMPEQ} \I\c{PFCMPGE} \I\c{PFCMPGT}
+
+\c PFCMPEQ mm1,mm2/m64           ; 0F 0F /r B0          [PENT,3DNOW]
+\c PFCMPGE mm1,mm2/m64           ; 0F 0F /r 90          [PENT,3DNOW]
+\c PFCMPGT mm1,mm2/m64           ; 0F 0F /r A0          [PENT,3DNOW]
+
+The \c{PFCMPxx} instructions compare the packed single-point FP values
+in the source and destination operands, and set the destination
+according to the result. If the condition is true, the destination is
+set to all 1s, otherwise it's set to all 0s.
+
+\b \c{PFCMPEQ} tests whether dst == src;
+
+\b \c{PFCMPGE} tests whether dst >= src;
+
+\b \c{PFCMPGT} tests whether dst >  src.
+
+
+\S{insPFMAX} \i\c{PFMAX}: Packed Single-Precision FP Maximum
+
+\c PFMAX mm1,mm2/m64             ; 0F 0F /r A4          [PENT,3DNOW]
+
+\c{PFMAX} returns the higher of each pair of single-precision FP values.
+If the higher value is zero, it is returned as positive zero.
+
+
+\S{insPFMIN} \i\c{PFMIN}: Packed Single-Precision FP Minimum
+
+\c PFMIN mm1,mm2/m64             ; 0F 0F /r 94          [PENT,3DNOW]
+
+\c{PFMIN} returns the lower of each pair of single-precision FP values.
+If the lower value is zero, it is returned as positive zero.
+
+
+\S{insPFMUL} \i\c{PFMUL}: Packed Single-Precision FP Multiply
+
+\c PFMUL mm1,mm2/m64             ; 0F 0F /r B4          [PENT,3DNOW]
+
+\c{PFMUL} returns the product of each pair of single-precision FP values.
+
+\c    dst[0-31]  := dst[0-31]  * src[0-31],
+\c    dst[32-63] := dst[32-63] * src[32-63].
+
+
+\S{insPFNACC} \i\c{PFNACC}: Packed Single-Precision FP Negative Accumulate
+
+\c PFNACC mm1,mm2/m64            ; 0F 0F /r 8A          [PENT,3DNOW]
+
+\c{PFNACC} performs a negative accumulate of the two single-precision
+FP values in the source and destination registers. The result of the
+accumulate from the destination register is stored in the low doubleword
+of the destination, and the result of the source accumulate is stored in
+the high doubleword of the destination register.
+
+The operation is:
+
+\c    dst[0-31]  := dst[0-31] - dst[32-63],
+\c    dst[32-63] := src[0-31] - src[32-63].
+
+
+\S{insPFPNACC} \i\c{PFPNACC}: Packed Single-Precision FP Mixed Accumulate
+
+\c PFPNACC mm1,mm2/m64           ; 0F 0F /r 8E          [PENT,3DNOW]
+
+\c{PFPNACC} performs a positive accumulate of the two single-precision
+FP values in the source register and a negative accumulate of the
+destination register. The result of the accumulate from the destination
+register is stored in the low doubleword of the destination, and the
+result of the source accumulate is stored in the high doubleword of the
+destination register.
+
+The operation is:
+
+\c    dst[0-31]  := dst[0-31] - dst[32-63],
+\c    dst[32-63] := src[0-31] + src[32-63].
+
+
+\S{insPFRCP} \i\c{PFRCP}: Packed Single-Precision FP Reciprocal Approximation
+
+\c PFRCP mm1,mm2/m64             ; 0F 0F /r 96          [PENT,3DNOW]
+
+\c{PFRCP} performs a low precision estimate of the reciprocal of the
+low-order single-precision FP value in the source operand, storing the
+result in both halves of the destination register. The result is accurate
+to 14 bits.
+
+For higher precision reciprocals, this instruction should be followed by
+two more instructions: \c{PFRCPIT1} (\k{insPFRCPIT1}) and \c{PFRCPIT2}
+(\k{insPFRCPIT1}). This will result in a 24-bit accuracy. For more details,
+see the AMD 3DNow! technology manual.
+
+
+\S{insPFRCPIT1} \i\c{PFRCPIT1}: Packed Single-Precision FP Reciprocal,
+First Iteration Step
+
+\c PFRCPIT1 mm1,mm2/m64          ; 0F 0F /r A6          [PENT,3DNOW]
+
+\c{PFRCPIT1} performs the first intermediate step in the calculation of
+the reciprocal of a single-precision FP value. The first source value
+(\c{mm1} is the original value, and the second source value (\c{mm2/m64}
+is the result of a \c{PFRCP} instruction.
+
+For the final step in a reciprocal, returning the full 24-bit accuracy
+of a single-precision FP value, see \c{PFRCPIT2} (\k{insPFRCPIT2}). For
+more details, see the AMD 3DNow! technology manual.
+
+
+\S{insPFRCPIT2} \i\c{PFRCPIT2}: Packed Single-Precision FP
+Reciprocal/ Reciprocal Square Root, Second Iteration Step
+
+\c PFRCPIT2 mm1,mm2/m64          ; 0F 0F /r B6          [PENT,3DNOW]
+
+\c{PFRCPIT2} performs the second and final intermediate step in the
+calculation of a reciprocal or reciprocal square root, refining the
+values returned by the \c{PFRCP} and \c{PFRSQRT} instructions,
+respectively.
+
+The first source value (\c{mm1}) is the output of either a \c{PFRCPIT1}
+or a \c{PFRSQIT1} instruction, and the second source is the output of
+either the \c{PFRCP} or the \c{PFRSQRT} instruction. For more details,
+see the AMD 3DNow! technology manual.
+
+
+\S{insPFRSQIT1} \i\c{PFRSQIT1}: Packed Single-Precision FP Reciprocal
+Square Root, First Iteration Step
+
+\c PFRSQIT1 mm1,mm2/m64          ; 0F 0F /r A7          [PENT,3DNOW]
+
+\c{PFRSQIT1} performs the first intermediate step in the calculation of
+the reciprocal square root of a single-precision FP value. The first
+source value (\c{mm1} is the square of the result of a \c{PFRSQRT}
+instruction, and the second source value (\c{mm2/m64} is the original
+value.
+
+For the final step in a calculation, returning the full 24-bit accuracy
+of a single-precision FP value, see \c{PFRCPIT2} (\k{insPFRCPIT2}). For
+more details, see the AMD 3DNow! technology manual.
+
+
+\S{insPFRSQRT} \i\c{PFRSQRT}: Packed Single-Precision FP Reciprocal
+Square Root Approximation
+
+\c PFRSQRT mm1,mm2/m64           ; 0F 0F /r 97          [PENT,3DNOW]
+
+\c{PFRSQRT} performs a low precision estimate of the reciprocal square
+root of the low-order single-precision FP value in the source operand,
+storing the result in both halves of the destination register. The result
+is accurate to 15 bits.
+
+For higher precision reciprocals, this instruction should be followed by
+two more instructions: \c{PFRSQIT1} (\k{insPFRSQIT1}) and \c{PFRCPIT2}
+(\k{insPFRCPIT1}). This will result in a 24-bit accuracy. For more details,
+see the AMD 3DNow! technology manual.
+
+
+\S{insPFSUB} \i\c{PFSUB}: Packed Single-Precision FP Subtract
+
+\c PFSUB mm1,mm2/m64             ; 0F 0F /r 9A          [PENT,3DNOW]
+
+\c{PFSUB} subtracts the single-precision FP values in the source from
+those in the destination, and stores the result in the destination
+operand.
+
+\c    dst[0-31]  := dst[0-31]  - src[0-31],
+\c    dst[32-63] := dst[32-63] - src[32-63].
+
+
+\S{insPFSUBR} \i\c{PFSUBR}: Packed Single-Precision FP Reverse Subtract
+
+\c PFSUBR mm1,mm2/m64            ; 0F 0F /r AA          [PENT,3DNOW]
+
+\c{PFSUBR} subtracts the single-precision FP values in the destination
+from those in the source, and stores the result in the destination
+operand.
+
+\c    dst[0-31]  := src[0-31]  - dst[0-31],
+\c    dst[32-63] := src[32-63] - dst[32-63].
+
+
+\S{insPI2FD} \i\c{PI2FD}: Packed Doubleword Integer to Single-Precision FP Convert
+
+\c PI2FD mm1,mm2/m64             ; 0F 0F /r 0D          [PENT,3DNOW]
+
+\c{PF2ID} converts two signed 32-bit integers in the source operand
+to single-precision FP values, using truncation of significant digits,
+and stores them in the destination operand.
+
+
+\S{insPF2IW} \i\c{PF2IW}: Packed Word Integer to Single-Precision FP Convert
+
+\c PI2FW mm1,mm2/m64             ; 0F 0F /r 0C          [PENT,3DNOW]
+
+\c{PF2IW} converts two signed 16-bit integers in the source operand
+to single-precision FP values, and stores them in the destination
+operand. The input values are in the low word of each doubleword.
+
+
+\S{insPINSRW} \i\c{PINSRW}: Insert Word
+
+\c PINSRW mm,r16/r32/m16,imm8    ;0F C4 /r ib      [KATMAI,MMX]
+\c PINSRW xmm,r16/r32/m16,imm8   ;66 0F C4 /r ib   [WILLAMETTE,SSE2]
+
+\c{PINSRW} loads a word from a 16-bit register (or the low half of a
+32-bit register), or from memory, and loads it to the word position
+in the destination register, pointed at by the count operand (third
+operand). If the destination is an \c{MMX} register, the low two bits
+of the count byte are used, if it is an \c{XMM} register the low 3
+bits are used. The insertion is done in such a way that the other
+words from the destination register are left untouched.
+
+
+\S{insPMACHRIW} \i\c{PMACHRIW}: Packed Multiply and Accumulate with Rounding
+
+\c PMACHRIW mm,m64               ; 0F 5E /r             [CYRIX,MMX]
+
+\c{PMACHRIW} takes two packed 16-bit integer inputs, multiplies the
+values in the inputs, rounds on bit 15 of each result, then adds bits
+15-30 of each result to the corresponding position of the \e{implied}
+destination register.
+
+The operation of this instruction is:
+
+\c    dstI[0-15]  := dstI[0-15]  + (mm[0-15] *m64[0-15]
+\c                                           + 0x00004000)[15-30],
+\c    dstI[16-31] := dstI[16-31] + (mm[16-31]*m64[16-31]
+\c                                           + 0x00004000)[15-30],
+\c    dstI[32-47] := dstI[32-47] + (mm[32-47]*m64[32-47]
+\c                                           + 0x00004000)[15-30],
+\c    dstI[48-63] := dstI[48-63] + (mm[48-63]*m64[48-63]
+\c                                           + 0x00004000)[15-30].
+
+Note that \c{PMACHRIW} cannot take a register as its second source
+operand.
+
+
+\S{insPMADDWD} \i\c{PMADDWD}: MMX Packed Multiply and Add
+
+\c PMADDWD mm1,mm2/m64           ; 0F F5 /r             [PENT,MMX]
+\c PMADDWD xmm1,xmm2/m128        ; 66 0F F5 /r     [WILLAMETTE,SSE2]
+
+\c{PMADDWD} treats its two inputs as vectors of signed words. It
+multiplies corresponding elements of the two operands, giving doubleword
+results. These are then added together in pairs and stored in the
+destination operand.
+
+The operation of this instruction is:
+
+\c    dst[0-31]   := (dst[0-15] * src[0-15])
+\c                                + (dst[16-31] * src[16-31]);
+\c    dst[32-63]  := (dst[32-47] * src[32-47])
+\c                                + (dst[48-63] * src[48-63]);
+
+The following apply to the \c{SSE} version of the instruction:
+
+\c    dst[64-95]  := (dst[64-79] * src[64-79])
+\c                                + (dst[80-95] * src[80-95]);
+\c    dst[96-127] := (dst[96-111] * src[96-111])
+\c                                + (dst[112-127] * src[112-127]).
+
+
+\S{insPMAGW} \i\c{PMAGW}: MMX Packed Magnitude
+
+\c PMAGW mm1,mm2/m64             ; 0F 52 /r             [CYRIX,MMX]
+
+\c{PMAGW}, specific to the Cyrix MMX extensions, treats both its
+operands as vectors of four signed words. It compares the absolute
+values of the words in corresponding positions, and sets each word
+of the destination (first) operand to whichever of the two words in
+that position had the larger absolute value.
+
+
+\S{insPMAXSW} \i\c{PMAXSW}: Packed Signed Integer Word Maximum
+
+\c PMAXSW mm1,mm2/m64            ; 0F EE /r        [KATMAI,MMX]
+\c PMAXSW xmm1,xmm2/m128         ; 66 0F EE /r     [WILLAMETTE,SSE2]
+
+\c{PMAXSW} compares each pair of words in the two source operands, and
+for each pair it stores the maximum value in the destination register.
+
+
+\S{insPMAXUB} \i\c{PMAXUB}: Packed Unsigned Integer Byte Maximum
+
+\c PMAXUB mm1,mm2/m64            ; 0F DE /r        [KATMAI,MMX]
+\c PMAXUB xmm1,xmm2/m128         ; 66 0F DE /r     [WILLAMETTE,SSE2]
+
+\c{PMAXUB} compares each pair of bytes in the two source operands, and
+for each pair it stores the maximum value in the destination register.
+
+
+\S{insPMINSW} \i\c{PMINSW}: Packed Signed Integer Word Minimum
+
+\c PMINSW mm1,mm2/m64            ; 0F EA /r        [KATMAI,MMX]
+\c PMINSW xmm1,xmm2/m128         ; 66 0F EA /r     [WILLAMETTE,SSE2]
+
+\c{PMINSW} compares each pair of words in the two source operands, and
+for each pair it stores the minimum value in the destination register.
+
+
+\S{insPMINUB} \i\c{PMINUB}: Packed Unsigned Integer Byte Minimum
+
+\c PMINUB mm1,mm2/m64            ; 0F DA /r        [KATMAI,MMX]
+\c PMINUB xmm1,xmm2/m128         ; 66 0F DA /r     [WILLAMETTE,SSE2]
+
+\c{PMINUB} compares each pair of bytes in the two source operands, and
+for each pair it stores the minimum value in the destination register.
+
+
+\S{insPMOVMSKB} \i\c{PMOVMSKB}: Move Byte Mask To Integer
+
+\c PMOVMSKB reg32,mm             ; 0F D7 /r        [KATMAI,MMX]
+\c PMOVMSKB reg32,xmm            ; 66 0F D7 /r     [WILLAMETTE,SSE2]
+
+\c{PMOVMSKB} returns an 8-bit or 16-bit mask formed of the most
+significant bits of each byte of source operand (8-bits for an
+\c{MMX} register, 16-bits for an \c{XMM} register).
+
+
+\S{insPMULHRW} \i\c{PMULHRWC}, \i\c{PMULHRIW}: Multiply Packed 16-bit Integers
+With Rounding, and Store High Word
+
+\c PMULHRWC mm1,mm2/m64         ; 0F 59 /r              [CYRIX,MMX]
+\c PMULHRIW mm1,mm2/m64         ; 0F 5D /r              [CYRIX,MMX]
+
+These instructions take two packed 16-bit integer inputs, multiply the
+values in the inputs, round on bit 15 of each result, then store bits
+15-30 of each result to the corresponding position of the destination
+register.
+
+\b For \c{PMULHRWC}, the destination is the first source operand.
+
+\b For \c{PMULHRIW}, the destination is an implied register (worked out
+as described for \c{PADDSIW} (\k{insPADDSIW})).
+
+The operation of this instruction is:
+
+\c    dst[0-15]  := (src1[0-15] *src2[0-15]  + 0x00004000)[15-30]
+\c    dst[16-31] := (src1[16-31]*src2[16-31] + 0x00004000)[15-30]
+\c    dst[32-47] := (src1[32-47]*src2[32-47] + 0x00004000)[15-30]
+\c    dst[48-63] := (src1[48-63]*src2[48-63] + 0x00004000)[15-30]
+
+See also \c{PMULHRWA} (\k{insPMULHRWA}) for a 3DNow! version of this
+instruction.
+
+
+\S{insPMULHRWA} \i\c{PMULHRWA}: Multiply Packed 16-bit Integers
+With Rounding, and Store High Word
+
+\c PMULHRWA mm1,mm2/m64          ; 0F 0F /r B7     [PENT,3DNOW]
+
+\c{PMULHRWA} takes two packed 16-bit integer inputs, multiplies
+the values in the inputs, rounds on bit 16 of each result, then
+stores bits 16-31 of each result to the corresponding position
+of the destination register.
+
+The operation of this instruction is:
+
+\c    dst[0-15]  := (src1[0-15] *src2[0-15]  + 0x00008000)[16-31];
+\c    dst[16-31] := (src1[16-31]*src2[16-31] + 0x00008000)[16-31];
+\c    dst[32-47] := (src1[32-47]*src2[32-47] + 0x00008000)[16-31];
+\c    dst[48-63] := (src1[48-63]*src2[48-63] + 0x00008000)[16-31].
+
+See also \c{PMULHRWC} (\k{insPMULHRW}) for a Cyrix version of this
+instruction.
+
+
+\S{insPMULHUW} \i\c{PMULHUW}: Multiply Packed 16-bit Integers,
+and Store High Word
+
+\c PMULHUW mm1,mm2/m64           ; 0F E4 /r        [KATMAI,MMX]
+\c PMULHUW xmm1,xmm2/m128        ; 66 0F E4 /r     [WILLAMETTE,SSE2]
+
+\c{PMULHUW} takes two packed unsigned 16-bit integer inputs, multiplies
+the values in the inputs, then stores bits 16-31 of each result to the
+corresponding position of the destination register.
+
+
+\S{insPMULHW} \i\c{PMULHW}, \i\c{PMULLW}: Multiply Packed 16-bit Integers,
+and Store
+
+\c PMULHW mm1,mm2/m64            ; 0F E5 /r             [PENT,MMX]
+\c PMULLW mm1,mm2/m64            ; 0F D5 /r             [PENT,MMX]
+
+\c PMULHW xmm1,xmm2/m128         ; 66 0F E5 /r     [WILLAMETTE,SSE2]
+\c PMULLW xmm1,xmm2/m128         ; 66 0F D5 /r     [WILLAMETTE,SSE2]
+
+\c{PMULxW} takes two packed unsigned 16-bit integer inputs, and
+multiplies the values in the inputs, forming doubleword results.
+
+\b \c{PMULHW} then stores the top 16 bits of each doubleword in the
+destination (first) operand;
+
+\b \c{PMULLW} stores the bottom 16 bits of each doubleword in the
+destination operand.
+
+
+\S{insPMULUDQ} \i\c{PMULUDQ}: Multiply Packed Unsigned
+32-bit Integers, and Store.
+
+\c PMULUDQ mm1,mm2/m64           ; 0F F4 /r        [WILLAMETTE,SSE2]
+\c PMULUDQ xmm1,xmm2/m128        ; 66 0F F4 /r     [WILLAMETTE,SSE2]
+
+\c{PMULUDQ} takes two packed unsigned 32-bit integer inputs, and
+multiplies the values in the inputs, forming quadword results. The
+source is either an unsigned doubleword in the low doubleword of a
+64-bit operand, or it's two unsigned doublewords in the first and
+third doublewords of a 128-bit operand. This produces either one or
+two 64-bit results, which are stored in the respective quadword
+locations of the destination register.
+
+The operation is:
+
+\c    dst[0-63]   := dst[0-31]  * src[0-31];
+\c    dst[64-127] := dst[64-95] * src[64-95].
+
+
+\S{insPMVccZB} \i\c{PMVccZB}: MMX Packed Conditional Move
+
+\c PMVZB mmxreg,mem64            ; 0F 58 /r             [CYRIX,MMX]
+\c PMVNZB mmxreg,mem64           ; 0F 5A /r             [CYRIX,MMX]
+\c PMVLZB mmxreg,mem64           ; 0F 5B /r             [CYRIX,MMX]
+\c PMVGEZB mmxreg,mem64          ; 0F 5C /r             [CYRIX,MMX]
+
+These instructions, specific to the Cyrix MMX extensions, perform
+parallel conditional moves. The two input operands are treated as
+vectors of eight bytes. Each byte of the destination (first) operand
+is either written from the corresponding byte of the source (second)
+operand, or left alone, depending on the value of the byte in the
+\e{implied} operand (specified in the same way as \c{PADDSIW}, in
+\k{insPADDSIW}).
+
+\b \c{PMVZB} performs each move if the corresponding byte in the
+implied operand is zero;
+
+\b \c{PMVNZB} moves if the byte is non-zero;
+
+\b \c{PMVLZB} moves if the byte is less than zero;
+
+\b \c{PMVGEZB} moves if the byte is greater than or equal to zero.
+
+Note that these instructions cannot take a register as their second
+source operand.
+
+
+\S{insPOP} \i\c{POP}: Pop Data from Stack
+
+\c POP reg16                     ; o16 58+r             [8086]
+\c POP reg32                     ; o32 58+r             [386]
+
+\c POP r/m16                     ; o16 8F /0            [8086]
+\c POP r/m32                     ; o32 8F /0            [386]
+
+\c POP CS                        ; 0F                   [8086,UNDOC]
+\c POP DS                        ; 1F                   [8086]
+\c POP ES                        ; 07                   [8086]
+\c POP SS                        ; 17                   [8086]
+\c POP FS                        ; 0F A1                [386]
+\c POP GS                        ; 0F A9                [386]
+
+\c{POP} loads a value from the stack (from \c{[SS:SP]} or
+\c{[SS:ESP]}) and then increments the stack pointer.
+
+The address-size attribute of the instruction determines whether
+\c{SP} or \c{ESP} is used as the stack pointer: to deliberately
+override the default given by the \c{BITS} setting, you can use an
+\i\c{a16} or \i\c{a32} prefix.
+
+The operand-size attribute of the instruction determines whether the
+stack pointer is incremented by 2 or 4: this means that segment
+register pops in \c{BITS 32} mode will pop 4 bytes off the stack and
+discard the upper two of them. If you need to override that, you can
+use an \i\c{o16} or \i\c{o32} prefix.
+
+The above opcode listings give two forms for general-purpose
+register pop instructions: for example, \c{POP BX} has the two forms
+\c{5B} and \c{8F C3}. NASM will always generate the shorter form
+when given \c{POP BX}. NDISASM will disassemble both.
+
+\c{POP CS} is not a documented instruction, and is not supported on
+any processor above the 8086 (since they use \c{0Fh} as an opcode
+prefix for instruction set extensions). However, at least some 8086
+processors do support it, and so NASM generates it for completeness.
+
+
+\S{insPOPA} \i\c{POPAx}: Pop All General-Purpose Registers
+
+\c POPA                          ; 61                   [186]
+\c POPAW                         ; o16 61               [186]
+\c POPAD                         ; o32 61               [386]
+
+\b \c{POPAW} pops a word from the stack into each of, successively,
+\c{DI}, \c{SI}, \c{BP}, nothing (it discards a word from the stack
+which was a placeholder for \c{SP}), \c{BX}, \c{DX}, \c{CX} and
+\c{AX}. It is intended to reverse the operation of \c{PUSHAW} (see
+\k{insPUSHA}), but it ignores the value for \c{SP} that was pushed
+on the stack by \c{PUSHAW}.
+
+\b \c{POPAD} pops twice as much data, and places the results in
+\c{EDI}, \c{ESI}, \c{EBP}, nothing (placeholder for \c{ESP}),
+\c{EBX}, \c{EDX}, \c{ECX} and \c{EAX}. It reverses the operation of
+\c{PUSHAD}.
+
+\c{POPA} is an alias mnemonic for either \c{POPAW} or \c{POPAD},
+depending on the current \c{BITS} setting.
+
+Note that the registers are popped in reverse order of their numeric
+values in opcodes (see \k{iref-rv}).
+
+
+\S{insPOPF} \i\c{POPFx}: Pop Flags Register
+
+\c POPF                          ; 9D                   [8086]
+\c POPFW                         ; o16 9D               [8086]
+\c POPFD                         ; o32 9D               [386]
+
+\b \c{POPFW} pops a word from the stack and stores it in the bottom 16
+bits of the flags register (or the whole flags register, on
+processors below a 386).
+
+\b \c{POPFD} pops a doubleword and stores it in the entire flags register.
+
+\c{POPF} is an alias mnemonic for either \c{POPFW} or \c{POPFD},
+depending on the current \c{BITS} setting.
+
+See also \c{PUSHF} (\k{insPUSHF}).
+
+
+\S{insPOR} \i\c{POR}: MMX Bitwise OR
+
+\c POR mm1,mm2/m64               ; 0F EB /r             [PENT,MMX]
+\c POR xmm1,xmm2/m128            ; 66 0F EB /r     [WILLAMETTE,SSE2]
+
+\c{POR} performs a bitwise OR operation between its two operands
+(i.e. each bit of the result is 1 if and only if at least one of the
+corresponding bits of the two inputs was 1), and stores the result
+in the destination (first) operand.
+
+
+\S{insPREFETCH} \i\c{PREFETCH}: Prefetch Data Into Caches
+
+\c PREFETCH mem8                 ; 0F 0D /0             [PENT,3DNOW]
+\c PREFETCHW mem8                ; 0F 0D /1             [PENT,3DNOW]
+
+\c{PREFETCH} and \c{PREFETCHW} fetch the line of data from memory that
+contains the specified byte. \c{PREFETCHW} performs differently on the
+Athlon to earlier processors.
+
+For more details, see the 3DNow! Technology Manual.
+
+
+\S{insPREFETCHh} \i\c{PREFETCHh}: Prefetch Data Into Caches
+\I\c{PREFETCHNTA} \I\c{PREFETCHT0} \I\c{PREFETCHT1} \I\c{PREFETCHT2}
+
+\c PREFETCHNTA m8                ; 0F 18 /0        [KATMAI]
+\c PREFETCHT0 m8                 ; 0F 18 /1        [KATMAI]
+\c PREFETCHT1 m8                 ; 0F 18 /2        [KATMAI]
+\c PREFETCHT2 m8                 ; 0F 18 /3        [KATMAI]
+
+The \c{PREFETCHh} instructions fetch the line of data from memory
+that contains the specified byte. It is placed in the cache
+according to rules specified by locality hints \c{h}:
+
+The hints are:
+
+\b \c{T0} (temporal data) - prefetch data into all levels of the
+cache hierarchy.
+
+\b \c{T1} (temporal data with respect to first level cache) -
+prefetch data into level 2 cache and higher.
+
+\b \c{T2} (temporal data with respect to second level cache) -
+prefetch data into level 2 cache and higher.
+
+\b \c{NTA} (non-temporal data with respect to all cache levels) -
+prefetch data into non-temporal cache structure and into a
+location close to the processor, minimizing cache pollution.
+
+Note that this group of instructions doesn't provide a guarantee
+that the data will be in the cache when it is needed. For more
+details, see the Intel IA32 Software Developer Manual, Volume 2.
+
+
+\S{insPSADBW} \i\c{PSADBW}: Packed Sum of Absolute Differences
+
+\c PSADBW mm1,mm2/m64            ; 0F F6 /r        [KATMAI,MMX]
+\c PSADBW xmm1,xmm2/m128         ; 66 0F F6 /r     [WILLAMETTE,SSE2]
+
+\c{PSADBW} The PSADBW instruction computes the absolute value of the
+difference of the packed unsigned bytes in the two source operands.
+These differences are then summed to produce a word result in the lower
+16-bit field of the destination register; the rest of the register is
+cleared. The destination operand is an \c{MMX} or an \c{XMM} register.
+The source operand can either be a register or a memory operand.
+
+
+\S{insPSHUFD} \i\c{PSHUFD}: Shuffle Packed Doublewords
+
+\c PSHUFD xmm1,xmm2/m128,imm8    ; 66 0F 70 /r ib  [WILLAMETTE,SSE2]
+
+\c{PSHUFD} shuffles the doublewords in the source (second) operand
+according to the encoding specified by imm8, and stores the result
+in the destination (first) operand.
+
+Bits 0 and 1 of imm8 encode the source position of the doubleword to
+be copied to position 0 in the destination operand. Bits 2 and 3
+encode for position 1, bits 4 and 5 encode for position 2, and bits
+6 and 7 encode for position 3. For example, an encoding of 10 in
+bits 0 and 1 of imm8 indicates that the doubleword at bits 64-95 of
+the source operand will be copied to bits 0-31 of the destination.
+
+
+\S{insPSHUFHW} \i\c{PSHUFHW}: Shuffle Packed High Words
+
+\c PSHUFHW xmm1,xmm2/m128,imm8   ; F3 0F 70 /r ib  [WILLAMETTE,SSE2]
+
+\c{PSHUFW} shuffles the words in the high quadword of the source
+(second) operand according to the encoding specified by imm8, and
+stores the result in the high quadword of the destination (first)
+operand.
+
+The operation of this instruction is similar to the \c{PSHUFW}
+instruction, except that the source and destination are the top
+quadword of a 128-bit operand, instead of being 64-bit operands.
+The low quadword is copied from the source to the destination
+without any changes.
+
+
+\S{insPSHUFLW} \i\c{PSHUFLW}: Shuffle Packed Low Words
+
+\c PSHUFLW xmm1,xmm2/m128,imm8   ; F2 0F 70 /r ib  [WILLAMETTE,SSE2]
+
+\c{PSHUFLW} shuffles the words in the low quadword of the source
+(second) operand according to the encoding specified by imm8, and
+stores the result in the low quadword of the destination (first)
+operand.
+
+The operation of this instruction is similar to the \c{PSHUFW}
+instruction, except that the source and destination are the low
+quadword of a 128-bit operand, instead of being 64-bit operands.
+The high quadword is copied from the source to the destination
+without any changes.
+
+
+\S{insPSHUFW} \i\c{PSHUFW}: Shuffle Packed Words
+
+\c PSHUFW mm1,mm2/m64,imm8       ; 0F 70 /r ib     [KATMAI,MMX]
+
+\c{PSHUFW} shuffles the words in the source (second) operand
+according to the encoding specified by imm8, and stores the result
+in the destination (first) operand.
+
+Bits 0 and 1 of imm8 encode the source position of the word to be
+copied to position 0 in the destination operand. Bits 2 and 3 encode
+for position 1, bits 4 and 5 encode for position 2, and bits 6 and 7
+encode for position 3. For example, an encoding of 10 in bits 0 and 1
+of imm8 indicates that the word at bits 32-47 of the source operand
+will be copied to bits 0-15 of the destination.
+
+
+\S{insPSLLD} \i\c{PSLLx}: Packed Data Bit Shift Left Logical
+
+\c PSLLW mm1,mm2/m64             ; 0F F1 /r             [PENT,MMX]
+\c PSLLW mm,imm8                 ; 0F 71 /6 ib          [PENT,MMX]
+
+\c PSLLW xmm1,xmm2/m128          ; 66 0F F1 /r     [WILLAMETTE,SSE2]
+\c PSLLW xmm,imm8                ; 66 0F 71 /6 ib  [WILLAMETTE,SSE2]
+
+\c PSLLD mm1,mm2/m64             ; 0F F2 /r             [PENT,MMX]
+\c PSLLD mm,imm8                 ; 0F 72 /6 ib          [PENT,MMX]
+
+\c PSLLD xmm1,xmm2/m128          ; 66 0F F2 /r     [WILLAMETTE,SSE2]
+\c PSLLD xmm,imm8                ; 66 0F 72 /6 ib  [WILLAMETTE,SSE2]
+
+\c PSLLQ mm1,mm2/m64             ; 0F F3 /r             [PENT,MMX]
+\c PSLLQ mm,imm8                 ; 0F 73 /6 ib          [PENT,MMX]
+
+\c PSLLQ xmm1,xmm2/m128          ; 66 0F F3 /r     [WILLAMETTE,SSE2]
+\c PSLLQ xmm,imm8                ; 66 0F 73 /6 ib  [WILLAMETTE,SSE2]
+
+\c PSLLDQ xmm1,imm8              ; 66 0F 73 /7 ib  [WILLAMETTE,SSE2]
+
+\c{PSLLx} performs logical left shifts of the data elements in the
+destination (first) operand, moving each bit in the separate elements
+left by the number of bits specified in the source (second) operand,
+clearing the low-order bits as they are vacated. \c{PSLLDQ} 
+shifts bytes, not bits.
+
+\b \c{PSLLW} shifts word sized elements.
+
+\b \c{PSLLD} shifts doubleword sized elements.
+
+\b \c{PSLLQ} shifts quadword sized elements.
+
+\b \c{PSLLDQ} shifts double quadword sized elements.
+
+
+\S{insPSRAD} \i\c{PSRAx}: Packed Data Bit Shift Right Arithmetic
+
+\c PSRAW mm1,mm2/m64             ; 0F E1 /r             [PENT,MMX]
+\c PSRAW mm,imm8                 ; 0F 71 /4 ib          [PENT,MMX]
+
+\c PSRAW xmm1,xmm2/m128          ; 66 0F E1 /r     [WILLAMETTE,SSE2]
+\c PSRAW xmm,imm8                ; 66 0F 71 /4 ib  [WILLAMETTE,SSE2]
+
+\c PSRAD mm1,mm2/m64             ; 0F E2 /r             [PENT,MMX]
+\c PSRAD mm,imm8                 ; 0F 72 /4 ib          [PENT,MMX]
+
+\c PSRAD xmm1,xmm2/m128          ; 66 0F E2 /r     [WILLAMETTE,SSE2]
+\c PSRAD xmm,imm8                ; 66 0F 72 /4 ib  [WILLAMETTE,SSE2]
+
+\c{PSRAx} performs arithmetic right shifts of the data elements in the
+destination (first) operand, moving each bit in the separate elements
+right by the number of bits specified in the source (second) operand,
+setting the high-order bits to the value of the original sign bit.
+
+\b \c{PSRAW} shifts word sized elements.
+
+\b \c{PSRAD} shifts doubleword sized elements.
+
+
+\S{insPSRLD} \i\c{PSRLx}: Packed Data Bit Shift Right Logical
+
+\c PSRLW mm1,mm2/m64             ; 0F D1 /r             [PENT,MMX]
+\c PSRLW mm,imm8                 ; 0F 71 /2 ib          [PENT,MMX]
+
+\c PSRLW xmm1,xmm2/m128          ; 66 0F D1 /r     [WILLAMETTE,SSE2]
+\c PSRLW xmm,imm8                ; 66 0F 71 /2 ib  [WILLAMETTE,SSE2]
+
+\c PSRLD mm1,mm2/m64             ; 0F D2 /r             [PENT,MMX]
+\c PSRLD mm,imm8                 ; 0F 72 /2 ib          [PENT,MMX]
+
+\c PSRLD xmm1,xmm2/m128          ; 66 0F D2 /r     [WILLAMETTE,SSE2]
+\c PSRLD xmm,imm8                ; 66 0F 72 /2 ib  [WILLAMETTE,SSE2]
+
+\c PSRLQ mm1,mm2/m64             ; 0F D3 /r             [PENT,MMX]
+\c PSRLQ mm,imm8                 ; 0F 73 /2 ib          [PENT,MMX]
+
+\c PSRLQ xmm1,xmm2/m128          ; 66 0F D3 /r     [WILLAMETTE,SSE2]
+\c PSRLQ xmm,imm8                ; 66 0F 73 /2 ib  [WILLAMETTE,SSE2]
+
+\c PSRLDQ xmm1,imm8              ; 66 0F 73 /3 ib  [WILLAMETTE,SSE2]
+
+\c{PSRLx} performs logical right shifts of the data elements in the
+destination (first) operand, moving each bit in the separate elements
+right by the number of bits specified in the source (second) operand,
+clearing the high-order bits as they are vacated. \c{PSRLDQ} 
+shifts bytes, not bits.
+
+\b \c{PSRLW} shifts word sized elements.
+
+\b \c{PSRLD} shifts doubleword sized elements.
+
+\b \c{PSRLQ} shifts quadword sized elements.
+
+\b \c{PSRLDQ} shifts double quadword sized elements.
+
+
+\S{insPSUBB} \i\c{PSUBx}: Subtract Packed Integers
+
+\c PSUBB mm1,mm2/m64             ; 0F F8 /r             [PENT,MMX]
+\c PSUBW mm1,mm2/m64             ; 0F F9 /r             [PENT,MMX]
+\c PSUBD mm1,mm2/m64             ; 0F FA /r             [PENT,MMX]
+\c PSUBQ mm1,mm2/m64             ; 0F FB /r        [WILLAMETTE,SSE2]
+
+\c PSUBB xmm1,xmm2/m128          ; 66 0F F8 /r     [WILLAMETTE,SSE2]
+\c PSUBW xmm1,xmm2/m128          ; 66 0F F9 /r     [WILLAMETTE,SSE2]
+\c PSUBD xmm1,xmm2/m128          ; 66 0F FA /r     [WILLAMETTE,SSE2]
+\c PSUBQ xmm1,xmm2/m128          ; 66 0F FB /r     [WILLAMETTE,SSE2]
+
+\c{PSUBx} subtracts packed integers in the source operand from those
+in the destination operand. It doesn't differentiate between signed
+and unsigned integers, and doesn't set any of the flags.
+
+\b \c{PSUBB} operates on byte sized elements.
+
+\b \c{PSUBW} operates on word sized elements.
+
+\b \c{PSUBD} operates on doubleword sized elements.
+
+\b \c{PSUBQ} operates on quadword sized elements.
+
+
+\S{insPSUBSB} \i\c{PSUBSxx}, \i\c{PSUBUSx}: Subtract Packed Integers With Saturation
+
+\c PSUBSB mm1,mm2/m64            ; 0F E8 /r             [PENT,MMX]
+\c PSUBSW mm1,mm2/m64            ; 0F E9 /r             [PENT,MMX]
+
+\c PSUBSB xmm1,xmm2/m128         ; 66 0F E8 /r     [WILLAMETTE,SSE2]
+\c PSUBSW xmm1,xmm2/m128         ; 66 0F E9 /r     [WILLAMETTE,SSE2]
+
+\c PSUBUSB mm1,mm2/m64           ; 0F D8 /r             [PENT,MMX]
+\c PSUBUSW mm1,mm2/m64           ; 0F D9 /r             [PENT,MMX]
+
+\c PSUBUSB xmm1,xmm2/m128        ; 66 0F D8 /r     [WILLAMETTE,SSE2]
+\c PSUBUSW xmm1,xmm2/m128        ; 66 0F D9 /r     [WILLAMETTE,SSE2]
+
+\c{PSUBSx} and \c{PSUBUSx} subtracts packed integers in the source
+operand from those in the destination operand, and use saturation for
+results that are outside the range supported by the destination operand.
+
+\b \c{PSUBSB} operates on signed bytes, and uses signed saturation on the
+results.
+
+\b \c{PSUBSW} operates on signed words, and uses signed saturation on the
+results.
+
+\b \c{PSUBUSB} operates on unsigned bytes, and uses signed saturation on
+the results.
+
+\b \c{PSUBUSW} operates on unsigned words, and uses signed saturation on
+the results.
+
+
+\S{insPSUBSIW} \i\c{PSUBSIW}: MMX Packed Subtract with Saturation to
+Implied Destination
+
+\c PSUBSIW mm1,mm2/m64           ; 0F 55 /r             [CYRIX,MMX]
+
+\c{PSUBSIW}, specific to the Cyrix extensions to the MMX instruction
+set, performs the same function as \c{PSUBSW}, except that the
+result is not placed in the register specified by the first operand,
+but instead in the implied destination register, specified as for
+\c{PADDSIW} (\k{insPADDSIW}).
+
+
+\S{insPSWAPD} \i\c{PSWAPD}: Swap Packed Data
+\I\c{PSWAPW}
+
+\c PSWAPD mm1,mm2/m64            ; 0F 0F /r BB     [PENT,3DNOW]
+
+\c{PSWAPD} swaps the packed doublewords in the source operand, and
+stores the result in the destination operand.
+
+In the \c{K6-2} and \c{K6-III} processors, this opcode uses the
+mnemonic \c{PSWAPW}, and it swaps the order of words when copying
+from the source to the destination.
+
+The operation in the \c{K6-2} and \c{K6-III} processors is
+
+\c    dst[0-15]  = src[48-63];
+\c    dst[16-31] = src[32-47];
+\c    dst[32-47] = src[16-31];
+\c    dst[48-63] = src[0-15].
+
+The operation in the \c{K6-x+}, \c{ATHLON} and later processors is:
+
+\c    dst[0-31]  = src[32-63];
+\c    dst[32-63] = src[0-31].
+
+
+\S{insPUNPCKHBW} \i\c{PUNPCKxxx}: Unpack and Interleave Data
+
+\c PUNPCKHBW mm1,mm2/m64         ; 0F 68 /r             [PENT,MMX]
+\c PUNPCKHWD mm1,mm2/m64         ; 0F 69 /r             [PENT,MMX]
+\c PUNPCKHDQ mm1,mm2/m64         ; 0F 6A /r             [PENT,MMX]
+
+\c PUNPCKHBW xmm1,xmm2/m128      ; 66 0F 68 /r     [WILLAMETTE,SSE2]
+\c PUNPCKHWD xmm1,xmm2/m128      ; 66 0F 69 /r     [WILLAMETTE,SSE2]
+\c PUNPCKHDQ xmm1,xmm2/m128      ; 66 0F 6A /r     [WILLAMETTE,SSE2]
+\c PUNPCKHQDQ xmm1,xmm2/m128     ; 66 0F 6D /r     [WILLAMETTE,SSE2]
+
+\c PUNPCKLBW mm1,mm2/m32         ; 0F 60 /r             [PENT,MMX]
+\c PUNPCKLWD mm1,mm2/m32         ; 0F 61 /r             [PENT,MMX]
+\c PUNPCKLDQ mm1,mm2/m32         ; 0F 62 /r             [PENT,MMX]
+
+\c PUNPCKLBW xmm1,xmm2/m128      ; 66 0F 60 /r     [WILLAMETTE,SSE2]
+\c PUNPCKLWD xmm1,xmm2/m128      ; 66 0F 61 /r     [WILLAMETTE,SSE2]
+\c PUNPCKLDQ xmm1,xmm2/m128      ; 66 0F 62 /r     [WILLAMETTE,SSE2]
+\c PUNPCKLQDQ xmm1,xmm2/m128     ; 66 0F 6C /r     [WILLAMETTE,SSE2]
+
+\c{PUNPCKxx} all treat their operands as vectors, and produce a new
+vector generated by interleaving elements from the two inputs. The
+\c{PUNPCKHxx} instructions start by throwing away the bottom half of
+each input operand, and the \c{PUNPCKLxx} instructions throw away
+the top half.
+
+The remaining elements, are then interleaved into the destination,
+alternating elements from the second (source) operand and the first
+(destination) operand: so the leftmost part of each element in the
+result always comes from the second operand, and the rightmost from
+the destination.
+
+\b \c{PUNPCKxBW} works a byte at a time, producing word sized output
+elements.
+
+\b \c{PUNPCKxWD} works a word at a time, producing doubleword sized
+output elements.
+
+\b \c{PUNPCKxDQ} works a doubleword at a time, producing quadword sized
+output elements.
+
+\b \c{PUNPCKxQDQ} works a quadword at a time, producing double quadword
+sized output elements.
+
+So, for example, for \c{MMX} operands, if the first operand held
+\c{0x7A6A5A4A3A2A1A0A} and the second held \c{0x7B6B5B4B3B2B1B0B},
+then:
+
+\b \c{PUNPCKHBW} would return \c{0x7B7A6B6A5B5A4B4A}.
+
+\b \c{PUNPCKHWD} would return \c{0x7B6B7A6A5B4B5A4A}.
+
+\b \c{PUNPCKHDQ} would return \c{0x7B6B5B4B7A6A5A4A}.
+
+\b \c{PUNPCKLBW} would return \c{0x3B3A2B2A1B1A0B0A}.
+
+\b \c{PUNPCKLWD} would return \c{0x3B2B3A2A1B0B1A0A}.
+
+\b \c{PUNPCKLDQ} would return \c{0x3B2B1B0B3A2A1A0A}.
+
+
+\S{insPUSH} \i\c{PUSH}: Push Data on Stack
+
+\c PUSH reg16                    ; o16 50+r             [8086]
+\c PUSH reg32                    ; o32 50+r             [386]
+
+\c PUSH r/m16                    ; o16 FF /6            [8086]
+\c PUSH r/m32                    ; o32 FF /6            [386]
+
+\c PUSH CS                       ; 0E                   [8086]
+\c PUSH DS                       ; 1E                   [8086]
+\c PUSH ES                       ; 06                   [8086]
+\c PUSH SS                       ; 16                   [8086]
+\c PUSH FS                       ; 0F A0                [386]
+\c PUSH GS                       ; 0F A8                [386]
+
+\c PUSH imm8                     ; 6A ib                [186]
+\c PUSH imm16                    ; o16 68 iw            [186]
+\c PUSH imm32                    ; o32 68 id            [386]
+
+\c{PUSH} decrements the stack pointer (\c{SP} or \c{ESP}) by 2 or 4,
+and then stores the given value at \c{[SS:SP]} or \c{[SS:ESP]}.
+
+The address-size attribute of the instruction determines whether
+\c{SP} or \c{ESP} is used as the stack pointer: to deliberately
+override the default given by the \c{BITS} setting, you can use an
+\i\c{a16} or \i\c{a32} prefix.
+
+The operand-size attribute of the instruction determines whether the
+stack pointer is decremented by 2 or 4: this means that segment
+register pushes in \c{BITS 32} mode will push 4 bytes on the stack,
+of which the upper two are undefined. If you need to override that,
+you can use an \i\c{o16} or \i\c{o32} prefix.
+
+The above opcode listings give two forms for general-purpose
+\i{register push} instructions: for example, \c{PUSH BX} has the two
+forms \c{53} and \c{FF F3}. NASM will always generate the shorter
+form when given \c{PUSH BX}. NDISASM will disassemble both.
+
+Unlike the undocumented and barely supported \c{POP CS}, \c{PUSH CS}
+is a perfectly valid and sensible instruction, supported on all
+processors.
+
+The instruction \c{PUSH SP} may be used to distinguish an 8086 from
+later processors: on an 8086, the value of \c{SP} stored is the
+value it has \e{after} the push instruction, whereas on later
+processors it is the value \e{before} the push instruction.
+
+
+\S{insPUSHA} \i\c{PUSHAx}: Push All General-Purpose Registers
+
+\c PUSHA                         ; 60                   [186]
+\c PUSHAD                        ; o32 60               [386]
+\c PUSHAW                        ; o16 60               [186]
+
+\c{PUSHAW} pushes, in succession, \c{AX}, \c{CX}, \c{DX}, \c{BX},
+\c{SP}, \c{BP}, \c{SI} and \c{DI} on the stack, decrementing the
+stack pointer by a total of 16.
+
+\c{PUSHAD} pushes, in succession, \c{EAX}, \c{ECX}, \c{EDX},
+\c{EBX}, \c{ESP}, \c{EBP}, \c{ESI} and \c{EDI} on the stack,
+decrementing the stack pointer by a total of 32.
+
+In both cases, the value of \c{SP} or \c{ESP} pushed is its
+\e{original} value, as it had before the instruction was executed.
+
+\c{PUSHA} is an alias mnemonic for either \c{PUSHAW} or \c{PUSHAD},
+depending on the current \c{BITS} setting.
+
+Note that the registers are pushed in order of their numeric values
+in opcodes (see \k{iref-rv}).
+
+See also \c{POPA} (\k{insPOPA}).
+
+
+\S{insPUSHF} \i\c{PUSHFx}: Push Flags Register
+
+\c PUSHF                         ; 9C                   [8086]
+\c PUSHFD                        ; o32 9C               [386]
+\c PUSHFW                        ; o16 9C               [8086]
+
+\b \c{PUSHFW} pushes the bottom 16 bits of the flags register 
+(or the whole flags register, on processors below a 386) onto
+the stack.
+
+\b \c{PUSHFD} pushes the entire flags register onto the stack.
+
+\c{PUSHF} is an alias mnemonic for either \c{PUSHFW} or \c{PUSHFD},
+depending on the current \c{BITS} setting.
+
+See also \c{POPF} (\k{insPOPF}).
+
+
+\S{insPXOR} \i\c{PXOR}: MMX Bitwise XOR
+
+\c PXOR mm1,mm2/m64              ; 0F EF /r             [PENT,MMX]
+\c PXOR xmm1,xmm2/m128           ; 66 0F EF /r     [WILLAMETTE,SSE2]
+
+\c{PXOR} performs a bitwise XOR operation between its two operands
+(i.e. each bit of the result is 1 if and only if exactly one of the
+corresponding bits of the two inputs was 1), and stores the result
+in the destination (first) operand.
+
+
+\S{insRCL} \i\c{RCL}, \i\c{RCR}: Bitwise Rotate through Carry Bit
+
+\c RCL r/m8,1                    ; D0 /2                [8086]
+\c RCL r/m8,CL                   ; D2 /2                [8086]
+\c RCL r/m8,imm8                 ; C0 /2 ib             [186]
+\c RCL r/m16,1                   ; o16 D1 /2            [8086]
+\c RCL r/m16,CL                  ; o16 D3 /2            [8086]
+\c RCL r/m16,imm8                ; o16 C1 /2 ib         [186]
+\c RCL r/m32,1                   ; o32 D1 /2            [386]
+\c RCL r/m32,CL                  ; o32 D3 /2            [386]
+\c RCL r/m32,imm8                ; o32 C1 /2 ib         [386]
+
+\c RCR r/m8,1                    ; D0 /3                [8086]
+\c RCR r/m8,CL                   ; D2 /3                [8086]
+\c RCR r/m8,imm8                 ; C0 /3 ib             [186]
+\c RCR r/m16,1                   ; o16 D1 /3            [8086]
+\c RCR r/m16,CL                  ; o16 D3 /3            [8086]
+\c RCR r/m16,imm8                ; o16 C1 /3 ib         [186]
+\c RCR r/m32,1                   ; o32 D1 /3            [386]
+\c RCR r/m32,CL                  ; o32 D3 /3            [386]
+\c RCR r/m32,imm8                ; o32 C1 /3 ib         [386]
+
+\c{RCL} and \c{RCR} perform a 9-bit, 17-bit or 33-bit bitwise
+rotation operation, involving the given source/destination (first)
+operand and the carry bit. Thus, for example, in the operation
+\c{RCL AL,1}, a 9-bit rotation is performed in which \c{AL} is
+shifted left by 1, the top bit of \c{AL} moves into the carry flag,
+and the original value of the carry flag is placed in the low bit of
+\c{AL}.
+
+The number of bits to rotate by is given by the second operand. Only
+the bottom five bits of the rotation count are considered by
+processors above the 8086.
+
+You can force the longer (286 and upwards, beginning with a \c{C1}
+byte) form of \c{RCL foo,1} by using a \c{BYTE} prefix: \c{RCL
+foo,BYTE 1}. Similarly with \c{RCR}.
+
+
+\S{insRCPPS} \i\c{RCPPS}: Packed Single-Precision FP Reciprocal
+
+\c RCPPS xmm1,xmm2/m128          ; 0F 53 /r        [KATMAI,SSE]
+
+\c{RCPPS} returns an approximation of the reciprocal of the packed
+single-precision FP values from xmm2/m128. The maximum error for this
+approximation is: |Error| <= 1.5 x 2^-12
+
+
+\S{insRCPSS} \i\c{RCPSS}: Scalar Single-Precision FP Reciprocal
+
+\c RCPSS xmm1,xmm2/m128          ; F3 0F 53 /r     [KATMAI,SSE]
+
+\c{RCPSS} returns an approximation of the reciprocal of the lower
+single-precision FP value from xmm2/m32; the upper three fields are
+passed through from xmm1. The maximum error for this approximation is:
+|Error| <= 1.5 x 2^-12
+
+
+\S{insRDMSR} \i\c{RDMSR}: Read Model-Specific Registers
+
+\c RDMSR                         ; 0F 32                [PENT,PRIV]
+
+\c{RDMSR} reads the processor Model-Specific Register (MSR) whose
+index is stored in \c{ECX}, and stores the result in \c{EDX:EAX}.
+See also \c{WRMSR} (\k{insWRMSR}).
+
+
+\S{insRDPMC} \i\c{RDPMC}: Read Performance-Monitoring Counters
+
+\c RDPMC                         ; 0F 33                [P6]
+
+\c{RDPMC} reads the processor performance-monitoring counter whose
+index is stored in \c{ECX}, and stores the result in \c{EDX:EAX}.
+
+This instruction is available on P6 and later processors and on MMX
+class processors.
+
+
+\S{insRDSHR} \i\c{RDSHR}: Read SMM Header Pointer Register
+
+\c RDSHR r/m32                   ; 0F 36 /0        [386,CYRIX,SMM]
+
+\c{RDSHR} reads the contents of the SMM header pointer register and
+saves it to the destination operand, which can be either a 32 bit
+memory location or a 32 bit register.
+
+See also \c{WRSHR} (\k{insWRSHR}).
+
+
+\S{insRDTSC} \i\c{RDTSC}: Read Time-Stamp Counter
+
+\c RDTSC                         ; 0F 31                [PENT]
+
+\c{RDTSC} reads the processor's time-stamp counter into \c{EDX:EAX}.
+
+
+\S{insRET} \i\c{RET}, \i\c{RETF}, \i\c{RETN}: Return from Procedure Call
+
+\c RET                           ; C3                   [8086]
+\c RET imm16                     ; C2 iw                [8086]
+
+\c RETF                          ; CB                   [8086]
+\c RETF imm16                    ; CA iw                [8086]
+
+\c RETN                          ; C3                   [8086]
+\c RETN imm16                    ; C2 iw                [8086]
+
+\b \c{RET}, and its exact synonym \c{RETN}, pop \c{IP} or \c{EIP} from
+the stack and transfer control to the new address. Optionally, if a
+numeric second operand is provided, they increment the stack pointer
+by a further \c{imm16} bytes after popping the return address.
+
+\b \c{RETF} executes a far return: after popping \c{IP}/\c{EIP}, it
+then pops \c{CS}, and \e{then} increments the stack pointer by the
+optional argument if present.
+
+
+\S{insROL} \i\c{ROL}, \i\c{ROR}: Bitwise Rotate
+
+\c ROL r/m8,1                    ; D0 /0                [8086]
+\c ROL r/m8,CL                   ; D2 /0                [8086]
+\c ROL r/m8,imm8                 ; C0 /0 ib             [186]
+\c ROL r/m16,1                   ; o16 D1 /0            [8086]
+\c ROL r/m16,CL                  ; o16 D3 /0            [8086]
+\c ROL r/m16,imm8                ; o16 C1 /0 ib         [186]
+\c ROL r/m32,1                   ; o32 D1 /0            [386]
+\c ROL r/m32,CL                  ; o32 D3 /0            [386]
+\c ROL r/m32,imm8                ; o32 C1 /0 ib         [386]
+
+\c ROR r/m8,1                    ; D0 /1                [8086]
+\c ROR r/m8,CL                   ; D2 /1                [8086]
+\c ROR r/m8,imm8                 ; C0 /1 ib             [186]
+\c ROR r/m16,1                   ; o16 D1 /1            [8086]
+\c ROR r/m16,CL                  ; o16 D3 /1            [8086]
+\c ROR r/m16,imm8                ; o16 C1 /1 ib         [186]
+\c ROR r/m32,1                   ; o32 D1 /1            [386]
+\c ROR r/m32,CL                  ; o32 D3 /1            [386]
+\c ROR r/m32,imm8                ; o32 C1 /1 ib         [386]
+
+\c{ROL} and \c{ROR} perform a bitwise rotation operation on the given
+source/destination (first) operand. Thus, for example, in the
+operation \c{ROL AL,1}, an 8-bit rotation is performed in which
+\c{AL} is shifted left by 1 and the original top bit of \c{AL} moves
+round into the low bit.
+
+The number of bits to rotate by is given by the second operand. Only
+the bottom five bits of the rotation count are considered by processors
+above the 8086.
+
+You can force the longer (286 and upwards, beginning with a \c{C1}
+byte) form of \c{ROL foo,1} by using a \c{BYTE} prefix: \c{ROL
+foo,BYTE 1}. Similarly with \c{ROR}.
+
+
+\S{insRSDC} \i\c{RSDC}: Restore Segment Register and Descriptor
+
+\c RSDC segreg,m80               ; 0F 79 /r        [486,CYRIX,SMM]
+
+\c{RSDC} restores a segment register (DS, ES, FS, GS, or SS) from mem80,
+and sets up its descriptor.
+
+
+\S{insRSLDT} \i\c{RSLDT}: Restore Segment Register and Descriptor
+
+\c RSLDT m80                     ; 0F 7B /0        [486,CYRIX,SMM]
+
+\c{RSLDT} restores the Local Descriptor Table (LDTR) from mem80.
+
+
+\S{insRSM} \i\c{RSM}: Resume from System-Management Mode
+
+\c RSM                           ; 0F AA                [PENT]
+
+\c{RSM} returns the processor to its normal operating mode when it
+was in System-Management Mode.
+
+
+\S{insRSQRTPS} \i\c{RSQRTPS}: Packed Single-Precision FP Square Root Reciprocal
+
+\c RSQRTPS xmm1,xmm2/m128        ; 0F 52 /r        [KATMAI,SSE]
+
+\c{RSQRTPS} computes the approximate reciprocals of the square
+roots of the packed single-precision floating-point values in the
+source and stores the results in xmm1. The maximum error for this
+approximation is: |Error| <= 1.5 x 2^-12
+
+
+\S{insRSQRTSS} \i\c{RSQRTSS}: Scalar Single-Precision FP Square Root Reciprocal
+
+\c RSQRTSS xmm1,xmm2/m128        ; F3 0F 52 /r     [KATMAI,SSE]
+
+\c{RSQRTSS} returns an approximation of the reciprocal of the
+square root of the lowest order single-precision FP value from
+the source, and stores it in the low doubleword of the destination
+register. The upper three fields of xmm1 are preserved. The maximum
+error for this approximation is: |Error| <= 1.5 x 2^-12
+
+
+\S{insRSTS} \i\c{RSTS}: Restore TSR and Descriptor
+
+\c RSTS m80                      ; 0F 7D /0        [486,CYRIX,SMM]
+
+\c{RSTS} restores Task State Register (TSR) from mem80.
+
+
+\S{insSAHF} \i\c{SAHF}: Store AH to Flags
+
+\c SAHF                          ; 9E                   [8086]
+
+\c{SAHF} sets the low byte of the flags word according to the
+contents of the \c{AH} register.
+
+The operation of \c{SAHF} is:
+
+\c  AH --> SF:ZF:0:AF:0:PF:1:CF
+
+See also \c{LAHF} (\k{insLAHF}).
+
+
+\S{insSAL} \i\c{SAL}, \i\c{SAR}: Bitwise Arithmetic Shifts
+
+\c SAL r/m8,1                    ; D0 /4                [8086]
+\c SAL r/m8,CL                   ; D2 /4                [8086]
+\c SAL r/m8,imm8                 ; C0 /4 ib             [186]
+\c SAL r/m16,1                   ; o16 D1 /4            [8086]
+\c SAL r/m16,CL                  ; o16 D3 /4            [8086]
+\c SAL r/m16,imm8                ; o16 C1 /4 ib         [186]
+\c SAL r/m32,1                   ; o32 D1 /4            [386]
+\c SAL r/m32,CL                  ; o32 D3 /4            [386]
+\c SAL r/m32,imm8                ; o32 C1 /4 ib         [386]
+
+\c SAR r/m8,1                    ; D0 /7                [8086]
+\c SAR r/m8,CL                   ; D2 /7                [8086]
+\c SAR r/m8,imm8                 ; C0 /7 ib             [186]
+\c SAR r/m16,1                   ; o16 D1 /7            [8086]
+\c SAR r/m16,CL                  ; o16 D3 /7            [8086]
+\c SAR r/m16,imm8                ; o16 C1 /7 ib         [186]
+\c SAR r/m32,1                   ; o32 D1 /7            [386]
+\c SAR r/m32,CL                  ; o32 D3 /7            [386]
+\c SAR r/m32,imm8                ; o32 C1 /7 ib         [386]
+
+\c{SAL} and \c{SAR} perform an arithmetic shift operation on the given
+source/destination (first) operand. The vacated bits are filled with
+zero for \c{SAL}, and with copies of the original high bit of the
+source operand for \c{SAR}.
+
+\c{SAL} is a synonym for \c{SHL} (see \k{insSHL}). NASM will
+assemble either one to the same code, but NDISASM will always
+disassemble that code as \c{SHL}.
+
+The number of bits to shift by is given by the second operand. Only
+the bottom five bits of the shift count are considered by processors
+above the 8086.
+
+You can force the longer (286 and upwards, beginning with a \c{C1}
+byte) form of \c{SAL foo,1} by using a \c{BYTE} prefix: \c{SAL
+foo,BYTE 1}. Similarly with \c{SAR}.
+
+
+\S{insSALC} \i\c{SALC}: Set AL from Carry Flag
+
+\c SALC                          ; D6                  [8086,UNDOC]
+
+\c{SALC} is an early undocumented instruction similar in concept to
+\c{SETcc} (\k{insSETcc}). Its function is to set \c{AL} to zero if
+the carry flag is clear, or to \c{0xFF} if it is set.
+
+
+\S{insSBB} \i\c{SBB}: Subtract with Borrow
+
+\c SBB r/m8,reg8                 ; 18 /r                [8086]
+\c SBB r/m16,reg16               ; o16 19 /r            [8086]
+\c SBB r/m32,reg32               ; o32 19 /r            [386]
+
+\c SBB reg8,r/m8                 ; 1A /r                [8086]
+\c SBB reg16,r/m16               ; o16 1B /r            [8086]
+\c SBB reg32,r/m32               ; o32 1B /r            [386]
+
+\c SBB r/m8,imm8                 ; 80 /3 ib             [8086]
+\c SBB r/m16,imm16               ; o16 81 /3 iw         [8086]
+\c SBB r/m32,imm32               ; o32 81 /3 id         [386]
+
+\c SBB r/m16,imm8                ; o16 83 /3 ib         [8086]
+\c SBB r/m32,imm8                ; o32 83 /3 ib         [386]
+
+\c SBB AL,imm8                   ; 1C ib                [8086]
+\c SBB AX,imm16                  ; o16 1D iw            [8086]
+\c SBB EAX,imm32                 ; o32 1D id            [386]
+
+\c{SBB} performs integer subtraction: it subtracts its second
+operand, plus the value of the carry flag, from its first, and
+leaves the result in its destination (first) operand. The flags are
+set according to the result of the operation: in particular, the
+carry flag is affected and can be used by a subsequent \c{SBB}
+instruction.
+
+In the forms with an 8-bit immediate second operand and a longer
+first operand, the second operand is considered to be signed, and is
+sign-extended to the length of the first operand. In these cases,
+the \c{BYTE} qualifier is necessary to force NASM to generate this
+form of the instruction.
+
+To subtract one number from another without also subtracting the
+contents of the carry flag, use \c{SUB} (\k{insSUB}).
+
+
+\S{insSCASB} \i\c{SCASB}, \i\c{SCASW}, \i\c{SCASD}: Scan String
+
+\c SCASB                         ; AE                   [8086]
+\c SCASW                         ; o16 AF               [8086]
+\c SCASD                         ; o32 AF               [386]
+
+\c{SCASB} compares the byte in \c{AL} with the byte at \c{[ES:DI]}
+or \c{[ES:EDI]}, and sets the flags accordingly. It then increments
+or decrements (depending on the direction flag: increments if the
+flag is clear, decrements if it is set) \c{DI} (or \c{EDI}).
+
+The register used is \c{DI} if the address size is 16 bits, and
+\c{EDI} if it is 32 bits. If you need to use an address size not
+equal to the current \c{BITS} setting, you can use an explicit
+\i\c{a16} or \i\c{a32} prefix.
+
+Segment override prefixes have no effect for this instruction: the
+use of \c{ES} for the load from \c{[DI]} or \c{[EDI]} cannot be
+overridden.
+
+\c{SCASW} and \c{SCASD} work in the same way, but they compare a
+word to \c{AX} or a doubleword to \c{EAX} instead of a byte to
+\c{AL}, and increment or decrement the addressing registers by 2 or
+4 instead of 1.
+
+The \c{REPE} and \c{REPNE} prefixes (equivalently, \c{REPZ} and
+\c{REPNZ}) may be used to repeat the instruction up to \c{CX} (or
+\c{ECX} - again, the address size chooses which) times until the
+first unequal or equal byte is found.
+
+
+\S{insSETcc} \i\c{SETcc}: Set Register from Condition
+
+\c SETcc r/m8                    ; 0F 90+cc /2          [386]
+
+\c{SETcc} sets the given 8-bit operand to zero if its condition is
+not satisfied, and to 1 if it is.
+
+
+\S{insSFENCE} \i\c{SFENCE}: Store Fence
+
+\c SFENCE                 ; 0F AE /7               [KATMAI]
+
+\c{SFENCE} performs a serialising operation on all writes to memory
+that were issued before the \c{SFENCE} instruction. This guarantees that
+all memory writes before the \c{SFENCE} instruction are visible before any
+writes after the \c{SFENCE} instruction.
+
+\c{SFENCE} is ordered respective to other \c{SFENCE} instruction, \c{MFENCE},
+any memory write and any other serialising instruction (such as \c{CPUID}).
+
+Weakly ordered memory types can be used to achieve higher processor
+performance through such techniques as out-of-order issue,
+write-combining, and write-collapsing. The degree to which a consumer
+of data recognizes or knows that the data is weakly ordered varies
+among applications and may be unknown to the producer of this data.
+The \c{SFENCE} instruction provides a performance-efficient way of
+insuring store ordering between routines that produce weakly-ordered
+results and routines that consume this data.
+
+\c{SFENCE} uses the following ModRM encoding:
+
+\c           Mod (7:6)        = 11B
+\c           Reg/Opcode (5:3) = 111B
+\c           R/M (2:0)        = 000B
+
+All other ModRM encodings are defined to be reserved, and use
+of these encodings risks incompatibility with future processors.
+
+See also \c{LFENCE} (\k{insLFENCE}) and \c{MFENCE} (\k{insMFENCE}).
+
+
+\S{insSGDT} \i\c{SGDT}, \i\c{SIDT}, \i\c{SLDT}: Store Descriptor Table Pointers
+
+\c SGDT mem                      ; 0F 01 /0             [286,PRIV]
+\c SIDT mem                      ; 0F 01 /1             [286,PRIV]
+\c SLDT r/m16                    ; 0F 00 /0             [286,PRIV]
+
+\c{SGDT} and \c{SIDT} both take a 6-byte memory area as an operand:
+they store the contents of the GDTR (global descriptor table
+register) or IDTR (interrupt descriptor table register) into that
+area as a 32-bit linear address and a 16-bit size limit from that
+area (in that order). These are the only instructions which directly
+use \e{linear} addresses, rather than segment/offset pairs.
+
+\c{SLDT} stores the segment selector corresponding to the LDT (local
+descriptor table) into the given operand.
+
+See also \c{LGDT}, \c{LIDT} and \c{LLDT} (\k{insLGDT}).
+
+
+\S{insSHL} \i\c{SHL}, \i\c{SHR}: Bitwise Logical Shifts
+
+\c SHL r/m8,1                    ; D0 /4                [8086]
+\c SHL r/m8,CL                   ; D2 /4                [8086]
+\c SHL r/m8,imm8                 ; C0 /4 ib             [186]
+\c SHL r/m16,1                   ; o16 D1 /4            [8086]
+\c SHL r/m16,CL                  ; o16 D3 /4            [8086]
+\c SHL r/m16,imm8                ; o16 C1 /4 ib         [186]
+\c SHL r/m32,1                   ; o32 D1 /4            [386]
+\c SHL r/m32,CL                  ; o32 D3 /4            [386]
+\c SHL r/m32,imm8                ; o32 C1 /4 ib         [386]
+
+\c SHR r/m8,1                    ; D0 /5                [8086]
+\c SHR r/m8,CL                   ; D2 /5                [8086]
+\c SHR r/m8,imm8                 ; C0 /5 ib             [186]
+\c SHR r/m16,1                   ; o16 D1 /5            [8086]
+\c SHR r/m16,CL                  ; o16 D3 /5            [8086]
+\c SHR r/m16,imm8                ; o16 C1 /5 ib         [186]
+\c SHR r/m32,1                   ; o32 D1 /5            [386]
+\c SHR r/m32,CL                  ; o32 D3 /5            [386]
+\c SHR r/m32,imm8                ; o32 C1 /5 ib         [386]
+
+\c{SHL} and \c{SHR} perform a logical shift operation on the given
+source/destination (first) operand. The vacated bits are filled with
+zero.
+
+A synonym for \c{SHL} is \c{SAL} (see \k{insSAL}). NASM will
+assemble either one to the same code, but NDISASM will always
+disassemble that code as \c{SHL}.
+
+The number of bits to shift by is given by the second operand. Only
+the bottom five bits of the shift count are considered by processors
+above the 8086.
+
+You can force the longer (286 and upwards, beginning with a \c{C1}
+byte) form of \c{SHL foo,1} by using a \c{BYTE} prefix: \c{SHL
+foo,BYTE 1}. Similarly with \c{SHR}.
+
+
+\S{insSHLD} \i\c{SHLD}, \i\c{SHRD}: Bitwise Double-Precision Shifts
+
+\c SHLD r/m16,reg16,imm8         ; o16 0F A4 /r ib      [386]
+\c SHLD r/m16,reg32,imm8         ; o32 0F A4 /r ib      [386]
+\c SHLD r/m16,reg16,CL           ; o16 0F A5 /r         [386]
+\c SHLD r/m16,reg32,CL           ; o32 0F A5 /r         [386]
+
+\c SHRD r/m16,reg16,imm8         ; o16 0F AC /r ib      [386]
+\c SHRD r/m32,reg32,imm8         ; o32 0F AC /r ib      [386]
+\c SHRD r/m16,reg16,CL           ; o16 0F AD /r         [386]
+\c SHRD r/m32,reg32,CL           ; o32 0F AD /r         [386]
+
+\b \c{SHLD} performs a double-precision left shift. It notionally
+places its second operand to the right of its first, then shifts
+the entire bit string thus generated to the left by a number of
+bits specified in the third operand. It then updates only the
+\e{first} operand according to the result of this. The second
+operand is not modified.
+
+\b \c{SHRD} performs the corresponding right shift: it notionally
+places the second operand to the \e{left} of the first, shifts the
+whole bit string right, and updates only the first operand.
+
+For example, if \c{EAX} holds \c{0x01234567} and \c{EBX} holds
+\c{0x89ABCDEF}, then the instruction \c{SHLD EAX,EBX,4} would update
+\c{EAX} to hold \c{0x12345678}. Under the same conditions, \c{SHRD
+EAX,EBX,4} would update \c{EAX} to hold \c{0xF0123456}.
+
+The number of bits to shift by is given by the third operand. Only
+the bottom five bits of the shift count are considered.
+
+
+\S{insSHUFPD} \i\c{SHUFPD}: Shuffle Packed Double-Precision FP Values
+
+\c SHUFPD xmm1,xmm2/m128,imm8    ; 66 0F C6 /r ib  [WILLAMETTE,SSE2]
+
+\c{SHUFPD} moves one of the packed double-precision FP values from
+the destination operand into the low quadword of the destination
+operand; the upper quadword is generated by moving one of the
+double-precision FP values from the source operand into the
+destination. The select (third) operand selects which of the values
+are moved to the destination register.
+
+The select operand is an 8-bit immediate: bit 0 selects which value
+is moved from the destination operand to the result (where 0 selects
+the low quadword and 1 selects the high quadword) and bit 1 selects
+which value is moved from the source operand to the result.
+Bits 2 through 7 of the shuffle operand are reserved.
+
+
+\S{insSHUFPS} \i\c{SHUFPS}: Shuffle Packed Single-Precision FP Values
+
+\c SHUFPS xmm1,xmm2/m128,imm8    ; 0F C6 /r ib     [KATMAI,SSE]
+
+\c{SHUFPS} moves two of the packed single-precision FP values from
+the destination operand into the low quadword of the destination
+operand; the upper quadword is generated by moving two of the
+single-precision FP values from the source operand into the
+destination. The select (third) operand selects which of the
+values are moved to the destination register.
+
+The select operand is an 8-bit immediate: bits 0 and 1 select the
+value to be moved from the destination operand the low doubleword of
+the result, bits 2 and 3 select the value to be moved from the
+destination operand the second doubleword of the result, bits 4 and
+5 select the value to be moved from the source operand the third
+doubleword of the result, and bits 6 and 7 select the value to be
+moved from the source operand to the high doubleword of the result.
+
+
+\S{insSMI} \i\c{SMI}: System Management Interrupt
+
+\c SMI                           ; F1                   [386,UNDOC]
+
+\c{SMI} puts some AMD processors into SMM mode. It is available on some
+386 and 486 processors, and is only available when DR7 bit 12 is set,
+otherwise it generates an Int 1.
+
+
+\S{insSMINT} \i\c{SMINT}, \i\c{SMINTOLD}: Software SMM Entry (CYRIX)
+
+\c SMINT                         ; 0F 38                [PENT,CYRIX]
+\c SMINTOLD                      ; 0F 7E                [486,CYRIX]
+
+\c{SMINT} puts the processor into SMM mode. The CPU state information is
+saved in the SMM memory header, and then execution begins at the SMM base
+address.
+
+\c{SMINTOLD} is the same as \c{SMINT}, but was the opcode used on the 486.
+
+This pair of opcodes are specific to the Cyrix and compatible range of
+processors (Cyrix, IBM, Via).
+
+
+\S{insSMSW} \i\c{SMSW}: Store Machine Status Word
+
+\c SMSW r/m16                    ; 0F 01 /4             [286,PRIV]
+
+\c{SMSW} stores the bottom half of the \c{CR0} control register (or
+the Machine Status Word, on 286 processors) into the destination
+operand. See also \c{LMSW} (\k{insLMSW}).
+
+For 32-bit code, this would store all of \c{CR0} in the specified
+register (or the bottom 16 bits if the destination is a memory location),
+ without needing an operand size override byte.
+
+
+\S{insSQRTPD} \i\c{SQRTPD}: Packed Double-Precision FP Square Root
+
+\c SQRTPD xmm1,xmm2/m128         ; 66 0F 51 /r     [WILLAMETTE,SSE2]
+
+\c{SQRTPD} calculates the square root of the packed double-precision
+FP value from the source operand, and stores the double-precision
+results in the destination register.
+
+
+\S{insSQRTPS} \i\c{SQRTPS}: Packed Single-Precision FP Square Root
+
+\c SQRTPS xmm1,xmm2/m128         ; 0F 51 /r        [KATMAI,SSE]
+
+\c{SQRTPS} calculates the square root of the packed single-precision
+FP value from the source operand, and stores the single-precision
+results in the destination register.
+
+
+\S{insSQRTSD} \i\c{SQRTSD}: Scalar Double-Precision FP Square Root
+
+\c SQRTSD xmm1,xmm2/m128         ; F2 0F 51 /r     [WILLAMETTE,SSE2]
+
+\c{SQRTSD} calculates the square root of the low-order double-precision
+FP value from the source operand, and stores the double-precision
+result in the destination register. The high-quadword remains unchanged.
+
+
+\S{insSQRTSS} \i\c{SQRTSS}: Scalar Single-Precision FP Square Root
+
+\c SQRTSS xmm1,xmm2/m128         ; F3 0F 51 /r     [KATMAI,SSE]
+
+\c{SQRTSS} calculates the square root of the low-order single-precision
+FP value from the source operand, and stores the single-precision
+result in the destination register. The three high doublewords remain
+unchanged.
+
+
+\S{insSTC} \i\c{STC}, \i\c{STD}, \i\c{STI}: Set Flags
+
+\c STC                           ; F9                   [8086]
+\c STD                           ; FD                   [8086]
+\c STI                           ; FB                   [8086]
+
+These instructions set various flags. \c{STC} sets the carry flag;
+\c{STD} sets the direction flag; and \c{STI} sets the interrupt flag
+(thus enabling interrupts).
+
+To clear the carry, direction, or interrupt flags, use the \c{CLC},
+\c{CLD} and \c{CLI} instructions (\k{insCLC}). To invert the carry
+flag, use \c{CMC} (\k{insCMC}).
+
+
+\S{insSTMXCSR} \i\c{STMXCSR}: Store Streaming SIMD Extension
+ Control/Status
+
+\c STMXCSR m32                   ; 0F AE /3        [KATMAI,SSE]
+
+\c{STMXCSR} stores the contents of the \c{MXCSR} control/status
+register to the specified memory location. \c{MXCSR} is used to
+enable masked/unmasked exception handling, to set rounding modes,
+to set flush-to-zero mode, and to view exception status flags.
+The reserved bits in the \c{MXCSR} register are stored as 0s.
+
+For details of the \c{MXCSR} register, see the Intel processor docs.
+
+See also \c{LDMXCSR} (\k{insLDMXCSR}).
+
+
+\S{insSTOSB} \i\c{STOSB}, \i\c{STOSW}, \i\c{STOSD}: Store Byte to String
+
+\c STOSB                         ; AA                   [8086]
+\c STOSW                         ; o16 AB               [8086]
+\c STOSD                         ; o32 AB               [386]
+
+\c{STOSB} stores the byte in \c{AL} at \c{[ES:DI]} or \c{[ES:EDI]},
+and sets the flags accordingly. It then increments or decrements
+(depending on the direction flag: increments if the flag is clear,
+decrements if it is set) \c{DI} (or \c{EDI}).
+
+The register used is \c{DI} if the address size is 16 bits, and
+\c{EDI} if it is 32 bits. If you need to use an address size not
+equal to the current \c{BITS} setting, you can use an explicit
+\i\c{a16} or \i\c{a32} prefix.
+
+Segment override prefixes have no effect for this instruction: the
+use of \c{ES} for the store to \c{[DI]} or \c{[EDI]} cannot be
+overridden.
+
+\c{STOSW} and \c{STOSD} work in the same way, but they store the
+word in \c{AX} or the doubleword in \c{EAX} instead of the byte in
+\c{AL}, and increment or decrement the addressing registers by 2 or
+4 instead of 1.
+
+The \c{REP} prefix may be used to repeat the instruction \c{CX} (or
+\c{ECX} - again, the address size chooses which) times.
+
+
+\S{insSTR} \i\c{STR}: Store Task Register
+
+\c STR r/m16                     ; 0F 00 /1             [286,PRIV]
+
+\c{STR} stores the segment selector corresponding to the contents of
+the Task Register into its operand. When the operand size is 32 bit and
+the destination is a register, the upper 16-bits are cleared to 0s. 
+When the destination operand is a memory location, 16 bits are
+written regardless of the  operand size.
+
+
+\S{insSUB} \i\c{SUB}: Subtract Integers
+
+\c SUB r/m8,reg8                 ; 28 /r                [8086]
+\c SUB r/m16,reg16               ; o16 29 /r            [8086]
+\c SUB r/m32,reg32               ; o32 29 /r            [386]
+
+\c SUB reg8,r/m8                 ; 2A /r                [8086]
+\c SUB reg16,r/m16               ; o16 2B /r            [8086]
+\c SUB reg32,r/m32               ; o32 2B /r            [386]
+
+\c SUB r/m8,imm8                 ; 80 /5 ib             [8086]
+\c SUB r/m16,imm16               ; o16 81 /5 iw         [8086]
+\c SUB r/m32,imm32               ; o32 81 /5 id         [386]
+
+\c SUB r/m16,imm8                ; o16 83 /5 ib         [8086]
+\c SUB r/m32,imm8                ; o32 83 /5 ib         [386]
+
+\c SUB AL,imm8                   ; 2C ib                [8086]
+\c SUB AX,imm16                  ; o16 2D iw            [8086]
+\c SUB EAX,imm32                 ; o32 2D id            [386]
+
+\c{SUB} performs integer subtraction: it subtracts its second
+operand from its first, and leaves the result in its destination
+(first) operand. The flags are set according to the result of the
+operation: in particular, the carry flag is affected and can be used
+by a subsequent \c{SBB} instruction (\k{insSBB}).
+
+In the forms with an 8-bit immediate second operand and a longer
+first operand, the second operand is considered to be signed, and is
+sign-extended to the length of the first operand. In these cases,
+the \c{BYTE} qualifier is necessary to force NASM to generate this
+form of the instruction.
+
+
+\S{insSUBPD} \i\c{SUBPD}: Packed Double-Precision FP Subtract
+
+\c SUBPD xmm1,xmm2/m128          ; 66 0F 5C /r     [WILLAMETTE,SSE2]
+
+\c{SUBPD} subtracts the packed double-precision FP values of
+the source operand from those of the destination operand, and
+stores the result in the destination operation.
+
+
+\S{insSUBPS} \i\c{SUBPS}: Packed Single-Precision FP Subtract
+
+\c SUBPS xmm1,xmm2/m128          ; 0F 5C /r        [KATMAI,SSE]
+
+\c{SUBPS} subtracts the packed single-precision FP values of
+the source operand from those of the destination operand, and
+stores the result in the destination operation.
+
+
+\S{insSUBSD} \i\c{SUBSD}: Scalar Single-FP Subtract
+
+\c SUBSD xmm1,xmm2/m128          ; F2 0F 5C /r     [WILLAMETTE,SSE2]
+
+\c{SUBSD} subtracts the low-order double-precision FP value of
+the source operand from that of the destination operand, and
+stores the result in the destination operation. The high
+quadword is unchanged.
+
+
+\S{insSUBSS} \i\c{SUBSS}: Scalar Single-FP Subtract
+
+\c SUBSS xmm1,xmm2/m128          ; F3 0F 5C /r     [KATMAI,SSE]
+
+\c{SUBSS} subtracts the low-order single-precision FP value of
+the source operand from that of the destination operand, and
+stores the result in the destination operation. The three high
+doublewords are unchanged.
+
+
+\S{insSVDC} \i\c{SVDC}: Save Segment Register and Descriptor
+
+\c SVDC m80,segreg               ; 0F 78 /r        [486,CYRIX,SMM]
+
+\c{SVDC} saves a segment register (DS, ES, FS, GS, or SS) and its
+descriptor to mem80.
+
+
+\S{insSVLDT} \i\c{SVLDT}: Save LDTR and Descriptor
+
+\c SVLDT m80                     ; 0F 7A /0        [486,CYRIX,SMM]
+
+\c{SVLDT} saves the Local Descriptor Table (LDTR) to mem80.
+
+
+\S{insSVTS} \i\c{SVTS}: Save TSR and Descriptor
+
+\c SVTS m80                      ; 0F 7C /0        [486,CYRIX,SMM]
+
+\c{SVTS} saves the Task State Register (TSR) to mem80.
+
+
+\S{insSYSCALL} \i\c{SYSCALL}: Call Operating System
+
+\c SYSCALL                       ; 0F 05                [P6,AMD]
+
+\c{SYSCALL} provides a fast method of transferring control to a fixed
+entry point in an operating system.
+
+\b The \c{EIP} register is copied into the \c{ECX} register.
+
+\b Bits [31-0] of the 64-bit SYSCALL/SYSRET Target Address Register
+(\c{STAR}) are copied into the \c{EIP} register.
+
+\b Bits [47-32] of the \c{STAR} register specify the selector that is
+copied into the \c{CS} register.
+
+\b Bits [47-32]+1000b of the \c{STAR} register specify the selector that
+is copied into the SS register.
+
+The \c{CS} and \c{SS} registers should not be modified by the operating
+system between the execution of the \c{SYSCALL} instruction and its
+corresponding \c{SYSRET} instruction.
+
+For more information, see the \c{SYSCALL and SYSRET Instruction Specification}
+(AMD document number 21086.pdf).
+
+
+\S{insSYSENTER} \i\c{SYSENTER}: Fast System Call
+
+\c SYSENTER                      ; 0F 34                [P6]
+
+\c{SYSENTER} executes a fast call to a level 0 system procedure or
+routine. Before using this instruction, various MSRs need to be set
+up:
+
+\b \c{SYSENTER_CS_MSR} contains the 32-bit segment selector for the
+privilege level 0 code segment. (This value is also used to compute
+the segment selector of the privilege level 0 stack segment.)
+
+\b \c{SYSENTER_EIP_MSR} contains the 32-bit offset into the privilege
+level 0 code segment to the first instruction of the selected operating
+procedure or routine.
+
+\b \c{SYSENTER_ESP_MSR} contains the 32-bit stack pointer for the
+privilege level 0 stack.
+
+\c{SYSENTER} performs the following sequence of operations:
+
+\b Loads the segment selector from the \c{SYSENTER_CS_MSR} into the
+\c{CS} register.
+
+\b Loads the instruction pointer from the \c{SYSENTER_EIP_MSR} into
+the \c{EIP} register.
+
+\b Adds 8 to the value in \c{SYSENTER_CS_MSR} and loads it into the
+\c{SS} register.
+
+\b Loads the stack pointer from the \c{SYSENTER_ESP_MSR} into the
+\c{ESP} register.
+
+\b Switches to privilege level 0.
+
+\b Clears the \c{VM} flag in the \c{EFLAGS} register, if the flag
+is set.
+
+\b Begins executing the selected system procedure.
+
+In particular, note that this instruction des not save the values of
+\c{CS} or \c{(E)IP}. If you need to return to the calling code, you
+need to write your code to cater for this.
+
+For more information, see the Intel Architecture Software Developer's
+Manual, Volume 2.
+
+
+\S{insSYSEXIT} \i\c{SYSEXIT}: Fast Return From System Call
+
+\c SYSEXIT                       ; 0F 35                [P6,PRIV]
+
+\c{SYSEXIT} executes a fast return to privilege level 3 user code.
+This instruction is a companion instruction to the \c{SYSENTER}
+instruction, and can only be executed by privilege level 0 code.
+Various registers need to be set up before calling this instruction:
+
+\b \c{SYSENTER_CS_MSR} contains the 32-bit segment selector for the
+privilege level 0 code segment in which the processor is currently
+executing. (This value is used to compute the segment selectors for
+the privilege level 3 code and stack segments.)
+
+\b \c{EDX} contains the 32-bit offset into the privilege level 3 code
+segment to the first instruction to be executed in the user code.
+
+\b \c{ECX} contains the 32-bit stack pointer for the privilege level 3
+stack.
+
+\c{SYSEXIT} performs the following sequence of operations:
+
+\b Adds 16 to the value in \c{SYSENTER_CS_MSR} and loads the sum into
+the \c{CS} selector register.
+
+\b Loads the instruction pointer from the \c{EDX} register into the
+\c{EIP} register.
+
+\b Adds 24 to the value in \c{SYSENTER_CS_MSR} and loads the sum
+into the \c{SS} selector register.
+
+\b Loads the stack pointer from the \c{ECX} register into the \c{ESP}
+register.
+
+\b Switches to privilege level 3.
+
+\b Begins executing the user code at the \c{EIP} address.
+
+For more information on the use of the \c{SYSENTER} and \c{SYSEXIT}
+instructions, see the Intel Architecture Software Developer's
+Manual, Volume 2.
+
+
+\S{insSYSRET} \i\c{SYSRET}: Return From Operating System
+
+\c SYSRET                        ; 0F 07                [P6,AMD,PRIV]
+
+\c{SYSRET} is the return instruction used in conjunction with the
+\c{SYSCALL} instruction to provide fast entry/exit to an operating system.
+
+\b The \c{ECX} register, which points to the next sequential instruction
+after the corresponding \c{SYSCALL} instruction, is copied into the \c{EIP}
+register.
+
+\b Bits [63-48] of the \c{STAR} register specify the selector that is copied
+into the \c{CS} register.
+
+\b Bits [63-48]+1000b of the \c{STAR} register specify the selector that is
+copied into the \c{SS} register.
+
+\b Bits [1-0] of the \c{SS} register are set to 11b (RPL of 3) regardless of
+the value of bits [49-48] of the \c{STAR} register.
+
+The \c{CS} and \c{SS} registers should not be modified by the operating
+system between the execution of the \c{SYSCALL} instruction and its
+corresponding \c{SYSRET} instruction.
+
+For more information, see the \c{SYSCALL and SYSRET Instruction Specification}
+(AMD document number 21086.pdf).
+
+
+\S{insTEST} \i\c{TEST}: Test Bits (notional bitwise AND)
+
+\c TEST r/m8,reg8                ; 84 /r                [8086]
+\c TEST r/m16,reg16              ; o16 85 /r            [8086]
+\c TEST r/m32,reg32              ; o32 85 /r            [386]
+
+\c TEST r/m8,imm8                ; F6 /0 ib             [8086]
+\c TEST r/m16,imm16              ; o16 F7 /0 iw         [8086]
+\c TEST r/m32,imm32              ; o32 F7 /0 id         [386]
+
+\c TEST AL,imm8                  ; A8 ib                [8086]
+\c TEST AX,imm16                 ; o16 A9 iw            [8086]
+\c TEST EAX,imm32                ; o32 A9 id            [386]
+
+\c{TEST} performs a `mental' bitwise AND of its two operands, and
+affects the flags as if the operation had taken place, but does not
+store the result of the operation anywhere.
+
+
+\S{insUCOMISD} \i\c{UCOMISD}: Unordered Scalar Double-Precision FP
+compare and set EFLAGS
+
+\c UCOMISD xmm1,xmm2/m128        ; 66 0F 2E /r     [WILLAMETTE,SSE2]
+
+\c{UCOMISD} compares the low-order double-precision FP numbers in the
+two operands, and sets the \c{ZF}, \c{PF} and \c{CF} bits in the
+\c{EFLAGS} register. In addition, the \c{OF}, \c{SF} and \c{AF} bits
+in the \c{EFLAGS} register are zeroed out. The unordered predicate
+(\c{ZF}, \c{PF} and \c{CF} all set) is returned if either source
+operand is a \c{NaN} (\c{qNaN} or \c{sNaN}).
+
+
+\S{insUCOMISS} \i\c{UCOMISS}: Unordered Scalar Single-Precision FP
+compare and set EFLAGS
+
+\c UCOMISS xmm1,xmm2/m128        ; 0F 2E /r        [KATMAI,SSE]
+
+\c{UCOMISS} compares the low-order single-precision FP numbers in the
+two operands, and sets the \c{ZF}, \c{PF} and \c{CF} bits in the
+\c{EFLAGS} register. In addition, the \c{OF}, \c{SF} and \c{AF} bits
+in the \c{EFLAGS} register are zeroed out. The unordered predicate
+(\c{ZF}, \c{PF} and \c{CF} all set) is returned if either source
+operand is a \c{NaN} (\c{qNaN} or \c{sNaN}).
+
+
+\S{insUD2} \i\c{UD0}, \i\c{UD1}, \i\c{UD2}: Undefined Instruction
+
+\c UD0                           ; 0F FF                [186,UNDOC]
+\c UD1                           ; 0F B9                [186,UNDOC]
+\c UD2                           ; 0F 0B                [186]
+
+\c{UDx} can be used to generate an invalid opcode exception, for testing
+purposes.
+
+\c{UD0} is specifically documented by AMD as being reserved for this
+purpose.
+
+\c{UD1} is documented by Intel as being available for this purpose.
+
+\c{UD2} is specifically documented by Intel as being reserved for this
+purpose. Intel document this as the preferred method of generating an
+invalid opcode exception.
+
+All these opcodes can be used to generate invalid opcode exceptions on
+all currently available processors.
+
+
+\S{insUMOV} \i\c{UMOV}: User Move Data
+
+\c UMOV r/m8,reg8                ; 0F 10 /r             [386,UNDOC]
+\c UMOV r/m16,reg16              ; o16 0F 11 /r         [386,UNDOC]
+\c UMOV r/m32,reg32              ; o32 0F 11 /r         [386,UNDOC]
+
+\c UMOV reg8,r/m8                ; 0F 12 /r             [386,UNDOC]
+\c UMOV reg16,r/m16              ; o16 0F 13 /r         [386,UNDOC]
+\c UMOV reg32,r/m32              ; o32 0F 13 /r         [386,UNDOC]
+
+This undocumented instruction is used by in-circuit emulators to
+access user memory (as opposed to host memory). It is used just like
+an ordinary memory/register or register/register \c{MOV}
+instruction, but accesses user space.
+
+This instruction is only available on some AMD and IBM 386 and 486
+processors.
+
+
+\S{insUNPCKHPD} \i\c{UNPCKHPD}: Unpack and Interleave High Packed
+Double-Precision FP Values
+
+\c UNPCKHPD xmm1,xmm2/m128       ; 66 0F 15 /r     [WILLAMETTE,SSE2]
+
+\c{UNPCKHPD} performs an interleaved unpack of the high-order data
+elements of the source and destination operands, saving the result
+in \c{xmm1}. It ignores the lower half of the sources.
+
+The operation of this instruction is:
+
+\c    dst[63-0]   := dst[127-64];
+\c    dst[127-64] := src[127-64].
+
+
+\S{insUNPCKHPS} \i\c{UNPCKHPS}: Unpack and Interleave High Packed
+Single-Precision FP Values
+
+\c UNPCKHPS xmm1,xmm2/m128       ; 0F 15 /r        [KATMAI,SSE]
+
+\c{UNPCKHPS} performs an interleaved unpack of the high-order data
+elements of the source and destination operands, saving the result
+in \c{xmm1}. It ignores the lower half of the sources.
+
+The operation of this instruction is:
+
+\c    dst[31-0]   := dst[95-64];
+\c    dst[63-32]  := src[95-64];
+\c    dst[95-64]  := dst[127-96];
+\c    dst[127-96] := src[127-96].
+
+
+\S{insUNPCKLPD} \i\c{UNPCKLPD}: Unpack and Interleave Low Packed
+Double-Precision FP Data
+
+\c UNPCKLPD xmm1,xmm2/m128       ; 66 0F 14 /r     [WILLAMETTE,SSE2]
+
+\c{UNPCKLPD} performs an interleaved unpack of the low-order data
+elements of the source and destination operands, saving the result
+in \c{xmm1}. It ignores the lower half of the sources.
+
+The operation of this instruction is:
+
+\c    dst[63-0]   := dst[63-0];
+\c    dst[127-64] := src[63-0].
+
+
+\S{insUNPCKLPS} \i\c{UNPCKLPS}: Unpack and Interleave Low Packed
+Single-Precision FP Data
+
+\c UNPCKLPS xmm1,xmm2/m128       ; 0F 14 /r        [KATMAI,SSE]
+
+\c{UNPCKLPS} performs an interleaved unpack of the low-order data
+elements of the source and destination operands, saving the result
+in \c{xmm1}. It ignores the lower half of the sources.
+
+The operation of this instruction is:
+
+\c    dst[31-0]   := dst[31-0];
+\c    dst[63-32]  := src[31-0];
+\c    dst[95-64]  := dst[63-32];
+\c    dst[127-96] := src[63-32].
+
+
+\S{insVERR} \i\c{VERR}, \i\c{VERW}: Verify Segment Readability/Writability
+
+\c VERR r/m16                    ; 0F 00 /4             [286,PRIV]
+
+\c VERW r/m16                    ; 0F 00 /5             [286,PRIV]
+
+\b \c{VERR} sets the zero flag if the segment specified by the selector
+in its operand can be read from at the current privilege level.
+Otherwise it is cleared.
+
+\b \c{VERW} sets the zero flag if the segment can be written.
+
+
+\S{insWAIT} \i\c{WAIT}: Wait for Floating-Point Processor
+
+\c WAIT                          ; 9B                   [8086]
+\c FWAIT                         ; 9B                   [8086]
+
+\c{WAIT}, on 8086 systems with a separate 8087 FPU, waits for the
+FPU to have finished any operation it is engaged in before
+continuing main processor operations, so that (for example) an FPU
+store to main memory can be guaranteed to have completed before the
+CPU tries to read the result back out.
+
+On higher processors, \c{WAIT} is unnecessary for this purpose, and
+it has the alternative purpose of ensuring that any pending unmasked
+FPU exceptions have happened before execution continues.
+
+
+\S{insWBINVD} \i\c{WBINVD}: Write Back and Invalidate Cache
+
+\c WBINVD                        ; 0F 09                [486]
+
+\c{WBINVD} invalidates and empties the processor's internal caches,
+and causes the processor to instruct external caches to do the same.
+It writes the contents of the caches back to memory first, so no
+data is lost. To flush the caches quickly without bothering to write
+the data back first, use \c{INVD} (\k{insINVD}).
+
+
+\S{insWRMSR} \i\c{WRMSR}: Write Model-Specific Registers
+
+\c WRMSR                         ; 0F 30                [PENT]
+
+\c{WRMSR} writes the value in \c{EDX:EAX} to the processor
+Model-Specific Register (MSR) whose index is stored in \c{ECX}.
+See also \c{RDMSR} (\k{insRDMSR}).
+
+
+\S{insWRSHR} \i\c{WRSHR}: Write SMM Header Pointer Register
+
+\c WRSHR r/m32                   ; 0F 37 /0        [386,CYRIX,SMM]
+
+\c{WRSHR} loads the contents of either a 32-bit memory location or a
+32-bit register into the SMM header pointer register.
+
+See also \c{RDSHR} (\k{insRDSHR}).
+
+
+\S{insXADD} \i\c{XADD}: Exchange and Add
+
+\c XADD r/m8,reg8                ; 0F C0 /r             [486]
+\c XADD r/m16,reg16              ; o16 0F C1 /r         [486]
+\c XADD r/m32,reg32              ; o32 0F C1 /r         [486]
+
+\c{XADD} exchanges the values in its two operands, and then adds
+them together and writes the result into the destination (first)
+operand. This instruction can be used with a \c{LOCK} prefix for
+multi-processor synchronisation purposes.
+
+
+\S{insXBTS} \i\c{XBTS}: Extract Bit String
+
+\c XBTS reg16,r/m16              ; o16 0F A6 /r         [386,UNDOC]
+\c XBTS reg32,r/m32              ; o32 0F A6 /r         [386,UNDOC]
+
+The implied operation of this instruction is:
+
+\c XBTS r/m16,reg16,AX,CL
+\c XBTS r/m32,reg32,EAX,CL
+
+Writes a bit string from the source operand to the destination. \c{CL}
+indicates the number of bits to be copied, and \c{(E)AX} indicates the
+low order bit offset in the source. The bits are written to the low
+order bits of the destination register. For example, if \c{CL} is set
+to 4 and \c{AX} (for 16-bit code) is set to 5, bits 5-8 of \c{src} will
+be copied to bits 0-3 of \c{dst}. This instruction is very poorly
+documented, and I have been unable to find any official source of
+documentation on it.
+
+\c{XBTS} is supported only on the early Intel 386s, and conflicts with
+the opcodes for \c{CMPXCHG486} (on early Intel 486s). NASM supports it
+only for completeness. Its counterpart is \c{IBTS} (see \k{insIBTS}).
+
+
+\S{insXCHG} \i\c{XCHG}: Exchange
+
+\c XCHG reg8,r/m8                ; 86 /r                [8086]
+\c XCHG reg16,r/m8               ; o16 87 /r            [8086]
+\c XCHG reg32,r/m32              ; o32 87 /r            [386]
+
+\c XCHG r/m8,reg8                ; 86 /r                [8086]
+\c XCHG r/m16,reg16              ; o16 87 /r            [8086]
+\c XCHG r/m32,reg32              ; o32 87 /r            [386]
+
+\c XCHG AX,reg16                 ; o16 90+r             [8086]
+\c XCHG EAX,reg32                ; o32 90+r             [386]
+\c XCHG reg16,AX                 ; o16 90+r             [8086]
+\c XCHG reg32,EAX                ; o32 90+r             [386]
+
+\c{XCHG} exchanges the values in its two operands. It can be used
+with a \c{LOCK} prefix for purposes of multi-processor
+synchronisation.
+
+\c{XCHG AX,AX} or \c{XCHG EAX,EAX} (depending on the \c{BITS}
+setting) generates the opcode \c{90h}, and so is a synonym for
+\c{NOP} (\k{insNOP}).
+
+
+\S{insXLATB} \i\c{XLATB}: Translate Byte in Lookup Table
+
+\c XLAT                          ; D7                   [8086]
+\c XLATB                         ; D7                   [8086]
+
+\c{XLATB} adds the value in \c{AL}, treated as an unsigned byte, to
+\c{BX} or \c{EBX}, and loads the byte from the resulting address (in
+the segment specified by \c{DS}) back into \c{AL}.
+
+The base register used is \c{BX} if the address size is 16 bits, and
+\c{EBX} if it is 32 bits. If you need to use an address size not
+equal to the current \c{BITS} setting, you can use an explicit
+\i\c{a16} or \i\c{a32} prefix.
+
+The segment register used to load from \c{[BX+AL]} or \c{[EBX+AL]}
+can be overridden by using a segment register name as a prefix (for
+example, \c{es xlatb}).
+
+
+\S{insXOR} \i\c{XOR}: Bitwise Exclusive OR
+
+\c XOR r/m8,reg8                 ; 30 /r                [8086]
+\c XOR r/m16,reg16               ; o16 31 /r            [8086]
+\c XOR r/m32,reg32               ; o32 31 /r            [386]
+
+\c XOR reg8,r/m8                 ; 32 /r                [8086]
+\c XOR reg16,r/m16               ; o16 33 /r            [8086]
+\c XOR reg32,r/m32               ; o32 33 /r            [386]
+
+\c XOR r/m8,imm8                 ; 80 /6 ib             [8086]
+\c XOR r/m16,imm16               ; o16 81 /6 iw         [8086]
+\c XOR r/m32,imm32               ; o32 81 /6 id         [386]
+
+\c XOR r/m16,imm8                ; o16 83 /6 ib         [8086]
+\c XOR r/m32,imm8                ; o32 83 /6 ib         [386]
+
+\c XOR AL,imm8                   ; 34 ib                [8086]
+\c XOR AX,imm16                  ; o16 35 iw            [8086]
+\c XOR EAX,imm32                 ; o32 35 id            [386]
+
+\c{XOR} performs a bitwise XOR operation between its two operands
+(i.e. each bit of the result is 1 if and only if exactly one of the
+corresponding bits of the two inputs was 1), and stores the result
+in the destination (first) operand.
+
+In the forms with an 8-bit immediate second operand and a longer
+first operand, the second operand is considered to be signed, and is
+sign-extended to the length of the first operand. In these cases,
+the \c{BYTE} qualifier is necessary to force NASM to generate this
+form of the instruction.
+
+The \c{MMX} instruction \c{PXOR} (see \k{insPXOR}) performs the same
+operation on the 64-bit \c{MMX} registers.
+
+
+\S{insXORPD} \i\c{XORPD}: Bitwise Logical XOR of Double-Precision FP Values
+
+\c XORPD xmm1,xmm2/m128          ; 66 0F 57 /r     [WILLAMETTE,SSE2]
+
+\c{XORPD} returns a bit-wise logical XOR between the source and
+destination operands, storing the result in the destination operand.
+
+
+\S{insXORPS} \i\c{XORPS}: Bitwise Logical XOR of Single-Precision FP Values
+
+\c XORPS xmm1,xmm2/m128          ; 0F 57 /r        [KATMAI,SSE]
+
+\c{XORPS} returns a bit-wise logical XOR between the source and
+destination operands, storing the result in the destination operand.
+
+
diff --git a/doc/nasmdoc.src b/doc/nasmdoc.src
index ee8c0f62..197011af 100644
--- a/doc/nasmdoc.src
+++ b/doc/nasmdoc.src
@@ -4,7 +4,7 @@
 \#
 \M{category}{Programming}
 \M{title}{NASM - The Netwide Assembler}
-\M{year}{2003}
+\M{year}{2007}
 \M{author}{The NASM Development Team}
 \M{license}{All rights reserved. This document is redistributable under the license given in the file "COPYING" distributed in the NASM archive.}
 \M{summary}{This file documents NASM, the Netwide Assembler: an assembler targetting the Intel x86 series of processors, with portable source.}
@@ -1096,9 +1096,11 @@ they can be \i{effective addresses} (see \k{effaddr}), constants
 
 For \i{floating-point} instructions, NASM accepts a wide range of
 syntaxes: you can use two-operand forms like MASM supports, or you
-can use NASM's native single-operand forms in most cases. Details of
-all forms of each supported instruction are given in
-\k{iref}. For example, you can code:
+can use NASM's native single-operand forms in most cases.
+\# Details of
+\# all forms of each supported instruction are given in
+\# \k{iref}.
+For example, you can code:
 
 \c         fadd    st1             ; this sets st0 := st0 + st1
 \c         fadd    st0,st1         ; so does this
@@ -1304,6 +1306,11 @@ fact, it will also split \c{[eax*2+offset]} into
 the \c{NOSPLIT} keyword: \c{[nosplit eax*2]} will force
 \c{[eax*2+0]} to be generated literally.
 
+In 64-bit mode, NASM will by default generate absolute addresses.  The
+\i\c{REL} keyword makes it produce \c{RIP}-relative addresses. Since
+this is frequently the normally desired behaviour, see the \c{DEFAULT}
+directive.  The keyword \i\c{ABS} overrides \i\c{REL}.
+
 
 \H{const} \i{Constants}
 
@@ -1350,7 +1357,8 @@ then the constant generated is not \c{0x61626364}, but
 \c{0x64636261}, so that if you were then to store the value into
 memory, it would read \c{abcd} rather than \c{dcba}. This is also
 the sense of character constants understood by the Pentium's
-\i\c{CPUID} instruction (see \k{insCPUID}).
+\i\c{CPUID} instruction.
+\# (see \k{insCPUID})
 
 
 \S{strconst} String Constants
@@ -3262,7 +3270,8 @@ local variables in C are an example of this kind of variable. The
 (see \k{stacksize} and is also compatible with the \c{%arg} directive
 (see \k{arg}). It allows simplified reference to variables on the
 stack which have been allocated typically by using the \c{ENTER}
-instruction (see \k{insENTER} for a description of that instruction).
+instruction.
+\# (see \k{insENTER} for a description of that instruction).
 An example of its use is the following:
 
 \c silly_swap:
@@ -3428,7 +3437,7 @@ the REX prefix is used. In summary, the \c{REX} prefix causes the addressing
 of AH, BH, CH and DH to be replaced by SPL, BPL, SIL and DIL.
 
 The \c{BITS} directive has an exactly equivalent primitive form,
-\c{[BITS 16]}, \c{[BITS 32]} and \c{BITS 64]}. The user-level form is
+\c{[BITS 16]}, \c{[BITS 32]} and \c{[BITS 64]}. The user-level form is
 a macro which has no function other than to call the primitive form.
 
 Note that the space is neccessary, e.g. \c{BITS32} will \e{not} work!
@@ -3439,6 +3448,25 @@ The `\c{USE16}' and `\c{USE32}' directives can be used in place of
 `\c{BITS 16}' and `\c{BITS 32}', for compatibility with other assemblers.
 
 
+\H{default} \i\c{DEFAULT}: Change the assembler defaults
+
+The \c{DEFAULT} directive changes the assembler defaults.  Normally,
+NASM defaults to a mode where the programmer is expected to explicitly
+specify most features directly.  However, this is occationally
+obnoxious, as the explicit form is pretty much the only one one wishes
+to use.
+
+Currently, the only \c{DEFAULT} that is settable is whether or not
+registerless instructions in 64-bit mode are \c{RIP}-relative or not.
+By default, they are absolute unless overridden with the \i\c{REL}
+specifier.  However, if \c{DEFAULT REL} is specified, \c{REL} is
+default, unless overridden with the \c{ABS} specifier, \e{except when
+used with an \c{FS} or \c{GS} segment override}.  The special handling
+of \c{FS} and \c{GS} overrides are due to the fact that these
+registers are generally used as thread pointers or other special
+functions in 64-bit mode, and generating \c{RIP}-relative addresses
+would be extremely confusing.
+
 \H{section} \i\c{SECTION} or \i\c{SEGMENT}: Changing and \i{Defining
 Sections}
 
@@ -6140,10 +6168,14 @@ corresponding \c{a16} prefix can be used.
 The \c{a16} and \c{a32} prefixes can be applied to any instruction
 in NASM's instruction table, but most of them can generate all the
 useful forms without them. The prefixes are necessary only for
-instructions with implicit addressing: \c{CMPSx} (\k{insCMPSB}),
-\c{SCASx} (\k{insSCASB}), \c{LODSx} (\k{insLODSB}), \c{STOSx}
-(\k{insSTOSB}), \c{MOVSx} (\k{insMOVSB}), \c{INSx} (\k{insINSB}),
-\c{OUTSx} (\k{insOUTSB}), and \c{XLATB} (\k{insXLATB}). Also, the
+instructions with implicit addressing:
+\# \c{CMPSx} (\k{insCMPSB}),
+\# \c{SCASx} (\k{insSCASB}), \c{LODSx} (\k{insLODSB}), \c{STOSx}
+\# (\k{insSTOSB}), \c{MOVSx} (\k{insMOVSB}), \c{INSx} (\k{insINSB}),
+\# \c{OUTSx} (\k{insOUTSB}), and \c{XLATB} (\k{insXLATB}).
+\c{CMPSx}, \c{SCASx}, \c{LODSx}, \c{STOSx}, \c{MOVSx}, \c{INSx},
+\c{OUTSx}, and \c{XLATB}.
+Also, the
 various push and pop instructions (\c{PUSHA} and \c{POPF} as well as
 the more usual \c{PUSH} and \c{POP}) can accept \c{a16} or \c{a32}
 prefixes to force a particular one of \c{SP} or \c{ESP} to be used
@@ -6168,6 +6200,28 @@ one.
 when in 16-bit mode, but this seems less useful.)
 
 
+\C{64bit} Writing 64-bit Code (Unix, Win64)
+
+This chapter attempts to cover some of the common issues involved when
+writing 64-bit code, to run under \i{Win64} or Unix.  It covers how to
+write assembly code to interface with 64-bit C routines, and how to
+write position-independent code for shared libraries.
+
+All 64-bit code uses a flat memory model, since segmentation is not
+available in 64-bit mode.  The one exception is the \c{FS} and \c{GS}
+registers, which still add their bases.
+
+Position independence in 64-bit mode is significantly simpler, since
+the processor supports \c{RIP}-relative addressing directly; see the
+\c{REL} keyword (\k{effaddr}).
+
+64-bit programming is relatively similar to 32-bit programming, but
+of course pointers are 64 bits long; additionally, all existing
+platforms pass arguments in registers rather than on the stack.
+Furthermore, 64-bit platforms use SSE2 by default for floating point.
+Please see the ABI documentation for your platform.
+
+
 \C{trouble} Troubleshooting
 
 This chapter describes some of the common problems that users have
@@ -6394,12 +6448,12 @@ are on a Unix system.
 
 To disassemble a file, you will typically use a command of the form
 
-\c        ndisasm [-b16 | -b32] filename
+\c        ndisasm -b {16|32|64} filename
 
-NDISASM can disassemble 16-bit code or 32-bit code equally easily,
+NDISASM can disassemble 16-, 32- or 64-bit code equally easily,
 provided of course that you remember to specify which it is to work
-with. If no \i\c{-b} switch is present, NDISASM works in 16-bit mode by
-default. The \i\c{-u} switch (for USE32) also invokes 32-bit mode.
+with. If no \i\c{-b} switch is present, NDISASM works in 16-bit mode
+by default. The \i\c{-u} switch (for USE32) also invokes 32-bit mode.
 
 Two more command line options are \i\c{-r} which reports the version
 number of NDISASM you are running, and \i\c{-h} which gives a short
@@ -6541,8 +6595,8 @@ anyway.
 \H{ndisbugs} Bugs and Improvements
 
 There are no known bugs. However, any you find, with patches if
-possible, should be sent to \W{mailto:jules@dsf.org.uk}\c{jules@dsf.org.uk}
-or \W{mailto:anakin@pobox.com}\c{anakin@pobox.com}, or to the
+possible, should be sent to
+\W{mailto:nasm-bugs@lists.sourceforge.net}\c{nasm-bugs@lists.sourceforge.net}, or to the
 developer's site at
 \W{https://sourceforge.net/projects/nasm/}\c{https://sourceforge.net/projects/nasm/}
 and we'll try to fix them. Feel free to send contributions and
@@ -6562,6736 +6616,3 @@ I don't recommend taking NDISASM apart to see how an efficient
 disassembler works, because as far as I know, it isn't an efficient
 one anyway. You have been warned.
 
-
-\A{iref} x86 Instruction Reference
-
-This appendix provides a complete list of the machine instructions
-which NASM will assemble, and a short description of the function of
-each one.
-
-It is not intended to be an exhaustive documentation on the fine
-details of the instructions' function, such as which exceptions they
-can trigger: for such documentation, you should go to Intel's Web
-site, \W{http://developer.intel.com/design/Pentium4/manuals/}\c{http://developer.intel.com/design/Pentium4/manuals/}.
-
-Instead, this appendix is intended primarily to provide
-documentation on the way the instructions may be used within NASM.
-For example, looking up \c{LOOP} will tell you that NASM allows
-\c{CX} or \c{ECX} to be specified as an optional second argument to
-the \c{LOOP} instruction, to enforce which of the two possible
-counter registers should be used if the default is not the one
-desired.
-
-The instructions are not quite listed in alphabetical order, since
-groups of instructions with similar functions are lumped together in
-the same entry. Most of them don't move very far from their
-alphabetic position because of this.
-
-
-\H{iref-opr} Key to Operand Specifications
-
-The instruction descriptions in this appendix specify their operands
-using the following notation:
-
-\b Registers: \c{reg8} denotes an 8-bit \i{general purpose
-register}, \c{reg16} denotes a 16-bit general purpose register,
-\c{reg32} a 32-bit one and \c{reg64} a 64-bit one. \c{fpureg} denotes
-one of the eight FPU stack registers, \c{mmxreg} denotes one of the
-eight 64-bit MMX registers, and \c{segreg} denotes a segment register.
-\c{xmmreg} denotes one of the 8, or 16 in x64 long mode, SSE XMM registers.
-In addition, some registers (such as \c{AL}, \c{DX}, \c{ECX} or \c{RAX})
-may be specified explicitly.
-
-\b Immediate operands: \c{imm} denotes a generic \i{immediate operand}.
-\c{imm8}, \c{imm16} and \c{imm32} are used when the operand is
-intended to be a specific size. For some of these instructions, NASM
-needs an explicit specifier: for example, \c{ADD ESP,16} could be
-interpreted as either \c{ADD r/m32,imm32} or \c{ADD r/m32,imm8}.
-NASM chooses the former by default, and so you must specify \c{ADD
-ESP,BYTE 16} for the latter. There is a special case of the allowance
-of an \c{imm64} for particular x64 versions of the MOV instruction.
-
-\b Memory references: \c{mem} denotes a generic \i{memory reference};
-\c{mem8}, \c{mem16}, \c{mem32}, \c{mem64} and \c{mem80} are used
-when the operand needs to be a specific size. Again, a specifier is
-needed in some cases: \c{DEC [address]} is ambiguous and will be
-rejected by NASM. You must specify \c{DEC BYTE [address]}, \c{DEC
-WORD [address]} or \c{DEC DWORD [address]} instead.
-
-\b \i{Restricted memory references}: one form of the \c{MOV}
-instruction allows a memory address to be specified \e{without}
-allowing the normal range of register combinations and effective
-address processing. This is denoted by \c{memoffs8}, \c{memoffs16},
-\c{memoffs32} or \c{memoffs64}.
-
-\b Register or memory choices: many instructions can accept either a
-register \e{or} a memory reference as an operand. \c{r/m8} is
-shorthand for \c{reg8/mem8}; similarly \c{r/m16} and \c{r/m32}.
-On legacy x86 modes, \c{r/m64} is MMX-related, and is shorthand for
-\c{mmxreg/mem64}. When utilizing the x86-64 architecture extension,
-\c{r/m64} denotes use of a 64-bit GPR as well, and is shorthand for
-\c{reg64/mem64}.
-
-
-\H{iref-opc} Key to Opcode Descriptions
-
-This appendix also provides the opcodes which NASM will generate for
-each form of each instruction. The opcodes are listed in the
-following way:
-
-\b A hex number, such as \c{3F}, indicates a fixed byte containing
-that number.
-
-\b A hex number followed by \c{+r}, such as \c{C8+r}, indicates that
-one of the operands to the instruction is a register, and the
-`register value' of that register should be added to the hex number
-to produce the generated byte. For example, EDX has register value
-2, so the code \c{C8+r}, when the register operand is EDX, generates
-the hex byte \c{CA}. Register values for specific registers are
-given in \k{iref-rv}.
-
-\b A hex number followed by \c{+cc}, such as \c{40+cc}, indicates
-that the instruction name has a condition code suffix, and the
-numeric representation of the condition code should be added to the
-hex number to produce the generated byte. For example, the code
-\c{40+cc}, when the instruction contains the \c{NE} condition,
-generates the hex byte \c{45}. Condition codes and their numeric
-representations are given in \k{iref-cc}.
-
-\b A slash followed by a digit, such as \c{/2}, indicates that one
-of the operands to the instruction is a memory address or register
-(denoted \c{mem} or \c{r/m}, with an optional size). This is to be
-encoded as an effective address, with a \i{ModR/M byte}, an optional
-\i{SIB byte}, and an optional displacement, and the spare (register)
-field of the ModR/M byte should be the digit given (which will be
-from 0 to 7, so it fits in three bits). The encoding of effective
-addresses is given in \k{iref-ea}.
-
-\b The code \c{/r} combines the above two: it indicates that one of
-the operands is a memory address or \c{r/m}, and another is a
-register, and that an effective address should be generated with the
-spare (register) field in the ModR/M byte being equal to the
-`register value' of the register operand. The encoding of effective
-addresses is given in \k{iref-ea}; register values are given in
-\k{iref-rv}.
-
-\b The codes \c{ib}, \c{iw} and \c{id} indicate that one of the
-operands to the instruction is an immediate value, and that this is
-to be encoded as a byte, little-endian word or little-endian
-doubleword respectively.
-
-\b The codes \c{rb}, \c{rw} and \c{rd} indicate that one of the
-operands to the instruction is an immediate value, and that the
-\e{difference} between this value and the address of the end of the
-instruction is to be encoded as a byte, word or doubleword
-respectively. Where the form \c{rw/rd} appears, it indicates that
-either \c{rw} or \c{rd} should be used according to whether assembly
-is being performed in \c{BITS 16} or \c{BITS 32} state respectively.
-
-\b The codes \c{ow} and \c{od} indicate that one of the operands to
-the instruction is a reference to the contents of a memory address
-specified as an immediate value: this encoding is used in some forms
-of the \c{MOV} instruction in place of the standard
-effective-address mechanism. The displacement is encoded as a word
-or doubleword. Again, \c{ow/od} denotes that \c{ow} or \c{od} should
-be chosen according to the \c{BITS} setting.
-
-\b The codes \c{o16} and \c{o32} indicate that the given form of the
-instruction should be assembled with operand size 16 or 32 bits. In
-other words, \c{o16} indicates a \c{66} prefix in \c{BITS 32} state,
-but generates no code in \c{BITS 16} state; and \c{o32} indicates a
-\c{66} prefix in \c{BITS 16} state but generates nothing in \c{BITS
-32}.
-
-\b The codes \c{a16} and \c{a32}, similarly to \c{o16} and \c{o32},
-indicate the address size of the given form of the instruction.
-Where this does not match the \c{BITS} setting, a \c{67} prefix is
-required. Please note that \c{a16} is useless in long mode as
-16-bit addressing is depreciated on the x86-64 architecture extension.
-
-
-\S{iref-rv} Register Values
-
-Where an instruction requires a register value, it is already
-implicit in the encoding of the rest of the instruction what type of
-register is intended: an 8-bit general-purpose register, a segment
-register, a debug register, an MMX register, or whatever. Therefore
-there is no problem with registers of different types sharing an
-encoding value.
-
-Please note that for the register classes listed below, the register
-extensions (REX) classes require the use of the REX prefix, in which
-is only available when in long mode on the x86-64 processor. This
-pretty much goes for any register that has a number higher than 7.
-
-The encodings for the various classes of register are:
-
-\b 8-bit general registers: \c{AL} is 0, \c{CL} is 1, \c{DL} is 2,
-\c{BL} is 3, \c{AH} is 4, \c{CH} is 5, \c{DH} is 6 and \c{BH} is
-7. Please note that \c{AH}, \c{BH}, \c{CH} and \c{DH} are not
-addressable when using the REX prefix in long mode.
-
-\b 8-bit general register extensions (REX): \c{SPL} is 4, \c{BPL} is 5,
-\c{SIL} is 6, \c{DIL} is 7, \c{R8B} is 8, \c{R9B} is 9, \c{R10B} is 10,
-\c{R11B} is 11, \c{R12B} is 12, \c{R13B} is 13, \c{R14B} is 14 and
-\c{R15B} is 15.
-
-\b 16-bit general registers: \c{AX} is 0, \c{CX} is 1, \c{DX} is 2,
-\c{BX} is 3, \c{SP} is 4, \c{BP} is 5, \c{SI} is 6, and \c{DI} is 7.
-
-\b 16-bit general register extensions (REX): \c{R8W} is 8, \c{R9W} is 9,
-\c{R10w} is 10, \c{R11W} is 11, \c{R12W} is 12, \c{R13W} is 13, \c{R14W}
-is 14 and \c{R15W} is 15.
-
-\b 32-bit general registers: \c{EAX} is 0, \c{ECX} is 1, \c{EDX} is
-2, \c{EBX} is 3, \c{ESP} is 4, \c{EBP} is 5, \c{ESI} is 6, and
-\c{EDI} is 7.
-
-\b 32-bit general register extensions (REX): \c{R8D} is 8, \c{R9D} is 9,
-\c{R10D} is 10, \c{R11D} is 11, \c{R12D} is 12, \c{R13D} is 13, \c{R14D}
-is 14 and \c{R15D} is 15.
-
-\b 64-bit general register extensions (REX): \c{RAX} is 0, \c{RCX} is 1,
-\c{RDX} is 2, \c{RBX} is 3, \c{RSP} is 4, \c{RBP} is 5, \c{RSI} is 6,
-\c{RDI} is 7, \c{R8} is 8, \c{R9} is 9, \c{R10} is 10, \c{R11} is 11,
-\c{R12} is 12, \c{R13} is 13, \c{R14} is 14 and \c{R15} is 15.
-
-\b \i{Segment registers}: \c{ES} is 0, \c{CS} is 1, \c{SS} is 2, \c{DS}
-is 3, \c{FS} is 4, and \c{GS} is 5.
-
-\b \I{floating-point, registers}Floating-point registers: \c{ST0}
-is 0, \c{ST1} is 1, \c{ST2} is 2, \c{ST3} is 3, \c{ST4} is 4,
-\c{ST5} is 5, \c{ST6} is 6, and \c{ST7} is 7.
-
-\b 64-bit \i{MMX registers}: \c{MM0} is 0, \c{MM1} is 1, \c{MM2} is 2,
-\c{MM3} is 3, \c{MM4} is 4, \c{MM5} is 5, \c{MM6} is 6, and \c{MM7}
-is 7.
-
-\b 128-bit \i{XMM (SSE) registers}: \c{XMM0} is 0, \c{XMM1} is 1,
-\c{XMM2} is 2, \c{XMM3} is 3, \c{XMM4} is 4, \c{XMM5} is 5, \c{XMM6} is
-6 and \c{XMM7} is 7.
-
-\b 128-bit \i{XMM (SSE) register} extensions (REX): \c{XMM8} is 8,
-\c{XMM9} is 9, \c{XMM10} is 10, \c{XMM11} is 11, \c{XMM12} is 12,
-\c{XMM13} is 13, \c{XMM14} is 14 and \c{XMM15} is 15.
-
-\b \i{Control registers}: \c{CR0} is 0, \c{CR2} is 2, \c{CR3} is 3,
-and \c{CR4} is 4.
-
-\b \i{Control register} extensions: \c{CR8} is 8.
-
-\b \i{Debug registers}: \c{DR0} is 0, \c{DR1} is 1, \c{DR2} is 2,
-\c{DR3} is 3, \c{DR6} is 6, and \c{DR7} is 7.
-
-\b \i{Test registers}: \c{TR3} is 3, \c{TR4} is 4, \c{TR5} is 5,
-\c{TR6} is 6, and \c{TR7} is 7.
-
-(Note that wherever a register name contains a number, that number
-is also the register value for that register.)
-
-
-\S{iref-cc} \i{Condition Codes}
-
-The available condition codes are given here, along with their
-numeric representations as part of opcodes. Many of these condition
-codes have synonyms, so several will be listed at a time.
-
-In the following descriptions, the word `either', when applied to two
-possible trigger conditions, is used to mean `either or both'. If
-`either but not both' is meant, the phrase `exactly one of' is used.
-
-\b \c{O} is 0 (trigger if the overflow flag is set); \c{NO} is 1.
-
-\b \c{B}, \c{C} and \c{NAE} are 2 (trigger if the carry flag is
-set); \c{AE}, \c{NB} and \c{NC} are 3.
-
-\b \c{E} and \c{Z} are 4 (trigger if the zero flag is set); \c{NE}
-and \c{NZ} are 5.
-
-\b \c{BE} and \c{NA} are 6 (trigger if either of the carry or zero
-flags is set); \c{A} and \c{NBE} are 7.
-
-\b \c{S} is 8 (trigger if the sign flag is set); \c{NS} is 9.
-
-\b \c{P} and \c{PE} are 10 (trigger if the parity flag is set);
-\c{NP} and \c{PO} are 11.
-
-\b \c{L} and \c{NGE} are 12 (trigger if exactly one of the sign and
-overflow flags is set); \c{GE} and \c{NL} are 13.
-
-\b \c{LE} and \c{NG} are 14 (trigger if either the zero flag is set,
-or exactly one of the sign and overflow flags is set); \c{G} and
-\c{NLE} are 15.
-
-Note that in all cases, the sense of a condition code may be
-reversed by changing the low bit of the numeric representation.
-
-For details of when an instruction sets each of the status flags,
-see the individual instruction, plus the Status Flags reference
-in \k{iref-Flags}
-
-
-\S{iref-SSE-cc} \i{SSE Condition Predicates}
-
-The condition predicates for SSE comparison instructions are the
-codes used as part of the opcode, to determine what form of
-comparison is being carried out. In each case, the imm8 value is
-the final byte of the opcode encoding, and the predicate is the
-code used as part of the mnemonic for the instruction (equivalent
-to the "cc" in an integer instruction that used a condition code).
-The instructions that use this will give details of what the various
-mnemonics are, this table is used to help you work out details of what
-is happening.
-
-\c Predi-  imm8  Description Relation where:   Emula- Result   QNaN
-\c  cate  Encod-             A Is 1st Operand  tion   if NaN   Signal
-\c         ing               B Is 2nd Operand         Operand  Invalid
-\c
-\c EQ     000B   equal       A = B                    False     No
-\c
-\c LT     001B   less-than   A < B                    False     Yes
-\c
-\c LE     010B   less-than-  A <= B                   False     Yes
-\c                or-equal
-\c
-\c ---    ----   greater     A > B             Swap   False     Yes
-\c               than                          Operands,
-\c                                             Use LT
-\c
-\c ---    ----   greater-    A >= B            Swap   False     Yes
-\c               than-or-equal                 Operands,
-\c                                             Use LE
-\c
-\c UNORD  011B   unordered   A, B = Unordered         True      No
-\c
-\c NEQ    100B   not-equal   A != B                   True      No
-\c
-\c NLT    101B   not-less-   NOT(A < B)               True      Yes
-\c               than
-\c
-\c NLE    110B   not-less-   NOT(A <= B)              True      Yes
-\c               than-or-
-\c               equal
-\c
-\c ---    ----   not-greater NOT(A > B)        Swap   True      Yes
-\c               than                          Operands,
-\c                                             Use NLT
-\c
-\c ---    ----   not-greater NOT(A >= B)       Swap   True      Yes
-\c               than-                         Operands,
-\c               or-equal                      Use NLE
-\c
-\c ORD    111B   ordered      A , B = Ordered         False     No
-
-The unordered relationship is true when at least one of the two
-values being compared is a NaN or in an unsupported format.
-
-Note that the comparisons which are listed as not having a predicate
-or encoding can only be achieved through software emulation, as
-described in the "emulation" column. Note in particular that an
-instruction such as \c{greater-than} is not the same as \c{NLE}, as,
-unlike with the \c{CMP} instruction, it has to take into account the
-possibility of one operand containing a NaN or an unsupported numeric
-format.
-
-
-\S{iref-Flags} \i{Status Flags}
-
-The status flags provide some information about the result of the
-arithmetic instructions. This information can be used by conditional
-instructions (such a \c{Jcc} and \c{CMOVcc}) as well as by some of
-the other instructions (such as \c{ADC} and \c{INTO}).
-
-There are 6 status flags:
-
-\c CF - Carry flag.
-
-Set if an arithmetic operation generates a
-carry or a borrow out of the most-significant bit of the result;
-cleared otherwise. This flag indicates an overflow condition for
-unsigned-integer arithmetic. It is also used in multiple-precision
-arithmetic.
-
-\c PF - Parity flag.
-
-Set if the least-significant byte of the result contains an even
-number of 1 bits; cleared otherwise.
-
-\c AF - Adjust flag.
-
-Set if an arithmetic operation generates a carry or a borrow
-out of bit 3 of the result; cleared otherwise. This flag is used
-in binary-coded decimal (BCD) arithmetic.
-
-\c ZF - Zero flag.
-
-Set if the result is zero; cleared otherwise.
-
-\c SF - Sign flag.
-
-Set equal to the most-significant bit of the result, which is the
-sign bit of a signed integer. (0 indicates a positive value and 1
-indicates a negative value.)
-
-\c OF - Overflow flag.
-
-Set if the integer result is too large a positive number or too
-small a negative number (excluding the sign-bit) to fit in the
-destination operand; cleared otherwise. This flag indicates an
-overflow condition for signed-integer (two's complement) arithmetic.
-
-
-\S{iref-ea} Effective Address Encoding: \i{ModR/M} and \i{SIB}
-
-An \i{effective address} is encoded in up to three parts: a ModR/M
-byte, an optional SIB byte, and an optional byte, word or doubleword
-displacement field.
-
-The ModR/M byte consists of three fields: the \c{mod} field, ranging
-from 0 to 3, in the upper two bits of the byte, the \c{r/m} field,
-ranging from 0 to 7, in the lower three bits, and the spare
-(register) field in the middle (bit 3 to bit 5). The spare field is
-not relevant to the effective address being encoded, and either
-contains an extension to the instruction opcode or the register
-value of another operand.
-
-The ModR/M system can be used to encode a direct register reference
-rather than a memory access. This is always done by setting the
-\c{mod} field to 3 and the \c{r/m} field to the register value of
-the register in question (it must be a general-purpose register, and
-the size of the register must already be implicit in the encoding of
-the rest of the instruction). In this case, the SIB byte and
-displacement field are both absent.
-
-In 16-bit addressing mode (either \c{BITS 16} with no \c{67} prefix,
-or \c{BITS 32} with a \c{67} prefix), the SIB byte is never used.
-The general rules for \c{mod} and \c{r/m} (there is an exception,
-given below) are:
-
-\b The \c{mod} field gives the length of the displacement field: 0
-means no displacement, 1 means one byte, and 2 means two bytes.
-
-\b The \c{r/m} field encodes the combination of registers to be
-added to the displacement to give the accessed address: 0 means
-\c{BX+SI}, 1 means \c{BX+DI}, 2 means \c{BP+SI}, 3 means \c{BP+DI},
-4 means \c{SI} only, 5 means \c{DI} only, 6 means \c{BP} only, and 7
-means \c{BX} only.
-
-However, there is a special case:
-
-\b If \c{mod} is 0 and \c{r/m} is 6, the effective address encoded
-is not \c{[BP]} as the above rules would suggest, but instead
-\c{[disp16]}: the displacement field is present and is two bytes
-long, and no registers are added to the displacement.
-
-Therefore the effective address \c{[BP]} cannot be encoded as
-efficiently as \c{[BX]}; so if you code \c{[BP]} in a program, NASM
-adds a notional 8-bit zero displacement, and sets \c{mod} to 1,
-\c{r/m} to 6, and the one-byte displacement field to 0.
-
-In 32-bit addressing mode (either \c{BITS 16} with a \c{67} prefix,
-or \c{BITS 32} with no \c{67} prefix) the general rules (again,
-there are exceptions) for \c{mod} and \c{r/m} are:
-
-\b The \c{mod} field gives the length of the displacement field: 0
-means no displacement, 1 means one byte, and 2 means four bytes.
-
-\b If only one register is to be added to the displacement, and it
-is not \c{ESP}, the \c{r/m} field gives its register value, and the
-SIB byte is absent. If the \c{r/m} field is 4 (which would encode
-\c{ESP}), the SIB byte is present and gives the combination and
-scaling of registers to be added to the displacement.
-
-If the SIB byte is present, it describes the combination of
-registers (an optional base register, and an optional index register
-scaled by multiplication by 1, 2, 4 or 8) to be added to the
-displacement. The SIB byte is divided into the \c{scale} field, in
-the top two bits, the \c{index} field in the next three, and the
-\c{base} field in the bottom three. The general rules are:
-
-\b The \c{base} field encodes the register value of the base
-register.
-
-\b The \c{index} field encodes the register value of the index
-register, unless it is 4, in which case no index register is used
-(so \c{ESP} cannot be used as an index register).
-
-\b The \c{scale} field encodes the multiplier by which the index
-register is scaled before adding it to the base and displacement: 0
-encodes a multiplier of 1, 1 encodes 2, 2 encodes 4 and 3 encodes 8.
-
-The exceptions to the 32-bit encoding rules are:
-
-\b If \c{mod} is 0 and \c{r/m} is 5, the effective address encoded
-is not \c{[EBP]} as the above rules would suggest, but instead
-\c{[disp32]}: the displacement field is present and is four bytes
-long, and no registers are added to the displacement.
-
-\b If \c{mod} is 0, \c{r/m} is 4 (meaning the SIB byte is present)
-and \c{base} is 5, the effective address encoded is not
-\c{[EBP+index]} as the above rules would suggest, but instead
-\c{[disp32+index]}: the displacement field is present and is four
-bytes long, and there is no base register (but the index register is
-still processed in the normal way).
-
-
-\S{iref-rex} Register Extensions: The \i{REX} Prefix
-
-The Register Extensions, or \i{REX} for short, prefix is the means
-of accessing extended registers on the x86-64 architecture. \i{REX}
-is considered an instruction prefix, but is required to be after
-all other prefixes and thus immediately before the first instruction
-opcode itself. So overall, \i{REX} can be thought of as an "Opcode
-Prefix" instead. The \i{REX} prefix itself is indicated by a value
-of 0x4X, where X is one of 16 different combinations of the actual
-\i{REX} flags.
-
-The \i{REX} prefix flags consist of four 1-bit extensions fields.
-These flags are found in the lower nibble of the actual \i{REX}
-prefix opcode. Below is the list of \i{REX} prefix flags, from
-high bit to low bit.
-
-\c{REX.W}: When set, this flag indicates the use of a 64-bit operand,
-as opposed to the default of using 32-bit operands as found in 32-bit
-Protected Mode.
-
-\c{REX.R}: When set, this flag extends the \c{reg (spare)} field of
-the \c{ModRM} byte. Overall, this raises the amount of addressable
-registers in this field from 8 to 16.
-
-\c{REX.X}: When set, this flag extends the \c{index} field of the
-\c{SIB} byte. Overall, this raises the amount of addressable
-registers in this field from 8 to 16.
-
-\c{REX.B}: When set, this flag extends the \c{r/m} field of the
-\c{ModRM} byte. This flag can also represent an extension to the
-opcode register \c{(/r)} field. The determination of which is used
-varies depending on which instruction is used. Overall, this raises
-the amount of addressable registers in these fields from 8 to 16.
-
-Interal use of the \i{REX} prefix by the processor is consistent,
-yet non-trivial. Most instructions use the \i{REX} prefix as
-indicated by the above flags. Some instructions require the \i{REX}
-prefix to be present even if the flags are empty. Some instructions
-default to a 64-bit operand and require the \i{REX} prefix only for
-actual register extensions, and thus ignores the \c{REX.W} field
-completely.
-
-At any rate, NASM is designed to handle, and fully supports, the
-\i{REX} prefix internally. Please read the appropriate processor
-documentation for further information on the \i{REX} prefix.
-
-You may have noticed that opcodes 0x40 through 0x4F are actually
-opcodes for the INC/DEC instructions for each General Purpose
-Register. This is, of course, correct... for legacy x86. While
-in long mode, opcodes 0x40 through 0x4F are reserved for use as
-the REX prefix. The other opcode forms of the INC/DEC instructions
-are used instead.
-
-
-\H{iref-flg} Key to Instruction Flags
-
-Given along with each instruction in this appendix is a set of
-flags, denoting the type of the instruction. The types are as follows:
-
-\b \c{8086}, \c{186}, \c{286}, \c{386}, \c{486}, \c{PENT} and \c{P6}
-denote the lowest processor type that supports the instruction. Most
-instructions run on all processors above the given type; those that
-do not are documented. The Pentium II contains no additional
-instructions beyond the P6 (Pentium Pro); from the point of view of
-its instruction set, it can be thought of as a P6 with MMX
-capability.
-
-\b \c{3DNOW} indicates that the instruction is a 3DNow! one, and will
-run on the AMD K6-2 and later processors. ATHLON extensions to the
-3DNow! instruction set are documented as such.
-
-\b \c{CYRIX} indicates that the instruction is specific to Cyrix
-processors, for example the extra MMX instructions in the Cyrix
-extended MMX instruction set.
-
-\b \c{FPU} indicates that the instruction is a floating-point one,
-and will only run on machines with a coprocessor (automatically
-including 486DX, Pentium and above).
-
-\b \c{KATMAI} indicates that the instruction was introduced as part
-of the Katmai New Instruction set. These instructions are available
-on the Pentium III and later processors. Those which are not
-specifically SSE instructions are also available on the AMD Athlon.
-
-\b \c{MMX} indicates that the instruction is an MMX one, and will
-run on MMX-capable Pentium processors and the Pentium II.
-
-\b \c{PRIV} indicates that the instruction is a protected-mode
-management instruction. Many of these may only be used in protected
-mode, or only at privilege level zero.
-
-\b \c{SSE} and \c{SSE2} indicate that the instruction is a Streaming
-SIMD Extension instruction. These instructions operate on multiple
-values in a single operation. SSE was introduced with the Pentium III
-and SSE2 was introduced with the Pentium 4.
-
-\b \c{UNDOC} indicates that the instruction is an undocumented one,
-and not part of the official Intel Architecture; it may or may not
-be supported on any given machine.
-
-\b \c{WILLAMETTE} indicates that the instruction was introduced as
-part of the new instruction set in the Pentium 4 and Intel Xeon
-processors. These instructions are also known as SSE2 instructions.
-
-\b \c{X64} indicates that the instruction was introduced as part of
-the new instruction set in the x86-64 architecture extension,
-commonly referred to as x64, AMD64 or EM64T.
-
-
-\H{iref-inst} x86 Instruction Set
-
-
-\S{insAAA} \i\c{AAA}, \i\c{AAS}, \i\c{AAM}, \i\c{AAD}: ASCII
-Adjustments
-
-\c AAA                           ; 37                   [8086]
-
-\c AAS                           ; 3F                   [8086]
-
-\c AAD                           ; D5 0A                [8086]
-\c AAD imm                       ; D5 ib                [8086]
-
-\c AAM                           ; D4 0A                [8086]
-\c AAM imm                       ; D4 ib                [8086]
-
-These instructions are used in conjunction with the add, subtract,
-multiply and divide instructions to perform binary-coded decimal
-arithmetic in \e{unpacked} (one BCD digit per byte - easy to
-translate to and from \c{ASCII}, hence the instruction names) form.
-There are also packed BCD instructions \c{DAA} and \c{DAS}: see
-\k{insDAA}.
-
-\b \c{AAA} (ASCII Adjust After Addition) should be used after a
-one-byte \c{ADD} instruction whose destination was the \c{AL}
-register: by means of examining the value in the low nibble of
-\c{AL} and also the auxiliary carry flag \c{AF}, it determines
-whether the addition has overflowed, and adjusts it (and sets
-the carry flag) if so. You can add long BCD strings together
-by doing \c{ADD}/\c{AAA} on the low digits, then doing
-\c{ADC}/\c{AAA} on each subsequent digit.
-
-\b \c{AAS} (ASCII Adjust AL After Subtraction) works similarly to
-\c{AAA}, but is for use after \c{SUB} instructions rather than
-\c{ADD}.
-
-\b \c{AAM} (ASCII Adjust AX After Multiply) is for use after you
-have multiplied two decimal digits together and left the result
-in \c{AL}: it divides \c{AL} by ten and stores the quotient in
-\c{AH}, leaving the remainder in \c{AL}. The divisor 10 can be
-changed by specifying an operand to the instruction: a particularly
-handy use of this is \c{AAM 16}, causing the two nibbles in \c{AL}
-to be separated into \c{AH} and \c{AL}.
-
-\b \c{AAD} (ASCII Adjust AX Before Division) performs the inverse
-operation to \c{AAM}: it multiplies \c{AH} by ten, adds it to
-\c{AL}, and sets \c{AH} to zero. Again, the multiplier 10 can
-be changed.
-
-
-\S{insADC} \i\c{ADC}: Add with Carry
-
-\c ADC r/m8,reg8                 ; 10 /r                [8086]
-\c ADC r/m16,reg16               ; o16 11 /r            [8086]
-\c ADC r/m32,reg32               ; o32 11 /r            [386]
-
-\c ADC reg8,r/m8                 ; 12 /r                [8086]
-\c ADC reg16,r/m16               ; o16 13 /r            [8086]
-\c ADC reg32,r/m32               ; o32 13 /r            [386]
-
-\c ADC r/m8,imm8                 ; 80 /2 ib             [8086]
-\c ADC r/m16,imm16               ; o16 81 /2 iw         [8086]
-\c ADC r/m32,imm32               ; o32 81 /2 id         [386]
-
-\c ADC r/m16,imm8                ; o16 83 /2 ib         [8086]
-\c ADC r/m32,imm8                ; o32 83 /2 ib         [386]
-
-\c ADC AL,imm8                   ; 14 ib                [8086]
-\c ADC AX,imm16                  ; o16 15 iw            [8086]
-\c ADC EAX,imm32                 ; o32 15 id            [386]
-
-\c{ADC} performs integer addition: it adds its two operands
-together, plus the value of the carry flag, and leaves the result in
-its destination (first) operand. The destination operand can be a
-register or a memory location. The source operand can be a register,
-a memory location or an immediate value.
-
-The flags are set according to the result of the operation: in
-particular, the carry flag is affected and can be used by a
-subsequent \c{ADC} instruction.
-
-In the forms with an 8-bit immediate second operand and a longer
-first operand, the second operand is considered to be signed, and is
-sign-extended to the length of the first operand. In these cases,
-the \c{BYTE} qualifier is necessary to force NASM to generate this
-form of the instruction.
-
-To add two numbers without also adding the contents of the carry
-flag, use \c{ADD} (\k{insADD}).
-
-
-\S{insADD} \i\c{ADD}: Add Integers
-
-\c ADD r/m8,reg8                 ; 00 /r                [8086]
-\c ADD r/m16,reg16               ; o16 01 /r            [8086]
-\c ADD r/m32,reg32               ; o32 01 /r            [386]
-
-\c ADD reg8,r/m8                 ; 02 /r                [8086]
-\c ADD reg16,r/m16               ; o16 03 /r            [8086]
-\c ADD reg32,r/m32               ; o32 03 /r            [386]
-
-\c ADD r/m8,imm8                 ; 80 /7 ib             [8086]
-\c ADD r/m16,imm16               ; o16 81 /7 iw         [8086]
-\c ADD r/m32,imm32               ; o32 81 /7 id         [386]
-
-\c ADD r/m16,imm8                ; o16 83 /7 ib         [8086]
-\c ADD r/m32,imm8                ; o32 83 /7 ib         [386]
-
-\c ADD AL,imm8                   ; 04 ib                [8086]
-\c ADD AX,imm16                  ; o16 05 iw            [8086]
-\c ADD EAX,imm32                 ; o32 05 id            [386]
-
-\c{ADD} performs integer addition: it adds its two operands
-together, and leaves the result in its destination (first) operand.
-The destination operand can be a register or a memory location.
-The source operand can be a register, a memory location or an
-immediate value.
-
-The flags are set according to the result of the operation: in
-particular, the carry flag is affected and can be used by a
-subsequent \c{ADC} instruction.
-
-In the forms with an 8-bit immediate second operand and a longer
-first operand, the second operand is considered to be signed, and is
-sign-extended to the length of the first operand. In these cases,
-the \c{BYTE} qualifier is necessary to force NASM to generate this
-form of the instruction.
-
-
-\S{insADDPD} \i\c{ADDPD}: ADD Packed Double-Precision FP Values
-
-\c ADDPD xmm1,xmm2/mem128        ; 66 0F 58 /r     [WILLAMETTE,SSE2]
-
-\c{ADDPD} performs addition on each of two packed double-precision
-FP value pairs.
-
-\c    dst[0-63]   := dst[0-63]   + src[0-63],
-\c    dst[64-127] := dst[64-127] + src[64-127].
-
-The destination is an \c{XMM} register. The source operand can be
-either an \c{XMM} register or a 128-bit memory location.
-
-
-\S{insADDPS} \i\c{ADDPS}: ADD Packed Single-Precision FP Values
-
-\c ADDPS xmm1,xmm2/mem128        ; 0F 58 /r        [KATMAI,SSE]
-
-\c{ADDPS} performs addition on each of four packed single-precision
-FP value pairs
-
-\c    dst[0-31]   := dst[0-31]   + src[0-31],
-\c    dst[32-63]  := dst[32-63]  + src[32-63],
-\c    dst[64-95]  := dst[64-95]  + src[64-95],
-\c    dst[96-127] := dst[96-127] + src[96-127].
-
-The destination is an \c{XMM} register. The source operand can be
-either an \c{XMM} register or a 128-bit memory location.
-
-
-\S{insADDSD} \i\c{ADDSD}: ADD Scalar Double-Precision FP Values
-
-\c ADDSD xmm1,xmm2/mem64         ; F2 0F 58 /r     [KATMAI,SSE]
-
-\c{ADDSD} adds the low double-precision FP values from the source
-and destination operands and stores the double-precision FP result
-in the destination operand.
-
-\c    dst[0-63]   := dst[0-63] + src[0-63],
-\c    dst[64-127) remains unchanged.
-
-The destination is an \c{XMM} register. The source operand can be
-either an \c{XMM} register or a 64-bit memory location.
-
-
-\S{insADDSS} \i\c{ADDSS}: ADD Scalar Single-Precision FP Values
-
-\c ADDSS xmm1,xmm2/mem32         ; F3 0F 58 /r     [WILLAMETTE,SSE2]
-
-\c{ADDSS} adds the low single-precision FP values from the source
-and destination operands and stores the single-precision FP result
-in the destination operand.
-
-\c    dst[0-31]   := dst[0-31] + src[0-31],
-\c    dst[32-127] remains unchanged.
-
-The destination is an \c{XMM} register. The source operand can be
-either an \c{XMM} register or a 32-bit memory location.
-
-
-\S{insAND} \i\c{AND}: Bitwise AND
-
-\c AND r/m8,reg8                 ; 20 /r                [8086]
-\c AND r/m16,reg16               ; o16 21 /r            [8086]
-\c AND r/m32,reg32               ; o32 21 /r            [386]
-
-\c AND reg8,r/m8                 ; 22 /r                [8086]
-\c AND reg16,r/m16               ; o16 23 /r            [8086]
-\c AND reg32,r/m32               ; o32 23 /r            [386]
-
-\c AND r/m8,imm8                 ; 80 /4 ib             [8086]
-\c AND r/m16,imm16               ; o16 81 /4 iw         [8086]
-\c AND r/m32,imm32               ; o32 81 /4 id         [386]
-
-\c AND r/m16,imm8                ; o16 83 /4 ib         [8086]
-\c AND r/m32,imm8                ; o32 83 /4 ib         [386]
-
-\c AND AL,imm8                   ; 24 ib                [8086]
-\c AND AX,imm16                  ; o16 25 iw            [8086]
-\c AND EAX,imm32                 ; o32 25 id            [386]
-
-\c{AND} performs a bitwise AND operation between its two operands
-(i.e. each bit of the result is 1 if and only if the corresponding
-bits of the two inputs were both 1), and stores the result in the
-destination (first) operand. The destination operand can be a
-register or a memory location. The source operand can be a register,
-a memory location or an immediate value.
-
-In the forms with an 8-bit immediate second operand and a longer
-first operand, the second operand is considered to be signed, and is
-sign-extended to the length of the first operand. In these cases,
-the \c{BYTE} qualifier is necessary to force NASM to generate this
-form of the instruction.
-
-The \c{MMX} instruction \c{PAND} (see \k{insPAND}) performs the same
-operation on the 64-bit \c{MMX} registers.
-
-
-\S{insANDNPD} \i\c{ANDNPD}: Bitwise Logical AND NOT of
-Packed Double-Precision FP Values
-
-\c ANDNPD xmm1,xmm2/mem128       ; 66 0F 55 /r     [WILLAMETTE,SSE2]
-
-\c{ANDNPD} inverts the bits of the two double-precision
-floating-point values in the destination register, and then
-performs a logical AND between the two double-precision
-floating-point values in the source operand and the temporary
-inverted result, storing the result in the destination register.
-
-\c    dst[0-63]   := src[0-63]   AND NOT dst[0-63],
-\c    dst[64-127] := src[64-127] AND NOT dst[64-127].
-
-The destination is an \c{XMM} register. The source operand can be
-either an \c{XMM} register or a 128-bit memory location.
-
-
-\S{insANDNPS} \i\c{ANDNPS}: Bitwise Logical AND NOT of
-Packed Single-Precision FP Values
-
-\c ANDNPS xmm1,xmm2/mem128       ; 0F 55 /r        [KATMAI,SSE]
-
-\c{ANDNPS} inverts the bits of the four single-precision
-floating-point values in the destination register, and then
-performs a logical AND between the four single-precision
-floating-point values in the source operand and the temporary
-inverted result, storing the result in the destination register.
-
-\c    dst[0-31]   := src[0-31]   AND NOT dst[0-31],
-\c    dst[32-63]  := src[32-63]  AND NOT dst[32-63],
-\c    dst[64-95]  := src[64-95]  AND NOT dst[64-95],
-\c    dst[96-127] := src[96-127] AND NOT dst[96-127].
-
-The destination is an \c{XMM} register. The source operand can be
-either an \c{XMM} register or a 128-bit memory location.
-
-
-\S{insANDPD} \i\c{ANDPD}: Bitwise Logical AND For Single FP
-
-\c ANDPD xmm1,xmm2/mem128        ; 66 0F 54 /r     [WILLAMETTE,SSE2]
-
-\c{ANDPD} performs a bitwise logical AND of the two double-precision
-floating point values in the source and destination operand, and
-stores the result in the destination register.
-
-\c    dst[0-63]   := src[0-63]   AND dst[0-63],
-\c    dst[64-127] := src[64-127] AND dst[64-127].
-
-The destination is an \c{XMM} register. The source operand can be
-either an \c{XMM} register or a 128-bit memory location.
-
-
-\S{insANDPS} \i\c{ANDPS}: Bitwise Logical AND For Single FP
-
-\c ANDPS xmm1,xmm2/mem128        ; 0F 54 /r        [KATMAI,SSE]
-
-\c{ANDPS} performs a bitwise logical AND of the four single-precision
-floating point values in the source and destination operand, and
-stores the result in the destination register.
-
-\c    dst[0-31]   := src[0-31]   AND dst[0-31],
-\c    dst[32-63]  := src[32-63]  AND dst[32-63],
-\c    dst[64-95]  := src[64-95]  AND dst[64-95],
-\c    dst[96-127] := src[96-127] AND dst[96-127].
-
-The destination is an \c{XMM} register. The source operand can be
-either an \c{XMM} register or a 128-bit memory location.
-
-
-\S{insARPL} \i\c{ARPL}: Adjust RPL Field of Selector
-
-\c ARPL r/m16,reg16              ; 63 /r                [286,PRIV]
-
-\c{ARPL} expects its two word operands to be segment selectors. It
-adjusts the \i\c{RPL} (requested privilege level - stored in the bottom
-two bits of the selector) field of the destination (first) operand
-to ensure that it is no less (i.e. no more privileged than) the \c{RPL}
-field of the source operand. The zero flag is set if and only if a
-change had to be made.
-
-
-\S{insBOUND} \i\c{BOUND}: Check Array Index against Bounds
-
-\c BOUND reg16,mem               ; o16 62 /r            [186]
-\c BOUND reg32,mem               ; o32 62 /r            [386]
-
-\c{BOUND} expects its second operand to point to an area of memory
-containing two signed values of the same size as its first operand
-(i.e. two words for the 16-bit form; two doublewords for the 32-bit
-form). It performs two signed comparisons: if the value in the
-register passed as its first operand is less than the first of the
-in-memory values, or is greater than or equal to the second, it
-throws a \c{BR} exception. Otherwise, it does nothing.
-
-
-\S{insBSF} \i\c{BSF}, \i\c{BSR}: Bit Scan
-
-\c BSF reg16,r/m16               ; o16 0F BC /r         [386]
-\c BSF reg32,r/m32               ; o32 0F BC /r         [386]
-
-\c BSR reg16,r/m16               ; o16 0F BD /r         [386]
-\c BSR reg32,r/m32               ; o32 0F BD /r         [386]
-
-\b \c{BSF} searches for the least significant set bit in its source
-(second) operand, and if it finds one, stores the index in
-its destination (first) operand. If no set bit is found, the
-contents of the destination operand are undefined. If the source
-operand is zero, the zero flag is set.
-
-\b \c{BSR} performs the same function, but searches from the top
-instead, so it finds the most significant set bit.
-
-Bit indices are from 0 (least significant) to 15 or 31 (most
-significant). The destination operand can only be a register.
-The source operand can be a register or a memory location.
-
-
-\S{insBSWAP} \i\c{BSWAP}: Byte Swap
-
-\c BSWAP reg32                   ; o32 0F C8+r          [486]
-
-\c{BSWAP} swaps the order of the four bytes of a 32-bit register:
-bits 0-7 exchange places with bits 24-31, and bits 8-15 swap with
-bits 16-23. There is no explicit 16-bit equivalent: to byte-swap
-\c{AX}, \c{BX}, \c{CX} or \c{DX}, \c{XCHG} can be used. When \c{BSWAP}
-is used with a 16-bit register, the result is undefined.
-
-
-\S{insBT} \i\c{BT}, \i\c{BTC}, \i\c{BTR}, \i\c{BTS}: Bit Test
-
-\c BT r/m16,reg16                ; o16 0F A3 /r         [386]
-\c BT r/m32,reg32                ; o32 0F A3 /r         [386]
-\c BT r/m16,imm8                 ; o16 0F BA /4 ib      [386]
-\c BT r/m32,imm8                 ; o32 0F BA /4 ib      [386]
-
-\c BTC r/m16,reg16               ; o16 0F BB /r         [386]
-\c BTC r/m32,reg32               ; o32 0F BB /r         [386]
-\c BTC r/m16,imm8                ; o16 0F BA /7 ib      [386]
-\c BTC r/m32,imm8                ; o32 0F BA /7 ib      [386]
-
-\c BTR r/m16,reg16               ; o16 0F B3 /r         [386]
-\c BTR r/m32,reg32               ; o32 0F B3 /r         [386]
-\c BTR r/m16,imm8                ; o16 0F BA /6 ib      [386]
-\c BTR r/m32,imm8                ; o32 0F BA /6 ib      [386]
-
-\c BTS r/m16,reg16               ; o16 0F AB /r         [386]
-\c BTS r/m32,reg32               ; o32 0F AB /r         [386]
-\c BTS r/m16,imm                 ; o16 0F BA /5 ib      [386]
-\c BTS r/m32,imm                 ; o32 0F BA /5 ib      [386]
-
-These instructions all test one bit of their first operand, whose
-index is given by the second operand, and store the value of that
-bit into the carry flag. Bit indices are from 0 (least significant)
-to 15 or 31 (most significant).
-
-In addition to storing the original value of the bit into the carry
-flag, \c{BTR} also resets (clears) the bit in the operand itself.
-\c{BTS} sets the bit, and \c{BTC} complements the bit. \c{BT} does
-not modify its operands.
-
-The destination can be a register or a memory location. The source can
-be a register or an immediate value.
-
-If the destination operand is a register, the bit offset should be
-in the range 0-15 (for 16-bit operands) or 0-31 (for 32-bit operands).
-An immediate value outside these ranges will be taken modulo 16/32
-by the processor.
-
-If the destination operand is a memory location, then an immediate
-bit offset follows the same rules as for a register. If the bit offset
-is in a register, then it can be anything within the signed range of
-the register used (ie, for a 32-bit operand, it can be (-2^31) to (2^31 - 1)
-
-
-\S{insCALL} \i\c{CALL}: Call Subroutine
-
-\c CALL imm                      ; E8 rw/rd             [8086]
-\c CALL imm:imm16                ; o16 9A iw iw         [8086]
-\c CALL imm:imm32                ; o32 9A id iw         [386]
-\c CALL FAR mem16                ; o16 FF /3            [8086]
-\c CALL FAR mem32                ; o32 FF /3            [386]
-\c CALL r/m16                    ; o16 FF /2            [8086]
-\c CALL r/m32                    ; o32 FF /2            [386]
-
-\c{CALL} calls a subroutine, by means of pushing the current
-instruction pointer (\c{IP}) and optionally \c{CS} as well on the
-stack, and then jumping to a given address.
-
-\c{CS} is pushed as well as \c{IP} if and only if the call is a far
-call, i.e. a destination segment address is specified in the
-instruction. The forms involving two colon-separated arguments are
-far calls; so are the \c{CALL FAR mem} forms.
-
-The immediate \i{near call} takes one of two forms (\c{call imm16/imm32},
-determined by the current segment size limit. For 16-bit operands,
-you would use \c{CALL 0x1234}, and for 32-bit operands you would use
-\c{CALL 0x12345678}. The value passed as an operand is a relative offset.
-
-You can choose between the two immediate \i{far call} forms
-(\c{CALL imm:imm}) by the use of the \c{WORD} and \c{DWORD} keywords:
-\c{CALL WORD 0x1234:0x5678}) or \c{CALL DWORD 0x1234:0x56789abc}.
-
-The \c{CALL FAR mem} forms execute a far call by loading the
-destination address out of memory. The address loaded consists of 16
-or 32 bits of offset (depending on the operand size), and 16 bits of
-segment. The operand size may be overridden using \c{CALL WORD FAR
-mem} or \c{CALL DWORD FAR mem}.
-
-The \c{CALL r/m} forms execute a \i{near call} (within the same
-segment), loading the destination address out of memory or out of a
-register. The keyword \c{NEAR} may be specified, for clarity, in
-these forms, but is not necessary. Again, operand size can be
-overridden using \c{CALL WORD mem} or \c{CALL DWORD mem}.
-
-As a convenience, NASM does not require you to call a far procedure
-symbol by coding the cumbersome \c{CALL SEG routine:routine}, but
-instead allows the easier synonym \c{CALL FAR routine}.
-
-The \c{CALL r/m} forms given above are near calls; NASM will accept
-the \c{NEAR} keyword (e.g. \c{CALL NEAR [address]}), even though it
-is not strictly necessary.
-
-
-\S{insCBW} \i\c{CBW}, \i\c{CWD}, \i\c{CDQ}, \i\c{CWDE}: Sign Extensions
-
-\c CBW                           ; o16 98               [8086]
-\c CWDE                          ; o32 98               [386]
-
-\c CWD                           ; o16 99               [8086]
-\c CDQ                           ; o32 99               [386]
-
-All these instructions sign-extend a short value into a longer one,
-by replicating the top bit of the original value to fill the
-extended one.
-
-\c{CBW} extends \c{AL} into \c{AX} by repeating the top bit of
-\c{AL} in every bit of \c{AH}. \c{CWDE} extends \c{AX} into
-\c{EAX}. \c{CWD} extends \c{AX} into \c{DX:AX} by repeating
-the top bit of \c{AX} throughout \c{DX}, and \c{CDQ} extends
-\c{EAX} into \c{EDX:EAX}.
-
-
-\S{insCLC} \i\c{CLC}, \i\c{CLD}, \i\c{CLI}, \i\c{CLTS}: Clear Flags
-
-\c CLC                           ; F8                   [8086]
-\c CLD                           ; FC                   [8086]
-\c CLI                           ; FA                   [8086]
-\c CLTS                          ; 0F 06                [286,PRIV]
-
-These instructions clear various flags. \c{CLC} clears the carry
-flag; \c{CLD} clears the direction flag; \c{CLI} clears the
-interrupt flag (thus disabling interrupts); and \c{CLTS} clears the
-task-switched (\c{TS}) flag in \c{CR0}.
-
-To set the carry, direction, or interrupt flags, use the \c{STC},
-\c{STD} and \c{STI} instructions (\k{insSTC}). To invert the carry
-flag, use \c{CMC} (\k{insCMC}).
-
-
-\S{insCLFLUSH} \i\c{CLFLUSH}: Flush Cache Line
-
-\c CLFLUSH mem                   ; 0F AE /7        [WILLAMETTE,SSE2]
-
-\c{CLFLUSH} invalidates the cache line that contains the linear address
-specified by the source operand from all levels of the processor cache
-hierarchy (data and instruction). If, at any level of the cache
-hierarchy, the line is inconsistent with memory (dirty) it is written
-to memory before invalidation. The source operand points to a
-byte-sized memory location.
-
-Although \c{CLFLUSH} is flagged \c{SSE2} and above, it may not be
-present on all processors which have \c{SSE2} support, and it may be
-supported on other processors; the \c{CPUID} instruction (\k{insCPUID})
-will return a bit which indicates support for the \c{CLFLUSH} instruction.
-
-
-\S{insCMC} \i\c{CMC}: Complement Carry Flag
-
-\c CMC                           ; F5                   [8086]
-
-\c{CMC} changes the value of the carry flag: if it was 0, it sets it
-to 1, and vice versa.
-
-
-\S{insCMOVcc} \i\c{CMOVcc}: Conditional Move
-
-\c CMOVcc reg16,r/m16            ; o16 0F 40+cc /r      [P6]
-\c CMOVcc reg32,r/m32            ; o32 0F 40+cc /r      [P6]
-
-\c{CMOV} moves its source (second) operand into its destination
-(first) operand if the given condition code is satisfied; otherwise
-it does nothing.
-
-For a list of condition codes, see \k{iref-cc}.
-
-Although the \c{CMOV} instructions are flagged \c{P6} and above, they
-may not be supported by all Pentium Pro processors; the \c{CPUID}
-instruction (\k{insCPUID}) will return a bit which indicates whether
-conditional moves are supported.
-
-
-\S{insCMP} \i\c{CMP}: Compare Integers
-
-\c CMP r/m8,reg8                 ; 38 /r                [8086]
-\c CMP r/m16,reg16               ; o16 39 /r            [8086]
-\c CMP r/m32,reg32               ; o32 39 /r            [386]
-
-\c CMP reg8,r/m8                 ; 3A /r                [8086]
-\c CMP reg16,r/m16               ; o16 3B /r            [8086]
-\c CMP reg32,r/m32               ; o32 3B /r            [386]
-
-\c CMP r/m8,imm8                 ; 80 /7 ib             [8086]
-\c CMP r/m16,imm16               ; o16 81 /7 iw         [8086]
-\c CMP r/m32,imm32               ; o32 81 /7 id         [386]
-
-\c CMP r/m16,imm8                ; o16 83 /7 ib         [8086]
-\c CMP r/m32,imm8                ; o32 83 /7 ib         [386]
-
-\c CMP AL,imm8                   ; 3C ib                [8086]
-\c CMP AX,imm16                  ; o16 3D iw            [8086]
-\c CMP EAX,imm32                 ; o32 3D id            [386]
-
-\c{CMP} performs a `mental' subtraction of its second operand from
-its first operand, and affects the flags as if the subtraction had
-taken place, but does not store the result of the subtraction
-anywhere.
-
-In the forms with an 8-bit immediate second operand and a longer
-first operand, the second operand is considered to be signed, and is
-sign-extended to the length of the first operand. In these cases,
-the \c{BYTE} qualifier is necessary to force NASM to generate this
-form of the instruction.
-
-The destination operand can be a register or a memory location. The
-source can be a register, memory location or an immediate value of
-the same size as the destination.
-
-
-\S{insCMPccPD} \i\c{CMPccPD}: Packed Double-Precision FP Compare
-\I\c{CMPEQPD} \I\c{CMPLTPD} \I\c{CMPLEPD} \I\c{CMPUNORDPD}
-\I\c{CMPNEQPD} \I\c{CMPNLTPD} \I\c{CMPNLEPD} \I\c{CMPORDPD}
-
-\c CMPPD xmm1,xmm2/mem128,imm8   ; 66 0F C2 /r ib  [WILLAMETTE,SSE2]
-
-\c CMPEQPD xmm1,xmm2/mem128      ; 66 0F C2 /r 00  [WILLAMETTE,SSE2]
-\c CMPLTPD xmm1,xmm2/mem128      ; 66 0F C2 /r 01  [WILLAMETTE,SSE2]
-\c CMPLEPD xmm1,xmm2/mem128      ; 66 0F C2 /r 02  [WILLAMETTE,SSE2]
-\c CMPUNORDPD xmm1,xmm2/mem128   ; 66 0F C2 /r 03  [WILLAMETTE,SSE2]
-\c CMPNEQPD xmm1,xmm2/mem128     ; 66 0F C2 /r 04  [WILLAMETTE,SSE2]
-\c CMPNLTPD xmm1,xmm2/mem128     ; 66 0F C2 /r 05  [WILLAMETTE,SSE2]
-\c CMPNLEPD xmm1,xmm2/mem128     ; 66 0F C2 /r 06  [WILLAMETTE,SSE2]
-\c CMPORDPD xmm1,xmm2/mem128     ; 66 0F C2 /r 07  [WILLAMETTE,SSE2]
-
-The \c{CMPccPD} instructions compare the two packed double-precision
-FP values in the source and destination operands, and returns the
-result of the comparison in the destination register. The result of
-each comparison is a quadword mask of all 1s (comparison true) or
-all 0s (comparison false).
-
-The destination is an \c{XMM} register. The source can be either an
-\c{XMM} register or a 128-bit memory location.
-
-The third operand is an 8-bit immediate value, of which the low 3
-bits define the type of comparison. For ease of programming, the
-8 two-operand pseudo-instructions are provided, with the third
-operand already filled in. The \I{Condition Predicates}
-\c{Condition Predicates} are:
-
-\c EQ     0   Equal
-\c LT     1   Less-than
-\c LE     2   Less-than-or-equal
-\c UNORD  3   Unordered
-\c NE     4   Not-equal
-\c NLT    5   Not-less-than
-\c NLE    6   Not-less-than-or-equal
-\c ORD    7   Ordered
-
-For more details of the comparison predicates, and details of how
-to emulate the "greater-than" equivalents, see \k{iref-SSE-cc}
-
-
-\S{insCMPccPS} \i\c{CMPccPS}: Packed Single-Precision FP Compare
-\I\c{CMPEQPS} \I\c{CMPLTPS} \I\c{CMPLEPS} \I\c{CMPUNORDPS}
-\I\c{CMPNEQPS} \I\c{CMPNLTPS} \I\c{CMPNLEPS} \I\c{CMPORDPS}
-
-\c CMPPS xmm1,xmm2/mem128,imm8   ; 0F C2 /r ib     [KATMAI,SSE]
-
-\c CMPEQPS xmm1,xmm2/mem128      ; 0F C2 /r 00     [KATMAI,SSE]
-\c CMPLTPS xmm1,xmm2/mem128      ; 0F C2 /r 01     [KATMAI,SSE]
-\c CMPLEPS xmm1,xmm2/mem128      ; 0F C2 /r 02     [KATMAI,SSE]
-\c CMPUNORDPS xmm1,xmm2/mem128   ; 0F C2 /r 03     [KATMAI,SSE]
-\c CMPNEQPS xmm1,xmm2/mem128     ; 0F C2 /r 04     [KATMAI,SSE]
-\c CMPNLTPS xmm1,xmm2/mem128     ; 0F C2 /r 05     [KATMAI,SSE]
-\c CMPNLEPS xmm1,xmm2/mem128     ; 0F C2 /r 06     [KATMAI,SSE]
-\c CMPORDPS xmm1,xmm2/mem128     ; 0F C2 /r 07     [KATMAI,SSE]
-
-The \c{CMPccPS} instructions compare the two packed single-precision
-FP values in the source and destination operands, and returns the
-result of the comparison in the destination register. The result of
-each comparison is a doubleword mask of all 1s (comparison true) or
-all 0s (comparison false).
-
-The destination is an \c{XMM} register. The source can be either an
-\c{XMM} register or a 128-bit memory location.
-
-The third operand is an 8-bit immediate value, of which the low 3
-bits define the type of comparison. For ease of programming, the
-8 two-operand pseudo-instructions are provided, with the third
-operand already filled in. The \I{Condition Predicates}
-\c{Condition Predicates} are:
-
-\c EQ     0   Equal
-\c LT     1   Less-than
-\c LE     2   Less-than-or-equal
-\c UNORD  3   Unordered
-\c NE     4   Not-equal
-\c NLT    5   Not-less-than
-\c NLE    6   Not-less-than-or-equal
-\c ORD    7   Ordered
-
-For more details of the comparison predicates, and details of how
-to emulate the "greater-than" equivalents, see \k{iref-SSE-cc}
-
-
-\S{insCMPSB} \i\c{CMPSB}, \i\c{CMPSW}, \i\c{CMPSD}: Compare Strings
-
-\c CMPSB                         ; A6                   [8086]
-\c CMPSW                         ; o16 A7               [8086]
-\c CMPSD                         ; o32 A7               [386]
-
-\c{CMPSB} compares the byte at \c{[DS:SI]} or \c{[DS:ESI]} with the
-byte at \c{[ES:DI]} or \c{[ES:EDI]}, and sets the flags accordingly.
-It then increments or decrements (depending on the direction flag:
-increments if the flag is clear, decrements if it is set) \c{SI} and
-\c{DI} (or \c{ESI} and \c{EDI}).
-
-The registers used are \c{SI} and \c{DI} if the address size is 16
-bits, and \c{ESI} and \c{EDI} if it is 32 bits. If you need to use
-an address size not equal to the current \c{BITS} setting, you can
-use an explicit \i\c{a16} or \i\c{a32} prefix.
-
-The segment register used to load from \c{[SI]} or \c{[ESI]} can be
-overridden by using a segment register name as a prefix (for
-example, \c{ES CMPSB}). The use of \c{ES} for the load from \c{[DI]}
-or \c{[EDI]} cannot be overridden.
-
-\c{CMPSW} and \c{CMPSD} work in the same way, but they compare a
-word or a doubleword instead of a byte, and increment or decrement
-the addressing registers by 2 or 4 instead of 1.
-
-The \c{REPE} and \c{REPNE} prefixes (equivalently, \c{REPZ} and
-\c{REPNZ}) may be used to repeat the instruction up to \c{CX} (or
-\c{ECX} - again, the address size chooses which) times until the
-first unequal or equal byte is found.
-
-
-\S{insCMPccSD} \i\c{CMPccSD}: Scalar Double-Precision FP Compare
-\I\c{CMPEQSD} \I\c{CMPLTSD} \I\c{CMPLESD} \I\c{CMPUNORDSD}
-\I\c{CMPNEQSD} \I\c{CMPNLTSD} \I\c{CMPNLESD} \I\c{CMPORDSD}
-
-\c CMPSD xmm1,xmm2/mem64,imm8    ; F2 0F C2 /r ib  [WILLAMETTE,SSE2]
-
-\c CMPEQSD xmm1,xmm2/mem64       ; F2 0F C2 /r 00  [WILLAMETTE,SSE2]
-\c CMPLTSD xmm1,xmm2/mem64       ; F2 0F C2 /r 01  [WILLAMETTE,SSE2]
-\c CMPLESD xmm1,xmm2/mem64       ; F2 0F C2 /r 02  [WILLAMETTE,SSE2]
-\c CMPUNORDSD xmm1,xmm2/mem64    ; F2 0F C2 /r 03  [WILLAMETTE,SSE2]
-\c CMPNEQSD xmm1,xmm2/mem64      ; F2 0F C2 /r 04  [WILLAMETTE,SSE2]
-\c CMPNLTSD xmm1,xmm2/mem64      ; F2 0F C2 /r 05  [WILLAMETTE,SSE2]
-\c CMPNLESD xmm1,xmm2/mem64      ; F2 0F C2 /r 06  [WILLAMETTE,SSE2]
-\c CMPORDSD xmm1,xmm2/mem64      ; F2 0F C2 /r 07  [WILLAMETTE,SSE2]
-
-The \c{CMPccSD} instructions compare the low-order double-precision
-FP values in the source and destination operands, and returns the
-result of the comparison in the destination register. The result of
-each comparison is a quadword mask of all 1s (comparison true) or
-all 0s (comparison false).
-
-The destination is an \c{XMM} register. The source can be either an
-\c{XMM} register or a 128-bit memory location.
-
-The third operand is an 8-bit immediate value, of which the low 3
-bits define the type of comparison. For ease of programming, the
-8 two-operand pseudo-instructions are provided, with the third
-operand already filled in. The \I{Condition Predicates}
-\c{Condition Predicates} are:
-
-\c EQ     0   Equal
-\c LT     1   Less-than
-\c LE     2   Less-than-or-equal
-\c UNORD  3   Unordered
-\c NE     4   Not-equal
-\c NLT    5   Not-less-than
-\c NLE    6   Not-less-than-or-equal
-\c ORD    7   Ordered
-
-For more details of the comparison predicates, and details of how
-to emulate the "greater-than" equivalents, see \k{iref-SSE-cc}
-
-
-\S{insCMPccSS} \i\c{CMPccSS}: Scalar Single-Precision FP Compare
-\I\c{CMPEQSS} \I\c{CMPLTSS} \I\c{CMPLESS} \I\c{CMPUNORDSS}
-\I\c{CMPNEQSS} \I\c{CMPNLTSS} \I\c{CMPNLESS} \I\c{CMPORDSS}
-
-\c CMPSS xmm1,xmm2/mem32,imm8    ; F3 0F C2 /r ib  [KATMAI,SSE]
-
-\c CMPEQSS xmm1,xmm2/mem32       ; F3 0F C2 /r 00  [KATMAI,SSE]
-\c CMPLTSS xmm1,xmm2/mem32       ; F3 0F C2 /r 01  [KATMAI,SSE]
-\c CMPLESS xmm1,xmm2/mem32       ; F3 0F C2 /r 02  [KATMAI,SSE]
-\c CMPUNORDSS xmm1,xmm2/mem32    ; F3 0F C2 /r 03  [KATMAI,SSE]
-\c CMPNEQSS xmm1,xmm2/mem32      ; F3 0F C2 /r 04  [KATMAI,SSE]
-\c CMPNLTSS xmm1,xmm2/mem32      ; F3 0F C2 /r 05  [KATMAI,SSE]
-\c CMPNLESS xmm1,xmm2/mem32      ; F3 0F C2 /r 06  [KATMAI,SSE]
-\c CMPORDSS xmm1,xmm2/mem32      ; F3 0F C2 /r 07  [KATMAI,SSE]
-
-The \c{CMPccSS} instructions compare the low-order single-precision
-FP values in the source and destination operands, and returns the
-result of the comparison in the destination register. The result of
-each comparison is a doubleword mask of all 1s (comparison true) or
-all 0s (comparison false).
-
-The destination is an \c{XMM} register. The source can be either an
-\c{XMM} register or a 128-bit memory location.
-
-The third operand is an 8-bit immediate value, of which the low 3
-bits define the type of comparison. For ease of programming, the
-8 two-operand pseudo-instructions are provided, with the third
-operand already filled in. The \I{Condition Predicates}
-\c{Condition Predicates} are:
-
-\c EQ     0   Equal
-\c LT     1   Less-than
-\c LE     2   Less-than-or-equal
-\c UNORD  3   Unordered
-\c NE     4   Not-equal
-\c NLT    5   Not-less-than
-\c NLE    6   Not-less-than-or-equal
-\c ORD    7   Ordered
-
-For more details of the comparison predicates, and details of how
-to emulate the "greater-than" equivalents, see \k{iref-SSE-cc}
-
-
-\S{insCMPXCHG} \i\c{CMPXCHG}, \i\c{CMPXCHG486}: Compare and Exchange
-
-\c CMPXCHG r/m8,reg8             ; 0F B0 /r             [PENT]
-\c CMPXCHG r/m16,reg16           ; o16 0F B1 /r         [PENT]
-\c CMPXCHG r/m32,reg32           ; o32 0F B1 /r         [PENT]
-
-\c CMPXCHG486 r/m8,reg8          ; 0F A6 /r             [486,UNDOC]
-\c CMPXCHG486 r/m16,reg16        ; o16 0F A7 /r         [486,UNDOC]
-\c CMPXCHG486 r/m32,reg32        ; o32 0F A7 /r         [486,UNDOC]
-
-These two instructions perform exactly the same operation; however,
-apparently some (not all) 486 processors support it under a
-non-standard opcode, so NASM provides the undocumented
-\c{CMPXCHG486} form to generate the non-standard opcode.
-
-\c{CMPXCHG} compares its destination (first) operand to the value in
-\c{AL}, \c{AX} or \c{EAX} (depending on the operand size of the
-instruction). If they are equal, it copies its source (second)
-operand into the destination and sets the zero flag. Otherwise, it
-clears the zero flag and copies the destination register to AL, AX or EAX.
-
-The destination can be either a register or a memory location. The
-source is a register.
-
-\c{CMPXCHG} is intended to be used for atomic operations in
-multitasking or multiprocessor environments. To safely update a
-value in shared memory, for example, you might load the value into
-\c{EAX}, load the updated value into \c{EBX}, and then execute the
-instruction \c{LOCK CMPXCHG [value],EBX}. If \c{value} has not
-changed since being loaded, it is updated with your desired new
-value, and the zero flag is set to let you know it has worked. (The
-\c{LOCK} prefix prevents another processor doing anything in the
-middle of this operation: it guarantees atomicity.) However, if
-another processor has modified the value in between your load and
-your attempted store, the store does not happen, and you are
-notified of the failure by a cleared zero flag, so you can go round
-and try again.
-
-
-\S{insCMPXCHG8B} \i\c{CMPXCHG8B}: Compare and Exchange Eight Bytes
-
-\c CMPXCHG8B mem                 ; 0F C7 /1             [PENT]
-
-This is a larger and more unwieldy version of \c{CMPXCHG}: it
-compares the 64-bit (eight-byte) value stored at \c{[mem]} with the
-value in \c{EDX:EAX}. If they are equal, it sets the zero flag and
-stores \c{ECX:EBX} into the memory area. If they are unequal, it
-clears the zero flag and stores the memory contents into \c{EDX:EAX}.
-
-\c{CMPXCHG8B} can be used with the \c{LOCK} prefix, to allow atomic
-execution. This is useful in multi-processor and multi-tasking
-environments.
-
-
-\S{insCOMISD} \i\c{COMISD}: Scalar Ordered Double-Precision FP Compare and Set EFLAGS
-
-\c COMISD xmm1,xmm2/mem64        ; 66 0F 2F /r     [WILLAMETTE,SSE2]
-
-\c{COMISD} compares the low-order double-precision FP value in the
-two source operands. ZF, PF and CF are set according to the result.
-OF, AF and AF are cleared. The unordered result is returned if either
-source is a NaN (QNaN or SNaN).
-
-The destination operand is an \c{XMM} register. The source can be either
-an \c{XMM} register or a memory location.
-
-The flags are set according to the following rules:
-
-\c    Result          Flags        Values
-
-\c    UNORDERED:      ZF,PF,CF <-- 111;
-\c    GREATER_THAN:   ZF,PF,CF <-- 000;
-\c    LESS_THAN:      ZF,PF,CF <-- 001;
-\c    EQUAL:          ZF,PF,CF <-- 100;
-
-
-\S{insCOMISS} \i\c{COMISS}: Scalar Ordered Single-Precision FP Compare and Set EFLAGS
-
-\c COMISS xmm1,xmm2/mem32        ; 66 0F 2F /r     [KATMAI,SSE]
-
-\c{COMISS} compares the low-order single-precision FP value in the
-two source operands. ZF, PF and CF are set according to the result.
-OF, AF and AF are cleared. The unordered result is returned if either
-source is a NaN (QNaN or SNaN).
-
-The destination operand is an \c{XMM} register. The source can be either
-an \c{XMM} register or a memory location.
-
-The flags are set according to the following rules:
-
-\c    Result          Flags        Values
-
-\c    UNORDERED:      ZF,PF,CF <-- 111;
-\c    GREATER_THAN:   ZF,PF,CF <-- 000;
-\c    LESS_THAN:      ZF,PF,CF <-- 001;
-\c    EQUAL:          ZF,PF,CF <-- 100;
-
-
-\S{insCPUID} \i\c{CPUID}: Get CPU Identification Code
-
-\c CPUID                         ; 0F A2                [PENT]
-
-\c{CPUID} returns various information about the processor it is
-being executed on. It fills the four registers \c{EAX}, \c{EBX},
-\c{ECX} and \c{EDX} with information, which varies depending on the
-input contents of \c{EAX}.
-
-\c{CPUID} also acts as a barrier to serialize instruction execution:
-executing the \c{CPUID} instruction guarantees that all the effects
-(memory modification, flag modification, register modification) of
-previous instructions have been completed before the next
-instruction gets fetched.
-
-The information returned is as follows:
-
-\b If \c{EAX} is zero on input, \c{EAX} on output holds the maximum
-acceptable input value of \c{EAX}, and \c{EBX:EDX:ECX} contain the
-string \c{"GenuineIntel"} (or not, if you have a clone processor).
-That is to say, \c{EBX} contains \c{"Genu"} (in NASM's own sense of
-character constants, described in \k{chrconst}), \c{EDX} contains
-\c{"ineI"} and \c{ECX} contains \c{"ntel"}.
-
-\b If \c{EAX} is one on input, \c{EAX} on output contains version
-information about the processor, and \c{EDX} contains a set of
-feature flags, showing the presence and absence of various features.
-For example, bit 8 is set if the \c{CMPXCHG8B} instruction
-(\k{insCMPXCHG8B}) is supported, bit 15 is set if the conditional
-move instructions (\k{insCMOVcc} and \k{insFCMOVB}) are supported,
-and bit 23 is set if \c{MMX} instructions are supported.
-
-\b If \c{EAX} is two on input, \c{EAX}, \c{EBX}, \c{ECX} and \c{EDX}
-all contain information about caches and TLBs (Translation Lookahead
-Buffers).
-
-For more information on the data returned from \c{CPUID}, see the
-documentation from Intel and other processor manufacturers.
-
-
-\S{insCVTDQ2PD} \i\c{CVTDQ2PD}:
-Packed Signed INT32 to Packed Double-Precision FP Conversion
-
-\c CVTDQ2PD xmm1,xmm2/mem64      ; F3 0F E6 /r     [WILLAMETTE,SSE2]
-
-\c{CVTDQ2PD} converts two packed signed doublewords from the source
-operand to two packed double-precision FP values in the destination
-operand.
-
-The destination operand is an \c{XMM} register. The source can be
-either an \c{XMM} register or a 64-bit memory location. If the
-source is a register, the packed integers are in the low quadword.
-
-
-\S{insCVTDQ2PS} \i\c{CVTDQ2PS}:
-Packed Signed INT32 to Packed Single-Precision FP Conversion
-
-\c CVTDQ2PS xmm1,xmm2/mem128     ; 0F 5B /r        [WILLAMETTE,SSE2]
-
-\c{CVTDQ2PS} converts four packed signed doublewords from the source
-operand to four packed single-precision FP values in the destination
-operand.
-
-The destination operand is an \c{XMM} register. The source can be
-either an \c{XMM} register or a 128-bit memory location.
-
-For more details of this instruction, see the Intel Processor manuals.
-
-
-\S{insCVTPD2DQ} \i\c{CVTPD2DQ}:
-Packed Double-Precision FP to Packed Signed INT32 Conversion
-
-\c CVTPD2DQ xmm1,xmm2/mem128     ; F2 0F E6 /r     [WILLAMETTE,SSE2]
-
-\c{CVTPD2DQ} converts two packed double-precision FP values from the
-source operand to two packed signed doublewords in the low quadword
-of the destination operand. The high quadword of the destination is
-set to all 0s.
-
-The destination operand is an \c{XMM} register. The source can be
-either an \c{XMM} register or a 128-bit memory location.
-
-For more details of this instruction, see the Intel Processor manuals.
-
-
-\S{insCVTPD2PI} \i\c{CVTPD2PI}:
-Packed Double-Precision FP to Packed Signed INT32 Conversion
-
-\c CVTPD2PI mm,xmm/mem128        ; 66 0F 2D /r     [WILLAMETTE,SSE2]
-
-\c{CVTPD2PI} converts two packed double-precision FP values from the
-source operand to two packed signed doublewords in the destination
-operand.
-
-The destination operand is an \c{MMX} register. The source can be
-either an \c{XMM} register or a 128-bit memory location.
-
-For more details of this instruction, see the Intel Processor manuals.
-
-
-\S{insCVTPD2PS} \i\c{CVTPD2PS}:
-Packed Double-Precision FP to Packed Single-Precision FP Conversion
-
-\c CVTPD2PS xmm1,xmm2/mem128     ; 66 0F 5A /r     [WILLAMETTE,SSE2]
-
-\c{CVTPD2PS} converts two packed double-precision FP values from the
-source operand to two packed single-precision FP values in the low
-quadword of the destination operand. The high quadword of the
-destination is set to all 0s.
-
-The destination operand is an \c{XMM} register. The source can be
-either an \c{XMM} register or a 128-bit memory location.
-
-For more details of this instruction, see the Intel Processor manuals.
-
-
-\S{insCVTPI2PD} \i\c{CVTPI2PD}:
-Packed Signed INT32 to Packed Double-Precision FP Conversion
-
-\c CVTPI2PD xmm,mm/mem64         ; 66 0F 2A /r     [WILLAMETTE,SSE2]
-
-\c{CVTPI2PD} converts two packed signed doublewords from the source
-operand to two packed double-precision FP values in the destination
-operand.
-
-The destination operand is an \c{XMM} register. The source can be
-either an \c{MMX} register or a 64-bit memory location.
-
-For more details of this instruction, see the Intel Processor manuals.
-
-
-\S{insCVTPI2PS} \i\c{CVTPI2PS}:
-Packed Signed INT32 to Packed Single-FP Conversion
-
-\c CVTPI2PS xmm,mm/mem64         ; 0F 2A /r        [KATMAI,SSE]
-
-\c{CVTPI2PS} converts two packed signed doublewords from the source
-operand to two packed single-precision FP values in the low quadword
-of the destination operand. The high quadword of the destination
-remains unchanged.
-
-The destination operand is an \c{XMM} register. The source can be
-either an \c{MMX} register or a 64-bit memory location.
-
-For more details of this instruction, see the Intel Processor manuals.
-
-
-\S{insCVTPS2DQ} \i\c{CVTPS2DQ}:
-Packed Single-Precision FP to Packed Signed INT32 Conversion
-
-\c CVTPS2DQ xmm1,xmm2/mem128     ; 66 0F 5B /r     [WILLAMETTE,SSE2]
-
-\c{CVTPS2DQ} converts four packed single-precision FP values from the
-source operand to four packed signed doublewords in the destination operand.
-
-The destination operand is an \c{XMM} register. The source can be
-either an \c{XMM} register or a 128-bit memory location.
-
-For more details of this instruction, see the Intel Processor manuals.
-
-
-\S{insCVTPS2PD} \i\c{CVTPS2PD}:
-Packed Single-Precision FP to Packed Double-Precision FP Conversion
-
-\c CVTPS2PD xmm1,xmm2/mem64      ; 0F 5A /r        [WILLAMETTE,SSE2]
-
-\c{CVTPS2PD} converts two packed single-precision FP values from the
-source operand to two packed double-precision FP values in the destination
-operand.
-
-The destination operand is an \c{XMM} register. The source can be
-either an \c{XMM} register or a 64-bit memory location. If the source
-is a register, the input values are in the low quadword.
-
-For more details of this instruction, see the Intel Processor manuals.
-
-
-\S{insCVTPS2PI} \i\c{CVTPS2PI}:
-Packed Single-Precision FP to Packed Signed INT32 Conversion
-
-\c CVTPS2PI mm,xmm/mem64         ; 0F 2D /r        [KATMAI,SSE]
-
-\c{CVTPS2PI} converts two packed single-precision FP values from
-the source operand to two packed signed doublewords in the destination
-operand.
-
-The destination operand is an \c{MMX} register. The source can be
-either an \c{XMM} register or a 64-bit memory location. If the
-source is a register, the input values are in the low quadword.
-
-For more details of this instruction, see the Intel Processor manuals.
-
-
-\S{insCVTSD2SI} \i\c{CVTSD2SI}:
-Scalar Double-Precision FP to Signed INT32 Conversion
-
-\c CVTSD2SI reg32,xmm/mem64      ; F2 0F 2D /r     [WILLAMETTE,SSE2]
-
-\c{CVTSD2SI} converts a double-precision FP value from the source
-operand to a signed doubleword in the destination operand.
-
-The destination operand is a general purpose register. The source can be
-either an \c{XMM} register or a 64-bit memory location. If the
-source is a register, the input value is in the low quadword.
-
-For more details of this instruction, see the Intel Processor manuals.
-
-
-\S{insCVTSD2SS} \i\c{CVTSD2SS}:
-Scalar Double-Precision FP to Scalar Single-Precision FP Conversion
-
-\c CVTSD2SS xmm1,xmm2/mem64      ; F2 0F 5A /r     [KATMAI,SSE]
-
-\c{CVTSD2SS} converts a double-precision FP value from the source
-operand to a single-precision FP value in the low doubleword of the
-destination operand. The upper 3 doublewords are left unchanged.
-
-The destination operand is an \c{XMM} register. The source can be
-either an \c{XMM} register or a 64-bit memory location. If the
-source is a register, the input value is in the low quadword.
-
-For more details of this instruction, see the Intel Processor manuals.
-
-
-\S{insCVTSI2SD} \i\c{CVTSI2SD}:
-Signed INT32 to Scalar Double-Precision FP Conversion
-
-\c CVTSI2SD xmm,r/m32            ; F2 0F 2A /r     [WILLAMETTE,SSE2]
-
-\c{CVTSI2SD} converts a signed doubleword from the source operand to
-a double-precision FP value in the low quadword of the destination
-operand. The high quadword is left unchanged.
-
-The destination operand is an \c{XMM} register. The source can be either
-a general purpose register or a 32-bit memory location.
-
-For more details of this instruction, see the Intel Processor manuals.
-
-
-\S{insCVTSI2SS} \i\c{CVTSI2SS}:
-Signed INT32 to Scalar Single-Precision FP Conversion
-
-\c CVTSI2SS xmm,r/m32            ; F3 0F 2A /r     [KATMAI,SSE]
-
-\c{CVTSI2SS} converts a signed doubleword from the source operand to a
-single-precision FP value in the low doubleword of the destination operand.
-The upper 3 doublewords are left unchanged.
-
-The destination operand is an \c{XMM} register. The source can be either
-a general purpose register or a 32-bit memory location.
-
-For more details of this instruction, see the Intel Processor manuals.
-
-
-\S{insCVTSS2SD} \i\c{CVTSS2SD}:
-Scalar Single-Precision FP to Scalar Double-Precision FP Conversion
-
-\c CVTSS2SD xmm1,xmm2/mem32      ; F3 0F 5A /r     [WILLAMETTE,SSE2]
-
-\c{CVTSS2SD} converts a single-precision FP value from the source operand
-to a double-precision FP value in the low quadword of the destination
-operand. The upper quadword is left unchanged.
-
-The destination operand is an \c{XMM} register. The source can be either
-an \c{XMM} register or a 32-bit memory location. If the source is a
-register, the input value is contained in the low doubleword.
-
-For more details of this instruction, see the Intel Processor manuals.
-
-
-\S{insCVTSS2SI} \i\c{CVTSS2SI}:
-Scalar Single-Precision FP to Signed INT32 Conversion
-
-\c CVTSS2SI reg32,xmm/mem32      ; F3 0F 2D /r     [KATMAI,SSE]
-
-\c{CVTSS2SI} converts a single-precision FP value from the source
-operand to a signed doubleword in the destination operand.
-
-The destination operand is a general purpose register. The source can be
-either an \c{XMM} register or a 32-bit memory location. If the
-source is a register, the input value is in the low doubleword.
-
-For more details of this instruction, see the Intel Processor manuals.
-
-
-\S{insCVTTPD2DQ} \i\c{CVTTPD2DQ}:
-Packed Double-Precision FP to Packed Signed INT32 Conversion with Truncation
-
-\c CVTTPD2DQ xmm1,xmm2/mem128    ; 66 0F E6 /r     [WILLAMETTE,SSE2]
-
-\c{CVTTPD2DQ} converts two packed double-precision FP values in the source
-operand to two packed single-precision FP values in the destination operand.
-If the result is inexact, it is truncated (rounded toward zero). The high
-quadword is set to all 0s.
-
-The destination operand is an \c{XMM} register. The source can be
-either an \c{XMM} register or a 128-bit memory location.
-
-For more details of this instruction, see the Intel Processor manuals.
-
-
-\S{insCVTTPD2PI} \i\c{CVTTPD2PI}:
-Packed Double-Precision FP to Packed Signed INT32 Conversion with Truncation
-
-\c CVTTPD2PI mm,xmm/mem128        ; 66 0F 2C /r     [WILLAMETTE,SSE2]
-
-\c{CVTTPD2PI} converts two packed double-precision FP values in the source
-operand to two packed single-precision FP values in the destination operand.
-If the result is inexact, it is truncated (rounded toward zero).
-
-The destination operand is an \c{MMX} register. The source can be
-either an \c{XMM} register or a 128-bit memory location.
-
-For more details of this instruction, see the Intel Processor manuals.
-
-
-\S{insCVTTPS2DQ} \i\c{CVTTPS2DQ}:
-Packed Single-Precision FP to Packed Signed INT32 Conversion with Truncation
-
-\c CVTTPS2DQ xmm1,xmm2/mem128    ; F3 0F 5B /r     [WILLAMETTE,SSE2]
-
-\c{CVTTPS2DQ} converts four packed single-precision FP values in the source
-operand to four packed signed doublewords in the destination operand.
-If the result is inexact, it is truncated (rounded toward zero).
-
-The destination operand is an \c{XMM} register. The source can be
-either an \c{XMM} register or a 128-bit memory location.
-
-For more details of this instruction, see the Intel Processor manuals.
-
-
-\S{insCVTTPS2PI} \i\c{CVTTPS2PI}:
-Packed Single-Precision FP to Packed Signed INT32 Conversion with Truncation
-
-\c CVTTPS2PI mm,xmm/mem64         ; 0F 2C /r       [KATMAI,SSE]
-
-\c{CVTTPS2PI} converts two packed single-precision FP values in the source
-operand to two packed signed doublewords in the destination operand.
-If the result is inexact, it is truncated (rounded toward zero). If
-the source is a register, the input values are in the low quadword.
-
-The destination operand is an \c{MMX} register. The source can be
-either an \c{XMM} register or a 64-bit memory location. If the source
-is a register, the input value is in the low quadword.
-
-For more details of this instruction, see the Intel Processor manuals.
-
-
-\S{insCVTTSD2SI} \i\c{CVTTSD2SI}:
-Scalar Double-Precision FP to Signed INT32 Conversion with Truncation
-
-\c CVTTSD2SI reg32,xmm/mem64      ; F2 0F 2C /r    [WILLAMETTE,SSE2]
-
-\c{CVTTSD2SI} converts a double-precision FP value in the source operand
-to a signed doubleword in the destination operand. If the result is
-inexact, it is truncated (rounded toward zero).
-
-The destination operand is a general purpose register. The source can be
-either an \c{XMM} register or a 64-bit memory location. If the source is a
-register, the input value is in the low quadword.
-
-For more details of this instruction, see the Intel Processor manuals.
-
-
-\S{insCVTTSS2SI} \i\c{CVTTSS2SI}:
-Scalar Single-Precision FP to Signed INT32 Conversion with Truncation
-
-\c CVTTSD2SI reg32,xmm/mem32      ; F3 0F 2C /r    [KATMAI,SSE]
-
-\c{CVTTSS2SI} converts a single-precision FP value in the source operand
-to a signed doubleword in the destination operand. If the result is
-inexact, it is truncated (rounded toward zero).
-
-The destination operand is a general purpose register. The source can be
-either an \c{XMM} register or a 32-bit memory location. If the source is a
-register, the input value is in the low doubleword.
-
-For more details of this instruction, see the Intel Processor manuals.
-
-
-\S{insDAA} \i\c{DAA}, \i\c{DAS}: Decimal Adjustments
-
-\c DAA                           ; 27                   [8086]
-\c DAS                           ; 2F                   [8086]
-
-These instructions are used in conjunction with the add and subtract
-instructions to perform binary-coded decimal arithmetic in
-\e{packed} (one BCD digit per nibble) form. For the unpacked
-equivalents, see \k{insAAA}.
-
-\c{DAA} should be used after a one-byte \c{ADD} instruction whose
-destination was the \c{AL} register: by means of examining the value
-in the \c{AL} and also the auxiliary carry flag \c{AF}, it
-determines whether either digit of the addition has overflowed, and
-adjusts it (and sets the carry and auxiliary-carry flags) if so. You
-can add long BCD strings together by doing \c{ADD}/\c{DAA} on the
-low two digits, then doing \c{ADC}/\c{DAA} on each subsequent pair
-of digits.
-
-\c{DAS} works similarly to \c{DAA}, but is for use after \c{SUB}
-instructions rather than \c{ADD}.
-
-
-\S{insDEC} \i\c{DEC}: Decrement Integer
-
-\c DEC reg16                     ; o16 48+r             [8086]
-\c DEC reg32                     ; o32 48+r             [386]
-\c DEC r/m8                      ; FE /1                [8086]
-\c DEC r/m16                     ; o16 FF /1            [8086]
-\c DEC r/m32                     ; o32 FF /1            [386]
-
-\c{DEC} subtracts 1 from its operand. It does \e{not} affect the
-carry flag: to affect the carry flag, use \c{SUB something,1} (see
-\k{insSUB}). \c{DEC} affects all the other flags according to the result.
-
-This instruction can be used with a \c{LOCK} prefix to allow atomic
-execution.
-
-See also \c{INC} (\k{insINC}).
-
-
-\S{insDIV} \i\c{DIV}: Unsigned Integer Divide
-
-\c DIV r/m8                      ; F6 /6                [8086]
-\c DIV r/m16                     ; o16 F7 /6            [8086]
-\c DIV r/m32                     ; o32 F7 /6            [386]
-
-\c{DIV} performs unsigned integer division. The explicit operand
-provided is the divisor; the dividend and destination operands are
-implicit, in the following way:
-
-\b For \c{DIV r/m8}, \c{AX} is divided by the given operand; the
-quotient is stored in \c{AL} and the remainder in \c{AH}.
-
-\b For \c{DIV r/m16}, \c{DX:AX} is divided by the given operand; the
-quotient is stored in \c{AX} and the remainder in \c{DX}.
-
-\b For \c{DIV r/m32}, \c{EDX:EAX} is divided by the given operand;
-the quotient is stored in \c{EAX} and the remainder in \c{EDX}.
-
-Signed integer division is performed by the \c{IDIV} instruction:
-see \k{insIDIV}.
-
-
-\S{insDIVPD} \i\c{DIVPD}: Packed Double-Precision FP Divide
-
-\c DIVPD xmm1,xmm2/mem128        ; 66 0F 5E /r     [WILLAMETTE,SSE2]
-
-\c{DIVPD} divides the two packed double-precision FP values in
-the destination operand by the two packed double-precision FP
-values in the source operand, and stores the packed double-precision
-results in the destination register.
-
-The destination is an \c{XMM} register. The source operand can be
-either an \c{XMM} register or a 128-bit memory location.
-
-\c    dst[0-63]   := dst[0-63]   / src[0-63],
-\c    dst[64-127] := dst[64-127] / src[64-127].
-
-
-\S{insDIVPS} \i\c{DIVPS}: Packed Single-Precision FP Divide
-
-\c DIVPS xmm1,xmm2/mem128        ; 0F 5E /r        [KATMAI,SSE]
-
-\c{DIVPS} divides the four packed single-precision FP values in
-the destination operand by the four packed single-precision FP
-values in the source operand, and stores the packed single-precision
-results in the destination register.
-
-The destination is an \c{XMM} register. The source operand can be
-either an \c{XMM} register or a 128-bit memory location.
-
-\c    dst[0-31]   := dst[0-31]   / src[0-31],
-\c    dst[32-63]  := dst[32-63]  / src[32-63],
-\c    dst[64-95]  := dst[64-95]  / src[64-95],
-\c    dst[96-127] := dst[96-127] / src[96-127].
-
-
-\S{insDIVSD} \i\c{DIVSD}: Scalar Double-Precision FP Divide
-
-\c DIVSD xmm1,xmm2/mem64         ; F2 0F 5E /r     [WILLAMETTE,SSE2]
-
-\c{DIVSD} divides the low-order double-precision FP value in the
-destination operand by the low-order double-precision FP value in
-the source operand, and stores the double-precision result in the
-destination register.
-
-The destination is an \c{XMM} register. The source operand can be
-either an \c{XMM} register or a 64-bit memory location.
-
-\c    dst[0-63]   := dst[0-63] / src[0-63],
-\c    dst[64-127] remains unchanged.
-
-
-\S{insDIVSS} \i\c{DIVSS}: Scalar Single-Precision FP Divide
-
-\c DIVSS xmm1,xmm2/mem32         ; F3 0F 5E /r     [KATMAI,SSE]
-
-\c{DIVSS} divides the low-order single-precision FP value in the
-destination operand by the low-order single-precision FP value in
-the source operand, and stores the single-precision result in the
-destination register.
-
-The destination is an \c{XMM} register. The source operand can be
-either an \c{XMM} register or a 32-bit memory location.
-
-\c    dst[0-31]   := dst[0-31] / src[0-31],
-\c    dst[32-127] remains unchanged.
-
-
-\S{insEMMS} \i\c{EMMS}: Empty MMX State
-
-\c EMMS                          ; 0F 77                [PENT,MMX]
-
-\c{EMMS} sets the FPU tag word (marking which floating-point registers
-are available) to all ones, meaning all registers are available for
-the FPU to use. It should be used after executing \c{MMX} instructions
-and before executing any subsequent floating-point operations.
-
-
-\S{insENTER} \i\c{ENTER}: Create Stack Frame
-
-\c ENTER imm,imm                 ; C8 iw ib             [186]
-
-\c{ENTER} constructs a \i\c{stack frame} for a high-level language
-procedure call. The first operand (the \c{iw} in the opcode
-definition above refers to the first operand) gives the amount of
-stack space to allocate for local variables; the second (the \c{ib}
-above) gives the nesting level of the procedure (for languages like
-Pascal, with nested procedures).
-
-The function of \c{ENTER}, with a nesting level of zero, is
-equivalent to
-
-\c           PUSH EBP            ; or PUSH BP         in 16 bits
-\c           MOV EBP,ESP         ; or MOV BP,SP       in 16 bits
-\c           SUB ESP,operand1    ; or SUB SP,operand1 in 16 bits
-
-This creates a stack frame with the procedure parameters accessible
-upwards from \c{EBP}, and local variables accessible downwards from
-\c{EBP}.
-
-With a nesting level of one, the stack frame created is 4 (or 2)
-bytes bigger, and the value of the final frame pointer \c{EBP} is
-accessible in memory at \c{[EBP-4]}.
-
-This allows \c{ENTER}, when called with a nesting level of two, to
-look at the stack frame described by the \e{previous} value of
-\c{EBP}, find the frame pointer at offset -4 from that, and push it
-along with its new frame pointer, so that when a level-two procedure
-is called from within a level-one procedure, \c{[EBP-4]} holds the
-frame pointer of the most recent level-one procedure call and
-\c{[EBP-8]} holds that of the most recent level-two call. And so on,
-for nesting levels up to 31.
-
-Stack frames created by \c{ENTER} can be destroyed by the \c{LEAVE}
-instruction: see \k{insLEAVE}.
-
-
-\S{insF2XM1} \i\c{F2XM1}: Calculate 2**X-1
-
-\c F2XM1                         ; D9 F0                [8086,FPU]
-
-\c{F2XM1} raises 2 to the power of \c{ST0}, subtracts one, and
-stores the result back into \c{ST0}. The initial contents of \c{ST0}
-must be a number in the range -1.0 to +1.0.
-
-
-\S{insFABS} \i\c{FABS}: Floating-Point Absolute Value
-
-\c FABS                          ; D9 E1                [8086,FPU]
-
-\c{FABS} computes the absolute value of \c{ST0},by clearing the sign
-bit, and stores the result back in \c{ST0}.
-
-
-\S{insFADD} \i\c{FADD}, \i\c{FADDP}: Floating-Point Addition
-
-\c FADD mem32                    ; D8 /0                [8086,FPU]
-\c FADD mem64                    ; DC /0                [8086,FPU]
-
-\c FADD fpureg                   ; D8 C0+r              [8086,FPU]
-\c FADD ST0,fpureg               ; D8 C0+r              [8086,FPU]
-
-\c FADD TO fpureg                ; DC C0+r              [8086,FPU]
-\c FADD fpureg,ST0               ; DC C0+r              [8086,FPU]
-
-\c FADDP fpureg                  ; DE C0+r              [8086,FPU]
-\c FADDP fpureg,ST0              ; DE C0+r              [8086,FPU]
-
-\b \c{FADD}, given one operand, adds the operand to \c{ST0} and stores
-the result back in \c{ST0}. If the operand has the \c{TO} modifier,
-the result is stored in the register given rather than in \c{ST0}.
-
-\b \c{FADDP} performs the same function as \c{FADD TO}, but pops the
-register stack after storing the result.
-
-The given two-operand forms are synonyms for the one-operand forms.
-
-To add an integer value to \c{ST0}, use the c{FIADD} instruction
-(\k{insFIADD})
-
-
-\S{insFBLD} \i\c{FBLD}, \i\c{FBSTP}: BCD Floating-Point Load and Store
-
-\c FBLD mem80                    ; DF /4                [8086,FPU]
-\c FBSTP mem80                   ; DF /6                [8086,FPU]
-
-\c{FBLD} loads an 80-bit (ten-byte) packed binary-coded decimal
-number from the given memory address, converts it to a real, and
-pushes it on the register stack. \c{FBSTP} stores the value of
-\c{ST0}, in packed BCD, at the given address and then pops the
-register stack.
-
-
-\S{insFCHS} \i\c{FCHS}: Floating-Point Change Sign
-
-\c FCHS                          ; D9 E0                [8086,FPU]
-
-\c{FCHS} negates the number in \c{ST0}, by inverting the sign bit:
-negative numbers become positive, and vice versa.
-
-
-\S{insFCLEX} \i\c{FCLEX}, \c{FNCLEX}: Clear Floating-Point Exceptions
-
-\c FCLEX                         ; 9B DB E2             [8086,FPU]
-\c FNCLEX                        ; DB E2                [8086,FPU]
-
-\c{FCLEX} clears any floating-point exceptions which may be pending.
-\c{FNCLEX} does the same thing but doesn't wait for previous
-floating-point operations (including the \e{handling} of pending
-exceptions) to finish first.
-
-
-\S{insFCMOVB} \i\c{FCMOVcc}: Floating-Point Conditional Move
-
-\c FCMOVB fpureg                 ; DA C0+r              [P6,FPU]
-\c FCMOVB ST0,fpureg             ; DA C0+r              [P6,FPU]
-
-\c FCMOVE fpureg                 ; DA C8+r              [P6,FPU]
-\c FCMOVE ST0,fpureg             ; DA C8+r              [P6,FPU]
-
-\c FCMOVBE fpureg                ; DA D0+r              [P6,FPU]
-\c FCMOVBE ST0,fpureg            ; DA D0+r              [P6,FPU]
-
-\c FCMOVU fpureg                 ; DA D8+r              [P6,FPU]
-\c FCMOVU ST0,fpureg             ; DA D8+r              [P6,FPU]
-
-\c FCMOVNB fpureg                ; DB C0+r              [P6,FPU]
-\c FCMOVNB ST0,fpureg            ; DB C0+r              [P6,FPU]
-
-\c FCMOVNE fpureg                ; DB C8+r              [P6,FPU]
-\c FCMOVNE ST0,fpureg            ; DB C8+r              [P6,FPU]
-
-\c FCMOVNBE fpureg               ; DB D0+r              [P6,FPU]
-\c FCMOVNBE ST0,fpureg           ; DB D0+r              [P6,FPU]
-
-\c FCMOVNU fpureg                ; DB D8+r              [P6,FPU]
-\c FCMOVNU ST0,fpureg            ; DB D8+r              [P6,FPU]
-
-The \c{FCMOV} instructions perform conditional move operations: each
-of them moves the contents of the given register into \c{ST0} if its
-condition is satisfied, and does nothing if not.
-
-The conditions are not the same as the standard condition codes used
-with conditional jump instructions. The conditions \c{B}, \c{BE},
-\c{NB}, \c{NBE}, \c{E} and \c{NE} are exactly as normal, but none of
-the other standard ones are supported. Instead, the condition \c{U}
-and its counterpart \c{NU} are provided; the \c{U} condition is
-satisfied if the last two floating-point numbers compared were
-\e{unordered}, i.e. they were not equal but neither one could be
-said to be greater than the other, for example if they were NaNs.
-(The flag state which signals this is the setting of the parity
-flag: so the \c{U} condition is notionally equivalent to \c{PE}, and
-\c{NU} is equivalent to \c{PO}.)
-
-The \c{FCMOV} conditions test the main processor's status flags, not
-the FPU status flags, so using \c{FCMOV} directly after \c{FCOM}
-will not work. Instead, you should either use \c{FCOMI} which writes
-directly to the main CPU flags word, or use \c{FSTSW} to extract the
-FPU flags.
-
-Although the \c{FCMOV} instructions are flagged \c{P6} above, they
-may not be supported by all Pentium Pro processors; the \c{CPUID}
-instruction (\k{insCPUID}) will return a bit which indicates whether
-conditional moves are supported.
-
-
-\S{insFCOM} \i\c{FCOM}, \i\c{FCOMP}, \i\c{FCOMPP}, \i\c{FCOMI},
-\i\c{FCOMIP}: Floating-Point Compare
-
-\c FCOM mem32                    ; D8 /2                [8086,FPU]
-\c FCOM mem64                    ; DC /2                [8086,FPU]
-\c FCOM fpureg                   ; D8 D0+r              [8086,FPU]
-\c FCOM ST0,fpureg               ; D8 D0+r              [8086,FPU]
-
-\c FCOMP mem32                   ; D8 /3                [8086,FPU]
-\c FCOMP mem64                   ; DC /3                [8086,FPU]
-\c FCOMP fpureg                  ; D8 D8+r              [8086,FPU]
-\c FCOMP ST0,fpureg              ; D8 D8+r              [8086,FPU]
-
-\c FCOMPP                        ; DE D9                [8086,FPU]
-
-\c FCOMI fpureg                  ; DB F0+r              [P6,FPU]
-\c FCOMI ST0,fpureg              ; DB F0+r              [P6,FPU]
-
-\c FCOMIP fpureg                 ; DF F0+r              [P6,FPU]
-\c FCOMIP ST0,fpureg             ; DF F0+r              [P6,FPU]
-
-\c{FCOM} compares \c{ST0} with the given operand, and sets the FPU
-flags accordingly. \c{ST0} is treated as the left-hand side of the
-comparison, so that the carry flag is set (for a `less-than' result)
-if \c{ST0} is less than the given operand.
-
-\c{FCOMP} does the same as \c{FCOM}, but pops the register stack
-afterwards. \c{FCOMPP} compares \c{ST0} with \c{ST1} and then pops
-the register stack twice.
-
-\c{FCOMI} and \c{FCOMIP} work like the corresponding forms of
-\c{FCOM} and \c{FCOMP}, but write their results directly to the CPU
-flags register rather than the FPU status word, so they can be
-immediately followed by conditional jump or conditional move
-instructions.
-
-The \c{FCOM} instructions differ from the \c{FUCOM} instructions
-(\k{insFUCOM}) only in the way they handle quiet NaNs: \c{FUCOM}
-will handle them silently and set the condition code flags to an
-`unordered' result, whereas \c{FCOM} will generate an exception.
-
-
-\S{insFCOS} \i\c{FCOS}: Cosine
-
-\c FCOS                          ; D9 FF                [386,FPU]
-
-\c{FCOS} computes the cosine of \c{ST0} (in radians), and stores the
-result in \c{ST0}. The absolute value of \c{ST0} must be less than 2**63.
-
-See also \c{FSINCOS} (\k{insFSIN}).
-
-
-\S{insFDECSTP} \i\c{FDECSTP}: Decrement Floating-Point Stack Pointer
-
-\c FDECSTP                       ; D9 F6                [8086,FPU]
-
-\c{FDECSTP} decrements the `top' field in the floating-point status
-word. This has the effect of rotating the FPU register stack by one,
-as if the contents of \c{ST7} had been pushed on the stack. See also
-\c{FINCSTP} (\k{insFINCSTP}).
-
-
-\S{insFDISI} \i\c{FxDISI}, \i\c{FxENI}: Disable and Enable Floating-Point Interrupts
-
-\c FDISI                         ; 9B DB E1             [8086,FPU]
-\c FNDISI                        ; DB E1                [8086,FPU]
-
-\c FENI                          ; 9B DB E0             [8086,FPU]
-\c FNENI                         ; DB E0                [8086,FPU]
-
-\c{FDISI} and \c{FENI} disable and enable floating-point interrupts.
-These instructions are only meaningful on original 8087 processors:
-the 287 and above treat them as no-operation instructions.
-
-\c{FNDISI} and \c{FNENI} do the same thing as \c{FDISI} and \c{FENI}
-respectively, but without waiting for the floating-point processor
-to finish what it was doing first.
-
-
-\S{insFDIV} \i\c{FDIV}, \i\c{FDIVP}, \i\c{FDIVR}, \i\c{FDIVRP}: Floating-Point Division
-
-\c FDIV mem32                    ; D8 /6                [8086,FPU]
-\c FDIV mem64                    ; DC /6                [8086,FPU]
-
-\c FDIV fpureg                   ; D8 F0+r              [8086,FPU]
-\c FDIV ST0,fpureg               ; D8 F0+r              [8086,FPU]
-
-\c FDIV TO fpureg                ; DC F8+r              [8086,FPU]
-\c FDIV fpureg,ST0               ; DC F8+r              [8086,FPU]
-
-\c FDIVR mem32                   ; D8 /7                [8086,FPU]
-\c FDIVR mem64                   ; DC /7                [8086,FPU]
-
-\c FDIVR fpureg                  ; D8 F8+r              [8086,FPU]
-\c FDIVR ST0,fpureg              ; D8 F8+r              [8086,FPU]
-
-\c FDIVR TO fpureg               ; DC F0+r              [8086,FPU]
-\c FDIVR fpureg,ST0              ; DC F0+r              [8086,FPU]
-
-\c FDIVP fpureg                  ; DE F8+r              [8086,FPU]
-\c FDIVP fpureg,ST0              ; DE F8+r              [8086,FPU]
-
-\c FDIVRP fpureg                 ; DE F0+r              [8086,FPU]
-\c FDIVRP fpureg,ST0             ; DE F0+r              [8086,FPU]
-
-\b \c{FDIV} divides \c{ST0} by the given operand and stores the result
-back in \c{ST0}, unless the \c{TO} qualifier is given, in which case
-it divides the given operand by \c{ST0} and stores the result in the
-operand.
-
-\b \c{FDIVR} does the same thing, but does the division the other way
-up: so if \c{TO} is not given, it divides the given operand by
-\c{ST0} and stores the result in \c{ST0}, whereas if \c{TO} is given
-it divides \c{ST0} by its operand and stores the result in the
-operand.
-
-\b \c{FDIVP} operates like \c{FDIV TO}, but pops the register stack
-once it has finished.
-
-\b \c{FDIVRP} operates like \c{FDIVR TO}, but pops the register stack
-once it has finished.
-
-For FP/Integer divisions, see \c{FIDIV} (\k{insFIDIV}).
-
-
-\S{insFEMMS} \i\c{FEMMS}: Faster Enter/Exit of the MMX or floating-point state
-
-\c FEMMS                         ; 0F 0E           [PENT,3DNOW]
-
-\c{FEMMS} can be used in place of the \c{EMMS} instruction on
-processors which support the 3DNow! instruction set. Following
-execution of \c{FEMMS}, the state of the \c{MMX/FP} registers
-is undefined, and this allows a faster context switch between
-\c{FP} and \c{MMX} instructions. The \c{FEMMS} instruction can
-also be used \e{before} executing \c{MMX} instructions
-
-
-\S{insFFREE} \i\c{FFREE}: Flag Floating-Point Register as Unused
-
-\c FFREE fpureg                  ; DD C0+r              [8086,FPU]
-\c FFREEP fpureg                 ; DF C0+r              [286,FPU,UNDOC]
-
-\c{FFREE} marks the given register as being empty.
-
-\c{FFREEP} marks the given register as being empty, and then
-pops the register stack.
-
-
-\S{insFIADD} \i\c{FIADD}: Floating-Point/Integer Addition
-
-\c FIADD mem16                   ; DE /0                [8086,FPU]
-\c FIADD mem32                   ; DA /0                [8086,FPU]
-
-\c{FIADD} adds the 16-bit or 32-bit integer stored in the given
-memory location to \c{ST0}, storing the result in \c{ST0}.
-
-
-\S{insFICOM} \i\c{FICOM}, \i\c{FICOMP}: Floating-Point/Integer Compare
-
-\c FICOM mem16                   ; DE /2                [8086,FPU]
-\c FICOM mem32                   ; DA /2                [8086,FPU]
-
-\c FICOMP mem16                  ; DE /3                [8086,FPU]
-\c FICOMP mem32                  ; DA /3                [8086,FPU]
-
-\c{FICOM} compares \c{ST0} with the 16-bit or 32-bit integer stored
-in the given memory location, and sets the FPU flags accordingly.
-\c{FICOMP} does the same, but pops the register stack afterwards.
-
-
-\S{insFIDIV} \i\c{FIDIV}, \i\c{FIDIVR}: Floating-Point/Integer Division
-
-\c FIDIV mem16                   ; DE /6                [8086,FPU]
-\c FIDIV mem32                   ; DA /6                [8086,FPU]
-
-\c FIDIVR mem16                  ; DE /7                [8086,FPU]
-\c FIDIVR mem32                  ; DA /7                [8086,FPU]
-
-\c{FIDIV} divides \c{ST0} by the 16-bit or 32-bit integer stored in
-the given memory location, and stores the result in \c{ST0}.
-\c{FIDIVR} does the division the other way up: it divides the
-integer by \c{ST0}, but still stores the result in \c{ST0}.
-
-
-\S{insFILD} \i\c{FILD}, \i\c{FIST}, \i\c{FISTP}: Floating-Point/Integer Conversion
-
-\c FILD mem16                    ; DF /0                [8086,FPU]
-\c FILD mem32                    ; DB /0                [8086,FPU]
-\c FILD mem64                    ; DF /5                [8086,FPU]
-
-\c FIST mem16                    ; DF /2                [8086,FPU]
-\c FIST mem32                    ; DB /2                [8086,FPU]
-
-\c FISTP mem16                   ; DF /3                [8086,FPU]
-\c FISTP mem32                   ; DB /3                [8086,FPU]
-\c FISTP mem64                   ; DF /7                [8086,FPU]
-
-\c{FILD} loads an integer out of a memory location, converts it to a
-real, and pushes it on the FPU register stack. \c{FIST} converts
-\c{ST0} to an integer and stores that in memory; \c{FISTP} does the
-same as \c{FIST}, but pops the register stack afterwards.
-
-
-\S{insFIMUL} \i\c{FIMUL}: Floating-Point/Integer Multiplication
-
-\c FIMUL mem16                   ; DE /1                [8086,FPU]
-\c FIMUL mem32                   ; DA /1                [8086,FPU]
-
-\c{FIMUL} multiplies \c{ST0} by the 16-bit or 32-bit integer stored
-in the given memory location, and stores the result in \c{ST0}.
-
-
-\S{insFINCSTP} \i\c{FINCSTP}: Increment Floating-Point Stack Pointer
-
-\c FINCSTP                       ; D9 F7                [8086,FPU]
-
-\c{FINCSTP} increments the `top' field in the floating-point status
-word. This has the effect of rotating the FPU register stack by one,
-as if the register stack had been popped; however, unlike the
-popping of the stack performed by many FPU instructions, it does not
-flag the new \c{ST7} (previously \c{ST0}) as empty. See also
-\c{FDECSTP} (\k{insFDECSTP}).
-
-
-\S{insFINIT} \i\c{FINIT}, \i\c{FNINIT}: initialize Floating-Point Unit
-
-\c FINIT                         ; 9B DB E3             [8086,FPU]
-\c FNINIT                        ; DB E3                [8086,FPU]
-
-\c{FINIT} initializes the FPU to its default state. It flags all
-registers as empty, without actually change their values, clears
-the top of stack pointer. \c{FNINIT} does the same, without first
-waiting for pending exceptions to clear.
-
-
-\S{insFISUB} \i\c{FISUB}: Floating-Point/Integer Subtraction
-
-\c FISUB mem16                   ; DE /4                [8086,FPU]
-\c FISUB mem32                   ; DA /4                [8086,FPU]
-
-\c FISUBR mem16                  ; DE /5                [8086,FPU]
-\c FISUBR mem32                  ; DA /5                [8086,FPU]
-
-\c{FISUB} subtracts the 16-bit or 32-bit integer stored in the given
-memory location from \c{ST0}, and stores the result in \c{ST0}.
-\c{FISUBR} does the subtraction the other way round, i.e. it
-subtracts \c{ST0} from the given integer, but still stores the
-result in \c{ST0}.
-
-
-\S{insFLD} \i\c{FLD}: Floating-Point Load
-
-\c FLD mem32                     ; D9 /0                [8086,FPU]
-\c FLD mem64                     ; DD /0                [8086,FPU]
-\c FLD mem80                     ; DB /5                [8086,FPU]
-\c FLD fpureg                    ; D9 C0+r              [8086,FPU]
-
-\c{FLD} loads a floating-point value out of the given register or
-memory location, and pushes it on the FPU register stack.
-
-
-\S{insFLD1} \i\c{FLDxx}: Floating-Point Load Constants
-
-\c FLD1                          ; D9 E8                [8086,FPU]
-\c FLDL2E                        ; D9 EA                [8086,FPU]
-\c FLDL2T                        ; D9 E9                [8086,FPU]
-\c FLDLG2                        ; D9 EC                [8086,FPU]
-\c FLDLN2                        ; D9 ED                [8086,FPU]
-\c FLDPI                         ; D9 EB                [8086,FPU]
-\c FLDZ                          ; D9 EE                [8086,FPU]
-
-These instructions push specific standard constants on the FPU
-register stack.
-
-\c  Instruction    Constant pushed
-
-\c  FLD1           1
-\c  FLDL2E         base-2 logarithm of e
-\c  FLDL2T         base-2 log of 10
-\c  FLDLG2         base-10 log of 2
-\c  FLDLN2         base-e log of 2
-\c  FLDPI          pi
-\c  FLDZ           zero
-
-
-\S{insFLDCW} \i\c{FLDCW}: Load Floating-Point Control Word
-
-\c FLDCW mem16                   ; D9 /5                [8086,FPU]
-
-\c{FLDCW} loads a 16-bit value out of memory and stores it into the
-FPU control word (governing things like the rounding mode, the
-precision, and the exception masks). See also \c{FSTCW}
-(\k{insFSTCW}). If exceptions are enabled and you don't want to
-generate one, use \c{FCLEX} or \c{FNCLEX} (\k{insFCLEX}) before
-loading the new control word.
-
-
-\S{insFLDENV} \i\c{FLDENV}: Load Floating-Point Environment
-
-\c FLDENV mem                    ; D9 /4                [8086,FPU]
-
-\c{FLDENV} loads the FPU operating environment (control word, status
-word, tag word, instruction pointer, data pointer and last opcode)
-from memory. The memory area is 14 or 28 bytes long, depending on
-the CPU mode at the time. See also \c{FSTENV} (\k{insFSTENV}).
-
-
-\S{insFMUL} \i\c{FMUL}, \i\c{FMULP}: Floating-Point Multiply
-
-\c FMUL mem32                    ; D8 /1                [8086,FPU]
-\c FMUL mem64                    ; DC /1                [8086,FPU]
-
-\c FMUL fpureg                   ; D8 C8+r              [8086,FPU]
-\c FMUL ST0,fpureg               ; D8 C8+r              [8086,FPU]
-
-\c FMUL TO fpureg                ; DC C8+r              [8086,FPU]
-\c FMUL fpureg,ST0               ; DC C8+r              [8086,FPU]
-
-\c FMULP fpureg                  ; DE C8+r              [8086,FPU]
-\c FMULP fpureg,ST0              ; DE C8+r              [8086,FPU]
-
-\c{FMUL} multiplies \c{ST0} by the given operand, and stores the
-result in \c{ST0}, unless the \c{TO} qualifier is used in which case
-it stores the result in the operand. \c{FMULP} performs the same
-operation as \c{FMUL TO}, and then pops the register stack.
-
-
-\S{insFNOP} \i\c{FNOP}: Floating-Point No Operation
-
-\c FNOP                          ; D9 D0                [8086,FPU]
-
-\c{FNOP} does nothing.
-
-
-\S{insFPATAN} \i\c{FPATAN}, \i\c{FPTAN}: Arctangent and Tangent
-
-\c FPATAN                        ; D9 F3                [8086,FPU]
-\c FPTAN                         ; D9 F2                [8086,FPU]
-
-\c{FPATAN} computes the arctangent, in radians, of the result of
-dividing \c{ST1} by \c{ST0}, stores the result in \c{ST1}, and pops
-the register stack. It works like the C \c{atan2} function, in that
-changing the sign of both \c{ST0} and \c{ST1} changes the output
-value by pi (so it performs true rectangular-to-polar coordinate
-conversion, with \c{ST1} being the Y coordinate and \c{ST0} being
-the X coordinate, not merely an arctangent).
-
-\c{FPTAN} computes the tangent of the value in \c{ST0} (in radians),
-and stores the result back into \c{ST0}.
-
-The absolute value of \c{ST0} must be less than 2**63.
-
-
-\S{insFPREM} \i\c{FPREM}, \i\c{FPREM1}: Floating-Point Partial Remainder
-
-\c FPREM                         ; D9 F8                [8086,FPU]
-\c FPREM1                        ; D9 F5                [386,FPU]
-
-These instructions both produce the remainder obtained by dividing
-\c{ST0} by \c{ST1}. This is calculated, notionally, by dividing
-\c{ST0} by \c{ST1}, rounding the result to an integer, multiplying
-by \c{ST1} again, and computing the value which would need to be
-added back on to the result to get back to the original value in
-\c{ST0}.
-
-The two instructions differ in the way the notional round-to-integer
-operation is performed. \c{FPREM} does it by rounding towards zero,
-so that the remainder it returns always has the same sign as the
-original value in \c{ST0}; \c{FPREM1} does it by rounding to the
-nearest integer, so that the remainder always has at most half the
-magnitude of \c{ST1}.
-
-Both instructions calculate \e{partial} remainders, meaning that
-they may not manage to provide the final result, but might leave
-intermediate results in \c{ST0} instead. If this happens, they will
-set the C2 flag in the FPU status word; therefore, to calculate a
-remainder, you should repeatedly execute \c{FPREM} or \c{FPREM1}
-until C2 becomes clear.
-
-
-\S{insFRNDINT} \i\c{FRNDINT}: Floating-Point Round to Integer
-
-\c FRNDINT                       ; D9 FC                [8086,FPU]
-
-\c{FRNDINT} rounds the contents of \c{ST0} to an integer, according
-to the current rounding mode set in the FPU control word, and stores
-the result back in \c{ST0}.
-
-
-\S{insFRSTOR} \i\c{FSAVE}, \i\c{FRSTOR}: Save/Restore Floating-Point State
-
-\c FSAVE mem                     ; 9B DD /6             [8086,FPU]
-\c FNSAVE mem                    ; DD /6                [8086,FPU]
-
-\c FRSTOR mem                    ; DD /4                [8086,FPU]
-
-\c{FSAVE} saves the entire floating-point unit state, including all
-the information saved by \c{FSTENV} (\k{insFSTENV}) plus the
-contents of all the registers, to a 94 or 108 byte area of memory
-(depending on the CPU mode). \c{FRSTOR} restores the floating-point
-state from the same area of memory.
-
-\c{FNSAVE} does the same as \c{FSAVE}, without first waiting for
-pending floating-point exceptions to clear.
-
-
-\S{insFSCALE} \i\c{FSCALE}: Scale Floating-Point Value by Power of Two
-
-\c FSCALE                        ; D9 FD                [8086,FPU]
-
-\c{FSCALE} scales a number by a power of two: it rounds \c{ST1}
-towards zero to obtain an integer, then multiplies \c{ST0} by two to
-the power of that integer, and stores the result in \c{ST0}.
-
-
-\S{insFSETPM} \i\c{FSETPM}: Set Protected Mode
-
-\c FSETPM                        ; DB E4                [286,FPU]
-
-This instruction initializes protected mode on the 287 floating-point
-coprocessor. It is only meaningful on that processor: the 387 and
-above treat the instruction as a no-operation.
-
-
-\S{insFSIN} \i\c{FSIN}, \i\c{FSINCOS}: Sine and Cosine
-
-\c FSIN                          ; D9 FE                [386,FPU]
-\c FSINCOS                       ; D9 FB                [386,FPU]
-
-\c{FSIN} calculates the sine of \c{ST0} (in radians) and stores the
-result in \c{ST0}. \c{FSINCOS} does the same, but then pushes the
-cosine of the same value on the register stack, so that the sine
-ends up in \c{ST1} and the cosine in \c{ST0}. \c{FSINCOS} is faster
-than executing \c{FSIN} and \c{FCOS} (see \k{insFCOS}) in succession.
-
-The absolute value of \c{ST0} must be less than 2**63.
-
-
-\S{insFSQRT} \i\c{FSQRT}: Floating-Point Square Root
-
-\c FSQRT                         ; D9 FA                [8086,FPU]
-
-\c{FSQRT} calculates the square root of \c{ST0} and stores the
-result in \c{ST0}.
-
-
-\S{insFST} \i\c{FST}, \i\c{FSTP}: Floating-Point Store
-
-\c FST mem32                     ; D9 /2                [8086,FPU]
-\c FST mem64                     ; DD /2                [8086,FPU]
-\c FST fpureg                    ; DD D0+r              [8086,FPU]
-
-\c FSTP mem32                    ; D9 /3                [8086,FPU]
-\c FSTP mem64                    ; DD /3                [8086,FPU]
-\c FSTP mem80                    ; DB /7                [8086,FPU]
-\c FSTP fpureg                   ; DD D8+r              [8086,FPU]
-
-\c{FST} stores the value in \c{ST0} into the given memory location
-or other FPU register. \c{FSTP} does the same, but then pops the
-register stack.
-
-
-\S{insFSTCW} \i\c{FSTCW}: Store Floating-Point Control Word
-
-\c FSTCW mem16                   ; 9B D9 /7             [8086,FPU]
-\c FNSTCW mem16                  ; D9 /7                [8086,FPU]
-
-\c{FSTCW} stores the \c{FPU} control word (governing things like the
-rounding mode, the precision, and the exception masks) into a 2-byte
-memory area. See also \c{FLDCW} (\k{insFLDCW}).
-
-\c{FNSTCW} does the same thing as \c{FSTCW}, without first waiting
-for pending floating-point exceptions to clear.
-
-
-\S{insFSTENV} \i\c{FSTENV}: Store Floating-Point Environment
-
-\c FSTENV mem                    ; 9B D9 /6             [8086,FPU]
-\c FNSTENV mem                   ; D9 /6                [8086,FPU]
-
-\c{FSTENV} stores the \c{FPU} operating environment (control word,
-status word, tag word, instruction pointer, data pointer and last
-opcode) into memory. The memory area is 14 or 28 bytes long,
-depending on the CPU mode at the time. See also \c{FLDENV}
-(\k{insFLDENV}).
-
-\c{FNSTENV} does the same thing as \c{FSTENV}, without first waiting
-for pending floating-point exceptions to clear.
-
-
-\S{insFSTSW} \i\c{FSTSW}: Store Floating-Point Status Word
-
-\c FSTSW mem16                   ; 9B DD /7             [8086,FPU]
-\c FSTSW AX                      ; 9B DF E0             [286,FPU]
-
-\c FNSTSW mem16                  ; DD /7                [8086,FPU]
-\c FNSTSW AX                     ; DF E0                [286,FPU]
-
-\c{FSTSW} stores the \c{FPU} status word into \c{AX} or into a 2-byte
-memory area.
-
-\c{FNSTSW} does the same thing as \c{FSTSW}, without first waiting
-for pending floating-point exceptions to clear.
-
-
-\S{insFSUB} \i\c{FSUB}, \i\c{FSUBP}, \i\c{FSUBR}, \i\c{FSUBRP}: Floating-Point Subtract
-
-\c FSUB mem32                    ; D8 /4                [8086,FPU]
-\c FSUB mem64                    ; DC /4                [8086,FPU]
-
-\c FSUB fpureg                   ; D8 E0+r              [8086,FPU]
-\c FSUB ST0,fpureg               ; D8 E0+r              [8086,FPU]
-
-\c FSUB TO fpureg                ; DC E8+r              [8086,FPU]
-\c FSUB fpureg,ST0               ; DC E8+r              [8086,FPU]
-
-\c FSUBR mem32                   ; D8 /5                [8086,FPU]
-\c FSUBR mem64                   ; DC /5                [8086,FPU]
-
-\c FSUBR fpureg                  ; D8 E8+r              [8086,FPU]
-\c FSUBR ST0,fpureg              ; D8 E8+r              [8086,FPU]
-
-\c FSUBR TO fpureg               ; DC E0+r              [8086,FPU]
-\c FSUBR fpureg,ST0              ; DC E0+r              [8086,FPU]
-
-\c FSUBP fpureg                  ; DE E8+r              [8086,FPU]
-\c FSUBP fpureg,ST0              ; DE E8+r              [8086,FPU]
-
-\c FSUBRP fpureg                 ; DE E0+r              [8086,FPU]
-\c FSUBRP fpureg,ST0             ; DE E0+r              [8086,FPU]
-
-\b \c{FSUB} subtracts the given operand from \c{ST0} and stores the
-result back in \c{ST0}, unless the \c{TO} qualifier is given, in
-which case it subtracts \c{ST0} from the given operand and stores
-the result in the operand.
-
-\b \c{FSUBR} does the same thing, but does the subtraction the other
-way up: so if \c{TO} is not given, it subtracts \c{ST0} from the given
-operand and stores the result in \c{ST0}, whereas if \c{TO} is given
-it subtracts its operand from \c{ST0} and stores the result in the
-operand.
-
-\b \c{FSUBP} operates like \c{FSUB TO}, but pops the register stack
-once it has finished.
-
-\b \c{FSUBRP} operates like \c{FSUBR TO}, but pops the register stack
-once it has finished.
-
-
-\S{insFTST} \i\c{FTST}: Test \c{ST0} Against Zero
-
-\c FTST                          ; D9 E4                [8086,FPU]
-
-\c{FTST} compares \c{ST0} with zero and sets the FPU flags
-accordingly. \c{ST0} is treated as the left-hand side of the
-comparison, so that a `less-than' result is generated if \c{ST0} is
-negative.
-
-
-\S{insFUCOM} \i\c{FUCOMxx}: Floating-Point Unordered Compare
-
-\c FUCOM fpureg                  ; DD E0+r              [386,FPU]
-\c FUCOM ST0,fpureg              ; DD E0+r              [386,FPU]
-
-\c FUCOMP fpureg                 ; DD E8+r              [386,FPU]
-\c FUCOMP ST0,fpureg             ; DD E8+r              [386,FPU]
-
-\c FUCOMPP                       ; DA E9                [386,FPU]
-
-\c FUCOMI fpureg                 ; DB E8+r              [P6,FPU]
-\c FUCOMI ST0,fpureg             ; DB E8+r              [P6,FPU]
-
-\c FUCOMIP fpureg                ; DF E8+r              [P6,FPU]
-\c FUCOMIP ST0,fpureg            ; DF E8+r              [P6,FPU]
-
-\b \c{FUCOM} compares \c{ST0} with the given operand, and sets the
-FPU flags accordingly. \c{ST0} is treated as the left-hand side of
-the comparison, so that the carry flag is set (for a `less-than'
-result) if \c{ST0} is less than the given operand.
-
-\b \c{FUCOMP} does the same as \c{FUCOM}, but pops the register stack
-afterwards. \c{FUCOMPP} compares \c{ST0} with \c{ST1} and then pops
-the register stack twice.
-
-\b \c{FUCOMI} and \c{FUCOMIP} work like the corresponding forms of
-\c{FUCOM} and \c{FUCOMP}, but write their results directly to the CPU
-flags register rather than the FPU status word, so they can be
-immediately followed by conditional jump or conditional move
-instructions.
-
-The \c{FUCOM} instructions differ from the \c{FCOM} instructions
-(\k{insFCOM}) only in the way they handle quiet NaNs: \c{FUCOM} will
-handle them silently and set the condition code flags to an
-`unordered' result, whereas \c{FCOM} will generate an exception.
-
-
-\S{insFXAM} \i\c{FXAM}: Examine Class of Value in \c{ST0}
-
-\c FXAM                          ; D9 E5                [8086,FPU]
-
-\c{FXAM} sets the FPU flags \c{C3}, \c{C2} and \c{C0} depending on
-the type of value stored in \c{ST0}:
-
-\c  Register contents     Flags
-
-\c  Unsupported format    000
-\c  NaN                   001
-\c  Finite number         010
-\c  Infinity              011
-\c  Zero                  100
-\c  Empty register        101
-\c  Denormal              110
-
-Additionally, the \c{C1} flag is set to the sign of the number.
-
-
-\S{insFXCH} \i\c{FXCH}: Floating-Point Exchange
-
-\c FXCH                          ; D9 C9                [8086,FPU]
-\c FXCH fpureg                   ; D9 C8+r              [8086,FPU]
-\c FXCH fpureg,ST0               ; D9 C8+r              [8086,FPU]
-\c FXCH ST0,fpureg               ; D9 C8+r              [8086,FPU]
-
-\c{FXCH} exchanges \c{ST0} with a given FPU register. The no-operand
-form exchanges \c{ST0} with \c{ST1}.
-
-
-\S{insFXRSTOR} \i\c{FXRSTOR}: Restore \c{FP}, \c{MMX} and \c{SSE} State
-
-\c FXRSTOR memory                ; 0F AE /1               [P6,SSE,FPU]
-
-The \c{FXRSTOR} instruction reloads the \c{FPU}, \c{MMX} and \c{SSE}
-state (environment and registers), from the 512 byte memory area defined
-by the source operand. This data should have been written by a previous
-\c{FXSAVE}.
-
-
-\S{insFXSAVE} \i\c{FXSAVE}: Store \c{FP}, \c{MMX} and \c{SSE} State
-
-\c FXSAVE memory                 ; 0F AE /0         [P6,SSE,FPU]
-
-\c{FXSAVE}The FXSAVE instruction writes the current \c{FPU}, \c{MMX}
-and \c{SSE} technology states (environment and registers), to the
-512 byte memory area defined by the destination operand. It does this
-without checking for pending unmasked floating-point exceptions
-(similar to the operation of \c{FNSAVE}).
-
-Unlike the \c{FSAVE/FNSAVE} instructions, the processor retains the
-contents of the \c{FPU}, \c{MMX} and \c{SSE} state in the processor
-after the state has been saved. This instruction has been optimized
-to maximize floating-point save performance.
-
-
-\S{insFXTRACT} \i\c{FXTRACT}: Extract Exponent and Significand
-
-\c FXTRACT                       ; D9 F4                [8086,FPU]
-
-\c{FXTRACT} separates the number in \c{ST0} into its exponent and
-significand (mantissa), stores the exponent back into \c{ST0}, and
-then pushes the significand on the register stack (so that the
-significand ends up in \c{ST0}, and the exponent in \c{ST1}).
-
-
-\S{insFYL2X} \i\c{FYL2X}, \i\c{FYL2XP1}: Compute Y times Log2(X) or Log2(X+1)
-
-\c FYL2X                         ; D9 F1                [8086,FPU]
-\c FYL2XP1                       ; D9 F9                [8086,FPU]
-
-\c{FYL2X} multiplies \c{ST1} by the base-2 logarithm of \c{ST0},
-stores the result in \c{ST1}, and pops the register stack (so that
-the result ends up in \c{ST0}). \c{ST0} must be non-zero and
-positive.
-
-\c{FYL2XP1} works the same way, but replacing the base-2 log of
-\c{ST0} with that of \c{ST0} plus one. This time, \c{ST0} must have
-magnitude no greater than 1 minus half the square root of two.
-
-
-\S{insHLT} \i\c{HLT}: Halt Processor
-
-\c HLT                           ; F4                   [8086,PRIV]
-
-\c{HLT} puts the processor into a halted state, where it will
-perform no more operations until restarted by an interrupt or a
-reset.
-
-On the 286 and later processors, this is a privileged instruction.
-
-
-\S{insIBTS} \i\c{IBTS}: Insert Bit String
-
-\c IBTS r/m16,reg16              ; o16 0F A7 /r         [386,UNDOC]
-\c IBTS r/m32,reg32              ; o32 0F A7 /r         [386,UNDOC]
-
-The implied operation of this instruction is:
-
-\c IBTS r/m16,AX,CL,reg16
-\c IBTS r/m32,EAX,CL,reg32
-
-Writes a bit string from the source operand to the destination.
-\c{CL} indicates the number of bits to be copied, from the low bits
-of the source. \c{(E)AX} indicates the low order bit offset in the
-destination that is written to. For example, if \c{CL} is set to 4
-and \c{AX} (for 16-bit code) is set to 5, bits 0-3 of \c{src} will
-be copied to bits 5-8 of \c{dst}. This instruction is very poorly
-documented, and I have been unable to find any official source of
-documentation on it.
-
-\c{IBTS} is supported only on the early Intel 386s, and conflicts
-with the opcodes for \c{CMPXCHG486} (on early Intel 486s). NASM
-supports it only for completeness. Its counterpart is \c{XBTS}
-(see \k{insXBTS}).
-
-
-\S{insIDIV} \i\c{IDIV}: Signed Integer Divide
-
-\c IDIV r/m8                     ; F6 /7                [8086]
-\c IDIV r/m16                    ; o16 F7 /7            [8086]
-\c IDIV r/m32                    ; o32 F7 /7            [386]
-
-\c{IDIV} performs signed integer division. The explicit operand
-provided is the divisor; the dividend and destination operands
-are implicit, in the following way:
-
-\b For \c{IDIV r/m8}, \c{AX} is divided by the given operand;
-the quotient is stored in \c{AL} and the remainder in \c{AH}.
-
-\b For \c{IDIV r/m16}, \c{DX:AX} is divided by the given operand;
-the quotient is stored in \c{AX} and the remainder in \c{DX}.
-
-\b For \c{IDIV r/m32}, \c{EDX:EAX} is divided by the given operand;
-the quotient is stored in \c{EAX} and the remainder in \c{EDX}.
-
-Unsigned integer division is performed by the \c{DIV} instruction:
-see \k{insDIV}.
-
-
-\S{insIMUL} \i\c{IMUL}: Signed Integer Multiply
-
-\c IMUL r/m8                     ; F6 /5                [8086]
-\c IMUL r/m16                    ; o16 F7 /5            [8086]
-\c IMUL r/m32                    ; o32 F7 /5            [386]
-
-\c IMUL reg16,r/m16              ; o16 0F AF /r         [386]
-\c IMUL reg32,r/m32              ; o32 0F AF /r         [386]
-
-\c IMUL reg16,imm8               ; o16 6B /r ib         [186]
-\c IMUL reg16,imm16              ; o16 69 /r iw         [186]
-\c IMUL reg32,imm8               ; o32 6B /r ib         [386]
-\c IMUL reg32,imm32              ; o32 69 /r id         [386]
-
-\c IMUL reg16,r/m16,imm8         ; o16 6B /r ib         [186]
-\c IMUL reg16,r/m16,imm16        ; o16 69 /r iw         [186]
-\c IMUL reg32,r/m32,imm8         ; o32 6B /r ib         [386]
-\c IMUL reg32,r/m32,imm32        ; o32 69 /r id         [386]
-
-\c{IMUL} performs signed integer multiplication. For the
-single-operand form, the other operand and destination are
-implicit, in the following way:
-
-\b For \c{IMUL r/m8}, \c{AL} is multiplied by the given operand;
-the product is stored in \c{AX}.
-
-\b For \c{IMUL r/m16}, \c{AX} is multiplied by the given operand;
-the product is stored in \c{DX:AX}.
-
-\b For \c{IMUL r/m32}, \c{EAX} is multiplied by the given operand;
-the product is stored in \c{EDX:EAX}.
-
-The two-operand form multiplies its two operands and stores the
-result in the destination (first) operand. The three-operand
-form multiplies its last two operands and stores the result in
-the first operand.
-
-The two-operand form with an immediate second operand is in
-fact a shorthand for the three-operand form, as can be seen by
-examining the opcode descriptions: in the two-operand form, the
-code \c{/r} takes both its register and \c{r/m} parts from the
-same operand (the first one).
-
-In the forms with an 8-bit immediate operand and another longer
-source operand, the immediate operand is considered to be signed,
-and is sign-extended to the length of the other source operand.
-In these cases, the \c{BYTE} qualifier is necessary to force
-NASM to generate this form of the instruction.
-
-Unsigned integer multiplication is performed by the \c{MUL}
-instruction: see \k{insMUL}.
-
-
-\S{insIN} \i\c{IN}: Input from I/O Port
-
-\c IN AL,imm8                    ; E4 ib                [8086]
-\c IN AX,imm8                    ; o16 E5 ib            [8086]
-\c IN EAX,imm8                   ; o32 E5 ib            [386]
-\c IN AL,DX                      ; EC                   [8086]
-\c IN AX,DX                      ; o16 ED               [8086]
-\c IN EAX,DX                     ; o32 ED               [386]
-
-\c{IN} reads a byte, word or doubleword from the specified I/O port,
-and stores it in the given destination register. The port number may
-be specified as an immediate value if it is between 0 and 255, and
-otherwise must be stored in \c{DX}. See also \c{OUT} (\k{insOUT}).
-
-
-\S{insINC} \i\c{INC}: Increment Integer
-
-\c INC reg16                     ; o16 40+r             [8086]
-\c INC reg32                     ; o32 40+r             [386]
-\c INC r/m8                      ; FE /0                [8086]
-\c INC r/m16                     ; o16 FF /0            [8086]
-\c INC r/m32                     ; o32 FF /0            [386]
-
-\c{INC} adds 1 to its operand. It does \e{not} affect the carry
-flag: to affect the carry flag, use \c{ADD something,1} (see
-\k{insADD}). \c{INC} affects all the other flags according to the result.
-
-This instruction can be used with a \c{LOCK} prefix to allow atomic execution.
-
-See also \c{DEC} (\k{insDEC}).
-
-
-\S{insINSB} \i\c{INSB}, \i\c{INSW}, \i\c{INSD}: Input String from I/O Port
-
-\c INSB                          ; 6C                   [186]
-\c INSW                          ; o16 6D               [186]
-\c INSD                          ; o32 6D               [386]
-
-\c{INSB} inputs a byte from the I/O port specified in \c{DX} and
-stores it at \c{[ES:DI]} or \c{[ES:EDI]}. It then increments or
-decrements (depending on the direction flag: increments if the flag
-is clear, decrements if it is set) \c{DI} or \c{EDI}.
-
-The register used is \c{DI} if the address size is 16 bits, and
-\c{EDI} if it is 32 bits. If you need to use an address size not
-equal to the current \c{BITS} setting, you can use an explicit
-\i\c{a16} or \i\c{a32} prefix.
-
-Segment override prefixes have no effect for this instruction: the
-use of \c{ES} for the load from \c{[DI]} or \c{[EDI]} cannot be
-overridden.
-
-\c{INSW} and \c{INSD} work in the same way, but they input a word or
-a doubleword instead of a byte, and increment or decrement the
-addressing register by 2 or 4 instead of 1.
-
-The \c{REP} prefix may be used to repeat the instruction \c{CX} (or
-\c{ECX} - again, the address size chooses which) times.
-
-See also \c{OUTSB}, \c{OUTSW} and \c{OUTSD} (\k{insOUTSB}).
-
-
-\S{insINT} \i\c{INT}: Software Interrupt
-
-\c INT imm8                      ; CD ib                [8086]
-
-\c{INT} causes a software interrupt through a specified vector
-number from 0 to 255.
-
-The code generated by the \c{INT} instruction is always two bytes
-long: although there are short forms for some \c{INT} instructions,
-NASM does not generate them when it sees the \c{INT} mnemonic. In
-order to generate single-byte breakpoint instructions, use the
-\c{INT3} or \c{INT1} instructions (see \k{insINT1}) instead.
-
-
-\S{insINT1} \i\c{INT3}, \i\c{INT1}, \i\c{ICEBP}, \i\c{INT01}: Breakpoints
-
-\c INT1                          ; F1                   [P6]
-\c ICEBP                         ; F1                   [P6]
-\c INT01                         ; F1                   [P6]
-
-\c INT3                          ; CC                   [8086]
-\c INT03                         ; CC                   [8086]
-
-\c{INT1} and \c{INT3} are short one-byte forms of the instructions
-\c{INT 1} and \c{INT 3} (see \k{insINT}). They perform a similar
-function to their longer counterparts, but take up less code space.
-They are used as breakpoints by debuggers.
-
-\b \c{INT1}, and its alternative synonyms \c{INT01} and \c{ICEBP}, is
-an instruction used by in-circuit emulators (ICEs). It is present,
-though not documented, on some processors down to the 286, but is
-only documented for the Pentium Pro. \c{INT3} is the instruction
-normally used as a breakpoint by debuggers.
-
-\b \c{INT3}, and its synonym \c{INT03}, is not precisely equivalent to
-\c{INT 3}: the short form, since it is designed to be used as a
-breakpoint, bypasses the normal \c{IOPL} checks in virtual-8086 mode,
-and also does not go through interrupt redirection.
-
-
-\S{insINTO} \i\c{INTO}: Interrupt if Overflow
-
-\c INTO                          ; CE                   [8086]
-
-\c{INTO} performs an \c{INT 4} software interrupt (see \k{insINT})
-if and only if the overflow flag is set.
-
-
-\S{insINVD} \i\c{INVD}: Invalidate Internal Caches
-
-\c INVD                          ; 0F 08                [486]
-
-\c{INVD} invalidates and empties the processor's internal caches,
-and causes the processor to instruct external caches to do the same.
-It does not write the contents of the caches back to memory first:
-any modified data held in the caches will be lost. To write the data
-back first, use \c{WBINVD} (\k{insWBINVD}).
-
-
-\S{insINVLPG} \i\c{INVLPG}: Invalidate TLB Entry
-
-\c INVLPG mem                    ; 0F 01 /7             [486]
-
-\c{INVLPG} invalidates the translation lookahead buffer (TLB) entry
-associated with the supplied memory address.
-
-
-\S{insIRET} \i\c{IRET}, \i\c{IRETW}, \i\c{IRETD}: Return from Interrupt
-
-\c IRET                          ; CF                   [8086]
-\c IRETW                         ; o16 CF               [8086]
-\c IRETD                         ; o32 CF               [386]
-
-\c{IRET} returns from an interrupt (hardware or software) by means
-of popping \c{IP} (or \c{EIP}), \c{CS} and the flags off the stack
-and then continuing execution from the new \c{CS:IP}.
-
-\c{IRETW} pops \c{IP}, \c{CS} and the flags as 2 bytes each, taking
-6 bytes off the stack in total. \c{IRETD} pops \c{EIP} as 4 bytes,
-pops a further 4 bytes of which the top two are discarded and the
-bottom two go into \c{CS}, and pops the flags as 4 bytes as well,
-taking 12 bytes off the stack.
-
-\c{IRET} is a shorthand for either \c{IRETW} or \c{IRETD}, depending
-on the default \c{BITS} setting at the time.
-
-
-\S{insJcc} \i\c{Jcc}: Conditional Branch
-
-\c Jcc imm                       ; 70+cc rb             [8086]
-\c Jcc NEAR imm                  ; 0F 80+cc rw/rd       [386]
-
-The \i{conditional jump} instructions execute a near (same segment)
-jump if and only if their conditions are satisfied. For example,
-\c{JNZ} jumps only if the zero flag is not set.
-
-The ordinary form of the instructions has only a 128-byte range; the
-\c{NEAR} form is a 386 extension to the instruction set, and can
-span the full size of a segment. NASM will not override your choice
-of jump instruction: if you want \c{Jcc NEAR}, you have to use the
-\c{NEAR} keyword.
-
-The \c{SHORT} keyword is allowed on the first form of the
-instruction, for clarity, but is not necessary.
-
-For details of the condition codes, see \k{iref-cc}.
-
-
-\S{insJCXZ} \i\c{JCXZ}, \i\c{JECXZ}: Jump if CX/ECX Zero
-
-\c JCXZ imm                      ; a16 E3 rb            [8086]
-\c JECXZ imm                     ; a32 E3 rb            [386]
-
-\c{JCXZ} performs a short jump (with maximum range 128 bytes) if and
-only if the contents of the \c{CX} register is 0. \c{JECXZ} does the
-same thing, but with \c{ECX}.
-
-
-\S{insJMP} \i\c{JMP}: Jump
-
-\c JMP imm                       ; E9 rw/rd             [8086]
-\c JMP SHORT imm                 ; EB rb                [8086]
-\c JMP imm:imm16                 ; o16 EA iw iw         [8086]
-\c JMP imm:imm32                 ; o32 EA id iw         [386]
-\c JMP FAR mem                   ; o16 FF /5            [8086]
-\c JMP FAR mem32                 ; o32 FF /5            [386]
-\c JMP r/m16                     ; o16 FF /4            [8086]
-\c JMP r/m32                     ; o32 FF /4            [386]
-
-\c{JMP} jumps to a given address. The address may be specified as an
-absolute segment and offset, or as a relative jump within the
-current segment.
-
-\c{JMP SHORT imm} has a maximum range of 128 bytes, since the
-displacement is specified as only 8 bits, but takes up less code
-space. NASM does not choose when to generate \c{JMP SHORT} for you:
-you must explicitly code \c{SHORT} every time you want a short jump.
-
-You can choose between the two immediate \i{far jump} forms (\c{JMP
-imm:imm}) by the use of the \c{WORD} and \c{DWORD} keywords: \c{JMP
-WORD 0x1234:0x5678}) or \c{JMP DWORD 0x1234:0x56789abc}.
-
-The \c{JMP FAR mem} forms execute a far jump by loading the
-destination address out of memory. The address loaded consists of 16
-or 32 bits of offset (depending on the operand size), and 16 bits of
-segment. The operand size may be overridden using \c{JMP WORD FAR
-mem} or \c{JMP DWORD FAR mem}.
-
-The \c{JMP r/m} forms execute a \i{near jump} (within the same
-segment), loading the destination address out of memory or out of a
-register. The keyword \c{NEAR} may be specified, for clarity, in
-these forms, but is not necessary. Again, operand size can be
-overridden using \c{JMP WORD mem} or \c{JMP DWORD mem}.
-
-As a convenience, NASM does not require you to jump to a far symbol
-by coding the cumbersome \c{JMP SEG routine:routine}, but instead
-allows the easier synonym \c{JMP FAR routine}.
-
-The \c{JMP r/m} forms given above are near calls; NASM will accept
-the \c{NEAR} keyword (e.g. \c{JMP NEAR [address]}), even though it
-is not strictly necessary.
-
-
-\S{insLAHF} \i\c{LAHF}: Load AH from Flags
-
-\c LAHF                          ; 9F                   [8086]
-
-\c{LAHF} sets the \c{AH} register according to the contents of the
-low byte of the flags word.
-
-The operation of \c{LAHF} is:
-
-\c  AH <-- SF:ZF:0:AF:0:PF:1:CF
-
-See also \c{SAHF} (\k{insSAHF}).
-
-
-\S{insLAR} \i\c{LAR}: Load Access Rights
-
-\c LAR reg16,r/m16               ; o16 0F 02 /r         [286,PRIV]
-\c LAR reg32,r/m32               ; o32 0F 02 /r         [286,PRIV]
-
-\c{LAR} takes the segment selector specified by its source (second)
-operand, finds the corresponding segment descriptor in the GDT or
-LDT, and loads the access-rights byte of the descriptor into its
-destination (first) operand.
-
-
-\S{insLDMXCSR} \i\c{LDMXCSR}: Load Streaming SIMD Extension
- Control/Status
-
-\c LDMXCSR mem32                 ; 0F AE /2        [KATMAI,SSE]
-
-\c{LDMXCSR} loads 32-bits of data from the specified memory location
-into the \c{MXCSR} control/status register. \c{MXCSR} is used to
-enable masked/unmasked exception handling, to set rounding modes,
-to set flush-to-zero mode, and to view exception status flags.
-
-For details of the \c{MXCSR} register, see the Intel processor docs.
-
-See also \c{STMXCSR} (\k{insSTMXCSR}
-
-
-\S{insLDS} \i\c{LDS}, \i\c{LES}, \i\c{LFS}, \i\c{LGS}, \i\c{LSS}: Load Far Pointer
-
-\c LDS reg16,mem                 ; o16 C5 /r            [8086]
-\c LDS reg32,mem                 ; o32 C5 /r            [386]
-
-\c LES reg16,mem                 ; o16 C4 /r            [8086]
-\c LES reg32,mem                 ; o32 C4 /r            [386]
-
-\c LFS reg16,mem                 ; o16 0F B4 /r         [386]
-\c LFS reg32,mem                 ; o32 0F B4 /r         [386]
-
-\c LGS reg16,mem                 ; o16 0F B5 /r         [386]
-\c LGS reg32,mem                 ; o32 0F B5 /r         [386]
-
-\c LSS reg16,mem                 ; o16 0F B2 /r         [386]
-\c LSS reg32,mem                 ; o32 0F B2 /r         [386]
-
-These instructions load an entire far pointer (16 or 32 bits of
-offset, plus 16 bits of segment) out of memory in one go. \c{LDS},
-for example, loads 16 or 32 bits from the given memory address into
-the given register (depending on the size of the register), then
-loads the \e{next} 16 bits from memory into \c{DS}. \c{LES},
-\c{LFS}, \c{LGS} and \c{LSS} work in the same way but use the other
-segment registers.
-
-
-\S{insLEA} \i\c{LEA}: Load Effective Address
-
-\c LEA reg16,mem                 ; o16 8D /r            [8086]
-\c LEA reg32,mem                 ; o32 8D /r            [386]
-
-\c{LEA}, despite its syntax, does not access memory. It calculates
-the effective address specified by its second operand as if it were
-going to load or store data from it, but instead it stores the
-calculated address into the register specified by its first operand.
-This can be used to perform quite complex calculations (e.g. \c{LEA
-EAX,[EBX+ECX*4+100]}) in one instruction.
-
-\c{LEA}, despite being a purely arithmetic instruction which
-accesses no memory, still requires square brackets around its second
-operand, as if it were a memory reference.
-
-The size of the calculation is the current \e{address} size, and the
-size that the result is stored as is the current \e{operand} size.
-If the address and operand size are not the same, then if the
-addressing mode was 32-bits, the low 16-bits are stored, and if the
-address was 16-bits, it is zero-extended to 32-bits before storing.
-
-
-\S{insLEAVE} \i\c{LEAVE}: Destroy Stack Frame
-
-\c LEAVE                         ; C9                   [186]
-
-\c{LEAVE} destroys a stack frame of the form created by the
-\c{ENTER} instruction (see \k{insENTER}). It is functionally
-equivalent to \c{MOV ESP,EBP} followed by \c{POP EBP} (or \c{MOV
-SP,BP} followed by \c{POP BP} in 16-bit mode).
-
-
-\S{insLFENCE} \i\c{LFENCE}: Load Fence
-
-\c LFENCE                        ; 0F AE /5        [WILLAMETTE,SSE2]
-
-\c{LFENCE} performs a serialising operation on all loads from memory
-that were issued before the \c{LFENCE} instruction. This guarantees that
-all memory reads before the \c{LFENCE} instruction are visible before any
-reads after the \c{LFENCE} instruction.
-
-\c{LFENCE} is ordered respective to other \c{LFENCE} instruction, \c{MFENCE},
-any memory read and any other serialising instruction (such as \c{CPUID}).
-
-Weakly ordered memory types can be used to achieve higher processor
-performance through such techniques as out-of-order issue and
-speculative reads. The degree to which a consumer of data recognizes
-or knows that the data is weakly ordered varies among applications
-and may be unknown to the producer of this data. The \c{LFENCE}
-instruction provides a performance-efficient way of ensuring load
-ordering between routines that produce weakly-ordered results and
-routines that consume that data.
-
-\c{LFENCE} uses the following ModRM encoding:
-
-\c           Mod (7:6)        = 11B
-\c           Reg/Opcode (5:3) = 101B
-\c           R/M (2:0)        = 000B
-
-All other ModRM encodings are defined to be reserved, and use
-of these encodings risks incompatibility with future processors.
-
-See also \c{SFENCE} (\k{insSFENCE}) and \c{MFENCE} (\k{insMFENCE}).
-
-
-\S{insLGDT} \i\c{LGDT}, \i\c{LIDT}, \i\c{LLDT}: Load Descriptor Tables
-
-\c LGDT mem                      ; 0F 01 /2             [286,PRIV]
-\c LIDT mem                      ; 0F 01 /3             [286,PRIV]
-\c LLDT r/m16                    ; 0F 00 /2             [286,PRIV]
-
-\c{LGDT} and \c{LIDT} both take a 6-byte memory area as an operand:
-they load a 16-bit size limit and a 32-bit linear address from that
-area (in the opposite order) into the \c{GDTR} (global descriptor table
-register) or \c{IDTR} (interrupt descriptor table register). These are
-the only instructions which directly use \e{linear} addresses, rather
-than segment/offset pairs.
-
-\c{LLDT} takes a segment selector as an operand. The processor looks
-up that selector in the GDT and stores the limit and base address
-given there into the \c{LDTR} (local descriptor table register).
-
-See also \c{SGDT}, \c{SIDT} and \c{SLDT} (\k{insSGDT}).
-
-
-\S{insLMSW} \i\c{LMSW}: Load/Store Machine Status Word
-
-\c LMSW r/m16                    ; 0F 01 /6             [286,PRIV]
-
-\c{LMSW} loads the bottom four bits of the source operand into the
-bottom four bits of the \c{CR0} control register (or the Machine
-Status Word, on 286 processors). See also \c{SMSW} (\k{insSMSW}).
-
-
-\S{insLOADALL} \i\c{LOADALL}, \i\c{LOADALL286}: Load Processor State
-
-\c LOADALL                       ; 0F 07                [386,UNDOC]
-\c LOADALL286                    ; 0F 05                [286,UNDOC]
-
-This instruction, in its two different-opcode forms, is apparently
-supported on most 286 processors, some 386 and possibly some 486.
-The opcode differs between the 286 and the 386.
-
-The function of the instruction is to load all information relating
-to the state of the processor out of a block of memory: on the 286,
-this block is located implicitly at absolute address \c{0x800}, and
-on the 386 and 486 it is at \c{[ES:EDI]}.
-
-
-\S{insLODSB} \i\c{LODSB}, \i\c{LODSW}, \i\c{LODSD}: Load from String
-
-\c LODSB                         ; AC                   [8086]
-\c LODSW                         ; o16 AD               [8086]
-\c LODSD                         ; o32 AD               [386]
-
-\c{LODSB} loads a byte from \c{[DS:SI]} or \c{[DS:ESI]} into \c{AL}.
-It then increments or decrements (depending on the direction flag:
-increments if the flag is clear, decrements if it is set) \c{SI} or
-\c{ESI}.
-
-The register used is \c{SI} if the address size is 16 bits, and
-\c{ESI} if it is 32 bits. If you need to use an address size not
-equal to the current \c{BITS} setting, you can use an explicit
-\i\c{a16} or \i\c{a32} prefix.
-
-The segment register used to load from \c{[SI]} or \c{[ESI]} can be
-overridden by using a segment register name as a prefix (for
-example, \c{ES LODSB}).
-
-\c{LODSW} and \c{LODSD} work in the same way, but they load a
-word or a doubleword instead of a byte, and increment or decrement
-the addressing registers by 2 or 4 instead of 1.
-
-
-\S{insLOOP} \i\c{LOOP}, \i\c{LOOPE}, \i\c{LOOPZ}, \i\c{LOOPNE}, \i\c{LOOPNZ}: Loop with Counter
-
-\c LOOP imm                      ; E2 rb                [8086]
-\c LOOP imm,CX                   ; a16 E2 rb            [8086]
-\c LOOP imm,ECX                  ; a32 E2 rb            [386]
-
-\c LOOPE imm                     ; E1 rb                [8086]
-\c LOOPE imm,CX                  ; a16 E1 rb            [8086]
-\c LOOPE imm,ECX                 ; a32 E1 rb            [386]
-\c LOOPZ imm                     ; E1 rb                [8086]
-\c LOOPZ imm,CX                  ; a16 E1 rb            [8086]
-\c LOOPZ imm,ECX                 ; a32 E1 rb            [386]
-
-\c LOOPNE imm                    ; E0 rb                [8086]
-\c LOOPNE imm,CX                 ; a16 E0 rb            [8086]
-\c LOOPNE imm,ECX                ; a32 E0 rb            [386]
-\c LOOPNZ imm                    ; E0 rb                [8086]
-\c LOOPNZ imm,CX                 ; a16 E0 rb            [8086]
-\c LOOPNZ imm,ECX                ; a32 E0 rb            [386]
-
-\c{LOOP} decrements its counter register (either \c{CX} or \c{ECX} -
-if one is not specified explicitly, the \c{BITS} setting dictates
-which is used) by one, and if the counter does not become zero as a
-result of this operation, it jumps to the given label. The jump has
-a range of 128 bytes.
-
-\c{LOOPE} (or its synonym \c{LOOPZ}) adds the additional condition
-that it only jumps if the counter is nonzero \e{and} the zero flag
-is set. Similarly, \c{LOOPNE} (and \c{LOOPNZ}) jumps only if the
-counter is nonzero and the zero flag is clear.
-
-
-\S{insLSL} \i\c{LSL}: Load Segment Limit
-
-\c LSL reg16,r/m16               ; o16 0F 03 /r         [286,PRIV]
-\c LSL reg32,r/m32               ; o32 0F 03 /r         [286,PRIV]
-
-\c{LSL} is given a segment selector in its source (second) operand;
-it computes the segment limit value by loading the segment limit
-field from the associated segment descriptor in the \c{GDT} or \c{LDT}.
-(This involves shifting left by 12 bits if the segment limit is
-page-granular, and not if it is byte-granular; so you end up with a
-byte limit in either case.) The segment limit obtained is then
-loaded into the destination (first) operand.
-
-
-\S{insLTR} \i\c{LTR}: Load Task Register
-
-\c LTR r/m16                     ; 0F 00 /3             [286,PRIV]
-
-\c{LTR} looks up the segment base and limit in the GDT or LDT
-descriptor specified by the segment selector given as its operand,
-and loads them into the Task Register.
-
-
-\S{insMASKMOVDQU} \i\c{MASKMOVDQU}: Byte Mask Write
-
-\c MASKMOVDQU xmm1,xmm2          ; 66 0F F7 /r     [WILLAMETTE,SSE2]
-
-\c{MASKMOVDQU} stores data from xmm1 to the location specified by
-\c{ES:(E)DI}. The size of the store depends on the address-size
-attribute. The most significant bit in each byte of the mask
-register xmm2 is used to selectively write the data (0 = no write,
-1 = write) on a per-byte basis.
-
-
-\S{insMASKMOVQ} \i\c{MASKMOVQ}: Byte Mask Write
-
-\c MASKMOVQ mm1,mm2              ; 0F F7 /r        [KATMAI,MMX]
-
-\c{MASKMOVQ} stores data from mm1 to the location specified by
-\c{ES:(E)DI}. The size of the store depends on the address-size
-attribute. The most significant bit in each byte of the mask
-register mm2 is used to selectively write the data (0 = no write,
-1 = write) on a per-byte basis.
-
-
-\S{insMAXPD} \i\c{MAXPD}: Return Packed Double-Precision FP Maximum
-
-\c MAXPD xmm1,xmm2/m128          ; 66 0F 5F /r     [WILLAMETTE,SSE2]
-
-\c{MAXPD} performs a SIMD compare of the packed double-precision
-FP numbers from xmm1 and xmm2/mem, and stores the maximum values
-of each pair of values in xmm1. If the values being compared are
-both zeroes, source2 (xmm2/m128) would be returned. If source2
-(xmm2/m128) is an SNaN, this SNaN is forwarded unchanged to the
-destination (i.e., a QNaN version of the SNaN is not returned).
-
-
-\S{insMAXPS} \i\c{MAXPS}: Return Packed Single-Precision FP Maximum
-
-\c MAXPS xmm1,xmm2/m128          ; 0F 5F /r        [KATMAI,SSE]
-
-\c{MAXPS} performs a SIMD compare of the packed single-precision
-FP numbers from xmm1 and xmm2/mem, and stores the maximum values
-of each pair of values in xmm1. If the values being compared are
-both zeroes, source2 (xmm2/m128) would be returned. If source2
-(xmm2/m128) is an SNaN, this SNaN is forwarded unchanged to the
-destination (i.e., a QNaN version of the SNaN is not returned).
-
-
-\S{insMAXSD} \i\c{MAXSD}: Return Scalar Double-Precision FP Maximum
-
-\c MAXSD xmm1,xmm2/m64           ; F2 0F 5F /r     [WILLAMETTE,SSE2]
-
-\c{MAXSD} compares the low-order double-precision FP numbers from
-xmm1 and xmm2/mem, and stores the maximum value in xmm1. If the
-values being compared are both zeroes, source2 (xmm2/m64) would
-be returned. If source2 (xmm2/m64) is an SNaN, this SNaN is
-forwarded unchanged to the destination (i.e., a QNaN version of
-the SNaN is not returned). The high quadword of the destination
-is left unchanged.
-
-
-\S{insMAXSS} \i\c{MAXSS}: Return Scalar Single-Precision FP Maximum
-
-\c MAXSS xmm1,xmm2/m32           ; F3 0F 5F /r     [KATMAI,SSE]
-
-\c{MAXSS} compares the low-order single-precision FP numbers from
-xmm1 and xmm2/mem, and stores the maximum value in xmm1. If the
-values being compared are both zeroes, source2 (xmm2/m32) would
-be returned. If source2 (xmm2/m32) is an SNaN, this SNaN is
-forwarded unchanged to the destination (i.e., a QNaN version of
-the SNaN is not returned). The high three doublewords of the
-destination are left unchanged.
-
-
-\S{insMFENCE} \i\c{MFENCE}: Memory Fence
-
-\c MFENCE                        ; 0F AE /6        [WILLAMETTE,SSE2]
-
-\c{MFENCE} performs a serialising operation on all loads from memory
-and writes to memory that were issued before the \c{MFENCE} instruction.
-This guarantees that all memory reads and writes before the \c{MFENCE}
-instruction are completed before any reads and writes after the
-\c{MFENCE} instruction.
-
-\c{MFENCE} is ordered respective to other \c{MFENCE} instructions,
-\c{LFENCE}, \c{SFENCE}, any memory read and any other serialising
-instruction (such as \c{CPUID}).
-
-Weakly ordered memory types can be used to achieve higher processor
-performance through such techniques as out-of-order issue, speculative
-reads, write-combining, and write-collapsing. The degree to which a
-consumer of data recognizes or knows that the data is weakly ordered
-varies among applications and may be unknown to the producer of this
-data. The \c{MFENCE} instruction provides a performance-efficient way
-of ensuring load and store ordering between routines that produce
-weakly-ordered results and routines that consume that data.
-
-\c{MFENCE} uses the following ModRM encoding:
-
-\c           Mod (7:6)        = 11B
-\c           Reg/Opcode (5:3) = 110B
-\c           R/M (2:0)        = 000B
-
-All other ModRM encodings are defined to be reserved, and use
-of these encodings risks incompatibility with future processors.
-
-See also \c{LFENCE} (\k{insLFENCE}) and \c{SFENCE} (\k{insSFENCE}).
-
-
-\S{insMINPD} \i\c{MINPD}: Return Packed Double-Precision FP Minimum
-
-\c MINPD xmm1,xmm2/m128          ; 66 0F 5D /r     [WILLAMETTE,SSE2]
-
-\c{MINPD} performs a SIMD compare of the packed double-precision
-FP numbers from xmm1 and xmm2/mem, and stores the minimum values
-of each pair of values in xmm1. If the values being compared are
-both zeroes, source2 (xmm2/m128) would be returned. If source2
-(xmm2/m128) is an SNaN, this SNaN is forwarded unchanged to the
-destination (i.e., a QNaN version of the SNaN is not returned).
-
-
-\S{insMINPS} \i\c{MINPS}: Return Packed Single-Precision FP Minimum
-
-\c MINPS xmm1,xmm2/m128          ; 0F 5D /r        [KATMAI,SSE]
-
-\c{MINPS} performs a SIMD compare of the packed single-precision
-FP numbers from xmm1 and xmm2/mem, and stores the minimum values
-of each pair of values in xmm1. If the values being compared are
-both zeroes, source2 (xmm2/m128) would be returned. If source2
-(xmm2/m128) is an SNaN, this SNaN is forwarded unchanged to the
-destination (i.e., a QNaN version of the SNaN is not returned).
-
-
-\S{insMINSD} \i\c{MINSD}: Return Scalar Double-Precision FP Minimum
-
-\c MINSD xmm1,xmm2/m64           ; F2 0F 5D /r     [WILLAMETTE,SSE2]
-
-\c{MINSD} compares the low-order double-precision FP numbers from
-xmm1 and xmm2/mem, and stores the minimum value in xmm1. If the
-values being compared are both zeroes, source2 (xmm2/m64) would
-be returned. If source2 (xmm2/m64) is an SNaN, this SNaN is
-forwarded unchanged to the destination (i.e., a QNaN version of
-the SNaN is not returned). The high quadword of the destination
-is left unchanged.
-
-
-\S{insMINSS} \i\c{MINSS}: Return Scalar Single-Precision FP Minimum
-
-\c MINSS xmm1,xmm2/m32           ; F3 0F 5D /r     [KATMAI,SSE]
-
-\c{MINSS} compares the low-order single-precision FP numbers from
-xmm1 and xmm2/mem, and stores the minimum value in xmm1. If the
-values being compared are both zeroes, source2 (xmm2/m32) would
-be returned. If source2 (xmm2/m32) is an SNaN, this SNaN is
-forwarded unchanged to the destination (i.e., a QNaN version of
-the SNaN is not returned). The high three doublewords of the
-destination are left unchanged.
-
-
-\S{insMOV} \i\c{MOV}: Move Data
-
-\c MOV r/m8,reg8                 ; 88 /r                [8086]
-\c MOV r/m16,reg16               ; o16 89 /r            [8086]
-\c MOV r/m32,reg32               ; o32 89 /r            [386]
-\c MOV reg8,r/m8                 ; 8A /r                [8086]
-\c MOV reg16,r/m16               ; o16 8B /r            [8086]
-\c MOV reg32,r/m32               ; o32 8B /r            [386]
-
-\c MOV reg8,imm8                 ; B0+r ib              [8086]
-\c MOV reg16,imm16               ; o16 B8+r iw          [8086]
-\c MOV reg32,imm32               ; o32 B8+r id          [386]
-\c MOV r/m8,imm8                 ; C6 /0 ib             [8086]
-\c MOV r/m16,imm16               ; o16 C7 /0 iw         [8086]
-\c MOV r/m32,imm32               ; o32 C7 /0 id         [386]
-
-\c MOV AL,memoffs8               ; A0 ow/od             [8086]
-\c MOV AX,memoffs16              ; o16 A1 ow/od         [8086]
-\c MOV EAX,memoffs32             ; o32 A1 ow/od         [386]
-\c MOV memoffs8,AL               ; A2 ow/od             [8086]
-\c MOV memoffs16,AX              ; o16 A3 ow/od         [8086]
-\c MOV memoffs32,EAX             ; o32 A3 ow/od         [386]
-
-\c MOV r/m16,segreg              ; o16 8C /r            [8086]
-\c MOV r/m32,segreg              ; o32 8C /r            [386]
-\c MOV segreg,r/m16              ; o16 8E /r            [8086]
-\c MOV segreg,r/m32              ; o32 8E /r            [386]
-
-\c MOV reg32,CR0/2/3/4           ; 0F 20 /r             [386]
-\c MOV reg32,DR0/1/2/3/6/7       ; 0F 21 /r             [386]
-\c MOV reg32,TR3/4/5/6/7         ; 0F 24 /r             [386]
-\c MOV CR0/2/3/4,reg32           ; 0F 22 /r             [386]
-\c MOV DR0/1/2/3/6/7,reg32       ; 0F 23 /r             [386]
-\c MOV TR3/4/5/6/7,reg32         ; 0F 26 /r             [386]
-
-\c{MOV} copies the contents of its source (second) operand into its
-destination (first) operand.
-
-In all forms of the \c{MOV} instruction, the two operands are the
-same size, except for moving between a segment register and an
-\c{r/m32} operand. These instructions are treated exactly like the
-corresponding 16-bit equivalent (so that, for example, \c{MOV
-DS,EAX} functions identically to \c{MOV DS,AX} but saves a prefix
-when in 32-bit mode), except that when a segment register is moved
-into a 32-bit destination, the top two bytes of the result are
-undefined.
-
-\c{MOV} may not use \c{CS} as a destination.
-
-\c{CR4} is only a supported register on the Pentium and above.
-
-Test registers are supported on 386/486 processors and on some
-non-Intel Pentium class processors.
-
-
-\S{insMOVAPD} \i\c{MOVAPD}: Move Aligned Packed Double-Precision FP Values
-
-\c MOVAPD xmm1,xmm2/mem128       ; 66 0F 28 /r     [WILLAMETTE,SSE2]
-\c MOVAPD xmm1/mem128,xmm2       ; 66 0F 29 /r     [WILLAMETTE,SSE2]
-
-\c{MOVAPD} moves a double quadword containing 2 packed double-precision
-FP values from the source operand to the destination. When the source
-or destination operand is a memory location, it must be aligned on a
-16-byte boundary.
-
-To move data in and out of memory locations that are not known to be on
-16-byte boundaries, use the \c{MOVUPD} instruction (\k{insMOVUPD}).
-
-
-\S{insMOVAPS} \i\c{MOVAPS}: Move Aligned Packed Single-Precision FP Values
-
-\c MOVAPS xmm1,xmm2/mem128       ; 0F 28 /r        [KATMAI,SSE]
-\c MOVAPS xmm1/mem128,xmm2       ; 0F 29 /r        [KATMAI,SSE]
-
-\c{MOVAPS} moves a double quadword containing 4 packed single-precision
-FP values from the source operand to the destination. When the source
-or destination operand is a memory location, it must be aligned on a
-16-byte boundary.
-
-To move data in and out of memory locations that are not known to be on
-16-byte boundaries, use the \c{MOVUPS} instruction (\k{insMOVUPS}).
-
-
-\S{insMOVD} \i\c{MOVD}: Move Doubleword to/from MMX Register
-
-\c MOVD mm,r/m32                 ; 0F 6E /r             [PENT,MMX]
-\c MOVD r/m32,mm                 ; 0F 7E /r             [PENT,MMX]
-\c MOVD xmm,r/m32                ; 66 0F 6E /r     [WILLAMETTE,SSE2]
-\c MOVD r/m32,xmm                ; 66 0F 7E /r     [WILLAMETTE,SSE2]
-
-\c{MOVD} copies 32 bits from its source (second) operand into its
-destination (first) operand. When the destination is a 64-bit \c{MMX}
-register or a 128-bit \c{XMM} register, the input value is zero-extended
-to fill the destination register.
-
-
-\S{insMOVDQ2Q} \i\c{MOVDQ2Q}: Move Quadword from XMM to MMX register.
-
-\c MOVDQ2Q mm,xmm                ; F2 OF D6 /r     [WILLAMETTE,SSE2]
-
-\c{MOVDQ2Q} moves the low quadword from the source operand to the
-destination operand.
-
-
-\S{insMOVDQA} \i\c{MOVDQA}: Move Aligned Double Quadword
-
-\c MOVDQA xmm1,xmm2/m128         ; 66 OF 6F /r     [WILLAMETTE,SSE2]
-\c MOVDQA xmm1/m128,xmm2         ; 66 OF 7F /r     [WILLAMETTE,SSE2]
-
-\c{MOVDQA} moves a double quadword from the source operand to the
-destination operand. When the source or destination operand is a
-memory location, it must be aligned to a 16-byte boundary.
-
-To move a double quadword to or from unaligned memory locations,
-use the \c{MOVDQU} instruction (\k{insMOVDQU}).
-
-
-\S{insMOVDQU} \i\c{MOVDQU}: Move Unaligned Double Quadword
-
-\c MOVDQU xmm1,xmm2/m128         ; F3 OF 6F /r     [WILLAMETTE,SSE2]
-\c MOVDQU xmm1/m128,xmm2         ; F3 OF 7F /r     [WILLAMETTE,SSE2]
-
-\c{MOVDQU} moves a double quadword from the source operand to the
-destination operand. When the source or destination operand is a
-memory location, the memory may be unaligned.
-
-To move a double quadword to or from known aligned memory locations,
-use the \c{MOVDQA} instruction (\k{insMOVDQA}).
-
-
-\S{insMOVHLPS} \i\c{MOVHLPS}: Move Packed Single-Precision FP High to Low
-
-\c MOVHLPS xmm1,xmm2             ; OF 12 /r        [KATMAI,SSE]
-
-\c{MOVHLPS} moves the two packed single-precision FP values from the
-high quadword of the source register xmm2 to the low quadword of the
-destination register, xmm2. The upper quadword of xmm1 is left unchanged.
-
-The operation of this instruction is:
-
-\c    dst[0-63]   := src[64-127],
-\c    dst[64-127] remains unchanged.
-
-
-\S{insMOVHPD} \i\c{MOVHPD}: Move High Packed Double-Precision FP
-
-\c MOVHPD xmm,m64               ; 66 OF 16 /r      [WILLAMETTE,SSE2]
-\c MOVHPD m64,xmm               ; 66 OF 17 /r      [WILLAMETTE,SSE2]
-
-\c{MOVHPD} moves a double-precision FP value between the source and
-destination operands. One of the operands is a 64-bit memory location,
-the other is the high quadword of an \c{XMM} register.
-
-The operation of this instruction is:
-
-\c    mem[0-63]   := xmm[64-127];
-
-or
-
-\c    xmm[0-63]   remains unchanged;
-\c    xmm[64-127] := mem[0-63].
-
-
-\S{insMOVHPS} \i\c{MOVHPS}: Move High Packed Single-Precision FP
-
-\c MOVHPS xmm,m64               ; 0F 16 /r         [KATMAI,SSE]
-\c MOVHPS m64,xmm               ; 0F 17 /r         [KATMAI,SSE]
-
-\c{MOVHPS} moves two packed single-precision FP values between the source
-and destination operands. One of the operands is a 64-bit memory location,
-the other is the high quadword of an \c{XMM} register.
-
-The operation of this instruction is:
-
-\c    mem[0-63]   := xmm[64-127];
-
-or
-
-\c    xmm[0-63]   remains unchanged;
-\c    xmm[64-127] := mem[0-63].
-
-
-\S{insMOVLHPS} \i\c{MOVLHPS}: Move Packed Single-Precision FP Low to High
-
-\c MOVLHPS xmm1,xmm2             ; OF 16 /r         [KATMAI,SSE]
-
-\c{MOVLHPS} moves the two packed single-precision FP values from the
-low quadword of the source register xmm2 to the high quadword of the
-destination register, xmm2. The low quadword of xmm1 is left unchanged.
-
-The operation of this instruction is:
-
-\c    dst[0-63]   remains unchanged;
-\c    dst[64-127] := src[0-63].
-
-\S{insMOVLPD} \i\c{MOVLPD}: Move Low Packed Double-Precision FP
-
-\c MOVLPD xmm,m64                ; 66 OF 12 /r     [WILLAMETTE,SSE2]
-\c MOVLPD m64,xmm                ; 66 OF 13 /r     [WILLAMETTE,SSE2]
-
-\c{MOVLPD} moves a double-precision FP value between the source and
-destination operands. One of the operands is a 64-bit memory location,
-the other is the low quadword of an \c{XMM} register.
-
-The operation of this instruction is:
-
-\c    mem(0-63)   := xmm(0-63);
-
-or
-
-\c    xmm(0-63)   := mem(0-63);
-\c    xmm(64-127) remains unchanged.
-
-\S{insMOVLPS} \i\c{MOVLPS}: Move Low Packed Single-Precision FP
-
-\c MOVLPS xmm,m64                ; OF 12 /r        [KATMAI,SSE]
-\c MOVLPS m64,xmm                ; OF 13 /r        [KATMAI,SSE]
-
-\c{MOVLPS} moves two packed single-precision FP values between the source
-and destination operands. One of the operands is a 64-bit memory location,
-the other is the low quadword of an \c{XMM} register.
-
-The operation of this instruction is:
-
-\c    mem(0-63)   := xmm(0-63);
-
-or
-
-\c    xmm(0-63)   := mem(0-63);
-\c    xmm(64-127) remains unchanged.
-
-
-\S{insMOVMSKPD} \i\c{MOVMSKPD}: Extract Packed Double-Precision FP Sign Mask
-
-\c MOVMSKPD reg32,xmm              ; 66 0F 50 /r   [WILLAMETTE,SSE2]
-
-\c{MOVMSKPD} inserts a 2-bit mask in r32, formed of the most significant
-bits of each double-precision FP number of the source operand.
-
-
-\S{insMOVMSKPS} \i\c{MOVMSKPS}: Extract Packed Single-Precision FP Sign Mask
-
-\c MOVMSKPS reg32,xmm              ; 0F 50 /r      [KATMAI,SSE]
-
-\c{MOVMSKPS} inserts a 4-bit mask in r32, formed of the most significant
-bits of each single-precision FP number of the source operand.
-
-
-\S{insMOVNTDQ} \i\c{MOVNTDQ}: Move Double Quadword Non Temporal
-
-\c MOVNTDQ m128,xmm              ; 66 0F E7 /r     [WILLAMETTE,SSE2]
-
-\c{MOVNTDQ} moves the double quadword from the \c{XMM} source
-register to the destination memory location, using a non-temporal
-hint. This store instruction minimizes cache pollution.
-
-
-\S{insMOVNTI} \i\c{MOVNTI}: Move Doubleword Non Temporal
-
-\c MOVNTI m32,reg32              ; 0F C3 /r        [WILLAMETTE,SSE2]
-
-\c{MOVNTI} moves the doubleword in the source register
-to the destination memory location, using a non-temporal
-hint. This store instruction minimizes cache pollution.
-
-
-\S{insMOVNTPD} \i\c{MOVNTPD}: Move Aligned Four Packed Single-Precision
-FP Values Non Temporal
-
-\c MOVNTPD m128,xmm              ; 66 0F 2B /r     [WILLAMETTE,SSE2]
-
-\c{MOVNTPD} moves the double quadword from the \c{XMM} source
-register to the destination memory location, using a non-temporal
-hint. This store instruction minimizes cache pollution. The memory
-location must be aligned to a 16-byte boundary.
-
-
-\S{insMOVNTPS} \i\c{MOVNTPS}: Move Aligned Four Packed Single-Precision
-FP Values Non Temporal
-
-\c MOVNTPS m128,xmm              ; 0F 2B /r        [KATMAI,SSE]
-
-\c{MOVNTPS} moves the double quadword from the \c{XMM} source
-register to the destination memory location, using a non-temporal
-hint. This store instruction minimizes cache pollution. The memory
-location must be aligned to a 16-byte boundary.
-
-
-\S{insMOVNTQ} \i\c{MOVNTQ}: Move Quadword Non Temporal
-
-\c MOVNTQ m64,mm                 ; 0F E7 /r        [KATMAI,MMX]
-
-\c{MOVNTQ} moves the quadword in the \c{MMX} source register
-to the destination memory location, using a non-temporal
-hint. This store instruction minimizes cache pollution.
-
-
-\S{insMOVQ} \i\c{MOVQ}: Move Quadword to/from MMX Register
-
-\c MOVQ mm1,mm2/m64               ; 0F 6F /r             [PENT,MMX]
-\c MOVQ mm1/m64,mm2               ; 0F 7F /r             [PENT,MMX]
-
-\c MOVQ xmm1,xmm2/m64             ; F3 0F 7E /r    [WILLAMETTE,SSE2]
-\c MOVQ xmm1/m64,xmm2             ; 66 0F D6 /r    [WILLAMETTE,SSE2]
-
-\c{MOVQ} copies 64 bits from its source (second) operand into its
-destination (first) operand. When the source is an \c{XMM} register,
-the low quadword is moved. When the destination is an \c{XMM} register,
-the destination is the low quadword, and the high quadword is cleared.
-
-
-\S{insMOVQ2DQ} \i\c{MOVQ2DQ}: Move Quadword from MMX to XMM register.
-
-\c MOVQ2DQ xmm,mm                ; F3 OF D6 /r     [WILLAMETTE,SSE2]
-
-\c{MOVQ2DQ} moves the quadword from the source operand to the low
-quadword of the destination operand, and clears the high quadword.
-
-
-\S{insMOVSB} \i\c{MOVSB}, \i\c{MOVSW}, \i\c{MOVSD}: Move String
-
-\c MOVSB                         ; A4                   [8086]
-\c MOVSW                         ; o16 A5               [8086]
-\c MOVSD                         ; o32 A5               [386]
-
-\c{MOVSB} copies the byte at \c{[DS:SI]} or \c{[DS:ESI]} to
-\c{[ES:DI]} or \c{[ES:EDI]}. It then increments or decrements
-(depending on the direction flag: increments if the flag is clear,
-decrements if it is set) \c{SI} and \c{DI} (or \c{ESI} and \c{EDI}).
-
-The registers used are \c{SI} and \c{DI} if the address size is 16
-bits, and \c{ESI} and \c{EDI} if it is 32 bits. If you need to use
-an address size not equal to the current \c{BITS} setting, you can
-use an explicit \i\c{a16} or \i\c{a32} prefix.
-
-The segment register used to load from \c{[SI]} or \c{[ESI]} can be
-overridden by using a segment register name as a prefix (for
-example, \c{es movsb}). The use of \c{ES} for the store to \c{[DI]}
-or \c{[EDI]} cannot be overridden.
-
-\c{MOVSW} and \c{MOVSD} work in the same way, but they copy a word
-or a doubleword instead of a byte, and increment or decrement the
-addressing registers by 2 or 4 instead of 1.
-
-The \c{REP} prefix may be used to repeat the instruction \c{CX} (or
-\c{ECX} - again, the address size chooses which) times.
-
-
-\S{insMOVSD} \i\c{MOVSD}: Move Scalar Double-Precision FP Value
-
-\c MOVSD xmm1,xmm2/m64           ; F2 0F 10 /r     [WILLAMETTE,SSE2]
-\c MOVSD xmm1/m64,xmm2           ; F2 0F 11 /r     [WILLAMETTE,SSE2]
-
-\c{MOVSD} moves a double-precision FP value from the source operand
-to the destination operand. When the source or destination is a
-register, the low-order FP value is read or written.
-
-
-\S{insMOVSS} \i\c{MOVSS}: Move Scalar Single-Precision FP Value
-
-\c MOVSS xmm1,xmm2/m32           ; F3 0F 10 /r     [KATMAI,SSE]
-\c MOVSS xmm1/m32,xmm2           ; F3 0F 11 /r     [KATMAI,SSE]
-
-\c{MOVSS} moves a single-precision FP value from the source operand
-to the destination operand. When the source or destination is a
-register, the low-order FP value is read or written.
-
-
-\S{insMOVSX} \i\c{MOVSX}, \i\c{MOVZX}: Move Data with Sign or Zero Extend
-
-\c MOVSX reg16,r/m8              ; o16 0F BE /r         [386]
-\c MOVSX reg32,r/m8              ; o32 0F BE /r         [386]
-\c MOVSX reg32,r/m16             ; o32 0F BF /r         [386]
-
-\c MOVZX reg16,r/m8              ; o16 0F B6 /r         [386]
-\c MOVZX reg32,r/m8              ; o32 0F B6 /r         [386]
-\c MOVZX reg32,r/m16             ; o32 0F B7 /r         [386]
-
-\c{MOVSX} sign-extends its source (second) operand to the length of
-its destination (first) operand, and copies the result into the
-destination operand. \c{MOVZX} does the same, but zero-extends
-rather than sign-extending.
-
-
-\S{insMOVUPD} \i\c{MOVUPD}: Move Unaligned Packed Double-Precision FP Values
-
-\c MOVUPD xmm1,xmm2/mem128       ; 66 0F 10 /r     [WILLAMETTE,SSE2]
-\c MOVUPD xmm1/mem128,xmm2       ; 66 0F 11 /r     [WILLAMETTE,SSE2]
-
-\c{MOVUPD} moves a double quadword containing 2 packed double-precision
-FP values from the source operand to the destination. This instruction
-makes no assumptions about alignment of memory operands.
-
-To move data in and out of memory locations that are known to be on 16-byte
-boundaries, use the \c{MOVAPD} instruction (\k{insMOVAPD}).
-
-
-\S{insMOVUPS} \i\c{MOVUPS}: Move Unaligned Packed Single-Precision FP Values
-
-\c MOVUPS xmm1,xmm2/mem128       ; 0F 10 /r        [KATMAI,SSE]
-\c MOVUPS xmm1/mem128,xmm2       ; 0F 11 /r        [KATMAI,SSE]
-
-\c{MOVUPS} moves a double quadword containing 4 packed single-precision
-FP values from the source operand to the destination. This instruction
-makes no assumptions about alignment of memory operands.
-
-To move data in and out of memory locations that are known to be on 16-byte
-boundaries, use the \c{MOVAPS} instruction (\k{insMOVAPS}).
-
-
-\S{insMUL} \i\c{MUL}: Unsigned Integer Multiply
-
-\c MUL r/m8                      ; F6 /4                [8086]
-\c MUL r/m16                     ; o16 F7 /4            [8086]
-\c MUL r/m32                     ; o32 F7 /4            [386]
-
-\c{MUL} performs unsigned integer multiplication. The other operand
-to the multiplication, and the destination operand, are implicit, in
-the following way:
-
-\b For \c{MUL r/m8}, \c{AL} is multiplied by the given operand; the
-product is stored in \c{AX}.
-
-\b For \c{MUL r/m16}, \c{AX} is multiplied by the given operand;
-the product is stored in \c{DX:AX}.
-
-\b For \c{MUL r/m32}, \c{EAX} is multiplied by the given operand;
-the product is stored in \c{EDX:EAX}.
-
-Signed integer multiplication is performed by the \c{IMUL}
-instruction: see \k{insIMUL}.
-
-
-\S{insMULPD} \i\c{MULPD}: Packed Single-FP Multiply
-
-\c MULPD xmm1,xmm2/mem128        ; 66 0F 59 /r     [WILLAMETTE,SSE2]
-
-\c{MULPD} performs a SIMD multiply of the packed double-precision FP
-values in both operands, and stores the results in the destination register.
-
-
-\S{insMULPS} \i\c{MULPS}: Packed Single-FP Multiply
-
-\c MULPS xmm1,xmm2/mem128        ; 0F 59 /r        [KATMAI,SSE]
-
-\c{MULPS} performs a SIMD multiply of the packed single-precision FP
-values in both operands, and stores the results in the destination register.
-
-
-\S{insMULSD} \i\c{MULSD}: Scalar Single-FP Multiply
-
-\c MULSD xmm1,xmm2/mem32         ; F2 0F 59 /r     [WILLAMETTE,SSE2]
-
-\c{MULSD} multiplies the lowest double-precision FP values of both
-operands, and stores the result in the low quadword of xmm1.
-
-
-\S{insMULSS} \i\c{MULSS}: Scalar Single-FP Multiply
-
-\c MULSS xmm1,xmm2/mem32         ; F3 0F 59 /r     [KATMAI,SSE]
-
-\c{MULSS} multiplies the lowest single-precision FP values of both
-operands, and stores the result in the low doubleword of xmm1.
-
-
-\S{insNEG} \i\c{NEG}, \i\c{NOT}: Two's and One's Complement
-
-\c NEG r/m8                      ; F6 /3                [8086]
-\c NEG r/m16                     ; o16 F7 /3            [8086]
-\c NEG r/m32                     ; o32 F7 /3            [386]
-
-\c NOT r/m8                      ; F6 /2                [8086]
-\c NOT r/m16                     ; o16 F7 /2            [8086]
-\c NOT r/m32                     ; o32 F7 /2            [386]
-
-\c{NEG} replaces the contents of its operand by the two's complement
-negation (invert all the bits and then add one) of the original
-value. \c{NOT}, similarly, performs one's complement (inverts all
-the bits).
-
-
-\S{insNOP} \i\c{NOP}: No Operation
-
-\c NOP                           ; 90                   [8086]
-
-\c{NOP} performs no operation. Its opcode is the same as that
-generated by \c{XCHG AX,AX} or \c{XCHG EAX,EAX} (depending on the
-processor mode; see \k{insXCHG}).
-
-
-\S{insOR} \i\c{OR}: Bitwise OR
-
-\c OR r/m8,reg8                  ; 08 /r                [8086]
-\c OR r/m16,reg16                ; o16 09 /r            [8086]
-\c OR r/m32,reg32                ; o32 09 /r            [386]
-
-\c OR reg8,r/m8                  ; 0A /r                [8086]
-\c OR reg16,r/m16                ; o16 0B /r            [8086]
-\c OR reg32,r/m32                ; o32 0B /r            [386]
-
-\c OR r/m8,imm8                  ; 80 /1 ib             [8086]
-\c OR r/m16,imm16                ; o16 81 /1 iw         [8086]
-\c OR r/m32,imm32                ; o32 81 /1 id         [386]
-
-\c OR r/m16,imm8                 ; o16 83 /1 ib         [8086]
-\c OR r/m32,imm8                 ; o32 83 /1 ib         [386]
-
-\c OR AL,imm8                    ; 0C ib                [8086]
-\c OR AX,imm16                   ; o16 0D iw            [8086]
-\c OR EAX,imm32                  ; o32 0D id            [386]
-
-\c{OR} performs a bitwise OR operation between its two operands
-(i.e. each bit of the result is 1 if and only if at least one of the
-corresponding bits of the two inputs was 1), and stores the result
-in the destination (first) operand.
-
-In the forms with an 8-bit immediate second operand and a longer
-first operand, the second operand is considered to be signed, and is
-sign-extended to the length of the first operand. In these cases,
-the \c{BYTE} qualifier is necessary to force NASM to generate this
-form of the instruction.
-
-The MMX instruction \c{POR} (see \k{insPOR}) performs the same
-operation on the 64-bit MMX registers.
-
-
-\S{insORPD} \i\c{ORPD}: Bit-wise Logical OR of Double-Precision FP Data
-
-\c ORPD xmm1,xmm2/m128           ; 66 0F 56 /r     [WILLAMETTE,SSE2]
-
-\c{ORPD} return a bit-wise logical OR between xmm1 and xmm2/mem,
-and stores the result in xmm1. If the source operand is a memory
-location, it must be aligned to a 16-byte boundary.
-
-
-\S{insORPS} \i\c{ORPS}: Bit-wise Logical OR of Single-Precision FP Data
-
-\c ORPS xmm1,xmm2/m128           ; 0F 56 /r        [KATMAI,SSE]
-
-\c{ORPS} return a bit-wise logical OR between xmm1 and xmm2/mem,
-and stores the result in xmm1. If the source operand is a memory
-location, it must be aligned to a 16-byte boundary.
-
-
-\S{insOUT} \i\c{OUT}: Output Data to I/O Port
-
-\c OUT imm8,AL                   ; E6 ib                [8086]
-\c OUT imm8,AX                   ; o16 E7 ib            [8086]
-\c OUT imm8,EAX                  ; o32 E7 ib            [386]
-\c OUT DX,AL                     ; EE                   [8086]
-\c OUT DX,AX                     ; o16 EF               [8086]
-\c OUT DX,EAX                    ; o32 EF               [386]
-
-\c{OUT} writes the contents of the given source register to the
-specified I/O port. The port number may be specified as an immediate
-value if it is between 0 and 255, and otherwise must be stored in
-\c{DX}. See also \c{IN} (\k{insIN}).
-
-
-\S{insOUTSB} \i\c{OUTSB}, \i\c{OUTSW}, \i\c{OUTSD}: Output String to I/O Port
-
-\c OUTSB                         ; 6E                   [186]
-\c OUTSW                         ; o16 6F               [186]
-\c OUTSD                         ; o32 6F               [386]
-
-\c{OUTSB} loads a byte from \c{[DS:SI]} or \c{[DS:ESI]} and writes
-it to the I/O port specified in \c{DX}. It then increments or
-decrements (depending on the direction flag: increments if the flag
-is clear, decrements if it is set) \c{SI} or \c{ESI}.
-
-The register used is \c{SI} if the address size is 16 bits, and
-\c{ESI} if it is 32 bits. If you need to use an address size not
-equal to the current \c{BITS} setting, you can use an explicit
-\i\c{a16} or \i\c{a32} prefix.
-
-The segment register used to load from \c{[SI]} or \c{[ESI]} can be
-overridden by using a segment register name as a prefix (for
-example, \c{es outsb}).
-
-\c{OUTSW} and \c{OUTSD} work in the same way, but they output a
-word or a doubleword instead of a byte, and increment or decrement
-the addressing registers by 2 or 4 instead of 1.
-
-The \c{REP} prefix may be used to repeat the instruction \c{CX} (or
-\c{ECX} - again, the address size chooses which) times.
-
-
-\S{insPACKSSDW} \i\c{PACKSSDW}, \i\c{PACKSSWB}, \i\c{PACKUSWB}: Pack Data
-
-\c PACKSSDW mm1,mm2/m64          ; 0F 6B /r             [PENT,MMX]
-\c PACKSSWB mm1,mm2/m64          ; 0F 63 /r             [PENT,MMX]
-\c PACKUSWB mm1,mm2/m64          ; 0F 67 /r             [PENT,MMX]
-
-\c PACKSSDW xmm1,xmm2/m128       ; 66 0F 6B /r     [WILLAMETTE,SSE2]
-\c PACKSSWB xmm1,xmm2/m128       ; 66 0F 63 /r     [WILLAMETTE,SSE2]
-\c PACKUSWB xmm1,xmm2/m128       ; 66 0F 67 /r     [WILLAMETTE,SSE2]
-
-All these instructions start by combining the source and destination
-operands, and then splitting the result in smaller sections which it
-then packs into the destination register. The \c{MMX} versions pack
-two 64-bit operands into one 64-bit register, while the \c{SSE}
-versions pack two 128-bit operands into one 128-bit register.
-
-\b \c{PACKSSWB} splits the combined value into words, and then reduces
-the words to bytes, using signed saturation. It then packs the bytes
-into the destination register in the same order the words were in.
-
-\b \c{PACKSSDW} performs the same operation as \c{PACKSSWB}, except that
-it reduces doublewords to words, then packs them into the destination
-register.
-
-\b \c{PACKUSWB} performs the same operation as \c{PACKSSWB}, except that
-it uses unsigned saturation when reducing the size of the elements.
-
-To perform signed saturation on a number, it is replaced by the largest
-signed number (\c{7FFFh} or \c{7Fh}) that \e{will} fit, and if it is too
-small it is replaced by the smallest signed number (\c{8000h} or
-\c{80h}) that will fit. To perform unsigned saturation, the input is
-treated as unsigned, and the input is replaced by the largest unsigned
-number that will fit.
-
-
-\S{insPADDB} \i\c{PADDB}, \i\c{PADDW}, \i\c{PADDD}: Add Packed Integers
-
-\c PADDB mm1,mm2/m64             ; 0F FC /r             [PENT,MMX]
-\c PADDW mm1,mm2/m64             ; 0F FD /r             [PENT,MMX]
-\c PADDD mm1,mm2/m64             ; 0F FE /r             [PENT,MMX]
-
-\c PADDB xmm1,xmm2/m128          ; 66 0F FC /r     [WILLAMETTE,SSE2]
-\c PADDW xmm1,xmm2/m128          ; 66 0F FD /r     [WILLAMETTE,SSE2]
-\c PADDD xmm1,xmm2/m128          ; 66 0F FE /r     [WILLAMETTE,SSE2]
-
-\c{PADDx} performs packed addition of the two operands, storing the
-result in the destination (first) operand.
-
-\b \c{PADDB} treats the operands as packed bytes, and adds each byte
-individually;
-
-\b \c{PADDW} treats the operands as packed words;
-
-\b \c{PADDD} treats its operands as packed doublewords.
-
-When an individual result is too large to fit in its destination, it
-is wrapped around and the low bits are stored, with the carry bit
-discarded.
-
-
-\S{insPADDQ} \i\c{PADDQ}: Add Packed Quadword Integers
-
-\c PADDQ mm1,mm2/m64             ; 0F D4 /r             [PENT,MMX]
-
-\c PADDQ xmm1,xmm2/m128          ; 66 0F D4 /r     [WILLAMETTE,SSE2]
-
-\c{PADDQ} adds the quadwords in the source and destination operands, and
-stores the result in the destination register.
-
-When an individual result is too large to fit in its destination, it
-is wrapped around and the low bits are stored, with the carry bit
-discarded.
-
-
-\S{insPADDSB} \i\c{PADDSB}, \i\c{PADDSW}: Add Packed Signed Integers With Saturation
-
-\c PADDSB mm1,mm2/m64            ; 0F EC /r             [PENT,MMX]
-\c PADDSW mm1,mm2/m64            ; 0F ED /r             [PENT,MMX]
-
-\c PADDSB xmm1,xmm2/m128         ; 66 0F EC /r     [WILLAMETTE,SSE2]
-\c PADDSW xmm1,xmm2/m128         ; 66 0F ED /r     [WILLAMETTE,SSE2]
-
-\c{PADDSx} performs packed addition of the two operands, storing the
-result in the destination (first) operand.
-\c{PADDSB} treats the operands as packed bytes, and adds each byte
-individually; and \c{PADDSW} treats the operands as packed words.
-
-When an individual result is too large to fit in its destination, a
-saturated value is stored. The resulting value is the value with the
-largest magnitude of the same sign as the result which will fit in
-the available space.
-
-
-\S{insPADDSIW} \i\c{PADDSIW}: MMX Packed Addition to Implicit Destination
-
-\c PADDSIW mmxreg,r/m64          ; 0F 51 /r             [CYRIX,MMX]
-
-\c{PADDSIW}, specific to the Cyrix extensions to the MMX instruction
-set, performs the same function as \c{PADDSW}, except that the result
-is placed in an implied register.
-
-To work out the implied register, invert the lowest bit in the register
-number. So \c{PADDSIW MM0,MM2} would put the result in \c{MM1}, but
-\c{PADDSIW MM1,MM2} would put the result in \c{MM0}.
-
-
-\S{insPADDUSB} \i\c{PADDUSB}, \i\c{PADDUSW}: Add Packed Unsigned Integers With Saturation
-
-\c PADDUSB mm1,mm2/m64           ; 0F DC /r             [PENT,MMX]
-\c PADDUSW mm1,mm2/m64           ; 0F DD /r             [PENT,MMX]
-
-\c PADDUSB xmm1,xmm2/m128         ; 66 0F DC /r    [WILLAMETTE,SSE2]
-\c PADDUSW xmm1,xmm2/m128         ; 66 0F DD /r    [WILLAMETTE,SSE2]
-
-\c{PADDUSx} performs packed addition of the two operands, storing the
-result in the destination (first) operand.
-\c{PADDUSB} treats the operands as packed bytes, and adds each byte
-individually; and \c{PADDUSW} treats the operands as packed words.
-
-When an individual result is too large to fit in its destination, a
-saturated value is stored. The resulting value is the maximum value
-that will fit in the available space.
-
-
-\S{insPAND} \i\c{PAND}, \i\c{PANDN}: MMX Bitwise AND and AND-NOT
-
-\c PAND mm1,mm2/m64              ; 0F DB /r             [PENT,MMX]
-\c PANDN mm1,mm2/m64             ; 0F DF /r             [PENT,MMX]
-
-\c PAND xmm1,xmm2/m128           ; 66 0F DB /r     [WILLAMETTE,SSE2]
-\c PANDN xmm1,xmm2/m128          ; 66 0F DF /r     [WILLAMETTE,SSE2]
-
-
-\c{PAND} performs a bitwise AND operation between its two operands
-(i.e. each bit of the result is 1 if and only if the corresponding
-bits of the two inputs were both 1), and stores the result in the
-destination (first) operand.
-
-\c{PANDN} performs the same operation, but performs a one's
-complement operation on the destination (first) operand first.
-
-
-\S{insPAUSE} \i\c{PAUSE}: Spin Loop Hint
-
-\c PAUSE                         ; F3 90           [WILLAMETTE,SSE2]
-
-\c{PAUSE} provides a hint to the processor that the following code
-is a spin loop. This improves processor performance by bypassing
-possible memory order violations. On older processors, this instruction
-operates as a \c{NOP}.
-
-
-\S{insPAVEB} \i\c{PAVEB}: MMX Packed Average
-
-\c PAVEB mmxreg,r/m64            ; 0F 50 /r             [CYRIX,MMX]
-
-\c{PAVEB}, specific to the Cyrix MMX extensions, treats its two
-operands as vectors of eight unsigned bytes, and calculates the
-average of the corresponding bytes in the operands. The resulting
-vector of eight averages is stored in the first operand.
-
-This opcode maps to \c{MOVMSKPS r32, xmm} on processors that support
-the SSE instruction set.
-
-
-\S{insPAVGB} \i\c{PAVGB} \i\c{PAVGW}: Average Packed Integers
-
-\c PAVGB mm1,mm2/m64             ; 0F E0 /r        [KATMAI,MMX]
-\c PAVGW mm1,mm2/m64             ; 0F E3 /r        [KATMAI,MMX,SM]
-
-\c PAVGB xmm1,xmm2/m128          ; 66 0F E0 /r     [WILLAMETTE,SSE2]
-\c PAVGW xmm1,xmm2/m128          ; 66 0F E3 /r     [WILLAMETTE,SSE2]
-
-\c{PAVGB} and \c{PAVGW} add the unsigned data elements of the source
-operand to the unsigned data elements of the destination register,
-then adds 1 to the temporary results. The results of the add are then
-each independently right-shifted by one bit position. The high order
-bits of each element are filled with the carry bits of the corresponding
-sum.
-
-\b \c{PAVGB} operates on packed unsigned bytes, and
-
-\b \c{PAVGW} operates on packed unsigned words.
-
-
-\S{insPAVGUSB} \i\c{PAVGUSB}: Average of unsigned packed 8-bit values
-
-\c PAVGUSB mm1,mm2/m64           ; 0F 0F /r BF          [PENT,3DNOW]
-
-\c{PAVGUSB} adds the unsigned data elements of the source operand to
-the unsigned data elements of the destination register, then adds 1
-to the temporary results. The results of the add are then each
-independently right-shifted by one bit position. The high order bits
-of each element are filled with the carry bits of the corresponding
-sum.
-
-This instruction performs exactly the same operations as the \c{PAVGB}
-\c{MMX} instruction (\k{insPAVGB}).
-
-
-\S{insPCMPEQB} \i\c{PCMPxx}: Compare Packed Integers.
-
-\c PCMPEQB mm1,mm2/m64           ; 0F 74 /r             [PENT,MMX]
-\c PCMPEQW mm1,mm2/m64           ; 0F 75 /r             [PENT,MMX]
-\c PCMPEQD mm1,mm2/m64           ; 0F 76 /r             [PENT,MMX]
-
-\c PCMPGTB mm1,mm2/m64           ; 0F 64 /r             [PENT,MMX]
-\c PCMPGTW mm1,mm2/m64           ; 0F 65 /r             [PENT,MMX]
-\c PCMPGTD mm1,mm2/m64           ; 0F 66 /r             [PENT,MMX]
-
-\c PCMPEQB xmm1,xmm2/m128        ; 66 0F 74 /r     [WILLAMETTE,SSE2]
-\c PCMPEQW xmm1,xmm2/m128        ; 66 0F 75 /r     [WILLAMETTE,SSE2]
-\c PCMPEQD xmm1,xmm2/m128        ; 66 0F 76 /r     [WILLAMETTE,SSE2]
-
-\c PCMPGTB xmm1,xmm2/m128        ; 66 0F 64 /r     [WILLAMETTE,SSE2]
-\c PCMPGTW xmm1,xmm2/m128        ; 66 0F 65 /r     [WILLAMETTE,SSE2]
-\c PCMPGTD xmm1,xmm2/m128        ; 66 0F 66 /r     [WILLAMETTE,SSE2]
-
-The \c{PCMPxx} instructions all treat their operands as vectors of
-bytes, words, or doublewords; corresponding elements of the source
-and destination are compared, and the corresponding element of the
-destination (first) operand is set to all zeros or all ones
-depending on the result of the comparison.
-
-\b \c{PCMPxxB} treats the operands as vectors of bytes;
-
-\b \c{PCMPxxW} treats the operands as vectors of words;
-
-\b \c{PCMPxxD} treats the operands as vectors of doublewords;
-
-\b \c{PCMPEQx} sets the corresponding element of the destination
-operand to all ones if the two elements compared are equal;
-
-\b \c{PCMPGTx} sets the destination element to all ones if the element
-of the first (destination) operand is greater (treated as a signed
-integer) than that of the second (source) operand.
-
-
-\S{insPDISTIB} \i\c{PDISTIB}: MMX Packed Distance and Accumulate
-with Implied Register
-
-\c PDISTIB mm,m64                ; 0F 54 /r             [CYRIX,MMX]
-
-\c{PDISTIB}, specific to the Cyrix MMX extensions, treats its two
-input operands as vectors of eight unsigned bytes. For each byte
-position, it finds the absolute difference between the bytes in that
-position in the two input operands, and adds that value to the byte
-in the same position in the implied output register. The addition is
-saturated to an unsigned byte in the same way as \c{PADDUSB}.
-
-To work out the implied register, invert the lowest bit in the register
-number. So \c{PDISTIB MM0,M64} would put the result in \c{MM1}, but
-\c{PDISTIB MM1,M64} would put the result in \c{MM0}.
-
-Note that \c{PDISTIB} cannot take a register as its second source
-operand.
-
-Operation:
-
-\c    dstI[0-7]     := dstI[0-7]   + ABS(src0[0-7] - src1[0-7]),
-\c    dstI[8-15]    := dstI[8-15]  + ABS(src0[8-15] - src1[8-15]),
-\c    .......
-\c    .......
-\c    dstI[56-63]   := dstI[56-63] + ABS(src0[56-63] - src1[56-63]).
-
-
-\S{insPEXTRW} \i\c{PEXTRW}: Extract Word
-
-\c PEXTRW reg32,mm,imm8          ; 0F C5 /r ib     [KATMAI,MMX]
-\c PEXTRW reg32,xmm,imm8         ; 66 0F C5 /r ib  [WILLAMETTE,SSE2]
-
-\c{PEXTRW} moves the word in the source register (second operand)
-that is pointed to by the count operand (third operand), into the
-lower half of a 32-bit general purpose register. The upper half of
-the register is cleared to all 0s.
-
-When the source operand is an \c{MMX} register, the two least
-significant bits of the count specify the source word. When it is
-an \c{SSE} register, the three least significant bits specify the
-word location.
-
-
-\S{insPF2ID} \i\c{PF2ID}: Packed Single-Precision FP to Integer Convert
-
-\c PF2ID mm1,mm2/m64             ; 0F 0F /r 1D          [PENT,3DNOW]
-
-\c{PF2ID} converts two single-precision FP values in the source operand
-to signed 32-bit integers, using truncation, and stores them in the
-destination operand. Source values that are outside the range supported
-by the destination are saturated to the largest absolute value of the
-same sign.
-
-
-\S{insPF2IW} \i\c{PF2IW}: Packed Single-Precision FP to Integer Word Convert
-
-\c PF2IW mm1,mm2/m64             ; 0F 0F /r 1C          [PENT,3DNOW]
-
-\c{PF2IW} converts two single-precision FP values in the source operand
-to signed 16-bit integers, using truncation, and stores them in the
-destination operand. Source values that are outside the range supported
-by the destination are saturated to the largest absolute value of the
-same sign.
-
-\b In the K6-2 and K6-III, the 16-bit value is zero-extended to 32-bits
-before storing.
-
-\b In the K6-2+, K6-III+ and Athlon processors, the value is sign-extended
-to 32-bits before storing.
-
-
-\S{insPFACC} \i\c{PFACC}: Packed Single-Precision FP Accumulate
-
-\c PFACC mm1,mm2/m64             ; 0F 0F /r AE          [PENT,3DNOW]
-
-\c{PFACC} adds the two single-precision FP values from the destination
-operand together, then adds the two single-precision FP values from the
-source operand, and places the results in the low and high doublewords
-of the destination operand.
-
-The operation is:
-
-\c    dst[0-31]   := dst[0-31] + dst[32-63],
-\c    dst[32-63]  := src[0-31] + src[32-63].
-
-
-\S{insPFADD} \i\c{PFADD}: Packed Single-Precision FP Addition
-
-\c PFADD mm1,mm2/m64             ; 0F 0F /r 9E          [PENT,3DNOW]
-
-\c{PFADD} performs addition on each of two packed single-precision
-FP value pairs.
-
-\c    dst[0-31]   := dst[0-31]  + src[0-31],
-\c    dst[32-63]  := dst[32-63] + src[32-63].
-
-
-\S{insPFCMP} \i\c{PFCMPxx}: Packed Single-Precision FP Compare
-\I\c{PFCMPEQ} \I\c{PFCMPGE} \I\c{PFCMPGT}
-
-\c PFCMPEQ mm1,mm2/m64           ; 0F 0F /r B0          [PENT,3DNOW]
-\c PFCMPGE mm1,mm2/m64           ; 0F 0F /r 90          [PENT,3DNOW]
-\c PFCMPGT mm1,mm2/m64           ; 0F 0F /r A0          [PENT,3DNOW]
-
-The \c{PFCMPxx} instructions compare the packed single-point FP values
-in the source and destination operands, and set the destination
-according to the result. If the condition is true, the destination is
-set to all 1s, otherwise it's set to all 0s.
-
-\b \c{PFCMPEQ} tests whether dst == src;
-
-\b \c{PFCMPGE} tests whether dst >= src;
-
-\b \c{PFCMPGT} tests whether dst >  src.
-
-
-\S{insPFMAX} \i\c{PFMAX}: Packed Single-Precision FP Maximum
-
-\c PFMAX mm1,mm2/m64             ; 0F 0F /r A4          [PENT,3DNOW]
-
-\c{PFMAX} returns the higher of each pair of single-precision FP values.
-If the higher value is zero, it is returned as positive zero.
-
-
-\S{insPFMIN} \i\c{PFMIN}: Packed Single-Precision FP Minimum
-
-\c PFMIN mm1,mm2/m64             ; 0F 0F /r 94          [PENT,3DNOW]
-
-\c{PFMIN} returns the lower of each pair of single-precision FP values.
-If the lower value is zero, it is returned as positive zero.
-
-
-\S{insPFMUL} \i\c{PFMUL}: Packed Single-Precision FP Multiply
-
-\c PFMUL mm1,mm2/m64             ; 0F 0F /r B4          [PENT,3DNOW]
-
-\c{PFMUL} returns the product of each pair of single-precision FP values.
-
-\c    dst[0-31]  := dst[0-31]  * src[0-31],
-\c    dst[32-63] := dst[32-63] * src[32-63].
-
-
-\S{insPFNACC} \i\c{PFNACC}: Packed Single-Precision FP Negative Accumulate
-
-\c PFNACC mm1,mm2/m64            ; 0F 0F /r 8A          [PENT,3DNOW]
-
-\c{PFNACC} performs a negative accumulate of the two single-precision
-FP values in the source and destination registers. The result of the
-accumulate from the destination register is stored in the low doubleword
-of the destination, and the result of the source accumulate is stored in
-the high doubleword of the destination register.
-
-The operation is:
-
-\c    dst[0-31]  := dst[0-31] - dst[32-63],
-\c    dst[32-63] := src[0-31] - src[32-63].
-
-
-\S{insPFPNACC} \i\c{PFPNACC}: Packed Single-Precision FP Mixed Accumulate
-
-\c PFPNACC mm1,mm2/m64           ; 0F 0F /r 8E          [PENT,3DNOW]
-
-\c{PFPNACC} performs a positive accumulate of the two single-precision
-FP values in the source register and a negative accumulate of the
-destination register. The result of the accumulate from the destination
-register is stored in the low doubleword of the destination, and the
-result of the source accumulate is stored in the high doubleword of the
-destination register.
-
-The operation is:
-
-\c    dst[0-31]  := dst[0-31] - dst[32-63],
-\c    dst[32-63] := src[0-31] + src[32-63].
-
-
-\S{insPFRCP} \i\c{PFRCP}: Packed Single-Precision FP Reciprocal Approximation
-
-\c PFRCP mm1,mm2/m64             ; 0F 0F /r 96          [PENT,3DNOW]
-
-\c{PFRCP} performs a low precision estimate of the reciprocal of the
-low-order single-precision FP value in the source operand, storing the
-result in both halves of the destination register. The result is accurate
-to 14 bits.
-
-For higher precision reciprocals, this instruction should be followed by
-two more instructions: \c{PFRCPIT1} (\k{insPFRCPIT1}) and \c{PFRCPIT2}
-(\k{insPFRCPIT1}). This will result in a 24-bit accuracy. For more details,
-see the AMD 3DNow! technology manual.
-
-
-\S{insPFRCPIT1} \i\c{PFRCPIT1}: Packed Single-Precision FP Reciprocal,
-First Iteration Step
-
-\c PFRCPIT1 mm1,mm2/m64          ; 0F 0F /r A6          [PENT,3DNOW]
-
-\c{PFRCPIT1} performs the first intermediate step in the calculation of
-the reciprocal of a single-precision FP value. The first source value
-(\c{mm1} is the original value, and the second source value (\c{mm2/m64}
-is the result of a \c{PFRCP} instruction.
-
-For the final step in a reciprocal, returning the full 24-bit accuracy
-of a single-precision FP value, see \c{PFRCPIT2} (\k{insPFRCPIT2}). For
-more details, see the AMD 3DNow! technology manual.
-
-
-\S{insPFRCPIT2} \i\c{PFRCPIT2}: Packed Single-Precision FP
-Reciprocal/ Reciprocal Square Root, Second Iteration Step
-
-\c PFRCPIT2 mm1,mm2/m64          ; 0F 0F /r B6          [PENT,3DNOW]
-
-\c{PFRCPIT2} performs the second and final intermediate step in the
-calculation of a reciprocal or reciprocal square root, refining the
-values returned by the \c{PFRCP} and \c{PFRSQRT} instructions,
-respectively.
-
-The first source value (\c{mm1}) is the output of either a \c{PFRCPIT1}
-or a \c{PFRSQIT1} instruction, and the second source is the output of
-either the \c{PFRCP} or the \c{PFRSQRT} instruction. For more details,
-see the AMD 3DNow! technology manual.
-
-
-\S{insPFRSQIT1} \i\c{PFRSQIT1}: Packed Single-Precision FP Reciprocal
-Square Root, First Iteration Step
-
-\c PFRSQIT1 mm1,mm2/m64          ; 0F 0F /r A7          [PENT,3DNOW]
-
-\c{PFRSQIT1} performs the first intermediate step in the calculation of
-the reciprocal square root of a single-precision FP value. The first
-source value (\c{mm1} is the square of the result of a \c{PFRSQRT}
-instruction, and the second source value (\c{mm2/m64} is the original
-value.
-
-For the final step in a calculation, returning the full 24-bit accuracy
-of a single-precision FP value, see \c{PFRCPIT2} (\k{insPFRCPIT2}). For
-more details, see the AMD 3DNow! technology manual.
-
-
-\S{insPFRSQRT} \i\c{PFRSQRT}: Packed Single-Precision FP Reciprocal
-Square Root Approximation
-
-\c PFRSQRT mm1,mm2/m64           ; 0F 0F /r 97          [PENT,3DNOW]
-
-\c{PFRSQRT} performs a low precision estimate of the reciprocal square
-root of the low-order single-precision FP value in the source operand,
-storing the result in both halves of the destination register. The result
-is accurate to 15 bits.
-
-For higher precision reciprocals, this instruction should be followed by
-two more instructions: \c{PFRSQIT1} (\k{insPFRSQIT1}) and \c{PFRCPIT2}
-(\k{insPFRCPIT1}). This will result in a 24-bit accuracy. For more details,
-see the AMD 3DNow! technology manual.
-
-
-\S{insPFSUB} \i\c{PFSUB}: Packed Single-Precision FP Subtract
-
-\c PFSUB mm1,mm2/m64             ; 0F 0F /r 9A          [PENT,3DNOW]
-
-\c{PFSUB} subtracts the single-precision FP values in the source from
-those in the destination, and stores the result in the destination
-operand.
-
-\c    dst[0-31]  := dst[0-31]  - src[0-31],
-\c    dst[32-63] := dst[32-63] - src[32-63].
-
-
-\S{insPFSUBR} \i\c{PFSUBR}: Packed Single-Precision FP Reverse Subtract
-
-\c PFSUBR mm1,mm2/m64            ; 0F 0F /r AA          [PENT,3DNOW]
-
-\c{PFSUBR} subtracts the single-precision FP values in the destination
-from those in the source, and stores the result in the destination
-operand.
-
-\c    dst[0-31]  := src[0-31]  - dst[0-31],
-\c    dst[32-63] := src[32-63] - dst[32-63].
-
-
-\S{insPI2FD} \i\c{PI2FD}: Packed Doubleword Integer to Single-Precision FP Convert
-
-\c PI2FD mm1,mm2/m64             ; 0F 0F /r 0D          [PENT,3DNOW]
-
-\c{PF2ID} converts two signed 32-bit integers in the source operand
-to single-precision FP values, using truncation of significant digits,
-and stores them in the destination operand.
-
-
-\S{insPF2IW} \i\c{PF2IW}: Packed Word Integer to Single-Precision FP Convert
-
-\c PI2FW mm1,mm2/m64             ; 0F 0F /r 0C          [PENT,3DNOW]
-
-\c{PF2IW} converts two signed 16-bit integers in the source operand
-to single-precision FP values, and stores them in the destination
-operand. The input values are in the low word of each doubleword.
-
-
-\S{insPINSRW} \i\c{PINSRW}: Insert Word
-
-\c PINSRW mm,r16/r32/m16,imm8    ;0F C4 /r ib      [KATMAI,MMX]
-\c PINSRW xmm,r16/r32/m16,imm8   ;66 0F C4 /r ib   [WILLAMETTE,SSE2]
-
-\c{PINSRW} loads a word from a 16-bit register (or the low half of a
-32-bit register), or from memory, and loads it to the word position
-in the destination register, pointed at by the count operand (third
-operand). If the destination is an \c{MMX} register, the low two bits
-of the count byte are used, if it is an \c{XMM} register the low 3
-bits are used. The insertion is done in such a way that the other
-words from the destination register are left untouched.
-
-
-\S{insPMACHRIW} \i\c{PMACHRIW}: Packed Multiply and Accumulate with Rounding
-
-\c PMACHRIW mm,m64               ; 0F 5E /r             [CYRIX,MMX]
-
-\c{PMACHRIW} takes two packed 16-bit integer inputs, multiplies the
-values in the inputs, rounds on bit 15 of each result, then adds bits
-15-30 of each result to the corresponding position of the \e{implied}
-destination register.
-
-The operation of this instruction is:
-
-\c    dstI[0-15]  := dstI[0-15]  + (mm[0-15] *m64[0-15]
-\c                                           + 0x00004000)[15-30],
-\c    dstI[16-31] := dstI[16-31] + (mm[16-31]*m64[16-31]
-\c                                           + 0x00004000)[15-30],
-\c    dstI[32-47] := dstI[32-47] + (mm[32-47]*m64[32-47]
-\c                                           + 0x00004000)[15-30],
-\c    dstI[48-63] := dstI[48-63] + (mm[48-63]*m64[48-63]
-\c                                           + 0x00004000)[15-30].
-
-Note that \c{PMACHRIW} cannot take a register as its second source
-operand.
-
-
-\S{insPMADDWD} \i\c{PMADDWD}: MMX Packed Multiply and Add
-
-\c PMADDWD mm1,mm2/m64           ; 0F F5 /r             [PENT,MMX]
-\c PMADDWD xmm1,xmm2/m128        ; 66 0F F5 /r     [WILLAMETTE,SSE2]
-
-\c{PMADDWD} treats its two inputs as vectors of signed words. It
-multiplies corresponding elements of the two operands, giving doubleword
-results. These are then added together in pairs and stored in the
-destination operand.
-
-The operation of this instruction is:
-
-\c    dst[0-31]   := (dst[0-15] * src[0-15])
-\c                                + (dst[16-31] * src[16-31]);
-\c    dst[32-63]  := (dst[32-47] * src[32-47])
-\c                                + (dst[48-63] * src[48-63]);
-
-The following apply to the \c{SSE} version of the instruction:
-
-\c    dst[64-95]  := (dst[64-79] * src[64-79])
-\c                                + (dst[80-95] * src[80-95]);
-\c    dst[96-127] := (dst[96-111] * src[96-111])
-\c                                + (dst[112-127] * src[112-127]).
-
-
-\S{insPMAGW} \i\c{PMAGW}: MMX Packed Magnitude
-
-\c PMAGW mm1,mm2/m64             ; 0F 52 /r             [CYRIX,MMX]
-
-\c{PMAGW}, specific to the Cyrix MMX extensions, treats both its
-operands as vectors of four signed words. It compares the absolute
-values of the words in corresponding positions, and sets each word
-of the destination (first) operand to whichever of the two words in
-that position had the larger absolute value.
-
-
-\S{insPMAXSW} \i\c{PMAXSW}: Packed Signed Integer Word Maximum
-
-\c PMAXSW mm1,mm2/m64            ; 0F EE /r        [KATMAI,MMX]
-\c PMAXSW xmm1,xmm2/m128         ; 66 0F EE /r     [WILLAMETTE,SSE2]
-
-\c{PMAXSW} compares each pair of words in the two source operands, and
-for each pair it stores the maximum value in the destination register.
-
-
-\S{insPMAXUB} \i\c{PMAXUB}: Packed Unsigned Integer Byte Maximum
-
-\c PMAXUB mm1,mm2/m64            ; 0F DE /r        [KATMAI,MMX]
-\c PMAXUB xmm1,xmm2/m128         ; 66 0F DE /r     [WILLAMETTE,SSE2]
-
-\c{PMAXUB} compares each pair of bytes in the two source operands, and
-for each pair it stores the maximum value in the destination register.
-
-
-\S{insPMINSW} \i\c{PMINSW}: Packed Signed Integer Word Minimum
-
-\c PMINSW mm1,mm2/m64            ; 0F EA /r        [KATMAI,MMX]
-\c PMINSW xmm1,xmm2/m128         ; 66 0F EA /r     [WILLAMETTE,SSE2]
-
-\c{PMINSW} compares each pair of words in the two source operands, and
-for each pair it stores the minimum value in the destination register.
-
-
-\S{insPMINUB} \i\c{PMINUB}: Packed Unsigned Integer Byte Minimum
-
-\c PMINUB mm1,mm2/m64            ; 0F DA /r        [KATMAI,MMX]
-\c PMINUB xmm1,xmm2/m128         ; 66 0F DA /r     [WILLAMETTE,SSE2]
-
-\c{PMINUB} compares each pair of bytes in the two source operands, and
-for each pair it stores the minimum value in the destination register.
-
-
-\S{insPMOVMSKB} \i\c{PMOVMSKB}: Move Byte Mask To Integer
-
-\c PMOVMSKB reg32,mm             ; 0F D7 /r        [KATMAI,MMX]
-\c PMOVMSKB reg32,xmm            ; 66 0F D7 /r     [WILLAMETTE,SSE2]
-
-\c{PMOVMSKB} returns an 8-bit or 16-bit mask formed of the most
-significant bits of each byte of source operand (8-bits for an
-\c{MMX} register, 16-bits for an \c{XMM} register).
-
-
-\S{insPMULHRW} \i\c{PMULHRWC}, \i\c{PMULHRIW}: Multiply Packed 16-bit Integers
-With Rounding, and Store High Word
-
-\c PMULHRWC mm1,mm2/m64         ; 0F 59 /r              [CYRIX,MMX]
-\c PMULHRIW mm1,mm2/m64         ; 0F 5D /r              [CYRIX,MMX]
-
-These instructions take two packed 16-bit integer inputs, multiply the
-values in the inputs, round on bit 15 of each result, then store bits
-15-30 of each result to the corresponding position of the destination
-register.
-
-\b For \c{PMULHRWC}, the destination is the first source operand.
-
-\b For \c{PMULHRIW}, the destination is an implied register (worked out
-as described for \c{PADDSIW} (\k{insPADDSIW})).
-
-The operation of this instruction is:
-
-\c    dst[0-15]  := (src1[0-15] *src2[0-15]  + 0x00004000)[15-30]
-\c    dst[16-31] := (src1[16-31]*src2[16-31] + 0x00004000)[15-30]
-\c    dst[32-47] := (src1[32-47]*src2[32-47] + 0x00004000)[15-30]
-\c    dst[48-63] := (src1[48-63]*src2[48-63] + 0x00004000)[15-30]
-
-See also \c{PMULHRWA} (\k{insPMULHRWA}) for a 3DNow! version of this
-instruction.
-
-
-\S{insPMULHRWA} \i\c{PMULHRWA}: Multiply Packed 16-bit Integers
-With Rounding, and Store High Word
-
-\c PMULHRWA mm1,mm2/m64          ; 0F 0F /r B7     [PENT,3DNOW]
-
-\c{PMULHRWA} takes two packed 16-bit integer inputs, multiplies
-the values in the inputs, rounds on bit 16 of each result, then
-stores bits 16-31 of each result to the corresponding position
-of the destination register.
-
-The operation of this instruction is:
-
-\c    dst[0-15]  := (src1[0-15] *src2[0-15]  + 0x00008000)[16-31];
-\c    dst[16-31] := (src1[16-31]*src2[16-31] + 0x00008000)[16-31];
-\c    dst[32-47] := (src1[32-47]*src2[32-47] + 0x00008000)[16-31];
-\c    dst[48-63] := (src1[48-63]*src2[48-63] + 0x00008000)[16-31].
-
-See also \c{PMULHRWC} (\k{insPMULHRW}) for a Cyrix version of this
-instruction.
-
-
-\S{insPMULHUW} \i\c{PMULHUW}: Multiply Packed 16-bit Integers,
-and Store High Word
-
-\c PMULHUW mm1,mm2/m64           ; 0F E4 /r        [KATMAI,MMX]
-\c PMULHUW xmm1,xmm2/m128        ; 66 0F E4 /r     [WILLAMETTE,SSE2]
-
-\c{PMULHUW} takes two packed unsigned 16-bit integer inputs, multiplies
-the values in the inputs, then stores bits 16-31 of each result to the
-corresponding position of the destination register.
-
-
-\S{insPMULHW} \i\c{PMULHW}, \i\c{PMULLW}: Multiply Packed 16-bit Integers,
-and Store
-
-\c PMULHW mm1,mm2/m64            ; 0F E5 /r             [PENT,MMX]
-\c PMULLW mm1,mm2/m64            ; 0F D5 /r             [PENT,MMX]
-
-\c PMULHW xmm1,xmm2/m128         ; 66 0F E5 /r     [WILLAMETTE,SSE2]
-\c PMULLW xmm1,xmm2/m128         ; 66 0F D5 /r     [WILLAMETTE,SSE2]
-
-\c{PMULxW} takes two packed unsigned 16-bit integer inputs, and
-multiplies the values in the inputs, forming doubleword results.
-
-\b \c{PMULHW} then stores the top 16 bits of each doubleword in the
-destination (first) operand;
-
-\b \c{PMULLW} stores the bottom 16 bits of each doubleword in the
-destination operand.
-
-
-\S{insPMULUDQ} \i\c{PMULUDQ}: Multiply Packed Unsigned
-32-bit Integers, and Store.
-
-\c PMULUDQ mm1,mm2/m64           ; 0F F4 /r        [WILLAMETTE,SSE2]
-\c PMULUDQ xmm1,xmm2/m128        ; 66 0F F4 /r     [WILLAMETTE,SSE2]
-
-\c{PMULUDQ} takes two packed unsigned 32-bit integer inputs, and
-multiplies the values in the inputs, forming quadword results. The
-source is either an unsigned doubleword in the low doubleword of a
-64-bit operand, or it's two unsigned doublewords in the first and
-third doublewords of a 128-bit operand. This produces either one or
-two 64-bit results, which are stored in the respective quadword
-locations of the destination register.
-
-The operation is:
-
-\c    dst[0-63]   := dst[0-31]  * src[0-31];
-\c    dst[64-127] := dst[64-95] * src[64-95].
-
-
-\S{insPMVccZB} \i\c{PMVccZB}: MMX Packed Conditional Move
-
-\c PMVZB mmxreg,mem64            ; 0F 58 /r             [CYRIX,MMX]
-\c PMVNZB mmxreg,mem64           ; 0F 5A /r             [CYRIX,MMX]
-\c PMVLZB mmxreg,mem64           ; 0F 5B /r             [CYRIX,MMX]
-\c PMVGEZB mmxreg,mem64          ; 0F 5C /r             [CYRIX,MMX]
-
-These instructions, specific to the Cyrix MMX extensions, perform
-parallel conditional moves. The two input operands are treated as
-vectors of eight bytes. Each byte of the destination (first) operand
-is either written from the corresponding byte of the source (second)
-operand, or left alone, depending on the value of the byte in the
-\e{implied} operand (specified in the same way as \c{PADDSIW}, in
-\k{insPADDSIW}).
-
-\b \c{PMVZB} performs each move if the corresponding byte in the
-implied operand is zero;
-
-\b \c{PMVNZB} moves if the byte is non-zero;
-
-\b \c{PMVLZB} moves if the byte is less than zero;
-
-\b \c{PMVGEZB} moves if the byte is greater than or equal to zero.
-
-Note that these instructions cannot take a register as their second
-source operand.
-
-
-\S{insPOP} \i\c{POP}: Pop Data from Stack
-
-\c POP reg16                     ; o16 58+r             [8086]
-\c POP reg32                     ; o32 58+r             [386]
-
-\c POP r/m16                     ; o16 8F /0            [8086]
-\c POP r/m32                     ; o32 8F /0            [386]
-
-\c POP CS                        ; 0F                   [8086,UNDOC]
-\c POP DS                        ; 1F                   [8086]
-\c POP ES                        ; 07                   [8086]
-\c POP SS                        ; 17                   [8086]
-\c POP FS                        ; 0F A1                [386]
-\c POP GS                        ; 0F A9                [386]
-
-\c{POP} loads a value from the stack (from \c{[SS:SP]} or
-\c{[SS:ESP]}) and then increments the stack pointer.
-
-The address-size attribute of the instruction determines whether
-\c{SP} or \c{ESP} is used as the stack pointer: to deliberately
-override the default given by the \c{BITS} setting, you can use an
-\i\c{a16} or \i\c{a32} prefix.
-
-The operand-size attribute of the instruction determines whether the
-stack pointer is incremented by 2 or 4: this means that segment
-register pops in \c{BITS 32} mode will pop 4 bytes off the stack and
-discard the upper two of them. If you need to override that, you can
-use an \i\c{o16} or \i\c{o32} prefix.
-
-The above opcode listings give two forms for general-purpose
-register pop instructions: for example, \c{POP BX} has the two forms
-\c{5B} and \c{8F C3}. NASM will always generate the shorter form
-when given \c{POP BX}. NDISASM will disassemble both.
-
-\c{POP CS} is not a documented instruction, and is not supported on
-any processor above the 8086 (since they use \c{0Fh} as an opcode
-prefix for instruction set extensions). However, at least some 8086
-processors do support it, and so NASM generates it for completeness.
-
-
-\S{insPOPA} \i\c{POPAx}: Pop All General-Purpose Registers
-
-\c POPA                          ; 61                   [186]
-\c POPAW                         ; o16 61               [186]
-\c POPAD                         ; o32 61               [386]
-
-\b \c{POPAW} pops a word from the stack into each of, successively,
-\c{DI}, \c{SI}, \c{BP}, nothing (it discards a word from the stack
-which was a placeholder for \c{SP}), \c{BX}, \c{DX}, \c{CX} and
-\c{AX}. It is intended to reverse the operation of \c{PUSHAW} (see
-\k{insPUSHA}), but it ignores the value for \c{SP} that was pushed
-on the stack by \c{PUSHAW}.
-
-\b \c{POPAD} pops twice as much data, and places the results in
-\c{EDI}, \c{ESI}, \c{EBP}, nothing (placeholder for \c{ESP}),
-\c{EBX}, \c{EDX}, \c{ECX} and \c{EAX}. It reverses the operation of
-\c{PUSHAD}.
-
-\c{POPA} is an alias mnemonic for either \c{POPAW} or \c{POPAD},
-depending on the current \c{BITS} setting.
-
-Note that the registers are popped in reverse order of their numeric
-values in opcodes (see \k{iref-rv}).
-
-
-\S{insPOPF} \i\c{POPFx}: Pop Flags Register
-
-\c POPF                          ; 9D                   [8086]
-\c POPFW                         ; o16 9D               [8086]
-\c POPFD                         ; o32 9D               [386]
-
-\b \c{POPFW} pops a word from the stack and stores it in the bottom 16
-bits of the flags register (or the whole flags register, on
-processors below a 386).
-
-\b \c{POPFD} pops a doubleword and stores it in the entire flags register.
-
-\c{POPF} is an alias mnemonic for either \c{POPFW} or \c{POPFD},
-depending on the current \c{BITS} setting.
-
-See also \c{PUSHF} (\k{insPUSHF}).
-
-
-\S{insPOR} \i\c{POR}: MMX Bitwise OR
-
-\c POR mm1,mm2/m64               ; 0F EB /r             [PENT,MMX]
-\c POR xmm1,xmm2/m128            ; 66 0F EB /r     [WILLAMETTE,SSE2]
-
-\c{POR} performs a bitwise OR operation between its two operands
-(i.e. each bit of the result is 1 if and only if at least one of the
-corresponding bits of the two inputs was 1), and stores the result
-in the destination (first) operand.
-
-
-\S{insPREFETCH} \i\c{PREFETCH}: Prefetch Data Into Caches
-
-\c PREFETCH mem8                 ; 0F 0D /0             [PENT,3DNOW]
-\c PREFETCHW mem8                ; 0F 0D /1             [PENT,3DNOW]
-
-\c{PREFETCH} and \c{PREFETCHW} fetch the line of data from memory that
-contains the specified byte. \c{PREFETCHW} performs differently on the
-Athlon to earlier processors.
-
-For more details, see the 3DNow! Technology Manual.
-
-
-\S{insPREFETCHh} \i\c{PREFETCHh}: Prefetch Data Into Caches
-\I\c{PREFETCHNTA} \I\c{PREFETCHT0} \I\c{PREFETCHT1} \I\c{PREFETCHT2}
-
-\c PREFETCHNTA m8                ; 0F 18 /0        [KATMAI]
-\c PREFETCHT0 m8                 ; 0F 18 /1        [KATMAI]
-\c PREFETCHT1 m8                 ; 0F 18 /2        [KATMAI]
-\c PREFETCHT2 m8                 ; 0F 18 /3        [KATMAI]
-
-The \c{PREFETCHh} instructions fetch the line of data from memory
-that contains the specified byte. It is placed in the cache
-according to rules specified by locality hints \c{h}:
-
-The hints are:
-
-\b \c{T0} (temporal data) - prefetch data into all levels of the
-cache hierarchy.
-
-\b \c{T1} (temporal data with respect to first level cache) -
-prefetch data into level 2 cache and higher.
-
-\b \c{T2} (temporal data with respect to second level cache) -
-prefetch data into level 2 cache and higher.
-
-\b \c{NTA} (non-temporal data with respect to all cache levels) -
-prefetch data into non-temporal cache structure and into a
-location close to the processor, minimizing cache pollution.
-
-Note that this group of instructions doesn't provide a guarantee
-that the data will be in the cache when it is needed. For more
-details, see the Intel IA32 Software Developer Manual, Volume 2.
-
-
-\S{insPSADBW} \i\c{PSADBW}: Packed Sum of Absolute Differences
-
-\c PSADBW mm1,mm2/m64            ; 0F F6 /r        [KATMAI,MMX]
-\c PSADBW xmm1,xmm2/m128         ; 66 0F F6 /r     [WILLAMETTE,SSE2]
-
-\c{PSADBW} The PSADBW instruction computes the absolute value of the
-difference of the packed unsigned bytes in the two source operands.
-These differences are then summed to produce a word result in the lower
-16-bit field of the destination register; the rest of the register is
-cleared. The destination operand is an \c{MMX} or an \c{XMM} register.
-The source operand can either be a register or a memory operand.
-
-
-\S{insPSHUFD} \i\c{PSHUFD}: Shuffle Packed Doublewords
-
-\c PSHUFD xmm1,xmm2/m128,imm8    ; 66 0F 70 /r ib  [WILLAMETTE,SSE2]
-
-\c{PSHUFD} shuffles the doublewords in the source (second) operand
-according to the encoding specified by imm8, and stores the result
-in the destination (first) operand.
-
-Bits 0 and 1 of imm8 encode the source position of the doubleword to
-be copied to position 0 in the destination operand. Bits 2 and 3
-encode for position 1, bits 4 and 5 encode for position 2, and bits
-6 and 7 encode for position 3. For example, an encoding of 10 in
-bits 0 and 1 of imm8 indicates that the doubleword at bits 64-95 of
-the source operand will be copied to bits 0-31 of the destination.
-
-
-\S{insPSHUFHW} \i\c{PSHUFHW}: Shuffle Packed High Words
-
-\c PSHUFHW xmm1,xmm2/m128,imm8   ; F3 0F 70 /r ib  [WILLAMETTE,SSE2]
-
-\c{PSHUFW} shuffles the words in the high quadword of the source
-(second) operand according to the encoding specified by imm8, and
-stores the result in the high quadword of the destination (first)
-operand.
-
-The operation of this instruction is similar to the \c{PSHUFW}
-instruction, except that the source and destination are the top
-quadword of a 128-bit operand, instead of being 64-bit operands.
-The low quadword is copied from the source to the destination
-without any changes.
-
-
-\S{insPSHUFLW} \i\c{PSHUFLW}: Shuffle Packed Low Words
-
-\c PSHUFLW xmm1,xmm2/m128,imm8   ; F2 0F 70 /r ib  [WILLAMETTE,SSE2]
-
-\c{PSHUFLW} shuffles the words in the low quadword of the source
-(second) operand according to the encoding specified by imm8, and
-stores the result in the low quadword of the destination (first)
-operand.
-
-The operation of this instruction is similar to the \c{PSHUFW}
-instruction, except that the source and destination are the low
-quadword of a 128-bit operand, instead of being 64-bit operands.
-The high quadword is copied from the source to the destination
-without any changes.
-
-
-\S{insPSHUFW} \i\c{PSHUFW}: Shuffle Packed Words
-
-\c PSHUFW mm1,mm2/m64,imm8       ; 0F 70 /r ib     [KATMAI,MMX]
-
-\c{PSHUFW} shuffles the words in the source (second) operand
-according to the encoding specified by imm8, and stores the result
-in the destination (first) operand.
-
-Bits 0 and 1 of imm8 encode the source position of the word to be
-copied to position 0 in the destination operand. Bits 2 and 3 encode
-for position 1, bits 4 and 5 encode for position 2, and bits 6 and 7
-encode for position 3. For example, an encoding of 10 in bits 0 and 1
-of imm8 indicates that the word at bits 32-47 of the source operand
-will be copied to bits 0-15 of the destination.
-
-
-\S{insPSLLD} \i\c{PSLLx}: Packed Data Bit Shift Left Logical
-
-\c PSLLW mm1,mm2/m64             ; 0F F1 /r             [PENT,MMX]
-\c PSLLW mm,imm8                 ; 0F 71 /6 ib          [PENT,MMX]
-
-\c PSLLW xmm1,xmm2/m128          ; 66 0F F1 /r     [WILLAMETTE,SSE2]
-\c PSLLW xmm,imm8                ; 66 0F 71 /6 ib  [WILLAMETTE,SSE2]
-
-\c PSLLD mm1,mm2/m64             ; 0F F2 /r             [PENT,MMX]
-\c PSLLD mm,imm8                 ; 0F 72 /6 ib          [PENT,MMX]
-
-\c PSLLD xmm1,xmm2/m128          ; 66 0F F2 /r     [WILLAMETTE,SSE2]
-\c PSLLD xmm,imm8                ; 66 0F 72 /6 ib  [WILLAMETTE,SSE2]
-
-\c PSLLQ mm1,mm2/m64             ; 0F F3 /r             [PENT,MMX]
-\c PSLLQ mm,imm8                 ; 0F 73 /6 ib          [PENT,MMX]
-
-\c PSLLQ xmm1,xmm2/m128          ; 66 0F F3 /r     [WILLAMETTE,SSE2]
-\c PSLLQ xmm,imm8                ; 66 0F 73 /6 ib  [WILLAMETTE,SSE2]
-
-\c PSLLDQ xmm1,imm8              ; 66 0F 73 /7 ib  [WILLAMETTE,SSE2]
-
-\c{PSLLx} performs logical left shifts of the data elements in the
-destination (first) operand, moving each bit in the separate elements
-left by the number of bits specified in the source (second) operand,
-clearing the low-order bits as they are vacated. \c{PSLLDQ} 
-shifts bytes, not bits.
-
-\b \c{PSLLW} shifts word sized elements.
-
-\b \c{PSLLD} shifts doubleword sized elements.
-
-\b \c{PSLLQ} shifts quadword sized elements.
-
-\b \c{PSLLDQ} shifts double quadword sized elements.
-
-
-\S{insPSRAD} \i\c{PSRAx}: Packed Data Bit Shift Right Arithmetic
-
-\c PSRAW mm1,mm2/m64             ; 0F E1 /r             [PENT,MMX]
-\c PSRAW mm,imm8                 ; 0F 71 /4 ib          [PENT,MMX]
-
-\c PSRAW xmm1,xmm2/m128          ; 66 0F E1 /r     [WILLAMETTE,SSE2]
-\c PSRAW xmm,imm8                ; 66 0F 71 /4 ib  [WILLAMETTE,SSE2]
-
-\c PSRAD mm1,mm2/m64             ; 0F E2 /r             [PENT,MMX]
-\c PSRAD mm,imm8                 ; 0F 72 /4 ib          [PENT,MMX]
-
-\c PSRAD xmm1,xmm2/m128          ; 66 0F E2 /r     [WILLAMETTE,SSE2]
-\c PSRAD xmm,imm8                ; 66 0F 72 /4 ib  [WILLAMETTE,SSE2]
-
-\c{PSRAx} performs arithmetic right shifts of the data elements in the
-destination (first) operand, moving each bit in the separate elements
-right by the number of bits specified in the source (second) operand,
-setting the high-order bits to the value of the original sign bit.
-
-\b \c{PSRAW} shifts word sized elements.
-
-\b \c{PSRAD} shifts doubleword sized elements.
-
-
-\S{insPSRLD} \i\c{PSRLx}: Packed Data Bit Shift Right Logical
-
-\c PSRLW mm1,mm2/m64             ; 0F D1 /r             [PENT,MMX]
-\c PSRLW mm,imm8                 ; 0F 71 /2 ib          [PENT,MMX]
-
-\c PSRLW xmm1,xmm2/m128          ; 66 0F D1 /r     [WILLAMETTE,SSE2]
-\c PSRLW xmm,imm8                ; 66 0F 71 /2 ib  [WILLAMETTE,SSE2]
-
-\c PSRLD mm1,mm2/m64             ; 0F D2 /r             [PENT,MMX]
-\c PSRLD mm,imm8                 ; 0F 72 /2 ib          [PENT,MMX]
-
-\c PSRLD xmm1,xmm2/m128          ; 66 0F D2 /r     [WILLAMETTE,SSE2]
-\c PSRLD xmm,imm8                ; 66 0F 72 /2 ib  [WILLAMETTE,SSE2]
-
-\c PSRLQ mm1,mm2/m64             ; 0F D3 /r             [PENT,MMX]
-\c PSRLQ mm,imm8                 ; 0F 73 /2 ib          [PENT,MMX]
-
-\c PSRLQ xmm1,xmm2/m128          ; 66 0F D3 /r     [WILLAMETTE,SSE2]
-\c PSRLQ xmm,imm8                ; 66 0F 73 /2 ib  [WILLAMETTE,SSE2]
-
-\c PSRLDQ xmm1,imm8              ; 66 0F 73 /3 ib  [WILLAMETTE,SSE2]
-
-\c{PSRLx} performs logical right shifts of the data elements in the
-destination (first) operand, moving each bit in the separate elements
-right by the number of bits specified in the source (second) operand,
-clearing the high-order bits as they are vacated. \c{PSRLDQ} 
-shifts bytes, not bits.
-
-\b \c{PSRLW} shifts word sized elements.
-
-\b \c{PSRLD} shifts doubleword sized elements.
-
-\b \c{PSRLQ} shifts quadword sized elements.
-
-\b \c{PSRLDQ} shifts double quadword sized elements.
-
-
-\S{insPSUBB} \i\c{PSUBx}: Subtract Packed Integers
-
-\c PSUBB mm1,mm2/m64             ; 0F F8 /r             [PENT,MMX]
-\c PSUBW mm1,mm2/m64             ; 0F F9 /r             [PENT,MMX]
-\c PSUBD mm1,mm2/m64             ; 0F FA /r             [PENT,MMX]
-\c PSUBQ mm1,mm2/m64             ; 0F FB /r        [WILLAMETTE,SSE2]
-
-\c PSUBB xmm1,xmm2/m128          ; 66 0F F8 /r     [WILLAMETTE,SSE2]
-\c PSUBW xmm1,xmm2/m128          ; 66 0F F9 /r     [WILLAMETTE,SSE2]
-\c PSUBD xmm1,xmm2/m128          ; 66 0F FA /r     [WILLAMETTE,SSE2]
-\c PSUBQ xmm1,xmm2/m128          ; 66 0F FB /r     [WILLAMETTE,SSE2]
-
-\c{PSUBx} subtracts packed integers in the source operand from those
-in the destination operand. It doesn't differentiate between signed
-and unsigned integers, and doesn't set any of the flags.
-
-\b \c{PSUBB} operates on byte sized elements.
-
-\b \c{PSUBW} operates on word sized elements.
-
-\b \c{PSUBD} operates on doubleword sized elements.
-
-\b \c{PSUBQ} operates on quadword sized elements.
-
-
-\S{insPSUBSB} \i\c{PSUBSxx}, \i\c{PSUBUSx}: Subtract Packed Integers With Saturation
-
-\c PSUBSB mm1,mm2/m64            ; 0F E8 /r             [PENT,MMX]
-\c PSUBSW mm1,mm2/m64            ; 0F E9 /r             [PENT,MMX]
-
-\c PSUBSB xmm1,xmm2/m128         ; 66 0F E8 /r     [WILLAMETTE,SSE2]
-\c PSUBSW xmm1,xmm2/m128         ; 66 0F E9 /r     [WILLAMETTE,SSE2]
-
-\c PSUBUSB mm1,mm2/m64           ; 0F D8 /r             [PENT,MMX]
-\c PSUBUSW mm1,mm2/m64           ; 0F D9 /r             [PENT,MMX]
-
-\c PSUBUSB xmm1,xmm2/m128        ; 66 0F D8 /r     [WILLAMETTE,SSE2]
-\c PSUBUSW xmm1,xmm2/m128        ; 66 0F D9 /r     [WILLAMETTE,SSE2]
-
-\c{PSUBSx} and \c{PSUBUSx} subtracts packed integers in the source
-operand from those in the destination operand, and use saturation for
-results that are outside the range supported by the destination operand.
-
-\b \c{PSUBSB} operates on signed bytes, and uses signed saturation on the
-results.
-
-\b \c{PSUBSW} operates on signed words, and uses signed saturation on the
-results.
-
-\b \c{PSUBUSB} operates on unsigned bytes, and uses signed saturation on
-the results.
-
-\b \c{PSUBUSW} operates on unsigned words, and uses signed saturation on
-the results.
-
-
-\S{insPSUBSIW} \i\c{PSUBSIW}: MMX Packed Subtract with Saturation to
-Implied Destination
-
-\c PSUBSIW mm1,mm2/m64           ; 0F 55 /r             [CYRIX,MMX]
-
-\c{PSUBSIW}, specific to the Cyrix extensions to the MMX instruction
-set, performs the same function as \c{PSUBSW}, except that the
-result is not placed in the register specified by the first operand,
-but instead in the implied destination register, specified as for
-\c{PADDSIW} (\k{insPADDSIW}).
-
-
-\S{insPSWAPD} \i\c{PSWAPD}: Swap Packed Data
-\I\c{PSWAPW}
-
-\c PSWAPD mm1,mm2/m64            ; 0F 0F /r BB     [PENT,3DNOW]
-
-\c{PSWAPD} swaps the packed doublewords in the source operand, and
-stores the result in the destination operand.
-
-In the \c{K6-2} and \c{K6-III} processors, this opcode uses the
-mnemonic \c{PSWAPW}, and it swaps the order of words when copying
-from the source to the destination.
-
-The operation in the \c{K6-2} and \c{K6-III} processors is
-
-\c    dst[0-15]  = src[48-63];
-\c    dst[16-31] = src[32-47];
-\c    dst[32-47] = src[16-31];
-\c    dst[48-63] = src[0-15].
-
-The operation in the \c{K6-x+}, \c{ATHLON} and later processors is:
-
-\c    dst[0-31]  = src[32-63];
-\c    dst[32-63] = src[0-31].
-
-
-\S{insPUNPCKHBW} \i\c{PUNPCKxxx}: Unpack and Interleave Data
-
-\c PUNPCKHBW mm1,mm2/m64         ; 0F 68 /r             [PENT,MMX]
-\c PUNPCKHWD mm1,mm2/m64         ; 0F 69 /r             [PENT,MMX]
-\c PUNPCKHDQ mm1,mm2/m64         ; 0F 6A /r             [PENT,MMX]
-
-\c PUNPCKHBW xmm1,xmm2/m128      ; 66 0F 68 /r     [WILLAMETTE,SSE2]
-\c PUNPCKHWD xmm1,xmm2/m128      ; 66 0F 69 /r     [WILLAMETTE,SSE2]
-\c PUNPCKHDQ xmm1,xmm2/m128      ; 66 0F 6A /r     [WILLAMETTE,SSE2]
-\c PUNPCKHQDQ xmm1,xmm2/m128     ; 66 0F 6D /r     [WILLAMETTE,SSE2]
-
-\c PUNPCKLBW mm1,mm2/m32         ; 0F 60 /r             [PENT,MMX]
-\c PUNPCKLWD mm1,mm2/m32         ; 0F 61 /r             [PENT,MMX]
-\c PUNPCKLDQ mm1,mm2/m32         ; 0F 62 /r             [PENT,MMX]
-
-\c PUNPCKLBW xmm1,xmm2/m128      ; 66 0F 60 /r     [WILLAMETTE,SSE2]
-\c PUNPCKLWD xmm1,xmm2/m128      ; 66 0F 61 /r     [WILLAMETTE,SSE2]
-\c PUNPCKLDQ xmm1,xmm2/m128      ; 66 0F 62 /r     [WILLAMETTE,SSE2]
-\c PUNPCKLQDQ xmm1,xmm2/m128     ; 66 0F 6C /r     [WILLAMETTE,SSE2]
-
-\c{PUNPCKxx} all treat their operands as vectors, and produce a new
-vector generated by interleaving elements from the two inputs. The
-\c{PUNPCKHxx} instructions start by throwing away the bottom half of
-each input operand, and the \c{PUNPCKLxx} instructions throw away
-the top half.
-
-The remaining elements, are then interleaved into the destination,
-alternating elements from the second (source) operand and the first
-(destination) operand: so the leftmost part of each element in the
-result always comes from the second operand, and the rightmost from
-the destination.
-
-\b \c{PUNPCKxBW} works a byte at a time, producing word sized output
-elements.
-
-\b \c{PUNPCKxWD} works a word at a time, producing doubleword sized
-output elements.
-
-\b \c{PUNPCKxDQ} works a doubleword at a time, producing quadword sized
-output elements.
-
-\b \c{PUNPCKxQDQ} works a quadword at a time, producing double quadword
-sized output elements.
-
-So, for example, for \c{MMX} operands, if the first operand held
-\c{0x7A6A5A4A3A2A1A0A} and the second held \c{0x7B6B5B4B3B2B1B0B},
-then:
-
-\b \c{PUNPCKHBW} would return \c{0x7B7A6B6A5B5A4B4A}.
-
-\b \c{PUNPCKHWD} would return \c{0x7B6B7A6A5B4B5A4A}.
-
-\b \c{PUNPCKHDQ} would return \c{0x7B6B5B4B7A6A5A4A}.
-
-\b \c{PUNPCKLBW} would return \c{0x3B3A2B2A1B1A0B0A}.
-
-\b \c{PUNPCKLWD} would return \c{0x3B2B3A2A1B0B1A0A}.
-
-\b \c{PUNPCKLDQ} would return \c{0x3B2B1B0B3A2A1A0A}.
-
-
-\S{insPUSH} \i\c{PUSH}: Push Data on Stack
-
-\c PUSH reg16                    ; o16 50+r             [8086]
-\c PUSH reg32                    ; o32 50+r             [386]
-
-\c PUSH r/m16                    ; o16 FF /6            [8086]
-\c PUSH r/m32                    ; o32 FF /6            [386]
-
-\c PUSH CS                       ; 0E                   [8086]
-\c PUSH DS                       ; 1E                   [8086]
-\c PUSH ES                       ; 06                   [8086]
-\c PUSH SS                       ; 16                   [8086]
-\c PUSH FS                       ; 0F A0                [386]
-\c PUSH GS                       ; 0F A8                [386]
-
-\c PUSH imm8                     ; 6A ib                [186]
-\c PUSH imm16                    ; o16 68 iw            [186]
-\c PUSH imm32                    ; o32 68 id            [386]
-
-\c{PUSH} decrements the stack pointer (\c{SP} or \c{ESP}) by 2 or 4,
-and then stores the given value at \c{[SS:SP]} or \c{[SS:ESP]}.
-
-The address-size attribute of the instruction determines whether
-\c{SP} or \c{ESP} is used as the stack pointer: to deliberately
-override the default given by the \c{BITS} setting, you can use an
-\i\c{a16} or \i\c{a32} prefix.
-
-The operand-size attribute of the instruction determines whether the
-stack pointer is decremented by 2 or 4: this means that segment
-register pushes in \c{BITS 32} mode will push 4 bytes on the stack,
-of which the upper two are undefined. If you need to override that,
-you can use an \i\c{o16} or \i\c{o32} prefix.
-
-The above opcode listings give two forms for general-purpose
-\i{register push} instructions: for example, \c{PUSH BX} has the two
-forms \c{53} and \c{FF F3}. NASM will always generate the shorter
-form when given \c{PUSH BX}. NDISASM will disassemble both.
-
-Unlike the undocumented and barely supported \c{POP CS}, \c{PUSH CS}
-is a perfectly valid and sensible instruction, supported on all
-processors.
-
-The instruction \c{PUSH SP} may be used to distinguish an 8086 from
-later processors: on an 8086, the value of \c{SP} stored is the
-value it has \e{after} the push instruction, whereas on later
-processors it is the value \e{before} the push instruction.
-
-
-\S{insPUSHA} \i\c{PUSHAx}: Push All General-Purpose Registers
-
-\c PUSHA                         ; 60                   [186]
-\c PUSHAD                        ; o32 60               [386]
-\c PUSHAW                        ; o16 60               [186]
-
-\c{PUSHAW} pushes, in succession, \c{AX}, \c{CX}, \c{DX}, \c{BX},
-\c{SP}, \c{BP}, \c{SI} and \c{DI} on the stack, decrementing the
-stack pointer by a total of 16.
-
-\c{PUSHAD} pushes, in succession, \c{EAX}, \c{ECX}, \c{EDX},
-\c{EBX}, \c{ESP}, \c{EBP}, \c{ESI} and \c{EDI} on the stack,
-decrementing the stack pointer by a total of 32.
-
-In both cases, the value of \c{SP} or \c{ESP} pushed is its
-\e{original} value, as it had before the instruction was executed.
-
-\c{PUSHA} is an alias mnemonic for either \c{PUSHAW} or \c{PUSHAD},
-depending on the current \c{BITS} setting.
-
-Note that the registers are pushed in order of their numeric values
-in opcodes (see \k{iref-rv}).
-
-See also \c{POPA} (\k{insPOPA}).
-
-
-\S{insPUSHF} \i\c{PUSHFx}: Push Flags Register
-
-\c PUSHF                         ; 9C                   [8086]
-\c PUSHFD                        ; o32 9C               [386]
-\c PUSHFW                        ; o16 9C               [8086]
-
-\b \c{PUSHFW} pushes the bottom 16 bits of the flags register 
-(or the whole flags register, on processors below a 386) onto
-the stack.
-
-\b \c{PUSHFD} pushes the entire flags register onto the stack.
-
-\c{PUSHF} is an alias mnemonic for either \c{PUSHFW} or \c{PUSHFD},
-depending on the current \c{BITS} setting.
-
-See also \c{POPF} (\k{insPOPF}).
-
-
-\S{insPXOR} \i\c{PXOR}: MMX Bitwise XOR
-
-\c PXOR mm1,mm2/m64              ; 0F EF /r             [PENT,MMX]
-\c PXOR xmm1,xmm2/m128           ; 66 0F EF /r     [WILLAMETTE,SSE2]
-
-\c{PXOR} performs a bitwise XOR operation between its two operands
-(i.e. each bit of the result is 1 if and only if exactly one of the
-corresponding bits of the two inputs was 1), and stores the result
-in the destination (first) operand.
-
-
-\S{insRCL} \i\c{RCL}, \i\c{RCR}: Bitwise Rotate through Carry Bit
-
-\c RCL r/m8,1                    ; D0 /2                [8086]
-\c RCL r/m8,CL                   ; D2 /2                [8086]
-\c RCL r/m8,imm8                 ; C0 /2 ib             [186]
-\c RCL r/m16,1                   ; o16 D1 /2            [8086]
-\c RCL r/m16,CL                  ; o16 D3 /2            [8086]
-\c RCL r/m16,imm8                ; o16 C1 /2 ib         [186]
-\c RCL r/m32,1                   ; o32 D1 /2            [386]
-\c RCL r/m32,CL                  ; o32 D3 /2            [386]
-\c RCL r/m32,imm8                ; o32 C1 /2 ib         [386]
-
-\c RCR r/m8,1                    ; D0 /3                [8086]
-\c RCR r/m8,CL                   ; D2 /3                [8086]
-\c RCR r/m8,imm8                 ; C0 /3 ib             [186]
-\c RCR r/m16,1                   ; o16 D1 /3            [8086]
-\c RCR r/m16,CL                  ; o16 D3 /3            [8086]
-\c RCR r/m16,imm8                ; o16 C1 /3 ib         [186]
-\c RCR r/m32,1                   ; o32 D1 /3            [386]
-\c RCR r/m32,CL                  ; o32 D3 /3            [386]
-\c RCR r/m32,imm8                ; o32 C1 /3 ib         [386]
-
-\c{RCL} and \c{RCR} perform a 9-bit, 17-bit or 33-bit bitwise
-rotation operation, involving the given source/destination (first)
-operand and the carry bit. Thus, for example, in the operation
-\c{RCL AL,1}, a 9-bit rotation is performed in which \c{AL} is
-shifted left by 1, the top bit of \c{AL} moves into the carry flag,
-and the original value of the carry flag is placed in the low bit of
-\c{AL}.
-
-The number of bits to rotate by is given by the second operand. Only
-the bottom five bits of the rotation count are considered by
-processors above the 8086.
-
-You can force the longer (286 and upwards, beginning with a \c{C1}
-byte) form of \c{RCL foo,1} by using a \c{BYTE} prefix: \c{RCL
-foo,BYTE 1}. Similarly with \c{RCR}.
-
-
-\S{insRCPPS} \i\c{RCPPS}: Packed Single-Precision FP Reciprocal
-
-\c RCPPS xmm1,xmm2/m128          ; 0F 53 /r        [KATMAI,SSE]
-
-\c{RCPPS} returns an approximation of the reciprocal of the packed
-single-precision FP values from xmm2/m128. The maximum error for this
-approximation is: |Error| <= 1.5 x 2^-12
-
-
-\S{insRCPSS} \i\c{RCPSS}: Scalar Single-Precision FP Reciprocal
-
-\c RCPSS xmm1,xmm2/m128          ; F3 0F 53 /r     [KATMAI,SSE]
-
-\c{RCPSS} returns an approximation of the reciprocal of the lower
-single-precision FP value from xmm2/m32; the upper three fields are
-passed through from xmm1. The maximum error for this approximation is:
-|Error| <= 1.5 x 2^-12
-
-
-\S{insRDMSR} \i\c{RDMSR}: Read Model-Specific Registers
-
-\c RDMSR                         ; 0F 32                [PENT,PRIV]
-
-\c{RDMSR} reads the processor Model-Specific Register (MSR) whose
-index is stored in \c{ECX}, and stores the result in \c{EDX:EAX}.
-See also \c{WRMSR} (\k{insWRMSR}).
-
-
-\S{insRDPMC} \i\c{RDPMC}: Read Performance-Monitoring Counters
-
-\c RDPMC                         ; 0F 33                [P6]
-
-\c{RDPMC} reads the processor performance-monitoring counter whose
-index is stored in \c{ECX}, and stores the result in \c{EDX:EAX}.
-
-This instruction is available on P6 and later processors and on MMX
-class processors.
-
-
-\S{insRDSHR} \i\c{RDSHR}: Read SMM Header Pointer Register
-
-\c RDSHR r/m32                   ; 0F 36 /0        [386,CYRIX,SMM]
-
-\c{RDSHR} reads the contents of the SMM header pointer register and
-saves it to the destination operand, which can be either a 32 bit
-memory location or a 32 bit register.
-
-See also \c{WRSHR} (\k{insWRSHR}).
-
-
-\S{insRDTSC} \i\c{RDTSC}: Read Time-Stamp Counter
-
-\c RDTSC                         ; 0F 31                [PENT]
-
-\c{RDTSC} reads the processor's time-stamp counter into \c{EDX:EAX}.
-
-
-\S{insRET} \i\c{RET}, \i\c{RETF}, \i\c{RETN}: Return from Procedure Call
-
-\c RET                           ; C3                   [8086]
-\c RET imm16                     ; C2 iw                [8086]
-
-\c RETF                          ; CB                   [8086]
-\c RETF imm16                    ; CA iw                [8086]
-
-\c RETN                          ; C3                   [8086]
-\c RETN imm16                    ; C2 iw                [8086]
-
-\b \c{RET}, and its exact synonym \c{RETN}, pop \c{IP} or \c{EIP} from
-the stack and transfer control to the new address. Optionally, if a
-numeric second operand is provided, they increment the stack pointer
-by a further \c{imm16} bytes after popping the return address.
-
-\b \c{RETF} executes a far return: after popping \c{IP}/\c{EIP}, it
-then pops \c{CS}, and \e{then} increments the stack pointer by the
-optional argument if present.
-
-
-\S{insROL} \i\c{ROL}, \i\c{ROR}: Bitwise Rotate
-
-\c ROL r/m8,1                    ; D0 /0                [8086]
-\c ROL r/m8,CL                   ; D2 /0                [8086]
-\c ROL r/m8,imm8                 ; C0 /0 ib             [186]
-\c ROL r/m16,1                   ; o16 D1 /0            [8086]
-\c ROL r/m16,CL                  ; o16 D3 /0            [8086]
-\c ROL r/m16,imm8                ; o16 C1 /0 ib         [186]
-\c ROL r/m32,1                   ; o32 D1 /0            [386]
-\c ROL r/m32,CL                  ; o32 D3 /0            [386]
-\c ROL r/m32,imm8                ; o32 C1 /0 ib         [386]
-
-\c ROR r/m8,1                    ; D0 /1                [8086]
-\c ROR r/m8,CL                   ; D2 /1                [8086]
-\c ROR r/m8,imm8                 ; C0 /1 ib             [186]
-\c ROR r/m16,1                   ; o16 D1 /1            [8086]
-\c ROR r/m16,CL                  ; o16 D3 /1            [8086]
-\c ROR r/m16,imm8                ; o16 C1 /1 ib         [186]
-\c ROR r/m32,1                   ; o32 D1 /1            [386]
-\c ROR r/m32,CL                  ; o32 D3 /1            [386]
-\c ROR r/m32,imm8                ; o32 C1 /1 ib         [386]
-
-\c{ROL} and \c{ROR} perform a bitwise rotation operation on the given
-source/destination (first) operand. Thus, for example, in the
-operation \c{ROL AL,1}, an 8-bit rotation is performed in which
-\c{AL} is shifted left by 1 and the original top bit of \c{AL} moves
-round into the low bit.
-
-The number of bits to rotate by is given by the second operand. Only
-the bottom five bits of the rotation count are considered by processors
-above the 8086.
-
-You can force the longer (286 and upwards, beginning with a \c{C1}
-byte) form of \c{ROL foo,1} by using a \c{BYTE} prefix: \c{ROL
-foo,BYTE 1}. Similarly with \c{ROR}.
-
-
-\S{insRSDC} \i\c{RSDC}: Restore Segment Register and Descriptor
-
-\c RSDC segreg,m80               ; 0F 79 /r        [486,CYRIX,SMM]
-
-\c{RSDC} restores a segment register (DS, ES, FS, GS, or SS) from mem80,
-and sets up its descriptor.
-
-
-\S{insRSLDT} \i\c{RSLDT}: Restore Segment Register and Descriptor
-
-\c RSLDT m80                     ; 0F 7B /0        [486,CYRIX,SMM]
-
-\c{RSLDT} restores the Local Descriptor Table (LDTR) from mem80.
-
-
-\S{insRSM} \i\c{RSM}: Resume from System-Management Mode
-
-\c RSM                           ; 0F AA                [PENT]
-
-\c{RSM} returns the processor to its normal operating mode when it
-was in System-Management Mode.
-
-
-\S{insRSQRTPS} \i\c{RSQRTPS}: Packed Single-Precision FP Square Root Reciprocal
-
-\c RSQRTPS xmm1,xmm2/m128        ; 0F 52 /r        [KATMAI,SSE]
-
-\c{RSQRTPS} computes the approximate reciprocals of the square
-roots of the packed single-precision floating-point values in the
-source and stores the results in xmm1. The maximum error for this
-approximation is: |Error| <= 1.5 x 2^-12
-
-
-\S{insRSQRTSS} \i\c{RSQRTSS}: Scalar Single-Precision FP Square Root Reciprocal
-
-\c RSQRTSS xmm1,xmm2/m128        ; F3 0F 52 /r     [KATMAI,SSE]
-
-\c{RSQRTSS} returns an approximation of the reciprocal of the
-square root of the lowest order single-precision FP value from
-the source, and stores it in the low doubleword of the destination
-register. The upper three fields of xmm1 are preserved. The maximum
-error for this approximation is: |Error| <= 1.5 x 2^-12
-
-
-\S{insRSTS} \i\c{RSTS}: Restore TSR and Descriptor
-
-\c RSTS m80                      ; 0F 7D /0        [486,CYRIX,SMM]
-
-\c{RSTS} restores Task State Register (TSR) from mem80.
-
-
-\S{insSAHF} \i\c{SAHF}: Store AH to Flags
-
-\c SAHF                          ; 9E                   [8086]
-
-\c{SAHF} sets the low byte of the flags word according to the
-contents of the \c{AH} register.
-
-The operation of \c{SAHF} is:
-
-\c  AH --> SF:ZF:0:AF:0:PF:1:CF
-
-See also \c{LAHF} (\k{insLAHF}).
-
-
-\S{insSAL} \i\c{SAL}, \i\c{SAR}: Bitwise Arithmetic Shifts
-
-\c SAL r/m8,1                    ; D0 /4                [8086]
-\c SAL r/m8,CL                   ; D2 /4                [8086]
-\c SAL r/m8,imm8                 ; C0 /4 ib             [186]
-\c SAL r/m16,1                   ; o16 D1 /4            [8086]
-\c SAL r/m16,CL                  ; o16 D3 /4            [8086]
-\c SAL r/m16,imm8                ; o16 C1 /4 ib         [186]
-\c SAL r/m32,1                   ; o32 D1 /4            [386]
-\c SAL r/m32,CL                  ; o32 D3 /4            [386]
-\c SAL r/m32,imm8                ; o32 C1 /4 ib         [386]
-
-\c SAR r/m8,1                    ; D0 /7                [8086]
-\c SAR r/m8,CL                   ; D2 /7                [8086]
-\c SAR r/m8,imm8                 ; C0 /7 ib             [186]
-\c SAR r/m16,1                   ; o16 D1 /7            [8086]
-\c SAR r/m16,CL                  ; o16 D3 /7            [8086]
-\c SAR r/m16,imm8                ; o16 C1 /7 ib         [186]
-\c SAR r/m32,1                   ; o32 D1 /7            [386]
-\c SAR r/m32,CL                  ; o32 D3 /7            [386]
-\c SAR r/m32,imm8                ; o32 C1 /7 ib         [386]
-
-\c{SAL} and \c{SAR} perform an arithmetic shift operation on the given
-source/destination (first) operand. The vacated bits are filled with
-zero for \c{SAL}, and with copies of the original high bit of the
-source operand for \c{SAR}.
-
-\c{SAL} is a synonym for \c{SHL} (see \k{insSHL}). NASM will
-assemble either one to the same code, but NDISASM will always
-disassemble that code as \c{SHL}.
-
-The number of bits to shift by is given by the second operand. Only
-the bottom five bits of the shift count are considered by processors
-above the 8086.
-
-You can force the longer (286 and upwards, beginning with a \c{C1}
-byte) form of \c{SAL foo,1} by using a \c{BYTE} prefix: \c{SAL
-foo,BYTE 1}. Similarly with \c{SAR}.
-
-
-\S{insSALC} \i\c{SALC}: Set AL from Carry Flag
-
-\c SALC                          ; D6                  [8086,UNDOC]
-
-\c{SALC} is an early undocumented instruction similar in concept to
-\c{SETcc} (\k{insSETcc}). Its function is to set \c{AL} to zero if
-the carry flag is clear, or to \c{0xFF} if it is set.
-
-
-\S{insSBB} \i\c{SBB}: Subtract with Borrow
-
-\c SBB r/m8,reg8                 ; 18 /r                [8086]
-\c SBB r/m16,reg16               ; o16 19 /r            [8086]
-\c SBB r/m32,reg32               ; o32 19 /r            [386]
-
-\c SBB reg8,r/m8                 ; 1A /r                [8086]
-\c SBB reg16,r/m16               ; o16 1B /r            [8086]
-\c SBB reg32,r/m32               ; o32 1B /r            [386]
-
-\c SBB r/m8,imm8                 ; 80 /3 ib             [8086]
-\c SBB r/m16,imm16               ; o16 81 /3 iw         [8086]
-\c SBB r/m32,imm32               ; o32 81 /3 id         [386]
-
-\c SBB r/m16,imm8                ; o16 83 /3 ib         [8086]
-\c SBB r/m32,imm8                ; o32 83 /3 ib         [386]
-
-\c SBB AL,imm8                   ; 1C ib                [8086]
-\c SBB AX,imm16                  ; o16 1D iw            [8086]
-\c SBB EAX,imm32                 ; o32 1D id            [386]
-
-\c{SBB} performs integer subtraction: it subtracts its second
-operand, plus the value of the carry flag, from its first, and
-leaves the result in its destination (first) operand. The flags are
-set according to the result of the operation: in particular, the
-carry flag is affected and can be used by a subsequent \c{SBB}
-instruction.
-
-In the forms with an 8-bit immediate second operand and a longer
-first operand, the second operand is considered to be signed, and is
-sign-extended to the length of the first operand. In these cases,
-the \c{BYTE} qualifier is necessary to force NASM to generate this
-form of the instruction.
-
-To subtract one number from another without also subtracting the
-contents of the carry flag, use \c{SUB} (\k{insSUB}).
-
-
-\S{insSCASB} \i\c{SCASB}, \i\c{SCASW}, \i\c{SCASD}: Scan String
-
-\c SCASB                         ; AE                   [8086]
-\c SCASW                         ; o16 AF               [8086]
-\c SCASD                         ; o32 AF               [386]
-
-\c{SCASB} compares the byte in \c{AL} with the byte at \c{[ES:DI]}
-or \c{[ES:EDI]}, and sets the flags accordingly. It then increments
-or decrements (depending on the direction flag: increments if the
-flag is clear, decrements if it is set) \c{DI} (or \c{EDI}).
-
-The register used is \c{DI} if the address size is 16 bits, and
-\c{EDI} if it is 32 bits. If you need to use an address size not
-equal to the current \c{BITS} setting, you can use an explicit
-\i\c{a16} or \i\c{a32} prefix.
-
-Segment override prefixes have no effect for this instruction: the
-use of \c{ES} for the load from \c{[DI]} or \c{[EDI]} cannot be
-overridden.
-
-\c{SCASW} and \c{SCASD} work in the same way, but they compare a
-word to \c{AX} or a doubleword to \c{EAX} instead of a byte to
-\c{AL}, and increment or decrement the addressing registers by 2 or
-4 instead of 1.
-
-The \c{REPE} and \c{REPNE} prefixes (equivalently, \c{REPZ} and
-\c{REPNZ}) may be used to repeat the instruction up to \c{CX} (or
-\c{ECX} - again, the address size chooses which) times until the
-first unequal or equal byte is found.
-
-
-\S{insSETcc} \i\c{SETcc}: Set Register from Condition
-
-\c SETcc r/m8                    ; 0F 90+cc /2          [386]
-
-\c{SETcc} sets the given 8-bit operand to zero if its condition is
-not satisfied, and to 1 if it is.
-
-
-\S{insSFENCE} \i\c{SFENCE}: Store Fence
-
-\c SFENCE                 ; 0F AE /7               [KATMAI]
-
-\c{SFENCE} performs a serialising operation on all writes to memory
-that were issued before the \c{SFENCE} instruction. This guarantees that
-all memory writes before the \c{SFENCE} instruction are visible before any
-writes after the \c{SFENCE} instruction.
-
-\c{SFENCE} is ordered respective to other \c{SFENCE} instruction, \c{MFENCE},
-any memory write and any other serialising instruction (such as \c{CPUID}).
-
-Weakly ordered memory types can be used to achieve higher processor
-performance through such techniques as out-of-order issue,
-write-combining, and write-collapsing. The degree to which a consumer
-of data recognizes or knows that the data is weakly ordered varies
-among applications and may be unknown to the producer of this data.
-The \c{SFENCE} instruction provides a performance-efficient way of
-insuring store ordering between routines that produce weakly-ordered
-results and routines that consume this data.
-
-\c{SFENCE} uses the following ModRM encoding:
-
-\c           Mod (7:6)        = 11B
-\c           Reg/Opcode (5:3) = 111B
-\c           R/M (2:0)        = 000B
-
-All other ModRM encodings are defined to be reserved, and use
-of these encodings risks incompatibility with future processors.
-
-See also \c{LFENCE} (\k{insLFENCE}) and \c{MFENCE} (\k{insMFENCE}).
-
-
-\S{insSGDT} \i\c{SGDT}, \i\c{SIDT}, \i\c{SLDT}: Store Descriptor Table Pointers
-
-\c SGDT mem                      ; 0F 01 /0             [286,PRIV]
-\c SIDT mem                      ; 0F 01 /1             [286,PRIV]
-\c SLDT r/m16                    ; 0F 00 /0             [286,PRIV]
-
-\c{SGDT} and \c{SIDT} both take a 6-byte memory area as an operand:
-they store the contents of the GDTR (global descriptor table
-register) or IDTR (interrupt descriptor table register) into that
-area as a 32-bit linear address and a 16-bit size limit from that
-area (in that order). These are the only instructions which directly
-use \e{linear} addresses, rather than segment/offset pairs.
-
-\c{SLDT} stores the segment selector corresponding to the LDT (local
-descriptor table) into the given operand.
-
-See also \c{LGDT}, \c{LIDT} and \c{LLDT} (\k{insLGDT}).
-
-
-\S{insSHL} \i\c{SHL}, \i\c{SHR}: Bitwise Logical Shifts
-
-\c SHL r/m8,1                    ; D0 /4                [8086]
-\c SHL r/m8,CL                   ; D2 /4                [8086]
-\c SHL r/m8,imm8                 ; C0 /4 ib             [186]
-\c SHL r/m16,1                   ; o16 D1 /4            [8086]
-\c SHL r/m16,CL                  ; o16 D3 /4            [8086]
-\c SHL r/m16,imm8                ; o16 C1 /4 ib         [186]
-\c SHL r/m32,1                   ; o32 D1 /4            [386]
-\c SHL r/m32,CL                  ; o32 D3 /4            [386]
-\c SHL r/m32,imm8                ; o32 C1 /4 ib         [386]
-
-\c SHR r/m8,1                    ; D0 /5                [8086]
-\c SHR r/m8,CL                   ; D2 /5                [8086]
-\c SHR r/m8,imm8                 ; C0 /5 ib             [186]
-\c SHR r/m16,1                   ; o16 D1 /5            [8086]
-\c SHR r/m16,CL                  ; o16 D3 /5            [8086]
-\c SHR r/m16,imm8                ; o16 C1 /5 ib         [186]
-\c SHR r/m32,1                   ; o32 D1 /5            [386]
-\c SHR r/m32,CL                  ; o32 D3 /5            [386]
-\c SHR r/m32,imm8                ; o32 C1 /5 ib         [386]
-
-\c{SHL} and \c{SHR} perform a logical shift operation on the given
-source/destination (first) operand. The vacated bits are filled with
-zero.
-
-A synonym for \c{SHL} is \c{SAL} (see \k{insSAL}). NASM will
-assemble either one to the same code, but NDISASM will always
-disassemble that code as \c{SHL}.
-
-The number of bits to shift by is given by the second operand. Only
-the bottom five bits of the shift count are considered by processors
-above the 8086.
-
-You can force the longer (286 and upwards, beginning with a \c{C1}
-byte) form of \c{SHL foo,1} by using a \c{BYTE} prefix: \c{SHL
-foo,BYTE 1}. Similarly with \c{SHR}.
-
-
-\S{insSHLD} \i\c{SHLD}, \i\c{SHRD}: Bitwise Double-Precision Shifts
-
-\c SHLD r/m16,reg16,imm8         ; o16 0F A4 /r ib      [386]
-\c SHLD r/m16,reg32,imm8         ; o32 0F A4 /r ib      [386]
-\c SHLD r/m16,reg16,CL           ; o16 0F A5 /r         [386]
-\c SHLD r/m16,reg32,CL           ; o32 0F A5 /r         [386]
-
-\c SHRD r/m16,reg16,imm8         ; o16 0F AC /r ib      [386]
-\c SHRD r/m32,reg32,imm8         ; o32 0F AC /r ib      [386]
-\c SHRD r/m16,reg16,CL           ; o16 0F AD /r         [386]
-\c SHRD r/m32,reg32,CL           ; o32 0F AD /r         [386]
-
-\b \c{SHLD} performs a double-precision left shift. It notionally
-places its second operand to the right of its first, then shifts
-the entire bit string thus generated to the left by a number of
-bits specified in the third operand. It then updates only the
-\e{first} operand according to the result of this. The second
-operand is not modified.
-
-\b \c{SHRD} performs the corresponding right shift: it notionally
-places the second operand to the \e{left} of the first, shifts the
-whole bit string right, and updates only the first operand.
-
-For example, if \c{EAX} holds \c{0x01234567} and \c{EBX} holds
-\c{0x89ABCDEF}, then the instruction \c{SHLD EAX,EBX,4} would update
-\c{EAX} to hold \c{0x12345678}. Under the same conditions, \c{SHRD
-EAX,EBX,4} would update \c{EAX} to hold \c{0xF0123456}.
-
-The number of bits to shift by is given by the third operand. Only
-the bottom five bits of the shift count are considered.
-
-
-\S{insSHUFPD} \i\c{SHUFPD}: Shuffle Packed Double-Precision FP Values
-
-\c SHUFPD xmm1,xmm2/m128,imm8    ; 66 0F C6 /r ib  [WILLAMETTE,SSE2]
-
-\c{SHUFPD} moves one of the packed double-precision FP values from
-the destination operand into the low quadword of the destination
-operand; the upper quadword is generated by moving one of the
-double-precision FP values from the source operand into the
-destination. The select (third) operand selects which of the values
-are moved to the destination register.
-
-The select operand is an 8-bit immediate: bit 0 selects which value
-is moved from the destination operand to the result (where 0 selects
-the low quadword and 1 selects the high quadword) and bit 1 selects
-which value is moved from the source operand to the result.
-Bits 2 through 7 of the shuffle operand are reserved.
-
-
-\S{insSHUFPS} \i\c{SHUFPS}: Shuffle Packed Single-Precision FP Values
-
-\c SHUFPS xmm1,xmm2/m128,imm8    ; 0F C6 /r ib     [KATMAI,SSE]
-
-\c{SHUFPS} moves two of the packed single-precision FP values from
-the destination operand into the low quadword of the destination
-operand; the upper quadword is generated by moving two of the
-single-precision FP values from the source operand into the
-destination. The select (third) operand selects which of the
-values are moved to the destination register.
-
-The select operand is an 8-bit immediate: bits 0 and 1 select the
-value to be moved from the destination operand the low doubleword of
-the result, bits 2 and 3 select the value to be moved from the
-destination operand the second doubleword of the result, bits 4 and
-5 select the value to be moved from the source operand the third
-doubleword of the result, and bits 6 and 7 select the value to be
-moved from the source operand to the high doubleword of the result.
-
-
-\S{insSMI} \i\c{SMI}: System Management Interrupt
-
-\c SMI                           ; F1                   [386,UNDOC]
-
-\c{SMI} puts some AMD processors into SMM mode. It is available on some
-386 and 486 processors, and is only available when DR7 bit 12 is set,
-otherwise it generates an Int 1.
-
-
-\S{insSMINT} \i\c{SMINT}, \i\c{SMINTOLD}: Software SMM Entry (CYRIX)
-
-\c SMINT                         ; 0F 38                [PENT,CYRIX]
-\c SMINTOLD                      ; 0F 7E                [486,CYRIX]
-
-\c{SMINT} puts the processor into SMM mode. The CPU state information is
-saved in the SMM memory header, and then execution begins at the SMM base
-address.
-
-\c{SMINTOLD} is the same as \c{SMINT}, but was the opcode used on the 486.
-
-This pair of opcodes are specific to the Cyrix and compatible range of
-processors (Cyrix, IBM, Via).
-
-
-\S{insSMSW} \i\c{SMSW}: Store Machine Status Word
-
-\c SMSW r/m16                    ; 0F 01 /4             [286,PRIV]
-
-\c{SMSW} stores the bottom half of the \c{CR0} control register (or
-the Machine Status Word, on 286 processors) into the destination
-operand. See also \c{LMSW} (\k{insLMSW}).
-
-For 32-bit code, this would store all of \c{CR0} in the specified
-register (or the bottom 16 bits if the destination is a memory location),
- without needing an operand size override byte.
-
-
-\S{insSQRTPD} \i\c{SQRTPD}: Packed Double-Precision FP Square Root
-
-\c SQRTPD xmm1,xmm2/m128         ; 66 0F 51 /r     [WILLAMETTE,SSE2]
-
-\c{SQRTPD} calculates the square root of the packed double-precision
-FP value from the source operand, and stores the double-precision
-results in the destination register.
-
-
-\S{insSQRTPS} \i\c{SQRTPS}: Packed Single-Precision FP Square Root
-
-\c SQRTPS xmm1,xmm2/m128         ; 0F 51 /r        [KATMAI,SSE]
-
-\c{SQRTPS} calculates the square root of the packed single-precision
-FP value from the source operand, and stores the single-precision
-results in the destination register.
-
-
-\S{insSQRTSD} \i\c{SQRTSD}: Scalar Double-Precision FP Square Root
-
-\c SQRTSD xmm1,xmm2/m128         ; F2 0F 51 /r     [WILLAMETTE,SSE2]
-
-\c{SQRTSD} calculates the square root of the low-order double-precision
-FP value from the source operand, and stores the double-precision
-result in the destination register. The high-quadword remains unchanged.
-
-
-\S{insSQRTSS} \i\c{SQRTSS}: Scalar Single-Precision FP Square Root
-
-\c SQRTSS xmm1,xmm2/m128         ; F3 0F 51 /r     [KATMAI,SSE]
-
-\c{SQRTSS} calculates the square root of the low-order single-precision
-FP value from the source operand, and stores the single-precision
-result in the destination register. The three high doublewords remain
-unchanged.
-
-
-\S{insSTC} \i\c{STC}, \i\c{STD}, \i\c{STI}: Set Flags
-
-\c STC                           ; F9                   [8086]
-\c STD                           ; FD                   [8086]
-\c STI                           ; FB                   [8086]
-
-These instructions set various flags. \c{STC} sets the carry flag;
-\c{STD} sets the direction flag; and \c{STI} sets the interrupt flag
-(thus enabling interrupts).
-
-To clear the carry, direction, or interrupt flags, use the \c{CLC},
-\c{CLD} and \c{CLI} instructions (\k{insCLC}). To invert the carry
-flag, use \c{CMC} (\k{insCMC}).
-
-
-\S{insSTMXCSR} \i\c{STMXCSR}: Store Streaming SIMD Extension
- Control/Status
-
-\c STMXCSR m32                   ; 0F AE /3        [KATMAI,SSE]
-
-\c{STMXCSR} stores the contents of the \c{MXCSR} control/status
-register to the specified memory location. \c{MXCSR} is used to
-enable masked/unmasked exception handling, to set rounding modes,
-to set flush-to-zero mode, and to view exception status flags.
-The reserved bits in the \c{MXCSR} register are stored as 0s.
-
-For details of the \c{MXCSR} register, see the Intel processor docs.
-
-See also \c{LDMXCSR} (\k{insLDMXCSR}).
-
-
-\S{insSTOSB} \i\c{STOSB}, \i\c{STOSW}, \i\c{STOSD}: Store Byte to String
-
-\c STOSB                         ; AA                   [8086]
-\c STOSW                         ; o16 AB               [8086]
-\c STOSD                         ; o32 AB               [386]
-
-\c{STOSB} stores the byte in \c{AL} at \c{[ES:DI]} or \c{[ES:EDI]},
-and sets the flags accordingly. It then increments or decrements
-(depending on the direction flag: increments if the flag is clear,
-decrements if it is set) \c{DI} (or \c{EDI}).
-
-The register used is \c{DI} if the address size is 16 bits, and
-\c{EDI} if it is 32 bits. If you need to use an address size not
-equal to the current \c{BITS} setting, you can use an explicit
-\i\c{a16} or \i\c{a32} prefix.
-
-Segment override prefixes have no effect for this instruction: the
-use of \c{ES} for the store to \c{[DI]} or \c{[EDI]} cannot be
-overridden.
-
-\c{STOSW} and \c{STOSD} work in the same way, but they store the
-word in \c{AX} or the doubleword in \c{EAX} instead of the byte in
-\c{AL}, and increment or decrement the addressing registers by 2 or
-4 instead of 1.
-
-The \c{REP} prefix may be used to repeat the instruction \c{CX} (or
-\c{ECX} - again, the address size chooses which) times.
-
-
-\S{insSTR} \i\c{STR}: Store Task Register
-
-\c STR r/m16                     ; 0F 00 /1             [286,PRIV]
-
-\c{STR} stores the segment selector corresponding to the contents of
-the Task Register into its operand. When the operand size is 32 bit and
-the destination is a register, the upper 16-bits are cleared to 0s. 
-When the destination operand is a memory location, 16 bits are
-written regardless of the  operand size.
-
-
-\S{insSUB} \i\c{SUB}: Subtract Integers
-
-\c SUB r/m8,reg8                 ; 28 /r                [8086]
-\c SUB r/m16,reg16               ; o16 29 /r            [8086]
-\c SUB r/m32,reg32               ; o32 29 /r            [386]
-
-\c SUB reg8,r/m8                 ; 2A /r                [8086]
-\c SUB reg16,r/m16               ; o16 2B /r            [8086]
-\c SUB reg32,r/m32               ; o32 2B /r            [386]
-
-\c SUB r/m8,imm8                 ; 80 /5 ib             [8086]
-\c SUB r/m16,imm16               ; o16 81 /5 iw         [8086]
-\c SUB r/m32,imm32               ; o32 81 /5 id         [386]
-
-\c SUB r/m16,imm8                ; o16 83 /5 ib         [8086]
-\c SUB r/m32,imm8                ; o32 83 /5 ib         [386]
-
-\c SUB AL,imm8                   ; 2C ib                [8086]
-\c SUB AX,imm16                  ; o16 2D iw            [8086]
-\c SUB EAX,imm32                 ; o32 2D id            [386]
-
-\c{SUB} performs integer subtraction: it subtracts its second
-operand from its first, and leaves the result in its destination
-(first) operand. The flags are set according to the result of the
-operation: in particular, the carry flag is affected and can be used
-by a subsequent \c{SBB} instruction (\k{insSBB}).
-
-In the forms with an 8-bit immediate second operand and a longer
-first operand, the second operand is considered to be signed, and is
-sign-extended to the length of the first operand. In these cases,
-the \c{BYTE} qualifier is necessary to force NASM to generate this
-form of the instruction.
-
-
-\S{insSUBPD} \i\c{SUBPD}: Packed Double-Precision FP Subtract
-
-\c SUBPD xmm1,xmm2/m128          ; 66 0F 5C /r     [WILLAMETTE,SSE2]
-
-\c{SUBPD} subtracts the packed double-precision FP values of
-the source operand from those of the destination operand, and
-stores the result in the destination operation.
-
-
-\S{insSUBPS} \i\c{SUBPS}: Packed Single-Precision FP Subtract
-
-\c SUBPS xmm1,xmm2/m128          ; 0F 5C /r        [KATMAI,SSE]
-
-\c{SUBPS} subtracts the packed single-precision FP values of
-the source operand from those of the destination operand, and
-stores the result in the destination operation.
-
-
-\S{insSUBSD} \i\c{SUBSD}: Scalar Single-FP Subtract
-
-\c SUBSD xmm1,xmm2/m128          ; F2 0F 5C /r     [WILLAMETTE,SSE2]
-
-\c{SUBSD} subtracts the low-order double-precision FP value of
-the source operand from that of the destination operand, and
-stores the result in the destination operation. The high
-quadword is unchanged.
-
-
-\S{insSUBSS} \i\c{SUBSS}: Scalar Single-FP Subtract
-
-\c SUBSS xmm1,xmm2/m128          ; F3 0F 5C /r     [KATMAI,SSE]
-
-\c{SUBSS} subtracts the low-order single-precision FP value of
-the source operand from that of the destination operand, and
-stores the result in the destination operation. The three high
-doublewords are unchanged.
-
-
-\S{insSVDC} \i\c{SVDC}: Save Segment Register and Descriptor
-
-\c SVDC m80,segreg               ; 0F 78 /r        [486,CYRIX,SMM]
-
-\c{SVDC} saves a segment register (DS, ES, FS, GS, or SS) and its
-descriptor to mem80.
-
-
-\S{insSVLDT} \i\c{SVLDT}: Save LDTR and Descriptor
-
-\c SVLDT m80                     ; 0F 7A /0        [486,CYRIX,SMM]
-
-\c{SVLDT} saves the Local Descriptor Table (LDTR) to mem80.
-
-
-\S{insSVTS} \i\c{SVTS}: Save TSR and Descriptor
-
-\c SVTS m80                      ; 0F 7C /0        [486,CYRIX,SMM]
-
-\c{SVTS} saves the Task State Register (TSR) to mem80.
-
-
-\S{insSYSCALL} \i\c{SYSCALL}: Call Operating System
-
-\c SYSCALL                       ; 0F 05                [P6,AMD]
-
-\c{SYSCALL} provides a fast method of transferring control to a fixed
-entry point in an operating system.
-
-\b The \c{EIP} register is copied into the \c{ECX} register.
-
-\b Bits [31-0] of the 64-bit SYSCALL/SYSRET Target Address Register
-(\c{STAR}) are copied into the \c{EIP} register.
-
-\b Bits [47-32] of the \c{STAR} register specify the selector that is
-copied into the \c{CS} register.
-
-\b Bits [47-32]+1000b of the \c{STAR} register specify the selector that
-is copied into the SS register.
-
-The \c{CS} and \c{SS} registers should not be modified by the operating
-system between the execution of the \c{SYSCALL} instruction and its
-corresponding \c{SYSRET} instruction.
-
-For more information, see the \c{SYSCALL and SYSRET Instruction Specification}
-(AMD document number 21086.pdf).
-
-
-\S{insSYSENTER} \i\c{SYSENTER}: Fast System Call
-
-\c SYSENTER                      ; 0F 34                [P6]
-
-\c{SYSENTER} executes a fast call to a level 0 system procedure or
-routine. Before using this instruction, various MSRs need to be set
-up:
-
-\b \c{SYSENTER_CS_MSR} contains the 32-bit segment selector for the
-privilege level 0 code segment. (This value is also used to compute
-the segment selector of the privilege level 0 stack segment.)
-
-\b \c{SYSENTER_EIP_MSR} contains the 32-bit offset into the privilege
-level 0 code segment to the first instruction of the selected operating
-procedure or routine.
-
-\b \c{SYSENTER_ESP_MSR} contains the 32-bit stack pointer for the
-privilege level 0 stack.
-
-\c{SYSENTER} performs the following sequence of operations:
-
-\b Loads the segment selector from the \c{SYSENTER_CS_MSR} into the
-\c{CS} register.
-
-\b Loads the instruction pointer from the \c{SYSENTER_EIP_MSR} into
-the \c{EIP} register.
-
-\b Adds 8 to the value in \c{SYSENTER_CS_MSR} and loads it into the
-\c{SS} register.
-
-\b Loads the stack pointer from the \c{SYSENTER_ESP_MSR} into the
-\c{ESP} register.
-
-\b Switches to privilege level 0.
-
-\b Clears the \c{VM} flag in the \c{EFLAGS} register, if the flag
-is set.
-
-\b Begins executing the selected system procedure.
-
-In particular, note that this instruction des not save the values of
-\c{CS} or \c{(E)IP}. If you need to return to the calling code, you
-need to write your code to cater for this.
-
-For more information, see the Intel Architecture Software Developer's
-Manual, Volume 2.
-
-
-\S{insSYSEXIT} \i\c{SYSEXIT}: Fast Return From System Call
-
-\c SYSEXIT                       ; 0F 35                [P6,PRIV]
-
-\c{SYSEXIT} executes a fast return to privilege level 3 user code.
-This instruction is a companion instruction to the \c{SYSENTER}
-instruction, and can only be executed by privilege level 0 code.
-Various registers need to be set up before calling this instruction:
-
-\b \c{SYSENTER_CS_MSR} contains the 32-bit segment selector for the
-privilege level 0 code segment in which the processor is currently
-executing. (This value is used to compute the segment selectors for
-the privilege level 3 code and stack segments.)
-
-\b \c{EDX} contains the 32-bit offset into the privilege level 3 code
-segment to the first instruction to be executed in the user code.
-
-\b \c{ECX} contains the 32-bit stack pointer for the privilege level 3
-stack.
-
-\c{SYSEXIT} performs the following sequence of operations:
-
-\b Adds 16 to the value in \c{SYSENTER_CS_MSR} and loads the sum into
-the \c{CS} selector register.
-
-\b Loads the instruction pointer from the \c{EDX} register into the
-\c{EIP} register.
-
-\b Adds 24 to the value in \c{SYSENTER_CS_MSR} and loads the sum
-into the \c{SS} selector register.
-
-\b Loads the stack pointer from the \c{ECX} register into the \c{ESP}
-register.
-
-\b Switches to privilege level 3.
-
-\b Begins executing the user code at the \c{EIP} address.
-
-For more information on the use of the \c{SYSENTER} and \c{SYSEXIT}
-instructions, see the Intel Architecture Software Developer's
-Manual, Volume 2.
-
-
-\S{insSYSRET} \i\c{SYSRET}: Return From Operating System
-
-\c SYSRET                        ; 0F 07                [P6,AMD,PRIV]
-
-\c{SYSRET} is the return instruction used in conjunction with the
-\c{SYSCALL} instruction to provide fast entry/exit to an operating system.
-
-\b The \c{ECX} register, which points to the next sequential instruction
-after the corresponding \c{SYSCALL} instruction, is copied into the \c{EIP}
-register.
-
-\b Bits [63-48] of the \c{STAR} register specify the selector that is copied
-into the \c{CS} register.
-
-\b Bits [63-48]+1000b of the \c{STAR} register specify the selector that is
-copied into the \c{SS} register.
-
-\b Bits [1-0] of the \c{SS} register are set to 11b (RPL of 3) regardless of
-the value of bits [49-48] of the \c{STAR} register.
-
-The \c{CS} and \c{SS} registers should not be modified by the operating
-system between the execution of the \c{SYSCALL} instruction and its
-corresponding \c{SYSRET} instruction.
-
-For more information, see the \c{SYSCALL and SYSRET Instruction Specification}
-(AMD document number 21086.pdf).
-
-
-\S{insTEST} \i\c{TEST}: Test Bits (notional bitwise AND)
-
-\c TEST r/m8,reg8                ; 84 /r                [8086]
-\c TEST r/m16,reg16              ; o16 85 /r            [8086]
-\c TEST r/m32,reg32              ; o32 85 /r            [386]
-
-\c TEST r/m8,imm8                ; F6 /0 ib             [8086]
-\c TEST r/m16,imm16              ; o16 F7 /0 iw         [8086]
-\c TEST r/m32,imm32              ; o32 F7 /0 id         [386]
-
-\c TEST AL,imm8                  ; A8 ib                [8086]
-\c TEST AX,imm16                 ; o16 A9 iw            [8086]
-\c TEST EAX,imm32                ; o32 A9 id            [386]
-
-\c{TEST} performs a `mental' bitwise AND of its two operands, and
-affects the flags as if the operation had taken place, but does not
-store the result of the operation anywhere.
-
-
-\S{insUCOMISD} \i\c{UCOMISD}: Unordered Scalar Double-Precision FP
-compare and set EFLAGS
-
-\c UCOMISD xmm1,xmm2/m128        ; 66 0F 2E /r     [WILLAMETTE,SSE2]
-
-\c{UCOMISD} compares the low-order double-precision FP numbers in the
-two operands, and sets the \c{ZF}, \c{PF} and \c{CF} bits in the
-\c{EFLAGS} register. In addition, the \c{OF}, \c{SF} and \c{AF} bits
-in the \c{EFLAGS} register are zeroed out. The unordered predicate
-(\c{ZF}, \c{PF} and \c{CF} all set) is returned if either source
-operand is a \c{NaN} (\c{qNaN} or \c{sNaN}).
-
-
-\S{insUCOMISS} \i\c{UCOMISS}: Unordered Scalar Single-Precision FP
-compare and set EFLAGS
-
-\c UCOMISS xmm1,xmm2/m128        ; 0F 2E /r        [KATMAI,SSE]
-
-\c{UCOMISS} compares the low-order single-precision FP numbers in the
-two operands, and sets the \c{ZF}, \c{PF} and \c{CF} bits in the
-\c{EFLAGS} register. In addition, the \c{OF}, \c{SF} and \c{AF} bits
-in the \c{EFLAGS} register are zeroed out. The unordered predicate
-(\c{ZF}, \c{PF} and \c{CF} all set) is returned if either source
-operand is a \c{NaN} (\c{qNaN} or \c{sNaN}).
-
-
-\S{insUD2} \i\c{UD0}, \i\c{UD1}, \i\c{UD2}: Undefined Instruction
-
-\c UD0                           ; 0F FF                [186,UNDOC]
-\c UD1                           ; 0F B9                [186,UNDOC]
-\c UD2                           ; 0F 0B                [186]
-
-\c{UDx} can be used to generate an invalid opcode exception, for testing
-purposes.
-
-\c{UD0} is specifically documented by AMD as being reserved for this
-purpose.
-
-\c{UD1} is documented by Intel as being available for this purpose.
-
-\c{UD2} is specifically documented by Intel as being reserved for this
-purpose. Intel document this as the preferred method of generating an
-invalid opcode exception.
-
-All these opcodes can be used to generate invalid opcode exceptions on
-all currently available processors.
-
-
-\S{insUMOV} \i\c{UMOV}: User Move Data
-
-\c UMOV r/m8,reg8                ; 0F 10 /r             [386,UNDOC]
-\c UMOV r/m16,reg16              ; o16 0F 11 /r         [386,UNDOC]
-\c UMOV r/m32,reg32              ; o32 0F 11 /r         [386,UNDOC]
-
-\c UMOV reg8,r/m8                ; 0F 12 /r             [386,UNDOC]
-\c UMOV reg16,r/m16              ; o16 0F 13 /r         [386,UNDOC]
-\c UMOV reg32,r/m32              ; o32 0F 13 /r         [386,UNDOC]
-
-This undocumented instruction is used by in-circuit emulators to
-access user memory (as opposed to host memory). It is used just like
-an ordinary memory/register or register/register \c{MOV}
-instruction, but accesses user space.
-
-This instruction is only available on some AMD and IBM 386 and 486
-processors.
-
-
-\S{insUNPCKHPD} \i\c{UNPCKHPD}: Unpack and Interleave High Packed
-Double-Precision FP Values
-
-\c UNPCKHPD xmm1,xmm2/m128       ; 66 0F 15 /r     [WILLAMETTE,SSE2]
-
-\c{UNPCKHPD} performs an interleaved unpack of the high-order data
-elements of the source and destination operands, saving the result
-in \c{xmm1}. It ignores the lower half of the sources.
-
-The operation of this instruction is:
-
-\c    dst[63-0]   := dst[127-64];
-\c    dst[127-64] := src[127-64].
-
-
-\S{insUNPCKHPS} \i\c{UNPCKHPS}: Unpack and Interleave High Packed
-Single-Precision FP Values
-
-\c UNPCKHPS xmm1,xmm2/m128       ; 0F 15 /r        [KATMAI,SSE]
-
-\c{UNPCKHPS} performs an interleaved unpack of the high-order data
-elements of the source and destination operands, saving the result
-in \c{xmm1}. It ignores the lower half of the sources.
-
-The operation of this instruction is:
-
-\c    dst[31-0]   := dst[95-64];
-\c    dst[63-32]  := src[95-64];
-\c    dst[95-64]  := dst[127-96];
-\c    dst[127-96] := src[127-96].
-
-
-\S{insUNPCKLPD} \i\c{UNPCKLPD}: Unpack and Interleave Low Packed
-Double-Precision FP Data
-
-\c UNPCKLPD xmm1,xmm2/m128       ; 66 0F 14 /r     [WILLAMETTE,SSE2]
-
-\c{UNPCKLPD} performs an interleaved unpack of the low-order data
-elements of the source and destination operands, saving the result
-in \c{xmm1}. It ignores the lower half of the sources.
-
-The operation of this instruction is:
-
-\c    dst[63-0]   := dst[63-0];
-\c    dst[127-64] := src[63-0].
-
-
-\S{insUNPCKLPS} \i\c{UNPCKLPS}: Unpack and Interleave Low Packed
-Single-Precision FP Data
-
-\c UNPCKLPS xmm1,xmm2/m128       ; 0F 14 /r        [KATMAI,SSE]
-
-\c{UNPCKLPS} performs an interleaved unpack of the low-order data
-elements of the source and destination operands, saving the result
-in \c{xmm1}. It ignores the lower half of the sources.
-
-The operation of this instruction is:
-
-\c    dst[31-0]   := dst[31-0];
-\c    dst[63-32]  := src[31-0];
-\c    dst[95-64]  := dst[63-32];
-\c    dst[127-96] := src[63-32].
-
-
-\S{insVERR} \i\c{VERR}, \i\c{VERW}: Verify Segment Readability/Writability
-
-\c VERR r/m16                    ; 0F 00 /4             [286,PRIV]
-
-\c VERW r/m16                    ; 0F 00 /5             [286,PRIV]
-
-\b \c{VERR} sets the zero flag if the segment specified by the selector
-in its operand can be read from at the current privilege level.
-Otherwise it is cleared.
-
-\b \c{VERW} sets the zero flag if the segment can be written.
-
-
-\S{insWAIT} \i\c{WAIT}: Wait for Floating-Point Processor
-
-\c WAIT                          ; 9B                   [8086]
-\c FWAIT                         ; 9B                   [8086]
-
-\c{WAIT}, on 8086 systems with a separate 8087 FPU, waits for the
-FPU to have finished any operation it is engaged in before
-continuing main processor operations, so that (for example) an FPU
-store to main memory can be guaranteed to have completed before the
-CPU tries to read the result back out.
-
-On higher processors, \c{WAIT} is unnecessary for this purpose, and
-it has the alternative purpose of ensuring that any pending unmasked
-FPU exceptions have happened before execution continues.
-
-
-\S{insWBINVD} \i\c{WBINVD}: Write Back and Invalidate Cache
-
-\c WBINVD                        ; 0F 09                [486]
-
-\c{WBINVD} invalidates and empties the processor's internal caches,
-and causes the processor to instruct external caches to do the same.
-It writes the contents of the caches back to memory first, so no
-data is lost. To flush the caches quickly without bothering to write
-the data back first, use \c{INVD} (\k{insINVD}).
-
-
-\S{insWRMSR} \i\c{WRMSR}: Write Model-Specific Registers
-
-\c WRMSR                         ; 0F 30                [PENT]
-
-\c{WRMSR} writes the value in \c{EDX:EAX} to the processor
-Model-Specific Register (MSR) whose index is stored in \c{ECX}.
-See also \c{RDMSR} (\k{insRDMSR}).
-
-
-\S{insWRSHR} \i\c{WRSHR}: Write SMM Header Pointer Register
-
-\c WRSHR r/m32                   ; 0F 37 /0        [386,CYRIX,SMM]
-
-\c{WRSHR} loads the contents of either a 32-bit memory location or a
-32-bit register into the SMM header pointer register.
-
-See also \c{RDSHR} (\k{insRDSHR}).
-
-
-\S{insXADD} \i\c{XADD}: Exchange and Add
-
-\c XADD r/m8,reg8                ; 0F C0 /r             [486]
-\c XADD r/m16,reg16              ; o16 0F C1 /r         [486]
-\c XADD r/m32,reg32              ; o32 0F C1 /r         [486]
-
-\c{XADD} exchanges the values in its two operands, and then adds
-them together and writes the result into the destination (first)
-operand. This instruction can be used with a \c{LOCK} prefix for
-multi-processor synchronisation purposes.
-
-
-\S{insXBTS} \i\c{XBTS}: Extract Bit String
-
-\c XBTS reg16,r/m16              ; o16 0F A6 /r         [386,UNDOC]
-\c XBTS reg32,r/m32              ; o32 0F A6 /r         [386,UNDOC]
-
-The implied operation of this instruction is:
-
-\c XBTS r/m16,reg16,AX,CL
-\c XBTS r/m32,reg32,EAX,CL
-
-Writes a bit string from the source operand to the destination. \c{CL}
-indicates the number of bits to be copied, and \c{(E)AX} indicates the
-low order bit offset in the source. The bits are written to the low
-order bits of the destination register. For example, if \c{CL} is set
-to 4 and \c{AX} (for 16-bit code) is set to 5, bits 5-8 of \c{src} will
-be copied to bits 0-3 of \c{dst}. This instruction is very poorly
-documented, and I have been unable to find any official source of
-documentation on it.
-
-\c{XBTS} is supported only on the early Intel 386s, and conflicts with
-the opcodes for \c{CMPXCHG486} (on early Intel 486s). NASM supports it
-only for completeness. Its counterpart is \c{IBTS} (see \k{insIBTS}).
-
-
-\S{insXCHG} \i\c{XCHG}: Exchange
-
-\c XCHG reg8,r/m8                ; 86 /r                [8086]
-\c XCHG reg16,r/m8               ; o16 87 /r            [8086]
-\c XCHG reg32,r/m32              ; o32 87 /r            [386]
-
-\c XCHG r/m8,reg8                ; 86 /r                [8086]
-\c XCHG r/m16,reg16              ; o16 87 /r            [8086]
-\c XCHG r/m32,reg32              ; o32 87 /r            [386]
-
-\c XCHG AX,reg16                 ; o16 90+r             [8086]
-\c XCHG EAX,reg32                ; o32 90+r             [386]
-\c XCHG reg16,AX                 ; o16 90+r             [8086]
-\c XCHG reg32,EAX                ; o32 90+r             [386]
-
-\c{XCHG} exchanges the values in its two operands. It can be used
-with a \c{LOCK} prefix for purposes of multi-processor
-synchronisation.
-
-\c{XCHG AX,AX} or \c{XCHG EAX,EAX} (depending on the \c{BITS}
-setting) generates the opcode \c{90h}, and so is a synonym for
-\c{NOP} (\k{insNOP}).
-
-
-\S{insXLATB} \i\c{XLATB}: Translate Byte in Lookup Table
-
-\c XLAT                          ; D7                   [8086]
-\c XLATB                         ; D7                   [8086]
-
-\c{XLATB} adds the value in \c{AL}, treated as an unsigned byte, to
-\c{BX} or \c{EBX}, and loads the byte from the resulting address (in
-the segment specified by \c{DS}) back into \c{AL}.
-
-The base register used is \c{BX} if the address size is 16 bits, and
-\c{EBX} if it is 32 bits. If you need to use an address size not
-equal to the current \c{BITS} setting, you can use an explicit
-\i\c{a16} or \i\c{a32} prefix.
-
-The segment register used to load from \c{[BX+AL]} or \c{[EBX+AL]}
-can be overridden by using a segment register name as a prefix (for
-example, \c{es xlatb}).
-
-
-\S{insXOR} \i\c{XOR}: Bitwise Exclusive OR
-
-\c XOR r/m8,reg8                 ; 30 /r                [8086]
-\c XOR r/m16,reg16               ; o16 31 /r            [8086]
-\c XOR r/m32,reg32               ; o32 31 /r            [386]
-
-\c XOR reg8,r/m8                 ; 32 /r                [8086]
-\c XOR reg16,r/m16               ; o16 33 /r            [8086]
-\c XOR reg32,r/m32               ; o32 33 /r            [386]
-
-\c XOR r/m8,imm8                 ; 80 /6 ib             [8086]
-\c XOR r/m16,imm16               ; o16 81 /6 iw         [8086]
-\c XOR r/m32,imm32               ; o32 81 /6 id         [386]
-
-\c XOR r/m16,imm8                ; o16 83 /6 ib         [8086]
-\c XOR r/m32,imm8                ; o32 83 /6 ib         [386]
-
-\c XOR AL,imm8                   ; 34 ib                [8086]
-\c XOR AX,imm16                  ; o16 35 iw            [8086]
-\c XOR EAX,imm32                 ; o32 35 id            [386]
-
-\c{XOR} performs a bitwise XOR operation between its two operands
-(i.e. each bit of the result is 1 if and only if exactly one of the
-corresponding bits of the two inputs was 1), and stores the result
-in the destination (first) operand.
-
-In the forms with an 8-bit immediate second operand and a longer
-first operand, the second operand is considered to be signed, and is
-sign-extended to the length of the first operand. In these cases,
-the \c{BYTE} qualifier is necessary to force NASM to generate this
-form of the instruction.
-
-The \c{MMX} instruction \c{PXOR} (see \k{insPXOR}) performs the same
-operation on the 64-bit \c{MMX} registers.
-
-
-\S{insXORPD} \i\c{XORPD}: Bitwise Logical XOR of Double-Precision FP Values
-
-\c XORPD xmm1,xmm2/m128          ; 66 0F 57 /r     [WILLAMETTE,SSE2]
-
-\c{XORPD} returns a bit-wise logical XOR between the source and
-destination operands, storing the result in the destination operand.
-
-
-\S{insXORPS} \i\c{XORPS}: Bitwise Logical XOR of Single-Precision FP Values
-
-\c XORPS xmm1,xmm2/m128          ; 0F 57 /r        [KATMAI,SSE]
-
-\c{XORPS} returns a bit-wise logical XOR between the source and
-destination operands, storing the result in the destination operand.
-
-
author	H. Peter Anvin <hpa@zytor.com>	2007-09-11 23:52:01 +0000
committer	H. Peter Anvin <hpa@zytor.com>	2007-09-11 23:52:01 +0000
commit	9b49e24e1fe1a4afc021f6c3a01720fcabdc47ca (patch)
tree	19cdcae470bc747d6ffe4b0ce17a1e178fcf5141 /doc
parent	62cb606f6876b01c5d89ad00b6d3d4a3a2ffccf2 (diff)
download	nasm-9b49e24e1fe1a4afc021f6c3a01720fcabdc47ca.tar.gz