aboutsummaryrefslogtreecommitdiffstats
path: root/contrib/libs/llvm12/lib/Target/ARM/README.txt
diff options
context:
space:
mode:
authorheretic <heretic@yandex-team.ru>2022-02-10 16:45:43 +0300
committerDaniil Cherednik <dcherednik@yandex-team.ru>2022-02-10 16:45:43 +0300
commit397cbe258b9e064f49c4ca575279f02f39fef76e (patch)
treea0b0eb3cca6a14e4e8ea715393637672fa651284 /contrib/libs/llvm12/lib/Target/ARM/README.txt
parent43f5a35593ebc9f6bcea619bb170394ea7ae468e (diff)
downloadydb-397cbe258b9e064f49c4ca575279f02f39fef76e.tar.gz
Restoring authorship annotation for <heretic@yandex-team.ru>. Commit 1 of 2.
Diffstat (limited to 'contrib/libs/llvm12/lib/Target/ARM/README.txt')
-rw-r--r--contrib/libs/llvm12/lib/Target/ARM/README.txt1464
1 files changed, 732 insertions, 732 deletions
diff --git a/contrib/libs/llvm12/lib/Target/ARM/README.txt b/contrib/libs/llvm12/lib/Target/ARM/README.txt
index def67cfae7..1a93bc7bb7 100644
--- a/contrib/libs/llvm12/lib/Target/ARM/README.txt
+++ b/contrib/libs/llvm12/lib/Target/ARM/README.txt
@@ -1,732 +1,732 @@
-//===---------------------------------------------------------------------===//
-// Random ideas for the ARM backend.
-//===---------------------------------------------------------------------===//
-
-Reimplement 'select' in terms of 'SEL'.
-
-* We would really like to support UXTAB16, but we need to prove that the
- add doesn't need to overflow between the two 16-bit chunks.
-
-* Implement pre/post increment support. (e.g. PR935)
-* Implement smarter constant generation for binops with large immediates.
-
-A few ARMv6T2 ops should be pattern matched: BFI, SBFX, and UBFX
-
-Interesting optimization for PIC codegen on arm-linux:
-http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43129
-
-//===---------------------------------------------------------------------===//
-
-Crazy idea: Consider code that uses lots of 8-bit or 16-bit values. By the
-time regalloc happens, these values are now in a 32-bit register, usually with
-the top-bits known to be sign or zero extended. If spilled, we should be able
-to spill these to a 8-bit or 16-bit stack slot, zero or sign extending as part
-of the reload.
-
-Doing this reduces the size of the stack frame (important for thumb etc), and
-also increases the likelihood that we will be able to reload multiple values
-from the stack with a single load.
-
-//===---------------------------------------------------------------------===//
-
-The constant island pass is in good shape. Some cleanups might be desirable,
-but there is unlikely to be much improvement in the generated code.
-
-1. There may be some advantage to trying to be smarter about the initial
-placement, rather than putting everything at the end.
-
-2. There might be some compile-time efficiency to be had by representing
-consecutive islands as a single block rather than multiple blocks.
-
-3. Use a priority queue to sort constant pool users in inverse order of
- position so we always process the one closed to the end of functions
- first. This may simply CreateNewWater.
-
-//===---------------------------------------------------------------------===//
-
-Eliminate copysign custom expansion. We are still generating crappy code with
-default expansion + if-conversion.
-
-//===---------------------------------------------------------------------===//
-
-Eliminate one instruction from:
-
-define i32 @_Z6slow4bii(i32 %x, i32 %y) {
- %tmp = icmp sgt i32 %x, %y
- %retval = select i1 %tmp, i32 %x, i32 %y
- ret i32 %retval
-}
-
-__Z6slow4bii:
- cmp r0, r1
- movgt r1, r0
- mov r0, r1
- bx lr
-=>
-
-__Z6slow4bii:
- cmp r0, r1
- movle r0, r1
- bx lr
-
-//===---------------------------------------------------------------------===//
-
-Implement long long "X-3" with instructions that fold the immediate in. These
-were disabled due to badness with the ARM carry flag on subtracts.
-
-//===---------------------------------------------------------------------===//
-
-More load / store optimizations:
-1) Better representation for block transfer? This is from Olden/power:
-
- fldd d0, [r4]
- fstd d0, [r4, #+32]
- fldd d0, [r4, #+8]
- fstd d0, [r4, #+40]
- fldd d0, [r4, #+16]
- fstd d0, [r4, #+48]
- fldd d0, [r4, #+24]
- fstd d0, [r4, #+56]
-
-If we can spare the registers, it would be better to use fldm and fstm here.
-Need major register allocator enhancement though.
-
-2) Can we recognize the relative position of constantpool entries? i.e. Treat
-
- ldr r0, LCPI17_3
- ldr r1, LCPI17_4
- ldr r2, LCPI17_5
-
- as
- ldr r0, LCPI17
- ldr r1, LCPI17+4
- ldr r2, LCPI17+8
-
- Then the ldr's can be combined into a single ldm. See Olden/power.
-
-Note for ARM v4 gcc uses ldmia to load a pair of 32-bit values to represent a
-double 64-bit FP constant:
-
- adr r0, L6
- ldmia r0, {r0-r1}
-
- .align 2
-L6:
- .long -858993459
- .long 1074318540
-
-3) struct copies appear to be done field by field
-instead of by words, at least sometimes:
-
-struct foo { int x; short s; char c1; char c2; };
-void cpy(struct foo*a, struct foo*b) { *a = *b; }
-
-llvm code (-O2)
- ldrb r3, [r1, #+6]
- ldr r2, [r1]
- ldrb r12, [r1, #+7]
- ldrh r1, [r1, #+4]
- str r2, [r0]
- strh r1, [r0, #+4]
- strb r3, [r0, #+6]
- strb r12, [r0, #+7]
-gcc code (-O2)
- ldmia r1, {r1-r2}
- stmia r0, {r1-r2}
-
-In this benchmark poor handling of aggregate copies has shown up as
-having a large effect on size, and possibly speed as well (we don't have
-a good way to measure on ARM).
-
-//===---------------------------------------------------------------------===//
-
-* Consider this silly example:
-
-double bar(double x) {
- double r = foo(3.1);
- return x+r;
-}
-
-_bar:
- stmfd sp!, {r4, r5, r7, lr}
- add r7, sp, #8
- mov r4, r0
- mov r5, r1
- fldd d0, LCPI1_0
- fmrrd r0, r1, d0
- bl _foo
- fmdrr d0, r4, r5
- fmsr s2, r0
- fsitod d1, s2
- faddd d0, d1, d0
- fmrrd r0, r1, d0
- ldmfd sp!, {r4, r5, r7, pc}
-
-Ignore the prologue and epilogue stuff for a second. Note
- mov r4, r0
- mov r5, r1
-the copys to callee-save registers and the fact they are only being used by the
-fmdrr instruction. It would have been better had the fmdrr been scheduled
-before the call and place the result in a callee-save DPR register. The two
-mov ops would not have been necessary.
-
-//===---------------------------------------------------------------------===//
-
-Calling convention related stuff:
-
-* gcc's parameter passing implementation is terrible and we suffer as a result:
-
-e.g.
-struct s {
- double d1;
- int s1;
-};
-
-void foo(struct s S) {
- printf("%g, %d\n", S.d1, S.s1);
-}
-
-'S' is passed via registers r0, r1, r2. But gcc stores them to the stack, and
-then reload them to r1, r2, and r3 before issuing the call (r0 contains the
-address of the format string):
-
- stmfd sp!, {r7, lr}
- add r7, sp, #0
- sub sp, sp, #12
- stmia sp, {r0, r1, r2}
- ldmia sp, {r1-r2}
- ldr r0, L5
- ldr r3, [sp, #8]
-L2:
- add r0, pc, r0
- bl L_printf$stub
-
-Instead of a stmia, ldmia, and a ldr, wouldn't it be better to do three moves?
-
-* Return an aggregate type is even worse:
-
-e.g.
-struct s foo(void) {
- struct s S = {1.1, 2};
- return S;
-}
-
- mov ip, r0
- ldr r0, L5
- sub sp, sp, #12
-L2:
- add r0, pc, r0
- @ lr needed for prologue
- ldmia r0, {r0, r1, r2}
- stmia sp, {r0, r1, r2}
- stmia ip, {r0, r1, r2}
- mov r0, ip
- add sp, sp, #12
- bx lr
-
-r0 (and later ip) is the hidden parameter from caller to store the value in. The
-first ldmia loads the constants into r0, r1, r2. The last stmia stores r0, r1,
-r2 into the address passed in. However, there is one additional stmia that
-stores r0, r1, and r2 to some stack location. The store is dead.
-
-The llvm-gcc generated code looks like this:
-
-csretcc void %foo(%struct.s* %agg.result) {
-entry:
- %S = alloca %struct.s, align 4 ; <%struct.s*> [#uses=1]
- %memtmp = alloca %struct.s ; <%struct.s*> [#uses=1]
- cast %struct.s* %S to sbyte* ; <sbyte*>:0 [#uses=2]
- call void %llvm.memcpy.i32( sbyte* %0, sbyte* cast ({ double, int }* %C.0.904 to sbyte*), uint 12, uint 4 )
- cast %struct.s* %agg.result to sbyte* ; <sbyte*>:1 [#uses=2]
- call void %llvm.memcpy.i32( sbyte* %1, sbyte* %0, uint 12, uint 0 )
- cast %struct.s* %memtmp to sbyte* ; <sbyte*>:2 [#uses=1]
- call void %llvm.memcpy.i32( sbyte* %2, sbyte* %1, uint 12, uint 0 )
- ret void
-}
-
-llc ends up issuing two memcpy's (the first memcpy becomes 3 loads from
-constantpool). Perhaps we should 1) fix llvm-gcc so the memcpy is translated
-into a number of load and stores, or 2) custom lower memcpy (of small size) to
-be ldmia / stmia. I think option 2 is better but the current register
-allocator cannot allocate a chunk of registers at a time.
-
-A feasible temporary solution is to use specific physical registers at the
-lowering time for small (<= 4 words?) transfer size.
-
-* ARM CSRet calling convention requires the hidden argument to be returned by
-the callee.
-
-//===---------------------------------------------------------------------===//
-
-We can definitely do a better job on BB placements to eliminate some branches.
-It's very common to see llvm generated assembly code that looks like this:
-
-LBB3:
- ...
-LBB4:
-...
- beq LBB3
- b LBB2
-
-If BB4 is the only predecessor of BB3, then we can emit BB3 after BB4. We can
-then eliminate beq and turn the unconditional branch to LBB2 to a bne.
-
-See McCat/18-imp/ComputeBoundingBoxes for an example.
-
-//===---------------------------------------------------------------------===//
-
-Pre-/post- indexed load / stores:
-
-1) We should not make the pre/post- indexed load/store transform if the base ptr
-is guaranteed to be live beyond the load/store. This can happen if the base
-ptr is live out of the block we are performing the optimization. e.g.
-
-mov r1, r2
-ldr r3, [r1], #4
-...
-
-vs.
-
-ldr r3, [r2]
-add r1, r2, #4
-...
-
-In most cases, this is just a wasted optimization. However, sometimes it can
-negatively impact the performance because two-address code is more restrictive
-when it comes to scheduling.
-
-Unfortunately, liveout information is currently unavailable during DAG combine
-time.
-
-2) Consider spliting a indexed load / store into a pair of add/sub + load/store
- to solve #1 (in TwoAddressInstructionPass.cpp).
-
-3) Enhance LSR to generate more opportunities for indexed ops.
-
-4) Once we added support for multiple result patterns, write indexed loads
- patterns instead of C++ instruction selection code.
-
-5) Use VLDM / VSTM to emulate indexed FP load / store.
-
-//===---------------------------------------------------------------------===//
-
-Implement support for some more tricky ways to materialize immediates. For
-example, to get 0xffff8000, we can use:
-
-mov r9, #&3f8000
-sub r9, r9, #&400000
-
-//===---------------------------------------------------------------------===//
-
-We sometimes generate multiple add / sub instructions to update sp in prologue
-and epilogue if the inc / dec value is too large to fit in a single immediate
-operand. In some cases, perhaps it might be better to load the value from a
-constantpool instead.
-
-//===---------------------------------------------------------------------===//
-
-GCC generates significantly better code for this function.
-
-int foo(int StackPtr, unsigned char *Line, unsigned char *Stack, int LineLen) {
- int i = 0;
-
- if (StackPtr != 0) {
- while (StackPtr != 0 && i < (((LineLen) < (32768))? (LineLen) : (32768)))
- Line[i++] = Stack[--StackPtr];
- if (LineLen > 32768)
- {
- while (StackPtr != 0 && i < LineLen)
- {
- i++;
- --StackPtr;
- }
- }
- }
- return StackPtr;
-}
-
-//===---------------------------------------------------------------------===//
-
-This should compile to the mlas instruction:
-int mlas(int x, int y, int z) { return ((x * y + z) < 0) ? 7 : 13; }
-
-//===---------------------------------------------------------------------===//
-
-At some point, we should triage these to see if they still apply to us:
-
-http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19598
-http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18560
-http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27016
-
-http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11831
-http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11826
-http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11825
-http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11824
-http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11823
-http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11820
-http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10982
-
-http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10242
-http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9831
-http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9760
-http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9759
-http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9703
-http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9702
-http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9663
-
-http://www.inf.u-szeged.hu/gcc-arm/
-http://citeseer.ist.psu.edu/debus04linktime.html
-
-//===---------------------------------------------------------------------===//
-
-gcc generates smaller code for this function at -O2 or -Os:
-
-void foo(signed char* p) {
- if (*p == 3)
- bar();
- else if (*p == 4)
- baz();
- else if (*p == 5)
- quux();
-}
-
-llvm decides it's a good idea to turn the repeated if...else into a
-binary tree, as if it were a switch; the resulting code requires -1
-compare-and-branches when *p<=2 or *p==5, the same number if *p==4
-or *p>6, and +1 if *p==3. So it should be a speed win
-(on balance). However, the revised code is larger, with 4 conditional
-branches instead of 3.
-
-More seriously, there is a byte->word extend before
-each comparison, where there should be only one, and the condition codes
-are not remembered when the same two values are compared twice.
-
-//===---------------------------------------------------------------------===//
-
-More LSR enhancements possible:
-
-1. Teach LSR about pre- and post- indexed ops to allow iv increment be merged
- in a load / store.
-2. Allow iv reuse even when a type conversion is required. For example, i8
- and i32 load / store addressing modes are identical.
-
-
-//===---------------------------------------------------------------------===//
-
-This:
-
-int foo(int a, int b, int c, int d) {
- long long acc = (long long)a * (long long)b;
- acc += (long long)c * (long long)d;
- return (int)(acc >> 32);
-}
-
-Should compile to use SMLAL (Signed Multiply Accumulate Long) which multiplies
-two signed 32-bit values to produce a 64-bit value, and accumulates this with
-a 64-bit value.
-
-We currently get this with both v4 and v6:
-
-_foo:
- smull r1, r0, r1, r0
- smull r3, r2, r3, r2
- adds r3, r3, r1
- adc r0, r2, r0
- bx lr
-
-//===---------------------------------------------------------------------===//
-
-This:
- #include <algorithm>
- std::pair<unsigned, bool> full_add(unsigned a, unsigned b)
- { return std::make_pair(a + b, a + b < a); }
- bool no_overflow(unsigned a, unsigned b)
- { return !full_add(a, b).second; }
-
-Should compile to:
-
-_Z8full_addjj:
- adds r2, r1, r2
- movcc r1, #0
- movcs r1, #1
- str r2, [r0, #0]
- strb r1, [r0, #4]
- mov pc, lr
-
-_Z11no_overflowjj:
- cmn r0, r1
- movcs r0, #0
- movcc r0, #1
- mov pc, lr
-
-not:
-
-__Z8full_addjj:
- add r3, r2, r1
- str r3, [r0]
- mov r2, #1
- mov r12, #0
- cmp r3, r1
- movlo r12, r2
- str r12, [r0, #+4]
- bx lr
-__Z11no_overflowjj:
- add r3, r1, r0
- mov r2, #1
- mov r1, #0
- cmp r3, r0
- movhs r1, r2
- mov r0, r1
- bx lr
-
-//===---------------------------------------------------------------------===//
-
-Some of the NEON intrinsics may be appropriate for more general use, either
-as target-independent intrinsics or perhaps elsewhere in the ARM backend.
-Some of them may also be lowered to target-independent SDNodes, and perhaps
-some new SDNodes could be added.
-
-For example, maximum, minimum, and absolute value operations are well-defined
-and standard operations, both for vector and scalar types.
-
-The current NEON-specific intrinsics for count leading zeros and count one
-bits could perhaps be replaced by the target-independent ctlz and ctpop
-intrinsics. It may also make sense to add a target-independent "ctls"
-intrinsic for "count leading sign bits". Likewise, the backend could use
-the target-independent SDNodes for these operations.
-
-ARMv6 has scalar saturating and halving adds and subtracts. The same
-intrinsics could possibly be used for both NEON's vector implementations of
-those operations and the ARMv6 scalar versions.
-
-//===---------------------------------------------------------------------===//
-
-Split out LDR (literal) from normal ARM LDR instruction. Also consider spliting
-LDR into imm12 and so_reg forms. This allows us to clean up some code. e.g.
-ARMLoadStoreOptimizer does not need to look at LDR (literal) and LDR (so_reg)
-while ARMConstantIslandPass only need to worry about LDR (literal).
-
-//===---------------------------------------------------------------------===//
-
-Constant island pass should make use of full range SoImm values for LEApcrel.
-Be careful though as the last attempt caused infinite looping on lencod.
-
-//===---------------------------------------------------------------------===//
-
-Predication issue. This function:
-
-extern unsigned array[ 128 ];
-int foo( int x ) {
- int y;
- y = array[ x & 127 ];
- if ( x & 128 )
- y = 123456789 & ( y >> 2 );
- else
- y = 123456789 & y;
- return y;
-}
-
-compiles to:
-
-_foo:
- and r1, r0, #127
- ldr r2, LCPI1_0
- ldr r2, [r2]
- ldr r1, [r2, +r1, lsl #2]
- mov r2, r1, lsr #2
- tst r0, #128
- moveq r2, r1
- ldr r0, LCPI1_1
- and r0, r2, r0
- bx lr
-
-It would be better to do something like this, to fold the shift into the
-conditional move:
-
- and r1, r0, #127
- ldr r2, LCPI1_0
- ldr r2, [r2]
- ldr r1, [r2, +r1, lsl #2]
- tst r0, #128
- movne r1, r1, lsr #2
- ldr r0, LCPI1_1
- and r0, r1, r0
- bx lr
-
-it saves an instruction and a register.
-
-//===---------------------------------------------------------------------===//
-
-It might be profitable to cse MOVi16 if there are lots of 32-bit immediates
-with the same bottom half.
-
-//===---------------------------------------------------------------------===//
-
-Robert Muth started working on an alternate jump table implementation that
-does not put the tables in-line in the text. This is more like the llvm
-default jump table implementation. This might be useful sometime. Several
-revisions of patches are on the mailing list, beginning at:
-http://lists.llvm.org/pipermail/llvm-dev/2009-June/022763.html
-
-//===---------------------------------------------------------------------===//
-
-Make use of the "rbit" instruction.
-
-//===---------------------------------------------------------------------===//
-
-Take a look at test/CodeGen/Thumb2/machine-licm.ll. ARM should be taught how
-to licm and cse the unnecessary load from cp#1.
-
-//===---------------------------------------------------------------------===//
-
-The CMN instruction sets the flags like an ADD instruction, while CMP sets
-them like a subtract. Therefore to be able to use CMN for comparisons other
-than the Z bit, we'll need additional logic to reverse the conditionals
-associated with the comparison. Perhaps a pseudo-instruction for the comparison,
-with a post-codegen pass to clean up and handle the condition codes?
-See PR5694 for testcase.
-
-//===---------------------------------------------------------------------===//
-
-Given the following on armv5:
-int test1(int A, int B) {
- return (A&-8388481)|(B&8388480);
-}
-
-We currently generate:
- ldr r2, .LCPI0_0
- and r0, r0, r2
- ldr r2, .LCPI0_1
- and r1, r1, r2
- orr r0, r1, r0
- bx lr
-
-We should be able to replace the second ldr+and with a bic (i.e. reuse the
-constant which was already loaded). Not sure what's necessary to do that.
-
-//===---------------------------------------------------------------------===//
-
-The code generated for bswap on armv4/5 (CPUs without rev) is less than ideal:
-
-int a(int x) { return __builtin_bswap32(x); }
-
-a:
- mov r1, #255, 24
- mov r2, #255, 16
- and r1, r1, r0, lsr #8
- and r2, r2, r0, lsl #8
- orr r1, r1, r0, lsr #24
- orr r0, r2, r0, lsl #24
- orr r0, r0, r1
- bx lr
-
-Something like the following would be better (fewer instructions/registers):
- eor r1, r0, r0, ror #16
- bic r1, r1, #0xff0000
- mov r1, r1, lsr #8
- eor r0, r1, r0, ror #8
- bx lr
-
-A custom Thumb version would also be a slight improvement over the generic
-version.
-
-//===---------------------------------------------------------------------===//
-
-Consider the following simple C code:
-
-void foo(unsigned char *a, unsigned char *b, int *c) {
- if ((*a | *b) == 0) *c = 0;
-}
-
-currently llvm-gcc generates something like this (nice branchless code I'd say):
-
- ldrb r0, [r0]
- ldrb r1, [r1]
- orr r0, r1, r0
- tst r0, #255
- moveq r0, #0
- streq r0, [r2]
- bx lr
-
-Note that both "tst" and "moveq" are redundant.
-
-//===---------------------------------------------------------------------===//
-
-When loading immediate constants with movt/movw, if there are multiple
-constants needed with the same low 16 bits, and those values are not live at
-the same time, it would be possible to use a single movw instruction, followed
-by multiple movt instructions to rewrite the high bits to different values.
-For example:
-
- volatile store i32 -1, i32* inttoptr (i32 1342210076 to i32*), align 4,
- !tbaa
-!0
- volatile store i32 -1, i32* inttoptr (i32 1342341148 to i32*), align 4,
- !tbaa
-!0
-
-is compiled and optimized to:
-
- movw r0, #32796
- mov.w r1, #-1
- movt r0, #20480
- str r1, [r0]
- movw r0, #32796 @ <= this MOVW is not needed, value is there already
- movt r0, #20482
- str r1, [r0]
-
-//===---------------------------------------------------------------------===//
-
-Improve codegen for select's:
-if (x != 0) x = 1
-if (x == 1) x = 1
-
-ARM codegen used to look like this:
- mov r1, r0
- cmp r1, #1
- mov r0, #0
- moveq r0, #1
-
-The naive lowering select between two different values. It should recognize the
-test is equality test so it's more a conditional move rather than a select:
- cmp r0, #1
- movne r0, #0
-
-Currently this is a ARM specific dag combine. We probably should make it into a
-target-neutral one.
-
-//===---------------------------------------------------------------------===//
-
-Optimize unnecessary checks for zero with __builtin_clz/ctz. Those builtins
-are specified to be undefined at zero, so portable code must check for zero
-and handle it as a special case. That is unnecessary on ARM where those
-operations are implemented in a way that is well-defined for zero. For
-example:
-
-int f(int x) { return x ? __builtin_clz(x) : sizeof(int)*8; }
-
-should just be implemented with a CLZ instruction. Since there are other
-targets, e.g., PPC, that share this behavior, it would be best to implement
-this in a target-independent way: we should probably fold that (when using
-"undefined at zero" semantics) to set the "defined at zero" bit and have
-the code generator expand out the right code.
-
-//===---------------------------------------------------------------------===//
-
-Clean up the test/MC/ARM files to have more robust register choices.
-
-R0 should not be used as a register operand in the assembler tests as it's then
-not possible to distinguish between a correct encoding and a missing operand
-encoding, as zero is the default value for the binary encoder.
-e.g.,
- add r0, r0 // bad
- add r3, r5 // good
-
-Register operands should be distinct. That is, when the encoding does not
-require two syntactical operands to refer to the same register, two different
-registers should be used in the test so as to catch errors where the
-operands are swapped in the encoding.
-e.g.,
- subs.w r1, r1, r1 // bad
- subs.w r1, r2, r3 // good
-
+//===---------------------------------------------------------------------===//
+// Random ideas for the ARM backend.
+//===---------------------------------------------------------------------===//
+
+Reimplement 'select' in terms of 'SEL'.
+
+* We would really like to support UXTAB16, but we need to prove that the
+ add doesn't need to overflow between the two 16-bit chunks.
+
+* Implement pre/post increment support. (e.g. PR935)
+* Implement smarter constant generation for binops with large immediates.
+
+A few ARMv6T2 ops should be pattern matched: BFI, SBFX, and UBFX
+
+Interesting optimization for PIC codegen on arm-linux:
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43129
+
+//===---------------------------------------------------------------------===//
+
+Crazy idea: Consider code that uses lots of 8-bit or 16-bit values. By the
+time regalloc happens, these values are now in a 32-bit register, usually with
+the top-bits known to be sign or zero extended. If spilled, we should be able
+to spill these to a 8-bit or 16-bit stack slot, zero or sign extending as part
+of the reload.
+
+Doing this reduces the size of the stack frame (important for thumb etc), and
+also increases the likelihood that we will be able to reload multiple values
+from the stack with a single load.
+
+//===---------------------------------------------------------------------===//
+
+The constant island pass is in good shape. Some cleanups might be desirable,
+but there is unlikely to be much improvement in the generated code.
+
+1. There may be some advantage to trying to be smarter about the initial
+placement, rather than putting everything at the end.
+
+2. There might be some compile-time efficiency to be had by representing
+consecutive islands as a single block rather than multiple blocks.
+
+3. Use a priority queue to sort constant pool users in inverse order of
+ position so we always process the one closed to the end of functions
+ first. This may simply CreateNewWater.
+
+//===---------------------------------------------------------------------===//
+
+Eliminate copysign custom expansion. We are still generating crappy code with
+default expansion + if-conversion.
+
+//===---------------------------------------------------------------------===//
+
+Eliminate one instruction from:
+
+define i32 @_Z6slow4bii(i32 %x, i32 %y) {
+ %tmp = icmp sgt i32 %x, %y
+ %retval = select i1 %tmp, i32 %x, i32 %y
+ ret i32 %retval
+}
+
+__Z6slow4bii:
+ cmp r0, r1
+ movgt r1, r0
+ mov r0, r1
+ bx lr
+=>
+
+__Z6slow4bii:
+ cmp r0, r1
+ movle r0, r1
+ bx lr
+
+//===---------------------------------------------------------------------===//
+
+Implement long long "X-3" with instructions that fold the immediate in. These
+were disabled due to badness with the ARM carry flag on subtracts.
+
+//===---------------------------------------------------------------------===//
+
+More load / store optimizations:
+1) Better representation for block transfer? This is from Olden/power:
+
+ fldd d0, [r4]
+ fstd d0, [r4, #+32]
+ fldd d0, [r4, #+8]
+ fstd d0, [r4, #+40]
+ fldd d0, [r4, #+16]
+ fstd d0, [r4, #+48]
+ fldd d0, [r4, #+24]
+ fstd d0, [r4, #+56]
+
+If we can spare the registers, it would be better to use fldm and fstm here.
+Need major register allocator enhancement though.
+
+2) Can we recognize the relative position of constantpool entries? i.e. Treat
+
+ ldr r0, LCPI17_3
+ ldr r1, LCPI17_4
+ ldr r2, LCPI17_5
+
+ as
+ ldr r0, LCPI17
+ ldr r1, LCPI17+4
+ ldr r2, LCPI17+8
+
+ Then the ldr's can be combined into a single ldm. See Olden/power.
+
+Note for ARM v4 gcc uses ldmia to load a pair of 32-bit values to represent a
+double 64-bit FP constant:
+
+ adr r0, L6
+ ldmia r0, {r0-r1}
+
+ .align 2
+L6:
+ .long -858993459
+ .long 1074318540
+
+3) struct copies appear to be done field by field
+instead of by words, at least sometimes:
+
+struct foo { int x; short s; char c1; char c2; };
+void cpy(struct foo*a, struct foo*b) { *a = *b; }
+
+llvm code (-O2)
+ ldrb r3, [r1, #+6]
+ ldr r2, [r1]
+ ldrb r12, [r1, #+7]
+ ldrh r1, [r1, #+4]
+ str r2, [r0]
+ strh r1, [r0, #+4]
+ strb r3, [r0, #+6]
+ strb r12, [r0, #+7]
+gcc code (-O2)
+ ldmia r1, {r1-r2}
+ stmia r0, {r1-r2}
+
+In this benchmark poor handling of aggregate copies has shown up as
+having a large effect on size, and possibly speed as well (we don't have
+a good way to measure on ARM).
+
+//===---------------------------------------------------------------------===//
+
+* Consider this silly example:
+
+double bar(double x) {
+ double r = foo(3.1);
+ return x+r;
+}
+
+_bar:
+ stmfd sp!, {r4, r5, r7, lr}
+ add r7, sp, #8
+ mov r4, r0
+ mov r5, r1
+ fldd d0, LCPI1_0
+ fmrrd r0, r1, d0
+ bl _foo
+ fmdrr d0, r4, r5
+ fmsr s2, r0
+ fsitod d1, s2
+ faddd d0, d1, d0
+ fmrrd r0, r1, d0
+ ldmfd sp!, {r4, r5, r7, pc}
+
+Ignore the prologue and epilogue stuff for a second. Note
+ mov r4, r0
+ mov r5, r1
+the copys to callee-save registers and the fact they are only being used by the
+fmdrr instruction. It would have been better had the fmdrr been scheduled
+before the call and place the result in a callee-save DPR register. The two
+mov ops would not have been necessary.
+
+//===---------------------------------------------------------------------===//
+
+Calling convention related stuff:
+
+* gcc's parameter passing implementation is terrible and we suffer as a result:
+
+e.g.
+struct s {
+ double d1;
+ int s1;
+};
+
+void foo(struct s S) {
+ printf("%g, %d\n", S.d1, S.s1);
+}
+
+'S' is passed via registers r0, r1, r2. But gcc stores them to the stack, and
+then reload them to r1, r2, and r3 before issuing the call (r0 contains the
+address of the format string):
+
+ stmfd sp!, {r7, lr}
+ add r7, sp, #0
+ sub sp, sp, #12
+ stmia sp, {r0, r1, r2}
+ ldmia sp, {r1-r2}
+ ldr r0, L5
+ ldr r3, [sp, #8]
+L2:
+ add r0, pc, r0
+ bl L_printf$stub
+
+Instead of a stmia, ldmia, and a ldr, wouldn't it be better to do three moves?
+
+* Return an aggregate type is even worse:
+
+e.g.
+struct s foo(void) {
+ struct s S = {1.1, 2};
+ return S;
+}
+
+ mov ip, r0
+ ldr r0, L5
+ sub sp, sp, #12
+L2:
+ add r0, pc, r0
+ @ lr needed for prologue
+ ldmia r0, {r0, r1, r2}
+ stmia sp, {r0, r1, r2}
+ stmia ip, {r0, r1, r2}
+ mov r0, ip
+ add sp, sp, #12
+ bx lr
+
+r0 (and later ip) is the hidden parameter from caller to store the value in. The
+first ldmia loads the constants into r0, r1, r2. The last stmia stores r0, r1,
+r2 into the address passed in. However, there is one additional stmia that
+stores r0, r1, and r2 to some stack location. The store is dead.
+
+The llvm-gcc generated code looks like this:
+
+csretcc void %foo(%struct.s* %agg.result) {
+entry:
+ %S = alloca %struct.s, align 4 ; <%struct.s*> [#uses=1]
+ %memtmp = alloca %struct.s ; <%struct.s*> [#uses=1]
+ cast %struct.s* %S to sbyte* ; <sbyte*>:0 [#uses=2]
+ call void %llvm.memcpy.i32( sbyte* %0, sbyte* cast ({ double, int }* %C.0.904 to sbyte*), uint 12, uint 4 )
+ cast %struct.s* %agg.result to sbyte* ; <sbyte*>:1 [#uses=2]
+ call void %llvm.memcpy.i32( sbyte* %1, sbyte* %0, uint 12, uint 0 )
+ cast %struct.s* %memtmp to sbyte* ; <sbyte*>:2 [#uses=1]
+ call void %llvm.memcpy.i32( sbyte* %2, sbyte* %1, uint 12, uint 0 )
+ ret void
+}
+
+llc ends up issuing two memcpy's (the first memcpy becomes 3 loads from
+constantpool). Perhaps we should 1) fix llvm-gcc so the memcpy is translated
+into a number of load and stores, or 2) custom lower memcpy (of small size) to
+be ldmia / stmia. I think option 2 is better but the current register
+allocator cannot allocate a chunk of registers at a time.
+
+A feasible temporary solution is to use specific physical registers at the
+lowering time for small (<= 4 words?) transfer size.
+
+* ARM CSRet calling convention requires the hidden argument to be returned by
+the callee.
+
+//===---------------------------------------------------------------------===//
+
+We can definitely do a better job on BB placements to eliminate some branches.
+It's very common to see llvm generated assembly code that looks like this:
+
+LBB3:
+ ...
+LBB4:
+...
+ beq LBB3
+ b LBB2
+
+If BB4 is the only predecessor of BB3, then we can emit BB3 after BB4. We can
+then eliminate beq and turn the unconditional branch to LBB2 to a bne.
+
+See McCat/18-imp/ComputeBoundingBoxes for an example.
+
+//===---------------------------------------------------------------------===//
+
+Pre-/post- indexed load / stores:
+
+1) We should not make the pre/post- indexed load/store transform if the base ptr
+is guaranteed to be live beyond the load/store. This can happen if the base
+ptr is live out of the block we are performing the optimization. e.g.
+
+mov r1, r2
+ldr r3, [r1], #4
+...
+
+vs.
+
+ldr r3, [r2]
+add r1, r2, #4
+...
+
+In most cases, this is just a wasted optimization. However, sometimes it can
+negatively impact the performance because two-address code is more restrictive
+when it comes to scheduling.
+
+Unfortunately, liveout information is currently unavailable during DAG combine
+time.
+
+2) Consider spliting a indexed load / store into a pair of add/sub + load/store
+ to solve #1 (in TwoAddressInstructionPass.cpp).
+
+3) Enhance LSR to generate more opportunities for indexed ops.
+
+4) Once we added support for multiple result patterns, write indexed loads
+ patterns instead of C++ instruction selection code.
+
+5) Use VLDM / VSTM to emulate indexed FP load / store.
+
+//===---------------------------------------------------------------------===//
+
+Implement support for some more tricky ways to materialize immediates. For
+example, to get 0xffff8000, we can use:
+
+mov r9, #&3f8000
+sub r9, r9, #&400000
+
+//===---------------------------------------------------------------------===//
+
+We sometimes generate multiple add / sub instructions to update sp in prologue
+and epilogue if the inc / dec value is too large to fit in a single immediate
+operand. In some cases, perhaps it might be better to load the value from a
+constantpool instead.
+
+//===---------------------------------------------------------------------===//
+
+GCC generates significantly better code for this function.
+
+int foo(int StackPtr, unsigned char *Line, unsigned char *Stack, int LineLen) {
+ int i = 0;
+
+ if (StackPtr != 0) {
+ while (StackPtr != 0 && i < (((LineLen) < (32768))? (LineLen) : (32768)))
+ Line[i++] = Stack[--StackPtr];
+ if (LineLen > 32768)
+ {
+ while (StackPtr != 0 && i < LineLen)
+ {
+ i++;
+ --StackPtr;
+ }
+ }
+ }
+ return StackPtr;
+}
+
+//===---------------------------------------------------------------------===//
+
+This should compile to the mlas instruction:
+int mlas(int x, int y, int z) { return ((x * y + z) < 0) ? 7 : 13; }
+
+//===---------------------------------------------------------------------===//
+
+At some point, we should triage these to see if they still apply to us:
+
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19598
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18560
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27016
+
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11831
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11826
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11825
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11824
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11823
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11820
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10982
+
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10242
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9831
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9760
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9759
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9703
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9702
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9663
+
+http://www.inf.u-szeged.hu/gcc-arm/
+http://citeseer.ist.psu.edu/debus04linktime.html
+
+//===---------------------------------------------------------------------===//
+
+gcc generates smaller code for this function at -O2 or -Os:
+
+void foo(signed char* p) {
+ if (*p == 3)
+ bar();
+ else if (*p == 4)
+ baz();
+ else if (*p == 5)
+ quux();
+}
+
+llvm decides it's a good idea to turn the repeated if...else into a
+binary tree, as if it were a switch; the resulting code requires -1
+compare-and-branches when *p<=2 or *p==5, the same number if *p==4
+or *p>6, and +1 if *p==3. So it should be a speed win
+(on balance). However, the revised code is larger, with 4 conditional
+branches instead of 3.
+
+More seriously, there is a byte->word extend before
+each comparison, where there should be only one, and the condition codes
+are not remembered when the same two values are compared twice.
+
+//===---------------------------------------------------------------------===//
+
+More LSR enhancements possible:
+
+1. Teach LSR about pre- and post- indexed ops to allow iv increment be merged
+ in a load / store.
+2. Allow iv reuse even when a type conversion is required. For example, i8
+ and i32 load / store addressing modes are identical.
+
+
+//===---------------------------------------------------------------------===//
+
+This:
+
+int foo(int a, int b, int c, int d) {
+ long long acc = (long long)a * (long long)b;
+ acc += (long long)c * (long long)d;
+ return (int)(acc >> 32);
+}
+
+Should compile to use SMLAL (Signed Multiply Accumulate Long) which multiplies
+two signed 32-bit values to produce a 64-bit value, and accumulates this with
+a 64-bit value.
+
+We currently get this with both v4 and v6:
+
+_foo:
+ smull r1, r0, r1, r0
+ smull r3, r2, r3, r2
+ adds r3, r3, r1
+ adc r0, r2, r0
+ bx lr
+
+//===---------------------------------------------------------------------===//
+
+This:
+ #include <algorithm>
+ std::pair<unsigned, bool> full_add(unsigned a, unsigned b)
+ { return std::make_pair(a + b, a + b < a); }
+ bool no_overflow(unsigned a, unsigned b)
+ { return !full_add(a, b).second; }
+
+Should compile to:
+
+_Z8full_addjj:
+ adds r2, r1, r2
+ movcc r1, #0
+ movcs r1, #1
+ str r2, [r0, #0]
+ strb r1, [r0, #4]
+ mov pc, lr
+
+_Z11no_overflowjj:
+ cmn r0, r1
+ movcs r0, #0
+ movcc r0, #1
+ mov pc, lr
+
+not:
+
+__Z8full_addjj:
+ add r3, r2, r1
+ str r3, [r0]
+ mov r2, #1
+ mov r12, #0
+ cmp r3, r1
+ movlo r12, r2
+ str r12, [r0, #+4]
+ bx lr
+__Z11no_overflowjj:
+ add r3, r1, r0
+ mov r2, #1
+ mov r1, #0
+ cmp r3, r0
+ movhs r1, r2
+ mov r0, r1
+ bx lr
+
+//===---------------------------------------------------------------------===//
+
+Some of the NEON intrinsics may be appropriate for more general use, either
+as target-independent intrinsics or perhaps elsewhere in the ARM backend.
+Some of them may also be lowered to target-independent SDNodes, and perhaps
+some new SDNodes could be added.
+
+For example, maximum, minimum, and absolute value operations are well-defined
+and standard operations, both for vector and scalar types.
+
+The current NEON-specific intrinsics for count leading zeros and count one
+bits could perhaps be replaced by the target-independent ctlz and ctpop
+intrinsics. It may also make sense to add a target-independent "ctls"
+intrinsic for "count leading sign bits". Likewise, the backend could use
+the target-independent SDNodes for these operations.
+
+ARMv6 has scalar saturating and halving adds and subtracts. The same
+intrinsics could possibly be used for both NEON's vector implementations of
+those operations and the ARMv6 scalar versions.
+
+//===---------------------------------------------------------------------===//
+
+Split out LDR (literal) from normal ARM LDR instruction. Also consider spliting
+LDR into imm12 and so_reg forms. This allows us to clean up some code. e.g.
+ARMLoadStoreOptimizer does not need to look at LDR (literal) and LDR (so_reg)
+while ARMConstantIslandPass only need to worry about LDR (literal).
+
+//===---------------------------------------------------------------------===//
+
+Constant island pass should make use of full range SoImm values for LEApcrel.
+Be careful though as the last attempt caused infinite looping on lencod.
+
+//===---------------------------------------------------------------------===//
+
+Predication issue. This function:
+
+extern unsigned array[ 128 ];
+int foo( int x ) {
+ int y;
+ y = array[ x & 127 ];
+ if ( x & 128 )
+ y = 123456789 & ( y >> 2 );
+ else
+ y = 123456789 & y;
+ return y;
+}
+
+compiles to:
+
+_foo:
+ and r1, r0, #127
+ ldr r2, LCPI1_0
+ ldr r2, [r2]
+ ldr r1, [r2, +r1, lsl #2]
+ mov r2, r1, lsr #2
+ tst r0, #128
+ moveq r2, r1
+ ldr r0, LCPI1_1
+ and r0, r2, r0
+ bx lr
+
+It would be better to do something like this, to fold the shift into the
+conditional move:
+
+ and r1, r0, #127
+ ldr r2, LCPI1_0
+ ldr r2, [r2]
+ ldr r1, [r2, +r1, lsl #2]
+ tst r0, #128
+ movne r1, r1, lsr #2
+ ldr r0, LCPI1_1
+ and r0, r1, r0
+ bx lr
+
+it saves an instruction and a register.
+
+//===---------------------------------------------------------------------===//
+
+It might be profitable to cse MOVi16 if there are lots of 32-bit immediates
+with the same bottom half.
+
+//===---------------------------------------------------------------------===//
+
+Robert Muth started working on an alternate jump table implementation that
+does not put the tables in-line in the text. This is more like the llvm
+default jump table implementation. This might be useful sometime. Several
+revisions of patches are on the mailing list, beginning at:
+http://lists.llvm.org/pipermail/llvm-dev/2009-June/022763.html
+
+//===---------------------------------------------------------------------===//
+
+Make use of the "rbit" instruction.
+
+//===---------------------------------------------------------------------===//
+
+Take a look at test/CodeGen/Thumb2/machine-licm.ll. ARM should be taught how
+to licm and cse the unnecessary load from cp#1.
+
+//===---------------------------------------------------------------------===//
+
+The CMN instruction sets the flags like an ADD instruction, while CMP sets
+them like a subtract. Therefore to be able to use CMN for comparisons other
+than the Z bit, we'll need additional logic to reverse the conditionals
+associated with the comparison. Perhaps a pseudo-instruction for the comparison,
+with a post-codegen pass to clean up and handle the condition codes?
+See PR5694 for testcase.
+
+//===---------------------------------------------------------------------===//
+
+Given the following on armv5:
+int test1(int A, int B) {
+ return (A&-8388481)|(B&8388480);
+}
+
+We currently generate:
+ ldr r2, .LCPI0_0
+ and r0, r0, r2
+ ldr r2, .LCPI0_1
+ and r1, r1, r2
+ orr r0, r1, r0
+ bx lr
+
+We should be able to replace the second ldr+and with a bic (i.e. reuse the
+constant which was already loaded). Not sure what's necessary to do that.
+
+//===---------------------------------------------------------------------===//
+
+The code generated for bswap on armv4/5 (CPUs without rev) is less than ideal:
+
+int a(int x) { return __builtin_bswap32(x); }
+
+a:
+ mov r1, #255, 24
+ mov r2, #255, 16
+ and r1, r1, r0, lsr #8
+ and r2, r2, r0, lsl #8
+ orr r1, r1, r0, lsr #24
+ orr r0, r2, r0, lsl #24
+ orr r0, r0, r1
+ bx lr
+
+Something like the following would be better (fewer instructions/registers):
+ eor r1, r0, r0, ror #16
+ bic r1, r1, #0xff0000
+ mov r1, r1, lsr #8
+ eor r0, r1, r0, ror #8
+ bx lr
+
+A custom Thumb version would also be a slight improvement over the generic
+version.
+
+//===---------------------------------------------------------------------===//
+
+Consider the following simple C code:
+
+void foo(unsigned char *a, unsigned char *b, int *c) {
+ if ((*a | *b) == 0) *c = 0;
+}
+
+currently llvm-gcc generates something like this (nice branchless code I'd say):
+
+ ldrb r0, [r0]
+ ldrb r1, [r1]
+ orr r0, r1, r0
+ tst r0, #255
+ moveq r0, #0
+ streq r0, [r2]
+ bx lr
+
+Note that both "tst" and "moveq" are redundant.
+
+//===---------------------------------------------------------------------===//
+
+When loading immediate constants with movt/movw, if there are multiple
+constants needed with the same low 16 bits, and those values are not live at
+the same time, it would be possible to use a single movw instruction, followed
+by multiple movt instructions to rewrite the high bits to different values.
+For example:
+
+ volatile store i32 -1, i32* inttoptr (i32 1342210076 to i32*), align 4,
+ !tbaa
+!0
+ volatile store i32 -1, i32* inttoptr (i32 1342341148 to i32*), align 4,
+ !tbaa
+!0
+
+is compiled and optimized to:
+
+ movw r0, #32796
+ mov.w r1, #-1
+ movt r0, #20480
+ str r1, [r0]
+ movw r0, #32796 @ <= this MOVW is not needed, value is there already
+ movt r0, #20482
+ str r1, [r0]
+
+//===---------------------------------------------------------------------===//
+
+Improve codegen for select's:
+if (x != 0) x = 1
+if (x == 1) x = 1
+
+ARM codegen used to look like this:
+ mov r1, r0
+ cmp r1, #1
+ mov r0, #0
+ moveq r0, #1
+
+The naive lowering select between two different values. It should recognize the
+test is equality test so it's more a conditional move rather than a select:
+ cmp r0, #1
+ movne r0, #0
+
+Currently this is a ARM specific dag combine. We probably should make it into a
+target-neutral one.
+
+//===---------------------------------------------------------------------===//
+
+Optimize unnecessary checks for zero with __builtin_clz/ctz. Those builtins
+are specified to be undefined at zero, so portable code must check for zero
+and handle it as a special case. That is unnecessary on ARM where those
+operations are implemented in a way that is well-defined for zero. For
+example:
+
+int f(int x) { return x ? __builtin_clz(x) : sizeof(int)*8; }
+
+should just be implemented with a CLZ instruction. Since there are other
+targets, e.g., PPC, that share this behavior, it would be best to implement
+this in a target-independent way: we should probably fold that (when using
+"undefined at zero" semantics) to set the "defined at zero" bit and have
+the code generator expand out the right code.
+
+//===---------------------------------------------------------------------===//
+
+Clean up the test/MC/ARM files to have more robust register choices.
+
+R0 should not be used as a register operand in the assembler tests as it's then
+not possible to distinguish between a correct encoding and a missing operand
+encoding, as zero is the default value for the binary encoder.
+e.g.,
+ add r0, r0 // bad
+ add r3, r5 // good
+
+Register operands should be distinct. That is, when the encoding does not
+require two syntactical operands to refer to the same register, two different
+registers should be used in the test so as to catch errors where the
+operands are swapped in the encoding.
+e.g.,
+ subs.w r1, r1, r1 // bad
+ subs.w r1, r2, r3 // good
+