Restoring authorship annotation for <heretic@yandex-team.ru>. Commit 1 of 2.

author: heretic <heretic@yandex-team.ru> 2022-02-10 16:45:43 +0300
committer: Daniil Cherednik <dcherednik@yandex-team.ru> 2022-02-10 16:45:43 +0300
commit: 397cbe258b9e064f49c4ca575279f02f39fef76e (patch)
tree: a0b0eb3cca6a14e4e8ea715393637672fa651284 /contrib/libs/llvm12/lib/Target/ARM/README.txt
parent: 43f5a35593ebc9f6bcea619bb170394ea7ae468e (diff)
download: ydb-397cbe258b9e064f49c4ca575279f02f39fef76e.tar.gz
1 files changed, 732 insertions, 732 deletions
diff --git a/contrib/libs/llvm12/lib/Target/ARM/README.txt b/contrib/libs/llvm12/lib/Target/ARM/README.txt
index def67cfae7..1a93bc7bb7 100644
--- a/contrib/libs/llvm12/lib/Target/ARM/README.txt
+++ b/contrib/libs/llvm12/lib/Target/ARM/README.txt
@@ -1,732 +1,732 @@
-//===---------------------------------------------------------------------===//
-// Random ideas for the ARM backend.
-//===---------------------------------------------------------------------===//
-
-Reimplement 'select' in terms of 'SEL'.
-
-* We would really like to support UXTAB16, but we need to prove that the
-  add doesn't need to overflow between the two 16-bit chunks.
-
-* Implement pre/post increment support.  (e.g. PR935)
-* Implement smarter constant generation for binops with large immediates.
-
-A few ARMv6T2 ops should be pattern matched: BFI, SBFX, and UBFX
-
-Interesting optimization for PIC codegen on arm-linux:
-http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43129
-
-//===---------------------------------------------------------------------===//
-
-Crazy idea:  Consider code that uses lots of 8-bit or 16-bit values.  By the
-time regalloc happens, these values are now in a 32-bit register, usually with
-the top-bits known to be sign or zero extended.  If spilled, we should be able
-to spill these to a 8-bit or 16-bit stack slot, zero or sign extending as part
-of the reload.
-
-Doing this reduces the size of the stack frame (important for thumb etc), and
-also increases the likelihood that we will be able to reload multiple values
-from the stack with a single load.
-
-//===---------------------------------------------------------------------===//
-
-The constant island pass is in good shape.  Some cleanups might be desirable,
-but there is unlikely to be much improvement in the generated code.
-
-1.  There may be some advantage to trying to be smarter about the initial
-placement, rather than putting everything at the end.
-
-2.  There might be some compile-time efficiency to be had by representing
-consecutive islands as a single block rather than multiple blocks.
-
-3.  Use a priority queue to sort constant pool users in inverse order of
-    position so we always process the one closed to the end of functions
-    first. This may simply CreateNewWater.
-
-//===---------------------------------------------------------------------===//
-
-Eliminate copysign custom expansion. We are still generating crappy code with
-default expansion + if-conversion.
-
-//===---------------------------------------------------------------------===//
-
-Eliminate one instruction from:
-
-define i32 @_Z6slow4bii(i32 %x, i32 %y) {
-        %tmp = icmp sgt i32 %x, %y
-        %retval = select i1 %tmp, i32 %x, i32 %y
-        ret i32 %retval
-}
-
-__Z6slow4bii:
-        cmp r0, r1
-        movgt r1, r0
-        mov r0, r1
-        bx lr
-=>
-
-__Z6slow4bii:
-        cmp r0, r1
-        movle r0, r1
-        bx lr
-
-//===---------------------------------------------------------------------===//
-
-Implement long long "X-3" with instructions that fold the immediate in.  These
-were disabled due to badness with the ARM carry flag on subtracts.
-
-//===---------------------------------------------------------------------===//
-
-More load / store optimizations:
-1) Better representation for block transfer? This is from Olden/power:
-
-	fldd d0, [r4]
-	fstd d0, [r4, #+32]
-	fldd d0, [r4, #+8]
-	fstd d0, [r4, #+40]
-	fldd d0, [r4, #+16]
-	fstd d0, [r4, #+48]
-	fldd d0, [r4, #+24]
-	fstd d0, [r4, #+56]
-
-If we can spare the registers, it would be better to use fldm and fstm here.
-Need major register allocator enhancement though.
-
-2) Can we recognize the relative position of constantpool entries? i.e. Treat
-
-	ldr r0, LCPI17_3
-	ldr r1, LCPI17_4
-	ldr r2, LCPI17_5
-
-   as
-	ldr r0, LCPI17
-	ldr r1, LCPI17+4
-	ldr r2, LCPI17+8
-
-   Then the ldr's can be combined into a single ldm. See Olden/power.
-
-Note for ARM v4 gcc uses ldmia to load a pair of 32-bit values to represent a
-double 64-bit FP constant:
-
-	adr	r0, L6
-	ldmia	r0, {r0-r1}
-
-	.align 2
-L6:
-	.long	-858993459
-	.long	1074318540
-
-3) struct copies appear to be done field by field
-instead of by words, at least sometimes:
-
-struct foo { int x; short s; char c1; char c2; };
-void cpy(struct foo*a, struct foo*b) { *a = *b; }
-
-llvm code (-O2)
-        ldrb r3, [r1, #+6]
-        ldr r2, [r1]
-        ldrb r12, [r1, #+7]
-        ldrh r1, [r1, #+4]
-        str r2, [r0]
-        strh r1, [r0, #+4]
-        strb r3, [r0, #+6]
-        strb r12, [r0, #+7]
-gcc code (-O2)
-        ldmia   r1, {r1-r2}
-        stmia   r0, {r1-r2}
-
-In this benchmark poor handling of aggregate copies has shown up as
-having a large effect on size, and possibly speed as well (we don't have
-a good way to measure on ARM).
-
-//===---------------------------------------------------------------------===//
-
-* Consider this silly example:
-
-double bar(double x) {
-  double r = foo(3.1);
-  return x+r;
-}
-
-_bar:
-        stmfd sp!, {r4, r5, r7, lr}
-        add r7, sp, #8
-        mov r4, r0
-        mov r5, r1
-        fldd d0, LCPI1_0
-        fmrrd r0, r1, d0
-        bl _foo
-        fmdrr d0, r4, r5
-        fmsr s2, r0
-        fsitod d1, s2
-        faddd d0, d1, d0
-        fmrrd r0, r1, d0
-        ldmfd sp!, {r4, r5, r7, pc}
-
-Ignore the prologue and epilogue stuff for a second. Note
-	mov r4, r0
-	mov r5, r1
-the copys to callee-save registers and the fact they are only being used by the
-fmdrr instruction. It would have been better had the fmdrr been scheduled
-before the call and place the result in a callee-save DPR register. The two
-mov ops would not have been necessary.
-
-//===---------------------------------------------------------------------===//
-
-Calling convention related stuff:
-
-* gcc's parameter passing implementation is terrible and we suffer as a result:
-
-e.g.
-struct s {
-  double d1;
-  int s1;
-};
-
-void foo(struct s S) {
-  printf("%g, %d\n", S.d1, S.s1);
-}
-
-'S' is passed via registers r0, r1, r2. But gcc stores them to the stack, and
-then reload them to r1, r2, and r3 before issuing the call (r0 contains the
-address of the format string):
-
-	stmfd	sp!, {r7, lr}
-	add	r7, sp, #0
-	sub	sp, sp, #12
-	stmia	sp, {r0, r1, r2}
-	ldmia	sp, {r1-r2}
-	ldr	r0, L5
-	ldr	r3, [sp, #8]
-L2:
-	add	r0, pc, r0
-	bl	L_printf$stub
-
-Instead of a stmia, ldmia, and a ldr, wouldn't it be better to do three moves?
-
-* Return an aggregate type is even worse:
-
-e.g.
-struct s foo(void) {
-  struct s S = {1.1, 2};
-  return S;
-}
-
-	mov	ip, r0
-	ldr	r0, L5
-	sub	sp, sp, #12
-L2:
-	add	r0, pc, r0
-	@ lr needed for prologue
-	ldmia	r0, {r0, r1, r2}
-	stmia	sp, {r0, r1, r2}
-	stmia	ip, {r0, r1, r2}
-	mov	r0, ip
-	add	sp, sp, #12
-	bx	lr
-
-r0 (and later ip) is the hidden parameter from caller to store the value in. The
-first ldmia loads the constants into r0, r1, r2. The last stmia stores r0, r1,
-r2 into the address passed in. However, there is one additional stmia that
-stores r0, r1, and r2 to some stack location. The store is dead.
-
-The llvm-gcc generated code looks like this:
-
-csretcc void %foo(%struct.s* %agg.result) {
-entry:
-	%S = alloca %struct.s, align 4		; <%struct.s*> [#uses=1]
-	%memtmp = alloca %struct.s		; <%struct.s*> [#uses=1]
-	cast %struct.s* %S to sbyte*		; <sbyte*>:0 [#uses=2]
-	call void %llvm.memcpy.i32( sbyte* %0, sbyte* cast ({ double, int }* %C.0.904 to sbyte*), uint 12, uint 4 )
-	cast %struct.s* %agg.result to sbyte*		; <sbyte*>:1 [#uses=2]
-	call void %llvm.memcpy.i32( sbyte* %1, sbyte* %0, uint 12, uint 0 )
-	cast %struct.s* %memtmp to sbyte*		; <sbyte*>:2 [#uses=1]
-	call void %llvm.memcpy.i32( sbyte* %2, sbyte* %1, uint 12, uint 0 )
-	ret void
-}
-
-llc ends up issuing two memcpy's (the first memcpy becomes 3 loads from
-constantpool). Perhaps we should 1) fix llvm-gcc so the memcpy is translated
-into a number of load and stores, or 2) custom lower memcpy (of small size) to
-be ldmia / stmia. I think option 2 is better but the current register
-allocator cannot allocate a chunk of registers at a time.
-
-A feasible temporary solution is to use specific physical registers at the
-lowering time for small (<= 4 words?) transfer size.
-
-* ARM CSRet calling convention requires the hidden argument to be returned by
-the callee.
-
-//===---------------------------------------------------------------------===//
-
-We can definitely do a better job on BB placements to eliminate some branches.
-It's very common to see llvm generated assembly code that looks like this:
-
-LBB3:
- ...
-LBB4:
-...
-  beq LBB3
-  b LBB2
-
-If BB4 is the only predecessor of BB3, then we can emit BB3 after BB4. We can
-then eliminate beq and turn the unconditional branch to LBB2 to a bne.
-
-See McCat/18-imp/ComputeBoundingBoxes for an example.
-
-//===---------------------------------------------------------------------===//
-
-Pre-/post- indexed load / stores:
-
-1) We should not make the pre/post- indexed load/store transform if the base ptr
-is guaranteed to be live beyond the load/store. This can happen if the base
-ptr is live out of the block we are performing the optimization. e.g.
-
-mov r1, r2
-ldr r3, [r1], #4
-...
-
-vs.
-
-ldr r3, [r2]
-add r1, r2, #4
-...
-
-In most cases, this is just a wasted optimization. However, sometimes it can
-negatively impact the performance because two-address code is more restrictive
-when it comes to scheduling.
-
-Unfortunately, liveout information is currently unavailable during DAG combine
-time.
-
-2) Consider spliting a indexed load / store into a pair of add/sub + load/store
-   to solve #1 (in TwoAddressInstructionPass.cpp).
-
-3) Enhance LSR to generate more opportunities for indexed ops.
-
-4) Once we added support for multiple result patterns, write indexed loads
-   patterns instead of C++ instruction selection code.
-
-5) Use VLDM / VSTM to emulate indexed FP load / store.
-
-//===---------------------------------------------------------------------===//
-
-Implement support for some more tricky ways to materialize immediates.  For
-example, to get 0xffff8000, we can use:
-
-mov r9, #&3f8000
-sub r9, r9, #&400000
-
-//===---------------------------------------------------------------------===//
-
-We sometimes generate multiple add / sub instructions to update sp in prologue
-and epilogue if the inc / dec value is too large to fit in a single immediate
-operand. In some cases, perhaps it might be better to load the value from a
-constantpool instead.
-
-//===---------------------------------------------------------------------===//
-
-GCC generates significantly better code for this function.
-
-int foo(int StackPtr, unsigned char *Line, unsigned char *Stack, int LineLen) {
-    int i = 0;
-
-    if (StackPtr != 0) {
-       while (StackPtr != 0 && i < (((LineLen) < (32768))? (LineLen) : (32768)))
-          Line[i++] = Stack[--StackPtr];
-        if (LineLen > 32768)
-        {
-            while (StackPtr != 0 && i < LineLen)
-            {
-                i++;
-                --StackPtr;
-            }
-        }
-    }
-    return StackPtr;
-}
-
-//===---------------------------------------------------------------------===//
-
-This should compile to the mlas instruction:
-int mlas(int x, int y, int z) { return ((x * y + z) < 0) ? 7 : 13; }
-
-//===---------------------------------------------------------------------===//
-
-At some point, we should triage these to see if they still apply to us:
-
-http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19598
-http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18560
-http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27016
-
-http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11831
-http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11826
-http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11825
-http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11824
-http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11823
-http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11820
-http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10982
-
-http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10242
-http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9831
-http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9760
-http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9759
-http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9703
-http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9702
-http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9663
-
-http://www.inf.u-szeged.hu/gcc-arm/
-http://citeseer.ist.psu.edu/debus04linktime.html
-
-//===---------------------------------------------------------------------===//
-
-gcc generates smaller code for this function at -O2 or -Os:
-
-void foo(signed char* p) {
-  if (*p == 3)
-     bar();
-   else if (*p == 4)
-    baz();
-  else if (*p == 5)
-    quux();
-}
-
-llvm decides it's a good idea to turn the repeated if...else into a
-binary tree, as if it were a switch; the resulting code requires -1
-compare-and-branches when *p<=2 or *p==5, the same number if *p==4
-or *p>6, and +1 if *p==3.  So it should be a speed win
-(on balance).  However, the revised code is larger, with 4 conditional
-branches instead of 3.
-
-More seriously, there is a byte->word extend before
-each comparison, where there should be only one, and the condition codes
-are not remembered when the same two values are compared twice.
-
-//===---------------------------------------------------------------------===//
-
-More LSR enhancements possible:
-
-1. Teach LSR about pre- and post- indexed ops to allow iv increment be merged
-   in a load / store.
-2. Allow iv reuse even when a type conversion is required. For example, i8
-   and i32 load / store addressing modes are identical.
-
-
-//===---------------------------------------------------------------------===//
-
-This:
-
-int foo(int a, int b, int c, int d) {
-  long long acc = (long long)a * (long long)b;
-  acc += (long long)c * (long long)d;
-  return (int)(acc >> 32);
-}
-
-Should compile to use SMLAL (Signed Multiply Accumulate Long) which multiplies
-two signed 32-bit values to produce a 64-bit value, and accumulates this with
-a 64-bit value.
-
-We currently get this with both v4 and v6:
-
-_foo:
-        smull r1, r0, r1, r0
-        smull r3, r2, r3, r2
-        adds r3, r3, r1
-        adc r0, r2, r0
-        bx lr
-
-//===---------------------------------------------------------------------===//
-
-This:
-        #include <algorithm>
-        std::pair<unsigned, bool> full_add(unsigned a, unsigned b)
-        { return std::make_pair(a + b, a + b < a); }
-        bool no_overflow(unsigned a, unsigned b)
-        { return !full_add(a, b).second; }
-
-Should compile to:
-
-_Z8full_addjj:
-	adds	r2, r1, r2
-	movcc	r1, #0
-	movcs	r1, #1
-	str	r2, [r0, #0]
-	strb	r1, [r0, #4]
-	mov	pc, lr
-
-_Z11no_overflowjj:
-	cmn	r0, r1
-	movcs	r0, #0
-	movcc	r0, #1
-	mov	pc, lr
-
-not:
-
-__Z8full_addjj:
-        add r3, r2, r1
-        str r3, [r0]
-        mov r2, #1
-        mov r12, #0
-        cmp r3, r1
-        movlo r12, r2
-        str r12, [r0, #+4]
-        bx lr
-__Z11no_overflowjj:
-        add r3, r1, r0
-        mov r2, #1
-        mov r1, #0
-        cmp r3, r0
-        movhs r1, r2
-        mov r0, r1
-        bx lr
-
-//===---------------------------------------------------------------------===//
-
-Some of the NEON intrinsics may be appropriate for more general use, either
-as target-independent intrinsics or perhaps elsewhere in the ARM backend.
-Some of them may also be lowered to target-independent SDNodes, and perhaps
-some new SDNodes could be added.
-
-For example, maximum, minimum, and absolute value operations are well-defined
-and standard operations, both for vector and scalar types.
-
-The current NEON-specific intrinsics for count leading zeros and count one
-bits could perhaps be replaced by the target-independent ctlz and ctpop
-intrinsics.  It may also make sense to add a target-independent "ctls"
-intrinsic for "count leading sign bits".  Likewise, the backend could use
-the target-independent SDNodes for these operations.
-
-ARMv6 has scalar saturating and halving adds and subtracts.  The same
-intrinsics could possibly be used for both NEON's vector implementations of
-those operations and the ARMv6 scalar versions.
-
-//===---------------------------------------------------------------------===//
-
-Split out LDR (literal) from normal ARM LDR instruction. Also consider spliting
-LDR into imm12 and so_reg forms.  This allows us to clean up some code. e.g.
-ARMLoadStoreOptimizer does not need to look at LDR (literal) and LDR (so_reg)
-while ARMConstantIslandPass only need to worry about LDR (literal).
-
-//===---------------------------------------------------------------------===//
-
-Constant island pass should make use of full range SoImm values for LEApcrel.
-Be careful though as the last attempt caused infinite looping on lencod.
-
-//===---------------------------------------------------------------------===//
-
-Predication issue. This function:
-
-extern unsigned array[ 128 ];
-int     foo( int x ) {
-  int     y;
-  y = array[ x & 127 ];
-  if ( x & 128 )
-     y = 123456789 & ( y >> 2 );
-  else
-     y = 123456789 & y;
-  return y;
-}
-
-compiles to:
-
-_foo:
-	and r1, r0, #127
-	ldr r2, LCPI1_0
-	ldr r2, [r2]
-	ldr r1, [r2, +r1, lsl #2]
-	mov r2, r1, lsr #2
-	tst r0, #128
-	moveq r2, r1
-	ldr r0, LCPI1_1
-	and r0, r2, r0
-	bx lr
-
-It would be better to do something like this, to fold the shift into the
-conditional move:
-
-	and r1, r0, #127
-	ldr r2, LCPI1_0
-	ldr r2, [r2]
-	ldr r1, [r2, +r1, lsl #2]
-	tst r0, #128
-	movne r1, r1, lsr #2
-	ldr r0, LCPI1_1
-	and r0, r1, r0
-	bx lr
-
-it saves an instruction and a register.
-
-//===---------------------------------------------------------------------===//
-
-It might be profitable to cse MOVi16 if there are lots of 32-bit immediates
-with the same bottom half.
-
-//===---------------------------------------------------------------------===//
-
-Robert Muth started working on an alternate jump table implementation that
-does not put the tables in-line in the text.  This is more like the llvm
-default jump table implementation.  This might be useful sometime.  Several
-revisions of patches are on the mailing list, beginning at:
-http://lists.llvm.org/pipermail/llvm-dev/2009-June/022763.html
-
-//===---------------------------------------------------------------------===//
-
-Make use of the "rbit" instruction.
-
-//===---------------------------------------------------------------------===//
-
-Take a look at test/CodeGen/Thumb2/machine-licm.ll. ARM should be taught how
-to licm and cse the unnecessary load from cp#1.
-
-//===---------------------------------------------------------------------===//
-
-The CMN instruction sets the flags like an ADD instruction, while CMP sets
-them like a subtract. Therefore to be able to use CMN for comparisons other
-than the Z bit, we'll need additional logic to reverse the conditionals
-associated with the comparison. Perhaps a pseudo-instruction for the comparison,
-with a post-codegen pass to clean up and handle the condition codes?
-See PR5694 for testcase.
-
-//===---------------------------------------------------------------------===//
-
-Given the following on armv5:
-int test1(int A, int B) {
-  return (A&-8388481)|(B&8388480);
-}
-
-We currently generate:
-	ldr	r2, .LCPI0_0
-	and	r0, r0, r2
-	ldr	r2, .LCPI0_1
-	and	r1, r1, r2
-	orr	r0, r1, r0
-	bx	lr
-
-We should be able to replace the second ldr+and with a bic (i.e. reuse the
-constant which was already loaded).  Not sure what's necessary to do that.
-
-//===---------------------------------------------------------------------===//
-
-The code generated for bswap on armv4/5 (CPUs without rev) is less than ideal:
-
-int a(int x) { return __builtin_bswap32(x); }
-
-a:
-	mov	r1, #255, 24
-	mov	r2, #255, 16
-	and	r1, r1, r0, lsr #8
-	and	r2, r2, r0, lsl #8
-	orr	r1, r1, r0, lsr #24
-	orr	r0, r2, r0, lsl #24
-	orr	r0, r0, r1
-	bx	lr
-
-Something like the following would be better (fewer instructions/registers):
-	eor     r1, r0, r0, ror #16
-	bic     r1, r1, #0xff0000
-	mov     r1, r1, lsr #8
-	eor     r0, r1, r0, ror #8
-	bx	lr
-
-A custom Thumb version would also be a slight improvement over the generic
-version.
-
-//===---------------------------------------------------------------------===//
-
-Consider the following simple C code:
-
-void foo(unsigned char *a, unsigned char *b, int *c) {
- if ((*a | *b) == 0) *c = 0;
-}
-
-currently llvm-gcc generates something like this (nice branchless code I'd say):
-
-       ldrb    r0, [r0]
-       ldrb    r1, [r1]
-       orr     r0, r1, r0
-       tst     r0, #255
-       moveq   r0, #0
-       streq   r0, [r2]
-       bx      lr
-
-Note that both "tst" and "moveq" are redundant.
-
-//===---------------------------------------------------------------------===//
-
-When loading immediate constants with movt/movw, if there are multiple
-constants needed with the same low 16 bits, and those values are not live at
-the same time, it would be possible to use a single movw instruction, followed
-by multiple movt instructions to rewrite the high bits to different values.
-For example:
-
-  volatile store i32 -1, i32* inttoptr (i32 1342210076 to i32*), align 4,
-  !tbaa
-!0
-  volatile store i32 -1, i32* inttoptr (i32 1342341148 to i32*), align 4,
-  !tbaa
-!0
-
-is compiled and optimized to:
-
-    movw    r0, #32796
-    mov.w    r1, #-1
-    movt    r0, #20480
-    str    r1, [r0]
-    movw    r0, #32796    @ <= this MOVW is not needed, value is there already
-    movt    r0, #20482
-    str    r1, [r0]
-
-//===---------------------------------------------------------------------===//
-
-Improve codegen for select's:
-if (x != 0) x = 1
-if (x == 1) x = 1
-
-ARM codegen used to look like this:
-       mov     r1, r0
-       cmp     r1, #1
-       mov     r0, #0
-       moveq   r0, #1
-
-The naive lowering select between two different values. It should recognize the
-test is equality test so it's more a conditional move rather than a select:
-       cmp     r0, #1
-       movne   r0, #0
-
-Currently this is a ARM specific dag combine. We probably should make it into a
-target-neutral one.
-
-//===---------------------------------------------------------------------===//
-
-Optimize unnecessary checks for zero with __builtin_clz/ctz.  Those builtins
-are specified to be undefined at zero, so portable code must check for zero
-and handle it as a special case.  That is unnecessary on ARM where those
-operations are implemented in a way that is well-defined for zero.  For
-example:
-
-int f(int x) { return x ? __builtin_clz(x) : sizeof(int)*8; }
-
-should just be implemented with a CLZ instruction.  Since there are other
-targets, e.g., PPC, that share this behavior, it would be best to implement
-this in a target-independent way: we should probably fold that (when using
-"undefined at zero" semantics) to set the "defined at zero" bit and have
-the code generator expand out the right code.
-
-//===---------------------------------------------------------------------===//
-
-Clean up the test/MC/ARM files to have more robust register choices.
-
-R0 should not be used as a register operand in the assembler tests as it's then
-not possible to distinguish between a correct encoding and a missing operand
-encoding, as zero is the default value for the binary encoder.
-e.g.,
-    add r0, r0  // bad
-    add r3, r5  // good
-
-Register operands should be distinct. That is, when the encoding does not
-require two syntactical operands to refer to the same register, two different
-registers should be used in the test so as to catch errors where the
-operands are swapped in the encoding.
-e.g.,
-    subs.w r1, r1, r1 // bad
-    subs.w r1, r2, r3 // good
-
+//===---------------------------------------------------------------------===// 
+// Random ideas for the ARM backend. 
+//===---------------------------------------------------------------------===// 
+ 
+Reimplement 'select' in terms of 'SEL'. 
+ 
+* We would really like to support UXTAB16, but we need to prove that the 
+  add doesn't need to overflow between the two 16-bit chunks. 
+ 
+* Implement pre/post increment support.  (e.g. PR935) 
+* Implement smarter constant generation for binops with large immediates. 
+ 
+A few ARMv6T2 ops should be pattern matched: BFI, SBFX, and UBFX 
+ 
+Interesting optimization for PIC codegen on arm-linux: 
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43129 
+ 
+//===---------------------------------------------------------------------===// 
+ 
+Crazy idea:  Consider code that uses lots of 8-bit or 16-bit values.  By the 
+time regalloc happens, these values are now in a 32-bit register, usually with 
+the top-bits known to be sign or zero extended.  If spilled, we should be able 
+to spill these to a 8-bit or 16-bit stack slot, zero or sign extending as part 
+of the reload. 
+ 
+Doing this reduces the size of the stack frame (important for thumb etc), and 
+also increases the likelihood that we will be able to reload multiple values 
+from the stack with a single load. 
+ 
+//===---------------------------------------------------------------------===// 
+ 
+The constant island pass is in good shape.  Some cleanups might be desirable, 
+but there is unlikely to be much improvement in the generated code. 
+ 
+1.  There may be some advantage to trying to be smarter about the initial 
+placement, rather than putting everything at the end. 
+ 
+2.  There might be some compile-time efficiency to be had by representing 
+consecutive islands as a single block rather than multiple blocks. 
+ 
+3.  Use a priority queue to sort constant pool users in inverse order of 
+    position so we always process the one closed to the end of functions 
+    first. This may simply CreateNewWater. 
+ 
+//===---------------------------------------------------------------------===// 
+ 
+Eliminate copysign custom expansion. We are still generating crappy code with 
+default expansion + if-conversion. 
+ 
+//===---------------------------------------------------------------------===// 
+ 
+Eliminate one instruction from: 
+ 
+define i32 @_Z6slow4bii(i32 %x, i32 %y) { 
+        %tmp = icmp sgt i32 %x, %y 
+        %retval = select i1 %tmp, i32 %x, i32 %y 
+        ret i32 %retval 
+} 
+ 
+__Z6slow4bii: 
+        cmp r0, r1 
+        movgt r1, r0 
+        mov r0, r1 
+        bx lr 
+=> 
+ 
+__Z6slow4bii: 
+        cmp r0, r1 
+        movle r0, r1 
+        bx lr 
+ 
+//===---------------------------------------------------------------------===// 
+ 
+Implement long long "X-3" with instructions that fold the immediate in.  These 
+were disabled due to badness with the ARM carry flag on subtracts. 
+ 
+//===---------------------------------------------------------------------===// 
+ 
+More load / store optimizations: 
+1) Better representation for block transfer? This is from Olden/power: 
+ 
+	fldd d0, [r4] 
+	fstd d0, [r4, #+32] 
+	fldd d0, [r4, #+8] 
+	fstd d0, [r4, #+40] 
+	fldd d0, [r4, #+16] 
+	fstd d0, [r4, #+48] 
+	fldd d0, [r4, #+24] 
+	fstd d0, [r4, #+56] 
+ 
+If we can spare the registers, it would be better to use fldm and fstm here. 
+Need major register allocator enhancement though. 
+ 
+2) Can we recognize the relative position of constantpool entries? i.e. Treat 
+ 
+	ldr r0, LCPI17_3 
+	ldr r1, LCPI17_4 
+	ldr r2, LCPI17_5 
+ 
+   as 
+	ldr r0, LCPI17 
+	ldr r1, LCPI17+4 
+	ldr r2, LCPI17+8 
+ 
+   Then the ldr's can be combined into a single ldm. See Olden/power. 
+ 
+Note for ARM v4 gcc uses ldmia to load a pair of 32-bit values to represent a 
+double 64-bit FP constant: 
+ 
+	adr	r0, L6 
+	ldmia	r0, {r0-r1} 
+ 
+	.align 2 
+L6: 
+	.long	-858993459 
+	.long	1074318540 
+ 
+3) struct copies appear to be done field by field 
+instead of by words, at least sometimes: 
+ 
+struct foo { int x; short s; char c1; char c2; }; 
+void cpy(struct foo*a, struct foo*b) { *a = *b; } 
+ 
+llvm code (-O2) 
+        ldrb r3, [r1, #+6] 
+        ldr r2, [r1] 
+        ldrb r12, [r1, #+7] 
+        ldrh r1, [r1, #+4] 
+        str r2, [r0] 
+        strh r1, [r0, #+4] 
+        strb r3, [r0, #+6] 
+        strb r12, [r0, #+7] 
+gcc code (-O2) 
+        ldmia   r1, {r1-r2} 
+        stmia   r0, {r1-r2} 
+ 
+In this benchmark poor handling of aggregate copies has shown up as 
+having a large effect on size, and possibly speed as well (we don't have 
+a good way to measure on ARM). 
+ 
+//===---------------------------------------------------------------------===// 
+ 
+* Consider this silly example: 
+ 
+double bar(double x) { 
+  double r = foo(3.1); 
+  return x+r; 
+} 
+ 
+_bar: 
+        stmfd sp!, {r4, r5, r7, lr} 
+        add r7, sp, #8 
+        mov r4, r0 
+        mov r5, r1 
+        fldd d0, LCPI1_0 
+        fmrrd r0, r1, d0 
+        bl _foo 
+        fmdrr d0, r4, r5 
+        fmsr s2, r0 
+        fsitod d1, s2 
+        faddd d0, d1, d0 
+        fmrrd r0, r1, d0 
+        ldmfd sp!, {r4, r5, r7, pc} 
+ 
+Ignore the prologue and epilogue stuff for a second. Note 
+	mov r4, r0 
+	mov r5, r1 
+the copys to callee-save registers and the fact they are only being used by the 
+fmdrr instruction. It would have been better had the fmdrr been scheduled 
+before the call and place the result in a callee-save DPR register. The two 
+mov ops would not have been necessary. 
+ 
+//===---------------------------------------------------------------------===// 
+ 
+Calling convention related stuff: 
+ 
+* gcc's parameter passing implementation is terrible and we suffer as a result: 
+ 
+e.g. 
+struct s { 
+  double d1; 
+  int s1; 
+}; 
+ 
+void foo(struct s S) { 
+  printf("%g, %d\n", S.d1, S.s1); 
+} 
+ 
+'S' is passed via registers r0, r1, r2. But gcc stores them to the stack, and 
+then reload them to r1, r2, and r3 before issuing the call (r0 contains the 
+address of the format string): 
+ 
+	stmfd	sp!, {r7, lr} 
+	add	r7, sp, #0 
+	sub	sp, sp, #12 
+	stmia	sp, {r0, r1, r2} 
+	ldmia	sp, {r1-r2} 
+	ldr	r0, L5 
+	ldr	r3, [sp, #8] 
+L2: 
+	add	r0, pc, r0 
+	bl	L_printf$stub 
+ 
+Instead of a stmia, ldmia, and a ldr, wouldn't it be better to do three moves? 
+ 
+* Return an aggregate type is even worse: 
+ 
+e.g. 
+struct s foo(void) { 
+  struct s S = {1.1, 2}; 
+  return S; 
+} 
+ 
+	mov	ip, r0 
+	ldr	r0, L5 
+	sub	sp, sp, #12 
+L2: 
+	add	r0, pc, r0 
+	@ lr needed for prologue 
+	ldmia	r0, {r0, r1, r2} 
+	stmia	sp, {r0, r1, r2} 
+	stmia	ip, {r0, r1, r2} 
+	mov	r0, ip 
+	add	sp, sp, #12 
+	bx	lr 
+ 
+r0 (and later ip) is the hidden parameter from caller to store the value in. The 
+first ldmia loads the constants into r0, r1, r2. The last stmia stores r0, r1, 
+r2 into the address passed in. However, there is one additional stmia that 
+stores r0, r1, and r2 to some stack location. The store is dead. 
+ 
+The llvm-gcc generated code looks like this: 
+ 
+csretcc void %foo(%struct.s* %agg.result) { 
+entry: 
+	%S = alloca %struct.s, align 4		; <%struct.s*> [#uses=1] 
+	%memtmp = alloca %struct.s		; <%struct.s*> [#uses=1] 
+	cast %struct.s* %S to sbyte*		; <sbyte*>:0 [#uses=2] 
+	call void %llvm.memcpy.i32( sbyte* %0, sbyte* cast ({ double, int }* %C.0.904 to sbyte*), uint 12, uint 4 ) 
+	cast %struct.s* %agg.result to sbyte*		; <sbyte*>:1 [#uses=2] 
+	call void %llvm.memcpy.i32( sbyte* %1, sbyte* %0, uint 12, uint 0 ) 
+	cast %struct.s* %memtmp to sbyte*		; <sbyte*>:2 [#uses=1] 
+	call void %llvm.memcpy.i32( sbyte* %2, sbyte* %1, uint 12, uint 0 ) 
+	ret void 
+} 
+ 
+llc ends up issuing two memcpy's (the first memcpy becomes 3 loads from 
+constantpool). Perhaps we should 1) fix llvm-gcc so the memcpy is translated 
+into a number of load and stores, or 2) custom lower memcpy (of small size) to 
+be ldmia / stmia. I think option 2 is better but the current register 
+allocator cannot allocate a chunk of registers at a time. 
+ 
+A feasible temporary solution is to use specific physical registers at the 
+lowering time for small (<= 4 words?) transfer size. 
+ 
+* ARM CSRet calling convention requires the hidden argument to be returned by 
+the callee. 
+ 
+//===---------------------------------------------------------------------===// 
+ 
+We can definitely do a better job on BB placements to eliminate some branches. 
+It's very common to see llvm generated assembly code that looks like this: 
+ 
+LBB3: 
+ ... 
+LBB4: 
+... 
+  beq LBB3 
+  b LBB2 
+ 
+If BB4 is the only predecessor of BB3, then we can emit BB3 after BB4. We can 
+then eliminate beq and turn the unconditional branch to LBB2 to a bne. 
+ 
+See McCat/18-imp/ComputeBoundingBoxes for an example. 
+ 
+//===---------------------------------------------------------------------===// 
+ 
+Pre-/post- indexed load / stores: 
+ 
+1) We should not make the pre/post- indexed load/store transform if the base ptr 
+is guaranteed to be live beyond the load/store. This can happen if the base 
+ptr is live out of the block we are performing the optimization. e.g. 
+ 
+mov r1, r2 
+ldr r3, [r1], #4 
+... 
+ 
+vs. 
+ 
+ldr r3, [r2] 
+add r1, r2, #4 
+... 
+ 
+In most cases, this is just a wasted optimization. However, sometimes it can 
+negatively impact the performance because two-address code is more restrictive 
+when it comes to scheduling. 
+ 
+Unfortunately, liveout information is currently unavailable during DAG combine 
+time. 
+ 
+2) Consider spliting a indexed load / store into a pair of add/sub + load/store 
+   to solve #1 (in TwoAddressInstructionPass.cpp). 
+ 
+3) Enhance LSR to generate more opportunities for indexed ops. 
+ 
+4) Once we added support for multiple result patterns, write indexed loads 
+   patterns instead of C++ instruction selection code. 
+ 
+5) Use VLDM / VSTM to emulate indexed FP load / store. 
+ 
+//===---------------------------------------------------------------------===// 
+ 
+Implement support for some more tricky ways to materialize immediates.  For 
+example, to get 0xffff8000, we can use: 
+ 
+mov r9, #&3f8000 
+sub r9, r9, #&400000 
+ 
+//===---------------------------------------------------------------------===// 
+ 
+We sometimes generate multiple add / sub instructions to update sp in prologue 
+and epilogue if the inc / dec value is too large to fit in a single immediate 
+operand. In some cases, perhaps it might be better to load the value from a 
+constantpool instead. 
+ 
+//===---------------------------------------------------------------------===// 
+ 
+GCC generates significantly better code for this function. 
+ 
+int foo(int StackPtr, unsigned char *Line, unsigned char *Stack, int LineLen) { 
+    int i = 0; 
+ 
+    if (StackPtr != 0) { 
+       while (StackPtr != 0 && i < (((LineLen) < (32768))? (LineLen) : (32768))) 
+          Line[i++] = Stack[--StackPtr]; 
+        if (LineLen > 32768) 
+        { 
+            while (StackPtr != 0 && i < LineLen) 
+            { 
+                i++; 
+                --StackPtr; 
+            } 
+        } 
+    } 
+    return StackPtr; 
+} 
+ 
+//===---------------------------------------------------------------------===// 
+ 
+This should compile to the mlas instruction: 
+int mlas(int x, int y, int z) { return ((x * y + z) < 0) ? 7 : 13; } 
+ 
+//===---------------------------------------------------------------------===// 
+ 
+At some point, we should triage these to see if they still apply to us: 
+ 
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19598 
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18560 
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27016 
+ 
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11831 
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11826 
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11825 
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11824 
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11823 
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11820 
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10982 
+ 
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10242 
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9831 
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9760 
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9759 
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9703 
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9702 
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9663 
+ 
+http://www.inf.u-szeged.hu/gcc-arm/ 
+http://citeseer.ist.psu.edu/debus04linktime.html 
+ 
+//===---------------------------------------------------------------------===// 
+ 
+gcc generates smaller code for this function at -O2 or -Os: 
+ 
+void foo(signed char* p) { 
+  if (*p == 3) 
+     bar(); 
+   else if (*p == 4) 
+    baz(); 
+  else if (*p == 5) 
+    quux(); 
+} 
+ 
+llvm decides it's a good idea to turn the repeated if...else into a 
+binary tree, as if it were a switch; the resulting code requires -1 
+compare-and-branches when *p<=2 or *p==5, the same number if *p==4 
+or *p>6, and +1 if *p==3.  So it should be a speed win 
+(on balance).  However, the revised code is larger, with 4 conditional 
+branches instead of 3. 
+ 
+More seriously, there is a byte->word extend before 
+each comparison, where there should be only one, and the condition codes 
+are not remembered when the same two values are compared twice. 
+ 
+//===---------------------------------------------------------------------===// 
+ 
+More LSR enhancements possible: 
+ 
+1. Teach LSR about pre- and post- indexed ops to allow iv increment be merged 
+   in a load / store. 
+2. Allow iv reuse even when a type conversion is required. For example, i8 
+   and i32 load / store addressing modes are identical. 
+ 
+ 
+//===---------------------------------------------------------------------===// 
+ 
+This: 
+ 
+int foo(int a, int b, int c, int d) { 
+  long long acc = (long long)a * (long long)b; 
+  acc += (long long)c * (long long)d; 
+  return (int)(acc >> 32); 
+} 
+ 
+Should compile to use SMLAL (Signed Multiply Accumulate Long) which multiplies 
+two signed 32-bit values to produce a 64-bit value, and accumulates this with 
+a 64-bit value. 
+ 
+We currently get this with both v4 and v6: 
+ 
+_foo: 
+        smull r1, r0, r1, r0 
+        smull r3, r2, r3, r2 
+        adds r3, r3, r1 
+        adc r0, r2, r0 
+        bx lr 
+ 
+//===---------------------------------------------------------------------===// 
+ 
+This: 
+        #include <algorithm> 
+        std::pair<unsigned, bool> full_add(unsigned a, unsigned b) 
+        { return std::make_pair(a + b, a + b < a); } 
+        bool no_overflow(unsigned a, unsigned b) 
+        { return !full_add(a, b).second; } 
+ 
+Should compile to: 
+ 
+_Z8full_addjj: 
+	adds	r2, r1, r2 
+	movcc	r1, #0 
+	movcs	r1, #1 
+	str	r2, [r0, #0] 
+	strb	r1, [r0, #4] 
+	mov	pc, lr 
+ 
+_Z11no_overflowjj: 
+	cmn	r0, r1 
+	movcs	r0, #0 
+	movcc	r0, #1 
+	mov	pc, lr 
+ 
+not: 
+ 
+__Z8full_addjj: 
+        add r3, r2, r1 
+        str r3, [r0] 
+        mov r2, #1 
+        mov r12, #0 
+        cmp r3, r1 
+        movlo r12, r2 
+        str r12, [r0, #+4] 
+        bx lr 
+__Z11no_overflowjj: 
+        add r3, r1, r0 
+        mov r2, #1 
+        mov r1, #0 
+        cmp r3, r0 
+        movhs r1, r2 
+        mov r0, r1 
+        bx lr 
+ 
+//===---------------------------------------------------------------------===// 
+ 
+Some of the NEON intrinsics may be appropriate for more general use, either 
+as target-independent intrinsics or perhaps elsewhere in the ARM backend. 
+Some of them may also be lowered to target-independent SDNodes, and perhaps 
+some new SDNodes could be added. 
+ 
+For example, maximum, minimum, and absolute value operations are well-defined 
+and standard operations, both for vector and scalar types. 
+ 
+The current NEON-specific intrinsics for count leading zeros and count one 
+bits could perhaps be replaced by the target-independent ctlz and ctpop 
+intrinsics.  It may also make sense to add a target-independent "ctls" 
+intrinsic for "count leading sign bits".  Likewise, the backend could use 
+the target-independent SDNodes for these operations. 
+ 
+ARMv6 has scalar saturating and halving adds and subtracts.  The same 
+intrinsics could possibly be used for both NEON's vector implementations of 
+those operations and the ARMv6 scalar versions. 
+ 
+//===---------------------------------------------------------------------===// 
+ 
+Split out LDR (literal) from normal ARM LDR instruction. Also consider spliting 
+LDR into imm12 and so_reg forms.  This allows us to clean up some code. e.g. 
+ARMLoadStoreOptimizer does not need to look at LDR (literal) and LDR (so_reg) 
+while ARMConstantIslandPass only need to worry about LDR (literal). 
+ 
+//===---------------------------------------------------------------------===// 
+ 
+Constant island pass should make use of full range SoImm values for LEApcrel. 
+Be careful though as the last attempt caused infinite looping on lencod. 
+ 
+//===---------------------------------------------------------------------===// 
+ 
+Predication issue. This function: 
+ 
+extern unsigned array[ 128 ]; 
+int     foo( int x ) { 
+  int     y; 
+  y = array[ x & 127 ]; 
+  if ( x & 128 ) 
+     y = 123456789 & ( y >> 2 ); 
+  else 
+     y = 123456789 & y; 
+  return y; 
+} 
+ 
+compiles to: 
+ 
+_foo: 
+	and r1, r0, #127 
+	ldr r2, LCPI1_0 
+	ldr r2, [r2] 
+	ldr r1, [r2, +r1, lsl #2] 
+	mov r2, r1, lsr #2 
+	tst r0, #128 
+	moveq r2, r1 
+	ldr r0, LCPI1_1 
+	and r0, r2, r0 
+	bx lr 
+ 
+It would be better to do something like this, to fold the shift into the 
+conditional move: 
+ 
+	and r1, r0, #127 
+	ldr r2, LCPI1_0 
+	ldr r2, [r2] 
+	ldr r1, [r2, +r1, lsl #2] 
+	tst r0, #128 
+	movne r1, r1, lsr #2 
+	ldr r0, LCPI1_1 
+	and r0, r1, r0 
+	bx lr 
+ 
+it saves an instruction and a register. 
+ 
+//===---------------------------------------------------------------------===// 
+ 
+It might be profitable to cse MOVi16 if there are lots of 32-bit immediates 
+with the same bottom half. 
+ 
+//===---------------------------------------------------------------------===// 
+ 
+Robert Muth started working on an alternate jump table implementation that 
+does not put the tables in-line in the text.  This is more like the llvm 
+default jump table implementation.  This might be useful sometime.  Several 
+revisions of patches are on the mailing list, beginning at: 
+http://lists.llvm.org/pipermail/llvm-dev/2009-June/022763.html 
+ 
+//===---------------------------------------------------------------------===// 
+ 
+Make use of the "rbit" instruction. 
+ 
+//===---------------------------------------------------------------------===// 
+ 
+Take a look at test/CodeGen/Thumb2/machine-licm.ll. ARM should be taught how 
+to licm and cse the unnecessary load from cp#1. 
+ 
+//===---------------------------------------------------------------------===// 
+ 
+The CMN instruction sets the flags like an ADD instruction, while CMP sets 
+them like a subtract. Therefore to be able to use CMN for comparisons other 
+than the Z bit, we'll need additional logic to reverse the conditionals 
+associated with the comparison. Perhaps a pseudo-instruction for the comparison, 
+with a post-codegen pass to clean up and handle the condition codes? 
+See PR5694 for testcase. 
+ 
+//===---------------------------------------------------------------------===// 
+ 
+Given the following on armv5: 
+int test1(int A, int B) { 
+  return (A&-8388481)|(B&8388480); 
+} 
+ 
+We currently generate: 
+	ldr	r2, .LCPI0_0 
+	and	r0, r0, r2 
+	ldr	r2, .LCPI0_1 
+	and	r1, r1, r2 
+	orr	r0, r1, r0 
+	bx	lr 
+ 
+We should be able to replace the second ldr+and with a bic (i.e. reuse the 
+constant which was already loaded).  Not sure what's necessary to do that. 
+ 
+//===---------------------------------------------------------------------===// 
+ 
+The code generated for bswap on armv4/5 (CPUs without rev) is less than ideal: 
+ 
+int a(int x) { return __builtin_bswap32(x); } 
+ 
+a: 
+	mov	r1, #255, 24 
+	mov	r2, #255, 16 
+	and	r1, r1, r0, lsr #8 
+	and	r2, r2, r0, lsl #8 
+	orr	r1, r1, r0, lsr #24 
+	orr	r0, r2, r0, lsl #24 
+	orr	r0, r0, r1 
+	bx	lr 
+ 
+Something like the following would be better (fewer instructions/registers): 
+	eor     r1, r0, r0, ror #16 
+	bic     r1, r1, #0xff0000 
+	mov     r1, r1, lsr #8 
+	eor     r0, r1, r0, ror #8 
+	bx	lr 
+ 
+A custom Thumb version would also be a slight improvement over the generic 
+version. 
+ 
+//===---------------------------------------------------------------------===// 
+ 
+Consider the following simple C code: 
+ 
+void foo(unsigned char *a, unsigned char *b, int *c) { 
+ if ((*a | *b) == 0) *c = 0; 
+} 
+ 
+currently llvm-gcc generates something like this (nice branchless code I'd say): 
+ 
+       ldrb    r0, [r0] 
+       ldrb    r1, [r1] 
+       orr     r0, r1, r0 
+       tst     r0, #255 
+       moveq   r0, #0 
+       streq   r0, [r2] 
+       bx      lr 
+ 
+Note that both "tst" and "moveq" are redundant. 
+ 
+//===---------------------------------------------------------------------===// 
+ 
+When loading immediate constants with movt/movw, if there are multiple 
+constants needed with the same low 16 bits, and those values are not live at 
+the same time, it would be possible to use a single movw instruction, followed 
+by multiple movt instructions to rewrite the high bits to different values. 
+For example: 
+ 
+  volatile store i32 -1, i32* inttoptr (i32 1342210076 to i32*), align 4, 
+  !tbaa 
+!0 
+  volatile store i32 -1, i32* inttoptr (i32 1342341148 to i32*), align 4, 
+  !tbaa 
+!0 
+ 
+is compiled and optimized to: 
+ 
+    movw    r0, #32796 
+    mov.w    r1, #-1 
+    movt    r0, #20480 
+    str    r1, [r0] 
+    movw    r0, #32796    @ <= this MOVW is not needed, value is there already 
+    movt    r0, #20482 
+    str    r1, [r0] 
+ 
+//===---------------------------------------------------------------------===// 
+ 
+Improve codegen for select's: 
+if (x != 0) x = 1 
+if (x == 1) x = 1 
+ 
+ARM codegen used to look like this: 
+       mov     r1, r0 
+       cmp     r1, #1 
+       mov     r0, #0 
+       moveq   r0, #1 
+ 
+The naive lowering select between two different values. It should recognize the 
+test is equality test so it's more a conditional move rather than a select: 
+       cmp     r0, #1 
+       movne   r0, #0 
+ 
+Currently this is a ARM specific dag combine. We probably should make it into a 
+target-neutral one. 
+ 
+//===---------------------------------------------------------------------===// 
+ 
+Optimize unnecessary checks for zero with __builtin_clz/ctz.  Those builtins 
+are specified to be undefined at zero, so portable code must check for zero 
+and handle it as a special case.  That is unnecessary on ARM where those 
+operations are implemented in a way that is well-defined for zero.  For 
+example: 
+ 
+int f(int x) { return x ? __builtin_clz(x) : sizeof(int)*8; } 
+ 
+should just be implemented with a CLZ instruction.  Since there are other 
+targets, e.g., PPC, that share this behavior, it would be best to implement 
+this in a target-independent way: we should probably fold that (when using 
+"undefined at zero" semantics) to set the "defined at zero" bit and have 
+the code generator expand out the right code. 
+ 
+//===---------------------------------------------------------------------===// 
+ 
+Clean up the test/MC/ARM files to have more robust register choices. 
+ 
+R0 should not be used as a register operand in the assembler tests as it's then 
+not possible to distinguish between a correct encoding and a missing operand 
+encoding, as zero is the default value for the binary encoder. 
+e.g., 
+    add r0, r0  // bad 
+    add r3, r5  // good 
+ 
+Register operands should be distinct. That is, when the encoding does not 
+require two syntactical operands to refer to the same register, two different 
+registers should be used in the test so as to catch errors where the 
+operands are swapped in the encoding. 
+e.g., 
+    subs.w r1, r1, r1 // bad 
+    subs.w r1, r2, r3 // good 
+
author	heretic <heretic@yandex-team.ru>	2022-02-10 16:45:43 +0300
committer	Daniil Cherednik <dcherednik@yandex-team.ru>	2022-02-10 16:45:43 +0300
commit	397cbe258b9e064f49c4ca575279f02f39fef76e (patch)
tree	a0b0eb3cca6a14e4e8ea715393637672fa651284 /contrib/libs/llvm12/lib/Target/ARM/README.txt
parent	43f5a35593ebc9f6bcea619bb170394ea7ae468e (diff)
download	ydb-397cbe258b9e064f49c4ca575279f02f39fef76e.tar.gz