diff options
author | heretic <heretic@yandex-team.ru> | 2022-02-10 16:45:43 +0300 |
---|---|---|
committer | Daniil Cherednik <dcherednik@yandex-team.ru> | 2022-02-10 16:45:43 +0300 |
commit | 397cbe258b9e064f49c4ca575279f02f39fef76e (patch) | |
tree | a0b0eb3cca6a14e4e8ea715393637672fa651284 /contrib/libs/llvm12/lib/Target/ARM/README.txt | |
parent | 43f5a35593ebc9f6bcea619bb170394ea7ae468e (diff) | |
download | ydb-397cbe258b9e064f49c4ca575279f02f39fef76e.tar.gz |
Restoring authorship annotation for <heretic@yandex-team.ru>. Commit 1 of 2.
Diffstat (limited to 'contrib/libs/llvm12/lib/Target/ARM/README.txt')
-rw-r--r-- | contrib/libs/llvm12/lib/Target/ARM/README.txt | 1464 |
1 files changed, 732 insertions, 732 deletions
diff --git a/contrib/libs/llvm12/lib/Target/ARM/README.txt b/contrib/libs/llvm12/lib/Target/ARM/README.txt index def67cfae7..1a93bc7bb7 100644 --- a/contrib/libs/llvm12/lib/Target/ARM/README.txt +++ b/contrib/libs/llvm12/lib/Target/ARM/README.txt @@ -1,732 +1,732 @@ -//===---------------------------------------------------------------------===// -// Random ideas for the ARM backend. -//===---------------------------------------------------------------------===// - -Reimplement 'select' in terms of 'SEL'. - -* We would really like to support UXTAB16, but we need to prove that the - add doesn't need to overflow between the two 16-bit chunks. - -* Implement pre/post increment support. (e.g. PR935) -* Implement smarter constant generation for binops with large immediates. - -A few ARMv6T2 ops should be pattern matched: BFI, SBFX, and UBFX - -Interesting optimization for PIC codegen on arm-linux: -http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43129 - -//===---------------------------------------------------------------------===// - -Crazy idea: Consider code that uses lots of 8-bit or 16-bit values. By the -time regalloc happens, these values are now in a 32-bit register, usually with -the top-bits known to be sign or zero extended. If spilled, we should be able -to spill these to a 8-bit or 16-bit stack slot, zero or sign extending as part -of the reload. - -Doing this reduces the size of the stack frame (important for thumb etc), and -also increases the likelihood that we will be able to reload multiple values -from the stack with a single load. - -//===---------------------------------------------------------------------===// - -The constant island pass is in good shape. Some cleanups might be desirable, -but there is unlikely to be much improvement in the generated code. - -1. There may be some advantage to trying to be smarter about the initial -placement, rather than putting everything at the end. - -2. There might be some compile-time efficiency to be had by representing -consecutive islands as a single block rather than multiple blocks. - -3. Use a priority queue to sort constant pool users in inverse order of - position so we always process the one closed to the end of functions - first. This may simply CreateNewWater. - -//===---------------------------------------------------------------------===// - -Eliminate copysign custom expansion. We are still generating crappy code with -default expansion + if-conversion. - -//===---------------------------------------------------------------------===// - -Eliminate one instruction from: - -define i32 @_Z6slow4bii(i32 %x, i32 %y) { - %tmp = icmp sgt i32 %x, %y - %retval = select i1 %tmp, i32 %x, i32 %y - ret i32 %retval -} - -__Z6slow4bii: - cmp r0, r1 - movgt r1, r0 - mov r0, r1 - bx lr -=> - -__Z6slow4bii: - cmp r0, r1 - movle r0, r1 - bx lr - -//===---------------------------------------------------------------------===// - -Implement long long "X-3" with instructions that fold the immediate in. These -were disabled due to badness with the ARM carry flag on subtracts. - -//===---------------------------------------------------------------------===// - -More load / store optimizations: -1) Better representation for block transfer? This is from Olden/power: - - fldd d0, [r4] - fstd d0, [r4, #+32] - fldd d0, [r4, #+8] - fstd d0, [r4, #+40] - fldd d0, [r4, #+16] - fstd d0, [r4, #+48] - fldd d0, [r4, #+24] - fstd d0, [r4, #+56] - -If we can spare the registers, it would be better to use fldm and fstm here. -Need major register allocator enhancement though. - -2) Can we recognize the relative position of constantpool entries? i.e. Treat - - ldr r0, LCPI17_3 - ldr r1, LCPI17_4 - ldr r2, LCPI17_5 - - as - ldr r0, LCPI17 - ldr r1, LCPI17+4 - ldr r2, LCPI17+8 - - Then the ldr's can be combined into a single ldm. See Olden/power. - -Note for ARM v4 gcc uses ldmia to load a pair of 32-bit values to represent a -double 64-bit FP constant: - - adr r0, L6 - ldmia r0, {r0-r1} - - .align 2 -L6: - .long -858993459 - .long 1074318540 - -3) struct copies appear to be done field by field -instead of by words, at least sometimes: - -struct foo { int x; short s; char c1; char c2; }; -void cpy(struct foo*a, struct foo*b) { *a = *b; } - -llvm code (-O2) - ldrb r3, [r1, #+6] - ldr r2, [r1] - ldrb r12, [r1, #+7] - ldrh r1, [r1, #+4] - str r2, [r0] - strh r1, [r0, #+4] - strb r3, [r0, #+6] - strb r12, [r0, #+7] -gcc code (-O2) - ldmia r1, {r1-r2} - stmia r0, {r1-r2} - -In this benchmark poor handling of aggregate copies has shown up as -having a large effect on size, and possibly speed as well (we don't have -a good way to measure on ARM). - -//===---------------------------------------------------------------------===// - -* Consider this silly example: - -double bar(double x) { - double r = foo(3.1); - return x+r; -} - -_bar: - stmfd sp!, {r4, r5, r7, lr} - add r7, sp, #8 - mov r4, r0 - mov r5, r1 - fldd d0, LCPI1_0 - fmrrd r0, r1, d0 - bl _foo - fmdrr d0, r4, r5 - fmsr s2, r0 - fsitod d1, s2 - faddd d0, d1, d0 - fmrrd r0, r1, d0 - ldmfd sp!, {r4, r5, r7, pc} - -Ignore the prologue and epilogue stuff for a second. Note - mov r4, r0 - mov r5, r1 -the copys to callee-save registers and the fact they are only being used by the -fmdrr instruction. It would have been better had the fmdrr been scheduled -before the call and place the result in a callee-save DPR register. The two -mov ops would not have been necessary. - -//===---------------------------------------------------------------------===// - -Calling convention related stuff: - -* gcc's parameter passing implementation is terrible and we suffer as a result: - -e.g. -struct s { - double d1; - int s1; -}; - -void foo(struct s S) { - printf("%g, %d\n", S.d1, S.s1); -} - -'S' is passed via registers r0, r1, r2. But gcc stores them to the stack, and -then reload them to r1, r2, and r3 before issuing the call (r0 contains the -address of the format string): - - stmfd sp!, {r7, lr} - add r7, sp, #0 - sub sp, sp, #12 - stmia sp, {r0, r1, r2} - ldmia sp, {r1-r2} - ldr r0, L5 - ldr r3, [sp, #8] -L2: - add r0, pc, r0 - bl L_printf$stub - -Instead of a stmia, ldmia, and a ldr, wouldn't it be better to do three moves? - -* Return an aggregate type is even worse: - -e.g. -struct s foo(void) { - struct s S = {1.1, 2}; - return S; -} - - mov ip, r0 - ldr r0, L5 - sub sp, sp, #12 -L2: - add r0, pc, r0 - @ lr needed for prologue - ldmia r0, {r0, r1, r2} - stmia sp, {r0, r1, r2} - stmia ip, {r0, r1, r2} - mov r0, ip - add sp, sp, #12 - bx lr - -r0 (and later ip) is the hidden parameter from caller to store the value in. The -first ldmia loads the constants into r0, r1, r2. The last stmia stores r0, r1, -r2 into the address passed in. However, there is one additional stmia that -stores r0, r1, and r2 to some stack location. The store is dead. - -The llvm-gcc generated code looks like this: - -csretcc void %foo(%struct.s* %agg.result) { -entry: - %S = alloca %struct.s, align 4 ; <%struct.s*> [#uses=1] - %memtmp = alloca %struct.s ; <%struct.s*> [#uses=1] - cast %struct.s* %S to sbyte* ; <sbyte*>:0 [#uses=2] - call void %llvm.memcpy.i32( sbyte* %0, sbyte* cast ({ double, int }* %C.0.904 to sbyte*), uint 12, uint 4 ) - cast %struct.s* %agg.result to sbyte* ; <sbyte*>:1 [#uses=2] - call void %llvm.memcpy.i32( sbyte* %1, sbyte* %0, uint 12, uint 0 ) - cast %struct.s* %memtmp to sbyte* ; <sbyte*>:2 [#uses=1] - call void %llvm.memcpy.i32( sbyte* %2, sbyte* %1, uint 12, uint 0 ) - ret void -} - -llc ends up issuing two memcpy's (the first memcpy becomes 3 loads from -constantpool). Perhaps we should 1) fix llvm-gcc so the memcpy is translated -into a number of load and stores, or 2) custom lower memcpy (of small size) to -be ldmia / stmia. I think option 2 is better but the current register -allocator cannot allocate a chunk of registers at a time. - -A feasible temporary solution is to use specific physical registers at the -lowering time for small (<= 4 words?) transfer size. - -* ARM CSRet calling convention requires the hidden argument to be returned by -the callee. - -//===---------------------------------------------------------------------===// - -We can definitely do a better job on BB placements to eliminate some branches. -It's very common to see llvm generated assembly code that looks like this: - -LBB3: - ... -LBB4: -... - beq LBB3 - b LBB2 - -If BB4 is the only predecessor of BB3, then we can emit BB3 after BB4. We can -then eliminate beq and turn the unconditional branch to LBB2 to a bne. - -See McCat/18-imp/ComputeBoundingBoxes for an example. - -//===---------------------------------------------------------------------===// - -Pre-/post- indexed load / stores: - -1) We should not make the pre/post- indexed load/store transform if the base ptr -is guaranteed to be live beyond the load/store. This can happen if the base -ptr is live out of the block we are performing the optimization. e.g. - -mov r1, r2 -ldr r3, [r1], #4 -... - -vs. - -ldr r3, [r2] -add r1, r2, #4 -... - -In most cases, this is just a wasted optimization. However, sometimes it can -negatively impact the performance because two-address code is more restrictive -when it comes to scheduling. - -Unfortunately, liveout information is currently unavailable during DAG combine -time. - -2) Consider spliting a indexed load / store into a pair of add/sub + load/store - to solve #1 (in TwoAddressInstructionPass.cpp). - -3) Enhance LSR to generate more opportunities for indexed ops. - -4) Once we added support for multiple result patterns, write indexed loads - patterns instead of C++ instruction selection code. - -5) Use VLDM / VSTM to emulate indexed FP load / store. - -//===---------------------------------------------------------------------===// - -Implement support for some more tricky ways to materialize immediates. For -example, to get 0xffff8000, we can use: - -mov r9, #&3f8000 -sub r9, r9, #&400000 - -//===---------------------------------------------------------------------===// - -We sometimes generate multiple add / sub instructions to update sp in prologue -and epilogue if the inc / dec value is too large to fit in a single immediate -operand. In some cases, perhaps it might be better to load the value from a -constantpool instead. - -//===---------------------------------------------------------------------===// - -GCC generates significantly better code for this function. - -int foo(int StackPtr, unsigned char *Line, unsigned char *Stack, int LineLen) { - int i = 0; - - if (StackPtr != 0) { - while (StackPtr != 0 && i < (((LineLen) < (32768))? (LineLen) : (32768))) - Line[i++] = Stack[--StackPtr]; - if (LineLen > 32768) - { - while (StackPtr != 0 && i < LineLen) - { - i++; - --StackPtr; - } - } - } - return StackPtr; -} - -//===---------------------------------------------------------------------===// - -This should compile to the mlas instruction: -int mlas(int x, int y, int z) { return ((x * y + z) < 0) ? 7 : 13; } - -//===---------------------------------------------------------------------===// - -At some point, we should triage these to see if they still apply to us: - -http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19598 -http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18560 -http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27016 - -http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11831 -http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11826 -http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11825 -http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11824 -http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11823 -http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11820 -http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10982 - -http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10242 -http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9831 -http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9760 -http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9759 -http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9703 -http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9702 -http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9663 - -http://www.inf.u-szeged.hu/gcc-arm/ -http://citeseer.ist.psu.edu/debus04linktime.html - -//===---------------------------------------------------------------------===// - -gcc generates smaller code for this function at -O2 or -Os: - -void foo(signed char* p) { - if (*p == 3) - bar(); - else if (*p == 4) - baz(); - else if (*p == 5) - quux(); -} - -llvm decides it's a good idea to turn the repeated if...else into a -binary tree, as if it were a switch; the resulting code requires -1 -compare-and-branches when *p<=2 or *p==5, the same number if *p==4 -or *p>6, and +1 if *p==3. So it should be a speed win -(on balance). However, the revised code is larger, with 4 conditional -branches instead of 3. - -More seriously, there is a byte->word extend before -each comparison, where there should be only one, and the condition codes -are not remembered when the same two values are compared twice. - -//===---------------------------------------------------------------------===// - -More LSR enhancements possible: - -1. Teach LSR about pre- and post- indexed ops to allow iv increment be merged - in a load / store. -2. Allow iv reuse even when a type conversion is required. For example, i8 - and i32 load / store addressing modes are identical. - - -//===---------------------------------------------------------------------===// - -This: - -int foo(int a, int b, int c, int d) { - long long acc = (long long)a * (long long)b; - acc += (long long)c * (long long)d; - return (int)(acc >> 32); -} - -Should compile to use SMLAL (Signed Multiply Accumulate Long) which multiplies -two signed 32-bit values to produce a 64-bit value, and accumulates this with -a 64-bit value. - -We currently get this with both v4 and v6: - -_foo: - smull r1, r0, r1, r0 - smull r3, r2, r3, r2 - adds r3, r3, r1 - adc r0, r2, r0 - bx lr - -//===---------------------------------------------------------------------===// - -This: - #include <algorithm> - std::pair<unsigned, bool> full_add(unsigned a, unsigned b) - { return std::make_pair(a + b, a + b < a); } - bool no_overflow(unsigned a, unsigned b) - { return !full_add(a, b).second; } - -Should compile to: - -_Z8full_addjj: - adds r2, r1, r2 - movcc r1, #0 - movcs r1, #1 - str r2, [r0, #0] - strb r1, [r0, #4] - mov pc, lr - -_Z11no_overflowjj: - cmn r0, r1 - movcs r0, #0 - movcc r0, #1 - mov pc, lr - -not: - -__Z8full_addjj: - add r3, r2, r1 - str r3, [r0] - mov r2, #1 - mov r12, #0 - cmp r3, r1 - movlo r12, r2 - str r12, [r0, #+4] - bx lr -__Z11no_overflowjj: - add r3, r1, r0 - mov r2, #1 - mov r1, #0 - cmp r3, r0 - movhs r1, r2 - mov r0, r1 - bx lr - -//===---------------------------------------------------------------------===// - -Some of the NEON intrinsics may be appropriate for more general use, either -as target-independent intrinsics or perhaps elsewhere in the ARM backend. -Some of them may also be lowered to target-independent SDNodes, and perhaps -some new SDNodes could be added. - -For example, maximum, minimum, and absolute value operations are well-defined -and standard operations, both for vector and scalar types. - -The current NEON-specific intrinsics for count leading zeros and count one -bits could perhaps be replaced by the target-independent ctlz and ctpop -intrinsics. It may also make sense to add a target-independent "ctls" -intrinsic for "count leading sign bits". Likewise, the backend could use -the target-independent SDNodes for these operations. - -ARMv6 has scalar saturating and halving adds and subtracts. The same -intrinsics could possibly be used for both NEON's vector implementations of -those operations and the ARMv6 scalar versions. - -//===---------------------------------------------------------------------===// - -Split out LDR (literal) from normal ARM LDR instruction. Also consider spliting -LDR into imm12 and so_reg forms. This allows us to clean up some code. e.g. -ARMLoadStoreOptimizer does not need to look at LDR (literal) and LDR (so_reg) -while ARMConstantIslandPass only need to worry about LDR (literal). - -//===---------------------------------------------------------------------===// - -Constant island pass should make use of full range SoImm values for LEApcrel. -Be careful though as the last attempt caused infinite looping on lencod. - -//===---------------------------------------------------------------------===// - -Predication issue. This function: - -extern unsigned array[ 128 ]; -int foo( int x ) { - int y; - y = array[ x & 127 ]; - if ( x & 128 ) - y = 123456789 & ( y >> 2 ); - else - y = 123456789 & y; - return y; -} - -compiles to: - -_foo: - and r1, r0, #127 - ldr r2, LCPI1_0 - ldr r2, [r2] - ldr r1, [r2, +r1, lsl #2] - mov r2, r1, lsr #2 - tst r0, #128 - moveq r2, r1 - ldr r0, LCPI1_1 - and r0, r2, r0 - bx lr - -It would be better to do something like this, to fold the shift into the -conditional move: - - and r1, r0, #127 - ldr r2, LCPI1_0 - ldr r2, [r2] - ldr r1, [r2, +r1, lsl #2] - tst r0, #128 - movne r1, r1, lsr #2 - ldr r0, LCPI1_1 - and r0, r1, r0 - bx lr - -it saves an instruction and a register. - -//===---------------------------------------------------------------------===// - -It might be profitable to cse MOVi16 if there are lots of 32-bit immediates -with the same bottom half. - -//===---------------------------------------------------------------------===// - -Robert Muth started working on an alternate jump table implementation that -does not put the tables in-line in the text. This is more like the llvm -default jump table implementation. This might be useful sometime. Several -revisions of patches are on the mailing list, beginning at: -http://lists.llvm.org/pipermail/llvm-dev/2009-June/022763.html - -//===---------------------------------------------------------------------===// - -Make use of the "rbit" instruction. - -//===---------------------------------------------------------------------===// - -Take a look at test/CodeGen/Thumb2/machine-licm.ll. ARM should be taught how -to licm and cse the unnecessary load from cp#1. - -//===---------------------------------------------------------------------===// - -The CMN instruction sets the flags like an ADD instruction, while CMP sets -them like a subtract. Therefore to be able to use CMN for comparisons other -than the Z bit, we'll need additional logic to reverse the conditionals -associated with the comparison. Perhaps a pseudo-instruction for the comparison, -with a post-codegen pass to clean up and handle the condition codes? -See PR5694 for testcase. - -//===---------------------------------------------------------------------===// - -Given the following on armv5: -int test1(int A, int B) { - return (A&-8388481)|(B&8388480); -} - -We currently generate: - ldr r2, .LCPI0_0 - and r0, r0, r2 - ldr r2, .LCPI0_1 - and r1, r1, r2 - orr r0, r1, r0 - bx lr - -We should be able to replace the second ldr+and with a bic (i.e. reuse the -constant which was already loaded). Not sure what's necessary to do that. - -//===---------------------------------------------------------------------===// - -The code generated for bswap on armv4/5 (CPUs without rev) is less than ideal: - -int a(int x) { return __builtin_bswap32(x); } - -a: - mov r1, #255, 24 - mov r2, #255, 16 - and r1, r1, r0, lsr #8 - and r2, r2, r0, lsl #8 - orr r1, r1, r0, lsr #24 - orr r0, r2, r0, lsl #24 - orr r0, r0, r1 - bx lr - -Something like the following would be better (fewer instructions/registers): - eor r1, r0, r0, ror #16 - bic r1, r1, #0xff0000 - mov r1, r1, lsr #8 - eor r0, r1, r0, ror #8 - bx lr - -A custom Thumb version would also be a slight improvement over the generic -version. - -//===---------------------------------------------------------------------===// - -Consider the following simple C code: - -void foo(unsigned char *a, unsigned char *b, int *c) { - if ((*a | *b) == 0) *c = 0; -} - -currently llvm-gcc generates something like this (nice branchless code I'd say): - - ldrb r0, [r0] - ldrb r1, [r1] - orr r0, r1, r0 - tst r0, #255 - moveq r0, #0 - streq r0, [r2] - bx lr - -Note that both "tst" and "moveq" are redundant. - -//===---------------------------------------------------------------------===// - -When loading immediate constants with movt/movw, if there are multiple -constants needed with the same low 16 bits, and those values are not live at -the same time, it would be possible to use a single movw instruction, followed -by multiple movt instructions to rewrite the high bits to different values. -For example: - - volatile store i32 -1, i32* inttoptr (i32 1342210076 to i32*), align 4, - !tbaa -!0 - volatile store i32 -1, i32* inttoptr (i32 1342341148 to i32*), align 4, - !tbaa -!0 - -is compiled and optimized to: - - movw r0, #32796 - mov.w r1, #-1 - movt r0, #20480 - str r1, [r0] - movw r0, #32796 @ <= this MOVW is not needed, value is there already - movt r0, #20482 - str r1, [r0] - -//===---------------------------------------------------------------------===// - -Improve codegen for select's: -if (x != 0) x = 1 -if (x == 1) x = 1 - -ARM codegen used to look like this: - mov r1, r0 - cmp r1, #1 - mov r0, #0 - moveq r0, #1 - -The naive lowering select between two different values. It should recognize the -test is equality test so it's more a conditional move rather than a select: - cmp r0, #1 - movne r0, #0 - -Currently this is a ARM specific dag combine. We probably should make it into a -target-neutral one. - -//===---------------------------------------------------------------------===// - -Optimize unnecessary checks for zero with __builtin_clz/ctz. Those builtins -are specified to be undefined at zero, so portable code must check for zero -and handle it as a special case. That is unnecessary on ARM where those -operations are implemented in a way that is well-defined for zero. For -example: - -int f(int x) { return x ? __builtin_clz(x) : sizeof(int)*8; } - -should just be implemented with a CLZ instruction. Since there are other -targets, e.g., PPC, that share this behavior, it would be best to implement -this in a target-independent way: we should probably fold that (when using -"undefined at zero" semantics) to set the "defined at zero" bit and have -the code generator expand out the right code. - -//===---------------------------------------------------------------------===// - -Clean up the test/MC/ARM files to have more robust register choices. - -R0 should not be used as a register operand in the assembler tests as it's then -not possible to distinguish between a correct encoding and a missing operand -encoding, as zero is the default value for the binary encoder. -e.g., - add r0, r0 // bad - add r3, r5 // good - -Register operands should be distinct. That is, when the encoding does not -require two syntactical operands to refer to the same register, two different -registers should be used in the test so as to catch errors where the -operands are swapped in the encoding. -e.g., - subs.w r1, r1, r1 // bad - subs.w r1, r2, r3 // good - +//===---------------------------------------------------------------------===// +// Random ideas for the ARM backend. +//===---------------------------------------------------------------------===// + +Reimplement 'select' in terms of 'SEL'. + +* We would really like to support UXTAB16, but we need to prove that the + add doesn't need to overflow between the two 16-bit chunks. + +* Implement pre/post increment support. (e.g. PR935) +* Implement smarter constant generation for binops with large immediates. + +A few ARMv6T2 ops should be pattern matched: BFI, SBFX, and UBFX + +Interesting optimization for PIC codegen on arm-linux: +http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43129 + +//===---------------------------------------------------------------------===// + +Crazy idea: Consider code that uses lots of 8-bit or 16-bit values. By the +time regalloc happens, these values are now in a 32-bit register, usually with +the top-bits known to be sign or zero extended. If spilled, we should be able +to spill these to a 8-bit or 16-bit stack slot, zero or sign extending as part +of the reload. + +Doing this reduces the size of the stack frame (important for thumb etc), and +also increases the likelihood that we will be able to reload multiple values +from the stack with a single load. + +//===---------------------------------------------------------------------===// + +The constant island pass is in good shape. Some cleanups might be desirable, +but there is unlikely to be much improvement in the generated code. + +1. There may be some advantage to trying to be smarter about the initial +placement, rather than putting everything at the end. + +2. There might be some compile-time efficiency to be had by representing +consecutive islands as a single block rather than multiple blocks. + +3. Use a priority queue to sort constant pool users in inverse order of + position so we always process the one closed to the end of functions + first. This may simply CreateNewWater. + +//===---------------------------------------------------------------------===// + +Eliminate copysign custom expansion. We are still generating crappy code with +default expansion + if-conversion. + +//===---------------------------------------------------------------------===// + +Eliminate one instruction from: + +define i32 @_Z6slow4bii(i32 %x, i32 %y) { + %tmp = icmp sgt i32 %x, %y + %retval = select i1 %tmp, i32 %x, i32 %y + ret i32 %retval +} + +__Z6slow4bii: + cmp r0, r1 + movgt r1, r0 + mov r0, r1 + bx lr +=> + +__Z6slow4bii: + cmp r0, r1 + movle r0, r1 + bx lr + +//===---------------------------------------------------------------------===// + +Implement long long "X-3" with instructions that fold the immediate in. These +were disabled due to badness with the ARM carry flag on subtracts. + +//===---------------------------------------------------------------------===// + +More load / store optimizations: +1) Better representation for block transfer? This is from Olden/power: + + fldd d0, [r4] + fstd d0, [r4, #+32] + fldd d0, [r4, #+8] + fstd d0, [r4, #+40] + fldd d0, [r4, #+16] + fstd d0, [r4, #+48] + fldd d0, [r4, #+24] + fstd d0, [r4, #+56] + +If we can spare the registers, it would be better to use fldm and fstm here. +Need major register allocator enhancement though. + +2) Can we recognize the relative position of constantpool entries? i.e. Treat + + ldr r0, LCPI17_3 + ldr r1, LCPI17_4 + ldr r2, LCPI17_5 + + as + ldr r0, LCPI17 + ldr r1, LCPI17+4 + ldr r2, LCPI17+8 + + Then the ldr's can be combined into a single ldm. See Olden/power. + +Note for ARM v4 gcc uses ldmia to load a pair of 32-bit values to represent a +double 64-bit FP constant: + + adr r0, L6 + ldmia r0, {r0-r1} + + .align 2 +L6: + .long -858993459 + .long 1074318540 + +3) struct copies appear to be done field by field +instead of by words, at least sometimes: + +struct foo { int x; short s; char c1; char c2; }; +void cpy(struct foo*a, struct foo*b) { *a = *b; } + +llvm code (-O2) + ldrb r3, [r1, #+6] + ldr r2, [r1] + ldrb r12, [r1, #+7] + ldrh r1, [r1, #+4] + str r2, [r0] + strh r1, [r0, #+4] + strb r3, [r0, #+6] + strb r12, [r0, #+7] +gcc code (-O2) + ldmia r1, {r1-r2} + stmia r0, {r1-r2} + +In this benchmark poor handling of aggregate copies has shown up as +having a large effect on size, and possibly speed as well (we don't have +a good way to measure on ARM). + +//===---------------------------------------------------------------------===// + +* Consider this silly example: + +double bar(double x) { + double r = foo(3.1); + return x+r; +} + +_bar: + stmfd sp!, {r4, r5, r7, lr} + add r7, sp, #8 + mov r4, r0 + mov r5, r1 + fldd d0, LCPI1_0 + fmrrd r0, r1, d0 + bl _foo + fmdrr d0, r4, r5 + fmsr s2, r0 + fsitod d1, s2 + faddd d0, d1, d0 + fmrrd r0, r1, d0 + ldmfd sp!, {r4, r5, r7, pc} + +Ignore the prologue and epilogue stuff for a second. Note + mov r4, r0 + mov r5, r1 +the copys to callee-save registers and the fact they are only being used by the +fmdrr instruction. It would have been better had the fmdrr been scheduled +before the call and place the result in a callee-save DPR register. The two +mov ops would not have been necessary. + +//===---------------------------------------------------------------------===// + +Calling convention related stuff: + +* gcc's parameter passing implementation is terrible and we suffer as a result: + +e.g. +struct s { + double d1; + int s1; +}; + +void foo(struct s S) { + printf("%g, %d\n", S.d1, S.s1); +} + +'S' is passed via registers r0, r1, r2. But gcc stores them to the stack, and +then reload them to r1, r2, and r3 before issuing the call (r0 contains the +address of the format string): + + stmfd sp!, {r7, lr} + add r7, sp, #0 + sub sp, sp, #12 + stmia sp, {r0, r1, r2} + ldmia sp, {r1-r2} + ldr r0, L5 + ldr r3, [sp, #8] +L2: + add r0, pc, r0 + bl L_printf$stub + +Instead of a stmia, ldmia, and a ldr, wouldn't it be better to do three moves? + +* Return an aggregate type is even worse: + +e.g. +struct s foo(void) { + struct s S = {1.1, 2}; + return S; +} + + mov ip, r0 + ldr r0, L5 + sub sp, sp, #12 +L2: + add r0, pc, r0 + @ lr needed for prologue + ldmia r0, {r0, r1, r2} + stmia sp, {r0, r1, r2} + stmia ip, {r0, r1, r2} + mov r0, ip + add sp, sp, #12 + bx lr + +r0 (and later ip) is the hidden parameter from caller to store the value in. The +first ldmia loads the constants into r0, r1, r2. The last stmia stores r0, r1, +r2 into the address passed in. However, there is one additional stmia that +stores r0, r1, and r2 to some stack location. The store is dead. + +The llvm-gcc generated code looks like this: + +csretcc void %foo(%struct.s* %agg.result) { +entry: + %S = alloca %struct.s, align 4 ; <%struct.s*> [#uses=1] + %memtmp = alloca %struct.s ; <%struct.s*> [#uses=1] + cast %struct.s* %S to sbyte* ; <sbyte*>:0 [#uses=2] + call void %llvm.memcpy.i32( sbyte* %0, sbyte* cast ({ double, int }* %C.0.904 to sbyte*), uint 12, uint 4 ) + cast %struct.s* %agg.result to sbyte* ; <sbyte*>:1 [#uses=2] + call void %llvm.memcpy.i32( sbyte* %1, sbyte* %0, uint 12, uint 0 ) + cast %struct.s* %memtmp to sbyte* ; <sbyte*>:2 [#uses=1] + call void %llvm.memcpy.i32( sbyte* %2, sbyte* %1, uint 12, uint 0 ) + ret void +} + +llc ends up issuing two memcpy's (the first memcpy becomes 3 loads from +constantpool). Perhaps we should 1) fix llvm-gcc so the memcpy is translated +into a number of load and stores, or 2) custom lower memcpy (of small size) to +be ldmia / stmia. I think option 2 is better but the current register +allocator cannot allocate a chunk of registers at a time. + +A feasible temporary solution is to use specific physical registers at the +lowering time for small (<= 4 words?) transfer size. + +* ARM CSRet calling convention requires the hidden argument to be returned by +the callee. + +//===---------------------------------------------------------------------===// + +We can definitely do a better job on BB placements to eliminate some branches. +It's very common to see llvm generated assembly code that looks like this: + +LBB3: + ... +LBB4: +... + beq LBB3 + b LBB2 + +If BB4 is the only predecessor of BB3, then we can emit BB3 after BB4. We can +then eliminate beq and turn the unconditional branch to LBB2 to a bne. + +See McCat/18-imp/ComputeBoundingBoxes for an example. + +//===---------------------------------------------------------------------===// + +Pre-/post- indexed load / stores: + +1) We should not make the pre/post- indexed load/store transform if the base ptr +is guaranteed to be live beyond the load/store. This can happen if the base +ptr is live out of the block we are performing the optimization. e.g. + +mov r1, r2 +ldr r3, [r1], #4 +... + +vs. + +ldr r3, [r2] +add r1, r2, #4 +... + +In most cases, this is just a wasted optimization. However, sometimes it can +negatively impact the performance because two-address code is more restrictive +when it comes to scheduling. + +Unfortunately, liveout information is currently unavailable during DAG combine +time. + +2) Consider spliting a indexed load / store into a pair of add/sub + load/store + to solve #1 (in TwoAddressInstructionPass.cpp). + +3) Enhance LSR to generate more opportunities for indexed ops. + +4) Once we added support for multiple result patterns, write indexed loads + patterns instead of C++ instruction selection code. + +5) Use VLDM / VSTM to emulate indexed FP load / store. + +//===---------------------------------------------------------------------===// + +Implement support for some more tricky ways to materialize immediates. For +example, to get 0xffff8000, we can use: + +mov r9, #&3f8000 +sub r9, r9, #&400000 + +//===---------------------------------------------------------------------===// + +We sometimes generate multiple add / sub instructions to update sp in prologue +and epilogue if the inc / dec value is too large to fit in a single immediate +operand. In some cases, perhaps it might be better to load the value from a +constantpool instead. + +//===---------------------------------------------------------------------===// + +GCC generates significantly better code for this function. + +int foo(int StackPtr, unsigned char *Line, unsigned char *Stack, int LineLen) { + int i = 0; + + if (StackPtr != 0) { + while (StackPtr != 0 && i < (((LineLen) < (32768))? (LineLen) : (32768))) + Line[i++] = Stack[--StackPtr]; + if (LineLen > 32768) + { + while (StackPtr != 0 && i < LineLen) + { + i++; + --StackPtr; + } + } + } + return StackPtr; +} + +//===---------------------------------------------------------------------===// + +This should compile to the mlas instruction: +int mlas(int x, int y, int z) { return ((x * y + z) < 0) ? 7 : 13; } + +//===---------------------------------------------------------------------===// + +At some point, we should triage these to see if they still apply to us: + +http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19598 +http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18560 +http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27016 + +http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11831 +http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11826 +http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11825 +http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11824 +http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11823 +http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11820 +http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10982 + +http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10242 +http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9831 +http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9760 +http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9759 +http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9703 +http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9702 +http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9663 + +http://www.inf.u-szeged.hu/gcc-arm/ +http://citeseer.ist.psu.edu/debus04linktime.html + +//===---------------------------------------------------------------------===// + +gcc generates smaller code for this function at -O2 or -Os: + +void foo(signed char* p) { + if (*p == 3) + bar(); + else if (*p == 4) + baz(); + else if (*p == 5) + quux(); +} + +llvm decides it's a good idea to turn the repeated if...else into a +binary tree, as if it were a switch; the resulting code requires -1 +compare-and-branches when *p<=2 or *p==5, the same number if *p==4 +or *p>6, and +1 if *p==3. So it should be a speed win +(on balance). However, the revised code is larger, with 4 conditional +branches instead of 3. + +More seriously, there is a byte->word extend before +each comparison, where there should be only one, and the condition codes +are not remembered when the same two values are compared twice. + +//===---------------------------------------------------------------------===// + +More LSR enhancements possible: + +1. Teach LSR about pre- and post- indexed ops to allow iv increment be merged + in a load / store. +2. Allow iv reuse even when a type conversion is required. For example, i8 + and i32 load / store addressing modes are identical. + + +//===---------------------------------------------------------------------===// + +This: + +int foo(int a, int b, int c, int d) { + long long acc = (long long)a * (long long)b; + acc += (long long)c * (long long)d; + return (int)(acc >> 32); +} + +Should compile to use SMLAL (Signed Multiply Accumulate Long) which multiplies +two signed 32-bit values to produce a 64-bit value, and accumulates this with +a 64-bit value. + +We currently get this with both v4 and v6: + +_foo: + smull r1, r0, r1, r0 + smull r3, r2, r3, r2 + adds r3, r3, r1 + adc r0, r2, r0 + bx lr + +//===---------------------------------------------------------------------===// + +This: + #include <algorithm> + std::pair<unsigned, bool> full_add(unsigned a, unsigned b) + { return std::make_pair(a + b, a + b < a); } + bool no_overflow(unsigned a, unsigned b) + { return !full_add(a, b).second; } + +Should compile to: + +_Z8full_addjj: + adds r2, r1, r2 + movcc r1, #0 + movcs r1, #1 + str r2, [r0, #0] + strb r1, [r0, #4] + mov pc, lr + +_Z11no_overflowjj: + cmn r0, r1 + movcs r0, #0 + movcc r0, #1 + mov pc, lr + +not: + +__Z8full_addjj: + add r3, r2, r1 + str r3, [r0] + mov r2, #1 + mov r12, #0 + cmp r3, r1 + movlo r12, r2 + str r12, [r0, #+4] + bx lr +__Z11no_overflowjj: + add r3, r1, r0 + mov r2, #1 + mov r1, #0 + cmp r3, r0 + movhs r1, r2 + mov r0, r1 + bx lr + +//===---------------------------------------------------------------------===// + +Some of the NEON intrinsics may be appropriate for more general use, either +as target-independent intrinsics or perhaps elsewhere in the ARM backend. +Some of them may also be lowered to target-independent SDNodes, and perhaps +some new SDNodes could be added. + +For example, maximum, minimum, and absolute value operations are well-defined +and standard operations, both for vector and scalar types. + +The current NEON-specific intrinsics for count leading zeros and count one +bits could perhaps be replaced by the target-independent ctlz and ctpop +intrinsics. It may also make sense to add a target-independent "ctls" +intrinsic for "count leading sign bits". Likewise, the backend could use +the target-independent SDNodes for these operations. + +ARMv6 has scalar saturating and halving adds and subtracts. The same +intrinsics could possibly be used for both NEON's vector implementations of +those operations and the ARMv6 scalar versions. + +//===---------------------------------------------------------------------===// + +Split out LDR (literal) from normal ARM LDR instruction. Also consider spliting +LDR into imm12 and so_reg forms. This allows us to clean up some code. e.g. +ARMLoadStoreOptimizer does not need to look at LDR (literal) and LDR (so_reg) +while ARMConstantIslandPass only need to worry about LDR (literal). + +//===---------------------------------------------------------------------===// + +Constant island pass should make use of full range SoImm values for LEApcrel. +Be careful though as the last attempt caused infinite looping on lencod. + +//===---------------------------------------------------------------------===// + +Predication issue. This function: + +extern unsigned array[ 128 ]; +int foo( int x ) { + int y; + y = array[ x & 127 ]; + if ( x & 128 ) + y = 123456789 & ( y >> 2 ); + else + y = 123456789 & y; + return y; +} + +compiles to: + +_foo: + and r1, r0, #127 + ldr r2, LCPI1_0 + ldr r2, [r2] + ldr r1, [r2, +r1, lsl #2] + mov r2, r1, lsr #2 + tst r0, #128 + moveq r2, r1 + ldr r0, LCPI1_1 + and r0, r2, r0 + bx lr + +It would be better to do something like this, to fold the shift into the +conditional move: + + and r1, r0, #127 + ldr r2, LCPI1_0 + ldr r2, [r2] + ldr r1, [r2, +r1, lsl #2] + tst r0, #128 + movne r1, r1, lsr #2 + ldr r0, LCPI1_1 + and r0, r1, r0 + bx lr + +it saves an instruction and a register. + +//===---------------------------------------------------------------------===// + +It might be profitable to cse MOVi16 if there are lots of 32-bit immediates +with the same bottom half. + +//===---------------------------------------------------------------------===// + +Robert Muth started working on an alternate jump table implementation that +does not put the tables in-line in the text. This is more like the llvm +default jump table implementation. This might be useful sometime. Several +revisions of patches are on the mailing list, beginning at: +http://lists.llvm.org/pipermail/llvm-dev/2009-June/022763.html + +//===---------------------------------------------------------------------===// + +Make use of the "rbit" instruction. + +//===---------------------------------------------------------------------===// + +Take a look at test/CodeGen/Thumb2/machine-licm.ll. ARM should be taught how +to licm and cse the unnecessary load from cp#1. + +//===---------------------------------------------------------------------===// + +The CMN instruction sets the flags like an ADD instruction, while CMP sets +them like a subtract. Therefore to be able to use CMN for comparisons other +than the Z bit, we'll need additional logic to reverse the conditionals +associated with the comparison. Perhaps a pseudo-instruction for the comparison, +with a post-codegen pass to clean up and handle the condition codes? +See PR5694 for testcase. + +//===---------------------------------------------------------------------===// + +Given the following on armv5: +int test1(int A, int B) { + return (A&-8388481)|(B&8388480); +} + +We currently generate: + ldr r2, .LCPI0_0 + and r0, r0, r2 + ldr r2, .LCPI0_1 + and r1, r1, r2 + orr r0, r1, r0 + bx lr + +We should be able to replace the second ldr+and with a bic (i.e. reuse the +constant which was already loaded). Not sure what's necessary to do that. + +//===---------------------------------------------------------------------===// + +The code generated for bswap on armv4/5 (CPUs without rev) is less than ideal: + +int a(int x) { return __builtin_bswap32(x); } + +a: + mov r1, #255, 24 + mov r2, #255, 16 + and r1, r1, r0, lsr #8 + and r2, r2, r0, lsl #8 + orr r1, r1, r0, lsr #24 + orr r0, r2, r0, lsl #24 + orr r0, r0, r1 + bx lr + +Something like the following would be better (fewer instructions/registers): + eor r1, r0, r0, ror #16 + bic r1, r1, #0xff0000 + mov r1, r1, lsr #8 + eor r0, r1, r0, ror #8 + bx lr + +A custom Thumb version would also be a slight improvement over the generic +version. + +//===---------------------------------------------------------------------===// + +Consider the following simple C code: + +void foo(unsigned char *a, unsigned char *b, int *c) { + if ((*a | *b) == 0) *c = 0; +} + +currently llvm-gcc generates something like this (nice branchless code I'd say): + + ldrb r0, [r0] + ldrb r1, [r1] + orr r0, r1, r0 + tst r0, #255 + moveq r0, #0 + streq r0, [r2] + bx lr + +Note that both "tst" and "moveq" are redundant. + +//===---------------------------------------------------------------------===// + +When loading immediate constants with movt/movw, if there are multiple +constants needed with the same low 16 bits, and those values are not live at +the same time, it would be possible to use a single movw instruction, followed +by multiple movt instructions to rewrite the high bits to different values. +For example: + + volatile store i32 -1, i32* inttoptr (i32 1342210076 to i32*), align 4, + !tbaa +!0 + volatile store i32 -1, i32* inttoptr (i32 1342341148 to i32*), align 4, + !tbaa +!0 + +is compiled and optimized to: + + movw r0, #32796 + mov.w r1, #-1 + movt r0, #20480 + str r1, [r0] + movw r0, #32796 @ <= this MOVW is not needed, value is there already + movt r0, #20482 + str r1, [r0] + +//===---------------------------------------------------------------------===// + +Improve codegen for select's: +if (x != 0) x = 1 +if (x == 1) x = 1 + +ARM codegen used to look like this: + mov r1, r0 + cmp r1, #1 + mov r0, #0 + moveq r0, #1 + +The naive lowering select between two different values. It should recognize the +test is equality test so it's more a conditional move rather than a select: + cmp r0, #1 + movne r0, #0 + +Currently this is a ARM specific dag combine. We probably should make it into a +target-neutral one. + +//===---------------------------------------------------------------------===// + +Optimize unnecessary checks for zero with __builtin_clz/ctz. Those builtins +are specified to be undefined at zero, so portable code must check for zero +and handle it as a special case. That is unnecessary on ARM where those +operations are implemented in a way that is well-defined for zero. For +example: + +int f(int x) { return x ? __builtin_clz(x) : sizeof(int)*8; } + +should just be implemented with a CLZ instruction. Since there are other +targets, e.g., PPC, that share this behavior, it would be best to implement +this in a target-independent way: we should probably fold that (when using +"undefined at zero" semantics) to set the "defined at zero" bit and have +the code generator expand out the right code. + +//===---------------------------------------------------------------------===// + +Clean up the test/MC/ARM files to have more robust register choices. + +R0 should not be used as a register operand in the assembler tests as it's then +not possible to distinguish between a correct encoding and a missing operand +encoding, as zero is the default value for the binary encoder. +e.g., + add r0, r0 // bad + add r3, r5 // good + +Register operands should be distinct. That is, when the encoding does not +require two syntactical operands to refer to the same register, two different +registers should be used in the test so as to catch errors where the +operands are swapped in the encoding. +e.g., + subs.w r1, r1, r1 // bad + subs.w r1, r2, r3 // good + |