diff options
| author | heretic <[email protected]> | 2022-02-10 16:45:46 +0300 | 
|---|---|---|
| committer | Daniil Cherednik <[email protected]> | 2022-02-10 16:45:46 +0300 | 
| commit | 81eddc8c0b55990194e112b02d127b87d54164a9 (patch) | |
| tree | 9142afc54d335ea52910662635b898e79e192e49 /contrib/libs/llvm12/lib/Target | |
| parent | 397cbe258b9e064f49c4ca575279f02f39fef76e (diff) | |
Restoring authorship annotation for <[email protected]>. Commit 2 of 2.
Diffstat (limited to 'contrib/libs/llvm12/lib/Target')
72 files changed, 9107 insertions, 9107 deletions
diff --git a/contrib/libs/llvm12/lib/Target/AArch64/.yandex_meta/licenses.list.txt b/contrib/libs/llvm12/lib/Target/AArch64/.yandex_meta/licenses.list.txt index b0b34714ca8..ad3879fc450 100644 --- a/contrib/libs/llvm12/lib/Target/AArch64/.yandex_meta/licenses.list.txt +++ b/contrib/libs/llvm12/lib/Target/AArch64/.yandex_meta/licenses.list.txt @@ -1,303 +1,303 @@ -====================Apache-2.0 WITH LLVM-exception====================  -// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.  -// See https://llvm.org/LICENSE.txt for license information.  -  -  -====================Apache-2.0 WITH LLVM-exception====================  -// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception  -  -  -====================File: LICENSE.TXT====================  -==============================================================================  -The LLVM Project is under the Apache License v2.0 with LLVM Exceptions:  -==============================================================================  -  -                                 Apache License  -                           Version 2.0, January 2004  -                        http://www.apache.org/licenses/  -  -    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION  -  -    1. Definitions.  -  -      "License" shall mean the terms and conditions for use, reproduction,  -      and distribution as defined by Sections 1 through 9 of this document.  -  -      "Licensor" shall mean the copyright owner or entity authorized by  -      the copyright owner that is granting the License.  -  -      "Legal Entity" shall mean the union of the acting entity and all  -      other entities that control, are controlled by, or are under common  -      control with that entity. For the purposes of this definition,  -      "control" means (i) the power, direct or indirect, to cause the  -      direction or management of such entity, whether by contract or  -      otherwise, or (ii) ownership of fifty percent (50%) or more of the  -      outstanding shares, or (iii) beneficial ownership of such entity.  -  -      "You" (or "Your") shall mean an individual or Legal Entity  -      exercising permissions granted by this License.  -  -      "Source" form shall mean the preferred form for making modifications,  -      including but not limited to software source code, documentation  -      source, and configuration files.  -  -      "Object" form shall mean any form resulting from mechanical  -      transformation or translation of a Source form, including but  -      not limited to compiled object code, generated documentation,  -      and conversions to other media types.  -  -      "Work" shall mean the work of authorship, whether in Source or  -      Object form, made available under the License, as indicated by a  -      copyright notice that is included in or attached to the work  -      (an example is provided in the Appendix below).  -  -      "Derivative Works" shall mean any work, whether in Source or Object  -      form, that is based on (or derived from) the Work and for which the  -      editorial revisions, annotations, elaborations, or other modifications  -      represent, as a whole, an original work of authorship. For the purposes  -      of this License, Derivative Works shall not include works that remain  -      separable from, or merely link (or bind by name) to the interfaces of,  -      the Work and Derivative Works thereof.  -  -      "Contribution" shall mean any work of authorship, including  -      the original version of the Work and any modifications or additions  -      to that Work or Derivative Works thereof, that is intentionally  -      submitted to Licensor for inclusion in the Work by the copyright owner  -      or by an individual or Legal Entity authorized to submit on behalf of  -      the copyright owner. For the purposes of this definition, "submitted"  -      means any form of electronic, verbal, or written communication sent  -      to the Licensor or its representatives, including but not limited to  -      communication on electronic mailing lists, source code control systems,  -      and issue tracking systems that are managed by, or on behalf of, the  -      Licensor for the purpose of discussing and improving the Work, but  -      excluding communication that is conspicuously marked or otherwise  -      designated in writing by the copyright owner as "Not a Contribution."  -  -      "Contributor" shall mean Licensor and any individual or Legal Entity  -      on behalf of whom a Contribution has been received by Licensor and  -      subsequently incorporated within the Work.  -  -    2. Grant of Copyright License. Subject to the terms and conditions of  -      this License, each Contributor hereby grants to You a perpetual,  -      worldwide, non-exclusive, no-charge, royalty-free, irrevocable  -      copyright license to reproduce, prepare Derivative Works of,  -      publicly display, publicly perform, sublicense, and distribute the  -      Work and such Derivative Works in Source or Object form.  -  -    3. Grant of Patent License. Subject to the terms and conditions of  -      this License, each Contributor hereby grants to You a perpetual,  -      worldwide, non-exclusive, no-charge, royalty-free, irrevocable  -      (except as stated in this section) patent license to make, have made,  -      use, offer to sell, sell, import, and otherwise transfer the Work,  -      where such license applies only to those patent claims licensable  -      by such Contributor that are necessarily infringed by their  -      Contribution(s) alone or by combination of their Contribution(s)  -      with the Work to which such Contribution(s) was submitted. If You  -      institute patent litigation against any entity (including a  -      cross-claim or counterclaim in a lawsuit) alleging that the Work  -      or a Contribution incorporated within the Work constitutes direct  -      or contributory patent infringement, then any patent licenses  -      granted to You under this License for that Work shall terminate  -      as of the date such litigation is filed.  -  -    4. Redistribution. You may reproduce and distribute copies of the  -      Work or Derivative Works thereof in any medium, with or without  -      modifications, and in Source or Object form, provided that You  -      meet the following conditions:  -  -      (a) You must give any other recipients of the Work or  -          Derivative Works a copy of this License; and  -  -      (b) You must cause any modified files to carry prominent notices  -          stating that You changed the files; and  -  -      (c) You must retain, in the Source form of any Derivative Works  -          that You distribute, all copyright, patent, trademark, and  -          attribution notices from the Source form of the Work,  -          excluding those notices that do not pertain to any part of  -          the Derivative Works; and  -  -      (d) If the Work includes a "NOTICE" text file as part of its  -          distribution, then any Derivative Works that You distribute must  -          include a readable copy of the attribution notices contained  -          within such NOTICE file, excluding those notices that do not  -          pertain to any part of the Derivative Works, in at least one  -          of the following places: within a NOTICE text file distributed  -          as part of the Derivative Works; within the Source form or  -          documentation, if provided along with the Derivative Works; or,  -          within a display generated by the Derivative Works, if and  -          wherever such third-party notices normally appear. The contents  -          of the NOTICE file are for informational purposes only and  -          do not modify the License. You may add Your own attribution  -          notices within Derivative Works that You distribute, alongside  -          or as an addendum to the NOTICE text from the Work, provided  -          that such additional attribution notices cannot be construed  -          as modifying the License.  -  -      You may add Your own copyright statement to Your modifications and  -      may provide additional or different license terms and conditions  -      for use, reproduction, or distribution of Your modifications, or  -      for any such Derivative Works as a whole, provided Your use,  -      reproduction, and distribution of the Work otherwise complies with  -      the conditions stated in this License.  -  -    5. Submission of Contributions. Unless You explicitly state otherwise,  -      any Contribution intentionally submitted for inclusion in the Work  -      by You to the Licensor shall be under the terms and conditions of  -      this License, without any additional terms or conditions.  -      Notwithstanding the above, nothing herein shall supersede or modify  -      the terms of any separate license agreement you may have executed  -      with Licensor regarding such Contributions.  -  -    6. Trademarks. This License does not grant permission to use the trade  -      names, trademarks, service marks, or product names of the Licensor,  -      except as required for reasonable and customary use in describing the  -      origin of the Work and reproducing the content of the NOTICE file.  -  -    7. Disclaimer of Warranty. Unless required by applicable law or  -      agreed to in writing, Licensor provides the Work (and each  -      Contributor provides its Contributions) on an "AS IS" BASIS,  -      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or  -      implied, including, without limitation, any warranties or conditions  -      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A  -      PARTICULAR PURPOSE. You are solely responsible for determining the  -      appropriateness of using or redistributing the Work and assume any  -      risks associated with Your exercise of permissions under this License.  -  -    8. Limitation of Liability. In no event and under no legal theory,  -      whether in tort (including negligence), contract, or otherwise,  -      unless required by applicable law (such as deliberate and grossly  -      negligent acts) or agreed to in writing, shall any Contributor be  -      liable to You for damages, including any direct, indirect, special,  -      incidental, or consequential damages of any character arising as a  -      result of this License or out of the use or inability to use the  -      Work (including but not limited to damages for loss of goodwill,  -      work stoppage, computer failure or malfunction, or any and all  -      other commercial damages or losses), even if such Contributor  -      has been advised of the possibility of such damages.  -  -    9. Accepting Warranty or Additional Liability. While redistributing  -      the Work or Derivative Works thereof, You may choose to offer,  -      and charge a fee for, acceptance of support, warranty, indemnity,  -      or other liability obligations and/or rights consistent with this  -      License. However, in accepting such obligations, You may act only  -      on Your own behalf and on Your sole responsibility, not on behalf  -      of any other Contributor, and only if You agree to indemnify,  -      defend, and hold each Contributor harmless for any liability  -      incurred by, or claims asserted against, such Contributor by reason  -      of your accepting any such warranty or additional liability.  -  -    END OF TERMS AND CONDITIONS  -  -    APPENDIX: How to apply the Apache License to your work.  -  -      To apply the Apache License to your work, attach the following  -      boilerplate notice, with the fields enclosed by brackets "[]"  -      replaced with your own identifying information. (Don't include  -      the brackets!)  The text should be enclosed in the appropriate  -      comment syntax for the file format. We also recommend that a  -      file or class name and description of purpose be included on the  -      same "printed page" as the copyright notice for easier  -      identification within third-party archives.  -  -    Copyright [yyyy] [name of copyright owner]  -  -    Licensed under the Apache License, Version 2.0 (the "License");  -    you may not use this file except in compliance with the License.  -    You may obtain a copy of the License at  -  -       http://www.apache.org/licenses/LICENSE-2.0  -  -    Unless required by applicable law or agreed to in writing, software  -    distributed under the License is distributed on an "AS IS" BASIS,  -    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  -    See the License for the specific language governing permissions and  -    limitations under the License.  -  -  ----- LLVM Exceptions to the Apache 2.0 License ----  -  -As an exception, if, as a result of your compiling your source code, portions  -of this Software are embedded into an Object form of such source code, you  -may redistribute such embedded portions in such Object form without complying  -with the conditions of Sections 4(a), 4(b) and 4(d) of the License.  -  -In addition, if you combine or link compiled forms of this Software with  -software that is licensed under the GPLv2 ("Combined Software") and if a  -court of competent jurisdiction determines that the patent provision (Section  -3), the indemnity provision (Section 9) or other Section of the License  -conflicts with the conditions of the GPLv2, you may retroactively and  -prospectively choose to deem waived or otherwise exclude such Section(s) of  -the License, but only in their entirety and only with respect to the Combined  -Software.  -  -==============================================================================  -Software from third parties included in the LLVM Project:  -==============================================================================  -The LLVM Project contains third party software which is under different license  -terms. All such code will be identified clearly using at least one of two  -mechanisms:  -1) It will be in a separate directory tree with its own `LICENSE.txt` or  -   `LICENSE` file at the top containing the specific license and restrictions  -   which apply to that software, or  -2) It will contain specific license and restriction terms at the top of every  -   file.  -  -==============================================================================  -Legacy LLVM License (https://llvm.org/docs/DeveloperPolicy.html#legacy):  -==============================================================================  -University of Illinois/NCSA  -Open Source License  -  -Copyright (c) 2003-2019 University of Illinois at Urbana-Champaign.  -All rights reserved.  -  -Developed by:  -  -    LLVM Team  -  -    University of Illinois at Urbana-Champaign  -  -    http://llvm.org  -  -Permission is hereby granted, free of charge, to any person obtaining a copy of  -this software and associated documentation files (the "Software"), to deal with  -the Software without restriction, including without limitation the rights to  -use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies  -of the Software, and to permit persons to whom the Software is furnished to do  -so, subject to the following conditions:  -  -    * Redistributions of source code must retain the above copyright notice,  -      this list of conditions and the following disclaimers.  -  -    * Redistributions in binary form must reproduce the above copyright notice,  -      this list of conditions and the following disclaimers in the  -      documentation and/or other materials provided with the distribution.  -  -    * Neither the names of the LLVM Team, University of Illinois at  -      Urbana-Champaign, nor the names of its contributors may be used to  -      endorse or promote products derived from this Software without specific  -      prior written permission.  -  -THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR  -IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS  -FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL THE  -CONTRIBUTORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER  -LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,  -OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS WITH THE  -SOFTWARE.  -  -  -  -====================File: include/llvm/Support/LICENSE.TXT====================  -LLVM System Interface Library  --------------------------------------------------------------------------------  -The LLVM System Interface Library is licensed under the Illinois Open Source  -License and has the following additional copyright:  -  -Copyright (C) 2004 eXtensible Systems, Inc.  -  -  -====================NCSA====================  -// This file is distributed under the University of Illinois Open Source  -// License. See LICENSE.TXT for details.  +====================Apache-2.0 WITH LLVM-exception==================== +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. + + +====================Apache-2.0 WITH LLVM-exception==================== +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception + + +====================File: LICENSE.TXT==================== +============================================================================== +The LLVM Project is under the Apache License v2.0 with LLVM Exceptions: +============================================================================== + +                                 Apache License +                           Version 2.0, January 2004 +                        http://www.apache.org/licenses/ + +    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION + +    1. Definitions. + +      "License" shall mean the terms and conditions for use, reproduction, +      and distribution as defined by Sections 1 through 9 of this document. + +      "Licensor" shall mean the copyright owner or entity authorized by +      the copyright owner that is granting the License. + +      "Legal Entity" shall mean the union of the acting entity and all +      other entities that control, are controlled by, or are under common +      control with that entity. For the purposes of this definition, +      "control" means (i) the power, direct or indirect, to cause the +      direction or management of such entity, whether by contract or +      otherwise, or (ii) ownership of fifty percent (50%) or more of the +      outstanding shares, or (iii) beneficial ownership of such entity. + +      "You" (or "Your") shall mean an individual or Legal Entity +      exercising permissions granted by this License. + +      "Source" form shall mean the preferred form for making modifications, +      including but not limited to software source code, documentation +      source, and configuration files. + +      "Object" form shall mean any form resulting from mechanical +      transformation or translation of a Source form, including but +      not limited to compiled object code, generated documentation, +      and conversions to other media types. + +      "Work" shall mean the work of authorship, whether in Source or +      Object form, made available under the License, as indicated by a +      copyright notice that is included in or attached to the work +      (an example is provided in the Appendix below). + +      "Derivative Works" shall mean any work, whether in Source or Object +      form, that is based on (or derived from) the Work and for which the +      editorial revisions, annotations, elaborations, or other modifications +      represent, as a whole, an original work of authorship. For the purposes +      of this License, Derivative Works shall not include works that remain +      separable from, or merely link (or bind by name) to the interfaces of, +      the Work and Derivative Works thereof. + +      "Contribution" shall mean any work of authorship, including +      the original version of the Work and any modifications or additions +      to that Work or Derivative Works thereof, that is intentionally +      submitted to Licensor for inclusion in the Work by the copyright owner +      or by an individual or Legal Entity authorized to submit on behalf of +      the copyright owner. For the purposes of this definition, "submitted" +      means any form of electronic, verbal, or written communication sent +      to the Licensor or its representatives, including but not limited to +      communication on electronic mailing lists, source code control systems, +      and issue tracking systems that are managed by, or on behalf of, the +      Licensor for the purpose of discussing and improving the Work, but +      excluding communication that is conspicuously marked or otherwise +      designated in writing by the copyright owner as "Not a Contribution." + +      "Contributor" shall mean Licensor and any individual or Legal Entity +      on behalf of whom a Contribution has been received by Licensor and +      subsequently incorporated within the Work. + +    2. Grant of Copyright License. Subject to the terms and conditions of +      this License, each Contributor hereby grants to You a perpetual, +      worldwide, non-exclusive, no-charge, royalty-free, irrevocable +      copyright license to reproduce, prepare Derivative Works of, +      publicly display, publicly perform, sublicense, and distribute the +      Work and such Derivative Works in Source or Object form. + +    3. Grant of Patent License. Subject to the terms and conditions of +      this License, each Contributor hereby grants to You a perpetual, +      worldwide, non-exclusive, no-charge, royalty-free, irrevocable +      (except as stated in this section) patent license to make, have made, +      use, offer to sell, sell, import, and otherwise transfer the Work, +      where such license applies only to those patent claims licensable +      by such Contributor that are necessarily infringed by their +      Contribution(s) alone or by combination of their Contribution(s) +      with the Work to which such Contribution(s) was submitted. If You +      institute patent litigation against any entity (including a +      cross-claim or counterclaim in a lawsuit) alleging that the Work +      or a Contribution incorporated within the Work constitutes direct +      or contributory patent infringement, then any patent licenses +      granted to You under this License for that Work shall terminate +      as of the date such litigation is filed. + +    4. Redistribution. You may reproduce and distribute copies of the +      Work or Derivative Works thereof in any medium, with or without +      modifications, and in Source or Object form, provided that You +      meet the following conditions: + +      (a) You must give any other recipients of the Work or +          Derivative Works a copy of this License; and + +      (b) You must cause any modified files to carry prominent notices +          stating that You changed the files; and + +      (c) You must retain, in the Source form of any Derivative Works +          that You distribute, all copyright, patent, trademark, and +          attribution notices from the Source form of the Work, +          excluding those notices that do not pertain to any part of +          the Derivative Works; and + +      (d) If the Work includes a "NOTICE" text file as part of its +          distribution, then any Derivative Works that You distribute must +          include a readable copy of the attribution notices contained +          within such NOTICE file, excluding those notices that do not +          pertain to any part of the Derivative Works, in at least one +          of the following places: within a NOTICE text file distributed +          as part of the Derivative Works; within the Source form or +          documentation, if provided along with the Derivative Works; or, +          within a display generated by the Derivative Works, if and +          wherever such third-party notices normally appear. The contents +          of the NOTICE file are for informational purposes only and +          do not modify the License. You may add Your own attribution +          notices within Derivative Works that You distribute, alongside +          or as an addendum to the NOTICE text from the Work, provided +          that such additional attribution notices cannot be construed +          as modifying the License. + +      You may add Your own copyright statement to Your modifications and +      may provide additional or different license terms and conditions +      for use, reproduction, or distribution of Your modifications, or +      for any such Derivative Works as a whole, provided Your use, +      reproduction, and distribution of the Work otherwise complies with +      the conditions stated in this License. + +    5. Submission of Contributions. Unless You explicitly state otherwise, +      any Contribution intentionally submitted for inclusion in the Work +      by You to the Licensor shall be under the terms and conditions of +      this License, without any additional terms or conditions. +      Notwithstanding the above, nothing herein shall supersede or modify +      the terms of any separate license agreement you may have executed +      with Licensor regarding such Contributions. + +    6. Trademarks. This License does not grant permission to use the trade +      names, trademarks, service marks, or product names of the Licensor, +      except as required for reasonable and customary use in describing the +      origin of the Work and reproducing the content of the NOTICE file. + +    7. Disclaimer of Warranty. Unless required by applicable law or +      agreed to in writing, Licensor provides the Work (and each +      Contributor provides its Contributions) on an "AS IS" BASIS, +      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or +      implied, including, without limitation, any warranties or conditions +      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A +      PARTICULAR PURPOSE. You are solely responsible for determining the +      appropriateness of using or redistributing the Work and assume any +      risks associated with Your exercise of permissions under this License. + +    8. Limitation of Liability. In no event and under no legal theory, +      whether in tort (including negligence), contract, or otherwise, +      unless required by applicable law (such as deliberate and grossly +      negligent acts) or agreed to in writing, shall any Contributor be +      liable to You for damages, including any direct, indirect, special, +      incidental, or consequential damages of any character arising as a +      result of this License or out of the use or inability to use the +      Work (including but not limited to damages for loss of goodwill, +      work stoppage, computer failure or malfunction, or any and all +      other commercial damages or losses), even if such Contributor +      has been advised of the possibility of such damages. + +    9. Accepting Warranty or Additional Liability. While redistributing +      the Work or Derivative Works thereof, You may choose to offer, +      and charge a fee for, acceptance of support, warranty, indemnity, +      or other liability obligations and/or rights consistent with this +      License. However, in accepting such obligations, You may act only +      on Your own behalf and on Your sole responsibility, not on behalf +      of any other Contributor, and only if You agree to indemnify, +      defend, and hold each Contributor harmless for any liability +      incurred by, or claims asserted against, such Contributor by reason +      of your accepting any such warranty or additional liability. + +    END OF TERMS AND CONDITIONS + +    APPENDIX: How to apply the Apache License to your work. + +      To apply the Apache License to your work, attach the following +      boilerplate notice, with the fields enclosed by brackets "[]" +      replaced with your own identifying information. (Don't include +      the brackets!)  The text should be enclosed in the appropriate +      comment syntax for the file format. We also recommend that a +      file or class name and description of purpose be included on the +      same "printed page" as the copyright notice for easier +      identification within third-party archives. + +    Copyright [yyyy] [name of copyright owner] + +    Licensed under the Apache License, Version 2.0 (the "License"); +    you may not use this file except in compliance with the License. +    You may obtain a copy of the License at + +       http://www.apache.org/licenses/LICENSE-2.0 + +    Unless required by applicable law or agreed to in writing, software +    distributed under the License is distributed on an "AS IS" BASIS, +    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +    See the License for the specific language governing permissions and +    limitations under the License. + + +---- LLVM Exceptions to the Apache 2.0 License ---- + +As an exception, if, as a result of your compiling your source code, portions +of this Software are embedded into an Object form of such source code, you +may redistribute such embedded portions in such Object form without complying +with the conditions of Sections 4(a), 4(b) and 4(d) of the License. + +In addition, if you combine or link compiled forms of this Software with +software that is licensed under the GPLv2 ("Combined Software") and if a +court of competent jurisdiction determines that the patent provision (Section +3), the indemnity provision (Section 9) or other Section of the License +conflicts with the conditions of the GPLv2, you may retroactively and +prospectively choose to deem waived or otherwise exclude such Section(s) of +the License, but only in their entirety and only with respect to the Combined +Software. + +============================================================================== +Software from third parties included in the LLVM Project: +============================================================================== +The LLVM Project contains third party software which is under different license +terms. All such code will be identified clearly using at least one of two +mechanisms: +1) It will be in a separate directory tree with its own `LICENSE.txt` or +   `LICENSE` file at the top containing the specific license and restrictions +   which apply to that software, or +2) It will contain specific license and restriction terms at the top of every +   file. + +============================================================================== +Legacy LLVM License (https://llvm.org/docs/DeveloperPolicy.html#legacy): +============================================================================== +University of Illinois/NCSA +Open Source License + +Copyright (c) 2003-2019 University of Illinois at Urbana-Champaign. +All rights reserved. + +Developed by: + +    LLVM Team + +    University of Illinois at Urbana-Champaign + +    http://llvm.org + +Permission is hereby granted, free of charge, to any person obtaining a copy of +this software and associated documentation files (the "Software"), to deal with +the Software without restriction, including without limitation the rights to +use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies +of the Software, and to permit persons to whom the Software is furnished to do +so, subject to the following conditions: + +    * Redistributions of source code must retain the above copyright notice, +      this list of conditions and the following disclaimers. + +    * Redistributions in binary form must reproduce the above copyright notice, +      this list of conditions and the following disclaimers in the +      documentation and/or other materials provided with the distribution. + +    * Neither the names of the LLVM Team, University of Illinois at +      Urbana-Champaign, nor the names of its contributors may be used to +      endorse or promote products derived from this Software without specific +      prior written permission. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS +FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL THE +CONTRIBUTORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS WITH THE +SOFTWARE. + + + +====================File: include/llvm/Support/LICENSE.TXT==================== +LLVM System Interface Library +------------------------------------------------------------------------------- +The LLVM System Interface Library is licensed under the Illinois Open Source +License and has the following additional copyright: + +Copyright (C) 2004 eXtensible Systems, Inc. + + +====================NCSA==================== +// This file is distributed under the University of Illinois Open Source +// License. See LICENSE.TXT for details. diff --git a/contrib/libs/llvm12/lib/Target/AArch64/AsmParser/.yandex_meta/licenses.list.txt b/contrib/libs/llvm12/lib/Target/AArch64/AsmParser/.yandex_meta/licenses.list.txt index a4433625d4a..c62d353021c 100644 --- a/contrib/libs/llvm12/lib/Target/AArch64/AsmParser/.yandex_meta/licenses.list.txt +++ b/contrib/libs/llvm12/lib/Target/AArch64/AsmParser/.yandex_meta/licenses.list.txt @@ -1,7 +1,7 @@ -====================Apache-2.0 WITH LLVM-exception====================  -// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.  -// See https://llvm.org/LICENSE.txt for license information.  -  -  -====================Apache-2.0 WITH LLVM-exception====================  -// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception  +====================Apache-2.0 WITH LLVM-exception==================== +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. + + +====================Apache-2.0 WITH LLVM-exception==================== +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception diff --git a/contrib/libs/llvm12/lib/Target/AArch64/AsmParser/ya.make b/contrib/libs/llvm12/lib/Target/AArch64/AsmParser/ya.make index 0434db6bfac..512f510d853 100644 --- a/contrib/libs/llvm12/lib/Target/AArch64/AsmParser/ya.make +++ b/contrib/libs/llvm12/lib/Target/AArch64/AsmParser/ya.make @@ -2,15 +2,15 @@  LIBRARY() -OWNER(  -    orivej  -    g:cpp-contrib  -)  - -LICENSE(Apache-2.0 WITH LLVM-exception)  -  -LICENSE_TEXTS(.yandex_meta/licenses.list.txt)  -  +OWNER( +    orivej +    g:cpp-contrib +) + +LICENSE(Apache-2.0 WITH LLVM-exception) + +LICENSE_TEXTS(.yandex_meta/licenses.list.txt) +  PEERDIR(      contrib/libs/llvm12      contrib/libs/llvm12/include diff --git a/contrib/libs/llvm12/lib/Target/AArch64/Disassembler/.yandex_meta/licenses.list.txt b/contrib/libs/llvm12/lib/Target/AArch64/Disassembler/.yandex_meta/licenses.list.txt index a4433625d4a..c62d353021c 100644 --- a/contrib/libs/llvm12/lib/Target/AArch64/Disassembler/.yandex_meta/licenses.list.txt +++ b/contrib/libs/llvm12/lib/Target/AArch64/Disassembler/.yandex_meta/licenses.list.txt @@ -1,7 +1,7 @@ -====================Apache-2.0 WITH LLVM-exception====================  -// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.  -// See https://llvm.org/LICENSE.txt for license information.  -  -  -====================Apache-2.0 WITH LLVM-exception====================  -// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception  +====================Apache-2.0 WITH LLVM-exception==================== +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. + + +====================Apache-2.0 WITH LLVM-exception==================== +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception diff --git a/contrib/libs/llvm12/lib/Target/AArch64/Disassembler/ya.make b/contrib/libs/llvm12/lib/Target/AArch64/Disassembler/ya.make index 5c499b5c3e8..096b55cd68c 100644 --- a/contrib/libs/llvm12/lib/Target/AArch64/Disassembler/ya.make +++ b/contrib/libs/llvm12/lib/Target/AArch64/Disassembler/ya.make @@ -2,15 +2,15 @@  LIBRARY() -OWNER(  -    orivej  -    g:cpp-contrib  -)  - -LICENSE(Apache-2.0 WITH LLVM-exception)  -  -LICENSE_TEXTS(.yandex_meta/licenses.list.txt)  -  +OWNER( +    orivej +    g:cpp-contrib +) + +LICENSE(Apache-2.0 WITH LLVM-exception) + +LICENSE_TEXTS(.yandex_meta/licenses.list.txt) +  PEERDIR(      contrib/libs/llvm12      contrib/libs/llvm12/include diff --git a/contrib/libs/llvm12/lib/Target/AArch64/MCTargetDesc/.yandex_meta/licenses.list.txt b/contrib/libs/llvm12/lib/Target/AArch64/MCTargetDesc/.yandex_meta/licenses.list.txt index a4433625d4a..c62d353021c 100644 --- a/contrib/libs/llvm12/lib/Target/AArch64/MCTargetDesc/.yandex_meta/licenses.list.txt +++ b/contrib/libs/llvm12/lib/Target/AArch64/MCTargetDesc/.yandex_meta/licenses.list.txt @@ -1,7 +1,7 @@ -====================Apache-2.0 WITH LLVM-exception====================  -// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.  -// See https://llvm.org/LICENSE.txt for license information.  -  -  -====================Apache-2.0 WITH LLVM-exception====================  -// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception  +====================Apache-2.0 WITH LLVM-exception==================== +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. + + +====================Apache-2.0 WITH LLVM-exception==================== +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception diff --git a/contrib/libs/llvm12/lib/Target/AArch64/MCTargetDesc/ya.make b/contrib/libs/llvm12/lib/Target/AArch64/MCTargetDesc/ya.make index fee67a8c71d..18b5c7460fd 100644 --- a/contrib/libs/llvm12/lib/Target/AArch64/MCTargetDesc/ya.make +++ b/contrib/libs/llvm12/lib/Target/AArch64/MCTargetDesc/ya.make @@ -2,15 +2,15 @@  LIBRARY() -OWNER(  -    orivej  -    g:cpp-contrib  -)  - -LICENSE(Apache-2.0 WITH LLVM-exception)  -  -LICENSE_TEXTS(.yandex_meta/licenses.list.txt)  -  +OWNER( +    orivej +    g:cpp-contrib +) + +LICENSE(Apache-2.0 WITH LLVM-exception) + +LICENSE_TEXTS(.yandex_meta/licenses.list.txt) +  PEERDIR(      contrib/libs/llvm12      contrib/libs/llvm12/include diff --git a/contrib/libs/llvm12/lib/Target/AArch64/TargetInfo/.yandex_meta/licenses.list.txt b/contrib/libs/llvm12/lib/Target/AArch64/TargetInfo/.yandex_meta/licenses.list.txt index a4433625d4a..c62d353021c 100644 --- a/contrib/libs/llvm12/lib/Target/AArch64/TargetInfo/.yandex_meta/licenses.list.txt +++ b/contrib/libs/llvm12/lib/Target/AArch64/TargetInfo/.yandex_meta/licenses.list.txt @@ -1,7 +1,7 @@ -====================Apache-2.0 WITH LLVM-exception====================  -// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.  -// See https://llvm.org/LICENSE.txt for license information.  -  -  -====================Apache-2.0 WITH LLVM-exception====================  -// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception  +====================Apache-2.0 WITH LLVM-exception==================== +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. + + +====================Apache-2.0 WITH LLVM-exception==================== +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception diff --git a/contrib/libs/llvm12/lib/Target/AArch64/TargetInfo/ya.make b/contrib/libs/llvm12/lib/Target/AArch64/TargetInfo/ya.make index 9d0c8855904..bb7d4a2c890 100644 --- a/contrib/libs/llvm12/lib/Target/AArch64/TargetInfo/ya.make +++ b/contrib/libs/llvm12/lib/Target/AArch64/TargetInfo/ya.make @@ -2,15 +2,15 @@  LIBRARY() -OWNER(  -    orivej  -    g:cpp-contrib  -)  - -LICENSE(Apache-2.0 WITH LLVM-exception)  -  -LICENSE_TEXTS(.yandex_meta/licenses.list.txt)  -  +OWNER( +    orivej +    g:cpp-contrib +) + +LICENSE(Apache-2.0 WITH LLVM-exception) + +LICENSE_TEXTS(.yandex_meta/licenses.list.txt) +  PEERDIR(      contrib/libs/llvm12      contrib/libs/llvm12/lib/Support diff --git a/contrib/libs/llvm12/lib/Target/AArch64/Utils/.yandex_meta/licenses.list.txt b/contrib/libs/llvm12/lib/Target/AArch64/Utils/.yandex_meta/licenses.list.txt index a4433625d4a..c62d353021c 100644 --- a/contrib/libs/llvm12/lib/Target/AArch64/Utils/.yandex_meta/licenses.list.txt +++ b/contrib/libs/llvm12/lib/Target/AArch64/Utils/.yandex_meta/licenses.list.txt @@ -1,7 +1,7 @@ -====================Apache-2.0 WITH LLVM-exception====================  -// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.  -// See https://llvm.org/LICENSE.txt for license information.  -  -  -====================Apache-2.0 WITH LLVM-exception====================  -// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception  +====================Apache-2.0 WITH LLVM-exception==================== +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. + + +====================Apache-2.0 WITH LLVM-exception==================== +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception diff --git a/contrib/libs/llvm12/lib/Target/AArch64/Utils/ya.make b/contrib/libs/llvm12/lib/Target/AArch64/Utils/ya.make index a0b39d5c954..3668c2a6509 100644 --- a/contrib/libs/llvm12/lib/Target/AArch64/Utils/ya.make +++ b/contrib/libs/llvm12/lib/Target/AArch64/Utils/ya.make @@ -2,15 +2,15 @@  LIBRARY() -OWNER(  -    orivej  -    g:cpp-contrib  -)  - -LICENSE(Apache-2.0 WITH LLVM-exception)  -  -LICENSE_TEXTS(.yandex_meta/licenses.list.txt)  -  +OWNER( +    orivej +    g:cpp-contrib +) + +LICENSE(Apache-2.0 WITH LLVM-exception) + +LICENSE_TEXTS(.yandex_meta/licenses.list.txt) +  PEERDIR(      contrib/libs/llvm12      contrib/libs/llvm12/include diff --git a/contrib/libs/llvm12/lib/Target/AArch64/ya.make b/contrib/libs/llvm12/lib/Target/AArch64/ya.make index e5ef1b3dcb7..244cbc7f34f 100644 --- a/contrib/libs/llvm12/lib/Target/AArch64/ya.make +++ b/contrib/libs/llvm12/lib/Target/AArch64/ya.make @@ -2,18 +2,18 @@  LIBRARY() -OWNER(  -    orivej  -    g:cpp-contrib  -)  +OWNER( +    orivej +    g:cpp-contrib +) + +LICENSE( +    Apache-2.0 WITH LLVM-exception AND +    NCSA +) + +LICENSE_TEXTS(.yandex_meta/licenses.list.txt) -LICENSE(  -    Apache-2.0 WITH LLVM-exception AND  -    NCSA  -)  -  -LICENSE_TEXTS(.yandex_meta/licenses.list.txt)  -   PEERDIR(      contrib/libs/llvm12      contrib/libs/llvm12/include diff --git a/contrib/libs/llvm12/lib/Target/ARM/.yandex_meta/licenses.list.txt b/contrib/libs/llvm12/lib/Target/ARM/.yandex_meta/licenses.list.txt index a4433625d4a..c62d353021c 100644 --- a/contrib/libs/llvm12/lib/Target/ARM/.yandex_meta/licenses.list.txt +++ b/contrib/libs/llvm12/lib/Target/ARM/.yandex_meta/licenses.list.txt @@ -1,7 +1,7 @@ -====================Apache-2.0 WITH LLVM-exception====================  -// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.  -// See https://llvm.org/LICENSE.txt for license information.  -  -  -====================Apache-2.0 WITH LLVM-exception====================  -// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception  +====================Apache-2.0 WITH LLVM-exception==================== +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. + + +====================Apache-2.0 WITH LLVM-exception==================== +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception diff --git a/contrib/libs/llvm12/lib/Target/ARM/AsmParser/.yandex_meta/licenses.list.txt b/contrib/libs/llvm12/lib/Target/ARM/AsmParser/.yandex_meta/licenses.list.txt index a4433625d4a..c62d353021c 100644 --- a/contrib/libs/llvm12/lib/Target/ARM/AsmParser/.yandex_meta/licenses.list.txt +++ b/contrib/libs/llvm12/lib/Target/ARM/AsmParser/.yandex_meta/licenses.list.txt @@ -1,7 +1,7 @@ -====================Apache-2.0 WITH LLVM-exception====================  -// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.  -// See https://llvm.org/LICENSE.txt for license information.  -  -  -====================Apache-2.0 WITH LLVM-exception====================  -// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception  +====================Apache-2.0 WITH LLVM-exception==================== +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. + + +====================Apache-2.0 WITH LLVM-exception==================== +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception diff --git a/contrib/libs/llvm12/lib/Target/ARM/AsmParser/ya.make b/contrib/libs/llvm12/lib/Target/ARM/AsmParser/ya.make index f5c567afbcd..572d301570d 100644 --- a/contrib/libs/llvm12/lib/Target/ARM/AsmParser/ya.make +++ b/contrib/libs/llvm12/lib/Target/ARM/AsmParser/ya.make @@ -2,15 +2,15 @@  LIBRARY() -OWNER(  -    orivej  -    g:cpp-contrib  -)  - -LICENSE(Apache-2.0 WITH LLVM-exception)  -  -LICENSE_TEXTS(.yandex_meta/licenses.list.txt)  -  +OWNER( +    orivej +    g:cpp-contrib +) + +LICENSE(Apache-2.0 WITH LLVM-exception) + +LICENSE_TEXTS(.yandex_meta/licenses.list.txt) +  PEERDIR(      contrib/libs/llvm12      contrib/libs/llvm12/include diff --git a/contrib/libs/llvm12/lib/Target/ARM/Disassembler/.yandex_meta/licenses.list.txt b/contrib/libs/llvm12/lib/Target/ARM/Disassembler/.yandex_meta/licenses.list.txt index a4433625d4a..c62d353021c 100644 --- a/contrib/libs/llvm12/lib/Target/ARM/Disassembler/.yandex_meta/licenses.list.txt +++ b/contrib/libs/llvm12/lib/Target/ARM/Disassembler/.yandex_meta/licenses.list.txt @@ -1,7 +1,7 @@ -====================Apache-2.0 WITH LLVM-exception====================  -// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.  -// See https://llvm.org/LICENSE.txt for license information.  -  -  -====================Apache-2.0 WITH LLVM-exception====================  -// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception  +====================Apache-2.0 WITH LLVM-exception==================== +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. + + +====================Apache-2.0 WITH LLVM-exception==================== +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception diff --git a/contrib/libs/llvm12/lib/Target/ARM/Disassembler/ya.make b/contrib/libs/llvm12/lib/Target/ARM/Disassembler/ya.make index 5e4e1b3e6ac..f8ce0c24d92 100644 --- a/contrib/libs/llvm12/lib/Target/ARM/Disassembler/ya.make +++ b/contrib/libs/llvm12/lib/Target/ARM/Disassembler/ya.make @@ -2,15 +2,15 @@  LIBRARY() -OWNER(  -    orivej  -    g:cpp-contrib  -)  - -LICENSE(Apache-2.0 WITH LLVM-exception)  -  -LICENSE_TEXTS(.yandex_meta/licenses.list.txt)  -  +OWNER( +    orivej +    g:cpp-contrib +) + +LICENSE(Apache-2.0 WITH LLVM-exception) + +LICENSE_TEXTS(.yandex_meta/licenses.list.txt) +  PEERDIR(      contrib/libs/llvm12      contrib/libs/llvm12/include diff --git a/contrib/libs/llvm12/lib/Target/ARM/MCTargetDesc/.yandex_meta/licenses.list.txt b/contrib/libs/llvm12/lib/Target/ARM/MCTargetDesc/.yandex_meta/licenses.list.txt index a4433625d4a..c62d353021c 100644 --- a/contrib/libs/llvm12/lib/Target/ARM/MCTargetDesc/.yandex_meta/licenses.list.txt +++ b/contrib/libs/llvm12/lib/Target/ARM/MCTargetDesc/.yandex_meta/licenses.list.txt @@ -1,7 +1,7 @@ -====================Apache-2.0 WITH LLVM-exception====================  -// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.  -// See https://llvm.org/LICENSE.txt for license information.  -  -  -====================Apache-2.0 WITH LLVM-exception====================  -// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception  +====================Apache-2.0 WITH LLVM-exception==================== +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. + + +====================Apache-2.0 WITH LLVM-exception==================== +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception diff --git a/contrib/libs/llvm12/lib/Target/ARM/MCTargetDesc/ya.make b/contrib/libs/llvm12/lib/Target/ARM/MCTargetDesc/ya.make index 2a6b0715e78..b92b47d0572 100644 --- a/contrib/libs/llvm12/lib/Target/ARM/MCTargetDesc/ya.make +++ b/contrib/libs/llvm12/lib/Target/ARM/MCTargetDesc/ya.make @@ -2,15 +2,15 @@  LIBRARY() -OWNER(  -    orivej  -    g:cpp-contrib  -)  - -LICENSE(Apache-2.0 WITH LLVM-exception)  -  -LICENSE_TEXTS(.yandex_meta/licenses.list.txt)  -  +OWNER( +    orivej +    g:cpp-contrib +) + +LICENSE(Apache-2.0 WITH LLVM-exception) + +LICENSE_TEXTS(.yandex_meta/licenses.list.txt) +  PEERDIR(      contrib/libs/llvm12      contrib/libs/llvm12/include diff --git a/contrib/libs/llvm12/lib/Target/ARM/README-Thumb.txt b/contrib/libs/llvm12/lib/Target/ARM/README-Thumb.txt index 041f5508d7d..d9cc086da82 100644 --- a/contrib/libs/llvm12/lib/Target/ARM/README-Thumb.txt +++ b/contrib/libs/llvm12/lib/Target/ARM/README-Thumb.txt @@ -1,261 +1,261 @@ -//===---------------------------------------------------------------------===//  -// Random ideas for the ARM backend (Thumb specific).  -//===---------------------------------------------------------------------===//  -  -* Add support for compiling functions in both ARM and Thumb mode, then taking  -  the smallest.  -  -* Add support for compiling individual basic blocks in thumb mode, when in a   -  larger ARM function.  This can be used for presumed cold code, like paths  -  to abort (failure path of asserts), EH handling code, etc.  -  -* Thumb doesn't have normal pre/post increment addressing modes, but you can  -  load/store 32-bit integers with pre/postinc by using load/store multiple  -  instrs with a single register.  -  -* Make better use of high registers r8, r10, r11, r12 (ip). Some variants of add  -  and cmp instructions can use high registers. Also, we can use them as  -  temporaries to spill values into.  -  -* In thumb mode, short, byte, and bool preferred alignments are currently set  -  to 4 to accommodate ISA restriction (i.e. add sp, #imm, imm must be multiple  -  of 4).  -  -//===---------------------------------------------------------------------===//  -  -Potential jumptable improvements:  -  -* If we know function size is less than (1 << 16) * 2 bytes, we can use 16-bit  -  jumptable entries (e.g. (L1 - L2) >> 1). Or even smaller entries if the  -  function is even smaller. This also applies to ARM.  -  -* Thumb jumptable codegen can improve given some help from the assembler. This  -  is what we generate right now:  -  -	.set PCRELV0, (LJTI1_0_0-(LPCRELL0+4))  -LPCRELL0:  -	mov r1, #PCRELV0  -	add r1, pc  -	ldr r0, [r0, r1]  -	mov pc, r0   -	.align	2  -LJTI1_0_0:  -	.long	 LBB1_3  -        ...  -  -Note there is another pc relative add that we can take advantage of.  -     add r1, pc, #imm_8 * 4  -  -We should be able to generate:  -  -LPCRELL0:  -	add r1, LJTI1_0_0  -	ldr r0, [r0, r1]  -	mov pc, r0   -	.align	2  -LJTI1_0_0:  -	.long	 LBB1_3  -  -if the assembler can translate the add to:  -       add r1, pc, #((LJTI1_0_0-(LPCRELL0+4))&0xfffffffc)  -  -Note the assembler also does something similar to constpool load:  -LPCRELL0:  -     ldr r0, LCPI1_0  -=>  -     ldr r0, pc, #((LCPI1_0-(LPCRELL0+4))&0xfffffffc)  -  -  -//===---------------------------------------------------------------------===//  -  -We compile the following:  -  -define i16 @func_entry_2E_ce(i32 %i) {  -        switch i32 %i, label %bb12.exitStub [  -                 i32 0, label %bb4.exitStub  -                 i32 1, label %bb9.exitStub  -                 i32 2, label %bb4.exitStub  -                 i32 3, label %bb4.exitStub  -                 i32 7, label %bb9.exitStub  -                 i32 8, label %bb.exitStub  -                 i32 9, label %bb9.exitStub  -        ]  -  -bb12.exitStub:  -        ret i16 0  -  -bb4.exitStub:  -        ret i16 1  -  -bb9.exitStub:  -        ret i16 2  -  -bb.exitStub:  -        ret i16 3  -}  -  -into:  -  -_func_entry_2E_ce:  -        mov r2, #1  -        lsl r2, r0  -        cmp r0, #9  -        bhi LBB1_4      @bb12.exitStub  -LBB1_1: @newFuncRoot  -        mov r1, #13  -        tst r2, r1  -        bne LBB1_5      @bb4.exitStub  -LBB1_2: @newFuncRoot  -        ldr r1, LCPI1_0  -        tst r2, r1  -        bne LBB1_6      @bb9.exitStub  -LBB1_3: @newFuncRoot  -        mov r1, #1  -        lsl r1, r1, #8  -        tst r2, r1  -        bne LBB1_7      @bb.exitStub  -LBB1_4: @bb12.exitStub  -        mov r0, #0  -        bx lr  -LBB1_5: @bb4.exitStub  -        mov r0, #1  -        bx lr  -LBB1_6: @bb9.exitStub  -        mov r0, #2  -        bx lr  -LBB1_7: @bb.exitStub  -        mov r0, #3  -        bx lr  -LBB1_8:  -        .align  2  -LCPI1_0:  -        .long   642  -  -  -gcc compiles to:  -  -	cmp	r0, #9  -	@ lr needed for prologue  -	bhi	L2  -	ldr	r3, L11  -	mov	r2, #1  -	mov	r1, r2, asl r0  -	ands	r0, r3, r2, asl r0  -	movne	r0, #2  -	bxne	lr  -	tst	r1, #13  -	beq	L9  -L3:  -	mov	r0, r2  -	bx	lr  -L9:  -	tst	r1, #256  -	movne	r0, #3  -	bxne	lr  -L2:  -	mov	r0, #0  -	bx	lr  -L12:  -	.align 2  -L11:  -	.long	642  -          -  -GCC is doing a couple of clever things here:  -  1. It is predicating one of the returns.  This isn't a clear win though: in  -     cases where that return isn't taken, it is replacing one condbranch with  -     two 'ne' predicated instructions.  -  2. It is sinking the shift of "1 << i" into the tst, and using ands instead of  -     tst.  This will probably require whole function isel.  -  3. GCC emits:  -  	tst	r1, #256  -     we emit:  -        mov r1, #1  -        lsl r1, r1, #8  -        tst r2, r1  -  -//===---------------------------------------------------------------------===//  -  -When spilling in thumb mode and the sp offset is too large to fit in the ldr /  -str offset field, we load the offset from a constpool entry and add it to sp:  -  -ldr r2, LCPI  -add r2, sp  -ldr r2, [r2]  -  -These instructions preserve the condition code which is important if the spill  -is between a cmp and a bcc instruction. However, we can use the (potentially)  -cheaper sequence if we know it's ok to clobber the condition register.  -  -add r2, sp, #255 * 4  -add r2, #132  -ldr r2, [r2, #7 * 4]  -  -This is especially bad when dynamic alloca is used. The all fixed size stack  -objects are referenced off the frame pointer with negative offsets. See  -oggenc for an example.  -  -//===---------------------------------------------------------------------===//  -  -Poor codegen test/CodeGen/ARM/select.ll f7:  -  -	ldr r5, LCPI1_0  -LPC0:  -	add r5, pc  -	ldr r6, LCPI1_1  -	ldr r2, LCPI1_2  -	mov r3, r6  -	mov lr, pc  -	bx r5  -  -//===---------------------------------------------------------------------===//  -  -Make register allocator / spiller smarter so we can re-materialize "mov r, imm",  -etc. Almost all Thumb instructions clobber condition code.  -  -//===---------------------------------------------------------------------===//  -  -Thumb load / store address mode offsets are scaled. The values kept in the  -instruction operands are pre-scale values. This probably ought to be changed  -to avoid extra work when we convert Thumb2 instructions to Thumb1 instructions.  -  -//===---------------------------------------------------------------------===//  -  -We need to make (some of the) Thumb1 instructions predicable. That will allow  -shrinking of predicated Thumb2 instructions. To allow this, we need to be able  -to toggle the 's' bit since they do not set CPSR when they are inside IT blocks.  -  -//===---------------------------------------------------------------------===//  -  -Make use of hi register variants of cmp: tCMPhir / tCMPZhir.  -  -//===---------------------------------------------------------------------===//  -  -Thumb1 immediate field sometimes keep pre-scaled values. See  -ThumbRegisterInfo::eliminateFrameIndex. This is inconsistent from ARM and  -Thumb2.  -  -//===---------------------------------------------------------------------===//  -  -Rather than having tBR_JTr print a ".align 2" and constant island pass pad it,  -add a target specific ALIGN instruction instead. That way, getInstSizeInBytes  -won't have to over-estimate. It can also be used for loop alignment pass.  -  -//===---------------------------------------------------------------------===//  -  -We generate conditional code for icmp when we don't need to. This code:  -  -  int foo(int s) {  -    return s == 1;  -  }  -  -produces:  -  -foo:  -        cmp     r0, #1  -        mov.w   r0, #0  -        it      eq  -        moveq   r0, #1  -        bx      lr  -  -when it could use subs + adcs. This is GCC PR46975.  +//===---------------------------------------------------------------------===// +// Random ideas for the ARM backend (Thumb specific). +//===---------------------------------------------------------------------===// + +* Add support for compiling functions in both ARM and Thumb mode, then taking +  the smallest. + +* Add support for compiling individual basic blocks in thumb mode, when in a  +  larger ARM function.  This can be used for presumed cold code, like paths +  to abort (failure path of asserts), EH handling code, etc. + +* Thumb doesn't have normal pre/post increment addressing modes, but you can +  load/store 32-bit integers with pre/postinc by using load/store multiple +  instrs with a single register. + +* Make better use of high registers r8, r10, r11, r12 (ip). Some variants of add +  and cmp instructions can use high registers. Also, we can use them as +  temporaries to spill values into. + +* In thumb mode, short, byte, and bool preferred alignments are currently set +  to 4 to accommodate ISA restriction (i.e. add sp, #imm, imm must be multiple +  of 4). + +//===---------------------------------------------------------------------===// + +Potential jumptable improvements: + +* If we know function size is less than (1 << 16) * 2 bytes, we can use 16-bit +  jumptable entries (e.g. (L1 - L2) >> 1). Or even smaller entries if the +  function is even smaller. This also applies to ARM. + +* Thumb jumptable codegen can improve given some help from the assembler. This +  is what we generate right now: + +	.set PCRELV0, (LJTI1_0_0-(LPCRELL0+4)) +LPCRELL0: +	mov r1, #PCRELV0 +	add r1, pc +	ldr r0, [r0, r1] +	mov pc, r0  +	.align	2 +LJTI1_0_0: +	.long	 LBB1_3 +        ... + +Note there is another pc relative add that we can take advantage of. +     add r1, pc, #imm_8 * 4 + +We should be able to generate: + +LPCRELL0: +	add r1, LJTI1_0_0 +	ldr r0, [r0, r1] +	mov pc, r0  +	.align	2 +LJTI1_0_0: +	.long	 LBB1_3 + +if the assembler can translate the add to: +       add r1, pc, #((LJTI1_0_0-(LPCRELL0+4))&0xfffffffc) + +Note the assembler also does something similar to constpool load: +LPCRELL0: +     ldr r0, LCPI1_0 +=> +     ldr r0, pc, #((LCPI1_0-(LPCRELL0+4))&0xfffffffc) + + +//===---------------------------------------------------------------------===// + +We compile the following: + +define i16 @func_entry_2E_ce(i32 %i) { +        switch i32 %i, label %bb12.exitStub [ +                 i32 0, label %bb4.exitStub +                 i32 1, label %bb9.exitStub +                 i32 2, label %bb4.exitStub +                 i32 3, label %bb4.exitStub +                 i32 7, label %bb9.exitStub +                 i32 8, label %bb.exitStub +                 i32 9, label %bb9.exitStub +        ] + +bb12.exitStub: +        ret i16 0 + +bb4.exitStub: +        ret i16 1 + +bb9.exitStub: +        ret i16 2 + +bb.exitStub: +        ret i16 3 +} + +into: + +_func_entry_2E_ce: +        mov r2, #1 +        lsl r2, r0 +        cmp r0, #9 +        bhi LBB1_4      @bb12.exitStub +LBB1_1: @newFuncRoot +        mov r1, #13 +        tst r2, r1 +        bne LBB1_5      @bb4.exitStub +LBB1_2: @newFuncRoot +        ldr r1, LCPI1_0 +        tst r2, r1 +        bne LBB1_6      @bb9.exitStub +LBB1_3: @newFuncRoot +        mov r1, #1 +        lsl r1, r1, #8 +        tst r2, r1 +        bne LBB1_7      @bb.exitStub +LBB1_4: @bb12.exitStub +        mov r0, #0 +        bx lr +LBB1_5: @bb4.exitStub +        mov r0, #1 +        bx lr +LBB1_6: @bb9.exitStub +        mov r0, #2 +        bx lr +LBB1_7: @bb.exitStub +        mov r0, #3 +        bx lr +LBB1_8: +        .align  2 +LCPI1_0: +        .long   642 + + +gcc compiles to: + +	cmp	r0, #9 +	@ lr needed for prologue +	bhi	L2 +	ldr	r3, L11 +	mov	r2, #1 +	mov	r1, r2, asl r0 +	ands	r0, r3, r2, asl r0 +	movne	r0, #2 +	bxne	lr +	tst	r1, #13 +	beq	L9 +L3: +	mov	r0, r2 +	bx	lr +L9: +	tst	r1, #256 +	movne	r0, #3 +	bxne	lr +L2: +	mov	r0, #0 +	bx	lr +L12: +	.align 2 +L11: +	.long	642 +         + +GCC is doing a couple of clever things here: +  1. It is predicating one of the returns.  This isn't a clear win though: in +     cases where that return isn't taken, it is replacing one condbranch with +     two 'ne' predicated instructions. +  2. It is sinking the shift of "1 << i" into the tst, and using ands instead of +     tst.  This will probably require whole function isel. +  3. GCC emits: +  	tst	r1, #256 +     we emit: +        mov r1, #1 +        lsl r1, r1, #8 +        tst r2, r1 + +//===---------------------------------------------------------------------===// + +When spilling in thumb mode and the sp offset is too large to fit in the ldr / +str offset field, we load the offset from a constpool entry and add it to sp: + +ldr r2, LCPI +add r2, sp +ldr r2, [r2] + +These instructions preserve the condition code which is important if the spill +is between a cmp and a bcc instruction. However, we can use the (potentially) +cheaper sequence if we know it's ok to clobber the condition register. + +add r2, sp, #255 * 4 +add r2, #132 +ldr r2, [r2, #7 * 4] + +This is especially bad when dynamic alloca is used. The all fixed size stack +objects are referenced off the frame pointer with negative offsets. See +oggenc for an example. + +//===---------------------------------------------------------------------===// + +Poor codegen test/CodeGen/ARM/select.ll f7: + +	ldr r5, LCPI1_0 +LPC0: +	add r5, pc +	ldr r6, LCPI1_1 +	ldr r2, LCPI1_2 +	mov r3, r6 +	mov lr, pc +	bx r5 + +//===---------------------------------------------------------------------===// + +Make register allocator / spiller smarter so we can re-materialize "mov r, imm", +etc. Almost all Thumb instructions clobber condition code. + +//===---------------------------------------------------------------------===// + +Thumb load / store address mode offsets are scaled. The values kept in the +instruction operands are pre-scale values. This probably ought to be changed +to avoid extra work when we convert Thumb2 instructions to Thumb1 instructions. + +//===---------------------------------------------------------------------===// + +We need to make (some of the) Thumb1 instructions predicable. That will allow +shrinking of predicated Thumb2 instructions. To allow this, we need to be able +to toggle the 's' bit since they do not set CPSR when they are inside IT blocks. + +//===---------------------------------------------------------------------===// + +Make use of hi register variants of cmp: tCMPhir / tCMPZhir. + +//===---------------------------------------------------------------------===// + +Thumb1 immediate field sometimes keep pre-scaled values. See +ThumbRegisterInfo::eliminateFrameIndex. This is inconsistent from ARM and +Thumb2. + +//===---------------------------------------------------------------------===// + +Rather than having tBR_JTr print a ".align 2" and constant island pass pad it, +add a target specific ALIGN instruction instead. That way, getInstSizeInBytes +won't have to over-estimate. It can also be used for loop alignment pass. + +//===---------------------------------------------------------------------===// + +We generate conditional code for icmp when we don't need to. This code: + +  int foo(int s) { +    return s == 1; +  } + +produces: + +foo: +        cmp     r0, #1 +        mov.w   r0, #0 +        it      eq +        moveq   r0, #1 +        bx      lr + +when it could use subs + adcs. This is GCC PR46975. diff --git a/contrib/libs/llvm12/lib/Target/ARM/README-Thumb2.txt b/contrib/libs/llvm12/lib/Target/ARM/README-Thumb2.txt index 227746ec13c..e7c2552d9e4 100644 --- a/contrib/libs/llvm12/lib/Target/ARM/README-Thumb2.txt +++ b/contrib/libs/llvm12/lib/Target/ARM/README-Thumb2.txt @@ -1,6 +1,6 @@ -//===---------------------------------------------------------------------===//  -// Random ideas for the ARM backend (Thumb2 specific).  -//===---------------------------------------------------------------------===//  -  -Make sure jumptable destinations are below the jumptable in order to make use  -of tbb / tbh.  +//===---------------------------------------------------------------------===// +// Random ideas for the ARM backend (Thumb2 specific). +//===---------------------------------------------------------------------===// + +Make sure jumptable destinations are below the jumptable in order to make use +of tbb / tbh. diff --git a/contrib/libs/llvm12/lib/Target/ARM/README.txt b/contrib/libs/llvm12/lib/Target/ARM/README.txt index 1a93bc7bb79..def67cfae72 100644 --- a/contrib/libs/llvm12/lib/Target/ARM/README.txt +++ b/contrib/libs/llvm12/lib/Target/ARM/README.txt @@ -1,732 +1,732 @@ -//===---------------------------------------------------------------------===//  -// Random ideas for the ARM backend.  -//===---------------------------------------------------------------------===//  -  -Reimplement 'select' in terms of 'SEL'.  -  -* We would really like to support UXTAB16, but we need to prove that the  -  add doesn't need to overflow between the two 16-bit chunks.  -  -* Implement pre/post increment support.  (e.g. PR935)  -* Implement smarter constant generation for binops with large immediates.  -  -A few ARMv6T2 ops should be pattern matched: BFI, SBFX, and UBFX  -  -Interesting optimization for PIC codegen on arm-linux:  -http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43129  -  -//===---------------------------------------------------------------------===//  -  -Crazy idea:  Consider code that uses lots of 8-bit or 16-bit values.  By the  -time regalloc happens, these values are now in a 32-bit register, usually with  -the top-bits known to be sign or zero extended.  If spilled, we should be able  -to spill these to a 8-bit or 16-bit stack slot, zero or sign extending as part  -of the reload.  -  -Doing this reduces the size of the stack frame (important for thumb etc), and  -also increases the likelihood that we will be able to reload multiple values  -from the stack with a single load.  -  -//===---------------------------------------------------------------------===//  -  -The constant island pass is in good shape.  Some cleanups might be desirable,  -but there is unlikely to be much improvement in the generated code.  -  -1.  There may be some advantage to trying to be smarter about the initial  -placement, rather than putting everything at the end.  -  -2.  There might be some compile-time efficiency to be had by representing  -consecutive islands as a single block rather than multiple blocks.  -  -3.  Use a priority queue to sort constant pool users in inverse order of  -    position so we always process the one closed to the end of functions  -    first. This may simply CreateNewWater.  -  -//===---------------------------------------------------------------------===//  -  -Eliminate copysign custom expansion. We are still generating crappy code with  -default expansion + if-conversion.  -  -//===---------------------------------------------------------------------===//  -  -Eliminate one instruction from:  -  -define i32 @_Z6slow4bii(i32 %x, i32 %y) {  -        %tmp = icmp sgt i32 %x, %y  -        %retval = select i1 %tmp, i32 %x, i32 %y  -        ret i32 %retval  -}  -  -__Z6slow4bii:  -        cmp r0, r1  -        movgt r1, r0  -        mov r0, r1  -        bx lr  -=>  -  -__Z6slow4bii:  -        cmp r0, r1  -        movle r0, r1  -        bx lr  -  -//===---------------------------------------------------------------------===//  -  -Implement long long "X-3" with instructions that fold the immediate in.  These  -were disabled due to badness with the ARM carry flag on subtracts.  -  -//===---------------------------------------------------------------------===//  -  -More load / store optimizations:  -1) Better representation for block transfer? This is from Olden/power:  -  -	fldd d0, [r4]  -	fstd d0, [r4, #+32]  -	fldd d0, [r4, #+8]  -	fstd d0, [r4, #+40]  -	fldd d0, [r4, #+16]  -	fstd d0, [r4, #+48]  -	fldd d0, [r4, #+24]  -	fstd d0, [r4, #+56]  -  -If we can spare the registers, it would be better to use fldm and fstm here.  -Need major register allocator enhancement though.  -  -2) Can we recognize the relative position of constantpool entries? i.e. Treat  -  -	ldr r0, LCPI17_3  -	ldr r1, LCPI17_4  -	ldr r2, LCPI17_5  -  -   as  -	ldr r0, LCPI17  -	ldr r1, LCPI17+4  -	ldr r2, LCPI17+8  -  -   Then the ldr's can be combined into a single ldm. See Olden/power.  -  -Note for ARM v4 gcc uses ldmia to load a pair of 32-bit values to represent a  -double 64-bit FP constant:  -  -	adr	r0, L6  -	ldmia	r0, {r0-r1}  -  -	.align 2  -L6:  -	.long	-858993459  -	.long	1074318540  -  -3) struct copies appear to be done field by field  -instead of by words, at least sometimes:  -  -struct foo { int x; short s; char c1; char c2; };  -void cpy(struct foo*a, struct foo*b) { *a = *b; }  -  -llvm code (-O2)  -        ldrb r3, [r1, #+6]  -        ldr r2, [r1]  -        ldrb r12, [r1, #+7]  -        ldrh r1, [r1, #+4]  -        str r2, [r0]  -        strh r1, [r0, #+4]  -        strb r3, [r0, #+6]  -        strb r12, [r0, #+7]  -gcc code (-O2)  -        ldmia   r1, {r1-r2}  -        stmia   r0, {r1-r2}  -  -In this benchmark poor handling of aggregate copies has shown up as  -having a large effect on size, and possibly speed as well (we don't have  -a good way to measure on ARM).  -  -//===---------------------------------------------------------------------===//  -  -* Consider this silly example:  -  -double bar(double x) {  -  double r = foo(3.1);  -  return x+r;  -}  -  -_bar:  -        stmfd sp!, {r4, r5, r7, lr}  -        add r7, sp, #8  -        mov r4, r0  -        mov r5, r1  -        fldd d0, LCPI1_0  -        fmrrd r0, r1, d0  -        bl _foo  -        fmdrr d0, r4, r5  -        fmsr s2, r0  -        fsitod d1, s2  -        faddd d0, d1, d0  -        fmrrd r0, r1, d0  -        ldmfd sp!, {r4, r5, r7, pc}  -  -Ignore the prologue and epilogue stuff for a second. Note  -	mov r4, r0  -	mov r5, r1  -the copys to callee-save registers and the fact they are only being used by the  -fmdrr instruction. It would have been better had the fmdrr been scheduled  -before the call and place the result in a callee-save DPR register. The two  -mov ops would not have been necessary.  -  -//===---------------------------------------------------------------------===//  -  -Calling convention related stuff:  -  -* gcc's parameter passing implementation is terrible and we suffer as a result:  -  -e.g.  -struct s {  -  double d1;  -  int s1;  -};  -  -void foo(struct s S) {  -  printf("%g, %d\n", S.d1, S.s1);  -}  -  -'S' is passed via registers r0, r1, r2. But gcc stores them to the stack, and  -then reload them to r1, r2, and r3 before issuing the call (r0 contains the  -address of the format string):  -  -	stmfd	sp!, {r7, lr}  -	add	r7, sp, #0  -	sub	sp, sp, #12  -	stmia	sp, {r0, r1, r2}  -	ldmia	sp, {r1-r2}  -	ldr	r0, L5  -	ldr	r3, [sp, #8]  -L2:  -	add	r0, pc, r0  -	bl	L_printf$stub  -  -Instead of a stmia, ldmia, and a ldr, wouldn't it be better to do three moves?  -  -* Return an aggregate type is even worse:  -  -e.g.  -struct s foo(void) {  -  struct s S = {1.1, 2};  -  return S;  -}  -  -	mov	ip, r0  -	ldr	r0, L5  -	sub	sp, sp, #12  -L2:  -	add	r0, pc, r0  -	@ lr needed for prologue  -	ldmia	r0, {r0, r1, r2}  -	stmia	sp, {r0, r1, r2}  -	stmia	ip, {r0, r1, r2}  -	mov	r0, ip  -	add	sp, sp, #12  -	bx	lr  -  -r0 (and later ip) is the hidden parameter from caller to store the value in. The  -first ldmia loads the constants into r0, r1, r2. The last stmia stores r0, r1,  -r2 into the address passed in. However, there is one additional stmia that  -stores r0, r1, and r2 to some stack location. The store is dead.  -  -The llvm-gcc generated code looks like this:  -  -csretcc void %foo(%struct.s* %agg.result) {  -entry:  -	%S = alloca %struct.s, align 4		; <%struct.s*> [#uses=1]  -	%memtmp = alloca %struct.s		; <%struct.s*> [#uses=1]  -	cast %struct.s* %S to sbyte*		; <sbyte*>:0 [#uses=2]  -	call void %llvm.memcpy.i32( sbyte* %0, sbyte* cast ({ double, int }* %C.0.904 to sbyte*), uint 12, uint 4 )  -	cast %struct.s* %agg.result to sbyte*		; <sbyte*>:1 [#uses=2]  -	call void %llvm.memcpy.i32( sbyte* %1, sbyte* %0, uint 12, uint 0 )  -	cast %struct.s* %memtmp to sbyte*		; <sbyte*>:2 [#uses=1]  -	call void %llvm.memcpy.i32( sbyte* %2, sbyte* %1, uint 12, uint 0 )  -	ret void  -}  -  -llc ends up issuing two memcpy's (the first memcpy becomes 3 loads from  -constantpool). Perhaps we should 1) fix llvm-gcc so the memcpy is translated  -into a number of load and stores, or 2) custom lower memcpy (of small size) to  -be ldmia / stmia. I think option 2 is better but the current register  -allocator cannot allocate a chunk of registers at a time.  -  -A feasible temporary solution is to use specific physical registers at the  -lowering time for small (<= 4 words?) transfer size.  -  -* ARM CSRet calling convention requires the hidden argument to be returned by  -the callee.  -  -//===---------------------------------------------------------------------===//  -  -We can definitely do a better job on BB placements to eliminate some branches.  -It's very common to see llvm generated assembly code that looks like this:  -  -LBB3:  - ...  -LBB4:  -...  -  beq LBB3  -  b LBB2  -  -If BB4 is the only predecessor of BB3, then we can emit BB3 after BB4. We can  -then eliminate beq and turn the unconditional branch to LBB2 to a bne.  -  -See McCat/18-imp/ComputeBoundingBoxes for an example.  -  -//===---------------------------------------------------------------------===//  -  -Pre-/post- indexed load / stores:  -  -1) We should not make the pre/post- indexed load/store transform if the base ptr  -is guaranteed to be live beyond the load/store. This can happen if the base  -ptr is live out of the block we are performing the optimization. e.g.  -  -mov r1, r2  -ldr r3, [r1], #4  -...  -  -vs.  -  -ldr r3, [r2]  -add r1, r2, #4  -...  -  -In most cases, this is just a wasted optimization. However, sometimes it can  -negatively impact the performance because two-address code is more restrictive  -when it comes to scheduling.  -  -Unfortunately, liveout information is currently unavailable during DAG combine  -time.  -  -2) Consider spliting a indexed load / store into a pair of add/sub + load/store  -   to solve #1 (in TwoAddressInstructionPass.cpp).  -  -3) Enhance LSR to generate more opportunities for indexed ops.  -  -4) Once we added support for multiple result patterns, write indexed loads  -   patterns instead of C++ instruction selection code.  -  -5) Use VLDM / VSTM to emulate indexed FP load / store.  -  -//===---------------------------------------------------------------------===//  -  -Implement support for some more tricky ways to materialize immediates.  For  -example, to get 0xffff8000, we can use:  -  -mov r9, #&3f8000  -sub r9, r9, #&400000  -  -//===---------------------------------------------------------------------===//  -  -We sometimes generate multiple add / sub instructions to update sp in prologue  -and epilogue if the inc / dec value is too large to fit in a single immediate  -operand. In some cases, perhaps it might be better to load the value from a  -constantpool instead.  -  -//===---------------------------------------------------------------------===//  -  -GCC generates significantly better code for this function.  -  -int foo(int StackPtr, unsigned char *Line, unsigned char *Stack, int LineLen) {  -    int i = 0;  -  -    if (StackPtr != 0) {  -       while (StackPtr != 0 && i < (((LineLen) < (32768))? (LineLen) : (32768)))  -          Line[i++] = Stack[--StackPtr];  -        if (LineLen > 32768)  -        {  -            while (StackPtr != 0 && i < LineLen)  -            {  -                i++;  -                --StackPtr;  -            }  -        }  -    }  -    return StackPtr;  -}  -  -//===---------------------------------------------------------------------===//  -  -This should compile to the mlas instruction:  -int mlas(int x, int y, int z) { return ((x * y + z) < 0) ? 7 : 13; }  -  -//===---------------------------------------------------------------------===//  -  -At some point, we should triage these to see if they still apply to us:  -  -http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19598  -http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18560  -http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27016  -  -http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11831  -http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11826  -http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11825  -http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11824  -http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11823  -http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11820  -http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10982  -  -http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10242  -http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9831  -http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9760  -http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9759  -http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9703  -http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9702  -http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9663  -  -http://www.inf.u-szeged.hu/gcc-arm/  -http://citeseer.ist.psu.edu/debus04linktime.html  -  -//===---------------------------------------------------------------------===//  -  -gcc generates smaller code for this function at -O2 or -Os:  -  -void foo(signed char* p) {  -  if (*p == 3)  -     bar();  -   else if (*p == 4)  -    baz();  -  else if (*p == 5)  -    quux();  -}  -  -llvm decides it's a good idea to turn the repeated if...else into a  -binary tree, as if it were a switch; the resulting code requires -1  -compare-and-branches when *p<=2 or *p==5, the same number if *p==4  -or *p>6, and +1 if *p==3.  So it should be a speed win  -(on balance).  However, the revised code is larger, with 4 conditional  -branches instead of 3.  -  -More seriously, there is a byte->word extend before  -each comparison, where there should be only one, and the condition codes  -are not remembered when the same two values are compared twice.  -  -//===---------------------------------------------------------------------===//  -  -More LSR enhancements possible:  -  -1. Teach LSR about pre- and post- indexed ops to allow iv increment be merged  -   in a load / store.  -2. Allow iv reuse even when a type conversion is required. For example, i8  -   and i32 load / store addressing modes are identical.  -  -  -//===---------------------------------------------------------------------===//  -  -This:  -  -int foo(int a, int b, int c, int d) {  -  long long acc = (long long)a * (long long)b;  -  acc += (long long)c * (long long)d;  -  return (int)(acc >> 32);  -}  -  -Should compile to use SMLAL (Signed Multiply Accumulate Long) which multiplies  -two signed 32-bit values to produce a 64-bit value, and accumulates this with  -a 64-bit value.  -  -We currently get this with both v4 and v6:  -  -_foo:  -        smull r1, r0, r1, r0  -        smull r3, r2, r3, r2  -        adds r3, r3, r1  -        adc r0, r2, r0  -        bx lr  -  -//===---------------------------------------------------------------------===//  -  -This:  -        #include <algorithm>  -        std::pair<unsigned, bool> full_add(unsigned a, unsigned b)  -        { return std::make_pair(a + b, a + b < a); }  -        bool no_overflow(unsigned a, unsigned b)  -        { return !full_add(a, b).second; }  -  -Should compile to:  -  -_Z8full_addjj:  -	adds	r2, r1, r2  -	movcc	r1, #0  -	movcs	r1, #1  -	str	r2, [r0, #0]  -	strb	r1, [r0, #4]  -	mov	pc, lr  -  -_Z11no_overflowjj:  -	cmn	r0, r1  -	movcs	r0, #0  -	movcc	r0, #1  -	mov	pc, lr  -  -not:  -  -__Z8full_addjj:  -        add r3, r2, r1  -        str r3, [r0]  -        mov r2, #1  -        mov r12, #0  -        cmp r3, r1  -        movlo r12, r2  -        str r12, [r0, #+4]  -        bx lr  -__Z11no_overflowjj:  -        add r3, r1, r0  -        mov r2, #1  -        mov r1, #0  -        cmp r3, r0  -        movhs r1, r2  -        mov r0, r1  -        bx lr  -  -//===---------------------------------------------------------------------===//  -  -Some of the NEON intrinsics may be appropriate for more general use, either  -as target-independent intrinsics or perhaps elsewhere in the ARM backend.  -Some of them may also be lowered to target-independent SDNodes, and perhaps  -some new SDNodes could be added.  -  -For example, maximum, minimum, and absolute value operations are well-defined  -and standard operations, both for vector and scalar types.  -  -The current NEON-specific intrinsics for count leading zeros and count one  -bits could perhaps be replaced by the target-independent ctlz and ctpop  -intrinsics.  It may also make sense to add a target-independent "ctls"  -intrinsic for "count leading sign bits".  Likewise, the backend could use  -the target-independent SDNodes for these operations.  -  -ARMv6 has scalar saturating and halving adds and subtracts.  The same  -intrinsics could possibly be used for both NEON's vector implementations of  -those operations and the ARMv6 scalar versions.  -  -//===---------------------------------------------------------------------===//  -  -Split out LDR (literal) from normal ARM LDR instruction. Also consider spliting  -LDR into imm12 and so_reg forms.  This allows us to clean up some code. e.g.  -ARMLoadStoreOptimizer does not need to look at LDR (literal) and LDR (so_reg)  -while ARMConstantIslandPass only need to worry about LDR (literal).  -  -//===---------------------------------------------------------------------===//  -  -Constant island pass should make use of full range SoImm values for LEApcrel.  -Be careful though as the last attempt caused infinite looping on lencod.  -  -//===---------------------------------------------------------------------===//  -  -Predication issue. This function:  -  -extern unsigned array[ 128 ];  -int     foo( int x ) {  -  int     y;  -  y = array[ x & 127 ];  -  if ( x & 128 )  -     y = 123456789 & ( y >> 2 );  -  else  -     y = 123456789 & y;  -  return y;  -}  -  -compiles to:  -  -_foo:  -	and r1, r0, #127  -	ldr r2, LCPI1_0  -	ldr r2, [r2]  -	ldr r1, [r2, +r1, lsl #2]  -	mov r2, r1, lsr #2  -	tst r0, #128  -	moveq r2, r1  -	ldr r0, LCPI1_1  -	and r0, r2, r0  -	bx lr  -  -It would be better to do something like this, to fold the shift into the  -conditional move:  -  -	and r1, r0, #127  -	ldr r2, LCPI1_0  -	ldr r2, [r2]  -	ldr r1, [r2, +r1, lsl #2]  -	tst r0, #128  -	movne r1, r1, lsr #2  -	ldr r0, LCPI1_1  -	and r0, r1, r0  -	bx lr  -  -it saves an instruction and a register.  -  -//===---------------------------------------------------------------------===//  -  -It might be profitable to cse MOVi16 if there are lots of 32-bit immediates  -with the same bottom half.  -  -//===---------------------------------------------------------------------===//  -  -Robert Muth started working on an alternate jump table implementation that  -does not put the tables in-line in the text.  This is more like the llvm  -default jump table implementation.  This might be useful sometime.  Several  -revisions of patches are on the mailing list, beginning at:  -http://lists.llvm.org/pipermail/llvm-dev/2009-June/022763.html  -  -//===---------------------------------------------------------------------===//  -  -Make use of the "rbit" instruction.  -  -//===---------------------------------------------------------------------===//  -  -Take a look at test/CodeGen/Thumb2/machine-licm.ll. ARM should be taught how  -to licm and cse the unnecessary load from cp#1.  -  -//===---------------------------------------------------------------------===//  -  -The CMN instruction sets the flags like an ADD instruction, while CMP sets  -them like a subtract. Therefore to be able to use CMN for comparisons other  -than the Z bit, we'll need additional logic to reverse the conditionals  -associated with the comparison. Perhaps a pseudo-instruction for the comparison,  -with a post-codegen pass to clean up and handle the condition codes?  -See PR5694 for testcase.  -  -//===---------------------------------------------------------------------===//  -  -Given the following on armv5:  -int test1(int A, int B) {  -  return (A&-8388481)|(B&8388480);  -}  -  -We currently generate:  -	ldr	r2, .LCPI0_0  -	and	r0, r0, r2  -	ldr	r2, .LCPI0_1  -	and	r1, r1, r2  -	orr	r0, r1, r0  -	bx	lr  -  -We should be able to replace the second ldr+and with a bic (i.e. reuse the  -constant which was already loaded).  Not sure what's necessary to do that.  -  -//===---------------------------------------------------------------------===//  -  -The code generated for bswap on armv4/5 (CPUs without rev) is less than ideal:  -  -int a(int x) { return __builtin_bswap32(x); }  -  -a:  -	mov	r1, #255, 24  -	mov	r2, #255, 16  -	and	r1, r1, r0, lsr #8  -	and	r2, r2, r0, lsl #8  -	orr	r1, r1, r0, lsr #24  -	orr	r0, r2, r0, lsl #24  -	orr	r0, r0, r1  -	bx	lr  -  -Something like the following would be better (fewer instructions/registers):  -	eor     r1, r0, r0, ror #16  -	bic     r1, r1, #0xff0000  -	mov     r1, r1, lsr #8  -	eor     r0, r1, r0, ror #8  -	bx	lr  -  -A custom Thumb version would also be a slight improvement over the generic  -version.  -  -//===---------------------------------------------------------------------===//  -  -Consider the following simple C code:  -  -void foo(unsigned char *a, unsigned char *b, int *c) {  - if ((*a | *b) == 0) *c = 0;  -}  -  -currently llvm-gcc generates something like this (nice branchless code I'd say):  -  -       ldrb    r0, [r0]  -       ldrb    r1, [r1]  -       orr     r0, r1, r0  -       tst     r0, #255  -       moveq   r0, #0  -       streq   r0, [r2]  -       bx      lr  -  -Note that both "tst" and "moveq" are redundant.  -  -//===---------------------------------------------------------------------===//  -  -When loading immediate constants with movt/movw, if there are multiple  -constants needed with the same low 16 bits, and those values are not live at  -the same time, it would be possible to use a single movw instruction, followed  -by multiple movt instructions to rewrite the high bits to different values.  -For example:  -  -  volatile store i32 -1, i32* inttoptr (i32 1342210076 to i32*), align 4,  -  !tbaa  -!0  -  volatile store i32 -1, i32* inttoptr (i32 1342341148 to i32*), align 4,  -  !tbaa  -!0  -  -is compiled and optimized to:  -  -    movw    r0, #32796  -    mov.w    r1, #-1  -    movt    r0, #20480  -    str    r1, [r0]  -    movw    r0, #32796    @ <= this MOVW is not needed, value is there already  -    movt    r0, #20482  -    str    r1, [r0]  -  -//===---------------------------------------------------------------------===//  -  -Improve codegen for select's:  -if (x != 0) x = 1  -if (x == 1) x = 1  -  -ARM codegen used to look like this:  -       mov     r1, r0  -       cmp     r1, #1  -       mov     r0, #0  -       moveq   r0, #1  -  -The naive lowering select between two different values. It should recognize the  -test is equality test so it's more a conditional move rather than a select:  -       cmp     r0, #1  -       movne   r0, #0  -  -Currently this is a ARM specific dag combine. We probably should make it into a  -target-neutral one.  -  -//===---------------------------------------------------------------------===//  -  -Optimize unnecessary checks for zero with __builtin_clz/ctz.  Those builtins  -are specified to be undefined at zero, so portable code must check for zero  -and handle it as a special case.  That is unnecessary on ARM where those  -operations are implemented in a way that is well-defined for zero.  For  -example:  -  -int f(int x) { return x ? __builtin_clz(x) : sizeof(int)*8; }  -  -should just be implemented with a CLZ instruction.  Since there are other  -targets, e.g., PPC, that share this behavior, it would be best to implement  -this in a target-independent way: we should probably fold that (when using  -"undefined at zero" semantics) to set the "defined at zero" bit and have  -the code generator expand out the right code.  -  -//===---------------------------------------------------------------------===//  -  -Clean up the test/MC/ARM files to have more robust register choices.  -  -R0 should not be used as a register operand in the assembler tests as it's then  -not possible to distinguish between a correct encoding and a missing operand  -encoding, as zero is the default value for the binary encoder.  -e.g.,  -    add r0, r0  // bad  -    add r3, r5  // good  -  -Register operands should be distinct. That is, when the encoding does not  -require two syntactical operands to refer to the same register, two different  -registers should be used in the test so as to catch errors where the  -operands are swapped in the encoding.  -e.g.,  -    subs.w r1, r1, r1 // bad  -    subs.w r1, r2, r3 // good  -  +//===---------------------------------------------------------------------===// +// Random ideas for the ARM backend. +//===---------------------------------------------------------------------===// + +Reimplement 'select' in terms of 'SEL'. + +* We would really like to support UXTAB16, but we need to prove that the +  add doesn't need to overflow between the two 16-bit chunks. + +* Implement pre/post increment support.  (e.g. PR935) +* Implement smarter constant generation for binops with large immediates. + +A few ARMv6T2 ops should be pattern matched: BFI, SBFX, and UBFX + +Interesting optimization for PIC codegen on arm-linux: +http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43129 + +//===---------------------------------------------------------------------===// + +Crazy idea:  Consider code that uses lots of 8-bit or 16-bit values.  By the +time regalloc happens, these values are now in a 32-bit register, usually with +the top-bits known to be sign or zero extended.  If spilled, we should be able +to spill these to a 8-bit or 16-bit stack slot, zero or sign extending as part +of the reload. + +Doing this reduces the size of the stack frame (important for thumb etc), and +also increases the likelihood that we will be able to reload multiple values +from the stack with a single load. + +//===---------------------------------------------------------------------===// + +The constant island pass is in good shape.  Some cleanups might be desirable, +but there is unlikely to be much improvement in the generated code. + +1.  There may be some advantage to trying to be smarter about the initial +placement, rather than putting everything at the end. + +2.  There might be some compile-time efficiency to be had by representing +consecutive islands as a single block rather than multiple blocks. + +3.  Use a priority queue to sort constant pool users in inverse order of +    position so we always process the one closed to the end of functions +    first. This may simply CreateNewWater. + +//===---------------------------------------------------------------------===// + +Eliminate copysign custom expansion. We are still generating crappy code with +default expansion + if-conversion. + +//===---------------------------------------------------------------------===// + +Eliminate one instruction from: + +define i32 @_Z6slow4bii(i32 %x, i32 %y) { +        %tmp = icmp sgt i32 %x, %y +        %retval = select i1 %tmp, i32 %x, i32 %y +        ret i32 %retval +} + +__Z6slow4bii: +        cmp r0, r1 +        movgt r1, r0 +        mov r0, r1 +        bx lr +=> + +__Z6slow4bii: +        cmp r0, r1 +        movle r0, r1 +        bx lr + +//===---------------------------------------------------------------------===// + +Implement long long "X-3" with instructions that fold the immediate in.  These +were disabled due to badness with the ARM carry flag on subtracts. + +//===---------------------------------------------------------------------===// + +More load / store optimizations: +1) Better representation for block transfer? This is from Olden/power: + +	fldd d0, [r4] +	fstd d0, [r4, #+32] +	fldd d0, [r4, #+8] +	fstd d0, [r4, #+40] +	fldd d0, [r4, #+16] +	fstd d0, [r4, #+48] +	fldd d0, [r4, #+24] +	fstd d0, [r4, #+56] + +If we can spare the registers, it would be better to use fldm and fstm here. +Need major register allocator enhancement though. + +2) Can we recognize the relative position of constantpool entries? i.e. Treat + +	ldr r0, LCPI17_3 +	ldr r1, LCPI17_4 +	ldr r2, LCPI17_5 + +   as +	ldr r0, LCPI17 +	ldr r1, LCPI17+4 +	ldr r2, LCPI17+8 + +   Then the ldr's can be combined into a single ldm. See Olden/power. + +Note for ARM v4 gcc uses ldmia to load a pair of 32-bit values to represent a +double 64-bit FP constant: + +	adr	r0, L6 +	ldmia	r0, {r0-r1} + +	.align 2 +L6: +	.long	-858993459 +	.long	1074318540 + +3) struct copies appear to be done field by field +instead of by words, at least sometimes: + +struct foo { int x; short s; char c1; char c2; }; +void cpy(struct foo*a, struct foo*b) { *a = *b; } + +llvm code (-O2) +        ldrb r3, [r1, #+6] +        ldr r2, [r1] +        ldrb r12, [r1, #+7] +        ldrh r1, [r1, #+4] +        str r2, [r0] +        strh r1, [r0, #+4] +        strb r3, [r0, #+6] +        strb r12, [r0, #+7] +gcc code (-O2) +        ldmia   r1, {r1-r2} +        stmia   r0, {r1-r2} + +In this benchmark poor handling of aggregate copies has shown up as +having a large effect on size, and possibly speed as well (we don't have +a good way to measure on ARM). + +//===---------------------------------------------------------------------===// + +* Consider this silly example: + +double bar(double x) { +  double r = foo(3.1); +  return x+r; +} + +_bar: +        stmfd sp!, {r4, r5, r7, lr} +        add r7, sp, #8 +        mov r4, r0 +        mov r5, r1 +        fldd d0, LCPI1_0 +        fmrrd r0, r1, d0 +        bl _foo +        fmdrr d0, r4, r5 +        fmsr s2, r0 +        fsitod d1, s2 +        faddd d0, d1, d0 +        fmrrd r0, r1, d0 +        ldmfd sp!, {r4, r5, r7, pc} + +Ignore the prologue and epilogue stuff for a second. Note +	mov r4, r0 +	mov r5, r1 +the copys to callee-save registers and the fact they are only being used by the +fmdrr instruction. It would have been better had the fmdrr been scheduled +before the call and place the result in a callee-save DPR register. The two +mov ops would not have been necessary. + +//===---------------------------------------------------------------------===// + +Calling convention related stuff: + +* gcc's parameter passing implementation is terrible and we suffer as a result: + +e.g. +struct s { +  double d1; +  int s1; +}; + +void foo(struct s S) { +  printf("%g, %d\n", S.d1, S.s1); +} + +'S' is passed via registers r0, r1, r2. But gcc stores them to the stack, and +then reload them to r1, r2, and r3 before issuing the call (r0 contains the +address of the format string): + +	stmfd	sp!, {r7, lr} +	add	r7, sp, #0 +	sub	sp, sp, #12 +	stmia	sp, {r0, r1, r2} +	ldmia	sp, {r1-r2} +	ldr	r0, L5 +	ldr	r3, [sp, #8] +L2: +	add	r0, pc, r0 +	bl	L_printf$stub + +Instead of a stmia, ldmia, and a ldr, wouldn't it be better to do three moves? + +* Return an aggregate type is even worse: + +e.g. +struct s foo(void) { +  struct s S = {1.1, 2}; +  return S; +} + +	mov	ip, r0 +	ldr	r0, L5 +	sub	sp, sp, #12 +L2: +	add	r0, pc, r0 +	@ lr needed for prologue +	ldmia	r0, {r0, r1, r2} +	stmia	sp, {r0, r1, r2} +	stmia	ip, {r0, r1, r2} +	mov	r0, ip +	add	sp, sp, #12 +	bx	lr + +r0 (and later ip) is the hidden parameter from caller to store the value in. The +first ldmia loads the constants into r0, r1, r2. The last stmia stores r0, r1, +r2 into the address passed in. However, there is one additional stmia that +stores r0, r1, and r2 to some stack location. The store is dead. + +The llvm-gcc generated code looks like this: + +csretcc void %foo(%struct.s* %agg.result) { +entry: +	%S = alloca %struct.s, align 4		; <%struct.s*> [#uses=1] +	%memtmp = alloca %struct.s		; <%struct.s*> [#uses=1] +	cast %struct.s* %S to sbyte*		; <sbyte*>:0 [#uses=2] +	call void %llvm.memcpy.i32( sbyte* %0, sbyte* cast ({ double, int }* %C.0.904 to sbyte*), uint 12, uint 4 ) +	cast %struct.s* %agg.result to sbyte*		; <sbyte*>:1 [#uses=2] +	call void %llvm.memcpy.i32( sbyte* %1, sbyte* %0, uint 12, uint 0 ) +	cast %struct.s* %memtmp to sbyte*		; <sbyte*>:2 [#uses=1] +	call void %llvm.memcpy.i32( sbyte* %2, sbyte* %1, uint 12, uint 0 ) +	ret void +} + +llc ends up issuing two memcpy's (the first memcpy becomes 3 loads from +constantpool). Perhaps we should 1) fix llvm-gcc so the memcpy is translated +into a number of load and stores, or 2) custom lower memcpy (of small size) to +be ldmia / stmia. I think option 2 is better but the current register +allocator cannot allocate a chunk of registers at a time. + +A feasible temporary solution is to use specific physical registers at the +lowering time for small (<= 4 words?) transfer size. + +* ARM CSRet calling convention requires the hidden argument to be returned by +the callee. + +//===---------------------------------------------------------------------===// + +We can definitely do a better job on BB placements to eliminate some branches. +It's very common to see llvm generated assembly code that looks like this: + +LBB3: + ... +LBB4: +... +  beq LBB3 +  b LBB2 + +If BB4 is the only predecessor of BB3, then we can emit BB3 after BB4. We can +then eliminate beq and turn the unconditional branch to LBB2 to a bne. + +See McCat/18-imp/ComputeBoundingBoxes for an example. + +//===---------------------------------------------------------------------===// + +Pre-/post- indexed load / stores: + +1) We should not make the pre/post- indexed load/store transform if the base ptr +is guaranteed to be live beyond the load/store. This can happen if the base +ptr is live out of the block we are performing the optimization. e.g. + +mov r1, r2 +ldr r3, [r1], #4 +... + +vs. + +ldr r3, [r2] +add r1, r2, #4 +... + +In most cases, this is just a wasted optimization. However, sometimes it can +negatively impact the performance because two-address code is more restrictive +when it comes to scheduling. + +Unfortunately, liveout information is currently unavailable during DAG combine +time. + +2) Consider spliting a indexed load / store into a pair of add/sub + load/store +   to solve #1 (in TwoAddressInstructionPass.cpp). + +3) Enhance LSR to generate more opportunities for indexed ops. + +4) Once we added support for multiple result patterns, write indexed loads +   patterns instead of C++ instruction selection code. + +5) Use VLDM / VSTM to emulate indexed FP load / store. + +//===---------------------------------------------------------------------===// + +Implement support for some more tricky ways to materialize immediates.  For +example, to get 0xffff8000, we can use: + +mov r9, #&3f8000 +sub r9, r9, #&400000 + +//===---------------------------------------------------------------------===// + +We sometimes generate multiple add / sub instructions to update sp in prologue +and epilogue if the inc / dec value is too large to fit in a single immediate +operand. In some cases, perhaps it might be better to load the value from a +constantpool instead. + +//===---------------------------------------------------------------------===// + +GCC generates significantly better code for this function. + +int foo(int StackPtr, unsigned char *Line, unsigned char *Stack, int LineLen) { +    int i = 0; + +    if (StackPtr != 0) { +       while (StackPtr != 0 && i < (((LineLen) < (32768))? (LineLen) : (32768))) +          Line[i++] = Stack[--StackPtr]; +        if (LineLen > 32768) +        { +            while (StackPtr != 0 && i < LineLen) +            { +                i++; +                --StackPtr; +            } +        } +    } +    return StackPtr; +} + +//===---------------------------------------------------------------------===// + +This should compile to the mlas instruction: +int mlas(int x, int y, int z) { return ((x * y + z) < 0) ? 7 : 13; } + +//===---------------------------------------------------------------------===// + +At some point, we should triage these to see if they still apply to us: + +http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19598 +http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18560 +http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27016 + +http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11831 +http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11826 +http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11825 +http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11824 +http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11823 +http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11820 +http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10982 + +http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10242 +http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9831 +http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9760 +http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9759 +http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9703 +http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9702 +http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9663 + +http://www.inf.u-szeged.hu/gcc-arm/ +http://citeseer.ist.psu.edu/debus04linktime.html + +//===---------------------------------------------------------------------===// + +gcc generates smaller code for this function at -O2 or -Os: + +void foo(signed char* p) { +  if (*p == 3) +     bar(); +   else if (*p == 4) +    baz(); +  else if (*p == 5) +    quux(); +} + +llvm decides it's a good idea to turn the repeated if...else into a +binary tree, as if it were a switch; the resulting code requires -1 +compare-and-branches when *p<=2 or *p==5, the same number if *p==4 +or *p>6, and +1 if *p==3.  So it should be a speed win +(on balance).  However, the revised code is larger, with 4 conditional +branches instead of 3. + +More seriously, there is a byte->word extend before +each comparison, where there should be only one, and the condition codes +are not remembered when the same two values are compared twice. + +//===---------------------------------------------------------------------===// + +More LSR enhancements possible: + +1. Teach LSR about pre- and post- indexed ops to allow iv increment be merged +   in a load / store. +2. Allow iv reuse even when a type conversion is required. For example, i8 +   and i32 load / store addressing modes are identical. + + +//===---------------------------------------------------------------------===// + +This: + +int foo(int a, int b, int c, int d) { +  long long acc = (long long)a * (long long)b; +  acc += (long long)c * (long long)d; +  return (int)(acc >> 32); +} + +Should compile to use SMLAL (Signed Multiply Accumulate Long) which multiplies +two signed 32-bit values to produce a 64-bit value, and accumulates this with +a 64-bit value. + +We currently get this with both v4 and v6: + +_foo: +        smull r1, r0, r1, r0 +        smull r3, r2, r3, r2 +        adds r3, r3, r1 +        adc r0, r2, r0 +        bx lr + +//===---------------------------------------------------------------------===// + +This: +        #include <algorithm> +        std::pair<unsigned, bool> full_add(unsigned a, unsigned b) +        { return std::make_pair(a + b, a + b < a); } +        bool no_overflow(unsigned a, unsigned b) +        { return !full_add(a, b).second; } + +Should compile to: + +_Z8full_addjj: +	adds	r2, r1, r2 +	movcc	r1, #0 +	movcs	r1, #1 +	str	r2, [r0, #0] +	strb	r1, [r0, #4] +	mov	pc, lr + +_Z11no_overflowjj: +	cmn	r0, r1 +	movcs	r0, #0 +	movcc	r0, #1 +	mov	pc, lr + +not: + +__Z8full_addjj: +        add r3, r2, r1 +        str r3, [r0] +        mov r2, #1 +        mov r12, #0 +        cmp r3, r1 +        movlo r12, r2 +        str r12, [r0, #+4] +        bx lr +__Z11no_overflowjj: +        add r3, r1, r0 +        mov r2, #1 +        mov r1, #0 +        cmp r3, r0 +        movhs r1, r2 +        mov r0, r1 +        bx lr + +//===---------------------------------------------------------------------===// + +Some of the NEON intrinsics may be appropriate for more general use, either +as target-independent intrinsics or perhaps elsewhere in the ARM backend. +Some of them may also be lowered to target-independent SDNodes, and perhaps +some new SDNodes could be added. + +For example, maximum, minimum, and absolute value operations are well-defined +and standard operations, both for vector and scalar types. + +The current NEON-specific intrinsics for count leading zeros and count one +bits could perhaps be replaced by the target-independent ctlz and ctpop +intrinsics.  It may also make sense to add a target-independent "ctls" +intrinsic for "count leading sign bits".  Likewise, the backend could use +the target-independent SDNodes for these operations. + +ARMv6 has scalar saturating and halving adds and subtracts.  The same +intrinsics could possibly be used for both NEON's vector implementations of +those operations and the ARMv6 scalar versions. + +//===---------------------------------------------------------------------===// + +Split out LDR (literal) from normal ARM LDR instruction. Also consider spliting +LDR into imm12 and so_reg forms.  This allows us to clean up some code. e.g. +ARMLoadStoreOptimizer does not need to look at LDR (literal) and LDR (so_reg) +while ARMConstantIslandPass only need to worry about LDR (literal). + +//===---------------------------------------------------------------------===// + +Constant island pass should make use of full range SoImm values for LEApcrel. +Be careful though as the last attempt caused infinite looping on lencod. + +//===---------------------------------------------------------------------===// + +Predication issue. This function: + +extern unsigned array[ 128 ]; +int     foo( int x ) { +  int     y; +  y = array[ x & 127 ]; +  if ( x & 128 ) +     y = 123456789 & ( y >> 2 ); +  else +     y = 123456789 & y; +  return y; +} + +compiles to: + +_foo: +	and r1, r0, #127 +	ldr r2, LCPI1_0 +	ldr r2, [r2] +	ldr r1, [r2, +r1, lsl #2] +	mov r2, r1, lsr #2 +	tst r0, #128 +	moveq r2, r1 +	ldr r0, LCPI1_1 +	and r0, r2, r0 +	bx lr + +It would be better to do something like this, to fold the shift into the +conditional move: + +	and r1, r0, #127 +	ldr r2, LCPI1_0 +	ldr r2, [r2] +	ldr r1, [r2, +r1, lsl #2] +	tst r0, #128 +	movne r1, r1, lsr #2 +	ldr r0, LCPI1_1 +	and r0, r1, r0 +	bx lr + +it saves an instruction and a register. + +//===---------------------------------------------------------------------===// + +It might be profitable to cse MOVi16 if there are lots of 32-bit immediates +with the same bottom half. + +//===---------------------------------------------------------------------===// + +Robert Muth started working on an alternate jump table implementation that +does not put the tables in-line in the text.  This is more like the llvm +default jump table implementation.  This might be useful sometime.  Several +revisions of patches are on the mailing list, beginning at: +http://lists.llvm.org/pipermail/llvm-dev/2009-June/022763.html + +//===---------------------------------------------------------------------===// + +Make use of the "rbit" instruction. + +//===---------------------------------------------------------------------===// + +Take a look at test/CodeGen/Thumb2/machine-licm.ll. ARM should be taught how +to licm and cse the unnecessary load from cp#1. + +//===---------------------------------------------------------------------===// + +The CMN instruction sets the flags like an ADD instruction, while CMP sets +them like a subtract. Therefore to be able to use CMN for comparisons other +than the Z bit, we'll need additional logic to reverse the conditionals +associated with the comparison. Perhaps a pseudo-instruction for the comparison, +with a post-codegen pass to clean up and handle the condition codes? +See PR5694 for testcase. + +//===---------------------------------------------------------------------===// + +Given the following on armv5: +int test1(int A, int B) { +  return (A&-8388481)|(B&8388480); +} + +We currently generate: +	ldr	r2, .LCPI0_0 +	and	r0, r0, r2 +	ldr	r2, .LCPI0_1 +	and	r1, r1, r2 +	orr	r0, r1, r0 +	bx	lr + +We should be able to replace the second ldr+and with a bic (i.e. reuse the +constant which was already loaded).  Not sure what's necessary to do that. + +//===---------------------------------------------------------------------===// + +The code generated for bswap on armv4/5 (CPUs without rev) is less than ideal: + +int a(int x) { return __builtin_bswap32(x); } + +a: +	mov	r1, #255, 24 +	mov	r2, #255, 16 +	and	r1, r1, r0, lsr #8 +	and	r2, r2, r0, lsl #8 +	orr	r1, r1, r0, lsr #24 +	orr	r0, r2, r0, lsl #24 +	orr	r0, r0, r1 +	bx	lr + +Something like the following would be better (fewer instructions/registers): +	eor     r1, r0, r0, ror #16 +	bic     r1, r1, #0xff0000 +	mov     r1, r1, lsr #8 +	eor     r0, r1, r0, ror #8 +	bx	lr + +A custom Thumb version would also be a slight improvement over the generic +version. + +//===---------------------------------------------------------------------===// + +Consider the following simple C code: + +void foo(unsigned char *a, unsigned char *b, int *c) { + if ((*a | *b) == 0) *c = 0; +} + +currently llvm-gcc generates something like this (nice branchless code I'd say): + +       ldrb    r0, [r0] +       ldrb    r1, [r1] +       orr     r0, r1, r0 +       tst     r0, #255 +       moveq   r0, #0 +       streq   r0, [r2] +       bx      lr + +Note that both "tst" and "moveq" are redundant. + +//===---------------------------------------------------------------------===// + +When loading immediate constants with movt/movw, if there are multiple +constants needed with the same low 16 bits, and those values are not live at +the same time, it would be possible to use a single movw instruction, followed +by multiple movt instructions to rewrite the high bits to different values. +For example: + +  volatile store i32 -1, i32* inttoptr (i32 1342210076 to i32*), align 4, +  !tbaa +!0 +  volatile store i32 -1, i32* inttoptr (i32 1342341148 to i32*), align 4, +  !tbaa +!0 + +is compiled and optimized to: + +    movw    r0, #32796 +    mov.w    r1, #-1 +    movt    r0, #20480 +    str    r1, [r0] +    movw    r0, #32796    @ <= this MOVW is not needed, value is there already +    movt    r0, #20482 +    str    r1, [r0] + +//===---------------------------------------------------------------------===// + +Improve codegen for select's: +if (x != 0) x = 1 +if (x == 1) x = 1 + +ARM codegen used to look like this: +       mov     r1, r0 +       cmp     r1, #1 +       mov     r0, #0 +       moveq   r0, #1 + +The naive lowering select between two different values. It should recognize the +test is equality test so it's more a conditional move rather than a select: +       cmp     r0, #1 +       movne   r0, #0 + +Currently this is a ARM specific dag combine. We probably should make it into a +target-neutral one. + +//===---------------------------------------------------------------------===// + +Optimize unnecessary checks for zero with __builtin_clz/ctz.  Those builtins +are specified to be undefined at zero, so portable code must check for zero +and handle it as a special case.  That is unnecessary on ARM where those +operations are implemented in a way that is well-defined for zero.  For +example: + +int f(int x) { return x ? __builtin_clz(x) : sizeof(int)*8; } + +should just be implemented with a CLZ instruction.  Since there are other +targets, e.g., PPC, that share this behavior, it would be best to implement +this in a target-independent way: we should probably fold that (when using +"undefined at zero" semantics) to set the "defined at zero" bit and have +the code generator expand out the right code. + +//===---------------------------------------------------------------------===// + +Clean up the test/MC/ARM files to have more robust register choices. + +R0 should not be used as a register operand in the assembler tests as it's then +not possible to distinguish between a correct encoding and a missing operand +encoding, as zero is the default value for the binary encoder. +e.g., +    add r0, r0  // bad +    add r3, r5  // good + +Register operands should be distinct. That is, when the encoding does not +require two syntactical operands to refer to the same register, two different +registers should be used in the test so as to catch errors where the +operands are swapped in the encoding. +e.g., +    subs.w r1, r1, r1 // bad +    subs.w r1, r2, r3 // good + diff --git a/contrib/libs/llvm12/lib/Target/ARM/TargetInfo/.yandex_meta/licenses.list.txt b/contrib/libs/llvm12/lib/Target/ARM/TargetInfo/.yandex_meta/licenses.list.txt index a4433625d4a..c62d353021c 100644 --- a/contrib/libs/llvm12/lib/Target/ARM/TargetInfo/.yandex_meta/licenses.list.txt +++ b/contrib/libs/llvm12/lib/Target/ARM/TargetInfo/.yandex_meta/licenses.list.txt @@ -1,7 +1,7 @@ -====================Apache-2.0 WITH LLVM-exception====================  -// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.  -// See https://llvm.org/LICENSE.txt for license information.  -  -  -====================Apache-2.0 WITH LLVM-exception====================  -// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception  +====================Apache-2.0 WITH LLVM-exception==================== +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. + + +====================Apache-2.0 WITH LLVM-exception==================== +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception diff --git a/contrib/libs/llvm12/lib/Target/ARM/TargetInfo/ya.make b/contrib/libs/llvm12/lib/Target/ARM/TargetInfo/ya.make index 260ad441904..089e7bf2069 100644 --- a/contrib/libs/llvm12/lib/Target/ARM/TargetInfo/ya.make +++ b/contrib/libs/llvm12/lib/Target/ARM/TargetInfo/ya.make @@ -2,15 +2,15 @@  LIBRARY() -OWNER(  -    orivej  -    g:cpp-contrib  -)  - -LICENSE(Apache-2.0 WITH LLVM-exception)  -  -LICENSE_TEXTS(.yandex_meta/licenses.list.txt)  -  +OWNER( +    orivej +    g:cpp-contrib +) + +LICENSE(Apache-2.0 WITH LLVM-exception) + +LICENSE_TEXTS(.yandex_meta/licenses.list.txt) +  PEERDIR(      contrib/libs/llvm12      contrib/libs/llvm12/lib/Support diff --git a/contrib/libs/llvm12/lib/Target/ARM/Utils/.yandex_meta/licenses.list.txt b/contrib/libs/llvm12/lib/Target/ARM/Utils/.yandex_meta/licenses.list.txt index a4433625d4a..c62d353021c 100644 --- a/contrib/libs/llvm12/lib/Target/ARM/Utils/.yandex_meta/licenses.list.txt +++ b/contrib/libs/llvm12/lib/Target/ARM/Utils/.yandex_meta/licenses.list.txt @@ -1,7 +1,7 @@ -====================Apache-2.0 WITH LLVM-exception====================  -// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.  -// See https://llvm.org/LICENSE.txt for license information.  -  -  -====================Apache-2.0 WITH LLVM-exception====================  -// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception  +====================Apache-2.0 WITH LLVM-exception==================== +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. + + +====================Apache-2.0 WITH LLVM-exception==================== +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception diff --git a/contrib/libs/llvm12/lib/Target/ARM/Utils/ya.make b/contrib/libs/llvm12/lib/Target/ARM/Utils/ya.make index 216fd023f60..7a980b708c3 100644 --- a/contrib/libs/llvm12/lib/Target/ARM/Utils/ya.make +++ b/contrib/libs/llvm12/lib/Target/ARM/Utils/ya.make @@ -2,15 +2,15 @@  LIBRARY() -OWNER(  -    orivej  -    g:cpp-contrib  -)  - -LICENSE(Apache-2.0 WITH LLVM-exception)  -  -LICENSE_TEXTS(.yandex_meta/licenses.list.txt)  -  +OWNER( +    orivej +    g:cpp-contrib +) + +LICENSE(Apache-2.0 WITH LLVM-exception) + +LICENSE_TEXTS(.yandex_meta/licenses.list.txt) +  PEERDIR(      contrib/libs/llvm12      contrib/libs/llvm12/include diff --git a/contrib/libs/llvm12/lib/Target/ARM/ya.make b/contrib/libs/llvm12/lib/Target/ARM/ya.make index 1fe4babbea6..9551f9f11bb 100644 --- a/contrib/libs/llvm12/lib/Target/ARM/ya.make +++ b/contrib/libs/llvm12/lib/Target/ARM/ya.make @@ -2,15 +2,15 @@  LIBRARY() -OWNER(  -    orivej  -    g:cpp-contrib  -)  +OWNER( +    orivej +    g:cpp-contrib +) + +LICENSE(Apache-2.0 WITH LLVM-exception) + +LICENSE_TEXTS(.yandex_meta/licenses.list.txt) -LICENSE(Apache-2.0 WITH LLVM-exception)  -  -LICENSE_TEXTS(.yandex_meta/licenses.list.txt)  -   PEERDIR(      contrib/libs/llvm12      contrib/libs/llvm12/include diff --git a/contrib/libs/llvm12/lib/Target/BPF/.yandex_meta/licenses.list.txt b/contrib/libs/llvm12/lib/Target/BPF/.yandex_meta/licenses.list.txt index a4433625d4a..c62d353021c 100644 --- a/contrib/libs/llvm12/lib/Target/BPF/.yandex_meta/licenses.list.txt +++ b/contrib/libs/llvm12/lib/Target/BPF/.yandex_meta/licenses.list.txt @@ -1,7 +1,7 @@ -====================Apache-2.0 WITH LLVM-exception====================  -// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.  -// See https://llvm.org/LICENSE.txt for license information.  -  -  -====================Apache-2.0 WITH LLVM-exception====================  -// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception  +====================Apache-2.0 WITH LLVM-exception==================== +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. + + +====================Apache-2.0 WITH LLVM-exception==================== +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception diff --git a/contrib/libs/llvm12/lib/Target/BPF/AsmParser/.yandex_meta/licenses.list.txt b/contrib/libs/llvm12/lib/Target/BPF/AsmParser/.yandex_meta/licenses.list.txt index a4433625d4a..c62d353021c 100644 --- a/contrib/libs/llvm12/lib/Target/BPF/AsmParser/.yandex_meta/licenses.list.txt +++ b/contrib/libs/llvm12/lib/Target/BPF/AsmParser/.yandex_meta/licenses.list.txt @@ -1,7 +1,7 @@ -====================Apache-2.0 WITH LLVM-exception====================  -// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.  -// See https://llvm.org/LICENSE.txt for license information.  -  -  -====================Apache-2.0 WITH LLVM-exception====================  -// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception  +====================Apache-2.0 WITH LLVM-exception==================== +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. + + +====================Apache-2.0 WITH LLVM-exception==================== +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception diff --git a/contrib/libs/llvm12/lib/Target/BPF/AsmParser/ya.make b/contrib/libs/llvm12/lib/Target/BPF/AsmParser/ya.make index adaba48b957..b61ac06cdd2 100644 --- a/contrib/libs/llvm12/lib/Target/BPF/AsmParser/ya.make +++ b/contrib/libs/llvm12/lib/Target/BPF/AsmParser/ya.make @@ -2,15 +2,15 @@  LIBRARY() -OWNER(  -    orivej  -    g:cpp-contrib  -)  - -LICENSE(Apache-2.0 WITH LLVM-exception)  -  -LICENSE_TEXTS(.yandex_meta/licenses.list.txt)  -  +OWNER( +    orivej +    g:cpp-contrib +) + +LICENSE(Apache-2.0 WITH LLVM-exception) + +LICENSE_TEXTS(.yandex_meta/licenses.list.txt) +  PEERDIR(      contrib/libs/llvm12      contrib/libs/llvm12/include diff --git a/contrib/libs/llvm12/lib/Target/BPF/Disassembler/.yandex_meta/licenses.list.txt b/contrib/libs/llvm12/lib/Target/BPF/Disassembler/.yandex_meta/licenses.list.txt index a4433625d4a..c62d353021c 100644 --- a/contrib/libs/llvm12/lib/Target/BPF/Disassembler/.yandex_meta/licenses.list.txt +++ b/contrib/libs/llvm12/lib/Target/BPF/Disassembler/.yandex_meta/licenses.list.txt @@ -1,7 +1,7 @@ -====================Apache-2.0 WITH LLVM-exception====================  -// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.  -// See https://llvm.org/LICENSE.txt for license information.  -  -  -====================Apache-2.0 WITH LLVM-exception====================  -// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception  +====================Apache-2.0 WITH LLVM-exception==================== +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. + + +====================Apache-2.0 WITH LLVM-exception==================== +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception diff --git a/contrib/libs/llvm12/lib/Target/BPF/Disassembler/ya.make b/contrib/libs/llvm12/lib/Target/BPF/Disassembler/ya.make index f31d0e82005..cb7872eeeeb 100644 --- a/contrib/libs/llvm12/lib/Target/BPF/Disassembler/ya.make +++ b/contrib/libs/llvm12/lib/Target/BPF/Disassembler/ya.make @@ -2,15 +2,15 @@  LIBRARY() -OWNER(  -    orivej  -    g:cpp-contrib  -)  - -LICENSE(Apache-2.0 WITH LLVM-exception)  -  -LICENSE_TEXTS(.yandex_meta/licenses.list.txt)  -  +OWNER( +    orivej +    g:cpp-contrib +) + +LICENSE(Apache-2.0 WITH LLVM-exception) + +LICENSE_TEXTS(.yandex_meta/licenses.list.txt) +  PEERDIR(      contrib/libs/llvm12      contrib/libs/llvm12/include diff --git a/contrib/libs/llvm12/lib/Target/BPF/MCTargetDesc/.yandex_meta/licenses.list.txt b/contrib/libs/llvm12/lib/Target/BPF/MCTargetDesc/.yandex_meta/licenses.list.txt index a4433625d4a..c62d353021c 100644 --- a/contrib/libs/llvm12/lib/Target/BPF/MCTargetDesc/.yandex_meta/licenses.list.txt +++ b/contrib/libs/llvm12/lib/Target/BPF/MCTargetDesc/.yandex_meta/licenses.list.txt @@ -1,7 +1,7 @@ -====================Apache-2.0 WITH LLVM-exception====================  -// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.  -// See https://llvm.org/LICENSE.txt for license information.  -  -  -====================Apache-2.0 WITH LLVM-exception====================  -// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception  +====================Apache-2.0 WITH LLVM-exception==================== +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. + + +====================Apache-2.0 WITH LLVM-exception==================== +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception diff --git a/contrib/libs/llvm12/lib/Target/BPF/MCTargetDesc/ya.make b/contrib/libs/llvm12/lib/Target/BPF/MCTargetDesc/ya.make index b44f9daa1ab..6522c7ef00c 100644 --- a/contrib/libs/llvm12/lib/Target/BPF/MCTargetDesc/ya.make +++ b/contrib/libs/llvm12/lib/Target/BPF/MCTargetDesc/ya.make @@ -2,15 +2,15 @@  LIBRARY() -OWNER(  -    orivej  -    g:cpp-contrib  -)  - -LICENSE(Apache-2.0 WITH LLVM-exception)  -  -LICENSE_TEXTS(.yandex_meta/licenses.list.txt)  -  +OWNER( +    orivej +    g:cpp-contrib +) + +LICENSE(Apache-2.0 WITH LLVM-exception) + +LICENSE_TEXTS(.yandex_meta/licenses.list.txt) +  PEERDIR(      contrib/libs/llvm12      contrib/libs/llvm12/include diff --git a/contrib/libs/llvm12/lib/Target/BPF/TargetInfo/.yandex_meta/licenses.list.txt b/contrib/libs/llvm12/lib/Target/BPF/TargetInfo/.yandex_meta/licenses.list.txt index a4433625d4a..c62d353021c 100644 --- a/contrib/libs/llvm12/lib/Target/BPF/TargetInfo/.yandex_meta/licenses.list.txt +++ b/contrib/libs/llvm12/lib/Target/BPF/TargetInfo/.yandex_meta/licenses.list.txt @@ -1,7 +1,7 @@ -====================Apache-2.0 WITH LLVM-exception====================  -// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.  -// See https://llvm.org/LICENSE.txt for license information.  -  -  -====================Apache-2.0 WITH LLVM-exception====================  -// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception  +====================Apache-2.0 WITH LLVM-exception==================== +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. + + +====================Apache-2.0 WITH LLVM-exception==================== +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception diff --git a/contrib/libs/llvm12/lib/Target/BPF/TargetInfo/ya.make b/contrib/libs/llvm12/lib/Target/BPF/TargetInfo/ya.make index 6a1c950d092..3a882dad3ef 100644 --- a/contrib/libs/llvm12/lib/Target/BPF/TargetInfo/ya.make +++ b/contrib/libs/llvm12/lib/Target/BPF/TargetInfo/ya.make @@ -2,15 +2,15 @@  LIBRARY() -OWNER(  -    orivej  -    g:cpp-contrib  -)  - -LICENSE(Apache-2.0 WITH LLVM-exception)  -  -LICENSE_TEXTS(.yandex_meta/licenses.list.txt)  -  +OWNER( +    orivej +    g:cpp-contrib +) + +LICENSE(Apache-2.0 WITH LLVM-exception) + +LICENSE_TEXTS(.yandex_meta/licenses.list.txt) +  PEERDIR(      contrib/libs/llvm12      contrib/libs/llvm12/lib/Support diff --git a/contrib/libs/llvm12/lib/Target/BPF/ya.make b/contrib/libs/llvm12/lib/Target/BPF/ya.make index 0a3900df456..0f122e4afe2 100644 --- a/contrib/libs/llvm12/lib/Target/BPF/ya.make +++ b/contrib/libs/llvm12/lib/Target/BPF/ya.make @@ -2,15 +2,15 @@  LIBRARY() -OWNER(  -    orivej  -    g:cpp-contrib  -)  - -LICENSE(Apache-2.0 WITH LLVM-exception)  -  -LICENSE_TEXTS(.yandex_meta/licenses.list.txt)  -  +OWNER( +    orivej +    g:cpp-contrib +) + +LICENSE(Apache-2.0 WITH LLVM-exception) + +LICENSE_TEXTS(.yandex_meta/licenses.list.txt) +  PEERDIR(      contrib/libs/llvm12      contrib/libs/llvm12/include diff --git a/contrib/libs/llvm12/lib/Target/NVPTX/.yandex_meta/licenses.list.txt b/contrib/libs/llvm12/lib/Target/NVPTX/.yandex_meta/licenses.list.txt index a4433625d4a..c62d353021c 100644 --- a/contrib/libs/llvm12/lib/Target/NVPTX/.yandex_meta/licenses.list.txt +++ b/contrib/libs/llvm12/lib/Target/NVPTX/.yandex_meta/licenses.list.txt @@ -1,7 +1,7 @@ -====================Apache-2.0 WITH LLVM-exception====================  -// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.  -// See https://llvm.org/LICENSE.txt for license information.  -  -  -====================Apache-2.0 WITH LLVM-exception====================  -// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception  +====================Apache-2.0 WITH LLVM-exception==================== +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. + + +====================Apache-2.0 WITH LLVM-exception==================== +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception diff --git a/contrib/libs/llvm12/lib/Target/NVPTX/MCTargetDesc/.yandex_meta/licenses.list.txt b/contrib/libs/llvm12/lib/Target/NVPTX/MCTargetDesc/.yandex_meta/licenses.list.txt index a4433625d4a..c62d353021c 100644 --- a/contrib/libs/llvm12/lib/Target/NVPTX/MCTargetDesc/.yandex_meta/licenses.list.txt +++ b/contrib/libs/llvm12/lib/Target/NVPTX/MCTargetDesc/.yandex_meta/licenses.list.txt @@ -1,7 +1,7 @@ -====================Apache-2.0 WITH LLVM-exception====================  -// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.  -// See https://llvm.org/LICENSE.txt for license information.  -  -  -====================Apache-2.0 WITH LLVM-exception====================  -// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception  +====================Apache-2.0 WITH LLVM-exception==================== +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. + + +====================Apache-2.0 WITH LLVM-exception==================== +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception diff --git a/contrib/libs/llvm12/lib/Target/NVPTX/MCTargetDesc/ya.make b/contrib/libs/llvm12/lib/Target/NVPTX/MCTargetDesc/ya.make index 51150a2c8e2..81ad30663e7 100644 --- a/contrib/libs/llvm12/lib/Target/NVPTX/MCTargetDesc/ya.make +++ b/contrib/libs/llvm12/lib/Target/NVPTX/MCTargetDesc/ya.make @@ -2,15 +2,15 @@  LIBRARY() -OWNER(  -    orivej  -    g:cpp-contrib  -)  - -LICENSE(Apache-2.0 WITH LLVM-exception)  -  -LICENSE_TEXTS(.yandex_meta/licenses.list.txt)  -  +OWNER( +    orivej +    g:cpp-contrib +) + +LICENSE(Apache-2.0 WITH LLVM-exception) + +LICENSE_TEXTS(.yandex_meta/licenses.list.txt) +  PEERDIR(      contrib/libs/llvm12      contrib/libs/llvm12/include diff --git a/contrib/libs/llvm12/lib/Target/NVPTX/TargetInfo/.yandex_meta/licenses.list.txt b/contrib/libs/llvm12/lib/Target/NVPTX/TargetInfo/.yandex_meta/licenses.list.txt index a4433625d4a..c62d353021c 100644 --- a/contrib/libs/llvm12/lib/Target/NVPTX/TargetInfo/.yandex_meta/licenses.list.txt +++ b/contrib/libs/llvm12/lib/Target/NVPTX/TargetInfo/.yandex_meta/licenses.list.txt @@ -1,7 +1,7 @@ -====================Apache-2.0 WITH LLVM-exception====================  -// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.  -// See https://llvm.org/LICENSE.txt for license information.  -  -  -====================Apache-2.0 WITH LLVM-exception====================  -// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception  +====================Apache-2.0 WITH LLVM-exception==================== +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. + + +====================Apache-2.0 WITH LLVM-exception==================== +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception diff --git a/contrib/libs/llvm12/lib/Target/NVPTX/TargetInfo/ya.make b/contrib/libs/llvm12/lib/Target/NVPTX/TargetInfo/ya.make index c49c23bb18c..52ef1e5f5ba 100644 --- a/contrib/libs/llvm12/lib/Target/NVPTX/TargetInfo/ya.make +++ b/contrib/libs/llvm12/lib/Target/NVPTX/TargetInfo/ya.make @@ -2,15 +2,15 @@  LIBRARY() -OWNER(  -    orivej  -    g:cpp-contrib  -)  - -LICENSE(Apache-2.0 WITH LLVM-exception)  -  -LICENSE_TEXTS(.yandex_meta/licenses.list.txt)  -  +OWNER( +    orivej +    g:cpp-contrib +) + +LICENSE(Apache-2.0 WITH LLVM-exception) + +LICENSE_TEXTS(.yandex_meta/licenses.list.txt) +  PEERDIR(      contrib/libs/llvm12      contrib/libs/llvm12/lib/Support diff --git a/contrib/libs/llvm12/lib/Target/NVPTX/ya.make b/contrib/libs/llvm12/lib/Target/NVPTX/ya.make index 7701b9ded40..4f7542eb652 100644 --- a/contrib/libs/llvm12/lib/Target/NVPTX/ya.make +++ b/contrib/libs/llvm12/lib/Target/NVPTX/ya.make @@ -2,15 +2,15 @@  LIBRARY() -OWNER(  -    orivej  -    g:cpp-contrib  -)  +OWNER( +    orivej +    g:cpp-contrib +) + +LICENSE(Apache-2.0 WITH LLVM-exception) + +LICENSE_TEXTS(.yandex_meta/licenses.list.txt) -LICENSE(Apache-2.0 WITH LLVM-exception)  -  -LICENSE_TEXTS(.yandex_meta/licenses.list.txt)  -   PEERDIR(      contrib/libs/llvm12      contrib/libs/llvm12/include diff --git a/contrib/libs/llvm12/lib/Target/PowerPC/.yandex_meta/licenses.list.txt b/contrib/libs/llvm12/lib/Target/PowerPC/.yandex_meta/licenses.list.txt index 3a4cf0af9fa..2f43d3f2722 100644 --- a/contrib/libs/llvm12/lib/Target/PowerPC/.yandex_meta/licenses.list.txt +++ b/contrib/libs/llvm12/lib/Target/PowerPC/.yandex_meta/licenses.list.txt @@ -1,16 +1,16 @@ -====================Apache-2.0 WITH LLVM-exception====================  -// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.  -// See https)//llvm.org/LICENSE.txt for license information.  -  -  -====================Apache-2.0 WITH LLVM-exception====================  -// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.  -// See https://llvm.org/LICENSE.txt for license information.  -  -  -====================Apache-2.0 WITH LLVM-exception====================  -// SPDX-License-Identifier) Apache-2.0 WITH LLVM-exception  -  -  -====================Apache-2.0 WITH LLVM-exception====================  -// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception  +====================Apache-2.0 WITH LLVM-exception==================== +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https)//llvm.org/LICENSE.txt for license information. + + +====================Apache-2.0 WITH LLVM-exception==================== +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. + + +====================Apache-2.0 WITH LLVM-exception==================== +// SPDX-License-Identifier) Apache-2.0 WITH LLVM-exception + + +====================Apache-2.0 WITH LLVM-exception==================== +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception diff --git a/contrib/libs/llvm12/lib/Target/PowerPC/AsmParser/.yandex_meta/licenses.list.txt b/contrib/libs/llvm12/lib/Target/PowerPC/AsmParser/.yandex_meta/licenses.list.txt index a4433625d4a..c62d353021c 100644 --- a/contrib/libs/llvm12/lib/Target/PowerPC/AsmParser/.yandex_meta/licenses.list.txt +++ b/contrib/libs/llvm12/lib/Target/PowerPC/AsmParser/.yandex_meta/licenses.list.txt @@ -1,7 +1,7 @@ -====================Apache-2.0 WITH LLVM-exception====================  -// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.  -// See https://llvm.org/LICENSE.txt for license information.  -  -  -====================Apache-2.0 WITH LLVM-exception====================  -// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception  +====================Apache-2.0 WITH LLVM-exception==================== +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. + + +====================Apache-2.0 WITH LLVM-exception==================== +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception diff --git a/contrib/libs/llvm12/lib/Target/PowerPC/AsmParser/ya.make b/contrib/libs/llvm12/lib/Target/PowerPC/AsmParser/ya.make index 2388d58641f..24183440dc2 100644 --- a/contrib/libs/llvm12/lib/Target/PowerPC/AsmParser/ya.make +++ b/contrib/libs/llvm12/lib/Target/PowerPC/AsmParser/ya.make @@ -2,15 +2,15 @@  LIBRARY() -OWNER(  -    orivej  -    g:cpp-contrib  -)  - -LICENSE(Apache-2.0 WITH LLVM-exception)  -  -LICENSE_TEXTS(.yandex_meta/licenses.list.txt)  -  +OWNER( +    orivej +    g:cpp-contrib +) + +LICENSE(Apache-2.0 WITH LLVM-exception) + +LICENSE_TEXTS(.yandex_meta/licenses.list.txt) +  PEERDIR(      contrib/libs/llvm12      contrib/libs/llvm12/include diff --git a/contrib/libs/llvm12/lib/Target/PowerPC/Disassembler/.yandex_meta/licenses.list.txt b/contrib/libs/llvm12/lib/Target/PowerPC/Disassembler/.yandex_meta/licenses.list.txt index a4433625d4a..c62d353021c 100644 --- a/contrib/libs/llvm12/lib/Target/PowerPC/Disassembler/.yandex_meta/licenses.list.txt +++ b/contrib/libs/llvm12/lib/Target/PowerPC/Disassembler/.yandex_meta/licenses.list.txt @@ -1,7 +1,7 @@ -====================Apache-2.0 WITH LLVM-exception====================  -// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.  -// See https://llvm.org/LICENSE.txt for license information.  -  -  -====================Apache-2.0 WITH LLVM-exception====================  -// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception  +====================Apache-2.0 WITH LLVM-exception==================== +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. + + +====================Apache-2.0 WITH LLVM-exception==================== +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception diff --git a/contrib/libs/llvm12/lib/Target/PowerPC/Disassembler/ya.make b/contrib/libs/llvm12/lib/Target/PowerPC/Disassembler/ya.make index c43266cf40b..a412740df21 100644 --- a/contrib/libs/llvm12/lib/Target/PowerPC/Disassembler/ya.make +++ b/contrib/libs/llvm12/lib/Target/PowerPC/Disassembler/ya.make @@ -2,15 +2,15 @@  LIBRARY() -OWNER(  -    orivej  -    g:cpp-contrib  -)  - -LICENSE(Apache-2.0 WITH LLVM-exception)  -  -LICENSE_TEXTS(.yandex_meta/licenses.list.txt)  -  +OWNER( +    orivej +    g:cpp-contrib +) + +LICENSE(Apache-2.0 WITH LLVM-exception) + +LICENSE_TEXTS(.yandex_meta/licenses.list.txt) +  PEERDIR(      contrib/libs/llvm12      contrib/libs/llvm12/include diff --git a/contrib/libs/llvm12/lib/Target/PowerPC/MCTargetDesc/.yandex_meta/licenses.list.txt b/contrib/libs/llvm12/lib/Target/PowerPC/MCTargetDesc/.yandex_meta/licenses.list.txt index b0b34714ca8..ad3879fc450 100644 --- a/contrib/libs/llvm12/lib/Target/PowerPC/MCTargetDesc/.yandex_meta/licenses.list.txt +++ b/contrib/libs/llvm12/lib/Target/PowerPC/MCTargetDesc/.yandex_meta/licenses.list.txt @@ -1,303 +1,303 @@ -====================Apache-2.0 WITH LLVM-exception====================  -// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.  -// See https://llvm.org/LICENSE.txt for license information.  -  -  -====================Apache-2.0 WITH LLVM-exception====================  -// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception  -  -  -====================File: LICENSE.TXT====================  -==============================================================================  -The LLVM Project is under the Apache License v2.0 with LLVM Exceptions:  -==============================================================================  -  -                                 Apache License  -                           Version 2.0, January 2004  -                        http://www.apache.org/licenses/  -  -    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION  -  -    1. Definitions.  -  -      "License" shall mean the terms and conditions for use, reproduction,  -      and distribution as defined by Sections 1 through 9 of this document.  -  -      "Licensor" shall mean the copyright owner or entity authorized by  -      the copyright owner that is granting the License.  -  -      "Legal Entity" shall mean the union of the acting entity and all  -      other entities that control, are controlled by, or are under common  -      control with that entity. For the purposes of this definition,  -      "control" means (i) the power, direct or indirect, to cause the  -      direction or management of such entity, whether by contract or  -      otherwise, or (ii) ownership of fifty percent (50%) or more of the  -      outstanding shares, or (iii) beneficial ownership of such entity.  -  -      "You" (or "Your") shall mean an individual or Legal Entity  -      exercising permissions granted by this License.  -  -      "Source" form shall mean the preferred form for making modifications,  -      including but not limited to software source code, documentation  -      source, and configuration files.  -  -      "Object" form shall mean any form resulting from mechanical  -      transformation or translation of a Source form, including but  -      not limited to compiled object code, generated documentation,  -      and conversions to other media types.  -  -      "Work" shall mean the work of authorship, whether in Source or  -      Object form, made available under the License, as indicated by a  -      copyright notice that is included in or attached to the work  -      (an example is provided in the Appendix below).  -  -      "Derivative Works" shall mean any work, whether in Source or Object  -      form, that is based on (or derived from) the Work and for which the  -      editorial revisions, annotations, elaborations, or other modifications  -      represent, as a whole, an original work of authorship. For the purposes  -      of this License, Derivative Works shall not include works that remain  -      separable from, or merely link (or bind by name) to the interfaces of,  -      the Work and Derivative Works thereof.  -  -      "Contribution" shall mean any work of authorship, including  -      the original version of the Work and any modifications or additions  -      to that Work or Derivative Works thereof, that is intentionally  -      submitted to Licensor for inclusion in the Work by the copyright owner  -      or by an individual or Legal Entity authorized to submit on behalf of  -      the copyright owner. For the purposes of this definition, "submitted"  -      means any form of electronic, verbal, or written communication sent  -      to the Licensor or its representatives, including but not limited to  -      communication on electronic mailing lists, source code control systems,  -      and issue tracking systems that are managed by, or on behalf of, the  -      Licensor for the purpose of discussing and improving the Work, but  -      excluding communication that is conspicuously marked or otherwise  -      designated in writing by the copyright owner as "Not a Contribution."  -  -      "Contributor" shall mean Licensor and any individual or Legal Entity  -      on behalf of whom a Contribution has been received by Licensor and  -      subsequently incorporated within the Work.  -  -    2. Grant of Copyright License. Subject to the terms and conditions of  -      this License, each Contributor hereby grants to You a perpetual,  -      worldwide, non-exclusive, no-charge, royalty-free, irrevocable  -      copyright license to reproduce, prepare Derivative Works of,  -      publicly display, publicly perform, sublicense, and distribute the  -      Work and such Derivative Works in Source or Object form.  -  -    3. Grant of Patent License. Subject to the terms and conditions of  -      this License, each Contributor hereby grants to You a perpetual,  -      worldwide, non-exclusive, no-charge, royalty-free, irrevocable  -      (except as stated in this section) patent license to make, have made,  -      use, offer to sell, sell, import, and otherwise transfer the Work,  -      where such license applies only to those patent claims licensable  -      by such Contributor that are necessarily infringed by their  -      Contribution(s) alone or by combination of their Contribution(s)  -      with the Work to which such Contribution(s) was submitted. If You  -      institute patent litigation against any entity (including a  -      cross-claim or counterclaim in a lawsuit) alleging that the Work  -      or a Contribution incorporated within the Work constitutes direct  -      or contributory patent infringement, then any patent licenses  -      granted to You under this License for that Work shall terminate  -      as of the date such litigation is filed.  -  -    4. Redistribution. You may reproduce and distribute copies of the  -      Work or Derivative Works thereof in any medium, with or without  -      modifications, and in Source or Object form, provided that You  -      meet the following conditions:  -  -      (a) You must give any other recipients of the Work or  -          Derivative Works a copy of this License; and  -  -      (b) You must cause any modified files to carry prominent notices  -          stating that You changed the files; and  -  -      (c) You must retain, in the Source form of any Derivative Works  -          that You distribute, all copyright, patent, trademark, and  -          attribution notices from the Source form of the Work,  -          excluding those notices that do not pertain to any part of  -          the Derivative Works; and  -  -      (d) If the Work includes a "NOTICE" text file as part of its  -          distribution, then any Derivative Works that You distribute must  -          include a readable copy of the attribution notices contained  -          within such NOTICE file, excluding those notices that do not  -          pertain to any part of the Derivative Works, in at least one  -          of the following places: within a NOTICE text file distributed  -          as part of the Derivative Works; within the Source form or  -          documentation, if provided along with the Derivative Works; or,  -          within a display generated by the Derivative Works, if and  -          wherever such third-party notices normally appear. The contents  -          of the NOTICE file are for informational purposes only and  -          do not modify the License. You may add Your own attribution  -          notices within Derivative Works that You distribute, alongside  -          or as an addendum to the NOTICE text from the Work, provided  -          that such additional attribution notices cannot be construed  -          as modifying the License.  -  -      You may add Your own copyright statement to Your modifications and  -      may provide additional or different license terms and conditions  -      for use, reproduction, or distribution of Your modifications, or  -      for any such Derivative Works as a whole, provided Your use,  -      reproduction, and distribution of the Work otherwise complies with  -      the conditions stated in this License.  -  -    5. Submission of Contributions. Unless You explicitly state otherwise,  -      any Contribution intentionally submitted for inclusion in the Work  -      by You to the Licensor shall be under the terms and conditions of  -      this License, without any additional terms or conditions.  -      Notwithstanding the above, nothing herein shall supersede or modify  -      the terms of any separate license agreement you may have executed  -      with Licensor regarding such Contributions.  -  -    6. Trademarks. This License does not grant permission to use the trade  -      names, trademarks, service marks, or product names of the Licensor,  -      except as required for reasonable and customary use in describing the  -      origin of the Work and reproducing the content of the NOTICE file.  -  -    7. Disclaimer of Warranty. Unless required by applicable law or  -      agreed to in writing, Licensor provides the Work (and each  -      Contributor provides its Contributions) on an "AS IS" BASIS,  -      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or  -      implied, including, without limitation, any warranties or conditions  -      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A  -      PARTICULAR PURPOSE. You are solely responsible for determining the  -      appropriateness of using or redistributing the Work and assume any  -      risks associated with Your exercise of permissions under this License.  -  -    8. Limitation of Liability. In no event and under no legal theory,  -      whether in tort (including negligence), contract, or otherwise,  -      unless required by applicable law (such as deliberate and grossly  -      negligent acts) or agreed to in writing, shall any Contributor be  -      liable to You for damages, including any direct, indirect, special,  -      incidental, or consequential damages of any character arising as a  -      result of this License or out of the use or inability to use the  -      Work (including but not limited to damages for loss of goodwill,  -      work stoppage, computer failure or malfunction, or any and all  -      other commercial damages or losses), even if such Contributor  -      has been advised of the possibility of such damages.  -  -    9. Accepting Warranty or Additional Liability. While redistributing  -      the Work or Derivative Works thereof, You may choose to offer,  -      and charge a fee for, acceptance of support, warranty, indemnity,  -      or other liability obligations and/or rights consistent with this  -      License. However, in accepting such obligations, You may act only  -      on Your own behalf and on Your sole responsibility, not on behalf  -      of any other Contributor, and only if You agree to indemnify,  -      defend, and hold each Contributor harmless for any liability  -      incurred by, or claims asserted against, such Contributor by reason  -      of your accepting any such warranty or additional liability.  -  -    END OF TERMS AND CONDITIONS  -  -    APPENDIX: How to apply the Apache License to your work.  -  -      To apply the Apache License to your work, attach the following  -      boilerplate notice, with the fields enclosed by brackets "[]"  -      replaced with your own identifying information. (Don't include  -      the brackets!)  The text should be enclosed in the appropriate  -      comment syntax for the file format. We also recommend that a  -      file or class name and description of purpose be included on the  -      same "printed page" as the copyright notice for easier  -      identification within third-party archives.  -  -    Copyright [yyyy] [name of copyright owner]  -  -    Licensed under the Apache License, Version 2.0 (the "License");  -    you may not use this file except in compliance with the License.  -    You may obtain a copy of the License at  -  -       http://www.apache.org/licenses/LICENSE-2.0  -  -    Unless required by applicable law or agreed to in writing, software  -    distributed under the License is distributed on an "AS IS" BASIS,  -    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  -    See the License for the specific language governing permissions and  -    limitations under the License.  -  -  ----- LLVM Exceptions to the Apache 2.0 License ----  -  -As an exception, if, as a result of your compiling your source code, portions  -of this Software are embedded into an Object form of such source code, you  -may redistribute such embedded portions in such Object form without complying  -with the conditions of Sections 4(a), 4(b) and 4(d) of the License.  -  -In addition, if you combine or link compiled forms of this Software with  -software that is licensed under the GPLv2 ("Combined Software") and if a  -court of competent jurisdiction determines that the patent provision (Section  -3), the indemnity provision (Section 9) or other Section of the License  -conflicts with the conditions of the GPLv2, you may retroactively and  -prospectively choose to deem waived or otherwise exclude such Section(s) of  -the License, but only in their entirety and only with respect to the Combined  -Software.  -  -==============================================================================  -Software from third parties included in the LLVM Project:  -==============================================================================  -The LLVM Project contains third party software which is under different license  -terms. All such code will be identified clearly using at least one of two  -mechanisms:  -1) It will be in a separate directory tree with its own `LICENSE.txt` or  -   `LICENSE` file at the top containing the specific license and restrictions  -   which apply to that software, or  -2) It will contain specific license and restriction terms at the top of every  -   file.  -  -==============================================================================  -Legacy LLVM License (https://llvm.org/docs/DeveloperPolicy.html#legacy):  -==============================================================================  -University of Illinois/NCSA  -Open Source License  -  -Copyright (c) 2003-2019 University of Illinois at Urbana-Champaign.  -All rights reserved.  -  -Developed by:  -  -    LLVM Team  -  -    University of Illinois at Urbana-Champaign  -  -    http://llvm.org  -  -Permission is hereby granted, free of charge, to any person obtaining a copy of  -this software and associated documentation files (the "Software"), to deal with  -the Software without restriction, including without limitation the rights to  -use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies  -of the Software, and to permit persons to whom the Software is furnished to do  -so, subject to the following conditions:  -  -    * Redistributions of source code must retain the above copyright notice,  -      this list of conditions and the following disclaimers.  -  -    * Redistributions in binary form must reproduce the above copyright notice,  -      this list of conditions and the following disclaimers in the  -      documentation and/or other materials provided with the distribution.  -  -    * Neither the names of the LLVM Team, University of Illinois at  -      Urbana-Champaign, nor the names of its contributors may be used to  -      endorse or promote products derived from this Software without specific  -      prior written permission.  -  -THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR  -IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS  -FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL THE  -CONTRIBUTORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER  -LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,  -OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS WITH THE  -SOFTWARE.  -  -  -  -====================File: include/llvm/Support/LICENSE.TXT====================  -LLVM System Interface Library  --------------------------------------------------------------------------------  -The LLVM System Interface Library is licensed under the Illinois Open Source  -License and has the following additional copyright:  -  -Copyright (C) 2004 eXtensible Systems, Inc.  -  -  -====================NCSA====================  -// This file is distributed under the University of Illinois Open Source  -// License. See LICENSE.TXT for details.  +====================Apache-2.0 WITH LLVM-exception==================== +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. + + +====================Apache-2.0 WITH LLVM-exception==================== +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception + + +====================File: LICENSE.TXT==================== +============================================================================== +The LLVM Project is under the Apache License v2.0 with LLVM Exceptions: +============================================================================== + +                                 Apache License +                           Version 2.0, January 2004 +                        http://www.apache.org/licenses/ + +    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION + +    1. Definitions. + +      "License" shall mean the terms and conditions for use, reproduction, +      and distribution as defined by Sections 1 through 9 of this document. + +      "Licensor" shall mean the copyright owner or entity authorized by +      the copyright owner that is granting the License. + +      "Legal Entity" shall mean the union of the acting entity and all +      other entities that control, are controlled by, or are under common +      control with that entity. For the purposes of this definition, +      "control" means (i) the power, direct or indirect, to cause the +      direction or management of such entity, whether by contract or +      otherwise, or (ii) ownership of fifty percent (50%) or more of the +      outstanding shares, or (iii) beneficial ownership of such entity. + +      "You" (or "Your") shall mean an individual or Legal Entity +      exercising permissions granted by this License. + +      "Source" form shall mean the preferred form for making modifications, +      including but not limited to software source code, documentation +      source, and configuration files. + +      "Object" form shall mean any form resulting from mechanical +      transformation or translation of a Source form, including but +      not limited to compiled object code, generated documentation, +      and conversions to other media types. + +      "Work" shall mean the work of authorship, whether in Source or +      Object form, made available under the License, as indicated by a +      copyright notice that is included in or attached to the work +      (an example is provided in the Appendix below). + +      "Derivative Works" shall mean any work, whether in Source or Object +      form, that is based on (or derived from) the Work and for which the +      editorial revisions, annotations, elaborations, or other modifications +      represent, as a whole, an original work of authorship. For the purposes +      of this License, Derivative Works shall not include works that remain +      separable from, or merely link (or bind by name) to the interfaces of, +      the Work and Derivative Works thereof. + +      "Contribution" shall mean any work of authorship, including +      the original version of the Work and any modifications or additions +      to that Work or Derivative Works thereof, that is intentionally +      submitted to Licensor for inclusion in the Work by the copyright owner +      or by an individual or Legal Entity authorized to submit on behalf of +      the copyright owner. For the purposes of this definition, "submitted" +      means any form of electronic, verbal, or written communication sent +      to the Licensor or its representatives, including but not limited to +      communication on electronic mailing lists, source code control systems, +      and issue tracking systems that are managed by, or on behalf of, the +      Licensor for the purpose of discussing and improving the Work, but +      excluding communication that is conspicuously marked or otherwise +      designated in writing by the copyright owner as "Not a Contribution." + +      "Contributor" shall mean Licensor and any individual or Legal Entity +      on behalf of whom a Contribution has been received by Licensor and +      subsequently incorporated within the Work. + +    2. Grant of Copyright License. Subject to the terms and conditions of +      this License, each Contributor hereby grants to You a perpetual, +      worldwide, non-exclusive, no-charge, royalty-free, irrevocable +      copyright license to reproduce, prepare Derivative Works of, +      publicly display, publicly perform, sublicense, and distribute the +      Work and such Derivative Works in Source or Object form. + +    3. Grant of Patent License. Subject to the terms and conditions of +      this License, each Contributor hereby grants to You a perpetual, +      worldwide, non-exclusive, no-charge, royalty-free, irrevocable +      (except as stated in this section) patent license to make, have made, +      use, offer to sell, sell, import, and otherwise transfer the Work, +      where such license applies only to those patent claims licensable +      by such Contributor that are necessarily infringed by their +      Contribution(s) alone or by combination of their Contribution(s) +      with the Work to which such Contribution(s) was submitted. If You +      institute patent litigation against any entity (including a +      cross-claim or counterclaim in a lawsuit) alleging that the Work +      or a Contribution incorporated within the Work constitutes direct +      or contributory patent infringement, then any patent licenses +      granted to You under this License for that Work shall terminate +      as of the date such litigation is filed. + +    4. Redistribution. You may reproduce and distribute copies of the +      Work or Derivative Works thereof in any medium, with or without +      modifications, and in Source or Object form, provided that You +      meet the following conditions: + +      (a) You must give any other recipients of the Work or +          Derivative Works a copy of this License; and + +      (b) You must cause any modified files to carry prominent notices +          stating that You changed the files; and + +      (c) You must retain, in the Source form of any Derivative Works +          that You distribute, all copyright, patent, trademark, and +          attribution notices from the Source form of the Work, +          excluding those notices that do not pertain to any part of +          the Derivative Works; and + +      (d) If the Work includes a "NOTICE" text file as part of its +          distribution, then any Derivative Works that You distribute must +          include a readable copy of the attribution notices contained +          within such NOTICE file, excluding those notices that do not +          pertain to any part of the Derivative Works, in at least one +          of the following places: within a NOTICE text file distributed +          as part of the Derivative Works; within the Source form or +          documentation, if provided along with the Derivative Works; or, +          within a display generated by the Derivative Works, if and +          wherever such third-party notices normally appear. The contents +          of the NOTICE file are for informational purposes only and +          do not modify the License. You may add Your own attribution +          notices within Derivative Works that You distribute, alongside +          or as an addendum to the NOTICE text from the Work, provided +          that such additional attribution notices cannot be construed +          as modifying the License. + +      You may add Your own copyright statement to Your modifications and +      may provide additional or different license terms and conditions +      for use, reproduction, or distribution of Your modifications, or +      for any such Derivative Works as a whole, provided Your use, +      reproduction, and distribution of the Work otherwise complies with +      the conditions stated in this License. + +    5. Submission of Contributions. Unless You explicitly state otherwise, +      any Contribution intentionally submitted for inclusion in the Work +      by You to the Licensor shall be under the terms and conditions of +      this License, without any additional terms or conditions. +      Notwithstanding the above, nothing herein shall supersede or modify +      the terms of any separate license agreement you may have executed +      with Licensor regarding such Contributions. + +    6. Trademarks. This License does not grant permission to use the trade +      names, trademarks, service marks, or product names of the Licensor, +      except as required for reasonable and customary use in describing the +      origin of the Work and reproducing the content of the NOTICE file. + +    7. Disclaimer of Warranty. Unless required by applicable law or +      agreed to in writing, Licensor provides the Work (and each +      Contributor provides its Contributions) on an "AS IS" BASIS, +      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or +      implied, including, without limitation, any warranties or conditions +      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A +      PARTICULAR PURPOSE. You are solely responsible for determining the +      appropriateness of using or redistributing the Work and assume any +      risks associated with Your exercise of permissions under this License. + +    8. Limitation of Liability. In no event and under no legal theory, +      whether in tort (including negligence), contract, or otherwise, +      unless required by applicable law (such as deliberate and grossly +      negligent acts) or agreed to in writing, shall any Contributor be +      liable to You for damages, including any direct, indirect, special, +      incidental, or consequential damages of any character arising as a +      result of this License or out of the use or inability to use the +      Work (including but not limited to damages for loss of goodwill, +      work stoppage, computer failure or malfunction, or any and all +      other commercial damages or losses), even if such Contributor +      has been advised of the possibility of such damages. + +    9. Accepting Warranty or Additional Liability. While redistributing +      the Work or Derivative Works thereof, You may choose to offer, +      and charge a fee for, acceptance of support, warranty, indemnity, +      or other liability obligations and/or rights consistent with this +      License. However, in accepting such obligations, You may act only +      on Your own behalf and on Your sole responsibility, not on behalf +      of any other Contributor, and only if You agree to indemnify, +      defend, and hold each Contributor harmless for any liability +      incurred by, or claims asserted against, such Contributor by reason +      of your accepting any such warranty or additional liability. + +    END OF TERMS AND CONDITIONS + +    APPENDIX: How to apply the Apache License to your work. + +      To apply the Apache License to your work, attach the following +      boilerplate notice, with the fields enclosed by brackets "[]" +      replaced with your own identifying information. (Don't include +      the brackets!)  The text should be enclosed in the appropriate +      comment syntax for the file format. We also recommend that a +      file or class name and description of purpose be included on the +      same "printed page" as the copyright notice for easier +      identification within third-party archives. + +    Copyright [yyyy] [name of copyright owner] + +    Licensed under the Apache License, Version 2.0 (the "License"); +    you may not use this file except in compliance with the License. +    You may obtain a copy of the License at + +       http://www.apache.org/licenses/LICENSE-2.0 + +    Unless required by applicable law or agreed to in writing, software +    distributed under the License is distributed on an "AS IS" BASIS, +    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +    See the License for the specific language governing permissions and +    limitations under the License. + + +---- LLVM Exceptions to the Apache 2.0 License ---- + +As an exception, if, as a result of your compiling your source code, portions +of this Software are embedded into an Object form of such source code, you +may redistribute such embedded portions in such Object form without complying +with the conditions of Sections 4(a), 4(b) and 4(d) of the License. + +In addition, if you combine or link compiled forms of this Software with +software that is licensed under the GPLv2 ("Combined Software") and if a +court of competent jurisdiction determines that the patent provision (Section +3), the indemnity provision (Section 9) or other Section of the License +conflicts with the conditions of the GPLv2, you may retroactively and +prospectively choose to deem waived or otherwise exclude such Section(s) of +the License, but only in their entirety and only with respect to the Combined +Software. + +============================================================================== +Software from third parties included in the LLVM Project: +============================================================================== +The LLVM Project contains third party software which is under different license +terms. All such code will be identified clearly using at least one of two +mechanisms: +1) It will be in a separate directory tree with its own `LICENSE.txt` or +   `LICENSE` file at the top containing the specific license and restrictions +   which apply to that software, or +2) It will contain specific license and restriction terms at the top of every +   file. + +============================================================================== +Legacy LLVM License (https://llvm.org/docs/DeveloperPolicy.html#legacy): +============================================================================== +University of Illinois/NCSA +Open Source License + +Copyright (c) 2003-2019 University of Illinois at Urbana-Champaign. +All rights reserved. + +Developed by: + +    LLVM Team + +    University of Illinois at Urbana-Champaign + +    http://llvm.org + +Permission is hereby granted, free of charge, to any person obtaining a copy of +this software and associated documentation files (the "Software"), to deal with +the Software without restriction, including without limitation the rights to +use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies +of the Software, and to permit persons to whom the Software is furnished to do +so, subject to the following conditions: + +    * Redistributions of source code must retain the above copyright notice, +      this list of conditions and the following disclaimers. + +    * Redistributions in binary form must reproduce the above copyright notice, +      this list of conditions and the following disclaimers in the +      documentation and/or other materials provided with the distribution. + +    * Neither the names of the LLVM Team, University of Illinois at +      Urbana-Champaign, nor the names of its contributors may be used to +      endorse or promote products derived from this Software without specific +      prior written permission. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS +FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL THE +CONTRIBUTORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS WITH THE +SOFTWARE. + + + +====================File: include/llvm/Support/LICENSE.TXT==================== +LLVM System Interface Library +------------------------------------------------------------------------------- +The LLVM System Interface Library is licensed under the Illinois Open Source +License and has the following additional copyright: + +Copyright (C) 2004 eXtensible Systems, Inc. + + +====================NCSA==================== +// This file is distributed under the University of Illinois Open Source +// License. See LICENSE.TXT for details. diff --git a/contrib/libs/llvm12/lib/Target/PowerPC/MCTargetDesc/ya.make b/contrib/libs/llvm12/lib/Target/PowerPC/MCTargetDesc/ya.make index 0e037d61de9..903dc6ec7f7 100644 --- a/contrib/libs/llvm12/lib/Target/PowerPC/MCTargetDesc/ya.make +++ b/contrib/libs/llvm12/lib/Target/PowerPC/MCTargetDesc/ya.make @@ -2,18 +2,18 @@  LIBRARY() -OWNER(  -    orivej  -    g:cpp-contrib  -)  - -LICENSE(  -    Apache-2.0 WITH LLVM-exception AND  -    NCSA  -)  -  -LICENSE_TEXTS(.yandex_meta/licenses.list.txt)  -  +OWNER( +    orivej +    g:cpp-contrib +) + +LICENSE( +    Apache-2.0 WITH LLVM-exception AND +    NCSA +) + +LICENSE_TEXTS(.yandex_meta/licenses.list.txt) +  PEERDIR(      contrib/libs/llvm12      contrib/libs/llvm12/include diff --git a/contrib/libs/llvm12/lib/Target/PowerPC/README.txt b/contrib/libs/llvm12/lib/Target/PowerPC/README.txt index 0902298a4f3..492eb22af2c 100644 --- a/contrib/libs/llvm12/lib/Target/PowerPC/README.txt +++ b/contrib/libs/llvm12/lib/Target/PowerPC/README.txt @@ -1,607 +1,607 @@ -//===- README.txt - Notes for improving PowerPC-specific code gen ---------===//  -  -TODO:  -* lmw/stmw pass a la arm load store optimizer for prolog/epilog  -  -===-------------------------------------------------------------------------===  -  -This code:  -  -unsigned add32carry(unsigned sum, unsigned x) {  - unsigned z = sum + x;  - if (sum + x < x)  -     z++;  - return z;  -}  -  -Should compile to something like:  -  -	addc r3,r3,r4  -	addze r3,r3  -  -instead we get:  -  -	add r3, r4, r3  -	cmplw cr7, r3, r4  -	mfcr r4 ; 1  -	rlwinm r4, r4, 29, 31, 31  -	add r3, r3, r4  -  -Ick.  -  -===-------------------------------------------------------------------------===  -  -We compile the hottest inner loop of viterbi to:  -  -        li r6, 0  -        b LBB1_84       ;bb432.i  -LBB1_83:        ;bb420.i  -        lbzx r8, r5, r7  -        addi r6, r7, 1  -        stbx r8, r4, r7  -LBB1_84:        ;bb432.i  -        mr r7, r6  -        cmplwi cr0, r7, 143  -        bne cr0, LBB1_83        ;bb420.i  -  -The CBE manages to produce:  -  -	li r0, 143  -	mtctr r0  -loop:  -	lbzx r2, r2, r11  -	stbx r0, r2, r9  -	addi r2, r2, 1  -	bdz later  -	b loop  -  -This could be much better (bdnz instead of bdz) but it still beats us.  If we  -produced this with bdnz, the loop would be a single dispatch group.  -  -===-------------------------------------------------------------------------===  -  -Lump the constant pool for each function into ONE pic object, and reference  -pieces of it as offsets from the start.  For functions like this (contrived  -to have lots of constants obviously):  -  -double X(double Y) { return (Y*1.23 + 4.512)*2.34 + 14.38; }  -  -We generate:  -  -_X:  -        lis r2, ha16(.CPI_X_0)  -        lfd f0, lo16(.CPI_X_0)(r2)  -        lis r2, ha16(.CPI_X_1)  -        lfd f2, lo16(.CPI_X_1)(r2)  -        fmadd f0, f1, f0, f2  -        lis r2, ha16(.CPI_X_2)  -        lfd f1, lo16(.CPI_X_2)(r2)  -        lis r2, ha16(.CPI_X_3)  -        lfd f2, lo16(.CPI_X_3)(r2)  -        fmadd f1, f0, f1, f2  +//===- README.txt - Notes for improving PowerPC-specific code gen ---------===// + +TODO: +* lmw/stmw pass a la arm load store optimizer for prolog/epilog + +===-------------------------------------------------------------------------=== + +This code: + +unsigned add32carry(unsigned sum, unsigned x) { + unsigned z = sum + x; + if (sum + x < x) +     z++; + return z; +} + +Should compile to something like: + +	addc r3,r3,r4 +	addze r3,r3 + +instead we get: + +	add r3, r4, r3 +	cmplw cr7, r3, r4 +	mfcr r4 ; 1 +	rlwinm r4, r4, 29, 31, 31 +	add r3, r3, r4 + +Ick. + +===-------------------------------------------------------------------------=== + +We compile the hottest inner loop of viterbi to: + +        li r6, 0 +        b LBB1_84       ;bb432.i +LBB1_83:        ;bb420.i +        lbzx r8, r5, r7 +        addi r6, r7, 1 +        stbx r8, r4, r7 +LBB1_84:        ;bb432.i +        mr r7, r6 +        cmplwi cr0, r7, 143 +        bne cr0, LBB1_83        ;bb420.i + +The CBE manages to produce: + +	li r0, 143 +	mtctr r0 +loop: +	lbzx r2, r2, r11 +	stbx r0, r2, r9 +	addi r2, r2, 1 +	bdz later +	b loop + +This could be much better (bdnz instead of bdz) but it still beats us.  If we +produced this with bdnz, the loop would be a single dispatch group. + +===-------------------------------------------------------------------------=== + +Lump the constant pool for each function into ONE pic object, and reference +pieces of it as offsets from the start.  For functions like this (contrived +to have lots of constants obviously): + +double X(double Y) { return (Y*1.23 + 4.512)*2.34 + 14.38; } + +We generate: + +_X: +        lis r2, ha16(.CPI_X_0) +        lfd f0, lo16(.CPI_X_0)(r2) +        lis r2, ha16(.CPI_X_1) +        lfd f2, lo16(.CPI_X_1)(r2) +        fmadd f0, f1, f0, f2 +        lis r2, ha16(.CPI_X_2) +        lfd f1, lo16(.CPI_X_2)(r2) +        lis r2, ha16(.CPI_X_3) +        lfd f2, lo16(.CPI_X_3)(r2) +        fmadd f1, f0, f1, f2 +        blr + +It would be better to materialize .CPI_X into a register, then use immediates +off of the register to avoid the lis's.  This is even more important in PIC  +mode. + +Note that this (and the static variable version) is discussed here for GCC: +http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html + +Here's another example (the sgn function): +double testf(double a) { +       return a == 0.0 ? 0.0 : (a > 0.0 ? 1.0 : -1.0); +} + +it produces a BB like this: +LBB1_1: ; cond_true +        lis r2, ha16(LCPI1_0) +        lfs f0, lo16(LCPI1_0)(r2) +        lis r2, ha16(LCPI1_1) +        lis r3, ha16(LCPI1_2) +        lfs f2, lo16(LCPI1_2)(r3) +        lfs f3, lo16(LCPI1_1)(r2) +        fsub f0, f0, f1 +        fsel f1, f0, f2, f3          blr  -  -It would be better to materialize .CPI_X into a register, then use immediates  -off of the register to avoid the lis's.  This is even more important in PIC   -mode.  -  -Note that this (and the static variable version) is discussed here for GCC:  -http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html  -  -Here's another example (the sgn function):  -double testf(double a) {  -       return a == 0.0 ? 0.0 : (a > 0.0 ? 1.0 : -1.0);  -}  -  -it produces a BB like this:  -LBB1_1: ; cond_true  -        lis r2, ha16(LCPI1_0)  -        lfs f0, lo16(LCPI1_0)(r2)  -        lis r2, ha16(LCPI1_1)  -        lis r3, ha16(LCPI1_2)  -        lfs f2, lo16(LCPI1_2)(r3)  -        lfs f3, lo16(LCPI1_1)(r2)  -        fsub f0, f0, f1  -        fsel f1, f0, f2, f3  -        blr   -  -===-------------------------------------------------------------------------===  -  -PIC Code Gen IPO optimization:  -  -Squish small scalar globals together into a single global struct, allowing the   -address of the struct to be CSE'd, avoiding PIC accesses (also reduces the size  -of the GOT on targets with one).  -  -Note that this is discussed here for GCC:  -http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html  -  -===-------------------------------------------------------------------------===  -  -Fold add and sub with constant into non-extern, non-weak addresses so this:  -  -static int a;  -void bar(int b) { a = b; }  -void foo(unsigned char *c) {  -  *c = a;  -}  -  -So that   -  -_foo:  -        lis r2, ha16(_a)  -        la r2, lo16(_a)(r2)  -        lbz r2, 3(r2)  -        stb r2, 0(r3)  -        blr  -  -Becomes  -  -_foo:  -        lis r2, ha16(_a+3)  -        lbz r2, lo16(_a+3)(r2)  -        stb r2, 0(r3)  -        blr  -  -===-------------------------------------------------------------------------===  -  -We should compile these two functions to the same thing:  -  -#include <stdlib.h>  -void f(int a, int b, int *P) {  -  *P = (a-b)>=0?(a-b):(b-a);  -}  -void g(int a, int b, int *P) {  -  *P = abs(a-b);  -}  -  -Further, they should compile to something better than:  -  -_g:  -        subf r2, r4, r3  -        subfic r3, r2, 0  -        cmpwi cr0, r2, -1  -        bgt cr0, LBB2_2 ; entry  -LBB2_1: ; entry  -        mr r2, r3  -LBB2_2: ; entry  -        stw r2, 0(r5)  -        blr  -  -GCC produces:  -  -_g:  -        subf r4,r4,r3  -        srawi r2,r4,31  -        xor r0,r2,r4  -        subf r0,r2,r0  -        stw r0,0(r5)  -        blr  -  -... which is much nicer.  -  -This theoretically may help improve twolf slightly (used in dimbox.c:142?).  -  -===-------------------------------------------------------------------------===  -  -PR5945: This:   -define i32 @clamp0g(i32 %a) {  -entry:  -        %cmp = icmp slt i32 %a, 0  -        %sel = select i1 %cmp, i32 0, i32 %a  -        ret i32 %sel  -}  -  -Is compile to this with the PowerPC (32-bit) backend:  -  -_clamp0g:  -        cmpwi cr0, r3, 0  -        li r2, 0  -        blt cr0, LBB1_2  -; %bb.1:                                                    ; %entry  -        mr r2, r3  -LBB1_2:                                                     ; %entry  -        mr r3, r2  -        blr  -  -This could be reduced to the much simpler:  -  -_clamp0g:  -        srawi r2, r3, 31  -        andc r3, r3, r2  -        blr  -  -===-------------------------------------------------------------------------===  -  -int foo(int N, int ***W, int **TK, int X) {  -  int t, i;  -    -  for (t = 0; t < N; ++t)  -    for (i = 0; i < 4; ++i)  -      W[t / X][i][t % X] = TK[i][t];  -        -  return 5;  -}  -  -We generate relatively atrocious code for this loop compared to gcc.  -  -We could also strength reduce the rem and the div:  -http://www.lcs.mit.edu/pubs/pdf/MIT-LCS-TM-600.pdf  -  -===-------------------------------------------------------------------------===  -  -We generate ugly code for this:  -  -void func(unsigned int *ret, float dx, float dy, float dz, float dw) {  -  unsigned code = 0;  -  if(dx < -dw) code |= 1;  -  if(dx > dw)  code |= 2;  -  if(dy < -dw) code |= 4;  -  if(dy > dw)  code |= 8;  -  if(dz < -dw) code |= 16;  -  if(dz > dw)  code |= 32;  -  *ret = code;  -}  -  -===-------------------------------------------------------------------------===  -  -%struct.B = type { i8, [3 x i8] }  -  -define void @bar(%struct.B* %b) {  -entry:  -        %tmp = bitcast %struct.B* %b to i32*              ; <uint*> [#uses=1]  -        %tmp = load i32* %tmp          ; <uint> [#uses=1]  -        %tmp3 = bitcast %struct.B* %b to i32*             ; <uint*> [#uses=1]  -        %tmp4 = load i32* %tmp3                ; <uint> [#uses=1]  -        %tmp8 = bitcast %struct.B* %b to i32*             ; <uint*> [#uses=2]  -        %tmp9 = load i32* %tmp8                ; <uint> [#uses=1]  -        %tmp4.mask17 = shl i32 %tmp4, i8 1          ; <uint> [#uses=1]  -        %tmp1415 = and i32 %tmp4.mask17, 2147483648            ; <uint> [#uses=1]  -        %tmp.masked = and i32 %tmp, 2147483648         ; <uint> [#uses=1]  -        %tmp11 = or i32 %tmp1415, %tmp.masked          ; <uint> [#uses=1]  -        %tmp12 = and i32 %tmp9, 2147483647             ; <uint> [#uses=1]  -        %tmp13 = or i32 %tmp12, %tmp11         ; <uint> [#uses=1]  -        store i32 %tmp13, i32* %tmp8  -        ret void  -}  -  -We emit:  -  -_foo:  -        lwz r2, 0(r3)  -        slwi r4, r2, 1  -        or r4, r4, r2  -        rlwimi r2, r4, 0, 0, 0  -        stw r2, 0(r3)  -        blr  -  -We could collapse a bunch of those ORs and ANDs and generate the following  -equivalent code:  -  -_foo:  -        lwz r2, 0(r3)  -        rlwinm r4, r2, 1, 0, 0  -        or r2, r2, r4  -        stw r2, 0(r3)  -        blr  -  -===-------------------------------------------------------------------------===  -  -Consider a function like this:  -  -float foo(float X) { return X + 1234.4123f; }  -  -The FP constant ends up in the constant pool, so we need to get the LR register.  - This ends up producing code like this:  -  -_foo:  -.LBB_foo_0:     ; entry  -        mflr r11  -***     stw r11, 8(r1)  -        bl "L00000$pb"  -"L00000$pb":  -        mflr r2  -        addis r2, r2, ha16(.CPI_foo_0-"L00000$pb")  -        lfs f0, lo16(.CPI_foo_0-"L00000$pb")(r2)  -        fadds f1, f1, f0  -***     lwz r11, 8(r1)  -        mtlr r11  -        blr  -  -This is functional, but there is no reason to spill the LR register all the way  -to the stack (the two marked instrs): spilling it to a GPR is quite enough.  -  -Implementing this will require some codegen improvements.  Nate writes:  -  -"So basically what we need to support the "no stack frame save and restore" is a  -generalization of the LR optimization to "callee-save regs".  -  -Currently, we have LR marked as a callee-save reg.  The register allocator sees  -that it's callee save, and spills it directly to the stack.  -  -Ideally, something like this would happen:  -  -LR would be in a separate register class from the GPRs. The class of LR would be  -marked "unspillable".  When the register allocator came across an unspillable  -reg, it would ask "what is the best class to copy this into that I *can* spill"  -If it gets a class back, which it will in this case (the gprs), it grabs a free  -register of that class.  If it is then later necessary to spill that reg, so be  -it.  -  -===-------------------------------------------------------------------------===  -  -We compile this:  -int test(_Bool X) {  -  return X ? 524288 : 0;  -}  -  -to:   -_test:  -        cmplwi cr0, r3, 0  -        lis r2, 8  -        li r3, 0  -        beq cr0, LBB1_2 ;entry  -LBB1_1: ;entry  -        mr r3, r2  -LBB1_2: ;entry  -        blr   -  -instead of:  -_test:  -        addic r2,r3,-1  -        subfe r0,r2,r3  -        slwi r3,r0,19  -        blr  -  -This sort of thing occurs a lot due to globalopt.  -  -===-------------------------------------------------------------------------===  -  -We compile:  -  -define i32 @bar(i32 %x) nounwind readnone ssp {  -entry:  -  %0 = icmp eq i32 %x, 0                          ; <i1> [#uses=1]  -  %neg = sext i1 %0 to i32              ; <i32> [#uses=1]  -  ret i32 %neg  -}  -  + +===-------------------------------------------------------------------------=== + +PIC Code Gen IPO optimization: + +Squish small scalar globals together into a single global struct, allowing the  +address of the struct to be CSE'd, avoiding PIC accesses (also reduces the size +of the GOT on targets with one). + +Note that this is discussed here for GCC: +http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html + +===-------------------------------------------------------------------------=== + +Fold add and sub with constant into non-extern, non-weak addresses so this: + +static int a; +void bar(int b) { a = b; } +void foo(unsigned char *c) { +  *c = a; +} + +So that  + +_foo: +        lis r2, ha16(_a) +        la r2, lo16(_a)(r2) +        lbz r2, 3(r2) +        stb r2, 0(r3) +        blr + +Becomes + +_foo: +        lis r2, ha16(_a+3) +        lbz r2, lo16(_a+3)(r2) +        stb r2, 0(r3) +        blr + +===-------------------------------------------------------------------------=== + +We should compile these two functions to the same thing: + +#include <stdlib.h> +void f(int a, int b, int *P) { +  *P = (a-b)>=0?(a-b):(b-a); +} +void g(int a, int b, int *P) { +  *P = abs(a-b); +} + +Further, they should compile to something better than: + +_g: +        subf r2, r4, r3 +        subfic r3, r2, 0 +        cmpwi cr0, r2, -1 +        bgt cr0, LBB2_2 ; entry +LBB2_1: ; entry +        mr r2, r3 +LBB2_2: ; entry +        stw r2, 0(r5) +        blr + +GCC produces: + +_g: +        subf r4,r4,r3 +        srawi r2,r4,31 +        xor r0,r2,r4 +        subf r0,r2,r0 +        stw r0,0(r5) +        blr + +... which is much nicer. + +This theoretically may help improve twolf slightly (used in dimbox.c:142?). + +===-------------------------------------------------------------------------=== + +PR5945: This:  +define i32 @clamp0g(i32 %a) { +entry: +        %cmp = icmp slt i32 %a, 0 +        %sel = select i1 %cmp, i32 0, i32 %a +        ret i32 %sel +} + +Is compile to this with the PowerPC (32-bit) backend: + +_clamp0g: +        cmpwi cr0, r3, 0 +        li r2, 0 +        blt cr0, LBB1_2 +; %bb.1:                                                    ; %entry +        mr r2, r3 +LBB1_2:                                                     ; %entry +        mr r3, r2 +        blr + +This could be reduced to the much simpler: + +_clamp0g: +        srawi r2, r3, 31 +        andc r3, r3, r2 +        blr + +===-------------------------------------------------------------------------=== + +int foo(int N, int ***W, int **TK, int X) { +  int t, i; +   +  for (t = 0; t < N; ++t) +    for (i = 0; i < 4; ++i) +      W[t / X][i][t % X] = TK[i][t]; +       +  return 5; +} + +We generate relatively atrocious code for this loop compared to gcc. + +We could also strength reduce the rem and the div: +http://www.lcs.mit.edu/pubs/pdf/MIT-LCS-TM-600.pdf + +===-------------------------------------------------------------------------=== + +We generate ugly code for this: + +void func(unsigned int *ret, float dx, float dy, float dz, float dw) { +  unsigned code = 0; +  if(dx < -dw) code |= 1; +  if(dx > dw)  code |= 2; +  if(dy < -dw) code |= 4; +  if(dy > dw)  code |= 8; +  if(dz < -dw) code |= 16; +  if(dz > dw)  code |= 32; +  *ret = code; +} + +===-------------------------------------------------------------------------=== + +%struct.B = type { i8, [3 x i8] } + +define void @bar(%struct.B* %b) { +entry: +        %tmp = bitcast %struct.B* %b to i32*              ; <uint*> [#uses=1] +        %tmp = load i32* %tmp          ; <uint> [#uses=1] +        %tmp3 = bitcast %struct.B* %b to i32*             ; <uint*> [#uses=1] +        %tmp4 = load i32* %tmp3                ; <uint> [#uses=1] +        %tmp8 = bitcast %struct.B* %b to i32*             ; <uint*> [#uses=2] +        %tmp9 = load i32* %tmp8                ; <uint> [#uses=1] +        %tmp4.mask17 = shl i32 %tmp4, i8 1          ; <uint> [#uses=1] +        %tmp1415 = and i32 %tmp4.mask17, 2147483648            ; <uint> [#uses=1] +        %tmp.masked = and i32 %tmp, 2147483648         ; <uint> [#uses=1] +        %tmp11 = or i32 %tmp1415, %tmp.masked          ; <uint> [#uses=1] +        %tmp12 = and i32 %tmp9, 2147483647             ; <uint> [#uses=1] +        %tmp13 = or i32 %tmp12, %tmp11         ; <uint> [#uses=1] +        store i32 %tmp13, i32* %tmp8 +        ret void +} + +We emit: + +_foo: +        lwz r2, 0(r3) +        slwi r4, r2, 1 +        or r4, r4, r2 +        rlwimi r2, r4, 0, 0, 0 +        stw r2, 0(r3) +        blr + +We could collapse a bunch of those ORs and ANDs and generate the following +equivalent code: + +_foo: +        lwz r2, 0(r3) +        rlwinm r4, r2, 1, 0, 0 +        or r2, r2, r4 +        stw r2, 0(r3) +        blr + +===-------------------------------------------------------------------------=== + +Consider a function like this: + +float foo(float X) { return X + 1234.4123f; } + +The FP constant ends up in the constant pool, so we need to get the LR register. + This ends up producing code like this: + +_foo: +.LBB_foo_0:     ; entry +        mflr r11 +***     stw r11, 8(r1) +        bl "L00000$pb" +"L00000$pb": +        mflr r2 +        addis r2, r2, ha16(.CPI_foo_0-"L00000$pb") +        lfs f0, lo16(.CPI_foo_0-"L00000$pb")(r2) +        fadds f1, f1, f0 +***     lwz r11, 8(r1) +        mtlr r11 +        blr + +This is functional, but there is no reason to spill the LR register all the way +to the stack (the two marked instrs): spilling it to a GPR is quite enough. + +Implementing this will require some codegen improvements.  Nate writes: + +"So basically what we need to support the "no stack frame save and restore" is a +generalization of the LR optimization to "callee-save regs". + +Currently, we have LR marked as a callee-save reg.  The register allocator sees +that it's callee save, and spills it directly to the stack. + +Ideally, something like this would happen: + +LR would be in a separate register class from the GPRs. The class of LR would be +marked "unspillable".  When the register allocator came across an unspillable +reg, it would ask "what is the best class to copy this into that I *can* spill" +If it gets a class back, which it will in this case (the gprs), it grabs a free +register of that class.  If it is then later necessary to spill that reg, so be +it. + +===-------------------------------------------------------------------------=== + +We compile this: +int test(_Bool X) { +  return X ? 524288 : 0; +} +  to:  -  -_bar:  -	cntlzw r2, r3  -	slwi r2, r2, 26  -	srawi r3, r2, 31  -	blr   -  -it would be better to produce:  -  -_bar:   -        addic r3,r3,-1  -        subfe r3,r3,r3  +_test: +        cmplwi cr0, r3, 0 +        lis r2, 8 +        li r3, 0 +        beq cr0, LBB1_2 ;entry +LBB1_1: ;entry +        mr r3, r2 +LBB1_2: ;entry          blr  -  -===-------------------------------------------------------------------------===  -  -We generate horrible ppc code for this:  -  -#define N  2000000  -double   a[N],c[N];  -void simpleloop() {  -   int j;  -   for (j=0; j<N; j++)  -     c[j] = a[j];  -}  -  -LBB1_1: ;bb  -        lfdx f0, r3, r4  -        addi r5, r5, 1                 ;; Extra IV for the exit value compare.  -        stfdx f0, r2, r4  -        addi r4, r4, 8  -  -        xoris r6, r5, 30               ;; This is due to a large immediate.  -        cmplwi cr0, r6, 33920  -        bne cr0, LBB1_1  -  -//===---------------------------------------------------------------------===//  -  -This:  -        #include <algorithm>  -        inline std::pair<unsigned, bool> full_add(unsigned a, unsigned b)  -        { return std::make_pair(a + b, a + b < a); }  -        bool no_overflow(unsigned a, unsigned b)  -        { return !full_add(a, b).second; }  -  -Should compile to:  -  -__Z11no_overflowjj:  -        add r4,r3,r4  -        subfc r3,r3,r4  -        li r3,0  -        adde r3,r3,r3  + +instead of: +_test: +        addic r2,r3,-1 +        subfe r0,r2,r3 +        slwi r3,r0,19 +        blr + +This sort of thing occurs a lot due to globalopt. + +===-------------------------------------------------------------------------=== + +We compile: + +define i32 @bar(i32 %x) nounwind readnone ssp { +entry: +  %0 = icmp eq i32 %x, 0                          ; <i1> [#uses=1] +  %neg = sext i1 %0 to i32              ; <i32> [#uses=1] +  ret i32 %neg +} + +to: + +_bar: +	cntlzw r2, r3 +	slwi r2, r2, 26 +	srawi r3, r2, 31 +	blr  + +it would be better to produce: + +_bar:  +        addic r3,r3,-1 +        subfe r3,r3,r3 +        blr + +===-------------------------------------------------------------------------=== + +We generate horrible ppc code for this: + +#define N  2000000 +double   a[N],c[N]; +void simpleloop() { +   int j; +   for (j=0; j<N; j++) +     c[j] = a[j]; +} + +LBB1_1: ;bb +        lfdx f0, r3, r4 +        addi r5, r5, 1                 ;; Extra IV for the exit value compare. +        stfdx f0, r2, r4 +        addi r4, r4, 8 + +        xoris r6, r5, 30               ;; This is due to a large immediate. +        cmplwi cr0, r6, 33920 +        bne cr0, LBB1_1 + +//===---------------------------------------------------------------------===// + +This: +        #include <algorithm> +        inline std::pair<unsigned, bool> full_add(unsigned a, unsigned b) +        { return std::make_pair(a + b, a + b < a); } +        bool no_overflow(unsigned a, unsigned b) +        { return !full_add(a, b).second; } + +Should compile to: + +__Z11no_overflowjj: +        add r4,r3,r4 +        subfc r3,r3,r4 +        li r3,0 +        adde r3,r3,r3 +        blr + +(or better) not: + +__Z11no_overflowjj: +        add r2, r4, r3 +        cmplw cr7, r2, r3 +        mfcr r2 +        rlwinm r2, r2, 29, 31, 31 +        xori r3, r2, 1          blr  -  -(or better) not:  -  -__Z11no_overflowjj:  -        add r2, r4, r3  -        cmplw cr7, r2, r3  -        mfcr r2  -        rlwinm r2, r2, 29, 31, 31  -        xori r3, r2, 1  -        blr   -  -//===---------------------------------------------------------------------===//  -  -We compile some FP comparisons into an mfcr with two rlwinms and an or.  For  -example:  -#include <math.h>  -int test(double x, double y) { return islessequal(x, y);}  -int test2(double x, double y) {  return islessgreater(x, y);}  -int test3(double x, double y) {  return !islessequal(x, y);}  -  -Compiles into (all three are similar, but the bits differ):  -  -_test:  -	fcmpu cr7, f1, f2  -	mfcr r2  -	rlwinm r3, r2, 29, 31, 31  -	rlwinm r2, r2, 31, 31, 31  -	or r3, r2, r3  -	blr   -  -GCC compiles this into:  -  - _test:  -	fcmpu cr7,f1,f2  -	cror 30,28,30  -	mfcr r3  -	rlwinm r3,r3,31,1  + +//===---------------------------------------------------------------------===// + +We compile some FP comparisons into an mfcr with two rlwinms and an or.  For +example: +#include <math.h> +int test(double x, double y) { return islessequal(x, y);} +int test2(double x, double y) {  return islessgreater(x, y);} +int test3(double x, double y) {  return !islessequal(x, y);} + +Compiles into (all three are similar, but the bits differ): + +_test: +	fcmpu cr7, f1, f2 +	mfcr r2 +	rlwinm r3, r2, 29, 31, 31 +	rlwinm r2, r2, 31, 31, 31 +	or r3, r2, r3 +	blr  + +GCC compiles this into: + + _test: +	fcmpu cr7,f1,f2 +	cror 30,28,30 +	mfcr r3 +	rlwinm r3,r3,31,1 +	blr +         +which is more efficient and can use mfocr.  See PR642 for some more context. + +//===---------------------------------------------------------------------===// + +void foo(float *data, float d) { +   long i; +   for (i = 0; i < 8000; i++) +      data[i] = d; +} +void foo2(float *data, float d) { +   long i; +   data--; +   for (i = 0; i < 8000; i++) { +      data[1] = d; +      data++; +   } +} + +These compile to: + +_foo: +	li r2, 0 +LBB1_1:	; bb +	addi r4, r2, 4 +	stfsx f1, r3, r2 +	cmplwi cr0, r4, 32000 +	mr r2, r4 +	bne cr0, LBB1_1	; bb +	blr  +_foo2: +	li r2, 0 +LBB2_1:	; bb +	addi r4, r2, 4 +	stfsx f1, r3, r2 +	cmplwi cr0, r4, 32000 +	mr r2, r4 +	bne cr0, LBB2_1	; bb  	blr  -          -which is more efficient and can use mfocr.  See PR642 for some more context.  -  -//===---------------------------------------------------------------------===//  -  -void foo(float *data, float d) {  -   long i;  -   for (i = 0; i < 8000; i++)  -      data[i] = d;  -}  -void foo2(float *data, float d) {  -   long i;  -   data--;  -   for (i = 0; i < 8000; i++) {  -      data[1] = d;  -      data++;  -   }  -}  -  -These compile to:  -  -_foo:  -	li r2, 0  -LBB1_1:	; bb  -	addi r4, r2, 4  -	stfsx f1, r3, r2  -	cmplwi cr0, r4, 32000  -	mr r2, r4  -	bne cr0, LBB1_1	; bb  -	blr   -_foo2:  -	li r2, 0  -LBB2_1:	; bb  -	addi r4, r2, 4  -	stfsx f1, r3, r2  -	cmplwi cr0, r4, 32000  -	mr r2, r4  -	bne cr0, LBB2_1	; bb  -	blr   -  -The 'mr' could be eliminated to folding the add into the cmp better.  -  -//===---------------------------------------------------------------------===//  -Codegen for the following (low-probability) case deteriorated considerably   -when the correctness fixes for unordered comparisons went in (PR 642, 58871).  -It should be possible to recover the code quality described in the comments.  -  -; RUN: llvm-as < %s | llc -march=ppc32  | grep or | count 3  -; This should produce one 'or' or 'cror' instruction per function.  -  -; RUN: llvm-as < %s | llc -march=ppc32  | grep mfcr | count 3  -; PR2964  -  -define i32 @test(double %x, double %y) nounwind  {  -entry:  -	%tmp3 = fcmp ole double %x, %y		; <i1> [#uses=1]  -	%tmp345 = zext i1 %tmp3 to i32		; <i32> [#uses=1]  -	ret i32 %tmp345  -}  -  -define i32 @test2(double %x, double %y) nounwind  {  -entry:  -	%tmp3 = fcmp one double %x, %y		; <i1> [#uses=1]  -	%tmp345 = zext i1 %tmp3 to i32		; <i32> [#uses=1]  -	ret i32 %tmp345  -}  -  -define i32 @test3(double %x, double %y) nounwind  {  -entry:  -	%tmp3 = fcmp ugt double %x, %y		; <i1> [#uses=1]  -	%tmp34 = zext i1 %tmp3 to i32		; <i32> [#uses=1]  -	ret i32 %tmp34  -}  -  -//===---------------------------------------------------------------------===//  -for the following code:  -  -void foo (float *__restrict__ a, int *__restrict__ b, int n) {  -      a[n] = b[n]  * 2.321;  -}  -  -we load b[n] to GPR, then move it VSX register and convert it float. We should   -use vsx scalar integer load instructions to avoid direct moves  -  -//===----------------------------------------------------------------------===//  -; RUN: llvm-as < %s | llc -march=ppc32 | not grep fneg  -  -; This could generate FSEL with appropriate flags (FSEL is not IEEE-safe, and   -; should not be generated except with -enable-finite-only-fp-math or the like).  -; With the correctness fixes for PR642 (58871) LowerSELECT_CC would need to  -; recognize a more elaborate tree than a simple SETxx.  -  -define double @test_FNEG_sel(double %A, double %B, double %C) {  -        %D = fsub double -0.000000e+00, %A               ; <double> [#uses=1]  -        %Cond = fcmp ugt double %D, -0.000000e+00               ; <i1> [#uses=1]  -        %E = select i1 %Cond, double %B, double %C              ; <double> [#uses=1]  -        ret double %E  -}  -  -//===----------------------------------------------------------------------===//  -The save/restore sequence for CR in prolog/epilog is terrible:  -- Each CR subreg is saved individually, rather than doing one save as a unit.  -- On Darwin, the save is done after the decrement of SP, which means the offset  -from SP of the save slot can be too big for a store instruction, which means we  -need an additional register (currently hacked in 96015+96020; the solution there  -is correct, but poor).  -- On SVR4 the same thing can happen, and I don't think saving before the SP  -decrement is safe on that target, as there is no red zone.  This is currently  -broken AFAIK, although it's not a target I can exercise.  -The following demonstrates the problem:  -extern void bar(char *p);  -void foo() {  -  char x[100000];  -  bar(x);  -  __asm__("" ::: "cr2");  -}  -  -//===-------------------------------------------------------------------------===  -Naming convention for instruction formats is very haphazard.  -We have agreed on a naming scheme as follows:  -  -<INST_form>{_<OP_type><OP_len>}+  -  -Where:  -INST_form is the instruction format (X-form, etc.)  -OP_type is the operand type - one of OPC (opcode), RD (register destination),  -                              RS (register source),  -                              RDp (destination register pair),  -                              RSp (source register pair), IM (immediate),  -                              XO (extended opcode)  -OP_len is the length of the operand in bits  -  -VSX register operands would be of length 6 (split across two fields),  -condition register fields of length 3.  -We would not need denote reserved fields in names of instruction formats.  -  -//===----------------------------------------------------------------------===//  -  -Instruction fusion was introduced in ISA 2.06 and more opportunities added in  -ISA 2.07.  LLVM needs to add infrastructure to recognize fusion opportunities  -and force instruction pairs to be scheduled together.  -  ------------------------------------------------------------------------------  -  -More general handling of any_extend and zero_extend:  -  -See https://reviews.llvm.org/D24924#555306  + +The 'mr' could be eliminated to folding the add into the cmp better. + +//===---------------------------------------------------------------------===// +Codegen for the following (low-probability) case deteriorated considerably  +when the correctness fixes for unordered comparisons went in (PR 642, 58871). +It should be possible to recover the code quality described in the comments. + +; RUN: llvm-as < %s | llc -march=ppc32  | grep or | count 3 +; This should produce one 'or' or 'cror' instruction per function. + +; RUN: llvm-as < %s | llc -march=ppc32  | grep mfcr | count 3 +; PR2964 + +define i32 @test(double %x, double %y) nounwind  { +entry: +	%tmp3 = fcmp ole double %x, %y		; <i1> [#uses=1] +	%tmp345 = zext i1 %tmp3 to i32		; <i32> [#uses=1] +	ret i32 %tmp345 +} + +define i32 @test2(double %x, double %y) nounwind  { +entry: +	%tmp3 = fcmp one double %x, %y		; <i1> [#uses=1] +	%tmp345 = zext i1 %tmp3 to i32		; <i32> [#uses=1] +	ret i32 %tmp345 +} + +define i32 @test3(double %x, double %y) nounwind  { +entry: +	%tmp3 = fcmp ugt double %x, %y		; <i1> [#uses=1] +	%tmp34 = zext i1 %tmp3 to i32		; <i32> [#uses=1] +	ret i32 %tmp34 +} + +//===---------------------------------------------------------------------===// +for the following code: + +void foo (float *__restrict__ a, int *__restrict__ b, int n) { +      a[n] = b[n]  * 2.321; +} + +we load b[n] to GPR, then move it VSX register and convert it float. We should  +use vsx scalar integer load instructions to avoid direct moves + +//===----------------------------------------------------------------------===// +; RUN: llvm-as < %s | llc -march=ppc32 | not grep fneg + +; This could generate FSEL with appropriate flags (FSEL is not IEEE-safe, and  +; should not be generated except with -enable-finite-only-fp-math or the like). +; With the correctness fixes for PR642 (58871) LowerSELECT_CC would need to +; recognize a more elaborate tree than a simple SETxx. + +define double @test_FNEG_sel(double %A, double %B, double %C) { +        %D = fsub double -0.000000e+00, %A               ; <double> [#uses=1] +        %Cond = fcmp ugt double %D, -0.000000e+00               ; <i1> [#uses=1] +        %E = select i1 %Cond, double %B, double %C              ; <double> [#uses=1] +        ret double %E +} + +//===----------------------------------------------------------------------===// +The save/restore sequence for CR in prolog/epilog is terrible: +- Each CR subreg is saved individually, rather than doing one save as a unit. +- On Darwin, the save is done after the decrement of SP, which means the offset +from SP of the save slot can be too big for a store instruction, which means we +need an additional register (currently hacked in 96015+96020; the solution there +is correct, but poor). +- On SVR4 the same thing can happen, and I don't think saving before the SP +decrement is safe on that target, as there is no red zone.  This is currently +broken AFAIK, although it's not a target I can exercise. +The following demonstrates the problem: +extern void bar(char *p); +void foo() { +  char x[100000]; +  bar(x); +  __asm__("" ::: "cr2"); +} + +//===-------------------------------------------------------------------------=== +Naming convention for instruction formats is very haphazard. +We have agreed on a naming scheme as follows: + +<INST_form>{_<OP_type><OP_len>}+ + +Where: +INST_form is the instruction format (X-form, etc.) +OP_type is the operand type - one of OPC (opcode), RD (register destination), +                              RS (register source), +                              RDp (destination register pair), +                              RSp (source register pair), IM (immediate), +                              XO (extended opcode) +OP_len is the length of the operand in bits + +VSX register operands would be of length 6 (split across two fields), +condition register fields of length 3. +We would not need denote reserved fields in names of instruction formats. + +//===----------------------------------------------------------------------===// + +Instruction fusion was introduced in ISA 2.06 and more opportunities added in +ISA 2.07.  LLVM needs to add infrastructure to recognize fusion opportunities +and force instruction pairs to be scheduled together. + +----------------------------------------------------------------------------- + +More general handling of any_extend and zero_extend: + +See https://reviews.llvm.org/D24924#555306 diff --git a/contrib/libs/llvm12/lib/Target/PowerPC/README_ALTIVEC.txt b/contrib/libs/llvm12/lib/Target/PowerPC/README_ALTIVEC.txt index 47d18ecfca6..6d32e76ed8d 100644 --- a/contrib/libs/llvm12/lib/Target/PowerPC/README_ALTIVEC.txt +++ b/contrib/libs/llvm12/lib/Target/PowerPC/README_ALTIVEC.txt @@ -1,338 +1,338 @@ -//===- README_ALTIVEC.txt - Notes for improving Altivec code gen ----------===//  -  -Implement PPCInstrInfo::isLoadFromStackSlot/isStoreToStackSlot for vector  -registers, to generate better spill code.  -  -//===----------------------------------------------------------------------===//  -  -The first should be a single lvx from the constant pool, the second should be   -a xor/stvx:  -  -void foo(void) {  -  int x[8] __attribute__((aligned(128))) = { 1, 1, 1, 17, 1, 1, 1, 1 };  -  bar (x);  -}  -  -#include <string.h>  -void foo(void) {  -  int x[8] __attribute__((aligned(128)));  -  memset (x, 0, sizeof (x));  -  bar (x);  -}  -  -//===----------------------------------------------------------------------===//  -  -Altivec: Codegen'ing MUL with vector FMADD should add -0.0, not 0.0:  -http://gcc.gnu.org/bugzilla/show_bug.cgi?id=8763  -  -When -ffast-math is on, we can use 0.0.  -  -//===----------------------------------------------------------------------===//  -  -  Consider this:  -  v4f32 Vector;  -  v4f32 Vector2 = { Vector.X, Vector.X, Vector.X, Vector.X };  -  -Since we know that "Vector" is 16-byte aligned and we know the element offset   -of ".X", we should change the load into a lve*x instruction, instead of doing  -a load/store/lve*x sequence.  -  -//===----------------------------------------------------------------------===//  -  -Implement passing vectors by value into calls and receiving them as arguments.  -  -//===----------------------------------------------------------------------===//  -  -GCC apparently tries to codegen { C1, C2, Variable, C3 } as a constant pool load  -of C1/C2/C3, then a load and vperm of Variable.  -  -//===----------------------------------------------------------------------===//  -  -We need a way to teach tblgen that some operands of an intrinsic are required to  -be constants.  The verifier should enforce this constraint.  -  -//===----------------------------------------------------------------------===//  -  -We currently codegen SCALAR_TO_VECTOR as a store of the scalar to a 16-byte  -aligned stack slot, followed by a load/vperm.  We should probably just store it  -to a scalar stack slot, then use lvsl/vperm to load it.  If the value is already  -in memory this is a big win.  -  -//===----------------------------------------------------------------------===//  -  -extract_vector_elt of an arbitrary constant vector can be done with the   -following instructions:  -  -vTemp = vec_splat(v0,2);    // 2 is the element the src is in.  -vec_ste(&destloc,0,vTemp);  -  -We can do an arbitrary non-constant value by using lvsr/perm/ste.  -  -//===----------------------------------------------------------------------===//  -  -If we want to tie instruction selection into the scheduler, we can do some  -constant formation with different instructions.  For example, we can generate  -"vsplti -1" with "vcmpequw R,R" and 1,1,1,1 with "vsubcuw R,R", and 0,0,0,0 with  -"vsplti 0" or "vxor", each of which use different execution units, thus could  -help scheduling.  -  -This is probably only reasonable for a post-pass scheduler.  -  -//===----------------------------------------------------------------------===//  -  -For this function:  -  -void test(vector float *A, vector float *B) {  -  vector float C = (vector float)vec_cmpeq(*A, *B);  -  if (!vec_any_eq(*A, *B))  -    *B = (vector float){0,0,0,0};  -  *A = C;  -}  -  -we get the following basic block:  -  -	...  -        lvx v2, 0, r4  -        lvx v3, 0, r3  -        vcmpeqfp v4, v3, v2  -        vcmpeqfp. v2, v3, v2  -        bne cr6, LBB1_2 ; cond_next  -  -The vcmpeqfp/vcmpeqfp. instructions currently cannot be merged when the  -vcmpeqfp. result is used by a branch.  This can be improved.  -  -//===----------------------------------------------------------------------===//  -  -The code generated for this is truly aweful:  -  -vector float test(float a, float b) {  - return (vector float){ 0.0, a, 0.0, 0.0};   -}  -  -LCPI1_0:                                        ;  float  -        .space  4  -        .text  -        .globl  _test  -        .align  4  -_test:  -        mfspr r2, 256  -        oris r3, r2, 4096  -        mtspr 256, r3  -        lis r3, ha16(LCPI1_0)  -        addi r4, r1, -32  -        stfs f1, -16(r1)  -        addi r5, r1, -16  -        lfs f0, lo16(LCPI1_0)(r3)  -        stfs f0, -32(r1)  -        lvx v2, 0, r4  -        lvx v3, 0, r5  -        vmrghw v3, v3, v2  -        vspltw v2, v2, 0  -        vmrghw v2, v2, v3  -        mtspr 256, r2  -        blr  -  -//===----------------------------------------------------------------------===//  -  -int foo(vector float *x, vector float *y) {  -        if (vec_all_eq(*x,*y)) return 3245;   -        else return 12;  -}  -  -A predicate compare being used in a select_cc should have the same peephole  -applied to it as a predicate compare used by a br_cc.  There should be no  -mfcr here:  -  -_foo:  -        mfspr r2, 256  -        oris r5, r2, 12288  -        mtspr 256, r5  -        li r5, 12  -        li r6, 3245  -        lvx v2, 0, r4  -        lvx v3, 0, r3  -        vcmpeqfp. v2, v3, v2  -        mfcr r3, 2  -        rlwinm r3, r3, 25, 31, 31  -        cmpwi cr0, r3, 0  -        bne cr0, LBB1_2 ; entry  -LBB1_1: ; entry  -        mr r6, r5  -LBB1_2: ; entry  -        mr r3, r6  -        mtspr 256, r2  -        blr  -  -//===----------------------------------------------------------------------===//  -  -CodeGen/PowerPC/vec_constants.ll has an and operation that should be  -codegen'd to andc.  The issue is that the 'all ones' build vector is  -SelectNodeTo'd a VSPLTISB instruction node before the and/xor is selected  -which prevents the vnot pattern from matching.  -  -  -//===----------------------------------------------------------------------===//  -  -An alternative to the store/store/load approach for illegal insert element   -lowering would be:  -  -1. store element to any ol' slot  -2. lvx the slot  -3. lvsl 0; splat index; vcmpeq to generate a select mask  -4. lvsl slot + x; vperm to rotate result into correct slot  -5. vsel result together.  -  -//===----------------------------------------------------------------------===//  -  -Should codegen branches on vec_any/vec_all to avoid mfcr.  Two examples:  -  -#include <altivec.h>  - int f(vector float a, vector float b)  - {  -  int aa = 0;  -  if (vec_all_ge(a, b))  -    aa |= 0x1;  -  if (vec_any_ge(a,b))  -    aa |= 0x2;  -  return aa;  -}  -  -vector float f(vector float a, vector float b) {   -  if (vec_any_eq(a, b))   -    return a;   -  else   -    return b;   -}  -  -//===----------------------------------------------------------------------===//  -  -We should do a little better with eliminating dead stores.  -The stores to the stack are dead since %a and %b are not needed  -  -; Function Attrs: nounwind  -define <16 x i8> @test_vpmsumb() #0 {  -  entry:  -  %a = alloca <16 x i8>, align 16  -  %b = alloca <16 x i8>, align 16  -  store <16 x i8> <i8 1, i8 2, i8 3, i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16>, <16 x i8>* %a, align 16  -  store <16 x i8> <i8 113, i8 114, i8 115, i8 116, i8 117, i8 118, i8 119, i8 120, i8 121, i8 122, i8 123, i8 124, i8 125, i8 126, i8 127, i8 112>, <16 x i8>* %b, align 16  -  %0 = load <16 x i8>* %a, align 16  -  %1 = load <16 x i8>* %b, align 16  -  %2 = call <16 x i8> @llvm.ppc.altivec.crypto.vpmsumb(<16 x i8> %0, <16 x i8> %1)  -  ret <16 x i8> %2  -}  -  -  -; Function Attrs: nounwind readnone  -declare <16 x i8> @llvm.ppc.altivec.crypto.vpmsumb(<16 x i8>, <16 x i8>) #1  -  -  -Produces the following code with -mtriple=powerpc64-unknown-linux-gnu:  -# %bb.0:                                # %entry  -    addis 3, 2, .LCPI0_0@toc@ha  -    addis 4, 2, .LCPI0_1@toc@ha  -    addi 3, 3, .LCPI0_0@toc@l  -    addi 4, 4, .LCPI0_1@toc@l  -    lxvw4x 0, 0, 3  -    addi 3, 1, -16  -    lxvw4x 35, 0, 4  -    stxvw4x 0, 0, 3  -    ori 2, 2, 0  -    lxvw4x 34, 0, 3  -    addi 3, 1, -32  -    stxvw4x 35, 0, 3  -    vpmsumb 2, 2, 3  -    blr  -    .long   0  -    .quad   0  -  -The two stxvw4x instructions are not needed.  -With -mtriple=powerpc64le-unknown-linux-gnu, the associated permutes  -are present too.  -  -//===----------------------------------------------------------------------===//  -  -The following example is found in test/CodeGen/PowerPC/vec_add_sub_doubleword.ll:  -  -define <2 x i64> @increment_by_val(<2 x i64> %x, i64 %val) nounwind {  -       %tmpvec = insertelement <2 x i64> <i64 0, i64 0>, i64 %val, i32 0  -       %tmpvec2 = insertelement <2 x i64> %tmpvec, i64 %val, i32 1  -       %result = add <2 x i64> %x, %tmpvec2  -       ret <2 x i64> %result  -  -This will generate the following instruction sequence:  -        std 5, -8(1)  -        std 5, -16(1)  -        addi 3, 1, -16  -        ori 2, 2, 0  -        lxvd2x 35, 0, 3  -        vaddudm 2, 2, 3  -        blr  -  -This will almost certainly cause a load-hit-store hazard.    -Since val is a value parameter, it should not need to be saved onto  -the stack, unless it's being done set up the vector register. Instead,  -it would be better to splat the value into a vector register, and then  -remove the (dead) stores to the stack.  -  -//===----------------------------------------------------------------------===//  -  -At the moment we always generate a lxsdx in preference to lfd, or stxsdx in  -preference to stfd.  When we have a reg-immediate addressing mode, this is a  -poor choice, since we have to load the address into an index register.  This  -should be fixed for P7/P8.   -  -//===----------------------------------------------------------------------===//  -  -Right now, ShuffleKind 0 is supported only on BE, and ShuffleKind 2 only on LE.  -However, we could actually support both kinds on either endianness, if we check  -for the appropriate shufflevector pattern for each case ...  this would cause  -some additional shufflevectors to be recognized and implemented via the  -"swapped" form.  -  -//===----------------------------------------------------------------------===//  -  -There is a utility program called PerfectShuffle that generates a table of the  -shortest instruction sequence for implementing a shufflevector operation on  -PowerPC.  However, this was designed for big-endian code generation.  We could  -modify this program to create a little endian version of the table.  The table  -is used in PPCISelLowering.cpp, PPCTargetLowering::LOWERVECTOR_SHUFFLE().  -  -//===----------------------------------------------------------------------===//  -  -Opportunies to use instructions from PPCInstrVSX.td during code gen  -  - Conversion instructions (Sections 7.6.1.5 and 7.6.1.6 of ISA 2.07)  -  - Scalar comparisons (xscmpodp and xscmpudp)  -  - Min and max (xsmaxdp, xsmindp, xvmaxdp, xvmindp, xvmaxsp, xvminsp)  -  -Related to this: we currently do not generate the lxvw4x instruction for either  -v4f32 or v4i32, probably because adding a dag pattern to the recognizer requires  -a single target type.  This should probably be addressed in the PPCISelDAGToDAG logic.  -  -//===----------------------------------------------------------------------===//  -  -Currently EXTRACT_VECTOR_ELT and INSERT_VECTOR_ELT are type-legal only  -for v2f64 with VSX available.  We should create custom lowering  -support for the other vector types.  Without this support, we generate  -sequences with load-hit-store hazards.  -  -v4f32 can be supported with VSX by shifting the correct element into  -big-endian lane 0, using xscvspdpn to produce a double-precision  -representation of the single-precision value in big-endian  -double-precision lane 0, and reinterpreting lane 0 as an FPR or  -vector-scalar register.  -  -v2i64 can be supported with VSX and P8Vector in the same manner as  -v2f64, followed by a direct move to a GPR.  -  -v4i32 can be supported with VSX and P8Vector by shifting the correct  -element into big-endian lane 1, using a direct move to a GPR, and  -sign-extending the 32-bit result to 64 bits.  -  -v8i16 can be supported with VSX and P8Vector by shifting the correct  -element into big-endian lane 3, using a direct move to a GPR, and  -sign-extending the 16-bit result to 64 bits.  -  -v16i8 can be supported with VSX and P8Vector by shifting the correct  -element into big-endian lane 7, using a direct move to a GPR, and  -sign-extending the 8-bit result to 64 bits.  +//===- README_ALTIVEC.txt - Notes for improving Altivec code gen ----------===// + +Implement PPCInstrInfo::isLoadFromStackSlot/isStoreToStackSlot for vector +registers, to generate better spill code. + +//===----------------------------------------------------------------------===// + +The first should be a single lvx from the constant pool, the second should be  +a xor/stvx: + +void foo(void) { +  int x[8] __attribute__((aligned(128))) = { 1, 1, 1, 17, 1, 1, 1, 1 }; +  bar (x); +} + +#include <string.h> +void foo(void) { +  int x[8] __attribute__((aligned(128))); +  memset (x, 0, sizeof (x)); +  bar (x); +} + +//===----------------------------------------------------------------------===// + +Altivec: Codegen'ing MUL with vector FMADD should add -0.0, not 0.0: +http://gcc.gnu.org/bugzilla/show_bug.cgi?id=8763 + +When -ffast-math is on, we can use 0.0. + +//===----------------------------------------------------------------------===// + +  Consider this: +  v4f32 Vector; +  v4f32 Vector2 = { Vector.X, Vector.X, Vector.X, Vector.X }; + +Since we know that "Vector" is 16-byte aligned and we know the element offset  +of ".X", we should change the load into a lve*x instruction, instead of doing +a load/store/lve*x sequence. + +//===----------------------------------------------------------------------===// + +Implement passing vectors by value into calls and receiving them as arguments. + +//===----------------------------------------------------------------------===// + +GCC apparently tries to codegen { C1, C2, Variable, C3 } as a constant pool load +of C1/C2/C3, then a load and vperm of Variable. + +//===----------------------------------------------------------------------===// + +We need a way to teach tblgen that some operands of an intrinsic are required to +be constants.  The verifier should enforce this constraint. + +//===----------------------------------------------------------------------===// + +We currently codegen SCALAR_TO_VECTOR as a store of the scalar to a 16-byte +aligned stack slot, followed by a load/vperm.  We should probably just store it +to a scalar stack slot, then use lvsl/vperm to load it.  If the value is already +in memory this is a big win. + +//===----------------------------------------------------------------------===// + +extract_vector_elt of an arbitrary constant vector can be done with the  +following instructions: + +vTemp = vec_splat(v0,2);    // 2 is the element the src is in. +vec_ste(&destloc,0,vTemp); + +We can do an arbitrary non-constant value by using lvsr/perm/ste. + +//===----------------------------------------------------------------------===// + +If we want to tie instruction selection into the scheduler, we can do some +constant formation with different instructions.  For example, we can generate +"vsplti -1" with "vcmpequw R,R" and 1,1,1,1 with "vsubcuw R,R", and 0,0,0,0 with +"vsplti 0" or "vxor", each of which use different execution units, thus could +help scheduling. + +This is probably only reasonable for a post-pass scheduler. + +//===----------------------------------------------------------------------===// + +For this function: + +void test(vector float *A, vector float *B) { +  vector float C = (vector float)vec_cmpeq(*A, *B); +  if (!vec_any_eq(*A, *B)) +    *B = (vector float){0,0,0,0}; +  *A = C; +} + +we get the following basic block: + +	... +        lvx v2, 0, r4 +        lvx v3, 0, r3 +        vcmpeqfp v4, v3, v2 +        vcmpeqfp. v2, v3, v2 +        bne cr6, LBB1_2 ; cond_next + +The vcmpeqfp/vcmpeqfp. instructions currently cannot be merged when the +vcmpeqfp. result is used by a branch.  This can be improved. + +//===----------------------------------------------------------------------===// + +The code generated for this is truly aweful: + +vector float test(float a, float b) { + return (vector float){ 0.0, a, 0.0, 0.0};  +} + +LCPI1_0:                                        ;  float +        .space  4 +        .text +        .globl  _test +        .align  4 +_test: +        mfspr r2, 256 +        oris r3, r2, 4096 +        mtspr 256, r3 +        lis r3, ha16(LCPI1_0) +        addi r4, r1, -32 +        stfs f1, -16(r1) +        addi r5, r1, -16 +        lfs f0, lo16(LCPI1_0)(r3) +        stfs f0, -32(r1) +        lvx v2, 0, r4 +        lvx v3, 0, r5 +        vmrghw v3, v3, v2 +        vspltw v2, v2, 0 +        vmrghw v2, v2, v3 +        mtspr 256, r2 +        blr + +//===----------------------------------------------------------------------===// + +int foo(vector float *x, vector float *y) { +        if (vec_all_eq(*x,*y)) return 3245;  +        else return 12; +} + +A predicate compare being used in a select_cc should have the same peephole +applied to it as a predicate compare used by a br_cc.  There should be no +mfcr here: + +_foo: +        mfspr r2, 256 +        oris r5, r2, 12288 +        mtspr 256, r5 +        li r5, 12 +        li r6, 3245 +        lvx v2, 0, r4 +        lvx v3, 0, r3 +        vcmpeqfp. v2, v3, v2 +        mfcr r3, 2 +        rlwinm r3, r3, 25, 31, 31 +        cmpwi cr0, r3, 0 +        bne cr0, LBB1_2 ; entry +LBB1_1: ; entry +        mr r6, r5 +LBB1_2: ; entry +        mr r3, r6 +        mtspr 256, r2 +        blr + +//===----------------------------------------------------------------------===// + +CodeGen/PowerPC/vec_constants.ll has an and operation that should be +codegen'd to andc.  The issue is that the 'all ones' build vector is +SelectNodeTo'd a VSPLTISB instruction node before the and/xor is selected +which prevents the vnot pattern from matching. + + +//===----------------------------------------------------------------------===// + +An alternative to the store/store/load approach for illegal insert element  +lowering would be: + +1. store element to any ol' slot +2. lvx the slot +3. lvsl 0; splat index; vcmpeq to generate a select mask +4. lvsl slot + x; vperm to rotate result into correct slot +5. vsel result together. + +//===----------------------------------------------------------------------===// + +Should codegen branches on vec_any/vec_all to avoid mfcr.  Two examples: + +#include <altivec.h> + int f(vector float a, vector float b) + { +  int aa = 0; +  if (vec_all_ge(a, b)) +    aa |= 0x1; +  if (vec_any_ge(a,b)) +    aa |= 0x2; +  return aa; +} + +vector float f(vector float a, vector float b) {  +  if (vec_any_eq(a, b))  +    return a;  +  else  +    return b;  +} + +//===----------------------------------------------------------------------===// + +We should do a little better with eliminating dead stores. +The stores to the stack are dead since %a and %b are not needed + +; Function Attrs: nounwind +define <16 x i8> @test_vpmsumb() #0 { +  entry: +  %a = alloca <16 x i8>, align 16 +  %b = alloca <16 x i8>, align 16 +  store <16 x i8> <i8 1, i8 2, i8 3, i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16>, <16 x i8>* %a, align 16 +  store <16 x i8> <i8 113, i8 114, i8 115, i8 116, i8 117, i8 118, i8 119, i8 120, i8 121, i8 122, i8 123, i8 124, i8 125, i8 126, i8 127, i8 112>, <16 x i8>* %b, align 16 +  %0 = load <16 x i8>* %a, align 16 +  %1 = load <16 x i8>* %b, align 16 +  %2 = call <16 x i8> @llvm.ppc.altivec.crypto.vpmsumb(<16 x i8> %0, <16 x i8> %1) +  ret <16 x i8> %2 +} + + +; Function Attrs: nounwind readnone +declare <16 x i8> @llvm.ppc.altivec.crypto.vpmsumb(<16 x i8>, <16 x i8>) #1 + + +Produces the following code with -mtriple=powerpc64-unknown-linux-gnu: +# %bb.0:                                # %entry +    addis 3, 2, .LCPI0_0@toc@ha +    addis 4, 2, .LCPI0_1@toc@ha +    addi 3, 3, .LCPI0_0@toc@l +    addi 4, 4, .LCPI0_1@toc@l +    lxvw4x 0, 0, 3 +    addi 3, 1, -16 +    lxvw4x 35, 0, 4 +    stxvw4x 0, 0, 3 +    ori 2, 2, 0 +    lxvw4x 34, 0, 3 +    addi 3, 1, -32 +    stxvw4x 35, 0, 3 +    vpmsumb 2, 2, 3 +    blr +    .long   0 +    .quad   0 + +The two stxvw4x instructions are not needed. +With -mtriple=powerpc64le-unknown-linux-gnu, the associated permutes +are present too. + +//===----------------------------------------------------------------------===// + +The following example is found in test/CodeGen/PowerPC/vec_add_sub_doubleword.ll: + +define <2 x i64> @increment_by_val(<2 x i64> %x, i64 %val) nounwind { +       %tmpvec = insertelement <2 x i64> <i64 0, i64 0>, i64 %val, i32 0 +       %tmpvec2 = insertelement <2 x i64> %tmpvec, i64 %val, i32 1 +       %result = add <2 x i64> %x, %tmpvec2 +       ret <2 x i64> %result + +This will generate the following instruction sequence: +        std 5, -8(1) +        std 5, -16(1) +        addi 3, 1, -16 +        ori 2, 2, 0 +        lxvd2x 35, 0, 3 +        vaddudm 2, 2, 3 +        blr + +This will almost certainly cause a load-hit-store hazard.   +Since val is a value parameter, it should not need to be saved onto +the stack, unless it's being done set up the vector register. Instead, +it would be better to splat the value into a vector register, and then +remove the (dead) stores to the stack. + +//===----------------------------------------------------------------------===// + +At the moment we always generate a lxsdx in preference to lfd, or stxsdx in +preference to stfd.  When we have a reg-immediate addressing mode, this is a +poor choice, since we have to load the address into an index register.  This +should be fixed for P7/P8.  + +//===----------------------------------------------------------------------===// + +Right now, ShuffleKind 0 is supported only on BE, and ShuffleKind 2 only on LE. +However, we could actually support both kinds on either endianness, if we check +for the appropriate shufflevector pattern for each case ...  this would cause +some additional shufflevectors to be recognized and implemented via the +"swapped" form. + +//===----------------------------------------------------------------------===// + +There is a utility program called PerfectShuffle that generates a table of the +shortest instruction sequence for implementing a shufflevector operation on +PowerPC.  However, this was designed for big-endian code generation.  We could +modify this program to create a little endian version of the table.  The table +is used in PPCISelLowering.cpp, PPCTargetLowering::LOWERVECTOR_SHUFFLE(). + +//===----------------------------------------------------------------------===// + +Opportunies to use instructions from PPCInstrVSX.td during code gen +  - Conversion instructions (Sections 7.6.1.5 and 7.6.1.6 of ISA 2.07) +  - Scalar comparisons (xscmpodp and xscmpudp) +  - Min and max (xsmaxdp, xsmindp, xvmaxdp, xvmindp, xvmaxsp, xvminsp) + +Related to this: we currently do not generate the lxvw4x instruction for either +v4f32 or v4i32, probably because adding a dag pattern to the recognizer requires +a single target type.  This should probably be addressed in the PPCISelDAGToDAG logic. + +//===----------------------------------------------------------------------===// + +Currently EXTRACT_VECTOR_ELT and INSERT_VECTOR_ELT are type-legal only +for v2f64 with VSX available.  We should create custom lowering +support for the other vector types.  Without this support, we generate +sequences with load-hit-store hazards. + +v4f32 can be supported with VSX by shifting the correct element into +big-endian lane 0, using xscvspdpn to produce a double-precision +representation of the single-precision value in big-endian +double-precision lane 0, and reinterpreting lane 0 as an FPR or +vector-scalar register. + +v2i64 can be supported with VSX and P8Vector in the same manner as +v2f64, followed by a direct move to a GPR. + +v4i32 can be supported with VSX and P8Vector by shifting the correct +element into big-endian lane 1, using a direct move to a GPR, and +sign-extending the 32-bit result to 64 bits. + +v8i16 can be supported with VSX and P8Vector by shifting the correct +element into big-endian lane 3, using a direct move to a GPR, and +sign-extending the 16-bit result to 64 bits. + +v16i8 can be supported with VSX and P8Vector by shifting the correct +element into big-endian lane 7, using a direct move to a GPR, and +sign-extending the 8-bit result to 64 bits. diff --git a/contrib/libs/llvm12/lib/Target/PowerPC/README_P9.txt b/contrib/libs/llvm12/lib/Target/PowerPC/README_P9.txt index 79cb6ccecad..c9984b7604b 100644 --- a/contrib/libs/llvm12/lib/Target/PowerPC/README_P9.txt +++ b/contrib/libs/llvm12/lib/Target/PowerPC/README_P9.txt @@ -1,605 +1,605 @@ -//===- README_P9.txt - Notes for improving Power9 code gen ----------------===//  -  -TODO: Instructions Need Implement Instrinstics or Map to LLVM IR  -  -Altivec:  -- Vector Compare Not Equal (Zero):  -  vcmpneb(.) vcmpneh(.) vcmpnew(.)  -  vcmpnezb(.) vcmpnezh(.) vcmpnezw(.)  -  . Same as other VCMP*, use VCMP/VCMPo form (support intrinsic)  -  -- Vector Extract Unsigned: vextractub vextractuh vextractuw vextractd  -  . Don't use llvm extractelement because they have different semantics  -  . Use instrinstics:  -    (set v2i64:$vD, (int_ppc_altivec_vextractub v16i8:$vA, imm:$UIMM))  -    (set v2i64:$vD, (int_ppc_altivec_vextractuh v8i16:$vA, imm:$UIMM))  -    (set v2i64:$vD, (int_ppc_altivec_vextractuw v4i32:$vA, imm:$UIMM))  -    (set v2i64:$vD, (int_ppc_altivec_vextractd  v2i64:$vA, imm:$UIMM))  -  -- Vector Extract Unsigned Byte Left/Right-Indexed:  -  vextublx vextubrx vextuhlx vextuhrx vextuwlx vextuwrx  -  . Use instrinstics:  -    // Left-Indexed  -    (set i64:$rD, (int_ppc_altivec_vextublx i64:$rA, v16i8:$vB))  -    (set i64:$rD, (int_ppc_altivec_vextuhlx i64:$rA, v8i16:$vB))  -    (set i64:$rD, (int_ppc_altivec_vextuwlx i64:$rA, v4i32:$vB))  -  -    // Right-Indexed  -    (set i64:$rD, (int_ppc_altivec_vextubrx i64:$rA, v16i8:$vB))  -    (set i64:$rD, (int_ppc_altivec_vextuhrx i64:$rA, v8i16:$vB))  -    (set i64:$rD, (int_ppc_altivec_vextuwrx i64:$rA, v4i32:$vB))  -  -- Vector Insert Element Instructions: vinsertb vinsertd vinserth vinsertw  -    (set v16i8:$vD, (int_ppc_altivec_vinsertb v16i8:$vA, imm:$UIMM))  -    (set v8i16:$vD, (int_ppc_altivec_vinsertd v8i16:$vA, imm:$UIMM))  -    (set v4i32:$vD, (int_ppc_altivec_vinserth v4i32:$vA, imm:$UIMM))  -    (set v2i64:$vD, (int_ppc_altivec_vinsertw v2i64:$vA, imm:$UIMM))  -  -- Vector Count Leading/Trailing Zero LSB. Result is placed into GPR[rD]:  -  vclzlsbb vctzlsbb  -  . Use intrinsic:  -    (set i64:$rD, (int_ppc_altivec_vclzlsbb v16i8:$vB))  -    (set i64:$rD, (int_ppc_altivec_vctzlsbb v16i8:$vB))  -  -- Vector Count Trailing Zeros: vctzb vctzh vctzw vctzd  -  . Map to llvm cttz  -    (set v16i8:$vD, (cttz v16i8:$vB))     // vctzb  -    (set v8i16:$vD, (cttz v8i16:$vB))     // vctzh  -    (set v4i32:$vD, (cttz v4i32:$vB))     // vctzw  -    (set v2i64:$vD, (cttz v2i64:$vB))     // vctzd  -  -- Vector Extend Sign: vextsb2w vextsh2w vextsb2d vextsh2d vextsw2d  -  . vextsb2w:  -    (set v4i32:$vD, (sext v4i8:$vB))  -  -    // PowerISA_V3.0:  -    do i = 0 to 3  -       VR[VRT].word[i] ← EXTS32(VR[VRB].word[i].byte[3])  -    end  -  -  . vextsh2w:  -    (set v4i32:$vD, (sext v4i16:$vB))  -  -    // PowerISA_V3.0:  -    do i = 0 to 3  -       VR[VRT].word[i] ← EXTS32(VR[VRB].word[i].hword[1])  -    end  -  -  . vextsb2d  -    (set v2i64:$vD, (sext v2i8:$vB))  -  -    // PowerISA_V3.0:  -    do i = 0 to 1  -       VR[VRT].dword[i] ← EXTS64(VR[VRB].dword[i].byte[7])  -    end  -  -  . vextsh2d  -    (set v2i64:$vD, (sext v2i16:$vB))  -  -    // PowerISA_V3.0:  -    do i = 0 to 1  -       VR[VRT].dword[i] ← EXTS64(VR[VRB].dword[i].hword[3])  -    end  -  -  . vextsw2d  -    (set v2i64:$vD, (sext v2i32:$vB))  -  -    // PowerISA_V3.0:  -    do i = 0 to 1  -       VR[VRT].dword[i] ← EXTS64(VR[VRB].dword[i].word[1])  -    end  -  -- Vector Integer Negate: vnegw vnegd  -  . Map to llvm ineg  -    (set v4i32:$rT, (ineg v4i32:$rA))       // vnegw  -    (set v2i64:$rT, (ineg v2i64:$rA))       // vnegd  -  -- Vector Parity Byte: vprtybw vprtybd vprtybq  -  . Use intrinsic:  -    (set v4i32:$rD, (int_ppc_altivec_vprtybw v4i32:$vB))  -    (set v2i64:$rD, (int_ppc_altivec_vprtybd v2i64:$vB))  -    (set v1i128:$rD, (int_ppc_altivec_vprtybq v1i128:$vB))  -  -- Vector (Bit) Permute (Right-indexed):  -  . vbpermd: Same as "vbpermq", use VX1_Int_Ty2:  -    VX1_Int_Ty2<1484, "vbpermd", int_ppc_altivec_vbpermd, v2i64, v2i64>;  -  -  . vpermr: use VA1a_Int_Ty3  -    VA1a_Int_Ty3<59, "vpermr", int_ppc_altivec_vpermr, v16i8, v16i8, v16i8>;  -  -- Vector Rotate Left Mask/Mask-Insert: vrlwnm vrlwmi vrldnm vrldmi  -  . Use intrinsic:  -    VX1_Int_Ty<389, "vrlwnm", int_ppc_altivec_vrlwnm, v4i32>;  -    VX1_Int_Ty<133, "vrlwmi", int_ppc_altivec_vrlwmi, v4i32>;  -    VX1_Int_Ty<453, "vrldnm", int_ppc_altivec_vrldnm, v2i64>;  -    VX1_Int_Ty<197, "vrldmi", int_ppc_altivec_vrldmi, v2i64>;  -  -- Vector Shift Left/Right: vslv vsrv  -  . Use intrinsic, don't map to llvm shl and lshr, because they have different  -    semantics, e.g. vslv:  -  -      do i = 0 to 15  -         sh ← VR[VRB].byte[i].bit[5:7]  -         VR[VRT].byte[i] ← src.byte[i:i+1].bit[sh:sh+7]  -      end  -  -    VR[VRT].byte[i] is composed of 2 bytes from src.byte[i:i+1]  -  -  . VX1_Int_Ty<1860, "vslv", int_ppc_altivec_vslv, v16i8>;  -    VX1_Int_Ty<1796, "vsrv", int_ppc_altivec_vsrv, v16i8>;  -  -- Vector Multiply-by-10 (& Write Carry) Unsigned Quadword:  -  vmul10uq vmul10cuq  -  . Use intrinsic:  -    VX1_Int_Ty<513, "vmul10uq",   int_ppc_altivec_vmul10uq,  v1i128>;  -    VX1_Int_Ty<  1, "vmul10cuq",  int_ppc_altivec_vmul10cuq, v1i128>;  -  -- Vector Multiply-by-10 Extended (& Write Carry) Unsigned Quadword:  -  vmul10euq vmul10ecuq  -  . Use intrinsic:  -    VX1_Int_Ty<577, "vmul10euq",  int_ppc_altivec_vmul10euq, v1i128>;  -    VX1_Int_Ty< 65, "vmul10ecuq", int_ppc_altivec_vmul10ecuq, v1i128>;  -  -- Decimal Convert From/to National/Zoned/Signed-QWord:  -  bcdcfn. bcdcfz. bcdctn. bcdctz. bcdcfsq. bcdctsq.  -  . Use instrinstics:  -    (set v1i128:$vD, (int_ppc_altivec_bcdcfno  v1i128:$vB, i1:$PS))  -    (set v1i128:$vD, (int_ppc_altivec_bcdcfzo  v1i128:$vB, i1:$PS))  -    (set v1i128:$vD, (int_ppc_altivec_bcdctno  v1i128:$vB))  -    (set v1i128:$vD, (int_ppc_altivec_bcdctzo  v1i128:$vB, i1:$PS))  -    (set v1i128:$vD, (int_ppc_altivec_bcdcfsqo v1i128:$vB, i1:$PS))  -    (set v1i128:$vD, (int_ppc_altivec_bcdctsqo v1i128:$vB))  -  -- Decimal Copy-Sign/Set-Sign: bcdcpsgn. bcdsetsgn.  -  . Use instrinstics:  -    (set v1i128:$vD, (int_ppc_altivec_bcdcpsgno v1i128:$vA, v1i128:$vB))  -    (set v1i128:$vD, (int_ppc_altivec_bcdsetsgno v1i128:$vB, i1:$PS))  -  -- Decimal Shift/Unsigned-Shift/Shift-and-Round: bcds. bcdus. bcdsr.  -  . Use instrinstics:  -    (set v1i128:$vD, (int_ppc_altivec_bcdso  v1i128:$vA, v1i128:$vB, i1:$PS))  -    (set v1i128:$vD, (int_ppc_altivec_bcduso v1i128:$vA, v1i128:$vB))  -    (set v1i128:$vD, (int_ppc_altivec_bcdsro v1i128:$vA, v1i128:$vB, i1:$PS))  -  -  . Note! Their VA is accessed only 1 byte, i.e. VA.byte[7]  -  -- Decimal (Unsigned) Truncate: bcdtrunc. bcdutrunc.  -  . Use instrinstics:  -    (set v1i128:$vD, (int_ppc_altivec_bcdso  v1i128:$vA, v1i128:$vB, i1:$PS))  -    (set v1i128:$vD, (int_ppc_altivec_bcduso v1i128:$vA, v1i128:$vB))  -  -  . Note! Their VA is accessed only 2 byte, i.e. VA.hword[3] (VA.bit[48:63])  -  -VSX:  -- QP Copy Sign: xscpsgnqp  -  . Similar to xscpsgndp  -  . (set f128:$vT, (fcopysign f128:$vB, f128:$vA)  -  -- QP Absolute/Negative-Absolute/Negate: xsabsqp xsnabsqp xsnegqp  -  . Similar to xsabsdp/xsnabsdp/xsnegdp  -  . (set f128:$vT, (fabs f128:$vB))             // xsabsqp  -    (set f128:$vT, (fneg (fabs f128:$vB)))      // xsnabsqp  -    (set f128:$vT, (fneg f128:$vB))             // xsnegqp  -  -- QP Add/Divide/Multiply/Subtract/Square-Root:  -  xsaddqp xsdivqp xsmulqp xssubqp xssqrtqp  -  . Similar to xsadddp  -  . isCommutable = 1  -    (set f128:$vT, (fadd f128:$vA, f128:$vB))   // xsaddqp  -    (set f128:$vT, (fmul f128:$vA, f128:$vB))   // xsmulqp  -  -  . isCommutable = 0  -    (set f128:$vT, (fdiv f128:$vA, f128:$vB))   // xsdivqp  -    (set f128:$vT, (fsub f128:$vA, f128:$vB))   // xssubqp  -    (set f128:$vT, (fsqrt f128:$vB)))           // xssqrtqp  -  -- Round to Odd of QP Add/Divide/Multiply/Subtract/Square-Root:  -  xsaddqpo xsdivqpo xsmulqpo xssubqpo xssqrtqpo  -  . Similar to xsrsqrtedp??  -      def XSRSQRTEDP : XX2Form<60, 74,  -                               (outs vsfrc:$XT), (ins vsfrc:$XB),  -                               "xsrsqrtedp $XT, $XB", IIC_VecFP,  -                               [(set f64:$XT, (PPCfrsqrte f64:$XB))]>;  -  -  . Define DAG Node in PPCInstrInfo.td:  -    def PPCfaddrto: SDNode<"PPCISD::FADDRTO", SDTFPBinOp, []>;  -    def PPCfdivrto: SDNode<"PPCISD::FDIVRTO", SDTFPBinOp, []>;  -    def PPCfmulrto: SDNode<"PPCISD::FMULRTO", SDTFPBinOp, []>;  -    def PPCfsubrto: SDNode<"PPCISD::FSUBRTO", SDTFPBinOp, []>;  -    def PPCfsqrtrto: SDNode<"PPCISD::FSQRTRTO", SDTFPUnaryOp, []>;  -  -    DAG patterns of each instruction (PPCInstrVSX.td):  -    . isCommutable = 1  -      (set f128:$vT, (PPCfaddrto f128:$vA, f128:$vB))   // xsaddqpo  -      (set f128:$vT, (PPCfmulrto f128:$vA, f128:$vB))   // xsmulqpo  -  -    . isCommutable = 0  -      (set f128:$vT, (PPCfdivrto f128:$vA, f128:$vB))   // xsdivqpo  -      (set f128:$vT, (PPCfsubrto f128:$vA, f128:$vB))   // xssubqpo  -      (set f128:$vT, (PPCfsqrtrto f128:$vB))            // xssqrtqpo  -  -- QP (Negative) Multiply-{Add/Subtract}: xsmaddqp xsmsubqp xsnmaddqp xsnmsubqp  -  . Ref: xsmaddadp/xsmsubadp/xsnmaddadp/xsnmsubadp  -  -  . isCommutable = 1  -    // xsmaddqp  -    [(set f128:$vT, (fma f128:$vA, f128:$vB, f128:$vTi))]>,  -    RegConstraint<"$vTi = $vT">, NoEncode<"$vTi">,  -    AltVSXFMARel;  -  -    // xsmsubqp  -    [(set f128:$vT, (fma f128:$vA, f128:$vB, (fneg f128:$vTi)))]>,  -    RegConstraint<"$vTi = $vT">, NoEncode<"$vTi">,  -    AltVSXFMARel;  -  -    // xsnmaddqp  -    [(set f128:$vT, (fneg (fma f128:$vA, f128:$vB, f128:$vTi)))]>,  -    RegConstraint<"$vTi = $vT">, NoEncode<"$vTi">,  -    AltVSXFMARel;  -  -    // xsnmsubqp  -    [(set f128:$vT, (fneg (fma f128:$vA, f128:$vB, (fneg f128:$vTi))))]>,  -    RegConstraint<"$vTi = $vT">, NoEncode<"$vTi">,  -    AltVSXFMARel;  -  -- Round to Odd of QP (Negative) Multiply-{Add/Subtract}:  -  xsmaddqpo xsmsubqpo xsnmaddqpo xsnmsubqpo  -  . Similar to xsrsqrtedp??  -  -  . Define DAG Node in PPCInstrInfo.td:  -    def PPCfmarto: SDNode<"PPCISD::FMARTO", SDTFPTernaryOp, []>;  -  -    It looks like we only need to define "PPCfmarto" for these instructions,  -    because according to PowerISA_V3.0, these instructions perform RTO on  -    fma's result:  -        xsmaddqp(o)  -        v      ← bfp_MULTIPLY_ADD(src1, src3, src2)  -        rnd    ← bfp_ROUND_TO_BFP128(RO, FPSCR.RN, v)  -        result ← bfp_CONVERT_TO_BFP128(rnd)  -  -        xsmsubqp(o)  -        v      ← bfp_MULTIPLY_ADD(src1, src3, bfp_NEGATE(src2))  -        rnd    ← bfp_ROUND_TO_BFP128(RO, FPSCR.RN, v)  -        result ← bfp_CONVERT_TO_BFP128(rnd)  -  -        xsnmaddqp(o)  -        v      ← bfp_MULTIPLY_ADD(src1,src3,src2)  -        rnd    ← bfp_NEGATE(bfp_ROUND_TO_BFP128(RO, FPSCR.RN, v))  -        result ← bfp_CONVERT_TO_BFP128(rnd)  -  -        xsnmsubqp(o)  -        v      ← bfp_MULTIPLY_ADD(src1, src3, bfp_NEGATE(src2))  -        rnd    ← bfp_NEGATE(bfp_ROUND_TO_BFP128(RO, FPSCR.RN, v))  -        result ← bfp_CONVERT_TO_BFP128(rnd)  -  -    DAG patterns of each instruction (PPCInstrVSX.td):  -    . isCommutable = 1  -      // xsmaddqpo  -      [(set f128:$vT, (PPCfmarto f128:$vA, f128:$vB, f128:$vTi))]>,  -      RegConstraint<"$vTi = $vT">, NoEncode<"$vTi">,  -      AltVSXFMARel;  -  -      // xsmsubqpo  -      [(set f128:$vT, (PPCfmarto f128:$vA, f128:$vB, (fneg f128:$vTi)))]>,  -      RegConstraint<"$vTi = $vT">, NoEncode<"$vTi">,  -      AltVSXFMARel;  -  -      // xsnmaddqpo  -      [(set f128:$vT, (fneg (PPCfmarto f128:$vA, f128:$vB, f128:$vTi)))]>,  -      RegConstraint<"$vTi = $vT">, NoEncode<"$vTi">,  -      AltVSXFMARel;  -  -      // xsnmsubqpo  -      [(set f128:$vT, (fneg (PPCfmarto f128:$vA, f128:$vB, (fneg f128:$vTi))))]>,  -      RegConstraint<"$vTi = $vT">, NoEncode<"$vTi">,  -      AltVSXFMARel;  -  -- QP Compare Ordered/Unordered: xscmpoqp xscmpuqp  -  . ref: XSCMPUDP  -      def XSCMPUDP : XX3Form_1<60, 35,  -                               (outs crrc:$crD), (ins vsfrc:$XA, vsfrc:$XB),  -                               "xscmpudp $crD, $XA, $XB", IIC_FPCompare, []>;  -  -  . No SDAG, intrinsic, builtin are required??  -    Or llvm fcmp order/unorder compare??  -  -- DP/QP Compare Exponents: xscmpexpdp xscmpexpqp  -  . No SDAG, intrinsic, builtin are required?  -  -- DP Compare ==, >=, >, !=: xscmpeqdp xscmpgedp xscmpgtdp xscmpnedp  -  . I checked existing instruction "XSCMPUDP". They are different in target  -    register. "XSCMPUDP" write to CR field, xscmp*dp write to VSX register  -  -  . Use instrinsic:  -    (set i128:$XT, (int_ppc_vsx_xscmpeqdp f64:$XA, f64:$XB))  -    (set i128:$XT, (int_ppc_vsx_xscmpgedp f64:$XA, f64:$XB))  -    (set i128:$XT, (int_ppc_vsx_xscmpgtdp f64:$XA, f64:$XB))  -    (set i128:$XT, (int_ppc_vsx_xscmpnedp f64:$XA, f64:$XB))  -  -- Vector Compare Not Equal: xvcmpnedp xvcmpnedp. xvcmpnesp xvcmpnesp.  -  . Similar to xvcmpeqdp:  -      defm XVCMPEQDP : XX3Form_Rcr<60, 99,  -                                 "xvcmpeqdp", "$XT, $XA, $XB", IIC_VecFPCompare,  -                                 int_ppc_vsx_xvcmpeqdp, v2i64, v2f64>;  -  -  . So we should use "XX3Form_Rcr" to implement instrinsic  -  -- Convert DP -> QP: xscvdpqp  -  . Similar to XSCVDPSP:  -      def XSCVDPSP : XX2Form<60, 265,  -                          (outs vsfrc:$XT), (ins vsfrc:$XB),  -                          "xscvdpsp $XT, $XB", IIC_VecFP, []>;  -  . So, No SDAG, intrinsic, builtin are required??  -  -- Round & Convert QP -> DP (dword[1] is set to zero): xscvqpdp xscvqpdpo  -  . Similar to XSCVDPSP  -  . No SDAG, intrinsic, builtin are required??  -  -- Truncate & Convert QP -> (Un)Signed (D)Word (dword[1] is set to zero):  -  xscvqpsdz xscvqpswz xscvqpudz xscvqpuwz  -  . According to PowerISA_V3.0, these are similar to "XSCVDPSXDS", "XSCVDPSXWS",  -    "XSCVDPUXDS", "XSCVDPUXWS"  -  -  . DAG patterns:  -    (set f128:$XT, (PPCfctidz f128:$XB))    // xscvqpsdz  -    (set f128:$XT, (PPCfctiwz f128:$XB))    // xscvqpswz  -    (set f128:$XT, (PPCfctiduz f128:$XB))   // xscvqpudz  -    (set f128:$XT, (PPCfctiwuz f128:$XB))   // xscvqpuwz  -  -- Convert (Un)Signed DWord -> QP: xscvsdqp xscvudqp  -  . Similar to XSCVSXDSP  -  . (set f128:$XT, (PPCfcfids f64:$XB))     // xscvsdqp  -    (set f128:$XT, (PPCfcfidus f64:$XB))    // xscvudqp  -  -- (Round &) Convert DP <-> HP: xscvdphp xscvhpdp  -  . Similar to XSCVDPSP  -  . No SDAG, intrinsic, builtin are required??  -  -- Vector HP -> SP: xvcvhpsp xvcvsphp  -  . Similar to XVCVDPSP:  -      def XVCVDPSP : XX2Form<60, 393,  -                          (outs vsrc:$XT), (ins vsrc:$XB),  -                          "xvcvdpsp $XT, $XB", IIC_VecFP, []>;  -  . No SDAG, intrinsic, builtin are required??  -  -- Round to Quad-Precision Integer: xsrqpi xsrqpix  -  . These are combination of "XSRDPI", "XSRDPIC", "XSRDPIM", .., because you  -    need to assign rounding mode in instruction  -  . Provide builtin?  -    (set f128:$vT, (int_ppc_vsx_xsrqpi f128:$vB))  -    (set f128:$vT, (int_ppc_vsx_xsrqpix f128:$vB))  -  -- Round Quad-Precision to Double-Extended Precision (fp80): xsrqpxp  -  . Provide builtin?  -    (set f128:$vT, (int_ppc_vsx_xsrqpxp f128:$vB))  -  -Fixed Point Facility:  -  -- Exploit cmprb and cmpeqb (perhaps for something like  -  isalpha/isdigit/isupper/islower and isspace respectivelly). This can  -  perhaps be done through a builtin.  -  -- Provide testing for cnttz[dw]  -- Insert Exponent DP/QP: xsiexpdp xsiexpqp  -  . Use intrinsic?  -  . xsiexpdp:  -    // Note: rA and rB are the unsigned integer value.  -    (set f128:$XT, (int_ppc_vsx_xsiexpdp i64:$rA, i64:$rB))  -  -  . xsiexpqp:  -    (set f128:$vT, (int_ppc_vsx_xsiexpqp f128:$vA, f64:$vB))  -  -- Extract Exponent/Significand DP/QP: xsxexpdp xsxsigdp xsxexpqp xsxsigqp  -  . Use intrinsic?  -  . (set i64:$rT, (int_ppc_vsx_xsxexpdp f64$XB))    // xsxexpdp  -    (set i64:$rT, (int_ppc_vsx_xsxsigdp f64$XB))    // xsxsigdp  -    (set f128:$vT, (int_ppc_vsx_xsxexpqp f128$vB))  // xsxexpqp  -    (set f128:$vT, (int_ppc_vsx_xsxsigqp f128$vB))  // xsxsigqp  -  -- Vector Insert Word: xxinsertw  -  - Useful for inserting f32/i32 elements into vectors (the element to be  -    inserted needs to be prepared)  -  . Note: llvm has insertelem in "Vector Operations"  -    ; yields <n x <ty>>  -    <result> = insertelement <n x <ty>> <val>, <ty> <elt>, <ty2> <idx>  -  -    But how to map to it??  -    [(set v1f128:$XT, (insertelement v1f128:$XTi, f128:$XB, i4:$UIMM))]>,  -    RegConstraint<"$XTi = $XT">, NoEncode<"$XTi">,  -  -  . Or use intrinsic?  -    (set v1f128:$XT, (int_ppc_vsx_xxinsertw v1f128:$XTi, f128:$XB, i4:$UIMM))  -  -- Vector Extract Unsigned Word: xxextractuw  -  - Not useful for extraction of f32 from v4f32 (the current pattern is better -  -    shift->convert)  -  - It is useful for (uint_to_fp (vector_extract v4i32, N))  -  - Unfortunately, it can't be used for (sint_to_fp (vector_extract v4i32, N))  -  . Note: llvm has extractelement in "Vector Operations"  -    ; yields <ty>  -    <result> = extractelement <n x <ty>> <val>, <ty2> <idx>  -  -    How to map to it??  -    [(set f128:$XT, (extractelement v1f128:$XB, i4:$UIMM))]  -  -  . Or use intrinsic?  -    (set f128:$XT, (int_ppc_vsx_xxextractuw v1f128:$XB, i4:$UIMM))  -  -- Vector Insert Exponent DP/SP: xviexpdp xviexpsp  -  . Use intrinsic  -    (set v2f64:$XT, (int_ppc_vsx_xviexpdp v2f64:$XA, v2f64:$XB))  -    (set v4f32:$XT, (int_ppc_vsx_xviexpsp v4f32:$XA, v4f32:$XB))  -  -- Vector Extract Exponent/Significand DP/SP: xvxexpdp xvxexpsp xvxsigdp xvxsigsp  -  . Use intrinsic  -    (set v2f64:$XT, (int_ppc_vsx_xvxexpdp v2f64:$XB))  -    (set v4f32:$XT, (int_ppc_vsx_xvxexpsp v4f32:$XB))  -    (set v2f64:$XT, (int_ppc_vsx_xvxsigdp v2f64:$XB))  -    (set v4f32:$XT, (int_ppc_vsx_xvxsigsp v4f32:$XB))  -  -- Test Data Class SP/DP/QP: xststdcsp xststdcdp xststdcqp  -  . No SDAG, intrinsic, builtin are required?  -    Because it seems that we have no way to map BF field?  -  -    Instruction Form: [PO T XO B XO BX TX]  -    Asm: xststd* BF,XB,DCMX  -  -    BF is an index to CR register field.  -  -- Vector Test Data Class SP/DP: xvtstdcsp xvtstdcdp  -  . Use intrinsic  -    (set v4f32:$XT, (int_ppc_vsx_xvtstdcsp v4f32:$XB, i7:$DCMX))  -    (set v2f64:$XT, (int_ppc_vsx_xvtstdcdp v2f64:$XB, i7:$DCMX))  -  -- Maximum/Minimum Type-C/Type-J DP: xsmaxcdp xsmaxjdp xsmincdp xsminjdp  -  . PowerISA_V3.0:  -    "xsmaxcdp can be used to implement the C/C++/Java conditional operation  -     (x>y)?x:y for single-precision and double-precision arguments."  -  -    Note! c type and j type have different behavior when:  -    1. Either input is NaN  -    2. Both input are +-Infinity, +-Zero  -  -  . dtype map to llvm fmaxnum/fminnum  -    jtype use intrinsic  -  -  . xsmaxcdp xsmincdp  -    (set f64:$XT, (fmaxnum f64:$XA, f64:$XB))  -    (set f64:$XT, (fminnum f64:$XA, f64:$XB))  -  -  . xsmaxjdp xsminjdp  -    (set f64:$XT, (int_ppc_vsx_xsmaxjdp f64:$XA, f64:$XB))  -    (set f64:$XT, (int_ppc_vsx_xsminjdp f64:$XA, f64:$XB))  -  -- Vector Byte-Reverse H/W/D/Q Word: xxbrh xxbrw xxbrd xxbrq  -  . Use intrinsic  -    (set v8i16:$XT, (int_ppc_vsx_xxbrh v8i16:$XB))  -    (set v4i32:$XT, (int_ppc_vsx_xxbrw v4i32:$XB))  -    (set v2i64:$XT, (int_ppc_vsx_xxbrd v2i64:$XB))  -    (set v1i128:$XT, (int_ppc_vsx_xxbrq v1i128:$XB))  -  -- Vector Permute: xxperm xxpermr  -  . I have checked "PPCxxswapd" in PPCInstrVSX.td, but they are different  -  . Use intrinsic  -    (set v16i8:$XT, (int_ppc_vsx_xxperm v16i8:$XA, v16i8:$XB))  -    (set v16i8:$XT, (int_ppc_vsx_xxpermr v16i8:$XA, v16i8:$XB))  -  -- Vector Splat Immediate Byte: xxspltib  -  . Similar to XXSPLTW:  -      def XXSPLTW : XX2Form_2<60, 164,  -                           (outs vsrc:$XT), (ins vsrc:$XB, u2imm:$UIM),  -                           "xxspltw $XT, $XB, $UIM", IIC_VecPerm, []>;  -  -  . No SDAG, intrinsic, builtin are required?  -  -- Load/Store Vector: lxv stxv  -  . Has likely SDAG match:  -    (set v?:$XT, (load ix16addr:$src))  -    (set v?:$XT, (store ix16addr:$dst))  -  -  . Need define ix16addr in PPCInstrInfo.td  -    ix16addr: 16-byte aligned, see "def memrix16" in PPCInstrInfo.td  -  -- Load/Store Vector Indexed: lxvx stxvx  -  . Has likely SDAG match:  -    (set v?:$XT, (load xoaddr:$src))  -    (set v?:$XT, (store xoaddr:$dst))  -  -- Load/Store DWord: lxsd stxsd  -  . Similar to lxsdx/stxsdx:  -    def LXSDX : XX1Form<31, 588,  -                        (outs vsfrc:$XT), (ins memrr:$src),  -                        "lxsdx $XT, $src", IIC_LdStLFD,  -                        [(set f64:$XT, (load xoaddr:$src))]>;  -  -  . (set f64:$XT, (load iaddrX4:$src))  -    (set f64:$XT, (store iaddrX4:$dst))  -  -- Load/Store SP, with conversion from/to DP: lxssp stxssp  -  . Similar to lxsspx/stxsspx:  -    def LXSSPX : XX1Form<31, 524, (outs vssrc:$XT), (ins memrr:$src),  -                         "lxsspx $XT, $src", IIC_LdStLFD,  -                         [(set f32:$XT, (load xoaddr:$src))]>;  -  -  . (set f32:$XT, (load iaddrX4:$src))  -    (set f32:$XT, (store iaddrX4:$dst))  -  -- Load as Integer Byte/Halfword & Zero Indexed: lxsibzx lxsihzx  -  . Similar to lxsiwzx:  -    def LXSIWZX : XX1Form<31, 12, (outs vsfrc:$XT), (ins memrr:$src),  -                          "lxsiwzx $XT, $src", IIC_LdStLFD,  -                          [(set f64:$XT, (PPClfiwzx xoaddr:$src))]>;  -  -  . (set f64:$XT, (PPClfiwzx xoaddr:$src))  -  -- Store as Integer Byte/Halfword Indexed: stxsibx stxsihx  -  . Similar to stxsiwx:  -    def STXSIWX : XX1Form<31, 140, (outs), (ins vsfrc:$XT, memrr:$dst),  -                          "stxsiwx $XT, $dst", IIC_LdStSTFD,  -                          [(PPCstfiwx f64:$XT, xoaddr:$dst)]>;  -  -  . (PPCstfiwx f64:$XT, xoaddr:$dst)  -  -- Load Vector Halfword*8/Byte*16 Indexed: lxvh8x lxvb16x  -  . Similar to lxvd2x/lxvw4x:  -    def LXVD2X : XX1Form<31, 844,  -                         (outs vsrc:$XT), (ins memrr:$src),  -                         "lxvd2x $XT, $src", IIC_LdStLFD,  -                         [(set v2f64:$XT, (int_ppc_vsx_lxvd2x xoaddr:$src))]>;  -  -  . (set v8i16:$XT, (int_ppc_vsx_lxvh8x xoaddr:$src))  -    (set v16i8:$XT, (int_ppc_vsx_lxvb16x xoaddr:$src))  -  -- Store Vector Halfword*8/Byte*16 Indexed: stxvh8x stxvb16x  -  . Similar to stxvd2x/stxvw4x:  -    def STXVD2X : XX1Form<31, 972,  -                         (outs), (ins vsrc:$XT, memrr:$dst),  -                         "stxvd2x $XT, $dst", IIC_LdStSTFD,  -                         [(store v2f64:$XT, xoaddr:$dst)]>;  -  -  . (store v8i16:$XT, xoaddr:$dst)  -    (store v16i8:$XT, xoaddr:$dst)  -  -- Load/Store Vector (Left-justified) with Length: lxvl lxvll stxvl stxvll  -  . Likely needs an intrinsic  -  . (set v?:$XT, (int_ppc_vsx_lxvl xoaddr:$src))  -    (set v?:$XT, (int_ppc_vsx_lxvll xoaddr:$src))  -  -  . (int_ppc_vsx_stxvl xoaddr:$dst))  -    (int_ppc_vsx_stxvll xoaddr:$dst))  -  -- Load Vector Word & Splat Indexed: lxvwsx  -  . Likely needs an intrinsic  -  . (set v?:$XT, (int_ppc_vsx_lxvwsx xoaddr:$src))  -  -Atomic operations (l[dw]at, st[dw]at):  -- Provide custom lowering for common atomic operations to use these  -  instructions with the correct Function Code  -- Ensure the operands are in the correct register (i.e. RT+1, RT+2)  -- Provide builtins since not all FC's necessarily have an existing LLVM  -  atomic operation  -  -Load Doubleword Monitored (ldmx):  -- Investigate whether there are any uses for this. It seems to be related to  -  Garbage Collection so it isn't likely to be all that useful for most  -  languages we deal with.  -  -Move to CR from XER Extended (mcrxrx):  -- Is there a use for this in LLVM?  -  -Fixed Point Facility:  -  -- Copy-Paste Facility: copy copy_first cp_abort paste paste. paste_last  -  . Use instrinstics:  -    (int_ppc_copy_first i32:$rA, i32:$rB)  -    (int_ppc_copy i32:$rA, i32:$rB)  -  -    (int_ppc_paste i32:$rA, i32:$rB)  -    (int_ppc_paste_last i32:$rA, i32:$rB)  -  -    (int_cp_abort)  -  -- Message Synchronize: msgsync  -- SLB*: slbieg slbsync  -- stop  -  . No instrinstics  +//===- README_P9.txt - Notes for improving Power9 code gen ----------------===// + +TODO: Instructions Need Implement Instrinstics or Map to LLVM IR + +Altivec: +- Vector Compare Not Equal (Zero): +  vcmpneb(.) vcmpneh(.) vcmpnew(.) +  vcmpnezb(.) vcmpnezh(.) vcmpnezw(.) +  . Same as other VCMP*, use VCMP/VCMPo form (support intrinsic) + +- Vector Extract Unsigned: vextractub vextractuh vextractuw vextractd +  . Don't use llvm extractelement because they have different semantics +  . Use instrinstics: +    (set v2i64:$vD, (int_ppc_altivec_vextractub v16i8:$vA, imm:$UIMM)) +    (set v2i64:$vD, (int_ppc_altivec_vextractuh v8i16:$vA, imm:$UIMM)) +    (set v2i64:$vD, (int_ppc_altivec_vextractuw v4i32:$vA, imm:$UIMM)) +    (set v2i64:$vD, (int_ppc_altivec_vextractd  v2i64:$vA, imm:$UIMM)) + +- Vector Extract Unsigned Byte Left/Right-Indexed: +  vextublx vextubrx vextuhlx vextuhrx vextuwlx vextuwrx +  . Use instrinstics: +    // Left-Indexed +    (set i64:$rD, (int_ppc_altivec_vextublx i64:$rA, v16i8:$vB)) +    (set i64:$rD, (int_ppc_altivec_vextuhlx i64:$rA, v8i16:$vB)) +    (set i64:$rD, (int_ppc_altivec_vextuwlx i64:$rA, v4i32:$vB)) + +    // Right-Indexed +    (set i64:$rD, (int_ppc_altivec_vextubrx i64:$rA, v16i8:$vB)) +    (set i64:$rD, (int_ppc_altivec_vextuhrx i64:$rA, v8i16:$vB)) +    (set i64:$rD, (int_ppc_altivec_vextuwrx i64:$rA, v4i32:$vB)) + +- Vector Insert Element Instructions: vinsertb vinsertd vinserth vinsertw +    (set v16i8:$vD, (int_ppc_altivec_vinsertb v16i8:$vA, imm:$UIMM)) +    (set v8i16:$vD, (int_ppc_altivec_vinsertd v8i16:$vA, imm:$UIMM)) +    (set v4i32:$vD, (int_ppc_altivec_vinserth v4i32:$vA, imm:$UIMM)) +    (set v2i64:$vD, (int_ppc_altivec_vinsertw v2i64:$vA, imm:$UIMM)) + +- Vector Count Leading/Trailing Zero LSB. Result is placed into GPR[rD]: +  vclzlsbb vctzlsbb +  . Use intrinsic: +    (set i64:$rD, (int_ppc_altivec_vclzlsbb v16i8:$vB)) +    (set i64:$rD, (int_ppc_altivec_vctzlsbb v16i8:$vB)) + +- Vector Count Trailing Zeros: vctzb vctzh vctzw vctzd +  . Map to llvm cttz +    (set v16i8:$vD, (cttz v16i8:$vB))     // vctzb +    (set v8i16:$vD, (cttz v8i16:$vB))     // vctzh +    (set v4i32:$vD, (cttz v4i32:$vB))     // vctzw +    (set v2i64:$vD, (cttz v2i64:$vB))     // vctzd + +- Vector Extend Sign: vextsb2w vextsh2w vextsb2d vextsh2d vextsw2d +  . vextsb2w: +    (set v4i32:$vD, (sext v4i8:$vB)) + +    // PowerISA_V3.0: +    do i = 0 to 3 +       VR[VRT].word[i] ← EXTS32(VR[VRB].word[i].byte[3]) +    end + +  . vextsh2w: +    (set v4i32:$vD, (sext v4i16:$vB)) + +    // PowerISA_V3.0: +    do i = 0 to 3 +       VR[VRT].word[i] ← EXTS32(VR[VRB].word[i].hword[1]) +    end + +  . vextsb2d +    (set v2i64:$vD, (sext v2i8:$vB)) + +    // PowerISA_V3.0: +    do i = 0 to 1 +       VR[VRT].dword[i] ← EXTS64(VR[VRB].dword[i].byte[7]) +    end + +  . vextsh2d +    (set v2i64:$vD, (sext v2i16:$vB)) + +    // PowerISA_V3.0: +    do i = 0 to 1 +       VR[VRT].dword[i] ← EXTS64(VR[VRB].dword[i].hword[3]) +    end + +  . vextsw2d +    (set v2i64:$vD, (sext v2i32:$vB)) + +    // PowerISA_V3.0: +    do i = 0 to 1 +       VR[VRT].dword[i] ← EXTS64(VR[VRB].dword[i].word[1]) +    end + +- Vector Integer Negate: vnegw vnegd +  . Map to llvm ineg +    (set v4i32:$rT, (ineg v4i32:$rA))       // vnegw +    (set v2i64:$rT, (ineg v2i64:$rA))       // vnegd + +- Vector Parity Byte: vprtybw vprtybd vprtybq +  . Use intrinsic: +    (set v4i32:$rD, (int_ppc_altivec_vprtybw v4i32:$vB)) +    (set v2i64:$rD, (int_ppc_altivec_vprtybd v2i64:$vB)) +    (set v1i128:$rD, (int_ppc_altivec_vprtybq v1i128:$vB)) + +- Vector (Bit) Permute (Right-indexed): +  . vbpermd: Same as "vbpermq", use VX1_Int_Ty2: +    VX1_Int_Ty2<1484, "vbpermd", int_ppc_altivec_vbpermd, v2i64, v2i64>; + +  . vpermr: use VA1a_Int_Ty3 +    VA1a_Int_Ty3<59, "vpermr", int_ppc_altivec_vpermr, v16i8, v16i8, v16i8>; + +- Vector Rotate Left Mask/Mask-Insert: vrlwnm vrlwmi vrldnm vrldmi +  . Use intrinsic: +    VX1_Int_Ty<389, "vrlwnm", int_ppc_altivec_vrlwnm, v4i32>; +    VX1_Int_Ty<133, "vrlwmi", int_ppc_altivec_vrlwmi, v4i32>; +    VX1_Int_Ty<453, "vrldnm", int_ppc_altivec_vrldnm, v2i64>; +    VX1_Int_Ty<197, "vrldmi", int_ppc_altivec_vrldmi, v2i64>; + +- Vector Shift Left/Right: vslv vsrv +  . Use intrinsic, don't map to llvm shl and lshr, because they have different +    semantics, e.g. vslv: + +      do i = 0 to 15 +         sh ← VR[VRB].byte[i].bit[5:7] +         VR[VRT].byte[i] ← src.byte[i:i+1].bit[sh:sh+7] +      end + +    VR[VRT].byte[i] is composed of 2 bytes from src.byte[i:i+1] + +  . VX1_Int_Ty<1860, "vslv", int_ppc_altivec_vslv, v16i8>; +    VX1_Int_Ty<1796, "vsrv", int_ppc_altivec_vsrv, v16i8>; + +- Vector Multiply-by-10 (& Write Carry) Unsigned Quadword: +  vmul10uq vmul10cuq +  . Use intrinsic: +    VX1_Int_Ty<513, "vmul10uq",   int_ppc_altivec_vmul10uq,  v1i128>; +    VX1_Int_Ty<  1, "vmul10cuq",  int_ppc_altivec_vmul10cuq, v1i128>; + +- Vector Multiply-by-10 Extended (& Write Carry) Unsigned Quadword: +  vmul10euq vmul10ecuq +  . Use intrinsic: +    VX1_Int_Ty<577, "vmul10euq",  int_ppc_altivec_vmul10euq, v1i128>; +    VX1_Int_Ty< 65, "vmul10ecuq", int_ppc_altivec_vmul10ecuq, v1i128>; + +- Decimal Convert From/to National/Zoned/Signed-QWord: +  bcdcfn. bcdcfz. bcdctn. bcdctz. bcdcfsq. bcdctsq. +  . Use instrinstics: +    (set v1i128:$vD, (int_ppc_altivec_bcdcfno  v1i128:$vB, i1:$PS)) +    (set v1i128:$vD, (int_ppc_altivec_bcdcfzo  v1i128:$vB, i1:$PS)) +    (set v1i128:$vD, (int_ppc_altivec_bcdctno  v1i128:$vB)) +    (set v1i128:$vD, (int_ppc_altivec_bcdctzo  v1i128:$vB, i1:$PS)) +    (set v1i128:$vD, (int_ppc_altivec_bcdcfsqo v1i128:$vB, i1:$PS)) +    (set v1i128:$vD, (int_ppc_altivec_bcdctsqo v1i128:$vB)) + +- Decimal Copy-Sign/Set-Sign: bcdcpsgn. bcdsetsgn. +  . Use instrinstics: +    (set v1i128:$vD, (int_ppc_altivec_bcdcpsgno v1i128:$vA, v1i128:$vB)) +    (set v1i128:$vD, (int_ppc_altivec_bcdsetsgno v1i128:$vB, i1:$PS)) + +- Decimal Shift/Unsigned-Shift/Shift-and-Round: bcds. bcdus. bcdsr. +  . Use instrinstics: +    (set v1i128:$vD, (int_ppc_altivec_bcdso  v1i128:$vA, v1i128:$vB, i1:$PS)) +    (set v1i128:$vD, (int_ppc_altivec_bcduso v1i128:$vA, v1i128:$vB)) +    (set v1i128:$vD, (int_ppc_altivec_bcdsro v1i128:$vA, v1i128:$vB, i1:$PS)) + +  . Note! Their VA is accessed only 1 byte, i.e. VA.byte[7] + +- Decimal (Unsigned) Truncate: bcdtrunc. bcdutrunc. +  . Use instrinstics: +    (set v1i128:$vD, (int_ppc_altivec_bcdso  v1i128:$vA, v1i128:$vB, i1:$PS)) +    (set v1i128:$vD, (int_ppc_altivec_bcduso v1i128:$vA, v1i128:$vB)) + +  . Note! Their VA is accessed only 2 byte, i.e. VA.hword[3] (VA.bit[48:63]) + +VSX: +- QP Copy Sign: xscpsgnqp +  . Similar to xscpsgndp +  . (set f128:$vT, (fcopysign f128:$vB, f128:$vA) + +- QP Absolute/Negative-Absolute/Negate: xsabsqp xsnabsqp xsnegqp +  . Similar to xsabsdp/xsnabsdp/xsnegdp +  . (set f128:$vT, (fabs f128:$vB))             // xsabsqp +    (set f128:$vT, (fneg (fabs f128:$vB)))      // xsnabsqp +    (set f128:$vT, (fneg f128:$vB))             // xsnegqp + +- QP Add/Divide/Multiply/Subtract/Square-Root: +  xsaddqp xsdivqp xsmulqp xssubqp xssqrtqp +  . Similar to xsadddp +  . isCommutable = 1 +    (set f128:$vT, (fadd f128:$vA, f128:$vB))   // xsaddqp +    (set f128:$vT, (fmul f128:$vA, f128:$vB))   // xsmulqp + +  . isCommutable = 0 +    (set f128:$vT, (fdiv f128:$vA, f128:$vB))   // xsdivqp +    (set f128:$vT, (fsub f128:$vA, f128:$vB))   // xssubqp +    (set f128:$vT, (fsqrt f128:$vB)))           // xssqrtqp + +- Round to Odd of QP Add/Divide/Multiply/Subtract/Square-Root: +  xsaddqpo xsdivqpo xsmulqpo xssubqpo xssqrtqpo +  . Similar to xsrsqrtedp?? +      def XSRSQRTEDP : XX2Form<60, 74, +                               (outs vsfrc:$XT), (ins vsfrc:$XB), +                               "xsrsqrtedp $XT, $XB", IIC_VecFP, +                               [(set f64:$XT, (PPCfrsqrte f64:$XB))]>; + +  . Define DAG Node in PPCInstrInfo.td: +    def PPCfaddrto: SDNode<"PPCISD::FADDRTO", SDTFPBinOp, []>; +    def PPCfdivrto: SDNode<"PPCISD::FDIVRTO", SDTFPBinOp, []>; +    def PPCfmulrto: SDNode<"PPCISD::FMULRTO", SDTFPBinOp, []>; +    def PPCfsubrto: SDNode<"PPCISD::FSUBRTO", SDTFPBinOp, []>; +    def PPCfsqrtrto: SDNode<"PPCISD::FSQRTRTO", SDTFPUnaryOp, []>; + +    DAG patterns of each instruction (PPCInstrVSX.td): +    . isCommutable = 1 +      (set f128:$vT, (PPCfaddrto f128:$vA, f128:$vB))   // xsaddqpo +      (set f128:$vT, (PPCfmulrto f128:$vA, f128:$vB))   // xsmulqpo + +    . isCommutable = 0 +      (set f128:$vT, (PPCfdivrto f128:$vA, f128:$vB))   // xsdivqpo +      (set f128:$vT, (PPCfsubrto f128:$vA, f128:$vB))   // xssubqpo +      (set f128:$vT, (PPCfsqrtrto f128:$vB))            // xssqrtqpo + +- QP (Negative) Multiply-{Add/Subtract}: xsmaddqp xsmsubqp xsnmaddqp xsnmsubqp +  . Ref: xsmaddadp/xsmsubadp/xsnmaddadp/xsnmsubadp + +  . isCommutable = 1 +    // xsmaddqp +    [(set f128:$vT, (fma f128:$vA, f128:$vB, f128:$vTi))]>, +    RegConstraint<"$vTi = $vT">, NoEncode<"$vTi">, +    AltVSXFMARel; + +    // xsmsubqp +    [(set f128:$vT, (fma f128:$vA, f128:$vB, (fneg f128:$vTi)))]>, +    RegConstraint<"$vTi = $vT">, NoEncode<"$vTi">, +    AltVSXFMARel; + +    // xsnmaddqp +    [(set f128:$vT, (fneg (fma f128:$vA, f128:$vB, f128:$vTi)))]>, +    RegConstraint<"$vTi = $vT">, NoEncode<"$vTi">, +    AltVSXFMARel; + +    // xsnmsubqp +    [(set f128:$vT, (fneg (fma f128:$vA, f128:$vB, (fneg f128:$vTi))))]>, +    RegConstraint<"$vTi = $vT">, NoEncode<"$vTi">, +    AltVSXFMARel; + +- Round to Odd of QP (Negative) Multiply-{Add/Subtract}: +  xsmaddqpo xsmsubqpo xsnmaddqpo xsnmsubqpo +  . Similar to xsrsqrtedp?? + +  . Define DAG Node in PPCInstrInfo.td: +    def PPCfmarto: SDNode<"PPCISD::FMARTO", SDTFPTernaryOp, []>; + +    It looks like we only need to define "PPCfmarto" for these instructions, +    because according to PowerISA_V3.0, these instructions perform RTO on +    fma's result: +        xsmaddqp(o) +        v      ← bfp_MULTIPLY_ADD(src1, src3, src2) +        rnd    ← bfp_ROUND_TO_BFP128(RO, FPSCR.RN, v) +        result ← bfp_CONVERT_TO_BFP128(rnd) + +        xsmsubqp(o) +        v      ← bfp_MULTIPLY_ADD(src1, src3, bfp_NEGATE(src2)) +        rnd    ← bfp_ROUND_TO_BFP128(RO, FPSCR.RN, v) +        result ← bfp_CONVERT_TO_BFP128(rnd) + +        xsnmaddqp(o) +        v      ← bfp_MULTIPLY_ADD(src1,src3,src2) +        rnd    ← bfp_NEGATE(bfp_ROUND_TO_BFP128(RO, FPSCR.RN, v)) +        result ← bfp_CONVERT_TO_BFP128(rnd) + +        xsnmsubqp(o) +        v      ← bfp_MULTIPLY_ADD(src1, src3, bfp_NEGATE(src2)) +        rnd    ← bfp_NEGATE(bfp_ROUND_TO_BFP128(RO, FPSCR.RN, v)) +        result ← bfp_CONVERT_TO_BFP128(rnd) + +    DAG patterns of each instruction (PPCInstrVSX.td): +    . isCommutable = 1 +      // xsmaddqpo +      [(set f128:$vT, (PPCfmarto f128:$vA, f128:$vB, f128:$vTi))]>, +      RegConstraint<"$vTi = $vT">, NoEncode<"$vTi">, +      AltVSXFMARel; + +      // xsmsubqpo +      [(set f128:$vT, (PPCfmarto f128:$vA, f128:$vB, (fneg f128:$vTi)))]>, +      RegConstraint<"$vTi = $vT">, NoEncode<"$vTi">, +      AltVSXFMARel; + +      // xsnmaddqpo +      [(set f128:$vT, (fneg (PPCfmarto f128:$vA, f128:$vB, f128:$vTi)))]>, +      RegConstraint<"$vTi = $vT">, NoEncode<"$vTi">, +      AltVSXFMARel; + +      // xsnmsubqpo +      [(set f128:$vT, (fneg (PPCfmarto f128:$vA, f128:$vB, (fneg f128:$vTi))))]>, +      RegConstraint<"$vTi = $vT">, NoEncode<"$vTi">, +      AltVSXFMARel; + +- QP Compare Ordered/Unordered: xscmpoqp xscmpuqp +  . ref: XSCMPUDP +      def XSCMPUDP : XX3Form_1<60, 35, +                               (outs crrc:$crD), (ins vsfrc:$XA, vsfrc:$XB), +                               "xscmpudp $crD, $XA, $XB", IIC_FPCompare, []>; + +  . No SDAG, intrinsic, builtin are required?? +    Or llvm fcmp order/unorder compare?? + +- DP/QP Compare Exponents: xscmpexpdp xscmpexpqp +  . No SDAG, intrinsic, builtin are required? + +- DP Compare ==, >=, >, !=: xscmpeqdp xscmpgedp xscmpgtdp xscmpnedp +  . I checked existing instruction "XSCMPUDP". They are different in target +    register. "XSCMPUDP" write to CR field, xscmp*dp write to VSX register + +  . Use instrinsic: +    (set i128:$XT, (int_ppc_vsx_xscmpeqdp f64:$XA, f64:$XB)) +    (set i128:$XT, (int_ppc_vsx_xscmpgedp f64:$XA, f64:$XB)) +    (set i128:$XT, (int_ppc_vsx_xscmpgtdp f64:$XA, f64:$XB)) +    (set i128:$XT, (int_ppc_vsx_xscmpnedp f64:$XA, f64:$XB)) + +- Vector Compare Not Equal: xvcmpnedp xvcmpnedp. xvcmpnesp xvcmpnesp. +  . Similar to xvcmpeqdp: +      defm XVCMPEQDP : XX3Form_Rcr<60, 99, +                                 "xvcmpeqdp", "$XT, $XA, $XB", IIC_VecFPCompare, +                                 int_ppc_vsx_xvcmpeqdp, v2i64, v2f64>; + +  . So we should use "XX3Form_Rcr" to implement instrinsic + +- Convert DP -> QP: xscvdpqp +  . Similar to XSCVDPSP: +      def XSCVDPSP : XX2Form<60, 265, +                          (outs vsfrc:$XT), (ins vsfrc:$XB), +                          "xscvdpsp $XT, $XB", IIC_VecFP, []>; +  . So, No SDAG, intrinsic, builtin are required?? + +- Round & Convert QP -> DP (dword[1] is set to zero): xscvqpdp xscvqpdpo +  . Similar to XSCVDPSP +  . No SDAG, intrinsic, builtin are required?? + +- Truncate & Convert QP -> (Un)Signed (D)Word (dword[1] is set to zero): +  xscvqpsdz xscvqpswz xscvqpudz xscvqpuwz +  . According to PowerISA_V3.0, these are similar to "XSCVDPSXDS", "XSCVDPSXWS", +    "XSCVDPUXDS", "XSCVDPUXWS" + +  . DAG patterns: +    (set f128:$XT, (PPCfctidz f128:$XB))    // xscvqpsdz +    (set f128:$XT, (PPCfctiwz f128:$XB))    // xscvqpswz +    (set f128:$XT, (PPCfctiduz f128:$XB))   // xscvqpudz +    (set f128:$XT, (PPCfctiwuz f128:$XB))   // xscvqpuwz + +- Convert (Un)Signed DWord -> QP: xscvsdqp xscvudqp +  . Similar to XSCVSXDSP +  . (set f128:$XT, (PPCfcfids f64:$XB))     // xscvsdqp +    (set f128:$XT, (PPCfcfidus f64:$XB))    // xscvudqp + +- (Round &) Convert DP <-> HP: xscvdphp xscvhpdp +  . Similar to XSCVDPSP +  . No SDAG, intrinsic, builtin are required?? + +- Vector HP -> SP: xvcvhpsp xvcvsphp +  . Similar to XVCVDPSP: +      def XVCVDPSP : XX2Form<60, 393, +                          (outs vsrc:$XT), (ins vsrc:$XB), +                          "xvcvdpsp $XT, $XB", IIC_VecFP, []>; +  . No SDAG, intrinsic, builtin are required?? + +- Round to Quad-Precision Integer: xsrqpi xsrqpix +  . These are combination of "XSRDPI", "XSRDPIC", "XSRDPIM", .., because you +    need to assign rounding mode in instruction +  . Provide builtin? +    (set f128:$vT, (int_ppc_vsx_xsrqpi f128:$vB)) +    (set f128:$vT, (int_ppc_vsx_xsrqpix f128:$vB)) + +- Round Quad-Precision to Double-Extended Precision (fp80): xsrqpxp +  . Provide builtin? +    (set f128:$vT, (int_ppc_vsx_xsrqpxp f128:$vB)) + +Fixed Point Facility: + +- Exploit cmprb and cmpeqb (perhaps for something like +  isalpha/isdigit/isupper/islower and isspace respectivelly). This can +  perhaps be done through a builtin. + +- Provide testing for cnttz[dw] +- Insert Exponent DP/QP: xsiexpdp xsiexpqp +  . Use intrinsic? +  . xsiexpdp: +    // Note: rA and rB are the unsigned integer value. +    (set f128:$XT, (int_ppc_vsx_xsiexpdp i64:$rA, i64:$rB)) + +  . xsiexpqp: +    (set f128:$vT, (int_ppc_vsx_xsiexpqp f128:$vA, f64:$vB)) + +- Extract Exponent/Significand DP/QP: xsxexpdp xsxsigdp xsxexpqp xsxsigqp +  . Use intrinsic? +  . (set i64:$rT, (int_ppc_vsx_xsxexpdp f64$XB))    // xsxexpdp +    (set i64:$rT, (int_ppc_vsx_xsxsigdp f64$XB))    // xsxsigdp +    (set f128:$vT, (int_ppc_vsx_xsxexpqp f128$vB))  // xsxexpqp +    (set f128:$vT, (int_ppc_vsx_xsxsigqp f128$vB))  // xsxsigqp + +- Vector Insert Word: xxinsertw +  - Useful for inserting f32/i32 elements into vectors (the element to be +    inserted needs to be prepared) +  . Note: llvm has insertelem in "Vector Operations" +    ; yields <n x <ty>> +    <result> = insertelement <n x <ty>> <val>, <ty> <elt>, <ty2> <idx> + +    But how to map to it?? +    [(set v1f128:$XT, (insertelement v1f128:$XTi, f128:$XB, i4:$UIMM))]>, +    RegConstraint<"$XTi = $XT">, NoEncode<"$XTi">, + +  . Or use intrinsic? +    (set v1f128:$XT, (int_ppc_vsx_xxinsertw v1f128:$XTi, f128:$XB, i4:$UIMM)) + +- Vector Extract Unsigned Word: xxextractuw +  - Not useful for extraction of f32 from v4f32 (the current pattern is better - +    shift->convert) +  - It is useful for (uint_to_fp (vector_extract v4i32, N)) +  - Unfortunately, it can't be used for (sint_to_fp (vector_extract v4i32, N)) +  . Note: llvm has extractelement in "Vector Operations" +    ; yields <ty> +    <result> = extractelement <n x <ty>> <val>, <ty2> <idx> + +    How to map to it?? +    [(set f128:$XT, (extractelement v1f128:$XB, i4:$UIMM))] + +  . Or use intrinsic? +    (set f128:$XT, (int_ppc_vsx_xxextractuw v1f128:$XB, i4:$UIMM)) + +- Vector Insert Exponent DP/SP: xviexpdp xviexpsp +  . Use intrinsic +    (set v2f64:$XT, (int_ppc_vsx_xviexpdp v2f64:$XA, v2f64:$XB)) +    (set v4f32:$XT, (int_ppc_vsx_xviexpsp v4f32:$XA, v4f32:$XB)) + +- Vector Extract Exponent/Significand DP/SP: xvxexpdp xvxexpsp xvxsigdp xvxsigsp +  . Use intrinsic +    (set v2f64:$XT, (int_ppc_vsx_xvxexpdp v2f64:$XB)) +    (set v4f32:$XT, (int_ppc_vsx_xvxexpsp v4f32:$XB)) +    (set v2f64:$XT, (int_ppc_vsx_xvxsigdp v2f64:$XB)) +    (set v4f32:$XT, (int_ppc_vsx_xvxsigsp v4f32:$XB)) + +- Test Data Class SP/DP/QP: xststdcsp xststdcdp xststdcqp +  . No SDAG, intrinsic, builtin are required? +    Because it seems that we have no way to map BF field? + +    Instruction Form: [PO T XO B XO BX TX] +    Asm: xststd* BF,XB,DCMX + +    BF is an index to CR register field. + +- Vector Test Data Class SP/DP: xvtstdcsp xvtstdcdp +  . Use intrinsic +    (set v4f32:$XT, (int_ppc_vsx_xvtstdcsp v4f32:$XB, i7:$DCMX)) +    (set v2f64:$XT, (int_ppc_vsx_xvtstdcdp v2f64:$XB, i7:$DCMX)) + +- Maximum/Minimum Type-C/Type-J DP: xsmaxcdp xsmaxjdp xsmincdp xsminjdp +  . PowerISA_V3.0: +    "xsmaxcdp can be used to implement the C/C++/Java conditional operation +     (x>y)?x:y for single-precision and double-precision arguments." + +    Note! c type and j type have different behavior when: +    1. Either input is NaN +    2. Both input are +-Infinity, +-Zero + +  . dtype map to llvm fmaxnum/fminnum +    jtype use intrinsic + +  . xsmaxcdp xsmincdp +    (set f64:$XT, (fmaxnum f64:$XA, f64:$XB)) +    (set f64:$XT, (fminnum f64:$XA, f64:$XB)) + +  . xsmaxjdp xsminjdp +    (set f64:$XT, (int_ppc_vsx_xsmaxjdp f64:$XA, f64:$XB)) +    (set f64:$XT, (int_ppc_vsx_xsminjdp f64:$XA, f64:$XB)) + +- Vector Byte-Reverse H/W/D/Q Word: xxbrh xxbrw xxbrd xxbrq +  . Use intrinsic +    (set v8i16:$XT, (int_ppc_vsx_xxbrh v8i16:$XB)) +    (set v4i32:$XT, (int_ppc_vsx_xxbrw v4i32:$XB)) +    (set v2i64:$XT, (int_ppc_vsx_xxbrd v2i64:$XB)) +    (set v1i128:$XT, (int_ppc_vsx_xxbrq v1i128:$XB)) + +- Vector Permute: xxperm xxpermr +  . I have checked "PPCxxswapd" in PPCInstrVSX.td, but they are different +  . Use intrinsic +    (set v16i8:$XT, (int_ppc_vsx_xxperm v16i8:$XA, v16i8:$XB)) +    (set v16i8:$XT, (int_ppc_vsx_xxpermr v16i8:$XA, v16i8:$XB)) + +- Vector Splat Immediate Byte: xxspltib +  . Similar to XXSPLTW: +      def XXSPLTW : XX2Form_2<60, 164, +                           (outs vsrc:$XT), (ins vsrc:$XB, u2imm:$UIM), +                           "xxspltw $XT, $XB, $UIM", IIC_VecPerm, []>; + +  . No SDAG, intrinsic, builtin are required? + +- Load/Store Vector: lxv stxv +  . Has likely SDAG match: +    (set v?:$XT, (load ix16addr:$src)) +    (set v?:$XT, (store ix16addr:$dst)) + +  . Need define ix16addr in PPCInstrInfo.td +    ix16addr: 16-byte aligned, see "def memrix16" in PPCInstrInfo.td + +- Load/Store Vector Indexed: lxvx stxvx +  . Has likely SDAG match: +    (set v?:$XT, (load xoaddr:$src)) +    (set v?:$XT, (store xoaddr:$dst)) + +- Load/Store DWord: lxsd stxsd +  . Similar to lxsdx/stxsdx: +    def LXSDX : XX1Form<31, 588, +                        (outs vsfrc:$XT), (ins memrr:$src), +                        "lxsdx $XT, $src", IIC_LdStLFD, +                        [(set f64:$XT, (load xoaddr:$src))]>; + +  . (set f64:$XT, (load iaddrX4:$src)) +    (set f64:$XT, (store iaddrX4:$dst)) + +- Load/Store SP, with conversion from/to DP: lxssp stxssp +  . Similar to lxsspx/stxsspx: +    def LXSSPX : XX1Form<31, 524, (outs vssrc:$XT), (ins memrr:$src), +                         "lxsspx $XT, $src", IIC_LdStLFD, +                         [(set f32:$XT, (load xoaddr:$src))]>; + +  . (set f32:$XT, (load iaddrX4:$src)) +    (set f32:$XT, (store iaddrX4:$dst)) + +- Load as Integer Byte/Halfword & Zero Indexed: lxsibzx lxsihzx +  . Similar to lxsiwzx: +    def LXSIWZX : XX1Form<31, 12, (outs vsfrc:$XT), (ins memrr:$src), +                          "lxsiwzx $XT, $src", IIC_LdStLFD, +                          [(set f64:$XT, (PPClfiwzx xoaddr:$src))]>; + +  . (set f64:$XT, (PPClfiwzx xoaddr:$src)) + +- Store as Integer Byte/Halfword Indexed: stxsibx stxsihx +  . Similar to stxsiwx: +    def STXSIWX : XX1Form<31, 140, (outs), (ins vsfrc:$XT, memrr:$dst), +                          "stxsiwx $XT, $dst", IIC_LdStSTFD, +                          [(PPCstfiwx f64:$XT, xoaddr:$dst)]>; + +  . (PPCstfiwx f64:$XT, xoaddr:$dst) + +- Load Vector Halfword*8/Byte*16 Indexed: lxvh8x lxvb16x +  . Similar to lxvd2x/lxvw4x: +    def LXVD2X : XX1Form<31, 844, +                         (outs vsrc:$XT), (ins memrr:$src), +                         "lxvd2x $XT, $src", IIC_LdStLFD, +                         [(set v2f64:$XT, (int_ppc_vsx_lxvd2x xoaddr:$src))]>; + +  . (set v8i16:$XT, (int_ppc_vsx_lxvh8x xoaddr:$src)) +    (set v16i8:$XT, (int_ppc_vsx_lxvb16x xoaddr:$src)) + +- Store Vector Halfword*8/Byte*16 Indexed: stxvh8x stxvb16x +  . Similar to stxvd2x/stxvw4x: +    def STXVD2X : XX1Form<31, 972, +                         (outs), (ins vsrc:$XT, memrr:$dst), +                         "stxvd2x $XT, $dst", IIC_LdStSTFD, +                         [(store v2f64:$XT, xoaddr:$dst)]>; + +  . (store v8i16:$XT, xoaddr:$dst) +    (store v16i8:$XT, xoaddr:$dst) + +- Load/Store Vector (Left-justified) with Length: lxvl lxvll stxvl stxvll +  . Likely needs an intrinsic +  . (set v?:$XT, (int_ppc_vsx_lxvl xoaddr:$src)) +    (set v?:$XT, (int_ppc_vsx_lxvll xoaddr:$src)) + +  . (int_ppc_vsx_stxvl xoaddr:$dst)) +    (int_ppc_vsx_stxvll xoaddr:$dst)) + +- Load Vector Word & Splat Indexed: lxvwsx +  . Likely needs an intrinsic +  . (set v?:$XT, (int_ppc_vsx_lxvwsx xoaddr:$src)) + +Atomic operations (l[dw]at, st[dw]at): +- Provide custom lowering for common atomic operations to use these +  instructions with the correct Function Code +- Ensure the operands are in the correct register (i.e. RT+1, RT+2) +- Provide builtins since not all FC's necessarily have an existing LLVM +  atomic operation + +Load Doubleword Monitored (ldmx): +- Investigate whether there are any uses for this. It seems to be related to +  Garbage Collection so it isn't likely to be all that useful for most +  languages we deal with. + +Move to CR from XER Extended (mcrxrx): +- Is there a use for this in LLVM? + +Fixed Point Facility: + +- Copy-Paste Facility: copy copy_first cp_abort paste paste. paste_last +  . Use instrinstics: +    (int_ppc_copy_first i32:$rA, i32:$rB) +    (int_ppc_copy i32:$rA, i32:$rB) + +    (int_ppc_paste i32:$rA, i32:$rB) +    (int_ppc_paste_last i32:$rA, i32:$rB) + +    (int_cp_abort) + +- Message Synchronize: msgsync +- SLB*: slbieg slbsync +- stop +  . No instrinstics diff --git a/contrib/libs/llvm12/lib/Target/PowerPC/TargetInfo/.yandex_meta/licenses.list.txt b/contrib/libs/llvm12/lib/Target/PowerPC/TargetInfo/.yandex_meta/licenses.list.txt index a4433625d4a..c62d353021c 100644 --- a/contrib/libs/llvm12/lib/Target/PowerPC/TargetInfo/.yandex_meta/licenses.list.txt +++ b/contrib/libs/llvm12/lib/Target/PowerPC/TargetInfo/.yandex_meta/licenses.list.txt @@ -1,7 +1,7 @@ -====================Apache-2.0 WITH LLVM-exception====================  -// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.  -// See https://llvm.org/LICENSE.txt for license information.  -  -  -====================Apache-2.0 WITH LLVM-exception====================  -// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception  +====================Apache-2.0 WITH LLVM-exception==================== +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. + + +====================Apache-2.0 WITH LLVM-exception==================== +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception diff --git a/contrib/libs/llvm12/lib/Target/PowerPC/TargetInfo/ya.make b/contrib/libs/llvm12/lib/Target/PowerPC/TargetInfo/ya.make index 9903560dccf..68badb4490f 100644 --- a/contrib/libs/llvm12/lib/Target/PowerPC/TargetInfo/ya.make +++ b/contrib/libs/llvm12/lib/Target/PowerPC/TargetInfo/ya.make @@ -2,15 +2,15 @@  LIBRARY() -OWNER(  -    orivej  -    g:cpp-contrib  -)  - -LICENSE(Apache-2.0 WITH LLVM-exception)  -  -LICENSE_TEXTS(.yandex_meta/licenses.list.txt)  -  +OWNER( +    orivej +    g:cpp-contrib +) + +LICENSE(Apache-2.0 WITH LLVM-exception) + +LICENSE_TEXTS(.yandex_meta/licenses.list.txt) +  PEERDIR(      contrib/libs/llvm12      contrib/libs/llvm12/lib/Support diff --git a/contrib/libs/llvm12/lib/Target/PowerPC/ya.make b/contrib/libs/llvm12/lib/Target/PowerPC/ya.make index a6812524a81..8c7039a5758 100644 --- a/contrib/libs/llvm12/lib/Target/PowerPC/ya.make +++ b/contrib/libs/llvm12/lib/Target/PowerPC/ya.make @@ -2,15 +2,15 @@  LIBRARY() -OWNER(  -    orivej  -    g:cpp-contrib  -)  +OWNER( +    orivej +    g:cpp-contrib +) + +LICENSE(Apache-2.0 WITH LLVM-exception) + +LICENSE_TEXTS(.yandex_meta/licenses.list.txt) -LICENSE(Apache-2.0 WITH LLVM-exception)  -  -LICENSE_TEXTS(.yandex_meta/licenses.list.txt)  -   PEERDIR(      contrib/libs/llvm12      contrib/libs/llvm12/include diff --git a/contrib/libs/llvm12/lib/Target/README.txt b/contrib/libs/llvm12/lib/Target/README.txt index d918287ed05..e172abbbd85 100644 --- a/contrib/libs/llvm12/lib/Target/README.txt +++ b/contrib/libs/llvm12/lib/Target/README.txt @@ -1,2279 +1,2279 @@ -Target Independent Opportunities:  -  -//===---------------------------------------------------------------------===//  -  -We should recognized various "overflow detection" idioms and translate them into  -llvm.uadd.with.overflow and similar intrinsics.  Here is a multiply idiom:  -  -unsigned int mul(unsigned int a,unsigned int b) {  - if ((unsigned long long)a*b>0xffffffff)  -   exit(0);  -  return a*b;  -}  -  -The legalization code for mul-with-overflow needs to be made more robust before  -this can be implemented though.  -  -//===---------------------------------------------------------------------===//  -  -Get the C front-end to expand hypot(x,y) -> llvm.sqrt(x*x+y*y) when errno and  -precision don't matter (ffastmath).  Misc/mandel will like this. :)  This isn't  -safe in general, even on darwin.  See the libm implementation of hypot for  -examples (which special case when x/y are exactly zero to get signed zeros etc  -right).  -  -//===---------------------------------------------------------------------===//  -  -On targets with expensive 64-bit multiply, we could LSR this:  -  -for (i = ...; ++i) {  -   x = 1ULL << i;  -  -into:  - long long tmp = 1;  - for (i = ...; ++i, tmp+=tmp)  -   x = tmp;  -  -This would be a win on ppc32, but not x86 or ppc64.  -  -//===---------------------------------------------------------------------===//  -  -Shrink: (setlt (loadi32 P), 0) -> (setlt (loadi8 Phi), 0)  -  -//===---------------------------------------------------------------------===//  -  -Reassociate should turn things like:  -  -int factorial(int X) {  - return X*X*X*X*X*X*X*X;  -}  -  -into llvm.powi calls, allowing the code generator to produce balanced  -multiplication trees.  -  -First, the intrinsic needs to be extended to support integers, and second the  -code generator needs to be enhanced to lower these to multiplication trees.  -  -//===---------------------------------------------------------------------===//  -  -Interesting? testcase for add/shift/mul reassoc:  -  -int bar(int x, int y) {  -  return x*x*x+y+x*x*x*x*x*y*y*y*y;  -}  -int foo(int z, int n) {  -  return bar(z, n) + bar(2*z, 2*n);  -}  -  -This is blocked on not handling X*X*X -> powi(X, 3) (see note above).  The issue  -is that we end up getting t = 2*X  s = t*t   and don't turn this into 4*X*X,  -which is the same number of multiplies and is canonical, because the 2*X has  -multiple uses.  Here's a simple example:  -  -define i32 @test15(i32 %X1) {  -  %B = mul i32 %X1, 47   ; X1*47  -  %C = mul i32 %B, %B  -  ret i32 %C  -}  -  -  -//===---------------------------------------------------------------------===//  -  -Reassociate should handle the example in GCC PR16157:  -  -extern int a0, a1, a2, a3, a4; extern int b0, b1, b2, b3, b4;   -void f () {  /* this can be optimized to four additions... */   -        b4 = a4 + a3 + a2 + a1 + a0;   -        b3 = a3 + a2 + a1 + a0;   -        b2 = a2 + a1 + a0;   -        b1 = a1 + a0;   -}   -  -This requires reassociating to forms of expressions that are already available,  -something that reassoc doesn't think about yet.  -  -  -//===---------------------------------------------------------------------===//  -  -These two functions should generate the same code on big-endian systems:  -  -int g(int *j,int *l)  {  return memcmp(j,l,4);  }  -int h(int *j, int *l) {  return *j - *l; }  -  -this could be done in SelectionDAGISel.cpp, along with other special cases,  -for 1,2,4,8 bytes.  -  -//===---------------------------------------------------------------------===//  -  -It would be nice to revert this patch:  -http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20060213/031986.html  -  -And teach the dag combiner enough to simplify the code expanded before   -legalize.  It seems plausible that this knowledge would let it simplify other  -stuff too.  -  -//===---------------------------------------------------------------------===//  -  -For vector types, DataLayout.cpp::getTypeInfo() returns alignment that is equal  -to the type size. It works but can be overly conservative as the alignment of  -specific vector types are target dependent.  -  -//===---------------------------------------------------------------------===//  -  -We should produce an unaligned load from code like this:  -  -v4sf example(float *P) {  -  return (v4sf){P[0], P[1], P[2], P[3] };  -}  -  -//===---------------------------------------------------------------------===//  -  -Add support for conditional increments, and other related patterns.  Instead  -of:  -  -	movl 136(%esp), %eax  -	cmpl $0, %eax  -	je LBB16_2	#cond_next  -LBB16_1:	#cond_true  -	incl _foo  -LBB16_2:	#cond_next  -  -emit:  -	movl	_foo, %eax  -	cmpl	$1, %edi  -	sbbl	$-1, %eax  -	movl	%eax, _foo  -  -//===---------------------------------------------------------------------===//  -  -Combine: a = sin(x), b = cos(x) into a,b = sincos(x).  -  -Expand these to calls of sin/cos and stores:  -      double sincos(double x, double *sin, double *cos);  -      float sincosf(float x, float *sin, float *cos);  -      long double sincosl(long double x, long double *sin, long double *cos);  -  -Doing so could allow SROA of the destination pointers.  See also:  -http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17687  -  -This is now easily doable with MRVs.  We could even make an intrinsic for this  -if anyone cared enough about sincos.  -  -//===---------------------------------------------------------------------===//  -  -quantum_sigma_x in 462.libquantum contains the following loop:  -  -      for(i=0; i<reg->size; i++)  -	{  -	  /* Flip the target bit of each basis state */  -	  reg->node[i].state ^= ((MAX_UNSIGNED) 1 << target);  -	}   -  -Where MAX_UNSIGNED/state is a 64-bit int.  On a 32-bit platform it would be just  -so cool to turn it into something like:  -  -   long long Res = ((MAX_UNSIGNED) 1 << target);  -   if (target < 32) {  -     for(i=0; i<reg->size; i++)  -       reg->node[i].state ^= Res & 0xFFFFFFFFULL;  -   } else {  -     for(i=0; i<reg->size; i++)  -       reg->node[i].state ^= Res & 0xFFFFFFFF00000000ULL  -   }  -     -... which would only do one 32-bit XOR per loop iteration instead of two.  -  -It would also be nice to recognize the reg->size doesn't alias reg->node[i],  -but this requires TBAA.  -  -//===---------------------------------------------------------------------===//  -  -This isn't recognized as bswap by instcombine (yes, it really is bswap):  -  -unsigned long reverse(unsigned v) {  -    unsigned t;  -    t = v ^ ((v << 16) | (v >> 16));  -    t &= ~0xff0000;  -    v = (v << 24) | (v >> 8);  -    return v ^ (t >> 8);  -}  -  -//===---------------------------------------------------------------------===//  -  -[LOOP DELETION]  -  -We don't delete this output free loop, because trip count analysis doesn't  -realize that it is finite (if it were infinite, it would be undefined).  Not  -having this blocks Loop Idiom from matching strlen and friends.    -  -void foo(char *C) {  -  int x = 0;  -  while (*C)  -    ++x,++C;  -}  -  -//===---------------------------------------------------------------------===//  -  -[LOOP RECOGNITION]  -  -These idioms should be recognized as popcount (see PR1488):  -  -unsigned countbits_slow(unsigned v) {  -  unsigned c;  -  for (c = 0; v; v >>= 1)  -    c += v & 1;  -  return c;  -}  -  -unsigned int popcount(unsigned int input) {  -  unsigned int count = 0;  -  for (unsigned int i =  0; i < 4 * 8; i++)  -    count += (input >> i) & i;  -  return count;  -}  -  -This should be recognized as CLZ:  rdar://8459039  -  -unsigned clz_a(unsigned a) {  -  int i;  -  for (i=0;i<32;i++)  -    if (a & (1<<(31-i)))  -      return i;  -  return 32;  -}  -  -This sort of thing should be added to the loop idiom pass.  -  -//===---------------------------------------------------------------------===//  -  -These should turn into single 16-bit (unaligned?) loads on little/big endian  -processors.  -  -unsigned short read_16_le(const unsigned char *adr) {  -  return adr[0] | (adr[1] << 8);  -}  -unsigned short read_16_be(const unsigned char *adr) {  -  return (adr[0] << 8) | adr[1];  -}  -  -//===---------------------------------------------------------------------===//  -  --instcombine should handle this transform:  -   icmp pred (sdiv X / C1 ), C2  -when X, C1, and C2 are unsigned.  Similarly for udiv and signed operands.   -  -Currently InstCombine avoids this transform but will do it when the signs of  -the operands and the sign of the divide match. See the FIXME in   -InstructionCombining.cpp in the visitSetCondInst method after the switch case   -for Instruction::UDiv (around line 4447) for more details.  -  -The SingleSource/Benchmarks/Shootout-C++/hash and hash2 tests have examples of  -this construct.   -  -//===---------------------------------------------------------------------===//  -  -[LOOP OPTIMIZATION]  -  -SingleSource/Benchmarks/Misc/dt.c shows several interesting optimization  -opportunities in its double_array_divs_variable function: it needs loop  -interchange, memory promotion (which LICM already does), vectorization and  -variable trip count loop unrolling (since it has a constant trip count). ICC  -apparently produces this very nice code with -ffast-math:  -  -..B1.70:                        # Preds ..B1.70 ..B1.69  -       mulpd     %xmm0, %xmm1                                  #108.2  -       mulpd     %xmm0, %xmm1                                  #108.2  -       mulpd     %xmm0, %xmm1                                  #108.2  -       mulpd     %xmm0, %xmm1                                  #108.2  -       addl      $8, %edx                                      #  -       cmpl      $131072, %edx                                 #108.2  -       jb        ..B1.70       # Prob 99%                      #108.2  -  -It would be better to count down to zero, but this is a lot better than what we  -do.  -  -//===---------------------------------------------------------------------===//  -  -Consider:  -  -typedef unsigned U32;  -typedef unsigned long long U64;  -int test (U32 *inst, U64 *regs) {  -    U64 effective_addr2;  -    U32 temp = *inst;  -    int r1 = (temp >> 20) & 0xf;  -    int b2 = (temp >> 16) & 0xf;  -    effective_addr2 = temp & 0xfff;  -    if (b2) effective_addr2 += regs[b2];  -    b2 = (temp >> 12) & 0xf;  -    if (b2) effective_addr2 += regs[b2];  -    effective_addr2 &= regs[4];  -     if ((effective_addr2 & 3) == 0)  -        return 1;  -    return 0;  -}  -  -Note that only the low 2 bits of effective_addr2 are used.  On 32-bit systems,  -we don't eliminate the computation of the top half of effective_addr2 because  -we don't have whole-function selection dags.  On x86, this means we use one  -extra register for the function when effective_addr2 is declared as U64 than  -when it is declared U32.  -  -PHI Slicing could be extended to do this.  -  -//===---------------------------------------------------------------------===//  -  -Tail call elim should be more aggressive, checking to see if the call is  -followed by an uncond branch to an exit block.  -  -; This testcase is due to tail-duplication not wanting to copy the return  -; instruction into the terminating blocks because there was other code  -; optimized out of the function after the taildup happened.  -; RUN: llvm-as < %s | opt -tailcallelim | llvm-dis | not grep call  -  -define i32 @t4(i32 %a) {  -entry:  -	%tmp.1 = and i32 %a, 1		; <i32> [#uses=1]  -	%tmp.2 = icmp ne i32 %tmp.1, 0		; <i1> [#uses=1]  -	br i1 %tmp.2, label %then.0, label %else.0  -  -then.0:		; preds = %entry  -	%tmp.5 = add i32 %a, -1		; <i32> [#uses=1]  -	%tmp.3 = call i32 @t4( i32 %tmp.5 )		; <i32> [#uses=1]  -	br label %return  -  -else.0:		; preds = %entry  -	%tmp.7 = icmp ne i32 %a, 0		; <i1> [#uses=1]  -	br i1 %tmp.7, label %then.1, label %return  -  -then.1:		; preds = %else.0  -	%tmp.11 = add i32 %a, -2		; <i32> [#uses=1]  -	%tmp.9 = call i32 @t4( i32 %tmp.11 )		; <i32> [#uses=1]  -	br label %return  -  -return:		; preds = %then.1, %else.0, %then.0  -	%result.0 = phi i32 [ 0, %else.0 ], [ %tmp.3, %then.0 ],  -                            [ %tmp.9, %then.1 ]  -	ret i32 %result.0  -}  -  -//===---------------------------------------------------------------------===//  -  -Tail recursion elimination should handle:  -  -int pow2m1(int n) {  - if (n == 0)  -   return 0;  - return 2 * pow2m1 (n - 1) + 1;  -}  -  -Also, multiplies can be turned into SHL's, so they should be handled as if  -they were associative.  "return foo() << 1" can be tail recursion eliminated.  -  -//===---------------------------------------------------------------------===//  -  -Argument promotion should promote arguments for recursive functions, like   -this:  -  -; RUN: llvm-as < %s | opt -argpromotion | llvm-dis | grep x.val  -  -define internal i32 @foo(i32* %x) {  -entry:  -	%tmp = load i32* %x		; <i32> [#uses=0]  -	%tmp.foo = call i32 @foo( i32* %x )		; <i32> [#uses=1]  -	ret i32 %tmp.foo  -}  -  -define i32 @bar(i32* %x) {  -entry:  -	%tmp3 = call i32 @foo( i32* %x )		; <i32> [#uses=1]  -	ret i32 %tmp3  -}  -  -//===---------------------------------------------------------------------===//  -  -We should investigate an instruction sinking pass.  Consider this silly  -example in pic mode:  -  -#include <assert.h>  -void foo(int x) {  -  assert(x);  -  //...  -}  -  -we compile this to:  -_foo:  -	subl	$28, %esp  -	call	"L1$pb"  -"L1$pb":  -	popl	%eax  -	cmpl	$0, 32(%esp)  -	je	LBB1_2	# cond_true  -LBB1_1:	# return  -	# ...  -	addl	$28, %esp  -	ret  -LBB1_2:	# cond_true  -...  -  -The PIC base computation (call+popl) is only used on one path through the   -code, but is currently always computed in the entry block.  It would be   -better to sink the picbase computation down into the block for the   -assertion, as it is the only one that uses it.  This happens for a lot of   -code with early outs.  -  -Another example is loads of arguments, which are usually emitted into the   -entry block on targets like x86.  If not used in all paths through a   -function, they should be sunk into the ones that do.  -  -In this case, whole-function-isel would also handle this.  -  -//===---------------------------------------------------------------------===//  -  -Investigate lowering of sparse switch statements into perfect hash tables:  -http://burtleburtle.net/bob/hash/perfect.html  -  -//===---------------------------------------------------------------------===//  -  -We should turn things like "load+fabs+store" and "load+fneg+store" into the  -corresponding integer operations.  On a yonah, this loop:  -  -double a[256];  -void foo() {  -  int i, b;  -  for (b = 0; b < 10000000; b++)  -  for (i = 0; i < 256; i++)  -    a[i] = -a[i];  -}  -  -is twice as slow as this loop:  -  -long long a[256];  -void foo() {  -  int i, b;  -  for (b = 0; b < 10000000; b++)  -  for (i = 0; i < 256; i++)  -    a[i] ^= (1ULL << 63);  -}  -  -and I suspect other processors are similar.  On X86 in particular this is a  -big win because doing this with integers allows the use of read/modify/write  -instructions.  -  -//===---------------------------------------------------------------------===//  -  -DAG Combiner should try to combine small loads into larger loads when   -profitable.  For example, we compile this C++ example:  -  -struct THotKey { short Key; bool Control; bool Shift; bool Alt; };  -extern THotKey m_HotKey;  -THotKey GetHotKey () { return m_HotKey; }  -  -into (-m64 -O3 -fno-exceptions -static -fomit-frame-pointer):  -  -__Z9GetHotKeyv:                         ## @_Z9GetHotKeyv  -	movq	_m_HotKey@GOTPCREL(%rip), %rax  -	movzwl	(%rax), %ecx  -	movzbl	2(%rax), %edx  -	shlq	$16, %rdx  -	orq	%rcx, %rdx  -	movzbl	3(%rax), %ecx  -	shlq	$24, %rcx  -	orq	%rdx, %rcx  -	movzbl	4(%rax), %eax  -	shlq	$32, %rax  -	orq	%rcx, %rax  -	ret  -  -//===---------------------------------------------------------------------===//  -  -We should add an FRINT node to the DAG to model targets that have legal  -implementations of ceil/floor/rint.  -  -//===---------------------------------------------------------------------===//  -  -Consider:  -  -int test() {  -  long long input[8] = {1,0,1,0,1,0,1,0};  -  foo(input);  -}  -  -Clang compiles this into:  -  -  call void @llvm.memset.p0i8.i64(i8* %tmp, i8 0, i64 64, i32 16, i1 false)  -  %0 = getelementptr [8 x i64]* %input, i64 0, i64 0  -  store i64 1, i64* %0, align 16  -  %1 = getelementptr [8 x i64]* %input, i64 0, i64 2  -  store i64 1, i64* %1, align 16  -  %2 = getelementptr [8 x i64]* %input, i64 0, i64 4  -  store i64 1, i64* %2, align 16  -  %3 = getelementptr [8 x i64]* %input, i64 0, i64 6  -  store i64 1, i64* %3, align 16  -  -Which gets codegen'd into:  -  -	pxor	%xmm0, %xmm0  -	movaps	%xmm0, -16(%rbp)  -	movaps	%xmm0, -32(%rbp)  -	movaps	%xmm0, -48(%rbp)  -	movaps	%xmm0, -64(%rbp)  -	movq	$1, -64(%rbp)  -	movq	$1, -48(%rbp)  -	movq	$1, -32(%rbp)  -	movq	$1, -16(%rbp)  -  -It would be better to have 4 movq's of 0 instead of the movaps's.  -  -//===---------------------------------------------------------------------===//  -  -http://llvm.org/PR717:  -  -The following code should compile into "ret int undef". Instead, LLVM  -produces "ret int 0":  -  -int f() {  -  int x = 4;  -  int y;  -  if (x == 3) y = 0;  -  return y;  -}  -  -//===---------------------------------------------------------------------===//  -  -The loop unroller should partially unroll loops (instead of peeling them)  -when code growth isn't too bad and when an unroll count allows simplification  -of some code within the loop.  One trivial example is:  -  -#include <stdio.h>  -int main() {  -    int nRet = 17;  -    int nLoop;  -    for ( nLoop = 0; nLoop < 1000; nLoop++ ) {  -        if ( nLoop & 1 )  -            nRet += 2;  -        else  -            nRet -= 1;  -    }  -    return nRet;  -}  -  -Unrolling by 2 would eliminate the '&1' in both copies, leading to a net  -reduction in code size.  The resultant code would then also be suitable for  -exit value computation.  -  -//===---------------------------------------------------------------------===//  -  -We miss a bunch of rotate opportunities on various targets, including ppc, x86,  -etc.  On X86, we miss a bunch of 'rotate by variable' cases because the rotate  -matching code in dag combine doesn't look through truncates aggressively   -enough.  Here are some testcases reduces from GCC PR17886:  -  -unsigned long long f5(unsigned long long x, unsigned long long y) {  -  return (x << 8) | ((y >> 48) & 0xffull);  -}  -unsigned long long f6(unsigned long long x, unsigned long long y, int z) {  -  switch(z) {  -  case 1:  -    return (x << 8) | ((y >> 48) & 0xffull);  -  case 2:  -    return (x << 16) | ((y >> 40) & 0xffffull);  -  case 3:  -    return (x << 24) | ((y >> 32) & 0xffffffull);  -  case 4:  -    return (x << 32) | ((y >> 24) & 0xffffffffull);  -  default:  -    return (x << 40) | ((y >> 16) & 0xffffffffffull);  -  }  -}  -  -//===---------------------------------------------------------------------===//  -  -This (and similar related idioms):  -  -unsigned int foo(unsigned char i) {  -  return i | (i<<8) | (i<<16) | (i<<24);  -}   -  -compiles into:  -  -define i32 @foo(i8 zeroext %i) nounwind readnone ssp noredzone {  -entry:  -  %conv = zext i8 %i to i32  -  %shl = shl i32 %conv, 8  -  %shl5 = shl i32 %conv, 16  -  %shl9 = shl i32 %conv, 24  -  %or = or i32 %shl9, %conv  -  %or6 = or i32 %or, %shl5  -  %or10 = or i32 %or6, %shl  -  ret i32 %or10  -}  -  -it would be better as:  -  -unsigned int bar(unsigned char i) {  -  unsigned int j=i | (i << 8);   -  return j | (j<<16);  -}  -  -aka:  -  -define i32 @bar(i8 zeroext %i) nounwind readnone ssp noredzone {  -entry:  -  %conv = zext i8 %i to i32  -  %shl = shl i32 %conv, 8  -  %or = or i32 %shl, %conv  -  %shl5 = shl i32 %or, 16  -  %or6 = or i32 %shl5, %or  -  ret i32 %or6  -}  -  -or even i*0x01010101, depending on the speed of the multiplier.  The best way to  -handle this is to canonicalize it to a multiply in IR and have codegen handle  -lowering multiplies to shifts on cpus where shifts are faster.  -  -//===---------------------------------------------------------------------===//  -  -We do a number of simplifications in simplify libcalls to strength reduce  -standard library functions, but we don't currently merge them together.  For  -example, it is useful to merge memcpy(a,b,strlen(b)) -> strcpy.  This can only  -be done safely if "b" isn't modified between the strlen and memcpy of course.  -  -//===---------------------------------------------------------------------===//  -  -We compile this program: (from GCC PR11680)  -http://gcc.gnu.org/bugzilla/attachment.cgi?id=4487  -  -Into code that runs the same speed in fast/slow modes, but both modes run 2x  -slower than when compile with GCC (either 4.0 or 4.2):  -  -$ llvm-g++ perf.cpp -O3 -fno-exceptions  -$ time ./a.out fast  -1.821u 0.003s 0:01.82 100.0%	0+0k 0+0io 0pf+0w  -  -$ g++ perf.cpp -O3 -fno-exceptions  -$ time ./a.out fast  -0.821u 0.001s 0:00.82 100.0%	0+0k 0+0io 0pf+0w  -  -It looks like we are making the same inlining decisions, so this may be raw  -codegen badness or something else (haven't investigated).  -  -//===---------------------------------------------------------------------===//  -  -Divisibility by constant can be simplified (according to GCC PR12849) from  -being a mulhi to being a mul lo (cheaper).  Testcase:  -  -void bar(unsigned n) {  -  if (n % 3 == 0)  -    true();  -}  -  -This is equivalent to the following, where 2863311531 is the multiplicative  -inverse of 3, and 1431655766 is ((2^32)-1)/3+1:  -void bar(unsigned n) {  -  if (n * 2863311531U < 1431655766U)  -    true();  -}  -  -The same transformation can work with an even modulo with the addition of a  -rotate: rotate the result of the multiply to the right by the number of bits  -which need to be zero for the condition to be true, and shrink the compare RHS  -by the same amount.  Unless the target supports rotates, though, that  -transformation probably isn't worthwhile.  -  -The transformation can also easily be made to work with non-zero equality  -comparisons: just transform, for example, "n % 3 == 1" to "(n-1) % 3 == 0".  -  -//===---------------------------------------------------------------------===//  -  -Better mod/ref analysis for scanf would allow us to eliminate the vtable and a  -bunch of other stuff from this example (see PR1604):   -  -#include <cstdio>  -struct test {  -    int val;  -    virtual ~test() {}  -};  -  -int main() {  -    test t;  -    std::scanf("%d", &t.val);  -    std::printf("%d\n", t.val);  -}  -  -//===---------------------------------------------------------------------===//  -  -These functions perform the same computation, but produce different assembly.  -  -define i8 @select(i8 %x) readnone nounwind {  -  %A = icmp ult i8 %x, 250  -  %B = select i1 %A, i8 0, i8 1  -  ret i8 %B   -}  -  -define i8 @addshr(i8 %x) readnone nounwind {  -  %A = zext i8 %x to i9  -  %B = add i9 %A, 6       ;; 256 - 250 == 6  -  %C = lshr i9 %B, 8  -  %D = trunc i9 %C to i8  -  ret i8 %D  -}  -  -//===---------------------------------------------------------------------===//  -  -From gcc bug 24696:  -int  -f (unsigned long a, unsigned long b, unsigned long c)  -{  -  return ((a & (c - 1)) != 0) || ((b & (c - 1)) != 0);  -}  -int  -f (unsigned long a, unsigned long b, unsigned long c)  -{  -  return ((a & (c - 1)) != 0) | ((b & (c - 1)) != 0);  -}  -Both should combine to ((a|b) & (c-1)) != 0.  Currently not optimized with  -"clang -emit-llvm-bc | opt -O3".  -  -//===---------------------------------------------------------------------===//  -  -From GCC Bug 20192:  -#define PMD_MASK    (~((1UL << 23) - 1))  -void clear_pmd_range(unsigned long start, unsigned long end)  -{  -   if (!(start & ~PMD_MASK) && !(end & ~PMD_MASK))  -       f();  -}  -The expression should optimize to something like  -"!((start|end)&~PMD_MASK). Currently not optimized with "clang  --emit-llvm-bc | opt -O3".  -  -//===---------------------------------------------------------------------===//  -  -unsigned int f(unsigned int i, unsigned int n) {++i; if (i == n) ++i; return  -i;}  -unsigned int f2(unsigned int i, unsigned int n) {++i; i += i == n; return i;}  -These should combine to the same thing.  Currently, the first function  -produces better code on X86.  -  -//===---------------------------------------------------------------------===//  -  -From GCC Bug 15784:  -#define abs(x) x>0?x:-x  -int f(int x, int y)  -{  - return (abs(x)) >= 0;  -}  -This should optimize to x == INT_MIN. (With -fwrapv.)  Currently not  -optimized with "clang -emit-llvm-bc | opt -O3".  -  -//===---------------------------------------------------------------------===//  -  -From GCC Bug 14753:  -void  -rotate_cst (unsigned int a)  -{  - a = (a << 10) | (a >> 22);  - if (a == 123)  -   bar ();  -}  -void  -minus_cst (unsigned int a)  -{  - unsigned int tem;  -  - tem = 20 - a;  - if (tem == 5)  -   bar ();  -}  -void  -mask_gt (unsigned int a)  -{  - /* This is equivalent to a > 15.  */  - if ((a & ~7) > 8)  -   bar ();  -}  -void  -rshift_gt (unsigned int a)  -{  - /* This is equivalent to a > 23.  */  - if ((a >> 2) > 5)  -   bar ();  -}  -  -All should simplify to a single comparison.  All of these are  -currently not optimized with "clang -emit-llvm-bc | opt  --O3".  -  -//===---------------------------------------------------------------------===//  -  -From GCC Bug 32605:  -int c(int* x) {return (char*)x+2 == (char*)x;}  -Should combine to 0.  Currently not optimized with "clang  --emit-llvm-bc | opt -O3" (although llc can optimize it).  -  -//===---------------------------------------------------------------------===//  -  -int a(unsigned b) {return ((b << 31) | (b << 30)) >> 31;}  -Should be combined to  "((b >> 1) | b) & 1".  Currently not optimized  -with "clang -emit-llvm-bc | opt -O3".  -  -//===---------------------------------------------------------------------===//  -  -unsigned a(unsigned x, unsigned y) { return x | (y & 1) | (y & 2);}  -Should combine to "x | (y & 3)".  Currently not optimized with "clang  --emit-llvm-bc | opt -O3".  -  -//===---------------------------------------------------------------------===//  -  -int a(int a, int b, int c) {return (~a & c) | ((c|a) & b);}  -Should fold to "(~a & c) | (a & b)".  Currently not optimized with  -"clang -emit-llvm-bc | opt -O3".  -  -//===---------------------------------------------------------------------===//  -  -int a(int a,int b) {return (~(a|b))|a;}  -Should fold to "a|~b".  Currently not optimized with "clang  --emit-llvm-bc | opt -O3".  -  -//===---------------------------------------------------------------------===//  -  -int a(int a, int b) {return (a&&b) || (a&&!b);}  -Should fold to "a".  Currently not optimized with "clang -emit-llvm-bc  -| opt -O3".  -  -//===---------------------------------------------------------------------===//  -  -int a(int a, int b, int c) {return (a&&b) || (!a&&c);}  -Should fold to "a ? b : c", or at least something sane.  Currently not  -optimized with "clang -emit-llvm-bc | opt -O3".  -  -//===---------------------------------------------------------------------===//  -  -int a(int a, int b, int c) {return (a&&b) || (a&&c) || (a&&b&&c);}  -Should fold to a && (b || c).  Currently not optimized with "clang  --emit-llvm-bc | opt -O3".  -  -//===---------------------------------------------------------------------===//  -  -int a(int x) {return x | ((x & 8) ^ 8);}  -Should combine to x | 8.  Currently not optimized with "clang  --emit-llvm-bc | opt -O3".  -  -//===---------------------------------------------------------------------===//  -  -int a(int x) {return x ^ ((x & 8) ^ 8);}  -Should also combine to x | 8.  Currently not optimized with "clang  --emit-llvm-bc | opt -O3".  -  -//===---------------------------------------------------------------------===//  -  -int a(int x) {return ((x | -9) ^ 8) & x;}  -Should combine to x & -9.  Currently not optimized with "clang  --emit-llvm-bc | opt -O3".  -  -//===---------------------------------------------------------------------===//  -  -unsigned a(unsigned a) {return a * 0x11111111 >> 28 & 1;}  -Should combine to "a * 0x88888888 >> 31".  Currently not optimized  -with "clang -emit-llvm-bc | opt -O3".  -  -//===---------------------------------------------------------------------===//  -  -unsigned a(char* x) {if ((*x & 32) == 0) return b();}  -There's an unnecessary zext in the generated code with "clang  --emit-llvm-bc | opt -O3".  -  -//===---------------------------------------------------------------------===//  -  -unsigned a(unsigned long long x) {return 40 * (x >> 1);}  -Should combine to "20 * (((unsigned)x) & -2)".  Currently not  -optimized with "clang -emit-llvm-bc | opt -O3".  -  -//===---------------------------------------------------------------------===//  -  -int g(int x) { return (x - 10) < 0; }  -Should combine to "x <= 9" (the sub has nsw).  Currently not  -optimized with "clang -emit-llvm-bc | opt -O3".  -  -//===---------------------------------------------------------------------===//  -  -int g(int x) { return (x + 10) < 0; }  -Should combine to "x < -10" (the add has nsw).  Currently not  -optimized with "clang -emit-llvm-bc | opt -O3".  -  -//===---------------------------------------------------------------------===//  -  -int f(int i, int j) { return i < j + 1; }  -int g(int i, int j) { return j > i - 1; }  -Should combine to "i <= j" (the add/sub has nsw).  Currently not  -optimized with "clang -emit-llvm-bc | opt -O3".  -  -//===---------------------------------------------------------------------===//  -  -unsigned f(unsigned x) { return ((x & 7) + 1) & 15; }  -The & 15 part should be optimized away, it doesn't change the result. Currently  -not optimized with "clang -emit-llvm-bc | opt -O3".  -  -//===---------------------------------------------------------------------===//  -  -This was noticed in the entryblock for grokdeclarator in 403.gcc:  -  -        %tmp = icmp eq i32 %decl_context, 4            -        %decl_context_addr.0 = select i1 %tmp, i32 3, i32 %decl_context   -        %tmp1 = icmp eq i32 %decl_context_addr.0, 1   -        %decl_context_addr.1 = select i1 %tmp1, i32 0, i32 %decl_context_addr.0  -  -tmp1 should be simplified to something like:  -  (!tmp || decl_context == 1)  -  -This allows recursive simplifications, tmp1 is used all over the place in  -the function, e.g. by:  -  -        %tmp23 = icmp eq i32 %decl_context_addr.1, 0            ; <i1> [#uses=1]  -        %tmp24 = xor i1 %tmp1, true             ; <i1> [#uses=1]  -        %or.cond8 = and i1 %tmp23, %tmp24               ; <i1> [#uses=1]  -  -later.  -  -//===---------------------------------------------------------------------===//  -  -[STORE SINKING]  -  -Store sinking: This code:  -  -void f (int n, int *cond, int *res) {  -    int i;  -    *res = 0;  -    for (i = 0; i < n; i++)  -        if (*cond)  -            *res ^= 234; /* (*) */  -}  -  -On this function GVN hoists the fully redundant value of *res, but nothing  -moves the store out.  This gives us this code:  -  -bb:		; preds = %bb2, %entry  -	%.rle = phi i32 [ 0, %entry ], [ %.rle6, %bb2 ]	  -	%i.05 = phi i32 [ 0, %entry ], [ %indvar.next, %bb2 ]  -	%1 = load i32* %cond, align 4  -	%2 = icmp eq i32 %1, 0  -	br i1 %2, label %bb2, label %bb1  -  -bb1:		; preds = %bb  -	%3 = xor i32 %.rle, 234	  -	store i32 %3, i32* %res, align 4  -	br label %bb2  -  -bb2:		; preds = %bb, %bb1  -	%.rle6 = phi i32 [ %3, %bb1 ], [ %.rle, %bb ]	  -	%indvar.next = add i32 %i.05, 1	  -	%exitcond = icmp eq i32 %indvar.next, %n  -	br i1 %exitcond, label %return, label %bb  -  -DSE should sink partially dead stores to get the store out of the loop.  -  -Here's another partial dead case:  -http://gcc.gnu.org/bugzilla/show_bug.cgi?id=12395  -  -//===---------------------------------------------------------------------===//  -  -Scalar PRE hoists the mul in the common block up to the else:  -  -int test (int a, int b, int c, int g) {  -  int d, e;  -  if (a)  -    d = b * c;  -  else  -    d = b - c;  -  e = b * c + g;  -  return d + e;  -}  -  -It would be better to do the mul once to reduce codesize above the if.  -This is GCC PR38204.  -  -  -//===---------------------------------------------------------------------===//  -This simple function from 179.art:  -  -int winner, numf2s;  -struct { double y; int   reset; } *Y;  -  -void find_match() {  -   int i;  -   winner = 0;  -   for (i=0;i<numf2s;i++)  -       if (Y[i].y > Y[winner].y)  -              winner =i;  -}  -  -Compiles into (with clang TBAA):  -  -for.body:                                         ; preds = %for.inc, %bb.nph  -  %indvar = phi i64 [ 0, %bb.nph ], [ %indvar.next, %for.inc ]  -  %i.01718 = phi i32 [ 0, %bb.nph ], [ %i.01719, %for.inc ]  -  %tmp4 = getelementptr inbounds %struct.anon* %tmp3, i64 %indvar, i32 0  -  %tmp5 = load double* %tmp4, align 8, !tbaa !4  -  %idxprom7 = sext i32 %i.01718 to i64  -  %tmp10 = getelementptr inbounds %struct.anon* %tmp3, i64 %idxprom7, i32 0  -  %tmp11 = load double* %tmp10, align 8, !tbaa !4  -  %cmp12 = fcmp ogt double %tmp5, %tmp11  -  br i1 %cmp12, label %if.then, label %for.inc  -  -if.then:                                          ; preds = %for.body  -  %i.017 = trunc i64 %indvar to i32  -  br label %for.inc  -  -for.inc:                                          ; preds = %for.body, %if.then  -  %i.01719 = phi i32 [ %i.01718, %for.body ], [ %i.017, %if.then ]  -  %indvar.next = add i64 %indvar, 1  -  %exitcond = icmp eq i64 %indvar.next, %tmp22  -  br i1 %exitcond, label %for.cond.for.end_crit_edge, label %for.body  -  -  -It is good that we hoisted the reloads of numf2's, and Y out of the loop and  -sunk the store to winner out.  -  -However, this is awful on several levels: the conditional truncate in the loop  -(-indvars at fault? why can't we completely promote the IV to i64?).  -  -Beyond that, we have a partially redundant load in the loop: if "winner" (aka   -%i.01718) isn't updated, we reload Y[winner].y the next time through the loop.  -Similarly, the addressing that feeds it (including the sext) is redundant. In  -the end we get this generated assembly:  -  -LBB0_2:                                 ## %for.body  -                                        ## =>This Inner Loop Header: Depth=1  -	movsd	(%rdi), %xmm0  -	movslq	%edx, %r8  -	shlq	$4, %r8  -	ucomisd	(%rcx,%r8), %xmm0  -	jbe	LBB0_4  -	movl	%esi, %edx  -LBB0_4:                                 ## %for.inc  -	addq	$16, %rdi  -	incq	%rsi  -	cmpq	%rsi, %rax  -	jne	LBB0_2  -  -All things considered this isn't too bad, but we shouldn't need the movslq or  -the shlq instruction, or the load folded into ucomisd every time through the  -loop.  -  -On an x86-specific topic, if the loop can't be restructure, the movl should be a  -cmov.  -  -//===---------------------------------------------------------------------===//  -  -[STORE SINKING]  -  -GCC PR37810 is an interesting case where we should sink load/store reload  -into the if block and outside the loop, so we don't reload/store it on the  -non-call path.  -  -for () {  -  *P += 1;  -  if ()  -    call();  -  else  -    ...  -->  -tmp = *P  -for () {  -  tmp += 1;  -  if () {  -    *P = tmp;  -    call();  -    tmp = *P;  -  } else ...  -}  -*P = tmp;  -  -We now hoist the reload after the call (Transforms/GVN/lpre-call-wrap.ll), but  -we don't sink the store.  We need partially dead store sinking.  -  -//===---------------------------------------------------------------------===//  -  -[LOAD PRE CRIT EDGE SPLITTING]  -  -GCC PR37166: Sinking of loads prevents SROA'ing the "g" struct on the stack  -leading to excess stack traffic. This could be handled by GVN with some crazy  -symbolic phi translation.  The code we get looks like (g is on the stack):  -  -bb2:		; preds = %bb1  -..  -	%9 = getelementptr %struct.f* %g, i32 0, i32 0		  -	store i32 %8, i32* %9, align  bel %bb3  -  -bb3:		; preds = %bb1, %bb2, %bb  -	%c_addr.0 = phi %struct.f* [ %g, %bb2 ], [ %c, %bb ], [ %c, %bb1 ]  -	%b_addr.0 = phi %struct.f* [ %b, %bb2 ], [ %g, %bb ], [ %b, %bb1 ]  -	%10 = getelementptr %struct.f* %c_addr.0, i32 0, i32 0  -	%11 = load i32* %10, align 4  -  -%11 is partially redundant, an in BB2 it should have the value %8.  -  -GCC PR33344 and PR35287 are similar cases.  -  -  -//===---------------------------------------------------------------------===//  -  -[LOAD PRE]  -  -There are many load PRE testcases in testsuite/gcc.dg/tree-ssa/loadpre* in the  -GCC testsuite, ones we don't get yet are (checked through loadpre25):  -  -[CRIT EDGE BREAKING]  -predcom-4.c  -  -[PRE OF READONLY CALL]  -loadpre5.c  -  -[TURN SELECT INTO BRANCH]  -loadpre14.c loadpre15.c   -  -actually a conditional increment: loadpre18.c loadpre19.c  -  -//===---------------------------------------------------------------------===//  -  -[LOAD PRE / STORE SINKING / SPEC HACK]  -  -This is a chunk of code from 456.hmmer:  -  -int f(int M, int *mc, int *mpp, int *tpmm, int *ip, int *tpim, int *dpp,  -     int *tpdm, int xmb, int *bp, int *ms) {  - int k, sc;  - for (k = 1; k <= M; k++) {  -     mc[k] = mpp[k-1]   + tpmm[k-1];  -     if ((sc = ip[k-1]  + tpim[k-1]) > mc[k])  mc[k] = sc;  -     if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k])  mc[k] = sc;  -     if ((sc = xmb  + bp[k])         > mc[k])  mc[k] = sc;  -     mc[k] += ms[k];  -   }  -}  -  -It is very profitable for this benchmark to turn the conditional stores to mc[k]  -into a conditional move (select instr in IR) and allow the final store to do the  -store.  See GCC PR27313 for more details.  Note that this is valid to xform even  -with the new C++ memory model, since mc[k] is previously loaded and later  -stored.  -  -//===---------------------------------------------------------------------===//  -  -[SCALAR PRE]  -There are many PRE testcases in testsuite/gcc.dg/tree-ssa/ssa-pre-*.c in the  -GCC testsuite.  -  -//===---------------------------------------------------------------------===//  -  -There are some interesting cases in testsuite/gcc.dg/tree-ssa/pred-comm* in the  -GCC testsuite.  For example, we get the first example in predcom-1.c, but   -miss the second one:  -  -unsigned fib[1000];  -unsigned avg[1000];  -  -__attribute__ ((noinline))  -void count_averages(int n) {  -  int i;  -  for (i = 1; i < n; i++)  -    avg[i] = (((unsigned long) fib[i - 1] + fib[i] + fib[i + 1]) / 3) & 0xffff;  -}  -  -which compiles into two loads instead of one in the loop.  -  -predcom-2.c is the same as predcom-1.c  -  -predcom-3.c is very similar but needs loads feeding each other instead of  -store->load.  -  -  -//===---------------------------------------------------------------------===//  -  -[ALIAS ANALYSIS]  -  -Type based alias analysis:  -http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14705  -  -We should do better analysis of posix_memalign.  At the least it should  -no-capture its pointer argument, at best, we should know that the out-value  -result doesn't point to anything (like malloc).  One example of this is in  -SingleSource/Benchmarks/Misc/dt.c  -  -//===---------------------------------------------------------------------===//  -  -Interesting missed case because of control flow flattening (should be 2 loads):  -http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26629  -With: llvm-gcc t2.c -S -o - -O0 -emit-llvm | llvm-as |   -             opt -mem2reg -gvn -instcombine | llvm-dis  -we miss it because we need 1) CRIT EDGE 2) MULTIPLE DIFFERENT  -VALS PRODUCED BY ONE BLOCK OVER DIFFERENT PATHS  -  -//===---------------------------------------------------------------------===//  -  -http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19633  -We could eliminate the branch condition here, loading from null is undefined:  -  -struct S { int w, x, y, z; };  -struct T { int r; struct S s; };  -void bar (struct S, int);  -void foo (int a, struct T b)  -{  -  struct S *c = 0;  -  if (a)  -    c = &b.s;  -  bar (*c, a);  -}  -  -//===---------------------------------------------------------------------===//  -  -simplifylibcalls should do several optimizations for strspn/strcspn:  -  -strcspn(x, "a") -> inlined loop for up to 3 letters (similarly for strspn):  -  -size_t __strcspn_c3 (__const char *__s, int __reject1, int __reject2,  -                     int __reject3) {  -  register size_t __result = 0;  -  while (__s[__result] != '\0' && __s[__result] != __reject1 &&  -         __s[__result] != __reject2 && __s[__result] != __reject3)  -    ++__result;  -  return __result;  -}  -  -This should turn into a switch on the character.  See PR3253 for some notes on  -codegen.  -  -456.hmmer apparently uses strcspn and strspn a lot.  471.omnetpp uses strspn.  -  -//===---------------------------------------------------------------------===//  -  -simplifylibcalls should turn these snprintf idioms into memcpy (GCC PR47917)  -  -char buf1[6], buf2[6], buf3[4], buf4[4];  -int i;  -  -int foo (void) {  -  int ret = snprintf (buf1, sizeof buf1, "abcde");  -  ret += snprintf (buf2, sizeof buf2, "abcdef") * 16;  -  ret += snprintf (buf3, sizeof buf3, "%s", i++ < 6 ? "abc" : "def") * 256;  -  ret += snprintf (buf4, sizeof buf4, "%s", i++ > 10 ? "abcde" : "defgh")*4096;  -  return ret;  -}  -  -//===---------------------------------------------------------------------===//  -  -"gas" uses this idiom:  -  else if (strchr ("+-/*%|&^:[]()~", *intel_parser.op_string))  -..  -  else if (strchr ("<>", *intel_parser.op_string)  -  -Those should be turned into a switch.  SimplifyLibCalls only gets the second  -case.  -  -//===---------------------------------------------------------------------===//  -  -252.eon contains this interesting code:  -  -        %3072 = getelementptr [100 x i8]* %tempString, i32 0, i32 0  -        %3073 = call i8* @strcpy(i8* %3072, i8* %3071) nounwind  -        %strlen = call i32 @strlen(i8* %3072)    ; uses = 1  -        %endptr = getelementptr [100 x i8]* %tempString, i32 0, i32 %strlen  -        call void @llvm.memcpy.i32(i8* %endptr,   -          i8* getelementptr ([5 x i8]* @"\01LC42", i32 0, i32 0), i32 5, i32 1)  -        %3074 = call i32 @strlen(i8* %endptr) nounwind readonly   -          -This is interesting for a couple reasons.  First, in this:  -  -The memcpy+strlen strlen can be replaced with:  -  -        %3074 = call i32 @strlen([5 x i8]* @"\01LC42") nounwind readonly   -  -Because the destination was just copied into the specified memory buffer.  This,  -in turn, can be constant folded to "4".  -  -In other code, it contains:  -  -        %endptr6978 = bitcast i8* %endptr69 to i32*              -        store i32 7107374, i32* %endptr6978, align 1  -        %3167 = call i32 @strlen(i8* %endptr69) nounwind readonly      -  -Which could also be constant folded.  Whatever is producing this should probably  -be fixed to leave this as a memcpy from a string.  -  -Further, eon also has an interesting partially redundant strlen call:  -  -bb8:            ; preds = %_ZN18eonImageCalculatorC1Ev.exit  -        %682 = getelementptr i8** %argv, i32 6          ; <i8**> [#uses=2]  -        %683 = load i8** %682, align 4          ; <i8*> [#uses=4]  -        %684 = load i8* %683, align 1           ; <i8> [#uses=1]  -        %685 = icmp eq i8 %684, 0               ; <i1> [#uses=1]  -        br i1 %685, label %bb10, label %bb9  -  -bb9:            ; preds = %bb8  -        %686 = call i32 @strlen(i8* %683) nounwind readonly            -        %687 = icmp ugt i32 %686, 254           ; <i1> [#uses=1]  -        br i1 %687, label %bb10, label %bb11  -  -bb10:           ; preds = %bb9, %bb8  -        %688 = call i32 @strlen(i8* %683) nounwind readonly            -  -This could be eliminated by doing the strlen once in bb8, saving code size and  -improving perf on the bb8->9->10 path.  -  -//===---------------------------------------------------------------------===//  -  -I see an interesting fully redundant call to strlen left in 186.crafty:InputMove  -which looks like:  -       %movetext11 = getelementptr [128 x i8]* %movetext, i32 0, i32 0   -   -  -bb62:           ; preds = %bb55, %bb53  -        %promote.0 = phi i32 [ %169, %bb55 ], [ 0, %bb53 ]               -        %171 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1  -        %172 = add i32 %171, -1         ; <i32> [#uses=1]  -        %173 = getelementptr [128 x i8]* %movetext, i32 0, i32 %172         -  -...  no stores ...  -       br i1 %or.cond, label %bb65, label %bb72  -  -bb65:           ; preds = %bb62  -        store i8 0, i8* %173, align 1  -        br label %bb72  -  -bb72:           ; preds = %bb65, %bb62  -        %trank.1 = phi i32 [ %176, %bb65 ], [ -1, %bb62 ]              -        %177 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1  -  -Note that on the bb62->bb72 path, that the %177 strlen call is partially  -redundant with the %171 call.  At worst, we could shove the %177 strlen call  -up into the bb65 block moving it out of the bb62->bb72 path.   However, note  -that bb65 stores to the string, zeroing out the last byte.  This means that on  -that path the value of %177 is actually just %171-1.  A sub is cheaper than a  -strlen!  -  -This pattern repeats several times, basically doing:  -  -  A = strlen(P);  -  P[A-1] = 0;  -  B = strlen(P);  -  where it is "obvious" that B = A-1.  -  -//===---------------------------------------------------------------------===//  -  -186.crafty has this interesting pattern with the "out.4543" variable:  -  -call void @llvm.memcpy.i32(  -        i8* getelementptr ([10 x i8]* @out.4543, i32 0, i32 0),  -       i8* getelementptr ([7 x i8]* @"\01LC28700", i32 0, i32 0), i32 7, i32 1)   -%101 = call@printf(i8* ...   @out.4543, i32 0, i32 0)) nounwind   -  -It is basically doing:  -  -  memcpy(globalarray, "string");  -  printf(...,  globalarray);  +Target Independent Opportunities: + +//===---------------------------------------------------------------------===// + +We should recognized various "overflow detection" idioms and translate them into +llvm.uadd.with.overflow and similar intrinsics.  Here is a multiply idiom: + +unsigned int mul(unsigned int a,unsigned int b) { + if ((unsigned long long)a*b>0xffffffff) +   exit(0); +  return a*b; +} + +The legalization code for mul-with-overflow needs to be made more robust before +this can be implemented though. + +//===---------------------------------------------------------------------===// + +Get the C front-end to expand hypot(x,y) -> llvm.sqrt(x*x+y*y) when errno and +precision don't matter (ffastmath).  Misc/mandel will like this. :)  This isn't +safe in general, even on darwin.  See the libm implementation of hypot for +examples (which special case when x/y are exactly zero to get signed zeros etc +right). + +//===---------------------------------------------------------------------===// + +On targets with expensive 64-bit multiply, we could LSR this: + +for (i = ...; ++i) { +   x = 1ULL << i; + +into: + long long tmp = 1; + for (i = ...; ++i, tmp+=tmp) +   x = tmp; + +This would be a win on ppc32, but not x86 or ppc64. + +//===---------------------------------------------------------------------===// + +Shrink: (setlt (loadi32 P), 0) -> (setlt (loadi8 Phi), 0) + +//===---------------------------------------------------------------------===// + +Reassociate should turn things like: + +int factorial(int X) { + return X*X*X*X*X*X*X*X; +} + +into llvm.powi calls, allowing the code generator to produce balanced +multiplication trees. + +First, the intrinsic needs to be extended to support integers, and second the +code generator needs to be enhanced to lower these to multiplication trees. + +//===---------------------------------------------------------------------===// + +Interesting? testcase for add/shift/mul reassoc: + +int bar(int x, int y) { +  return x*x*x+y+x*x*x*x*x*y*y*y*y; +} +int foo(int z, int n) { +  return bar(z, n) + bar(2*z, 2*n); +} + +This is blocked on not handling X*X*X -> powi(X, 3) (see note above).  The issue +is that we end up getting t = 2*X  s = t*t   and don't turn this into 4*X*X, +which is the same number of multiplies and is canonical, because the 2*X has +multiple uses.  Here's a simple example: + +define i32 @test15(i32 %X1) { +  %B = mul i32 %X1, 47   ; X1*47 +  %C = mul i32 %B, %B +  ret i32 %C +} + + +//===---------------------------------------------------------------------===// + +Reassociate should handle the example in GCC PR16157: + +extern int a0, a1, a2, a3, a4; extern int b0, b1, b2, b3, b4;  +void f () {  /* this can be optimized to four additions... */  +        b4 = a4 + a3 + a2 + a1 + a0;  +        b3 = a3 + a2 + a1 + a0;  +        b2 = a2 + a1 + a0;  +        b1 = a1 + a0;  +}  + +This requires reassociating to forms of expressions that are already available, +something that reassoc doesn't think about yet. + + +//===---------------------------------------------------------------------===// + +These two functions should generate the same code on big-endian systems: + +int g(int *j,int *l)  {  return memcmp(j,l,4);  } +int h(int *j, int *l) {  return *j - *l; } + +this could be done in SelectionDAGISel.cpp, along with other special cases, +for 1,2,4,8 bytes. + +//===---------------------------------------------------------------------===// + +It would be nice to revert this patch: +http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20060213/031986.html + +And teach the dag combiner enough to simplify the code expanded before  +legalize.  It seems plausible that this knowledge would let it simplify other +stuff too. + +//===---------------------------------------------------------------------===// + +For vector types, DataLayout.cpp::getTypeInfo() returns alignment that is equal +to the type size. It works but can be overly conservative as the alignment of +specific vector types are target dependent. + +//===---------------------------------------------------------------------===// + +We should produce an unaligned load from code like this: + +v4sf example(float *P) { +  return (v4sf){P[0], P[1], P[2], P[3] }; +} + +//===---------------------------------------------------------------------===// + +Add support for conditional increments, and other related patterns.  Instead +of: + +	movl 136(%esp), %eax +	cmpl $0, %eax +	je LBB16_2	#cond_next +LBB16_1:	#cond_true +	incl _foo +LBB16_2:	#cond_next + +emit: +	movl	_foo, %eax +	cmpl	$1, %edi +	sbbl	$-1, %eax +	movl	%eax, _foo + +//===---------------------------------------------------------------------===// + +Combine: a = sin(x), b = cos(x) into a,b = sincos(x). + +Expand these to calls of sin/cos and stores: +      double sincos(double x, double *sin, double *cos); +      float sincosf(float x, float *sin, float *cos); +      long double sincosl(long double x, long double *sin, long double *cos); + +Doing so could allow SROA of the destination pointers.  See also: +http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17687 + +This is now easily doable with MRVs.  We could even make an intrinsic for this +if anyone cared enough about sincos. + +//===---------------------------------------------------------------------===// + +quantum_sigma_x in 462.libquantum contains the following loop: + +      for(i=0; i<reg->size; i++) +	{ +	  /* Flip the target bit of each basis state */ +	  reg->node[i].state ^= ((MAX_UNSIGNED) 1 << target); +	}  + +Where MAX_UNSIGNED/state is a 64-bit int.  On a 32-bit platform it would be just +so cool to turn it into something like: + +   long long Res = ((MAX_UNSIGNED) 1 << target); +   if (target < 32) { +     for(i=0; i<reg->size; i++) +       reg->node[i].state ^= Res & 0xFFFFFFFFULL; +   } else { +     for(i=0; i<reg->size; i++) +       reg->node[i].state ^= Res & 0xFFFFFFFF00000000ULL +   } -Anyway, by knowing that printf just reads the memory and forward substituting  -the string directly into the printf, this eliminates reads from globalarray.  -Since this pattern occurs frequently in crafty (due to the "DisplayTime" and  -other similar functions) there are many stores to "out".  Once all the printfs  -stop using "out", all that is left is the memcpy's into it.  This should allow  -globalopt to remove the "stored only" global.  -  -//===---------------------------------------------------------------------===//  -  -This code:  -  -define inreg i32 @foo(i8* inreg %p) nounwind {  -  %tmp0 = load i8* %p  -  %tmp1 = ashr i8 %tmp0, 5  -  %tmp2 = sext i8 %tmp1 to i32  -  ret i32 %tmp2  -}  -  -could be dagcombine'd to a sign-extending load with a shift.  -For example, on x86 this currently gets this:  -  -	movb	(%eax), %al  -	sarb	$5, %al  -	movsbl	%al, %eax  -  -while it could get this:  -  -	movsbl	(%eax), %eax  -	sarl	$5, %eax  -  -//===---------------------------------------------------------------------===//  -  -GCC PR31029:  -  -int test(int x) { return 1-x == x; }     // --> return false  -int test2(int x) { return 2-x == x; }    // --> return x == 1 ?  -  -Always foldable for odd constants, what is the rule for even?  -  -//===---------------------------------------------------------------------===//  -  -PR 3381: GEP to field of size 0 inside a struct could be turned into GEP  -for next field in struct (which is at same address).  -  -For example: store of float into { {{}}, float } could be turned into a store to  -the float directly.  -  -//===---------------------------------------------------------------------===//  -  -The arg promotion pass should make use of nocapture to make its alias analysis  -stuff much more precise.  -  -//===---------------------------------------------------------------------===//  -  -The following functions should be optimized to use a select instead of a  -branch (from gcc PR40072):  -  -char char_int(int m) {if(m>7) return 0; return m;}  -int int_char(char m) {if(m>7) return 0; return m;}  -  -//===---------------------------------------------------------------------===//  -  -int func(int a, int b) { if (a & 0x80) b |= 0x80; else b &= ~0x80; return b; }  -  -Generates this:  -  -define i32 @func(i32 %a, i32 %b) nounwind readnone ssp {  -entry:  -  %0 = and i32 %a, 128                            ; <i32> [#uses=1]  -  %1 = icmp eq i32 %0, 0                          ; <i1> [#uses=1]  -  %2 = or i32 %b, 128                             ; <i32> [#uses=1]  -  %3 = and i32 %b, -129                           ; <i32> [#uses=1]  -  %b_addr.0 = select i1 %1, i32 %3, i32 %2        ; <i32> [#uses=1]  -  ret i32 %b_addr.0  -}  -  -However, it's functionally equivalent to:  -  -         b = (b & ~0x80) | (a & 0x80);  -  -Which generates this:  -  -define i32 @func(i32 %a, i32 %b) nounwind readnone ssp {  -entry:  -  %0 = and i32 %b, -129                           ; <i32> [#uses=1]  -  %1 = and i32 %a, 128                            ; <i32> [#uses=1]  -  %2 = or i32 %0, %1                              ; <i32> [#uses=1]  -  ret i32 %2  -}  -  -This can be generalized for other forms:  -  -     b = (b & ~0x80) | (a & 0x40) << 1;  -  -//===---------------------------------------------------------------------===//  -  -These two functions produce different code. They shouldn't:  -  -#include <stdint.h>  -   -uint8_t p1(uint8_t b, uint8_t a) {  -  b = (b & ~0xc0) | (a & 0xc0);  -  return (b);  -}  +... which would only do one 32-bit XOR per loop iteration instead of two. + +It would also be nice to recognize the reg->size doesn't alias reg->node[i], +but this requires TBAA. + +//===---------------------------------------------------------------------===// + +This isn't recognized as bswap by instcombine (yes, it really is bswap): + +unsigned long reverse(unsigned v) { +    unsigned t; +    t = v ^ ((v << 16) | (v >> 16)); +    t &= ~0xff0000; +    v = (v << 24) | (v >> 8); +    return v ^ (t >> 8); +} + +//===---------------------------------------------------------------------===// + +[LOOP DELETION] + +We don't delete this output free loop, because trip count analysis doesn't +realize that it is finite (if it were infinite, it would be undefined).  Not +having this blocks Loop Idiom from matching strlen and friends.   + +void foo(char *C) { +  int x = 0; +  while (*C) +    ++x,++C; +} + +//===---------------------------------------------------------------------===// + +[LOOP RECOGNITION] + +These idioms should be recognized as popcount (see PR1488): + +unsigned countbits_slow(unsigned v) { +  unsigned c; +  for (c = 0; v; v >>= 1) +    c += v & 1; +  return c; +} + +unsigned int popcount(unsigned int input) { +  unsigned int count = 0; +  for (unsigned int i =  0; i < 4 * 8; i++) +    count += (input >> i) & i; +  return count; +} + +This should be recognized as CLZ:  rdar://8459039 + +unsigned clz_a(unsigned a) { +  int i; +  for (i=0;i<32;i++) +    if (a & (1<<(31-i))) +      return i; +  return 32; +} + +This sort of thing should be added to the loop idiom pass. + +//===---------------------------------------------------------------------===// + +These should turn into single 16-bit (unaligned?) loads on little/big endian +processors. + +unsigned short read_16_le(const unsigned char *adr) { +  return adr[0] | (adr[1] << 8); +} +unsigned short read_16_be(const unsigned char *adr) { +  return (adr[0] << 8) | adr[1]; +} + +//===---------------------------------------------------------------------===// + +-instcombine should handle this transform: +   icmp pred (sdiv X / C1 ), C2 +when X, C1, and C2 are unsigned.  Similarly for udiv and signed operands.  + +Currently InstCombine avoids this transform but will do it when the signs of +the operands and the sign of the divide match. See the FIXME in  +InstructionCombining.cpp in the visitSetCondInst method after the switch case  +for Instruction::UDiv (around line 4447) for more details. + +The SingleSource/Benchmarks/Shootout-C++/hash and hash2 tests have examples of +this construct.  + +//===---------------------------------------------------------------------===// + +[LOOP OPTIMIZATION] + +SingleSource/Benchmarks/Misc/dt.c shows several interesting optimization +opportunities in its double_array_divs_variable function: it needs loop +interchange, memory promotion (which LICM already does), vectorization and +variable trip count loop unrolling (since it has a constant trip count). ICC +apparently produces this very nice code with -ffast-math: + +..B1.70:                        # Preds ..B1.70 ..B1.69 +       mulpd     %xmm0, %xmm1                                  #108.2 +       mulpd     %xmm0, %xmm1                                  #108.2 +       mulpd     %xmm0, %xmm1                                  #108.2 +       mulpd     %xmm0, %xmm1                                  #108.2 +       addl      $8, %edx                                      # +       cmpl      $131072, %edx                                 #108.2 +       jb        ..B1.70       # Prob 99%                      #108.2 + +It would be better to count down to zero, but this is a lot better than what we +do. + +//===---------------------------------------------------------------------===// + +Consider: + +typedef unsigned U32; +typedef unsigned long long U64; +int test (U32 *inst, U64 *regs) { +    U64 effective_addr2; +    U32 temp = *inst; +    int r1 = (temp >> 20) & 0xf; +    int b2 = (temp >> 16) & 0xf; +    effective_addr2 = temp & 0xfff; +    if (b2) effective_addr2 += regs[b2]; +    b2 = (temp >> 12) & 0xf; +    if (b2) effective_addr2 += regs[b2]; +    effective_addr2 &= regs[4]; +     if ((effective_addr2 & 3) == 0) +        return 1; +    return 0; +} + +Note that only the low 2 bits of effective_addr2 are used.  On 32-bit systems, +we don't eliminate the computation of the top half of effective_addr2 because +we don't have whole-function selection dags.  On x86, this means we use one +extra register for the function when effective_addr2 is declared as U64 than +when it is declared U32. + +PHI Slicing could be extended to do this. + +//===---------------------------------------------------------------------===// + +Tail call elim should be more aggressive, checking to see if the call is +followed by an uncond branch to an exit block. + +; This testcase is due to tail-duplication not wanting to copy the return +; instruction into the terminating blocks because there was other code +; optimized out of the function after the taildup happened. +; RUN: llvm-as < %s | opt -tailcallelim | llvm-dis | not grep call + +define i32 @t4(i32 %a) { +entry: +	%tmp.1 = and i32 %a, 1		; <i32> [#uses=1] +	%tmp.2 = icmp ne i32 %tmp.1, 0		; <i1> [#uses=1] +	br i1 %tmp.2, label %then.0, label %else.0 + +then.0:		; preds = %entry +	%tmp.5 = add i32 %a, -1		; <i32> [#uses=1] +	%tmp.3 = call i32 @t4( i32 %tmp.5 )		; <i32> [#uses=1] +	br label %return + +else.0:		; preds = %entry +	%tmp.7 = icmp ne i32 %a, 0		; <i1> [#uses=1] +	br i1 %tmp.7, label %then.1, label %return + +then.1:		; preds = %else.0 +	%tmp.11 = add i32 %a, -2		; <i32> [#uses=1] +	%tmp.9 = call i32 @t4( i32 %tmp.11 )		; <i32> [#uses=1] +	br label %return + +return:		; preds = %then.1, %else.0, %then.0 +	%result.0 = phi i32 [ 0, %else.0 ], [ %tmp.3, %then.0 ], +                            [ %tmp.9, %then.1 ] +	ret i32 %result.0 +} + +//===---------------------------------------------------------------------===// + +Tail recursion elimination should handle: + +int pow2m1(int n) { + if (n == 0) +   return 0; + return 2 * pow2m1 (n - 1) + 1; +} + +Also, multiplies can be turned into SHL's, so they should be handled as if +they were associative.  "return foo() << 1" can be tail recursion eliminated. + +//===---------------------------------------------------------------------===// + +Argument promotion should promote arguments for recursive functions, like  +this: + +; RUN: llvm-as < %s | opt -argpromotion | llvm-dis | grep x.val + +define internal i32 @foo(i32* %x) { +entry: +	%tmp = load i32* %x		; <i32> [#uses=0] +	%tmp.foo = call i32 @foo( i32* %x )		; <i32> [#uses=1] +	ret i32 %tmp.foo +} + +define i32 @bar(i32* %x) { +entry: +	%tmp3 = call i32 @foo( i32* %x )		; <i32> [#uses=1] +	ret i32 %tmp3 +} + +//===---------------------------------------------------------------------===// + +We should investigate an instruction sinking pass.  Consider this silly +example in pic mode: + +#include <assert.h> +void foo(int x) { +  assert(x); +  //... +} + +we compile this to: +_foo: +	subl	$28, %esp +	call	"L1$pb" +"L1$pb": +	popl	%eax +	cmpl	$0, 32(%esp) +	je	LBB1_2	# cond_true +LBB1_1:	# return +	# ... +	addl	$28, %esp +	ret +LBB1_2:	# cond_true +... + +The PIC base computation (call+popl) is only used on one path through the  +code, but is currently always computed in the entry block.  It would be  +better to sink the picbase computation down into the block for the  +assertion, as it is the only one that uses it.  This happens for a lot of  +code with early outs. + +Another example is loads of arguments, which are usually emitted into the  +entry block on targets like x86.  If not used in all paths through a  +function, they should be sunk into the ones that do. + +In this case, whole-function-isel would also handle this. + +//===---------------------------------------------------------------------===// + +Investigate lowering of sparse switch statements into perfect hash tables: +http://burtleburtle.net/bob/hash/perfect.html + +//===---------------------------------------------------------------------===// + +We should turn things like "load+fabs+store" and "load+fneg+store" into the +corresponding integer operations.  On a yonah, this loop: + +double a[256]; +void foo() { +  int i, b; +  for (b = 0; b < 10000000; b++) +  for (i = 0; i < 256; i++) +    a[i] = -a[i]; +} + +is twice as slow as this loop: + +long long a[256]; +void foo() { +  int i, b; +  for (b = 0; b < 10000000; b++) +  for (i = 0; i < 256; i++) +    a[i] ^= (1ULL << 63); +} + +and I suspect other processors are similar.  On X86 in particular this is a +big win because doing this with integers allows the use of read/modify/write +instructions. + +//===---------------------------------------------------------------------===// + +DAG Combiner should try to combine small loads into larger loads when  +profitable.  For example, we compile this C++ example: + +struct THotKey { short Key; bool Control; bool Shift; bool Alt; }; +extern THotKey m_HotKey; +THotKey GetHotKey () { return m_HotKey; } + +into (-m64 -O3 -fno-exceptions -static -fomit-frame-pointer): + +__Z9GetHotKeyv:                         ## @_Z9GetHotKeyv +	movq	_m_HotKey@GOTPCREL(%rip), %rax +	movzwl	(%rax), %ecx +	movzbl	2(%rax), %edx +	shlq	$16, %rdx +	orq	%rcx, %rdx +	movzbl	3(%rax), %ecx +	shlq	$24, %rcx +	orq	%rdx, %rcx +	movzbl	4(%rax), %eax +	shlq	$32, %rax +	orq	%rcx, %rax +	ret + +//===---------------------------------------------------------------------===// + +We should add an FRINT node to the DAG to model targets that have legal +implementations of ceil/floor/rint. + +//===---------------------------------------------------------------------===// + +Consider: + +int test() { +  long long input[8] = {1,0,1,0,1,0,1,0}; +  foo(input); +} + +Clang compiles this into: + +  call void @llvm.memset.p0i8.i64(i8* %tmp, i8 0, i64 64, i32 16, i1 false) +  %0 = getelementptr [8 x i64]* %input, i64 0, i64 0 +  store i64 1, i64* %0, align 16 +  %1 = getelementptr [8 x i64]* %input, i64 0, i64 2 +  store i64 1, i64* %1, align 16 +  %2 = getelementptr [8 x i64]* %input, i64 0, i64 4 +  store i64 1, i64* %2, align 16 +  %3 = getelementptr [8 x i64]* %input, i64 0, i64 6 +  store i64 1, i64* %3, align 16 + +Which gets codegen'd into: + +	pxor	%xmm0, %xmm0 +	movaps	%xmm0, -16(%rbp) +	movaps	%xmm0, -32(%rbp) +	movaps	%xmm0, -48(%rbp) +	movaps	%xmm0, -64(%rbp) +	movq	$1, -64(%rbp) +	movq	$1, -48(%rbp) +	movq	$1, -32(%rbp) +	movq	$1, -16(%rbp) + +It would be better to have 4 movq's of 0 instead of the movaps's. + +//===---------------------------------------------------------------------===// + +http://llvm.org/PR717: + +The following code should compile into "ret int undef". Instead, LLVM +produces "ret int 0": + +int f() { +  int x = 4; +  int y; +  if (x == 3) y = 0; +  return y; +} + +//===---------------------------------------------------------------------===// + +The loop unroller should partially unroll loops (instead of peeling them) +when code growth isn't too bad and when an unroll count allows simplification +of some code within the loop.  One trivial example is: + +#include <stdio.h> +int main() { +    int nRet = 17; +    int nLoop; +    for ( nLoop = 0; nLoop < 1000; nLoop++ ) { +        if ( nLoop & 1 ) +            nRet += 2; +        else +            nRet -= 1; +    } +    return nRet; +} + +Unrolling by 2 would eliminate the '&1' in both copies, leading to a net +reduction in code size.  The resultant code would then also be suitable for +exit value computation. + +//===---------------------------------------------------------------------===// + +We miss a bunch of rotate opportunities on various targets, including ppc, x86, +etc.  On X86, we miss a bunch of 'rotate by variable' cases because the rotate +matching code in dag combine doesn't look through truncates aggressively  +enough.  Here are some testcases reduces from GCC PR17886: + +unsigned long long f5(unsigned long long x, unsigned long long y) { +  return (x << 8) | ((y >> 48) & 0xffull); +} +unsigned long long f6(unsigned long long x, unsigned long long y, int z) { +  switch(z) { +  case 1: +    return (x << 8) | ((y >> 48) & 0xffull); +  case 2: +    return (x << 16) | ((y >> 40) & 0xffffull); +  case 3: +    return (x << 24) | ((y >> 32) & 0xffffffull); +  case 4: +    return (x << 32) | ((y >> 24) & 0xffffffffull); +  default: +    return (x << 40) | ((y >> 16) & 0xffffffffffull); +  } +} + +//===---------------------------------------------------------------------===// + +This (and similar related idioms): + +unsigned int foo(unsigned char i) { +  return i | (i<<8) | (i<<16) | (i<<24); +}  + +compiles into: + +define i32 @foo(i8 zeroext %i) nounwind readnone ssp noredzone { +entry: +  %conv = zext i8 %i to i32 +  %shl = shl i32 %conv, 8 +  %shl5 = shl i32 %conv, 16 +  %shl9 = shl i32 %conv, 24 +  %or = or i32 %shl9, %conv +  %or6 = or i32 %or, %shl5 +  %or10 = or i32 %or6, %shl +  ret i32 %or10 +} + +it would be better as: + +unsigned int bar(unsigned char i) { +  unsigned int j=i | (i << 8);  +  return j | (j<<16); +} + +aka: + +define i32 @bar(i8 zeroext %i) nounwind readnone ssp noredzone { +entry: +  %conv = zext i8 %i to i32 +  %shl = shl i32 %conv, 8 +  %or = or i32 %shl, %conv +  %shl5 = shl i32 %or, 16 +  %or6 = or i32 %shl5, %or +  ret i32 %or6 +} + +or even i*0x01010101, depending on the speed of the multiplier.  The best way to +handle this is to canonicalize it to a multiply in IR and have codegen handle +lowering multiplies to shifts on cpus where shifts are faster. + +//===---------------------------------------------------------------------===// + +We do a number of simplifications in simplify libcalls to strength reduce +standard library functions, but we don't currently merge them together.  For +example, it is useful to merge memcpy(a,b,strlen(b)) -> strcpy.  This can only +be done safely if "b" isn't modified between the strlen and memcpy of course. + +//===---------------------------------------------------------------------===// + +We compile this program: (from GCC PR11680) +http://gcc.gnu.org/bugzilla/attachment.cgi?id=4487 + +Into code that runs the same speed in fast/slow modes, but both modes run 2x +slower than when compile with GCC (either 4.0 or 4.2): + +$ llvm-g++ perf.cpp -O3 -fno-exceptions +$ time ./a.out fast +1.821u 0.003s 0:01.82 100.0%	0+0k 0+0io 0pf+0w + +$ g++ perf.cpp -O3 -fno-exceptions +$ time ./a.out fast +0.821u 0.001s 0:00.82 100.0%	0+0k 0+0io 0pf+0w + +It looks like we are making the same inlining decisions, so this may be raw +codegen badness or something else (haven't investigated). + +//===---------------------------------------------------------------------===// + +Divisibility by constant can be simplified (according to GCC PR12849) from +being a mulhi to being a mul lo (cheaper).  Testcase: + +void bar(unsigned n) { +  if (n % 3 == 0) +    true(); +} + +This is equivalent to the following, where 2863311531 is the multiplicative +inverse of 3, and 1431655766 is ((2^32)-1)/3+1: +void bar(unsigned n) { +  if (n * 2863311531U < 1431655766U) +    true(); +} + +The same transformation can work with an even modulo with the addition of a +rotate: rotate the result of the multiply to the right by the number of bits +which need to be zero for the condition to be true, and shrink the compare RHS +by the same amount.  Unless the target supports rotates, though, that +transformation probably isn't worthwhile. + +The transformation can also easily be made to work with non-zero equality +comparisons: just transform, for example, "n % 3 == 1" to "(n-1) % 3 == 0". + +//===---------------------------------------------------------------------===// + +Better mod/ref analysis for scanf would allow us to eliminate the vtable and a +bunch of other stuff from this example (see PR1604):  + +#include <cstdio> +struct test { +    int val; +    virtual ~test() {} +}; + +int main() { +    test t; +    std::scanf("%d", &t.val); +    std::printf("%d\n", t.val); +} + +//===---------------------------------------------------------------------===// + +These functions perform the same computation, but produce different assembly. + +define i8 @select(i8 %x) readnone nounwind { +  %A = icmp ult i8 %x, 250 +  %B = select i1 %A, i8 0, i8 1 +  ret i8 %B  +} + +define i8 @addshr(i8 %x) readnone nounwind { +  %A = zext i8 %x to i9 +  %B = add i9 %A, 6       ;; 256 - 250 == 6 +  %C = lshr i9 %B, 8 +  %D = trunc i9 %C to i8 +  ret i8 %D +} + +//===---------------------------------------------------------------------===// + +From gcc bug 24696: +int +f (unsigned long a, unsigned long b, unsigned long c) +{ +  return ((a & (c - 1)) != 0) || ((b & (c - 1)) != 0); +} +int +f (unsigned long a, unsigned long b, unsigned long c) +{ +  return ((a & (c - 1)) != 0) | ((b & (c - 1)) != 0); +} +Both should combine to ((a|b) & (c-1)) != 0.  Currently not optimized with +"clang -emit-llvm-bc | opt -O3". + +//===---------------------------------------------------------------------===// + +From GCC Bug 20192: +#define PMD_MASK    (~((1UL << 23) - 1)) +void clear_pmd_range(unsigned long start, unsigned long end) +{ +   if (!(start & ~PMD_MASK) && !(end & ~PMD_MASK)) +       f(); +} +The expression should optimize to something like +"!((start|end)&~PMD_MASK). Currently not optimized with "clang +-emit-llvm-bc | opt -O3". + +//===---------------------------------------------------------------------===// + +unsigned int f(unsigned int i, unsigned int n) {++i; if (i == n) ++i; return +i;} +unsigned int f2(unsigned int i, unsigned int n) {++i; i += i == n; return i;} +These should combine to the same thing.  Currently, the first function +produces better code on X86. + +//===---------------------------------------------------------------------===// + +From GCC Bug 15784: +#define abs(x) x>0?x:-x +int f(int x, int y) +{ + return (abs(x)) >= 0; +} +This should optimize to x == INT_MIN. (With -fwrapv.)  Currently not +optimized with "clang -emit-llvm-bc | opt -O3". + +//===---------------------------------------------------------------------===// + +From GCC Bug 14753: +void +rotate_cst (unsigned int a) +{ + a = (a << 10) | (a >> 22); + if (a == 123) +   bar (); +} +void +minus_cst (unsigned int a) +{ + unsigned int tem; + + tem = 20 - a; + if (tem == 5) +   bar (); +} +void +mask_gt (unsigned int a) +{ + /* This is equivalent to a > 15.  */ + if ((a & ~7) > 8) +   bar (); +} +void +rshift_gt (unsigned int a) +{ + /* This is equivalent to a > 23.  */ + if ((a >> 2) > 5) +   bar (); +} + +All should simplify to a single comparison.  All of these are +currently not optimized with "clang -emit-llvm-bc | opt +-O3". + +//===---------------------------------------------------------------------===// + +From GCC Bug 32605: +int c(int* x) {return (char*)x+2 == (char*)x;} +Should combine to 0.  Currently not optimized with "clang +-emit-llvm-bc | opt -O3" (although llc can optimize it). + +//===---------------------------------------------------------------------===// + +int a(unsigned b) {return ((b << 31) | (b << 30)) >> 31;} +Should be combined to  "((b >> 1) | b) & 1".  Currently not optimized +with "clang -emit-llvm-bc | opt -O3". + +//===---------------------------------------------------------------------===// + +unsigned a(unsigned x, unsigned y) { return x | (y & 1) | (y & 2);} +Should combine to "x | (y & 3)".  Currently not optimized with "clang +-emit-llvm-bc | opt -O3". + +//===---------------------------------------------------------------------===// + +int a(int a, int b, int c) {return (~a & c) | ((c|a) & b);} +Should fold to "(~a & c) | (a & b)".  Currently not optimized with +"clang -emit-llvm-bc | opt -O3". + +//===---------------------------------------------------------------------===// + +int a(int a,int b) {return (~(a|b))|a;} +Should fold to "a|~b".  Currently not optimized with "clang +-emit-llvm-bc | opt -O3". + +//===---------------------------------------------------------------------===// + +int a(int a, int b) {return (a&&b) || (a&&!b);} +Should fold to "a".  Currently not optimized with "clang -emit-llvm-bc +| opt -O3". + +//===---------------------------------------------------------------------===// + +int a(int a, int b, int c) {return (a&&b) || (!a&&c);} +Should fold to "a ? b : c", or at least something sane.  Currently not +optimized with "clang -emit-llvm-bc | opt -O3". + +//===---------------------------------------------------------------------===// + +int a(int a, int b, int c) {return (a&&b) || (a&&c) || (a&&b&&c);} +Should fold to a && (b || c).  Currently not optimized with "clang +-emit-llvm-bc | opt -O3". + +//===---------------------------------------------------------------------===// + +int a(int x) {return x | ((x & 8) ^ 8);} +Should combine to x | 8.  Currently not optimized with "clang +-emit-llvm-bc | opt -O3". + +//===---------------------------------------------------------------------===// + +int a(int x) {return x ^ ((x & 8) ^ 8);} +Should also combine to x | 8.  Currently not optimized with "clang +-emit-llvm-bc | opt -O3". + +//===---------------------------------------------------------------------===// + +int a(int x) {return ((x | -9) ^ 8) & x;} +Should combine to x & -9.  Currently not optimized with "clang +-emit-llvm-bc | opt -O3". + +//===---------------------------------------------------------------------===// + +unsigned a(unsigned a) {return a * 0x11111111 >> 28 & 1;} +Should combine to "a * 0x88888888 >> 31".  Currently not optimized +with "clang -emit-llvm-bc | opt -O3". + +//===---------------------------------------------------------------------===// + +unsigned a(char* x) {if ((*x & 32) == 0) return b();} +There's an unnecessary zext in the generated code with "clang +-emit-llvm-bc | opt -O3". + +//===---------------------------------------------------------------------===// + +unsigned a(unsigned long long x) {return 40 * (x >> 1);} +Should combine to "20 * (((unsigned)x) & -2)".  Currently not +optimized with "clang -emit-llvm-bc | opt -O3". + +//===---------------------------------------------------------------------===// + +int g(int x) { return (x - 10) < 0; } +Should combine to "x <= 9" (the sub has nsw).  Currently not +optimized with "clang -emit-llvm-bc | opt -O3". + +//===---------------------------------------------------------------------===// + +int g(int x) { return (x + 10) < 0; } +Should combine to "x < -10" (the add has nsw).  Currently not +optimized with "clang -emit-llvm-bc | opt -O3". + +//===---------------------------------------------------------------------===// + +int f(int i, int j) { return i < j + 1; } +int g(int i, int j) { return j > i - 1; } +Should combine to "i <= j" (the add/sub has nsw).  Currently not +optimized with "clang -emit-llvm-bc | opt -O3". + +//===---------------------------------------------------------------------===// + +unsigned f(unsigned x) { return ((x & 7) + 1) & 15; } +The & 15 part should be optimized away, it doesn't change the result. Currently +not optimized with "clang -emit-llvm-bc | opt -O3". + +//===---------------------------------------------------------------------===// + +This was noticed in the entryblock for grokdeclarator in 403.gcc: + +        %tmp = icmp eq i32 %decl_context, 4           +        %decl_context_addr.0 = select i1 %tmp, i32 3, i32 %decl_context  +        %tmp1 = icmp eq i32 %decl_context_addr.0, 1  +        %decl_context_addr.1 = select i1 %tmp1, i32 0, i32 %decl_context_addr.0 + +tmp1 should be simplified to something like: +  (!tmp || decl_context == 1) + +This allows recursive simplifications, tmp1 is used all over the place in +the function, e.g. by: + +        %tmp23 = icmp eq i32 %decl_context_addr.1, 0            ; <i1> [#uses=1] +        %tmp24 = xor i1 %tmp1, true             ; <i1> [#uses=1] +        %or.cond8 = and i1 %tmp23, %tmp24               ; <i1> [#uses=1] + +later. + +//===---------------------------------------------------------------------===// + +[STORE SINKING] + +Store sinking: This code: + +void f (int n, int *cond, int *res) { +    int i; +    *res = 0; +    for (i = 0; i < n; i++) +        if (*cond) +            *res ^= 234; /* (*) */ +} + +On this function GVN hoists the fully redundant value of *res, but nothing +moves the store out.  This gives us this code: + +bb:		; preds = %bb2, %entry +	%.rle = phi i32 [ 0, %entry ], [ %.rle6, %bb2 ]	 +	%i.05 = phi i32 [ 0, %entry ], [ %indvar.next, %bb2 ] +	%1 = load i32* %cond, align 4 +	%2 = icmp eq i32 %1, 0 +	br i1 %2, label %bb2, label %bb1 + +bb1:		; preds = %bb +	%3 = xor i32 %.rle, 234	 +	store i32 %3, i32* %res, align 4 +	br label %bb2 + +bb2:		; preds = %bb, %bb1 +	%.rle6 = phi i32 [ %3, %bb1 ], [ %.rle, %bb ]	 +	%indvar.next = add i32 %i.05, 1	 +	%exitcond = icmp eq i32 %indvar.next, %n +	br i1 %exitcond, label %return, label %bb + +DSE should sink partially dead stores to get the store out of the loop. + +Here's another partial dead case: +http://gcc.gnu.org/bugzilla/show_bug.cgi?id=12395 + +//===---------------------------------------------------------------------===// + +Scalar PRE hoists the mul in the common block up to the else: + +int test (int a, int b, int c, int g) { +  int d, e; +  if (a) +    d = b * c; +  else +    d = b - c; +  e = b * c + g; +  return d + e; +} + +It would be better to do the mul once to reduce codesize above the if. +This is GCC PR38204. + + +//===---------------------------------------------------------------------===// +This simple function from 179.art: + +int winner, numf2s; +struct { double y; int   reset; } *Y; + +void find_match() { +   int i; +   winner = 0; +   for (i=0;i<numf2s;i++) +       if (Y[i].y > Y[winner].y) +              winner =i; +} + +Compiles into (with clang TBAA): + +for.body:                                         ; preds = %for.inc, %bb.nph +  %indvar = phi i64 [ 0, %bb.nph ], [ %indvar.next, %for.inc ] +  %i.01718 = phi i32 [ 0, %bb.nph ], [ %i.01719, %for.inc ] +  %tmp4 = getelementptr inbounds %struct.anon* %tmp3, i64 %indvar, i32 0 +  %tmp5 = load double* %tmp4, align 8, !tbaa !4 +  %idxprom7 = sext i32 %i.01718 to i64 +  %tmp10 = getelementptr inbounds %struct.anon* %tmp3, i64 %idxprom7, i32 0 +  %tmp11 = load double* %tmp10, align 8, !tbaa !4 +  %cmp12 = fcmp ogt double %tmp5, %tmp11 +  br i1 %cmp12, label %if.then, label %for.inc + +if.then:                                          ; preds = %for.body +  %i.017 = trunc i64 %indvar to i32 +  br label %for.inc + +for.inc:                                          ; preds = %for.body, %if.then +  %i.01719 = phi i32 [ %i.01718, %for.body ], [ %i.017, %if.then ] +  %indvar.next = add i64 %indvar, 1 +  %exitcond = icmp eq i64 %indvar.next, %tmp22 +  br i1 %exitcond, label %for.cond.for.end_crit_edge, label %for.body + + +It is good that we hoisted the reloads of numf2's, and Y out of the loop and +sunk the store to winner out. + +However, this is awful on several levels: the conditional truncate in the loop +(-indvars at fault? why can't we completely promote the IV to i64?). + +Beyond that, we have a partially redundant load in the loop: if "winner" (aka  +%i.01718) isn't updated, we reload Y[winner].y the next time through the loop. +Similarly, the addressing that feeds it (including the sext) is redundant. In +the end we get this generated assembly: + +LBB0_2:                                 ## %for.body +                                        ## =>This Inner Loop Header: Depth=1 +	movsd	(%rdi), %xmm0 +	movslq	%edx, %r8 +	shlq	$4, %r8 +	ucomisd	(%rcx,%r8), %xmm0 +	jbe	LBB0_4 +	movl	%esi, %edx +LBB0_4:                                 ## %for.inc +	addq	$16, %rdi +	incq	%rsi +	cmpq	%rsi, %rax +	jne	LBB0_2 + +All things considered this isn't too bad, but we shouldn't need the movslq or +the shlq instruction, or the load folded into ucomisd every time through the +loop. + +On an x86-specific topic, if the loop can't be restructure, the movl should be a +cmov. + +//===---------------------------------------------------------------------===// + +[STORE SINKING] + +GCC PR37810 is an interesting case where we should sink load/store reload +into the if block and outside the loop, so we don't reload/store it on the +non-call path. + +for () { +  *P += 1; +  if () +    call(); +  else +    ... +-> +tmp = *P +for () { +  tmp += 1; +  if () { +    *P = tmp; +    call(); +    tmp = *P; +  } else ... +} +*P = tmp; + +We now hoist the reload after the call (Transforms/GVN/lpre-call-wrap.ll), but +we don't sink the store.  We need partially dead store sinking. + +//===---------------------------------------------------------------------===// + +[LOAD PRE CRIT EDGE SPLITTING] + +GCC PR37166: Sinking of loads prevents SROA'ing the "g" struct on the stack +leading to excess stack traffic. This could be handled by GVN with some crazy +symbolic phi translation.  The code we get looks like (g is on the stack): + +bb2:		; preds = %bb1 +.. +	%9 = getelementptr %struct.f* %g, i32 0, i32 0		 +	store i32 %8, i32* %9, align  bel %bb3 + +bb3:		; preds = %bb1, %bb2, %bb +	%c_addr.0 = phi %struct.f* [ %g, %bb2 ], [ %c, %bb ], [ %c, %bb1 ] +	%b_addr.0 = phi %struct.f* [ %b, %bb2 ], [ %g, %bb ], [ %b, %bb1 ] +	%10 = getelementptr %struct.f* %c_addr.0, i32 0, i32 0 +	%11 = load i32* %10, align 4 + +%11 is partially redundant, an in BB2 it should have the value %8. + +GCC PR33344 and PR35287 are similar cases. + + +//===---------------------------------------------------------------------===// + +[LOAD PRE] + +There are many load PRE testcases in testsuite/gcc.dg/tree-ssa/loadpre* in the +GCC testsuite, ones we don't get yet are (checked through loadpre25): + +[CRIT EDGE BREAKING] +predcom-4.c + +[PRE OF READONLY CALL] +loadpre5.c + +[TURN SELECT INTO BRANCH] +loadpre14.c loadpre15.c  + +actually a conditional increment: loadpre18.c loadpre19.c + +//===---------------------------------------------------------------------===// + +[LOAD PRE / STORE SINKING / SPEC HACK] + +This is a chunk of code from 456.hmmer: + +int f(int M, int *mc, int *mpp, int *tpmm, int *ip, int *tpim, int *dpp, +     int *tpdm, int xmb, int *bp, int *ms) { + int k, sc; + for (k = 1; k <= M; k++) { +     mc[k] = mpp[k-1]   + tpmm[k-1]; +     if ((sc = ip[k-1]  + tpim[k-1]) > mc[k])  mc[k] = sc; +     if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k])  mc[k] = sc; +     if ((sc = xmb  + bp[k])         > mc[k])  mc[k] = sc; +     mc[k] += ms[k]; +   } +} + +It is very profitable for this benchmark to turn the conditional stores to mc[k] +into a conditional move (select instr in IR) and allow the final store to do the +store.  See GCC PR27313 for more details.  Note that this is valid to xform even +with the new C++ memory model, since mc[k] is previously loaded and later +stored. + +//===---------------------------------------------------------------------===// + +[SCALAR PRE] +There are many PRE testcases in testsuite/gcc.dg/tree-ssa/ssa-pre-*.c in the +GCC testsuite. + +//===---------------------------------------------------------------------===// + +There are some interesting cases in testsuite/gcc.dg/tree-ssa/pred-comm* in the +GCC testsuite.  For example, we get the first example in predcom-1.c, but  +miss the second one: + +unsigned fib[1000]; +unsigned avg[1000]; + +__attribute__ ((noinline)) +void count_averages(int n) { +  int i; +  for (i = 1; i < n; i++) +    avg[i] = (((unsigned long) fib[i - 1] + fib[i] + fib[i + 1]) / 3) & 0xffff; +} + +which compiles into two loads instead of one in the loop. + +predcom-2.c is the same as predcom-1.c + +predcom-3.c is very similar but needs loads feeding each other instead of +store->load. + + +//===---------------------------------------------------------------------===// + +[ALIAS ANALYSIS] + +Type based alias analysis: +http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14705 + +We should do better analysis of posix_memalign.  At the least it should +no-capture its pointer argument, at best, we should know that the out-value +result doesn't point to anything (like malloc).  One example of this is in +SingleSource/Benchmarks/Misc/dt.c + +//===---------------------------------------------------------------------===// + +Interesting missed case because of control flow flattening (should be 2 loads): +http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26629 +With: llvm-gcc t2.c -S -o - -O0 -emit-llvm | llvm-as |  +             opt -mem2reg -gvn -instcombine | llvm-dis +we miss it because we need 1) CRIT EDGE 2) MULTIPLE DIFFERENT +VALS PRODUCED BY ONE BLOCK OVER DIFFERENT PATHS + +//===---------------------------------------------------------------------===// + +http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19633 +We could eliminate the branch condition here, loading from null is undefined: + +struct S { int w, x, y, z; }; +struct T { int r; struct S s; }; +void bar (struct S, int); +void foo (int a, struct T b) +{ +  struct S *c = 0; +  if (a) +    c = &b.s; +  bar (*c, a); +} + +//===---------------------------------------------------------------------===// + +simplifylibcalls should do several optimizations for strspn/strcspn: + +strcspn(x, "a") -> inlined loop for up to 3 letters (similarly for strspn): + +size_t __strcspn_c3 (__const char *__s, int __reject1, int __reject2, +                     int __reject3) { +  register size_t __result = 0; +  while (__s[__result] != '\0' && __s[__result] != __reject1 && +         __s[__result] != __reject2 && __s[__result] != __reject3) +    ++__result; +  return __result; +} + +This should turn into a switch on the character.  See PR3253 for some notes on +codegen. + +456.hmmer apparently uses strcspn and strspn a lot.  471.omnetpp uses strspn. + +//===---------------------------------------------------------------------===// + +simplifylibcalls should turn these snprintf idioms into memcpy (GCC PR47917) + +char buf1[6], buf2[6], buf3[4], buf4[4]; +int i; + +int foo (void) { +  int ret = snprintf (buf1, sizeof buf1, "abcde"); +  ret += snprintf (buf2, sizeof buf2, "abcdef") * 16; +  ret += snprintf (buf3, sizeof buf3, "%s", i++ < 6 ? "abc" : "def") * 256; +  ret += snprintf (buf4, sizeof buf4, "%s", i++ > 10 ? "abcde" : "defgh")*4096; +  return ret; +} + +//===---------------------------------------------------------------------===// + +"gas" uses this idiom: +  else if (strchr ("+-/*%|&^:[]()~", *intel_parser.op_string)) +.. +  else if (strchr ("<>", *intel_parser.op_string) + +Those should be turned into a switch.  SimplifyLibCalls only gets the second +case. + +//===---------------------------------------------------------------------===// + +252.eon contains this interesting code: + +        %3072 = getelementptr [100 x i8]* %tempString, i32 0, i32 0 +        %3073 = call i8* @strcpy(i8* %3072, i8* %3071) nounwind +        %strlen = call i32 @strlen(i8* %3072)    ; uses = 1 +        %endptr = getelementptr [100 x i8]* %tempString, i32 0, i32 %strlen +        call void @llvm.memcpy.i32(i8* %endptr,  +          i8* getelementptr ([5 x i8]* @"\01LC42", i32 0, i32 0), i32 5, i32 1) +        %3074 = call i32 @strlen(i8* %endptr) nounwind readonly  +         +This is interesting for a couple reasons.  First, in this: + +The memcpy+strlen strlen can be replaced with: + +        %3074 = call i32 @strlen([5 x i8]* @"\01LC42") nounwind readonly  + +Because the destination was just copied into the specified memory buffer.  This, +in turn, can be constant folded to "4". + +In other code, it contains: + +        %endptr6978 = bitcast i8* %endptr69 to i32*             +        store i32 7107374, i32* %endptr6978, align 1 +        %3167 = call i32 @strlen(i8* %endptr69) nounwind readonly     + +Which could also be constant folded.  Whatever is producing this should probably +be fixed to leave this as a memcpy from a string. + +Further, eon also has an interesting partially redundant strlen call: + +bb8:            ; preds = %_ZN18eonImageCalculatorC1Ev.exit +        %682 = getelementptr i8** %argv, i32 6          ; <i8**> [#uses=2] +        %683 = load i8** %682, align 4          ; <i8*> [#uses=4] +        %684 = load i8* %683, align 1           ; <i8> [#uses=1] +        %685 = icmp eq i8 %684, 0               ; <i1> [#uses=1] +        br i1 %685, label %bb10, label %bb9 + +bb9:            ; preds = %bb8 +        %686 = call i32 @strlen(i8* %683) nounwind readonly           +        %687 = icmp ugt i32 %686, 254           ; <i1> [#uses=1] +        br i1 %687, label %bb10, label %bb11 + +bb10:           ; preds = %bb9, %bb8 +        %688 = call i32 @strlen(i8* %683) nounwind readonly           + +This could be eliminated by doing the strlen once in bb8, saving code size and +improving perf on the bb8->9->10 path. + +//===---------------------------------------------------------------------===// + +I see an interesting fully redundant call to strlen left in 186.crafty:InputMove +which looks like: +       %movetext11 = getelementptr [128 x i8]* %movetext, i32 0, i32 0  +  + +bb62:           ; preds = %bb55, %bb53 +        %promote.0 = phi i32 [ %169, %bb55 ], [ 0, %bb53 ]              +        %171 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1 +        %172 = add i32 %171, -1         ; <i32> [#uses=1] +        %173 = getelementptr [128 x i8]* %movetext, i32 0, i32 %172        + +...  no stores ... +       br i1 %or.cond, label %bb65, label %bb72 + +bb65:           ; preds = %bb62 +        store i8 0, i8* %173, align 1 +        br label %bb72 + +bb72:           ; preds = %bb65, %bb62 +        %trank.1 = phi i32 [ %176, %bb65 ], [ -1, %bb62 ]             +        %177 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1 + +Note that on the bb62->bb72 path, that the %177 strlen call is partially +redundant with the %171 call.  At worst, we could shove the %177 strlen call +up into the bb65 block moving it out of the bb62->bb72 path.   However, note +that bb65 stores to the string, zeroing out the last byte.  This means that on +that path the value of %177 is actually just %171-1.  A sub is cheaper than a +strlen! + +This pattern repeats several times, basically doing: + +  A = strlen(P); +  P[A-1] = 0; +  B = strlen(P); +  where it is "obvious" that B = A-1. + +//===---------------------------------------------------------------------===// + +186.crafty has this interesting pattern with the "out.4543" variable: + +call void @llvm.memcpy.i32( +        i8* getelementptr ([10 x i8]* @out.4543, i32 0, i32 0), +       i8* getelementptr ([7 x i8]* @"\01LC28700", i32 0, i32 0), i32 7, i32 1)  +%101 = call@printf(i8* ...   @out.4543, i32 0, i32 0)) nounwind  + +It is basically doing: + +  memcpy(globalarray, "string"); +  printf(...,  globalarray); -uint8_t p2(uint8_t b, uint8_t a) {  -  b = (b & ~0x40) | (a & 0x40);  -  b = (b & ~0x80) | (a & 0x80);  -  return (b);  -}  -  -define zeroext i8 @p1(i8 zeroext %b, i8 zeroext %a) nounwind readnone ssp {  -entry:  -  %0 = and i8 %b, 63                              ; <i8> [#uses=1]  -  %1 = and i8 %a, -64                             ; <i8> [#uses=1]  -  %2 = or i8 %1, %0                               ; <i8> [#uses=1]  -  ret i8 %2  -}  -  -define zeroext i8 @p2(i8 zeroext %b, i8 zeroext %a) nounwind readnone ssp {  -entry:  -  %0 = and i8 %b, 63                              ; <i8> [#uses=1]  -  %.masked = and i8 %a, 64                        ; <i8> [#uses=1]  -  %1 = and i8 %a, -128                            ; <i8> [#uses=1]  -  %2 = or i8 %1, %0                               ; <i8> [#uses=1]  -  %3 = or i8 %2, %.masked                         ; <i8> [#uses=1]  -  ret i8 %3  -}  -  -//===---------------------------------------------------------------------===//  -  -IPSCCP does not currently propagate argument dependent constants through  -functions where it does not not all of the callers.  This includes functions  -with normal external linkage as well as templates, C99 inline functions etc.  -Specifically, it does nothing to:  -  -define i32 @test(i32 %x, i32 %y, i32 %z) nounwind {  -entry:  -  %0 = add nsw i32 %y, %z                           -  %1 = mul i32 %0, %x                               -  %2 = mul i32 %y, %z                               -  %3 = add nsw i32 %1, %2                           -  ret i32 %3  -}  -  -define i32 @test2() nounwind {  -entry:  -  %0 = call i32 @test(i32 1, i32 2, i32 4) nounwind  -  ret i32 %0  -}  -  -It would be interesting extend IPSCCP to be able to handle simple cases like  -this, where all of the arguments to a call are constant.  Because IPSCCP runs  -before inlining, trivial templates and inline functions are not yet inlined.  -The results for a function + set of constant arguments should be memoized in a  -map.  -  -//===---------------------------------------------------------------------===//  -  -The libcall constant folding stuff should be moved out of SimplifyLibcalls into  -libanalysis' constantfolding logic.  This would allow IPSCCP to be able to  -handle simple things like this:  -  -static int foo(const char *X) { return strlen(X); }  -int bar() { return foo("abcd"); }  -  -//===---------------------------------------------------------------------===//  -  +Anyway, by knowing that printf just reads the memory and forward substituting +the string directly into the printf, this eliminates reads from globalarray. +Since this pattern occurs frequently in crafty (due to the "DisplayTime" and +other similar functions) there are many stores to "out".  Once all the printfs +stop using "out", all that is left is the memcpy's into it.  This should allow +globalopt to remove the "stored only" global. + +//===---------------------------------------------------------------------===// + +This code: + +define inreg i32 @foo(i8* inreg %p) nounwind { +  %tmp0 = load i8* %p +  %tmp1 = ashr i8 %tmp0, 5 +  %tmp2 = sext i8 %tmp1 to i32 +  ret i32 %tmp2 +} + +could be dagcombine'd to a sign-extending load with a shift. +For example, on x86 this currently gets this: + +	movb	(%eax), %al +	sarb	$5, %al +	movsbl	%al, %eax + +while it could get this: + +	movsbl	(%eax), %eax +	sarl	$5, %eax + +//===---------------------------------------------------------------------===// + +GCC PR31029: + +int test(int x) { return 1-x == x; }     // --> return false +int test2(int x) { return 2-x == x; }    // --> return x == 1 ? + +Always foldable for odd constants, what is the rule for even? + +//===---------------------------------------------------------------------===// + +PR 3381: GEP to field of size 0 inside a struct could be turned into GEP +for next field in struct (which is at same address). + +For example: store of float into { {{}}, float } could be turned into a store to +the float directly. + +//===---------------------------------------------------------------------===// + +The arg promotion pass should make use of nocapture to make its alias analysis +stuff much more precise. + +//===---------------------------------------------------------------------===// + +The following functions should be optimized to use a select instead of a +branch (from gcc PR40072): + +char char_int(int m) {if(m>7) return 0; return m;} +int int_char(char m) {if(m>7) return 0; return m;} + +//===---------------------------------------------------------------------===// + +int func(int a, int b) { if (a & 0x80) b |= 0x80; else b &= ~0x80; return b; } + +Generates this: + +define i32 @func(i32 %a, i32 %b) nounwind readnone ssp { +entry: +  %0 = and i32 %a, 128                            ; <i32> [#uses=1] +  %1 = icmp eq i32 %0, 0                          ; <i1> [#uses=1] +  %2 = or i32 %b, 128                             ; <i32> [#uses=1] +  %3 = and i32 %b, -129                           ; <i32> [#uses=1] +  %b_addr.0 = select i1 %1, i32 %3, i32 %2        ; <i32> [#uses=1] +  ret i32 %b_addr.0 +} + +However, it's functionally equivalent to: + +         b = (b & ~0x80) | (a & 0x80); + +Which generates this: + +define i32 @func(i32 %a, i32 %b) nounwind readnone ssp { +entry: +  %0 = and i32 %b, -129                           ; <i32> [#uses=1] +  %1 = and i32 %a, 128                            ; <i32> [#uses=1] +  %2 = or i32 %0, %1                              ; <i32> [#uses=1] +  ret i32 %2 +} + +This can be generalized for other forms: + +     b = (b & ~0x80) | (a & 0x40) << 1; + +//===---------------------------------------------------------------------===// + +These two functions produce different code. They shouldn't: + +#include <stdint.h> +  +uint8_t p1(uint8_t b, uint8_t a) { +  b = (b & ~0xc0) | (a & 0xc0); +  return (b); +} +  +uint8_t p2(uint8_t b, uint8_t a) { +  b = (b & ~0x40) | (a & 0x40); +  b = (b & ~0x80) | (a & 0x80); +  return (b); +} + +define zeroext i8 @p1(i8 zeroext %b, i8 zeroext %a) nounwind readnone ssp { +entry: +  %0 = and i8 %b, 63                              ; <i8> [#uses=1] +  %1 = and i8 %a, -64                             ; <i8> [#uses=1] +  %2 = or i8 %1, %0                               ; <i8> [#uses=1] +  ret i8 %2 +} + +define zeroext i8 @p2(i8 zeroext %b, i8 zeroext %a) nounwind readnone ssp { +entry: +  %0 = and i8 %b, 63                              ; <i8> [#uses=1] +  %.masked = and i8 %a, 64                        ; <i8> [#uses=1] +  %1 = and i8 %a, -128                            ; <i8> [#uses=1] +  %2 = or i8 %1, %0                               ; <i8> [#uses=1] +  %3 = or i8 %2, %.masked                         ; <i8> [#uses=1] +  ret i8 %3 +} + +//===---------------------------------------------------------------------===// + +IPSCCP does not currently propagate argument dependent constants through +functions where it does not not all of the callers.  This includes functions +with normal external linkage as well as templates, C99 inline functions etc. +Specifically, it does nothing to: + +define i32 @test(i32 %x, i32 %y, i32 %z) nounwind { +entry: +  %0 = add nsw i32 %y, %z                          +  %1 = mul i32 %0, %x                              +  %2 = mul i32 %y, %z                              +  %3 = add nsw i32 %1, %2                          +  ret i32 %3 +} + +define i32 @test2() nounwind { +entry: +  %0 = call i32 @test(i32 1, i32 2, i32 4) nounwind +  ret i32 %0 +} + +It would be interesting extend IPSCCP to be able to handle simple cases like +this, where all of the arguments to a call are constant.  Because IPSCCP runs +before inlining, trivial templates and inline functions are not yet inlined. +The results for a function + set of constant arguments should be memoized in a +map. + +//===---------------------------------------------------------------------===// + +The libcall constant folding stuff should be moved out of SimplifyLibcalls into +libanalysis' constantfolding logic.  This would allow IPSCCP to be able to +handle simple things like this: + +static int foo(const char *X) { return strlen(X); } +int bar() { return foo("abcd"); } + +//===---------------------------------------------------------------------===// +  function-attrs doesn't know much about memcpy/memset.  This function should be -marked readnone rather than readonly, since it only twiddles local memory, but  +marked readnone rather than readonly, since it only twiddles local memory, but  function-attrs doesn't handle memset/memcpy/memmove aggressively: -  -struct X { int *p; int *q; };  -int foo() {  - int i = 0, j = 1;  - struct X x, y;  - int **p;  - y.p = &i;  - x.q = &j;  - p = __builtin_memcpy (&x, &y, sizeof (int *));  - return **p;  -}  -  -This can be seen at:  + +struct X { int *p; int *q; }; +int foo() { + int i = 0, j = 1; + struct X x, y; + int **p; + y.p = &i; + x.q = &j; + p = __builtin_memcpy (&x, &y, sizeof (int *)); + return **p; +} + +This can be seen at:  $ clang t.c -S -o - -mkernel -O0 -emit-llvm | opt -function-attrs -S -  -  -//===---------------------------------------------------------------------===//  -  -Missed instcombine transformation:  -define i1 @a(i32 %x) nounwind readnone {  -entry:  -  %cmp = icmp eq i32 %x, 30  -  %sub = add i32 %x, -30  -  %cmp2 = icmp ugt i32 %sub, 9  -  %or = or i1 %cmp, %cmp2  -  ret i1 %or  -}  -This should be optimized to a single compare.  Testcase derived from gcc.  -  -//===---------------------------------------------------------------------===//  -  -Missed instcombine or reassociate transformation:  -int a(int a, int b) { return (a==12)&(b>47)&(b<58); }  -  -The sgt and slt should be combined into a single comparison. Testcase derived  -from gcc.  -  -//===---------------------------------------------------------------------===//  -  -Missed instcombine transformation:  -  -  %382 = srem i32 %tmp14.i, 64                    ; [#uses=1]  -  %383 = zext i32 %382 to i64                     ; [#uses=1]  -  %384 = shl i64 %381, %383                       ; [#uses=1]  -  %385 = icmp slt i32 %tmp14.i, 64                ; [#uses=1]  -  -The srem can be transformed to an and because if %tmp14.i is negative, the  -shift is undefined.  Testcase derived from 403.gcc.  -  -//===---------------------------------------------------------------------===//  -  -This is a range comparison on a divided result (from 403.gcc):  -  -  %1337 = sdiv i32 %1336, 8                       ; [#uses=1]  -  %.off.i208 = add i32 %1336, 7                   ; [#uses=1]  -  %1338 = icmp ult i32 %.off.i208, 15             ; [#uses=1]  -    -We already catch this (removing the sdiv) if there isn't an add, we should  -handle the 'add' as well.  This is a common idiom with it's builtin_alloca code.  -C testcase:  -  -int a(int x) { return (unsigned)(x/16+7) < 15; }  -  -Another similar case involves truncations on 64-bit targets:  -  -  %361 = sdiv i64 %.046, 8                        ; [#uses=1]  -  %362 = trunc i64 %361 to i32                    ; [#uses=2]  -...  -  %367 = icmp eq i32 %362, 0                      ; [#uses=1]  -  -//===---------------------------------------------------------------------===//  -  -Missed instcombine/dagcombine transformation:  -define void @lshift_lt(i8 zeroext %a) nounwind {  -entry:  -  %conv = zext i8 %a to i32  -  %shl = shl i32 %conv, 3  -  %cmp = icmp ult i32 %shl, 33  -  br i1 %cmp, label %if.then, label %if.end  -  -if.then:  -  tail call void @bar() nounwind  -  ret void  -  -if.end:  -  ret void  -}  -declare void @bar() nounwind  -  -The shift should be eliminated.  Testcase derived from gcc.  -  -//===---------------------------------------------------------------------===//  -  -These compile into different code, one gets recognized as a switch and the  -other doesn't due to phase ordering issues (PR6212):  -  -int test1(int mainType, int subType) {  -  if (mainType == 7)  -    subType = 4;  -  else if (mainType == 9)  -    subType = 6;  -  else if (mainType == 11)  -    subType = 9;  -  return subType;  -}  -  -int test2(int mainType, int subType) {  -  if (mainType == 7)  -    subType = 4;  -  if (mainType == 9)  -    subType = 6;  -  if (mainType == 11)  -    subType = 9;  -  return subType;  -}  -  -//===---------------------------------------------------------------------===//  -  -The following test case (from PR6576):  -  -define i32 @mul(i32 %a, i32 %b) nounwind readnone {  -entry:  - %cond1 = icmp eq i32 %b, 0                      ; <i1> [#uses=1]  - br i1 %cond1, label %exit, label %bb.nph  -bb.nph:                                           ; preds = %entry  - %tmp = mul i32 %b, %a                           ; <i32> [#uses=1]  - ret i32 %tmp  -exit:                                             ; preds = %entry  - ret i32 0  -}  -  -could be reduced to:  -  -define i32 @mul(i32 %a, i32 %b) nounwind readnone {  -entry:  - %tmp = mul i32 %b, %a  - ret i32 %tmp  -}  -  -//===---------------------------------------------------------------------===//  -  -We should use DSE + llvm.lifetime.end to delete dead vtable pointer updates.  -See GCC PR34949  -  -Another interesting case is that something related could be used for variables  -that go const after their ctor has finished.  In these cases, globalopt (which  -can statically run the constructor) could mark the global const (so it gets put  -in the readonly section).  A testcase would be:  -  -#include <complex>  -using namespace std;  -const complex<char> should_be_in_rodata (42,-42);  -complex<char> should_be_in_data (42,-42);  -complex<char> should_be_in_bss;  -  -Where we currently evaluate the ctors but the globals don't become const because  -the optimizer doesn't know they "become const" after the ctor is done.  See  -GCC PR4131 for more examples.  -  -//===---------------------------------------------------------------------===//  -  -In this code:  -  -long foo(long x) {  -  return x > 1 ? x : 1;  -}  -  -LLVM emits a comparison with 1 instead of 0. 0 would be equivalent  -and cheaper on most targets.  -  -LLVM prefers comparisons with zero over non-zero in general, but in this  -case it choses instead to keep the max operation obvious.  -  -//===---------------------------------------------------------------------===//  -  -define void @a(i32 %x) nounwind {  -entry:  -  switch i32 %x, label %if.end [  -    i32 0, label %if.then  -    i32 1, label %if.then  -    i32 2, label %if.then  -    i32 3, label %if.then  -    i32 5, label %if.then  -  ]  -if.then:  -  tail call void @foo() nounwind  -  ret void  -if.end:  -  ret void  -}  -declare void @foo()  -  -Generated code on x86-64 (other platforms give similar results):  -a:  -	cmpl	$5, %edi  -	ja	LBB2_2  -	cmpl	$4, %edi  -	jne	LBB2_3  -.LBB0_2:  -	ret  -.LBB0_3:  -	jmp	foo  # TAILCALL  -  -If we wanted to be really clever, we could simplify the whole thing to  -something like the following, which eliminates a branch:  -	xorl    $1, %edi  -	cmpl	$4, %edi  -	ja	.LBB0_2  -	ret  -.LBB0_2:  -	jmp	foo  # TAILCALL  -  -//===---------------------------------------------------------------------===//  -  -We compile this:  -  -int foo(int a) { return (a & (~15)) / 16; }  -  -Into:  -  -define i32 @foo(i32 %a) nounwind readnone ssp {  -entry:  -  %and = and i32 %a, -16  -  %div = sdiv i32 %and, 16  -  ret i32 %div  -}  -  -but this code (X & -A)/A is X >> log2(A) when A is a power of 2, so this case  -should be instcombined into just "a >> 4".  -  -We do get this at the codegen level, so something knows about it, but   -instcombine should catch it earlier:  -  -_foo:                                   ## @foo  -## %bb.0:                               ## %entry  -	movl	%edi, %eax  -	sarl	$4, %eax  -	ret  -  -//===---------------------------------------------------------------------===//  -  -This code (from GCC PR28685):  -  -int test(int a, int b) {  -  int lt = a < b;  -  int eq = a == b;  -  if (lt)  -    return 1;  -  return eq;  -}  -  -Is compiled to:  -  -define i32 @test(i32 %a, i32 %b) nounwind readnone ssp {  -entry:  -  %cmp = icmp slt i32 %a, %b  -  br i1 %cmp, label %return, label %if.end  -  -if.end:                                           ; preds = %entry  -  %cmp5 = icmp eq i32 %a, %b  -  %conv6 = zext i1 %cmp5 to i32  -  ret i32 %conv6  -  -return:                                           ; preds = %entry  -  ret i32 1  -}  -  -it could be:  -  -define i32 @test__(i32 %a, i32 %b) nounwind readnone ssp {  -entry:  -  %0 = icmp sle i32 %a, %b  -  %retval = zext i1 %0 to i32  -  ret i32 %retval  -}  -  -//===---------------------------------------------------------------------===//  -  -This code can be seen in viterbi:  -  -  %64 = call noalias i8* @malloc(i64 %62) nounwind  -...  -  %67 = call i64 @llvm.objectsize.i64(i8* %64, i1 false) nounwind  -  %68 = call i8* @__memset_chk(i8* %64, i32 0, i64 %62, i64 %67) nounwind  -  -llvm.objectsize.i64 should be taught about malloc/calloc, allowing it to  -fold to %62.  This is a security win (overflows of malloc will get caught)  -and also a performance win by exposing more memsets to the optimizer.  -  -This occurs several times in viterbi.  -  -Note that this would change the semantics of @llvm.objectsize which by its  -current definition always folds to a constant. We also should make sure that  -we remove checking in code like  -  -  char *p = malloc(strlen(s)+1);  + + +//===---------------------------------------------------------------------===// + +Missed instcombine transformation: +define i1 @a(i32 %x) nounwind readnone { +entry: +  %cmp = icmp eq i32 %x, 30 +  %sub = add i32 %x, -30 +  %cmp2 = icmp ugt i32 %sub, 9 +  %or = or i1 %cmp, %cmp2 +  ret i1 %or +} +This should be optimized to a single compare.  Testcase derived from gcc. + +//===---------------------------------------------------------------------===// + +Missed instcombine or reassociate transformation: +int a(int a, int b) { return (a==12)&(b>47)&(b<58); } + +The sgt and slt should be combined into a single comparison. Testcase derived +from gcc. + +//===---------------------------------------------------------------------===// + +Missed instcombine transformation: + +  %382 = srem i32 %tmp14.i, 64                    ; [#uses=1] +  %383 = zext i32 %382 to i64                     ; [#uses=1] +  %384 = shl i64 %381, %383                       ; [#uses=1] +  %385 = icmp slt i32 %tmp14.i, 64                ; [#uses=1] + +The srem can be transformed to an and because if %tmp14.i is negative, the +shift is undefined.  Testcase derived from 403.gcc. + +//===---------------------------------------------------------------------===// + +This is a range comparison on a divided result (from 403.gcc): + +  %1337 = sdiv i32 %1336, 8                       ; [#uses=1] +  %.off.i208 = add i32 %1336, 7                   ; [#uses=1] +  %1338 = icmp ult i32 %.off.i208, 15             ; [#uses=1] +   +We already catch this (removing the sdiv) if there isn't an add, we should +handle the 'add' as well.  This is a common idiom with it's builtin_alloca code. +C testcase: + +int a(int x) { return (unsigned)(x/16+7) < 15; } + +Another similar case involves truncations on 64-bit targets: + +  %361 = sdiv i64 %.046, 8                        ; [#uses=1] +  %362 = trunc i64 %361 to i32                    ; [#uses=2] +... +  %367 = icmp eq i32 %362, 0                      ; [#uses=1] + +//===---------------------------------------------------------------------===// + +Missed instcombine/dagcombine transformation: +define void @lshift_lt(i8 zeroext %a) nounwind { +entry: +  %conv = zext i8 %a to i32 +  %shl = shl i32 %conv, 3 +  %cmp = icmp ult i32 %shl, 33 +  br i1 %cmp, label %if.then, label %if.end + +if.then: +  tail call void @bar() nounwind +  ret void + +if.end: +  ret void +} +declare void @bar() nounwind + +The shift should be eliminated.  Testcase derived from gcc. + +//===---------------------------------------------------------------------===// + +These compile into different code, one gets recognized as a switch and the +other doesn't due to phase ordering issues (PR6212): + +int test1(int mainType, int subType) { +  if (mainType == 7) +    subType = 4; +  else if (mainType == 9) +    subType = 6; +  else if (mainType == 11) +    subType = 9; +  return subType; +} + +int test2(int mainType, int subType) { +  if (mainType == 7) +    subType = 4; +  if (mainType == 9) +    subType = 6; +  if (mainType == 11) +    subType = 9; +  return subType; +} + +//===---------------------------------------------------------------------===// + +The following test case (from PR6576): + +define i32 @mul(i32 %a, i32 %b) nounwind readnone { +entry: + %cond1 = icmp eq i32 %b, 0                      ; <i1> [#uses=1] + br i1 %cond1, label %exit, label %bb.nph +bb.nph:                                           ; preds = %entry + %tmp = mul i32 %b, %a                           ; <i32> [#uses=1] + ret i32 %tmp +exit:                                             ; preds = %entry + ret i32 0 +} + +could be reduced to: + +define i32 @mul(i32 %a, i32 %b) nounwind readnone { +entry: + %tmp = mul i32 %b, %a + ret i32 %tmp +} + +//===---------------------------------------------------------------------===// + +We should use DSE + llvm.lifetime.end to delete dead vtable pointer updates. +See GCC PR34949 + +Another interesting case is that something related could be used for variables +that go const after their ctor has finished.  In these cases, globalopt (which +can statically run the constructor) could mark the global const (so it gets put +in the readonly section).  A testcase would be: + +#include <complex> +using namespace std; +const complex<char> should_be_in_rodata (42,-42); +complex<char> should_be_in_data (42,-42); +complex<char> should_be_in_bss; + +Where we currently evaluate the ctors but the globals don't become const because +the optimizer doesn't know they "become const" after the ctor is done.  See +GCC PR4131 for more examples. + +//===---------------------------------------------------------------------===// + +In this code: + +long foo(long x) { +  return x > 1 ? x : 1; +} + +LLVM emits a comparison with 1 instead of 0. 0 would be equivalent +and cheaper on most targets. + +LLVM prefers comparisons with zero over non-zero in general, but in this +case it choses instead to keep the max operation obvious. + +//===---------------------------------------------------------------------===// + +define void @a(i32 %x) nounwind { +entry: +  switch i32 %x, label %if.end [ +    i32 0, label %if.then +    i32 1, label %if.then +    i32 2, label %if.then +    i32 3, label %if.then +    i32 5, label %if.then +  ] +if.then: +  tail call void @foo() nounwind +  ret void +if.end: +  ret void +} +declare void @foo() + +Generated code on x86-64 (other platforms give similar results): +a: +	cmpl	$5, %edi +	ja	LBB2_2 +	cmpl	$4, %edi +	jne	LBB2_3 +.LBB0_2: +	ret +.LBB0_3: +	jmp	foo  # TAILCALL + +If we wanted to be really clever, we could simplify the whole thing to +something like the following, which eliminates a branch: +	xorl    $1, %edi +	cmpl	$4, %edi +	ja	.LBB0_2 +	ret +.LBB0_2: +	jmp	foo  # TAILCALL + +//===---------------------------------------------------------------------===// + +We compile this: + +int foo(int a) { return (a & (~15)) / 16; } + +Into: + +define i32 @foo(i32 %a) nounwind readnone ssp { +entry: +  %and = and i32 %a, -16 +  %div = sdiv i32 %and, 16 +  ret i32 %div +} + +but this code (X & -A)/A is X >> log2(A) when A is a power of 2, so this case +should be instcombined into just "a >> 4". + +We do get this at the codegen level, so something knows about it, but  +instcombine should catch it earlier: + +_foo:                                   ## @foo +## %bb.0:                               ## %entry +	movl	%edi, %eax +	sarl	$4, %eax +	ret + +//===---------------------------------------------------------------------===// + +This code (from GCC PR28685): + +int test(int a, int b) { +  int lt = a < b; +  int eq = a == b; +  if (lt) +    return 1; +  return eq; +} + +Is compiled to: + +define i32 @test(i32 %a, i32 %b) nounwind readnone ssp { +entry: +  %cmp = icmp slt i32 %a, %b +  br i1 %cmp, label %return, label %if.end + +if.end:                                           ; preds = %entry +  %cmp5 = icmp eq i32 %a, %b +  %conv6 = zext i1 %cmp5 to i32 +  ret i32 %conv6 + +return:                                           ; preds = %entry +  ret i32 1 +} + +it could be: + +define i32 @test__(i32 %a, i32 %b) nounwind readnone ssp { +entry: +  %0 = icmp sle i32 %a, %b +  %retval = zext i1 %0 to i32 +  ret i32 %retval +} + +//===---------------------------------------------------------------------===// + +This code can be seen in viterbi: + +  %64 = call noalias i8* @malloc(i64 %62) nounwind +... +  %67 = call i64 @llvm.objectsize.i64(i8* %64, i1 false) nounwind +  %68 = call i8* @__memset_chk(i8* %64, i32 0, i64 %62, i64 %67) nounwind + +llvm.objectsize.i64 should be taught about malloc/calloc, allowing it to +fold to %62.  This is a security win (overflows of malloc will get caught) +and also a performance win by exposing more memsets to the optimizer. + +This occurs several times in viterbi. + +Note that this would change the semantics of @llvm.objectsize which by its +current definition always folds to a constant. We also should make sure that +we remove checking in code like + +  char *p = malloc(strlen(s)+1);    __strcpy_chk(p, s, __builtin_object_size(p, 0)); -  -//===---------------------------------------------------------------------===//  -  -clang -O3 currently compiles this code  -  -int g(unsigned int a) {  -  unsigned int c[100];  -  c[10] = a;  -  c[11] = a;  -  unsigned int b = c[10] + c[11];  -  if(b > a*2) a = 4;  -  else a = 8;  -  return a + 7;  -}  -  -into  -  -define i32 @g(i32 a) nounwind readnone {  -  %add = shl i32 %a, 1  -  %mul = shl i32 %a, 1  -  %cmp = icmp ugt i32 %add, %mul  -  %a.addr.0 = select i1 %cmp, i32 11, i32 15  -  ret i32 %a.addr.0  -}  -  -The icmp should fold to false. This CSE opportunity is only available  -after GVN and InstCombine have run.  -  -//===---------------------------------------------------------------------===//  -  -memcpyopt should turn this:  -  -define i8* @test10(i32 %x) {  -  %alloc = call noalias i8* @malloc(i32 %x) nounwind  -  call void @llvm.memset.p0i8.i32(i8* %alloc, i8 0, i32 %x, i32 1, i1 false)  -  ret i8* %alloc  -}  -  -into a call to calloc.  We should make sure that we analyze calloc as  -aggressively as malloc though.  -  -//===---------------------------------------------------------------------===//  -  -clang -O3 doesn't optimize this:  -  -void f1(int* begin, int* end) {  -  std::fill(begin, end, 0);  -}  -  -into a memset.  This is PR8942.  -  -//===---------------------------------------------------------------------===//  -  -clang -O3 -fno-exceptions currently compiles this code:  -  -void f(int N) {  -  std::vector<int> v(N);  -  -  extern void sink(void*); sink(&v);  -}  -  -into  -  -define void @_Z1fi(i32 %N) nounwind {  -entry:  -  %v2 = alloca [3 x i32*], align 8  -  %v2.sub = getelementptr inbounds [3 x i32*]* %v2, i64 0, i64 0  -  %tmpcast = bitcast [3 x i32*]* %v2 to %"class.std::vector"*  -  %conv = sext i32 %N to i64  -  store i32* null, i32** %v2.sub, align 8, !tbaa !0  -  %tmp3.i.i.i.i.i = getelementptr inbounds [3 x i32*]* %v2, i64 0, i64 1  -  store i32* null, i32** %tmp3.i.i.i.i.i, align 8, !tbaa !0  -  %tmp4.i.i.i.i.i = getelementptr inbounds [3 x i32*]* %v2, i64 0, i64 2  -  store i32* null, i32** %tmp4.i.i.i.i.i, align 8, !tbaa !0  -  %cmp.i.i.i.i = icmp eq i32 %N, 0  -  br i1 %cmp.i.i.i.i, label %_ZNSt12_Vector_baseIiSaIiEEC2EmRKS0_.exit.thread.i.i, label %cond.true.i.i.i.i  -  -_ZNSt12_Vector_baseIiSaIiEEC2EmRKS0_.exit.thread.i.i: ; preds = %entry  -  store i32* null, i32** %v2.sub, align 8, !tbaa !0  -  store i32* null, i32** %tmp3.i.i.i.i.i, align 8, !tbaa !0  -  %add.ptr.i5.i.i = getelementptr inbounds i32* null, i64 %conv  -  store i32* %add.ptr.i5.i.i, i32** %tmp4.i.i.i.i.i, align 8, !tbaa !0  -  br label %_ZNSt6vectorIiSaIiEEC1EmRKiRKS0_.exit  -  -cond.true.i.i.i.i:                                ; preds = %entry  -  %cmp.i.i.i.i.i = icmp slt i32 %N, 0  -  br i1 %cmp.i.i.i.i.i, label %if.then.i.i.i.i.i, label %_ZNSt12_Vector_baseIiSaIiEEC2EmRKS0_.exit.i.i  -  -if.then.i.i.i.i.i:                                ; preds = %cond.true.i.i.i.i  -  call void @_ZSt17__throw_bad_allocv() noreturn nounwind  -  unreachable  -  -_ZNSt12_Vector_baseIiSaIiEEC2EmRKS0_.exit.i.i:    ; preds = %cond.true.i.i.i.i  -  %mul.i.i.i.i.i = shl i64 %conv, 2  -  %call3.i.i.i.i.i = call noalias i8* @_Znwm(i64 %mul.i.i.i.i.i) nounwind  -  %0 = bitcast i8* %call3.i.i.i.i.i to i32*  -  store i32* %0, i32** %v2.sub, align 8, !tbaa !0  -  store i32* %0, i32** %tmp3.i.i.i.i.i, align 8, !tbaa !0  -  %add.ptr.i.i.i = getelementptr inbounds i32* %0, i64 %conv  -  store i32* %add.ptr.i.i.i, i32** %tmp4.i.i.i.i.i, align 8, !tbaa !0  -  call void @llvm.memset.p0i8.i64(i8* %call3.i.i.i.i.i, i8 0, i64 %mul.i.i.i.i.i, i32 4, i1 false)  -  br label %_ZNSt6vectorIiSaIiEEC1EmRKiRKS0_.exit  -  -This is just the handling the construction of the vector. Most surprising here  -is the fact that all three null stores in %entry are dead (because we do no  -cross-block DSE).  -  -Also surprising is that %conv isn't simplified to 0 in %....exit.thread.i.i.  -This is a because the client of LazyValueInfo doesn't simplify all instruction  -operands, just selected ones.  -  -//===---------------------------------------------------------------------===//  -  -clang -O3 -fno-exceptions currently compiles this code:  -  -void f(char* a, int n) {  -  __builtin_memset(a, 0, n);  -  for (int i = 0; i < n; ++i)  -    a[i] = 0;  -}  -  -into:  -  -define void @_Z1fPci(i8* nocapture %a, i32 %n) nounwind {  -entry:  -  %conv = sext i32 %n to i64  -  tail call void @llvm.memset.p0i8.i64(i8* %a, i8 0, i64 %conv, i32 1, i1 false)  -  %cmp8 = icmp sgt i32 %n, 0  -  br i1 %cmp8, label %for.body.lr.ph, label %for.end  -  -for.body.lr.ph:                                   ; preds = %entry  -  %tmp10 = add i32 %n, -1  -  %tmp11 = zext i32 %tmp10 to i64  -  %tmp12 = add i64 %tmp11, 1  -  call void @llvm.memset.p0i8.i64(i8* %a, i8 0, i64 %tmp12, i32 1, i1 false)  -  ret void  -  -for.end:                                          ; preds = %entry  -  ret void  -}  -  -This shouldn't need the ((zext (%n - 1)) + 1) game, and it should ideally fold  -the two memset's together.  -  -The issue with the addition only occurs in 64-bit mode, and appears to be at  -least partially caused by Scalar Evolution not keeping its cache updated: it  -returns the "wrong" result immediately after indvars runs, but figures out the  -expected result if it is run from scratch on IR resulting from running indvars.  -  -//===---------------------------------------------------------------------===//  -  -clang -O3 -fno-exceptions currently compiles this code:  -  -struct S {  -  unsigned short m1, m2;  -  unsigned char m3, m4;  -};  -  -void f(int N) {  -  std::vector<S> v(N);  -  extern void sink(void*); sink(&v);  -}  -  -into poor code for zero-initializing 'v' when N is >0. The problem is that  -S is only 6 bytes, but each element is 8 byte-aligned. We generate a loop and  -4 stores on each iteration. If the struct were 8 bytes, this gets turned into  -a memset.  -  -In order to handle this we have to:  -  A) Teach clang to generate metadata for memsets of structs that have holes in  -     them.  -  B) Teach clang to use such a memset for zero init of this struct (since it has  -     a hole), instead of doing elementwise zeroing.  -  -//===---------------------------------------------------------------------===//  -  -clang -O3 currently compiles this code:  -  -extern const int magic;  -double f() { return 0.0 * magic; }  -  -into  -  -@magic = external constant i32  -  -define double @_Z1fv() nounwind readnone {  -entry:  -  %tmp = load i32* @magic, align 4, !tbaa !0  -  %conv = sitofp i32 %tmp to double  -  %mul = fmul double %conv, 0.000000e+00  -  ret double %mul  -}  -  -We should be able to fold away this fmul to 0.0.  More generally, fmul(x,0.0)  -can be folded to 0.0 if we can prove that the LHS is not -0.0, not a NaN, and  -not an INF.  The CannotBeNegativeZero predicate in value tracking should be  -extended to support general "fpclassify" operations that can return   -yes/no/unknown for each of these predicates.  -  -In this predicate, we know that uitofp is trivially never NaN or -0.0, and  -we know that it isn't +/-Inf if the floating point type has enough exponent bits  -to represent the largest integer value as < inf.  -  -//===---------------------------------------------------------------------===//  -  -When optimizing a transformation that can change the sign of 0.0 (such as the  -0.0*val -> 0.0 transformation above), it might be provable that the sign of the  -expression doesn't matter.  For example, by the above rules, we can't transform  -fmul(sitofp(x), 0.0) into 0.0, because x might be -1 and the result of the  -expression is defined to be -0.0.  -  -If we look at the uses of the fmul for example, we might be able to prove that  -all uses don't care about the sign of zero.  For example, if we have:  -  -  fadd(fmul(sitofp(x), 0.0), 2.0)  -  -Since we know that x+2.0 doesn't care about the sign of any zeros in X, we can  -transform the fmul to 0.0, and then the fadd to 2.0.  -  -//===---------------------------------------------------------------------===//  -  -We should enhance memcpy/memcpy/memset to allow a metadata node on them  -indicating that some bytes of the transfer are undefined.  This is useful for  -frontends like clang when lowering struct copies, when some elements of the  -struct are undefined.  Consider something like this:  -  -struct x {  -  char a;  -  int b[4];  -};  -void foo(struct x*P);  -struct x testfunc() {  -  struct x V1, V2;  -  foo(&V1);  -  V2 = V1;  -  -  return V2;  -}  -  -We currently compile this to:  -$ clang t.c -S -o - -O0 -emit-llvm | opt -sroa -S  -  -  -%struct.x = type { i8, [4 x i32] }  -  -define void @testfunc(%struct.x* sret %agg.result) nounwind ssp {  -entry:  -  %V1 = alloca %struct.x, align 4  -  call void @foo(%struct.x* %V1)  -  %tmp1 = bitcast %struct.x* %V1 to i8*  -  %0 = bitcast %struct.x* %V1 to i160*  -  %srcval1 = load i160* %0, align 4  -  %tmp2 = bitcast %struct.x* %agg.result to i8*  -  %1 = bitcast %struct.x* %agg.result to i160*  -  store i160 %srcval1, i160* %1, align 4  -  ret void  -}  -  -This happens because SRoA sees that the temp alloca has is being memcpy'd into  -and out of and it has holes and it has to be conservative.  If we knew about the  -holes, then this could be much much better.  -  -Having information about these holes would also improve memcpy (etc) lowering at  -llc time when it gets inlined, because we can use smaller transfers.  This also  -avoids partial register stalls in some important cases.  -  -//===---------------------------------------------------------------------===//  -  -We don't fold (icmp (add) (add)) unless the two adds only have a single use.  -There are a lot of cases that we're refusing to fold in (e.g.) 256.bzip2, for  -example:  -  - %indvar.next90 = add i64 %indvar89, 1     ;; Has 2 uses  - %tmp96 = add i64 %tmp95, 1                ;; Has 1 use  - %exitcond97 = icmp eq i64 %indvar.next90, %tmp96  -  -We don't fold this because we don't want to introduce an overlapped live range  -of the ivar.  However if we can make this more aggressive without causing  -performance issues in two ways:  -  -1. If *either* the LHS or RHS has a single use, we can definitely do the  -   transformation.  In the overlapping liverange case we're trading one register  -   use for one fewer operation, which is a reasonable trade.  Before doing this  -   we should verify that the llc output actually shrinks for some benchmarks.  -2. If both ops have multiple uses, we can still fold it if the operations are  -   both sinkable to *after* the icmp (e.g. in a subsequent block) which doesn't  -   increase register pressure.  -  -There are a ton of icmp's we aren't simplifying because of the reg pressure  -concern.  Care is warranted here though because many of these are induction  -variables and other cases that matter a lot to performance, like the above.  -Here's a blob of code that you can drop into the bottom of visitICmp to see some  -missed cases:  -  -  { Value *A, *B, *C, *D;  -    if (match(Op0, m_Add(m_Value(A), m_Value(B))) &&   -        match(Op1, m_Add(m_Value(C), m_Value(D))) &&  -        (A == C || A == D || B == C || B == D)) {  -      errs() << "OP0 = " << *Op0 << "  U=" << Op0->getNumUses() << "\n";  -      errs() << "OP1 = " << *Op1 << "  U=" << Op1->getNumUses() << "\n";  -      errs() << "CMP = " << I << "\n\n";  -    }  -  }  -  -//===---------------------------------------------------------------------===//  -  -define i1 @test1(i32 %x) nounwind {  -  %and = and i32 %x, 3  -  %cmp = icmp ult i32 %and, 2  -  ret i1 %cmp  -}  -  -Can be folded to (x & 2) == 0.  -  -define i1 @test2(i32 %x) nounwind {  -  %and = and i32 %x, 3  -  %cmp = icmp ugt i32 %and, 1  -  ret i1 %cmp  -}  -  -Can be folded to (x & 2) != 0.  -  -SimplifyDemandedBits shrinks the "and" constant to 2 but instcombine misses the  -icmp transform.  -  -//===---------------------------------------------------------------------===//  -  -This code:  -  -typedef struct {  -int f1:1;  -int f2:1;  -int f3:1;  -int f4:29;  -} t1;  -  -typedef struct {  -int f1:1;  -int f2:1;  -int f3:30;  -} t2;  -  -t1 s1;  -t2 s2;  -  -void func1(void)  -{  -s1.f1 = s2.f1;  -s1.f2 = s2.f2;  -}  -  -Compiles into this IR (on x86-64 at least):  -  -%struct.t1 = type { i8, [3 x i8] }  -@s2 = global %struct.t1 zeroinitializer, align 4  -@s1 = global %struct.t1 zeroinitializer, align 4  -define void @func1() nounwind ssp noredzone {  -entry:  -  %0 = load i32* bitcast (%struct.t1* @s2 to i32*), align 4  -  %bf.val.sext5 = and i32 %0, 1  -  %1 = load i32* bitcast (%struct.t1* @s1 to i32*), align 4  -  %2 = and i32 %1, -4  -  %3 = or i32 %2, %bf.val.sext5  -  %bf.val.sext26 = and i32 %0, 2  -  %4 = or i32 %3, %bf.val.sext26  -  store i32 %4, i32* bitcast (%struct.t1* @s1 to i32*), align 4  -  ret void  -}  -  -The two or/and's should be merged into one each.  -  -//===---------------------------------------------------------------------===//  -  -Machine level code hoisting can be useful in some cases.  For example, PR9408  -is about:  -  -typedef union {  - void (*f1)(int);  - void (*f2)(long);  -} funcs;  -  -void foo(funcs f, int which) {  - int a = 5;  - if (which) {  -   f.f1(a);  - } else {  -   f.f2(a);  - }  -}  -  -which we compile to:  -  -foo:                                    # @foo  -# %bb.0:                                # %entry  -       pushq   %rbp  -       movq    %rsp, %rbp  -       testl   %esi, %esi  -       movq    %rdi, %rax  -       je      .LBB0_2  -# %bb.1:                                # %if.then  -       movl    $5, %edi  -       callq   *%rax  -       popq    %rbp  -       ret  -.LBB0_2:                                # %if.else  -       movl    $5, %edi  -       callq   *%rax  -       popq    %rbp  -       ret  -  -Note that bb1 and bb2 are the same.  This doesn't happen at the IR level  -because one call is passing an i32 and the other is passing an i64.  -  -//===---------------------------------------------------------------------===//  -  -I see this sort of pattern in 176.gcc in a few places (e.g. the start of  -store_bit_field).  The rem should be replaced with a multiply and subtract:  -  -  %3 = sdiv i32 %A, %B  -  %4 = srem i32 %A, %B  -  -Similarly for udiv/urem.  Note that this shouldn't be done on X86 or ARM,  -which can do this in a single operation (instruction or libcall).  It is  -probably best to do this in the code generator.  -  -//===---------------------------------------------------------------------===//  -  -unsigned foo(unsigned x, unsigned y) { return (x & y) == 0 || x == 0; }  -should fold to (x & y) == 0.  -  -//===---------------------------------------------------------------------===//  -  -unsigned foo(unsigned x, unsigned y) { return x > y && x != 0; }  -should fold to x > y.  -  -//===---------------------------------------------------------------------===//  + +//===---------------------------------------------------------------------===// + +clang -O3 currently compiles this code + +int g(unsigned int a) { +  unsigned int c[100]; +  c[10] = a; +  c[11] = a; +  unsigned int b = c[10] + c[11]; +  if(b > a*2) a = 4; +  else a = 8; +  return a + 7; +} + +into + +define i32 @g(i32 a) nounwind readnone { +  %add = shl i32 %a, 1 +  %mul = shl i32 %a, 1 +  %cmp = icmp ugt i32 %add, %mul +  %a.addr.0 = select i1 %cmp, i32 11, i32 15 +  ret i32 %a.addr.0 +} + +The icmp should fold to false. This CSE opportunity is only available +after GVN and InstCombine have run. + +//===---------------------------------------------------------------------===// + +memcpyopt should turn this: + +define i8* @test10(i32 %x) { +  %alloc = call noalias i8* @malloc(i32 %x) nounwind +  call void @llvm.memset.p0i8.i32(i8* %alloc, i8 0, i32 %x, i32 1, i1 false) +  ret i8* %alloc +} + +into a call to calloc.  We should make sure that we analyze calloc as +aggressively as malloc though. + +//===---------------------------------------------------------------------===// + +clang -O3 doesn't optimize this: + +void f1(int* begin, int* end) { +  std::fill(begin, end, 0); +} + +into a memset.  This is PR8942. + +//===---------------------------------------------------------------------===// + +clang -O3 -fno-exceptions currently compiles this code: + +void f(int N) { +  std::vector<int> v(N); + +  extern void sink(void*); sink(&v); +} + +into + +define void @_Z1fi(i32 %N) nounwind { +entry: +  %v2 = alloca [3 x i32*], align 8 +  %v2.sub = getelementptr inbounds [3 x i32*]* %v2, i64 0, i64 0 +  %tmpcast = bitcast [3 x i32*]* %v2 to %"class.std::vector"* +  %conv = sext i32 %N to i64 +  store i32* null, i32** %v2.sub, align 8, !tbaa !0 +  %tmp3.i.i.i.i.i = getelementptr inbounds [3 x i32*]* %v2, i64 0, i64 1 +  store i32* null, i32** %tmp3.i.i.i.i.i, align 8, !tbaa !0 +  %tmp4.i.i.i.i.i = getelementptr inbounds [3 x i32*]* %v2, i64 0, i64 2 +  store i32* null, i32** %tmp4.i.i.i.i.i, align 8, !tbaa !0 +  %cmp.i.i.i.i = icmp eq i32 %N, 0 +  br i1 %cmp.i.i.i.i, label %_ZNSt12_Vector_baseIiSaIiEEC2EmRKS0_.exit.thread.i.i, label %cond.true.i.i.i.i + +_ZNSt12_Vector_baseIiSaIiEEC2EmRKS0_.exit.thread.i.i: ; preds = %entry +  store i32* null, i32** %v2.sub, align 8, !tbaa !0 +  store i32* null, i32** %tmp3.i.i.i.i.i, align 8, !tbaa !0 +  %add.ptr.i5.i.i = getelementptr inbounds i32* null, i64 %conv +  store i32* %add.ptr.i5.i.i, i32** %tmp4.i.i.i.i.i, align 8, !tbaa !0 +  br label %_ZNSt6vectorIiSaIiEEC1EmRKiRKS0_.exit + +cond.true.i.i.i.i:                                ; preds = %entry +  %cmp.i.i.i.i.i = icmp slt i32 %N, 0 +  br i1 %cmp.i.i.i.i.i, label %if.then.i.i.i.i.i, label %_ZNSt12_Vector_baseIiSaIiEEC2EmRKS0_.exit.i.i + +if.then.i.i.i.i.i:                                ; preds = %cond.true.i.i.i.i +  call void @_ZSt17__throw_bad_allocv() noreturn nounwind +  unreachable + +_ZNSt12_Vector_baseIiSaIiEEC2EmRKS0_.exit.i.i:    ; preds = %cond.true.i.i.i.i +  %mul.i.i.i.i.i = shl i64 %conv, 2 +  %call3.i.i.i.i.i = call noalias i8* @_Znwm(i64 %mul.i.i.i.i.i) nounwind +  %0 = bitcast i8* %call3.i.i.i.i.i to i32* +  store i32* %0, i32** %v2.sub, align 8, !tbaa !0 +  store i32* %0, i32** %tmp3.i.i.i.i.i, align 8, !tbaa !0 +  %add.ptr.i.i.i = getelementptr inbounds i32* %0, i64 %conv +  store i32* %add.ptr.i.i.i, i32** %tmp4.i.i.i.i.i, align 8, !tbaa !0 +  call void @llvm.memset.p0i8.i64(i8* %call3.i.i.i.i.i, i8 0, i64 %mul.i.i.i.i.i, i32 4, i1 false) +  br label %_ZNSt6vectorIiSaIiEEC1EmRKiRKS0_.exit + +This is just the handling the construction of the vector. Most surprising here +is the fact that all three null stores in %entry are dead (because we do no +cross-block DSE). + +Also surprising is that %conv isn't simplified to 0 in %....exit.thread.i.i. +This is a because the client of LazyValueInfo doesn't simplify all instruction +operands, just selected ones. + +//===---------------------------------------------------------------------===// + +clang -O3 -fno-exceptions currently compiles this code: + +void f(char* a, int n) { +  __builtin_memset(a, 0, n); +  for (int i = 0; i < n; ++i) +    a[i] = 0; +} + +into: + +define void @_Z1fPci(i8* nocapture %a, i32 %n) nounwind { +entry: +  %conv = sext i32 %n to i64 +  tail call void @llvm.memset.p0i8.i64(i8* %a, i8 0, i64 %conv, i32 1, i1 false) +  %cmp8 = icmp sgt i32 %n, 0 +  br i1 %cmp8, label %for.body.lr.ph, label %for.end + +for.body.lr.ph:                                   ; preds = %entry +  %tmp10 = add i32 %n, -1 +  %tmp11 = zext i32 %tmp10 to i64 +  %tmp12 = add i64 %tmp11, 1 +  call void @llvm.memset.p0i8.i64(i8* %a, i8 0, i64 %tmp12, i32 1, i1 false) +  ret void + +for.end:                                          ; preds = %entry +  ret void +} + +This shouldn't need the ((zext (%n - 1)) + 1) game, and it should ideally fold +the two memset's together. + +The issue with the addition only occurs in 64-bit mode, and appears to be at +least partially caused by Scalar Evolution not keeping its cache updated: it +returns the "wrong" result immediately after indvars runs, but figures out the +expected result if it is run from scratch on IR resulting from running indvars. + +//===---------------------------------------------------------------------===// + +clang -O3 -fno-exceptions currently compiles this code: + +struct S { +  unsigned short m1, m2; +  unsigned char m3, m4; +}; + +void f(int N) { +  std::vector<S> v(N); +  extern void sink(void*); sink(&v); +} + +into poor code for zero-initializing 'v' when N is >0. The problem is that +S is only 6 bytes, but each element is 8 byte-aligned. We generate a loop and +4 stores on each iteration. If the struct were 8 bytes, this gets turned into +a memset. + +In order to handle this we have to: +  A) Teach clang to generate metadata for memsets of structs that have holes in +     them. +  B) Teach clang to use such a memset for zero init of this struct (since it has +     a hole), instead of doing elementwise zeroing. + +//===---------------------------------------------------------------------===// + +clang -O3 currently compiles this code: + +extern const int magic; +double f() { return 0.0 * magic; } + +into + +@magic = external constant i32 + +define double @_Z1fv() nounwind readnone { +entry: +  %tmp = load i32* @magic, align 4, !tbaa !0 +  %conv = sitofp i32 %tmp to double +  %mul = fmul double %conv, 0.000000e+00 +  ret double %mul +} + +We should be able to fold away this fmul to 0.0.  More generally, fmul(x,0.0) +can be folded to 0.0 if we can prove that the LHS is not -0.0, not a NaN, and +not an INF.  The CannotBeNegativeZero predicate in value tracking should be +extended to support general "fpclassify" operations that can return  +yes/no/unknown for each of these predicates. + +In this predicate, we know that uitofp is trivially never NaN or -0.0, and +we know that it isn't +/-Inf if the floating point type has enough exponent bits +to represent the largest integer value as < inf. + +//===---------------------------------------------------------------------===// + +When optimizing a transformation that can change the sign of 0.0 (such as the +0.0*val -> 0.0 transformation above), it might be provable that the sign of the +expression doesn't matter.  For example, by the above rules, we can't transform +fmul(sitofp(x), 0.0) into 0.0, because x might be -1 and the result of the +expression is defined to be -0.0. + +If we look at the uses of the fmul for example, we might be able to prove that +all uses don't care about the sign of zero.  For example, if we have: + +  fadd(fmul(sitofp(x), 0.0), 2.0) + +Since we know that x+2.0 doesn't care about the sign of any zeros in X, we can +transform the fmul to 0.0, and then the fadd to 2.0. + +//===---------------------------------------------------------------------===// + +We should enhance memcpy/memcpy/memset to allow a metadata node on them +indicating that some bytes of the transfer are undefined.  This is useful for +frontends like clang when lowering struct copies, when some elements of the +struct are undefined.  Consider something like this: + +struct x { +  char a; +  int b[4]; +}; +void foo(struct x*P); +struct x testfunc() { +  struct x V1, V2; +  foo(&V1); +  V2 = V1; + +  return V2; +} + +We currently compile this to: +$ clang t.c -S -o - -O0 -emit-llvm | opt -sroa -S + + +%struct.x = type { i8, [4 x i32] } + +define void @testfunc(%struct.x* sret %agg.result) nounwind ssp { +entry: +  %V1 = alloca %struct.x, align 4 +  call void @foo(%struct.x* %V1) +  %tmp1 = bitcast %struct.x* %V1 to i8* +  %0 = bitcast %struct.x* %V1 to i160* +  %srcval1 = load i160* %0, align 4 +  %tmp2 = bitcast %struct.x* %agg.result to i8* +  %1 = bitcast %struct.x* %agg.result to i160* +  store i160 %srcval1, i160* %1, align 4 +  ret void +} + +This happens because SRoA sees that the temp alloca has is being memcpy'd into +and out of and it has holes and it has to be conservative.  If we knew about the +holes, then this could be much much better. + +Having information about these holes would also improve memcpy (etc) lowering at +llc time when it gets inlined, because we can use smaller transfers.  This also +avoids partial register stalls in some important cases. + +//===---------------------------------------------------------------------===// + +We don't fold (icmp (add) (add)) unless the two adds only have a single use. +There are a lot of cases that we're refusing to fold in (e.g.) 256.bzip2, for +example: + + %indvar.next90 = add i64 %indvar89, 1     ;; Has 2 uses + %tmp96 = add i64 %tmp95, 1                ;; Has 1 use + %exitcond97 = icmp eq i64 %indvar.next90, %tmp96 + +We don't fold this because we don't want to introduce an overlapped live range +of the ivar.  However if we can make this more aggressive without causing +performance issues in two ways: + +1. If *either* the LHS or RHS has a single use, we can definitely do the +   transformation.  In the overlapping liverange case we're trading one register +   use for one fewer operation, which is a reasonable trade.  Before doing this +   we should verify that the llc output actually shrinks for some benchmarks. +2. If both ops have multiple uses, we can still fold it if the operations are +   both sinkable to *after* the icmp (e.g. in a subsequent block) which doesn't +   increase register pressure. + +There are a ton of icmp's we aren't simplifying because of the reg pressure +concern.  Care is warranted here though because many of these are induction +variables and other cases that matter a lot to performance, like the above. +Here's a blob of code that you can drop into the bottom of visitICmp to see some +missed cases: + +  { Value *A, *B, *C, *D; +    if (match(Op0, m_Add(m_Value(A), m_Value(B))) &&  +        match(Op1, m_Add(m_Value(C), m_Value(D))) && +        (A == C || A == D || B == C || B == D)) { +      errs() << "OP0 = " << *Op0 << "  U=" << Op0->getNumUses() << "\n"; +      errs() << "OP1 = " << *Op1 << "  U=" << Op1->getNumUses() << "\n"; +      errs() << "CMP = " << I << "\n\n"; +    } +  } + +//===---------------------------------------------------------------------===// + +define i1 @test1(i32 %x) nounwind { +  %and = and i32 %x, 3 +  %cmp = icmp ult i32 %and, 2 +  ret i1 %cmp +} + +Can be folded to (x & 2) == 0. + +define i1 @test2(i32 %x) nounwind { +  %and = and i32 %x, 3 +  %cmp = icmp ugt i32 %and, 1 +  ret i1 %cmp +} + +Can be folded to (x & 2) != 0. + +SimplifyDemandedBits shrinks the "and" constant to 2 but instcombine misses the +icmp transform. + +//===---------------------------------------------------------------------===// + +This code: + +typedef struct { +int f1:1; +int f2:1; +int f3:1; +int f4:29; +} t1; + +typedef struct { +int f1:1; +int f2:1; +int f3:30; +} t2; + +t1 s1; +t2 s2; + +void func1(void) +{ +s1.f1 = s2.f1; +s1.f2 = s2.f2; +} + +Compiles into this IR (on x86-64 at least): + +%struct.t1 = type { i8, [3 x i8] } +@s2 = global %struct.t1 zeroinitializer, align 4 +@s1 = global %struct.t1 zeroinitializer, align 4 +define void @func1() nounwind ssp noredzone { +entry: +  %0 = load i32* bitcast (%struct.t1* @s2 to i32*), align 4 +  %bf.val.sext5 = and i32 %0, 1 +  %1 = load i32* bitcast (%struct.t1* @s1 to i32*), align 4 +  %2 = and i32 %1, -4 +  %3 = or i32 %2, %bf.val.sext5 +  %bf.val.sext26 = and i32 %0, 2 +  %4 = or i32 %3, %bf.val.sext26 +  store i32 %4, i32* bitcast (%struct.t1* @s1 to i32*), align 4 +  ret void +} + +The two or/and's should be merged into one each. + +//===---------------------------------------------------------------------===// + +Machine level code hoisting can be useful in some cases.  For example, PR9408 +is about: + +typedef union { + void (*f1)(int); + void (*f2)(long); +} funcs; + +void foo(funcs f, int which) { + int a = 5; + if (which) { +   f.f1(a); + } else { +   f.f2(a); + } +} + +which we compile to: + +foo:                                    # @foo +# %bb.0:                                # %entry +       pushq   %rbp +       movq    %rsp, %rbp +       testl   %esi, %esi +       movq    %rdi, %rax +       je      .LBB0_2 +# %bb.1:                                # %if.then +       movl    $5, %edi +       callq   *%rax +       popq    %rbp +       ret +.LBB0_2:                                # %if.else +       movl    $5, %edi +       callq   *%rax +       popq    %rbp +       ret + +Note that bb1 and bb2 are the same.  This doesn't happen at the IR level +because one call is passing an i32 and the other is passing an i64. + +//===---------------------------------------------------------------------===// + +I see this sort of pattern in 176.gcc in a few places (e.g. the start of +store_bit_field).  The rem should be replaced with a multiply and subtract: + +  %3 = sdiv i32 %A, %B +  %4 = srem i32 %A, %B + +Similarly for udiv/urem.  Note that this shouldn't be done on X86 or ARM, +which can do this in a single operation (instruction or libcall).  It is +probably best to do this in the code generator. + +//===---------------------------------------------------------------------===// + +unsigned foo(unsigned x, unsigned y) { return (x & y) == 0 || x == 0; } +should fold to (x & y) == 0. + +//===---------------------------------------------------------------------===// + +unsigned foo(unsigned x, unsigned y) { return x > y && x != 0; } +should fold to x > y. + +//===---------------------------------------------------------------------===// diff --git a/contrib/libs/llvm12/lib/Target/X86/.yandex_meta/licenses.list.txt b/contrib/libs/llvm12/lib/Target/X86/.yandex_meta/licenses.list.txt index 92fbe1c0846..f08f43f1d80 100644 --- a/contrib/libs/llvm12/lib/Target/X86/.yandex_meta/licenses.list.txt +++ b/contrib/libs/llvm12/lib/Target/X86/.yandex_meta/licenses.list.txt @@ -1,309 +1,309 @@ -====================Apache-2.0 WITH LLVM-exception====================  -// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.  -// See https://llvm.org/LICENSE.txt for license information.  -  -  -====================Apache-2.0 WITH LLVM-exception====================  -// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception  -  -  -====================COPYRIGHT====================  -    Trampoline->setComdat(C);  -  BasicBlock *EntryBB = BasicBlock::Create(Context, "entry", Trampoline);  -  IRBuilder<> Builder(EntryBB);  -  -  -====================File: LICENSE.TXT====================  -==============================================================================  -The LLVM Project is under the Apache License v2.0 with LLVM Exceptions:  -==============================================================================  -  -                                 Apache License  -                           Version 2.0, January 2004  -                        http://www.apache.org/licenses/  -  -    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION  -  -    1. Definitions.  -  -      "License" shall mean the terms and conditions for use, reproduction,  -      and distribution as defined by Sections 1 through 9 of this document.  -  -      "Licensor" shall mean the copyright owner or entity authorized by  -      the copyright owner that is granting the License.  -  -      "Legal Entity" shall mean the union of the acting entity and all  -      other entities that control, are controlled by, or are under common  -      control with that entity. For the purposes of this definition,  -      "control" means (i) the power, direct or indirect, to cause the  -      direction or management of such entity, whether by contract or  -      otherwise, or (ii) ownership of fifty percent (50%) or more of the  -      outstanding shares, or (iii) beneficial ownership of such entity.  -  -      "You" (or "Your") shall mean an individual or Legal Entity  -      exercising permissions granted by this License.  -  -      "Source" form shall mean the preferred form for making modifications,  -      including but not limited to software source code, documentation  -      source, and configuration files.  -  -      "Object" form shall mean any form resulting from mechanical  -      transformation or translation of a Source form, including but  -      not limited to compiled object code, generated documentation,  -      and conversions to other media types.  -  -      "Work" shall mean the work of authorship, whether in Source or  -      Object form, made available under the License, as indicated by a  -      copyright notice that is included in or attached to the work  -      (an example is provided in the Appendix below).  -  -      "Derivative Works" shall mean any work, whether in Source or Object  -      form, that is based on (or derived from) the Work and for which the  -      editorial revisions, annotations, elaborations, or other modifications  -      represent, as a whole, an original work of authorship. For the purposes  -      of this License, Derivative Works shall not include works that remain  -      separable from, or merely link (or bind by name) to the interfaces of,  -      the Work and Derivative Works thereof.  -  -      "Contribution" shall mean any work of authorship, including  -      the original version of the Work and any modifications or additions  -      to that Work or Derivative Works thereof, that is intentionally  -      submitted to Licensor for inclusion in the Work by the copyright owner  -      or by an individual or Legal Entity authorized to submit on behalf of  -      the copyright owner. For the purposes of this definition, "submitted"  -      means any form of electronic, verbal, or written communication sent  -      to the Licensor or its representatives, including but not limited to  -      communication on electronic mailing lists, source code control systems,  -      and issue tracking systems that are managed by, or on behalf of, the  -      Licensor for the purpose of discussing and improving the Work, but  -      excluding communication that is conspicuously marked or otherwise  -      designated in writing by the copyright owner as "Not a Contribution."  -  -      "Contributor" shall mean Licensor and any individual or Legal Entity  -      on behalf of whom a Contribution has been received by Licensor and  -      subsequently incorporated within the Work.  -  -    2. Grant of Copyright License. Subject to the terms and conditions of  -      this License, each Contributor hereby grants to You a perpetual,  -      worldwide, non-exclusive, no-charge, royalty-free, irrevocable  -      copyright license to reproduce, prepare Derivative Works of,  -      publicly display, publicly perform, sublicense, and distribute the  -      Work and such Derivative Works in Source or Object form.  -  -    3. Grant of Patent License. Subject to the terms and conditions of  -      this License, each Contributor hereby grants to You a perpetual,  -      worldwide, non-exclusive, no-charge, royalty-free, irrevocable  -      (except as stated in this section) patent license to make, have made,  -      use, offer to sell, sell, import, and otherwise transfer the Work,  -      where such license applies only to those patent claims licensable  -      by such Contributor that are necessarily infringed by their  -      Contribution(s) alone or by combination of their Contribution(s)  -      with the Work to which such Contribution(s) was submitted. If You  -      institute patent litigation against any entity (including a  -      cross-claim or counterclaim in a lawsuit) alleging that the Work  -      or a Contribution incorporated within the Work constitutes direct  -      or contributory patent infringement, then any patent licenses  -      granted to You under this License for that Work shall terminate  -      as of the date such litigation is filed.  -  -    4. Redistribution. You may reproduce and distribute copies of the  -      Work or Derivative Works thereof in any medium, with or without  -      modifications, and in Source or Object form, provided that You  -      meet the following conditions:  -  -      (a) You must give any other recipients of the Work or  -          Derivative Works a copy of this License; and  -  -      (b) You must cause any modified files to carry prominent notices  -          stating that You changed the files; and  -  -      (c) You must retain, in the Source form of any Derivative Works  -          that You distribute, all copyright, patent, trademark, and  -          attribution notices from the Source form of the Work,  -          excluding those notices that do not pertain to any part of  -          the Derivative Works; and  -  -      (d) If the Work includes a "NOTICE" text file as part of its  -          distribution, then any Derivative Works that You distribute must  -          include a readable copy of the attribution notices contained  -          within such NOTICE file, excluding those notices that do not  -          pertain to any part of the Derivative Works, in at least one  -          of the following places: within a NOTICE text file distributed  -          as part of the Derivative Works; within the Source form or  -          documentation, if provided along with the Derivative Works; or,  -          within a display generated by the Derivative Works, if and  -          wherever such third-party notices normally appear. The contents  -          of the NOTICE file are for informational purposes only and  -          do not modify the License. You may add Your own attribution  -          notices within Derivative Works that You distribute, alongside  -          or as an addendum to the NOTICE text from the Work, provided  -          that such additional attribution notices cannot be construed  -          as modifying the License.  -  -      You may add Your own copyright statement to Your modifications and  -      may provide additional or different license terms and conditions  -      for use, reproduction, or distribution of Your modifications, or  -      for any such Derivative Works as a whole, provided Your use,  -      reproduction, and distribution of the Work otherwise complies with  -      the conditions stated in this License.  -  -    5. Submission of Contributions. Unless You explicitly state otherwise,  -      any Contribution intentionally submitted for inclusion in the Work  -      by You to the Licensor shall be under the terms and conditions of  -      this License, without any additional terms or conditions.  -      Notwithstanding the above, nothing herein shall supersede or modify  -      the terms of any separate license agreement you may have executed  -      with Licensor regarding such Contributions.  -  -    6. Trademarks. This License does not grant permission to use the trade  -      names, trademarks, service marks, or product names of the Licensor,  -      except as required for reasonable and customary use in describing the  -      origin of the Work and reproducing the content of the NOTICE file.  -  -    7. Disclaimer of Warranty. Unless required by applicable law or  -      agreed to in writing, Licensor provides the Work (and each  -      Contributor provides its Contributions) on an "AS IS" BASIS,  -      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or  -      implied, including, without limitation, any warranties or conditions  -      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A  -      PARTICULAR PURPOSE. You are solely responsible for determining the  -      appropriateness of using or redistributing the Work and assume any  -      risks associated with Your exercise of permissions under this License.  -  -    8. Limitation of Liability. In no event and under no legal theory,  -      whether in tort (including negligence), contract, or otherwise,  -      unless required by applicable law (such as deliberate and grossly  -      negligent acts) or agreed to in writing, shall any Contributor be  -      liable to You for damages, including any direct, indirect, special,  -      incidental, or consequential damages of any character arising as a  -      result of this License or out of the use or inability to use the  -      Work (including but not limited to damages for loss of goodwill,  -      work stoppage, computer failure or malfunction, or any and all  -      other commercial damages or losses), even if such Contributor  -      has been advised of the possibility of such damages.  -  -    9. Accepting Warranty or Additional Liability. While redistributing  -      the Work or Derivative Works thereof, You may choose to offer,  -      and charge a fee for, acceptance of support, warranty, indemnity,  -      or other liability obligations and/or rights consistent with this  -      License. However, in accepting such obligations, You may act only  -      on Your own behalf and on Your sole responsibility, not on behalf  -      of any other Contributor, and only if You agree to indemnify,  -      defend, and hold each Contributor harmless for any liability  -      incurred by, or claims asserted against, such Contributor by reason  -      of your accepting any such warranty or additional liability.  -  -    END OF TERMS AND CONDITIONS  -  -    APPENDIX: How to apply the Apache License to your work.  -  -      To apply the Apache License to your work, attach the following  -      boilerplate notice, with the fields enclosed by brackets "[]"  -      replaced with your own identifying information. (Don't include  -      the brackets!)  The text should be enclosed in the appropriate  -      comment syntax for the file format. We also recommend that a  -      file or class name and description of purpose be included on the  -      same "printed page" as the copyright notice for easier  -      identification within third-party archives.  -  -    Copyright [yyyy] [name of copyright owner]  -  -    Licensed under the Apache License, Version 2.0 (the "License");  -    you may not use this file except in compliance with the License.  -    You may obtain a copy of the License at  -  -       http://www.apache.org/licenses/LICENSE-2.0  -  -    Unless required by applicable law or agreed to in writing, software  -    distributed under the License is distributed on an "AS IS" BASIS,  -    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  -    See the License for the specific language governing permissions and  -    limitations under the License.  -  -  ----- LLVM Exceptions to the Apache 2.0 License ----  -  -As an exception, if, as a result of your compiling your source code, portions  -of this Software are embedded into an Object form of such source code, you  -may redistribute such embedded portions in such Object form without complying  -with the conditions of Sections 4(a), 4(b) and 4(d) of the License.  -  -In addition, if you combine or link compiled forms of this Software with  -software that is licensed under the GPLv2 ("Combined Software") and if a  -court of competent jurisdiction determines that the patent provision (Section  -3), the indemnity provision (Section 9) or other Section of the License  -conflicts with the conditions of the GPLv2, you may retroactively and  -prospectively choose to deem waived or otherwise exclude such Section(s) of  -the License, but only in their entirety and only with respect to the Combined  -Software.  -  -==============================================================================  -Software from third parties included in the LLVM Project:  -==============================================================================  -The LLVM Project contains third party software which is under different license  -terms. All such code will be identified clearly using at least one of two  -mechanisms:  -1) It will be in a separate directory tree with its own `LICENSE.txt` or  -   `LICENSE` file at the top containing the specific license and restrictions  -   which apply to that software, or  -2) It will contain specific license and restriction terms at the top of every  -   file.  -  -==============================================================================  -Legacy LLVM License (https://llvm.org/docs/DeveloperPolicy.html#legacy):  -==============================================================================  -University of Illinois/NCSA  -Open Source License  -  -Copyright (c) 2003-2019 University of Illinois at Urbana-Champaign.  -All rights reserved.  -  -Developed by:  -  -    LLVM Team  -  -    University of Illinois at Urbana-Champaign  -  -    http://llvm.org  -  -Permission is hereby granted, free of charge, to any person obtaining a copy of  -this software and associated documentation files (the "Software"), to deal with  -the Software without restriction, including without limitation the rights to  -use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies  -of the Software, and to permit persons to whom the Software is furnished to do  -so, subject to the following conditions:  -  -    * Redistributions of source code must retain the above copyright notice,  -      this list of conditions and the following disclaimers.  -  -    * Redistributions in binary form must reproduce the above copyright notice,  -      this list of conditions and the following disclaimers in the  -      documentation and/or other materials provided with the distribution.  -  -    * Neither the names of the LLVM Team, University of Illinois at  -      Urbana-Champaign, nor the names of its contributors may be used to  -      endorse or promote products derived from this Software without specific  -      prior written permission.  -  -THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR  -IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS  -FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL THE  -CONTRIBUTORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER  -LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,  -OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS WITH THE  -SOFTWARE.  -  -  -  -====================File: include/llvm/Support/LICENSE.TXT====================  -LLVM System Interface Library  --------------------------------------------------------------------------------  -The LLVM System Interface Library is licensed under the Illinois Open Source  -License and has the following additional copyright:  -  -Copyright (C) 2004 eXtensible Systems, Inc.  -  -  -====================NCSA====================  -// This file is distributed under the University of Illinois Open Source  -// License. See LICENSE.TXT for details.  +====================Apache-2.0 WITH LLVM-exception==================== +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. + + +====================Apache-2.0 WITH LLVM-exception==================== +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception + + +====================COPYRIGHT==================== +    Trampoline->setComdat(C); +  BasicBlock *EntryBB = BasicBlock::Create(Context, "entry", Trampoline); +  IRBuilder<> Builder(EntryBB); + + +====================File: LICENSE.TXT==================== +============================================================================== +The LLVM Project is under the Apache License v2.0 with LLVM Exceptions: +============================================================================== + +                                 Apache License +                           Version 2.0, January 2004 +                        http://www.apache.org/licenses/ + +    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION + +    1. Definitions. + +      "License" shall mean the terms and conditions for use, reproduction, +      and distribution as defined by Sections 1 through 9 of this document. + +      "Licensor" shall mean the copyright owner or entity authorized by +      the copyright owner that is granting the License. + +      "Legal Entity" shall mean the union of the acting entity and all +      other entities that control, are controlled by, or are under common +      control with that entity. For the purposes of this definition, +      "control" means (i) the power, direct or indirect, to cause the +      direction or management of such entity, whether by contract or +      otherwise, or (ii) ownership of fifty percent (50%) or more of the +      outstanding shares, or (iii) beneficial ownership of such entity. + +      "You" (or "Your") shall mean an individual or Legal Entity +      exercising permissions granted by this License. + +      "Source" form shall mean the preferred form for making modifications, +      including but not limited to software source code, documentation +      source, and configuration files. + +      "Object" form shall mean any form resulting from mechanical +      transformation or translation of a Source form, including but +      not limited to compiled object code, generated documentation, +      and conversions to other media types. + +      "Work" shall mean the work of authorship, whether in Source or +      Object form, made available under the License, as indicated by a +      copyright notice that is included in or attached to the work +      (an example is provided in the Appendix below). + +      "Derivative Works" shall mean any work, whether in Source or Object +      form, that is based on (or derived from) the Work and for which the +      editorial revisions, annotations, elaborations, or other modifications +      represent, as a whole, an original work of authorship. For the purposes +      of this License, Derivative Works shall not include works that remain +      separable from, or merely link (or bind by name) to the interfaces of, +      the Work and Derivative Works thereof. + +      "Contribution" shall mean any work of authorship, including +      the original version of the Work and any modifications or additions +      to that Work or Derivative Works thereof, that is intentionally +      submitted to Licensor for inclusion in the Work by the copyright owner +      or by an individual or Legal Entity authorized to submit on behalf of +      the copyright owner. For the purposes of this definition, "submitted" +      means any form of electronic, verbal, or written communication sent +      to the Licensor or its representatives, including but not limited to +      communication on electronic mailing lists, source code control systems, +      and issue tracking systems that are managed by, or on behalf of, the +      Licensor for the purpose of discussing and improving the Work, but +      excluding communication that is conspicuously marked or otherwise +      designated in writing by the copyright owner as "Not a Contribution." + +      "Contributor" shall mean Licensor and any individual or Legal Entity +      on behalf of whom a Contribution has been received by Licensor and +      subsequently incorporated within the Work. + +    2. Grant of Copyright License. Subject to the terms and conditions of +      this License, each Contributor hereby grants to You a perpetual, +      worldwide, non-exclusive, no-charge, royalty-free, irrevocable +      copyright license to reproduce, prepare Derivative Works of, +      publicly display, publicly perform, sublicense, and distribute the +      Work and such Derivative Works in Source or Object form. + +    3. Grant of Patent License. Subject to the terms and conditions of +      this License, each Contributor hereby grants to You a perpetual, +      worldwide, non-exclusive, no-charge, royalty-free, irrevocable +      (except as stated in this section) patent license to make, have made, +      use, offer to sell, sell, import, and otherwise transfer the Work, +      where such license applies only to those patent claims licensable +      by such Contributor that are necessarily infringed by their +      Contribution(s) alone or by combination of their Contribution(s) +      with the Work to which such Contribution(s) was submitted. If You +      institute patent litigation against any entity (including a +      cross-claim or counterclaim in a lawsuit) alleging that the Work +      or a Contribution incorporated within the Work constitutes direct +      or contributory patent infringement, then any patent licenses +      granted to You under this License for that Work shall terminate +      as of the date such litigation is filed. + +    4. Redistribution. You may reproduce and distribute copies of the +      Work or Derivative Works thereof in any medium, with or without +      modifications, and in Source or Object form, provided that You +      meet the following conditions: + +      (a) You must give any other recipients of the Work or +          Derivative Works a copy of this License; and + +      (b) You must cause any modified files to carry prominent notices +          stating that You changed the files; and + +      (c) You must retain, in the Source form of any Derivative Works +          that You distribute, all copyright, patent, trademark, and +          attribution notices from the Source form of the Work, +          excluding those notices that do not pertain to any part of +          the Derivative Works; and + +      (d) If the Work includes a "NOTICE" text file as part of its +          distribution, then any Derivative Works that You distribute must +          include a readable copy of the attribution notices contained +          within such NOTICE file, excluding those notices that do not +          pertain to any part of the Derivative Works, in at least one +          of the following places: within a NOTICE text file distributed +          as part of the Derivative Works; within the Source form or +          documentation, if provided along with the Derivative Works; or, +          within a display generated by the Derivative Works, if and +          wherever such third-party notices normally appear. The contents +          of the NOTICE file are for informational purposes only and +          do not modify the License. You may add Your own attribution +          notices within Derivative Works that You distribute, alongside +          or as an addendum to the NOTICE text from the Work, provided +          that such additional attribution notices cannot be construed +          as modifying the License. + +      You may add Your own copyright statement to Your modifications and +      may provide additional or different license terms and conditions +      for use, reproduction, or distribution of Your modifications, or +      for any such Derivative Works as a whole, provided Your use, +      reproduction, and distribution of the Work otherwise complies with +      the conditions stated in this License. + +    5. Submission of Contributions. Unless You explicitly state otherwise, +      any Contribution intentionally submitted for inclusion in the Work +      by You to the Licensor shall be under the terms and conditions of +      this License, without any additional terms or conditions. +      Notwithstanding the above, nothing herein shall supersede or modify +      the terms of any separate license agreement you may have executed +      with Licensor regarding such Contributions. + +    6. Trademarks. This License does not grant permission to use the trade +      names, trademarks, service marks, or product names of the Licensor, +      except as required for reasonable and customary use in describing the +      origin of the Work and reproducing the content of the NOTICE file. + +    7. Disclaimer of Warranty. Unless required by applicable law or +      agreed to in writing, Licensor provides the Work (and each +      Contributor provides its Contributions) on an "AS IS" BASIS, +      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or +      implied, including, without limitation, any warranties or conditions +      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A +      PARTICULAR PURPOSE. You are solely responsible for determining the +      appropriateness of using or redistributing the Work and assume any +      risks associated with Your exercise of permissions under this License. + +    8. Limitation of Liability. In no event and under no legal theory, +      whether in tort (including negligence), contract, or otherwise, +      unless required by applicable law (such as deliberate and grossly +      negligent acts) or agreed to in writing, shall any Contributor be +      liable to You for damages, including any direct, indirect, special, +      incidental, or consequential damages of any character arising as a +      result of this License or out of the use or inability to use the +      Work (including but not limited to damages for loss of goodwill, +      work stoppage, computer failure or malfunction, or any and all +      other commercial damages or losses), even if such Contributor +      has been advised of the possibility of such damages. + +    9. Accepting Warranty or Additional Liability. While redistributing +      the Work or Derivative Works thereof, You may choose to offer, +      and charge a fee for, acceptance of support, warranty, indemnity, +      or other liability obligations and/or rights consistent with this +      License. However, in accepting such obligations, You may act only +      on Your own behalf and on Your sole responsibility, not on behalf +      of any other Contributor, and only if You agree to indemnify, +      defend, and hold each Contributor harmless for any liability +      incurred by, or claims asserted against, such Contributor by reason +      of your accepting any such warranty or additional liability. + +    END OF TERMS AND CONDITIONS + +    APPENDIX: How to apply the Apache License to your work. + +      To apply the Apache License to your work, attach the following +      boilerplate notice, with the fields enclosed by brackets "[]" +      replaced with your own identifying information. (Don't include +      the brackets!)  The text should be enclosed in the appropriate +      comment syntax for the file format. We also recommend that a +      file or class name and description of purpose be included on the +      same "printed page" as the copyright notice for easier +      identification within third-party archives. + +    Copyright [yyyy] [name of copyright owner] + +    Licensed under the Apache License, Version 2.0 (the "License"); +    you may not use this file except in compliance with the License. +    You may obtain a copy of the License at + +       http://www.apache.org/licenses/LICENSE-2.0 + +    Unless required by applicable law or agreed to in writing, software +    distributed under the License is distributed on an "AS IS" BASIS, +    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +    See the License for the specific language governing permissions and +    limitations under the License. + + +---- LLVM Exceptions to the Apache 2.0 License ---- + +As an exception, if, as a result of your compiling your source code, portions +of this Software are embedded into an Object form of such source code, you +may redistribute such embedded portions in such Object form without complying +with the conditions of Sections 4(a), 4(b) and 4(d) of the License. + +In addition, if you combine or link compiled forms of this Software with +software that is licensed under the GPLv2 ("Combined Software") and if a +court of competent jurisdiction determines that the patent provision (Section +3), the indemnity provision (Section 9) or other Section of the License +conflicts with the conditions of the GPLv2, you may retroactively and +prospectively choose to deem waived or otherwise exclude such Section(s) of +the License, but only in their entirety and only with respect to the Combined +Software. + +============================================================================== +Software from third parties included in the LLVM Project: +============================================================================== +The LLVM Project contains third party software which is under different license +terms. All such code will be identified clearly using at least one of two +mechanisms: +1) It will be in a separate directory tree with its own `LICENSE.txt` or +   `LICENSE` file at the top containing the specific license and restrictions +   which apply to that software, or +2) It will contain specific license and restriction terms at the top of every +   file. + +============================================================================== +Legacy LLVM License (https://llvm.org/docs/DeveloperPolicy.html#legacy): +============================================================================== +University of Illinois/NCSA +Open Source License + +Copyright (c) 2003-2019 University of Illinois at Urbana-Champaign. +All rights reserved. + +Developed by: + +    LLVM Team + +    University of Illinois at Urbana-Champaign + +    http://llvm.org + +Permission is hereby granted, free of charge, to any person obtaining a copy of +this software and associated documentation files (the "Software"), to deal with +the Software without restriction, including without limitation the rights to +use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies +of the Software, and to permit persons to whom the Software is furnished to do +so, subject to the following conditions: + +    * Redistributions of source code must retain the above copyright notice, +      this list of conditions and the following disclaimers. + +    * Redistributions in binary form must reproduce the above copyright notice, +      this list of conditions and the following disclaimers in the +      documentation and/or other materials provided with the distribution. + +    * Neither the names of the LLVM Team, University of Illinois at +      Urbana-Champaign, nor the names of its contributors may be used to +      endorse or promote products derived from this Software without specific +      prior written permission. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS +FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL THE +CONTRIBUTORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS WITH THE +SOFTWARE. + + + +====================File: include/llvm/Support/LICENSE.TXT==================== +LLVM System Interface Library +------------------------------------------------------------------------------- +The LLVM System Interface Library is licensed under the Illinois Open Source +License and has the following additional copyright: + +Copyright (C) 2004 eXtensible Systems, Inc. + + +====================NCSA==================== +// This file is distributed under the University of Illinois Open Source +// License. See LICENSE.TXT for details. diff --git a/contrib/libs/llvm12/lib/Target/X86/AsmParser/.yandex_meta/licenses.list.txt b/contrib/libs/llvm12/lib/Target/X86/AsmParser/.yandex_meta/licenses.list.txt index a4433625d4a..c62d353021c 100644 --- a/contrib/libs/llvm12/lib/Target/X86/AsmParser/.yandex_meta/licenses.list.txt +++ b/contrib/libs/llvm12/lib/Target/X86/AsmParser/.yandex_meta/licenses.list.txt @@ -1,7 +1,7 @@ -====================Apache-2.0 WITH LLVM-exception====================  -// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.  -// See https://llvm.org/LICENSE.txt for license information.  -  -  -====================Apache-2.0 WITH LLVM-exception====================  -// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception  +====================Apache-2.0 WITH LLVM-exception==================== +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. + + +====================Apache-2.0 WITH LLVM-exception==================== +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception diff --git a/contrib/libs/llvm12/lib/Target/X86/AsmParser/ya.make b/contrib/libs/llvm12/lib/Target/X86/AsmParser/ya.make index c30cd4cf659..f88283f4e55 100644 --- a/contrib/libs/llvm12/lib/Target/X86/AsmParser/ya.make +++ b/contrib/libs/llvm12/lib/Target/X86/AsmParser/ya.make @@ -2,15 +2,15 @@  LIBRARY() -OWNER(  -    orivej  -    g:cpp-contrib  -)  - -LICENSE(Apache-2.0 WITH LLVM-exception)  -  -LICENSE_TEXTS(.yandex_meta/licenses.list.txt)  -  +OWNER( +    orivej +    g:cpp-contrib +) + +LICENSE(Apache-2.0 WITH LLVM-exception) + +LICENSE_TEXTS(.yandex_meta/licenses.list.txt) +  PEERDIR(      contrib/libs/llvm12      contrib/libs/llvm12/include diff --git a/contrib/libs/llvm12/lib/Target/X86/Disassembler/.yandex_meta/licenses.list.txt b/contrib/libs/llvm12/lib/Target/X86/Disassembler/.yandex_meta/licenses.list.txt index a4433625d4a..c62d353021c 100644 --- a/contrib/libs/llvm12/lib/Target/X86/Disassembler/.yandex_meta/licenses.list.txt +++ b/contrib/libs/llvm12/lib/Target/X86/Disassembler/.yandex_meta/licenses.list.txt @@ -1,7 +1,7 @@ -====================Apache-2.0 WITH LLVM-exception====================  -// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.  -// See https://llvm.org/LICENSE.txt for license information.  -  -  -====================Apache-2.0 WITH LLVM-exception====================  -// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception  +====================Apache-2.0 WITH LLVM-exception==================== +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. + + +====================Apache-2.0 WITH LLVM-exception==================== +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception diff --git a/contrib/libs/llvm12/lib/Target/X86/Disassembler/ya.make b/contrib/libs/llvm12/lib/Target/X86/Disassembler/ya.make index d1c75366c17..b55833692f9 100644 --- a/contrib/libs/llvm12/lib/Target/X86/Disassembler/ya.make +++ b/contrib/libs/llvm12/lib/Target/X86/Disassembler/ya.make @@ -2,15 +2,15 @@  LIBRARY() -OWNER(  -    orivej  -    g:cpp-contrib  -)  - -LICENSE(Apache-2.0 WITH LLVM-exception)  -  -LICENSE_TEXTS(.yandex_meta/licenses.list.txt)  -  +OWNER( +    orivej +    g:cpp-contrib +) + +LICENSE(Apache-2.0 WITH LLVM-exception) + +LICENSE_TEXTS(.yandex_meta/licenses.list.txt) +  PEERDIR(      contrib/libs/llvm12      contrib/libs/llvm12/include diff --git a/contrib/libs/llvm12/lib/Target/X86/MCTargetDesc/.yandex_meta/licenses.list.txt b/contrib/libs/llvm12/lib/Target/X86/MCTargetDesc/.yandex_meta/licenses.list.txt index a4433625d4a..c62d353021c 100644 --- a/contrib/libs/llvm12/lib/Target/X86/MCTargetDesc/.yandex_meta/licenses.list.txt +++ b/contrib/libs/llvm12/lib/Target/X86/MCTargetDesc/.yandex_meta/licenses.list.txt @@ -1,7 +1,7 @@ -====================Apache-2.0 WITH LLVM-exception====================  -// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.  -// See https://llvm.org/LICENSE.txt for license information.  -  -  -====================Apache-2.0 WITH LLVM-exception====================  -// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception  +====================Apache-2.0 WITH LLVM-exception==================== +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. + + +====================Apache-2.0 WITH LLVM-exception==================== +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception diff --git a/contrib/libs/llvm12/lib/Target/X86/MCTargetDesc/ya.make b/contrib/libs/llvm12/lib/Target/X86/MCTargetDesc/ya.make index 565dda72f55..8da0d02f5b7 100644 --- a/contrib/libs/llvm12/lib/Target/X86/MCTargetDesc/ya.make +++ b/contrib/libs/llvm12/lib/Target/X86/MCTargetDesc/ya.make @@ -2,15 +2,15 @@  LIBRARY() -OWNER(  -    orivej  -    g:cpp-contrib  -)  - -LICENSE(Apache-2.0 WITH LLVM-exception)  -  -LICENSE_TEXTS(.yandex_meta/licenses.list.txt)  -  +OWNER( +    orivej +    g:cpp-contrib +) + +LICENSE(Apache-2.0 WITH LLVM-exception) + +LICENSE_TEXTS(.yandex_meta/licenses.list.txt) +  PEERDIR(      contrib/libs/llvm12      contrib/libs/llvm12/include diff --git a/contrib/libs/llvm12/lib/Target/X86/README-FPStack.txt b/contrib/libs/llvm12/lib/Target/X86/README-FPStack.txt index aab9759b352..39efd2dbcf1 100644 --- a/contrib/libs/llvm12/lib/Target/X86/README-FPStack.txt +++ b/contrib/libs/llvm12/lib/Target/X86/README-FPStack.txt @@ -1,85 +1,85 @@ -//===---------------------------------------------------------------------===//  -// Random ideas for the X86 backend: FP stack related stuff  -//===---------------------------------------------------------------------===//  -  -//===---------------------------------------------------------------------===//  -  -Some targets (e.g. athlons) prefer freep to fstp ST(0):  -http://gcc.gnu.org/ml/gcc-patches/2004-04/msg00659.html  -  -//===---------------------------------------------------------------------===//  -  -This should use fiadd on chips where it is profitable:  -double foo(double P, int *I) { return P+*I; }  -  -We have fiadd patterns now but the followings have the same cost and  -complexity. We need a way to specify the later is more profitable.  -  -def FpADD32m  : FpI<(ops RFP:$dst, RFP:$src1, f32mem:$src2), OneArgFPRW,  -                    [(set RFP:$dst, (fadd RFP:$src1,  -                                     (extloadf64f32 addr:$src2)))]>;  -                // ST(0) = ST(0) + [mem32]  -  -def FpIADD32m : FpI<(ops RFP:$dst, RFP:$src1, i32mem:$src2), OneArgFPRW,  -                    [(set RFP:$dst, (fadd RFP:$src1,  -                                     (X86fild addr:$src2, i32)))]>;  -                // ST(0) = ST(0) + [mem32int]  -  -//===---------------------------------------------------------------------===//  -  -The FP stackifier should handle simple permutates to reduce number of shuffle  -instructions, e.g. turning:  -  -fld P	->		fld Q  -fld Q			fld P  -fxch  -  -or:  -  -fxch	->		fucomi  -fucomi			jl X  -jg X  -  -Ideas:  -http://gcc.gnu.org/ml/gcc-patches/2004-11/msg02410.html  -  -  -//===---------------------------------------------------------------------===//  -  -Add a target specific hook to DAG combiner to handle SINT_TO_FP and  -FP_TO_SINT when the source operand is already in memory.  -  -//===---------------------------------------------------------------------===//  -  -Open code rint,floor,ceil,trunc:  -http://gcc.gnu.org/ml/gcc-patches/2004-08/msg02006.html  -http://gcc.gnu.org/ml/gcc-patches/2004-08/msg02011.html  -  -Opencode the sincos[f] libcall.  -  -//===---------------------------------------------------------------------===//  -  -None of the FPStack instructions are handled in  -X86RegisterInfo::foldMemoryOperand, which prevents the spiller from  -folding spill code into the instructions.  -  -//===---------------------------------------------------------------------===//  -  -Currently the x86 codegen isn't very good at mixing SSE and FPStack  -code:  -  -unsigned int foo(double x) { return x; }  -  -foo:  -	subl $20, %esp  -	movsd 24(%esp), %xmm0  -	movsd %xmm0, 8(%esp)  -	fldl 8(%esp)  -	fisttpll (%esp)  -	movl (%esp), %eax  -	addl $20, %esp  -	ret  -  -This just requires being smarter when custom expanding fptoui.  -  -//===---------------------------------------------------------------------===//  +//===---------------------------------------------------------------------===// +// Random ideas for the X86 backend: FP stack related stuff +//===---------------------------------------------------------------------===// + +//===---------------------------------------------------------------------===// + +Some targets (e.g. athlons) prefer freep to fstp ST(0): +http://gcc.gnu.org/ml/gcc-patches/2004-04/msg00659.html + +//===---------------------------------------------------------------------===// + +This should use fiadd on chips where it is profitable: +double foo(double P, int *I) { return P+*I; } + +We have fiadd patterns now but the followings have the same cost and +complexity. We need a way to specify the later is more profitable. + +def FpADD32m  : FpI<(ops RFP:$dst, RFP:$src1, f32mem:$src2), OneArgFPRW, +                    [(set RFP:$dst, (fadd RFP:$src1, +                                     (extloadf64f32 addr:$src2)))]>; +                // ST(0) = ST(0) + [mem32] + +def FpIADD32m : FpI<(ops RFP:$dst, RFP:$src1, i32mem:$src2), OneArgFPRW, +                    [(set RFP:$dst, (fadd RFP:$src1, +                                     (X86fild addr:$src2, i32)))]>; +                // ST(0) = ST(0) + [mem32int] + +//===---------------------------------------------------------------------===// + +The FP stackifier should handle simple permutates to reduce number of shuffle +instructions, e.g. turning: + +fld P	->		fld Q +fld Q			fld P +fxch + +or: + +fxch	->		fucomi +fucomi			jl X +jg X + +Ideas: +http://gcc.gnu.org/ml/gcc-patches/2004-11/msg02410.html + + +//===---------------------------------------------------------------------===// + +Add a target specific hook to DAG combiner to handle SINT_TO_FP and +FP_TO_SINT when the source operand is already in memory. + +//===---------------------------------------------------------------------===// + +Open code rint,floor,ceil,trunc: +http://gcc.gnu.org/ml/gcc-patches/2004-08/msg02006.html +http://gcc.gnu.org/ml/gcc-patches/2004-08/msg02011.html + +Opencode the sincos[f] libcall. + +//===---------------------------------------------------------------------===// + +None of the FPStack instructions are handled in +X86RegisterInfo::foldMemoryOperand, which prevents the spiller from +folding spill code into the instructions. + +//===---------------------------------------------------------------------===// + +Currently the x86 codegen isn't very good at mixing SSE and FPStack +code: + +unsigned int foo(double x) { return x; } + +foo: +	subl $20, %esp +	movsd 24(%esp), %xmm0 +	movsd %xmm0, 8(%esp) +	fldl 8(%esp) +	fisttpll (%esp) +	movl (%esp), %eax +	addl $20, %esp +	ret + +This just requires being smarter when custom expanding fptoui. + +//===---------------------------------------------------------------------===// diff --git a/contrib/libs/llvm12/lib/Target/X86/README-SSE.txt b/contrib/libs/llvm12/lib/Target/X86/README-SSE.txt index 40f526b4788..d52840e5c48 100644 --- a/contrib/libs/llvm12/lib/Target/X86/README-SSE.txt +++ b/contrib/libs/llvm12/lib/Target/X86/README-SSE.txt @@ -1,829 +1,829 @@ -//===---------------------------------------------------------------------===//  -// Random ideas for the X86 backend: SSE-specific stuff.  -//===---------------------------------------------------------------------===//  -  -//===---------------------------------------------------------------------===//  -  -SSE Variable shift can be custom lowered to something like this, which uses a  -small table + unaligned load + shuffle instead of going through memory.  -  -__m128i_shift_right:  -	.byte	  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15  -	.byte	 -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1  -  -...  -__m128i shift_right(__m128i value, unsigned long offset) {  -  return _mm_shuffle_epi8(value,  -               _mm_loadu_si128((__m128 *) (___m128i_shift_right + offset)));  -}  -  -//===---------------------------------------------------------------------===//  -  -SSE has instructions for doing operations on complex numbers, we should pattern  -match them.   For example, this should turn into a horizontal add:  -  -typedef float __attribute__((vector_size(16))) v4f32;  -float f32(v4f32 A) {  -  return A[0]+A[1]+A[2]+A[3];  -}  -  -Instead we get this:  -  -_f32:                                   ## @f32  -	pshufd	$1, %xmm0, %xmm1        ## xmm1 = xmm0[1,0,0,0]  -	addss	%xmm0, %xmm1  -	pshufd	$3, %xmm0, %xmm2        ## xmm2 = xmm0[3,0,0,0]  -	movhlps	%xmm0, %xmm0            ## xmm0 = xmm0[1,1]  -	movaps	%xmm0, %xmm3  -	addss	%xmm1, %xmm3  -	movdqa	%xmm2, %xmm0  -	addss	%xmm3, %xmm0  -	ret  -  -Also, there are cases where some simple local SLP would improve codegen a bit.  -compiling this:  -  -_Complex float f32(_Complex float A, _Complex float B) {  -  return A+B;  -}  -  -into:  -  -_f32:                                   ## @f32  -	movdqa	%xmm0, %xmm2  -	addss	%xmm1, %xmm2  -	pshufd	$1, %xmm1, %xmm1        ## xmm1 = xmm1[1,0,0,0]  -	pshufd	$1, %xmm0, %xmm3        ## xmm3 = xmm0[1,0,0,0]  -	addss	%xmm1, %xmm3  -	movaps	%xmm2, %xmm0  -	unpcklps	%xmm3, %xmm0    ## xmm0 = xmm0[0],xmm3[0],xmm0[1],xmm3[1]  -	ret  -  -seems silly when it could just be one addps.  -  -  -//===---------------------------------------------------------------------===//  -  -Expand libm rounding functions inline:  Significant speedups possible.  -http://gcc.gnu.org/ml/gcc-patches/2006-10/msg00909.html  -  -//===---------------------------------------------------------------------===//  -  -When compiled with unsafemath enabled, "main" should enable SSE DAZ mode and  -other fast SSE modes.  -  -//===---------------------------------------------------------------------===//  -  -Think about doing i64 math in SSE regs on x86-32.  -  -//===---------------------------------------------------------------------===//  -  -This testcase should have no SSE instructions in it, and only one load from  -a constant pool:  -  -double %test3(bool %B) {  -        %C = select bool %B, double 123.412, double 523.01123123  -        ret double %C  -}  -  -Currently, the select is being lowered, which prevents the dag combiner from  -turning 'select (load CPI1), (load CPI2)' -> 'load (select CPI1, CPI2)'  -  -The pattern isel got this one right.  -  -//===---------------------------------------------------------------------===//  -  -Lower memcpy / memset to a series of SSE 128 bit move instructions when it's  -feasible.  -  -//===---------------------------------------------------------------------===//  -  -Codegen:  -  if (copysign(1.0, x) == copysign(1.0, y))  -into:  -  if (x^y & mask)  -when using SSE.  -  -//===---------------------------------------------------------------------===//  -  -Use movhps to update upper 64-bits of a v4sf value. Also movlps on lower half  -of a v4sf value.  -  -//===---------------------------------------------------------------------===//  -  -Better codegen for vector_shuffles like this { x, 0, 0, 0 } or { x, 0, x, 0}.  -Perhaps use pxor / xorp* to clear a XMM register first?  -  -//===---------------------------------------------------------------------===//  -  -External test Nurbs exposed some problems. Look for  -__ZN15Nurbs_SSE_Cubic17TessellateSurfaceE, bb cond_next140. This is what icc  -emits:  -  -        movaps    (%edx), %xmm2                                 #59.21  -        movaps    (%edx), %xmm5                                 #60.21  -        movaps    (%edx), %xmm4                                 #61.21  -        movaps    (%edx), %xmm3                                 #62.21  -        movl      40(%ecx), %ebp                                #69.49  -        shufps    $0, %xmm2, %xmm5                              #60.21  -        movl      100(%esp), %ebx                               #69.20  -        movl      (%ebx), %edi                                  #69.20  -        imull     %ebp, %edi                                    #69.49  -        addl      (%eax), %edi                                  #70.33  -        shufps    $85, %xmm2, %xmm4                             #61.21  -        shufps    $170, %xmm2, %xmm3                            #62.21  -        shufps    $255, %xmm2, %xmm2                            #63.21  -        lea       (%ebp,%ebp,2), %ebx                           #69.49  -        negl      %ebx                                          #69.49  -        lea       -3(%edi,%ebx), %ebx                           #70.33  -        shll      $4, %ebx                                      #68.37  -        addl      32(%ecx), %ebx                                #68.37  -        testb     $15, %bl                                      #91.13  -        jne       L_B1.24       # Prob 5%                       #91.13  -  -This is the llvm code after instruction scheduling:  -  -cond_next140 (0xa910740, LLVM BB @0xa90beb0):  -	%reg1078 = MOV32ri -3  -	%reg1079 = ADD32rm %reg1078, %reg1068, 1, %noreg, 0  -	%reg1037 = MOV32rm %reg1024, 1, %noreg, 40  -	%reg1080 = IMUL32rr %reg1079, %reg1037  -	%reg1081 = MOV32rm %reg1058, 1, %noreg, 0  -	%reg1038 = LEA32r %reg1081, 1, %reg1080, -3  -	%reg1036 = MOV32rm %reg1024, 1, %noreg, 32  -	%reg1082 = SHL32ri %reg1038, 4  -	%reg1039 = ADD32rr %reg1036, %reg1082  -	%reg1083 = MOVAPSrm %reg1059, 1, %noreg, 0  -	%reg1034 = SHUFPSrr %reg1083, %reg1083, 170  -	%reg1032 = SHUFPSrr %reg1083, %reg1083, 0  -	%reg1035 = SHUFPSrr %reg1083, %reg1083, 255  -	%reg1033 = SHUFPSrr %reg1083, %reg1083, 85  -	%reg1040 = MOV32rr %reg1039  -	%reg1084 = AND32ri8 %reg1039, 15  -	CMP32ri8 %reg1084, 0  -	JE mbb<cond_next204,0xa914d30>  -  -Still ok. After register allocation:  -  -cond_next140 (0xa910740, LLVM BB @0xa90beb0):  -	%eax = MOV32ri -3  -	%edx = MOV32rm %stack.3, 1, %noreg, 0  -	ADD32rm %eax<def&use>, %edx, 1, %noreg, 0  -	%edx = MOV32rm %stack.7, 1, %noreg, 0  -	%edx = MOV32rm %edx, 1, %noreg, 40  -	IMUL32rr %eax<def&use>, %edx  -	%esi = MOV32rm %stack.5, 1, %noreg, 0  -	%esi = MOV32rm %esi, 1, %noreg, 0  -	MOV32mr %stack.4, 1, %noreg, 0, %esi  -	%eax = LEA32r %esi, 1, %eax, -3  -	%esi = MOV32rm %stack.7, 1, %noreg, 0  -	%esi = MOV32rm %esi, 1, %noreg, 32  -	%edi = MOV32rr %eax  -	SHL32ri %edi<def&use>, 4  -	ADD32rr %edi<def&use>, %esi  -	%xmm0 = MOVAPSrm %ecx, 1, %noreg, 0  -	%xmm1 = MOVAPSrr %xmm0  -	SHUFPSrr %xmm1<def&use>, %xmm1, 170  -	%xmm2 = MOVAPSrr %xmm0  -	SHUFPSrr %xmm2<def&use>, %xmm2, 0  -	%xmm3 = MOVAPSrr %xmm0  -	SHUFPSrr %xmm3<def&use>, %xmm3, 255  -	SHUFPSrr %xmm0<def&use>, %xmm0, 85  -	%ebx = MOV32rr %edi  -	AND32ri8 %ebx<def&use>, 15  -	CMP32ri8 %ebx, 0  -	JE mbb<cond_next204,0xa914d30>  -  -This looks really bad. The problem is shufps is a destructive opcode. Since it  -appears as operand two in more than one shufps ops. It resulted in a number of  -copies. Note icc also suffers from the same problem. Either the instruction  -selector should select pshufd or The register allocator can made the two-address  -to three-address transformation.  -  -It also exposes some other problems. See MOV32ri -3 and the spills.  -  -//===---------------------------------------------------------------------===//  -  -Consider:  -  -__m128 test(float a) {  -  return _mm_set_ps(0.0, 0.0, 0.0, a*a);  -}  -  -This compiles into:  -  -movss 4(%esp), %xmm1  -mulss %xmm1, %xmm1  -xorps %xmm0, %xmm0  -movss %xmm1, %xmm0  -ret  -  -Because mulss doesn't modify the top 3 elements, the top elements of   -xmm1 are already zero'd.  We could compile this to:  -  -movss 4(%esp), %xmm0  -mulss %xmm0, %xmm0  -ret  -  -//===---------------------------------------------------------------------===//  -  -Here's a sick and twisted idea.  Consider code like this:  -  -__m128 test(__m128 a) {  -  float b = *(float*)&A;  -  ...  -  return _mm_set_ps(0.0, 0.0, 0.0, b);  -}  -  -This might compile to this code:  -  -movaps c(%esp), %xmm1  -xorps %xmm0, %xmm0  -movss %xmm1, %xmm0  -ret  -  -Now consider if the ... code caused xmm1 to get spilled.  This might produce  -this code:  -  -movaps c(%esp), %xmm1  -movaps %xmm1, c2(%esp)  -...  -  -xorps %xmm0, %xmm0  -movaps c2(%esp), %xmm1  -movss %xmm1, %xmm0  -ret  -  -However, since the reload is only used by these instructions, we could   -"fold" it into the uses, producing something like this:  -  -movaps c(%esp), %xmm1  -movaps %xmm1, c2(%esp)  -...  -  -movss c2(%esp), %xmm0  -ret  -  -... saving two instructions.  -  -The basic idea is that a reload from a spill slot, can, if only one 4-byte   -chunk is used, bring in 3 zeros the one element instead of 4 elements.  -This can be used to simplify a variety of shuffle operations, where the  -elements are fixed zeros.  -  -//===---------------------------------------------------------------------===//  -  -This code generates ugly code, probably due to costs being off or something:  -  -define void @test(float* %P, <4 x float>* %P2 ) {  -        %xFloat0.688 = load float* %P  -        %tmp = load <4 x float>* %P2  -        %inFloat3.713 = insertelement <4 x float> %tmp, float 0.0, i32 3  -        store <4 x float> %inFloat3.713, <4 x float>* %P2  -        ret void  -}  -  -Generates:  -  -_test:  -	movl	8(%esp), %eax  -	movaps	(%eax), %xmm0  -	pxor	%xmm1, %xmm1  -	movaps	%xmm0, %xmm2  -	shufps	$50, %xmm1, %xmm2  -	shufps	$132, %xmm2, %xmm0  -	movaps	%xmm0, (%eax)  -	ret  -  -Would it be better to generate:  -  -_test:  -        movl 8(%esp), %ecx  -        movaps (%ecx), %xmm0  -	xor %eax, %eax  -        pinsrw $6, %eax, %xmm0  -        pinsrw $7, %eax, %xmm0  -        movaps %xmm0, (%ecx)  -        ret  -  -?  -  -//===---------------------------------------------------------------------===//  -  -Some useful information in the Apple Altivec / SSE Migration Guide:  -  -http://developer.apple.com/documentation/Performance/Conceptual/  -Accelerate_sse_migration/index.html  -  -e.g. SSE select using and, andnot, or. Various SSE compare translations.  -  -//===---------------------------------------------------------------------===//  -  -Add hooks to commute some CMPP operations.  -  -//===---------------------------------------------------------------------===//  -  -Apply the same transformation that merged four float into a single 128-bit load  -to loads from constant pool.  -  -//===---------------------------------------------------------------------===//  -  -Floating point max / min are commutable when -enable-unsafe-fp-path is  -specified. We should turn int_x86_sse_max_ss and X86ISD::FMIN etc. into other  -nodes which are selected to max / min instructions that are marked commutable.  -  -//===---------------------------------------------------------------------===//  -  -We should materialize vector constants like "all ones" and "signbit" with   -code like:  -  -     cmpeqps xmm1, xmm1   ; xmm1 = all-ones  -  -and:  -     cmpeqps xmm1, xmm1   ; xmm1 = all-ones  -     psrlq   xmm1, 31     ; xmm1 = all 100000000000...  -  -instead of using a load from the constant pool.  The later is important for  -ABS/NEG/copysign etc.  -  -//===---------------------------------------------------------------------===//  -  -These functions:  -  -#include <xmmintrin.h>  -__m128i a;  -void x(unsigned short n) {  -  a = _mm_slli_epi32 (a, n);  -}  -void y(unsigned n) {  -  a = _mm_slli_epi32 (a, n);  -}  -  -compile to ( -O3 -static -fomit-frame-pointer):  -_x:  -        movzwl  4(%esp), %eax  -        movd    %eax, %xmm0  -        movaps  _a, %xmm1  -        pslld   %xmm0, %xmm1  -        movaps  %xmm1, _a  -        ret  -_y:  -        movd    4(%esp), %xmm0  -        movaps  _a, %xmm1  -        pslld   %xmm0, %xmm1  -        movaps  %xmm1, _a  -        ret  -  -"y" looks good, but "x" does silly movzwl stuff around into a GPR.  It seems  -like movd would be sufficient in both cases as the value is already zero   -extended in the 32-bit stack slot IIRC.  For signed short, it should also be  -save, as a really-signed value would be undefined for pslld.  -  -  -//===---------------------------------------------------------------------===//  -  -#include <math.h>  -int t1(double d) { return signbit(d); }  -  -This currently compiles to:  -	subl	$12, %esp  -	movsd	16(%esp), %xmm0  -	movsd	%xmm0, (%esp)  -	movl	4(%esp), %eax  -	shrl	$31, %eax  -	addl	$12, %esp  -	ret  -  -We should use movmskp{s|d} instead.  -  -//===---------------------------------------------------------------------===//  -  -CodeGen/X86/vec_align.ll tests whether we can turn 4 scalar loads into a single  -(aligned) vector load.  This functionality has a couple of problems.  -  -1. The code to infer alignment from loads of globals is in the X86 backend,  -   not the dag combiner.  This is because dagcombine2 needs to be able to see  -   through the X86ISD::Wrapper node, which DAGCombine can't really do.  -2. The code for turning 4 x load into a single vector load is target   -   independent and should be moved to the dag combiner.  -3. The code for turning 4 x load into a vector load can only handle a direct   -   load from a global or a direct load from the stack.  It should be generalized  -   to handle any load from P, P+4, P+8, P+12, where P can be anything.  -4. The alignment inference code cannot handle loads from globals in non-static  -   mode because it doesn't look through the extra dyld stub load.  If you try  -   vec_align.ll without -relocation-model=static, you'll see what I mean.  -  -//===---------------------------------------------------------------------===//  -  -We should lower store(fneg(load p), q) into an integer load+xor+store, which  -eliminates a constant pool load.  For example, consider:  -  -define i64 @ccosf(float %z.0, float %z.1) nounwind readonly  {  -entry:  - %tmp6 = fsub float -0.000000e+00, %z.1		; <float> [#uses=1]  - %tmp20 = tail call i64 @ccoshf( float %tmp6, float %z.0 ) nounwind readonly  - ret i64 %tmp20  -}  -declare i64 @ccoshf(float %z.0, float %z.1) nounwind readonly  -  -This currently compiles to:  -  -LCPI1_0:					#  <4 x float>  -	.long	2147483648	# float -0  -	.long	2147483648	# float -0  -	.long	2147483648	# float -0  -	.long	2147483648	# float -0  -_ccosf:  -	subl	$12, %esp  -	movss	16(%esp), %xmm0  -	movss	%xmm0, 4(%esp)  -	movss	20(%esp), %xmm0  -	xorps	LCPI1_0, %xmm0  -	movss	%xmm0, (%esp)  -	call	L_ccoshf$stub  -	addl	$12, %esp  -	ret  -  -Note the load into xmm0, then xor (to negate), then store.  In PIC mode,  -this code computes the pic base and does two loads to do the constant pool   -load, so the improvement is much bigger.  -  -The tricky part about this xform is that the argument load/store isn't exposed  -until post-legalize, and at that point, the fneg has been custom expanded into   -an X86 fxor.  This means that we need to handle this case in the x86 backend  -instead of in target independent code.  -  -//===---------------------------------------------------------------------===//  -  -Non-SSE4 insert into 16 x i8 is atrociously bad.  -  -//===---------------------------------------------------------------------===//  -  -<2 x i64> extract is substantially worse than <2 x f64>, even if the destination  -is memory.  -  -//===---------------------------------------------------------------------===//  -  -INSERTPS can match any insert (extract, imm1), imm2 for 4 x float, and insert  -any number of 0.0 simultaneously.  Currently we only use it for simple  -insertions.  -  -See comments in LowerINSERT_VECTOR_ELT_SSE4.  -  -//===---------------------------------------------------------------------===//  -  -On a random note, SSE2 should declare insert/extract of 2 x f64 as legal, not  -Custom.  All combinations of insert/extract reg-reg, reg-mem, and mem-reg are  -legal, it'll just take a few extra patterns written in the .td file.  -  -Note: this is not a code quality issue; the custom lowered code happens to be  -right, but we shouldn't have to custom lower anything.  This is probably related  -to <2 x i64> ops being so bad.  -  -//===---------------------------------------------------------------------===//  -  -LLVM currently generates stack realignment code, when it is not necessary  -needed. The problem is that we need to know about stack alignment too early,  -before RA runs.  -  -At that point we don't know, whether there will be vector spill, or not.  -Stack realignment logic is overly conservative here, but otherwise we can  -produce unaligned loads/stores.  -  -Fixing this will require some huge RA changes.  -  -Testcase:  +//===---------------------------------------------------------------------===// +// Random ideas for the X86 backend: SSE-specific stuff. +//===---------------------------------------------------------------------===// + +//===---------------------------------------------------------------------===// + +SSE Variable shift can be custom lowered to something like this, which uses a +small table + unaligned load + shuffle instead of going through memory. + +__m128i_shift_right: +	.byte	  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15 +	.byte	 -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1 + +... +__m128i shift_right(__m128i value, unsigned long offset) { +  return _mm_shuffle_epi8(value, +               _mm_loadu_si128((__m128 *) (___m128i_shift_right + offset))); +} + +//===---------------------------------------------------------------------===// + +SSE has instructions for doing operations on complex numbers, we should pattern +match them.   For example, this should turn into a horizontal add: + +typedef float __attribute__((vector_size(16))) v4f32; +float f32(v4f32 A) { +  return A[0]+A[1]+A[2]+A[3]; +} + +Instead we get this: + +_f32:                                   ## @f32 +	pshufd	$1, %xmm0, %xmm1        ## xmm1 = xmm0[1,0,0,0] +	addss	%xmm0, %xmm1 +	pshufd	$3, %xmm0, %xmm2        ## xmm2 = xmm0[3,0,0,0] +	movhlps	%xmm0, %xmm0            ## xmm0 = xmm0[1,1] +	movaps	%xmm0, %xmm3 +	addss	%xmm1, %xmm3 +	movdqa	%xmm2, %xmm0 +	addss	%xmm3, %xmm0 +	ret + +Also, there are cases where some simple local SLP would improve codegen a bit. +compiling this: + +_Complex float f32(_Complex float A, _Complex float B) { +  return A+B; +} + +into: + +_f32:                                   ## @f32 +	movdqa	%xmm0, %xmm2 +	addss	%xmm1, %xmm2 +	pshufd	$1, %xmm1, %xmm1        ## xmm1 = xmm1[1,0,0,0] +	pshufd	$1, %xmm0, %xmm3        ## xmm3 = xmm0[1,0,0,0] +	addss	%xmm1, %xmm3 +	movaps	%xmm2, %xmm0 +	unpcklps	%xmm3, %xmm0    ## xmm0 = xmm0[0],xmm3[0],xmm0[1],xmm3[1] +	ret + +seems silly when it could just be one addps. + + +//===---------------------------------------------------------------------===// + +Expand libm rounding functions inline:  Significant speedups possible. +http://gcc.gnu.org/ml/gcc-patches/2006-10/msg00909.html + +//===---------------------------------------------------------------------===// + +When compiled with unsafemath enabled, "main" should enable SSE DAZ mode and +other fast SSE modes. + +//===---------------------------------------------------------------------===// + +Think about doing i64 math in SSE regs on x86-32. + +//===---------------------------------------------------------------------===// + +This testcase should have no SSE instructions in it, and only one load from +a constant pool: + +double %test3(bool %B) { +        %C = select bool %B, double 123.412, double 523.01123123 +        ret double %C +} + +Currently, the select is being lowered, which prevents the dag combiner from +turning 'select (load CPI1), (load CPI2)' -> 'load (select CPI1, CPI2)' + +The pattern isel got this one right. + +//===---------------------------------------------------------------------===// + +Lower memcpy / memset to a series of SSE 128 bit move instructions when it's +feasible. + +//===---------------------------------------------------------------------===// + +Codegen: +  if (copysign(1.0, x) == copysign(1.0, y)) +into: +  if (x^y & mask) +when using SSE. + +//===---------------------------------------------------------------------===// + +Use movhps to update upper 64-bits of a v4sf value. Also movlps on lower half +of a v4sf value. + +//===---------------------------------------------------------------------===// + +Better codegen for vector_shuffles like this { x, 0, 0, 0 } or { x, 0, x, 0}. +Perhaps use pxor / xorp* to clear a XMM register first? + +//===---------------------------------------------------------------------===// + +External test Nurbs exposed some problems. Look for +__ZN15Nurbs_SSE_Cubic17TessellateSurfaceE, bb cond_next140. This is what icc +emits: + +        movaps    (%edx), %xmm2                                 #59.21 +        movaps    (%edx), %xmm5                                 #60.21 +        movaps    (%edx), %xmm4                                 #61.21 +        movaps    (%edx), %xmm3                                 #62.21 +        movl      40(%ecx), %ebp                                #69.49 +        shufps    $0, %xmm2, %xmm5                              #60.21 +        movl      100(%esp), %ebx                               #69.20 +        movl      (%ebx), %edi                                  #69.20 +        imull     %ebp, %edi                                    #69.49 +        addl      (%eax), %edi                                  #70.33 +        shufps    $85, %xmm2, %xmm4                             #61.21 +        shufps    $170, %xmm2, %xmm3                            #62.21 +        shufps    $255, %xmm2, %xmm2                            #63.21 +        lea       (%ebp,%ebp,2), %ebx                           #69.49 +        negl      %ebx                                          #69.49 +        lea       -3(%edi,%ebx), %ebx                           #70.33 +        shll      $4, %ebx                                      #68.37 +        addl      32(%ecx), %ebx                                #68.37 +        testb     $15, %bl                                      #91.13 +        jne       L_B1.24       # Prob 5%                       #91.13 + +This is the llvm code after instruction scheduling: + +cond_next140 (0xa910740, LLVM BB @0xa90beb0): +	%reg1078 = MOV32ri -3 +	%reg1079 = ADD32rm %reg1078, %reg1068, 1, %noreg, 0 +	%reg1037 = MOV32rm %reg1024, 1, %noreg, 40 +	%reg1080 = IMUL32rr %reg1079, %reg1037 +	%reg1081 = MOV32rm %reg1058, 1, %noreg, 0 +	%reg1038 = LEA32r %reg1081, 1, %reg1080, -3 +	%reg1036 = MOV32rm %reg1024, 1, %noreg, 32 +	%reg1082 = SHL32ri %reg1038, 4 +	%reg1039 = ADD32rr %reg1036, %reg1082 +	%reg1083 = MOVAPSrm %reg1059, 1, %noreg, 0 +	%reg1034 = SHUFPSrr %reg1083, %reg1083, 170 +	%reg1032 = SHUFPSrr %reg1083, %reg1083, 0 +	%reg1035 = SHUFPSrr %reg1083, %reg1083, 255 +	%reg1033 = SHUFPSrr %reg1083, %reg1083, 85 +	%reg1040 = MOV32rr %reg1039 +	%reg1084 = AND32ri8 %reg1039, 15 +	CMP32ri8 %reg1084, 0 +	JE mbb<cond_next204,0xa914d30> + +Still ok. After register allocation: + +cond_next140 (0xa910740, LLVM BB @0xa90beb0): +	%eax = MOV32ri -3 +	%edx = MOV32rm %stack.3, 1, %noreg, 0 +	ADD32rm %eax<def&use>, %edx, 1, %noreg, 0 +	%edx = MOV32rm %stack.7, 1, %noreg, 0 +	%edx = MOV32rm %edx, 1, %noreg, 40 +	IMUL32rr %eax<def&use>, %edx +	%esi = MOV32rm %stack.5, 1, %noreg, 0 +	%esi = MOV32rm %esi, 1, %noreg, 0 +	MOV32mr %stack.4, 1, %noreg, 0, %esi +	%eax = LEA32r %esi, 1, %eax, -3 +	%esi = MOV32rm %stack.7, 1, %noreg, 0 +	%esi = MOV32rm %esi, 1, %noreg, 32 +	%edi = MOV32rr %eax +	SHL32ri %edi<def&use>, 4 +	ADD32rr %edi<def&use>, %esi +	%xmm0 = MOVAPSrm %ecx, 1, %noreg, 0 +	%xmm1 = MOVAPSrr %xmm0 +	SHUFPSrr %xmm1<def&use>, %xmm1, 170 +	%xmm2 = MOVAPSrr %xmm0 +	SHUFPSrr %xmm2<def&use>, %xmm2, 0 +	%xmm3 = MOVAPSrr %xmm0 +	SHUFPSrr %xmm3<def&use>, %xmm3, 255 +	SHUFPSrr %xmm0<def&use>, %xmm0, 85 +	%ebx = MOV32rr %edi +	AND32ri8 %ebx<def&use>, 15 +	CMP32ri8 %ebx, 0 +	JE mbb<cond_next204,0xa914d30> + +This looks really bad. The problem is shufps is a destructive opcode. Since it +appears as operand two in more than one shufps ops. It resulted in a number of +copies. Note icc also suffers from the same problem. Either the instruction +selector should select pshufd or The register allocator can made the two-address +to three-address transformation. + +It also exposes some other problems. See MOV32ri -3 and the spills. + +//===---------------------------------------------------------------------===// + +Consider: + +__m128 test(float a) { +  return _mm_set_ps(0.0, 0.0, 0.0, a*a); +} + +This compiles into: + +movss 4(%esp), %xmm1 +mulss %xmm1, %xmm1 +xorps %xmm0, %xmm0 +movss %xmm1, %xmm0 +ret + +Because mulss doesn't modify the top 3 elements, the top elements of  +xmm1 are already zero'd.  We could compile this to: + +movss 4(%esp), %xmm0 +mulss %xmm0, %xmm0 +ret + +//===---------------------------------------------------------------------===// + +Here's a sick and twisted idea.  Consider code like this: + +__m128 test(__m128 a) { +  float b = *(float*)&A; +  ... +  return _mm_set_ps(0.0, 0.0, 0.0, b); +} + +This might compile to this code: + +movaps c(%esp), %xmm1 +xorps %xmm0, %xmm0 +movss %xmm1, %xmm0 +ret + +Now consider if the ... code caused xmm1 to get spilled.  This might produce +this code: + +movaps c(%esp), %xmm1 +movaps %xmm1, c2(%esp) +... + +xorps %xmm0, %xmm0 +movaps c2(%esp), %xmm1 +movss %xmm1, %xmm0 +ret + +However, since the reload is only used by these instructions, we could  +"fold" it into the uses, producing something like this: + +movaps c(%esp), %xmm1 +movaps %xmm1, c2(%esp) +... + +movss c2(%esp), %xmm0 +ret + +... saving two instructions. + +The basic idea is that a reload from a spill slot, can, if only one 4-byte  +chunk is used, bring in 3 zeros the one element instead of 4 elements. +This can be used to simplify a variety of shuffle operations, where the +elements are fixed zeros. + +//===---------------------------------------------------------------------===// + +This code generates ugly code, probably due to costs being off or something: + +define void @test(float* %P, <4 x float>* %P2 ) { +        %xFloat0.688 = load float* %P +        %tmp = load <4 x float>* %P2 +        %inFloat3.713 = insertelement <4 x float> %tmp, float 0.0, i32 3 +        store <4 x float> %inFloat3.713, <4 x float>* %P2 +        ret void +} + +Generates: + +_test: +	movl	8(%esp), %eax +	movaps	(%eax), %xmm0 +	pxor	%xmm1, %xmm1 +	movaps	%xmm0, %xmm2 +	shufps	$50, %xmm1, %xmm2 +	shufps	$132, %xmm2, %xmm0 +	movaps	%xmm0, (%eax) +	ret + +Would it be better to generate: + +_test: +        movl 8(%esp), %ecx +        movaps (%ecx), %xmm0 +	xor %eax, %eax +        pinsrw $6, %eax, %xmm0 +        pinsrw $7, %eax, %xmm0 +        movaps %xmm0, (%ecx) +        ret + +? + +//===---------------------------------------------------------------------===// + +Some useful information in the Apple Altivec / SSE Migration Guide: + +http://developer.apple.com/documentation/Performance/Conceptual/ +Accelerate_sse_migration/index.html + +e.g. SSE select using and, andnot, or. Various SSE compare translations. + +//===---------------------------------------------------------------------===// + +Add hooks to commute some CMPP operations. + +//===---------------------------------------------------------------------===// + +Apply the same transformation that merged four float into a single 128-bit load +to loads from constant pool. + +//===---------------------------------------------------------------------===// + +Floating point max / min are commutable when -enable-unsafe-fp-path is +specified. We should turn int_x86_sse_max_ss and X86ISD::FMIN etc. into other +nodes which are selected to max / min instructions that are marked commutable. + +//===---------------------------------------------------------------------===// + +We should materialize vector constants like "all ones" and "signbit" with  +code like: + +     cmpeqps xmm1, xmm1   ; xmm1 = all-ones + +and: +     cmpeqps xmm1, xmm1   ; xmm1 = all-ones +     psrlq   xmm1, 31     ; xmm1 = all 100000000000... + +instead of using a load from the constant pool.  The later is important for +ABS/NEG/copysign etc. + +//===---------------------------------------------------------------------===// + +These functions: + +#include <xmmintrin.h> +__m128i a; +void x(unsigned short n) { +  a = _mm_slli_epi32 (a, n); +} +void y(unsigned n) { +  a = _mm_slli_epi32 (a, n); +} + +compile to ( -O3 -static -fomit-frame-pointer): +_x: +        movzwl  4(%esp), %eax +        movd    %eax, %xmm0 +        movaps  _a, %xmm1 +        pslld   %xmm0, %xmm1 +        movaps  %xmm1, _a +        ret +_y: +        movd    4(%esp), %xmm0 +        movaps  _a, %xmm1 +        pslld   %xmm0, %xmm1 +        movaps  %xmm1, _a +        ret + +"y" looks good, but "x" does silly movzwl stuff around into a GPR.  It seems +like movd would be sufficient in both cases as the value is already zero  +extended in the 32-bit stack slot IIRC.  For signed short, it should also be +save, as a really-signed value would be undefined for pslld. + + +//===---------------------------------------------------------------------===// + +#include <math.h> +int t1(double d) { return signbit(d); } + +This currently compiles to: +	subl	$12, %esp +	movsd	16(%esp), %xmm0 +	movsd	%xmm0, (%esp) +	movl	4(%esp), %eax +	shrl	$31, %eax +	addl	$12, %esp +	ret + +We should use movmskp{s|d} instead. + +//===---------------------------------------------------------------------===// + +CodeGen/X86/vec_align.ll tests whether we can turn 4 scalar loads into a single +(aligned) vector load.  This functionality has a couple of problems. + +1. The code to infer alignment from loads of globals is in the X86 backend, +   not the dag combiner.  This is because dagcombine2 needs to be able to see +   through the X86ISD::Wrapper node, which DAGCombine can't really do. +2. The code for turning 4 x load into a single vector load is target  +   independent and should be moved to the dag combiner. +3. The code for turning 4 x load into a vector load can only handle a direct  +   load from a global or a direct load from the stack.  It should be generalized +   to handle any load from P, P+4, P+8, P+12, where P can be anything. +4. The alignment inference code cannot handle loads from globals in non-static +   mode because it doesn't look through the extra dyld stub load.  If you try +   vec_align.ll without -relocation-model=static, you'll see what I mean. + +//===---------------------------------------------------------------------===// + +We should lower store(fneg(load p), q) into an integer load+xor+store, which +eliminates a constant pool load.  For example, consider: + +define i64 @ccosf(float %z.0, float %z.1) nounwind readonly  { +entry: + %tmp6 = fsub float -0.000000e+00, %z.1		; <float> [#uses=1] + %tmp20 = tail call i64 @ccoshf( float %tmp6, float %z.0 ) nounwind readonly + ret i64 %tmp20 +} +declare i64 @ccoshf(float %z.0, float %z.1) nounwind readonly + +This currently compiles to: + +LCPI1_0:					#  <4 x float> +	.long	2147483648	# float -0 +	.long	2147483648	# float -0 +	.long	2147483648	# float -0 +	.long	2147483648	# float -0 +_ccosf: +	subl	$12, %esp +	movss	16(%esp), %xmm0 +	movss	%xmm0, 4(%esp) +	movss	20(%esp), %xmm0 +	xorps	LCPI1_0, %xmm0 +	movss	%xmm0, (%esp) +	call	L_ccoshf$stub +	addl	$12, %esp +	ret + +Note the load into xmm0, then xor (to negate), then store.  In PIC mode, +this code computes the pic base and does two loads to do the constant pool  +load, so the improvement is much bigger. + +The tricky part about this xform is that the argument load/store isn't exposed +until post-legalize, and at that point, the fneg has been custom expanded into  +an X86 fxor.  This means that we need to handle this case in the x86 backend +instead of in target independent code. + +//===---------------------------------------------------------------------===// + +Non-SSE4 insert into 16 x i8 is atrociously bad. + +//===---------------------------------------------------------------------===// + +<2 x i64> extract is substantially worse than <2 x f64>, even if the destination +is memory. + +//===---------------------------------------------------------------------===// + +INSERTPS can match any insert (extract, imm1), imm2 for 4 x float, and insert +any number of 0.0 simultaneously.  Currently we only use it for simple +insertions. + +See comments in LowerINSERT_VECTOR_ELT_SSE4. + +//===---------------------------------------------------------------------===// + +On a random note, SSE2 should declare insert/extract of 2 x f64 as legal, not +Custom.  All combinations of insert/extract reg-reg, reg-mem, and mem-reg are +legal, it'll just take a few extra patterns written in the .td file. + +Note: this is not a code quality issue; the custom lowered code happens to be +right, but we shouldn't have to custom lower anything.  This is probably related +to <2 x i64> ops being so bad. + +//===---------------------------------------------------------------------===// + +LLVM currently generates stack realignment code, when it is not necessary +needed. The problem is that we need to know about stack alignment too early, +before RA runs. + +At that point we don't know, whether there will be vector spill, or not. +Stack realignment logic is overly conservative here, but otherwise we can +produce unaligned loads/stores. + +Fixing this will require some huge RA changes. + +Testcase: +#include <emmintrin.h> + +typedef short vSInt16 __attribute__ ((__vector_size__ (16))); + +static const vSInt16 a = {- 22725, - 12873, - 22725, - 12873, - 22725, - 12873, +- 22725, - 12873};; + +vSInt16 madd(vSInt16 b) +{ +    return _mm_madd_epi16(a, b); +} + +Generated code (x86-32, linux): +madd: +        pushl   %ebp +        movl    %esp, %ebp +        andl    $-16, %esp +        movaps  .LCPI1_0, %xmm1 +        pmaddwd %xmm1, %xmm0 +        movl    %ebp, %esp +        popl    %ebp +        ret + +//===---------------------------------------------------------------------===// + +Consider:  #include <emmintrin.h>  -  -typedef short vSInt16 __attribute__ ((__vector_size__ (16)));  -  -static const vSInt16 a = {- 22725, - 12873, - 22725, - 12873, - 22725, - 12873,  -- 22725, - 12873};;  -  -vSInt16 madd(vSInt16 b)  -{  -    return _mm_madd_epi16(a, b);  -}  -  -Generated code (x86-32, linux):  -madd:  -        pushl   %ebp  -        movl    %esp, %ebp  -        andl    $-16, %esp  -        movaps  .LCPI1_0, %xmm1  -        pmaddwd %xmm1, %xmm0  -        movl    %ebp, %esp  -        popl    %ebp  -        ret  -  -//===---------------------------------------------------------------------===//  -  -Consider:  -#include <emmintrin.h>   -__m128 foo2 (float x) {  - return _mm_set_ps (0, 0, x, 0);  -}  -  -In x86-32 mode, we generate this spiffy code:  -  -_foo2:  -	movss	4(%esp), %xmm0  -	pshufd	$81, %xmm0, %xmm0  -	ret  -  -in x86-64 mode, we generate this code, which could be better:  -  -_foo2:  -	xorps	%xmm1, %xmm1  -	movss	%xmm0, %xmm1  -	pshufd	$81, %xmm1, %xmm0  -	ret  -  -In sse4 mode, we could use insertps to make both better.  -  -Here's another testcase that could use insertps [mem]:  -  -#include <xmmintrin.h>  -extern float x2, x3;  -__m128 foo1 (float x1, float x4) {  - return _mm_set_ps (x2, x1, x3, x4);  -}  -  -gcc mainline compiles it to:  -  -foo1:  -       insertps        $0x10, x2(%rip), %xmm0  -       insertps        $0x10, x3(%rip), %xmm1  -       movaps  %xmm1, %xmm2  -       movlhps %xmm0, %xmm2  -       movaps  %xmm2, %xmm0  -       ret  -  -//===---------------------------------------------------------------------===//  -  -We compile vector multiply-by-constant into poor code:  -  -define <4 x i32> @f(<4 x i32> %i) nounwind  {  -	%A = mul <4 x i32> %i, < i32 10, i32 10, i32 10, i32 10 >  -	ret <4 x i32> %A  -}  -  -On targets without SSE4.1, this compiles into:  -  -LCPI1_0:					##  <4 x i32>  -	.long	10  -	.long	10  -	.long	10  -	.long	10  -	.text  -	.align	4,0x90  -	.globl	_f  -_f:  -	pshufd	$3, %xmm0, %xmm1  -	movd	%xmm1, %eax  -	imull	LCPI1_0+12, %eax  -	movd	%eax, %xmm1  -	pshufd	$1, %xmm0, %xmm2  -	movd	%xmm2, %eax  -	imull	LCPI1_0+4, %eax  -	movd	%eax, %xmm2  -	punpckldq	%xmm1, %xmm2  -	movd	%xmm0, %eax  -	imull	LCPI1_0, %eax  -	movd	%eax, %xmm1  -	movhlps	%xmm0, %xmm0  -	movd	%xmm0, %eax  -	imull	LCPI1_0+8, %eax  -	movd	%eax, %xmm0  -	punpckldq	%xmm0, %xmm1  -	movaps	%xmm1, %xmm0  -	punpckldq	%xmm2, %xmm0  -	ret  -  -It would be better to synthesize integer vector multiplication by constants  -using shifts and adds, pslld and paddd here. And even on targets with SSE4.1,  -simple cases such as multiplication by powers of two would be better as  -vector shifts than as multiplications.  -  -//===---------------------------------------------------------------------===//  -  -We compile this:  -  -__m128i  -foo2 (char x)  -{  -  return _mm_set_epi8 (1, 0, 0, 0, 0, 0, 0, 0, 0, x, 0, 1, 0, 0, 0, 0);  -}  -  -into:  -	movl	$1, %eax  -	xorps	%xmm0, %xmm0  -	pinsrw	$2, %eax, %xmm0  -	movzbl	4(%esp), %eax  -	pinsrw	$3, %eax, %xmm0  -	movl	$256, %eax  -	pinsrw	$7, %eax, %xmm0  -	ret  -  -  -gcc-4.2:  -	subl	$12, %esp  -	movzbl	16(%esp), %eax  -	movdqa	LC0, %xmm0  -	pinsrw	$3, %eax, %xmm0  -	addl	$12, %esp  -	ret  -	.const  -	.align 4  -LC0:  -	.word	0  -	.word	0  -	.word	1  -	.word	0  -	.word	0  -	.word	0  -	.word	0  -	.word	256  -  -With SSE4, it should be  -      movdqa  .LC0(%rip), %xmm0  -      pinsrb  $6, %edi, %xmm0  -  -//===---------------------------------------------------------------------===//  -  -We should transform a shuffle of two vectors of constants into a single vector  -of constants. Also, insertelement of a constant into a vector of constants  -should also result in a vector of constants. e.g. 2008-06-25-VecISelBug.ll.  -  -We compiled it to something horrible:  -  -	.align	4  -LCPI1_1:					##  float  -	.long	1065353216	## float 1  -	.const  -  -	.align	4  -LCPI1_0:					##  <4 x float>  -	.space	4  -	.long	1065353216	## float 1  -	.space	4  -	.long	1065353216	## float 1  -	.text  -	.align	4,0x90  -	.globl	_t  -_t:  -	xorps	%xmm0, %xmm0  -	movhps	LCPI1_0, %xmm0  -	movss	LCPI1_1, %xmm1  -	movaps	%xmm0, %xmm2  -	shufps	$2, %xmm1, %xmm2  -	shufps	$132, %xmm2, %xmm0  -	movaps	%xmm0, 0  -  -//===---------------------------------------------------------------------===//  -rdar://5907648  -  -This function:  -  -float foo(unsigned char x) {  -  return x;  -}  -  -compiles to (x86-32):  -  -define float @foo(i8 zeroext  %x) nounwind  {  -	%tmp12 = uitofp i8 %x to float		; <float> [#uses=1]  -	ret float %tmp12  -}  -  -compiles to:  -  -_foo:  -	subl	$4, %esp  -	movzbl	8(%esp), %eax  -	cvtsi2ss	%eax, %xmm0  -	movss	%xmm0, (%esp)  -	flds	(%esp)  -	addl	$4, %esp  -	ret  -  -We should be able to use:  -  cvtsi2ss 8($esp), %xmm0  -since we know the stack slot is already zext'd.  -  -//===---------------------------------------------------------------------===//  -  -Consider using movlps instead of movsd to implement (scalar_to_vector (loadf64))  -when code size is critical. movlps is slower than movsd on core2 but it's one  -byte shorter.  -  -//===---------------------------------------------------------------------===//  -  -We should use a dynamic programming based approach to tell when using FPStack  -operations is cheaper than SSE.  SciMark montecarlo contains code like this  -for example:  -  -double MonteCarlo_num_flops(int Num_samples) {  -    return ((double) Num_samples)* 4.0;  -}  -  -In fpstack mode, this compiles into:  -  -LCPI1_0:					  -	.long	1082130432	## float 4.000000e+00  -_MonteCarlo_num_flops:  -	subl	$4, %esp  -	movl	8(%esp), %eax  -	movl	%eax, (%esp)  -	fildl	(%esp)  -	fmuls	LCPI1_0  -	addl	$4, %esp  -	ret  -          -in SSE mode, it compiles into significantly slower code:  -  -_MonteCarlo_num_flops:  -	subl	$12, %esp  -	cvtsi2sd	16(%esp), %xmm0  -	mulsd	LCPI1_0, %xmm0  -	movsd	%xmm0, (%esp)  -	fldl	(%esp)  -	addl	$12, %esp  -	ret  -  -There are also other cases in scimark where using fpstack is better, it is  -cheaper to do fld1 than load from a constant pool for example, so  -"load, add 1.0, store" is better done in the fp stack, etc.  -  -//===---------------------------------------------------------------------===//  -  -These should compile into the same code (PR6214): Perhaps instcombine should  -canonicalize the former into the later?  -  -define float @foo(float %x) nounwind {  -  %t = bitcast float %x to i32  -  %s = and i32 %t, 2147483647  -  %d = bitcast i32 %s to float  -  ret float %d  -}  -  -declare float @fabsf(float %n)  -define float @bar(float %x) nounwind {  -  %d = call float @fabsf(float %x)  -  ret float %d  -}  -  -//===---------------------------------------------------------------------===//  -  -This IR (from PR6194):  -  -target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"  -target triple = "x86_64-apple-darwin10.0.0"  -  -%0 = type { double, double }  -%struct.float3 = type { float, float, float }  -  -define void @test(%0, %struct.float3* nocapture %res) nounwind noinline ssp {  -entry:  -  %tmp18 = extractvalue %0 %0, 0                  ; <double> [#uses=1]  -  %tmp19 = bitcast double %tmp18 to i64           ; <i64> [#uses=1]  -  %tmp20 = zext i64 %tmp19 to i128                ; <i128> [#uses=1]  -  %tmp10 = lshr i128 %tmp20, 32                   ; <i128> [#uses=1]  -  %tmp11 = trunc i128 %tmp10 to i32               ; <i32> [#uses=1]  -  %tmp12 = bitcast i32 %tmp11 to float            ; <float> [#uses=1]  -  %tmp5 = getelementptr inbounds %struct.float3* %res, i64 0, i32 1 ; <float*> [#uses=1]  -  store float %tmp12, float* %tmp5  -  ret void  -}  -  -Compiles to:  -  -_test:                                  ## @test  -	movd	%xmm0, %rax  -	shrq	$32, %rax  -	movl	%eax, 4(%rdi)  -	ret  -  -This would be better kept in the SSE unit by treating XMM0 as a 4xfloat and  -doing a shuffle from v[1] to v[0] then a float store.  -  -//===---------------------------------------------------------------------===//  -  -[UNSAFE FP]  -  -void foo(double, double, double);  -void norm(double x, double y, double z) {  -  double scale = __builtin_sqrt(x*x + y*y + z*z);  -  foo(x/scale, y/scale, z/scale);  -}  -  -We currently generate an sqrtsd and 3 divsd instructions. This is bad, fp div is  -slow and not pipelined. In -ffast-math mode we could compute "1.0/scale" first  -and emit 3 mulsd in place of the divs. This can be done as a target-independent  -transform.  -  -If we're dealing with floats instead of doubles we could even replace the sqrtss  -and inversion with an rsqrtss instruction, which computes 1/sqrt faster at the  -cost of reduced accuracy.  -  -//===---------------------------------------------------------------------===//  +__m128 foo2 (float x) { + return _mm_set_ps (0, 0, x, 0); +} + +In x86-32 mode, we generate this spiffy code: + +_foo2: +	movss	4(%esp), %xmm0 +	pshufd	$81, %xmm0, %xmm0 +	ret + +in x86-64 mode, we generate this code, which could be better: + +_foo2: +	xorps	%xmm1, %xmm1 +	movss	%xmm0, %xmm1 +	pshufd	$81, %xmm1, %xmm0 +	ret + +In sse4 mode, we could use insertps to make both better. + +Here's another testcase that could use insertps [mem]: + +#include <xmmintrin.h> +extern float x2, x3; +__m128 foo1 (float x1, float x4) { + return _mm_set_ps (x2, x1, x3, x4); +} + +gcc mainline compiles it to: + +foo1: +       insertps        $0x10, x2(%rip), %xmm0 +       insertps        $0x10, x3(%rip), %xmm1 +       movaps  %xmm1, %xmm2 +       movlhps %xmm0, %xmm2 +       movaps  %xmm2, %xmm0 +       ret + +//===---------------------------------------------------------------------===// + +We compile vector multiply-by-constant into poor code: + +define <4 x i32> @f(<4 x i32> %i) nounwind  { +	%A = mul <4 x i32> %i, < i32 10, i32 10, i32 10, i32 10 > +	ret <4 x i32> %A +} + +On targets without SSE4.1, this compiles into: + +LCPI1_0:					##  <4 x i32> +	.long	10 +	.long	10 +	.long	10 +	.long	10 +	.text +	.align	4,0x90 +	.globl	_f +_f: +	pshufd	$3, %xmm0, %xmm1 +	movd	%xmm1, %eax +	imull	LCPI1_0+12, %eax +	movd	%eax, %xmm1 +	pshufd	$1, %xmm0, %xmm2 +	movd	%xmm2, %eax +	imull	LCPI1_0+4, %eax +	movd	%eax, %xmm2 +	punpckldq	%xmm1, %xmm2 +	movd	%xmm0, %eax +	imull	LCPI1_0, %eax +	movd	%eax, %xmm1 +	movhlps	%xmm0, %xmm0 +	movd	%xmm0, %eax +	imull	LCPI1_0+8, %eax +	movd	%eax, %xmm0 +	punpckldq	%xmm0, %xmm1 +	movaps	%xmm1, %xmm0 +	punpckldq	%xmm2, %xmm0 +	ret + +It would be better to synthesize integer vector multiplication by constants +using shifts and adds, pslld and paddd here. And even on targets with SSE4.1, +simple cases such as multiplication by powers of two would be better as +vector shifts than as multiplications. + +//===---------------------------------------------------------------------===// + +We compile this: + +__m128i +foo2 (char x) +{ +  return _mm_set_epi8 (1, 0, 0, 0, 0, 0, 0, 0, 0, x, 0, 1, 0, 0, 0, 0); +} + +into: +	movl	$1, %eax +	xorps	%xmm0, %xmm0 +	pinsrw	$2, %eax, %xmm0 +	movzbl	4(%esp), %eax +	pinsrw	$3, %eax, %xmm0 +	movl	$256, %eax +	pinsrw	$7, %eax, %xmm0 +	ret + + +gcc-4.2: +	subl	$12, %esp +	movzbl	16(%esp), %eax +	movdqa	LC0, %xmm0 +	pinsrw	$3, %eax, %xmm0 +	addl	$12, %esp +	ret +	.const +	.align 4 +LC0: +	.word	0 +	.word	0 +	.word	1 +	.word	0 +	.word	0 +	.word	0 +	.word	0 +	.word	256 + +With SSE4, it should be +      movdqa  .LC0(%rip), %xmm0 +      pinsrb  $6, %edi, %xmm0 + +//===---------------------------------------------------------------------===// + +We should transform a shuffle of two vectors of constants into a single vector +of constants. Also, insertelement of a constant into a vector of constants +should also result in a vector of constants. e.g. 2008-06-25-VecISelBug.ll. + +We compiled it to something horrible: + +	.align	4 +LCPI1_1:					##  float +	.long	1065353216	## float 1 +	.const + +	.align	4 +LCPI1_0:					##  <4 x float> +	.space	4 +	.long	1065353216	## float 1 +	.space	4 +	.long	1065353216	## float 1 +	.text +	.align	4,0x90 +	.globl	_t +_t: +	xorps	%xmm0, %xmm0 +	movhps	LCPI1_0, %xmm0 +	movss	LCPI1_1, %xmm1 +	movaps	%xmm0, %xmm2 +	shufps	$2, %xmm1, %xmm2 +	shufps	$132, %xmm2, %xmm0 +	movaps	%xmm0, 0 + +//===---------------------------------------------------------------------===// +rdar://5907648 + +This function: + +float foo(unsigned char x) { +  return x; +} + +compiles to (x86-32): + +define float @foo(i8 zeroext  %x) nounwind  { +	%tmp12 = uitofp i8 %x to float		; <float> [#uses=1] +	ret float %tmp12 +} + +compiles to: + +_foo: +	subl	$4, %esp +	movzbl	8(%esp), %eax +	cvtsi2ss	%eax, %xmm0 +	movss	%xmm0, (%esp) +	flds	(%esp) +	addl	$4, %esp +	ret + +We should be able to use: +  cvtsi2ss 8($esp), %xmm0 +since we know the stack slot is already zext'd. + +//===---------------------------------------------------------------------===// + +Consider using movlps instead of movsd to implement (scalar_to_vector (loadf64)) +when code size is critical. movlps is slower than movsd on core2 but it's one +byte shorter. + +//===---------------------------------------------------------------------===// + +We should use a dynamic programming based approach to tell when using FPStack +operations is cheaper than SSE.  SciMark montecarlo contains code like this +for example: + +double MonteCarlo_num_flops(int Num_samples) { +    return ((double) Num_samples)* 4.0; +} + +In fpstack mode, this compiles into: + +LCPI1_0:					 +	.long	1082130432	## float 4.000000e+00 +_MonteCarlo_num_flops: +	subl	$4, %esp +	movl	8(%esp), %eax +	movl	%eax, (%esp) +	fildl	(%esp) +	fmuls	LCPI1_0 +	addl	$4, %esp +	ret +         +in SSE mode, it compiles into significantly slower code: + +_MonteCarlo_num_flops: +	subl	$12, %esp +	cvtsi2sd	16(%esp), %xmm0 +	mulsd	LCPI1_0, %xmm0 +	movsd	%xmm0, (%esp) +	fldl	(%esp) +	addl	$12, %esp +	ret + +There are also other cases in scimark where using fpstack is better, it is +cheaper to do fld1 than load from a constant pool for example, so +"load, add 1.0, store" is better done in the fp stack, etc. + +//===---------------------------------------------------------------------===// + +These should compile into the same code (PR6214): Perhaps instcombine should +canonicalize the former into the later? + +define float @foo(float %x) nounwind { +  %t = bitcast float %x to i32 +  %s = and i32 %t, 2147483647 +  %d = bitcast i32 %s to float +  ret float %d +} + +declare float @fabsf(float %n) +define float @bar(float %x) nounwind { +  %d = call float @fabsf(float %x) +  ret float %d +} + +//===---------------------------------------------------------------------===// + +This IR (from PR6194): + +target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128" +target triple = "x86_64-apple-darwin10.0.0" + +%0 = type { double, double } +%struct.float3 = type { float, float, float } + +define void @test(%0, %struct.float3* nocapture %res) nounwind noinline ssp { +entry: +  %tmp18 = extractvalue %0 %0, 0                  ; <double> [#uses=1] +  %tmp19 = bitcast double %tmp18 to i64           ; <i64> [#uses=1] +  %tmp20 = zext i64 %tmp19 to i128                ; <i128> [#uses=1] +  %tmp10 = lshr i128 %tmp20, 32                   ; <i128> [#uses=1] +  %tmp11 = trunc i128 %tmp10 to i32               ; <i32> [#uses=1] +  %tmp12 = bitcast i32 %tmp11 to float            ; <float> [#uses=1] +  %tmp5 = getelementptr inbounds %struct.float3* %res, i64 0, i32 1 ; <float*> [#uses=1] +  store float %tmp12, float* %tmp5 +  ret void +} + +Compiles to: + +_test:                                  ## @test +	movd	%xmm0, %rax +	shrq	$32, %rax +	movl	%eax, 4(%rdi) +	ret + +This would be better kept in the SSE unit by treating XMM0 as a 4xfloat and +doing a shuffle from v[1] to v[0] then a float store. + +//===---------------------------------------------------------------------===// + +[UNSAFE FP] + +void foo(double, double, double); +void norm(double x, double y, double z) { +  double scale = __builtin_sqrt(x*x + y*y + z*z); +  foo(x/scale, y/scale, z/scale); +} + +We currently generate an sqrtsd and 3 divsd instructions. This is bad, fp div is +slow and not pipelined. In -ffast-math mode we could compute "1.0/scale" first +and emit 3 mulsd in place of the divs. This can be done as a target-independent +transform. + +If we're dealing with floats instead of doubles we could even replace the sqrtss +and inversion with an rsqrtss instruction, which computes 1/sqrt faster at the +cost of reduced accuracy. + +//===---------------------------------------------------------------------===// diff --git a/contrib/libs/llvm12/lib/Target/X86/README-X86-64.txt b/contrib/libs/llvm12/lib/Target/X86/README-X86-64.txt index d919c697bdb..a3ea4595ac1 100644 --- a/contrib/libs/llvm12/lib/Target/X86/README-X86-64.txt +++ b/contrib/libs/llvm12/lib/Target/X86/README-X86-64.txt @@ -1,184 +1,184 @@ -//===- README_X86_64.txt - Notes for X86-64 code gen ----------------------===//  -  -AMD64 Optimization Manual 8.2 has some nice information about optimizing integer  -multiplication by a constant. How much of it applies to Intel's X86-64  -implementation? There are definite trade-offs to consider: latency vs. register  -pressure vs. code size.  -  -//===---------------------------------------------------------------------===//  -  -Are we better off using branches instead of cmove to implement FP to  -unsigned i64?  -  -_conv:  -	ucomiss	LC0(%rip), %xmm0  -	cvttss2siq	%xmm0, %rdx  -	jb	L3  -	subss	LC0(%rip), %xmm0  -	movabsq	$-9223372036854775808, %rax  -	cvttss2siq	%xmm0, %rdx  -	xorq	%rax, %rdx  -L3:  -	movq	%rdx, %rax  -	ret  -  -instead of  -  -_conv:  -	movss LCPI1_0(%rip), %xmm1  -	cvttss2siq %xmm0, %rcx  -	movaps %xmm0, %xmm2  -	subss %xmm1, %xmm2  -	cvttss2siq %xmm2, %rax  -	movabsq $-9223372036854775808, %rdx  -	xorq %rdx, %rax  -	ucomiss %xmm1, %xmm0  -	cmovb %rcx, %rax  -	ret  -  -Seems like the jb branch has high likelihood of being taken. It would have  -saved a few instructions.  -  -//===---------------------------------------------------------------------===//  -  -It's not possible to reference AH, BH, CH, and DH registers in an instruction  -requiring REX prefix. However, divb and mulb both produce results in AH. If isel  -emits a CopyFromReg which gets turned into a movb and that can be allocated a  -r8b - r15b.  -  -To get around this, isel emits a CopyFromReg from AX and then right shift it  -down by 8 and truncate it. It's not pretty but it works. We need some register  -allocation magic to make the hack go away (e.g. putting additional constraints  -on the result of the movb).  -  -//===---------------------------------------------------------------------===//  -  -The x86-64 ABI for hidden-argument struct returns requires that the  -incoming value of %rdi be copied into %rax by the callee upon return.  -  -The idea is that it saves callers from having to remember this value,  -which would often require a callee-saved register. Callees usually  -need to keep this value live for most of their body anyway, so it  -doesn't add a significant burden on them.  -  -We currently implement this in codegen, however this is suboptimal  -because it means that it would be quite awkward to implement the  -optimization for callers.  -  -A better implementation would be to relax the LLVM IR rules for sret  -arguments to allow a function with an sret argument to have a non-void  -return type, and to have the front-end to set up the sret argument value  -as the return value of the function. The front-end could more easily  -emit uses of the returned struct value to be in terms of the function's  -lowered return value, and it would free non-C frontends from a  -complication only required by a C-based ABI.  -  -//===---------------------------------------------------------------------===//  -  -We get a redundant zero extension for code like this:  -  -int mask[1000];  -int foo(unsigned x) {  - if (x < 10)  -   x = x * 45;  - else  -   x = x * 78;  - return mask[x];  -}  -  -_foo:  -LBB1_0:	## entry  -	cmpl	$9, %edi  -	jbe	LBB1_3	## bb  -LBB1_1:	## bb1  -	imull	$78, %edi, %eax  -LBB1_2:	## bb2  -	movl	%eax, %eax                    <----  -	movq	_mask@GOTPCREL(%rip), %rcx  -	movl	(%rcx,%rax,4), %eax  -	ret  -LBB1_3:	## bb  -	imull	$45, %edi, %eax  -	jmp	LBB1_2	## bb2  -    -Before regalloc, we have:  -  -        %reg1025 = IMUL32rri8 %reg1024, 45, implicit-def %eflags  -        JMP mbb<bb2,0x203afb0>  -    Successors according to CFG: 0x203afb0 (#3)  -  -bb1: 0x203af60, LLVM BB @0x1e02310, ID#2:  -    Predecessors according to CFG: 0x203aec0 (#0)  -        %reg1026 = IMUL32rri8 %reg1024, 78, implicit-def %eflags  -    Successors according to CFG: 0x203afb0 (#3)  -  -bb2: 0x203afb0, LLVM BB @0x1e02340, ID#3:  -    Predecessors according to CFG: 0x203af10 (#1) 0x203af60 (#2)  -        %reg1027 = PHI %reg1025, mbb<bb,0x203af10>,  -                            %reg1026, mbb<bb1,0x203af60>  -        %reg1029 = MOVZX64rr32 %reg1027  -  -so we'd have to know that IMUL32rri8 leaves the high word zero extended and to  -be able to recognize the zero extend.  This could also presumably be implemented  -if we have whole-function selectiondags.  -  -//===---------------------------------------------------------------------===//  -  -Take the following code  -(from http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34653):  -extern unsigned long table[];  -unsigned long foo(unsigned char *p) {  -  unsigned long tag = *p;  -  return table[tag >> 4] + table[tag & 0xf];  -}  -  -Current code generated:  -	movzbl	(%rdi), %eax  -	movq	%rax, %rcx  -	andq	$240, %rcx  -	shrq	%rcx  -	andq	$15, %rax  -	movq	table(,%rax,8), %rax  -	addq	table(%rcx), %rax  -	ret  -  -Issues:  -1. First movq should be movl; saves a byte.  -2. Both andq's should be andl; saves another two bytes.  I think this was  -   implemented at one point, but subsequently regressed.  -3. shrq should be shrl; saves another byte.  -4. The first andq can be completely eliminated by using a slightly more  -   expensive addressing mode.  -  -//===---------------------------------------------------------------------===//  -  -Consider the following (contrived testcase, but contains common factors):  -  -#include <stdarg.h>  -int test(int x, ...) {  -  int sum, i;  -  va_list l;  -  va_start(l, x);  -  for (i = 0; i < x; i++)  -    sum += va_arg(l, int);  -  va_end(l);  -  return sum;  -}  -  -Testcase given in C because fixing it will likely involve changing the IR  -generated for it.  The primary issue with the result is that it doesn't do any  -of the optimizations which are possible if we know the address of a va_list  -in the current function is never taken:  -1. We shouldn't spill the XMM registers because we only call va_arg with "int".  -2. It would be nice if we could sroa the va_list.  -3. Probably overkill, but it'd be cool if we could peel off the first five  -iterations of the loop.  -  -Other optimizations involving functions which use va_arg on floats which don't  -have the address of a va_list taken:  -1. Conversely to the above, we shouldn't spill general registers if we only  -   call va_arg on "double".  -2. If we know nothing more than 64 bits wide is read from the XMM registers,  -   we can change the spilling code to reduce the amount of stack used by half.  -  -//===---------------------------------------------------------------------===//  +//===- README_X86_64.txt - Notes for X86-64 code gen ----------------------===// + +AMD64 Optimization Manual 8.2 has some nice information about optimizing integer +multiplication by a constant. How much of it applies to Intel's X86-64 +implementation? There are definite trade-offs to consider: latency vs. register +pressure vs. code size. + +//===---------------------------------------------------------------------===// + +Are we better off using branches instead of cmove to implement FP to +unsigned i64? + +_conv: +	ucomiss	LC0(%rip), %xmm0 +	cvttss2siq	%xmm0, %rdx +	jb	L3 +	subss	LC0(%rip), %xmm0 +	movabsq	$-9223372036854775808, %rax +	cvttss2siq	%xmm0, %rdx +	xorq	%rax, %rdx +L3: +	movq	%rdx, %rax +	ret + +instead of + +_conv: +	movss LCPI1_0(%rip), %xmm1 +	cvttss2siq %xmm0, %rcx +	movaps %xmm0, %xmm2 +	subss %xmm1, %xmm2 +	cvttss2siq %xmm2, %rax +	movabsq $-9223372036854775808, %rdx +	xorq %rdx, %rax +	ucomiss %xmm1, %xmm0 +	cmovb %rcx, %rax +	ret + +Seems like the jb branch has high likelihood of being taken. It would have +saved a few instructions. + +//===---------------------------------------------------------------------===// + +It's not possible to reference AH, BH, CH, and DH registers in an instruction +requiring REX prefix. However, divb and mulb both produce results in AH. If isel +emits a CopyFromReg which gets turned into a movb and that can be allocated a +r8b - r15b. + +To get around this, isel emits a CopyFromReg from AX and then right shift it +down by 8 and truncate it. It's not pretty but it works. We need some register +allocation magic to make the hack go away (e.g. putting additional constraints +on the result of the movb). + +//===---------------------------------------------------------------------===// + +The x86-64 ABI for hidden-argument struct returns requires that the +incoming value of %rdi be copied into %rax by the callee upon return. + +The idea is that it saves callers from having to remember this value, +which would often require a callee-saved register. Callees usually +need to keep this value live for most of their body anyway, so it +doesn't add a significant burden on them. + +We currently implement this in codegen, however this is suboptimal +because it means that it would be quite awkward to implement the +optimization for callers. + +A better implementation would be to relax the LLVM IR rules for sret +arguments to allow a function with an sret argument to have a non-void +return type, and to have the front-end to set up the sret argument value +as the return value of the function. The front-end could more easily +emit uses of the returned struct value to be in terms of the function's +lowered return value, and it would free non-C frontends from a +complication only required by a C-based ABI. + +//===---------------------------------------------------------------------===// + +We get a redundant zero extension for code like this: + +int mask[1000]; +int foo(unsigned x) { + if (x < 10) +   x = x * 45; + else +   x = x * 78; + return mask[x]; +} + +_foo: +LBB1_0:	## entry +	cmpl	$9, %edi +	jbe	LBB1_3	## bb +LBB1_1:	## bb1 +	imull	$78, %edi, %eax +LBB1_2:	## bb2 +	movl	%eax, %eax                    <---- +	movq	_mask@GOTPCREL(%rip), %rcx +	movl	(%rcx,%rax,4), %eax +	ret +LBB1_3:	## bb +	imull	$45, %edi, %eax +	jmp	LBB1_2	## bb2 +   +Before regalloc, we have: + +        %reg1025 = IMUL32rri8 %reg1024, 45, implicit-def %eflags +        JMP mbb<bb2,0x203afb0> +    Successors according to CFG: 0x203afb0 (#3) + +bb1: 0x203af60, LLVM BB @0x1e02310, ID#2: +    Predecessors according to CFG: 0x203aec0 (#0) +        %reg1026 = IMUL32rri8 %reg1024, 78, implicit-def %eflags +    Successors according to CFG: 0x203afb0 (#3) + +bb2: 0x203afb0, LLVM BB @0x1e02340, ID#3: +    Predecessors according to CFG: 0x203af10 (#1) 0x203af60 (#2) +        %reg1027 = PHI %reg1025, mbb<bb,0x203af10>, +                            %reg1026, mbb<bb1,0x203af60> +        %reg1029 = MOVZX64rr32 %reg1027 + +so we'd have to know that IMUL32rri8 leaves the high word zero extended and to +be able to recognize the zero extend.  This could also presumably be implemented +if we have whole-function selectiondags. + +//===---------------------------------------------------------------------===// + +Take the following code +(from http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34653): +extern unsigned long table[]; +unsigned long foo(unsigned char *p) { +  unsigned long tag = *p; +  return table[tag >> 4] + table[tag & 0xf]; +} + +Current code generated: +	movzbl	(%rdi), %eax +	movq	%rax, %rcx +	andq	$240, %rcx +	shrq	%rcx +	andq	$15, %rax +	movq	table(,%rax,8), %rax +	addq	table(%rcx), %rax +	ret + +Issues: +1. First movq should be movl; saves a byte. +2. Both andq's should be andl; saves another two bytes.  I think this was +   implemented at one point, but subsequently regressed. +3. shrq should be shrl; saves another byte. +4. The first andq can be completely eliminated by using a slightly more +   expensive addressing mode. + +//===---------------------------------------------------------------------===// + +Consider the following (contrived testcase, but contains common factors): + +#include <stdarg.h> +int test(int x, ...) { +  int sum, i; +  va_list l; +  va_start(l, x); +  for (i = 0; i < x; i++) +    sum += va_arg(l, int); +  va_end(l); +  return sum; +} + +Testcase given in C because fixing it will likely involve changing the IR +generated for it.  The primary issue with the result is that it doesn't do any +of the optimizations which are possible if we know the address of a va_list +in the current function is never taken: +1. We shouldn't spill the XMM registers because we only call va_arg with "int". +2. It would be nice if we could sroa the va_list. +3. Probably overkill, but it'd be cool if we could peel off the first five +iterations of the loop. + +Other optimizations involving functions which use va_arg on floats which don't +have the address of a va_list taken: +1. Conversely to the above, we shouldn't spill general registers if we only +   call va_arg on "double". +2. If we know nothing more than 64 bits wide is read from the XMM registers, +   we can change the spilling code to reduce the amount of stack used by half. + +//===---------------------------------------------------------------------===// diff --git a/contrib/libs/llvm12/lib/Target/X86/README.txt b/contrib/libs/llvm12/lib/Target/X86/README.txt index 6bc6e74b266..c06a7b1ade6 100644 --- a/contrib/libs/llvm12/lib/Target/X86/README.txt +++ b/contrib/libs/llvm12/lib/Target/X86/README.txt @@ -1,1794 +1,1794 @@ -//===---------------------------------------------------------------------===//  -// Random ideas for the X86 backend.  -//===---------------------------------------------------------------------===//  -  -Improvements to the multiply -> shift/add algorithm:  -http://gcc.gnu.org/ml/gcc-patches/2004-08/msg01590.html  -  -//===---------------------------------------------------------------------===//  -  -Improve code like this (occurs fairly frequently, e.g. in LLVM):  -long long foo(int x) { return 1LL << x; }  -  -http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01109.html  -http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01128.html  -http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01136.html  -  -Another useful one would be  ~0ULL >> X and ~0ULL << X.  -  -One better solution for 1LL << x is:  -        xorl    %eax, %eax  -        xorl    %edx, %edx  -        testb   $32, %cl  -        sete    %al  -        setne   %dl  -        sall    %cl, %eax  -        sall    %cl, %edx  -  -But that requires good 8-bit subreg support.  -  -Also, this might be better.  It's an extra shift, but it's one instruction  -shorter, and doesn't stress 8-bit subreg support.  -(From http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01148.html,  -but without the unnecessary and.)  -        movl %ecx, %eax  -        shrl $5, %eax  -        movl %eax, %edx  -        xorl $1, %edx  -        sall %cl, %eax  -        sall %cl. %edx  -  -64-bit shifts (in general) expand to really bad code.  Instead of using  -cmovs, we should expand to a conditional branch like GCC produces.  -  -//===---------------------------------------------------------------------===//  -  -Some isel ideas:  -  -1. Dynamic programming based approach when compile time is not an  -   issue.  -2. Code duplication (addressing mode) during isel.  -3. Other ideas from "Register-Sensitive Selection, Duplication, and  -   Sequencing of Instructions".  -4. Scheduling for reduced register pressure.  E.g. "Minimum Register  -   Instruction Sequence Problem: Revisiting Optimal Code Generation for DAGs"  -   and other related papers.  -   http://citeseer.ist.psu.edu/govindarajan01minimum.html  -  -//===---------------------------------------------------------------------===//  -  -Should we promote i16 to i32 to avoid partial register update stalls?  -  -//===---------------------------------------------------------------------===//  -  -Leave any_extend as pseudo instruction and hint to register  -allocator. Delay codegen until post register allocation.  -Note. any_extend is now turned into an INSERT_SUBREG. We still need to teach  -the coalescer how to deal with it though.  -  -//===---------------------------------------------------------------------===//  -  -It appears icc use push for parameter passing. Need to investigate.  -  -//===---------------------------------------------------------------------===//  -  -The instruction selector sometimes misses folding a load into a compare.  The  -pattern is written as (cmp reg, (load p)).  Because the compare isn't  -commutative, it is not matched with the load on both sides.  The dag combiner  -should be made smart enough to canonicalize the load into the RHS of a compare  -when it can invert the result of the compare for free.  -  -//===---------------------------------------------------------------------===//  -  -In many cases, LLVM generates code like this:  -  -_test:  -        movl 8(%esp), %eax  -        cmpl %eax, 4(%esp)  -        setl %al  -        movzbl %al, %eax  -        ret  -  -on some processors (which ones?), it is more efficient to do this:  -  -_test:  -        movl 8(%esp), %ebx  -        xor  %eax, %eax  -        cmpl %ebx, 4(%esp)  -        setl %al  -        ret  -  -Doing this correctly is tricky though, as the xor clobbers the flags.  -  -//===---------------------------------------------------------------------===//  -  -We should generate bts/btr/etc instructions on targets where they are cheap or  -when codesize is important.  e.g., for:  -  -void setbit(int *target, int bit) {  -    *target |= (1 << bit);  -}  -void clearbit(int *target, int bit) {  -    *target &= ~(1 << bit);  -}  -  -//===---------------------------------------------------------------------===//  -  -Instead of the following for memset char*, 1, 10:  -  -	movl $16843009, 4(%edx)  -	movl $16843009, (%edx)  -	movw $257, 8(%edx)  -  -It might be better to generate  -  -	movl $16843009, %eax  -	movl %eax, 4(%edx)  -	movl %eax, (%edx)  -	movw al, 8(%edx)  -	  -when we can spare a register. It reduces code size.  -  -//===---------------------------------------------------------------------===//  -  -Evaluate what the best way to codegen sdiv X, (2^C) is.  For X/8, we currently  -get this:  -  -define i32 @test1(i32 %X) {  -    %Y = sdiv i32 %X, 8  -    ret i32 %Y  -}  -  -_test1:  -        movl 4(%esp), %eax  -        movl %eax, %ecx  -        sarl $31, %ecx  -        shrl $29, %ecx  -        addl %ecx, %eax  -        sarl $3, %eax  -        ret  -  -GCC knows several different ways to codegen it, one of which is this:  -  -_test1:  -        movl    4(%esp), %eax  -        cmpl    $-1, %eax  -        leal    7(%eax), %ecx  -        cmovle  %ecx, %eax  -        sarl    $3, %eax  -        ret  -  -which is probably slower, but it's interesting at least :)  -  -//===---------------------------------------------------------------------===//  -  -We are currently lowering large (1MB+) memmove/memcpy to rep/stosl and rep/movsl  -We should leave these as libcalls for everything over a much lower threshold,  -since libc is hand tuned for medium and large mem ops (avoiding RFO for large  -stores, TLB preheating, etc)  -  -//===---------------------------------------------------------------------===//  -  -Optimize this into something reasonable:  - x * copysign(1.0, y) * copysign(1.0, z)  -  -//===---------------------------------------------------------------------===//  -  -Optimize copysign(x, *y) to use an integer load from y.  -  -//===---------------------------------------------------------------------===//  -  -The following tests perform worse with LSR:  -  -lambda, siod, optimizer-eval, ackermann, hash2, nestedloop, strcat, and Treesor.  -  -//===---------------------------------------------------------------------===//  -  -Adding to the list of cmp / test poor codegen issues:  -  -int test(__m128 *A, __m128 *B) {  -  if (_mm_comige_ss(*A, *B))  -    return 3;  -  else  -    return 4;  -}  -  -_test:  -	movl 8(%esp), %eax  -	movaps (%eax), %xmm0  -	movl 4(%esp), %eax  -	movaps (%eax), %xmm1  -	comiss %xmm0, %xmm1  -	setae %al  -	movzbl %al, %ecx  -	movl $3, %eax  -	movl $4, %edx  -	cmpl $0, %ecx  -	cmove %edx, %eax  -	ret  -  -Note the setae, movzbl, cmpl, cmove can be replaced with a single cmovae. There  -are a number of issues. 1) We are introducing a setcc between the result of the  -intrisic call and select. 2) The intrinsic is expected to produce a i32 value  -so a any extend (which becomes a zero extend) is added.  -  -We probably need some kind of target DAG combine hook to fix this.  -  -//===---------------------------------------------------------------------===//  -  -We generate significantly worse code for this than GCC:  -http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21150  -http://gcc.gnu.org/bugzilla/attachment.cgi?id=8701  -  -There is also one case we do worse on PPC.  -  -//===---------------------------------------------------------------------===//  -  -For this:  -  -int test(int a)  -{  -  return a * 3;  -}  -  -We currently emits  -	imull $3, 4(%esp), %eax  -  -Perhaps this is what we really should generate is? Is imull three or four  -cycles? Note: ICC generates this:  -	movl	4(%esp), %eax  -	leal	(%eax,%eax,2), %eax  -  -The current instruction priority is based on pattern complexity. The former is  -more "complex" because it folds a load so the latter will not be emitted.  -  -Perhaps we should use AddedComplexity to give LEA32r a higher priority? We  -should always try to match LEA first since the LEA matching code does some  -estimate to determine whether the match is profitable.  -  -However, if we care more about code size, then imull is better. It's two bytes  -shorter than movl + leal.  -  -On a Pentium M, both variants have the same characteristics with regard  -to throughput; however, the multiplication has a latency of four cycles, as  -opposed to two cycles for the movl+lea variant.  -  -//===---------------------------------------------------------------------===//  -  -It appears gcc place string data with linkonce linkage in  -.section __TEXT,__const_coal,coalesced instead of  -.section __DATA,__const_coal,coalesced.  -Take a look at darwin.h, there are other Darwin assembler directives that we  -do not make use of.  -  -//===---------------------------------------------------------------------===//  -  -define i32 @foo(i32* %a, i32 %t) {  -entry:  -	br label %cond_true  -  -cond_true:		; preds = %cond_true, %entry  -	%x.0.0 = phi i32 [ 0, %entry ], [ %tmp9, %cond_true ]		; <i32> [#uses=3]  -	%t_addr.0.0 = phi i32 [ %t, %entry ], [ %tmp7, %cond_true ]		; <i32> [#uses=1]  -	%tmp2 = getelementptr i32* %a, i32 %x.0.0		; <i32*> [#uses=1]  -	%tmp3 = load i32* %tmp2		; <i32> [#uses=1]  -	%tmp5 = add i32 %t_addr.0.0, %x.0.0		; <i32> [#uses=1]  -	%tmp7 = add i32 %tmp5, %tmp3		; <i32> [#uses=2]  -	%tmp9 = add i32 %x.0.0, 1		; <i32> [#uses=2]  -	%tmp = icmp sgt i32 %tmp9, 39		; <i1> [#uses=1]  -	br i1 %tmp, label %bb12, label %cond_true  -  -bb12:		; preds = %cond_true  -	ret i32 %tmp7  -}  -is pessimized by -loop-reduce and -indvars  -  -//===---------------------------------------------------------------------===//  -  -u32 to float conversion improvement:  -  -float uint32_2_float( unsigned u ) {  -  float fl = (int) (u & 0xffff);  -  float fh = (int) (u >> 16);  -  fh *= 0x1.0p16f;  -  return fh + fl;  -}  -  -00000000        subl    $0x04,%esp  -00000003        movl    0x08(%esp,1),%eax  -00000007        movl    %eax,%ecx  -00000009        shrl    $0x10,%ecx  -0000000c        cvtsi2ss        %ecx,%xmm0  -00000010        andl    $0x0000ffff,%eax  -00000015        cvtsi2ss        %eax,%xmm1  -00000019        mulss   0x00000078,%xmm0  -00000021        addss   %xmm1,%xmm0  -00000025        movss   %xmm0,(%esp,1)  -0000002a        flds    (%esp,1)  -0000002d        addl    $0x04,%esp  -00000030        ret  -  -//===---------------------------------------------------------------------===//  -  -When using fastcc abi, align stack slot of argument of type double on 8 byte  -boundary to improve performance.  -  -//===---------------------------------------------------------------------===//  -  -GCC's ix86_expand_int_movcc function (in i386.c) has a ton of interesting  -simplifications for integer "x cmp y ? a : b".  -  -//===---------------------------------------------------------------------===//  -  -Consider the expansion of:  -  -define i32 @test3(i32 %X) {  -        %tmp1 = urem i32 %X, 255  -        ret i32 %tmp1  -}  -  -Currently it compiles to:  -  -...  -        movl $2155905153, %ecx  -        movl 8(%esp), %esi  -        movl %esi, %eax  -        mull %ecx  -...  -  -This could be "reassociated" into:  -  -        movl $2155905153, %eax  -        movl 8(%esp), %ecx  -        mull %ecx  -  -to avoid the copy.  In fact, the existing two-address stuff would do this  -except that mul isn't a commutative 2-addr instruction.  I guess this has  -to be done at isel time based on the #uses to mul?  -  -//===---------------------------------------------------------------------===//  -  -Make sure the instruction which starts a loop does not cross a cacheline  -boundary. This requires knowning the exact length of each machine instruction.  -That is somewhat complicated, but doable. Example 256.bzip2:  -  -In the new trace, the hot loop has an instruction which crosses a cacheline  -boundary.  In addition to potential cache misses, this can't help decoding as I  -imagine there has to be some kind of complicated decoder reset and realignment  -to grab the bytes from the next cacheline.  -  -532  532 0x3cfc movb     (1809(%esp, %esi), %bl   <<<--- spans 2 64 byte lines  -942  942 0x3d03 movl     %dh, (1809(%esp, %esi)  -937  937 0x3d0a incl     %esi  -3    3   0x3d0b cmpb     %bl, %dl  -27   27  0x3d0d jnz      0x000062db <main+11707>  -  -//===---------------------------------------------------------------------===//  -  -In c99 mode, the preprocessor doesn't like assembly comments like #TRUNCATE.  -  -//===---------------------------------------------------------------------===//  -  -This could be a single 16-bit load.  -  -int f(char *p) {  -    if ((p[0] == 1) & (p[1] == 2)) return 1;  -    return 0;  -}  -  -//===---------------------------------------------------------------------===//  -  -We should inline lrintf and probably other libc functions.  -  -//===---------------------------------------------------------------------===//  -  -This code:  -  -void test(int X) {  -  if (X) abort();  -}  -  -is currently compiled to:  -  -_test:  -        subl $12, %esp  -        cmpl $0, 16(%esp)  -        jne LBB1_1  -        addl $12, %esp  -        ret  -LBB1_1:  -        call L_abort$stub  -  -It would be better to produce:  -  -_test:  -        subl $12, %esp  -        cmpl $0, 16(%esp)  -        jne L_abort$stub  -        addl $12, %esp  -        ret  -  -This can be applied to any no-return function call that takes no arguments etc.  -Alternatively, the stack save/restore logic could be shrink-wrapped, producing  -something like this:  -  -_test:  -        cmpl $0, 4(%esp)  -        jne LBB1_1  -        ret  -LBB1_1:  -        subl $12, %esp  -        call L_abort$stub  -  -Both are useful in different situations.  Finally, it could be shrink-wrapped  -and tail called, like this:  -  -_test:  -        cmpl $0, 4(%esp)  -        jne LBB1_1  -        ret  -LBB1_1:  -        pop %eax   # realign stack.  -        call L_abort$stub  -  -Though this probably isn't worth it.  -  -//===---------------------------------------------------------------------===//  -  -Sometimes it is better to codegen subtractions from a constant (e.g. 7-x) with  -a neg instead of a sub instruction.  Consider:  -  -int test(char X) { return 7-X; }  -  -we currently produce:  -_test:  -        movl $7, %eax  -        movsbl 4(%esp), %ecx  -        subl %ecx, %eax  -        ret  -  -We would use one fewer register if codegen'd as:  -  -        movsbl 4(%esp), %eax  -	neg %eax  -        add $7, %eax  -        ret  -  -Note that this isn't beneficial if the load can be folded into the sub.  In  -this case, we want a sub:  -  -int test(int X) { return 7-X; }  -_test:  -        movl $7, %eax  -        subl 4(%esp), %eax  -        ret  -  -//===---------------------------------------------------------------------===//  -  -Leaf functions that require one 4-byte spill slot have a prolog like this:  -  -_foo:  -        pushl   %esi  -        subl    $4, %esp  -...  -and an epilog like this:  -        addl    $4, %esp  -        popl    %esi  -        ret  -  -It would be smaller, and potentially faster, to push eax on entry and to  -pop into a dummy register instead of using addl/subl of esp.  Just don't pop   -into any return registers :)  -  -//===---------------------------------------------------------------------===//  -  -The X86 backend should fold (branch (or (setcc, setcc))) into multiple   -branches.  We generate really poor code for:  -  -double testf(double a) {  -       return a == 0.0 ? 0.0 : (a > 0.0 ? 1.0 : -1.0);  -}  -  -For example, the entry BB is:  -  -_testf:  -        subl    $20, %esp  -        pxor    %xmm0, %xmm0  -        movsd   24(%esp), %xmm1  -        ucomisd %xmm0, %xmm1  -        setnp   %al  -        sete    %cl  -        testb   %cl, %al  -        jne     LBB1_5  # UnifiedReturnBlock  -LBB1_1: # cond_true  -  -  -it would be better to replace the last four instructions with:  -  -	jp LBB1_1  -	je LBB1_5  -LBB1_1:  -  -We also codegen the inner ?: into a diamond:  -  -       cvtss2sd        LCPI1_0(%rip), %xmm2  -        cvtss2sd        LCPI1_1(%rip), %xmm3  -        ucomisd %xmm1, %xmm0  -        ja      LBB1_3  # cond_true  -LBB1_2: # cond_true  -        movapd  %xmm3, %xmm2  -LBB1_3: # cond_true  -        movapd  %xmm2, %xmm0  -        ret  -  -We should sink the load into xmm3 into the LBB1_2 block.  This should  -be pretty easy, and will nuke all the copies.  -  -//===---------------------------------------------------------------------===//  -  -This:  -        #include <algorithm>  -        inline std::pair<unsigned, bool> full_add(unsigned a, unsigned b)  -        { return std::make_pair(a + b, a + b < a); }  -        bool no_overflow(unsigned a, unsigned b)  -        { return !full_add(a, b).second; }  -  -Should compile to:  -	addl	%esi, %edi  -	setae	%al  -	movzbl	%al, %eax  -	ret  -  -on x86-64, instead of the rather stupid-looking:  -	addl	%esi, %edi  -	setb	%al  -	xorb	$1, %al  -	movzbl	%al, %eax  -	ret  -  -  -//===---------------------------------------------------------------------===//  -  -The following code:  -  -bb114.preheader:		; preds = %cond_next94  -	%tmp231232 = sext i16 %tmp62 to i32		; <i32> [#uses=1]  -	%tmp233 = sub i32 32, %tmp231232		; <i32> [#uses=1]  -	%tmp245246 = sext i16 %tmp65 to i32		; <i32> [#uses=1]  -	%tmp252253 = sext i16 %tmp68 to i32		; <i32> [#uses=1]  -	%tmp254 = sub i32 32, %tmp252253		; <i32> [#uses=1]  -	%tmp553554 = bitcast i16* %tmp37 to i8*		; <i8*> [#uses=2]  -	%tmp583584 = sext i16 %tmp98 to i32		; <i32> [#uses=1]  -	%tmp585 = sub i32 32, %tmp583584		; <i32> [#uses=1]  -	%tmp614615 = sext i16 %tmp101 to i32		; <i32> [#uses=1]  -	%tmp621622 = sext i16 %tmp104 to i32		; <i32> [#uses=1]  -	%tmp623 = sub i32 32, %tmp621622		; <i32> [#uses=1]  -	br label %bb114  -  -produces:  -  -LBB3_5:	# bb114.preheader  -	movswl	-68(%ebp), %eax  -	movl	$32, %ecx  -	movl	%ecx, -80(%ebp)  -	subl	%eax, -80(%ebp)  -	movswl	-52(%ebp), %eax  -	movl	%ecx, -84(%ebp)  -	subl	%eax, -84(%ebp)  -	movswl	-70(%ebp), %eax  -	movl	%ecx, -88(%ebp)  -	subl	%eax, -88(%ebp)  -	movswl	-50(%ebp), %eax  -	subl	%eax, %ecx  -	movl	%ecx, -76(%ebp)  -	movswl	-42(%ebp), %eax  -	movl	%eax, -92(%ebp)  -	movswl	-66(%ebp), %eax  -	movl	%eax, -96(%ebp)  -	movw	$0, -98(%ebp)  -  -This appears to be bad because the RA is not folding the store to the stack   -slot into the movl.  The above instructions could be:  -	movl    $32, -80(%ebp)  -...  -	movl    $32, -84(%ebp)  -...  -This seems like a cross between remat and spill folding.  -  -This has redundant subtractions of %eax from a stack slot. However, %ecx doesn't  -change, so we could simply subtract %eax from %ecx first and then use %ecx (or  -vice-versa).  -  -//===---------------------------------------------------------------------===//  -  -This code:  -  -	%tmp659 = icmp slt i16 %tmp654, 0		; <i1> [#uses=1]  -	br i1 %tmp659, label %cond_true662, label %cond_next715  -  -produces this:  -  -	testw	%cx, %cx  -	movswl	%cx, %esi  -	jns	LBB4_109	# cond_next715  -  -Shark tells us that using %cx in the testw instruction is sub-optimal. It  -suggests using the 32-bit register (which is what ICC uses).  -  -//===---------------------------------------------------------------------===//  -  -We compile this:  -  -void compare (long long foo) {  -  if (foo < 4294967297LL)  -    abort();  -}  -  -to:  -  -compare:  -        subl    $4, %esp  -        cmpl    $0, 8(%esp)  -        setne   %al  -        movzbw  %al, %ax  -        cmpl    $1, 12(%esp)  -        setg    %cl  -        movzbw  %cl, %cx  -        cmove   %ax, %cx  -        testb   $1, %cl  -        jne     .LBB1_2 # UnifiedReturnBlock  -.LBB1_1:        # ifthen  -        call    abort  -.LBB1_2:        # UnifiedReturnBlock  -        addl    $4, %esp  -        ret  -  -(also really horrible code on ppc).  This is due to the expand code for 64-bit  -compares.  GCC produces multiple branches, which is much nicer:  -  -compare:  -        subl    $12, %esp  -        movl    20(%esp), %edx  -        movl    16(%esp), %eax  -        decl    %edx  -        jle     .L7  -.L5:  -        addl    $12, %esp  -        ret  -        .p2align 4,,7  -.L7:  -        jl      .L4  -        cmpl    $0, %eax  -        .p2align 4,,8  -        ja      .L5  -.L4:  -        .p2align 4,,9  -        call    abort  -  -//===---------------------------------------------------------------------===//  -  -Tail call optimization improvements: Tail call optimization currently  -pushes all arguments on the top of the stack (their normal place for  -non-tail call optimized calls) that source from the callers arguments  -or  that source from a virtual register (also possibly sourcing from  -callers arguments).  -This is done to prevent overwriting of parameters (see example  -below) that might be used later.  -  -example:    -  -int callee(int32, int64);   -int caller(int32 arg1, int32 arg2) {   -  int64 local = arg2 * 2;   -  return callee(arg2, (int64)local);   -}  -  -[arg1]          [!arg2 no longer valid since we moved local onto it]  -[arg2]      ->  [(int64)  -[RETADDR]        local  ]  -  -Moving arg1 onto the stack slot of callee function would overwrite  -arg2 of the caller.  -  -Possible optimizations:  -  -  - - Analyse the actual parameters of the callee to see which would  -   overwrite a caller parameter which is used by the callee and only  -   push them onto the top of the stack.  -  -   int callee (int32 arg1, int32 arg2);  -   int caller (int32 arg1, int32 arg2) {  -       return callee(arg1,arg2);  -   }  -  -   Here we don't need to write any variables to the top of the stack  -   since they don't overwrite each other.  -  -   int callee (int32 arg1, int32 arg2);  -   int caller (int32 arg1, int32 arg2) {  -       return callee(arg2,arg1);  -   }  -  -   Here we need to push the arguments because they overwrite each  -   other.  -  -//===---------------------------------------------------------------------===//  -  -main ()  -{  -  int i = 0;  -  unsigned long int z = 0;  -  -  do {  -    z -= 0x00004000;  -    i++;  -    if (i > 0x00040000)  -      abort ();  -  } while (z > 0);  -  exit (0);  -}  -  -gcc compiles this to:  -  -_main:  -	subl	$28, %esp  -	xorl	%eax, %eax  -	jmp	L2  -L3:  -	cmpl	$262144, %eax  -	je	L10  -L2:  -	addl	$1, %eax  -	cmpl	$262145, %eax  -	jne	L3  -	call	L_abort$stub  -L10:  -	movl	$0, (%esp)  -	call	L_exit$stub  -  -llvm:  -  -_main:  -	subl	$12, %esp  -	movl	$1, %eax  -	movl	$16384, %ecx  -LBB1_1:	# bb  -	cmpl	$262145, %eax  -	jge	LBB1_4	# cond_true  -LBB1_2:	# cond_next  -	incl	%eax  -	addl	$4294950912, %ecx  -	cmpl	$16384, %ecx  -	jne	LBB1_1	# bb  -LBB1_3:	# bb11  -	xorl	%eax, %eax  -	addl	$12, %esp  -	ret  -LBB1_4:	# cond_true  -	call	L_abort$stub  -  -1. LSR should rewrite the first cmp with induction variable %ecx.  -2. DAG combiner should fold  -        leal    1(%eax), %edx  -        cmpl    $262145, %edx  -   =>  -        cmpl    $262144, %eax  -  -//===---------------------------------------------------------------------===//  -  -define i64 @test(double %X) {  -	%Y = fptosi double %X to i64  -	ret i64 %Y  -}  -  -compiles to:  -  -_test:  -	subl	$20, %esp  -	movsd	24(%esp), %xmm0  -	movsd	%xmm0, 8(%esp)  -	fldl	8(%esp)  -	fisttpll	(%esp)  -	movl	4(%esp), %edx  -	movl	(%esp), %eax  -	addl	$20, %esp  -	#FP_REG_KILL  -	ret  -  -This should just fldl directly from the input stack slot.  -  -//===---------------------------------------------------------------------===//  -  -This code:  -int foo (int x) { return (x & 65535) | 255; }  -  -Should compile into:  -  -_foo:  -        movzwl  4(%esp), %eax  -        orl     $255, %eax  -        ret  -  -instead of:  -_foo:  -	movl	$65280, %eax  -	andl	4(%esp), %eax  -	orl	$255, %eax  -	ret  -  -//===---------------------------------------------------------------------===//  -  -We're codegen'ing multiply of long longs inefficiently:  -  -unsigned long long LLM(unsigned long long arg1, unsigned long long arg2) {  -  return arg1 *  arg2;  -}  -  -We compile to (fomit-frame-pointer):  -  -_LLM:  -	pushl	%esi  -	movl	8(%esp), %ecx  -	movl	16(%esp), %esi  -	movl	%esi, %eax  -	mull	%ecx  -	imull	12(%esp), %esi  -	addl	%edx, %esi  -	imull	20(%esp), %ecx  -	movl	%esi, %edx  -	addl	%ecx, %edx  -	popl	%esi  -	ret  -  -This looks like a scheduling deficiency and lack of remat of the load from  -the argument area.  ICC apparently produces:  -  -        movl      8(%esp), %ecx  -        imull     12(%esp), %ecx  -        movl      16(%esp), %eax  -        imull     4(%esp), %eax   -        addl      %eax, %ecx    -        movl      4(%esp), %eax  -        mull      12(%esp)   -        addl      %ecx, %edx  -        ret  -  -Note that it remat'd loads from 4(esp) and 12(esp).  See this GCC PR:  -http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17236  -  -//===---------------------------------------------------------------------===//  -  -We can fold a store into "zeroing a reg".  Instead of:  -  -xorl    %eax, %eax  -movl    %eax, 124(%esp)  -  -we should get:  -  -movl    $0, 124(%esp)  -  -if the flags of the xor are dead.  -  -Likewise, we isel "x<<1" into "add reg,reg".  If reg is spilled, this should  -be folded into: shl [mem], 1  -  -//===---------------------------------------------------------------------===//  -  -In SSE mode, we turn abs and neg into a load from the constant pool plus a xor  -or and instruction, for example:  -  -	xorpd	LCPI1_0, %xmm2  -  -However, if xmm2 gets spilled, we end up with really ugly code like this:  -  -	movsd	(%esp), %xmm0  -	xorpd	LCPI1_0, %xmm0  -	movsd	%xmm0, (%esp)  -  -Since we 'know' that this is a 'neg', we can actually "fold" the spill into  -the neg/abs instruction, turning it into an *integer* operation, like this:  -  -	xorl 2147483648, [mem+4]     ## 2147483648 = (1 << 31)  -  -you could also use xorb, but xorl is less likely to lead to a partial register  -stall.  Here is a contrived testcase:  -  -double a, b, c;  -void test(double *P) {  -  double X = *P;  -  a = X;  -  bar();  -  X = -X;  -  b = X;  -  bar();  -  c = X;  -}  -  -//===---------------------------------------------------------------------===//  -  -The generated code on x86 for checking for signed overflow on a multiply the  -obvious way is much longer than it needs to be.  -  -int x(int a, int b) {  -  long long prod = (long long)a*b;  -  return  prod > 0x7FFFFFFF || prod < (-0x7FFFFFFF-1);  -}  -  -See PR2053 for more details.  -  -//===---------------------------------------------------------------------===//  -  -We should investigate using cdq/ctld (effect: edx = sar eax, 31)  -more aggressively; it should cost the same as a move+shift on any modern  -processor, but it's a lot shorter. Downside is that it puts more  -pressure on register allocation because it has fixed operands.  -  -Example:  -int abs(int x) {return x < 0 ? -x : x;}  -  -gcc compiles this to the following when using march/mtune=pentium2/3/4/m/etc.:  -abs:  -        movl    4(%esp), %eax  -        cltd  -        xorl    %edx, %eax  -        subl    %edx, %eax  -        ret  -  -//===---------------------------------------------------------------------===//  -  -Take the following code (from   -http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16541):  -  -extern unsigned char first_one[65536];  -int FirstOnet(unsigned long long arg1)  -{  -  if (arg1 >> 48)  -    return (first_one[arg1 >> 48]);  -  return 0;  -}  -  -  -The following code is currently generated:  -FirstOnet:  -        movl    8(%esp), %eax  -        cmpl    $65536, %eax  -        movl    4(%esp), %ecx  -        jb      .LBB1_2 # UnifiedReturnBlock  -.LBB1_1:        # ifthen  -        shrl    $16, %eax  -        movzbl  first_one(%eax), %eax  -        ret  -.LBB1_2:        # UnifiedReturnBlock  -        xorl    %eax, %eax  -        ret  -  -We could change the "movl 8(%esp), %eax" into "movzwl 10(%esp), %eax"; this  -lets us change the cmpl into a testl, which is shorter, and eliminate the shift.  -  -//===---------------------------------------------------------------------===//  -  -We compile this function:  -  -define i32 @foo(i32 %a, i32 %b, i32 %c, i8 zeroext  %d) nounwind  {  -entry:  -	%tmp2 = icmp eq i8 %d, 0		; <i1> [#uses=1]  -	br i1 %tmp2, label %bb7, label %bb  -  -bb:		; preds = %entry  -	%tmp6 = add i32 %b, %a		; <i32> [#uses=1]  -	ret i32 %tmp6  -  -bb7:		; preds = %entry  -	%tmp10 = sub i32 %a, %c		; <i32> [#uses=1]  -	ret i32 %tmp10  -}  -  -to:  -  -foo:                                    # @foo  -# %bb.0:                                # %entry  -	movl	4(%esp), %ecx  -	cmpb	$0, 16(%esp)  -	je	.LBB0_2  -# %bb.1:                                # %bb  -	movl	8(%esp), %eax  -	addl	%ecx, %eax  -	ret  -.LBB0_2:                                # %bb7  -	movl	12(%esp), %edx  -	movl	%ecx, %eax  -	subl	%edx, %eax  -	ret  -  -There's an obviously unnecessary movl in .LBB0_2, and we could eliminate a  -couple more movls by putting 4(%esp) into %eax instead of %ecx.  -  -//===---------------------------------------------------------------------===//  -  -See rdar://4653682.  -  -From flops:  -  -LBB1_15:        # bb310  -        cvtss2sd        LCPI1_0, %xmm1  -        addsd   %xmm1, %xmm0  -        movsd   176(%esp), %xmm2  -        mulsd   %xmm0, %xmm2  -        movapd  %xmm2, %xmm3  -        mulsd   %xmm3, %xmm3  -        movapd  %xmm3, %xmm4  -        mulsd   LCPI1_23, %xmm4  -        addsd   LCPI1_24, %xmm4  -        mulsd   %xmm3, %xmm4  -        addsd   LCPI1_25, %xmm4  -        mulsd   %xmm3, %xmm4  -        addsd   LCPI1_26, %xmm4  -        mulsd   %xmm3, %xmm4  -        addsd   LCPI1_27, %xmm4  -        mulsd   %xmm3, %xmm4  -        addsd   LCPI1_28, %xmm4  -        mulsd   %xmm3, %xmm4  -        addsd   %xmm1, %xmm4  -        mulsd   %xmm2, %xmm4  -        movsd   152(%esp), %xmm1  -        addsd   %xmm4, %xmm1  -        movsd   %xmm1, 152(%esp)  -        incl    %eax  -        cmpl    %eax, %esi  -        jge     LBB1_15 # bb310  -LBB1_16:        # bb358.loopexit  -        movsd   152(%esp), %xmm0  -        addsd   %xmm0, %xmm0  -        addsd   LCPI1_22, %xmm0  -        movsd   %xmm0, 152(%esp)  -  -Rather than spilling the result of the last addsd in the loop, we should have  -insert a copy to split the interval (one for the duration of the loop, one  -extending to the fall through). The register pressure in the loop isn't high  -enough to warrant the spill.  -  -Also check why xmm7 is not used at all in the function.  -  -//===---------------------------------------------------------------------===//  -  -Take the following:  -  -target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:64:64-v128:128:128-a0:0:64-f80:128:128-S128"  -target triple = "i386-apple-darwin8"  -@in_exit.4870.b = internal global i1 false		; <i1*> [#uses=2]  -define fastcc void @abort_gzip() noreturn nounwind  {  -entry:  -	%tmp.b.i = load i1* @in_exit.4870.b		; <i1> [#uses=1]  -	br i1 %tmp.b.i, label %bb.i, label %bb4.i  -bb.i:		; preds = %entry  -	tail call void @exit( i32 1 ) noreturn nounwind   -	unreachable  -bb4.i:		; preds = %entry  -	store i1 true, i1* @in_exit.4870.b  -	tail call void @exit( i32 1 ) noreturn nounwind   -	unreachable  -}  -declare void @exit(i32) noreturn nounwind   -  -This compiles into:  -_abort_gzip:                            ## @abort_gzip  -## %bb.0:                               ## %entry  -	subl	$12, %esp  -	movb	_in_exit.4870.b, %al  -	cmpb	$1, %al  -	jne	LBB0_2  -  -We somehow miss folding the movb into the cmpb.  -  -//===---------------------------------------------------------------------===//  -  -We compile:  -  -int test(int x, int y) {  -  return x-y-1;  -}  -  -into (-m64):  -  -_test:  -	decl	%edi  -	movl	%edi, %eax  -	subl	%esi, %eax  -	ret  -  -it would be better to codegen as: x+~y  (notl+addl)  -  -//===---------------------------------------------------------------------===//  -  -This code:  -  -int foo(const char *str,...)  -{  - __builtin_va_list a; int x;  - __builtin_va_start(a,str); x = __builtin_va_arg(a,int); __builtin_va_end(a);  - return x;  -}  -  -gets compiled into this on x86-64:  -	subq    $200, %rsp  -        movaps  %xmm7, 160(%rsp)  -        movaps  %xmm6, 144(%rsp)  -        movaps  %xmm5, 128(%rsp)  -        movaps  %xmm4, 112(%rsp)  -        movaps  %xmm3, 96(%rsp)  -        movaps  %xmm2, 80(%rsp)  -        movaps  %xmm1, 64(%rsp)  -        movaps  %xmm0, 48(%rsp)  -        movq    %r9, 40(%rsp)  -        movq    %r8, 32(%rsp)  -        movq    %rcx, 24(%rsp)  -        movq    %rdx, 16(%rsp)  -        movq    %rsi, 8(%rsp)  -        leaq    (%rsp), %rax  -        movq    %rax, 192(%rsp)  -        leaq    208(%rsp), %rax  -        movq    %rax, 184(%rsp)  -        movl    $48, 180(%rsp)  -        movl    $8, 176(%rsp)  -        movl    176(%rsp), %eax  -        cmpl    $47, %eax  -        jbe     .LBB1_3 # bb  -.LBB1_1:        # bb3  -        movq    184(%rsp), %rcx  -        leaq    8(%rcx), %rax  -        movq    %rax, 184(%rsp)  -.LBB1_2:        # bb4  -        movl    (%rcx), %eax  -        addq    $200, %rsp  -        ret  -.LBB1_3:        # bb  -        movl    %eax, %ecx  -        addl    $8, %eax  -        addq    192(%rsp), %rcx  -        movl    %eax, 176(%rsp)  -        jmp     .LBB1_2 # bb4  -  -gcc 4.3 generates:  -	subq    $96, %rsp  -.LCFI0:  -        leaq    104(%rsp), %rax  -        movq    %rsi, -80(%rsp)  -        movl    $8, -120(%rsp)  -        movq    %rax, -112(%rsp)  -        leaq    -88(%rsp), %rax  -        movq    %rax, -104(%rsp)  -        movl    $8, %eax  -        cmpl    $48, %eax  -        jb      .L6  -        movq    -112(%rsp), %rdx  -        movl    (%rdx), %eax  -        addq    $96, %rsp  -        ret  -        .p2align 4,,10  -        .p2align 3  -.L6:  -        mov     %eax, %edx  -        addq    -104(%rsp), %rdx  -        addl    $8, %eax  -        movl    %eax, -120(%rsp)  -        movl    (%rdx), %eax  -        addq    $96, %rsp  -        ret  -  -and it gets compiled into this on x86:  -	pushl   %ebp  -        movl    %esp, %ebp  -        subl    $4, %esp  -        leal    12(%ebp), %eax  -        movl    %eax, -4(%ebp)  -        leal    16(%ebp), %eax  -        movl    %eax, -4(%ebp)  -        movl    12(%ebp), %eax  -        addl    $4, %esp  -        popl    %ebp  -        ret  -  -gcc 4.3 generates:  -	pushl   %ebp  -        movl    %esp, %ebp  -        movl    12(%ebp), %eax  -        popl    %ebp  -        ret  -  -//===---------------------------------------------------------------------===//  -  -Teach tblgen not to check bitconvert source type in some cases. This allows us  -to consolidate the following patterns in X86InstrMMX.td:  -  -def : Pat<(v2i32 (bitconvert (i64 (vector_extract (v2i64 VR128:$src),  -                                                  (iPTR 0))))),  -          (v2i32 (MMX_MOVDQ2Qrr VR128:$src))>;  -def : Pat<(v4i16 (bitconvert (i64 (vector_extract (v2i64 VR128:$src),  -                                                  (iPTR 0))))),  -          (v4i16 (MMX_MOVDQ2Qrr VR128:$src))>;  -def : Pat<(v8i8 (bitconvert (i64 (vector_extract (v2i64 VR128:$src),  -                                                  (iPTR 0))))),  -          (v8i8 (MMX_MOVDQ2Qrr VR128:$src))>;  -  -There are other cases in various td files.  -  -//===---------------------------------------------------------------------===//  -  -Take something like the following on x86-32:  -unsigned a(unsigned long long x, unsigned y) {return x % y;}  -  -We currently generate a libcall, but we really shouldn't: the expansion is  -shorter and likely faster than the libcall.  The expected code is something  -like the following:  -  -	movl	12(%ebp), %eax  -	movl	16(%ebp), %ecx  -	xorl	%edx, %edx  -	divl	%ecx  -	movl	8(%ebp), %eax  -	divl	%ecx  -	movl	%edx, %eax  -	ret  -  -A similar code sequence works for division.  -  -//===---------------------------------------------------------------------===//  -  -We currently compile this:  -  -define i32 @func1(i32 %v1, i32 %v2) nounwind {  -entry:  -  %t = call {i32, i1} @llvm.sadd.with.overflow.i32(i32 %v1, i32 %v2)  -  %sum = extractvalue {i32, i1} %t, 0  -  %obit = extractvalue {i32, i1} %t, 1  -  br i1 %obit, label %overflow, label %normal  -normal:  -  ret i32 %sum  -overflow:  -  call void @llvm.trap()  -  unreachable  -}  -declare {i32, i1} @llvm.sadd.with.overflow.i32(i32, i32)  -declare void @llvm.trap()  -  -to:  -  -_func1:  -	movl	4(%esp), %eax  -	addl	8(%esp), %eax  -	jo	LBB1_2	## overflow  -LBB1_1:	## normal  -	ret  -LBB1_2:	## overflow  -	ud2  -  -it would be nice to produce "into" someday.  -  -//===---------------------------------------------------------------------===//  -  -Test instructions can be eliminated by using EFLAGS values from arithmetic  -instructions. This is currently not done for mul, and, or, xor, neg, shl,  -sra, srl, shld, shrd, atomic ops, and others. It is also currently not done  -for read-modify-write instructions. It is also current not done if the  -OF or CF flags are needed.  -  -The shift operators have the complication that when the shift count is  -zero, EFLAGS is not set, so they can only subsume a test instruction if  -the shift count is known to be non-zero. Also, using the EFLAGS value  -from a shift is apparently very slow on some x86 implementations.  -  -In read-modify-write instructions, the root node in the isel match is  -the store, and isel has no way for the use of the EFLAGS result of the  -arithmetic to be remapped to the new node.  -  -Add and subtract instructions set OF on signed overflow and CF on unsiged  -overflow, while test instructions always clear OF and CF. In order to  -replace a test with an add or subtract in a situation where OF or CF is  -needed, codegen must be able to prove that the operation cannot see  -signed or unsigned overflow, respectively.  -  -//===---------------------------------------------------------------------===//  -  -memcpy/memmove do not lower to SSE copies when possible.  A silly example is:  -define <16 x float> @foo(<16 x float> %A) nounwind {  -	%tmp = alloca <16 x float>, align 16  -	%tmp2 = alloca <16 x float>, align 16  -	store <16 x float> %A, <16 x float>* %tmp  -	%s = bitcast <16 x float>* %tmp to i8*  -	%s2 = bitcast <16 x float>* %tmp2 to i8*  -	call void @llvm.memcpy.i64(i8* %s, i8* %s2, i64 64, i32 16)  -	%R = load <16 x float>* %tmp2  -	ret <16 x float> %R  -}  -  -declare void @llvm.memcpy.i64(i8* nocapture, i8* nocapture, i64, i32) nounwind  -  -which compiles to:  -  -_foo:  -	subl	$140, %esp  -	movaps	%xmm3, 112(%esp)  -	movaps	%xmm2, 96(%esp)  -	movaps	%xmm1, 80(%esp)  -	movaps	%xmm0, 64(%esp)  -	movl	60(%esp), %eax  -	movl	%eax, 124(%esp)  -	movl	56(%esp), %eax  -	movl	%eax, 120(%esp)  -	movl	52(%esp), %eax  -        <many many more 32-bit copies>  -      	movaps	(%esp), %xmm0  -	movaps	16(%esp), %xmm1  -	movaps	32(%esp), %xmm2  -	movaps	48(%esp), %xmm3  -	addl	$140, %esp  -	ret  -  -On Nehalem, it may even be cheaper to just use movups when unaligned than to  -fall back to lower-granularity chunks.  -  -//===---------------------------------------------------------------------===//  -  -Implement processor-specific optimizations for parity with GCC on these  -processors.  GCC does two optimizations:  -  -1. ix86_pad_returns inserts a noop before ret instructions if immediately  -   preceded by a conditional branch or is the target of a jump.  -2. ix86_avoid_jump_misspredicts inserts noops in cases where a 16-byte block of  -   code contains more than 3 branches.  -     -The first one is done for all AMDs, Core2, and "Generic"  -The second one is done for: Atom, Pentium Pro, all AMDs, Pentium 4, Nocona,  -  Core 2, and "Generic"  -  -//===---------------------------------------------------------------------===//  -Testcase:  -int x(int a) { return (a&0xf0)>>4; }  -  -Current output:  -	movl	4(%esp), %eax  -	shrl	$4, %eax  -	andl	$15, %eax  -	ret  -  -Ideal output:  -	movzbl	4(%esp), %eax  -	shrl	$4, %eax  -	ret  -  -//===---------------------------------------------------------------------===//  -  -Re-implement atomic builtins __sync_add_and_fetch() and __sync_sub_and_fetch  -properly.  -  -When the return value is not used (i.e. only care about the value in the  -memory), x86 does not have to use add to implement these. Instead, it can use  -add, sub, inc, dec instructions with the "lock" prefix.  -  -This is currently implemented using a bit of instruction selection trick. The  -issue is the target independent pattern produces one output and a chain and we  -want to map it into one that just output a chain. The current trick is to select  -it into a MERGE_VALUES with the first definition being an implicit_def. The  -proper solution is to add new ISD opcodes for the no-output variant. DAG  -combiner can then transform the node before it gets to target node selection.  -  -Problem #2 is we are adding a whole bunch of x86 atomic instructions when in  -fact these instructions are identical to the non-lock versions. We need a way to  -add target specific information to target nodes and have this information  -carried over to machine instructions. Asm printer (or JIT) can use this  -information to add the "lock" prefix.  -  -//===---------------------------------------------------------------------===//  -  -struct B {  -  unsigned char y0 : 1;  -};  -  -int bar(struct B* a) { return a->y0; }  -  -define i32 @bar(%struct.B* nocapture %a) nounwind readonly optsize {  -  %1 = getelementptr inbounds %struct.B* %a, i64 0, i32 0  -  %2 = load i8* %1, align 1  -  %3 = and i8 %2, 1  -  %4 = zext i8 %3 to i32  -  ret i32 %4  -}  -  -bar:                                    # @bar  -# %bb.0:  -        movb    (%rdi), %al  -        andb    $1, %al  -        movzbl  %al, %eax  -        ret  -  -Missed optimization: should be movl+andl.  -  -//===---------------------------------------------------------------------===//  -  -The x86_64 abi says:  -  -Booleans, when stored in a memory object, are stored as single byte objects the  -value of which is always 0 (false) or 1 (true).  -  -We are not using this fact:  -  -int bar(_Bool *a) { return *a; }  -  -define i32 @bar(i8* nocapture %a) nounwind readonly optsize {  -  %1 = load i8* %a, align 1, !tbaa !0  -  %tmp = and i8 %1, 1  -  %2 = zext i8 %tmp to i32  -  ret i32 %2  -}  -  -bar:  -        movb    (%rdi), %al  -        andb    $1, %al  -        movzbl  %al, %eax  -        ret  -  -GCC produces  -  -bar:  -        movzbl  (%rdi), %eax  -        ret  -  -//===---------------------------------------------------------------------===//  -  -Take the following C code:  -int f(int a, int b) { return (unsigned char)a == (unsigned char)b; }  -  -We generate the following IR with clang:  -define i32 @f(i32 %a, i32 %b) nounwind readnone {  -entry:  -  %tmp = xor i32 %b, %a                           ; <i32> [#uses=1]  -  %tmp6 = and i32 %tmp, 255                       ; <i32> [#uses=1]  -  %cmp = icmp eq i32 %tmp6, 0                     ; <i1> [#uses=1]  -  %conv5 = zext i1 %cmp to i32                    ; <i32> [#uses=1]  -  ret i32 %conv5  -}  -  -And the following x86 code:  -	xorl	%esi, %edi  -	testb	$-1, %dil  -	sete	%al  -	movzbl	%al, %eax  -	ret  -  -A cmpb instead of the xorl+testb would be one instruction shorter.  -  -//===---------------------------------------------------------------------===//  -  -Given the following C code:  -int f(int a, int b) { return (signed char)a == (signed char)b; }  -  -We generate the following IR with clang:  -define i32 @f(i32 %a, i32 %b) nounwind readnone {  -entry:  -  %sext = shl i32 %a, 24                          ; <i32> [#uses=1]  -  %conv1 = ashr i32 %sext, 24                     ; <i32> [#uses=1]  -  %sext6 = shl i32 %b, 24                         ; <i32> [#uses=1]  -  %conv4 = ashr i32 %sext6, 24                    ; <i32> [#uses=1]  -  %cmp = icmp eq i32 %conv1, %conv4               ; <i1> [#uses=1]  -  %conv5 = zext i1 %cmp to i32                    ; <i32> [#uses=1]  -  ret i32 %conv5  -}  -  -And the following x86 code:  -	movsbl	%sil, %eax  -	movsbl	%dil, %ecx  -	cmpl	%eax, %ecx  -	sete	%al  -	movzbl	%al, %eax  -	ret  -  -  -It should be possible to eliminate the sign extensions.  -  -//===---------------------------------------------------------------------===//  -  -LLVM misses a load+store narrowing opportunity in this code:  -  -%struct.bf = type { i64, i16, i16, i32 }  -  -@bfi = external global %struct.bf*                ; <%struct.bf**> [#uses=2]  -  -define void @t1() nounwind ssp {  -entry:  -  %0 = load %struct.bf** @bfi, align 8            ; <%struct.bf*> [#uses=1]  -  %1 = getelementptr %struct.bf* %0, i64 0, i32 1 ; <i16*> [#uses=1]  -  %2 = bitcast i16* %1 to i32*                    ; <i32*> [#uses=2]  -  %3 = load i32* %2, align 1                      ; <i32> [#uses=1]  -  %4 = and i32 %3, -65537                         ; <i32> [#uses=1]  -  store i32 %4, i32* %2, align 1  -  %5 = load %struct.bf** @bfi, align 8            ; <%struct.bf*> [#uses=1]  -  %6 = getelementptr %struct.bf* %5, i64 0, i32 1 ; <i16*> [#uses=1]  -  %7 = bitcast i16* %6 to i32*                    ; <i32*> [#uses=2]  -  %8 = load i32* %7, align 1                      ; <i32> [#uses=1]  -  %9 = and i32 %8, -131073                        ; <i32> [#uses=1]  -  store i32 %9, i32* %7, align 1  -  ret void  -}  -  -LLVM currently emits this:  -  -  movq  bfi(%rip), %rax  -  andl  $-65537, 8(%rax)  -  movq  bfi(%rip), %rax  -  andl  $-131073, 8(%rax)  -  ret  -  -It could narrow the loads and stores to emit this:  -  -  movq  bfi(%rip), %rax  -  andb  $-2, 10(%rax)  -  movq  bfi(%rip), %rax  -  andb  $-3, 10(%rax)  -  ret  -  -The trouble is that there is a TokenFactor between the store and the  -load, making it non-trivial to determine if there's anything between  -the load and the store which would prohibit narrowing.  -  -//===---------------------------------------------------------------------===//  -  -This code:  -void foo(unsigned x) {  -  if (x == 0) bar();  -  else if (x == 1) qux();  -}  -  -currently compiles into:  -_foo:  -	movl	4(%esp), %eax  -	cmpl	$1, %eax  -	je	LBB0_3  -	testl	%eax, %eax  -	jne	LBB0_4  -  -the testl could be removed:  -_foo:  -	movl	4(%esp), %eax  -	cmpl	$1, %eax  -	je	LBB0_3  -	jb	LBB0_4  -  -0 is the only unsigned number < 1.  -  -//===---------------------------------------------------------------------===//  -  -This code:  -  -%0 = type { i32, i1 }  -  -define i32 @add32carry(i32 %sum, i32 %x) nounwind readnone ssp {  -entry:  -  %uadd = tail call %0 @llvm.uadd.with.overflow.i32(i32 %sum, i32 %x)  -  %cmp = extractvalue %0 %uadd, 1  -  %inc = zext i1 %cmp to i32  -  %add = add i32 %x, %sum  -  %z.0 = add i32 %add, %inc  -  ret i32 %z.0  -}  -  -declare %0 @llvm.uadd.with.overflow.i32(i32, i32) nounwind readnone  -  -compiles to:  -  -_add32carry:                            ## @add32carry  -	addl	%esi, %edi  -	sbbl	%ecx, %ecx  -	movl	%edi, %eax  -	subl	%ecx, %eax  -	ret  -  -But it could be:  -  -_add32carry:  -	leal	(%rsi,%rdi), %eax  -	cmpl	%esi, %eax  -	adcl	$0, %eax  -	ret  -  -//===---------------------------------------------------------------------===//  -  -The hot loop of 256.bzip2 contains code that looks a bit like this:  -  -int foo(char *P, char *Q, int x, int y) {  -  if (P[0] != Q[0])  -     return P[0] < Q[0];  -  if (P[1] != Q[1])  -     return P[1] < Q[1];  -  if (P[2] != Q[2])  -     return P[2] < Q[2];  -   return P[3] < Q[3];  -}  -  -In the real code, we get a lot more wrong than this.  However, even in this  -code we generate:  -  -_foo:                                   ## @foo  -## %bb.0:                               ## %entry  -	movb	(%rsi), %al  -	movb	(%rdi), %cl  -	cmpb	%al, %cl  -	je	LBB0_2  -LBB0_1:                                 ## %if.then  -	cmpb	%al, %cl  -	jmp	LBB0_5  -LBB0_2:                                 ## %if.end  -	movb	1(%rsi), %al  -	movb	1(%rdi), %cl  -	cmpb	%al, %cl  -	jne	LBB0_1  -## %bb.3:                               ## %if.end38  -	movb	2(%rsi), %al  -	movb	2(%rdi), %cl  -	cmpb	%al, %cl  -	jne	LBB0_1  -## %bb.4:                               ## %if.end60  -	movb	3(%rdi), %al  -	cmpb	3(%rsi), %al  -LBB0_5:                                 ## %if.end60  -	setl	%al  -	movzbl	%al, %eax  -	ret  -  -Note that we generate jumps to LBB0_1 which does a redundant compare.  The  -redundant compare also forces the register values to be live, which prevents  -folding one of the loads into the compare.  In contrast, GCC 4.2 produces:  -  -_foo:  -	movzbl	(%rsi), %eax  -	cmpb	%al, (%rdi)  -	jne	L10  -L12:  -	movzbl	1(%rsi), %eax  -	cmpb	%al, 1(%rdi)  -	jne	L10  -	movzbl	2(%rsi), %eax  -	cmpb	%al, 2(%rdi)  -	jne	L10  -	movzbl	3(%rdi), %eax  -	cmpb	3(%rsi), %al  -L10:  -	setl	%al  -	movzbl	%al, %eax  -	ret  -  -which is "perfect".  -  -//===---------------------------------------------------------------------===//  -  -For the branch in the following code:  -int a();  -int b(int x, int y) {  -  if (x & (1<<(y&7)))  -    return a();  -  return y;  -}  -  -We currently generate:  -	movb	%sil, %al  -	andb	$7, %al  -	movzbl	%al, %eax  -	btl	%eax, %edi  -	jae	.LBB0_2  -  -movl+andl would be shorter than the movb+andb+movzbl sequence.  -  -//===---------------------------------------------------------------------===//  -  -For the following:  -struct u1 {  -    float x, y;  -};  -float foo(struct u1 u) {  -    return u.x + u.y;  -}  -  -We currently generate:  -	movdqa	%xmm0, %xmm1  -	pshufd	$1, %xmm0, %xmm0        # xmm0 = xmm0[1,0,0,0]  -	addss	%xmm1, %xmm0  -	ret  -  -We could save an instruction here by commuting the addss.  -  -//===---------------------------------------------------------------------===//  -  -This (from PR9661):  -  -float clamp_float(float a) {  -        if (a > 1.0f)  -                return 1.0f;  -        else if (a < 0.0f)  -                return 0.0f;  -        else  -                return a;  -}  -  -Could compile to:  -  -clamp_float:                            # @clamp_float  -        movss   .LCPI0_0(%rip), %xmm1  -        minss   %xmm1, %xmm0  -        pxor    %xmm1, %xmm1  -        maxss   %xmm1, %xmm0  -        ret  -  -with -ffast-math.  -  -//===---------------------------------------------------------------------===//  -  -This function (from PR9803):  -  -int clamp2(int a) {  -        if (a > 5)  -                a = 5;  -        if (a < 0)   -                return 0;  -        return a;  -}  -  -Compiles to:  -  -_clamp2:                                ## @clamp2  -        pushq   %rbp  -        movq    %rsp, %rbp  -        cmpl    $5, %edi  -        movl    $5, %ecx  -        cmovlel %edi, %ecx  -        testl   %ecx, %ecx  -        movl    $0, %eax  -        cmovnsl %ecx, %eax  -        popq    %rbp  -        ret  -  -The move of 0 could be scheduled above the test to make it is xor reg,reg.  -  -//===---------------------------------------------------------------------===//  -  -GCC PR48986.  We currently compile this:  -  -void bar(void);  -void yyy(int* p) {  -    if (__sync_fetch_and_add(p, -1) == 1)  -      bar();  -}  -  -into:  -	movl	$-1, %eax  -	lock  -	xaddl	%eax, (%rdi)  -	cmpl	$1, %eax  -	je	LBB0_2  -  -Instead we could generate:  -  -	lock  -	dec %rdi  -	je LBB0_2  -  -The trick is to match "fetch_and_add(X, -C) == C".  -  -//===---------------------------------------------------------------------===//  -  -unsigned t(unsigned a, unsigned b) {  -  return a <= b ? 5 : -5;  -}  -  -We generate:  -	movl	$5, %ecx  -	cmpl	%esi, %edi  -	movl	$-5, %eax  -	cmovbel	%ecx, %eax  -  -GCC:  -	cmpl	%edi, %esi  -	sbbl	%eax, %eax  -	andl	$-10, %eax  -	addl	$5, %eax  -  -//===---------------------------------------------------------------------===//  +//===---------------------------------------------------------------------===// +// Random ideas for the X86 backend. +//===---------------------------------------------------------------------===// + +Improvements to the multiply -> shift/add algorithm: +http://gcc.gnu.org/ml/gcc-patches/2004-08/msg01590.html + +//===---------------------------------------------------------------------===// + +Improve code like this (occurs fairly frequently, e.g. in LLVM): +long long foo(int x) { return 1LL << x; } + +http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01109.html +http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01128.html +http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01136.html + +Another useful one would be  ~0ULL >> X and ~0ULL << X. + +One better solution for 1LL << x is: +        xorl    %eax, %eax +        xorl    %edx, %edx +        testb   $32, %cl +        sete    %al +        setne   %dl +        sall    %cl, %eax +        sall    %cl, %edx + +But that requires good 8-bit subreg support. + +Also, this might be better.  It's an extra shift, but it's one instruction +shorter, and doesn't stress 8-bit subreg support. +(From http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01148.html, +but without the unnecessary and.) +        movl %ecx, %eax +        shrl $5, %eax +        movl %eax, %edx +        xorl $1, %edx +        sall %cl, %eax +        sall %cl. %edx + +64-bit shifts (in general) expand to really bad code.  Instead of using +cmovs, we should expand to a conditional branch like GCC produces. + +//===---------------------------------------------------------------------===// + +Some isel ideas: + +1. Dynamic programming based approach when compile time is not an +   issue. +2. Code duplication (addressing mode) during isel. +3. Other ideas from "Register-Sensitive Selection, Duplication, and +   Sequencing of Instructions". +4. Scheduling for reduced register pressure.  E.g. "Minimum Register +   Instruction Sequence Problem: Revisiting Optimal Code Generation for DAGs" +   and other related papers. +   http://citeseer.ist.psu.edu/govindarajan01minimum.html + +//===---------------------------------------------------------------------===// + +Should we promote i16 to i32 to avoid partial register update stalls? + +//===---------------------------------------------------------------------===// + +Leave any_extend as pseudo instruction and hint to register +allocator. Delay codegen until post register allocation. +Note. any_extend is now turned into an INSERT_SUBREG. We still need to teach +the coalescer how to deal with it though. + +//===---------------------------------------------------------------------===// + +It appears icc use push for parameter passing. Need to investigate. + +//===---------------------------------------------------------------------===// + +The instruction selector sometimes misses folding a load into a compare.  The +pattern is written as (cmp reg, (load p)).  Because the compare isn't +commutative, it is not matched with the load on both sides.  The dag combiner +should be made smart enough to canonicalize the load into the RHS of a compare +when it can invert the result of the compare for free. + +//===---------------------------------------------------------------------===// + +In many cases, LLVM generates code like this: + +_test: +        movl 8(%esp), %eax +        cmpl %eax, 4(%esp) +        setl %al +        movzbl %al, %eax +        ret + +on some processors (which ones?), it is more efficient to do this: + +_test: +        movl 8(%esp), %ebx +        xor  %eax, %eax +        cmpl %ebx, 4(%esp) +        setl %al +        ret + +Doing this correctly is tricky though, as the xor clobbers the flags. + +//===---------------------------------------------------------------------===// + +We should generate bts/btr/etc instructions on targets where they are cheap or +when codesize is important.  e.g., for: + +void setbit(int *target, int bit) { +    *target |= (1 << bit); +} +void clearbit(int *target, int bit) { +    *target &= ~(1 << bit); +} + +//===---------------------------------------------------------------------===// + +Instead of the following for memset char*, 1, 10: + +	movl $16843009, 4(%edx) +	movl $16843009, (%edx) +	movw $257, 8(%edx) + +It might be better to generate + +	movl $16843009, %eax +	movl %eax, 4(%edx) +	movl %eax, (%edx) +	movw al, 8(%edx) +	 +when we can spare a register. It reduces code size. + +//===---------------------------------------------------------------------===// + +Evaluate what the best way to codegen sdiv X, (2^C) is.  For X/8, we currently +get this: + +define i32 @test1(i32 %X) { +    %Y = sdiv i32 %X, 8 +    ret i32 %Y +} + +_test1: +        movl 4(%esp), %eax +        movl %eax, %ecx +        sarl $31, %ecx +        shrl $29, %ecx +        addl %ecx, %eax +        sarl $3, %eax +        ret + +GCC knows several different ways to codegen it, one of which is this: + +_test1: +        movl    4(%esp), %eax +        cmpl    $-1, %eax +        leal    7(%eax), %ecx +        cmovle  %ecx, %eax +        sarl    $3, %eax +        ret + +which is probably slower, but it's interesting at least :) + +//===---------------------------------------------------------------------===// + +We are currently lowering large (1MB+) memmove/memcpy to rep/stosl and rep/movsl +We should leave these as libcalls for everything over a much lower threshold, +since libc is hand tuned for medium and large mem ops (avoiding RFO for large +stores, TLB preheating, etc) + +//===---------------------------------------------------------------------===// + +Optimize this into something reasonable: + x * copysign(1.0, y) * copysign(1.0, z) + +//===---------------------------------------------------------------------===// + +Optimize copysign(x, *y) to use an integer load from y. + +//===---------------------------------------------------------------------===// + +The following tests perform worse with LSR: + +lambda, siod, optimizer-eval, ackermann, hash2, nestedloop, strcat, and Treesor. + +//===---------------------------------------------------------------------===// + +Adding to the list of cmp / test poor codegen issues: + +int test(__m128 *A, __m128 *B) { +  if (_mm_comige_ss(*A, *B)) +    return 3; +  else +    return 4; +} + +_test: +	movl 8(%esp), %eax +	movaps (%eax), %xmm0 +	movl 4(%esp), %eax +	movaps (%eax), %xmm1 +	comiss %xmm0, %xmm1 +	setae %al +	movzbl %al, %ecx +	movl $3, %eax +	movl $4, %edx +	cmpl $0, %ecx +	cmove %edx, %eax +	ret + +Note the setae, movzbl, cmpl, cmove can be replaced with a single cmovae. There +are a number of issues. 1) We are introducing a setcc between the result of the +intrisic call and select. 2) The intrinsic is expected to produce a i32 value +so a any extend (which becomes a zero extend) is added. + +We probably need some kind of target DAG combine hook to fix this. + +//===---------------------------------------------------------------------===// + +We generate significantly worse code for this than GCC: +http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21150 +http://gcc.gnu.org/bugzilla/attachment.cgi?id=8701 + +There is also one case we do worse on PPC. + +//===---------------------------------------------------------------------===// + +For this: + +int test(int a) +{ +  return a * 3; +} + +We currently emits +	imull $3, 4(%esp), %eax + +Perhaps this is what we really should generate is? Is imull three or four +cycles? Note: ICC generates this: +	movl	4(%esp), %eax +	leal	(%eax,%eax,2), %eax + +The current instruction priority is based on pattern complexity. The former is +more "complex" because it folds a load so the latter will not be emitted. + +Perhaps we should use AddedComplexity to give LEA32r a higher priority? We +should always try to match LEA first since the LEA matching code does some +estimate to determine whether the match is profitable. + +However, if we care more about code size, then imull is better. It's two bytes +shorter than movl + leal. + +On a Pentium M, both variants have the same characteristics with regard +to throughput; however, the multiplication has a latency of four cycles, as +opposed to two cycles for the movl+lea variant. + +//===---------------------------------------------------------------------===// + +It appears gcc place string data with linkonce linkage in +.section __TEXT,__const_coal,coalesced instead of +.section __DATA,__const_coal,coalesced. +Take a look at darwin.h, there are other Darwin assembler directives that we +do not make use of. + +//===---------------------------------------------------------------------===// + +define i32 @foo(i32* %a, i32 %t) { +entry: +	br label %cond_true + +cond_true:		; preds = %cond_true, %entry +	%x.0.0 = phi i32 [ 0, %entry ], [ %tmp9, %cond_true ]		; <i32> [#uses=3] +	%t_addr.0.0 = phi i32 [ %t, %entry ], [ %tmp7, %cond_true ]		; <i32> [#uses=1] +	%tmp2 = getelementptr i32* %a, i32 %x.0.0		; <i32*> [#uses=1] +	%tmp3 = load i32* %tmp2		; <i32> [#uses=1] +	%tmp5 = add i32 %t_addr.0.0, %x.0.0		; <i32> [#uses=1] +	%tmp7 = add i32 %tmp5, %tmp3		; <i32> [#uses=2] +	%tmp9 = add i32 %x.0.0, 1		; <i32> [#uses=2] +	%tmp = icmp sgt i32 %tmp9, 39		; <i1> [#uses=1] +	br i1 %tmp, label %bb12, label %cond_true + +bb12:		; preds = %cond_true +	ret i32 %tmp7 +} +is pessimized by -loop-reduce and -indvars + +//===---------------------------------------------------------------------===// + +u32 to float conversion improvement: + +float uint32_2_float( unsigned u ) { +  float fl = (int) (u & 0xffff); +  float fh = (int) (u >> 16); +  fh *= 0x1.0p16f; +  return fh + fl; +} + +00000000        subl    $0x04,%esp +00000003        movl    0x08(%esp,1),%eax +00000007        movl    %eax,%ecx +00000009        shrl    $0x10,%ecx +0000000c        cvtsi2ss        %ecx,%xmm0 +00000010        andl    $0x0000ffff,%eax +00000015        cvtsi2ss        %eax,%xmm1 +00000019        mulss   0x00000078,%xmm0 +00000021        addss   %xmm1,%xmm0 +00000025        movss   %xmm0,(%esp,1) +0000002a        flds    (%esp,1) +0000002d        addl    $0x04,%esp +00000030        ret + +//===---------------------------------------------------------------------===// + +When using fastcc abi, align stack slot of argument of type double on 8 byte +boundary to improve performance. + +//===---------------------------------------------------------------------===// + +GCC's ix86_expand_int_movcc function (in i386.c) has a ton of interesting +simplifications for integer "x cmp y ? a : b". + +//===---------------------------------------------------------------------===// + +Consider the expansion of: + +define i32 @test3(i32 %X) { +        %tmp1 = urem i32 %X, 255 +        ret i32 %tmp1 +} + +Currently it compiles to: + +... +        movl $2155905153, %ecx +        movl 8(%esp), %esi +        movl %esi, %eax +        mull %ecx +... + +This could be "reassociated" into: + +        movl $2155905153, %eax +        movl 8(%esp), %ecx +        mull %ecx + +to avoid the copy.  In fact, the existing two-address stuff would do this +except that mul isn't a commutative 2-addr instruction.  I guess this has +to be done at isel time based on the #uses to mul? + +//===---------------------------------------------------------------------===// + +Make sure the instruction which starts a loop does not cross a cacheline +boundary. This requires knowning the exact length of each machine instruction. +That is somewhat complicated, but doable. Example 256.bzip2: + +In the new trace, the hot loop has an instruction which crosses a cacheline +boundary.  In addition to potential cache misses, this can't help decoding as I +imagine there has to be some kind of complicated decoder reset and realignment +to grab the bytes from the next cacheline. + +532  532 0x3cfc movb     (1809(%esp, %esi), %bl   <<<--- spans 2 64 byte lines +942  942 0x3d03 movl     %dh, (1809(%esp, %esi) +937  937 0x3d0a incl     %esi +3    3   0x3d0b cmpb     %bl, %dl +27   27  0x3d0d jnz      0x000062db <main+11707> + +//===---------------------------------------------------------------------===// + +In c99 mode, the preprocessor doesn't like assembly comments like #TRUNCATE. + +//===---------------------------------------------------------------------===// + +This could be a single 16-bit load. + +int f(char *p) { +    if ((p[0] == 1) & (p[1] == 2)) return 1; +    return 0; +} + +//===---------------------------------------------------------------------===// + +We should inline lrintf and probably other libc functions. + +//===---------------------------------------------------------------------===// + +This code: + +void test(int X) { +  if (X) abort(); +} + +is currently compiled to: + +_test: +        subl $12, %esp +        cmpl $0, 16(%esp) +        jne LBB1_1 +        addl $12, %esp +        ret +LBB1_1: +        call L_abort$stub + +It would be better to produce: + +_test: +        subl $12, %esp +        cmpl $0, 16(%esp) +        jne L_abort$stub +        addl $12, %esp +        ret + +This can be applied to any no-return function call that takes no arguments etc. +Alternatively, the stack save/restore logic could be shrink-wrapped, producing +something like this: + +_test: +        cmpl $0, 4(%esp) +        jne LBB1_1 +        ret +LBB1_1: +        subl $12, %esp +        call L_abort$stub + +Both are useful in different situations.  Finally, it could be shrink-wrapped +and tail called, like this: + +_test: +        cmpl $0, 4(%esp) +        jne LBB1_1 +        ret +LBB1_1: +        pop %eax   # realign stack. +        call L_abort$stub + +Though this probably isn't worth it. + +//===---------------------------------------------------------------------===// + +Sometimes it is better to codegen subtractions from a constant (e.g. 7-x) with +a neg instead of a sub instruction.  Consider: + +int test(char X) { return 7-X; } + +we currently produce: +_test: +        movl $7, %eax +        movsbl 4(%esp), %ecx +        subl %ecx, %eax +        ret + +We would use one fewer register if codegen'd as: + +        movsbl 4(%esp), %eax +	neg %eax +        add $7, %eax +        ret + +Note that this isn't beneficial if the load can be folded into the sub.  In +this case, we want a sub: + +int test(int X) { return 7-X; } +_test: +        movl $7, %eax +        subl 4(%esp), %eax +        ret + +//===---------------------------------------------------------------------===// + +Leaf functions that require one 4-byte spill slot have a prolog like this: + +_foo: +        pushl   %esi +        subl    $4, %esp +... +and an epilog like this: +        addl    $4, %esp +        popl    %esi +        ret + +It would be smaller, and potentially faster, to push eax on entry and to +pop into a dummy register instead of using addl/subl of esp.  Just don't pop  +into any return registers :) + +//===---------------------------------------------------------------------===// + +The X86 backend should fold (branch (or (setcc, setcc))) into multiple  +branches.  We generate really poor code for: + +double testf(double a) { +       return a == 0.0 ? 0.0 : (a > 0.0 ? 1.0 : -1.0); +} + +For example, the entry BB is: + +_testf: +        subl    $20, %esp +        pxor    %xmm0, %xmm0 +        movsd   24(%esp), %xmm1 +        ucomisd %xmm0, %xmm1 +        setnp   %al +        sete    %cl +        testb   %cl, %al +        jne     LBB1_5  # UnifiedReturnBlock +LBB1_1: # cond_true + + +it would be better to replace the last four instructions with: + +	jp LBB1_1 +	je LBB1_5 +LBB1_1: + +We also codegen the inner ?: into a diamond: + +       cvtss2sd        LCPI1_0(%rip), %xmm2 +        cvtss2sd        LCPI1_1(%rip), %xmm3 +        ucomisd %xmm1, %xmm0 +        ja      LBB1_3  # cond_true +LBB1_2: # cond_true +        movapd  %xmm3, %xmm2 +LBB1_3: # cond_true +        movapd  %xmm2, %xmm0 +        ret + +We should sink the load into xmm3 into the LBB1_2 block.  This should +be pretty easy, and will nuke all the copies. + +//===---------------------------------------------------------------------===// + +This: +        #include <algorithm> +        inline std::pair<unsigned, bool> full_add(unsigned a, unsigned b) +        { return std::make_pair(a + b, a + b < a); } +        bool no_overflow(unsigned a, unsigned b) +        { return !full_add(a, b).second; } + +Should compile to: +	addl	%esi, %edi +	setae	%al +	movzbl	%al, %eax +	ret + +on x86-64, instead of the rather stupid-looking: +	addl	%esi, %edi +	setb	%al +	xorb	$1, %al +	movzbl	%al, %eax +	ret + + +//===---------------------------------------------------------------------===// + +The following code: + +bb114.preheader:		; preds = %cond_next94 +	%tmp231232 = sext i16 %tmp62 to i32		; <i32> [#uses=1] +	%tmp233 = sub i32 32, %tmp231232		; <i32> [#uses=1] +	%tmp245246 = sext i16 %tmp65 to i32		; <i32> [#uses=1] +	%tmp252253 = sext i16 %tmp68 to i32		; <i32> [#uses=1] +	%tmp254 = sub i32 32, %tmp252253		; <i32> [#uses=1] +	%tmp553554 = bitcast i16* %tmp37 to i8*		; <i8*> [#uses=2] +	%tmp583584 = sext i16 %tmp98 to i32		; <i32> [#uses=1] +	%tmp585 = sub i32 32, %tmp583584		; <i32> [#uses=1] +	%tmp614615 = sext i16 %tmp101 to i32		; <i32> [#uses=1] +	%tmp621622 = sext i16 %tmp104 to i32		; <i32> [#uses=1] +	%tmp623 = sub i32 32, %tmp621622		; <i32> [#uses=1] +	br label %bb114 + +produces: + +LBB3_5:	# bb114.preheader +	movswl	-68(%ebp), %eax +	movl	$32, %ecx +	movl	%ecx, -80(%ebp) +	subl	%eax, -80(%ebp) +	movswl	-52(%ebp), %eax +	movl	%ecx, -84(%ebp) +	subl	%eax, -84(%ebp) +	movswl	-70(%ebp), %eax +	movl	%ecx, -88(%ebp) +	subl	%eax, -88(%ebp) +	movswl	-50(%ebp), %eax +	subl	%eax, %ecx +	movl	%ecx, -76(%ebp) +	movswl	-42(%ebp), %eax +	movl	%eax, -92(%ebp) +	movswl	-66(%ebp), %eax +	movl	%eax, -96(%ebp) +	movw	$0, -98(%ebp) + +This appears to be bad because the RA is not folding the store to the stack  +slot into the movl.  The above instructions could be: +	movl    $32, -80(%ebp) +... +	movl    $32, -84(%ebp) +... +This seems like a cross between remat and spill folding. + +This has redundant subtractions of %eax from a stack slot. However, %ecx doesn't +change, so we could simply subtract %eax from %ecx first and then use %ecx (or +vice-versa). + +//===---------------------------------------------------------------------===// + +This code: + +	%tmp659 = icmp slt i16 %tmp654, 0		; <i1> [#uses=1] +	br i1 %tmp659, label %cond_true662, label %cond_next715 + +produces this: + +	testw	%cx, %cx +	movswl	%cx, %esi +	jns	LBB4_109	# cond_next715 + +Shark tells us that using %cx in the testw instruction is sub-optimal. It +suggests using the 32-bit register (which is what ICC uses). + +//===---------------------------------------------------------------------===// + +We compile this: + +void compare (long long foo) { +  if (foo < 4294967297LL) +    abort(); +} + +to: + +compare: +        subl    $4, %esp +        cmpl    $0, 8(%esp) +        setne   %al +        movzbw  %al, %ax +        cmpl    $1, 12(%esp) +        setg    %cl +        movzbw  %cl, %cx +        cmove   %ax, %cx +        testb   $1, %cl +        jne     .LBB1_2 # UnifiedReturnBlock +.LBB1_1:        # ifthen +        call    abort +.LBB1_2:        # UnifiedReturnBlock +        addl    $4, %esp +        ret + +(also really horrible code on ppc).  This is due to the expand code for 64-bit +compares.  GCC produces multiple branches, which is much nicer: + +compare: +        subl    $12, %esp +        movl    20(%esp), %edx +        movl    16(%esp), %eax +        decl    %edx +        jle     .L7 +.L5: +        addl    $12, %esp +        ret +        .p2align 4,,7 +.L7: +        jl      .L4 +        cmpl    $0, %eax +        .p2align 4,,8 +        ja      .L5 +.L4: +        .p2align 4,,9 +        call    abort + +//===---------------------------------------------------------------------===// + +Tail call optimization improvements: Tail call optimization currently +pushes all arguments on the top of the stack (their normal place for +non-tail call optimized calls) that source from the callers arguments +or  that source from a virtual register (also possibly sourcing from +callers arguments). +This is done to prevent overwriting of parameters (see example +below) that might be used later. + +example:   + +int callee(int32, int64);  +int caller(int32 arg1, int32 arg2) {  +  int64 local = arg2 * 2;  +  return callee(arg2, (int64)local);  +} + +[arg1]          [!arg2 no longer valid since we moved local onto it] +[arg2]      ->  [(int64) +[RETADDR]        local  ] + +Moving arg1 onto the stack slot of callee function would overwrite +arg2 of the caller. + +Possible optimizations: + + + - Analyse the actual parameters of the callee to see which would +   overwrite a caller parameter which is used by the callee and only +   push them onto the top of the stack. + +   int callee (int32 arg1, int32 arg2); +   int caller (int32 arg1, int32 arg2) { +       return callee(arg1,arg2); +   } + +   Here we don't need to write any variables to the top of the stack +   since they don't overwrite each other. + +   int callee (int32 arg1, int32 arg2); +   int caller (int32 arg1, int32 arg2) { +       return callee(arg2,arg1); +   } + +   Here we need to push the arguments because they overwrite each +   other. + +//===---------------------------------------------------------------------===// + +main () +{ +  int i = 0; +  unsigned long int z = 0; + +  do { +    z -= 0x00004000; +    i++; +    if (i > 0x00040000) +      abort (); +  } while (z > 0); +  exit (0); +} + +gcc compiles this to: + +_main: +	subl	$28, %esp +	xorl	%eax, %eax +	jmp	L2 +L3: +	cmpl	$262144, %eax +	je	L10 +L2: +	addl	$1, %eax +	cmpl	$262145, %eax +	jne	L3 +	call	L_abort$stub +L10: +	movl	$0, (%esp) +	call	L_exit$stub + +llvm: + +_main: +	subl	$12, %esp +	movl	$1, %eax +	movl	$16384, %ecx +LBB1_1:	# bb +	cmpl	$262145, %eax +	jge	LBB1_4	# cond_true +LBB1_2:	# cond_next +	incl	%eax +	addl	$4294950912, %ecx +	cmpl	$16384, %ecx +	jne	LBB1_1	# bb +LBB1_3:	# bb11 +	xorl	%eax, %eax +	addl	$12, %esp +	ret +LBB1_4:	# cond_true +	call	L_abort$stub + +1. LSR should rewrite the first cmp with induction variable %ecx. +2. DAG combiner should fold +        leal    1(%eax), %edx +        cmpl    $262145, %edx +   => +        cmpl    $262144, %eax + +//===---------------------------------------------------------------------===// + +define i64 @test(double %X) { +	%Y = fptosi double %X to i64 +	ret i64 %Y +} + +compiles to: + +_test: +	subl	$20, %esp +	movsd	24(%esp), %xmm0 +	movsd	%xmm0, 8(%esp) +	fldl	8(%esp) +	fisttpll	(%esp) +	movl	4(%esp), %edx +	movl	(%esp), %eax +	addl	$20, %esp +	#FP_REG_KILL +	ret + +This should just fldl directly from the input stack slot. + +//===---------------------------------------------------------------------===// + +This code: +int foo (int x) { return (x & 65535) | 255; } + +Should compile into: + +_foo: +        movzwl  4(%esp), %eax +        orl     $255, %eax +        ret + +instead of: +_foo: +	movl	$65280, %eax +	andl	4(%esp), %eax +	orl	$255, %eax +	ret + +//===---------------------------------------------------------------------===// + +We're codegen'ing multiply of long longs inefficiently: + +unsigned long long LLM(unsigned long long arg1, unsigned long long arg2) { +  return arg1 *  arg2; +} + +We compile to (fomit-frame-pointer): + +_LLM: +	pushl	%esi +	movl	8(%esp), %ecx +	movl	16(%esp), %esi +	movl	%esi, %eax +	mull	%ecx +	imull	12(%esp), %esi +	addl	%edx, %esi +	imull	20(%esp), %ecx +	movl	%esi, %edx +	addl	%ecx, %edx +	popl	%esi +	ret + +This looks like a scheduling deficiency and lack of remat of the load from +the argument area.  ICC apparently produces: + +        movl      8(%esp), %ecx +        imull     12(%esp), %ecx +        movl      16(%esp), %eax +        imull     4(%esp), %eax  +        addl      %eax, %ecx   +        movl      4(%esp), %eax +        mull      12(%esp)  +        addl      %ecx, %edx +        ret + +Note that it remat'd loads from 4(esp) and 12(esp).  See this GCC PR: +http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17236 + +//===---------------------------------------------------------------------===// + +We can fold a store into "zeroing a reg".  Instead of: + +xorl    %eax, %eax +movl    %eax, 124(%esp) + +we should get: + +movl    $0, 124(%esp) + +if the flags of the xor are dead. + +Likewise, we isel "x<<1" into "add reg,reg".  If reg is spilled, this should +be folded into: shl [mem], 1 + +//===---------------------------------------------------------------------===// + +In SSE mode, we turn abs and neg into a load from the constant pool plus a xor +or and instruction, for example: + +	xorpd	LCPI1_0, %xmm2 + +However, if xmm2 gets spilled, we end up with really ugly code like this: + +	movsd	(%esp), %xmm0 +	xorpd	LCPI1_0, %xmm0 +	movsd	%xmm0, (%esp) + +Since we 'know' that this is a 'neg', we can actually "fold" the spill into +the neg/abs instruction, turning it into an *integer* operation, like this: + +	xorl 2147483648, [mem+4]     ## 2147483648 = (1 << 31) + +you could also use xorb, but xorl is less likely to lead to a partial register +stall.  Here is a contrived testcase: + +double a, b, c; +void test(double *P) { +  double X = *P; +  a = X; +  bar(); +  X = -X; +  b = X; +  bar(); +  c = X; +} + +//===---------------------------------------------------------------------===// + +The generated code on x86 for checking for signed overflow on a multiply the +obvious way is much longer than it needs to be. + +int x(int a, int b) { +  long long prod = (long long)a*b; +  return  prod > 0x7FFFFFFF || prod < (-0x7FFFFFFF-1); +} + +See PR2053 for more details. + +//===---------------------------------------------------------------------===// + +We should investigate using cdq/ctld (effect: edx = sar eax, 31) +more aggressively; it should cost the same as a move+shift on any modern +processor, but it's a lot shorter. Downside is that it puts more +pressure on register allocation because it has fixed operands. + +Example: +int abs(int x) {return x < 0 ? -x : x;} + +gcc compiles this to the following when using march/mtune=pentium2/3/4/m/etc.: +abs: +        movl    4(%esp), %eax +        cltd +        xorl    %edx, %eax +        subl    %edx, %eax +        ret + +//===---------------------------------------------------------------------===// + +Take the following code (from  +http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16541): + +extern unsigned char first_one[65536]; +int FirstOnet(unsigned long long arg1) +{ +  if (arg1 >> 48) +    return (first_one[arg1 >> 48]); +  return 0; +} + + +The following code is currently generated: +FirstOnet: +        movl    8(%esp), %eax +        cmpl    $65536, %eax +        movl    4(%esp), %ecx +        jb      .LBB1_2 # UnifiedReturnBlock +.LBB1_1:        # ifthen +        shrl    $16, %eax +        movzbl  first_one(%eax), %eax +        ret +.LBB1_2:        # UnifiedReturnBlock +        xorl    %eax, %eax +        ret + +We could change the "movl 8(%esp), %eax" into "movzwl 10(%esp), %eax"; this +lets us change the cmpl into a testl, which is shorter, and eliminate the shift. + +//===---------------------------------------------------------------------===// + +We compile this function: + +define i32 @foo(i32 %a, i32 %b, i32 %c, i8 zeroext  %d) nounwind  { +entry: +	%tmp2 = icmp eq i8 %d, 0		; <i1> [#uses=1] +	br i1 %tmp2, label %bb7, label %bb + +bb:		; preds = %entry +	%tmp6 = add i32 %b, %a		; <i32> [#uses=1] +	ret i32 %tmp6 + +bb7:		; preds = %entry +	%tmp10 = sub i32 %a, %c		; <i32> [#uses=1] +	ret i32 %tmp10 +} + +to: + +foo:                                    # @foo +# %bb.0:                                # %entry +	movl	4(%esp), %ecx +	cmpb	$0, 16(%esp) +	je	.LBB0_2 +# %bb.1:                                # %bb +	movl	8(%esp), %eax +	addl	%ecx, %eax +	ret +.LBB0_2:                                # %bb7 +	movl	12(%esp), %edx +	movl	%ecx, %eax +	subl	%edx, %eax +	ret + +There's an obviously unnecessary movl in .LBB0_2, and we could eliminate a +couple more movls by putting 4(%esp) into %eax instead of %ecx. + +//===---------------------------------------------------------------------===// + +See rdar://4653682. + +From flops: + +LBB1_15:        # bb310 +        cvtss2sd        LCPI1_0, %xmm1 +        addsd   %xmm1, %xmm0 +        movsd   176(%esp), %xmm2 +        mulsd   %xmm0, %xmm2 +        movapd  %xmm2, %xmm3 +        mulsd   %xmm3, %xmm3 +        movapd  %xmm3, %xmm4 +        mulsd   LCPI1_23, %xmm4 +        addsd   LCPI1_24, %xmm4 +        mulsd   %xmm3, %xmm4 +        addsd   LCPI1_25, %xmm4 +        mulsd   %xmm3, %xmm4 +        addsd   LCPI1_26, %xmm4 +        mulsd   %xmm3, %xmm4 +        addsd   LCPI1_27, %xmm4 +        mulsd   %xmm3, %xmm4 +        addsd   LCPI1_28, %xmm4 +        mulsd   %xmm3, %xmm4 +        addsd   %xmm1, %xmm4 +        mulsd   %xmm2, %xmm4 +        movsd   152(%esp), %xmm1 +        addsd   %xmm4, %xmm1 +        movsd   %xmm1, 152(%esp) +        incl    %eax +        cmpl    %eax, %esi +        jge     LBB1_15 # bb310 +LBB1_16:        # bb358.loopexit +        movsd   152(%esp), %xmm0 +        addsd   %xmm0, %xmm0 +        addsd   LCPI1_22, %xmm0 +        movsd   %xmm0, 152(%esp) + +Rather than spilling the result of the last addsd in the loop, we should have +insert a copy to split the interval (one for the duration of the loop, one +extending to the fall through). The register pressure in the loop isn't high +enough to warrant the spill. + +Also check why xmm7 is not used at all in the function. + +//===---------------------------------------------------------------------===// + +Take the following: + +target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:64:64-v128:128:128-a0:0:64-f80:128:128-S128" +target triple = "i386-apple-darwin8" +@in_exit.4870.b = internal global i1 false		; <i1*> [#uses=2] +define fastcc void @abort_gzip() noreturn nounwind  { +entry: +	%tmp.b.i = load i1* @in_exit.4870.b		; <i1> [#uses=1] +	br i1 %tmp.b.i, label %bb.i, label %bb4.i +bb.i:		; preds = %entry +	tail call void @exit( i32 1 ) noreturn nounwind  +	unreachable +bb4.i:		; preds = %entry +	store i1 true, i1* @in_exit.4870.b +	tail call void @exit( i32 1 ) noreturn nounwind  +	unreachable +} +declare void @exit(i32) noreturn nounwind  + +This compiles into: +_abort_gzip:                            ## @abort_gzip +## %bb.0:                               ## %entry +	subl	$12, %esp +	movb	_in_exit.4870.b, %al +	cmpb	$1, %al +	jne	LBB0_2 + +We somehow miss folding the movb into the cmpb. + +//===---------------------------------------------------------------------===// + +We compile: + +int test(int x, int y) { +  return x-y-1; +} + +into (-m64): + +_test: +	decl	%edi +	movl	%edi, %eax +	subl	%esi, %eax +	ret + +it would be better to codegen as: x+~y  (notl+addl) + +//===---------------------------------------------------------------------===// + +This code: + +int foo(const char *str,...) +{ + __builtin_va_list a; int x; + __builtin_va_start(a,str); x = __builtin_va_arg(a,int); __builtin_va_end(a); + return x; +} + +gets compiled into this on x86-64: +	subq    $200, %rsp +        movaps  %xmm7, 160(%rsp) +        movaps  %xmm6, 144(%rsp) +        movaps  %xmm5, 128(%rsp) +        movaps  %xmm4, 112(%rsp) +        movaps  %xmm3, 96(%rsp) +        movaps  %xmm2, 80(%rsp) +        movaps  %xmm1, 64(%rsp) +        movaps  %xmm0, 48(%rsp) +        movq    %r9, 40(%rsp) +        movq    %r8, 32(%rsp) +        movq    %rcx, 24(%rsp) +        movq    %rdx, 16(%rsp) +        movq    %rsi, 8(%rsp) +        leaq    (%rsp), %rax +        movq    %rax, 192(%rsp) +        leaq    208(%rsp), %rax +        movq    %rax, 184(%rsp) +        movl    $48, 180(%rsp) +        movl    $8, 176(%rsp) +        movl    176(%rsp), %eax +        cmpl    $47, %eax +        jbe     .LBB1_3 # bb +.LBB1_1:        # bb3 +        movq    184(%rsp), %rcx +        leaq    8(%rcx), %rax +        movq    %rax, 184(%rsp) +.LBB1_2:        # bb4 +        movl    (%rcx), %eax +        addq    $200, %rsp +        ret +.LBB1_3:        # bb +        movl    %eax, %ecx +        addl    $8, %eax +        addq    192(%rsp), %rcx +        movl    %eax, 176(%rsp) +        jmp     .LBB1_2 # bb4 + +gcc 4.3 generates: +	subq    $96, %rsp +.LCFI0: +        leaq    104(%rsp), %rax +        movq    %rsi, -80(%rsp) +        movl    $8, -120(%rsp) +        movq    %rax, -112(%rsp) +        leaq    -88(%rsp), %rax +        movq    %rax, -104(%rsp) +        movl    $8, %eax +        cmpl    $48, %eax +        jb      .L6 +        movq    -112(%rsp), %rdx +        movl    (%rdx), %eax +        addq    $96, %rsp +        ret +        .p2align 4,,10 +        .p2align 3 +.L6: +        mov     %eax, %edx +        addq    -104(%rsp), %rdx +        addl    $8, %eax +        movl    %eax, -120(%rsp) +        movl    (%rdx), %eax +        addq    $96, %rsp +        ret + +and it gets compiled into this on x86: +	pushl   %ebp +        movl    %esp, %ebp +        subl    $4, %esp +        leal    12(%ebp), %eax +        movl    %eax, -4(%ebp) +        leal    16(%ebp), %eax +        movl    %eax, -4(%ebp) +        movl    12(%ebp), %eax +        addl    $4, %esp +        popl    %ebp +        ret + +gcc 4.3 generates: +	pushl   %ebp +        movl    %esp, %ebp +        movl    12(%ebp), %eax +        popl    %ebp +        ret + +//===---------------------------------------------------------------------===// + +Teach tblgen not to check bitconvert source type in some cases. This allows us +to consolidate the following patterns in X86InstrMMX.td: + +def : Pat<(v2i32 (bitconvert (i64 (vector_extract (v2i64 VR128:$src), +                                                  (iPTR 0))))), +          (v2i32 (MMX_MOVDQ2Qrr VR128:$src))>; +def : Pat<(v4i16 (bitconvert (i64 (vector_extract (v2i64 VR128:$src), +                                                  (iPTR 0))))), +          (v4i16 (MMX_MOVDQ2Qrr VR128:$src))>; +def : Pat<(v8i8 (bitconvert (i64 (vector_extract (v2i64 VR128:$src), +                                                  (iPTR 0))))), +          (v8i8 (MMX_MOVDQ2Qrr VR128:$src))>; + +There are other cases in various td files. + +//===---------------------------------------------------------------------===// + +Take something like the following on x86-32: +unsigned a(unsigned long long x, unsigned y) {return x % y;} + +We currently generate a libcall, but we really shouldn't: the expansion is +shorter and likely faster than the libcall.  The expected code is something +like the following: + +	movl	12(%ebp), %eax +	movl	16(%ebp), %ecx +	xorl	%edx, %edx +	divl	%ecx +	movl	8(%ebp), %eax +	divl	%ecx +	movl	%edx, %eax +	ret + +A similar code sequence works for division. + +//===---------------------------------------------------------------------===// + +We currently compile this: + +define i32 @func1(i32 %v1, i32 %v2) nounwind { +entry: +  %t = call {i32, i1} @llvm.sadd.with.overflow.i32(i32 %v1, i32 %v2) +  %sum = extractvalue {i32, i1} %t, 0 +  %obit = extractvalue {i32, i1} %t, 1 +  br i1 %obit, label %overflow, label %normal +normal: +  ret i32 %sum +overflow: +  call void @llvm.trap() +  unreachable +} +declare {i32, i1} @llvm.sadd.with.overflow.i32(i32, i32) +declare void @llvm.trap() + +to: + +_func1: +	movl	4(%esp), %eax +	addl	8(%esp), %eax +	jo	LBB1_2	## overflow +LBB1_1:	## normal +	ret +LBB1_2:	## overflow +	ud2 + +it would be nice to produce "into" someday. + +//===---------------------------------------------------------------------===// + +Test instructions can be eliminated by using EFLAGS values from arithmetic +instructions. This is currently not done for mul, and, or, xor, neg, shl, +sra, srl, shld, shrd, atomic ops, and others. It is also currently not done +for read-modify-write instructions. It is also current not done if the +OF or CF flags are needed. + +The shift operators have the complication that when the shift count is +zero, EFLAGS is not set, so they can only subsume a test instruction if +the shift count is known to be non-zero. Also, using the EFLAGS value +from a shift is apparently very slow on some x86 implementations. + +In read-modify-write instructions, the root node in the isel match is +the store, and isel has no way for the use of the EFLAGS result of the +arithmetic to be remapped to the new node. + +Add and subtract instructions set OF on signed overflow and CF on unsiged +overflow, while test instructions always clear OF and CF. In order to +replace a test with an add or subtract in a situation where OF or CF is +needed, codegen must be able to prove that the operation cannot see +signed or unsigned overflow, respectively. + +//===---------------------------------------------------------------------===// + +memcpy/memmove do not lower to SSE copies when possible.  A silly example is: +define <16 x float> @foo(<16 x float> %A) nounwind { +	%tmp = alloca <16 x float>, align 16 +	%tmp2 = alloca <16 x float>, align 16 +	store <16 x float> %A, <16 x float>* %tmp +	%s = bitcast <16 x float>* %tmp to i8* +	%s2 = bitcast <16 x float>* %tmp2 to i8* +	call void @llvm.memcpy.i64(i8* %s, i8* %s2, i64 64, i32 16) +	%R = load <16 x float>* %tmp2 +	ret <16 x float> %R +} + +declare void @llvm.memcpy.i64(i8* nocapture, i8* nocapture, i64, i32) nounwind + +which compiles to: + +_foo: +	subl	$140, %esp +	movaps	%xmm3, 112(%esp) +	movaps	%xmm2, 96(%esp) +	movaps	%xmm1, 80(%esp) +	movaps	%xmm0, 64(%esp) +	movl	60(%esp), %eax +	movl	%eax, 124(%esp) +	movl	56(%esp), %eax +	movl	%eax, 120(%esp) +	movl	52(%esp), %eax +        <many many more 32-bit copies> +      	movaps	(%esp), %xmm0 +	movaps	16(%esp), %xmm1 +	movaps	32(%esp), %xmm2 +	movaps	48(%esp), %xmm3 +	addl	$140, %esp +	ret + +On Nehalem, it may even be cheaper to just use movups when unaligned than to +fall back to lower-granularity chunks. + +//===---------------------------------------------------------------------===// + +Implement processor-specific optimizations for parity with GCC on these +processors.  GCC does two optimizations: + +1. ix86_pad_returns inserts a noop before ret instructions if immediately +   preceded by a conditional branch or is the target of a jump. +2. ix86_avoid_jump_misspredicts inserts noops in cases where a 16-byte block of +   code contains more than 3 branches. +    +The first one is done for all AMDs, Core2, and "Generic" +The second one is done for: Atom, Pentium Pro, all AMDs, Pentium 4, Nocona, +  Core 2, and "Generic" + +//===---------------------------------------------------------------------===// +Testcase: +int x(int a) { return (a&0xf0)>>4; } + +Current output: +	movl	4(%esp), %eax +	shrl	$4, %eax +	andl	$15, %eax +	ret + +Ideal output: +	movzbl	4(%esp), %eax +	shrl	$4, %eax +	ret + +//===---------------------------------------------------------------------===// + +Re-implement atomic builtins __sync_add_and_fetch() and __sync_sub_and_fetch +properly. + +When the return value is not used (i.e. only care about the value in the +memory), x86 does not have to use add to implement these. Instead, it can use +add, sub, inc, dec instructions with the "lock" prefix. + +This is currently implemented using a bit of instruction selection trick. The +issue is the target independent pattern produces one output and a chain and we +want to map it into one that just output a chain. The current trick is to select +it into a MERGE_VALUES with the first definition being an implicit_def. The +proper solution is to add new ISD opcodes for the no-output variant. DAG +combiner can then transform the node before it gets to target node selection. + +Problem #2 is we are adding a whole bunch of x86 atomic instructions when in +fact these instructions are identical to the non-lock versions. We need a way to +add target specific information to target nodes and have this information +carried over to machine instructions. Asm printer (or JIT) can use this +information to add the "lock" prefix. + +//===---------------------------------------------------------------------===// + +struct B { +  unsigned char y0 : 1; +}; + +int bar(struct B* a) { return a->y0; } + +define i32 @bar(%struct.B* nocapture %a) nounwind readonly optsize { +  %1 = getelementptr inbounds %struct.B* %a, i64 0, i32 0 +  %2 = load i8* %1, align 1 +  %3 = and i8 %2, 1 +  %4 = zext i8 %3 to i32 +  ret i32 %4 +} + +bar:                                    # @bar +# %bb.0: +        movb    (%rdi), %al +        andb    $1, %al +        movzbl  %al, %eax +        ret + +Missed optimization: should be movl+andl. + +//===---------------------------------------------------------------------===// + +The x86_64 abi says: + +Booleans, when stored in a memory object, are stored as single byte objects the +value of which is always 0 (false) or 1 (true). + +We are not using this fact: + +int bar(_Bool *a) { return *a; } + +define i32 @bar(i8* nocapture %a) nounwind readonly optsize { +  %1 = load i8* %a, align 1, !tbaa !0 +  %tmp = and i8 %1, 1 +  %2 = zext i8 %tmp to i32 +  ret i32 %2 +} + +bar: +        movb    (%rdi), %al +        andb    $1, %al +        movzbl  %al, %eax +        ret + +GCC produces + +bar: +        movzbl  (%rdi), %eax +        ret + +//===---------------------------------------------------------------------===// + +Take the following C code: +int f(int a, int b) { return (unsigned char)a == (unsigned char)b; } + +We generate the following IR with clang: +define i32 @f(i32 %a, i32 %b) nounwind readnone { +entry: +  %tmp = xor i32 %b, %a                           ; <i32> [#uses=1] +  %tmp6 = and i32 %tmp, 255                       ; <i32> [#uses=1] +  %cmp = icmp eq i32 %tmp6, 0                     ; <i1> [#uses=1] +  %conv5 = zext i1 %cmp to i32                    ; <i32> [#uses=1] +  ret i32 %conv5 +} + +And the following x86 code: +	xorl	%esi, %edi +	testb	$-1, %dil +	sete	%al +	movzbl	%al, %eax +	ret + +A cmpb instead of the xorl+testb would be one instruction shorter. + +//===---------------------------------------------------------------------===// + +Given the following C code: +int f(int a, int b) { return (signed char)a == (signed char)b; } + +We generate the following IR with clang: +define i32 @f(i32 %a, i32 %b) nounwind readnone { +entry: +  %sext = shl i32 %a, 24                          ; <i32> [#uses=1] +  %conv1 = ashr i32 %sext, 24                     ; <i32> [#uses=1] +  %sext6 = shl i32 %b, 24                         ; <i32> [#uses=1] +  %conv4 = ashr i32 %sext6, 24                    ; <i32> [#uses=1] +  %cmp = icmp eq i32 %conv1, %conv4               ; <i1> [#uses=1] +  %conv5 = zext i1 %cmp to i32                    ; <i32> [#uses=1] +  ret i32 %conv5 +} + +And the following x86 code: +	movsbl	%sil, %eax +	movsbl	%dil, %ecx +	cmpl	%eax, %ecx +	sete	%al +	movzbl	%al, %eax +	ret + + +It should be possible to eliminate the sign extensions. + +//===---------------------------------------------------------------------===// + +LLVM misses a load+store narrowing opportunity in this code: + +%struct.bf = type { i64, i16, i16, i32 } + +@bfi = external global %struct.bf*                ; <%struct.bf**> [#uses=2] + +define void @t1() nounwind ssp { +entry: +  %0 = load %struct.bf** @bfi, align 8            ; <%struct.bf*> [#uses=1] +  %1 = getelementptr %struct.bf* %0, i64 0, i32 1 ; <i16*> [#uses=1] +  %2 = bitcast i16* %1 to i32*                    ; <i32*> [#uses=2] +  %3 = load i32* %2, align 1                      ; <i32> [#uses=1] +  %4 = and i32 %3, -65537                         ; <i32> [#uses=1] +  store i32 %4, i32* %2, align 1 +  %5 = load %struct.bf** @bfi, align 8            ; <%struct.bf*> [#uses=1] +  %6 = getelementptr %struct.bf* %5, i64 0, i32 1 ; <i16*> [#uses=1] +  %7 = bitcast i16* %6 to i32*                    ; <i32*> [#uses=2] +  %8 = load i32* %7, align 1                      ; <i32> [#uses=1] +  %9 = and i32 %8, -131073                        ; <i32> [#uses=1] +  store i32 %9, i32* %7, align 1 +  ret void +} + +LLVM currently emits this: + +  movq  bfi(%rip), %rax +  andl  $-65537, 8(%rax) +  movq  bfi(%rip), %rax +  andl  $-131073, 8(%rax) +  ret + +It could narrow the loads and stores to emit this: + +  movq  bfi(%rip), %rax +  andb  $-2, 10(%rax) +  movq  bfi(%rip), %rax +  andb  $-3, 10(%rax) +  ret + +The trouble is that there is a TokenFactor between the store and the +load, making it non-trivial to determine if there's anything between +the load and the store which would prohibit narrowing. + +//===---------------------------------------------------------------------===// + +This code: +void foo(unsigned x) { +  if (x == 0) bar(); +  else if (x == 1) qux(); +} + +currently compiles into: +_foo: +	movl	4(%esp), %eax +	cmpl	$1, %eax +	je	LBB0_3 +	testl	%eax, %eax +	jne	LBB0_4 + +the testl could be removed: +_foo: +	movl	4(%esp), %eax +	cmpl	$1, %eax +	je	LBB0_3 +	jb	LBB0_4 + +0 is the only unsigned number < 1. + +//===---------------------------------------------------------------------===// + +This code: + +%0 = type { i32, i1 } + +define i32 @add32carry(i32 %sum, i32 %x) nounwind readnone ssp { +entry: +  %uadd = tail call %0 @llvm.uadd.with.overflow.i32(i32 %sum, i32 %x) +  %cmp = extractvalue %0 %uadd, 1 +  %inc = zext i1 %cmp to i32 +  %add = add i32 %x, %sum +  %z.0 = add i32 %add, %inc +  ret i32 %z.0 +} + +declare %0 @llvm.uadd.with.overflow.i32(i32, i32) nounwind readnone + +compiles to: + +_add32carry:                            ## @add32carry +	addl	%esi, %edi +	sbbl	%ecx, %ecx +	movl	%edi, %eax +	subl	%ecx, %eax +	ret + +But it could be: + +_add32carry: +	leal	(%rsi,%rdi), %eax +	cmpl	%esi, %eax +	adcl	$0, %eax +	ret + +//===---------------------------------------------------------------------===// + +The hot loop of 256.bzip2 contains code that looks a bit like this: + +int foo(char *P, char *Q, int x, int y) { +  if (P[0] != Q[0]) +     return P[0] < Q[0]; +  if (P[1] != Q[1]) +     return P[1] < Q[1]; +  if (P[2] != Q[2]) +     return P[2] < Q[2]; +   return P[3] < Q[3]; +} + +In the real code, we get a lot more wrong than this.  However, even in this +code we generate: + +_foo:                                   ## @foo +## %bb.0:                               ## %entry +	movb	(%rsi), %al +	movb	(%rdi), %cl +	cmpb	%al, %cl +	je	LBB0_2 +LBB0_1:                                 ## %if.then +	cmpb	%al, %cl +	jmp	LBB0_5 +LBB0_2:                                 ## %if.end +	movb	1(%rsi), %al +	movb	1(%rdi), %cl +	cmpb	%al, %cl +	jne	LBB0_1 +## %bb.3:                               ## %if.end38 +	movb	2(%rsi), %al +	movb	2(%rdi), %cl +	cmpb	%al, %cl +	jne	LBB0_1 +## %bb.4:                               ## %if.end60 +	movb	3(%rdi), %al +	cmpb	3(%rsi), %al +LBB0_5:                                 ## %if.end60 +	setl	%al +	movzbl	%al, %eax +	ret + +Note that we generate jumps to LBB0_1 which does a redundant compare.  The +redundant compare also forces the register values to be live, which prevents +folding one of the loads into the compare.  In contrast, GCC 4.2 produces: + +_foo: +	movzbl	(%rsi), %eax +	cmpb	%al, (%rdi) +	jne	L10 +L12: +	movzbl	1(%rsi), %eax +	cmpb	%al, 1(%rdi) +	jne	L10 +	movzbl	2(%rsi), %eax +	cmpb	%al, 2(%rdi) +	jne	L10 +	movzbl	3(%rdi), %eax +	cmpb	3(%rsi), %al +L10: +	setl	%al +	movzbl	%al, %eax +	ret + +which is "perfect". + +//===---------------------------------------------------------------------===// + +For the branch in the following code: +int a(); +int b(int x, int y) { +  if (x & (1<<(y&7))) +    return a(); +  return y; +} + +We currently generate: +	movb	%sil, %al +	andb	$7, %al +	movzbl	%al, %eax +	btl	%eax, %edi +	jae	.LBB0_2 + +movl+andl would be shorter than the movb+andb+movzbl sequence. + +//===---------------------------------------------------------------------===// + +For the following: +struct u1 { +    float x, y; +}; +float foo(struct u1 u) { +    return u.x + u.y; +} + +We currently generate: +	movdqa	%xmm0, %xmm1 +	pshufd	$1, %xmm0, %xmm0        # xmm0 = xmm0[1,0,0,0] +	addss	%xmm1, %xmm0 +	ret + +We could save an instruction here by commuting the addss. + +//===---------------------------------------------------------------------===// + +This (from PR9661): + +float clamp_float(float a) { +        if (a > 1.0f) +                return 1.0f; +        else if (a < 0.0f) +                return 0.0f; +        else +                return a; +} + +Could compile to: + +clamp_float:                            # @clamp_float +        movss   .LCPI0_0(%rip), %xmm1 +        minss   %xmm1, %xmm0 +        pxor    %xmm1, %xmm1 +        maxss   %xmm1, %xmm0 +        ret + +with -ffast-math. + +//===---------------------------------------------------------------------===// + +This function (from PR9803): + +int clamp2(int a) { +        if (a > 5) +                a = 5; +        if (a < 0)  +                return 0; +        return a; +} + +Compiles to: + +_clamp2:                                ## @clamp2 +        pushq   %rbp +        movq    %rsp, %rbp +        cmpl    $5, %edi +        movl    $5, %ecx +        cmovlel %edi, %ecx +        testl   %ecx, %ecx +        movl    $0, %eax +        cmovnsl %ecx, %eax +        popq    %rbp +        ret + +The move of 0 could be scheduled above the test to make it is xor reg,reg. + +//===---------------------------------------------------------------------===// + +GCC PR48986.  We currently compile this: + +void bar(void); +void yyy(int* p) { +    if (__sync_fetch_and_add(p, -1) == 1) +      bar(); +} + +into: +	movl	$-1, %eax +	lock +	xaddl	%eax, (%rdi) +	cmpl	$1, %eax +	je	LBB0_2 + +Instead we could generate: + +	lock +	dec %rdi +	je LBB0_2 + +The trick is to match "fetch_and_add(X, -C) == C". + +//===---------------------------------------------------------------------===// + +unsigned t(unsigned a, unsigned b) { +  return a <= b ? 5 : -5; +} + +We generate: +	movl	$5, %ecx +	cmpl	%esi, %edi +	movl	$-5, %eax +	cmovbel	%ecx, %eax + +GCC: +	cmpl	%edi, %esi +	sbbl	%eax, %eax +	andl	$-10, %eax +	addl	$5, %eax + +//===---------------------------------------------------------------------===// diff --git a/contrib/libs/llvm12/lib/Target/X86/TargetInfo/.yandex_meta/licenses.list.txt b/contrib/libs/llvm12/lib/Target/X86/TargetInfo/.yandex_meta/licenses.list.txt index a4433625d4a..c62d353021c 100644 --- a/contrib/libs/llvm12/lib/Target/X86/TargetInfo/.yandex_meta/licenses.list.txt +++ b/contrib/libs/llvm12/lib/Target/X86/TargetInfo/.yandex_meta/licenses.list.txt @@ -1,7 +1,7 @@ -====================Apache-2.0 WITH LLVM-exception====================  -// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.  -// See https://llvm.org/LICENSE.txt for license information.  -  -  -====================Apache-2.0 WITH LLVM-exception====================  -// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception  +====================Apache-2.0 WITH LLVM-exception==================== +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. + + +====================Apache-2.0 WITH LLVM-exception==================== +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception diff --git a/contrib/libs/llvm12/lib/Target/X86/TargetInfo/ya.make b/contrib/libs/llvm12/lib/Target/X86/TargetInfo/ya.make index 9048b1b3736..2f30db941ed 100644 --- a/contrib/libs/llvm12/lib/Target/X86/TargetInfo/ya.make +++ b/contrib/libs/llvm12/lib/Target/X86/TargetInfo/ya.make @@ -2,15 +2,15 @@  LIBRARY() -OWNER(  -    orivej  -    g:cpp-contrib  -)  - -LICENSE(Apache-2.0 WITH LLVM-exception)  -  -LICENSE_TEXTS(.yandex_meta/licenses.list.txt)  -  +OWNER( +    orivej +    g:cpp-contrib +) + +LICENSE(Apache-2.0 WITH LLVM-exception) + +LICENSE_TEXTS(.yandex_meta/licenses.list.txt) +  PEERDIR(      contrib/libs/llvm12      contrib/libs/llvm12/lib/Support diff --git a/contrib/libs/llvm12/lib/Target/X86/ya.make b/contrib/libs/llvm12/lib/Target/X86/ya.make index eda842b9550..1df03a55e7e 100644 --- a/contrib/libs/llvm12/lib/Target/X86/ya.make +++ b/contrib/libs/llvm12/lib/Target/X86/ya.make @@ -2,18 +2,18 @@  LIBRARY() -OWNER(  -    orivej  -    g:cpp-contrib  -)  +OWNER( +    orivej +    g:cpp-contrib +) + +LICENSE( +    Apache-2.0 WITH LLVM-exception AND +    NCSA +) + +LICENSE_TEXTS(.yandex_meta/licenses.list.txt) -LICENSE(  -    Apache-2.0 WITH LLVM-exception AND  -    NCSA  -)  -  -LICENSE_TEXTS(.yandex_meta/licenses.list.txt)  -   PEERDIR(      contrib/libs/llvm12      contrib/libs/llvm12/include diff --git a/contrib/libs/llvm12/lib/Target/ya.make b/contrib/libs/llvm12/lib/Target/ya.make index 7696dde9a3b..84014564299 100644 --- a/contrib/libs/llvm12/lib/Target/ya.make +++ b/contrib/libs/llvm12/lib/Target/ya.make @@ -2,15 +2,15 @@  LIBRARY() -OWNER(  -    orivej  -    g:cpp-contrib  -)  - -LICENSE(Apache-2.0 WITH LLVM-exception)  -  -LICENSE_TEXTS(.yandex_meta/licenses.list.txt)  -  +OWNER( +    orivej +    g:cpp-contrib +) + +LICENSE(Apache-2.0 WITH LLVM-exception) + +LICENSE_TEXTS(.yandex_meta/licenses.list.txt) +  PEERDIR(      contrib/libs/llvm12      contrib/libs/llvm12/include @@ -20,9 +20,9 @@ PEERDIR(      contrib/libs/llvm12/lib/Support  ) -ADDINCL(  -    contrib/libs/llvm12/lib/Target  -)  +ADDINCL( +    contrib/libs/llvm12/lib/Target +)  NO_COMPILER_WARNINGS()  | 
