| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Technically _tzcnt* intrinsics are only available when the BMI
instruction set is present. However the instruction encoding
degrades to "rep bsf" on older processors.
Clang for Windows debatably restricts the _tzcnt* instrinics behind
the __BMI__ architecture define, so check for its presence or
exclude the usage of these intrinics when clang is present.
See also:
https://ffmpeg.org/pipermail/ffmpeg-devel/2015-November/183404.html
https://bugs.llvm.org/show_bug.cgi?id=30506
http://lists.llvm.org/pipermail/cfe-dev/2016-October/051034.html
Signed-off-by: Dale Curtis <[email protected]>
Reviewed-by: Matt Oliver <[email protected]>
Signed-off-by: Michael Niedermayer <[email protected]>
|
|\
| |
| |
| |
| |
| |
| |
| |
| | |
* commit '994c4bc10751e39c7ed9f67ffd0c0dea5223daf2':
x86util: Port all macros to cpuflags
See d5f8a642f6eb1c6e305c41dabddd0fd36ffb3f77
Merged-by: James Almer <[email protected]>
|
| |
| |
| |
| |
| |
| | |
Also do some small cosmetic changes: Drop pointless _MMX suffix from ABSD2
macro name, drop pointless check for MMX support, we always assume MMX is
available in our SIMD code, fix spelling.
|
| |
| |
| |
| | |
None of them are specific to the YASM assembler.
|
| |
| |
| |
| | |
Signed-off-by: James Almer <[email protected]>
|
|\|
| |
| |
| |
| |
| |
| | |
* commit '7abdd026df6a9a52d07d8174505b33cc89db7bf6':
asm: Consistently uppercase SECTION markers
Merged-by: James Almer <[email protected]>
|
| | |
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
When allocating stack space with an alignment requirement that is larger
than the current stack alignment we need to store a copy of the original
stack pointer in order to be able to restore it later.
If we chose to use another register for this purpose we should not pick
eax/rax since it can be overwritten as a return value.
Signed-off-by: Anton Khirnov <[email protected]>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Improved version of VBROADCASTSS that works like the avx2 instruction.
Emulation of vpbroadcastd.
Horizontal sum HSUMPS that places the result in all elements.
Emulation of blendvps and pblendvb.
Signed-off-by: Ivan Kalvachev <[email protected]>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Yasm:
src/libavfilter/x86/af_volume.asm:24: warning: Standard COFF does not support read-only data sections
src/libavfilter/x86/af_volume.asm:24: warning: Unrecognized qualifier `align'
Nasm:
src/libavfilter/x86/af_volume.asm:24: error: standard COFF does not support section alignment specification
src/libavutil/x86/x86inc.asm:92: ... from macro `SECTION_RODATA' defined here
Tested-by: Clément Bœsch <[email protected]>
Signed-off-by: James Almer <[email protected]>
|
| |
| |
| |
| |
| |
| |
| |
| | |
None of them are specific to the YASM assembler.
(Cherry-picked from libav commit 39e208f4d4756367c7cd2d581847e0c1b8a429c1)
Signed-off-by: James Almer <[email protected]>
|
| |
| |
| |
| | |
About 2x faster than the c version.
|
| |
| |
| |
| |
| |
| |
| |
| | |
Simplifies writing assembly code that depends on available instructions.
LZCNT implies SSE2
BMI1 implies AVX+LZCNT
AVX2 implies BMI2
|
| |
| |
| |
| |
| | |
The use of rsp was pretty much hardcoded there and probably didn't work
otherwise with stack_size > 0.
|
| |
| |
| |
| |
| |
| |
| | |
Due to a peculiarity in the ModR/M addressing encoding, the r12 and r13
registers sometimes requires an additional byte when used as a base register.
r14 and r15 doesn't have that issue, so prefer using them.
|
| |
| |
| |
| | |
There's no point in emitting a rep prefix before ret on modern CPUs.
|
| |
| |
| |
| |
| |
| | |
We overload the `call` instruction with a macro, but it would misbehave when
the macro argument wasn't a valid identifier. Fix it by explicitly checking
if the argument is an identifier.
|
| | |
|
| |
| |
| |
| |
| |
| | |
~20% faster than AVX.
Signed-off-by: James Almer <[email protected]>
|
| | |
|
|\|
| |
| |
| |
| |
| |
| | |
* commit '99434f4df81b6801b2b535d5b9143305595784f6':
float_dsp: Have implementation match function pointer prototype
Merged-by: Clément Bœsch <[email protected]>
|
| |
| |
| |
| |
| | |
libavutil/x86/float_dsp_init.c(144) : warning C4028: formal parameter 1 different from declaration
libavutil/x86/float_dsp_init.c(144) : warning C4028: formal parameter 2 different from declaration
|
|\|
| |
| |
| |
| |
| |
| | |
* commit '7911186ed616ae81dd8617d6d0e8b08c818db9d8':
emms: Give apriv_emms_yasm() a more general name
Merged-by: James Almer <[email protected]>
|
| | |
|
|\|
| |
| |
| |
| |
| |
| | |
* commit '6be7944ee2ec2f045e6eb9a93237e992c8b20ac4':
x86: Add missing colons after assembly labels
Merged-by: James Almer <[email protected]>
|
| |
| |
| |
| |
| | |
This fixes many warnings of the sort
warning: label alone on a line without a colon might be in error
|
| |
| |
| |
| |
| |
| |
| | |
are the same
Reviewed-by: Henrik Gramner <[email protected]>
Signed-off-by: James Almer <[email protected]>
|
|\|
| |
| |
| |
| |
| |
| | |
* commit '07e1f99a1bb41d1a615676140eefc85cf69fa793':
x86util: Document SBUTTERFLY macro
Merged-by: Clément Bœsch <[email protected]>
|
| |
| |
| |
| | |
Signed-off-by: Luca Barbato <[email protected]>
|
|\|
| |
| |
| |
| |
| |
| | |
* commit 'd7bc52bf456deba0f32d9fe5c288ec441f1ebef5':
imgutils: add a function for copying image data from GPU mapped memory
Merged-by: Clément Bœsch <[email protected]>
|
| |
| |
| |
| | |
See https://software.intel.com/en-us/articles/copying-accelerated-video-decode-frame-buffers
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
x86-64 only
Yorkfield:
- sse2: ~2.17x (434 vs. 200 cycles)
Nehalem:
- sse2: ~2.94x (409 vs. 139 cycles)
Skylake:
- sse2: ~3.10x (370 vs. 119 cycles)
- avx: ~3.29x (370 vs. 112 cycles)
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Originally committed to x264 in 1637239a by Henrik Gramner who has
agreed to re-license it as LGPL. Original commit message follows.
x86: Avoid some bypass delays and false dependencies
A bypass delay of 1-3 clock cycles may occur on some CPUs when transitioning
between int and float domains, so try to avoid that if possible.
|
| | |
|
|\|
| |
| |
| |
| |
| |
| | |
* commit '8e9cd81d291b1010c625b2766058aadf4affb537':
x86: cpu: Detect Conroe CPUs and their slow shuffle unit
Merged-by: James Almer <[email protected]>
|
| | |
|
|\|
| |
| |
| |
| |
| |
| | |
* commit '7d7355aa92bb36ca0765c49a569a999bcb96f332':
x86: Add SSSE3_SLOW CPU flag and related convenience macros
Merged-by: James Almer <[email protected]>
|
| | |
|
| |
| |
| |
| |
| |
| | |
Integration to Libav by Josh de Kock <[email protected]>.
Signed-off-by: Alexandra Hájková <[email protected]>
|
| |
| |
| |
| |
| | |
These warnings conflict with system macros on Solaris, producing
truckloads of warnings about macro redefinition.
|
| |
| |
| |
| |
| |
| |
| | |
Allows emulation to work when dst is equal to src2 as long as the
instruction is commutative, e.g. `addps m0, m1, m0`.
Signed-off-by: Anton Khirnov <[email protected]>
|
| |
| |
| |
| |
| |
| |
| |
| | |
The yasm/nasm preprocessor only checks the first token, which means that
parameters such as `dword [rax]` are treated as identifiers, which is
generally not what we want.
Signed-off-by: Anton Khirnov <[email protected]>
|
| |
| |
| |
| | |
Signed-off-by: Anton Khirnov <[email protected]>
|
| |
| |
| |
| |
| |
| |
| | |
Those instructions are not commutative since they only change the first
element in the vector and leave the rest unmodified.
Signed-off-by: Anton Khirnov <[email protected]>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
When allocating stack space with an alignment requirement that is larger
than the current stack alignment we need to store a copy of the original
stack pointer in order to be able to restore it later.
If we chose to use another register for this purpose we should not pick
eax/rax since it can be overwritten as a return value.
|
| |
| |
| |
| |
| | |
Reviewed-by: Andreas Cadhalpun <[email protected]>
Signed-off-by: Michael Niedermayer <[email protected]>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
checkasm --bench, 10k runs, for *_add_${bpc}_${sub_idct}_${opt}, shows
that it's about 1.65x as fast as the AVX version for the full IDCT, and
similar speedups for the sub-IDCTs:
nop: 24.6
vp9_inv_dct_dct_16x16_add_8_1_c: 6444.8
vp9_inv_dct_dct_16x16_add_8_1_sse2: 638.6
vp9_inv_dct_dct_16x16_add_8_1_ssse3: 484.4
vp9_inv_dct_dct_16x16_add_8_1_avx: 661.2
vp9_inv_dct_dct_16x16_add_8_1_avx2: 311.5
vp9_inv_dct_dct_16x16_add_8_2_c: 6665.7
vp9_inv_dct_dct_16x16_add_8_2_sse2: 646.9
vp9_inv_dct_dct_16x16_add_8_2_ssse3: 455.2
vp9_inv_dct_dct_16x16_add_8_2_avx: 521.9
vp9_inv_dct_dct_16x16_add_8_2_avx2: 304.3
vp9_inv_dct_dct_16x16_add_8_4_c: 7022.7
vp9_inv_dct_dct_16x16_add_8_4_sse2: 647.4
vp9_inv_dct_dct_16x16_add_8_4_ssse3: 467.1
vp9_inv_dct_dct_16x16_add_8_4_avx: 446.1
vp9_inv_dct_dct_16x16_add_8_4_avx2: 297.0
vp9_inv_dct_dct_16x16_add_8_8_c: 6800.4
vp9_inv_dct_dct_16x16_add_8_8_sse2: 598.6
vp9_inv_dct_dct_16x16_add_8_8_ssse3: 465.7
vp9_inv_dct_dct_16x16_add_8_8_avx: 440.9
vp9_inv_dct_dct_16x16_add_8_8_avx2: 290.2
vp9_inv_dct_dct_16x16_add_8_16_c: 6626.6
vp9_inv_dct_dct_16x16_add_8_16_sse2: 599.5
vp9_inv_dct_dct_16x16_add_8_16_ssse3: 475.0
vp9_inv_dct_dct_16x16_add_8_16_avx: 469.9
vp9_inv_dct_dct_16x16_add_8_16_avx2: 286.4
|
| |
| |
| |
| | |
See merge commit '39d6d3618d48625decaff7d9bdbb45b44ef2a805'.
|
|\|
| |
| |
| |
| |
| |
| | |
* commit '41ed7ab45fc693f7d7fc35664c0233f4c32d69bb':
cosmetics: Fix spelling mistakes
Merged-by: Clément Bœsch <[email protected]>
|
| |
| |
| |
| | |
Signed-off-by: Diego Biurrun <[email protected]>
|