| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
| |
Improved version of VBROADCASTSS that works like the avx2 instruction.
Emulation of vpbroadcastd.
Horizontal sum HSUMPS that places the result in all elements.
Emulation of blendvps and pblendvb.
Signed-off-by: Ivan Kalvachev <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Yasm:
src/libavfilter/x86/af_volume.asm:24: warning: Standard COFF does not support read-only data sections
src/libavfilter/x86/af_volume.asm:24: warning: Unrecognized qualifier `align'
Nasm:
src/libavfilter/x86/af_volume.asm:24: error: standard COFF does not support section alignment specification
src/libavutil/x86/x86inc.asm:92: ... from macro `SECTION_RODATA' defined here
Tested-by: Clément Bœsch <[email protected]>
Signed-off-by: James Almer <[email protected]>
|
|
|
|
|
|
|
|
| |
None of them are specific to the YASM assembler.
(Cherry-picked from libav commit 39e208f4d4756367c7cd2d581847e0c1b8a429c1)
Signed-off-by: James Almer <[email protected]>
|
|
|
|
| |
About 2x faster than the c version.
|
|
|
|
|
|
|
|
| |
Simplifies writing assembly code that depends on available instructions.
LZCNT implies SSE2
BMI1 implies AVX+LZCNT
AVX2 implies BMI2
|
|
|
|
|
| |
The use of rsp was pretty much hardcoded there and probably didn't work
otherwise with stack_size > 0.
|
|
|
|
|
|
|
| |
Due to a peculiarity in the ModR/M addressing encoding, the r12 and r13
registers sometimes requires an additional byte when used as a base register.
r14 and r15 doesn't have that issue, so prefer using them.
|
|
|
|
| |
There's no point in emitting a rep prefix before ret on modern CPUs.
|
|
|
|
|
|
| |
We overload the `call` instruction with a macro, but it would misbehave when
the macro argument wasn't a valid identifier. Fix it by explicitly checking
if the argument is an identifier.
|
| |
|
|
|
|
|
|
| |
~20% faster than AVX.
Signed-off-by: James Almer <[email protected]>
|
| |
|
|\
| |
| |
| |
| |
| |
| | |
* commit '99434f4df81b6801b2b535d5b9143305595784f6':
float_dsp: Have implementation match function pointer prototype
Merged-by: Clément Bœsch <[email protected]>
|
| |
| |
| |
| |
| | |
libavutil/x86/float_dsp_init.c(144) : warning C4028: formal parameter 1 different from declaration
libavutil/x86/float_dsp_init.c(144) : warning C4028: formal parameter 2 different from declaration
|
|\|
| |
| |
| |
| |
| |
| | |
* commit '7911186ed616ae81dd8617d6d0e8b08c818db9d8':
emms: Give apriv_emms_yasm() a more general name
Merged-by: James Almer <[email protected]>
|
| | |
|
|\|
| |
| |
| |
| |
| |
| | |
* commit '6be7944ee2ec2f045e6eb9a93237e992c8b20ac4':
x86: Add missing colons after assembly labels
Merged-by: James Almer <[email protected]>
|
| |
| |
| |
| |
| | |
This fixes many warnings of the sort
warning: label alone on a line without a colon might be in error
|
| |
| |
| |
| |
| |
| |
| | |
are the same
Reviewed-by: Henrik Gramner <[email protected]>
Signed-off-by: James Almer <[email protected]>
|
|\|
| |
| |
| |
| |
| |
| | |
* commit '07e1f99a1bb41d1a615676140eefc85cf69fa793':
x86util: Document SBUTTERFLY macro
Merged-by: Clément Bœsch <[email protected]>
|
| |
| |
| |
| | |
Signed-off-by: Luca Barbato <[email protected]>
|
|\|
| |
| |
| |
| |
| |
| | |
* commit 'd7bc52bf456deba0f32d9fe5c288ec441f1ebef5':
imgutils: add a function for copying image data from GPU mapped memory
Merged-by: Clément Bœsch <[email protected]>
|
| |
| |
| |
| | |
See https://software.intel.com/en-us/articles/copying-accelerated-video-decode-frame-buffers
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
x86-64 only
Yorkfield:
- sse2: ~2.17x (434 vs. 200 cycles)
Nehalem:
- sse2: ~2.94x (409 vs. 139 cycles)
Skylake:
- sse2: ~3.10x (370 vs. 119 cycles)
- avx: ~3.29x (370 vs. 112 cycles)
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Originally committed to x264 in 1637239a by Henrik Gramner who has
agreed to re-license it as LGPL. Original commit message follows.
x86: Avoid some bypass delays and false dependencies
A bypass delay of 1-3 clock cycles may occur on some CPUs when transitioning
between int and float domains, so try to avoid that if possible.
|
| | |
|
|\|
| |
| |
| |
| |
| |
| | |
* commit '8e9cd81d291b1010c625b2766058aadf4affb537':
x86: cpu: Detect Conroe CPUs and their slow shuffle unit
Merged-by: James Almer <[email protected]>
|
| | |
|
|\|
| |
| |
| |
| |
| |
| | |
* commit '7d7355aa92bb36ca0765c49a569a999bcb96f332':
x86: Add SSSE3_SLOW CPU flag and related convenience macros
Merged-by: James Almer <[email protected]>
|
| | |
|
| |
| |
| |
| |
| |
| | |
Integration to Libav by Josh de Kock <[email protected]>.
Signed-off-by: Alexandra Hájková <[email protected]>
|
| |
| |
| |
| |
| | |
These warnings conflict with system macros on Solaris, producing
truckloads of warnings about macro redefinition.
|
| |
| |
| |
| |
| |
| |
| | |
Allows emulation to work when dst is equal to src2 as long as the
instruction is commutative, e.g. `addps m0, m1, m0`.
Signed-off-by: Anton Khirnov <[email protected]>
|
| |
| |
| |
| |
| |
| |
| |
| | |
The yasm/nasm preprocessor only checks the first token, which means that
parameters such as `dword [rax]` are treated as identifiers, which is
generally not what we want.
Signed-off-by: Anton Khirnov <[email protected]>
|
| |
| |
| |
| | |
Signed-off-by: Anton Khirnov <[email protected]>
|
| |
| |
| |
| |
| |
| |
| | |
Those instructions are not commutative since they only change the first
element in the vector and leave the rest unmodified.
Signed-off-by: Anton Khirnov <[email protected]>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
When allocating stack space with an alignment requirement that is larger
than the current stack alignment we need to store a copy of the original
stack pointer in order to be able to restore it later.
If we chose to use another register for this purpose we should not pick
eax/rax since it can be overwritten as a return value.
|
| |
| |
| |
| |
| | |
Reviewed-by: Andreas Cadhalpun <[email protected]>
Signed-off-by: Michael Niedermayer <[email protected]>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
checkasm --bench, 10k runs, for *_add_${bpc}_${sub_idct}_${opt}, shows
that it's about 1.65x as fast as the AVX version for the full IDCT, and
similar speedups for the sub-IDCTs:
nop: 24.6
vp9_inv_dct_dct_16x16_add_8_1_c: 6444.8
vp9_inv_dct_dct_16x16_add_8_1_sse2: 638.6
vp9_inv_dct_dct_16x16_add_8_1_ssse3: 484.4
vp9_inv_dct_dct_16x16_add_8_1_avx: 661.2
vp9_inv_dct_dct_16x16_add_8_1_avx2: 311.5
vp9_inv_dct_dct_16x16_add_8_2_c: 6665.7
vp9_inv_dct_dct_16x16_add_8_2_sse2: 646.9
vp9_inv_dct_dct_16x16_add_8_2_ssse3: 455.2
vp9_inv_dct_dct_16x16_add_8_2_avx: 521.9
vp9_inv_dct_dct_16x16_add_8_2_avx2: 304.3
vp9_inv_dct_dct_16x16_add_8_4_c: 7022.7
vp9_inv_dct_dct_16x16_add_8_4_sse2: 647.4
vp9_inv_dct_dct_16x16_add_8_4_ssse3: 467.1
vp9_inv_dct_dct_16x16_add_8_4_avx: 446.1
vp9_inv_dct_dct_16x16_add_8_4_avx2: 297.0
vp9_inv_dct_dct_16x16_add_8_8_c: 6800.4
vp9_inv_dct_dct_16x16_add_8_8_sse2: 598.6
vp9_inv_dct_dct_16x16_add_8_8_ssse3: 465.7
vp9_inv_dct_dct_16x16_add_8_8_avx: 440.9
vp9_inv_dct_dct_16x16_add_8_8_avx2: 290.2
vp9_inv_dct_dct_16x16_add_8_16_c: 6626.6
vp9_inv_dct_dct_16x16_add_8_16_sse2: 599.5
vp9_inv_dct_dct_16x16_add_8_16_ssse3: 475.0
vp9_inv_dct_dct_16x16_add_8_16_avx: 469.9
vp9_inv_dct_dct_16x16_add_8_16_avx2: 286.4
|
| |
| |
| |
| | |
See merge commit '39d6d3618d48625decaff7d9bdbb45b44ef2a805'.
|
|\|
| |
| |
| |
| |
| |
| | |
* commit '41ed7ab45fc693f7d7fc35664c0233f4c32d69bb':
cosmetics: Fix spelling mistakes
Merged-by: Clément Bœsch <[email protected]>
|
| |
| |
| |
| | |
Signed-off-by: Diego Biurrun <[email protected]>
|
| |
| |
| |
| |
| |
| |
| | |
Needed to declare 32-byte long constants
Signed-off-by: James Almer <[email protected]>
Signed-off-by: Luca Barbato <[email protected]>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Some debuggers/profilers use this metadata to determine which function a
given instruction is in; without it they get can confused by local labels
(if you haven't stripped those). On the other hand, some tools are still
confused even with this metadata. e.g. this fixes `gdb`, but not `perf`.
Currently only implemented for ELF.
Signed-off-by: Anton Khirnov <[email protected]>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The REP_RET workaround is only needed on old AMD cpus, and the labels clutter
up the symbol table and confuse debugging/profiling tools, so use EQU to
create SHN_ABS symbols instead of creating local labels. Furthermore, skip
the workaround completely in functions that definitely won't run on such cpus.
Note that EQU is just creating a local label when using nasm instead of yasm.
This is probably a bug, but at least it doesn't break anything.
Signed-off-by: Anton Khirnov <[email protected]>
|
| |
| |
| |
| |
| |
| |
| |
| | |
cpuflags is never undefined any more, it's set to 0 instead.
Also fix an incorrect comment.
Signed-off-by: Anton Khirnov <[email protected]>
|
| |
| |
| |
| | |
Signed-off-by: Anton Khirnov <[email protected]>
|
| |
| |
| |
| |
| |
| |
| |
| | |
When allocating stack space with a larger alignment than the known stack
alignment a temporary register is used for storing the stack pointer.
Ensure that this isn't one of the registers used for passing arguments.
Signed-off-by: Anton Khirnov <[email protected]>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
* Correctly handle FMA instructions with memory operands.
* Print a warning if FMA instructions are used without the correct cpuflag.
* Simplify the instantiation code.
* Clarify documentation.
Only the last operand in FMA3 instructions can be a memory operand. When
converting FMA4 instructions to FMA3 instructions we can utilize the fact
that multiply is a commutative operation and reorder operands if necessary
to ensure that a memory operand is used only as the last operand.
Signed-off-by: Anton Khirnov <[email protected]>
|
| |
| |
| |
| | |
Signed-off-by: Anton Khirnov <[email protected]>
|