diff options
author | Rémi Denis-Courmont <remi@remlab.net> | 2024-06-01 21:32:56 +0300 |
---|---|---|
committer | Rémi Denis-Courmont <remi@remlab.net> | 2024-06-04 17:40:41 +0300 |
commit | 4e120fbbbd087c3acbad6ce2e8c7b1262a5c8632 (patch) | |
tree | b04252a83e826cf23cc0509fbe7118ccefd3f2c1 /libavfilter/textutils.c | |
parent | 30797e4ff6c8c537471c386cd019a6a48a721f01 (diff) | |
download | ffmpeg-4e120fbbbd087c3acbad6ce2e8c7b1262a5c8632.tar.gz |
lavc/vp8dsp: add R-V V vp7_idct_dc_add4y
As with idct_dc_add, most of the code is shared with, and replaces, the
previous VP8 function. To improve performance, we break down the 16x4
matrix into 4 rows, rather than 4 squares. Thus strided loads and
stores are avoided, and the 4 DC calculations are vectored.
Unfortunately this requires a vector gather to splat the DC values, but
overall this is still a win for performance:
T-Head C908:
vp7_idct_dc_add4y_c: 7.2
vp7_idct_dc_add4y_rvv_i32: 2.2
vp8_idct_dc_add4y_c: 6.2
vp8_idct_dc_add4y_rvv_i32: 2.2 (before)
vp8_idct_dc_add4y_rvv_i32: 1.7
SpacemiT X60:
vp7_idct_dc_add4y_c: 6.2
vp7_idct_dc_add4y_rvv_i32: 2.0
vp8_idct_dc_add4y_c: 5.5
vp8_idct_dc_add4y_rvv_i32: 2.5 (before)
vp8_idct_dc_add4y_rvv_i32: 1.7
I also tried to provision the DC values using indexed loads. It ends up
slower overall, especially for VP7, as we then have to compute 16 DC's
instead of just 4.
Diffstat (limited to 'libavfilter/textutils.c')
0 files changed, 0 insertions, 0 deletions