aboutsummaryrefslogtreecommitdiffstats
path: root/tools/source2c
diff options
context:
space:
mode:
authorRémi Denis-Courmont <remi@remlab.net>2024-06-01 21:32:56 +0300
committerRémi Denis-Courmont <remi@remlab.net>2024-06-04 17:40:41 +0300
commit4e120fbbbd087c3acbad6ce2e8c7b1262a5c8632 (patch)
treeb04252a83e826cf23cc0509fbe7118ccefd3f2c1 /tools/source2c
parent30797e4ff6c8c537471c386cd019a6a48a721f01 (diff)
downloadffmpeg-4e120fbbbd087c3acbad6ce2e8c7b1262a5c8632.tar.gz
lavc/vp8dsp: add R-V V vp7_idct_dc_add4y
As with idct_dc_add, most of the code is shared with, and replaces, the previous VP8 function. To improve performance, we break down the 16x4 matrix into 4 rows, rather than 4 squares. Thus strided loads and stores are avoided, and the 4 DC calculations are vectored. Unfortunately this requires a vector gather to splat the DC values, but overall this is still a win for performance: T-Head C908: vp7_idct_dc_add4y_c: 7.2 vp7_idct_dc_add4y_rvv_i32: 2.2 vp8_idct_dc_add4y_c: 6.2 vp8_idct_dc_add4y_rvv_i32: 2.2 (before) vp8_idct_dc_add4y_rvv_i32: 1.7 SpacemiT X60: vp7_idct_dc_add4y_c: 6.2 vp7_idct_dc_add4y_rvv_i32: 2.0 vp8_idct_dc_add4y_c: 5.5 vp8_idct_dc_add4y_rvv_i32: 2.5 (before) vp8_idct_dc_add4y_rvv_i32: 1.7 I also tried to provision the DC values using indexed loads. It ends up slower overall, especially for VP7, as we then have to compute 16 DC's instead of just 4.
Diffstat (limited to 'tools/source2c')
0 files changed, 0 insertions, 0 deletions