aboutsummaryrefslogtreecommitdiffstats
path: root/libavcodec/riscv
Commit message (Collapse)AuthorAgeFilesLines
* lavc/aacencdsp: fix rounding in R-V V quantize_bandsRémi Denis-Courmont2024-06-081-1/+1
| | | | We need to round toward zero here.
* lavc/vp8dsp: R-V V vp8_idct_addRémi Denis-Courmont2024-06-082-0/+61
| | | | | | T-Head C908 (cycles): vp8_idct_add_c: 312.2 vp8_idct_add_rvv_i32: 117.0
* lavc/vc1dsp: R-V V vc1_inv_trans_4x4Rémi Denis-Courmont2024-06-072-0/+47
| | | | | | | | | T-Head C908 (cycles): vc1dsp.vc1_inv_trans_4x4_c: 310.7 vc1dsp.vc1_inv_trans_4x4_rvv_i32: 120.0 We could use 1 `vlseg4e64.v` instead of 4 `vle16.v`, but that seems to be about 7% slower.
* lavc/vc1dsp: R-V V vc1_inv_trans_4x8Rémi Denis-Courmont2024-06-072-0/+79
| | | | | | T-Head C908 (cycles): vc1dsp.vc1_inv_trans_4x8_c: 653.2 vc1dsp.vc1_inv_trans_4x8_rvv_i32: 234.0
* lavc/vc1dsp: R-V V vc1_inv_trans_8x4Rémi Denis-Courmont2024-06-072-0/+75
| | | | | | T-Head C908 (cycles): vc1dsp.vc1_inv_trans_8x4_c: 626.2 vc1dsp.vc1_inv_trans_8x4_rvv_i32: 215.2
* lavc/vc1dsp: R-V V vc1_inv_trans_8x8Rémi Denis-Courmont2024-06-072-0/+112
| | | | | | T-Head C908 (cycles): vc1dsp.vc1_inv_trans_8x8_c: 871.7 vc1dsp.vc1_inv_trans_8x8_rvv_i32: 286.7
* lavc/flacdsp: fix sign extension in R-V V wasted33Rémi Denis-Courmont2024-06-071-3/+2
| | | | | We need to use either VWCVT.X.X.V or VSEXT.VF2. The later is preferable to avoid changing VTYPE.
* lavc/vp8dsp: remove no longer used macrosRémi Denis-Courmont2024-06-041-22/+0
|
* lavc/vp7dsp: add R-V V vp7_idct_dc_add4uvRémi Denis-Courmont2024-06-044-17/+45
| | | | | | | | | | | | | | | | | | | | This is almost the same story as vp7_idct_add4y. We just have to use strided loads of 2 64-bit elements to account for the different data layout in memory. T-Head C908: vp7_idct_dc_add4uv_c: 7.5 vp7_idct_dc_add4uv_rvv_i64: 2.0 vp8_idct_dc_add4uv_c: 6.2 vp8_idct_dc_add4uv_rvv_i32: 2.2 (before) vp8_idct_dc_add4uv_rvv_i64: 2.0 SpacemiT X60: vp7_idct_dc_add4uv_c: 6.7 vp7_idct_dc_add4uv_rvv_i64: 2.2 vp8_idct_dc_add4uv_c: 5.7 vp8_idct_dc_add4uv_rvv_i32: 2.5 (before) vp8_idct_dc_add4uv_rvv_i64: 2.0
* lavc/vp8dsp: rework R-V V idct_dc_add4yRémi Denis-Courmont2024-06-042-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | DCT-related FFmpeg functions often add an unsigned 8-bit sample to a signed 16-bit coefficient, then clip the result back to an unsigned 8-bit value. RISC-V has no signed 16-bit to unsigned 8-bit clip, so instead our most common sequence is: VWADDU.WV set SEW to 16 bits VMAX.VV zero # clip negative values to 0 set SEW to 8 bits VNCLIPU.WI # clip values over 255 to 255 and narrow Here we use a different sequence which does not require toggling the vector type. This assumes that the wide addend vector is biased by -128: VWADDU.WV VNCLIP.WI # clip values to signed 8-bit and narrow VXOR.VX 0x80 # flip sign bit (convert signed to unsigned) Also the VMAX is effectively replaced by a VXOR of half-width. In this function, this comes for free as we anyway add a constant to the wide vector in the prologue. On C908, this has no observable effects. On X60, this improves microbenchmarks by about 20%.
* lavc/vp8dsp: add R-V V vp7_idct_dc_add4yRémi Denis-Courmont2024-06-043-10/+54
| | | | | | | | | | | | | | | | | | | | | | | | | | | As with idct_dc_add, most of the code is shared with, and replaces, the previous VP8 function. To improve performance, we break down the 16x4 matrix into 4 rows, rather than 4 squares. Thus strided loads and stores are avoided, and the 4 DC calculations are vectored. Unfortunately this requires a vector gather to splat the DC values, but overall this is still a win for performance: T-Head C908: vp7_idct_dc_add4y_c: 7.2 vp7_idct_dc_add4y_rvv_i32: 2.2 vp8_idct_dc_add4y_c: 6.2 vp8_idct_dc_add4y_rvv_i32: 2.2 (before) vp8_idct_dc_add4y_rvv_i32: 1.7 SpacemiT X60: vp7_idct_dc_add4y_c: 6.2 vp7_idct_dc_add4y_rvv_i32: 2.0 vp8_idct_dc_add4y_c: 5.5 vp8_idct_dc_add4y_rvv_i32: 2.5 (before) vp8_idct_dc_add4y_rvv_i32: 1.7 I also tried to provision the DC values using indexed loads. It ends up slower overall, especially for VP7, as we then have to compute 16 DC's instead of just 4.
* lavc/vp8dsp: add R-V V vp7_idct_dc_addRémi Denis-Courmont2024-06-042-8/+34
| | | | | | | | | | This just computes the direct coefficient and hands over to code shared with VP8. Accordingly the bulk of changes are just rewriting the VP8 code to share. Nothing to write home about: vp7_idct_dc_add_c: 1.7 vp7_idct_dc_add_rvv_i32: 1.2
* lavc/aacencdsp: R-V V quant_bandsRémi Denis-Courmont2024-06-032-0/+34
| | | | | | | | | | | | | | T-Head C908: quant_bands_signed_c: 576.0 quant_bands_signed_rvv_f32: 48.7 quant_bands_unsigned_c: 414.2 quant_bands_unsigned_rvv_f32: 31.7 SpacemiT X60: quant_bands_signed_c: 497.7 quant_bands_signed_rvv_f32: 23.0 quant_bands_unsigned_c: 353.5 quant_bands_unsigned_rvv_f32: 16.2
* lavc/vc1dsp: fix R-V V avg_mspel_pixelsRémi Denis-Courmont2024-06-022-24/+18
| | | | | | | | | | | | | | | | | | | | | | The 8x8 pixel arrays are not necessarily aligned to 64 bits, so the current code leads to Bus error on real hardware. This reproducible with FATE's vc1_ilaced_twomv test case. The new "pessimist" code can trivially be shared for 16x16 pixel arrays so we also do that. FWIW, this also nominally reduces the hardware requirement from Zve64x to Zve32x. T-Head C908: vc1dsp.avg_vc1_mspel_pixels_tab[0][0]_c: 14.7 vc1dsp.avg_vc1_mspel_pixels_tab[0][0]_rvv_i32: 3.5 vc1dsp.avg_vc1_mspel_pixels_tab[1][0]_c: 3.7 vc1dsp.avg_vc1_mspel_pixels_tab[1][0]_rvv_i32: 1.5 SpacemiT X60: vc1dsp.avg_vc1_mspel_pixels_tab[0][0]_c: 13.0 vc1dsp.avg_vc1_mspel_pixels_tab[0][0]_rvv_i32: 3.0 vc1dsp.avg_vc1_mspel_pixels_tab[1][0]_c: 3.2 vc1dsp.avg_vc1_mspel_pixels_tab[1][0]_rvv_i32: 1.2
* lavc/sbrdsp: add support for 256-bit vectorsRémi Denis-Courmont2024-05-311-1/+1
| | | | | | | | | | | | | | | hf_apply_noise_0_c: 35.7 hf_apply_noise_0_rvv_f32: 9.5 hf_apply_noise_1_c: 38.5 hf_apply_noise_1_rvv_f32: 10.0 hf_apply_noise_2_c: 35.5 hf_apply_noise_2_rvv_f32: 9.7 hf_apply_noise_3_c: 38.5 hf_apply_noise_3_rvv_f32: 10.0 Maybe extending the noise table manually is not such great idea, but I not quite sure how to deal with that otherwise? Allocating the table dynamically is possible but would require an ELF destructor to clean up.
* lavc/vp9dsp: R-V V rename ff_avg to ff_vp9_avgsunyuechi2024-05-303-8/+8
| | | | | | Avoid potential naming conflicts Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
* riscv: allow passing addend to vtype_vli macroRémi Denis-Courmont2024-05-301-1/+1
| | | | | A constant (-1) is added to the length value, so we can have an added for free, and optimise the addition away if the addend is exactly 1.
* lavc/vp7dsp: R-V V vp7_idct_addRémi Denis-Courmont2024-05-292-0/+31
| | | | | | | Most of the code is shared with DC, thanks to minor earlier changes. vp7_idct_add_c: 5.2 vp7_idct_add_rvv_i32: 2.5
* lavc/vp7dsp: revector ff_vp7_dc_wht_rvvRémi Denis-Courmont2024-05-292-10/+16
| | | | This prepares for some code reuse.
* lavc/vp7dsp: add R-V V vp7_luma_dc_whtRémi Denis-Courmont2024-05-293-0/+138
| | | | | | | | | This works out a bit more favourably than VP8's due to: - additional multiplications that can be vectored, - hardware-supported fixed-point rounding mode. vp7_luma_dc_wht_c: 3.2 vp7_luma_dc_wht_rvv_i64: 2.0
* lavc/vp8dsp: R-V V vp8_luma_dc_whtRémi Denis-Courmont2024-05-292-0/+61
| | | | | | This is not great as transposition is poorly supported, but it works: vp8_luma_dc_wht_c: 2.5 vp8_luma_dc_wht_rvv_i32: 1.7
* lavc/lpc: optimise RVV vector type for compute_autocorrRémi Denis-Courmont2024-05-292-3/+5
| | | | | | | On SpacemiT X60 (with len == 4000): autocorr_10_c: 2303.7 autocorr_10_rvv_f64: 1411.5 (before) autocorr_10_rvv_f64: 842.2 (after)
* lavc/vp8dsp: save one R-V GPRRémi Denis-Courmont2024-05-281-7/+16
| | | | | | This saves one instruction and frees up A5, which will be repurposed in later changes. Unfortunately, we need to add quite a lot of alternative code for this.
* lavc/vp8dsp: avoid one multiplication on RISC-VRémi Denis-Courmont2024-05-282-28/+29
| | | | Use shifts rather than multiply, and save one instruction.
* lavc/vp8dsp: factor R-V V bilin functionsRémi Denis-Courmont2024-05-281-10/+27
| | | | | For a given type, only the first VSETVLI instruction varies depending on the size.
* lavc/sbrdsp: fold immediate offset into relocationRémi Denis-Courmont2024-05-281-2/+1
| | | | This results in AUIPC; ADDI instead of AUIPC; ADDI; ... ADDI.
* lavc/startcode: fix RVV return value on no matchRémi Denis-Courmont2024-05-281-0/+2
| | | | | If there are no zero bytes, t2 equals -1. The code cannot simply fall through to the match case.
* lavc/lpc: fix off-by-one in R-V V compute_autocorrRémi Denis-Courmont2024-05-282-1/+2
|
* lavc/flacdsp: R-V Zvl256b lpc33Rémi Denis-Courmont2024-05-272-2/+32
| | | | | | | | | | | flac_lpc_33_13_c: 499.7 flac_lpc_33_13_rvv_i64: 197.7 flac_lpc_33_16_c: 601.5 flac_lpc_33_16_rvv_i64: 195.2 flac_lpc_33_29_c: 1011.5 flac_lpc_33_29_rvv_i64: 300.7 flac_lpc_33_32_c: 1099.0 flac_lpc_33_32_rvv_i64: 296.7
* lavc/vp8dsp: disable EPEL HV on RV128Rémi Denis-Courmont2024-05-272-1/+4
| | | | | RV128 is mostly scifi at this point, so we can just disable it here (the EPEL HV prologue/epilogue do not save 128-bit registers).
* lavc/vp8dsp: remove unused macro parameterRémi Denis-Courmont2024-05-261-4/+4
|
* lavc/rv34dsp: remove stray load immediateRémi Denis-Courmont2024-05-261-1/+0
|
* lavc/vp8dsp: R-V V put_epel hvsunyuechi2024-05-262-21/+115
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | C908: vp8_put_epel4_h4v4_c: 20.0 vp8_put_epel4_h4v4_rvv_i32: 11.0 vp8_put_epel4_h4v6_c: 25.2 vp8_put_epel4_h4v6_rvv_i32: 13.5 vp8_put_epel4_h6v4_c: 22.2 vp8_put_epel4_h6v4_rvv_i32: 14.5 vp8_put_epel4_h6v6_c: 29.0 vp8_put_epel4_h6v6_rvv_i32: 15.7 vp8_put_epel8_h4v4_c: 73.0 vp8_put_epel8_h4v4_rvv_i32: 22.2 vp8_put_epel8_h4v6_c: 90.5 vp8_put_epel8_h4v6_rvv_i32: 26.7 vp8_put_epel8_h6v4_c: 85.0 vp8_put_epel8_h6v4_rvv_i32: 27.2 vp8_put_epel8_h6v6_c: 104.7 vp8_put_epel8_h6v6_rvv_i32: 29.5 vp8_put_epel16_h4v4_c: 145.5 vp8_put_epel16_h4v4_rvv_i32: 26.5 vp8_put_epel16_h4v6_c: 190.7 vp8_put_epel16_h4v6_rvv_i32: 47.5 vp8_put_epel16_h6v4_c: 173.7 vp8_put_epel16_h6v4_rvv_i32: 33.2 vp8_put_epel16_h6v6_c: 222.2 vp8_put_epel16_h6v6_rvv_i32: 35.5 Amended to disable unsupported RV128. Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
* lavc/sbrdsp: fix inverted boundary checkRémi Denis-Courmont2024-05-251-1/+1
| | | | | | | 128-bit is the maximum, not the minimum here. Larger vector sizes can result in reads past the end of the noise value table. This partially reverts commit cdcb4b98b7f74d87a6274899ff70724795d551cb.
* lavc/flacdsp: do not assume maximum R-V VLRémi Denis-Courmont2024-05-251-2/+2
| | | | | | | | | | | | This loop correctly assumes that VLMAX=16 (4x128-bit vectors with 32-bit elements) and 32 >= pred_order > 16. We need to alternate between VL=16 and VL=t2=pred_order-16 elements to add up to pred_order. The current code requests AVL=a2=pred_order elements. In QEMU and on thte K230 hardware, this sets VL=16 as we need. But the specification merely guarantees that we get: ceil(AVL / 2) <= VL <= VLMAX. For instance, if pred_order equals 27, we could end up with VL=14 or VL=15 instead of VL=16. So instead, request literally VLMAX=16.
* lavc/pixblockdsp: add scalar get_pixels_unalignedRémi Denis-Courmont2024-05-241-0/+7
| | | | | | | The code is already there, we just need to use it. get_pixels_unaligned_c: 2.2 get_pixels_unaligned_misaligned: 1.7
* lavc/h263dsp: R-V V {h,v}_loop_filterRémi Denis-Courmont2024-05-223-0/+143
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since the horizontal and vertical filters are identical except for a transposition, this uses a common subprocedure with an ad-hoc ABI. To preserve return-address stack prediction, a link register has to be used (c.f. the "Control Transfer Instructions" from the RISC-V ISA Manual). The alternate/temporary link register T0 is used here, so that the normal RA is preserved (something Arm cannot do!). To load the strength value based on `qscale`, the shortest possible and PIC-compatible sequence is used: AUIPC; ADD; LBU. The classic LLA; ADD; LBU sequence would add one more instruction since LLA is a convenience alias for AUIPC; ADDI. To ensure that this trick works, relocation relaxation is disabled. To implement the two signed divisions by a power of two toward zero: (x / (1 << SHIFT)) the code relies on the small range of integers involved, computing: (x + (x >> (16 - SHIFT))) >> SHIFT rather than the more general: (x + ((x >> (16 - 1)) & ((1 << SHIFT) - 1))) >> SHIFT Thus one ANDI instruction is avoided. T-Head C908: h263dsp.h_loop_filter_c: 228.2 h263dsp.h_loop_filter_rvv_i32: 144.0 h263dsp.v_loop_filter_c: 242.7 h263dsp.v_loop_filter_rvv_i32: 114.0 (C is probably worse in real use due to less predictible branches.)
* lavc/vp9dsp: R-V V mc avgsunyuechi2024-05-213-1/+78
| | | | | | | | | | | | | | | | C908: vp9_avg4_8bpp_c: 1.2 vp9_avg4_8bpp_rvv_i64: 1.0 vp9_avg8_8bpp_c: 3.7 vp9_avg8_8bpp_rvv_i64: 1.5 vp9_avg16_8bpp_c: 14.7 vp9_avg16_8bpp_rvv_i64: 3.5 vp9_avg32_8bpp_c: 57.7 vp9_avg32_8bpp_rvv_i64: 10.0 vp9_avg64_8bpp_c: 229.0 vp9_avg64_8bpp_rvv_i64: 31.7 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
* Revert "lavc/sbrdsp: R-V V neg_odd_64"Rémi Denis-Courmont2024-05-212-22/+0
| | | | | | | | | | | | | | | | | | | | | | | While this function can easily be written with vectors, it just fails to get any performance improvement. For reference, this is a simpler loop-free implementation that does get better performance than the current one depending on hardware, but still more or less the same metrics as the C code: func ff_sbr_neg_odd_64_rvv, zve64x li a1, 32 addi a0, a0, 7 li t0, 8 vsetvli zero, a1, e8, m2, ta, ma li t1, 0x80 vlse8.v v8, (a0), t0 vxor.vx v8, v8, t1 vsse8.v v8, (a0), t0 ret endfunc This reverts commit d06fd18f8f4c6a81ef94cbb600620d83ad51269d.
* lavc/vc1dsp: R-V V vc1_unescape_bufferRémi Denis-Courmont2024-05-212-0/+55
| | | | | | | | | | | | | | | | | | | | | | | | Notes: - The loop is biased toward no unescaped bytes as that should be most common. - The input byte array is slid rather than the (8 times smaller) bit-mask, as RISC-V V does not provide a bit-mask (or bit-wise) slide instruction. - There are two comparisons with 0 per iteration, for the same reason. - In case of match, bytes are copied until the first match, and the loop is restarted after the escape byte. Vector compression (vcompress.vm) could discard all escape bytes but that is slower if escape bytes are rare. Further optimisations should be possible, e.g.: - processing 2 bytes fewer per iteration to get rid of a 2 slides, - taking a short cut if the input vector contains less than 2 zeroes. But this is a good starting point: T-Head C908: vc1dsp.vc1_unescape_buffer_c: 12749.5 vc1dsp.vc1_unescape_buffer_rvv_i32: 6009.0 SpacemiT X60: vc1dsp.vc1_unescape_buffer_c: 11038.0 vc1dsp.vc1_unescape_buffer_rvv_i32: 2061.0
* lavc/huffyuvdsp: optimise RVV vtype for add_hfyu_left_pred_bgr32Rémi Denis-Courmont2024-05-192-3/+6
| | | | | | | T-Head C908: add_hfyu_left_pred_bgr32_c: 237.5 add_hfyu_left_pred_bgr32_rvv_i32: 173.5 (before) add_hfyu_left_pred_bgr32_rvv_i32: 110.0 (after)
* lavc/flacdsp: optimise RVV vector type for lpc32Rémi Denis-Courmont2024-05-192-12/+15
| | | | | | | | | | | | | | | | | This is pretty much the same as for lpc16, though it only improves half as large prediction orders. With 128-bit vectors, this gives: C V old V new 1 69.2 181.5 95.5 2 107.7 180.7 95.2 3 145.5 180.0 103.5 4 183.0 179.2 102.7 5 220.7 178.5 128.0 6 257.7 194.0 127.5 7 294.5 193.7 126.7 8 331.0 193.0 126.5 Larger prediction orders see no significant changes at that size.
* lavc/flacdsp: optimise RVV vector type for lpc16Rémi Denis-Courmont2024-05-192-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This calculates the optimal vector type value at run-time based on the hardware vector length and the FLAC LPC prediction order. In this particular case, the additional computation is easily amortised over the loop iterations: T-Head C908: C V before V after 1 48.0 214.7 95.2 2 64.7 214.2 94.7 3 79.7 213.5 94.5 4 96.2 196.5 94.2 # 5 111.0 195.7 118.5 6 127.0 211.2 102.0 7 143.7 194.2 101.5 8 175.7 193.2 101.2 # 9 176.2 224.2 126.0 10 191.5 192.0 125.5 11 224.5 191.2 124.7 12 223.0 190.2 124.2 13 239.2 189.5 123.7 14 253.7 188.7 139.5 15 286.2 188.0 122.7 16 284.0 187.0 122.5 # 17 300.2 186.5 186.5 18 314.0 185.5 185.7 19 329.7 184.7 185.0 20 343.0 184.2 184.2 21 358.7 199.2 183.7 22 371.7 182.7 182.7 23 387.5 181.7 182.0 24 400.7 181.0 181.2 25 431.5 180.2 196.5 26 443.7 195.5 196.0 27 459.0 178.7 196.2 28 470.7 177.7 194.2 29 470.0 177.0 193.5 30 481.2 176.2 176.5 31 496.2 175.5 175.7 32 507.2 174.7 191.0 # # Power of two boundary. With 128-bit vectors, improvements are expected for the first two test cases only. For the other two, there is overhead but below noise. Improvements should be better observable with prediction order of 8 and less, or on hardware with larger vector sizes.
* lavc/vp9_intra: fix another .irp use with LLVM asRémi Denis-Courmont2024-05-191-1/+1
|
* lavc/vp9_intra: fix .irp use with LLVM asRémi Denis-Courmont2024-05-191-14/+14
|
* lavc/vp8dsp: fix .irp use with LLVM asRémi Denis-Courmont2024-05-191-2/+2
|
* lavc/startcode: add R-V V startcode_find_candidateRémi Denis-Courmont2024-05-194-8/+62
|
* lavc/startcode: add R-V Zbb startcode_find_candidateRémi Denis-Courmont2024-05-194-2/+130
| | | | | | | | The main loop processes 8 bytes in 5 instructions. For comparison, the optimal plain strnlen() requires 4 instructions per byte (6.4x worse): LBU; ADDI; BEQZ; BNE. The current libavcodec C code involves 5 instructions per byte (8x worse). Actual benchmarks may be slightly less favourable due to latency from ORC.B to BNE.
* lavc/vp9dsp: R-V V ipred tmsunyuechi2024-05-173-0/+130
| | | | | | | | | | | | | | C908: vp9_tm_4x4_8bpp_c: 116.5 vp9_tm_4x4_8bpp_rvv_i32: 43.5 vp9_tm_8x8_8bpp_c: 416.2 vp9_tm_8x8_8bpp_rvv_i32: 86.0 vp9_tm_16x16_8bpp_c: 1665.5 vp9_tm_16x16_8bpp_rvv_i32: 187.2 vp9_tm_32x32_8bpp_c: 6974.2 vp9_tm_32x32_8bpp_rvv_i32: 625.7 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
* lavc/flacdsp: R-V V flac_wasted33Rémi Denis-Courmont2024-05-172-0/+36
| | | | | | T-Head C908: flac_wasted_33_c: 786.2 flac_wasted_33_rvv_i64: 486.5