ffmpeg - Mirror of FFmpeg git repo

	Commit message (Collapse)	Author	Age	Files	Lines
*	lavc/aacencdsp: fix rounding in R-V V quantize_bands	Rémi Denis-Courmont	2024-06-08	1	-1/+1
\| \| \| \|	We need to round toward zero here.
*	lavc/vp8dsp: R-V V vp8_idct_add	Rémi Denis-Courmont	2024-06-08	2	-0/+61
\| \| \| \| \| \|	T-Head C908 (cycles): vp8_idct_add_c: 312.2 vp8_idct_add_rvv_i32: 117.0
*	lavc/vc1dsp: R-V V vc1_inv_trans_4x4	Rémi Denis-Courmont	2024-06-07	2	-0/+47
\| \| \| \| \| \| \| \| \|	T-Head C908 (cycles): vc1dsp.vc1_inv_trans_4x4_c: 310.7 vc1dsp.vc1_inv_trans_4x4_rvv_i32: 120.0 We could use 1 `vlseg4e64.v` instead of 4 `vle16.v`, but that seems to be about 7% slower.
*	lavc/vc1dsp: R-V V vc1_inv_trans_4x8	Rémi Denis-Courmont	2024-06-07	2	-0/+79
\| \| \| \| \| \|	T-Head C908 (cycles): vc1dsp.vc1_inv_trans_4x8_c: 653.2 vc1dsp.vc1_inv_trans_4x8_rvv_i32: 234.0
*	lavc/vc1dsp: R-V V vc1_inv_trans_8x4	Rémi Denis-Courmont	2024-06-07	2	-0/+75
\| \| \| \| \| \|	T-Head C908 (cycles): vc1dsp.vc1_inv_trans_8x4_c: 626.2 vc1dsp.vc1_inv_trans_8x4_rvv_i32: 215.2
*	lavc/vc1dsp: R-V V vc1_inv_trans_8x8	Rémi Denis-Courmont	2024-06-07	2	-0/+112
\| \| \| \| \| \|	T-Head C908 (cycles): vc1dsp.vc1_inv_trans_8x8_c: 871.7 vc1dsp.vc1_inv_trans_8x8_rvv_i32: 286.7
*	lavc/flacdsp: fix sign extension in R-V V wasted33	Rémi Denis-Courmont	2024-06-07	1	-3/+2
\| \| \| \| \|	We need to use either VWCVT.X.X.V or VSEXT.VF2. The later is preferable to avoid changing VTYPE.
*	lavc/vp8dsp: remove no longer used macros	Rémi Denis-Courmont	2024-06-04	1	-22/+0
\|
*	lavc/vp7dsp: add R-V V vp7_idct_dc_add4uv	Rémi Denis-Courmont	2024-06-04	4	-17/+45
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is almost the same story as vp7_idct_add4y. We just have to use strided loads of 2 64-bit elements to account for the different data layout in memory. T-Head C908: vp7_idct_dc_add4uv_c: 7.5 vp7_idct_dc_add4uv_rvv_i64: 2.0 vp8_idct_dc_add4uv_c: 6.2 vp8_idct_dc_add4uv_rvv_i32: 2.2 (before) vp8_idct_dc_add4uv_rvv_i64: 2.0 SpacemiT X60: vp7_idct_dc_add4uv_c: 6.7 vp7_idct_dc_add4uv_rvv_i64: 2.2 vp8_idct_dc_add4uv_c: 5.7 vp8_idct_dc_add4uv_rvv_i32: 2.5 (before) vp8_idct_dc_add4uv_rvv_i64: 2.0
*	lavc/vp8dsp: rework R-V V idct_dc_add4y	Rémi Denis-Courmont	2024-06-04	2	-8/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	DCT-related FFmpeg functions often add an unsigned 8-bit sample to a signed 16-bit coefficient, then clip the result back to an unsigned 8-bit value. RISC-V has no signed 16-bit to unsigned 8-bit clip, so instead our most common sequence is: VWADDU.WV set SEW to 16 bits VMAX.VV zero # clip negative values to 0 set SEW to 8 bits VNCLIPU.WI # clip values over 255 to 255 and narrow Here we use a different sequence which does not require toggling the vector type. This assumes that the wide addend vector is biased by -128: VWADDU.WV VNCLIP.WI # clip values to signed 8-bit and narrow VXOR.VX 0x80 # flip sign bit (convert signed to unsigned) Also the VMAX is effectively replaced by a VXOR of half-width. In this function, this comes for free as we anyway add a constant to the wide vector in the prologue. On C908, this has no observable effects. On X60, this improves microbenchmarks by about 20%.
*	lavc/vp8dsp: add R-V V vp7_idct_dc_add4y	Rémi Denis-Courmont	2024-06-04	3	-10/+54
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	As with idct_dc_add, most of the code is shared with, and replaces, the previous VP8 function. To improve performance, we break down the 16x4 matrix into 4 rows, rather than 4 squares. Thus strided loads and stores are avoided, and the 4 DC calculations are vectored. Unfortunately this requires a vector gather to splat the DC values, but overall this is still a win for performance: T-Head C908: vp7_idct_dc_add4y_c: 7.2 vp7_idct_dc_add4y_rvv_i32: 2.2 vp8_idct_dc_add4y_c: 6.2 vp8_idct_dc_add4y_rvv_i32: 2.2 (before) vp8_idct_dc_add4y_rvv_i32: 1.7 SpacemiT X60: vp7_idct_dc_add4y_c: 6.2 vp7_idct_dc_add4y_rvv_i32: 2.0 vp8_idct_dc_add4y_c: 5.5 vp8_idct_dc_add4y_rvv_i32: 2.5 (before) vp8_idct_dc_add4y_rvv_i32: 1.7 I also tried to provision the DC values using indexed loads. It ends up slower overall, especially for VP7, as we then have to compute 16 DC's instead of just 4.
*	lavc/vp8dsp: add R-V V vp7_idct_dc_add	Rémi Denis-Courmont	2024-06-04	2	-8/+34
\| \| \| \| \| \| \| \| \| \|	This just computes the direct coefficient and hands over to code shared with VP8. Accordingly the bulk of changes are just rewriting the VP8 code to share. Nothing to write home about: vp7_idct_dc_add_c: 1.7 vp7_idct_dc_add_rvv_i32: 1.2
*	lavc/aacencdsp: R-V V quant_bands	Rémi Denis-Courmont	2024-06-03	2	-0/+34
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	T-Head C908: quant_bands_signed_c: 576.0 quant_bands_signed_rvv_f32: 48.7 quant_bands_unsigned_c: 414.2 quant_bands_unsigned_rvv_f32: 31.7 SpacemiT X60: quant_bands_signed_c: 497.7 quant_bands_signed_rvv_f32: 23.0 quant_bands_unsigned_c: 353.5 quant_bands_unsigned_rvv_f32: 16.2
*	lavc/vc1dsp: fix R-V V avg_mspel_pixels	Rémi Denis-Courmont	2024-06-02	2	-24/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The 8x8 pixel arrays are not necessarily aligned to 64 bits, so the current code leads to Bus error on real hardware. This reproducible with FATE's vc1_ilaced_twomv test case. The new "pessimist" code can trivially be shared for 16x16 pixel arrays so we also do that. FWIW, this also nominally reduces the hardware requirement from Zve64x to Zve32x. T-Head C908: vc1dsp.avg_vc1_mspel_pixels_tab[0][0]_c: 14.7 vc1dsp.avg_vc1_mspel_pixels_tab[0][0]_rvv_i32: 3.5 vc1dsp.avg_vc1_mspel_pixels_tab[1][0]_c: 3.7 vc1dsp.avg_vc1_mspel_pixels_tab[1][0]_rvv_i32: 1.5 SpacemiT X60: vc1dsp.avg_vc1_mspel_pixels_tab[0][0]_c: 13.0 vc1dsp.avg_vc1_mspel_pixels_tab[0][0]_rvv_i32: 3.0 vc1dsp.avg_vc1_mspel_pixels_tab[1][0]_c: 3.2 vc1dsp.avg_vc1_mspel_pixels_tab[1][0]_rvv_i32: 1.2
*	lavc/sbrdsp: add support for 256-bit vectors	Rémi Denis-Courmont	2024-05-31	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	hf_apply_noise_0_c: 35.7 hf_apply_noise_0_rvv_f32: 9.5 hf_apply_noise_1_c: 38.5 hf_apply_noise_1_rvv_f32: 10.0 hf_apply_noise_2_c: 35.5 hf_apply_noise_2_rvv_f32: 9.7 hf_apply_noise_3_c: 38.5 hf_apply_noise_3_rvv_f32: 10.0 Maybe extending the noise table manually is not such great idea, but I not quite sure how to deal with that otherwise? Allocating the table dynamically is possible but would require an ELF destructor to clean up.
*	lavc/vp9dsp: R-V V rename ff_avg to ff_vp9_avg	sunyuechi	2024-05-30	3	-8/+8
\| \| \| \| \| \|	Avoid potential naming conflicts Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
*	riscv: allow passing addend to vtype_vli macro	Rémi Denis-Courmont	2024-05-30	1	-1/+1
\| \| \| \| \|	A constant (-1) is added to the length value, so we can have an added for free, and optimise the addition away if the addend is exactly 1.
*	lavc/vp7dsp: R-V V vp7_idct_add	Rémi Denis-Courmont	2024-05-29	2	-0/+31
\| \| \| \| \| \| \|	Most of the code is shared with DC, thanks to minor earlier changes. vp7_idct_add_c: 5.2 vp7_idct_add_rvv_i32: 2.5
*	lavc/vp7dsp: revector ff_vp7_dc_wht_rvv	Rémi Denis-Courmont	2024-05-29	2	-10/+16
\| \| \| \|	This prepares for some code reuse.
*	lavc/vp7dsp: add R-V V vp7_luma_dc_wht	Rémi Denis-Courmont	2024-05-29	3	-0/+138
\| \| \| \| \| \| \| \| \|	This works out a bit more favourably than VP8's due to: - additional multiplications that can be vectored, - hardware-supported fixed-point rounding mode. vp7_luma_dc_wht_c: 3.2 vp7_luma_dc_wht_rvv_i64: 2.0
*	lavc/vp8dsp: R-V V vp8_luma_dc_wht	Rémi Denis-Courmont	2024-05-29	2	-0/+61
\| \| \| \| \| \|	This is not great as transposition is poorly supported, but it works: vp8_luma_dc_wht_c: 2.5 vp8_luma_dc_wht_rvv_i32: 1.7
*	lavc/lpc: optimise RVV vector type for compute_autocorr	Rémi Denis-Courmont	2024-05-29	2	-3/+5
\| \| \| \| \| \| \|	On SpacemiT X60 (with len == 4000): autocorr_10_c: 2303.7 autocorr_10_rvv_f64: 1411.5 (before) autocorr_10_rvv_f64: 842.2 (after)
*	lavc/vp8dsp: save one R-V GPR	Rémi Denis-Courmont	2024-05-28	1	-7/+16
\| \| \| \| \| \|	This saves one instruction and frees up A5, which will be repurposed in later changes. Unfortunately, we need to add quite a lot of alternative code for this.
*	lavc/vp8dsp: avoid one multiplication on RISC-V	Rémi Denis-Courmont	2024-05-28	2	-28/+29
\| \| \| \|	Use shifts rather than multiply, and save one instruction.
*	lavc/vp8dsp: factor R-V V bilin functions	Rémi Denis-Courmont	2024-05-28	1	-10/+27
\| \| \| \| \|	For a given type, only the first VSETVLI instruction varies depending on the size.
*	lavc/sbrdsp: fold immediate offset into relocation	Rémi Denis-Courmont	2024-05-28	1	-2/+1
\| \| \| \|	This results in AUIPC; ADDI instead of AUIPC; ADDI; ... ADDI.
*	lavc/startcode: fix RVV return value on no match	Rémi Denis-Courmont	2024-05-28	1	-0/+2
\| \| \| \| \|	If there are no zero bytes, t2 equals -1. The code cannot simply fall through to the match case.
*	lavc/lpc: fix off-by-one in R-V V compute_autocorr	Rémi Denis-Courmont	2024-05-28	2	-1/+2
\|
*	lavc/flacdsp: R-V Zvl256b lpc33	Rémi Denis-Courmont	2024-05-27	2	-2/+32
\| \| \| \| \| \| \| \| \| \| \|	flac_lpc_33_13_c: 499.7 flac_lpc_33_13_rvv_i64: 197.7 flac_lpc_33_16_c: 601.5 flac_lpc_33_16_rvv_i64: 195.2 flac_lpc_33_29_c: 1011.5 flac_lpc_33_29_rvv_i64: 300.7 flac_lpc_33_32_c: 1099.0 flac_lpc_33_32_rvv_i64: 296.7
*	lavc/vp8dsp: disable EPEL HV on RV128	Rémi Denis-Courmont	2024-05-27	2	-1/+4
\| \| \| \| \|	RV128 is mostly scifi at this point, so we can just disable it here (the EPEL HV prologue/epilogue do not save 128-bit registers).
*	lavc/vp8dsp: remove unused macro parameter	Rémi Denis-Courmont	2024-05-26	1	-4/+4
\|
*	lavc/rv34dsp: remove stray load immediate	Rémi Denis-Courmont	2024-05-26	1	-1/+0
\|
*	lavc/vp8dsp: R-V V put_epel hv	sunyuechi	2024-05-26	2	-21/+115
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	C908: vp8_put_epel4_h4v4_c: 20.0 vp8_put_epel4_h4v4_rvv_i32: 11.0 vp8_put_epel4_h4v6_c: 25.2 vp8_put_epel4_h4v6_rvv_i32: 13.5 vp8_put_epel4_h6v4_c: 22.2 vp8_put_epel4_h6v4_rvv_i32: 14.5 vp8_put_epel4_h6v6_c: 29.0 vp8_put_epel4_h6v6_rvv_i32: 15.7 vp8_put_epel8_h4v4_c: 73.0 vp8_put_epel8_h4v4_rvv_i32: 22.2 vp8_put_epel8_h4v6_c: 90.5 vp8_put_epel8_h4v6_rvv_i32: 26.7 vp8_put_epel8_h6v4_c: 85.0 vp8_put_epel8_h6v4_rvv_i32: 27.2 vp8_put_epel8_h6v6_c: 104.7 vp8_put_epel8_h6v6_rvv_i32: 29.5 vp8_put_epel16_h4v4_c: 145.5 vp8_put_epel16_h4v4_rvv_i32: 26.5 vp8_put_epel16_h4v6_c: 190.7 vp8_put_epel16_h4v6_rvv_i32: 47.5 vp8_put_epel16_h6v4_c: 173.7 vp8_put_epel16_h6v4_rvv_i32: 33.2 vp8_put_epel16_h6v6_c: 222.2 vp8_put_epel16_h6v6_rvv_i32: 35.5 Amended to disable unsupported RV128. Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
*	lavc/sbrdsp: fix inverted boundary check	Rémi Denis-Courmont	2024-05-25	1	-1/+1
\| \| \| \| \| \| \|	128-bit is the maximum, not the minimum here. Larger vector sizes can result in reads past the end of the noise value table. This partially reverts commit cdcb4b98b7f74d87a6274899ff70724795d551cb.
*	lavc/flacdsp: do not assume maximum R-V VL	Rémi Denis-Courmont	2024-05-25	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \|	This loop correctly assumes that VLMAX=16 (4x128-bit vectors with 32-bit elements) and 32 >= pred_order > 16. We need to alternate between VL=16 and VL=t2=pred_order-16 elements to add up to pred_order. The current code requests AVL=a2=pred_order elements. In QEMU and on thte K230 hardware, this sets VL=16 as we need. But the specification merely guarantees that we get: ceil(AVL / 2) <= VL <= VLMAX. For instance, if pred_order equals 27, we could end up with VL=14 or VL=15 instead of VL=16. So instead, request literally VLMAX=16.
*	lavc/pixblockdsp: add scalar get_pixels_unaligned	Rémi Denis-Courmont	2024-05-24	1	-0/+7
\| \| \| \| \| \| \|	The code is already there, we just need to use it. get_pixels_unaligned_c: 2.2 get_pixels_unaligned_misaligned: 1.7
*	lavc/h263dsp: R-V V {h,v}_loop_filter	Rémi Denis-Courmont	2024-05-22	3	-0/+143
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since the horizontal and vertical filters are identical except for a transposition, this uses a common subprocedure with an ad-hoc ABI. To preserve return-address stack prediction, a link register has to be used (c.f. the "Control Transfer Instructions" from the RISC-V ISA Manual). The alternate/temporary link register T0 is used here, so that the normal RA is preserved (something Arm cannot do!). To load the strength value based on `qscale`, the shortest possible and PIC-compatible sequence is used: AUIPC; ADD; LBU. The classic LLA; ADD; LBU sequence would add one more instruction since LLA is a convenience alias for AUIPC; ADDI. To ensure that this trick works, relocation relaxation is disabled. To implement the two signed divisions by a power of two toward zero: (x / (1 << SHIFT)) the code relies on the small range of integers involved, computing: (x + (x >> (16 - SHIFT))) >> SHIFT rather than the more general: (x + ((x >> (16 - 1)) & ((1 << SHIFT) - 1))) >> SHIFT Thus one ANDI instruction is avoided. T-Head C908: h263dsp.h_loop_filter_c: 228.2 h263dsp.h_loop_filter_rvv_i32: 144.0 h263dsp.v_loop_filter_c: 242.7 h263dsp.v_loop_filter_rvv_i32: 114.0 (C is probably worse in real use due to less predictible branches.)
*	lavc/vp9dsp: R-V V mc avg	sunyuechi	2024-05-21	3	-1/+78
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	C908: vp9_avg4_8bpp_c: 1.2 vp9_avg4_8bpp_rvv_i64: 1.0 vp9_avg8_8bpp_c: 3.7 vp9_avg8_8bpp_rvv_i64: 1.5 vp9_avg16_8bpp_c: 14.7 vp9_avg16_8bpp_rvv_i64: 3.5 vp9_avg32_8bpp_c: 57.7 vp9_avg32_8bpp_rvv_i64: 10.0 vp9_avg64_8bpp_c: 229.0 vp9_avg64_8bpp_rvv_i64: 31.7 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
*	Revert "lavc/sbrdsp: R-V V neg_odd_64"	Rémi Denis-Courmont	2024-05-21	2	-22/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	While this function can easily be written with vectors, it just fails to get any performance improvement. For reference, this is a simpler loop-free implementation that does get better performance than the current one depending on hardware, but still more or less the same metrics as the C code: func ff_sbr_neg_odd_64_rvv, zve64x li a1, 32 addi a0, a0, 7 li t0, 8 vsetvli zero, a1, e8, m2, ta, ma li t1, 0x80 vlse8.v v8, (a0), t0 vxor.vx v8, v8, t1 vsse8.v v8, (a0), t0 ret endfunc This reverts commit d06fd18f8f4c6a81ef94cbb600620d83ad51269d.
*	lavc/vc1dsp: R-V V vc1_unescape_buffer	Rémi Denis-Courmont	2024-05-21	2	-0/+55
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Notes: - The loop is biased toward no unescaped bytes as that should be most common. - The input byte array is slid rather than the (8 times smaller) bit-mask, as RISC-V V does not provide a bit-mask (or bit-wise) slide instruction. - There are two comparisons with 0 per iteration, for the same reason. - In case of match, bytes are copied until the first match, and the loop is restarted after the escape byte. Vector compression (vcompress.vm) could discard all escape bytes but that is slower if escape bytes are rare. Further optimisations should be possible, e.g.: - processing 2 bytes fewer per iteration to get rid of a 2 slides, - taking a short cut if the input vector contains less than 2 zeroes. But this is a good starting point: T-Head C908: vc1dsp.vc1_unescape_buffer_c: 12749.5 vc1dsp.vc1_unescape_buffer_rvv_i32: 6009.0 SpacemiT X60: vc1dsp.vc1_unescape_buffer_c: 11038.0 vc1dsp.vc1_unescape_buffer_rvv_i32: 2061.0
*	lavc/huffyuvdsp: optimise RVV vtype for add_hfyu_left_pred_bgr32	Rémi Denis-Courmont	2024-05-19	2	-3/+6
\| \| \| \| \| \| \|	T-Head C908: add_hfyu_left_pred_bgr32_c: 237.5 add_hfyu_left_pred_bgr32_rvv_i32: 173.5 (before) add_hfyu_left_pred_bgr32_rvv_i32: 110.0 (after)
*	lavc/flacdsp: optimise RVV vector type for lpc32	Rémi Denis-Courmont	2024-05-19	2	-12/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is pretty much the same as for lpc16, though it only improves half as large prediction orders. With 128-bit vectors, this gives: C V old V new 1 69.2 181.5 95.5 2 107.7 180.7 95.2 3 145.5 180.0 103.5 4 183.0 179.2 102.7 5 220.7 178.5 128.0 6 257.7 194.0 127.5 7 294.5 193.7 126.7 8 331.0 193.0 126.5 Larger prediction orders see no significant changes at that size.
*	lavc/flacdsp: optimise RVV vector type for lpc16	Rémi Denis-Courmont	2024-05-19	2	-3/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This calculates the optimal vector type value at run-time based on the hardware vector length and the FLAC LPC prediction order. In this particular case, the additional computation is easily amortised over the loop iterations: T-Head C908: C V before V after 1 48.0 214.7 95.2 2 64.7 214.2 94.7 3 79.7 213.5 94.5 4 96.2 196.5 94.2 # 5 111.0 195.7 118.5 6 127.0 211.2 102.0 7 143.7 194.2 101.5 8 175.7 193.2 101.2 # 9 176.2 224.2 126.0 10 191.5 192.0 125.5 11 224.5 191.2 124.7 12 223.0 190.2 124.2 13 239.2 189.5 123.7 14 253.7 188.7 139.5 15 286.2 188.0 122.7 16 284.0 187.0 122.5 # 17 300.2 186.5 186.5 18 314.0 185.5 185.7 19 329.7 184.7 185.0 20 343.0 184.2 184.2 21 358.7 199.2 183.7 22 371.7 182.7 182.7 23 387.5 181.7 182.0 24 400.7 181.0 181.2 25 431.5 180.2 196.5 26 443.7 195.5 196.0 27 459.0 178.7 196.2 28 470.7 177.7 194.2 29 470.0 177.0 193.5 30 481.2 176.2 176.5 31 496.2 175.5 175.7 32 507.2 174.7 191.0 # # Power of two boundary. With 128-bit vectors, improvements are expected for the first two test cases only. For the other two, there is overhead but below noise. Improvements should be better observable with prediction order of 8 and less, or on hardware with larger vector sizes.
*	lavc/vp9_intra: fix another .irp use with LLVM as	Rémi Denis-Courmont	2024-05-19	1	-1/+1
\|
*	lavc/vp9_intra: fix .irp use with LLVM as	Rémi Denis-Courmont	2024-05-19	1	-14/+14
\|
*	lavc/vp8dsp: fix .irp use with LLVM as	Rémi Denis-Courmont	2024-05-19	1	-2/+2
\|
*	lavc/startcode: add R-V V startcode_find_candidate	Rémi Denis-Courmont	2024-05-19	4	-8/+62
\|
*	lavc/startcode: add R-V Zbb startcode_find_candidate	Rémi Denis-Courmont	2024-05-19	4	-2/+130
\| \| \| \| \| \| \| \|	The main loop processes 8 bytes in 5 instructions. For comparison, the optimal plain strnlen() requires 4 instructions per byte (6.4x worse): LBU; ADDI; BEQZ; BNE. The current libavcodec C code involves 5 instructions per byte (8x worse). Actual benchmarks may be slightly less favourable due to latency from ORC.B to BNE.
*	lavc/vp9dsp: R-V V ipred tm	sunyuechi	2024-05-17	3	-0/+130
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	C908: vp9_tm_4x4_8bpp_c: 116.5 vp9_tm_4x4_8bpp_rvv_i32: 43.5 vp9_tm_8x8_8bpp_c: 416.2 vp9_tm_8x8_8bpp_rvv_i32: 86.0 vp9_tm_16x16_8bpp_c: 1665.5 vp9_tm_16x16_8bpp_rvv_i32: 187.2 vp9_tm_32x32_8bpp_c: 6974.2 vp9_tm_32x32_8bpp_rvv_i32: 625.7 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
*	lavc/flacdsp: R-V V flac_wasted33	Rémi Denis-Courmont	2024-05-17	2	-0/+36
\| \| \| \| \| \|	T-Head C908: flac_wasted_33_c: 786.2 flac_wasted_33_rvv_i64: 486.5