diff options
author | Martin Storsjö <martin@martin.st> | 2017-01-10 00:15:15 +0200 |
---|---|---|
committer | Michael Niedermayer <michael@niedermayer.cc> | 2017-01-14 21:13:30 +0100 |
commit | 388f6e6715ff6f84efc858f69530867a235fe909 (patch) | |
tree | e42cb45fd37e7abef1ac9e0aea93499ecd3acc9d /tests/checkasm | |
parent | ecd343aa1ffbbfddac937b708d111c88159ee5f9 (diff) | |
download | ffmpeg-388f6e6715ff6f84efc858f69530867a235fe909.tar.gz |
arm: vp9itxfm: Skip empty slices in the first pass of idct_idct 16x16 and 32x32
This work is sponsored by, and copyright, Google.
Previously all subpartitions except the eob=1 (DC) case ran with
the same runtime:
Cortex A7 A8 A9 A53
vp9_inv_dct_dct_16x16_sub16_add_neon: 3188.1 2435.4 2499.0 1969.0
vp9_inv_dct_dct_32x32_sub32_add_neon: 18531.7 16582.3 14207.6 12000.3
By skipping individual 4x16 or 4x32 pixel slices in the first pass,
we reduce the runtime of these functions like this:
vp9_inv_dct_dct_16x16_sub1_add_neon: 274.6 189.5 211.7 235.8
vp9_inv_dct_dct_16x16_sub2_add_neon: 2064.0 1534.8 1719.4 1248.7
vp9_inv_dct_dct_16x16_sub4_add_neon: 2135.0 1477.2 1736.3 1249.5
vp9_inv_dct_dct_16x16_sub8_add_neon: 2446.7 1828.7 1993.6 1494.7
vp9_inv_dct_dct_16x16_sub12_add_neon: 2832.4 2118.3 2266.5 1735.1
vp9_inv_dct_dct_16x16_sub16_add_neon: 3211.7 2475.3 2523.5 1983.1
vp9_inv_dct_dct_32x32_sub1_add_neon: 756.2 456.7 862.0 553.9
vp9_inv_dct_dct_32x32_sub2_add_neon: 10682.2 8190.4 8539.2 6762.5
vp9_inv_dct_dct_32x32_sub4_add_neon: 10813.5 8014.9 8518.3 6762.8
vp9_inv_dct_dct_32x32_sub8_add_neon: 11859.6 9313.0 9347.4 7514.5
vp9_inv_dct_dct_32x32_sub12_add_neon: 12946.6 10752.4 10192.2 8280.2
vp9_inv_dct_dct_32x32_sub16_add_neon: 14074.6 11946.5 11001.4 9008.6
vp9_inv_dct_dct_32x32_sub20_add_neon: 15269.9 13662.7 11816.1 9762.6
vp9_inv_dct_dct_32x32_sub24_add_neon: 16327.9 14940.1 12626.7 10516.0
vp9_inv_dct_dct_32x32_sub28_add_neon: 17462.7 15776.1 13446.2 11264.7
vp9_inv_dct_dct_32x32_sub32_add_neon: 18575.5 17157.0 14249.3 12015.1
I.e. in general a very minor overhead for the full subpartition case due
to the additional loads and cmps, but a significant speedup for the cases
when we only need to process a small part of the actual input data.
In common VP9 content in a few inspected clips, 70-90% of the non-dc-only
16x16 and 32x32 IDCTs only have nonzero coefficients in the upper left
8x8 or 16x16 subpartitions respectively.
This is cherrypicked from libav commit
9c8bc74c2b40537b0997f646c87c008042d788c2.
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Diffstat (limited to 'tests/checkasm')
-rw-r--r-- | tests/checkasm/vp9dsp.c | 6 |
1 files changed, 4 insertions, 2 deletions
diff --git a/tests/checkasm/vp9dsp.c b/tests/checkasm/vp9dsp.c index f32b97cfa3..481730966b 100644 --- a/tests/checkasm/vp9dsp.c +++ b/tests/checkasm/vp9dsp.c @@ -334,8 +334,10 @@ static void check_itxfm(void) // skip testing sub-IDCTs for WHT or ADST since they don't // implement it in any of the SIMD functions. If they do, // consider changing this to ensure we have complete test - // coverage - for (sub = (txtp == 0 && tx < 4) ? 1 : sz; sub <= sz; sub <<= 1) { + // coverage. Test sub=1 for dc-only, then 2, 4, 8, 12, etc, + // since the arm version can distinguish them at that level. + for (sub = (txtp == 0 && tx < 4) ? 1 : sz; sub <= sz; + sub < 4 ? (sub <<= 1) : (sub += 4)) { if (check_func(dsp.itxfm_add[tx][txtp], "vp9_inv_%s_%dx%d_sub%d_add_%d", tx == 4 ? "wht_wht" : txtp_types[txtp], |