diff options
author | Michael Niedermayer <michaelni@gmx.at> | 2011-06-16 05:25:18 +0200 |
---|---|---|
committer | Michael Niedermayer <michaelni@gmx.at> | 2011-06-16 06:29:01 +0200 |
commit | 56629aa0127e7f8f2f5dad3ebe794424b51afd64 (patch) | |
tree | 3882ab613274639d8f30e48d15e04cff2cf8a3fb /libswscale | |
parent | 33651e3edffcf404e7dce33395b31b1113e5d4b2 (diff) | |
parent | 7a02527b05e2ae5ffab579062dbe3c888758335f (diff) | |
download | ffmpeg-56629aa0127e7f8f2f5dad3ebe794424b51afd64.tar.gz |
Merge branch 'master' into oldabi
* master:
mmsh: fixed printf injection bug in mmsh request
ac3enc: use correct alignment and length in channel coupling dsp functions.
ffmpeg: don't abuse a global for passing framerate from input to output
ffmpeg: don't abuse a global for passing channels from input to output
ffmpeg: don't abuse a global for passing samplerate from input to output
Make buffer size check consistent and avoid a possible overflow.
Fix spelling.
Full support for sending H.264 in RTP
ARM: update ff_h264_idct8_add4_neon for 4:4:4 changes
swscale: use SwsContext for av_log when available
Support reading chan atoms with empty channel descriptions.
Reindent after last commit.
Fix multi-channel AAC encoding.
Fix "redundant redeclaration" warning.
Fix compilation with --disable-everything --enable-encoder=ac3/ac3_fixed.
vf_mp: Fix large memleak.
swscale: Remove HAVE_MMX from files that are only compiled with MMX enabled.
swscale: Fix compilation with --disable-mmx2.
mjpegenc: Fix JFIF version
swscale: remove misplaced comment.
ffmpeg: fix streaming to ffserver.
swscale: split out RGB48 output functions from yuv2packed[12X]_c().
build: move vpath directives to main Makefile
swscale: fix JPEG-range YUV scaling artifacts.
build: move ALLFFLIBS to a more logical place
ARM: factor some repetitive code into macros
CrystalHD: Use mp4toannexb bitstream filter.
CrystalHD: Keep mp4toannexb filter around for entire decoder lifetime.
Fix SVQ3 after adding 4:4:4 H.264 support
H.264: fix CODEC_FLAG_GRAY
4:4:4 H.264 decoding support
matroskadec: properly decode color space in an endian neutral way
matroskadec: use a temporary fourcc variable
matroskaenc: ensure the written colorspace don't depend on host endianness
ac3enc: fix allocation of floating point samples.
utils: Drop pointless '#if 1' preprocessor directive.
ac3enc: remove empty ac3_float function that is never called
ac3enc: split templated float vs. fixed functions into a separate file.
ac3enc: dynamically allocate AC3EncodeContext fields windowed_samples and mdct
ac3enc: use function pointer to choose between AC-3 and E-AC-3 header output functions.
Roll back 4:4:4 H.264 for now Needs some ARM/PPC asm modifications.
Fix SVQ3 after adding 4:4:4 H.264 support
H.264: fix CODEC_FLAG_GRAY
4:4:4 H.264 decoding support
h264_parser: Fix whitespace after previous change.
h264_parser: Fix behaviour when PARSER_FLAG_COMPLETE_FRAMES is set.
wav: remove an invalid free().
lavf: initialise reference_dts in av_estimate_timings_from_pts.
h264: don't be so picky on decoding pps in extradata.
avcodec.h: add or elaborate on some documentation comments.
h264: change a few comments into error messages
ac3dec: fix doxy-style for comment ("///>" should be "///<" instead).
img2: add .dpx to the list of supported file extensions.
ffv1: fix undefined behavior with insane widths.
replace remaining usage of deprecated av_metadata_set2() by av_dict_set()
matroskaenc: write colourspace element for rawvideo tracks
nsv: simplify probe function
nsv: return error code instead of discarding it in read_header()
ARM: jrevdct_arm: simplify stack usage
ARM: jrevdct_arm: use push/pop mnemonics
ARM: jrevdct_arm: misc cleanup
ARM: optimised mpadsp_apply_window_fixed
Add some (important) changelog entries
H264: Reduce pointless diffs to qatar
Revert "H264: Split out hl_motion and template it, this seems a bit faster"
libavfilter: implement avfilter_fill_frame_from_video_buffer_ref()
avfiltergraph: make the AVFilterInOut alloc/free API public
avfiltergraph: change the syntax of avfilter_graph_parse()
graphparser: prefer void * over AVClass * for log contexts
h264: Complexify frame num gap shortening code
Update todo
mpeg12: replace 2 asserts by av_assert0
cmdutils: add missing NULL check in parse_options()
x11grab: remove a memory allocation and the associated memcpy.
Fix --disable-everything
build: fix "make install" with documentation disabled
build: simplify some conditional targets
resample: clarify supported resampling.
lavfi: fix signature for avfilter_graph_parse() and avfilter_graph_config()
avfiltergraph: use meaningful error codes
Revert "ac3: there was no libav in 2010 thus this code cannot be from libav."
Fix -t option for formats which holds dts and no pts
dnxhd: Renama tables
Extract rotation in MOV metadata
bitstream: Properly promote av_reverse values before shifting.
pixfmt: Replace 9/10bit deprecation by a technical explanation.
libavutil/swscale: YUV444P10/YUV444P9 support.
H.264: Fix high bit depth explicit biweight
h264: Fix 10-bit H.264 x86 chroma v loopfilter asm.
Replace DEBUG_SEEK/DEBUG_SI + av_log combinations by av_dlog.
Update copyright year for ac3enc_opts_template.c.
adts: Adjust frame size mask to follow the specification.
APIchanges: fill hash for the avfilter_get_audio_buffer_ref_from_arrays addition
lavfi: avfilter_merge_formats: handle case where inputs are same
lavfi: use avfilter_get_audio_buffer_ref_from_arrays() in defaults.c
lavfi: implement avfilter_get_audio_buffer_ref_from_arrays()
APIchanges: remove duplicated entry
APIchanges: fill in dates and numbers
APIchanges: remove duplicated entry
APIchanges: correctly interleave entries
APIchanges: add entry for av_force_cpu_flags() addition
lavf: bump minor after the addition of fps_probe_size to AVFormatContext
lavc: bump minor after the addition of AVCodecContext.request_sample_fmt
movenc: Add RTP muxer/hinter options
movenc: Pass the RTP AVFormatContext to the SDP generation
rtspenc: Add RTP muxer options
rtspenc: Add an AVClass for setting muxer specific options
rtpenc_chain: Pass the rtpflags options through to the chained muxer
rtpenc: Declare the rtp flags private AVOptions in rtpenc.h
sdp: Reindent after the previous commit
rtpenc: MP4A-LATM payload support
avoptions: Add an av_opt_flag_is_set function for inspecting flag fields
sdp: Allow passing an AVFormatContext to the SDP generation
mov: Fix wrong timestamp generation for fragmented movies that have time offset caused by the first edit list entry.
mpeg12: more advanced ffmpeg mpeg2 aspect guessing code.
ac3: there was no libav in 2010 thus this code cannot be from libav.
swscale: split YUYV output out of yuv2packed[12X]_c().
dict: This code was developed in ffmpeg and not libav, nor by libav developers. Correct copyright notices.
lavf: make compute_pkt_fields2() return meaningful error values
matroskadec: set timestamps for RealAudio packets.
intelh263dec: aspect ratio processing fix.
intelh263dec: fix "Strict H.263 compliance" file playback
oss,sndio: simplify by using FFMIN.
swscale: extract monowhite/black output from yuv2packed[12X]_c().
swscale: de-macro'ify RGB15/16/32 input functions.
swscale: rearrange code.
movdec: Add support for the 'wfex' atom.
ffmpeg.c: Add a necessary const qualifier
riff: Fix potential memleak.
swscale: change 48bit RGB input macros to inline functions.
swscale: change 9/10bit YUV input macros to inline functions.
swscale: extract gray16 output functions from yuv2packed[12X]().
swscale: use standard clipping functions.
swscale: merge macros that are used only once.
swscale: fix function declarations in swscale.c.
swscale: fix function declaration keywords in x86/swscale_template.c.
jpegdec: actually search for and parse RSTn
crypto: Use av_freep instead of av_free
Revert "crypto: fix potential double free"
Revert "build: remove empty $(OBJS) target"
crypto: Use av_freep instead of av_free
aac: fix adts frame size mask, fix demuxer probing for some files.
lavf: don't try to free private options if priv_data is NULL.
lavfi: handle NULL lists in avfilter_make_format_list
swscale: fix types of assembly arguments.
swscale: move two macros that are only used once into caller.
swscale: remove unused function.
Fix "mixed declarations and code" warnings.
options: Add missing braces around struct initializer.
mov: Remove leftover crufty debug statement with references to a local file.
dvbsubdec: Fix compilation of debug code.
Remove all uses of now deprecated metadata functions.
Move metadata API from lavf to lavu.
crypto: fix potential double free
libx264: fix double free
ffplay: remove -debug option
ffplay: remove -vismv option
mpegvideo: use av_get_picture_type_char() in ff_print_debug_info()
Remove some non-compiling debug messages.
ffplay: Fix non-compiling debug printf and replace it by av_dlog.
H264: x86 predict init cosmetics.
ac3enc: Fix linking of AC-3 encoder without the E-AC-3 encoder.
Move E-AC-3 encoder functions to a separate eac3enc.c file.
ac3enc: remove convenience macro, #define DEBUG
ac3enc: remove unused #define
vc1: re-initialize tables after width/height change.
APIchanges: fill-in git commit hash for av_get_bytes_per_sample() addition
samplefmt: add av_get_bytes_per_sample()
libvpxenc: add forgotten AVClass.
iirfilter: fix biquad filter coefficients.
swscale: remove duplicate conversion routine in swScale().
swscale: add yuv2planar/packed function typedefs.
swscale: integrate yuv2nv12X_C into yuv2yuvX() function pointers.
swscale: reindent x86 init code.
swscale: extract SWS_FULL_CHR_H_INT conditional into init code.
swscale: cosmetics.
swscale: remove alp/chr/lumSrcOffset.
swscale: un-special-case yuv2yuvX16_c().
shorten: Remove stray DEBUG #define and corresponding av_dlog statement.
vorbisdec: Restore mistakenly removed debug output.
v4l2: set default standard to NULL
sws: make dither_scale const
configure: Document --enable-vdpau.
Replace some av_log/printf + #ifdef combinations by av_dlog.
Replace some nonstandard DEBUG_* preprocessor directives by plain DEBUG.
svq1dec: Fix debug statements that referenced non-existing context.
Replace some printf instances in debug code by av_log.
showfiltfmts: use av_get_pix_fmt_name()
inverse.c: Replace unnecessary intmath.h header by necessary stdint.h.
Drop unnecessary directory prefixes from #include directives.
Makefile: critical build fix after the merge. make fate passed locally due to ffmpeg/ffmpeg_g being there from before
ffplay: Fix -vismv
mem: Trying to workaround posix_memalign() bug on OSX
build: remove empty $(OBJS) target
build: make rule for linking ff* apply only to these targets
eval: add support for pow() function
build: rearrange some lines in a more logical way
s302m: fix resampling for 16 and 24bits.
ARM: remove MUL64 and MAC64 inline asm
build: clean up .PHONY lists
build: move all (un)install* target aliases to toplevel Makefile
flvenc: propagate error properly
build: remove stale dependency
build: do not add CFLAGS-yes to CFLAGS
utils.c: fix crash with threading enabled.
configure: simplify source_path setup
configure: remove --source-path option
pixdesc: remove duplicated header inclusion
lavfi: use av_samples_alloc() in avfilter_default_get_audio_buffer()
lavfi: prefer nb_samples over size in AVFilterBufferRefAudioProps
samplefmt: switch nb_channels/nb_samples params order in av_samples_alloc()
samplefmt: change layout for arrays created by av_samples_alloc() and _fill_arrays()
lavf: deprecate AVFormatParameters.time_base.
img2: add framerate private option.
img2: add video_size private option.
img2: add pixel_format private option.
tty: add framerate private option.
Move code for "ffmpeg: fix massive leak occurring when seeking" / e4841a404bdabfeafb917454d510b60d888cb761 elsewhere
lavf: remove reference to output-example in Makefile
vsrc_buffer: add flags param to av_vsrc_buffer_add_video_buffer_ref
Remove some unused scripts from tools/.
Add x86 assembly for some 10-bit H.264 intra predict functions.
v4l2: do not force NTSC as standard
Add const to avfilter_get_video_buffer_ref_from_arrays arguments.
Skip tableprint.h during 'make checkheaders'.
Remove unnecessary LIBAVFORMAT_BUILD #ifdef.
Drop explicit filenames from @file Doxygen tags.
Skip generated table headers during 'make checkheaders'.
lavf,lavc: free avoptions in a generic way.
AVOptions: add av_opt_free convenience function.
sdl: align option fields after last commit
ffmpeg: fix massive leak occurring when seeking
ffprobe: implement -i FILE option
tableprint: Restore mistakenly deleted common.h #include for FF_ARRAY_ELEMS.
ffplay.texi: document -i FILE option
cmdutils: remove unnecessary OPT_DUMMY implementation
cmdutils: change the signature of the function argument in parse_options()
sdl: use the filename for defining the window title, if not specified
tiff: print log in case of unknown / unsupported tag.
tiff: fix linesize for mono-white/black formats.
Fix build of eval-test program
configure: Document --enable-vaapi
swscale: override the lack of the accurate rounding flag when needed for dither.
swscale: factor should_dither out
ac3enc: extract all exponents for the frame at once
ARM: remove MULL inline asm
mathops: use MUL64 macro where it forms part of other ops
tty: factorise returning error codes.
rawdec: add framerate private option.
x11grab: add framerate private option.
fbdev,v4l2: remove some forgotten uses of AVFormatParameters.time_base.
bktr: don't error when AVFormatParameters.time_base isn't set.
cmdutils: add missing const qualifier
Skip headers not designed to work standalone during 'make checkheaders'.
Add missing #includes to make headers self-contained.
musepack: remove unnecessary #include from mpcdata.h
musepack: remove extraneous mpcdata.h inclusions
Fix error check in av_file_map()
udp: support old, crappy non pthread mode
ffmpeg: use opt_acodec when setting audio codec in opt_target.
ffmpeg: fix segfault with too many output files
ffplay: error out with invalid sample rate or channels.
oggdec: fix Ticket185
build: simplify commands for clean target
j2kdec: dont fail on non zero cblock style.
makefile: fix j2k encoder dependancies
udp: fix indention
swscale: split swscale.c in unscaled and generic conversion routines.
swscale: cosmetics.
swscale: integrate (literally) swscale_template.c in swscale.c.
swscale: split out x86/swscale_template.c from swscale.c.
swscale: enable hScale_altivec_real.
swscale: split out ppc _template.c files from main swscale.c.
swscale: remove indirections in ppc/swscale_template.c.
swscale: split out unscaled altivec YUV converters in their own file.
mpegvideoenc: fix multislice fate tests with threading disabled.
cmdutils: move "#undef main" from ffplay.c to cmdutils.h
wav: update size check for ds64
wav: fix skip size at end of ds64 chunk
mpegts: Wrap #ifdef DEBUG and av_hex_dump_log() combination in a macro.
build: Simplify texi2html invocation through the --output option.
Mark some variables with av_unused
Replace avcodec_get_pix_fmt_name() by av_get_pix_fmt_name().
svq3: Check negative mb_type to fix potential crash.
svq3: Move svq3-specific fields to their own context.
rawdec: initialize return value to 0.
Remove unused get_psnr() prototype
rawdec: don't leak option strings.
bktr: get default framerate from video standard.
swscale: remove unused COMPILE_TEMPLATE_ALTIVEC.
h264 fill_filter_caches: Dont init chroma nnz_cache.
In print_report, print progression time in hours:mins:secs:us
ffmpeg: In print_report, use int64_t for pts to check for 0 and avoid inf value for bitrate.
In libswscale, use all lines when converting from 422p to rgb with mmx, improve quality.
Replace custom DEBUG preprocessor trickery by the standard one.
vorbis: Remove non-compiling debug statement.
vorbis: Remove pointless DEBUG #ifdef around debug output macros.
cook: Remove non-compiling debug output.
Remove pointless #ifdefs around function declarations in a header.
Replace #ifdef + av_log() combinations by av_dlog().
Replace custom debug output functions by av_dlog().
cook: Remove unused debug functions.
lavfi: add avfilter_link_free() function
swscale: reintroduce sws_format_name() symbol
Remove stray extra arguments from av_dlog() invocations.
targa: fix big-endian build
v4l2: remove one forgotten use of AVFormatParameters.pix_fmt.
vfwcap: add a framerate private option.
v4l2: add a framerate private option.
libdc1394: add a framerate private option.
fbdev: add a framerate private option.
bktr: add a framerate private option.
oma: check avio_read() return value
nutdec: remove unused variable
Remove unused variables
swscale: dither for planar yuv outputs
swscale: Fix use of uninitialized values (bug probably introduced from a marge of libav)
cpudetect: add av_force_cpu_flags()
swscale: allocate larger buffer to handle altivec overreads.
H264/MPEG frame-level multi-threading.
vsrc_buffer: propagate error code in av_vsrc_buffer_add_frame()
lavfi: reindent after the previous commit
lavfi: add braces around the block of an if() expression in avfilter_default_get_video_buffer
lavfi: clarify the context of a comment in avfilter_default_get_video_buffer()
lavfi: apply misc style fixes
Cosmetic changes to h264_idct_10bit.asm.
2x faster h264_idct_add8_10.
aacenc: Add stereo_mode option.
h264: remove CONFIG_GPL from x86 intra prediction code.
doc: cosmetics: libx264 typos
postprocess: Remove test for impossible condition (was: Re: postprocess.c: replace check for p==NULL with *p==0)
Fix various uninitialized variable warnings
Port remove of get_sws_cpuflags from MPlayer's libmpcodecs.
Replace "vector const" by "const vector" otherwise gcc 4.6.0 fails.
Port recent changes to MPlayer libmpcodecs.
Replace non-existent HAVE_SSE2 with HAVE_SSE.
Simplify code and avoid compiler warning about incompatible types.
Fix type of out[] variable, it should not be const.
ARM: ac3dsp: optimised update_bap_counts()
mpegaudiodec: Fix av_dlog() invocation.
swscale: fix compilation of bfin due to missing pixdesc.h header
lavf: tag dump_format() as @deprecated
yuv4mpeg: complain and exit if a non-rawvideo stream is selected
ffmpeg: handle copy of packets for AVFMT_RAWPICTURE output formats
doc/examples: give meaningful names to the example files
h264/10bit: add HAVE_ALIGNED_STACK checks.
swscale: More accurate rounding in YSCALE_YUV_2_PACKEDX_FULL_C()
Update 8-bit H.264 IDCT function names to reflect bit-depth.
Add IDCT functions for 10-bit H.264.
mpegaudioenc: Fix broken av_dlog statement.
Employ correct printf format specifiers, mostly in debug output.
ARM: fix MUL64 inline asm for pre-armv6
doc: add libvpx encoder section
vf_drawtext: Replace FFmpeg by Libav in license boilerplate.
mpegaudiodec: remove unusued code and variables
postprocess.c: filter name needs to be double 0 terminated
improved 'edts' atom writing support
mpegaudio: clean up compute_antialias() definition
vp8: fix segmentation race during frame-threading.
Port libmpcodec fixes from MPlayer.
Merge remote-tracking branch 'ffmpeg-mt/master'
swscale: Remove unused variable.
ARM: simplify inline asm with 64-bit operands
Add "const" to avoid "initialization discards qualifiers" warning.
Add const to fix "cast discards qualifiers" warnings.
Include pixdesc.h for av_get_pix_fmt_name.
wav: Don't avio_seek() if we know we'll run into EOF
api-example: uppercase first letter in "copyright"
output-example: create @file doxy from text in the copyright header
examples: move API examples to a dedicated dir in doc
ffmpeg: simplify opt_*_codec() options
v4l2: rewrite code iterating the supported standards
v4l2: perform minor style fixes
v4l2: replace memset() with explicit struct initialization
rawdec: fail in case of unknow pixel format
swscale: remove sws_format_name()
error.c: fix compile flags
TCP: change default timeout to 5sec
Revert "Timeout TCP open() after 5 seconds."
Fix various unused variable warnings
Fix various bad printf format warnings
ARM: enable UAL syntax in asm.S
Remove now unused nb_istreams variable.
Add const to vector types for input in altivec code.
Remove unused variable, avoiding compiler warning.
Cast pointers to uintptr_t rather than unsigned int.
v4l2: don't leak video standard string on error.
swscale: Remove disabled code.
avfilter: Surround function only used in debug mode by appropriate #ifdef.
vf_crop: Replace #ifdef DEBUG + av_log() by av_dlog().
build: remove BUILD_ROOT variable
vp8: use av_clip_uintp2() where possible
swscale: Commits that could not be pulled earlier due to bugs #2
Commits that could not be pulled earlier due to bugs.
Revert 1a5e4fd8c5b99478b4e08a69261930bb12aa948b for postproc. This broke the code
doc: correct AC-3 option subsection placement
ac3enc: fix LOCAL_ALIGNED usage in count_mantissa_bits()
ac3dsp: do not use the ff_* prefix when referencing ff_ac3_bap_bits.
swscale: use av_clip_uint8() in yuv2yuv1_c().
swscale: replace formatConvBuffer[VOF] by allocated array.
v4l2: create file @doxy from text in the copyright header
v4l2: remove pointless empty lines
v4l2: set default standard to NULL
v4l2: use OFFSET macro when setting options
ac3dsp: fix loop condition in ac3_update_bap_counts_c()
ARM: unbreak build
lavdev: add SDL output device
ac3enc: modify mantissa bit counting to keep bap counts for all values of bap instead of just 0 to 4.
ac3enc: split mantissa bit counting into a separate function.
ac3enc: store per-block/channel bap pointers by reference block in a 2D array rather than in the AC3Block struct.
lavu: add av_get_pix_fmt_name() convenience function
iff: remove duplicated file description
cmdutils: remove OPT_FUNC2
get_bits: add av_unused tag to cache variable
sws: replace all long with int.
ARM: aacdec: fix constraints on inline asm
ARM: remove unnecessary volatile from inline asm
ARM: add "cc" clobbers to inline asm where needed
ARM: improve FASTDIV asm
ac3enc: use LOCAL_ALIGNED macro
APIchanges: fill in git hash for av_get_pix_fmt_name (0420bd7).
lavu: add av_get_pix_fmt_name() convenience function
cmdutils: remove OPT_FUNC2
swscale: fix crash in bilinear scaling.
vpxenc: add VP8E_SET_STATIC_THRESHOLD mapping
webm: support stereo videos in matroska/webm muxer
rgb2rgb: remove duplicate mmx/mmx2/3dnow/sse2 functions.
swscale: reindent h[cy]scale_fast() and updateDitherTables().
swscale: reformat x86/swscale_template.c.
swscale: remove duplicate mmx/mmx2 functions if they are identical.
swscale: remove if (c->dstFormat) branch from yuv2packed[12X]().
swscale: remove if(full_chr_int) from yuv2packed1().
swscale: remove if(accurate_rnd) branch from functions.
swscale: revive SWS_CPU_CAPS until next major bump.
swscale: Remove commented-out printf cruft.
Export PCR pid
Export more transport stream information.
Output MPEG-TS stream identifiers.
lavf: deprecate AVFormatParameters.pix_fmt.
rawdec: add a pixel_format private option.
v4l2: add a pixel_format private option.
libdc1394: add a pixel_format private option.
cosmetics: indentation and alignment after previous commit
ac3enc: add support for E-AC-3 encoding.
ac3enc: Move AC-3 AVOptions array to a separate file to make it easier to use only selected options for the different AC-3 encoder types.
ARM: disable ff_vector_fmul_vfp on VFPv3 systems
ARM: check for VFPv3
swscale: Remove unused variables in x86 code.
doc: Drop DJGPP section, Libav now compiles out-of-the-box on FreeDOS.
x86: Add appropriate ifdefs around certain AVX functions.
cmdutils: use sws_freeContext() instead of av_freep().
swscale: delay allocation of formatConvBuffer().
swscale: fix build with --disable-swscale-alpha.
movenc: Deprecate the global RTP hinting flag, use a private AVOption instead
movenc: Add an AVClass for setting muxer specific options
libdc1394: choose best video mode and rate based on camera capabilities.
Remove support for libdc1394 < 2.0.
avopt: fix segfault
swscale: fix non-bitexact yuv2yuv[X2]() MMX/MMX2 functions.
swscale: dont loose precission on RGB/BGR48 input, that is dont drop half the bits.
patch checklist: suggest --disable-yasm test.
lavdev: prefer the inclusion of avdevice.h over that of libavformat/avformat.h
lavdev: include libavformat/avformat.h in avdevice.h
fate.txt: replace FATE rsync command with a make command
configure: report yasm/nasm presence properly
tcp: make connect() timeout properly
rawdec: factor video demuxer definitions into a macro.
rtspdec: add initial_pause private option.
lavf: deprecate AVFormatParameters.width/height.
tty: add video_size private option.
rawdec: add video_size private option.
x11grab: add video_size private option.
x11grab: factorize returning error codes.
vfwcap: add video_size private option.
v4l2: add video_size private option.
v4l2: factorize returning error codes.
libdc1394: add video_size private option.
libdc1394: return meaninful error codes.
bktr: add video_size private option.
bktr: factorize returning error codes.
Fix memleak
Fix typo
Remove specific note when not specific
Minor cleanup in libx264.c
Add metadata conversion table to the wav demuxer
Fix 32bit rawvideo in avi on big-endian.
id3v2: Check malloc result. ID3v2 tags can be very large.
id3v2: Initialize tflags for version 2.2.
webm: Additional options/presets for VP8 encodes under FFmpeg
muxers: Add a flag to mark muxers that allow (non strict) monotone timestamps.
swscale: Do not loose precission on yuv values after rgb->yuv.
libx264: support aspect Ratio Switch
ARM: add ARMv6 optimised av_clip_uintp2
ARM: remove volatile from asm statements in libavutil/intmath
ARM: fix av_clipl_int32_arm()
v4l: include avdevice.h
ffserver: move close_connection() call to avoid a temporary string and copy.
lavf: initialize demuxer private options.
AVOptions: set string default values.
Fix compilation with YASM/NASM versions not supporting AVX.
lavdevice: mark v4l for removal on next major bump.
swscale: fix compile on ppc.
swscale: fix compile on x86-32.
build: Remove generated .version file on distclean.
configure: Add -D_GNU_SOURCE to CPPFLAGS on OS/2.
doc: Drop hint at --enable-memalign-hack for MinGW, it is now autodetected.
ffplay: Remove disabled code.
Mark parameterless function declarations as 'void'.
swscale: use av_clip_uint8() in yuv2yuv1_c().
swscale: remove VOF/VOFW.
swscale: split chroma buffers into separate U/V planes.
swscale: replace formatConvBuffer[VOF] by allocated array.
rgb2rgb: remove duplicate mmx/mmx2/3dnow/sse2 functions.
swscale: reindent h[cy]scale_fast() and updateDitherTables().
swscale: reformat x86/swscale_template.c.
swscale: remove duplicate mmx/mmx2 functions if they are identical.
swscale: remove if (c->dstFormat) branch from yuv2packed[12X]().
swscale: remove if(full_chr_int) from yuv2packed1().
swscale: remove if(accurate_rnd) branch from functions.
ffserver: Fix a null pointer dereference as a result of the FF_API_MAX_STREAMS cleanup.
libdc1394: fix compilation.
swscale: revive SWS_CPU_CAPS until next major bump.
swscale: Remove commented-out printf cruft.
ac3enc: initialize all coefficients to zero.
ffv1: fix 16bits multithreading
doc: create separate section for audio encoders
swscale: Remove orphaned, commented-out function declaration.
swscale: Eliminate rgb24toyv12_c() duplication.
mpegvideo_enc: use AV_LOG_ERROR instead of AV_LOG_INFO for two error messages
Fail when lowres value is lower than 0
Remove h263_msmpeg4 from MpegEncContext.
APIchanges: Fill in git hash for fps_probe_size (30315a8)
avformat: Add fpsprobesize as an AVOption.
swscale: document SWS_CPU_CAPS*
Revert removial of SWS flags from e66149e714006d099d1ebfcca3f22ca74fc7dcf4
avoptions: Return explicitly NAN or {0,0} if the option isn't found
rtmp: Reindent
rtmp: Don't try to do av_malloc(0)
swscale: remove duplicatiopn of rgb24toyv12_c()
Return -1 on invalid input instead of crashing.
vf_mp: fix name of the remove-logo filter referenced in filters.texi
tty: replace AVFormatParameters.sample_rate abuse with a private option.
Fix end time of last chapter in compute_chapters_end
ffmpeg: get rid of useless AVInputStream.nb_streams.
ffmpeg: simplify managing input files and streams
ffmpeg: purge redundant AVInputStream.index.
lavf: deprecate AVFormatParameters.channel.
libdc1394: add a private option for channel.
dv1394: add a private option for channel.
v4l2: reindent.
v4l2: add a private option for channel.
lavf: deprecate AVFormatParameters.standard.
v4l2: add a private option for video standard.
v4l: add a private option for video standard.
dv1394: add a private option for video standard.
bktr: add a private option for video standard.
lavf: deprecate AVFormatParameters.{channels,sample_rate}.
rawdec: add sample_rate/channels private options.
ALSA: add channels and sample_rate private options.
oss: add channels and sample_rate private options.
sndio: add channels and sample_rate private options.
lavf: deprecate AVFormatParameters.mpeg2ts_raw.
mpegts: add compute_pcr option.
lavf: add priv_class field to AVInputFormat.
lavfi: add select filter
eval: implement not() expression
vsrc_buffer: return an error code if no frames are available
ffmpeg: handle the case when get_filtered_frame() fails
indeo3: add out-of-buffer write check
Add reading of disc number to mov.c
Fix end time of last chapter in compute_chapters_end().
Do not reset channel_layout to 0.
vsrc_buffer: remove duplicated file description
Merge swscale bloatup This will be cleaned up in the next merge
swscale: MMX optim of hscale16()
swscale: dont loose bits on planar >8bit yuv ind gray nput.
swscale: Switch to ronalds yuv2yuvX16inC_template() its very similar to baptsites and supports alpha
configure: enable memalign_hack automatically when needed
rawdec: fix decoding of QT WRAW files
matroska: improve declaration of video_stereo_* constant tables
matroskadec: fix reverted condition to accept combine_plane operation
Fix register types for LOAD_AB arguments, fixes compilation with NASM.
swscale: unbreak the build on non-x86 systems.
swscale: remove if(bitexact) branch from functions.
swscale: remove if(canMMX2BeUsed) conditional.
swscale: remove swScale_{c,MMX,MMX2} duplication.
swscale: use emms_c().
Move emms_c() from libavcodec to libavutil.
tiff: set palette in the context when specified in TIFF_PAL tag
rtsp: use strtoul to parse rtptime and seq values.
pgssubdec: fix incorrect colors.
dvdsubdec: fix incorrect colors.
ape: Allow demuxing of files with metadata tags.
swscale: remove dead macro WRITEBGR24OLD.
swscale: remove AMD3DNOW "optimizations".
swscale: remove duplicate code in ppc/ subdirectory.
swscale: remove duplicated x86/ functions.
swscale: force --enable-runtime-cpudetect and remove SWS_CPU_CAPS_*.
vsrc_buffer.h: add file doxy
vsrc_buffer: tweak error message in init()
wav: fix various printf warnings related to wrong argument type
wav: propagate ff_get_wav_header() error code in w64_read_header()
msmpeg4: reindent.
lavc: remove msmpeg4v1 encoder.
Remove avconfig.h and INCINSTDIRs on uninstall.
ac3enc: add channel coupling support
partial revert of 01d3ebaf219d83c0a70cdf9696ecb6b868e8a165
fate: reenable frext-pph10i4_panasonic_a after the bitstream has been fixed
avcodec_find_decoder: prefer non experimental decoders.
j2kdec: mark as CODEC_CAP_EXPERIMENTAL
j2k[c/h] j2kdec.c: Implement 2 code block styles
j2k: Add void as the parameter of function ff_j2k_init_tier1_luts
Add Kamil Nowosads j2k code.
matroska: cleanup handling of video stereo mode
oggdec: use av_dlog()
mem: define the MAX_MALLOC_SIZE constant and use it in place of INT_MAX
configure: Add -U__STRICT_ANSI__ to CPPFLAGS on Cygwin and DOS.
muxers.texi changes for mkv/webm options
aacdec: fix typo in scalefactor clipping check
mpegaudio: Correct license header
add 5.1 to stereo downmix to resample.c this is based on previous 6to2channel-resample.patch from ffmpeg2theora but updated to work with trunk and using av_clip_int16.
fate: fix fate-h264-conformance-frext-pph10i4-panasonic-a crcs.
fate: update 9/10bit refs.
h264: Properly set coded_{width, height} when parsing H.264.
x86 asm: Add SECTION_TEXT to dct32_sse.asm.
Fix 9/10 bit in swscale.
Do not ask for samples if a specific channel layout was requested.
libx264: specify field for default union values in options
movdec: dont divide by zero when stts_data[0].duration = 0.
Fix ticket127
dct32: Replacing libav by ffmpeg in the license header with the authors permission. Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
ffmpeg: Don't trigger url_interrupt_cb on the first signal
avoptions: Check the return value from av_get_number
lavf: fix style for avformat_alloc_output_context2()
lavf: deprecate avformat_alloc_output_context() in favor of avformat_alloc_output_context2()
lavfi: make vsrc_buffer.h header public
dct32_sse: eliminate some spills
Fix compilation with --disable-yasm.
Fix dct32() compilation with --disable-yasm
mpeg2dec: Fix lowres 3
lavfi: bump minor and add changelog entry after the split filter addition
vf_split: add documentation to filters.texi
vf_split: give more meaningful names to the output pads
vf_split: define draw_slice() before end_frame()
vf_split: add description
vf_split: fix various nits
wmadec: avoid infinit loop.
DirectShow capture: Fix build
ffmpeg: get rid of the -vglobal option.
dct32: Add AVX implementation of 32-point DCT
dct32: Change pass 6 permutation to allow for AVX implementation
dct32: port SSE 32-point DCT to YASM
matroska: switch stereo mode from int to string and add support in the demuxer too
matroska: cosmetics
Create a stereo_mode metadata tag to specify the stereo 3d video layout using the StereoMode tag in a matroska/webm video track.
libavfilter: vf_split from soc.
DirectShow capture support Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
multiple inclusion guard cleanup
avio: document buffer must created with av_malloc() and friends
avio: check AVIOContext malloc failure
swscale: point out an alternative to sws_getContext
svq3: Do initialization after parsing the extradata
Fix channel_layout documentation.
add changelog entries for 0.7_beta2
ffserver: dont just crash
fix ffserver's SIGSEGV
avoptions: Support getting flag values using av_get_int
preset dir for win32
Merge remote-tracking branch 'ffmpeg-mt/master'
Add a flag to disable side data merging.
Merge/split side data.
Encoding alac with more than two channels is not supported.
mp3lame: add #include required for AV_RB32 macro.
configure: make executable again
LATM/AAC: Free previously initialized context on reinit.
configure: Do not unconditionally add -Wall to host CFLAGS.
configure: Set OS/2 objformat to a.out.
Add support for a.out object format to assembler macros.
fate: disable threading for encoding
fate: add comment field
fate: allow overriding default build and install dirs
mpegtsenc: Add an AVClass pointer to the private data
mpegaudio: clean up #includes
mpegaudio: move all header parsing to mpegaudiodecheader.[ch]
vf_libopencv: prefer opencv/cxcore.h over cxtypes.h
decoders.texi: fix typos in rawvideo section
cmdutils: use const AVClass * when senseful
encoders.texi: add documentation for the libx264 encoder
decoders.texi: add documentation for rawvideo decoder and options
doc: add decoders.texi file
encoders.texi: decrease level for audio encoders section
ffprobe.texi: remove inclusion of muxers section
indeo3: release buffer in indeo3_decode_end()
indeo3: remove unnecessary includes
indeo3: add @file doxy and a link to multimedia wiki documentation
cmdutils: reset *picref_ptr to NULL in get_filtered_frame()
ffmpeg: remove useless NULL-check on avfilter_unref_buffer
libmp3lame: include "libavutil/intreadwrite.h" header
qdm2: Use floating point synthesis filter.
h264: correct border check.
h264: fix loopfilter with threading at slice boundaries.
Fix ff_mpa_synth_filter_fixed() prototype
Reindent
rtpenc_chain: Pass the MP4A_LATM flag to chained muxers
rtpenc: MP4A-LATM payload support
movenc: Pass AVFormatContext flags to the SDP generation
sdp: Allow passing AVFormatContext flags to the SDP generation
vsrc_buffer: document av_vsrc_buffer_add_video_buffer_ref()
vsrc_buffer: add av_vsrc_buffer_add_frame()
vsrc_buffer: fix example in docs, add mandatory parameters
vsrc_buffer: make the source accept sws_param in init
vsrc_buffer: propagate avfilter_open() error code
vsrc_buffer: fix style
lavfi: add avfilter_get_video_buffer_ref_from_frame to avcodec.h
vsrc_buffer: remove dependency on AVFrame
Rename costablegen.c ---> cos_tablegen.c.
Collapse tableprint.c into tableprint.h.
Simplify trig table rules
Remove potentially unstable filenames from comments in generated files.
Ignore generated tables and generated table generator programs.
Simplify CLEANFILES make variable by using wildcards.
Remove silly insults from avformat_version() Doxygen documentation.
mpegaudiodsp: fix x86 and ppc makefiles
configure: Adjust AVX assembler check.
mpegaudio: remove unused version of SAME_HEADER_MASK
mpegaudio: remove useless #undef at end of file
asfdec: add missing #include for av_bswap32()
mpegaudio: merge two #if CONFIG_FLOAT blocks
mpegaudio: move some struct definitions from mpegaudio.h
Move some mpegaudio functions to new mpegaudiodsp subsystem
Clean up #includes in cmdutils.h.
g729: Merge g729.h into g729dec.c.
av_find_stream_info: Print more details about max anaylize duration failures.
10l: wrap float_interleave functions in HAVE_YASM.
Add little description for -rc_override
APIchanges: fill in date and commit for request_sample_fmt
Add floating-point sample format support to the ac3, eac3, dca, aac, and vorbis decoders.
Add support for request_sample_format in ffmpeg and ffplay.
Add APIchanges entry for request_sample_fmt.
Add request_sample_fmt field to AVCodecContext.
Add float_interleave() to FmtConvertContext with x86-optimized versions.
Remove unused make variable SEEK_REFFILE
fate: remove redundant aref and vref references
Parse 'bext' metadata in the wav demuxer
Cosmetics: indent
Keep parsing wav until EOF if the input is seekable and we know the size of the data tag
Refactor the tag checking into a switch statement
Use avio_tell() instead of url_ftell()
add x264opts entry to docs
cleaned up the udp.c, removed some variables and an av_log
configure: favor pkg_config over sdl_config
libx264: support passing arbitrary parameters.
ffmpeg: dont show_banner() on verbose<0
fate: remove do_ffmpeg_nocheck function
fate: do not collect -benchmark output
mpegaudiodec: remove decode_end() function
fate: run aref and vref as regular tests
mpegaudio: sanitise compute_antialias_* names
mpeg12: add slice-threading checks to slice-threading initializers.
h264: copy pixel_shift between slice threading contexts.
mdec: enable frame-level multithreading.
mdec.c: fix overread.
id3v2: prevent unsigned integer overflow in ff_id3v2_parse()
id3v2: add @file doxy and link to format documentation
configure: opensolaris install is not compatible with ffmpeg, allow overriding it.
Fix compilation of iirfilter-test.
eval: opensolaris strtod() cannot handle 0x1234
libx264: handle closed GOP codec flag
lavf: remove duplicate assignment in avformat_alloc_context.
lavf: use designated initializers for AVClasses.
Make sure neither data_size nor sample_count is negative
Refactor the 'fmt ' tag search and parsing
flvdec: clenup debug code
asfdec: fix possible overread on broken files.
asfdec: do not fall back to binary/generic search
asfdec: reindent after previous commit c7bd5ed
asfdec: fallback to binary search internally
mpegaudio: add _fixed suffix to some names
Modify x86util.asm to ease transitioning to 10-bit H.264 assembly.
ffmpeg: reset top_field_first in opt_input_file().
dct: build dct32 as separate object files
qdm2: include correct header for rdft
Ogg demuxer: give meaningful error codes and warnings.
update changelog with 9/10 bit H264 and FFV1 changes
Add some forgotten const to function arguments in libavfilter & libavformat.
Write channel_layout for multichannel aif files.
Fix ff_mov_write_chan() so it can be used by other muxers.
Fix some mov files with little endian audio (tickets 201 - 203).
iff/8svx: redesign 8SVX demuxing and decoding for handling stereo samples correctly
iff: compact code setting metadata tags
iff: fix bitrate computation for compressed audio stream
iff: distinguish fields for audio and video compression
imgutils: introduce internal image_get_linesize() and use it
imgutils: make av_image_get_linesize() return AVERROR(EINVAL) for invalid pixel formats
drawtext: specify union type for setting default options
drawtext: reindent after the previous commit
drawtext: fix strftime() text expansion
ffmpeg: fix -aspect cli option
Restructure video filter implementation in ffmpeg.c.
ffplay: remove audio_write_get_buf_size() forward declaration
lavfi: print key-frame and picture type information in ff_dlog_ref()
mathops: remove ancient confusing comment
rawdec: Allow overriding top field first.
ffmpeg: initialize input_codec array earlier.
cmdutils: Allocate private decoder context if its not allocated yet.
cws2fws: Improve error message wording.
tools: Check the return value of write().
mpegaudio: move OUT_FMT macro to mpegaudiodec.c
mpegaudio: remove OUT_MIN/MAX macros
Add missing #includes to mp3_header_(de)compress bsf
dct: fix indentation
dct: bypass table allocation for DCT_II of size 32
pngdec: relax condition for setting monoblack pixel format
h264dsp_mmx: Add #ifdefs around some mmxext functions on x86_64.
Remove unused header mpegaudio3.h.
Support decoding of 1bpp rawvideo in avi (ticket 205).
Support decoding of 2bpp rawvideo in avi (ticket 206).
Bump minor after adding a caf muxer.
configure: another try on fixing osx/mingw SDL
aacdec: Use float instead of int16_t for ltp_state to avoid needless rounding.
av_picture_crop(): Support simple cases with packed pixels too.
acelp: Remove unused gray_decode table.
dfa: Remove unused variable.
configure: Include AVX availability in summary output.
rawdec: propagate pict_type information to the output frame
showinfo: replace "CRC" by "checksum"
showinfo: fix vertical align nit
showinfo: fix computation of Adler checksum
imgutils: generalize linesize computation for bitstream formats
configure: use same CPPFLAGS in kFreeBSD as Linux
Conflicts:
ffserver.c
libavcodec/avcodec.h
libavcodec/opt.h
libavcodec/version.h
libavdevice/avdevice.h
libavfilter/avfilter.h
libavformat/avformat.h
libavformat/metadata.c
libavformat/metadata.h
libavformat/utils.c
libavformat/version.h
libavutil/avutil.h
libavutil/mem.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
Diffstat (limited to 'libswscale')
27 files changed, 9083 insertions, 7118 deletions
diff --git a/libswscale/Makefile b/libswscale/Makefile index 6976079686..8bb06baae2 100644 --- a/libswscale/Makefile +++ b/libswscale/Makefile @@ -5,14 +5,19 @@ FFLIBS = avutil HEADERS = swscale.h -OBJS = options.o rgb2rgb.o swscale.o utils.o yuv2rgb.o +OBJS = options.o rgb2rgb.o swscale.o utils.o yuv2rgb.o \ + swscale_unscaled.o OBJS-$(ARCH_BFIN) += bfin/internal_bfin.o \ bfin/swscale_bfin.o \ bfin/yuv2rgb_bfin.o OBJS-$(CONFIG_MLIB) += mlib/yuv2rgb_mlib.o -OBJS-$(HAVE_ALTIVEC) += ppc/yuv2rgb_altivec.o -OBJS-$(HAVE_MMX) += x86/yuv2rgb_mmx.o +OBJS-$(HAVE_ALTIVEC) += ppc/swscale_altivec.o \ + ppc/yuv2rgb_altivec.o \ + ppc/yuv2yuv_altivec.o +OBJS-$(HAVE_MMX) += x86/rgb2rgb.o \ + x86/swscale_mmx.o \ + x86/yuv2rgb_mmx.o OBJS-$(HAVE_VIS) += sparc/yuv2rgb_vis.o TESTPROGS = colorspace swscale diff --git a/libswscale/bfin/internal_bfin.S b/libswscale/bfin/internal_bfin.S index 5af46540a8..cb8d71253c 100644 --- a/libswscale/bfin/internal_bfin.S +++ b/libswscale/bfin/internal_bfin.S @@ -466,8 +466,8 @@ DEFUN_END(yuv2rgb24_line) #define ARG_srcStride 40 DEFUN(uyvytoyv12, mL3, (const uint8_t *src, uint8_t *ydst, uint8_t *udst, uint8_t *vdst, - long width, long height, - long lumStride, long chromStride, long srcStride)): + int width, int height, + int lumStride, int chromStride, int srcStride)): link 0; [--sp] = (r7:4,p5:4); @@ -539,8 +539,8 @@ DEFUN(uyvytoyv12, mL3, (const uint8_t *src, uint8_t *ydst, uint8_t *udst, uint8 DEFUN_END(uyvytoyv12) DEFUN(yuyvtoyv12, mL3, (const uint8_t *src, uint8_t *ydst, uint8_t *udst, uint8_t *vdst, - long width, long height, - long lumStride, long chromStride, long srcStride)): + int width, int height, + int lumStride, int chromStride, int srcStride)): link 0; [--sp] = (r7:4,p5:4); diff --git a/libswscale/bfin/swscale_bfin.c b/libswscale/bfin/swscale_bfin.c index ce2f1720dd..4b26ba67c2 100644 --- a/libswscale/bfin/swscale_bfin.c +++ b/libswscale/bfin/swscale_bfin.c @@ -38,12 +38,12 @@ #endif int ff_bfin_uyvytoyv12(const uint8_t *src, uint8_t *ydst, uint8_t *udst, uint8_t *vdst, - long width, long height, - long lumStride, long chromStride, long srcStride) L1CODE; + int width, int height, + int lumStride, int chromStride, int srcStride) L1CODE; int ff_bfin_yuyvtoyv12(const uint8_t *src, uint8_t *ydst, uint8_t *udst, uint8_t *vdst, - long width, long height, - long lumStride, long chromStride, long srcStride) L1CODE; + int width, int height, + int lumStride, int chromStride, int srcStride) L1CODE; static int uyvytoyv12_unscaled(SwsContext *c, uint8_t* src[], int srcStride[], int srcSliceY, int srcSliceH, uint8_t* dst[], int dstStride[]) @@ -79,15 +79,13 @@ static int yuyvtoyv12_unscaled(SwsContext *c, uint8_t* src[], int srcStride[], i void ff_bfin_get_unscaled_swscale(SwsContext *c) { SwsFunc swScale = c->swScale; - if (c->flags & SWS_CPU_CAPS_BFIN) - if (c->dstFormat == PIX_FMT_YUV420P) - if (c->srcFormat == PIX_FMT_UYVY422) { - av_log (NULL, AV_LOG_VERBOSE, "selecting Blackfin optimized uyvytoyv12_unscaled\n"); - c->swScale = uyvytoyv12_unscaled; - } - if (c->dstFormat == PIX_FMT_YUV420P) - if (c->srcFormat == PIX_FMT_YUYV422) { - av_log (NULL, AV_LOG_VERBOSE, "selecting Blackfin optimized yuyvtoyv12_unscaled\n"); - c->swScale = yuyvtoyv12_unscaled; - } + + if (c->dstFormat == PIX_FMT_YUV420P && c->srcFormat == PIX_FMT_UYVY422) { + av_log (NULL, AV_LOG_VERBOSE, "selecting Blackfin optimized uyvytoyv12_unscaled\n"); + c->swScale = uyvytoyv12_unscaled; + } + if (c->dstFormat == PIX_FMT_YUV420P && c->srcFormat == PIX_FMT_YUYV422) { + av_log (NULL, AV_LOG_VERBOSE, "selecting Blackfin optimized yuyvtoyv12_unscaled\n"); + c->swScale = yuyvtoyv12_unscaled; + } } diff --git a/libswscale/bfin/yuv2rgb_bfin.c b/libswscale/bfin/yuv2rgb_bfin.c index eaa83eaf3b..7a7dc7f0e6 100644 --- a/libswscale/bfin/yuv2rgb_bfin.c +++ b/libswscale/bfin/yuv2rgb_bfin.c @@ -28,6 +28,7 @@ #include <assert.h> #include "config.h" #include <unistd.h> +#include "libavutil/pixdesc.h" #include "libswscale/rgb2rgb.h" #include "libswscale/swscale.h" #include "libswscale/swscale_internal.h" @@ -197,7 +198,7 @@ SwsFunc ff_yuv2rgb_get_func_ptr_bfin(SwsContext *c) } av_log(c, AV_LOG_INFO, "BlackFin accelerated color space converter %s\n", - sws_format_name (c->dstFormat)); + av_get_pix_fmt_name(c->dstFormat)); return f; } diff --git a/libswscale/colorspace-test.c b/libswscale/colorspace-test.c index 50db7d09e3..34095d8532 100644 --- a/libswscale/colorspace-test.c +++ b/libswscale/colorspace-test.c @@ -33,31 +33,6 @@ #define FUNC(s,d,n) {s,d,#n,n} -static int cpu_caps; - -static char *args_parse(int argc, char *argv[]) -{ - int o; - - while ((o = getopt(argc, argv, "m23")) != -1) { - switch (o) { - case 'm': - cpu_caps |= SWS_CPU_CAPS_MMX; - break; - case '2': - cpu_caps |= SWS_CPU_CAPS_MMX2; - break; - case '3': - cpu_caps |= SWS_CPU_CAPS_3DNOW; - break; - default: - av_log(NULL, AV_LOG_ERROR, "Unknown option %c\n", o); - } - } - - return argv[optind]; -} - int main(int argc, char **argv) { int i, funcNum; @@ -70,16 +45,14 @@ int main(int argc, char **argv) return -1; av_log(NULL, AV_LOG_INFO, "memory corruption test ...\n"); - args_parse(argc, argv); - av_log(NULL, AV_LOG_INFO, "CPU capabilities forced to %x\n", cpu_caps); - sws_rgb2rgb_init(cpu_caps); + sws_rgb2rgb_init(); for(funcNum=0; ; funcNum++) { struct func_info_s { int src_bpp; int dst_bpp; const char *name; - void (*func)(const uint8_t *src, uint8_t *dst, long src_size); + void (*func)(const uint8_t *src, uint8_t *dst, int src_size); } func_info[] = { FUNC(2, 2, rgb15to16), FUNC(2, 3, rgb15to24), diff --git a/libswscale/options.c b/libswscale/options.c index d3cd0a3190..24e70b96fc 100644 --- a/libswscale/options.c +++ b/libswscale/options.c @@ -48,12 +48,6 @@ static const AVOption options[] = { { "spline", "natural bicubic spline", 0, FF_OPT_TYPE_CONST, {.dbl = SWS_SPLINE }, INT_MIN, INT_MAX, VE, "sws_flags" }, { "print_info", "print info", 0, FF_OPT_TYPE_CONST, {.dbl = SWS_PRINT_INFO }, INT_MIN, INT_MAX, VE, "sws_flags" }, { "accurate_rnd", "accurate rounding", 0, FF_OPT_TYPE_CONST, {.dbl = SWS_ACCURATE_RND }, INT_MIN, INT_MAX, VE, "sws_flags" }, - { "mmx", "MMX SIMD acceleration", 0, FF_OPT_TYPE_CONST, {.dbl = SWS_CPU_CAPS_MMX }, INT_MIN, INT_MAX, VE, "sws_flags" }, - { "mmx2", "MMX2 SIMD acceleration", 0, FF_OPT_TYPE_CONST, {.dbl = SWS_CPU_CAPS_MMX2 }, INT_MIN, INT_MAX, VE, "sws_flags" }, - { "sse2", "SSE2 SIMD acceleration", 0, FF_OPT_TYPE_CONST, {.dbl = SWS_CPU_CAPS_SSE2 }, INT_MIN, INT_MAX, VE, "sws_flags" }, - { "3dnow", "3DNOW SIMD acceleration", 0, FF_OPT_TYPE_CONST, {.dbl = SWS_CPU_CAPS_3DNOW }, INT_MIN, INT_MAX, VE, "sws_flags" }, - { "altivec", "AltiVec SIMD acceleration", 0, FF_OPT_TYPE_CONST, {.dbl = SWS_CPU_CAPS_ALTIVEC }, INT_MIN, INT_MAX, VE, "sws_flags" }, - { "bfin", "Blackfin SIMD acceleration", 0, FF_OPT_TYPE_CONST, {.dbl = SWS_CPU_CAPS_BFIN }, INT_MIN, INT_MAX, VE, "sws_flags" }, { "full_chroma_int", "full chroma interpolation", 0 , FF_OPT_TYPE_CONST, {.dbl = SWS_FULL_CHR_H_INT }, INT_MIN, INT_MAX, VE, "sws_flags" }, { "full_chroma_inp", "full chroma input", 0 , FF_OPT_TYPE_CONST, {.dbl = SWS_FULL_CHR_H_INP }, INT_MIN, INT_MAX, VE, "sws_flags" }, { "bitexact", "", 0 , FF_OPT_TYPE_CONST, {.dbl = SWS_BITEXACT }, INT_MIN, INT_MAX, VE, "sws_flags" }, diff --git a/libswscale/ppc/swscale_altivec_template.c b/libswscale/ppc/swscale_altivec.c index c7aa0fd2e6..197000beb9 100644 --- a/libswscale/ppc/swscale_altivec_template.c +++ b/libswscale/ppc/swscale_altivec.c @@ -21,6 +21,13 @@ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA */ +#include <inttypes.h> +#include "config.h" +#include "libswscale/swscale.h" +#include "libswscale/swscale_internal.h" +#include "libavutil/cpu.h" +#include "yuv2rgb_altivec.h" + #define vzero vec_splat_s32(0) static inline void @@ -29,13 +36,13 @@ altivec_packIntArrayToCharArray(int *val, uint8_t* dest, int dstW) register int i; vector unsigned int altivec_vectorShiftInt19 = vec_add(vec_splat_u32(10), vec_splat_u32(9)); - if ((unsigned long)dest % 16) { + if ((uintptr_t)dest % 16) { /* badly aligned store, we force store alignment */ /* and will handle load misalignment on val w/ vec_perm */ vector unsigned char perm1; vector signed int v1; for (i = 0 ; (i < dstW) && - (((unsigned long)dest + i) % 16) ; i++) { + (((uintptr_t)dest + i) % 16) ; i++) { int t = val[i] >> 19; dest[i] = (t < 0) ? 0 : ((t > 255) ? 255 : t); } @@ -85,10 +92,15 @@ altivec_packIntArrayToCharArray(int *val, uint8_t* dest, int dstW) } } -static inline void -yuv2yuvX_altivec_real(const int16_t *lumFilter, const int16_t **lumSrc, int lumFilterSize, - const int16_t *chrFilter, const int16_t **chrSrc, int chrFilterSize, - uint8_t *dest, uint8_t *uDest, uint8_t *vDest, int dstW, int chrDstW) +static void +yuv2yuvX_altivec_real(SwsContext *c, + const int16_t *lumFilter, const int16_t **lumSrc, + int lumFilterSize, const int16_t *chrFilter, + const int16_t **chrUSrc, const int16_t **chrVSrc, + int chrFilterSize, const int16_t **alpSrc, + uint8_t *dest, uint8_t *uDest, + uint8_t *vDest, uint8_t *aDest, + int dstW, int chrDstW) { const vector signed int vini = {(1 << 18), (1 << 18), (1 << 18), (1 << 18)}; register int i, j; @@ -159,22 +171,22 @@ yuv2yuvX_altivec_real(const int16_t *lumFilter, const int16_t **lumSrc, int lumF vChrFilter = vec_perm(vChrFilter, vChrFilter, perm0); vChrFilter = vec_splat(vChrFilter, 0); // chrFilter[j] is loaded 8 times in vChrFilter - perm = vec_lvsl(0, chrSrc[j]); - l1 = vec_ld(0, chrSrc[j]); - l1_V = vec_ld(VOFW << 1, chrSrc[j]); + perm = vec_lvsl(0, chrUSrc[j]); + l1 = vec_ld(0, chrUSrc[j]); + l1_V = vec_ld(0, chrVSrc[j]); for (i = 0; i < (chrDstW - 7); i+=8) { int offset = i << 2; - vector signed short l2 = vec_ld((i << 1) + 16, chrSrc[j]); - vector signed short l2_V = vec_ld(((i + VOFW) << 1) + 16, chrSrc[j]); + vector signed short l2 = vec_ld((i << 1) + 16, chrUSrc[j]); + vector signed short l2_V = vec_ld((i << 1) + 16, chrVSrc[j]); vector signed int v1 = vec_ld(offset, u); vector signed int v2 = vec_ld(offset + 16, u); vector signed int v1_V = vec_ld(offset, v); vector signed int v2_V = vec_ld(offset + 16, v); - vector signed short ls = vec_perm(l1, l2, perm); // chrSrc[j][i] ... chrSrc[j][i+7] - vector signed short ls_V = vec_perm(l1_V, l2_V, perm); // chrSrc[j][i+VOFW] ... chrSrc[j][i+2055] + vector signed short ls = vec_perm(l1, l2, perm); // chrUSrc[j][i] ... chrUSrc[j][i+7] + vector signed short ls_V = vec_perm(l1_V, l2_V, perm); // chrVSrc[j][i] ... chrVSrc[j][i] vector signed int i1 = vec_mule(vChrFilter, ls); vector signed int i2 = vec_mulo(vChrFilter, ls); @@ -182,9 +194,9 @@ yuv2yuvX_altivec_real(const int16_t *lumFilter, const int16_t **lumSrc, int lumF vector signed int i2_V = vec_mulo(vChrFilter, ls_V); vector signed int vf1 = vec_mergeh(i1, i2); - vector signed int vf2 = vec_mergel(i1, i2); // chrSrc[j][i] * chrFilter[j] ... chrSrc[j][i+7] * chrFilter[j] + vector signed int vf2 = vec_mergel(i1, i2); // chrUSrc[j][i] * chrFilter[j] ... chrUSrc[j][i+7] * chrFilter[j] vector signed int vf1_V = vec_mergeh(i1_V, i2_V); - vector signed int vf2_V = vec_mergel(i1_V, i2_V); // chrSrc[j][i] * chrFilter[j] ... chrSrc[j][i+7] * chrFilter[j] + vector signed int vf2_V = vec_mergel(i1_V, i2_V); // chrVSrc[j][i] * chrFilter[j] ... chrVSrc[j][i+7] * chrFilter[j] vector signed int vo1 = vec_add(v1, vf1); vector signed int vo2 = vec_add(v2, vf2); @@ -200,8 +212,8 @@ yuv2yuvX_altivec_real(const int16_t *lumFilter, const int16_t **lumSrc, int lumF l1_V = l2_V; } for ( ; i < chrDstW; i++) { - u[i] += chrSrc[j][i] * chrFilter[j]; - v[i] += chrSrc[j][i + VOFW] * chrFilter[j]; + u[i] += chrUSrc[j][i] * chrFilter[j]; + v[i] += chrVSrc[j][i] * chrFilter[j]; } } altivec_packIntArrayToCharArray(u, uDest, chrDstW); @@ -209,10 +221,10 @@ yuv2yuvX_altivec_real(const int16_t *lumFilter, const int16_t **lumSrc, int lumF } } -static inline void hScale_altivec_real(int16_t *dst, int dstW, - const uint8_t *src, int srcW, - int xInc, const int16_t *filter, - const int16_t *filterPos, int filterSize) +static void hScale_altivec_real(int16_t *dst, int dstW, + const uint8_t *src, int srcW, + int xInc, const int16_t *filter, + const int16_t *filterPos, int filterSize) { register int i; DECLARE_ALIGNED(16, int, tempo)[4]; @@ -389,157 +401,24 @@ static inline void hScale_altivec_real(int16_t *dst, int dstW, } } -static inline int yv12toyuy2_unscaled_altivec(SwsContext *c, const uint8_t* src[], int srcStride[], int srcSliceY, - int srcSliceH, uint8_t* dstParam[], int dstStride_a[]) +void ff_sws_init_swScale_altivec(SwsContext *c) { - uint8_t *dst=dstParam[0] + dstStride_a[0]*srcSliceY; - // yv12toyuy2(src[0], src[1], src[2], dst, c->srcW, srcSliceH, srcStride[0], srcStride[1], dstStride[0]); - const uint8_t *ysrc = src[0]; - const uint8_t *usrc = src[1]; - const uint8_t *vsrc = src[2]; - const int width = c->srcW; - const int height = srcSliceH; - const int lumStride = srcStride[0]; - const int chromStride = srcStride[1]; - const int dstStride = dstStride_a[0]; - const vector unsigned char yperm = vec_lvsl(0, ysrc); - const int vertLumPerChroma = 2; - register unsigned int y; - - if (width&15) { - yv12toyuy2(ysrc, usrc, vsrc, dst, c->srcW, srcSliceH, lumStride, chromStride, dstStride); - return srcSliceH; - } + enum PixelFormat dstFormat = c->dstFormat; - /* This code assumes: - - 1) dst is 16 bytes-aligned - 2) dstStride is a multiple of 16 - 3) width is a multiple of 16 - 4) lum & chrom stride are multiples of 8 - */ - - for (y=0; y<height; y++) { - int i; - for (i = 0; i < width - 31; i+= 32) { - const unsigned int j = i >> 1; - vector unsigned char v_yA = vec_ld(i, ysrc); - vector unsigned char v_yB = vec_ld(i + 16, ysrc); - vector unsigned char v_yC = vec_ld(i + 32, ysrc); - vector unsigned char v_y1 = vec_perm(v_yA, v_yB, yperm); - vector unsigned char v_y2 = vec_perm(v_yB, v_yC, yperm); - vector unsigned char v_uA = vec_ld(j, usrc); - vector unsigned char v_uB = vec_ld(j + 16, usrc); - vector unsigned char v_u = vec_perm(v_uA, v_uB, vec_lvsl(j, usrc)); - vector unsigned char v_vA = vec_ld(j, vsrc); - vector unsigned char v_vB = vec_ld(j + 16, vsrc); - vector unsigned char v_v = vec_perm(v_vA, v_vB, vec_lvsl(j, vsrc)); - vector unsigned char v_uv_a = vec_mergeh(v_u, v_v); - vector unsigned char v_uv_b = vec_mergel(v_u, v_v); - vector unsigned char v_yuy2_0 = vec_mergeh(v_y1, v_uv_a); - vector unsigned char v_yuy2_1 = vec_mergel(v_y1, v_uv_a); - vector unsigned char v_yuy2_2 = vec_mergeh(v_y2, v_uv_b); - vector unsigned char v_yuy2_3 = vec_mergel(v_y2, v_uv_b); - vec_st(v_yuy2_0, (i << 1), dst); - vec_st(v_yuy2_1, (i << 1) + 16, dst); - vec_st(v_yuy2_2, (i << 1) + 32, dst); - vec_st(v_yuy2_3, (i << 1) + 48, dst); - } - if (i < width) { - const unsigned int j = i >> 1; - vector unsigned char v_y1 = vec_ld(i, ysrc); - vector unsigned char v_u = vec_ld(j, usrc); - vector unsigned char v_v = vec_ld(j, vsrc); - vector unsigned char v_uv_a = vec_mergeh(v_u, v_v); - vector unsigned char v_yuy2_0 = vec_mergeh(v_y1, v_uv_a); - vector unsigned char v_yuy2_1 = vec_mergel(v_y1, v_uv_a); - vec_st(v_yuy2_0, (i << 1), dst); - vec_st(v_yuy2_1, (i << 1) + 16, dst); - } - if ((y&(vertLumPerChroma-1)) == vertLumPerChroma-1) { - usrc += chromStride; - vsrc += chromStride; - } - ysrc += lumStride; - dst += dstStride; - } + if (!(av_get_cpu_flags() & AV_CPU_FLAG_ALTIVEC)) + return; - return srcSliceH; -} - -static inline int yv12touyvy_unscaled_altivec(SwsContext *c, const uint8_t* src[], int srcStride[], int srcSliceY, - int srcSliceH, uint8_t* dstParam[], int dstStride_a[]) -{ - uint8_t *dst=dstParam[0] + dstStride_a[0]*srcSliceY; - // yv12toyuy2(src[0], src[1], src[2], dst, c->srcW, srcSliceH, srcStride[0], srcStride[1], dstStride[0]); - const uint8_t *ysrc = src[0]; - const uint8_t *usrc = src[1]; - const uint8_t *vsrc = src[2]; - const int width = c->srcW; - const int height = srcSliceH; - const int lumStride = srcStride[0]; - const int chromStride = srcStride[1]; - const int dstStride = dstStride_a[0]; - const int vertLumPerChroma = 2; - const vector unsigned char yperm = vec_lvsl(0, ysrc); - register unsigned int y; - - if (width&15) { - yv12touyvy(ysrc, usrc, vsrc, dst, c->srcW, srcSliceH, lumStride, chromStride, dstStride); - return srcSliceH; + c->hScale = hScale_altivec_real; + if (!is16BPS(dstFormat) && !is9_OR_10BPS(dstFormat)) { + c->yuv2yuvX = yuv2yuvX_altivec_real; } - /* This code assumes: - - 1) dst is 16 bytes-aligned - 2) dstStride is a multiple of 16 - 3) width is a multiple of 16 - 4) lum & chrom stride are multiples of 8 - */ - - for (y=0; y<height; y++) { - int i; - for (i = 0; i < width - 31; i+= 32) { - const unsigned int j = i >> 1; - vector unsigned char v_yA = vec_ld(i, ysrc); - vector unsigned char v_yB = vec_ld(i + 16, ysrc); - vector unsigned char v_yC = vec_ld(i + 32, ysrc); - vector unsigned char v_y1 = vec_perm(v_yA, v_yB, yperm); - vector unsigned char v_y2 = vec_perm(v_yB, v_yC, yperm); - vector unsigned char v_uA = vec_ld(j, usrc); - vector unsigned char v_uB = vec_ld(j + 16, usrc); - vector unsigned char v_u = vec_perm(v_uA, v_uB, vec_lvsl(j, usrc)); - vector unsigned char v_vA = vec_ld(j, vsrc); - vector unsigned char v_vB = vec_ld(j + 16, vsrc); - vector unsigned char v_v = vec_perm(v_vA, v_vB, vec_lvsl(j, vsrc)); - vector unsigned char v_uv_a = vec_mergeh(v_u, v_v); - vector unsigned char v_uv_b = vec_mergel(v_u, v_v); - vector unsigned char v_uyvy_0 = vec_mergeh(v_uv_a, v_y1); - vector unsigned char v_uyvy_1 = vec_mergel(v_uv_a, v_y1); - vector unsigned char v_uyvy_2 = vec_mergeh(v_uv_b, v_y2); - vector unsigned char v_uyvy_3 = vec_mergel(v_uv_b, v_y2); - vec_st(v_uyvy_0, (i << 1), dst); - vec_st(v_uyvy_1, (i << 1) + 16, dst); - vec_st(v_uyvy_2, (i << 1) + 32, dst); - vec_st(v_uyvy_3, (i << 1) + 48, dst); + /* The following list of supported dstFormat values should + * match what's found in the body of ff_yuv2packedX_altivec() */ + if (!(c->flags & (SWS_BITEXACT | SWS_FULL_CHR_H_INT)) && !c->alpPixBuf && + (c->dstFormat==PIX_FMT_ABGR || c->dstFormat==PIX_FMT_BGRA || + c->dstFormat==PIX_FMT_BGR24 || c->dstFormat==PIX_FMT_RGB24 || + c->dstFormat==PIX_FMT_RGBA || c->dstFormat==PIX_FMT_ARGB)) { + c->yuv2packedX = ff_yuv2packedX_altivec; } - if (i < width) { - const unsigned int j = i >> 1; - vector unsigned char v_y1 = vec_ld(i, ysrc); - vector unsigned char v_u = vec_ld(j, usrc); - vector unsigned char v_v = vec_ld(j, vsrc); - vector unsigned char v_uv_a = vec_mergeh(v_u, v_v); - vector unsigned char v_uyvy_0 = vec_mergeh(v_uv_a, v_y1); - vector unsigned char v_uyvy_1 = vec_mergel(v_uv_a, v_y1); - vec_st(v_uyvy_0, (i << 1), dst); - vec_st(v_uyvy_1, (i << 1) + 16, dst); - } - if ((y&(vertLumPerChroma-1)) == vertLumPerChroma-1) { - usrc += chromStride; - vsrc += chromStride; - } - ysrc += lumStride; - dst += dstStride; - } - return srcSliceH; } diff --git a/libswscale/ppc/yuv2rgb_altivec.c b/libswscale/ppc/yuv2rgb_altivec.c index 2b58eb27c9..e13702b100 100644 --- a/libswscale/ppc/yuv2rgb_altivec.c +++ b/libswscale/ppc/yuv2rgb_altivec.c @@ -94,6 +94,9 @@ adjustment. #include "libswscale/rgb2rgb.h" #include "libswscale/swscale.h" #include "libswscale/swscale_internal.h" +#include "libavutil/cpu.h" +#include "libavutil/pixdesc.h" +#include "yuv2rgb_altivec.h" #undef PROFILE_THE_BEAST #undef INC_SCALING @@ -296,7 +299,7 @@ static int altivec_##name (SwsContext *c, \ vector signed short R1,G1,B1; \ vector unsigned char R,G,B; \ \ - vector unsigned char *y1ivP, *y2ivP, *uivP, *vivP; \ + const vector unsigned char *y1ivP, *y2ivP, *uivP, *vivP; \ vector unsigned char align_perm; \ \ vector signed short \ @@ -333,10 +336,10 @@ static int altivec_##name (SwsContext *c, \ \ for (j=0;j<w/16;j++) { \ \ - y1ivP = (vector unsigned char *)y1i; \ - y2ivP = (vector unsigned char *)y2i; \ - uivP = (vector unsigned char *)ui; \ - vivP = (vector unsigned char *)vi; \ + y1ivP = (const vector unsigned char *)y1i; \ + y2ivP = (const vector unsigned char *)y2i; \ + uivP = (const vector unsigned char *)ui; \ + vivP = (const vector unsigned char *)vi; \ \ align_perm = vec_lvsl (0, y1i); \ y0 = (vector unsigned char) \ @@ -446,159 +449,7 @@ static int altivec_##name (SwsContext *c, \ #define out_bgr24(a,b,c,ptr) vec_mstbgr24(a,b,c,ptr) DEFCSP420_CVT (yuv2_abgr, out_abgr) -#if 1 DEFCSP420_CVT (yuv2_bgra, out_bgra) -#else -static int altivec_yuv2_bgra32 (SwsContext *c, - unsigned char **in, int *instrides, - int srcSliceY, int srcSliceH, - unsigned char **oplanes, int *outstrides) -{ - int w = c->srcW; - int h = srcSliceH; - int i,j; - int instrides_scl[3]; - vector unsigned char y0,y1; - - vector signed char u,v; - - vector signed short Y0,Y1,Y2,Y3; - vector signed short U,V; - vector signed short vx,ux,uvx; - vector signed short vx0,ux0,uvx0; - vector signed short vx1,ux1,uvx1; - vector signed short R0,G0,B0; - vector signed short R1,G1,B1; - vector unsigned char R,G,B; - - vector unsigned char *uivP, *vivP; - vector unsigned char align_perm; - - vector signed short - lCY = c->CY, - lOY = c->OY, - lCRV = c->CRV, - lCBU = c->CBU, - lCGU = c->CGU, - lCGV = c->CGV; - - vector unsigned short lCSHIFT = c->CSHIFT; - - ubyte *y1i = in[0]; - ubyte *y2i = in[0]+w; - ubyte *ui = in[1]; - ubyte *vi = in[2]; - - vector unsigned char *oute - = (vector unsigned char *) - (oplanes[0]+srcSliceY*outstrides[0]); - vector unsigned char *outo - = (vector unsigned char *) - (oplanes[0]+srcSliceY*outstrides[0]+outstrides[0]); - - - instrides_scl[0] = instrides[0]; - instrides_scl[1] = instrides[1]-w/2; /* the loop moves ui by w/2 */ - instrides_scl[2] = instrides[2]-w/2; /* the loop moves vi by w/2 */ - - - for (i=0;i<h/2;i++) { - vec_dstst (outo, (0x02000002|(((w*3+32)/32)<<16)), 0); - vec_dstst (oute, (0x02000002|(((w*3+32)/32)<<16)), 1); - - for (j=0;j<w/16;j++) { - - y0 = vec_ldl (0,y1i); - y1 = vec_ldl (0,y2i); - uivP = (vector unsigned char *)ui; - vivP = (vector unsigned char *)vi; - - align_perm = vec_lvsl (0, ui); - u = (vector signed char)vec_perm (uivP[0], uivP[1], align_perm); - - align_perm = vec_lvsl (0, vi); - v = (vector signed char)vec_perm (vivP[0], vivP[1], align_perm); - u = (vector signed char) - vec_sub (u,(vector signed char) - vec_splat((vector signed char){128},0)); - - v = (vector signed char) - vec_sub (v, (vector signed char) - vec_splat((vector signed char){128},0)); - - U = vec_unpackh (u); - V = vec_unpackh (v); - - - Y0 = vec_unh (y0); - Y1 = vec_unl (y0); - Y2 = vec_unh (y1); - Y3 = vec_unl (y1); - - Y0 = vec_mradds (Y0, lCY, lOY); - Y1 = vec_mradds (Y1, lCY, lOY); - Y2 = vec_mradds (Y2, lCY, lOY); - Y3 = vec_mradds (Y3, lCY, lOY); - - /* ux = (CBU*(u<<CSHIFT)+0x4000)>>15 */ - ux = vec_sl (U, lCSHIFT); - ux = vec_mradds (ux, lCBU, (vector signed short){0}); - ux0 = vec_mergeh (ux,ux); - ux1 = vec_mergel (ux,ux); - - /* vx = (CRV*(v<<CSHIFT)+0x4000)>>15; */ - vx = vec_sl (V, lCSHIFT); - vx = vec_mradds (vx, lCRV, (vector signed short){0}); - vx0 = vec_mergeh (vx,vx); - vx1 = vec_mergel (vx,vx); - /* uvx = ((CGU*u) + (CGV*v))>>15 */ - uvx = vec_mradds (U, lCGU, (vector signed short){0}); - uvx = vec_mradds (V, lCGV, uvx); - uvx0 = vec_mergeh (uvx,uvx); - uvx1 = vec_mergel (uvx,uvx); - R0 = vec_add (Y0,vx0); - G0 = vec_add (Y0,uvx0); - B0 = vec_add (Y0,ux0); - R1 = vec_add (Y1,vx1); - G1 = vec_add (Y1,uvx1); - B1 = vec_add (Y1,ux1); - R = vec_packclp (R0,R1); - G = vec_packclp (G0,G1); - B = vec_packclp (B0,B1); - - out_argb(R,G,B,oute); - R0 = vec_add (Y2,vx0); - G0 = vec_add (Y2,uvx0); - B0 = vec_add (Y2,ux0); - R1 = vec_add (Y3,vx1); - G1 = vec_add (Y3,uvx1); - B1 = vec_add (Y3,ux1); - R = vec_packclp (R0,R1); - G = vec_packclp (G0,G1); - B = vec_packclp (B0,B1); - - out_argb(R,G,B,outo); - y1i += 16; - y2i += 16; - ui += 8; - vi += 8; - - } - - outo += (outstrides[0])>>4; - oute += (outstrides[0])>>4; - - ui += instrides_scl[1]; - vi += instrides_scl[2]; - y1i += instrides_scl[0]; - y2i += instrides_scl[0]; - } - return srcSliceH; -} - -#endif - - DEFCSP420_CVT (yuv2_rgba, out_rgba) DEFCSP420_CVT (yuv2_argb, out_argb) DEFCSP420_CVT (yuv2_rgb24, out_rgb24) @@ -692,7 +543,7 @@ static int altivec_uyvy_rgb32 (SwsContext *c, */ SwsFunc ff_yuv2rgb_init_altivec(SwsContext *c) { - if (!(c->flags & SWS_CPU_CAPS_ALTIVEC)) + if (!(av_get_cpu_flags() & AV_CPU_FLAG_ALTIVEC)) return NULL; /* @@ -777,10 +628,12 @@ void ff_yuv2rgb_init_tables_altivec(SwsContext *c, const int inv_table[4], int b void -ff_yuv2packedX_altivec(SwsContext *c, - const int16_t *lumFilter, const int16_t **lumSrc, int lumFilterSize, - const int16_t *chrFilter, const int16_t **chrSrc, int chrFilterSize, - uint8_t *dest, int dstW, int dstY) +ff_yuv2packedX_altivec(SwsContext *c, const int16_t *lumFilter, + const int16_t **lumSrc, int lumFilterSize, + const int16_t *chrFilter, const int16_t **chrUSrc, + const int16_t **chrVSrc, int chrFilterSize, + const int16_t **alpSrc, uint8_t *dest, + int dstW, int dstY) { int i,j; vector signed short X,X0,X1,Y0,U0,V0,Y1,U1,V1,U,V; @@ -791,7 +644,7 @@ ff_yuv2packedX_altivec(SwsContext *c, vector signed short RND = vec_splat_s16(1<<3); vector unsigned short SCL = vec_splat_u16(4); - DECLARE_ALIGNED(16, unsigned long, scratch)[16]; + DECLARE_ALIGNED(16, unsigned int, scratch)[16]; vector signed short *YCoeffs, *CCoeffs; @@ -815,9 +668,9 @@ ff_yuv2packedX_altivec(SwsContext *c, V = RND; /* extract 8 coeffs from U,V */ for (j=0; j<chrFilterSize; j++) { - X = vec_ld (0, &chrSrc[j][i/2]); + X = vec_ld (0, &chrUSrc[j][i/2]); U = vec_mradds (X, CCoeffs[j], U); - X = vec_ld (0, &chrSrc[j][i/2+VOFW]); + X = vec_ld (0, &chrVSrc[j][i/2]); V = vec_mradds (X, CCoeffs[j], V); } @@ -868,7 +721,7 @@ ff_yuv2packedX_altivec(SwsContext *c, static int printed_error_message; if (!printed_error_message) { av_log(c, AV_LOG_ERROR, "altivec_yuv2packedX doesn't support %s output\n", - sws_format_name(c->dstFormat)); + av_get_pix_fmt_name(c->dstFormat)); printed_error_message=1; } return; @@ -893,9 +746,9 @@ ff_yuv2packedX_altivec(SwsContext *c, V = RND; /* extract 8 coeffs from U,V */ for (j=0; j<chrFilterSize; j++) { - X = vec_ld (0, &chrSrc[j][i/2]); + X = vec_ld (0, &chrUSrc[j][i/2]); U = vec_mradds (X, CCoeffs[j], U); - X = vec_ld (0, &chrSrc[j][i/2+VOFW]); + X = vec_ld (0, &chrVSrc[j][i/2]); V = vec_mradds (X, CCoeffs[j], V); } @@ -943,7 +796,7 @@ ff_yuv2packedX_altivec(SwsContext *c, default: /* Unreachable, I think. */ av_log(c, AV_LOG_ERROR, "altivec_yuv2packedX doesn't support %s output\n", - sws_format_name(c->dstFormat)); + av_get_pix_fmt_name(c->dstFormat)); return; } diff --git a/libswscale/ppc/yuv2rgb_altivec.h b/libswscale/ppc/yuv2rgb_altivec.h new file mode 100644 index 0000000000..15385b1d3b --- /dev/null +++ b/libswscale/ppc/yuv2rgb_altivec.h @@ -0,0 +1,34 @@ +/* + * AltiVec-enhanced yuv2yuvX + * + * Copyright (C) 2004 Romain Dolbeau <romain@dolbeau.org> + * based on the equivalent C code in swscale.c + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#ifndef PPC_YUV2RGB_ALTIVEC_H +#define PPC_YUV2RGB_ALTIVEC_H 1 + +void ff_yuv2packedX_altivec(SwsContext *c, const int16_t *lumFilter, + const int16_t **lumSrc, int lumFilterSize, + const int16_t *chrFilter, const int16_t **chrUSrc, + const int16_t **chrVSrc, int chrFilterSize, + const int16_t **alpSrc, uint8_t *dest, + int dstW, int dstY); + +#endif /* PPC_YUV2RGB_ALTIVEC_H */ diff --git a/libswscale/ppc/yuv2yuv_altivec.c b/libswscale/ppc/yuv2yuv_altivec.c new file mode 100644 index 0000000000..82c265afd2 --- /dev/null +++ b/libswscale/ppc/yuv2yuv_altivec.c @@ -0,0 +1,191 @@ +/* + * AltiVec-enhanced yuv-to-yuv convertion routines. + * + * Copyright (C) 2004 Romain Dolbeau <romain@dolbeau.org> + * based on the equivalent C code in swscale.c + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include <inttypes.h> +#include "config.h" +#include "libswscale/swscale.h" +#include "libswscale/swscale_internal.h" +#include "libavutil/cpu.h" + +static int yv12toyuy2_unscaled_altivec(SwsContext *c, const uint8_t* src[], + int srcStride[], int srcSliceY, + int srcSliceH, uint8_t* dstParam[], + int dstStride_a[]) +{ + uint8_t *dst=dstParam[0] + dstStride_a[0]*srcSliceY; + // yv12toyuy2(src[0], src[1], src[2], dst, c->srcW, srcSliceH, srcStride[0], srcStride[1], dstStride[0]); + const uint8_t *ysrc = src[0]; + const uint8_t *usrc = src[1]; + const uint8_t *vsrc = src[2]; + const int width = c->srcW; + const int height = srcSliceH; + const int lumStride = srcStride[0]; + const int chromStride = srcStride[1]; + const int dstStride = dstStride_a[0]; + const vector unsigned char yperm = vec_lvsl(0, ysrc); + const int vertLumPerChroma = 2; + register unsigned int y; + + /* This code assumes: + + 1) dst is 16 bytes-aligned + 2) dstStride is a multiple of 16 + 3) width is a multiple of 16 + 4) lum & chrom stride are multiples of 8 + */ + + for (y=0; y<height; y++) { + int i; + for (i = 0; i < width - 31; i+= 32) { + const unsigned int j = i >> 1; + vector unsigned char v_yA = vec_ld(i, ysrc); + vector unsigned char v_yB = vec_ld(i + 16, ysrc); + vector unsigned char v_yC = vec_ld(i + 32, ysrc); + vector unsigned char v_y1 = vec_perm(v_yA, v_yB, yperm); + vector unsigned char v_y2 = vec_perm(v_yB, v_yC, yperm); + vector unsigned char v_uA = vec_ld(j, usrc); + vector unsigned char v_uB = vec_ld(j + 16, usrc); + vector unsigned char v_u = vec_perm(v_uA, v_uB, vec_lvsl(j, usrc)); + vector unsigned char v_vA = vec_ld(j, vsrc); + vector unsigned char v_vB = vec_ld(j + 16, vsrc); + vector unsigned char v_v = vec_perm(v_vA, v_vB, vec_lvsl(j, vsrc)); + vector unsigned char v_uv_a = vec_mergeh(v_u, v_v); + vector unsigned char v_uv_b = vec_mergel(v_u, v_v); + vector unsigned char v_yuy2_0 = vec_mergeh(v_y1, v_uv_a); + vector unsigned char v_yuy2_1 = vec_mergel(v_y1, v_uv_a); + vector unsigned char v_yuy2_2 = vec_mergeh(v_y2, v_uv_b); + vector unsigned char v_yuy2_3 = vec_mergel(v_y2, v_uv_b); + vec_st(v_yuy2_0, (i << 1), dst); + vec_st(v_yuy2_1, (i << 1) + 16, dst); + vec_st(v_yuy2_2, (i << 1) + 32, dst); + vec_st(v_yuy2_3, (i << 1) + 48, dst); + } + if (i < width) { + const unsigned int j = i >> 1; + vector unsigned char v_y1 = vec_ld(i, ysrc); + vector unsigned char v_u = vec_ld(j, usrc); + vector unsigned char v_v = vec_ld(j, vsrc); + vector unsigned char v_uv_a = vec_mergeh(v_u, v_v); + vector unsigned char v_yuy2_0 = vec_mergeh(v_y1, v_uv_a); + vector unsigned char v_yuy2_1 = vec_mergel(v_y1, v_uv_a); + vec_st(v_yuy2_0, (i << 1), dst); + vec_st(v_yuy2_1, (i << 1) + 16, dst); + } + if ((y&(vertLumPerChroma-1)) == vertLumPerChroma-1) { + usrc += chromStride; + vsrc += chromStride; + } + ysrc += lumStride; + dst += dstStride; + } + + return srcSliceH; +} + +static int yv12touyvy_unscaled_altivec(SwsContext *c, const uint8_t* src[], + int srcStride[], int srcSliceY, + int srcSliceH, uint8_t* dstParam[], + int dstStride_a[]) +{ + uint8_t *dst=dstParam[0] + dstStride_a[0]*srcSliceY; + // yv12toyuy2(src[0], src[1], src[2], dst, c->srcW, srcSliceH, srcStride[0], srcStride[1], dstStride[0]); + const uint8_t *ysrc = src[0]; + const uint8_t *usrc = src[1]; + const uint8_t *vsrc = src[2]; + const int width = c->srcW; + const int height = srcSliceH; + const int lumStride = srcStride[0]; + const int chromStride = srcStride[1]; + const int dstStride = dstStride_a[0]; + const int vertLumPerChroma = 2; + const vector unsigned char yperm = vec_lvsl(0, ysrc); + register unsigned int y; + + /* This code assumes: + + 1) dst is 16 bytes-aligned + 2) dstStride is a multiple of 16 + 3) width is a multiple of 16 + 4) lum & chrom stride are multiples of 8 + */ + + for (y=0; y<height; y++) { + int i; + for (i = 0; i < width - 31; i+= 32) { + const unsigned int j = i >> 1; + vector unsigned char v_yA = vec_ld(i, ysrc); + vector unsigned char v_yB = vec_ld(i + 16, ysrc); + vector unsigned char v_yC = vec_ld(i + 32, ysrc); + vector unsigned char v_y1 = vec_perm(v_yA, v_yB, yperm); + vector unsigned char v_y2 = vec_perm(v_yB, v_yC, yperm); + vector unsigned char v_uA = vec_ld(j, usrc); + vector unsigned char v_uB = vec_ld(j + 16, usrc); + vector unsigned char v_u = vec_perm(v_uA, v_uB, vec_lvsl(j, usrc)); + vector unsigned char v_vA = vec_ld(j, vsrc); + vector unsigned char v_vB = vec_ld(j + 16, vsrc); + vector unsigned char v_v = vec_perm(v_vA, v_vB, vec_lvsl(j, vsrc)); + vector unsigned char v_uv_a = vec_mergeh(v_u, v_v); + vector unsigned char v_uv_b = vec_mergel(v_u, v_v); + vector unsigned char v_uyvy_0 = vec_mergeh(v_uv_a, v_y1); + vector unsigned char v_uyvy_1 = vec_mergel(v_uv_a, v_y1); + vector unsigned char v_uyvy_2 = vec_mergeh(v_uv_b, v_y2); + vector unsigned char v_uyvy_3 = vec_mergel(v_uv_b, v_y2); + vec_st(v_uyvy_0, (i << 1), dst); + vec_st(v_uyvy_1, (i << 1) + 16, dst); + vec_st(v_uyvy_2, (i << 1) + 32, dst); + vec_st(v_uyvy_3, (i << 1) + 48, dst); + } + if (i < width) { + const unsigned int j = i >> 1; + vector unsigned char v_y1 = vec_ld(i, ysrc); + vector unsigned char v_u = vec_ld(j, usrc); + vector unsigned char v_v = vec_ld(j, vsrc); + vector unsigned char v_uv_a = vec_mergeh(v_u, v_v); + vector unsigned char v_uyvy_0 = vec_mergeh(v_uv_a, v_y1); + vector unsigned char v_uyvy_1 = vec_mergel(v_uv_a, v_y1); + vec_st(v_uyvy_0, (i << 1), dst); + vec_st(v_uyvy_1, (i << 1) + 16, dst); + } + if ((y&(vertLumPerChroma-1)) == vertLumPerChroma-1) { + usrc += chromStride; + vsrc += chromStride; + } + ysrc += lumStride; + dst += dstStride; + } + return srcSliceH; +} + +void ff_swscale_get_unscaled_altivec(SwsContext *c) +{ + if ((av_get_cpu_flags() & AV_CPU_FLAG_ALTIVEC) && !(c->srcW & 15) && + !(c->flags & SWS_BITEXACT) && c->srcFormat == PIX_FMT_YUV420P) { + enum PixelFormat dstFormat = c->dstFormat; + + // unscaled YV12 -> packed YUV, we want speed + if (dstFormat == PIX_FMT_YUYV422) + c->swScale= yv12toyuy2_unscaled_altivec; + else if (dstFormat == PIX_FMT_UYVY422) + c->swScale= yv12touyvy_unscaled_altivec; + } +} diff --git a/libswscale/rgb2rgb.c b/libswscale/rgb2rgb.c index adc5d59c8c..84ef43b774 100644 --- a/libswscale/rgb2rgb.c +++ b/libswscale/rgb2rgb.c @@ -24,115 +24,75 @@ */ #include <inttypes.h> #include "config.h" -#include "libavutil/x86_cpu.h" #include "libavutil/bswap.h" #include "rgb2rgb.h" #include "swscale.h" #include "swscale_internal.h" -void (*rgb24tobgr32)(const uint8_t *src, uint8_t *dst, long src_size); -void (*rgb24tobgr16)(const uint8_t *src, uint8_t *dst, long src_size); -void (*rgb24tobgr15)(const uint8_t *src, uint8_t *dst, long src_size); -void (*rgb32tobgr24)(const uint8_t *src, uint8_t *dst, long src_size); -void (*rgb32to16)(const uint8_t *src, uint8_t *dst, long src_size); -void (*rgb32to15)(const uint8_t *src, uint8_t *dst, long src_size); -void (*rgb15to16)(const uint8_t *src, uint8_t *dst, long src_size); -void (*rgb15tobgr24)(const uint8_t *src, uint8_t *dst, long src_size); -void (*rgb15to32)(const uint8_t *src, uint8_t *dst, long src_size); -void (*rgb16to15)(const uint8_t *src, uint8_t *dst, long src_size); -void (*rgb16tobgr24)(const uint8_t *src, uint8_t *dst, long src_size); -void (*rgb16to32)(const uint8_t *src, uint8_t *dst, long src_size); -void (*rgb24tobgr24)(const uint8_t *src, uint8_t *dst, long src_size); -void (*rgb24to16)(const uint8_t *src, uint8_t *dst, long src_size); -void (*rgb24to15)(const uint8_t *src, uint8_t *dst, long src_size); -void (*shuffle_bytes_2103)(const uint8_t *src, uint8_t *dst, long src_size); -void (*rgb32tobgr16)(const uint8_t *src, uint8_t *dst, long src_size); -void (*rgb32tobgr15)(const uint8_t *src, uint8_t *dst, long src_size); +void (*rgb24tobgr32)(const uint8_t *src, uint8_t *dst, int src_size); +void (*rgb24tobgr16)(const uint8_t *src, uint8_t *dst, int src_size); +void (*rgb24tobgr15)(const uint8_t *src, uint8_t *dst, int src_size); +void (*rgb32tobgr24)(const uint8_t *src, uint8_t *dst, int src_size); +void (*rgb32to16)(const uint8_t *src, uint8_t *dst, int src_size); +void (*rgb32to15)(const uint8_t *src, uint8_t *dst, int src_size); +void (*rgb15to16)(const uint8_t *src, uint8_t *dst, int src_size); +void (*rgb15tobgr24)(const uint8_t *src, uint8_t *dst, int src_size); +void (*rgb15to32)(const uint8_t *src, uint8_t *dst, int src_size); +void (*rgb16to15)(const uint8_t *src, uint8_t *dst, int src_size); +void (*rgb16tobgr24)(const uint8_t *src, uint8_t *dst, int src_size); +void (*rgb16to32)(const uint8_t *src, uint8_t *dst, int src_size); +void (*rgb24tobgr24)(const uint8_t *src, uint8_t *dst, int src_size); +void (*rgb24to16)(const uint8_t *src, uint8_t *dst, int src_size); +void (*rgb24to15)(const uint8_t *src, uint8_t *dst, int src_size); +void (*shuffle_bytes_2103)(const uint8_t *src, uint8_t *dst, int src_size); +void (*rgb32tobgr16)(const uint8_t *src, uint8_t *dst, int src_size); +void (*rgb32tobgr15)(const uint8_t *src, uint8_t *dst, int src_size); void (*yv12toyuy2)(const uint8_t *ysrc, const uint8_t *usrc, const uint8_t *vsrc, uint8_t *dst, - long width, long height, - long lumStride, long chromStride, long dstStride); + int width, int height, + int lumStride, int chromStride, int dstStride); void (*yv12touyvy)(const uint8_t *ysrc, const uint8_t *usrc, const uint8_t *vsrc, uint8_t *dst, - long width, long height, - long lumStride, long chromStride, long dstStride); + int width, int height, + int lumStride, int chromStride, int dstStride); void (*yuv422ptoyuy2)(const uint8_t *ysrc, const uint8_t *usrc, const uint8_t *vsrc, uint8_t *dst, - long width, long height, - long lumStride, long chromStride, long dstStride); + int width, int height, + int lumStride, int chromStride, int dstStride); void (*yuv422ptouyvy)(const uint8_t *ysrc, const uint8_t *usrc, const uint8_t *vsrc, uint8_t *dst, - long width, long height, - long lumStride, long chromStride, long dstStride); + int width, int height, + int lumStride, int chromStride, int dstStride); void (*yuy2toyv12)(const uint8_t *src, uint8_t *ydst, uint8_t *udst, uint8_t *vdst, - long width, long height, - long lumStride, long chromStride, long srcStride); + int width, int height, + int lumStride, int chromStride, int srcStride); void (*rgb24toyv12)(const uint8_t *src, uint8_t *ydst, uint8_t *udst, uint8_t *vdst, - long width, long height, - long lumStride, long chromStride, long srcStride); -void (*planar2x)(const uint8_t *src, uint8_t *dst, long width, long height, - long srcStride, long dstStride); + int width, int height, + int lumStride, int chromStride, int srcStride); +void (*planar2x)(const uint8_t *src, uint8_t *dst, int width, int height, + int srcStride, int dstStride); void (*interleaveBytes)(const uint8_t *src1, const uint8_t *src2, uint8_t *dst, - long width, long height, long src1Stride, - long src2Stride, long dstStride); + int width, int height, int src1Stride, + int src2Stride, int dstStride); void (*vu9_to_vu12)(const uint8_t *src1, const uint8_t *src2, uint8_t *dst1, uint8_t *dst2, - long width, long height, - long srcStride1, long srcStride2, - long dstStride1, long dstStride2); + int width, int height, + int srcStride1, int srcStride2, + int dstStride1, int dstStride2); void (*yvu9_to_yuy2)(const uint8_t *src1, const uint8_t *src2, const uint8_t *src3, uint8_t *dst, - long width, long height, - long srcStride1, long srcStride2, - long srcStride3, long dstStride); + int width, int height, + int srcStride1, int srcStride2, + int srcStride3, int dstStride); void (*uyvytoyuv420)(uint8_t *ydst, uint8_t *udst, uint8_t *vdst, const uint8_t *src, - long width, long height, - long lumStride, long chromStride, long srcStride); + int width, int height, + int lumStride, int chromStride, int srcStride); void (*uyvytoyuv422)(uint8_t *ydst, uint8_t *udst, uint8_t *vdst, const uint8_t *src, - long width, long height, - long lumStride, long chromStride, long srcStride); + int width, int height, + int lumStride, int chromStride, int srcStride); void (*yuyvtoyuv420)(uint8_t *ydst, uint8_t *udst, uint8_t *vdst, const uint8_t *src, - long width, long height, - long lumStride, long chromStride, long srcStride); + int width, int height, + int lumStride, int chromStride, int srcStride); void (*yuyvtoyuv422)(uint8_t *ydst, uint8_t *udst, uint8_t *vdst, const uint8_t *src, - long width, long height, - long lumStride, long chromStride, long srcStride); - - -#if ARCH_X86 -DECLARE_ASM_CONST(8, uint64_t, mmx_ff) = 0x00000000000000FFULL; -DECLARE_ASM_CONST(8, uint64_t, mmx_null) = 0x0000000000000000ULL; -DECLARE_ASM_CONST(8, uint64_t, mmx_one) = 0xFFFFFFFFFFFFFFFFULL; -DECLARE_ASM_CONST(8, uint64_t, mask32b) = 0x000000FF000000FFULL; -DECLARE_ASM_CONST(8, uint64_t, mask32g) = 0x0000FF000000FF00ULL; -DECLARE_ASM_CONST(8, uint64_t, mask32r) = 0x00FF000000FF0000ULL; -DECLARE_ASM_CONST(8, uint64_t, mask32a) = 0xFF000000FF000000ULL; -DECLARE_ASM_CONST(8, uint64_t, mask32) = 0x00FFFFFF00FFFFFFULL; -DECLARE_ASM_CONST(8, uint64_t, mask3216br) = 0x00F800F800F800F8ULL; -DECLARE_ASM_CONST(8, uint64_t, mask3216g) = 0x0000FC000000FC00ULL; -DECLARE_ASM_CONST(8, uint64_t, mask3215g) = 0x0000F8000000F800ULL; -DECLARE_ASM_CONST(8, uint64_t, mul3216) = 0x2000000420000004ULL; -DECLARE_ASM_CONST(8, uint64_t, mul3215) = 0x2000000820000008ULL; -DECLARE_ASM_CONST(8, uint64_t, mask24b) = 0x00FF0000FF0000FFULL; -DECLARE_ASM_CONST(8, uint64_t, mask24g) = 0xFF0000FF0000FF00ULL; -DECLARE_ASM_CONST(8, uint64_t, mask24r) = 0x0000FF0000FF0000ULL; -DECLARE_ASM_CONST(8, uint64_t, mask24l) = 0x0000000000FFFFFFULL; -DECLARE_ASM_CONST(8, uint64_t, mask24h) = 0x0000FFFFFF000000ULL; -DECLARE_ASM_CONST(8, uint64_t, mask24hh) = 0xffff000000000000ULL; -DECLARE_ASM_CONST(8, uint64_t, mask24hhh) = 0xffffffff00000000ULL; -DECLARE_ASM_CONST(8, uint64_t, mask24hhhh) = 0xffffffffffff0000ULL; -DECLARE_ASM_CONST(8, uint64_t, mask15b) = 0x001F001F001F001FULL; /* 00000000 00011111 xxB */ -DECLARE_ASM_CONST(8, uint64_t, mask15rg) = 0x7FE07FE07FE07FE0ULL; /* 01111111 11100000 RGx */ -DECLARE_ASM_CONST(8, uint64_t, mask15s) = 0xFFE0FFE0FFE0FFE0ULL; -DECLARE_ASM_CONST(8, uint64_t, mask15g) = 0x03E003E003E003E0ULL; -DECLARE_ASM_CONST(8, uint64_t, mask15r) = 0x7C007C007C007C00ULL; -#define mask16b mask15b -DECLARE_ASM_CONST(8, uint64_t, mask16g) = 0x07E007E007E007E0ULL; -DECLARE_ASM_CONST(8, uint64_t, mask16r) = 0xF800F800F800F800ULL; -DECLARE_ASM_CONST(8, uint64_t, red_16mask) = 0x0000f8000000f800ULL; -DECLARE_ASM_CONST(8, uint64_t, green_16mask) = 0x000007e0000007e0ULL; -DECLARE_ASM_CONST(8, uint64_t, blue_16mask) = 0x0000001f0000001fULL; -DECLARE_ASM_CONST(8, uint64_t, red_15mask) = 0x00007c0000007c00ULL; -DECLARE_ASM_CONST(8, uint64_t, green_15mask) = 0x000003e0000003e0ULL; -DECLARE_ASM_CONST(8, uint64_t, blue_15mask) = 0x0000001f0000001fULL; -#endif /* ARCH_X86 */ + int width, int height, + int lumStride, int chromStride, int srcStride); #define RGB2YUV_SHIFT 8 #define BY ((int)( 0.098*(1<<RGB2YUV_SHIFT)+0.5)) @@ -145,50 +105,9 @@ DECLARE_ASM_CONST(8, uint64_t, blue_15mask) = 0x0000001f0000001fULL; #define RV ((int)( 0.439*(1<<RGB2YUV_SHIFT)+0.5)) #define RU ((int)(-0.148*(1<<RGB2YUV_SHIFT)+0.5)) -//Note: We have C, MMX, MMX2, 3DNOW versions, there is no 3DNOW + MMX2 one. //plain C versions -#define COMPILE_TEMPLATE_MMX 0 -#define COMPILE_TEMPLATE_MMX2 0 -#define COMPILE_TEMPLATE_AMD3DNOW 0 -#define COMPILE_TEMPLATE_SSE2 0 -#define RENAME(a) a ## _C -#include "rgb2rgb_template.c" - -#if ARCH_X86 - -//MMX versions -#undef RENAME -#undef COMPILE_TEMPLATE_MMX -#define COMPILE_TEMPLATE_MMX 1 -#define RENAME(a) a ## _MMX -#include "rgb2rgb_template.c" - -//MMX2 versions -#undef RENAME -#undef COMPILE_TEMPLATE_MMX2 -#define COMPILE_TEMPLATE_MMX2 1 -#define RENAME(a) a ## _MMX2 -#include "rgb2rgb_template.c" - -//SSE2 versions -#undef RENAME -#undef COMPILE_TEMPLATE_SSE2 -#define COMPILE_TEMPLATE_SSE2 1 -#define RENAME(a) a ## _SSE2 #include "rgb2rgb_template.c" -//3DNOW versions -#undef RENAME -#undef COMPILE_TEMPLATE_MMX2 -#undef COMPILE_TEMPLATE_SSE2 -#undef COMPILE_TEMPLATE_AMD3DNOW -#define COMPILE_TEMPLATE_MMX2 0 -#define COMPILE_TEMPLATE_SSE2 1 -#define COMPILE_TEMPLATE_AMD3DNOW 1 -#define RENAME(a) a ## _3DNOW -#include "rgb2rgb_template.c" - -#endif //ARCH_X86 || ARCH_X86_64 /* RGB15->RGB16 original by Strepto/Astral @@ -197,20 +116,11 @@ DECLARE_ASM_CONST(8, uint64_t, blue_15mask) = 0x0000001f0000001fULL; 32-bit C version, and and&add trick by Michael Niedermayer */ -void sws_rgb2rgb_init(int flags) +void sws_rgb2rgb_init(void) { -#if HAVE_SSE2 || HAVE_MMX2 || HAVE_AMD3DNOW || HAVE_MMX - if (flags & SWS_CPU_CAPS_SSE2) - rgb2rgb_init_SSE2(); - else if (flags & SWS_CPU_CAPS_MMX2) - rgb2rgb_init_MMX2(); - else if (flags & SWS_CPU_CAPS_3DNOW) - rgb2rgb_init_3DNOW(); - else if (flags & SWS_CPU_CAPS_MMX) - rgb2rgb_init_MMX(); - else -#endif /* HAVE_MMX2 || HAVE_AMD3DNOW || HAVE_MMX */ - rgb2rgb_init_C(); + rgb2rgb_init_c(); + if (HAVE_MMX) + rgb2rgb_init_x86(); } #if LIBSWSCALE_VERSION_MAJOR < 1 @@ -241,10 +151,10 @@ void palette8tobgr16(const uint8_t *src, uint8_t *dst, long num_pixels, const ui } #endif -void rgb32to24(const uint8_t *src, uint8_t *dst, long src_size) +void rgb32to24(const uint8_t *src, uint8_t *dst, int src_size) { - long i; - long num_pixels = src_size >> 2; + int i; + int num_pixels = src_size >> 2; for (i=0; i<num_pixels; i++) { #if HAVE_BIGENDIAN /* RGB32 (= A,B,G,R) -> BGR24 (= B,G,R) */ @@ -259,9 +169,9 @@ void rgb32to24(const uint8_t *src, uint8_t *dst, long src_size) } } -void rgb24to32(const uint8_t *src, uint8_t *dst, long src_size) +void rgb24to32(const uint8_t *src, uint8_t *dst, int src_size) { - long i; + int i; for (i=0; 3*i<src_size; i++) { #if HAVE_BIGENDIAN /* RGB24 (= R,G,B) -> BGR32 (= A,R,G,B) */ @@ -278,7 +188,7 @@ void rgb24to32(const uint8_t *src, uint8_t *dst, long src_size) } } -void rgb16tobgr32(const uint8_t *src, uint8_t *dst, long src_size) +void rgb16tobgr32(const uint8_t *src, uint8_t *dst, int src_size) { const uint16_t *end; uint8_t *d = dst; @@ -301,7 +211,7 @@ void rgb16tobgr32(const uint8_t *src, uint8_t *dst, long src_size) } } -void rgb16to24(const uint8_t *src, uint8_t *dst, long src_size) +void rgb16to24(const uint8_t *src, uint8_t *dst, int src_size) { const uint16_t *end; uint8_t *d = dst; @@ -316,10 +226,10 @@ void rgb16to24(const uint8_t *src, uint8_t *dst, long src_size) } } -void rgb16tobgr16(const uint8_t *src, uint8_t *dst, long src_size) +void rgb16tobgr16(const uint8_t *src, uint8_t *dst, int src_size) { - long i; - long num_pixels = src_size >> 1; + int i; + int num_pixels = src_size >> 1; for (i=0; i<num_pixels; i++) { unsigned rgb = ((const uint16_t*)src)[i]; @@ -327,10 +237,10 @@ void rgb16tobgr16(const uint8_t *src, uint8_t *dst, long src_size) } } -void rgb16tobgr15(const uint8_t *src, uint8_t *dst, long src_size) +void rgb16tobgr15(const uint8_t *src, uint8_t *dst, int src_size) { - long i; - long num_pixels = src_size >> 1; + int i; + int num_pixels = src_size >> 1; for (i=0; i<num_pixels; i++) { unsigned rgb = ((const uint16_t*)src)[i]; @@ -338,7 +248,7 @@ void rgb16tobgr15(const uint8_t *src, uint8_t *dst, long src_size) } } -void rgb15tobgr32(const uint8_t *src, uint8_t *dst, long src_size) +void rgb15tobgr32(const uint8_t *src, uint8_t *dst, int src_size) { const uint16_t *end; uint8_t *d = dst; @@ -361,7 +271,7 @@ void rgb15tobgr32(const uint8_t *src, uint8_t *dst, long src_size) } } -void rgb15to24(const uint8_t *src, uint8_t *dst, long src_size) +void rgb15to24(const uint8_t *src, uint8_t *dst, int src_size) { const uint16_t *end; uint8_t *d = dst; @@ -376,10 +286,10 @@ void rgb15to24(const uint8_t *src, uint8_t *dst, long src_size) } } -void rgb15tobgr16(const uint8_t *src, uint8_t *dst, long src_size) +void rgb15tobgr16(const uint8_t *src, uint8_t *dst, int src_size) { - long i; - long num_pixels = src_size >> 1; + int i; + int num_pixels = src_size >> 1; for (i=0; i<num_pixels; i++) { unsigned rgb = ((const uint16_t*)src)[i]; @@ -387,10 +297,10 @@ void rgb15tobgr16(const uint8_t *src, uint8_t *dst, long src_size) } } -void rgb15tobgr15(const uint8_t *src, uint8_t *dst, long src_size) +void rgb15tobgr15(const uint8_t *src, uint8_t *dst, int src_size) { - long i; - long num_pixels = src_size >> 1; + int i; + int num_pixels = src_size >> 1; for (i=0; i<num_pixels; i++) { unsigned br; @@ -400,10 +310,10 @@ void rgb15tobgr15(const uint8_t *src, uint8_t *dst, long src_size) } } -void bgr8torgb8(const uint8_t *src, uint8_t *dst, long src_size) +void bgr8torgb8(const uint8_t *src, uint8_t *dst, int src_size) { - long i; - long num_pixels = src_size; + int i; + int num_pixels = src_size; for (i=0; i<num_pixels; i++) { unsigned b,g,r; register uint8_t rgb; @@ -416,9 +326,9 @@ void bgr8torgb8(const uint8_t *src, uint8_t *dst, long src_size) } #define DEFINE_SHUFFLE_BYTES(a, b, c, d) \ -void shuffle_bytes_##a##b##c##d(const uint8_t *src, uint8_t *dst, long src_size) \ +void shuffle_bytes_##a##b##c##d(const uint8_t *src, uint8_t *dst, int src_size) \ { \ - long i; \ + int i; \ \ for (i = 0; i < src_size; i+=4) { \ dst[i + 0] = src[i + a]; \ diff --git a/libswscale/rgb2rgb.h b/libswscale/rgb2rgb.h index 31e21af127..6923dd9608 100644 --- a/libswscale/rgb2rgb.h +++ b/libswscale/rgb2rgb.h @@ -32,41 +32,41 @@ #include "libavutil/avutil.h" /* A full collection of RGB to RGB(BGR) converters */ -extern void (*rgb24tobgr32)(const uint8_t *src, uint8_t *dst, long src_size); -extern void (*rgb24tobgr16)(const uint8_t *src, uint8_t *dst, long src_size); -extern void (*rgb24tobgr15)(const uint8_t *src, uint8_t *dst, long src_size); -extern void (*rgb32tobgr24)(const uint8_t *src, uint8_t *dst, long src_size); -extern void (*rgb32to16) (const uint8_t *src, uint8_t *dst, long src_size); -extern void (*rgb32to15) (const uint8_t *src, uint8_t *dst, long src_size); -extern void (*rgb15to16) (const uint8_t *src, uint8_t *dst, long src_size); -extern void (*rgb15tobgr24)(const uint8_t *src, uint8_t *dst, long src_size); -extern void (*rgb15to32) (const uint8_t *src, uint8_t *dst, long src_size); -extern void (*rgb16to15) (const uint8_t *src, uint8_t *dst, long src_size); -extern void (*rgb16tobgr24)(const uint8_t *src, uint8_t *dst, long src_size); -extern void (*rgb16to32) (const uint8_t *src, uint8_t *dst, long src_size); -extern void (*rgb24tobgr24)(const uint8_t *src, uint8_t *dst, long src_size); -extern void (*rgb24to16) (const uint8_t *src, uint8_t *dst, long src_size); -extern void (*rgb24to15) (const uint8_t *src, uint8_t *dst, long src_size); -extern void (*shuffle_bytes_2103)(const uint8_t *src, uint8_t *dst, long src_size); -extern void (*rgb32tobgr16)(const uint8_t *src, uint8_t *dst, long src_size); -extern void (*rgb32tobgr15)(const uint8_t *src, uint8_t *dst, long src_size); - -void rgb24to32 (const uint8_t *src, uint8_t *dst, long src_size); -void rgb32to24 (const uint8_t *src, uint8_t *dst, long src_size); -void rgb16tobgr32(const uint8_t *src, uint8_t *dst, long src_size); -void rgb16to24 (const uint8_t *src, uint8_t *dst, long src_size); -void rgb16tobgr16(const uint8_t *src, uint8_t *dst, long src_size); -void rgb16tobgr15(const uint8_t *src, uint8_t *dst, long src_size); -void rgb15tobgr32(const uint8_t *src, uint8_t *dst, long src_size); -void rgb15to24 (const uint8_t *src, uint8_t *dst, long src_size); -void rgb15tobgr16(const uint8_t *src, uint8_t *dst, long src_size); -void rgb15tobgr15(const uint8_t *src, uint8_t *dst, long src_size); -void bgr8torgb8 (const uint8_t *src, uint8_t *dst, long src_size); - -void shuffle_bytes_0321(const uint8_t *src, uint8_t *dst, long src_size); -void shuffle_bytes_1230(const uint8_t *src, uint8_t *dst, long src_size); -void shuffle_bytes_3012(const uint8_t *src, uint8_t *dst, long src_size); -void shuffle_bytes_3210(const uint8_t *src, uint8_t *dst, long src_size); +extern void (*rgb24tobgr32)(const uint8_t *src, uint8_t *dst, int src_size); +extern void (*rgb24tobgr16)(const uint8_t *src, uint8_t *dst, int src_size); +extern void (*rgb24tobgr15)(const uint8_t *src, uint8_t *dst, int src_size); +extern void (*rgb32tobgr24)(const uint8_t *src, uint8_t *dst, int src_size); +extern void (*rgb32to16) (const uint8_t *src, uint8_t *dst, int src_size); +extern void (*rgb32to15) (const uint8_t *src, uint8_t *dst, int src_size); +extern void (*rgb15to16) (const uint8_t *src, uint8_t *dst, int src_size); +extern void (*rgb15tobgr24)(const uint8_t *src, uint8_t *dst, int src_size); +extern void (*rgb15to32) (const uint8_t *src, uint8_t *dst, int src_size); +extern void (*rgb16to15) (const uint8_t *src, uint8_t *dst, int src_size); +extern void (*rgb16tobgr24)(const uint8_t *src, uint8_t *dst, int src_size); +extern void (*rgb16to32) (const uint8_t *src, uint8_t *dst, int src_size); +extern void (*rgb24tobgr24)(const uint8_t *src, uint8_t *dst, int src_size); +extern void (*rgb24to16) (const uint8_t *src, uint8_t *dst, int src_size); +extern void (*rgb24to15) (const uint8_t *src, uint8_t *dst, int src_size); +extern void (*shuffle_bytes_2103)(const uint8_t *src, uint8_t *dst, int src_size); +extern void (*rgb32tobgr16)(const uint8_t *src, uint8_t *dst, int src_size); +extern void (*rgb32tobgr15)(const uint8_t *src, uint8_t *dst, int src_size); + +void rgb24to32 (const uint8_t *src, uint8_t *dst, int src_size); +void rgb32to24 (const uint8_t *src, uint8_t *dst, int src_size); +void rgb16tobgr32(const uint8_t *src, uint8_t *dst, int src_size); +void rgb16to24 (const uint8_t *src, uint8_t *dst, int src_size); +void rgb16tobgr16(const uint8_t *src, uint8_t *dst, int src_size); +void rgb16tobgr15(const uint8_t *src, uint8_t *dst, int src_size); +void rgb15tobgr32(const uint8_t *src, uint8_t *dst, int src_size); +void rgb15to24 (const uint8_t *src, uint8_t *dst, int src_size); +void rgb15tobgr16(const uint8_t *src, uint8_t *dst, int src_size); +void rgb15tobgr15(const uint8_t *src, uint8_t *dst, int src_size); +void bgr8torgb8 (const uint8_t *src, uint8_t *dst, int src_size); + +void shuffle_bytes_0321(const uint8_t *src, uint8_t *dst, int src_size); +void shuffle_bytes_1230(const uint8_t *src, uint8_t *dst, int src_size); +void shuffle_bytes_3012(const uint8_t *src, uint8_t *dst, int src_size); +void shuffle_bytes_3210(const uint8_t *src, uint8_t *dst, int src_size); #if LIBSWSCALE_VERSION_MAJOR < 1 /* deprecated, use the public versions in swscale.h */ @@ -78,51 +78,48 @@ attribute_deprecated void palette8torgb16(const uint8_t *src, uint8_t *dst, long attribute_deprecated void palette8tobgr16(const uint8_t *src, uint8_t *dst, long num_pixels, const uint8_t *palette); #endif -/** - * Height should be a multiple of 2 and width should be a multiple of 16. - * (If this is a problem for anyone then tell me, and I will fix it.) - * Chrominance data is only taken from every second line, others are ignored. - * FIXME: Write high quality version. - */ -//void uyvytoyv12(const uint8_t *src, uint8_t *ydst, uint8_t *udst, uint8_t *vdst, + +void rgb24toyv12_c(const uint8_t *src, uint8_t *ydst, uint8_t *udst, + uint8_t *vdst, int width, int height, int lumStride, + int chromStride, int srcStride); /** * Height should be a multiple of 2 and width should be a multiple of 16. * (If this is a problem for anyone then tell me, and I will fix it.) */ extern void (*yv12toyuy2)(const uint8_t *ysrc, const uint8_t *usrc, const uint8_t *vsrc, uint8_t *dst, - long width, long height, - long lumStride, long chromStride, long dstStride); + int width, int height, + int lumStride, int chromStride, int dstStride); /** * Width should be a multiple of 16. */ extern void (*yuv422ptoyuy2)(const uint8_t *ysrc, const uint8_t *usrc, const uint8_t *vsrc, uint8_t *dst, - long width, long height, - long lumStride, long chromStride, long dstStride); + int width, int height, + int lumStride, int chromStride, int dstStride); /** * Height should be a multiple of 2 and width should be a multiple of 16. * (If this is a problem for anyone then tell me, and I will fix it.) */ extern void (*yuy2toyv12)(const uint8_t *src, uint8_t *ydst, uint8_t *udst, uint8_t *vdst, - long width, long height, - long lumStride, long chromStride, long srcStride); + int width, int height, + int lumStride, int chromStride, int srcStride); /** * Height should be a multiple of 2 and width should be a multiple of 16. * (If this is a problem for anyone then tell me, and I will fix it.) */ extern void (*yv12touyvy)(const uint8_t *ysrc, const uint8_t *usrc, const uint8_t *vsrc, uint8_t *dst, - long width, long height, - long lumStride, long chromStride, long dstStride); + int width, int height, + int lumStride, int chromStride, int dstStride); /** * Width should be a multiple of 16. */ extern void (*yuv422ptouyvy)(const uint8_t *ysrc, const uint8_t *usrc, const uint8_t *vsrc, uint8_t *dst, - long width, long height, - long lumStride, long chromStride, long dstStride); + int width, int height, + int lumStride, int chromStride, int dstStride); /** * Height should be a multiple of 2 and width should be a multiple of 2. @@ -131,41 +128,43 @@ extern void (*yuv422ptouyvy)(const uint8_t *ysrc, const uint8_t *usrc, const uin * FIXME: Write high quality version. */ extern void (*rgb24toyv12)(const uint8_t *src, uint8_t *ydst, uint8_t *udst, uint8_t *vdst, - long width, long height, - long lumStride, long chromStride, long srcStride); -extern void (*planar2x)(const uint8_t *src, uint8_t *dst, long width, long height, - long srcStride, long dstStride); + int width, int height, + int lumStride, int chromStride, int srcStride); +extern void (*planar2x)(const uint8_t *src, uint8_t *dst, int width, int height, + int srcStride, int dstStride); extern void (*interleaveBytes)(const uint8_t *src1, const uint8_t *src2, uint8_t *dst, - long width, long height, long src1Stride, - long src2Stride, long dstStride); + int width, int height, int src1Stride, + int src2Stride, int dstStride); extern void (*vu9_to_vu12)(const uint8_t *src1, const uint8_t *src2, uint8_t *dst1, uint8_t *dst2, - long width, long height, - long srcStride1, long srcStride2, - long dstStride1, long dstStride2); + int width, int height, + int srcStride1, int srcStride2, + int dstStride1, int dstStride2); extern void (*yvu9_to_yuy2)(const uint8_t *src1, const uint8_t *src2, const uint8_t *src3, uint8_t *dst, - long width, long height, - long srcStride1, long srcStride2, - long srcStride3, long dstStride); + int width, int height, + int srcStride1, int srcStride2, + int srcStride3, int dstStride); extern void (*uyvytoyuv420)(uint8_t *ydst, uint8_t *udst, uint8_t *vdst, const uint8_t *src, - long width, long height, - long lumStride, long chromStride, long srcStride); + int width, int height, + int lumStride, int chromStride, int srcStride); extern void (*uyvytoyuv422)(uint8_t *ydst, uint8_t *udst, uint8_t *vdst, const uint8_t *src, - long width, long height, - long lumStride, long chromStride, long srcStride); + int width, int height, + int lumStride, int chromStride, int srcStride); extern void (*yuyvtoyuv420)(uint8_t *ydst, uint8_t *udst, uint8_t *vdst, const uint8_t *src, - long width, long height, - long lumStride, long chromStride, long srcStride); + int width, int height, + int lumStride, int chromStride, int srcStride); extern void (*yuyvtoyuv422)(uint8_t *ydst, uint8_t *udst, uint8_t *vdst, const uint8_t *src, - long width, long height, - long lumStride, long chromStride, long srcStride); + int width, int height, + int lumStride, int chromStride, int srcStride); + +void sws_rgb2rgb_init(void); -void sws_rgb2rgb_init(int flags); +void rgb2rgb_init_x86(void); #endif /* SWSCALE_RGB2RGB_H */ diff --git a/libswscale/rgb2rgb_template.c b/libswscale/rgb2rgb_template.c index 9af0eaa366..0734e8891b 100644 --- a/libswscale/rgb2rgb_template.c +++ b/libswscale/rgb2rgb_template.c @@ -26,85 +26,13 @@ #include <stddef.h> -#undef PREFETCH -#undef MOVNTQ -#undef EMMS -#undef SFENCE -#undef MMREG_SIZE -#undef PAVGB - -#if COMPILE_TEMPLATE_SSE2 -#define MMREG_SIZE 16 -#else -#define MMREG_SIZE 8 -#endif - -#if COMPILE_TEMPLATE_AMD3DNOW -#define PREFETCH "prefetch" -#define PAVGB "pavgusb" -#elif COMPILE_TEMPLATE_MMX2 -#define PREFETCH "prefetchnta" -#define PAVGB "pavgb" -#else -#define PREFETCH " # nop" -#endif - -#if COMPILE_TEMPLATE_AMD3DNOW -/* On K6 femms is faster than emms. On K7 femms is directly mapped to emms. */ -#define EMMS "femms" -#else -#define EMMS "emms" -#endif - -#if COMPILE_TEMPLATE_MMX2 -#define MOVNTQ "movntq" -#define SFENCE "sfence" -#else -#define MOVNTQ "movq" -#define SFENCE " # nop" -#endif - -static inline void RENAME(rgb24tobgr32)(const uint8_t *src, uint8_t *dst, long src_size) +static inline void rgb24tobgr32_c(const uint8_t *src, uint8_t *dst, int src_size) { uint8_t *dest = dst; const uint8_t *s = src; const uint8_t *end; -#if COMPILE_TEMPLATE_MMX - const uint8_t *mm_end; -#endif end = s + src_size; -#if COMPILE_TEMPLATE_MMX - __asm__ volatile(PREFETCH" %0"::"m"(*s):"memory"); - mm_end = end - 23; - __asm__ volatile("movq %0, %%mm7"::"m"(mask32a):"memory"); - while (s < mm_end) { - __asm__ volatile( - PREFETCH" 32%1 \n\t" - "movd %1, %%mm0 \n\t" - "punpckldq 3%1, %%mm0 \n\t" - "movd 6%1, %%mm1 \n\t" - "punpckldq 9%1, %%mm1 \n\t" - "movd 12%1, %%mm2 \n\t" - "punpckldq 15%1, %%mm2 \n\t" - "movd 18%1, %%mm3 \n\t" - "punpckldq 21%1, %%mm3 \n\t" - "por %%mm7, %%mm0 \n\t" - "por %%mm7, %%mm1 \n\t" - "por %%mm7, %%mm2 \n\t" - "por %%mm7, %%mm3 \n\t" - MOVNTQ" %%mm0, %0 \n\t" - MOVNTQ" %%mm1, 8%0 \n\t" - MOVNTQ" %%mm2, 16%0 \n\t" - MOVNTQ" %%mm3, 24%0" - :"=m"(*dest) - :"m"(*s) - :"memory"); - dest += 32; - s += 24; - } - __asm__ volatile(SFENCE:::"memory"); - __asm__ volatile(EMMS:::"memory"); -#endif + while (s < end) { #if HAVE_BIGENDIAN /* RGB24 (= R,G,B) -> RGB32 (= A,B,G,R) */ @@ -122,76 +50,14 @@ static inline void RENAME(rgb24tobgr32)(const uint8_t *src, uint8_t *dst, long s } } -#define STORE_BGR24_MMX \ - "psrlq $8, %%mm2 \n\t" \ - "psrlq $8, %%mm3 \n\t" \ - "psrlq $8, %%mm6 \n\t" \ - "psrlq $8, %%mm7 \n\t" \ - "pand "MANGLE(mask24l)", %%mm0\n\t" \ - "pand "MANGLE(mask24l)", %%mm1\n\t" \ - "pand "MANGLE(mask24l)", %%mm4\n\t" \ - "pand "MANGLE(mask24l)", %%mm5\n\t" \ - "pand "MANGLE(mask24h)", %%mm2\n\t" \ - "pand "MANGLE(mask24h)", %%mm3\n\t" \ - "pand "MANGLE(mask24h)", %%mm6\n\t" \ - "pand "MANGLE(mask24h)", %%mm7\n\t" \ - "por %%mm2, %%mm0 \n\t" \ - "por %%mm3, %%mm1 \n\t" \ - "por %%mm6, %%mm4 \n\t" \ - "por %%mm7, %%mm5 \n\t" \ - \ - "movq %%mm1, %%mm2 \n\t" \ - "movq %%mm4, %%mm3 \n\t" \ - "psllq $48, %%mm2 \n\t" \ - "psllq $32, %%mm3 \n\t" \ - "pand "MANGLE(mask24hh)", %%mm2\n\t" \ - "pand "MANGLE(mask24hhh)", %%mm3\n\t" \ - "por %%mm2, %%mm0 \n\t" \ - "psrlq $16, %%mm1 \n\t" \ - "psrlq $32, %%mm4 \n\t" \ - "psllq $16, %%mm5 \n\t" \ - "por %%mm3, %%mm1 \n\t" \ - "pand "MANGLE(mask24hhhh)", %%mm5\n\t" \ - "por %%mm5, %%mm4 \n\t" \ - \ - MOVNTQ" %%mm0, %0 \n\t" \ - MOVNTQ" %%mm1, 8%0 \n\t" \ - MOVNTQ" %%mm4, 16%0" - - -static inline void RENAME(rgb32tobgr24)(const uint8_t *src, uint8_t *dst, long src_size) +static inline void rgb32tobgr24_c(const uint8_t *src, uint8_t *dst, int src_size) { uint8_t *dest = dst; const uint8_t *s = src; const uint8_t *end; -#if COMPILE_TEMPLATE_MMX - const uint8_t *mm_end; -#endif + end = s + src_size; -#if COMPILE_TEMPLATE_MMX - __asm__ volatile(PREFETCH" %0"::"m"(*s):"memory"); - mm_end = end - 31; - while (s < mm_end) { - __asm__ volatile( - PREFETCH" 32%1 \n\t" - "movq %1, %%mm0 \n\t" - "movq 8%1, %%mm1 \n\t" - "movq 16%1, %%mm4 \n\t" - "movq 24%1, %%mm5 \n\t" - "movq %%mm0, %%mm2 \n\t" - "movq %%mm1, %%mm3 \n\t" - "movq %%mm4, %%mm6 \n\t" - "movq %%mm5, %%mm7 \n\t" - STORE_BGR24_MMX - :"=m"(*dest) - :"m"(*s) - :"memory"); - dest += 24; - s += 32; - } - __asm__ volatile(SFENCE:::"memory"); - __asm__ volatile(EMMS:::"memory"); -#endif + while (s < end) { #if HAVE_BIGENDIAN /* RGB32 (= A,B,G,R) -> RGB24 (= R,G,B) */ @@ -215,39 +81,13 @@ static inline void RENAME(rgb32tobgr24)(const uint8_t *src, uint8_t *dst, long s MMX2, 3DNOW optimization by Nick Kurshev 32-bit C version, and and&add trick by Michael Niedermayer */ -static inline void RENAME(rgb15to16)(const uint8_t *src, uint8_t *dst, long src_size) +static inline void rgb15to16_c(const uint8_t *src, uint8_t *dst, int src_size) { register const uint8_t* s=src; register uint8_t* d=dst; register const uint8_t *end; const uint8_t *mm_end; end = s + src_size; -#if COMPILE_TEMPLATE_MMX - __asm__ volatile(PREFETCH" %0"::"m"(*s)); - __asm__ volatile("movq %0, %%mm4"::"m"(mask15s)); - mm_end = end - 15; - while (s<mm_end) { - __asm__ volatile( - PREFETCH" 32%1 \n\t" - "movq %1, %%mm0 \n\t" - "movq 8%1, %%mm2 \n\t" - "movq %%mm0, %%mm1 \n\t" - "movq %%mm2, %%mm3 \n\t" - "pand %%mm4, %%mm0 \n\t" - "pand %%mm4, %%mm2 \n\t" - "paddw %%mm1, %%mm0 \n\t" - "paddw %%mm3, %%mm2 \n\t" - MOVNTQ" %%mm0, %0 \n\t" - MOVNTQ" %%mm2, 8%0" - :"=m"(*d) - :"m"(*s) - ); - d+=16; - s+=16; - } - __asm__ volatile(SFENCE:::"memory"); - __asm__ volatile(EMMS:::"memory"); -#endif mm_end = end - 3; while (s < mm_end) { register unsigned x= *((const uint32_t *)s); @@ -261,44 +101,14 @@ static inline void RENAME(rgb15to16)(const uint8_t *src, uint8_t *dst, long src_ } } -static inline void RENAME(rgb16to15)(const uint8_t *src, uint8_t *dst, long src_size) +static inline void rgb16to15_c(const uint8_t *src, uint8_t *dst, int src_size) { register const uint8_t* s=src; register uint8_t* d=dst; register const uint8_t *end; const uint8_t *mm_end; end = s + src_size; -#if COMPILE_TEMPLATE_MMX - __asm__ volatile(PREFETCH" %0"::"m"(*s)); - __asm__ volatile("movq %0, %%mm7"::"m"(mask15rg)); - __asm__ volatile("movq %0, %%mm6"::"m"(mask15b)); - mm_end = end - 15; - while (s<mm_end) { - __asm__ volatile( - PREFETCH" 32%1 \n\t" - "movq %1, %%mm0 \n\t" - "movq 8%1, %%mm2 \n\t" - "movq %%mm0, %%mm1 \n\t" - "movq %%mm2, %%mm3 \n\t" - "psrlq $1, %%mm0 \n\t" - "psrlq $1, %%mm2 \n\t" - "pand %%mm7, %%mm0 \n\t" - "pand %%mm7, %%mm2 \n\t" - "pand %%mm6, %%mm1 \n\t" - "pand %%mm6, %%mm3 \n\t" - "por %%mm1, %%mm0 \n\t" - "por %%mm3, %%mm2 \n\t" - MOVNTQ" %%mm0, %0 \n\t" - MOVNTQ" %%mm2, 8%0" - :"=m"(*d) - :"m"(*s) - ); - d+=16; - s+=16; - } - __asm__ volatile(SFENCE:::"memory"); - __asm__ volatile(EMMS:::"memory"); -#endif + mm_end = end - 3; while (s < mm_end) { register uint32_t x= *((const uint32_t*)s); @@ -312,369 +122,61 @@ static inline void RENAME(rgb16to15)(const uint8_t *src, uint8_t *dst, long src_ } } -static inline void RENAME(rgb32to16)(const uint8_t *src, uint8_t *dst, long src_size) +static inline void rgb32to16_c(const uint8_t *src, uint8_t *dst, int src_size) { const uint8_t *s = src; const uint8_t *end; -#if COMPILE_TEMPLATE_MMX - const uint8_t *mm_end; -#endif uint16_t *d = (uint16_t *)dst; end = s + src_size; -#if COMPILE_TEMPLATE_MMX - mm_end = end - 15; -#if 1 //is faster only if multiplies are reasonably fast (FIXME figure out on which CPUs this is faster, on Athlon it is slightly faster) - __asm__ volatile( - "movq %3, %%mm5 \n\t" - "movq %4, %%mm6 \n\t" - "movq %5, %%mm7 \n\t" - "jmp 2f \n\t" - ".p2align 4 \n\t" - "1: \n\t" - PREFETCH" 32(%1) \n\t" - "movd (%1), %%mm0 \n\t" - "movd 4(%1), %%mm3 \n\t" - "punpckldq 8(%1), %%mm0 \n\t" - "punpckldq 12(%1), %%mm3 \n\t" - "movq %%mm0, %%mm1 \n\t" - "movq %%mm3, %%mm4 \n\t" - "pand %%mm6, %%mm0 \n\t" - "pand %%mm6, %%mm3 \n\t" - "pmaddwd %%mm7, %%mm0 \n\t" - "pmaddwd %%mm7, %%mm3 \n\t" - "pand %%mm5, %%mm1 \n\t" - "pand %%mm5, %%mm4 \n\t" - "por %%mm1, %%mm0 \n\t" - "por %%mm4, %%mm3 \n\t" - "psrld $5, %%mm0 \n\t" - "pslld $11, %%mm3 \n\t" - "por %%mm3, %%mm0 \n\t" - MOVNTQ" %%mm0, (%0) \n\t" - "add $16, %1 \n\t" - "add $8, %0 \n\t" - "2: \n\t" - "cmp %2, %1 \n\t" - " jb 1b \n\t" - : "+r" (d), "+r"(s) - : "r" (mm_end), "m" (mask3216g), "m" (mask3216br), "m" (mul3216) - ); -#else - __asm__ volatile(PREFETCH" %0"::"m"(*src):"memory"); - __asm__ volatile( - "movq %0, %%mm7 \n\t" - "movq %1, %%mm6 \n\t" - ::"m"(red_16mask),"m"(green_16mask)); - while (s < mm_end) { - __asm__ volatile( - PREFETCH" 32%1 \n\t" - "movd %1, %%mm0 \n\t" - "movd 4%1, %%mm3 \n\t" - "punpckldq 8%1, %%mm0 \n\t" - "punpckldq 12%1, %%mm3 \n\t" - "movq %%mm0, %%mm1 \n\t" - "movq %%mm0, %%mm2 \n\t" - "movq %%mm3, %%mm4 \n\t" - "movq %%mm3, %%mm5 \n\t" - "psrlq $3, %%mm0 \n\t" - "psrlq $3, %%mm3 \n\t" - "pand %2, %%mm0 \n\t" - "pand %2, %%mm3 \n\t" - "psrlq $5, %%mm1 \n\t" - "psrlq $5, %%mm4 \n\t" - "pand %%mm6, %%mm1 \n\t" - "pand %%mm6, %%mm4 \n\t" - "psrlq $8, %%mm2 \n\t" - "psrlq $8, %%mm5 \n\t" - "pand %%mm7, %%mm2 \n\t" - "pand %%mm7, %%mm5 \n\t" - "por %%mm1, %%mm0 \n\t" - "por %%mm4, %%mm3 \n\t" - "por %%mm2, %%mm0 \n\t" - "por %%mm5, %%mm3 \n\t" - "psllq $16, %%mm3 \n\t" - "por %%mm3, %%mm0 \n\t" - MOVNTQ" %%mm0, %0 \n\t" - :"=m"(*d):"m"(*s),"m"(blue_16mask):"memory"); - d += 4; - s += 16; - } -#endif - __asm__ volatile(SFENCE:::"memory"); - __asm__ volatile(EMMS:::"memory"); -#endif + while (s < end) { register int rgb = *(const uint32_t*)s; s += 4; *d++ = ((rgb&0xFF)>>3) + ((rgb&0xFC00)>>5) + ((rgb&0xF80000)>>8); } } -static inline void RENAME(rgb32tobgr16)(const uint8_t *src, uint8_t *dst, long src_size) +static inline void rgb32tobgr16_c(const uint8_t *src, uint8_t *dst, int src_size) { const uint8_t *s = src; const uint8_t *end; -#if COMPILE_TEMPLATE_MMX - const uint8_t *mm_end; -#endif uint16_t *d = (uint16_t *)dst; end = s + src_size; -#if COMPILE_TEMPLATE_MMX - __asm__ volatile(PREFETCH" %0"::"m"(*src):"memory"); - __asm__ volatile( - "movq %0, %%mm7 \n\t" - "movq %1, %%mm6 \n\t" - ::"m"(red_16mask),"m"(green_16mask)); - mm_end = end - 15; - while (s < mm_end) { - __asm__ volatile( - PREFETCH" 32%1 \n\t" - "movd %1, %%mm0 \n\t" - "movd 4%1, %%mm3 \n\t" - "punpckldq 8%1, %%mm0 \n\t" - "punpckldq 12%1, %%mm3 \n\t" - "movq %%mm0, %%mm1 \n\t" - "movq %%mm0, %%mm2 \n\t" - "movq %%mm3, %%mm4 \n\t" - "movq %%mm3, %%mm5 \n\t" - "psllq $8, %%mm0 \n\t" - "psllq $8, %%mm3 \n\t" - "pand %%mm7, %%mm0 \n\t" - "pand %%mm7, %%mm3 \n\t" - "psrlq $5, %%mm1 \n\t" - "psrlq $5, %%mm4 \n\t" - "pand %%mm6, %%mm1 \n\t" - "pand %%mm6, %%mm4 \n\t" - "psrlq $19, %%mm2 \n\t" - "psrlq $19, %%mm5 \n\t" - "pand %2, %%mm2 \n\t" - "pand %2, %%mm5 \n\t" - "por %%mm1, %%mm0 \n\t" - "por %%mm4, %%mm3 \n\t" - "por %%mm2, %%mm0 \n\t" - "por %%mm5, %%mm3 \n\t" - "psllq $16, %%mm3 \n\t" - "por %%mm3, %%mm0 \n\t" - MOVNTQ" %%mm0, %0 \n\t" - :"=m"(*d):"m"(*s),"m"(blue_16mask):"memory"); - d += 4; - s += 16; - } - __asm__ volatile(SFENCE:::"memory"); - __asm__ volatile(EMMS:::"memory"); -#endif while (s < end) { register int rgb = *(const uint32_t*)s; s += 4; *d++ = ((rgb&0xF8)<<8) + ((rgb&0xFC00)>>5) + ((rgb&0xF80000)>>19); } } -static inline void RENAME(rgb32to15)(const uint8_t *src, uint8_t *dst, long src_size) +static inline void rgb32to15_c(const uint8_t *src, uint8_t *dst, int src_size) { const uint8_t *s = src; const uint8_t *end; -#if COMPILE_TEMPLATE_MMX - const uint8_t *mm_end; -#endif uint16_t *d = (uint16_t *)dst; end = s + src_size; -#if COMPILE_TEMPLATE_MMX - mm_end = end - 15; -#if 1 //is faster only if multiplies are reasonably fast (FIXME figure out on which CPUs this is faster, on Athlon it is slightly faster) - __asm__ volatile( - "movq %3, %%mm5 \n\t" - "movq %4, %%mm6 \n\t" - "movq %5, %%mm7 \n\t" - "jmp 2f \n\t" - ".p2align 4 \n\t" - "1: \n\t" - PREFETCH" 32(%1) \n\t" - "movd (%1), %%mm0 \n\t" - "movd 4(%1), %%mm3 \n\t" - "punpckldq 8(%1), %%mm0 \n\t" - "punpckldq 12(%1), %%mm3 \n\t" - "movq %%mm0, %%mm1 \n\t" - "movq %%mm3, %%mm4 \n\t" - "pand %%mm6, %%mm0 \n\t" - "pand %%mm6, %%mm3 \n\t" - "pmaddwd %%mm7, %%mm0 \n\t" - "pmaddwd %%mm7, %%mm3 \n\t" - "pand %%mm5, %%mm1 \n\t" - "pand %%mm5, %%mm4 \n\t" - "por %%mm1, %%mm0 \n\t" - "por %%mm4, %%mm3 \n\t" - "psrld $6, %%mm0 \n\t" - "pslld $10, %%mm3 \n\t" - "por %%mm3, %%mm0 \n\t" - MOVNTQ" %%mm0, (%0) \n\t" - "add $16, %1 \n\t" - "add $8, %0 \n\t" - "2: \n\t" - "cmp %2, %1 \n\t" - " jb 1b \n\t" - : "+r" (d), "+r"(s) - : "r" (mm_end), "m" (mask3215g), "m" (mask3216br), "m" (mul3215) - ); -#else - __asm__ volatile(PREFETCH" %0"::"m"(*src):"memory"); - __asm__ volatile( - "movq %0, %%mm7 \n\t" - "movq %1, %%mm6 \n\t" - ::"m"(red_15mask),"m"(green_15mask)); - while (s < mm_end) { - __asm__ volatile( - PREFETCH" 32%1 \n\t" - "movd %1, %%mm0 \n\t" - "movd 4%1, %%mm3 \n\t" - "punpckldq 8%1, %%mm0 \n\t" - "punpckldq 12%1, %%mm3 \n\t" - "movq %%mm0, %%mm1 \n\t" - "movq %%mm0, %%mm2 \n\t" - "movq %%mm3, %%mm4 \n\t" - "movq %%mm3, %%mm5 \n\t" - "psrlq $3, %%mm0 \n\t" - "psrlq $3, %%mm3 \n\t" - "pand %2, %%mm0 \n\t" - "pand %2, %%mm3 \n\t" - "psrlq $6, %%mm1 \n\t" - "psrlq $6, %%mm4 \n\t" - "pand %%mm6, %%mm1 \n\t" - "pand %%mm6, %%mm4 \n\t" - "psrlq $9, %%mm2 \n\t" - "psrlq $9, %%mm5 \n\t" - "pand %%mm7, %%mm2 \n\t" - "pand %%mm7, %%mm5 \n\t" - "por %%mm1, %%mm0 \n\t" - "por %%mm4, %%mm3 \n\t" - "por %%mm2, %%mm0 \n\t" - "por %%mm5, %%mm3 \n\t" - "psllq $16, %%mm3 \n\t" - "por %%mm3, %%mm0 \n\t" - MOVNTQ" %%mm0, %0 \n\t" - :"=m"(*d):"m"(*s),"m"(blue_15mask):"memory"); - d += 4; - s += 16; - } -#endif - __asm__ volatile(SFENCE:::"memory"); - __asm__ volatile(EMMS:::"memory"); -#endif while (s < end) { register int rgb = *(const uint32_t*)s; s += 4; *d++ = ((rgb&0xFF)>>3) + ((rgb&0xF800)>>6) + ((rgb&0xF80000)>>9); } } -static inline void RENAME(rgb32tobgr15)(const uint8_t *src, uint8_t *dst, long src_size) +static inline void rgb32tobgr15_c(const uint8_t *src, uint8_t *dst, int src_size) { const uint8_t *s = src; const uint8_t *end; -#if COMPILE_TEMPLATE_MMX - const uint8_t *mm_end; -#endif uint16_t *d = (uint16_t *)dst; end = s + src_size; -#if COMPILE_TEMPLATE_MMX - __asm__ volatile(PREFETCH" %0"::"m"(*src):"memory"); - __asm__ volatile( - "movq %0, %%mm7 \n\t" - "movq %1, %%mm6 \n\t" - ::"m"(red_15mask),"m"(green_15mask)); - mm_end = end - 15; - while (s < mm_end) { - __asm__ volatile( - PREFETCH" 32%1 \n\t" - "movd %1, %%mm0 \n\t" - "movd 4%1, %%mm3 \n\t" - "punpckldq 8%1, %%mm0 \n\t" - "punpckldq 12%1, %%mm3 \n\t" - "movq %%mm0, %%mm1 \n\t" - "movq %%mm0, %%mm2 \n\t" - "movq %%mm3, %%mm4 \n\t" - "movq %%mm3, %%mm5 \n\t" - "psllq $7, %%mm0 \n\t" - "psllq $7, %%mm3 \n\t" - "pand %%mm7, %%mm0 \n\t" - "pand %%mm7, %%mm3 \n\t" - "psrlq $6, %%mm1 \n\t" - "psrlq $6, %%mm4 \n\t" - "pand %%mm6, %%mm1 \n\t" - "pand %%mm6, %%mm4 \n\t" - "psrlq $19, %%mm2 \n\t" - "psrlq $19, %%mm5 \n\t" - "pand %2, %%mm2 \n\t" - "pand %2, %%mm5 \n\t" - "por %%mm1, %%mm0 \n\t" - "por %%mm4, %%mm3 \n\t" - "por %%mm2, %%mm0 \n\t" - "por %%mm5, %%mm3 \n\t" - "psllq $16, %%mm3 \n\t" - "por %%mm3, %%mm0 \n\t" - MOVNTQ" %%mm0, %0 \n\t" - :"=m"(*d):"m"(*s),"m"(blue_15mask):"memory"); - d += 4; - s += 16; - } - __asm__ volatile(SFENCE:::"memory"); - __asm__ volatile(EMMS:::"memory"); -#endif while (s < end) { register int rgb = *(const uint32_t*)s; s += 4; *d++ = ((rgb&0xF8)<<7) + ((rgb&0xF800)>>6) + ((rgb&0xF80000)>>19); } } -static inline void RENAME(rgb24tobgr16)(const uint8_t *src, uint8_t *dst, long src_size) +static inline void rgb24tobgr16_c(const uint8_t *src, uint8_t *dst, int src_size) { const uint8_t *s = src; const uint8_t *end; -#if COMPILE_TEMPLATE_MMX - const uint8_t *mm_end; -#endif uint16_t *d = (uint16_t *)dst; end = s + src_size; -#if COMPILE_TEMPLATE_MMX - __asm__ volatile(PREFETCH" %0"::"m"(*src):"memory"); - __asm__ volatile( - "movq %0, %%mm7 \n\t" - "movq %1, %%mm6 \n\t" - ::"m"(red_16mask),"m"(green_16mask)); - mm_end = end - 11; - while (s < mm_end) { - __asm__ volatile( - PREFETCH" 32%1 \n\t" - "movd %1, %%mm0 \n\t" - "movd 3%1, %%mm3 \n\t" - "punpckldq 6%1, %%mm0 \n\t" - "punpckldq 9%1, %%mm3 \n\t" - "movq %%mm0, %%mm1 \n\t" - "movq %%mm0, %%mm2 \n\t" - "movq %%mm3, %%mm4 \n\t" - "movq %%mm3, %%mm5 \n\t" - "psrlq $3, %%mm0 \n\t" - "psrlq $3, %%mm3 \n\t" - "pand %2, %%mm0 \n\t" - "pand %2, %%mm3 \n\t" - "psrlq $5, %%mm1 \n\t" - "psrlq $5, %%mm4 \n\t" - "pand %%mm6, %%mm1 \n\t" - "pand %%mm6, %%mm4 \n\t" - "psrlq $8, %%mm2 \n\t" - "psrlq $8, %%mm5 \n\t" - "pand %%mm7, %%mm2 \n\t" - "pand %%mm7, %%mm5 \n\t" - "por %%mm1, %%mm0 \n\t" - "por %%mm4, %%mm3 \n\t" - "por %%mm2, %%mm0 \n\t" - "por %%mm5, %%mm3 \n\t" - "psllq $16, %%mm3 \n\t" - "por %%mm3, %%mm0 \n\t" - MOVNTQ" %%mm0, %0 \n\t" - :"=m"(*d):"m"(*s),"m"(blue_16mask):"memory"); - d += 4; - s += 12; - } - __asm__ volatile(SFENCE:::"memory"); - __asm__ volatile(EMMS:::"memory"); -#endif while (s < end) { const int b = *s++; const int g = *s++; @@ -683,59 +185,12 @@ static inline void RENAME(rgb24tobgr16)(const uint8_t *src, uint8_t *dst, long s } } -static inline void RENAME(rgb24to16)(const uint8_t *src, uint8_t *dst, long src_size) +static inline void rgb24to16_c(const uint8_t *src, uint8_t *dst, int src_size) { const uint8_t *s = src; const uint8_t *end; -#if COMPILE_TEMPLATE_MMX - const uint8_t *mm_end; -#endif uint16_t *d = (uint16_t *)dst; end = s + src_size; -#if COMPILE_TEMPLATE_MMX - __asm__ volatile(PREFETCH" %0"::"m"(*src):"memory"); - __asm__ volatile( - "movq %0, %%mm7 \n\t" - "movq %1, %%mm6 \n\t" - ::"m"(red_16mask),"m"(green_16mask)); - mm_end = end - 15; - while (s < mm_end) { - __asm__ volatile( - PREFETCH" 32%1 \n\t" - "movd %1, %%mm0 \n\t" - "movd 3%1, %%mm3 \n\t" - "punpckldq 6%1, %%mm0 \n\t" - "punpckldq 9%1, %%mm3 \n\t" - "movq %%mm0, %%mm1 \n\t" - "movq %%mm0, %%mm2 \n\t" - "movq %%mm3, %%mm4 \n\t" - "movq %%mm3, %%mm5 \n\t" - "psllq $8, %%mm0 \n\t" - "psllq $8, %%mm3 \n\t" - "pand %%mm7, %%mm0 \n\t" - "pand %%mm7, %%mm3 \n\t" - "psrlq $5, %%mm1 \n\t" - "psrlq $5, %%mm4 \n\t" - "pand %%mm6, %%mm1 \n\t" - "pand %%mm6, %%mm4 \n\t" - "psrlq $19, %%mm2 \n\t" - "psrlq $19, %%mm5 \n\t" - "pand %2, %%mm2 \n\t" - "pand %2, %%mm5 \n\t" - "por %%mm1, %%mm0 \n\t" - "por %%mm4, %%mm3 \n\t" - "por %%mm2, %%mm0 \n\t" - "por %%mm5, %%mm3 \n\t" - "psllq $16, %%mm3 \n\t" - "por %%mm3, %%mm0 \n\t" - MOVNTQ" %%mm0, %0 \n\t" - :"=m"(*d):"m"(*s),"m"(blue_16mask):"memory"); - d += 4; - s += 12; - } - __asm__ volatile(SFENCE:::"memory"); - __asm__ volatile(EMMS:::"memory"); -#endif while (s < end) { const int r = *s++; const int g = *s++; @@ -744,59 +199,12 @@ static inline void RENAME(rgb24to16)(const uint8_t *src, uint8_t *dst, long src_ } } -static inline void RENAME(rgb24tobgr15)(const uint8_t *src, uint8_t *dst, long src_size) +static inline void rgb24tobgr15_c(const uint8_t *src, uint8_t *dst, int src_size) { const uint8_t *s = src; const uint8_t *end; -#if COMPILE_TEMPLATE_MMX - const uint8_t *mm_end; -#endif uint16_t *d = (uint16_t *)dst; end = s + src_size; -#if COMPILE_TEMPLATE_MMX - __asm__ volatile(PREFETCH" %0"::"m"(*src):"memory"); - __asm__ volatile( - "movq %0, %%mm7 \n\t" - "movq %1, %%mm6 \n\t" - ::"m"(red_15mask),"m"(green_15mask)); - mm_end = end - 11; - while (s < mm_end) { - __asm__ volatile( - PREFETCH" 32%1 \n\t" - "movd %1, %%mm0 \n\t" - "movd 3%1, %%mm3 \n\t" - "punpckldq 6%1, %%mm0 \n\t" - "punpckldq 9%1, %%mm3 \n\t" - "movq %%mm0, %%mm1 \n\t" - "movq %%mm0, %%mm2 \n\t" - "movq %%mm3, %%mm4 \n\t" - "movq %%mm3, %%mm5 \n\t" - "psrlq $3, %%mm0 \n\t" - "psrlq $3, %%mm3 \n\t" - "pand %2, %%mm0 \n\t" - "pand %2, %%mm3 \n\t" - "psrlq $6, %%mm1 \n\t" - "psrlq $6, %%mm4 \n\t" - "pand %%mm6, %%mm1 \n\t" - "pand %%mm6, %%mm4 \n\t" - "psrlq $9, %%mm2 \n\t" - "psrlq $9, %%mm5 \n\t" - "pand %%mm7, %%mm2 \n\t" - "pand %%mm7, %%mm5 \n\t" - "por %%mm1, %%mm0 \n\t" - "por %%mm4, %%mm3 \n\t" - "por %%mm2, %%mm0 \n\t" - "por %%mm5, %%mm3 \n\t" - "psllq $16, %%mm3 \n\t" - "por %%mm3, %%mm0 \n\t" - MOVNTQ" %%mm0, %0 \n\t" - :"=m"(*d):"m"(*s),"m"(blue_15mask):"memory"); - d += 4; - s += 12; - } - __asm__ volatile(SFENCE:::"memory"); - __asm__ volatile(EMMS:::"memory"); -#endif while (s < end) { const int b = *s++; const int g = *s++; @@ -805,59 +213,12 @@ static inline void RENAME(rgb24tobgr15)(const uint8_t *src, uint8_t *dst, long s } } -static inline void RENAME(rgb24to15)(const uint8_t *src, uint8_t *dst, long src_size) +static inline void rgb24to15_c(const uint8_t *src, uint8_t *dst, int src_size) { const uint8_t *s = src; const uint8_t *end; -#if COMPILE_TEMPLATE_MMX - const uint8_t *mm_end; -#endif uint16_t *d = (uint16_t *)dst; end = s + src_size; -#if COMPILE_TEMPLATE_MMX - __asm__ volatile(PREFETCH" %0"::"m"(*src):"memory"); - __asm__ volatile( - "movq %0, %%mm7 \n\t" - "movq %1, %%mm6 \n\t" - ::"m"(red_15mask),"m"(green_15mask)); - mm_end = end - 15; - while (s < mm_end) { - __asm__ volatile( - PREFETCH" 32%1 \n\t" - "movd %1, %%mm0 \n\t" - "movd 3%1, %%mm3 \n\t" - "punpckldq 6%1, %%mm0 \n\t" - "punpckldq 9%1, %%mm3 \n\t" - "movq %%mm0, %%mm1 \n\t" - "movq %%mm0, %%mm2 \n\t" - "movq %%mm3, %%mm4 \n\t" - "movq %%mm3, %%mm5 \n\t" - "psllq $7, %%mm0 \n\t" - "psllq $7, %%mm3 \n\t" - "pand %%mm7, %%mm0 \n\t" - "pand %%mm7, %%mm3 \n\t" - "psrlq $6, %%mm1 \n\t" - "psrlq $6, %%mm4 \n\t" - "pand %%mm6, %%mm1 \n\t" - "pand %%mm6, %%mm4 \n\t" - "psrlq $19, %%mm2 \n\t" - "psrlq $19, %%mm5 \n\t" - "pand %2, %%mm2 \n\t" - "pand %2, %%mm5 \n\t" - "por %%mm1, %%mm0 \n\t" - "por %%mm4, %%mm3 \n\t" - "por %%mm2, %%mm0 \n\t" - "por %%mm5, %%mm3 \n\t" - "psllq $16, %%mm3 \n\t" - "por %%mm3, %%mm0 \n\t" - MOVNTQ" %%mm0, %0 \n\t" - :"=m"(*d):"m"(*s),"m"(blue_15mask):"memory"); - d += 4; - s += 12; - } - __asm__ volatile(SFENCE:::"memory"); - __asm__ volatile(EMMS:::"memory"); -#endif while (s < end) { const int r = *s++; const int g = *s++; @@ -887,104 +248,12 @@ static inline void RENAME(rgb24to15)(const uint8_t *src, uint8_t *dst, long src_ | original bits */ -static inline void RENAME(rgb15tobgr24)(const uint8_t *src, uint8_t *dst, long src_size) +static inline void rgb15tobgr24_c(const uint8_t *src, uint8_t *dst, int src_size) { const uint16_t *end; -#if COMPILE_TEMPLATE_MMX - const uint16_t *mm_end; -#endif uint8_t *d = dst; const uint16_t *s = (const uint16_t*)src; end = s + src_size/2; -#if COMPILE_TEMPLATE_MMX - __asm__ volatile(PREFETCH" %0"::"m"(*s):"memory"); - mm_end = end - 7; - while (s < mm_end) { - __asm__ volatile( - PREFETCH" 32%1 \n\t" - "movq %1, %%mm0 \n\t" - "movq %1, %%mm1 \n\t" - "movq %1, %%mm2 \n\t" - "pand %2, %%mm0 \n\t" - "pand %3, %%mm1 \n\t" - "pand %4, %%mm2 \n\t" - "psllq $3, %%mm0 \n\t" - "psrlq $2, %%mm1 \n\t" - "psrlq $7, %%mm2 \n\t" - "movq %%mm0, %%mm3 \n\t" - "movq %%mm1, %%mm4 \n\t" - "movq %%mm2, %%mm5 \n\t" - "punpcklwd %5, %%mm0 \n\t" - "punpcklwd %5, %%mm1 \n\t" - "punpcklwd %5, %%mm2 \n\t" - "punpckhwd %5, %%mm3 \n\t" - "punpckhwd %5, %%mm4 \n\t" - "punpckhwd %5, %%mm5 \n\t" - "psllq $8, %%mm1 \n\t" - "psllq $16, %%mm2 \n\t" - "por %%mm1, %%mm0 \n\t" - "por %%mm2, %%mm0 \n\t" - "psllq $8, %%mm4 \n\t" - "psllq $16, %%mm5 \n\t" - "por %%mm4, %%mm3 \n\t" - "por %%mm5, %%mm3 \n\t" - - "movq %%mm0, %%mm6 \n\t" - "movq %%mm3, %%mm7 \n\t" - - "movq 8%1, %%mm0 \n\t" - "movq 8%1, %%mm1 \n\t" - "movq 8%1, %%mm2 \n\t" - "pand %2, %%mm0 \n\t" - "pand %3, %%mm1 \n\t" - "pand %4, %%mm2 \n\t" - "psllq $3, %%mm0 \n\t" - "psrlq $2, %%mm1 \n\t" - "psrlq $7, %%mm2 \n\t" - "movq %%mm0, %%mm3 \n\t" - "movq %%mm1, %%mm4 \n\t" - "movq %%mm2, %%mm5 \n\t" - "punpcklwd %5, %%mm0 \n\t" - "punpcklwd %5, %%mm1 \n\t" - "punpcklwd %5, %%mm2 \n\t" - "punpckhwd %5, %%mm3 \n\t" - "punpckhwd %5, %%mm4 \n\t" - "punpckhwd %5, %%mm5 \n\t" - "psllq $8, %%mm1 \n\t" - "psllq $16, %%mm2 \n\t" - "por %%mm1, %%mm0 \n\t" - "por %%mm2, %%mm0 \n\t" - "psllq $8, %%mm4 \n\t" - "psllq $16, %%mm5 \n\t" - "por %%mm4, %%mm3 \n\t" - "por %%mm5, %%mm3 \n\t" - - :"=m"(*d) - :"m"(*s),"m"(mask15b),"m"(mask15g),"m"(mask15r), "m"(mmx_null) - :"memory"); - /* borrowed 32 to 24 */ - __asm__ volatile( - "movq %%mm0, %%mm4 \n\t" - "movq %%mm3, %%mm5 \n\t" - "movq %%mm6, %%mm0 \n\t" - "movq %%mm7, %%mm1 \n\t" - - "movq %%mm4, %%mm6 \n\t" - "movq %%mm5, %%mm7 \n\t" - "movq %%mm0, %%mm2 \n\t" - "movq %%mm1, %%mm3 \n\t" - - STORE_BGR24_MMX - - :"=m"(*d) - :"m"(*s) - :"memory"); - d += 24; - s += 8; - } - __asm__ volatile(SFENCE:::"memory"); - __asm__ volatile(EMMS:::"memory"); -#endif while (s < end) { register uint16_t bgr; bgr = *s++; @@ -994,103 +263,12 @@ static inline void RENAME(rgb15tobgr24)(const uint8_t *src, uint8_t *dst, long s } } -static inline void RENAME(rgb16tobgr24)(const uint8_t *src, uint8_t *dst, long src_size) +static inline void rgb16tobgr24_c(const uint8_t *src, uint8_t *dst, int src_size) { const uint16_t *end; -#if COMPILE_TEMPLATE_MMX - const uint16_t *mm_end; -#endif uint8_t *d = (uint8_t *)dst; const uint16_t *s = (const uint16_t *)src; end = s + src_size/2; -#if COMPILE_TEMPLATE_MMX - __asm__ volatile(PREFETCH" %0"::"m"(*s):"memory"); - mm_end = end - 7; - while (s < mm_end) { - __asm__ volatile( - PREFETCH" 32%1 \n\t" - "movq %1, %%mm0 \n\t" - "movq %1, %%mm1 \n\t" - "movq %1, %%mm2 \n\t" - "pand %2, %%mm0 \n\t" - "pand %3, %%mm1 \n\t" - "pand %4, %%mm2 \n\t" - "psllq $3, %%mm0 \n\t" - "psrlq $3, %%mm1 \n\t" - "psrlq $8, %%mm2 \n\t" - "movq %%mm0, %%mm3 \n\t" - "movq %%mm1, %%mm4 \n\t" - "movq %%mm2, %%mm5 \n\t" - "punpcklwd %5, %%mm0 \n\t" - "punpcklwd %5, %%mm1 \n\t" - "punpcklwd %5, %%mm2 \n\t" - "punpckhwd %5, %%mm3 \n\t" - "punpckhwd %5, %%mm4 \n\t" - "punpckhwd %5, %%mm5 \n\t" - "psllq $8, %%mm1 \n\t" - "psllq $16, %%mm2 \n\t" - "por %%mm1, %%mm0 \n\t" - "por %%mm2, %%mm0 \n\t" - "psllq $8, %%mm4 \n\t" - "psllq $16, %%mm5 \n\t" - "por %%mm4, %%mm3 \n\t" - "por %%mm5, %%mm3 \n\t" - - "movq %%mm0, %%mm6 \n\t" - "movq %%mm3, %%mm7 \n\t" - - "movq 8%1, %%mm0 \n\t" - "movq 8%1, %%mm1 \n\t" - "movq 8%1, %%mm2 \n\t" - "pand %2, %%mm0 \n\t" - "pand %3, %%mm1 \n\t" - "pand %4, %%mm2 \n\t" - "psllq $3, %%mm0 \n\t" - "psrlq $3, %%mm1 \n\t" - "psrlq $8, %%mm2 \n\t" - "movq %%mm0, %%mm3 \n\t" - "movq %%mm1, %%mm4 \n\t" - "movq %%mm2, %%mm5 \n\t" - "punpcklwd %5, %%mm0 \n\t" - "punpcklwd %5, %%mm1 \n\t" - "punpcklwd %5, %%mm2 \n\t" - "punpckhwd %5, %%mm3 \n\t" - "punpckhwd %5, %%mm4 \n\t" - "punpckhwd %5, %%mm5 \n\t" - "psllq $8, %%mm1 \n\t" - "psllq $16, %%mm2 \n\t" - "por %%mm1, %%mm0 \n\t" - "por %%mm2, %%mm0 \n\t" - "psllq $8, %%mm4 \n\t" - "psllq $16, %%mm5 \n\t" - "por %%mm4, %%mm3 \n\t" - "por %%mm5, %%mm3 \n\t" - :"=m"(*d) - :"m"(*s),"m"(mask16b),"m"(mask16g),"m"(mask16r),"m"(mmx_null) - :"memory"); - /* borrowed 32 to 24 */ - __asm__ volatile( - "movq %%mm0, %%mm4 \n\t" - "movq %%mm3, %%mm5 \n\t" - "movq %%mm6, %%mm0 \n\t" - "movq %%mm7, %%mm1 \n\t" - - "movq %%mm4, %%mm6 \n\t" - "movq %%mm5, %%mm7 \n\t" - "movq %%mm0, %%mm2 \n\t" - "movq %%mm1, %%mm3 \n\t" - - STORE_BGR24_MMX - - :"=m"(*d) - :"m"(*s) - :"memory"); - d += 24; - s += 8; - } - __asm__ volatile(SFENCE:::"memory"); - __asm__ volatile(EMMS:::"memory"); -#endif while (s < end) { register uint16_t bgr; bgr = *s++; @@ -1100,61 +278,12 @@ static inline void RENAME(rgb16tobgr24)(const uint8_t *src, uint8_t *dst, long s } } -/* - * mm0 = 00 B3 00 B2 00 B1 00 B0 - * mm1 = 00 G3 00 G2 00 G1 00 G0 - * mm2 = 00 R3 00 R2 00 R1 00 R0 - * mm6 = FF FF FF FF FF FF FF FF - * mm7 = 00 00 00 00 00 00 00 00 - */ -#define PACK_RGB32 \ - "packuswb %%mm7, %%mm0 \n\t" /* 00 00 00 00 B3 B2 B1 B0 */ \ - "packuswb %%mm7, %%mm1 \n\t" /* 00 00 00 00 G3 G2 G1 G0 */ \ - "packuswb %%mm7, %%mm2 \n\t" /* 00 00 00 00 R3 R2 R1 R0 */ \ - "punpcklbw %%mm1, %%mm0 \n\t" /* G3 B3 G2 B2 G1 B1 G0 B0 */ \ - "punpcklbw %%mm6, %%mm2 \n\t" /* FF R3 FF R2 FF R1 FF R0 */ \ - "movq %%mm0, %%mm3 \n\t" \ - "punpcklwd %%mm2, %%mm0 \n\t" /* FF R1 G1 B1 FF R0 G0 B0 */ \ - "punpckhwd %%mm2, %%mm3 \n\t" /* FF R3 G3 B3 FF R2 G2 B2 */ \ - MOVNTQ" %%mm0, %0 \n\t" \ - MOVNTQ" %%mm3, 8%0 \n\t" \ - -static inline void RENAME(rgb15to32)(const uint8_t *src, uint8_t *dst, long src_size) +static inline void rgb15to32_c(const uint8_t *src, uint8_t *dst, int src_size) { const uint16_t *end; -#if COMPILE_TEMPLATE_MMX - const uint16_t *mm_end; -#endif uint8_t *d = dst; const uint16_t *s = (const uint16_t *)src; end = s + src_size/2; -#if COMPILE_TEMPLATE_MMX - __asm__ volatile(PREFETCH" %0"::"m"(*s):"memory"); - __asm__ volatile("pxor %%mm7,%%mm7 \n\t":::"memory"); - __asm__ volatile("pcmpeqd %%mm6,%%mm6 \n\t":::"memory"); - mm_end = end - 3; - while (s < mm_end) { - __asm__ volatile( - PREFETCH" 32%1 \n\t" - "movq %1, %%mm0 \n\t" - "movq %1, %%mm1 \n\t" - "movq %1, %%mm2 \n\t" - "pand %2, %%mm0 \n\t" - "pand %3, %%mm1 \n\t" - "pand %4, %%mm2 \n\t" - "psllq $3, %%mm0 \n\t" - "psrlq $2, %%mm1 \n\t" - "psrlq $7, %%mm2 \n\t" - PACK_RGB32 - :"=m"(*d) - :"m"(*s),"m"(mask15b),"m"(mask15g),"m"(mask15r) - :"memory"); - d += 16; - s += 4; - } - __asm__ volatile(SFENCE:::"memory"); - __asm__ volatile(EMMS:::"memory"); -#endif while (s < end) { register uint16_t bgr; bgr = *s++; @@ -1172,42 +301,12 @@ static inline void RENAME(rgb15to32)(const uint8_t *src, uint8_t *dst, long src_ } } -static inline void RENAME(rgb16to32)(const uint8_t *src, uint8_t *dst, long src_size) +static inline void rgb16to32_c(const uint8_t *src, uint8_t *dst, int src_size) { const uint16_t *end; -#if COMPILE_TEMPLATE_MMX - const uint16_t *mm_end; -#endif uint8_t *d = dst; const uint16_t *s = (const uint16_t*)src; end = s + src_size/2; -#if COMPILE_TEMPLATE_MMX - __asm__ volatile(PREFETCH" %0"::"m"(*s):"memory"); - __asm__ volatile("pxor %%mm7,%%mm7 \n\t":::"memory"); - __asm__ volatile("pcmpeqd %%mm6,%%mm6 \n\t":::"memory"); - mm_end = end - 3; - while (s < mm_end) { - __asm__ volatile( - PREFETCH" 32%1 \n\t" - "movq %1, %%mm0 \n\t" - "movq %1, %%mm1 \n\t" - "movq %1, %%mm2 \n\t" - "pand %2, %%mm0 \n\t" - "pand %3, %%mm1 \n\t" - "pand %4, %%mm2 \n\t" - "psllq $3, %%mm0 \n\t" - "psrlq $3, %%mm1 \n\t" - "psrlq $8, %%mm2 \n\t" - PACK_RGB32 - :"=m"(*d) - :"m"(*s),"m"(mask16b),"m"(mask16g),"m"(mask16r) - :"memory"); - d += 16; - s += 4; - } - __asm__ volatile(SFENCE:::"memory"); - __asm__ volatile(EMMS:::"memory"); -#endif while (s < end) { register uint16_t bgr; bgr = *s++; @@ -1225,63 +324,11 @@ static inline void RENAME(rgb16to32)(const uint8_t *src, uint8_t *dst, long src_ } } -static inline void RENAME(shuffle_bytes_2103)(const uint8_t *src, uint8_t *dst, long src_size) +static inline void shuffle_bytes_2103_c(const uint8_t *src, uint8_t *dst, int src_size) { - x86_reg idx = 15 - src_size; + int idx = 15 - src_size; const uint8_t *s = src-idx; uint8_t *d = dst-idx; -#if COMPILE_TEMPLATE_MMX - __asm__ volatile( - "test %0, %0 \n\t" - "jns 2f \n\t" - PREFETCH" (%1, %0) \n\t" - "movq %3, %%mm7 \n\t" - "pxor %4, %%mm7 \n\t" - "movq %%mm7, %%mm6 \n\t" - "pxor %5, %%mm7 \n\t" - ".p2align 4 \n\t" - "1: \n\t" - PREFETCH" 32(%1, %0) \n\t" - "movq (%1, %0), %%mm0 \n\t" - "movq 8(%1, %0), %%mm1 \n\t" -# if COMPILE_TEMPLATE_MMX2 - "pshufw $177, %%mm0, %%mm3 \n\t" - "pshufw $177, %%mm1, %%mm5 \n\t" - "pand %%mm7, %%mm0 \n\t" - "pand %%mm6, %%mm3 \n\t" - "pand %%mm7, %%mm1 \n\t" - "pand %%mm6, %%mm5 \n\t" - "por %%mm3, %%mm0 \n\t" - "por %%mm5, %%mm1 \n\t" -# else - "movq %%mm0, %%mm2 \n\t" - "movq %%mm1, %%mm4 \n\t" - "pand %%mm7, %%mm0 \n\t" - "pand %%mm6, %%mm2 \n\t" - "pand %%mm7, %%mm1 \n\t" - "pand %%mm6, %%mm4 \n\t" - "movq %%mm2, %%mm3 \n\t" - "movq %%mm4, %%mm5 \n\t" - "pslld $16, %%mm2 \n\t" - "psrld $16, %%mm3 \n\t" - "pslld $16, %%mm4 \n\t" - "psrld $16, %%mm5 \n\t" - "por %%mm2, %%mm0 \n\t" - "por %%mm4, %%mm1 \n\t" - "por %%mm3, %%mm0 \n\t" - "por %%mm5, %%mm1 \n\t" -# endif - MOVNTQ" %%mm0, (%2, %0) \n\t" - MOVNTQ" %%mm1, 8(%2, %0) \n\t" - "add $16, %0 \n\t" - "js 1b \n\t" - SFENCE" \n\t" - EMMS" \n\t" - "2: \n\t" - : "+&r"(idx) - : "r" (s), "r" (d), "m" (mask32b), "m" (mask32r), "m" (mmx_one) - : "memory"); -#endif for (; idx<15; idx+=4) { register int v = *(const uint32_t *)&s[idx], g = v & 0xff00ff00; v &= 0xff00ff; @@ -1289,66 +336,9 @@ static inline void RENAME(shuffle_bytes_2103)(const uint8_t *src, uint8_t *dst, } } -static inline void RENAME(rgb24tobgr24)(const uint8_t *src, uint8_t *dst, long src_size) +static inline void rgb24tobgr24_c(const uint8_t *src, uint8_t *dst, int src_size) { unsigned i; -#if COMPILE_TEMPLATE_MMX - x86_reg mmx_size= 23 - src_size; - __asm__ volatile ( - "test %%"REG_a", %%"REG_a" \n\t" - "jns 2f \n\t" - "movq "MANGLE(mask24r)", %%mm5 \n\t" - "movq "MANGLE(mask24g)", %%mm6 \n\t" - "movq "MANGLE(mask24b)", %%mm7 \n\t" - ".p2align 4 \n\t" - "1: \n\t" - PREFETCH" 32(%1, %%"REG_a") \n\t" - "movq (%1, %%"REG_a"), %%mm0 \n\t" // BGR BGR BG - "movq (%1, %%"REG_a"), %%mm1 \n\t" // BGR BGR BG - "movq 2(%1, %%"REG_a"), %%mm2 \n\t" // R BGR BGR B - "psllq $16, %%mm0 \n\t" // 00 BGR BGR - "pand %%mm5, %%mm0 \n\t" - "pand %%mm6, %%mm1 \n\t" - "pand %%mm7, %%mm2 \n\t" - "por %%mm0, %%mm1 \n\t" - "por %%mm2, %%mm1 \n\t" - "movq 6(%1, %%"REG_a"), %%mm0 \n\t" // BGR BGR BG - MOVNTQ" %%mm1, (%2, %%"REG_a") \n\t" // RGB RGB RG - "movq 8(%1, %%"REG_a"), %%mm1 \n\t" // R BGR BGR B - "movq 10(%1, %%"REG_a"), %%mm2 \n\t" // GR BGR BGR - "pand %%mm7, %%mm0 \n\t" - "pand %%mm5, %%mm1 \n\t" - "pand %%mm6, %%mm2 \n\t" - "por %%mm0, %%mm1 \n\t" - "por %%mm2, %%mm1 \n\t" - "movq 14(%1, %%"REG_a"), %%mm0 \n\t" // R BGR BGR B - MOVNTQ" %%mm1, 8(%2, %%"REG_a") \n\t" // B RGB RGB R - "movq 16(%1, %%"REG_a"), %%mm1 \n\t" // GR BGR BGR - "movq 18(%1, %%"REG_a"), %%mm2 \n\t" // BGR BGR BG - "pand %%mm6, %%mm0 \n\t" - "pand %%mm7, %%mm1 \n\t" - "pand %%mm5, %%mm2 \n\t" - "por %%mm0, %%mm1 \n\t" - "por %%mm2, %%mm1 \n\t" - MOVNTQ" %%mm1, 16(%2, %%"REG_a") \n\t" - "add $24, %%"REG_a" \n\t" - " js 1b \n\t" - "2: \n\t" - : "+a" (mmx_size) - : "r" (src-mmx_size), "r"(dst-mmx_size) - ); - - __asm__ volatile(SFENCE:::"memory"); - __asm__ volatile(EMMS:::"memory"); - - if (mmx_size==23) return; //finished, was multiple of 8 - - src+= src_size; - dst+= src_size; - src_size= 23-mmx_size; - src-= src_size; - dst-= src_size; -#endif for (i=0; i<src_size; i+=3) { register uint8_t x; x = src[i + 2]; @@ -1358,98 +348,16 @@ static inline void RENAME(rgb24tobgr24)(const uint8_t *src, uint8_t *dst, long s } } -static inline void RENAME(yuvPlanartoyuy2)(const uint8_t *ysrc, const uint8_t *usrc, const uint8_t *vsrc, uint8_t *dst, - long width, long height, - long lumStride, long chromStride, long dstStride, long vertLumPerChroma) +static inline void yuvPlanartoyuy2_c(const uint8_t *ysrc, const uint8_t *usrc, + const uint8_t *vsrc, uint8_t *dst, + int width, int height, + int lumStride, int chromStride, + int dstStride, int vertLumPerChroma) { - long y; - const x86_reg chromWidth= width>>1; + int y; + const int chromWidth = width >> 1; for (y=0; y<height; y++) { -#if COMPILE_TEMPLATE_MMX - //FIXME handle 2 lines at once (fewer prefetches, reuse some chroma, but very likely memory-limited anyway) - __asm__ volatile( - "xor %%"REG_a", %%"REG_a" \n\t" - ".p2align 4 \n\t" - "1: \n\t" - PREFETCH" 32(%1, %%"REG_a", 2) \n\t" - PREFETCH" 32(%2, %%"REG_a") \n\t" - PREFETCH" 32(%3, %%"REG_a") \n\t" - "movq (%2, %%"REG_a"), %%mm0 \n\t" // U(0) - "movq %%mm0, %%mm2 \n\t" // U(0) - "movq (%3, %%"REG_a"), %%mm1 \n\t" // V(0) - "punpcklbw %%mm1, %%mm0 \n\t" // UVUV UVUV(0) - "punpckhbw %%mm1, %%mm2 \n\t" // UVUV UVUV(8) - - "movq (%1, %%"REG_a",2), %%mm3 \n\t" // Y(0) - "movq 8(%1, %%"REG_a",2), %%mm5 \n\t" // Y(8) - "movq %%mm3, %%mm4 \n\t" // Y(0) - "movq %%mm5, %%mm6 \n\t" // Y(8) - "punpcklbw %%mm0, %%mm3 \n\t" // YUYV YUYV(0) - "punpckhbw %%mm0, %%mm4 \n\t" // YUYV YUYV(4) - "punpcklbw %%mm2, %%mm5 \n\t" // YUYV YUYV(8) - "punpckhbw %%mm2, %%mm6 \n\t" // YUYV YUYV(12) - - MOVNTQ" %%mm3, (%0, %%"REG_a", 4) \n\t" - MOVNTQ" %%mm4, 8(%0, %%"REG_a", 4) \n\t" - MOVNTQ" %%mm5, 16(%0, %%"REG_a", 4) \n\t" - MOVNTQ" %%mm6, 24(%0, %%"REG_a", 4) \n\t" - - "add $8, %%"REG_a" \n\t" - "cmp %4, %%"REG_a" \n\t" - " jb 1b \n\t" - ::"r"(dst), "r"(ysrc), "r"(usrc), "r"(vsrc), "g" (chromWidth) - : "%"REG_a - ); -#else - -#if ARCH_ALPHA && HAVE_MVI -#define pl2yuy2(n) \ - y1 = yc[n]; \ - y2 = yc2[n]; \ - u = uc[n]; \ - v = vc[n]; \ - __asm__("unpkbw %1, %0" : "=r"(y1) : "r"(y1)); \ - __asm__("unpkbw %1, %0" : "=r"(y2) : "r"(y2)); \ - __asm__("unpkbl %1, %0" : "=r"(u) : "r"(u)); \ - __asm__("unpkbl %1, %0" : "=r"(v) : "r"(v)); \ - yuv1 = (u << 8) + (v << 24); \ - yuv2 = yuv1 + y2; \ - yuv1 += y1; \ - qdst[n] = yuv1; \ - qdst2[n] = yuv2; - - int i; - uint64_t *qdst = (uint64_t *) dst; - uint64_t *qdst2 = (uint64_t *) (dst + dstStride); - const uint32_t *yc = (uint32_t *) ysrc; - const uint32_t *yc2 = (uint32_t *) (ysrc + lumStride); - const uint16_t *uc = (uint16_t*) usrc, *vc = (uint16_t*) vsrc; - for (i = 0; i < chromWidth; i += 8) { - uint64_t y1, y2, yuv1, yuv2; - uint64_t u, v; - /* Prefetch */ - __asm__("ldq $31,64(%0)" :: "r"(yc)); - __asm__("ldq $31,64(%0)" :: "r"(yc2)); - __asm__("ldq $31,64(%0)" :: "r"(uc)); - __asm__("ldq $31,64(%0)" :: "r"(vc)); - - pl2yuy2(0); - pl2yuy2(1); - pl2yuy2(2); - pl2yuy2(3); - - yc += 4; - yc2 += 4; - uc += 4; - vc += 4; - qdst += 4; - qdst2 += 4; - } - y++; - ysrc += lumStride; - dst += dstStride; - -#elif HAVE_FAST_64BIT +#if HAVE_FAST_64BIT int i; uint64_t *ldst = (uint64_t *) dst; const uint8_t *yc = ysrc, *uc = usrc, *vc = vsrc; @@ -1481,7 +389,6 @@ static inline void RENAME(yuvPlanartoyuy2)(const uint8_t *ysrc, const uint8_t *u vc++; } #endif -#endif if ((y&(vertLumPerChroma-1)) == vertLumPerChroma-1) { usrc += chromStride; vsrc += chromStride; @@ -1489,70 +396,32 @@ static inline void RENAME(yuvPlanartoyuy2)(const uint8_t *ysrc, const uint8_t *u ysrc += lumStride; dst += dstStride; } -#if COMPILE_TEMPLATE_MMX - __asm__(EMMS" \n\t" - SFENCE" \n\t" - :::"memory"); -#endif } /** * Height should be a multiple of 2 and width should be a multiple of 16. * (If this is a problem for anyone then tell me, and I will fix it.) */ -static inline void RENAME(yv12toyuy2)(const uint8_t *ysrc, const uint8_t *usrc, const uint8_t *vsrc, uint8_t *dst, - long width, long height, - long lumStride, long chromStride, long dstStride) +static inline void yv12toyuy2_c(const uint8_t *ysrc, const uint8_t *usrc, + const uint8_t *vsrc, uint8_t *dst, + int width, int height, + int lumStride, int chromStride, + int dstStride) { //FIXME interpolate chroma - RENAME(yuvPlanartoyuy2)(ysrc, usrc, vsrc, dst, width, height, lumStride, chromStride, dstStride, 2); + yuvPlanartoyuy2_c(ysrc, usrc, vsrc, dst, width, height, lumStride, + chromStride, dstStride, 2); } -static inline void RENAME(yuvPlanartouyvy)(const uint8_t *ysrc, const uint8_t *usrc, const uint8_t *vsrc, uint8_t *dst, - long width, long height, - long lumStride, long chromStride, long dstStride, long vertLumPerChroma) +static inline void yuvPlanartouyvy_c(const uint8_t *ysrc, const uint8_t *usrc, + const uint8_t *vsrc, uint8_t *dst, + int width, int height, + int lumStride, int chromStride, + int dstStride, int vertLumPerChroma) { - long y; - const x86_reg chromWidth= width>>1; + int y; + const int chromWidth = width >> 1; for (y=0; y<height; y++) { -#if COMPILE_TEMPLATE_MMX - //FIXME handle 2 lines at once (fewer prefetches, reuse some chroma, but very likely memory-limited anyway) - __asm__ volatile( - "xor %%"REG_a", %%"REG_a" \n\t" - ".p2align 4 \n\t" - "1: \n\t" - PREFETCH" 32(%1, %%"REG_a", 2) \n\t" - PREFETCH" 32(%2, %%"REG_a") \n\t" - PREFETCH" 32(%3, %%"REG_a") \n\t" - "movq (%2, %%"REG_a"), %%mm0 \n\t" // U(0) - "movq %%mm0, %%mm2 \n\t" // U(0) - "movq (%3, %%"REG_a"), %%mm1 \n\t" // V(0) - "punpcklbw %%mm1, %%mm0 \n\t" // UVUV UVUV(0) - "punpckhbw %%mm1, %%mm2 \n\t" // UVUV UVUV(8) - - "movq (%1, %%"REG_a",2), %%mm3 \n\t" // Y(0) - "movq 8(%1, %%"REG_a",2), %%mm5 \n\t" // Y(8) - "movq %%mm0, %%mm4 \n\t" // Y(0) - "movq %%mm2, %%mm6 \n\t" // Y(8) - "punpcklbw %%mm3, %%mm0 \n\t" // YUYV YUYV(0) - "punpckhbw %%mm3, %%mm4 \n\t" // YUYV YUYV(4) - "punpcklbw %%mm5, %%mm2 \n\t" // YUYV YUYV(8) - "punpckhbw %%mm5, %%mm6 \n\t" // YUYV YUYV(12) - - MOVNTQ" %%mm0, (%0, %%"REG_a", 4) \n\t" - MOVNTQ" %%mm4, 8(%0, %%"REG_a", 4) \n\t" - MOVNTQ" %%mm2, 16(%0, %%"REG_a", 4) \n\t" - MOVNTQ" %%mm6, 24(%0, %%"REG_a", 4) \n\t" - - "add $8, %%"REG_a" \n\t" - "cmp %4, %%"REG_a" \n\t" - " jb 1b \n\t" - ::"r"(dst), "r"(ysrc), "r"(usrc), "r"(vsrc), "g" (chromWidth) - : "%"REG_a - ); -#else -//FIXME adapt the Alpha ASM code from yv12->yuy2 - #if HAVE_FAST_64BIT int i; uint64_t *ldst = (uint64_t *) dst; @@ -1585,7 +454,6 @@ static inline void RENAME(yuvPlanartouyvy)(const uint8_t *ysrc, const uint8_t *u vc++; } #endif -#endif if ((y&(vertLumPerChroma-1)) == vertLumPerChroma-1) { usrc += chromStride; vsrc += chromStride; @@ -1593,140 +461,63 @@ static inline void RENAME(yuvPlanartouyvy)(const uint8_t *ysrc, const uint8_t *u ysrc += lumStride; dst += dstStride; } -#if COMPILE_TEMPLATE_MMX - __asm__(EMMS" \n\t" - SFENCE" \n\t" - :::"memory"); -#endif } /** * Height should be a multiple of 2 and width should be a multiple of 16 * (If this is a problem for anyone then tell me, and I will fix it.) */ -static inline void RENAME(yv12touyvy)(const uint8_t *ysrc, const uint8_t *usrc, const uint8_t *vsrc, uint8_t *dst, - long width, long height, - long lumStride, long chromStride, long dstStride) +static inline void yv12touyvy_c(const uint8_t *ysrc, const uint8_t *usrc, + const uint8_t *vsrc, uint8_t *dst, + int width, int height, + int lumStride, int chromStride, + int dstStride) { //FIXME interpolate chroma - RENAME(yuvPlanartouyvy)(ysrc, usrc, vsrc, dst, width, height, lumStride, chromStride, dstStride, 2); + yuvPlanartouyvy_c(ysrc, usrc, vsrc, dst, width, height, lumStride, + chromStride, dstStride, 2); } /** * Width should be a multiple of 16. */ -static inline void RENAME(yuv422ptouyvy)(const uint8_t *ysrc, const uint8_t *usrc, const uint8_t *vsrc, uint8_t *dst, - long width, long height, - long lumStride, long chromStride, long dstStride) +static inline void yuv422ptouyvy_c(const uint8_t *ysrc, const uint8_t *usrc, + const uint8_t *vsrc, uint8_t *dst, + int width, int height, + int lumStride, int chromStride, + int dstStride) { - RENAME(yuvPlanartouyvy)(ysrc, usrc, vsrc, dst, width, height, lumStride, chromStride, dstStride, 1); + yuvPlanartouyvy_c(ysrc, usrc, vsrc, dst, width, height, lumStride, + chromStride, dstStride, 1); } /** * Width should be a multiple of 16. */ -static inline void RENAME(yuv422ptoyuy2)(const uint8_t *ysrc, const uint8_t *usrc, const uint8_t *vsrc, uint8_t *dst, - long width, long height, - long lumStride, long chromStride, long dstStride) +static inline void yuv422ptoyuy2_c(const uint8_t *ysrc, const uint8_t *usrc, + const uint8_t *vsrc, uint8_t *dst, + int width, int height, + int lumStride, int chromStride, + int dstStride) { - RENAME(yuvPlanartoyuy2)(ysrc, usrc, vsrc, dst, width, height, lumStride, chromStride, dstStride, 1); + yuvPlanartoyuy2_c(ysrc, usrc, vsrc, dst, width, height, lumStride, + chromStride, dstStride, 1); } /** * Height should be a multiple of 2 and width should be a multiple of 16. * (If this is a problem for anyone then tell me, and I will fix it.) */ -static inline void RENAME(yuy2toyv12)(const uint8_t *src, uint8_t *ydst, uint8_t *udst, uint8_t *vdst, - long width, long height, - long lumStride, long chromStride, long srcStride) -{ - long y; - const x86_reg chromWidth= width>>1; +static inline void yuy2toyv12_c(const uint8_t *src, uint8_t *ydst, + uint8_t *udst, uint8_t *vdst, + int width, int height, + int lumStride, int chromStride, + int srcStride) +{ + int y; + const int chromWidth = width >> 1; for (y=0; y<height; y+=2) { -#if COMPILE_TEMPLATE_MMX - __asm__ volatile( - "xor %%"REG_a", %%"REG_a" \n\t" - "pcmpeqw %%mm7, %%mm7 \n\t" - "psrlw $8, %%mm7 \n\t" // FF,00,FF,00... - ".p2align 4 \n\t" - "1: \n\t" - PREFETCH" 64(%0, %%"REG_a", 4) \n\t" - "movq (%0, %%"REG_a", 4), %%mm0 \n\t" // YUYV YUYV(0) - "movq 8(%0, %%"REG_a", 4), %%mm1 \n\t" // YUYV YUYV(4) - "movq %%mm0, %%mm2 \n\t" // YUYV YUYV(0) - "movq %%mm1, %%mm3 \n\t" // YUYV YUYV(4) - "psrlw $8, %%mm0 \n\t" // U0V0 U0V0(0) - "psrlw $8, %%mm1 \n\t" // U0V0 U0V0(4) - "pand %%mm7, %%mm2 \n\t" // Y0Y0 Y0Y0(0) - "pand %%mm7, %%mm3 \n\t" // Y0Y0 Y0Y0(4) - "packuswb %%mm1, %%mm0 \n\t" // UVUV UVUV(0) - "packuswb %%mm3, %%mm2 \n\t" // YYYY YYYY(0) - - MOVNTQ" %%mm2, (%1, %%"REG_a", 2) \n\t" - - "movq 16(%0, %%"REG_a", 4), %%mm1 \n\t" // YUYV YUYV(8) - "movq 24(%0, %%"REG_a", 4), %%mm2 \n\t" // YUYV YUYV(12) - "movq %%mm1, %%mm3 \n\t" // YUYV YUYV(8) - "movq %%mm2, %%mm4 \n\t" // YUYV YUYV(12) - "psrlw $8, %%mm1 \n\t" // U0V0 U0V0(8) - "psrlw $8, %%mm2 \n\t" // U0V0 U0V0(12) - "pand %%mm7, %%mm3 \n\t" // Y0Y0 Y0Y0(8) - "pand %%mm7, %%mm4 \n\t" // Y0Y0 Y0Y0(12) - "packuswb %%mm2, %%mm1 \n\t" // UVUV UVUV(8) - "packuswb %%mm4, %%mm3 \n\t" // YYYY YYYY(8) - - MOVNTQ" %%mm3, 8(%1, %%"REG_a", 2) \n\t" - - "movq %%mm0, %%mm2 \n\t" // UVUV UVUV(0) - "movq %%mm1, %%mm3 \n\t" // UVUV UVUV(8) - "psrlw $8, %%mm0 \n\t" // V0V0 V0V0(0) - "psrlw $8, %%mm1 \n\t" // V0V0 V0V0(8) - "pand %%mm7, %%mm2 \n\t" // U0U0 U0U0(0) - "pand %%mm7, %%mm3 \n\t" // U0U0 U0U0(8) - "packuswb %%mm1, %%mm0 \n\t" // VVVV VVVV(0) - "packuswb %%mm3, %%mm2 \n\t" // UUUU UUUU(0) - - MOVNTQ" %%mm0, (%3, %%"REG_a") \n\t" - MOVNTQ" %%mm2, (%2, %%"REG_a") \n\t" - - "add $8, %%"REG_a" \n\t" - "cmp %4, %%"REG_a" \n\t" - " jb 1b \n\t" - ::"r"(src), "r"(ydst), "r"(udst), "r"(vdst), "g" (chromWidth) - : "memory", "%"REG_a - ); - - ydst += lumStride; - src += srcStride; - - __asm__ volatile( - "xor %%"REG_a", %%"REG_a" \n\t" - ".p2align 4 \n\t" - "1: \n\t" - PREFETCH" 64(%0, %%"REG_a", 4) \n\t" - "movq (%0, %%"REG_a", 4), %%mm0 \n\t" // YUYV YUYV(0) - "movq 8(%0, %%"REG_a", 4), %%mm1 \n\t" // YUYV YUYV(4) - "movq 16(%0, %%"REG_a", 4), %%mm2 \n\t" // YUYV YUYV(8) - "movq 24(%0, %%"REG_a", 4), %%mm3 \n\t" // YUYV YUYV(12) - "pand %%mm7, %%mm0 \n\t" // Y0Y0 Y0Y0(0) - "pand %%mm7, %%mm1 \n\t" // Y0Y0 Y0Y0(4) - "pand %%mm7, %%mm2 \n\t" // Y0Y0 Y0Y0(8) - "pand %%mm7, %%mm3 \n\t" // Y0Y0 Y0Y0(12) - "packuswb %%mm1, %%mm0 \n\t" // YYYY YYYY(0) - "packuswb %%mm3, %%mm2 \n\t" // YYYY YYYY(8) - - MOVNTQ" %%mm0, (%1, %%"REG_a", 2) \n\t" - MOVNTQ" %%mm2, 8(%1, %%"REG_a", 2) \n\t" - - "add $8, %%"REG_a" \n\t" - "cmp %4, %%"REG_a" \n\t" - " jb 1b \n\t" - - ::"r"(src), "r"(ydst), "r"(udst), "r"(vdst), "g" (chromWidth) - : "memory", "%"REG_a - ); -#else - long i; + int i; for (i=0; i<chromWidth; i++) { ydst[2*i+0] = src[4*i+0]; udst[i] = src[4*i+1]; @@ -1740,22 +531,17 @@ static inline void RENAME(yuy2toyv12)(const uint8_t *src, uint8_t *ydst, uint8_t ydst[2*i+0] = src[4*i+0]; ydst[2*i+1] = src[4*i+2]; } -#endif udst += chromStride; vdst += chromStride; ydst += lumStride; src += srcStride; } -#if COMPILE_TEMPLATE_MMX - __asm__ volatile(EMMS" \n\t" - SFENCE" \n\t" - :::"memory"); -#endif } -static inline void RENAME(planar2x)(const uint8_t *src, uint8_t *dst, long srcWidth, long srcHeight, long srcStride, long dstStride) +static inline void planar2x_c(const uint8_t *src, uint8_t *dst, int srcWidth, + int srcHeight, int srcStride, int dstStride) { - long x,y; + int x,y; dst[0]= src[0]; @@ -1769,66 +555,10 @@ static inline void RENAME(planar2x)(const uint8_t *src, uint8_t *dst, long srcWi dst+= dstStride; for (y=1; y<srcHeight; y++) { -#if COMPILE_TEMPLATE_MMX2 || COMPILE_TEMPLATE_AMD3DNOW - const x86_reg mmxSize= srcWidth&~15; - __asm__ volatile( - "mov %4, %%"REG_a" \n\t" - "movq "MANGLE(mmx_ff)", %%mm0 \n\t" - "movq (%0, %%"REG_a"), %%mm4 \n\t" - "movq %%mm4, %%mm2 \n\t" - "psllq $8, %%mm4 \n\t" - "pand %%mm0, %%mm2 \n\t" - "por %%mm2, %%mm4 \n\t" - "movq (%1, %%"REG_a"), %%mm5 \n\t" - "movq %%mm5, %%mm3 \n\t" - "psllq $8, %%mm5 \n\t" - "pand %%mm0, %%mm3 \n\t" - "por %%mm3, %%mm5 \n\t" - "1: \n\t" - "movq (%0, %%"REG_a"), %%mm0 \n\t" - "movq (%1, %%"REG_a"), %%mm1 \n\t" - "movq 1(%0, %%"REG_a"), %%mm2 \n\t" - "movq 1(%1, %%"REG_a"), %%mm3 \n\t" - PAVGB" %%mm0, %%mm5 \n\t" - PAVGB" %%mm0, %%mm3 \n\t" - PAVGB" %%mm0, %%mm5 \n\t" - PAVGB" %%mm0, %%mm3 \n\t" - PAVGB" %%mm1, %%mm4 \n\t" - PAVGB" %%mm1, %%mm2 \n\t" - PAVGB" %%mm1, %%mm4 \n\t" - PAVGB" %%mm1, %%mm2 \n\t" - "movq %%mm5, %%mm7 \n\t" - "movq %%mm4, %%mm6 \n\t" - "punpcklbw %%mm3, %%mm5 \n\t" - "punpckhbw %%mm3, %%mm7 \n\t" - "punpcklbw %%mm2, %%mm4 \n\t" - "punpckhbw %%mm2, %%mm6 \n\t" -#if 1 - MOVNTQ" %%mm5, (%2, %%"REG_a", 2) \n\t" - MOVNTQ" %%mm7, 8(%2, %%"REG_a", 2) \n\t" - MOVNTQ" %%mm4, (%3, %%"REG_a", 2) \n\t" - MOVNTQ" %%mm6, 8(%3, %%"REG_a", 2) \n\t" -#else - "movq %%mm5, (%2, %%"REG_a", 2) \n\t" - "movq %%mm7, 8(%2, %%"REG_a", 2) \n\t" - "movq %%mm4, (%3, %%"REG_a", 2) \n\t" - "movq %%mm6, 8(%3, %%"REG_a", 2) \n\t" -#endif - "add $8, %%"REG_a" \n\t" - "movq -1(%0, %%"REG_a"), %%mm4 \n\t" - "movq -1(%1, %%"REG_a"), %%mm5 \n\t" - " js 1b \n\t" - :: "r" (src + mmxSize ), "r" (src + srcStride + mmxSize ), - "r" (dst + mmxSize*2), "r" (dst + dstStride + mmxSize*2), - "g" (-mmxSize) - : "%"REG_a - ); -#else - const x86_reg mmxSize=1; + const int mmxSize = 1; dst[0 ]= (3*src[0] + src[srcStride])>>2; dst[dstStride]= ( src[0] + 3*src[srcStride])>>2; -#endif for (x=mmxSize-1; x<srcWidth-1; x++) { dst[2*x +1]= (3*src[x+0] + src[x+srcStride+1])>>2; @@ -1844,7 +574,6 @@ static inline void RENAME(planar2x)(const uint8_t *src, uint8_t *dst, long srcWi } // last line -#if 1 dst[0]= src[0]; for (x=0; x<srcWidth-1; x++) { @@ -1852,18 +581,6 @@ static inline void RENAME(planar2x)(const uint8_t *src, uint8_t *dst, long srcWi dst[2*x+2]= ( src[x] + 3*src[x+1])>>2; } dst[2*srcWidth-1]= src[srcWidth-1]; -#else - for (x=0; x<srcWidth; x++) { - dst[2*x+0]= - dst[2*x+1]= src[x]; - } -#endif - -#if COMPILE_TEMPLATE_MMX - __asm__ volatile(EMMS" \n\t" - SFENCE" \n\t" - :::"memory"); -#endif } /** @@ -1872,97 +589,16 @@ static inline void RENAME(planar2x)(const uint8_t *src, uint8_t *dst, long srcWi * Chrominance data is only taken from every second line, others are ignored. * FIXME: Write HQ version. */ -static inline void RENAME(uyvytoyv12)(const uint8_t *src, uint8_t *ydst, uint8_t *udst, uint8_t *vdst, - long width, long height, - long lumStride, long chromStride, long srcStride) -{ - long y; - const x86_reg chromWidth= width>>1; +static inline void uyvytoyv12_c(const uint8_t *src, uint8_t *ydst, + uint8_t *udst, uint8_t *vdst, + int width, int height, + int lumStride, int chromStride, + int srcStride) +{ + int y; + const int chromWidth = width >> 1; for (y=0; y<height; y+=2) { -#if COMPILE_TEMPLATE_MMX - __asm__ volatile( - "xor %%"REG_a", %%"REG_a" \n\t" - "pcmpeqw %%mm7, %%mm7 \n\t" - "psrlw $8, %%mm7 \n\t" // FF,00,FF,00... - ".p2align 4 \n\t" - "1: \n\t" - PREFETCH" 64(%0, %%"REG_a", 4) \n\t" - "movq (%0, %%"REG_a", 4), %%mm0 \n\t" // UYVY UYVY(0) - "movq 8(%0, %%"REG_a", 4), %%mm1 \n\t" // UYVY UYVY(4) - "movq %%mm0, %%mm2 \n\t" // UYVY UYVY(0) - "movq %%mm1, %%mm3 \n\t" // UYVY UYVY(4) - "pand %%mm7, %%mm0 \n\t" // U0V0 U0V0(0) - "pand %%mm7, %%mm1 \n\t" // U0V0 U0V0(4) - "psrlw $8, %%mm2 \n\t" // Y0Y0 Y0Y0(0) - "psrlw $8, %%mm3 \n\t" // Y0Y0 Y0Y0(4) - "packuswb %%mm1, %%mm0 \n\t" // UVUV UVUV(0) - "packuswb %%mm3, %%mm2 \n\t" // YYYY YYYY(0) - - MOVNTQ" %%mm2, (%1, %%"REG_a", 2) \n\t" - - "movq 16(%0, %%"REG_a", 4), %%mm1 \n\t" // UYVY UYVY(8) - "movq 24(%0, %%"REG_a", 4), %%mm2 \n\t" // UYVY UYVY(12) - "movq %%mm1, %%mm3 \n\t" // UYVY UYVY(8) - "movq %%mm2, %%mm4 \n\t" // UYVY UYVY(12) - "pand %%mm7, %%mm1 \n\t" // U0V0 U0V0(8) - "pand %%mm7, %%mm2 \n\t" // U0V0 U0V0(12) - "psrlw $8, %%mm3 \n\t" // Y0Y0 Y0Y0(8) - "psrlw $8, %%mm4 \n\t" // Y0Y0 Y0Y0(12) - "packuswb %%mm2, %%mm1 \n\t" // UVUV UVUV(8) - "packuswb %%mm4, %%mm3 \n\t" // YYYY YYYY(8) - - MOVNTQ" %%mm3, 8(%1, %%"REG_a", 2) \n\t" - - "movq %%mm0, %%mm2 \n\t" // UVUV UVUV(0) - "movq %%mm1, %%mm3 \n\t" // UVUV UVUV(8) - "psrlw $8, %%mm0 \n\t" // V0V0 V0V0(0) - "psrlw $8, %%mm1 \n\t" // V0V0 V0V0(8) - "pand %%mm7, %%mm2 \n\t" // U0U0 U0U0(0) - "pand %%mm7, %%mm3 \n\t" // U0U0 U0U0(8) - "packuswb %%mm1, %%mm0 \n\t" // VVVV VVVV(0) - "packuswb %%mm3, %%mm2 \n\t" // UUUU UUUU(0) - - MOVNTQ" %%mm0, (%3, %%"REG_a") \n\t" - MOVNTQ" %%mm2, (%2, %%"REG_a") \n\t" - - "add $8, %%"REG_a" \n\t" - "cmp %4, %%"REG_a" \n\t" - " jb 1b \n\t" - ::"r"(src), "r"(ydst), "r"(udst), "r"(vdst), "g" (chromWidth) - : "memory", "%"REG_a - ); - - ydst += lumStride; - src += srcStride; - - __asm__ volatile( - "xor %%"REG_a", %%"REG_a" \n\t" - ".p2align 4 \n\t" - "1: \n\t" - PREFETCH" 64(%0, %%"REG_a", 4) \n\t" - "movq (%0, %%"REG_a", 4), %%mm0 \n\t" // YUYV YUYV(0) - "movq 8(%0, %%"REG_a", 4), %%mm1 \n\t" // YUYV YUYV(4) - "movq 16(%0, %%"REG_a", 4), %%mm2 \n\t" // YUYV YUYV(8) - "movq 24(%0, %%"REG_a", 4), %%mm3 \n\t" // YUYV YUYV(12) - "psrlw $8, %%mm0 \n\t" // Y0Y0 Y0Y0(0) - "psrlw $8, %%mm1 \n\t" // Y0Y0 Y0Y0(4) - "psrlw $8, %%mm2 \n\t" // Y0Y0 Y0Y0(8) - "psrlw $8, %%mm3 \n\t" // Y0Y0 Y0Y0(12) - "packuswb %%mm1, %%mm0 \n\t" // YYYY YYYY(0) - "packuswb %%mm3, %%mm2 \n\t" // YYYY YYYY(8) - - MOVNTQ" %%mm0, (%1, %%"REG_a", 2) \n\t" - MOVNTQ" %%mm2, 8(%1, %%"REG_a", 2) \n\t" - - "add $8, %%"REG_a" \n\t" - "cmp %4, %%"REG_a" \n\t" - " jb 1b \n\t" - - ::"r"(src), "r"(ydst), "r"(udst), "r"(vdst), "g" (chromWidth) - : "memory", "%"REG_a - ); -#else - long i; + int i; for (i=0; i<chromWidth; i++) { udst[i] = src[4*i+0]; ydst[2*i+0] = src[4*i+1]; @@ -1976,17 +612,11 @@ static inline void RENAME(uyvytoyv12)(const uint8_t *src, uint8_t *ydst, uint8_t ydst[2*i+0] = src[4*i+1]; ydst[2*i+1] = src[4*i+3]; } -#endif udst += chromStride; vdst += chromStride; ydst += lumStride; src += srcStride; } -#if COMPILE_TEMPLATE_MMX - __asm__ volatile(EMMS" \n\t" - SFENCE" \n\t" - :::"memory"); -#endif } /** @@ -1996,251 +626,15 @@ static inline void RENAME(uyvytoyv12)(const uint8_t *src, uint8_t *ydst, uint8_t * others are ignored in the C version. * FIXME: Write HQ version. */ -static inline void RENAME(rgb24toyv12)(const uint8_t *src, uint8_t *ydst, uint8_t *udst, uint8_t *vdst, - long width, long height, - long lumStride, long chromStride, long srcStride) +void rgb24toyv12_c(const uint8_t *src, uint8_t *ydst, uint8_t *udst, + uint8_t *vdst, int width, int height, int lumStride, + int chromStride, int srcStride) { - long y; - const x86_reg chromWidth= width>>1; -#if COMPILE_TEMPLATE_MMX - for (y=0; y<height-2; y+=2) { - long i; - for (i=0; i<2; i++) { - __asm__ volatile( - "mov %2, %%"REG_a" \n\t" - "movq "MANGLE(ff_bgr2YCoeff)", %%mm6 \n\t" - "movq "MANGLE(ff_w1111)", %%mm5 \n\t" - "pxor %%mm7, %%mm7 \n\t" - "lea (%%"REG_a", %%"REG_a", 2), %%"REG_d" \n\t" - ".p2align 4 \n\t" - "1: \n\t" - PREFETCH" 64(%0, %%"REG_d") \n\t" - "movd (%0, %%"REG_d"), %%mm0 \n\t" - "movd 3(%0, %%"REG_d"), %%mm1 \n\t" - "punpcklbw %%mm7, %%mm0 \n\t" - "punpcklbw %%mm7, %%mm1 \n\t" - "movd 6(%0, %%"REG_d"), %%mm2 \n\t" - "movd 9(%0, %%"REG_d"), %%mm3 \n\t" - "punpcklbw %%mm7, %%mm2 \n\t" - "punpcklbw %%mm7, %%mm3 \n\t" - "pmaddwd %%mm6, %%mm0 \n\t" - "pmaddwd %%mm6, %%mm1 \n\t" - "pmaddwd %%mm6, %%mm2 \n\t" - "pmaddwd %%mm6, %%mm3 \n\t" -#ifndef FAST_BGR2YV12 - "psrad $8, %%mm0 \n\t" - "psrad $8, %%mm1 \n\t" - "psrad $8, %%mm2 \n\t" - "psrad $8, %%mm3 \n\t" -#endif - "packssdw %%mm1, %%mm0 \n\t" - "packssdw %%mm3, %%mm2 \n\t" - "pmaddwd %%mm5, %%mm0 \n\t" - "pmaddwd %%mm5, %%mm2 \n\t" - "packssdw %%mm2, %%mm0 \n\t" - "psraw $7, %%mm0 \n\t" - - "movd 12(%0, %%"REG_d"), %%mm4 \n\t" - "movd 15(%0, %%"REG_d"), %%mm1 \n\t" - "punpcklbw %%mm7, %%mm4 \n\t" - "punpcklbw %%mm7, %%mm1 \n\t" - "movd 18(%0, %%"REG_d"), %%mm2 \n\t" - "movd 21(%0, %%"REG_d"), %%mm3 \n\t" - "punpcklbw %%mm7, %%mm2 \n\t" - "punpcklbw %%mm7, %%mm3 \n\t" - "pmaddwd %%mm6, %%mm4 \n\t" - "pmaddwd %%mm6, %%mm1 \n\t" - "pmaddwd %%mm6, %%mm2 \n\t" - "pmaddwd %%mm6, %%mm3 \n\t" -#ifndef FAST_BGR2YV12 - "psrad $8, %%mm4 \n\t" - "psrad $8, %%mm1 \n\t" - "psrad $8, %%mm2 \n\t" - "psrad $8, %%mm3 \n\t" -#endif - "packssdw %%mm1, %%mm4 \n\t" - "packssdw %%mm3, %%mm2 \n\t" - "pmaddwd %%mm5, %%mm4 \n\t" - "pmaddwd %%mm5, %%mm2 \n\t" - "add $24, %%"REG_d" \n\t" - "packssdw %%mm2, %%mm4 \n\t" - "psraw $7, %%mm4 \n\t" - - "packuswb %%mm4, %%mm0 \n\t" - "paddusb "MANGLE(ff_bgr2YOffset)", %%mm0 \n\t" - - MOVNTQ" %%mm0, (%1, %%"REG_a") \n\t" - "add $8, %%"REG_a" \n\t" - " js 1b \n\t" - : : "r" (src+width*3), "r" (ydst+width), "g" ((x86_reg)-width) - : "%"REG_a, "%"REG_d - ); - ydst += lumStride; - src += srcStride; - } - src -= srcStride*2; - __asm__ volatile( - "mov %4, %%"REG_a" \n\t" - "movq "MANGLE(ff_w1111)", %%mm5 \n\t" - "movq "MANGLE(ff_bgr2UCoeff)", %%mm6 \n\t" - "pxor %%mm7, %%mm7 \n\t" - "lea (%%"REG_a", %%"REG_a", 2), %%"REG_d" \n\t" - "add %%"REG_d", %%"REG_d" \n\t" - ".p2align 4 \n\t" - "1: \n\t" - PREFETCH" 64(%0, %%"REG_d") \n\t" - PREFETCH" 64(%1, %%"REG_d") \n\t" -#if COMPILE_TEMPLATE_MMX2 || COMPILE_TEMPLATE_AMD3DNOW - "movq (%0, %%"REG_d"), %%mm0 \n\t" - "movq (%1, %%"REG_d"), %%mm1 \n\t" - "movq 6(%0, %%"REG_d"), %%mm2 \n\t" - "movq 6(%1, %%"REG_d"), %%mm3 \n\t" - PAVGB" %%mm1, %%mm0 \n\t" - PAVGB" %%mm3, %%mm2 \n\t" - "movq %%mm0, %%mm1 \n\t" - "movq %%mm2, %%mm3 \n\t" - "psrlq $24, %%mm0 \n\t" - "psrlq $24, %%mm2 \n\t" - PAVGB" %%mm1, %%mm0 \n\t" - PAVGB" %%mm3, %%mm2 \n\t" - "punpcklbw %%mm7, %%mm0 \n\t" - "punpcklbw %%mm7, %%mm2 \n\t" -#else - "movd (%0, %%"REG_d"), %%mm0 \n\t" - "movd (%1, %%"REG_d"), %%mm1 \n\t" - "movd 3(%0, %%"REG_d"), %%mm2 \n\t" - "movd 3(%1, %%"REG_d"), %%mm3 \n\t" - "punpcklbw %%mm7, %%mm0 \n\t" - "punpcklbw %%mm7, %%mm1 \n\t" - "punpcklbw %%mm7, %%mm2 \n\t" - "punpcklbw %%mm7, %%mm3 \n\t" - "paddw %%mm1, %%mm0 \n\t" - "paddw %%mm3, %%mm2 \n\t" - "paddw %%mm2, %%mm0 \n\t" - "movd 6(%0, %%"REG_d"), %%mm4 \n\t" - "movd 6(%1, %%"REG_d"), %%mm1 \n\t" - "movd 9(%0, %%"REG_d"), %%mm2 \n\t" - "movd 9(%1, %%"REG_d"), %%mm3 \n\t" - "punpcklbw %%mm7, %%mm4 \n\t" - "punpcklbw %%mm7, %%mm1 \n\t" - "punpcklbw %%mm7, %%mm2 \n\t" - "punpcklbw %%mm7, %%mm3 \n\t" - "paddw %%mm1, %%mm4 \n\t" - "paddw %%mm3, %%mm2 \n\t" - "paddw %%mm4, %%mm2 \n\t" - "psrlw $2, %%mm0 \n\t" - "psrlw $2, %%mm2 \n\t" -#endif - "movq "MANGLE(ff_bgr2VCoeff)", %%mm1 \n\t" - "movq "MANGLE(ff_bgr2VCoeff)", %%mm3 \n\t" - - "pmaddwd %%mm0, %%mm1 \n\t" - "pmaddwd %%mm2, %%mm3 \n\t" - "pmaddwd %%mm6, %%mm0 \n\t" - "pmaddwd %%mm6, %%mm2 \n\t" -#ifndef FAST_BGR2YV12 - "psrad $8, %%mm0 \n\t" - "psrad $8, %%mm1 \n\t" - "psrad $8, %%mm2 \n\t" - "psrad $8, %%mm3 \n\t" -#endif - "packssdw %%mm2, %%mm0 \n\t" - "packssdw %%mm3, %%mm1 \n\t" - "pmaddwd %%mm5, %%mm0 \n\t" - "pmaddwd %%mm5, %%mm1 \n\t" - "packssdw %%mm1, %%mm0 \n\t" // V1 V0 U1 U0 - "psraw $7, %%mm0 \n\t" - -#if COMPILE_TEMPLATE_MMX2 || COMPILE_TEMPLATE_AMD3DNOW - "movq 12(%0, %%"REG_d"), %%mm4 \n\t" - "movq 12(%1, %%"REG_d"), %%mm1 \n\t" - "movq 18(%0, %%"REG_d"), %%mm2 \n\t" - "movq 18(%1, %%"REG_d"), %%mm3 \n\t" - PAVGB" %%mm1, %%mm4 \n\t" - PAVGB" %%mm3, %%mm2 \n\t" - "movq %%mm4, %%mm1 \n\t" - "movq %%mm2, %%mm3 \n\t" - "psrlq $24, %%mm4 \n\t" - "psrlq $24, %%mm2 \n\t" - PAVGB" %%mm1, %%mm4 \n\t" - PAVGB" %%mm3, %%mm2 \n\t" - "punpcklbw %%mm7, %%mm4 \n\t" - "punpcklbw %%mm7, %%mm2 \n\t" -#else - "movd 12(%0, %%"REG_d"), %%mm4 \n\t" - "movd 12(%1, %%"REG_d"), %%mm1 \n\t" - "movd 15(%0, %%"REG_d"), %%mm2 \n\t" - "movd 15(%1, %%"REG_d"), %%mm3 \n\t" - "punpcklbw %%mm7, %%mm4 \n\t" - "punpcklbw %%mm7, %%mm1 \n\t" - "punpcklbw %%mm7, %%mm2 \n\t" - "punpcklbw %%mm7, %%mm3 \n\t" - "paddw %%mm1, %%mm4 \n\t" - "paddw %%mm3, %%mm2 \n\t" - "paddw %%mm2, %%mm4 \n\t" - "movd 18(%0, %%"REG_d"), %%mm5 \n\t" - "movd 18(%1, %%"REG_d"), %%mm1 \n\t" - "movd 21(%0, %%"REG_d"), %%mm2 \n\t" - "movd 21(%1, %%"REG_d"), %%mm3 \n\t" - "punpcklbw %%mm7, %%mm5 \n\t" - "punpcklbw %%mm7, %%mm1 \n\t" - "punpcklbw %%mm7, %%mm2 \n\t" - "punpcklbw %%mm7, %%mm3 \n\t" - "paddw %%mm1, %%mm5 \n\t" - "paddw %%mm3, %%mm2 \n\t" - "paddw %%mm5, %%mm2 \n\t" - "movq "MANGLE(ff_w1111)", %%mm5 \n\t" - "psrlw $2, %%mm4 \n\t" - "psrlw $2, %%mm2 \n\t" -#endif - "movq "MANGLE(ff_bgr2VCoeff)", %%mm1 \n\t" - "movq "MANGLE(ff_bgr2VCoeff)", %%mm3 \n\t" - - "pmaddwd %%mm4, %%mm1 \n\t" - "pmaddwd %%mm2, %%mm3 \n\t" - "pmaddwd %%mm6, %%mm4 \n\t" - "pmaddwd %%mm6, %%mm2 \n\t" -#ifndef FAST_BGR2YV12 - "psrad $8, %%mm4 \n\t" - "psrad $8, %%mm1 \n\t" - "psrad $8, %%mm2 \n\t" - "psrad $8, %%mm3 \n\t" -#endif - "packssdw %%mm2, %%mm4 \n\t" - "packssdw %%mm3, %%mm1 \n\t" - "pmaddwd %%mm5, %%mm4 \n\t" - "pmaddwd %%mm5, %%mm1 \n\t" - "add $24, %%"REG_d" \n\t" - "packssdw %%mm1, %%mm4 \n\t" // V3 V2 U3 U2 - "psraw $7, %%mm4 \n\t" - - "movq %%mm0, %%mm1 \n\t" - "punpckldq %%mm4, %%mm0 \n\t" - "punpckhdq %%mm4, %%mm1 \n\t" - "packsswb %%mm1, %%mm0 \n\t" - "paddb "MANGLE(ff_bgr2UVOffset)", %%mm0 \n\t" - "movd %%mm0, (%2, %%"REG_a") \n\t" - "punpckhdq %%mm0, %%mm0 \n\t" - "movd %%mm0, (%3, %%"REG_a") \n\t" - "add $4, %%"REG_a" \n\t" - " js 1b \n\t" - : : "r" (src+chromWidth*6), "r" (src+srcStride+chromWidth*6), "r" (udst+chromWidth), "r" (vdst+chromWidth), "g" (-chromWidth) - : "%"REG_a, "%"REG_d - ); - - udst += chromStride; - vdst += chromStride; - src += srcStride*2; - } - - __asm__ volatile(EMMS" \n\t" - SFENCE" \n\t" - :::"memory"); -#else + int y; + const int chromWidth = width >> 1; y=0; -#endif for (; y<height; y+=2) { - long i; + int i; for (i=0; i<chromWidth; i++) { unsigned int b = src[6*i+0]; unsigned int g = src[6*i+1]; @@ -2290,195 +684,56 @@ static inline void RENAME(rgb24toyv12)(const uint8_t *src, uint8_t *ydst, uint8_ } } -static void RENAME(interleaveBytes)(const uint8_t *src1, const uint8_t *src2, uint8_t *dest, - long width, long height, long src1Stride, - long src2Stride, long dstStride) +static void interleaveBytes_c(const uint8_t *src1, const uint8_t *src2, + uint8_t *dest, int width, + int height, int src1Stride, + int src2Stride, int dstStride) { - long h; + int h; for (h=0; h < height; h++) { - long w; - -#if COMPILE_TEMPLATE_MMX -#if COMPILE_TEMPLATE_SSE2 - __asm__( - "xor %%"REG_a", %%"REG_a" \n\t" - "1: \n\t" - PREFETCH" 64(%1, %%"REG_a") \n\t" - PREFETCH" 64(%2, %%"REG_a") \n\t" - "movdqa (%1, %%"REG_a"), %%xmm0 \n\t" - "movdqa (%1, %%"REG_a"), %%xmm1 \n\t" - "movdqa (%2, %%"REG_a"), %%xmm2 \n\t" - "punpcklbw %%xmm2, %%xmm0 \n\t" - "punpckhbw %%xmm2, %%xmm1 \n\t" - "movntdq %%xmm0, (%0, %%"REG_a", 2) \n\t" - "movntdq %%xmm1, 16(%0, %%"REG_a", 2) \n\t" - "add $16, %%"REG_a" \n\t" - "cmp %3, %%"REG_a" \n\t" - " jb 1b \n\t" - ::"r"(dest), "r"(src1), "r"(src2), "r" ((x86_reg)width-15) - : "memory", "%"REG_a"" - ); -#else - __asm__( - "xor %%"REG_a", %%"REG_a" \n\t" - "1: \n\t" - PREFETCH" 64(%1, %%"REG_a") \n\t" - PREFETCH" 64(%2, %%"REG_a") \n\t" - "movq (%1, %%"REG_a"), %%mm0 \n\t" - "movq 8(%1, %%"REG_a"), %%mm2 \n\t" - "movq %%mm0, %%mm1 \n\t" - "movq %%mm2, %%mm3 \n\t" - "movq (%2, %%"REG_a"), %%mm4 \n\t" - "movq 8(%2, %%"REG_a"), %%mm5 \n\t" - "punpcklbw %%mm4, %%mm0 \n\t" - "punpckhbw %%mm4, %%mm1 \n\t" - "punpcklbw %%mm5, %%mm2 \n\t" - "punpckhbw %%mm5, %%mm3 \n\t" - MOVNTQ" %%mm0, (%0, %%"REG_a", 2) \n\t" - MOVNTQ" %%mm1, 8(%0, %%"REG_a", 2) \n\t" - MOVNTQ" %%mm2, 16(%0, %%"REG_a", 2) \n\t" - MOVNTQ" %%mm3, 24(%0, %%"REG_a", 2) \n\t" - "add $16, %%"REG_a" \n\t" - "cmp %3, %%"REG_a" \n\t" - " jb 1b \n\t" - ::"r"(dest), "r"(src1), "r"(src2), "r" ((x86_reg)width-15) - : "memory", "%"REG_a - ); -#endif - for (w= (width&(~15)); w < width; w++) { - dest[2*w+0] = src1[w]; - dest[2*w+1] = src2[w]; - } -#else + int w; for (w=0; w < width; w++) { dest[2*w+0] = src1[w]; dest[2*w+1] = src2[w]; } -#endif dest += dstStride; src1 += src1Stride; src2 += src2Stride; } -#if COMPILE_TEMPLATE_MMX - __asm__( - EMMS" \n\t" - SFENCE" \n\t" - ::: "memory" - ); -#endif } -static inline void RENAME(vu9_to_vu12)(const uint8_t *src1, const uint8_t *src2, - uint8_t *dst1, uint8_t *dst2, - long width, long height, - long srcStride1, long srcStride2, - long dstStride1, long dstStride2) +static inline void vu9_to_vu12_c(const uint8_t *src1, const uint8_t *src2, + uint8_t *dst1, uint8_t *dst2, + int width, int height, + int srcStride1, int srcStride2, + int dstStride1, int dstStride2) { - x86_reg y; - long x,w,h; + int y; + int x,w,h; w=width/2; h=height/2; -#if COMPILE_TEMPLATE_MMX - __asm__ volatile( - PREFETCH" %0 \n\t" - PREFETCH" %1 \n\t" - ::"m"(*(src1+srcStride1)),"m"(*(src2+srcStride2)):"memory"); -#endif for (y=0;y<h;y++) { const uint8_t* s1=src1+srcStride1*(y>>1); uint8_t* d=dst1+dstStride1*y; x=0; -#if COMPILE_TEMPLATE_MMX - for (;x<w-31;x+=32) { - __asm__ volatile( - PREFETCH" 32%1 \n\t" - "movq %1, %%mm0 \n\t" - "movq 8%1, %%mm2 \n\t" - "movq 16%1, %%mm4 \n\t" - "movq 24%1, %%mm6 \n\t" - "movq %%mm0, %%mm1 \n\t" - "movq %%mm2, %%mm3 \n\t" - "movq %%mm4, %%mm5 \n\t" - "movq %%mm6, %%mm7 \n\t" - "punpcklbw %%mm0, %%mm0 \n\t" - "punpckhbw %%mm1, %%mm1 \n\t" - "punpcklbw %%mm2, %%mm2 \n\t" - "punpckhbw %%mm3, %%mm3 \n\t" - "punpcklbw %%mm4, %%mm4 \n\t" - "punpckhbw %%mm5, %%mm5 \n\t" - "punpcklbw %%mm6, %%mm6 \n\t" - "punpckhbw %%mm7, %%mm7 \n\t" - MOVNTQ" %%mm0, %0 \n\t" - MOVNTQ" %%mm1, 8%0 \n\t" - MOVNTQ" %%mm2, 16%0 \n\t" - MOVNTQ" %%mm3, 24%0 \n\t" - MOVNTQ" %%mm4, 32%0 \n\t" - MOVNTQ" %%mm5, 40%0 \n\t" - MOVNTQ" %%mm6, 48%0 \n\t" - MOVNTQ" %%mm7, 56%0" - :"=m"(d[2*x]) - :"m"(s1[x]) - :"memory"); - } -#endif for (;x<w;x++) d[2*x]=d[2*x+1]=s1[x]; } for (y=0;y<h;y++) { const uint8_t* s2=src2+srcStride2*(y>>1); uint8_t* d=dst2+dstStride2*y; x=0; -#if COMPILE_TEMPLATE_MMX - for (;x<w-31;x+=32) { - __asm__ volatile( - PREFETCH" 32%1 \n\t" - "movq %1, %%mm0 \n\t" - "movq 8%1, %%mm2 \n\t" - "movq 16%1, %%mm4 \n\t" - "movq 24%1, %%mm6 \n\t" - "movq %%mm0, %%mm1 \n\t" - "movq %%mm2, %%mm3 \n\t" - "movq %%mm4, %%mm5 \n\t" - "movq %%mm6, %%mm7 \n\t" - "punpcklbw %%mm0, %%mm0 \n\t" - "punpckhbw %%mm1, %%mm1 \n\t" - "punpcklbw %%mm2, %%mm2 \n\t" - "punpckhbw %%mm3, %%mm3 \n\t" - "punpcklbw %%mm4, %%mm4 \n\t" - "punpckhbw %%mm5, %%mm5 \n\t" - "punpcklbw %%mm6, %%mm6 \n\t" - "punpckhbw %%mm7, %%mm7 \n\t" - MOVNTQ" %%mm0, %0 \n\t" - MOVNTQ" %%mm1, 8%0 \n\t" - MOVNTQ" %%mm2, 16%0 \n\t" - MOVNTQ" %%mm3, 24%0 \n\t" - MOVNTQ" %%mm4, 32%0 \n\t" - MOVNTQ" %%mm5, 40%0 \n\t" - MOVNTQ" %%mm6, 48%0 \n\t" - MOVNTQ" %%mm7, 56%0" - :"=m"(d[2*x]) - :"m"(s2[x]) - :"memory"); - } -#endif for (;x<w;x++) d[2*x]=d[2*x+1]=s2[x]; } -#if COMPILE_TEMPLATE_MMX - __asm__( - EMMS" \n\t" - SFENCE" \n\t" - ::: "memory" - ); -#endif } -static inline void RENAME(yvu9_to_yuy2)(const uint8_t *src1, const uint8_t *src2, const uint8_t *src3, - uint8_t *dst, - long width, long height, - long srcStride1, long srcStride2, - long srcStride3, long dstStride) +static inline void yvu9_to_yuy2_c(const uint8_t *src1, const uint8_t *src2, + const uint8_t *src3, uint8_t *dst, + int width, int height, + int srcStride1, int srcStride2, + int srcStride3, int dstStride) { - x86_reg x; - long y,w,h; + int x; + int y,w,h; w=width/2; h=height; for (y=0;y<h;y++) { const uint8_t* yp=src1+srcStride1*y; @@ -2486,62 +741,8 @@ static inline void RENAME(yvu9_to_yuy2)(const uint8_t *src1, const uint8_t *src2 const uint8_t* vp=src3+srcStride3*(y>>2); uint8_t* d=dst+dstStride*y; x=0; -#if COMPILE_TEMPLATE_MMX - for (;x<w-7;x+=8) { - __asm__ volatile( - PREFETCH" 32(%1, %0) \n\t" - PREFETCH" 32(%2, %0) \n\t" - PREFETCH" 32(%3, %0) \n\t" - "movq (%1, %0, 4), %%mm0 \n\t" /* Y0Y1Y2Y3Y4Y5Y6Y7 */ - "movq (%2, %0), %%mm1 \n\t" /* U0U1U2U3U4U5U6U7 */ - "movq (%3, %0), %%mm2 \n\t" /* V0V1V2V3V4V5V6V7 */ - "movq %%mm0, %%mm3 \n\t" /* Y0Y1Y2Y3Y4Y5Y6Y7 */ - "movq %%mm1, %%mm4 \n\t" /* U0U1U2U3U4U5U6U7 */ - "movq %%mm2, %%mm5 \n\t" /* V0V1V2V3V4V5V6V7 */ - "punpcklbw %%mm1, %%mm1 \n\t" /* U0U0 U1U1 U2U2 U3U3 */ - "punpcklbw %%mm2, %%mm2 \n\t" /* V0V0 V1V1 V2V2 V3V3 */ - "punpckhbw %%mm4, %%mm4 \n\t" /* U4U4 U5U5 U6U6 U7U7 */ - "punpckhbw %%mm5, %%mm5 \n\t" /* V4V4 V5V5 V6V6 V7V7 */ - - "movq %%mm1, %%mm6 \n\t" - "punpcklbw %%mm2, %%mm1 \n\t" /* U0V0 U0V0 U1V1 U1V1*/ - "punpcklbw %%mm1, %%mm0 \n\t" /* Y0U0 Y1V0 Y2U0 Y3V0*/ - "punpckhbw %%mm1, %%mm3 \n\t" /* Y4U1 Y5V1 Y6U1 Y7V1*/ - MOVNTQ" %%mm0, (%4, %0, 8) \n\t" - MOVNTQ" %%mm3, 8(%4, %0, 8) \n\t" - - "punpckhbw %%mm2, %%mm6 \n\t" /* U2V2 U2V2 U3V3 U3V3*/ - "movq 8(%1, %0, 4), %%mm0 \n\t" - "movq %%mm0, %%mm3 \n\t" - "punpcklbw %%mm6, %%mm0 \n\t" /* Y U2 Y V2 Y U2 Y V2*/ - "punpckhbw %%mm6, %%mm3 \n\t" /* Y U3 Y V3 Y U3 Y V3*/ - MOVNTQ" %%mm0, 16(%4, %0, 8) \n\t" - MOVNTQ" %%mm3, 24(%4, %0, 8) \n\t" - - "movq %%mm4, %%mm6 \n\t" - "movq 16(%1, %0, 4), %%mm0 \n\t" - "movq %%mm0, %%mm3 \n\t" - "punpcklbw %%mm5, %%mm4 \n\t" - "punpcklbw %%mm4, %%mm0 \n\t" /* Y U4 Y V4 Y U4 Y V4*/ - "punpckhbw %%mm4, %%mm3 \n\t" /* Y U5 Y V5 Y U5 Y V5*/ - MOVNTQ" %%mm0, 32(%4, %0, 8) \n\t" - MOVNTQ" %%mm3, 40(%4, %0, 8) \n\t" - - "punpckhbw %%mm5, %%mm6 \n\t" - "movq 24(%1, %0, 4), %%mm0 \n\t" - "movq %%mm0, %%mm3 \n\t" - "punpcklbw %%mm6, %%mm0 \n\t" /* Y U6 Y V6 Y U6 Y V6*/ - "punpckhbw %%mm6, %%mm3 \n\t" /* Y U7 Y V7 Y U7 Y V7*/ - MOVNTQ" %%mm0, 48(%4, %0, 8) \n\t" - MOVNTQ" %%mm3, 56(%4, %0, 8) \n\t" - - : "+r" (x) - : "r"(yp), "r" (up), "r"(vp), "r"(d) - :"memory"); - } -#endif for (; x<w; x++) { - const long x2 = x<<2; + const int x2 = x<<2; d[8*x+0] = yp[x2]; d[8*x+1] = up[x]; d[8*x+2] = yp[x2+1]; @@ -2552,95 +753,27 @@ static inline void RENAME(yvu9_to_yuy2)(const uint8_t *src1, const uint8_t *src2 d[8*x+7] = vp[x]; } } -#if COMPILE_TEMPLATE_MMX - __asm__( - EMMS" \n\t" - SFENCE" \n\t" - ::: "memory" - ); -#endif } -static void RENAME(extract_even)(const uint8_t *src, uint8_t *dst, x86_reg count) +static void extract_even_c(const uint8_t *src, uint8_t *dst, int count) { dst += count; src += 2*count; count= - count; -#if COMPILE_TEMPLATE_MMX - if(count <= -16) { - count += 15; - __asm__ volatile( - "pcmpeqw %%mm7, %%mm7 \n\t" - "psrlw $8, %%mm7 \n\t" - "1: \n\t" - "movq -30(%1, %0, 2), %%mm0 \n\t" - "movq -22(%1, %0, 2), %%mm1 \n\t" - "movq -14(%1, %0, 2), %%mm2 \n\t" - "movq -6(%1, %0, 2), %%mm3 \n\t" - "pand %%mm7, %%mm0 \n\t" - "pand %%mm7, %%mm1 \n\t" - "pand %%mm7, %%mm2 \n\t" - "pand %%mm7, %%mm3 \n\t" - "packuswb %%mm1, %%mm0 \n\t" - "packuswb %%mm3, %%mm2 \n\t" - MOVNTQ" %%mm0,-15(%2, %0) \n\t" - MOVNTQ" %%mm2,- 7(%2, %0) \n\t" - "add $16, %0 \n\t" - " js 1b \n\t" - : "+r"(count) - : "r"(src), "r"(dst) - ); - count -= 15; - } -#endif while(count<0) { dst[count]= src[2*count]; count++; } } -static void RENAME(extract_even2)(const uint8_t *src, uint8_t *dst0, uint8_t *dst1, x86_reg count) +static void extract_even2_c(const uint8_t *src, uint8_t *dst0, uint8_t *dst1, + int count) { dst0+= count; dst1+= count; src += 4*count; count= - count; -#if COMPILE_TEMPLATE_MMX - if(count <= -8) { - count += 7; - __asm__ volatile( - "pcmpeqw %%mm7, %%mm7 \n\t" - "psrlw $8, %%mm7 \n\t" - "1: \n\t" - "movq -28(%1, %0, 4), %%mm0 \n\t" - "movq -20(%1, %0, 4), %%mm1 \n\t" - "movq -12(%1, %0, 4), %%mm2 \n\t" - "movq -4(%1, %0, 4), %%mm3 \n\t" - "pand %%mm7, %%mm0 \n\t" - "pand %%mm7, %%mm1 \n\t" - "pand %%mm7, %%mm2 \n\t" - "pand %%mm7, %%mm3 \n\t" - "packuswb %%mm1, %%mm0 \n\t" - "packuswb %%mm3, %%mm2 \n\t" - "movq %%mm0, %%mm1 \n\t" - "movq %%mm2, %%mm3 \n\t" - "psrlw $8, %%mm0 \n\t" - "psrlw $8, %%mm2 \n\t" - "pand %%mm7, %%mm1 \n\t" - "pand %%mm7, %%mm3 \n\t" - "packuswb %%mm2, %%mm0 \n\t" - "packuswb %%mm3, %%mm1 \n\t" - MOVNTQ" %%mm0,- 7(%3, %0) \n\t" - MOVNTQ" %%mm1,- 7(%2, %0) \n\t" - "add $8, %0 \n\t" - " js 1b \n\t" - : "+r"(count) - : "r"(src), "r"(dst0), "r"(dst1) - ); - count -= 7; - } -#endif while(count<0) { dst0[count]= src[4*count+0]; dst1[count]= src[4*count+2]; @@ -2648,52 +781,14 @@ static void RENAME(extract_even2)(const uint8_t *src, uint8_t *dst0, uint8_t *ds } } -static void RENAME(extract_even2avg)(const uint8_t *src0, const uint8_t *src1, uint8_t *dst0, uint8_t *dst1, x86_reg count) +static void extract_even2avg_c(const uint8_t *src0, const uint8_t *src1, + uint8_t *dst0, uint8_t *dst1, int count) { dst0 += count; dst1 += count; src0 += 4*count; src1 += 4*count; count= - count; -#ifdef PAVGB - if(count <= -8) { - count += 7; - __asm__ volatile( - "pcmpeqw %%mm7, %%mm7 \n\t" - "psrlw $8, %%mm7 \n\t" - "1: \n\t" - "movq -28(%1, %0, 4), %%mm0 \n\t" - "movq -20(%1, %0, 4), %%mm1 \n\t" - "movq -12(%1, %0, 4), %%mm2 \n\t" - "movq -4(%1, %0, 4), %%mm3 \n\t" - PAVGB" -28(%2, %0, 4), %%mm0 \n\t" - PAVGB" -20(%2, %0, 4), %%mm1 \n\t" - PAVGB" -12(%2, %0, 4), %%mm2 \n\t" - PAVGB" - 4(%2, %0, 4), %%mm3 \n\t" - "pand %%mm7, %%mm0 \n\t" - "pand %%mm7, %%mm1 \n\t" - "pand %%mm7, %%mm2 \n\t" - "pand %%mm7, %%mm3 \n\t" - "packuswb %%mm1, %%mm0 \n\t" - "packuswb %%mm3, %%mm2 \n\t" - "movq %%mm0, %%mm1 \n\t" - "movq %%mm2, %%mm3 \n\t" - "psrlw $8, %%mm0 \n\t" - "psrlw $8, %%mm2 \n\t" - "pand %%mm7, %%mm1 \n\t" - "pand %%mm7, %%mm3 \n\t" - "packuswb %%mm2, %%mm0 \n\t" - "packuswb %%mm3, %%mm1 \n\t" - MOVNTQ" %%mm0,- 7(%4, %0) \n\t" - MOVNTQ" %%mm1,- 7(%3, %0) \n\t" - "add $8, %0 \n\t" - " js 1b \n\t" - : "+r"(count) - : "r"(src0), "r"(src1), "r"(dst0), "r"(dst1) - ); - count -= 7; - } -#endif while(count<0) { dst0[count]= (src0[4*count+0]+src1[4*count+0])>>1; dst1[count]= (src0[4*count+2]+src1[4*count+2])>>1; @@ -2701,47 +796,13 @@ static void RENAME(extract_even2avg)(const uint8_t *src0, const uint8_t *src1, u } } -static void RENAME(extract_odd2)(const uint8_t *src, uint8_t *dst0, uint8_t *dst1, x86_reg count) +static void extract_odd2_c(const uint8_t *src, uint8_t *dst0, uint8_t *dst1, + int count) { dst0+= count; dst1+= count; src += 4*count; count= - count; -#if COMPILE_TEMPLATE_MMX - if(count <= -8) { - count += 7; - __asm__ volatile( - "pcmpeqw %%mm7, %%mm7 \n\t" - "psrlw $8, %%mm7 \n\t" - "1: \n\t" - "movq -28(%1, %0, 4), %%mm0 \n\t" - "movq -20(%1, %0, 4), %%mm1 \n\t" - "movq -12(%1, %0, 4), %%mm2 \n\t" - "movq -4(%1, %0, 4), %%mm3 \n\t" - "psrlw $8, %%mm0 \n\t" - "psrlw $8, %%mm1 \n\t" - "psrlw $8, %%mm2 \n\t" - "psrlw $8, %%mm3 \n\t" - "packuswb %%mm1, %%mm0 \n\t" - "packuswb %%mm3, %%mm2 \n\t" - "movq %%mm0, %%mm1 \n\t" - "movq %%mm2, %%mm3 \n\t" - "psrlw $8, %%mm0 \n\t" - "psrlw $8, %%mm2 \n\t" - "pand %%mm7, %%mm1 \n\t" - "pand %%mm7, %%mm3 \n\t" - "packuswb %%mm2, %%mm0 \n\t" - "packuswb %%mm3, %%mm1 \n\t" - MOVNTQ" %%mm0,- 7(%3, %0) \n\t" - MOVNTQ" %%mm1,- 7(%2, %0) \n\t" - "add $8, %0 \n\t" - " js 1b \n\t" - : "+r"(count) - : "r"(src), "r"(dst0), "r"(dst1) - ); - count -= 7; - } -#endif src++; while(count<0) { dst0[count]= src[4*count+0]; @@ -2750,52 +811,14 @@ static void RENAME(extract_odd2)(const uint8_t *src, uint8_t *dst0, uint8_t *dst } } -static void RENAME(extract_odd2avg)(const uint8_t *src0, const uint8_t *src1, uint8_t *dst0, uint8_t *dst1, x86_reg count) +static void extract_odd2avg_c(const uint8_t *src0, const uint8_t *src1, + uint8_t *dst0, uint8_t *dst1, int count) { dst0 += count; dst1 += count; src0 += 4*count; src1 += 4*count; count= - count; -#ifdef PAVGB - if(count <= -8) { - count += 7; - __asm__ volatile( - "pcmpeqw %%mm7, %%mm7 \n\t" - "psrlw $8, %%mm7 \n\t" - "1: \n\t" - "movq -28(%1, %0, 4), %%mm0 \n\t" - "movq -20(%1, %0, 4), %%mm1 \n\t" - "movq -12(%1, %0, 4), %%mm2 \n\t" - "movq -4(%1, %0, 4), %%mm3 \n\t" - PAVGB" -28(%2, %0, 4), %%mm0 \n\t" - PAVGB" -20(%2, %0, 4), %%mm1 \n\t" - PAVGB" -12(%2, %0, 4), %%mm2 \n\t" - PAVGB" - 4(%2, %0, 4), %%mm3 \n\t" - "psrlw $8, %%mm0 \n\t" - "psrlw $8, %%mm1 \n\t" - "psrlw $8, %%mm2 \n\t" - "psrlw $8, %%mm3 \n\t" - "packuswb %%mm1, %%mm0 \n\t" - "packuswb %%mm3, %%mm2 \n\t" - "movq %%mm0, %%mm1 \n\t" - "movq %%mm2, %%mm3 \n\t" - "psrlw $8, %%mm0 \n\t" - "psrlw $8, %%mm2 \n\t" - "pand %%mm7, %%mm1 \n\t" - "pand %%mm7, %%mm3 \n\t" - "packuswb %%mm2, %%mm0 \n\t" - "packuswb %%mm3, %%mm1 \n\t" - MOVNTQ" %%mm0,- 7(%4, %0) \n\t" - MOVNTQ" %%mm1,- 7(%3, %0) \n\t" - "add $8, %0 \n\t" - " js 1b \n\t" - : "+r"(count) - : "r"(src0), "r"(src1), "r"(dst0), "r"(dst1) - ); - count -= 7; - } -#endif src0++; src1++; while(count<0) { @@ -2805,17 +828,17 @@ static void RENAME(extract_odd2avg)(const uint8_t *src0, const uint8_t *src1, ui } } -static void RENAME(yuyvtoyuv420)(uint8_t *ydst, uint8_t *udst, uint8_t *vdst, const uint8_t *src, - long width, long height, - long lumStride, long chromStride, long srcStride) +static void yuyvtoyuv420_c(uint8_t *ydst, uint8_t *udst, uint8_t *vdst, + const uint8_t *src, int width, int height, + int lumStride, int chromStride, int srcStride) { - long y; - const long chromWidth= -((-width)>>1); + int y; + const int chromWidth= -((-width)>>1); for (y=0; y<height; y++) { - RENAME(extract_even)(src, ydst, width); + extract_even_c(src, ydst, width); if(y&1) { - RENAME(extract_odd2avg)(src-srcStride, src, udst, vdst, chromWidth); + extract_odd2avg_c(src - srcStride, src, udst, vdst, chromWidth); udst+= chromStride; vdst+= chromStride; } @@ -2823,51 +846,37 @@ static void RENAME(yuyvtoyuv420)(uint8_t *ydst, uint8_t *udst, uint8_t *vdst, co src += srcStride; ydst+= lumStride; } -#if COMPILE_TEMPLATE_MMX - __asm__( - EMMS" \n\t" - SFENCE" \n\t" - ::: "memory" - ); -#endif } -static void RENAME(yuyvtoyuv422)(uint8_t *ydst, uint8_t *udst, uint8_t *vdst, const uint8_t *src, - long width, long height, - long lumStride, long chromStride, long srcStride) +static void yuyvtoyuv422_c(uint8_t *ydst, uint8_t *udst, uint8_t *vdst, + const uint8_t *src, int width, int height, + int lumStride, int chromStride, int srcStride) { - long y; - const long chromWidth= -((-width)>>1); + int y; + const int chromWidth= -((-width)>>1); for (y=0; y<height; y++) { - RENAME(extract_even)(src, ydst, width); - RENAME(extract_odd2)(src, udst, vdst, chromWidth); + extract_even_c(src, ydst, width); + extract_odd2_c(src, udst, vdst, chromWidth); src += srcStride; ydst+= lumStride; udst+= chromStride; vdst+= chromStride; } -#if COMPILE_TEMPLATE_MMX - __asm__( - EMMS" \n\t" - SFENCE" \n\t" - ::: "memory" - ); -#endif } -static void RENAME(uyvytoyuv420)(uint8_t *ydst, uint8_t *udst, uint8_t *vdst, const uint8_t *src, - long width, long height, - long lumStride, long chromStride, long srcStride) +static void uyvytoyuv420_c(uint8_t *ydst, uint8_t *udst, uint8_t *vdst, + const uint8_t *src, int width, int height, + int lumStride, int chromStride, int srcStride) { - long y; - const long chromWidth= -((-width)>>1); + int y; + const int chromWidth= -((-width)>>1); for (y=0; y<height; y++) { - RENAME(extract_even)(src+1, ydst, width); + extract_even_c(src + 1, ydst, width); if(y&1) { - RENAME(extract_even2avg)(src-srcStride, src, udst, vdst, chromWidth); + extract_even2avg_c(src - srcStride, src, udst, vdst, chromWidth); udst+= chromStride; vdst+= chromStride; } @@ -2875,73 +884,59 @@ static void RENAME(uyvytoyuv420)(uint8_t *ydst, uint8_t *udst, uint8_t *vdst, co src += srcStride; ydst+= lumStride; } -#if COMPILE_TEMPLATE_MMX - __asm__( - EMMS" \n\t" - SFENCE" \n\t" - ::: "memory" - ); -#endif } -static void RENAME(uyvytoyuv422)(uint8_t *ydst, uint8_t *udst, uint8_t *vdst, const uint8_t *src, - long width, long height, - long lumStride, long chromStride, long srcStride) +static void uyvytoyuv422_c(uint8_t *ydst, uint8_t *udst, uint8_t *vdst, + const uint8_t *src, int width, int height, + int lumStride, int chromStride, int srcStride) { - long y; - const long chromWidth= -((-width)>>1); + int y; + const int chromWidth= -((-width)>>1); for (y=0; y<height; y++) { - RENAME(extract_even)(src+1, ydst, width); - RENAME(extract_even2)(src, udst, vdst, chromWidth); + extract_even_c(src + 1, ydst, width); + extract_even2_c(src, udst, vdst, chromWidth); src += srcStride; ydst+= lumStride; udst+= chromStride; vdst+= chromStride; } -#if COMPILE_TEMPLATE_MMX - __asm__( - EMMS" \n\t" - SFENCE" \n\t" - ::: "memory" - ); -#endif } -static inline void RENAME(rgb2rgb_init)(void) -{ - rgb15to16 = RENAME(rgb15to16); - rgb15tobgr24 = RENAME(rgb15tobgr24); - rgb15to32 = RENAME(rgb15to32); - rgb16tobgr24 = RENAME(rgb16tobgr24); - rgb16to32 = RENAME(rgb16to32); - rgb16to15 = RENAME(rgb16to15); - rgb24tobgr16 = RENAME(rgb24tobgr16); - rgb24tobgr15 = RENAME(rgb24tobgr15); - rgb24tobgr32 = RENAME(rgb24tobgr32); - rgb32to16 = RENAME(rgb32to16); - rgb32to15 = RENAME(rgb32to15); - rgb32tobgr24 = RENAME(rgb32tobgr24); - rgb24to15 = RENAME(rgb24to15); - rgb24to16 = RENAME(rgb24to16); - rgb24tobgr24 = RENAME(rgb24tobgr24); - shuffle_bytes_2103 = RENAME(shuffle_bytes_2103); - rgb32tobgr16 = RENAME(rgb32tobgr16); - rgb32tobgr15 = RENAME(rgb32tobgr15); - yv12toyuy2 = RENAME(yv12toyuy2); - yv12touyvy = RENAME(yv12touyvy); - yuv422ptoyuy2 = RENAME(yuv422ptoyuy2); - yuv422ptouyvy = RENAME(yuv422ptouyvy); - yuy2toyv12 = RENAME(yuy2toyv12); - planar2x = RENAME(planar2x); - rgb24toyv12 = RENAME(rgb24toyv12); - interleaveBytes = RENAME(interleaveBytes); - vu9_to_vu12 = RENAME(vu9_to_vu12); - yvu9_to_yuy2 = RENAME(yvu9_to_yuy2); - - uyvytoyuv420 = RENAME(uyvytoyuv420); - uyvytoyuv422 = RENAME(uyvytoyuv422); - yuyvtoyuv420 = RENAME(yuyvtoyuv420); - yuyvtoyuv422 = RENAME(yuyvtoyuv422); +static inline void rgb2rgb_init_c(void) +{ + rgb15to16 = rgb15to16_c; + rgb15tobgr24 = rgb15tobgr24_c; + rgb15to32 = rgb15to32_c; + rgb16tobgr24 = rgb16tobgr24_c; + rgb16to32 = rgb16to32_c; + rgb16to15 = rgb16to15_c; + rgb24tobgr16 = rgb24tobgr16_c; + rgb24tobgr15 = rgb24tobgr15_c; + rgb24tobgr32 = rgb24tobgr32_c; + rgb32to16 = rgb32to16_c; + rgb32to15 = rgb32to15_c; + rgb32tobgr24 = rgb32tobgr24_c; + rgb24to15 = rgb24to15_c; + rgb24to16 = rgb24to16_c; + rgb24tobgr24 = rgb24tobgr24_c; + shuffle_bytes_2103 = shuffle_bytes_2103_c; + rgb32tobgr16 = rgb32tobgr16_c; + rgb32tobgr15 = rgb32tobgr15_c; + yv12toyuy2 = yv12toyuy2_c; + yv12touyvy = yv12touyvy_c; + yuv422ptoyuy2 = yuv422ptoyuy2_c; + yuv422ptouyvy = yuv422ptouyvy_c; + yuy2toyv12 = yuy2toyv12_c; + planar2x = planar2x_c; + rgb24toyv12 = rgb24toyv12_c; + interleaveBytes = interleaveBytes_c; + vu9_to_vu12 = vu9_to_vu12_c; + yvu9_to_yuy2 = yvu9_to_yuy2_c; + + uyvytoyuv420 = uyvytoyuv420_c; + uyvytoyuv422 = uyvytoyuv422_c; + yuyvtoyuv420 = yuyvtoyuv420_c; + yuyvtoyuv422 = yuyvtoyuv422_c; } diff --git a/libswscale/swscale-test.c b/libswscale/swscale-test.c index 7f171ea725..888cbab26a 100644 --- a/libswscale/swscale-test.c +++ b/libswscale/swscale-test.c @@ -58,15 +58,11 @@ static uint64_t getSSD(uint8_t *src1, uint8_t *src2, int stride1, int stride2, i int x,y; uint64_t ssd=0; -//printf("%d %d\n", w, h); - for (y=0; y<h; y++) { for (x=0; x<w; x++) { int d= src1[x + y*stride1] - src2[x + y*stride2]; ssd+= d*d; -//printf("%d", abs(src1[x + y*stride1] - src2[x + y*stride2])/26 ); } -//printf("\n"); } return ssd; } @@ -162,8 +158,6 @@ static int doTest(uint8_t *ref[4], int refStride[4], int w, int h, goto end; } -// printf("test %X %X %X -> %X %X %X\n", (int)ref[0], (int)ref[1], (int)ref[2], -// (int)src[0], (int)src[1], (int)src[2]); printf(" %s %dx%d -> %s %3dx%3d flags=%2d", av_pix_fmt_descriptors[srcFormat].name, srcW, srcH, diff --git a/libswscale/swscale.c b/libswscale/swscale.c index d53af2771d..4318e0bf15 100644 --- a/libswscale/swscale.c +++ b/libswscale/swscale.c @@ -60,29 +60,14 @@ untested special converters #include "swscale.h" #include "swscale_internal.h" #include "rgb2rgb.h" +#include "libavutil/avassert.h" #include "libavutil/intreadwrite.h" -#include "libavutil/x86_cpu.h" +#include "libavutil/cpu.h" #include "libavutil/avutil.h" #include "libavutil/mathematics.h" #include "libavutil/bswap.h" #include "libavutil/pixdesc.h" -#undef MOVNTQ -#undef PAVGB - -//#undef HAVE_MMX2 -//#define HAVE_AMD3DNOW -//#undef HAVE_MMX -//#undef ARCH_X86 -#define DITHER1XBPP - -#define isPacked(x) ( \ - (x)==PIX_FMT_PAL8 \ - || (x)==PIX_FMT_YUYV422 \ - || (x)==PIX_FMT_UYVY422 \ - || (x)==PIX_FMT_GRAY8A \ - || isAnyRGB(x) \ - ) #define RGB2YUV_SHIFT 15 #define BY ( (int)(0.114*219/255*(1<<RGB2YUV_SHIFT)+0.5)) @@ -121,63 +106,6 @@ add BGR4 output support write special BGR->BGR scaler */ -#if ARCH_X86 -DECLARE_ASM_CONST(8, uint64_t, bF8)= 0xF8F8F8F8F8F8F8F8LL; -DECLARE_ASM_CONST(8, uint64_t, bFC)= 0xFCFCFCFCFCFCFCFCLL; -DECLARE_ASM_CONST(8, uint64_t, w10)= 0x0010001000100010LL; -DECLARE_ASM_CONST(8, uint64_t, w02)= 0x0002000200020002LL; -DECLARE_ASM_CONST(8, uint64_t, bm00001111)=0x00000000FFFFFFFFLL; -DECLARE_ASM_CONST(8, uint64_t, bm00000111)=0x0000000000FFFFFFLL; -DECLARE_ASM_CONST(8, uint64_t, bm11111000)=0xFFFFFFFFFF000000LL; -DECLARE_ASM_CONST(8, uint64_t, bm01010101)=0x00FF00FF00FF00FFLL; - -const DECLARE_ALIGNED(8, uint64_t, ff_dither4)[2] = { - 0x0103010301030103LL, - 0x0200020002000200LL,}; - -const DECLARE_ALIGNED(8, uint64_t, ff_dither8)[2] = { - 0x0602060206020602LL, - 0x0004000400040004LL,}; - -DECLARE_ASM_CONST(8, uint64_t, b16Mask)= 0x001F001F001F001FLL; -DECLARE_ASM_CONST(8, uint64_t, g16Mask)= 0x07E007E007E007E0LL; -DECLARE_ASM_CONST(8, uint64_t, r16Mask)= 0xF800F800F800F800LL; -DECLARE_ASM_CONST(8, uint64_t, b15Mask)= 0x001F001F001F001FLL; -DECLARE_ASM_CONST(8, uint64_t, g15Mask)= 0x03E003E003E003E0LL; -DECLARE_ASM_CONST(8, uint64_t, r15Mask)= 0x7C007C007C007C00LL; - -DECLARE_ALIGNED(8, const uint64_t, ff_M24A) = 0x00FF0000FF0000FFLL; -DECLARE_ALIGNED(8, const uint64_t, ff_M24B) = 0xFF0000FF0000FF00LL; -DECLARE_ALIGNED(8, const uint64_t, ff_M24C) = 0x0000FF0000FF0000LL; - -#ifdef FAST_BGR2YV12 -DECLARE_ALIGNED(8, const uint64_t, ff_bgr2YCoeff) = 0x000000210041000DULL; -DECLARE_ALIGNED(8, const uint64_t, ff_bgr2UCoeff) = 0x0000FFEEFFDC0038ULL; -DECLARE_ALIGNED(8, const uint64_t, ff_bgr2VCoeff) = 0x00000038FFD2FFF8ULL; -#else -DECLARE_ALIGNED(8, const uint64_t, ff_bgr2YCoeff) = 0x000020E540830C8BULL; -DECLARE_ALIGNED(8, const uint64_t, ff_bgr2UCoeff) = 0x0000ED0FDAC23831ULL; -DECLARE_ALIGNED(8, const uint64_t, ff_bgr2VCoeff) = 0x00003831D0E6F6EAULL; -#endif /* FAST_BGR2YV12 */ -DECLARE_ALIGNED(8, const uint64_t, ff_bgr2YOffset) = 0x1010101010101010ULL; -DECLARE_ALIGNED(8, const uint64_t, ff_bgr2UVOffset) = 0x8080808080808080ULL; -DECLARE_ALIGNED(8, const uint64_t, ff_w1111) = 0x0001000100010001ULL; - -DECLARE_ASM_CONST(8, uint64_t, ff_bgr24toY1Coeff) = 0x0C88000040870C88ULL; -DECLARE_ASM_CONST(8, uint64_t, ff_bgr24toY2Coeff) = 0x20DE4087000020DEULL; -DECLARE_ASM_CONST(8, uint64_t, ff_rgb24toY1Coeff) = 0x20DE0000408720DEULL; -DECLARE_ASM_CONST(8, uint64_t, ff_rgb24toY2Coeff) = 0x0C88408700000C88ULL; -DECLARE_ASM_CONST(8, uint64_t, ff_bgr24toYOffset) = 0x0008400000084000ULL; - -DECLARE_ASM_CONST(8, uint64_t, ff_bgr24toUV)[2][4] = { - {0x38380000DAC83838ULL, 0xECFFDAC80000ECFFULL, 0xF6E40000D0E3F6E4ULL, 0x3838D0E300003838ULL}, - {0xECFF0000DAC8ECFFULL, 0x3838DAC800003838ULL, 0x38380000D0E33838ULL, 0xF6E4D0E30000F6E4ULL}, -}; - -DECLARE_ASM_CONST(8, uint64_t, ff_bgr24toUVOffset)= 0x0040400000404000ULL; - -#endif /* ARCH_X86 */ - DECLARE_ALIGNED(8, static const uint8_t, dither_2x2_4)[2][8]={ { 1, 3, 1, 3, 1, 3, 1, 3, }, { 2, 0, 2, 0, 2, 0, 2, 0, }, @@ -341,7 +269,9 @@ DECLARE_ALIGNED(8, const uint8_t, dithers)[8][8][8]={ { 112, 16,104, 8,118, 22,110, 14,}, }}; -uint16_t dither_scale[15][16]={ +static const uint8_t flat64[8]={64,64,64,64,64,64,64,64}; + +const uint16_t dither_scale[15][16]={ { 2, 3, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,}, { 2, 3, 7, 7, 13, 13, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25,}, { 3, 3, 4, 15, 15, 29, 57, 57, 57, 113, 113, 113, 113, 113, 113, 113,}, @@ -359,10 +289,14 @@ uint16_t dither_scale[15][16]={ { 3, 5, 7, 9, 11, 12, 14, 15, 15, 15, 15, 15, 15, 15, 16,65535,}, }; -static av_always_inline void yuv2yuvX16inC_template(const int16_t *lumFilter, const int16_t **lumSrc, int lumFilterSize, - const int16_t *chrFilter, const int16_t **chrSrc, int chrFilterSize, - const int16_t **alpSrc, uint16_t *dest, uint16_t *uDest, uint16_t *vDest, uint16_t *aDest, - int dstW, int chrDstW, int big_endian, int output_bits) +static av_always_inline void +yuv2yuvX16_c_template(const int16_t *lumFilter, const int16_t **lumSrc, + int lumFilterSize, const int16_t *chrFilter, + const int16_t **chrUSrc, const int16_t **chrVSrc, + int chrFilterSize, const int16_t **alpSrc, + uint16_t *dest, uint16_t *uDest, uint16_t *vDest, + uint16_t *aDest, int dstW, int chrDstW, + int big_endian, int output_bits) { //FIXME Optimize (just quickly written not optimized..) int i; @@ -383,7 +317,7 @@ static av_always_inline void yuv2yuvX16inC_template(const int16_t *lumFilter, co } \ } for (i = 0; i < dstW; i++) { - int val = 1 << 10; + int val = 1 << (26-output_bits); int j; for (j = 0; j < lumFilterSize; j++) @@ -394,13 +328,13 @@ static av_always_inline void yuv2yuvX16inC_template(const int16_t *lumFilter, co if (uDest) { for (i = 0; i < chrDstW; i++) { - int u = 1 << 10; - int v = 1 << 10; + int u = 1 << (26-output_bits); + int v = 1 << (26-output_bits); int j; for (j = 0; j < chrFilterSize; j++) { - u += chrSrc[j][i ] * chrFilter[j]; - v += chrSrc[j][i + VOFW] * chrFilter[j]; + u += chrUSrc[j][i] * chrFilter[j]; + v += chrVSrc[j][i] * chrFilter[j]; } output_pixel(&uDest[i], u); @@ -410,7 +344,7 @@ static av_always_inline void yuv2yuvX16inC_template(const int16_t *lumFilter, co if (CONFIG_SWSCALE_ALPHA && aDest) { for (i = 0; i < dstW; i++) { - int val = 1 << 10; + int val = 1 << (26-output_bits); int j; for (j = 0; j < lumFilterSize; j++) @@ -419,89 +353,46 @@ static av_always_inline void yuv2yuvX16inC_template(const int16_t *lumFilter, co output_pixel(&aDest[i], val); } } +#undef output_pixel } -static av_always_inline void yuv2yuvXNinC_template(const int16_t *lumFilter, const int16_t **lumSrc, int lumFilterSize, - const int16_t *chrFilter, const int16_t **chrSrc, int chrFilterSize, - const int16_t **alpSrc, uint16_t *dest, uint16_t *uDest, uint16_t *vDest, uint16_t *aDest, - int dstW, int chrDstW, int big_endian, int depth) -{ - //FIXME Optimize (just quickly written not optimized..) - int i; - - for (i = 0; i < dstW; i++) { - int val = 1 << (26-depth); - int j; - - for (j = 0; j < lumFilterSize; j++) - val += lumSrc[j][i] * lumFilter[j]; - - if (big_endian) { - AV_WB16(&dest[i], av_clip(val >> (27-depth), 0, (1<<depth)-1)); - } else { - AV_WL16(&dest[i], av_clip(val >> (27-depth), 0, (1<<depth)-1)); - } - } - - if (uDest) { - for (i = 0; i < chrDstW; i++) { - int u = 1 << (26-depth); - int v = 1 << (26-depth); - int j; - - for (j = 0; j < chrFilterSize; j++) { - u += chrSrc[j][i ] * chrFilter[j]; - v += chrSrc[j][i + VOFW] * chrFilter[j]; - } - - if (big_endian) { - AV_WB16(&uDest[i], av_clip(u >> (27-depth), 0, (1<<depth)-1)); - AV_WB16(&vDest[i], av_clip(v >> (27-depth), 0, (1<<depth)-1)); - } else { - AV_WL16(&uDest[i], av_clip(u >> (27-depth), 0, (1<<depth)-1)); - AV_WL16(&vDest[i], av_clip(v >> (27-depth), 0, (1<<depth)-1)); - } - } - } -} - -static inline void yuv2yuvX16inC(const int16_t *lumFilter, const int16_t **lumSrc, int lumFilterSize, - const int16_t *chrFilter, const int16_t **chrSrc, int chrFilterSize, - const int16_t **alpSrc, uint16_t *dest, uint16_t *uDest, uint16_t *vDest, uint16_t *aDest, int dstW, int chrDstW, - enum PixelFormat dstFormat) -{ - if (isNBPS(dstFormat)) { - const int depth = av_pix_fmt_descriptors[dstFormat].comp[0].depth_minus1+1; - yuv2yuvXNinC_template(lumFilter, lumSrc, lumFilterSize, - chrFilter, chrSrc, chrFilterSize, - alpSrc, - dest, uDest, vDest, aDest, - dstW, chrDstW, isBE(dstFormat), depth); - } else { - if (isBE(dstFormat)) { - yuv2yuvX16inC_template(lumFilter, lumSrc, lumFilterSize, - chrFilter, chrSrc, chrFilterSize, - alpSrc, - dest, uDest, vDest, aDest, - dstW, chrDstW, 1, 16); - } else { - yuv2yuvX16inC_template(lumFilter, lumSrc, lumFilterSize, - chrFilter, chrSrc, chrFilterSize, - alpSrc, - dest, uDest, vDest, aDest, - dstW, chrDstW, 0, 16); - } - } +#define yuv2NBPS(bits, BE_LE, is_be) \ +static void yuv2yuvX ## bits ## BE_LE ## _c(SwsContext *c, const int16_t *lumFilter, \ + const int16_t **lumSrc, int lumFilterSize, \ + const int16_t *chrFilter, const int16_t **chrUSrc, \ + const int16_t **chrVSrc, \ + int chrFilterSize, const int16_t **alpSrc, \ + uint8_t *_dest, uint8_t *_uDest, uint8_t *_vDest, \ + uint8_t *_aDest, int dstW, int chrDstW) \ +{ \ + uint16_t *dest = (uint16_t *) _dest, *uDest = (uint16_t *) _uDest, \ + *vDest = (uint16_t *) _vDest, *aDest = (uint16_t *) _aDest; \ + yuv2yuvX16_c_template(lumFilter, lumSrc, lumFilterSize, \ + chrFilter, chrUSrc, chrVSrc, chrFilterSize, \ + alpSrc, \ + dest, uDest, vDest, aDest, \ + dstW, chrDstW, is_be, bits); \ } - -static inline void yuv2yuvXinC(const int16_t *lumFilter, const int16_t **lumSrc, int lumFilterSize, - const int16_t *chrFilter, const int16_t **chrSrc, int chrFilterSize, - const int16_t **alpSrc, uint8_t *dest, uint8_t *uDest, uint8_t *vDest, uint8_t *aDest, int dstW, int chrDstW) +yuv2NBPS( 9, BE, 1); +yuv2NBPS( 9, LE, 0); +yuv2NBPS(10, BE, 1); +yuv2NBPS(10, LE, 0); +yuv2NBPS(16, BE, 1); +yuv2NBPS(16, LE, 0); + +static void yuv2yuvX_c(SwsContext *c, const int16_t *lumFilter, + const int16_t **lumSrc, int lumFilterSize, + const int16_t *chrFilter, const int16_t **chrUSrc, + const int16_t **chrVSrc, + int chrFilterSize, const int16_t **alpSrc, + uint8_t *dest, uint8_t *uDest, uint8_t *vDest, + uint8_t *aDest, int dstW, int chrDstW, + const uint8_t *lumDither, const uint8_t *chrDither) { //FIXME Optimize (just quickly written not optimized..) int i; for (i=0; i<dstW; i++) { - int val=1<<18; + int val = lumDither[i&7] << 12; int j; for (j=0; j<lumFilterSize; j++) val += lumSrc[j][i] * lumFilter[j]; @@ -511,12 +402,12 @@ static inline void yuv2yuvXinC(const int16_t *lumFilter, const int16_t **lumSrc, if (uDest) for (i=0; i<chrDstW; i++) { - int u=1<<18; - int v=1<<18; + int u = chrDither[i&7] << 12; + int v = chrDither[(i+3)&7] << 12; int j; for (j=0; j<chrFilterSize; j++) { - u += chrSrc[j][i] * chrFilter[j]; - v += chrSrc[j][i + VOFW] * chrFilter[j]; + u += chrUSrc[j][i] * chrFilter[j]; + v += chrVSrc[j][i] * chrFilter[j]; } uDest[i]= av_clip_uint8(u>>19); @@ -525,24 +416,58 @@ static inline void yuv2yuvXinC(const int16_t *lumFilter, const int16_t **lumSrc, if (CONFIG_SWSCALE_ALPHA && aDest) for (i=0; i<dstW; i++) { - int val=1<<18; + int val = lumDither[i&7] << 12; int j; for (j=0; j<lumFilterSize; j++) val += alpSrc[j][i] * lumFilter[j]; aDest[i]= av_clip_uint8(val>>19); } +} + +static void yuv2yuv1_c(SwsContext *c, const int16_t *lumSrc, + const int16_t *chrUSrc, const int16_t *chrVSrc, + const int16_t *alpSrc, + uint8_t *dest, uint8_t *uDest, uint8_t *vDest, + uint8_t *aDest, int dstW, int chrDstW, const uint8_t *lumDither, const uint8_t *chrDither) +{ + int i; + + for (i=0; i<dstW; i++) { + int val= (lumSrc[i]+lumDither[i&7])>>7; + dest[i]= av_clip_uint8(val); + } + if (uDest) + for (i=0; i<chrDstW; i++) { + int u=(chrUSrc[i]+chrDither[i&7])>>7; + int v=(chrVSrc[i]+chrDither[(i+3)&7])>>7; + uDest[i]= av_clip_uint8(u); + vDest[i]= av_clip_uint8(v); + } + + if (CONFIG_SWSCALE_ALPHA && aDest) + for (i=0; i<dstW; i++) { + int val= (alpSrc[i]+lumDither[i&7])>>7; + aDest[i]= av_clip_uint8(val); + } } -static inline void yuv2nv12XinC(const int16_t *lumFilter, const int16_t **lumSrc, int lumFilterSize, - const int16_t *chrFilter, const int16_t **chrSrc, int chrFilterSize, - uint8_t *dest, uint8_t *uDest, int dstW, int chrDstW, int dstFormat) +static void yuv2nv12X_c(SwsContext *c, const int16_t *lumFilter, + const int16_t **lumSrc, int lumFilterSize, + const int16_t *chrFilter, const int16_t **chrUSrc, + const int16_t **chrVSrc, int chrFilterSize, + const int16_t **alpSrc, uint8_t *dest, uint8_t *uDest, + uint8_t *vDest, uint8_t *aDest, + int dstW, int chrDstW, + const uint8_t *lumDither, const uint8_t *chrDither) { + enum PixelFormat dstFormat = c->dstFormat; + //FIXME Optimize (just quickly written not optimized..) int i; for (i=0; i<dstW; i++) { - int val=1<<18; + int val = lumDither[i&7]<<12; int j; for (j=0; j<lumFilterSize; j++) val += lumSrc[j][i] * lumFilter[j]; @@ -555,12 +480,12 @@ static inline void yuv2nv12XinC(const int16_t *lumFilter, const int16_t **lumSrc if (dstFormat == PIX_FMT_NV12) for (i=0; i<chrDstW; i++) { - int u=1<<18; - int v=1<<18; + int u = chrDither[i&7]<<12; + int v = chrDither[(i+3)&7]<<12; int j; for (j=0; j<chrFilterSize; j++) { - u += chrSrc[j][i] * chrFilter[j]; - v += chrSrc[j][i + VOFW] * chrFilter[j]; + u += chrUSrc[j][i] * chrFilter[j]; + v += chrVSrc[j][i] * chrFilter[j]; } uDest[2*i]= av_clip_uint8(u>>19); @@ -568,12 +493,12 @@ static inline void yuv2nv12XinC(const int16_t *lumFilter, const int16_t **lumSrc } else for (i=0; i<chrDstW; i++) { - int u=1<<18; - int v=1<<18; + int u = chrDither[i&7]<<12; + int v = chrDither[(i+3)&7]<<12; int j; for (j=0; j<chrFilterSize; j++) { - u += chrSrc[j][i] * chrFilter[j]; - v += chrSrc[j][i + VOFW] * chrFilter[j]; + u += chrUSrc[j][i] * chrFilter[j]; + v += chrVSrc[j][i] * chrFilter[j]; } uDest[2*i]= av_clip_uint8(v>>19); @@ -581,7 +506,484 @@ static inline void yuv2nv12XinC(const int16_t *lumFilter, const int16_t **lumSrc } } -#define YSCALE_YUV_2_PACKEDX_NOCLIP_C(type,alpha) \ +#define output_pixel(pos, val) \ + if (target == PIX_FMT_GRAY16BE) { \ + AV_WB16(pos, val); \ + } else { \ + AV_WL16(pos, val); \ + } + +static av_always_inline void +yuv2gray16_X_c_template(SwsContext *c, const int16_t *lumFilter, + const int16_t **lumSrc, int lumFilterSize, + const int16_t *chrFilter, const int16_t **chrUSrc, + const int16_t **chrVSrc, int chrFilterSize, + const int16_t **alpSrc, uint8_t *dest, int dstW, + int y, enum PixelFormat target) +{ + int i; + + for (i = 0; i < (dstW >> 1); i++) { + int j; + int Y1 = 1 << 18; + int Y2 = 1 << 18; + const int i2 = 2 * i; + + for (j = 0; j < lumFilterSize; j++) { + Y1 += lumSrc[j][i2] * lumFilter[j]; + Y2 += lumSrc[j][i2+1] * lumFilter[j]; + } + Y1 >>= 11; + Y2 >>= 11; + if ((Y1 | Y2) & 0x10000) { + Y1 = av_clip_uint16(Y1); + Y2 = av_clip_uint16(Y2); + } + output_pixel(&dest[2 * i2 + 0], Y1); + output_pixel(&dest[2 * i2 + 2], Y2); + } +} + +static av_always_inline void +yuv2gray16_2_c_template(SwsContext *c, const uint16_t *buf0, + const uint16_t *buf1, const uint16_t *ubuf0, + const uint16_t *ubuf1, const uint16_t *vbuf0, + const uint16_t *vbuf1, const uint16_t *abuf0, + const uint16_t *abuf1, uint8_t *dest, int dstW, + int yalpha, int uvalpha, int y, + enum PixelFormat target) +{ + int yalpha1 = 4095 - yalpha; \ + int i; + + for (i = 0; i < (dstW >> 1); i++) { + const int i2 = 2 * i; + int Y1 = (buf0[i2 ] * yalpha1 + buf1[i2 ] * yalpha) >> 11; + int Y2 = (buf0[i2+1] * yalpha1 + buf1[i2+1] * yalpha) >> 11; + + output_pixel(&dest[2 * i2 + 0], Y1); + output_pixel(&dest[2 * i2 + 2], Y2); + } +} + +static av_always_inline void +yuv2gray16_1_c_template(SwsContext *c, const uint16_t *buf0, + const uint16_t *ubuf0, const uint16_t *ubuf1, + const uint16_t *vbuf0, const uint16_t *vbuf1, + const uint16_t *abuf0, uint8_t *dest, int dstW, + int uvalpha, enum PixelFormat dstFormat, + int flags, int y, enum PixelFormat target) +{ + int i; + + for (i = 0; i < (dstW >> 1); i++) { + const int i2 = 2 * i; + int Y1 = buf0[i2 ] << 1; + int Y2 = buf0[i2+1] << 1; + + output_pixel(&dest[2 * i2 + 0], Y1); + output_pixel(&dest[2 * i2 + 2], Y2); + } +} + +#undef output_pixel + +#define YUV2PACKEDWRAPPER(name, base, ext, fmt) \ +static void name ## ext ## _X_c(SwsContext *c, const int16_t *lumFilter, \ + const int16_t **lumSrc, int lumFilterSize, \ + const int16_t *chrFilter, const int16_t **chrUSrc, \ + const int16_t **chrVSrc, int chrFilterSize, \ + const int16_t **alpSrc, uint8_t *dest, int dstW, \ + int y) \ +{ \ + name ## base ## _X_c_template(c, lumFilter, lumSrc, lumFilterSize, \ + chrFilter, chrUSrc, chrVSrc, chrFilterSize, \ + alpSrc, dest, dstW, y, fmt); \ +} \ + \ +static void name ## ext ## _2_c(SwsContext *c, const uint16_t *buf0, \ + const uint16_t *buf1, const uint16_t *ubuf0, \ + const uint16_t *ubuf1, const uint16_t *vbuf0, \ + const uint16_t *vbuf1, const uint16_t *abuf0, \ + const uint16_t *abuf1, uint8_t *dest, int dstW, \ + int yalpha, int uvalpha, int y) \ +{ \ + name ## base ## _2_c_template(c, buf0, buf1, ubuf0, ubuf1, \ + vbuf0, vbuf1, abuf0, abuf1, \ + dest, dstW, yalpha, uvalpha, y, fmt); \ +} \ + \ +static void name ## ext ## _1_c(SwsContext *c, const uint16_t *buf0, \ + const uint16_t *ubuf0, const uint16_t *ubuf1, \ + const uint16_t *vbuf0, const uint16_t *vbuf1, \ + const uint16_t *abuf0, uint8_t *dest, int dstW, \ + int uvalpha, enum PixelFormat dstFormat, \ + int flags, int y) \ +{ \ + name ## base ## _1_c_template(c, buf0, ubuf0, ubuf1, vbuf0, \ + vbuf1, abuf0, dest, dstW, uvalpha, \ + dstFormat, flags, y, fmt); \ +} + +YUV2PACKEDWRAPPER(yuv2gray16,, LE, PIX_FMT_GRAY16LE); +YUV2PACKEDWRAPPER(yuv2gray16,, BE, PIX_FMT_GRAY16BE); + +#define output_pixel(pos, acc) \ + if (target == PIX_FMT_MONOBLACK) { \ + pos = acc; \ + } else { \ + pos = ~acc; \ + } + +static av_always_inline void +yuv2mono_X_c_template(SwsContext *c, const int16_t *lumFilter, + const int16_t **lumSrc, int lumFilterSize, + const int16_t *chrFilter, const int16_t **chrUSrc, + const int16_t **chrVSrc, int chrFilterSize, + const int16_t **alpSrc, uint8_t *dest, int dstW, + int y, enum PixelFormat target) +{ + const uint8_t * const d128=dither_8x8_220[y&7]; + uint8_t *g = c->table_gU[128] + c->table_gV[128]; + int i; + int acc = 0; + + for (i = 0; i < dstW - 1; i += 2) { + int j; + int Y1 = 1 << 18; + int Y2 = 1 << 18; + + for (j = 0; j < lumFilterSize; j++) { + Y1 += lumSrc[j][i] * lumFilter[j]; + Y2 += lumSrc[j][i+1] * lumFilter[j]; + } + Y1 >>= 19; + Y2 >>= 19; + if ((Y1 | Y2) & 0x100) { + Y1 = av_clip_uint8(Y1); + Y2 = av_clip_uint8(Y2); + } + acc += acc + g[Y1 + d128[(i + 0) & 7]]; + acc += acc + g[Y2 + d128[(i + 1) & 7]]; + if ((i & 7) == 6) { + output_pixel(*dest++, acc); + } + } +} + +static av_always_inline void +yuv2mono_2_c_template(SwsContext *c, const uint16_t *buf0, + const uint16_t *buf1, const uint16_t *ubuf0, + const uint16_t *ubuf1, const uint16_t *vbuf0, + const uint16_t *vbuf1, const uint16_t *abuf0, + const uint16_t *abuf1, uint8_t *dest, int dstW, + int yalpha, int uvalpha, int y, + enum PixelFormat target) +{ + const uint8_t * const d128 = dither_8x8_220[y & 7]; + uint8_t *g = c->table_gU[128] + c->table_gV[128]; + int yalpha1 = 4095 - yalpha; + int i; + + for (i = 0; i < dstW - 7; i += 8) { + int acc = g[((buf0[i ] * yalpha1 + buf1[i ] * yalpha) >> 19) + d128[0]]; + acc += acc + g[((buf0[i + 1] * yalpha1 + buf1[i + 1] * yalpha) >> 19) + d128[1]]; + acc += acc + g[((buf0[i + 2] * yalpha1 + buf1[i + 2] * yalpha) >> 19) + d128[2]]; + acc += acc + g[((buf0[i + 3] * yalpha1 + buf1[i + 3] * yalpha) >> 19) + d128[3]]; + acc += acc + g[((buf0[i + 4] * yalpha1 + buf1[i + 4] * yalpha) >> 19) + d128[4]]; + acc += acc + g[((buf0[i + 5] * yalpha1 + buf1[i + 5] * yalpha) >> 19) + d128[5]]; + acc += acc + g[((buf0[i + 6] * yalpha1 + buf1[i + 6] * yalpha) >> 19) + d128[6]]; + acc += acc + g[((buf0[i + 7] * yalpha1 + buf1[i + 7] * yalpha) >> 19) + d128[7]]; + output_pixel(*dest++, acc); + } +} + +static av_always_inline void +yuv2mono_1_c_template(SwsContext *c, const uint16_t *buf0, + const uint16_t *ubuf0, const uint16_t *ubuf1, + const uint16_t *vbuf0, const uint16_t *vbuf1, + const uint16_t *abuf0, uint8_t *dest, int dstW, + int uvalpha, enum PixelFormat dstFormat, + int flags, int y, enum PixelFormat target) +{ + const uint8_t * const d128 = dither_8x8_220[y & 7]; + uint8_t *g = c->table_gU[128] + c->table_gV[128]; + int i; + + for (i = 0; i < dstW - 7; i += 8) { + int acc = g[(buf0[i ] >> 7) + d128[0]]; + acc += acc + g[(buf0[i + 1] >> 7) + d128[1]]; + acc += acc + g[(buf0[i + 2] >> 7) + d128[2]]; + acc += acc + g[(buf0[i + 3] >> 7) + d128[3]]; + acc += acc + g[(buf0[i + 4] >> 7) + d128[4]]; + acc += acc + g[(buf0[i + 5] >> 7) + d128[5]]; + acc += acc + g[(buf0[i + 6] >> 7) + d128[6]]; + acc += acc + g[(buf0[i + 7] >> 7) + d128[7]]; + output_pixel(*dest++, acc); + } +} + +#undef output_pixel + +YUV2PACKEDWRAPPER(yuv2mono,, white, PIX_FMT_MONOWHITE); +YUV2PACKEDWRAPPER(yuv2mono,, black, PIX_FMT_MONOBLACK); + +#define output_pixels(pos, Y1, U, Y2, V) \ + if (target == PIX_FMT_YUYV422) { \ + dest[pos + 0] = Y1; \ + dest[pos + 1] = U; \ + dest[pos + 2] = Y2; \ + dest[pos + 3] = V; \ + } else { \ + dest[pos + 0] = U; \ + dest[pos + 1] = Y1; \ + dest[pos + 2] = V; \ + dest[pos + 3] = Y2; \ + } + +static av_always_inline void +yuv2422_X_c_template(SwsContext *c, const int16_t *lumFilter, + const int16_t **lumSrc, int lumFilterSize, + const int16_t *chrFilter, const int16_t **chrUSrc, + const int16_t **chrVSrc, int chrFilterSize, + const int16_t **alpSrc, uint8_t *dest, int dstW, + int y, enum PixelFormat target) +{ + int i; + + for (i = 0; i < (dstW >> 1); i++) { + int j; + int Y1 = 1 << 18; + int Y2 = 1 << 18; + int U = 1 << 18; + int V = 1 << 18; + + for (j = 0; j < lumFilterSize; j++) { + Y1 += lumSrc[j][i * 2] * lumFilter[j]; + Y2 += lumSrc[j][i * 2 + 1] * lumFilter[j]; + } + for (j = 0; j < chrFilterSize; j++) { + U += chrUSrc[j][i] * chrFilter[j]; + V += chrVSrc[j][i] * chrFilter[j]; + } + Y1 >>= 19; + Y2 >>= 19; + U >>= 19; + V >>= 19; + if ((Y1 | Y2 | U | V) & 0x100) { + Y1 = av_clip_uint8(Y1); + Y2 = av_clip_uint8(Y2); + U = av_clip_uint8(U); + V = av_clip_uint8(V); + } + output_pixels(4*i, Y1, U, Y2, V); + } +} + +static av_always_inline void +yuv2422_2_c_template(SwsContext *c, const uint16_t *buf0, + const uint16_t *buf1, const uint16_t *ubuf0, + const uint16_t *ubuf1, const uint16_t *vbuf0, + const uint16_t *vbuf1, const uint16_t *abuf0, + const uint16_t *abuf1, uint8_t *dest, int dstW, + int yalpha, int uvalpha, int y, + enum PixelFormat target) +{ + int yalpha1 = 4095 - yalpha; + int uvalpha1 = 4095 - uvalpha; + int i; + + for (i = 0; i < (dstW >> 1); i++) { + int Y1 = (buf0[i * 2] * yalpha1 + buf1[i * 2] * yalpha) >> 19; + int Y2 = (buf0[i * 2 + 1] * yalpha1 + buf1[i * 2 + 1] * yalpha) >> 19; + int U = (ubuf0[i] * uvalpha1 + ubuf1[i] * uvalpha) >> 19; + int V = (vbuf0[i] * uvalpha1 + vbuf1[i] * uvalpha) >> 19; + + output_pixels(i * 4, Y1, U, Y2, V); + } +} + +static av_always_inline void +yuv2422_1_c_template(SwsContext *c, const uint16_t *buf0, + const uint16_t *ubuf0, const uint16_t *ubuf1, + const uint16_t *vbuf0, const uint16_t *vbuf1, + const uint16_t *abuf0, uint8_t *dest, int dstW, + int uvalpha, enum PixelFormat dstFormat, + int flags, int y, enum PixelFormat target) +{ + int i; + + if (uvalpha < 2048) { + for (i = 0; i < (dstW >> 1); i++) { + int Y1 = buf0[i * 2] >> 7; + int Y2 = buf0[i * 2 + 1] >> 7; + int U = ubuf1[i] >> 7; + int V = vbuf1[i] >> 7; + + output_pixels(i * 4, Y1, U, Y2, V); + } + } else { + for (i = 0; i < (dstW >> 1); i++) { + int Y1 = buf0[i * 2] >> 7; + int Y2 = buf0[i * 2 + 1] >> 7; + int U = (ubuf0[i] + ubuf1[i]) >> 8; + int V = (vbuf0[i] + vbuf1[i]) >> 8; + + output_pixels(i * 4, Y1, U, Y2, V); + } + } +} + +#undef output_pixels + +YUV2PACKEDWRAPPER(yuv2, 422, yuyv422, PIX_FMT_YUYV422); +YUV2PACKEDWRAPPER(yuv2, 422, uyvy422, PIX_FMT_UYVY422); + +#define r_b ((target == PIX_FMT_RGB48LE || target == PIX_FMT_RGB48BE) ? r : b) +#define b_r ((target == PIX_FMT_RGB48LE || target == PIX_FMT_RGB48BE) ? b : r) + +static av_always_inline void +yuv2rgb48_X_c_template(SwsContext *c, const int16_t *lumFilter, + const int16_t **lumSrc, int lumFilterSize, + const int16_t *chrFilter, const int16_t **chrUSrc, + const int16_t **chrVSrc, int chrFilterSize, + const int16_t **alpSrc, uint8_t *dest, int dstW, + int y, enum PixelFormat target) +{ + int i; + + for (i = 0; i < (dstW >> 1); i++) { + int j; + int Y1 = 1 << 18; + int Y2 = 1 << 18; + int U = 1 << 18; + int V = 1 << 18; + const uint8_t *r, *g, *b; + + for (j = 0; j < lumFilterSize; j++) { + Y1 += lumSrc[j][i * 2] * lumFilter[j]; + Y2 += lumSrc[j][i * 2 + 1] * lumFilter[j]; + } + for (j = 0; j < chrFilterSize; j++) { + U += chrUSrc[j][i] * chrFilter[j]; + V += chrVSrc[j][i] * chrFilter[j]; + } + Y1 >>= 19; + Y2 >>= 19; + U >>= 19; + V >>= 19; + if ((Y1 | Y2 | U | V) & 0x100) { + Y1 = av_clip_uint8(Y1); + Y2 = av_clip_uint8(Y2); + U = av_clip_uint8(U); + V = av_clip_uint8(V); + } + + /* FIXME fix tables so that clipping is not needed and then use _NOCLIP*/ + r = (const uint8_t *) c->table_rV[V]; + g = (const uint8_t *)(c->table_gU[U] + c->table_gV[V]); + b = (const uint8_t *) c->table_bU[U]; + + dest[ 0] = dest[ 1] = r_b[Y1]; + dest[ 2] = dest[ 3] = g[Y1]; + dest[ 4] = dest[ 5] = b_r[Y1]; + dest[ 6] = dest[ 7] = r_b[Y2]; + dest[ 8] = dest[ 9] = g[Y2]; + dest[10] = dest[11] = b_r[Y2]; + dest += 12; + } +} + +static av_always_inline void +yuv2rgb48_2_c_template(SwsContext *c, const uint16_t *buf0, + const uint16_t *buf1, const uint16_t *ubuf0, + const uint16_t *ubuf1, const uint16_t *vbuf0, + const uint16_t *vbuf1, const uint16_t *abuf0, + const uint16_t *abuf1, uint8_t *dest, int dstW, + int yalpha, int uvalpha, int y, + enum PixelFormat target) +{ + int yalpha1 = 4095 - yalpha; + int uvalpha1 = 4095 - uvalpha; + int i; + + for (i = 0; i < (dstW >> 1); i++) { + int Y1 = (buf0[i * 2] * yalpha1 + buf1[i * 2] * yalpha) >> 19; + int Y2 = (buf0[i * 2 + 1] * yalpha1 + buf1[i * 2 + 1] * yalpha) >> 19; + int U = (ubuf0[i] * uvalpha1 + ubuf1[i] * uvalpha) >> 19; + int V = (vbuf0[i] * uvalpha1 + vbuf1[i] * uvalpha) >> 19; + const uint8_t *r = (const uint8_t *) c->table_rV[V], + *g = (const uint8_t *)(c->table_gU[U] + c->table_gV[V]), + *b = (const uint8_t *) c->table_bU[U]; + + dest[ 0] = dest[ 1] = r_b[Y1]; + dest[ 2] = dest[ 3] = g[Y1]; + dest[ 4] = dest[ 5] = b_r[Y1]; + dest[ 6] = dest[ 7] = r_b[Y2]; + dest[ 8] = dest[ 9] = g[Y2]; + dest[10] = dest[11] = b_r[Y2]; + dest += 12; + } +} + +static av_always_inline void +yuv2rgb48_1_c_template(SwsContext *c, const uint16_t *buf0, + const uint16_t *ubuf0, const uint16_t *ubuf1, + const uint16_t *vbuf0, const uint16_t *vbuf1, + const uint16_t *abuf0, uint8_t *dest, int dstW, + int uvalpha, enum PixelFormat dstFormat, + int flags, int y, enum PixelFormat target) +{ + int i; + + if (uvalpha < 2048) { + for (i = 0; i < (dstW >> 1); i++) { + int Y1 = buf0[i * 2] >> 7; + int Y2 = buf0[i * 2 + 1] >> 7; + int U = ubuf1[i] >> 7; + int V = vbuf1[i] >> 7; + const uint8_t *r = (const uint8_t *) c->table_rV[V], + *g = (const uint8_t *)(c->table_gU[U] + c->table_gV[V]), + *b = (const uint8_t *) c->table_bU[U]; + + dest[ 0] = dest[ 1] = r_b[Y1]; + dest[ 2] = dest[ 3] = g[Y1]; + dest[ 4] = dest[ 5] = b_r[Y1]; + dest[ 6] = dest[ 7] = r_b[Y2]; + dest[ 8] = dest[ 9] = g[Y2]; + dest[10] = dest[11] = b_r[Y2]; + dest += 12; + } + } else { + for (i = 0; i < (dstW >> 1); i++) { + int Y1 = buf0[i * 2] >> 7; + int Y2 = buf0[i * 2 + 1] >> 7; + int U = (ubuf0[i] + ubuf1[i]) >> 8; + int V = (vbuf0[i] + vbuf1[i]) >> 8; + const uint8_t *r = (const uint8_t *) c->table_rV[V], + *g = (const uint8_t *)(c->table_gU[U] + c->table_gV[V]), + *b = (const uint8_t *) c->table_bU[U]; + + dest[ 0] = dest[ 1] = r_b[Y1]; + dest[ 2] = dest[ 3] = g[Y1]; + dest[ 4] = dest[ 5] = b_r[Y1]; + dest[ 6] = dest[ 7] = r_b[Y2]; + dest[ 8] = dest[ 9] = g[Y2]; + dest[10] = dest[11] = b_r[Y2]; + dest += 12; + } + } +} + +#undef r_b +#undef b_r + +YUV2PACKEDWRAPPER(yuv2, rgb48, rgb48be, PIX_FMT_RGB48BE); +//YUV2PACKEDWRAPPER(yuv2, rgb48, rgb48le, PIX_FMT_RGB48LE); +YUV2PACKEDWRAPPER(yuv2, rgb48, bgr48be, PIX_FMT_BGR48BE); +//YUV2PACKEDWRAPPER(yuv2, rgb48, bgr48le, PIX_FMT_BGR48LE); + +#define YSCALE_YUV_2_RGBX_C(type,alpha) \ for (i=0; i<(dstW>>1); i++) {\ int j;\ int Y1 = 1<<18;\ @@ -597,13 +999,19 @@ static inline void yuv2nv12XinC(const int16_t *lumFilter, const int16_t **lumSrc Y2 += lumSrc[j][i2+1] * lumFilter[j];\ }\ for (j=0; j<chrFilterSize; j++) {\ - U += chrSrc[j][i] * chrFilter[j];\ - V += chrSrc[j][i+VOFW] * chrFilter[j];\ + U += chrUSrc[j][i] * chrFilter[j];\ + V += chrVSrc[j][i] * chrFilter[j];\ }\ Y1>>=19;\ Y2>>=19;\ U >>=19;\ V >>=19;\ + if ((Y1|Y2|U|V)&0x100) {\ + Y1 = av_clip_uint8(Y1); \ + Y2 = av_clip_uint8(Y2); \ + U = av_clip_uint8(U); \ + V = av_clip_uint8(V); \ + }\ if (alpha) {\ A1 = 1<<18;\ A2 = 1<<18;\ @@ -613,31 +1021,22 @@ static inline void yuv2nv12XinC(const int16_t *lumFilter, const int16_t **lumSrc }\ A1>>=19;\ A2>>=19;\ - } - -#define YSCALE_YUV_2_PACKEDX_C(type,alpha) \ - YSCALE_YUV_2_PACKEDX_NOCLIP_C(type,alpha)\ - if ((Y1|Y2|U|V)&256) {\ - if (Y1>255) Y1=255; \ - else if (Y1<0)Y1=0; \ - if (Y2>255) Y2=255; \ - else if (Y2<0)Y2=0; \ - if (U>255) U=255; \ - else if (U<0) U=0; \ - if (V>255) V=255; \ - else if (V<0) V=0; \ + if ((A1|A2)&0x100) {\ + A1 = av_clip_uint8(A1); \ + A2 = av_clip_uint8(A2); \ + }\ }\ - if (alpha && ((A1|A2)&256)) {\ - A1=av_clip_uint8(A1);\ - A2=av_clip_uint8(A2);\ - } + /* FIXME fix tables so that clipping is not needed and then use _NOCLIP*/\ + r = (type *)c->table_rV[V]; \ + g = (type *)(c->table_gU[U] + c->table_gV[V]); \ + b = (type *)c->table_bU[U]; -#define YSCALE_YUV_2_PACKEDX_FULL_C(rnd,alpha) \ +#define YSCALE_YUV_2_RGBX_FULL_C(rnd,alpha) \ for (i=0; i<dstW; i++) {\ int j;\ - int Y = 0;\ - int U = -128<<19;\ - int V = -128<<19;\ + int Y = 1<<9;\ + int U = (1<<9)-(128<<19);\ + int V = (1<<9)-(128<<19);\ int av_unused A;\ int R,G,B;\ \ @@ -645,23 +1044,20 @@ static inline void yuv2nv12XinC(const int16_t *lumFilter, const int16_t **lumSrc Y += lumSrc[j][i ] * lumFilter[j];\ }\ for (j=0; j<chrFilterSize; j++) {\ - U += chrSrc[j][i ] * chrFilter[j];\ - V += chrSrc[j][i+VOFW] * chrFilter[j];\ + U += chrUSrc[j][i] * chrFilter[j];\ + V += chrVSrc[j][i] * chrFilter[j];\ }\ Y >>=10;\ U >>=10;\ V >>=10;\ if (alpha) {\ - A = rnd;\ + A = rnd>>3;\ for (j=0; j<lumFilterSize; j++)\ A += alpSrc[j][i ] * lumFilter[j];\ A >>=19;\ - if (A&256)\ + if (A&0x100)\ A = av_clip_uint8(A);\ - } - -#define YSCALE_YUV_2_RGBX_FULL_C(rnd,alpha) \ - YSCALE_YUV_2_PACKEDX_FULL_C(rnd>>3,alpha)\ + }\ Y-= c->yuv2rgb_y_offset;\ Y*= c->yuv2rgb_y_coeff;\ Y+= rnd;\ @@ -669,193 +1065,64 @@ static inline void yuv2nv12XinC(const int16_t *lumFilter, const int16_t **lumSrc G= Y + V*c->yuv2rgb_v2g_coeff + U*c->yuv2rgb_u2g_coeff;\ B= Y + U*c->yuv2rgb_u2b_coeff;\ if ((R|G|B)&(0xC0000000)) {\ - if (R>=(256<<22)) R=(256<<22)-1; \ - else if (R<0)R=0; \ - if (G>=(256<<22)) G=(256<<22)-1; \ - else if (G<0)G=0; \ - if (B>=(256<<22)) B=(256<<22)-1; \ - else if (B<0)B=0; \ + R = av_clip_uintp2(R, 30); \ + G = av_clip_uintp2(G, 30); \ + B = av_clip_uintp2(B, 30); \ } -#define YSCALE_YUV_2_GRAY16_C \ - for (i=0; i<(dstW>>1); i++) {\ - int j;\ - int Y1 = 1<<18;\ - int Y2 = 1<<18;\ - int U = 1<<18;\ - int V = 1<<18;\ - \ - const int i2= 2*i;\ - \ - for (j=0; j<lumFilterSize; j++) {\ - Y1 += lumSrc[j][i2] * lumFilter[j];\ - Y2 += lumSrc[j][i2+1] * lumFilter[j];\ - }\ - Y1>>=11;\ - Y2>>=11;\ - if ((Y1|Y2|U|V)&65536) {\ - if (Y1>65535) Y1=65535; \ - else if (Y1<0)Y1=0; \ - if (Y2>65535) Y2=65535; \ - else if (Y2<0)Y2=0; \ - } - -#define YSCALE_YUV_2_RGBX_C(type,alpha) \ - YSCALE_YUV_2_PACKEDX_C(type,alpha) /* FIXME fix tables so that clipping is not needed and then use _NOCLIP*/\ - r = (type *)c->table_rV[V]; \ - g = (type *)(c->table_gU[U] + c->table_gV[V]); \ - b = (type *)c->table_bU[U]; - -#define YSCALE_YUV_2_PACKED2_C(type,alpha) \ +#define YSCALE_YUV_2_RGB2_C(type,alpha) \ for (i=0; i<(dstW>>1); i++) { \ const int i2= 2*i; \ int Y1= (buf0[i2 ]*yalpha1+buf1[i2 ]*yalpha)>>19; \ int Y2= (buf0[i2+1]*yalpha1+buf1[i2+1]*yalpha)>>19; \ - int U= (uvbuf0[i ]*uvalpha1+uvbuf1[i ]*uvalpha)>>19; \ - int V= (uvbuf0[i+VOFW]*uvalpha1+uvbuf1[i+VOFW]*uvalpha)>>19; \ + int U= (ubuf0[i]*uvalpha1+ubuf1[i]*uvalpha)>>19; \ + int V= (vbuf0[i]*uvalpha1+vbuf1[i]*uvalpha)>>19; \ type av_unused *r, *b, *g; \ int av_unused A1, A2; \ if (alpha) {\ A1= (abuf0[i2 ]*yalpha1+abuf1[i2 ]*yalpha)>>19; \ A2= (abuf0[i2+1]*yalpha1+abuf1[i2+1]*yalpha)>>19; \ - } - -#define YSCALE_YUV_2_GRAY16_2_C \ - for (i=0; i<(dstW>>1); i++) { \ - const int i2= 2*i; \ - int Y1= (buf0[i2 ]*yalpha1+buf1[i2 ]*yalpha)>>11; \ - int Y2= (buf0[i2+1]*yalpha1+buf1[i2+1]*yalpha)>>11; - -#define YSCALE_YUV_2_RGB2_C(type,alpha) \ - YSCALE_YUV_2_PACKED2_C(type,alpha)\ + }\ r = (type *)c->table_rV[V];\ g = (type *)(c->table_gU[U] + c->table_gV[V]);\ b = (type *)c->table_bU[U]; -#define YSCALE_YUV_2_PACKED1_C(type,alpha) \ +#define YSCALE_YUV_2_RGB1_C(type,alpha) \ for (i=0; i<(dstW>>1); i++) {\ const int i2= 2*i;\ int Y1= buf0[i2 ]>>7;\ int Y2= buf0[i2+1]>>7;\ - int U= (uvbuf1[i ])>>7;\ - int V= (uvbuf1[i+VOFW])>>7;\ + int U= (ubuf1[i])>>7;\ + int V= (vbuf1[i])>>7;\ type av_unused *r, *b, *g;\ int av_unused A1, A2;\ if (alpha) {\ A1= abuf0[i2 ]>>7;\ A2= abuf0[i2+1]>>7;\ - } - -#define YSCALE_YUV_2_GRAY16_1_C \ - for (i=0; i<(dstW>>1); i++) {\ - const int i2= 2*i;\ - int Y1= buf0[i2 ]<<1;\ - int Y2= buf0[i2+1]<<1; - -#define YSCALE_YUV_2_RGB1_C(type,alpha) \ - YSCALE_YUV_2_PACKED1_C(type,alpha)\ + }\ r = (type *)c->table_rV[V];\ g = (type *)(c->table_gU[U] + c->table_gV[V]);\ b = (type *)c->table_bU[U]; -#define YSCALE_YUV_2_PACKED1B_C(type,alpha) \ +#define YSCALE_YUV_2_RGB1B_C(type,alpha) \ for (i=0; i<(dstW>>1); i++) {\ const int i2= 2*i;\ int Y1= buf0[i2 ]>>7;\ int Y2= buf0[i2+1]>>7;\ - int U= (uvbuf0[i ] + uvbuf1[i ])>>8;\ - int V= (uvbuf0[i+VOFW] + uvbuf1[i+VOFW])>>8;\ + int U= (ubuf0[i] + ubuf1[i])>>8;\ + int V= (vbuf0[i] + vbuf1[i])>>8;\ type av_unused *r, *b, *g;\ int av_unused A1, A2;\ if (alpha) {\ A1= abuf0[i2 ]>>7;\ A2= abuf0[i2+1]>>7;\ - } - -#define YSCALE_YUV_2_RGB1B_C(type,alpha) \ - YSCALE_YUV_2_PACKED1B_C(type,alpha)\ + }\ r = (type *)c->table_rV[V];\ g = (type *)(c->table_gU[U] + c->table_gV[V]);\ b = (type *)c->table_bU[U]; -#define YSCALE_YUV_2_MONO2_C \ - const uint8_t * const d128=dither_8x8_220[y&7];\ - uint8_t *g= c->table_gU[128] + c->table_gV[128];\ - for (i=0; i<dstW-7; i+=8) {\ - int acc;\ - acc = g[((buf0[i ]*yalpha1+buf1[i ]*yalpha)>>19) + d128[0]];\ - acc+= acc + g[((buf0[i+1]*yalpha1+buf1[i+1]*yalpha)>>19) + d128[1]];\ - acc+= acc + g[((buf0[i+2]*yalpha1+buf1[i+2]*yalpha)>>19) + d128[2]];\ - acc+= acc + g[((buf0[i+3]*yalpha1+buf1[i+3]*yalpha)>>19) + d128[3]];\ - acc+= acc + g[((buf0[i+4]*yalpha1+buf1[i+4]*yalpha)>>19) + d128[4]];\ - acc+= acc + g[((buf0[i+5]*yalpha1+buf1[i+5]*yalpha)>>19) + d128[5]];\ - acc+= acc + g[((buf0[i+6]*yalpha1+buf1[i+6]*yalpha)>>19) + d128[6]];\ - acc+= acc + g[((buf0[i+7]*yalpha1+buf1[i+7]*yalpha)>>19) + d128[7]];\ - ((uint8_t*)dest)[0]= c->dstFormat == PIX_FMT_MONOBLACK ? acc : ~acc;\ - dest++;\ - } - -#define YSCALE_YUV_2_MONOX_C \ - const uint8_t * const d128=dither_8x8_220[y&7];\ - uint8_t *g= c->table_gU[128] + c->table_gV[128];\ - int acc=0;\ - for (i=0; i<dstW-1; i+=2) {\ - int j;\ - int Y1=1<<18;\ - int Y2=1<<18;\ -\ - for (j=0; j<lumFilterSize; j++) {\ - Y1 += lumSrc[j][i] * lumFilter[j];\ - Y2 += lumSrc[j][i+1] * lumFilter[j];\ - }\ - Y1>>=19;\ - Y2>>=19;\ - if ((Y1|Y2)&256) {\ - if (Y1>255) Y1=255;\ - else if (Y1<0)Y1=0;\ - if (Y2>255) Y2=255;\ - else if (Y2<0)Y2=0;\ - }\ - acc+= acc + g[Y1+d128[(i+0)&7]];\ - acc+= acc + g[Y2+d128[(i+1)&7]];\ - if ((i&7)==6) {\ - ((uint8_t*)dest)[0]= c->dstFormat == PIX_FMT_MONOBLACK ? acc : ~acc;\ - dest++;\ - }\ - } - -#define YSCALE_YUV_2_ANYRGB_C(func, func2, func_g16, func_monoblack)\ +#define YSCALE_YUV_2_ANYRGB_C(func)\ switch(c->dstFormat) {\ - case PIX_FMT_RGB48BE:\ - case PIX_FMT_RGB48LE:\ - func(uint8_t,0)\ - ((uint8_t*)dest)[ 0]= r[Y1];\ - ((uint8_t*)dest)[ 1]= r[Y1];\ - ((uint8_t*)dest)[ 2]= g[Y1];\ - ((uint8_t*)dest)[ 3]= g[Y1];\ - ((uint8_t*)dest)[ 4]= b[Y1];\ - ((uint8_t*)dest)[ 5]= b[Y1];\ - ((uint8_t*)dest)[ 6]= r[Y2];\ - ((uint8_t*)dest)[ 7]= r[Y2];\ - ((uint8_t*)dest)[ 8]= g[Y2];\ - ((uint8_t*)dest)[ 9]= g[Y2];\ - ((uint8_t*)dest)[10]= b[Y2];\ - ((uint8_t*)dest)[11]= b[Y2];\ - dest+=12;\ - }\ - break;\ - case PIX_FMT_BGR48BE:\ - case PIX_FMT_BGR48LE:\ - func(uint8_t,0)\ - ((uint8_t*)dest)[ 0] = ((uint8_t*)dest)[ 1] = b[Y1];\ - ((uint8_t*)dest)[ 2] = ((uint8_t*)dest)[ 3] = g[Y1];\ - ((uint8_t*)dest)[ 4] = ((uint8_t*)dest)[ 5] = r[Y1];\ - ((uint8_t*)dest)[ 6] = ((uint8_t*)dest)[ 7] = b[Y2];\ - ((uint8_t*)dest)[ 8] = ((uint8_t*)dest)[ 9] = g[Y2];\ - ((uint8_t*)dest)[10] = ((uint8_t*)dest)[11] = r[Y2];\ - dest+=12;\ - }\ - break;\ case PIX_FMT_RGBA:\ case PIX_FMT_BGRA:\ if (CONFIG_SMALL) {\ @@ -922,10 +1189,8 @@ static inline void yuv2nv12XinC(const int16_t *lumFilter, const int16_t **lumSrc dest+=6;\ }\ break;\ - case PIX_FMT_RGB565BE:\ - case PIX_FMT_RGB565LE:\ - case PIX_FMT_BGR565BE:\ - case PIX_FMT_BGR565LE:\ + case PIX_FMT_RGB565:\ + case PIX_FMT_BGR565:\ {\ const int dr1= dither_2x2_8[y&1 ][0];\ const int dg1= dither_2x2_4[y&1 ][0];\ @@ -939,10 +1204,8 @@ static inline void yuv2nv12XinC(const int16_t *lumFilter, const int16_t **lumSrc }\ }\ break;\ - case PIX_FMT_RGB555BE:\ - case PIX_FMT_RGB555LE:\ - case PIX_FMT_BGR555BE:\ - case PIX_FMT_BGR555LE:\ + case PIX_FMT_RGB555:\ + case PIX_FMT_BGR555:\ {\ const int dr1= dither_2x2_8[y&1 ][0];\ const int dg1= dither_2x2_8[y&1 ][1];\ @@ -956,10 +1219,8 @@ static inline void yuv2nv12XinC(const int16_t *lumFilter, const int16_t **lumSrc }\ }\ break;\ - case PIX_FMT_RGB444BE:\ - case PIX_FMT_RGB444LE:\ - case PIX_FMT_BGR444BE:\ - case PIX_FMT_BGR444LE:\ + case PIX_FMT_RGB444:\ + case PIX_FMT_BGR444:\ {\ const int dr1= dither_4x4_16[y&3 ][0];\ const int dg1= dither_4x4_16[y&3 ][1];\ @@ -1006,57 +1267,23 @@ static inline void yuv2nv12XinC(const int16_t *lumFilter, const int16_t **lumSrc }\ }\ break;\ - case PIX_FMT_MONOBLACK:\ - case PIX_FMT_MONOWHITE:\ - {\ - func_monoblack\ - }\ - break;\ - case PIX_FMT_YUYV422:\ - func2\ - ((uint8_t*)dest)[2*i2+0]= Y1;\ - ((uint8_t*)dest)[2*i2+1]= U;\ - ((uint8_t*)dest)[2*i2+2]= Y2;\ - ((uint8_t*)dest)[2*i2+3]= V;\ - } \ - break;\ - case PIX_FMT_UYVY422:\ - func2\ - ((uint8_t*)dest)[2*i2+0]= U;\ - ((uint8_t*)dest)[2*i2+1]= Y1;\ - ((uint8_t*)dest)[2*i2+2]= V;\ - ((uint8_t*)dest)[2*i2+3]= Y2;\ - } \ - break;\ - case PIX_FMT_GRAY16BE:\ - func_g16\ - ((uint8_t*)dest)[2*i2+0]= Y1>>8;\ - ((uint8_t*)dest)[2*i2+1]= Y1;\ - ((uint8_t*)dest)[2*i2+2]= Y2>>8;\ - ((uint8_t*)dest)[2*i2+3]= Y2;\ - } \ - break;\ - case PIX_FMT_GRAY16LE:\ - func_g16\ - ((uint8_t*)dest)[2*i2+0]= Y1;\ - ((uint8_t*)dest)[2*i2+1]= Y1>>8;\ - ((uint8_t*)dest)[2*i2+2]= Y2;\ - ((uint8_t*)dest)[2*i2+3]= Y2>>8;\ - } \ - break;\ } -static inline void yuv2packedXinC(SwsContext *c, const int16_t *lumFilter, const int16_t **lumSrc, int lumFilterSize, - const int16_t *chrFilter, const int16_t **chrSrc, int chrFilterSize, - const int16_t **alpSrc, uint8_t *dest, int dstW, int y) +static void yuv2packedX_c(SwsContext *c, const int16_t *lumFilter, + const int16_t **lumSrc, int lumFilterSize, + const int16_t *chrFilter, const int16_t **chrUSrc, + const int16_t **chrVSrc, int chrFilterSize, + const int16_t **alpSrc, uint8_t *dest, int dstW, int y) { int i; - YSCALE_YUV_2_ANYRGB_C(YSCALE_YUV_2_RGBX_C, YSCALE_YUV_2_PACKEDX_C(void,0), YSCALE_YUV_2_GRAY16_C, YSCALE_YUV_2_MONOX_C) + YSCALE_YUV_2_ANYRGB_C(YSCALE_YUV_2_RGBX_C) } -static inline void yuv2rgbXinC_full(SwsContext *c, const int16_t *lumFilter, const int16_t **lumSrc, int lumFilterSize, - const int16_t *chrFilter, const int16_t **chrSrc, int chrFilterSize, - const int16_t **alpSrc, uint8_t *dest, int dstW, int y) +static void yuv2rgbX_c_full(SwsContext *c, const int16_t *lumFilter, + const int16_t **lumSrc, int lumFilterSize, + const int16_t *chrFilter, const int16_t **chrUSrc, + const int16_t **chrVSrc, int chrFilterSize, + const int16_t **alpSrc, uint8_t *dest, int dstW, int y) { int i; int step= c->dstFormatBpp/8; @@ -1138,7 +1365,45 @@ static inline void yuv2rgbXinC_full(SwsContext *c, const int16_t *lumFilter, con } } -static void fillPlane(uint8_t* plane, int stride, int width, int height, int y, uint8_t val) +/** + * vertical bilinear scale YV12 to RGB + */ +static void yuv2packed2_c(SwsContext *c, const uint16_t *buf0, + const uint16_t *buf1, const uint16_t *ubuf0, + const uint16_t *ubuf1, const uint16_t *vbuf0, + const uint16_t *vbuf1, const uint16_t *abuf0, + const uint16_t *abuf1, uint8_t *dest, int dstW, + int yalpha, int uvalpha, int y) +{ + int yalpha1=4095- yalpha; + int uvalpha1=4095-uvalpha; + int i; + + YSCALE_YUV_2_ANYRGB_C(YSCALE_YUV_2_RGB2_C) +} + +/** + * YV12 to RGB without scaling or interpolating + */ +static void yuv2packed1_c(SwsContext *c, const uint16_t *buf0, + const uint16_t *ubuf0, const uint16_t *ubuf1, + const uint16_t *vbuf0, const uint16_t *vbuf1, + const uint16_t *abuf0, uint8_t *dest, int dstW, + int uvalpha, enum PixelFormat dstFormat, + int flags, int y) +{ + int i; + + if (uvalpha < 2048) { + YSCALE_YUV_2_ANYRGB_C(YSCALE_YUV_2_RGB1_C) + } else { + YSCALE_YUV_2_ANYRGB_C(YSCALE_YUV_2_RGB1B_C) + } +} + +static av_always_inline void fillPlane(uint8_t* plane, int stride, + int width, int height, + int y, uint8_t val) { int i; uint8_t *ptr = plane + stride*y; @@ -1148,1195 +1413,1186 @@ static void fillPlane(uint8_t* plane, int stride, int width, int height, int y, } } -static inline void rgb48ToY(uint8_t *dst, const uint8_t *src, long width, - uint32_t *unused) +#define input_pixel(pos) (isBE(origin) ? AV_RB16(pos) : AV_RL16(pos)) + +#define r ((origin == PIX_FMT_BGR48BE || origin == PIX_FMT_BGR48LE) ? b_r : r_b) +#define b ((origin == PIX_FMT_BGR48BE || origin == PIX_FMT_BGR48LE) ? r_b : b_r) + +static av_always_inline void +rgb48ToY_c_template(int16_t *dst, const uint16_t *src, int width, + enum PixelFormat origin) { int i; for (i = 0; i < width; i++) { - int r = src[i*6+0]; - int g = src[i*6+2]; - int b = src[i*6+4]; + int r_b = input_pixel(&src[i*3+0]); + int g = input_pixel(&src[i*3+1]); + int b_r = input_pixel(&src[i*3+2]); - dst[i] = (RY*r + GY*g + BY*b + (33<<(RGB2YUV_SHIFT-1))) >> RGB2YUV_SHIFT; + dst[i] = (RY*r + GY*g + BY*b + (32<<(RGB2YUV_SHIFT-1+8)) + (1<<(RGB2YUV_SHIFT-7+8))) >> (RGB2YUV_SHIFT-6+8); } } -static inline void rgb48ToUV(uint8_t *dstU, uint8_t *dstV, - const uint8_t *src1, const uint8_t *src2, - long width, uint32_t *unused) +static av_always_inline void +rgb48ToUV_c_template(int16_t *dstU, int16_t *dstV, + const uint16_t *src1, const uint16_t *src2, + int width, enum PixelFormat origin) { int i; assert(src1==src2); for (i = 0; i < width; i++) { - int r = src1[6*i + 0]; - int g = src1[6*i + 2]; - int b = src1[6*i + 4]; + int r_b = input_pixel(&src1[i*3+0]); + int g = input_pixel(&src1[i*3+1]); + int b_r = input_pixel(&src1[i*3+2]); - dstU[i] = (RU*r + GU*g + BU*b + (257<<(RGB2YUV_SHIFT-1))) >> RGB2YUV_SHIFT; - dstV[i] = (RV*r + GV*g + BV*b + (257<<(RGB2YUV_SHIFT-1))) >> RGB2YUV_SHIFT; + dstU[i] = (RU*r + GU*g + BU*b + (256<<(RGB2YUV_SHIFT-1+8)) + (1<<(RGB2YUV_SHIFT-7+8))) >> (RGB2YUV_SHIFT-6+8); + dstV[i] = (RV*r + GV*g + BV*b + (256<<(RGB2YUV_SHIFT-1+8)) + (1<<(RGB2YUV_SHIFT-7+8))) >> (RGB2YUV_SHIFT-6+8); } } -static inline void rgb48ToUV_half(uint8_t *dstU, uint8_t *dstV, - const uint8_t *src1, const uint8_t *src2, - long width, uint32_t *unused) +static av_always_inline void +rgb48ToUV_half_c_template(int16_t *dstU, int16_t *dstV, + const uint16_t *src1, const uint16_t *src2, + int width, enum PixelFormat origin) { int i; assert(src1==src2); for (i = 0; i < width; i++) { - int r= src1[12*i + 0] + src1[12*i + 6]; - int g= src1[12*i + 2] + src1[12*i + 8]; - int b= src1[12*i + 4] + src1[12*i + 10]; + int r_b = (input_pixel(&src1[6*i + 0])) + (input_pixel(&src1[6*i + 3])); + int g = (input_pixel(&src1[6*i + 1])) + (input_pixel(&src1[6*i + 4])); + int b_r = (input_pixel(&src1[6*i + 2])) + (input_pixel(&src1[6*i + 5])); - dstU[i]= (RU*r + GU*g + BU*b + (257<<RGB2YUV_SHIFT)) >> (RGB2YUV_SHIFT+1); - dstV[i]= (RV*r + GV*g + BV*b + (257<<RGB2YUV_SHIFT)) >> (RGB2YUV_SHIFT+1); + dstU[i]= (RU*r + GU*g + BU*b + (256U<<(RGB2YUV_SHIFT+8)) + (1<<(RGB2YUV_SHIFT-6+8))) >> (RGB2YUV_SHIFT-5+8); + dstV[i]= (RV*r + GV*g + BV*b + (256U<<(RGB2YUV_SHIFT+8)) + (1<<(RGB2YUV_SHIFT-6+8))) >> (RGB2YUV_SHIFT-5+8); } } -static inline void bgr48ToY(uint8_t *dst, const uint8_t *src, long width, - uint32_t *unused) +#undef r +#undef b +#undef input_pixel + +#define rgb48funcs(pattern, BE_LE, origin) \ +static void pattern ## 48 ## BE_LE ## ToY_c(uint8_t *dst, const uint8_t *src, \ + int width, uint32_t *unused) \ +{ \ + rgb48ToY_c_template(dst, src, width, origin); \ +} \ + \ +static void pattern ## 48 ## BE_LE ## ToUV_c(uint8_t *dstU, uint8_t *dstV, \ + const uint8_t *src1, const uint8_t *src2, \ + int width, uint32_t *unused) \ +{ \ + rgb48ToUV_c_template(dstU, dstV, src1, src2, width, origin); \ +} \ + \ +static void pattern ## 48 ## BE_LE ## ToUV_half_c(uint8_t *dstU, uint8_t *dstV, \ + const uint8_t *src1, const uint8_t *src2, \ + int width, uint32_t *unused) \ +{ \ + rgb48ToUV_half_c_template(dstU, dstV, src1, src2, width, origin); \ +} + +rgb48funcs(rgb, LE, PIX_FMT_RGB48LE); +rgb48funcs(rgb, BE, PIX_FMT_RGB48BE); +rgb48funcs(bgr, LE, PIX_FMT_BGR48LE); +rgb48funcs(bgr, BE, PIX_FMT_BGR48BE); + +#define input_pixel(i) ((origin == PIX_FMT_RGBA || origin == PIX_FMT_BGRA || \ + origin == PIX_FMT_ARGB || origin == PIX_FMT_ABGR) ? AV_RN32A(&src[(i)*4]) : \ + (isBE(origin) ? AV_RB16(&src[(i)*2]) : AV_RL16(&src[(i)*2]))) + +static av_always_inline void +rgb16_32ToY_c_template(int16_t *dst, const uint8_t *src, + int width, enum PixelFormat origin, + int shr, int shg, int shb, int shp, + int maskr, int maskg, int maskb, + int rsh, int gsh, int bsh, int S) { + const int ry = RY << rsh, gy = GY << gsh, by = BY << bsh, + rnd = (32<<((S)-1)) + (1<<(S-7)); int i; + for (i = 0; i < width; i++) { - int b = src[i*6+0]; - int g = src[i*6+2]; - int r = src[i*6+4]; + int px = input_pixel(i) >> shp; + int b = (px & maskb) >> shb; + int g = (px & maskg) >> shg; + int r = (px & maskr) >> shr; - dst[i] = (RY*r + GY*g + BY*b + (33<<(RGB2YUV_SHIFT-1))) >> RGB2YUV_SHIFT; + dst[i] = (ry * r + gy * g + by * b + rnd) >> ((S)-6); } } -static inline void bgr48ToUV(uint8_t *dstU, uint8_t *dstV, - const uint8_t *src1, const uint8_t *src2, - long width, uint32_t *unused) +static av_always_inline void +rgb16_32ToUV_c_template(int16_t *dstU, int16_t *dstV, + const uint8_t *src, int width, + enum PixelFormat origin, + int shr, int shg, int shb, int shp, + int maskr, int maskg, int maskb, + int rsh, int gsh, int bsh, int S) { + const int ru = RU << rsh, gu = GU << gsh, bu = BU << bsh, + rv = RV << rsh, gv = GV << gsh, bv = BV << bsh, + rnd = (256<<((S)-1)) + (1<<(S-7)); int i; + for (i = 0; i < width; i++) { - int b = src1[6*i + 0]; - int g = src1[6*i + 2]; - int r = src1[6*i + 4]; + int px = input_pixel(i) >> shp; + int b = (px & maskb) >> shb; + int g = (px & maskg) >> shg; + int r = (px & maskr) >> shr; - dstU[i] = (RU*r + GU*g + BU*b + (257<<(RGB2YUV_SHIFT-1))) >> RGB2YUV_SHIFT; - dstV[i] = (RV*r + GV*g + BV*b + (257<<(RGB2YUV_SHIFT-1))) >> RGB2YUV_SHIFT; + dstU[i] = (ru * r + gu * g + bu * b + rnd) >> ((S)-6); + dstV[i] = (rv * r + gv * g + bv * b + rnd) >> ((S)-6); } } -static inline void bgr48ToUV_half(uint8_t *dstU, uint8_t *dstV, - const uint8_t *src1, const uint8_t *src2, - long width, uint32_t *unused) +static av_always_inline void +rgb16_32ToUV_half_c_template(int16_t *dstU, int16_t *dstV, + const uint8_t *src, int width, + enum PixelFormat origin, + int shr, int shg, int shb, int shp, + int maskr, int maskg, int maskb, + int rsh, int gsh, int bsh, int S) { + const int ru = RU << rsh, gu = GU << gsh, bu = BU << bsh, + rv = RV << rsh, gv = GV << gsh, bv = BV << bsh, + rnd = (256U<<(S)) + (1<<(S-6)), maskgx = ~(maskr | maskb); int i; + + maskr |= maskr << 1; maskb |= maskb << 1; maskg |= maskg << 1; for (i = 0; i < width; i++) { - int b= src1[12*i + 0] + src1[12*i + 6]; - int g= src1[12*i + 2] + src1[12*i + 8]; - int r= src1[12*i + 4] + src1[12*i + 10]; + int px0 = input_pixel(2 * i + 0) >> shp; + int px1 = input_pixel(2 * i + 1) >> shp; + int b, r, g = (px0 & maskgx) + (px1 & maskgx); + int rb = px0 + px1 - g; + + b = (rb & maskb) >> shb; + if (shp || origin == PIX_FMT_BGR565LE || origin == PIX_FMT_BGR565BE || + origin == PIX_FMT_RGB565LE || origin == PIX_FMT_RGB565BE) { + g >>= shg; + } else { + g = (g & maskg) >> shg; + } + r = (rb & maskr) >> shr; - dstU[i]= (RU*r + GU*g + BU*b + (257<<RGB2YUV_SHIFT)) >> (RGB2YUV_SHIFT+1); - dstV[i]= (RV*r + GV*g + BV*b + (257<<RGB2YUV_SHIFT)) >> (RGB2YUV_SHIFT+1); + dstU[i] = (ru * r + gu * g + bu * b + (unsigned)rnd) >> ((S)-6+1); + dstV[i] = (rv * r + gv * g + bv * b + (unsigned)rnd) >> ((S)-6+1); } } -#define BGR2Y(type, name, shr, shg, shb, maskr, maskg, maskb, RY, GY, BY, S)\ -static inline void name(uint8_t *dst, const uint8_t *src, long width, uint32_t *unused)\ -{\ - int i;\ - for (i=0; i<width; i++) {\ - int b= (((const type*)src)[i]>>shb)&maskb;\ - int g= (((const type*)src)[i]>>shg)&maskg;\ - int r= (((const type*)src)[i]>>shr)&maskr;\ -\ - dst[i]= (((RY)*r + (GY)*g + (BY)*b + (33<<((S)-1)))>>(S));\ - }\ +#undef input_pixel + +#define rgb16_32_wrapper(fmt, name, shr, shg, shb, shp, maskr, \ + maskg, maskb, rsh, gsh, bsh, S) \ +static void name ## ToY_c(uint8_t *dst, const uint8_t *src, \ + int width, uint32_t *unused) \ +{ \ + rgb16_32ToY_c_template(dst, src, width, fmt, shr, shg, shb, shp, \ + maskr, maskg, maskb, rsh, gsh, bsh, S); \ +} \ + \ +static void name ## ToUV_c(uint8_t *dstU, uint8_t *dstV, \ + const uint8_t *src, const uint8_t *dummy, \ + int width, uint32_t *unused) \ +{ \ + rgb16_32ToUV_c_template(dstU, dstV, src, width, fmt, shr, shg, shb, shp, \ + maskr, maskg, maskb, rsh, gsh, bsh, S); \ +} \ + \ +static void name ## ToUV_half_c(uint8_t *dstU, uint8_t *dstV, \ + const uint8_t *src, const uint8_t *dummy, \ + int width, uint32_t *unused) \ +{ \ + rgb16_32ToUV_half_c_template(dstU, dstV, src, width, fmt, shr, shg, shb, shp, \ + maskr, maskg, maskb, rsh, gsh, bsh, S); \ } -BGR2Y(uint32_t, bgr32ToY,16, 0, 0, 0x00FF, 0xFF00, 0x00FF, RY<< 8, GY , BY<< 8, RGB2YUV_SHIFT+8) -BGR2Y(uint32_t,bgr321ToY,16,16, 0, 0xFF00, 0x00FF, 0xFF00, RY , GY<<8, BY , RGB2YUV_SHIFT+8) -BGR2Y(uint32_t, rgb32ToY, 0, 0,16, 0x00FF, 0xFF00, 0x00FF, RY<< 8, GY , BY<< 8, RGB2YUV_SHIFT+8) -BGR2Y(uint32_t,rgb321ToY, 0,16,16, 0xFF00, 0x00FF, 0xFF00, RY , GY<<8, BY , RGB2YUV_SHIFT+8) -BGR2Y(uint16_t, bgr16ToY, 0, 0, 0, 0x001F, 0x07E0, 0xF800, RY<<11, GY<<5, BY , RGB2YUV_SHIFT+8) -BGR2Y(uint16_t, bgr15ToY, 0, 0, 0, 0x001F, 0x03E0, 0x7C00, RY<<10, GY<<5, BY , RGB2YUV_SHIFT+7) -BGR2Y(uint16_t, rgb16ToY, 0, 0, 0, 0xF800, 0x07E0, 0x001F, RY , GY<<5, BY<<11, RGB2YUV_SHIFT+8) -BGR2Y(uint16_t, rgb15ToY, 0, 0, 0, 0x7C00, 0x03E0, 0x001F, RY , GY<<5, BY<<10, RGB2YUV_SHIFT+7) +rgb16_32_wrapper(PIX_FMT_BGR32, bgr32, 16, 0, 0, 0, 0xFF0000, 0xFF00, 0x00FF, 8, 0, 8, RGB2YUV_SHIFT+8); +rgb16_32_wrapper(PIX_FMT_BGR32_1, bgr321, 16, 0, 0, 8, 0xFF0000, 0xFF00, 0x00FF, 8, 0, 8, RGB2YUV_SHIFT+8); +rgb16_32_wrapper(PIX_FMT_RGB32, rgb32, 0, 0, 16, 0, 0x00FF, 0xFF00, 0xFF0000, 8, 0, 8, RGB2YUV_SHIFT+8); +rgb16_32_wrapper(PIX_FMT_RGB32_1, rgb321, 0, 0, 16, 8, 0x00FF, 0xFF00, 0xFF0000, 8, 0, 8, RGB2YUV_SHIFT+8); +rgb16_32_wrapper(PIX_FMT_BGR565LE, bgr16le, 0, 0, 0, 0, 0x001F, 0x07E0, 0xF800, 11, 5, 0, RGB2YUV_SHIFT+8); +rgb16_32_wrapper(PIX_FMT_BGR555LE, bgr15le, 0, 0, 0, 0, 0x001F, 0x03E0, 0x7C00, 10, 5, 0, RGB2YUV_SHIFT+7); +rgb16_32_wrapper(PIX_FMT_RGB565LE, rgb16le, 0, 0, 0, 0, 0xF800, 0x07E0, 0x001F, 0, 5, 11, RGB2YUV_SHIFT+8); +rgb16_32_wrapper(PIX_FMT_RGB555LE, rgb15le, 0, 0, 0, 0, 0x7C00, 0x03E0, 0x001F, 0, 5, 10, RGB2YUV_SHIFT+7); +rgb16_32_wrapper(PIX_FMT_BGR565BE, bgr16be, 0, 0, 0, 0, 0x001F, 0x07E0, 0xF800, 11, 5, 0, RGB2YUV_SHIFT+8); +rgb16_32_wrapper(PIX_FMT_BGR555BE, bgr15be, 0, 0, 0, 0, 0x001F, 0x03E0, 0x7C00, 10, 5, 0, RGB2YUV_SHIFT+7); +rgb16_32_wrapper(PIX_FMT_RGB565BE, rgb16be, 0, 0, 0, 0, 0xF800, 0x07E0, 0x001F, 0, 5, 11, RGB2YUV_SHIFT+8); +rgb16_32_wrapper(PIX_FMT_RGB555BE, rgb15be, 0, 0, 0, 0, 0x7C00, 0x03E0, 0x001F, 0, 5, 10, RGB2YUV_SHIFT+7); + +static void abgrToA_c(int16_t *dst, const uint8_t *src, int width, uint32_t *unused) +{ + int i; + for (i=0; i<width; i++) { + dst[i]= src[4*i]<<6; + } +} -static inline void abgrToA(uint8_t *dst, const uint8_t *src, long width, uint32_t *unused) +static void rgbaToA_c(int16_t *dst, const uint8_t *src, int width, uint32_t *unused) { int i; for (i=0; i<width; i++) { - dst[i]= src[4*i]; - } -} - -#define BGR2UV(type, name, shr, shg, shb, shp, maskr, maskg, maskb, RU, GU, BU, RV, GV, BV, S) \ -static inline void name(uint8_t *dstU, uint8_t *dstV, const uint8_t *src, const uint8_t *dummy, long width, uint32_t *unused)\ -{\ - int i;\ - for (i=0; i<width; i++) {\ - int b= ((((const type*)src)[i]>>shp)&maskb)>>shb;\ - int g= ((((const type*)src)[i]>>shp)&maskg)>>shg;\ - int r= ((((const type*)src)[i]>>shp)&maskr)>>shr;\ -\ - dstU[i]= ((RU)*r + (GU)*g + (BU)*b + (257<<((S)-1)))>>(S);\ - dstV[i]= ((RV)*r + (GV)*g + (BV)*b + (257<<((S)-1)))>>(S);\ - }\ -}\ -static inline void name ## _half(uint8_t *dstU, uint8_t *dstV, const uint8_t *src, const uint8_t *dummy, long width, uint32_t *unused)\ -{\ - int i;\ - for (i=0; i<width; i++) {\ - int pix0= ((const type*)src)[2*i+0]>>shp;\ - int pix1= ((const type*)src)[2*i+1]>>shp;\ - int g= (pix0&~(maskr|maskb))+(pix1&~(maskr|maskb));\ - int b= ((pix0+pix1-g)&(maskb|(2*maskb)))>>shb;\ - int r= ((pix0+pix1-g)&(maskr|(2*maskr)))>>shr;\ - g&= maskg|(2*maskg);\ -\ - g>>=shg;\ -\ - dstU[i]= ((RU)*r + (GU)*g + (BU)*b + (257<<(S)))>>((S)+1);\ - dstV[i]= ((RV)*r + (GV)*g + (BV)*b + (257<<(S)))>>((S)+1);\ - }\ -} - -BGR2UV(uint32_t, bgr32ToUV,16, 0, 0, 0, 0xFF0000, 0xFF00, 0x00FF, RU<< 8, GU , BU<< 8, RV<< 8, GV , BV<< 8, RGB2YUV_SHIFT+8) -BGR2UV(uint32_t,bgr321ToUV,16, 0, 0, 8, 0xFF0000, 0xFF00, 0x00FF, RU<< 8, GU , BU<< 8, RV<< 8, GV , BV<< 8, RGB2YUV_SHIFT+8) -BGR2UV(uint32_t, rgb32ToUV, 0, 0,16, 0, 0x00FF, 0xFF00, 0xFF0000, RU<< 8, GU , BU<< 8, RV<< 8, GV , BV<< 8, RGB2YUV_SHIFT+8) -BGR2UV(uint32_t,rgb321ToUV, 0, 0,16, 8, 0x00FF, 0xFF00, 0xFF0000, RU<< 8, GU , BU<< 8, RV<< 8, GV , BV<< 8, RGB2YUV_SHIFT+8) -BGR2UV(uint16_t, bgr16ToUV, 0, 0, 0, 0, 0x001F, 0x07E0, 0xF800, RU<<11, GU<<5, BU , RV<<11, GV<<5, BV , RGB2YUV_SHIFT+8) -BGR2UV(uint16_t, bgr15ToUV, 0, 0, 0, 0, 0x001F, 0x03E0, 0x7C00, RU<<10, GU<<5, BU , RV<<10, GV<<5, BV , RGB2YUV_SHIFT+7) -BGR2UV(uint16_t, rgb16ToUV, 0, 0, 0, 0, 0xF800, 0x07E0, 0x001F, RU , GU<<5, BU<<11, RV , GV<<5, BV<<11, RGB2YUV_SHIFT+8) -BGR2UV(uint16_t, rgb15ToUV, 0, 0, 0, 0, 0x7C00, 0x03E0, 0x001F, RU , GU<<5, BU<<10, RV , GV<<5, BV<<10, RGB2YUV_SHIFT+7) - -static inline void palToA(uint8_t *dst, const uint8_t *src, long width, uint32_t *pal) + dst[i]= src[4*i+3]<<6; + } +} + +static void palToA_c(int16_t *dst, const uint8_t *src, int width, uint32_t *pal) { int i; for (i=0; i<width; i++) { int d= src[i]; - dst[i]= pal[d] >> 24; + dst[i]= (pal[d] >> 24)<<6; } } -static inline void palToY(uint8_t *dst, const uint8_t *src, long width, uint32_t *pal) +static void palToY_c(int16_t *dst, const uint8_t *src, long width, uint32_t *pal) { int i; for (i=0; i<width; i++) { int d= src[i]; - dst[i]= pal[d] & 0xFF; + dst[i]= (pal[d] & 0xFF)<<6; } } -static inline void palToUV(uint8_t *dstU, uint8_t *dstV, +static void palToUV_c(uint16_t *dstU, int16_t *dstV, const uint8_t *src1, const uint8_t *src2, - long width, uint32_t *pal) + int width, uint32_t *pal) { int i; assert(src1 == src2); for (i=0; i<width; i++) { int p= pal[src1[i]]; - dstU[i]= p>>8; - dstV[i]= p>>16; + dstU[i]= (uint8_t)(p>> 8)<<6; + dstV[i]= (uint8_t)(p>>16)<<6; } } -static inline void monowhite2Y(uint8_t *dst, const uint8_t *src, long width, uint32_t *unused) +static void monowhite2Y_c(int16_t *dst, const uint8_t *src, int width, uint32_t *unused) { int i, j; for (i=0; i<width/8; i++) { int d= ~src[i]; for(j=0; j<8; j++) - dst[8*i+j]= ((d>>(7-j))&1)*255; + dst[8*i+j]= ((d>>(7-j))&1)*16383; } } -static inline void monoblack2Y(uint8_t *dst, const uint8_t *src, long width, uint32_t *unused) +static void monoblack2Y_c(int16_t *dst, const uint8_t *src, int width, uint32_t *unused) { int i, j; for (i=0; i<width/8; i++) { int d= src[i]; for(j=0; j<8; j++) - dst[8*i+j]= ((d>>(7-j))&1)*255; - } -} - -//Note: we have C, MMX, MMX2, 3DNOW versions, there is no 3DNOW+MMX2 one -//Plain C versions -#if CONFIG_RUNTIME_CPUDETECT -# define COMPILE_C 1 -# if ARCH_X86 -# define COMPILE_MMX 1 -# define COMPILE_MMX2 1 -# define COMPILE_3DNOW 1 -# elif ARCH_PPC -# define COMPILE_ALTIVEC HAVE_ALTIVEC -# endif -#else /* CONFIG_RUNTIME_CPUDETECT */ -# if ARCH_X86 -# if HAVE_MMX2 -# define COMPILE_MMX2 1 -# elif HAVE_AMD3DNOW -# define COMPILE_3DNOW 1 -# elif HAVE_MMX -# define COMPILE_MMX 1 -# else -# define COMPILE_C 1 -# endif -# elif ARCH_PPC && HAVE_ALTIVEC -# define COMPILE_ALTIVEC 1 -# else -# define COMPILE_C 1 -# endif -#endif - -#ifndef COMPILE_C -# define COMPILE_C 0 -#endif -#ifndef COMPILE_MMX -# define COMPILE_MMX 0 -#endif -#ifndef COMPILE_MMX2 -# define COMPILE_MMX2 0 -#endif -#ifndef COMPILE_3DNOW -# define COMPILE_3DNOW 0 -#endif -#ifndef COMPILE_ALTIVEC -# define COMPILE_ALTIVEC 0 -#endif - -#define COMPILE_TEMPLATE_MMX 0 -#define COMPILE_TEMPLATE_MMX2 0 -#define COMPILE_TEMPLATE_AMD3DNOW 0 -#define COMPILE_TEMPLATE_ALTIVEC 0 - -#if COMPILE_C -#define RENAME(a) a ## _C -#include "swscale_template.c" -#endif - -#if COMPILE_ALTIVEC -#undef RENAME -#undef COMPILE_TEMPLATE_ALTIVEC -#define COMPILE_TEMPLATE_ALTIVEC 1 -#define RENAME(a) a ## _altivec -#include "swscale_template.c" -#endif - -#if ARCH_X86 - -//MMX versions -#if COMPILE_MMX -#undef RENAME -#undef COMPILE_TEMPLATE_MMX -#undef COMPILE_TEMPLATE_MMX2 -#undef COMPILE_TEMPLATE_AMD3DNOW -#define COMPILE_TEMPLATE_MMX 1 -#define COMPILE_TEMPLATE_MMX2 0 -#define COMPILE_TEMPLATE_AMD3DNOW 0 -#define RENAME(a) a ## _MMX -#include "swscale_template.c" -#endif - -//MMX2 versions -#if COMPILE_MMX2 -#undef RENAME -#undef COMPILE_TEMPLATE_MMX -#undef COMPILE_TEMPLATE_MMX2 -#undef COMPILE_TEMPLATE_AMD3DNOW -#define COMPILE_TEMPLATE_MMX 1 -#define COMPILE_TEMPLATE_MMX2 1 -#define COMPILE_TEMPLATE_AMD3DNOW 0 -#define RENAME(a) a ## _MMX2 -#include "swscale_template.c" -#endif - -//3DNOW versions -#if COMPILE_3DNOW -#undef RENAME -#undef COMPILE_TEMPLATE_MMX -#undef COMPILE_TEMPLATE_MMX2 -#undef COMPILE_TEMPLATE_AMD3DNOW -#define COMPILE_TEMPLATE_MMX 1 -#define COMPILE_TEMPLATE_MMX2 0 -#define COMPILE_TEMPLATE_AMD3DNOW 1 -#define RENAME(a) a ## _3DNow -#include "swscale_template.c" -#endif + dst[8*i+j]= ((d>>(7-j))&1)*16383; + } +} -#endif //ARCH_X86 +//FIXME yuy2* can read up to 7 samples too much -SwsFunc ff_getSwsFunc(SwsContext *c) +static void yuy2ToY_c(uint8_t *dst, const uint8_t *src, int width, + uint32_t *unused) { -#if CONFIG_RUNTIME_CPUDETECT - int flags = c->flags; - -#if ARCH_X86 - // ordered per speed fastest first - if (flags & SWS_CPU_CAPS_MMX2) { - sws_init_swScale_MMX2(c); - return swScale_MMX2; - } else if (flags & SWS_CPU_CAPS_3DNOW) { - sws_init_swScale_3DNow(c); - return swScale_3DNow; - } else if (flags & SWS_CPU_CAPS_MMX) { - sws_init_swScale_MMX(c); - return swScale_MMX; - } else { - sws_init_swScale_C(c); - return swScale_C; - } + int i; + for (i=0; i<width; i++) + dst[i]= src[2*i]; +} -#else -#if COMPILE_ALTIVEC - if (flags & SWS_CPU_CAPS_ALTIVEC) { - sws_init_swScale_altivec(c); - return swScale_altivec; - } else { - sws_init_swScale_C(c); - return swScale_C; +static void yuy2ToUV_c(uint8_t *dstU, uint8_t *dstV, const uint8_t *src1, + const uint8_t *src2, int width, uint32_t *unused) +{ + int i; + for (i=0; i<width; i++) { + dstU[i]= src1[4*i + 1]; + dstV[i]= src1[4*i + 3]; } -#endif - sws_init_swScale_C(c); - return swScale_C; -#endif /* ARCH_X86 */ -#else //CONFIG_RUNTIME_CPUDETECT -#if COMPILE_TEMPLATE_MMX2 - sws_init_swScale_MMX2(c); - return swScale_MMX2; -#elif COMPILE_TEMPLATE_AMD3DNOW - sws_init_swScale_3DNow(c); - return swScale_3DNow; -#elif COMPILE_TEMPLATE_MMX - sws_init_swScale_MMX(c); - return swScale_MMX; -#elif COMPILE_TEMPLATE_ALTIVEC - sws_init_swScale_altivec(c); - return swScale_altivec; -#else - sws_init_swScale_C(c); - return swScale_C; -#endif -#endif //!CONFIG_RUNTIME_CPUDETECT + assert(src1 == src2); } -static void copyPlane(const uint8_t *src, int srcStride, - int srcSliceY, int srcSliceH, int width, - uint8_t *dst, int dstStride) +static void LEToUV_c(uint8_t *dstU, uint8_t *dstV, const uint8_t *src1, + const uint8_t *src2, int width, uint32_t *unused) { - dst += dstStride * srcSliceY; - if (dstStride == srcStride && srcStride > 0) { - memcpy(dst, src, srcSliceH * dstStride); - } else { - int i; - for (i=0; i<srcSliceH; i++) { - memcpy(dst, src, width); - src += srcStride; - dst += dstStride; - } + int i; + for (i=0; i<width; i++) { + dstU[i]= src1[2*i + 1]; + dstV[i]= src2[2*i + 1]; } } -static int planarToNv12Wrapper(SwsContext *c, const uint8_t* src[], int srcStride[], int srcSliceY, - int srcSliceH, uint8_t* dstParam[], int dstStride[]) +/* This is almost identical to the previous, end exists only because + * yuy2ToY/UV)(dst, src+1, ...) would have 100% unaligned accesses. */ +static void uyvyToY_c(uint8_t *dst, const uint8_t *src, int width, + uint32_t *unused) { - uint8_t *dst = dstParam[1] + dstStride[1]*srcSliceY/2; - - copyPlane(src[0], srcStride[0], srcSliceY, srcSliceH, c->srcW, - dstParam[0], dstStride[0]); - - if (c->dstFormat == PIX_FMT_NV12) - interleaveBytes(src[1], src[2], dst, c->srcW/2, srcSliceH/2, srcStride[1], srcStride[2], dstStride[0]); - else - interleaveBytes(src[2], src[1], dst, c->srcW/2, srcSliceH/2, srcStride[2], srcStride[1], dstStride[0]); - - return srcSliceH; + int i; + for (i=0; i<width; i++) + dst[i]= src[2*i+1]; } -static int planarToYuy2Wrapper(SwsContext *c, const uint8_t* src[], int srcStride[], int srcSliceY, - int srcSliceH, uint8_t* dstParam[], int dstStride[]) +static void uyvyToUV_c(uint8_t *dstU, uint8_t *dstV, const uint8_t *src1, + const uint8_t *src2, int width, uint32_t *unused) { - uint8_t *dst=dstParam[0] + dstStride[0]*srcSliceY; - - yv12toyuy2(src[0], src[1], src[2], dst, c->srcW, srcSliceH, srcStride[0], srcStride[1], dstStride[0]); - - return srcSliceH; + int i; + for (i=0; i<width; i++) { + dstU[i]= src1[4*i + 0]; + dstV[i]= src1[4*i + 2]; + } + assert(src1 == src2); } -static int planarToUyvyWrapper(SwsContext *c, const uint8_t* src[], int srcStride[], int srcSliceY, - int srcSliceH, uint8_t* dstParam[], int dstStride[]) +static void BEToUV_c(uint8_t *dstU, uint8_t *dstV, const uint8_t *src1, + const uint8_t *src2, int width, uint32_t *unused) { - uint8_t *dst=dstParam[0] + dstStride[0]*srcSliceY; - - yv12touyvy(src[0], src[1], src[2], dst, c->srcW, srcSliceH, srcStride[0], srcStride[1], dstStride[0]); - - return srcSliceH; + int i; + for (i=0; i<width; i++) { + dstU[i]= src1[2*i]; + dstV[i]= src2[2*i]; + } } -static int yuv422pToYuy2Wrapper(SwsContext *c, const uint8_t* src[], int srcStride[], int srcSliceY, - int srcSliceH, uint8_t* dstParam[], int dstStride[]) +static av_always_inline void nvXXtoUV_c(uint8_t *dst1, uint8_t *dst2, + const uint8_t *src, int width) { - uint8_t *dst=dstParam[0] + dstStride[0]*srcSliceY; - - yuv422ptoyuy2(src[0],src[1],src[2],dst,c->srcW,srcSliceH,srcStride[0],srcStride[1],dstStride[0]); - - return srcSliceH; + int i; + for (i = 0; i < width; i++) { + dst1[i] = src[2*i+0]; + dst2[i] = src[2*i+1]; + } } -static int yuv422pToUyvyWrapper(SwsContext *c, const uint8_t* src[], int srcStride[], int srcSliceY, - int srcSliceH, uint8_t* dstParam[], int dstStride[]) +static void nv12ToUV_c(uint8_t *dstU, uint8_t *dstV, + const uint8_t *src1, const uint8_t *src2, + int width, uint32_t *unused) { - uint8_t *dst=dstParam[0] + dstStride[0]*srcSliceY; - - yuv422ptouyvy(src[0],src[1],src[2],dst,c->srcW,srcSliceH,srcStride[0],srcStride[1],dstStride[0]); - - return srcSliceH; + nvXXtoUV_c(dstU, dstV, src1, width); } -static int yuyvToYuv420Wrapper(SwsContext *c, const uint8_t* src[], int srcStride[], int srcSliceY, - int srcSliceH, uint8_t* dstParam[], int dstStride[]) +static void nv21ToUV_c(uint8_t *dstU, uint8_t *dstV, + const uint8_t *src1, const uint8_t *src2, + int width, uint32_t *unused) { - uint8_t *ydst=dstParam[0] + dstStride[0]*srcSliceY; - uint8_t *udst=dstParam[1] + dstStride[1]*srcSliceY/2; - uint8_t *vdst=dstParam[2] + dstStride[2]*srcSliceY/2; + nvXXtoUV_c(dstV, dstU, src1, width); +} - yuyvtoyuv420(ydst, udst, vdst, src[0], c->srcW, srcSliceH, dstStride[0], dstStride[1], srcStride[0]); +#define input_pixel(pos) (isBE(origin) ? AV_RB16(pos) : AV_RL16(pos)) - if (dstParam[3]) - fillPlane(dstParam[3], dstStride[3], c->srcW, srcSliceH, srcSliceY, 255); +// FIXME Maybe dither instead. +static av_always_inline void +yuv9_OR_10ToUV_c_template(uint8_t *dstU, uint8_t *dstV, + const uint8_t *_srcU, const uint8_t *_srcV, + int width, enum PixelFormat origin, int depth) +{ + int i; + const uint16_t *srcU = (const uint16_t *) _srcU; + const uint16_t *srcV = (const uint16_t *) _srcV; - return srcSliceH; + for (i = 0; i < width; i++) { + dstU[i] = input_pixel(&srcU[i]) >> (depth - 8); + dstV[i] = input_pixel(&srcV[i]) >> (depth - 8); + } } -static int yuyvToYuv422Wrapper(SwsContext *c, const uint8_t* src[], int srcStride[], int srcSliceY, - int srcSliceH, uint8_t* dstParam[], int dstStride[]) +static av_always_inline void +yuv9_or_10ToY_c_template(uint8_t *dstY, const uint8_t *_srcY, + int width, enum PixelFormat origin, int depth) { - uint8_t *ydst=dstParam[0] + dstStride[0]*srcSliceY; - uint8_t *udst=dstParam[1] + dstStride[1]*srcSliceY; - uint8_t *vdst=dstParam[2] + dstStride[2]*srcSliceY; - - yuyvtoyuv422(ydst, udst, vdst, src[0], c->srcW, srcSliceH, dstStride[0], dstStride[1], srcStride[0]); + int i; + const uint16_t *srcY = (const uint16_t*)_srcY; - return srcSliceH; + for (i = 0; i < width; i++) + dstY[i] = input_pixel(&srcY[i]) >> (depth - 8); } -static int uyvyToYuv420Wrapper(SwsContext *c, const uint8_t* src[], int srcStride[], int srcSliceY, - int srcSliceH, uint8_t* dstParam[], int dstStride[]) -{ - uint8_t *ydst=dstParam[0] + dstStride[0]*srcSliceY; - uint8_t *udst=dstParam[1] + dstStride[1]*srcSliceY/2; - uint8_t *vdst=dstParam[2] + dstStride[2]*srcSliceY/2; +#undef input_pixel + +#define YUV_NBPS(depth, BE_LE, origin) \ +static void BE_LE ## depth ## ToUV_c(uint8_t *dstU, uint8_t *dstV, \ + const uint8_t *srcU, const uint8_t *srcV, \ + int width, uint32_t *unused) \ +{ \ + yuv9_OR_10ToUV_c_template(dstU, dstV, srcU, srcV, width, origin, depth); \ +} \ +static void BE_LE ## depth ## ToY_c(uint8_t *dstY, const uint8_t *srcY, \ + int width, uint32_t *unused) \ +{ \ + yuv9_or_10ToY_c_template(dstY, srcY, width, origin, depth); \ +} - uyvytoyuv420(ydst, udst, vdst, src[0], c->srcW, srcSliceH, dstStride[0], dstStride[1], srcStride[0]); +YUV_NBPS( 9, LE, PIX_FMT_YUV420P9LE); +YUV_NBPS( 9, BE, PIX_FMT_YUV420P9BE); +YUV_NBPS(10, LE, PIX_FMT_YUV420P10LE); +YUV_NBPS(10, BE, PIX_FMT_YUV420P10BE); - if (dstParam[3]) - fillPlane(dstParam[3], dstStride[3], c->srcW, srcSliceH, srcSliceY, 255); +static void bgr24ToY_c(int16_t *dst, const uint8_t *src, + int width, uint32_t *unused) +{ + int i; + for (i=0; i<width; i++) { + int b= src[i*3+0]; + int g= src[i*3+1]; + int r= src[i*3+2]; - return srcSliceH; + dst[i]= ((RY*r + GY*g + BY*b + (32<<(RGB2YUV_SHIFT-1)) + (1<<(RGB2YUV_SHIFT-7)))>>(RGB2YUV_SHIFT-6)); + } } -static int uyvyToYuv422Wrapper(SwsContext *c, const uint8_t* src[], int srcStride[], int srcSliceY, - int srcSliceH, uint8_t* dstParam[], int dstStride[]) +static void bgr24ToUV_c(int16_t *dstU, int16_t *dstV, const uint8_t *src1, + const uint8_t *src2, int width, uint32_t *unused) { - uint8_t *ydst=dstParam[0] + dstStride[0]*srcSliceY; - uint8_t *udst=dstParam[1] + dstStride[1]*srcSliceY; - uint8_t *vdst=dstParam[2] + dstStride[2]*srcSliceY; - - uyvytoyuv422(ydst, udst, vdst, src[0], c->srcW, srcSliceH, dstStride[0], dstStride[1], srcStride[0]); + int i; + for (i=0; i<width; i++) { + int b= src1[3*i + 0]; + int g= src1[3*i + 1]; + int r= src1[3*i + 2]; - return srcSliceH; + dstU[i]= (RU*r + GU*g + BU*b + (256<<(RGB2YUV_SHIFT-1)) + (1<<(RGB2YUV_SHIFT-7)))>>(RGB2YUV_SHIFT-6); + dstV[i]= (RV*r + GV*g + BV*b + (256<<(RGB2YUV_SHIFT-1)) + (1<<(RGB2YUV_SHIFT-7)))>>(RGB2YUV_SHIFT-6); + } + assert(src1 == src2); } -static void gray8aToPacked32(const uint8_t *src, uint8_t *dst, long num_pixels, const uint8_t *palette) +static void bgr24ToUV_half_c(int16_t *dstU, int16_t *dstV, const uint8_t *src1, + const uint8_t *src2, int width, uint32_t *unused) { - long i; - for (i=0; i<num_pixels; i++) - ((uint32_t *) dst)[i] = ((const uint32_t *)palette)[src[i<<1]] | (src[(i<<1)+1] << 24); + int i; + for (i=0; i<width; i++) { + int b= src1[6*i + 0] + src1[6*i + 3]; + int g= src1[6*i + 1] + src1[6*i + 4]; + int r= src1[6*i + 2] + src1[6*i + 5]; + + dstU[i]= (RU*r + GU*g + BU*b + (256<<RGB2YUV_SHIFT) + (1<<(RGB2YUV_SHIFT-6)))>>(RGB2YUV_SHIFT-5); + dstV[i]= (RV*r + GV*g + BV*b + (256<<RGB2YUV_SHIFT) + (1<<(RGB2YUV_SHIFT-6)))>>(RGB2YUV_SHIFT-5); + } + assert(src1 == src2); } -static void gray8aToPacked32_1(const uint8_t *src, uint8_t *dst, long num_pixels, const uint8_t *palette) +static void rgb24ToY_c(int16_t *dst, const uint8_t *src, int width, + uint32_t *unused) { - long i; + int i; + for (i=0; i<width; i++) { + int r= src[i*3+0]; + int g= src[i*3+1]; + int b= src[i*3+2]; - for (i=0; i<num_pixels; i++) - ((uint32_t *) dst)[i] = ((const uint32_t *)palette)[src[i<<1]] | src[(i<<1)+1]; + dst[i]= ((RY*r + GY*g + BY*b + (32<<(RGB2YUV_SHIFT-1)) + (1<<(RGB2YUV_SHIFT-7)))>>(RGB2YUV_SHIFT-6)); + } } -static void gray8aToPacked24(const uint8_t *src, uint8_t *dst, long num_pixels, const uint8_t *palette) +static void rgb24ToUV_c(int16_t *dstU, int16_t *dstV, const uint8_t *src1, + const uint8_t *src2, int width, uint32_t *unused) { - long i; + int i; + assert(src1==src2); + for (i=0; i<width; i++) { + int r= src1[3*i + 0]; + int g= src1[3*i + 1]; + int b= src1[3*i + 2]; - for (i=0; i<num_pixels; i++) { - //FIXME slow? - dst[0]= palette[src[i<<1]*4+0]; - dst[1]= palette[src[i<<1]*4+1]; - dst[2]= palette[src[i<<1]*4+2]; - dst+= 3; + dstU[i]= (RU*r + GU*g + BU*b + (256<<(RGB2YUV_SHIFT-1)) + (1<<(RGB2YUV_SHIFT-7)))>>(RGB2YUV_SHIFT-6); + dstV[i]= (RV*r + GV*g + BV*b + (256<<(RGB2YUV_SHIFT-1)) + (1<<(RGB2YUV_SHIFT-7)))>>(RGB2YUV_SHIFT-6); } } -static int palToRgbWrapper(SwsContext *c, const uint8_t* src[], int srcStride[], int srcSliceY, - int srcSliceH, uint8_t* dst[], int dstStride[]) +static void rgb24ToUV_half_c(int16_t *dstU, int16_t *dstV, const uint8_t *src1, + const uint8_t *src2, int width, uint32_t *unused) { - const enum PixelFormat srcFormat= c->srcFormat; - const enum PixelFormat dstFormat= c->dstFormat; - void (*conv)(const uint8_t *src, uint8_t *dst, long num_pixels, - const uint8_t *palette)=NULL; int i; - uint8_t *dstPtr= dst[0] + dstStride[0]*srcSliceY; - const uint8_t *srcPtr= src[0]; + assert(src1==src2); + for (i=0; i<width; i++) { + int r= src1[6*i + 0] + src1[6*i + 3]; + int g= src1[6*i + 1] + src1[6*i + 4]; + int b= src1[6*i + 2] + src1[6*i + 5]; - if (srcFormat == PIX_FMT_GRAY8A) { - switch (dstFormat) { - case PIX_FMT_RGB32 : conv = gray8aToPacked32; break; - case PIX_FMT_BGR32 : conv = gray8aToPacked32; break; - case PIX_FMT_BGR32_1: conv = gray8aToPacked32_1; break; - case PIX_FMT_RGB32_1: conv = gray8aToPacked32_1; break; - case PIX_FMT_RGB24 : conv = gray8aToPacked24; break; - case PIX_FMT_BGR24 : conv = gray8aToPacked24; break; - } - } else if (usePal(srcFormat)) { - switch (dstFormat) { - case PIX_FMT_RGB32 : conv = sws_convertPalette8ToPacked32; break; - case PIX_FMT_BGR32 : conv = sws_convertPalette8ToPacked32; break; - case PIX_FMT_BGR32_1: conv = sws_convertPalette8ToPacked32; break; - case PIX_FMT_RGB32_1: conv = sws_convertPalette8ToPacked32; break; - case PIX_FMT_RGB24 : conv = sws_convertPalette8ToPacked24; break; - case PIX_FMT_BGR24 : conv = sws_convertPalette8ToPacked24; break; - } + dstU[i]= (RU*r + GU*g + BU*b + (256<<RGB2YUV_SHIFT) + (1<<(RGB2YUV_SHIFT-6)))>>(RGB2YUV_SHIFT-5); + dstV[i]= (RV*r + GV*g + BV*b + (256<<RGB2YUV_SHIFT) + (1<<(RGB2YUV_SHIFT-6)))>>(RGB2YUV_SHIFT-5); } +} - if (!conv) - av_log(c, AV_LOG_ERROR, "internal error %s -> %s converter\n", - sws_format_name(srcFormat), sws_format_name(dstFormat)); - else { - for (i=0; i<srcSliceH; i++) { - conv(srcPtr, dstPtr, c->srcW, (uint8_t *) c->pal_rgb); - srcPtr+= srcStride[0]; - dstPtr+= dstStride[0]; + +// bilinear / bicubic scaling +static void hScale_c(int16_t *dst, int dstW, const uint8_t *src, + int srcW, int xInc, + const int16_t *filter, const int16_t *filterPos, + int filterSize) +{ + int i; + for (i=0; i<dstW; i++) { + int j; + int srcPos= filterPos[i]; + int val=0; + for (j=0; j<filterSize; j++) { + val += ((int)src[srcPos + j])*filter[filterSize*i + j]; } + //filter += hFilterSize; + dst[i] = FFMIN(val>>7, (1<<15)-1); // the cubic equation does overflow ... + //dst[i] = val>>7; } - - return srcSliceH; } -#define isRGBA32(x) ( \ - (x) == PIX_FMT_ARGB \ - || (x) == PIX_FMT_RGBA \ - || (x) == PIX_FMT_BGRA \ - || (x) == PIX_FMT_ABGR \ - ) - -/* {RGB,BGR}{15,16,24,32,32_1} -> {RGB,BGR}{15,16,24,32} */ -static int rgbToRgbWrapper(SwsContext *c, const uint8_t* src[], int srcStride[], int srcSliceY, - int srcSliceH, uint8_t* dst[], int dstStride[]) +static inline void hScale16_c(int16_t *dst, int dstW, const uint16_t *src, int srcW, int xInc, + const int16_t *filter, const int16_t *filterPos, long filterSize, int shift) { - const enum PixelFormat srcFormat= c->srcFormat; - const enum PixelFormat dstFormat= c->dstFormat; - const int srcBpp= (c->srcFormatBpp + 7) >> 3; - const int dstBpp= (c->dstFormatBpp + 7) >> 3; - const int srcId= c->srcFormatBpp >> 2; /* 1:0, 4:1, 8:2, 15:3, 16:4, 24:6, 32:8 */ - const int dstId= c->dstFormatBpp >> 2; - void (*conv)(const uint8_t *src, uint8_t *dst, long src_size)=NULL; - -#define CONV_IS(src, dst) (srcFormat == PIX_FMT_##src && dstFormat == PIX_FMT_##dst) - - if (isRGBA32(srcFormat) && isRGBA32(dstFormat)) { - if ( CONV_IS(ABGR, RGBA) - || CONV_IS(ARGB, BGRA) - || CONV_IS(BGRA, ARGB) - || CONV_IS(RGBA, ABGR)) conv = shuffle_bytes_3210; - else if (CONV_IS(ABGR, ARGB) - || CONV_IS(ARGB, ABGR)) conv = shuffle_bytes_0321; - else if (CONV_IS(ABGR, BGRA) - || CONV_IS(ARGB, RGBA)) conv = shuffle_bytes_1230; - else if (CONV_IS(BGRA, RGBA) - || CONV_IS(RGBA, BGRA)) conv = shuffle_bytes_2103; - else if (CONV_IS(BGRA, ABGR) - || CONV_IS(RGBA, ARGB)) conv = shuffle_bytes_3012; - } else - /* BGR -> BGR */ - if ( (isBGRinInt(srcFormat) && isBGRinInt(dstFormat)) - || (isRGBinInt(srcFormat) && isRGBinInt(dstFormat))) { - switch(srcId | (dstId<<4)) { - case 0x34: conv= rgb16to15; break; - case 0x36: conv= rgb24to15; break; - case 0x38: conv= rgb32to15; break; - case 0x43: conv= rgb15to16; break; - case 0x46: conv= rgb24to16; break; - case 0x48: conv= rgb32to16; break; - case 0x63: conv= rgb15to24; break; - case 0x64: conv= rgb16to24; break; - case 0x68: conv= rgb32to24; break; - case 0x83: conv= rgb15to32; break; - case 0x84: conv= rgb16to32; break; - case 0x86: conv= rgb24to32; break; - } - } else if ( (isBGRinInt(srcFormat) && isRGBinInt(dstFormat)) - || (isRGBinInt(srcFormat) && isBGRinInt(dstFormat))) { - switch(srcId | (dstId<<4)) { - case 0x33: conv= rgb15tobgr15; break; - case 0x34: conv= rgb16tobgr15; break; - case 0x36: conv= rgb24tobgr15; break; - case 0x38: conv= rgb32tobgr15; break; - case 0x43: conv= rgb15tobgr16; break; - case 0x44: conv= rgb16tobgr16; break; - case 0x46: conv= rgb24tobgr16; break; - case 0x48: conv= rgb32tobgr16; break; - case 0x63: conv= rgb15tobgr24; break; - case 0x64: conv= rgb16tobgr24; break; - case 0x66: conv= rgb24tobgr24; break; - case 0x68: conv= rgb32tobgr24; break; - case 0x83: conv= rgb15tobgr32; break; - case 0x84: conv= rgb16tobgr32; break; - case 0x86: conv= rgb24tobgr32; break; + int i, j; + + for (i=0; i<dstW; i++) { + int srcPos= filterPos[i]; + int val=0; + for (j=0; j<filterSize; j++) { + val += ((int)src[srcPos + j])*filter[filterSize*i + j]; } + dst[i] = FFMIN(val>>shift, (1<<15)-1); // the cubic equation does overflow ... } +} - if (!conv) { - av_log(c, AV_LOG_ERROR, "internal error %s -> %s converter\n", - sws_format_name(srcFormat), sws_format_name(dstFormat)); - } else { - const uint8_t *srcPtr= src[0]; - uint8_t *dstPtr= dst[0]; - if ((srcFormat == PIX_FMT_RGB32_1 || srcFormat == PIX_FMT_BGR32_1) && !isRGBA32(dstFormat)) - srcPtr += ALT32_CORR; - - if ((dstFormat == PIX_FMT_RGB32_1 || dstFormat == PIX_FMT_BGR32_1) && !isRGBA32(srcFormat)) - dstPtr += ALT32_CORR; - - if (dstStride[0]*srcBpp == srcStride[0]*dstBpp && srcStride[0] > 0 && !(srcStride[0]%srcBpp)) - conv(srcPtr, dstPtr + dstStride[0]*srcSliceY, srcSliceH*srcStride[0]); - else { - int i; - dstPtr += dstStride[0]*srcSliceY; - - for (i=0; i<srcSliceH; i++) { - conv(srcPtr, dstPtr, c->srcW*srcBpp); - srcPtr+= srcStride[0]; - dstPtr+= dstStride[0]; - } +static inline void hScale16X_c(int16_t *dst, int dstW, const uint16_t *src, int srcW, int xInc, + const int16_t *filter, const int16_t *filterPos, long filterSize, int shift) +{ + int i, j; + for (i=0; i<dstW; i++) { + int srcPos= filterPos[i]; + int val=0; + for (j=0; j<filterSize; j++) { + val += ((int)av_bswap16(src[srcPos + j]))*filter[filterSize*i + j]; } + dst[i] = FFMIN(val>>shift, (1<<15)-1); // the cubic equation does overflow ... } - return srcSliceH; } -static int bgr24ToYv12Wrapper(SwsContext *c, const uint8_t* src[], int srcStride[], int srcSliceY, - int srcSliceH, uint8_t* dst[], int dstStride[]) +//FIXME all pal and rgb srcFormats could do this convertion as well +//FIXME all scalers more complex than bilinear could do half of this transform +static void chrRangeToJpeg_c(int16_t *dstU, int16_t *dstV, int width) { - rgb24toyv12( - src[0], - dst[0]+ srcSliceY *dstStride[0], - dst[1]+(srcSliceY>>1)*dstStride[1], - dst[2]+(srcSliceY>>1)*dstStride[2], - c->srcW, srcSliceH, - dstStride[0], dstStride[1], srcStride[0]); - if (dst[3]) - fillPlane(dst[3], dstStride[3], c->srcW, srcSliceH, srcSliceY, 255); - return srcSliceH; + int i; + for (i = 0; i < width; i++) { + dstU[i] = (FFMIN(dstU[i],30775)*4663 - 9289992)>>12; //-264 + dstV[i] = (FFMIN(dstV[i],30775)*4663 - 9289992)>>12; //-264 + } } - -static int yvu9ToYv12Wrapper(SwsContext *c, const uint8_t* src[], int srcStride[], int srcSliceY, - int srcSliceH, uint8_t* dst[], int dstStride[]) +static void chrRangeFromJpeg_c(int16_t *dstU, int16_t *dstV, int width) +{ + int i; + for (i = 0; i < width; i++) { + dstU[i] = (dstU[i]*1799 + 4081085)>>11; //1469 + dstV[i] = (dstV[i]*1799 + 4081085)>>11; //1469 + } +} +static void lumRangeToJpeg_c(int16_t *dst, int width) +{ + int i; + for (i = 0; i < width; i++) + dst[i] = (FFMIN(dst[i],30189)*19077 - 39057361)>>14; +} +static void lumRangeFromJpeg_c(int16_t *dst, int width) { - copyPlane(src[0], srcStride[0], srcSliceY, srcSliceH, c->srcW, - dst[0], dstStride[0]); + int i; + for (i = 0; i < width; i++) + dst[i] = (dst[i]*14071 + 33561947)>>14; +} - planar2x(src[1], dst[1] + dstStride[1]*(srcSliceY >> 1), c->chrSrcW, - srcSliceH >> 2, srcStride[1], dstStride[1]); - planar2x(src[2], dst[2] + dstStride[2]*(srcSliceY >> 1), c->chrSrcW, - srcSliceH >> 2, srcStride[2], dstStride[2]); - if (dst[3]) - fillPlane(dst[3], dstStride[3], c->srcW, srcSliceH, srcSliceY, 255); - return srcSliceH; +static void hyscale_fast_c(SwsContext *c, int16_t *dst, int dstWidth, + const uint8_t *src, int srcW, int xInc) +{ + int i; + unsigned int xpos=0; + for (i=0;i<dstWidth;i++) { + register unsigned int xx=xpos>>16; + register unsigned int xalpha=(xpos&0xFFFF)>>9; + dst[i]= (src[xx]<<7) + (src[xx+1] - src[xx])*xalpha; + xpos+=xInc; + } + for (i=dstWidth-1; (i*xInc)>>16 >=srcW-1; i--) + dst[i] = src[srcW-1]*128; } -/* unscaled copy like stuff (assumes nearly identical formats) */ -static int packedCopyWrapper(SwsContext *c, const uint8_t* src[], int srcStride[], int srcSliceY, - int srcSliceH, uint8_t* dst[], int dstStride[]) +// *** horizontal scale Y line to temp buffer +static av_always_inline void hyscale(SwsContext *c, uint16_t *dst, int dstWidth, + const uint8_t *src, int srcW, int xInc, + const int16_t *hLumFilter, + const int16_t *hLumFilterPos, int hLumFilterSize, + uint8_t *formatConvBuffer, + uint32_t *pal, int isAlpha) { - if (dstStride[0]==srcStride[0] && srcStride[0] > 0) - memcpy(dst[0] + dstStride[0]*srcSliceY, src[0], srcSliceH*dstStride[0]); - else { - int i; - const uint8_t *srcPtr= src[0]; - uint8_t *dstPtr= dst[0] + dstStride[0]*srcSliceY; - int length=0; + void (*toYV12)(uint8_t *, const uint8_t *, int, uint32_t *) = isAlpha ? c->alpToYV12 : c->lumToYV12; + void (*convertRange)(int16_t *, int) = isAlpha ? NULL : c->lumConvertRange; - /* universal length finder */ - while(length+c->srcW <= FFABS(dstStride[0]) - && length+c->srcW <= FFABS(srcStride[0])) length+= c->srcW; - assert(length!=0); + if (toYV12) { + toYV12(formatConvBuffer, src, srcW, pal); + src= formatConvBuffer; + } - for (i=0; i<srcSliceH; i++) { - memcpy(dstPtr, srcPtr, length); - srcPtr+= srcStride[0]; - dstPtr+= dstStride[0]; - } + if (c->hScale16) { + int shift= isAnyRGB(c->srcFormat) || c->srcFormat==PIX_FMT_PAL8 ? 13 : av_pix_fmt_descriptors[c->srcFormat].comp[0].depth_minus1; + c->hScale16(dst, dstWidth, (const uint16_t*)src, srcW, xInc, hLumFilter, hLumFilterPos, hLumFilterSize, shift); + } else if (!c->hyscale_fast) { + c->hScale(dst, dstWidth, src, srcW, xInc, hLumFilter, hLumFilterPos, hLumFilterSize); + } else { // fast bilinear upscale / crap downscale + c->hyscale_fast(c, dst, dstWidth, src, srcW, xInc); } - return srcSliceH; -} - -#define DITHER_COPY(dst, dstStride, src, srcStride, bswap, dbswap)\ - uint16_t scale= dither_scale[dst_depth-1][src_depth-1];\ - int shift= src_depth-dst_depth + dither_scale[src_depth-2][dst_depth-1];\ - for (i = 0; i < height; i++) {\ - uint8_t *dither= dithers[src_depth-9][i&7];\ - for (j = 0; j < length-7; j+=8){\ - dst[j+0] = dbswap((bswap(src[j+0]) + dither[0])*scale>>shift);\ - dst[j+1] = dbswap((bswap(src[j+1]) + dither[1])*scale>>shift);\ - dst[j+2] = dbswap((bswap(src[j+2]) + dither[2])*scale>>shift);\ - dst[j+3] = dbswap((bswap(src[j+3]) + dither[3])*scale>>shift);\ - dst[j+4] = dbswap((bswap(src[j+4]) + dither[4])*scale>>shift);\ - dst[j+5] = dbswap((bswap(src[j+5]) + dither[5])*scale>>shift);\ - dst[j+6] = dbswap((bswap(src[j+6]) + dither[6])*scale>>shift);\ - dst[j+7] = dbswap((bswap(src[j+7]) + dither[7])*scale>>shift);\ - }\ - for (; j < length; j++)\ - dst[j] = dbswap((bswap(src[j]) + dither[j&7])*scale>>shift);\ - dst += dstStride;\ - src += srcStride;\ - } - - -static int planarCopyWrapper(SwsContext *c, const uint8_t* src[], int srcStride[], int srcSliceY, - int srcSliceH, uint8_t* dst[], int dstStride[]) -{ - int plane, i, j; - for (plane=0; plane<4; plane++) { - int length= (plane==0 || plane==3) ? c->srcW : -((-c->srcW )>>c->chrDstHSubSample); - int y= (plane==0 || plane==3) ? srcSliceY: -((-srcSliceY)>>c->chrDstVSubSample); - int height= (plane==0 || plane==3) ? srcSliceH: -((-srcSliceH)>>c->chrDstVSubSample); - const uint8_t *srcPtr= src[plane]; - uint8_t *dstPtr= dst[plane] + dstStride[plane]*y; - - if (!dst[plane]) continue; - // ignore palette for GRAY8 - if (plane == 1 && !dst[2]) continue; - if (!src[plane] || (plane == 1 && !src[2])) { - if(is16BPS(c->dstFormat)) - length*=2; - fillPlane(dst[plane], dstStride[plane], length, height, y, (plane==3) ? 255 : 128); - } else { - if(isNBPS(c->srcFormat) || isNBPS(c->dstFormat) - || (is16BPS(c->srcFormat) != is16BPS(c->dstFormat)) - ) { - const int src_depth = av_pix_fmt_descriptors[c->srcFormat].comp[plane].depth_minus1+1; - const int dst_depth = av_pix_fmt_descriptors[c->dstFormat].comp[plane].depth_minus1+1; - const uint16_t *srcPtr2 = (const uint16_t*)srcPtr; - uint16_t *dstPtr2 = (uint16_t*)dstPtr; - - if (dst_depth == 8) { - if(isBE(c->srcFormat) == HAVE_BIGENDIAN){ - DITHER_COPY(dstPtr, dstStride[plane], srcPtr2, srcStride[plane]/2, , ) - } else { - DITHER_COPY(dstPtr, dstStride[plane], srcPtr2, srcStride[plane]/2, av_bswap16, ) - } - } else if (src_depth == 8) { - for (i = 0; i < height; i++) { - if(isBE(c->dstFormat)){ - for (j = 0; j < length; j++) - AV_WB16(&dstPtr2[j], (srcPtr[j]<<(dst_depth-8)) | - (srcPtr[j]>>(2*8-dst_depth))); - } else { - for (j = 0; j < length; j++) - AV_WL16(&dstPtr2[j], (srcPtr[j]<<(dst_depth-8)) | - (srcPtr[j]>>(2*8-dst_depth))); - } - dstPtr2 += dstStride[plane]/2; - srcPtr += srcStride[plane]; - } - } else if (src_depth <= dst_depth) { - for (i = 0; i < height; i++) { -#define COPY_UP(r,w) \ - for (j = 0; j < length; j++){ \ - unsigned int v= r(&srcPtr2[j]);\ - w(&dstPtr2[j], (v<<(dst_depth-src_depth)) | \ - (v>>(2*src_depth-dst_depth)));\ - } - if(isBE(c->srcFormat)){ - if(isBE(c->dstFormat)){ - COPY_UP(AV_RB16, AV_WB16) - } else { - COPY_UP(AV_RB16, AV_WL16) - } - } else { - if(isBE(c->dstFormat)){ - COPY_UP(AV_RL16, AV_WB16) - } else { - COPY_UP(AV_RL16, AV_WL16) - } - } - dstPtr2 += dstStride[plane]/2; - srcPtr2 += srcStride[plane]/2; - } - } else { - if(isBE(c->srcFormat) == HAVE_BIGENDIAN){ - if(isBE(c->dstFormat) == HAVE_BIGENDIAN){ - DITHER_COPY(dstPtr2, dstStride[plane]/2, srcPtr2, srcStride[plane]/2, , ) - } else { - DITHER_COPY(dstPtr2, dstStride[plane]/2, srcPtr2, srcStride[plane]/2, , av_bswap16) - } - }else{ - if(isBE(c->dstFormat) == HAVE_BIGENDIAN){ - DITHER_COPY(dstPtr2, dstStride[plane]/2, srcPtr2, srcStride[plane]/2, av_bswap16, ) - } else { - DITHER_COPY(dstPtr2, dstStride[plane]/2, srcPtr2, srcStride[plane]/2, av_bswap16, av_bswap16) - } - } - } - } else if(is16BPS(c->srcFormat) && is16BPS(c->dstFormat) - && isBE(c->srcFormat) != isBE(c->dstFormat)) { - - for (i=0; i<height; i++) { - for (j=0; j<length; j++) - ((uint16_t*)dstPtr)[j] = av_bswap16(((const uint16_t*)srcPtr)[j]); - srcPtr+= srcStride[plane]; - dstPtr+= dstStride[plane]; - } - } else if (dstStride[plane] == srcStride[plane] && - srcStride[plane] > 0 && srcStride[plane] == length) { - memcpy(dst[plane] + dstStride[plane]*y, src[plane], - height*dstStride[plane]); - } else { - if(is16BPS(c->srcFormat) && is16BPS(c->dstFormat)) - length*=2; - for (i=0; i<height; i++) { - memcpy(dstPtr, srcPtr, length); - srcPtr+= srcStride[plane]; - dstPtr+= dstStride[plane]; - } - } - } + + if (convertRange) + convertRange(dst, dstWidth); +} + +static void hcscale_fast_c(SwsContext *c, int16_t *dst1, int16_t *dst2, + int dstWidth, const uint8_t *src1, + const uint8_t *src2, int srcW, int xInc) +{ + int i; + unsigned int xpos=0; + for (i=0;i<dstWidth;i++) { + register unsigned int xx=xpos>>16; + register unsigned int xalpha=(xpos&0xFFFF)>>9; + dst1[i]=(src1[xx]*(xalpha^127)+src1[xx+1]*xalpha); + dst2[i]=(src2[xx]*(xalpha^127)+src2[xx+1]*xalpha); + xpos+=xInc; + } + for (i=dstWidth-1; (i*xInc)>>16 >=srcW-1; i--) { + dst1[i] = src1[srcW-1]*128; + dst2[i] = src2[srcW-1]*128; } - return srcSliceH; } -int ff_hardcodedcpuflags(void) +static av_always_inline void hcscale(SwsContext *c, uint16_t *dst1, uint16_t *dst2, int dstWidth, + const uint8_t *src1, const uint8_t *src2, + int srcW, int xInc, const int16_t *hChrFilter, + const int16_t *hChrFilterPos, int hChrFilterSize, + uint8_t *formatConvBuffer, uint32_t *pal) { - int flags = 0; -#if COMPILE_TEMPLATE_MMX2 - flags |= SWS_CPU_CAPS_MMX|SWS_CPU_CAPS_MMX2; -#elif COMPILE_TEMPLATE_AMD3DNOW - flags |= SWS_CPU_CAPS_MMX|SWS_CPU_CAPS_3DNOW; -#elif COMPILE_TEMPLATE_MMX - flags |= SWS_CPU_CAPS_MMX; -#elif COMPILE_TEMPLATE_ALTIVEC - flags |= SWS_CPU_CAPS_ALTIVEC; -#elif ARCH_BFIN - flags |= SWS_CPU_CAPS_BFIN; -#endif - return flags; -} - -void ff_get_unscaled_swscale(SwsContext *c) -{ - const enum PixelFormat srcFormat = c->srcFormat; - const enum PixelFormat dstFormat = c->dstFormat; - const int flags = c->flags; - const int dstH = c->dstH; - int needsDither; - - needsDither= isAnyRGB(dstFormat) - && c->dstFormatBpp < 24 - && (c->dstFormatBpp < c->srcFormatBpp || (!isAnyRGB(srcFormat))); - - /* yv12_to_nv12 */ - if ((srcFormat == PIX_FMT_YUV420P || srcFormat == PIX_FMT_YUVA420P) && (dstFormat == PIX_FMT_NV12 || dstFormat == PIX_FMT_NV21)) { - c->swScale= planarToNv12Wrapper; - } - /* yuv2bgr */ - if ((srcFormat==PIX_FMT_YUV420P || srcFormat==PIX_FMT_YUV422P || srcFormat==PIX_FMT_YUVA420P) && isAnyRGB(dstFormat) - && !(flags & SWS_ACCURATE_RND) && !(dstH&1)) { - c->swScale= ff_yuv2rgb_get_func_ptr(c); - } - - if (srcFormat==PIX_FMT_YUV410P && (dstFormat==PIX_FMT_YUV420P || dstFormat==PIX_FMT_YUVA420P) && !(flags & SWS_BITEXACT)) { - c->swScale= yvu9ToYv12Wrapper; - } - - /* bgr24toYV12 */ - if (srcFormat==PIX_FMT_BGR24 && (dstFormat==PIX_FMT_YUV420P || dstFormat==PIX_FMT_YUVA420P) && !(flags & SWS_ACCURATE_RND)) - c->swScale= bgr24ToYv12Wrapper; - - /* RGB/BGR -> RGB/BGR (no dither needed forms) */ - if ( isAnyRGB(srcFormat) - && isAnyRGB(dstFormat) - && srcFormat != PIX_FMT_BGR8 && dstFormat != PIX_FMT_BGR8 - && srcFormat != PIX_FMT_RGB8 && dstFormat != PIX_FMT_RGB8 - && srcFormat != PIX_FMT_BGR4 && dstFormat != PIX_FMT_BGR4 - && srcFormat != PIX_FMT_RGB4 && dstFormat != PIX_FMT_RGB4 - && srcFormat != PIX_FMT_BGR4_BYTE && dstFormat != PIX_FMT_BGR4_BYTE - && srcFormat != PIX_FMT_RGB4_BYTE && dstFormat != PIX_FMT_RGB4_BYTE - && srcFormat != PIX_FMT_MONOBLACK && dstFormat != PIX_FMT_MONOBLACK - && srcFormat != PIX_FMT_MONOWHITE && dstFormat != PIX_FMT_MONOWHITE - && srcFormat != PIX_FMT_RGB48LE && dstFormat != PIX_FMT_RGB48LE - && srcFormat != PIX_FMT_RGB48BE && dstFormat != PIX_FMT_RGB48BE - && srcFormat != PIX_FMT_BGR48LE && dstFormat != PIX_FMT_BGR48LE - && srcFormat != PIX_FMT_BGR48BE && dstFormat != PIX_FMT_BGR48BE - && (!needsDither || (c->flags&(SWS_FAST_BILINEAR|SWS_POINT)))) - c->swScale= rgbToRgbWrapper; - - if ((usePal(srcFormat) && ( - dstFormat == PIX_FMT_RGB32 || - dstFormat == PIX_FMT_RGB32_1 || - dstFormat == PIX_FMT_RGB24 || - dstFormat == PIX_FMT_BGR32 || - dstFormat == PIX_FMT_BGR32_1 || - dstFormat == PIX_FMT_BGR24))) - c->swScale= palToRgbWrapper; - - if (srcFormat == PIX_FMT_YUV422P) { - if (dstFormat == PIX_FMT_YUYV422) - c->swScale= yuv422pToYuy2Wrapper; - else if (dstFormat == PIX_FMT_UYVY422) - c->swScale= yuv422pToUyvyWrapper; - } - - /* LQ converters if -sws 0 or -sws 4*/ - if (c->flags&(SWS_FAST_BILINEAR|SWS_POINT)) { - /* yv12_to_yuy2 */ - if (srcFormat == PIX_FMT_YUV420P || srcFormat == PIX_FMT_YUVA420P) { - if (dstFormat == PIX_FMT_YUYV422) - c->swScale= planarToYuy2Wrapper; - else if (dstFormat == PIX_FMT_UYVY422) - c->swScale= planarToUyvyWrapper; - } + if (c->chrToYV12) { + uint8_t *buf2 = formatConvBuffer + FFALIGN(srcW*2+78, 16); + c->chrToYV12(formatConvBuffer, buf2, src1, src2, srcW, pal); + src1= formatConvBuffer; + src2= buf2; } - if(srcFormat == PIX_FMT_YUYV422 && (dstFormat == PIX_FMT_YUV420P || dstFormat == PIX_FMT_YUVA420P)) - c->swScale= yuyvToYuv420Wrapper; - if(srcFormat == PIX_FMT_UYVY422 && (dstFormat == PIX_FMT_YUV420P || dstFormat == PIX_FMT_YUVA420P)) - c->swScale= uyvyToYuv420Wrapper; - if(srcFormat == PIX_FMT_YUYV422 && dstFormat == PIX_FMT_YUV422P) - c->swScale= yuyvToYuv422Wrapper; - if(srcFormat == PIX_FMT_UYVY422 && dstFormat == PIX_FMT_YUV422P) - c->swScale= uyvyToYuv422Wrapper; - -#if COMPILE_ALTIVEC - if ((c->flags & SWS_CPU_CAPS_ALTIVEC) && - !(c->flags & SWS_BITEXACT) && - srcFormat == PIX_FMT_YUV420P) { - // unscaled YV12 -> packed YUV, we want speed - if (dstFormat == PIX_FMT_YUYV422) - c->swScale= yv12toyuy2_unscaled_altivec; - else if (dstFormat == PIX_FMT_UYVY422) - c->swScale= yv12touyvy_unscaled_altivec; + + if (c->hScale16) { + int shift= isAnyRGB(c->srcFormat) || c->srcFormat==PIX_FMT_PAL8 ? 13 : av_pix_fmt_descriptors[c->srcFormat].comp[0].depth_minus1; + c->hScale16(dst1, dstWidth, (const uint16_t*)src1, srcW, xInc, hChrFilter, hChrFilterPos, hChrFilterSize, shift); + c->hScale16(dst2, dstWidth, (const uint16_t*)src2, srcW, xInc, hChrFilter, hChrFilterPos, hChrFilterSize, shift); + } else if (!c->hcscale_fast) { + c->hScale(dst1, dstWidth, src1, srcW, xInc, hChrFilter, hChrFilterPos, hChrFilterSize); + c->hScale(dst2, dstWidth, src2, srcW, xInc, hChrFilter, hChrFilterPos, hChrFilterSize); + } else { // fast bilinear upscale / crap downscale + c->hcscale_fast(c, dst1, dst2, dstWidth, src1, src2, srcW, xInc); } -#endif - /* simple copy */ - if ( srcFormat == dstFormat - || (srcFormat == PIX_FMT_YUVA420P && dstFormat == PIX_FMT_YUV420P) - || (srcFormat == PIX_FMT_YUV420P && dstFormat == PIX_FMT_YUVA420P) - || (isPlanarYUV(srcFormat) && isGray(dstFormat)) - || (isPlanarYUV(dstFormat) && isGray(srcFormat)) - || (isGray(dstFormat) && isGray(srcFormat)) - || (isPlanarYUV(srcFormat) && isPlanarYUV(dstFormat) - && c->chrDstHSubSample == c->chrSrcHSubSample - && c->chrDstVSubSample == c->chrSrcVSubSample - && dstFormat != PIX_FMT_NV12 && dstFormat != PIX_FMT_NV21 - && srcFormat != PIX_FMT_NV12 && srcFormat != PIX_FMT_NV21)) - { - if (isPacked(c->srcFormat)) - c->swScale= packedCopyWrapper; - else /* Planar YUV or gray */ - c->swScale= planarCopyWrapper; - } -#if ARCH_BFIN - if (flags & SWS_CPU_CAPS_BFIN) - ff_bfin_get_unscaled_swscale (c); -#endif + if (c->chrConvertRange) + c->chrConvertRange(dst1, dst2, dstWidth); } -static void reset_ptr(const uint8_t* src[], int format) +static av_always_inline void +find_c_packed_planar_out_funcs(SwsContext *c, + yuv2planar1_fn *yuv2yuv1, yuv2planarX_fn *yuv2yuvX, + yuv2packed1_fn *yuv2packed1, yuv2packed2_fn *yuv2packed2, + yuv2packedX_fn *yuv2packedX) { - if(!isALPHA(format)) - src[3]=NULL; - if(!isPlanarYUV(format)) { - src[3]=src[2]=NULL; - - if (!usePal(format)) - src[1]= NULL; + enum PixelFormat dstFormat = c->dstFormat; + + if (dstFormat == PIX_FMT_NV12 || dstFormat == PIX_FMT_NV21) { + *yuv2yuvX = yuv2nv12X_c; + } else if (is16BPS(dstFormat)) { + *yuv2yuvX = isBE(dstFormat) ? yuv2yuvX16BE_c : yuv2yuvX16LE_c; + } else if (is9_OR_10BPS(dstFormat)) { + if (av_pix_fmt_descriptors[dstFormat].comp[0].depth_minus1 == 8) { + *yuv2yuvX = isBE(dstFormat) ? yuv2yuvX9BE_c : yuv2yuvX9LE_c; + } else { + *yuv2yuvX = isBE(dstFormat) ? yuv2yuvX10BE_c : yuv2yuvX10LE_c; + } + } else { + *yuv2yuv1 = yuv2yuv1_c; + *yuv2yuvX = yuv2yuvX_c; + } + if(c->flags & SWS_FULL_CHR_H_INT) { + *yuv2packedX = yuv2rgbX_c_full; + } else { + switch (dstFormat) { + case PIX_FMT_GRAY16BE: + *yuv2packed1 = yuv2gray16BE_1_c; + *yuv2packed2 = yuv2gray16BE_2_c; + *yuv2packedX = yuv2gray16BE_X_c; + break; + case PIX_FMT_GRAY16LE: + *yuv2packed1 = yuv2gray16LE_1_c; + *yuv2packed2 = yuv2gray16LE_2_c; + *yuv2packedX = yuv2gray16LE_X_c; + break; + case PIX_FMT_MONOWHITE: + *yuv2packed1 = yuv2monowhite_1_c; + *yuv2packed2 = yuv2monowhite_2_c; + *yuv2packedX = yuv2monowhite_X_c; + break; + case PIX_FMT_MONOBLACK: + *yuv2packed1 = yuv2monoblack_1_c; + *yuv2packed2 = yuv2monoblack_2_c; + *yuv2packedX = yuv2monoblack_X_c; + break; + case PIX_FMT_YUYV422: + *yuv2packed1 = yuv2yuyv422_1_c; + *yuv2packed2 = yuv2yuyv422_2_c; + *yuv2packedX = yuv2yuyv422_X_c; + break; + case PIX_FMT_UYVY422: + *yuv2packed1 = yuv2uyvy422_1_c; + *yuv2packed2 = yuv2uyvy422_2_c; + *yuv2packedX = yuv2uyvy422_X_c; + break; + case PIX_FMT_RGB48LE: + //*yuv2packed1 = yuv2rgb48le_1_c; + //*yuv2packed2 = yuv2rgb48le_2_c; + //*yuv2packedX = yuv2rgb48le_X_c; + //break; + case PIX_FMT_RGB48BE: + *yuv2packed1 = yuv2rgb48be_1_c; + *yuv2packed2 = yuv2rgb48be_2_c; + *yuv2packedX = yuv2rgb48be_X_c; + break; + case PIX_FMT_BGR48LE: + //*yuv2packed1 = yuv2bgr48le_1_c; + //*yuv2packed2 = yuv2bgr48le_2_c; + //*yuv2packedX = yuv2bgr48le_X_c; + //break; + case PIX_FMT_BGR48BE: + *yuv2packed1 = yuv2bgr48be_1_c; + *yuv2packed2 = yuv2bgr48be_2_c; + *yuv2packedX = yuv2bgr48be_X_c; + break; + default: + *yuv2packed1 = yuv2packed1_c; + *yuv2packed2 = yuv2packed2_c; + *yuv2packedX = yuv2packedX_c; + break; + } } } -static int check_image_pointers(uint8_t *data[4], enum PixelFormat pix_fmt, - const int linesizes[4]) +#define DEBUG_SWSCALE_BUFFERS 0 +#define DEBUG_BUFFERS(...) if (DEBUG_SWSCALE_BUFFERS) av_log(c, AV_LOG_DEBUG, __VA_ARGS__) + +static int swScale(SwsContext *c, const uint8_t* src[], + int srcStride[], int srcSliceY, + int srcSliceH, uint8_t* dst[], int dstStride[]) { - const AVPixFmtDescriptor *desc = &av_pix_fmt_descriptors[pix_fmt]; - int i; + /* load a few things into local vars to make the code more readable? and faster */ + const int srcW= c->srcW; + const int dstW= c->dstW; + const int dstH= c->dstH; + const int chrDstW= c->chrDstW; + const int chrSrcW= c->chrSrcW; + const int lumXInc= c->lumXInc; + const int chrXInc= c->chrXInc; + const enum PixelFormat dstFormat= c->dstFormat; + const int flags= c->flags; + int16_t *vLumFilterPos= c->vLumFilterPos; + int16_t *vChrFilterPos= c->vChrFilterPos; + int16_t *hLumFilterPos= c->hLumFilterPos; + int16_t *hChrFilterPos= c->hChrFilterPos; + int16_t *vLumFilter= c->vLumFilter; + int16_t *vChrFilter= c->vChrFilter; + int16_t *hLumFilter= c->hLumFilter; + int16_t *hChrFilter= c->hChrFilter; + int32_t *lumMmxFilter= c->lumMmxFilter; + int32_t *chrMmxFilter= c->chrMmxFilter; + int32_t av_unused *alpMmxFilter= c->alpMmxFilter; + const int vLumFilterSize= c->vLumFilterSize; + const int vChrFilterSize= c->vChrFilterSize; + const int hLumFilterSize= c->hLumFilterSize; + const int hChrFilterSize= c->hChrFilterSize; + int16_t **lumPixBuf= c->lumPixBuf; + int16_t **chrUPixBuf= c->chrUPixBuf; + int16_t **chrVPixBuf= c->chrVPixBuf; + int16_t **alpPixBuf= c->alpPixBuf; + const int vLumBufSize= c->vLumBufSize; + const int vChrBufSize= c->vChrBufSize; + uint8_t *formatConvBuffer= c->formatConvBuffer; + const int chrSrcSliceY= srcSliceY >> c->chrSrcVSubSample; + const int chrSrcSliceH= -((-srcSliceH) >> c->chrSrcVSubSample); + int lastDstY; + uint32_t *pal=c->pal_yuv; + int should_dither= isNBPS(c->srcFormat) || is16BPS(c->srcFormat); + yuv2planar1_fn yuv2yuv1 = c->yuv2yuv1; + yuv2planarX_fn yuv2yuvX = c->yuv2yuvX; + yuv2packed1_fn yuv2packed1 = c->yuv2packed1; + yuv2packed2_fn yuv2packed2 = c->yuv2packed2; + yuv2packedX_fn yuv2packedX = c->yuv2packedX; + + /* vars which will change and which we need to store back in the context */ + int dstY= c->dstY; + int lumBufIndex= c->lumBufIndex; + int chrBufIndex= c->chrBufIndex; + int lastInLumBuf= c->lastInLumBuf; + int lastInChrBuf= c->lastInChrBuf; + + if (isPacked(c->srcFormat)) { + src[0]= + src[1]= + src[2]= + src[3]= src[0]; + srcStride[0]= + srcStride[1]= + srcStride[2]= + srcStride[3]= srcStride[0]; + } + srcStride[1]<<= c->vChrDrop; + srcStride[2]<<= c->vChrDrop; + + DEBUG_BUFFERS("swScale() %p[%d] %p[%d] %p[%d] %p[%d] -> %p[%d] %p[%d] %p[%d] %p[%d]\n", + src[0], srcStride[0], src[1], srcStride[1], src[2], srcStride[2], src[3], srcStride[3], + dst[0], dstStride[0], dst[1], dstStride[1], dst[2], dstStride[2], dst[3], dstStride[3]); + DEBUG_BUFFERS("srcSliceY: %d srcSliceH: %d dstY: %d dstH: %d\n", + srcSliceY, srcSliceH, dstY, dstH); + DEBUG_BUFFERS("vLumFilterSize: %d vLumBufSize: %d vChrFilterSize: %d vChrBufSize: %d\n", + vLumFilterSize, vLumBufSize, vChrFilterSize, vChrBufSize); + + if (dstStride[0]%8 !=0 || dstStride[1]%8 !=0 || dstStride[2]%8 !=0 || dstStride[3]%8 != 0) { + static int warnedAlready=0; //FIXME move this into the context perhaps + if (flags & SWS_PRINT_INFO && !warnedAlready) { + av_log(c, AV_LOG_WARNING, "Warning: dstStride is not aligned!\n" + " ->cannot do aligned memory accesses anymore\n"); + warnedAlready=1; + } + } - for (i = 0; i < 4; i++) { - int plane = desc->comp[i].plane; - if (!data[plane] || !linesizes[plane]) - return 0; + /* Note the user might start scaling the picture in the middle so this + will not get executed. This is not really intended but works + currently, so people might do it. */ + if (srcSliceY ==0) { + lumBufIndex=-1; + chrBufIndex=-1; + dstY=0; + lastInLumBuf= -1; + lastInChrBuf= -1; } - return 1; -} + lastDstY= dstY; + + for (;dstY < dstH; dstY++) { + unsigned char *dest =dst[0]+dstStride[0]*dstY; + const int chrDstY= dstY>>c->chrDstVSubSample; + unsigned char *uDest=dst[1]+dstStride[1]*chrDstY; + unsigned char *vDest=dst[2]+dstStride[2]*chrDstY; + unsigned char *aDest=(CONFIG_SWSCALE_ALPHA && alpPixBuf) ? dst[3]+dstStride[3]*dstY : NULL; + const uint8_t *lumDither= should_dither ? dithers[7][dstY &7] : flat64; + const uint8_t *chrDither= should_dither ? dithers[7][chrDstY&7] : flat64; + + const int firstLumSrcY= vLumFilterPos[dstY]; //First line needed as input + const int firstLumSrcY2= vLumFilterPos[FFMIN(dstY | ((1<<c->chrDstVSubSample) - 1), dstH-1)]; + const int firstChrSrcY= vChrFilterPos[chrDstY]; //First line needed as input + int lastLumSrcY= firstLumSrcY + vLumFilterSize -1; // Last line needed as input + int lastLumSrcY2=firstLumSrcY2+ vLumFilterSize -1; // Last line needed as input + int lastChrSrcY= firstChrSrcY + vChrFilterSize -1; // Last line needed as input + int enough_lines; + + //handle holes (FAST_BILINEAR & weird filters) + if (firstLumSrcY > lastInLumBuf) lastInLumBuf= firstLumSrcY-1; + if (firstChrSrcY > lastInChrBuf) lastInChrBuf= firstChrSrcY-1; + assert(firstLumSrcY >= lastInLumBuf - vLumBufSize + 1); + assert(firstChrSrcY >= lastInChrBuf - vChrBufSize + 1); + + DEBUG_BUFFERS("dstY: %d\n", dstY); + DEBUG_BUFFERS("\tfirstLumSrcY: %d lastLumSrcY: %d lastInLumBuf: %d\n", + firstLumSrcY, lastLumSrcY, lastInLumBuf); + DEBUG_BUFFERS("\tfirstChrSrcY: %d lastChrSrcY: %d lastInChrBuf: %d\n", + firstChrSrcY, lastChrSrcY, lastInChrBuf); + + // Do we have enough lines in this slice to output the dstY line + enough_lines = lastLumSrcY2 < srcSliceY + srcSliceH && lastChrSrcY < -((-srcSliceY - srcSliceH)>>c->chrSrcVSubSample); + + if (!enough_lines) { + lastLumSrcY = srcSliceY + srcSliceH - 1; + lastChrSrcY = chrSrcSliceY + chrSrcSliceH - 1; + DEBUG_BUFFERS("buffering slice: lastLumSrcY %d lastChrSrcY %d\n", + lastLumSrcY, lastChrSrcY); + } -/** - * swscale wrapper, so we don't need to export the SwsContext. - * Assumes planar YUV to be in YUV order instead of YVU. - */ -int sws_scale(SwsContext *c, const uint8_t* const src[], const int srcStride[], int srcSliceY, - int srcSliceH, uint8_t* const dst[], const int dstStride[]) -{ - int i; - const uint8_t* src2[4]= {src[0], src[1], src[2], src[3]}; - uint8_t* dst2[4]= {dst[0], dst[1], dst[2], dst[3]}; - - // do not mess up sliceDir if we have a "trailing" 0-size slice - if (srcSliceH == 0) - return 0; - - if (!check_image_pointers(src, c->srcFormat, srcStride)) { - av_log(c, AV_LOG_ERROR, "bad src image pointers\n"); - return 0; - } - if (!check_image_pointers(dst, c->dstFormat, dstStride)) { - av_log(c, AV_LOG_ERROR, "bad dst image pointers\n"); - return 0; - } - - if (c->sliceDir == 0 && srcSliceY != 0 && srcSliceY + srcSliceH != c->srcH) { - av_log(c, AV_LOG_ERROR, "Slices start in the middle!\n"); - return 0; - } - if (c->sliceDir == 0) { - if (srcSliceY == 0) c->sliceDir = 1; else c->sliceDir = -1; - } - - if (usePal(c->srcFormat)) { - for (i=0; i<256; i++) { - int p, r, g, b, y, u, v, a = 0xff; - if(c->srcFormat == PIX_FMT_PAL8) { - p=((const uint32_t*)(src[1]))[i]; - a= (p>>24)&0xFF; - r= (p>>16)&0xFF; - g= (p>> 8)&0xFF; - b= p &0xFF; - } else if(c->srcFormat == PIX_FMT_RGB8) { - r= (i>>5 )*36; - g= ((i>>2)&7)*36; - b= (i&3 )*85; - } else if(c->srcFormat == PIX_FMT_BGR8) { - b= (i>>6 )*85; - g= ((i>>3)&7)*36; - r= (i&7 )*36; - } else if(c->srcFormat == PIX_FMT_RGB4_BYTE) { - r= (i>>3 )*255; - g= ((i>>1)&3)*85; - b= (i&1 )*255; - } else if(c->srcFormat == PIX_FMT_GRAY8 || c->srcFormat == PIX_FMT_GRAY8A) { - r = g = b = i; - } else { - assert(c->srcFormat == PIX_FMT_BGR4_BYTE); - b= (i>>3 )*255; - g= ((i>>1)&3)*85; - r= (i&1 )*255; - } - y= av_clip_uint8((RY*r + GY*g + BY*b + ( 33<<(RGB2YUV_SHIFT-1)))>>RGB2YUV_SHIFT); - u= av_clip_uint8((RU*r + GU*g + BU*b + (257<<(RGB2YUV_SHIFT-1)))>>RGB2YUV_SHIFT); - v= av_clip_uint8((RV*r + GV*g + BV*b + (257<<(RGB2YUV_SHIFT-1)))>>RGB2YUV_SHIFT); - c->pal_yuv[i]= y + (u<<8) + (v<<16) + (a<<24); - - switch(c->dstFormat) { - case PIX_FMT_BGR32: -#if !HAVE_BIGENDIAN - case PIX_FMT_RGB24: -#endif - c->pal_rgb[i]= r + (g<<8) + (b<<16) + (a<<24); - break; - case PIX_FMT_BGR32_1: -#if HAVE_BIGENDIAN - case PIX_FMT_BGR24: -#endif - c->pal_rgb[i]= a + (r<<8) + (g<<16) + (b<<24); - break; - case PIX_FMT_RGB32_1: -#if HAVE_BIGENDIAN - case PIX_FMT_RGB24: -#endif - c->pal_rgb[i]= a + (b<<8) + (g<<16) + (r<<24); - break; - case PIX_FMT_RGB32: -#if !HAVE_BIGENDIAN - case PIX_FMT_BGR24: + //Do horizontal scaling + while(lastInLumBuf < lastLumSrcY) { + const uint8_t *src1= src[0]+(lastInLumBuf + 1 - srcSliceY)*srcStride[0]; + const uint8_t *src2= src[3]+(lastInLumBuf + 1 - srcSliceY)*srcStride[3]; + lumBufIndex++; + assert(lumBufIndex < 2*vLumBufSize); + assert(lastInLumBuf + 1 - srcSliceY < srcSliceH); + assert(lastInLumBuf + 1 - srcSliceY >= 0); + hyscale(c, lumPixBuf[ lumBufIndex ], dstW, src1, srcW, lumXInc, + hLumFilter, hLumFilterPos, hLumFilterSize, + formatConvBuffer, + pal, 0); + if (CONFIG_SWSCALE_ALPHA && alpPixBuf) + hyscale(c, alpPixBuf[ lumBufIndex ], dstW, src2, srcW, + lumXInc, hLumFilter, hLumFilterPos, hLumFilterSize, + formatConvBuffer, + pal, 1); + lastInLumBuf++; + DEBUG_BUFFERS("\t\tlumBufIndex %d: lastInLumBuf: %d\n", + lumBufIndex, lastInLumBuf); + } + while(lastInChrBuf < lastChrSrcY) { + const uint8_t *src1= src[1]+(lastInChrBuf + 1 - chrSrcSliceY)*srcStride[1]; + const uint8_t *src2= src[2]+(lastInChrBuf + 1 - chrSrcSliceY)*srcStride[2]; + chrBufIndex++; + assert(chrBufIndex < 2*vChrBufSize); + assert(lastInChrBuf + 1 - chrSrcSliceY < (chrSrcSliceH)); + assert(lastInChrBuf + 1 - chrSrcSliceY >= 0); + //FIXME replace parameters through context struct (some at least) + + if (c->needs_hcscale) + hcscale(c, chrUPixBuf[chrBufIndex], chrVPixBuf[chrBufIndex], + chrDstW, src1, src2, chrSrcW, chrXInc, + hChrFilter, hChrFilterPos, hChrFilterSize, + formatConvBuffer, pal); + lastInChrBuf++; + DEBUG_BUFFERS("\t\tchrBufIndex %d: lastInChrBuf: %d\n", + chrBufIndex, lastInChrBuf); + } + //wrap buf index around to stay inside the ring buffer + if (lumBufIndex >= vLumBufSize) lumBufIndex-= vLumBufSize; + if (chrBufIndex >= vChrBufSize) chrBufIndex-= vChrBufSize; + if (!enough_lines) + break; //we can't output a dstY line so let's try with the next slice + +#if HAVE_MMX + updateMMXDitherTables(c, dstY, lumBufIndex, chrBufIndex, lastInLumBuf, lastInChrBuf); #endif - default: - c->pal_rgb[i]= b + (g<<8) + (r<<16) + (a<<24); + if (dstY >= dstH-2) { + // hmm looks like we can't use MMX here without overwriting this array's tail + find_c_packed_planar_out_funcs(c, &yuv2yuv1, &yuv2yuvX, + &yuv2packed1, &yuv2packed2, + &yuv2packedX); + } + + { + const int16_t **lumSrcPtr= (const int16_t **) lumPixBuf + lumBufIndex + firstLumSrcY - lastInLumBuf + vLumBufSize; + const int16_t **chrUSrcPtr= (const int16_t **) chrUPixBuf + chrBufIndex + firstChrSrcY - lastInChrBuf + vChrBufSize; + const int16_t **chrVSrcPtr= (const int16_t **) chrVPixBuf + chrBufIndex + firstChrSrcY - lastInChrBuf + vChrBufSize; + const int16_t **alpSrcPtr= (CONFIG_SWSCALE_ALPHA && alpPixBuf) ? (const int16_t **) alpPixBuf + lumBufIndex + firstLumSrcY - lastInLumBuf + vLumBufSize : NULL; + + if (isPlanarYUV(dstFormat) || dstFormat==PIX_FMT_GRAY8) { //YV12 like + const int chrSkipMask= (1<<c->chrDstVSubSample)-1; + if ((dstY&chrSkipMask) || isGray(dstFormat)) uDest=vDest= NULL; //FIXME split functions in lumi / chromi + if (c->yuv2yuv1 && vLumFilterSize == 1 && vChrFilterSize == 1) { // unscaled YV12 + const int16_t *lumBuf = lumSrcPtr[0]; + const int16_t *chrUBuf= chrUSrcPtr[0]; + const int16_t *chrVBuf= chrVSrcPtr[0]; + const int16_t *alpBuf= (CONFIG_SWSCALE_ALPHA && alpPixBuf) ? alpSrcPtr[0] : NULL; + yuv2yuv1(c, lumBuf, chrUBuf, chrVBuf, alpBuf, dest, + uDest, vDest, aDest, dstW, chrDstW, lumDither, chrDither); + } else { //General YV12 + yuv2yuvX(c, + vLumFilter+dstY*vLumFilterSize , lumSrcPtr, vLumFilterSize, + vChrFilter+chrDstY*vChrFilterSize, chrUSrcPtr, + chrVSrcPtr, vChrFilterSize, + alpSrcPtr, dest, uDest, vDest, aDest, dstW, chrDstW, lumDither, chrDither); + } + } else { + assert(lumSrcPtr + vLumFilterSize - 1 < lumPixBuf + vLumBufSize*2); + assert(chrUSrcPtr + vChrFilterSize - 1 < chrUPixBuf + vChrBufSize*2); + if (c->yuv2packed1 && vLumFilterSize == 1 && vChrFilterSize == 2) { //unscaled RGB + int chrAlpha= vChrFilter[2*dstY+1]; + yuv2packed1(c, *lumSrcPtr, *chrUSrcPtr, *(chrUSrcPtr+1), + *chrVSrcPtr, *(chrVSrcPtr+1), + alpPixBuf ? *alpSrcPtr : NULL, + dest, dstW, chrAlpha, dstFormat, flags, dstY); + } else if (c->yuv2packed2 && vLumFilterSize == 2 && vChrFilterSize == 2) { //bilinear upscale RGB + int lumAlpha= vLumFilter[2*dstY+1]; + int chrAlpha= vChrFilter[2*dstY+1]; + lumMmxFilter[2]= + lumMmxFilter[3]= vLumFilter[2*dstY ]*0x10001; + chrMmxFilter[2]= + chrMmxFilter[3]= vChrFilter[2*chrDstY]*0x10001; + yuv2packed2(c, *lumSrcPtr, *(lumSrcPtr+1), *chrUSrcPtr, *(chrUSrcPtr+1), + *chrVSrcPtr, *(chrVSrcPtr+1), + alpPixBuf ? *alpSrcPtr : NULL, alpPixBuf ? *(alpSrcPtr+1) : NULL, + dest, dstW, lumAlpha, chrAlpha, dstY); + } else { //general RGB + yuv2packedX(c, + vLumFilter+dstY*vLumFilterSize, lumSrcPtr, vLumFilterSize, + vChrFilter+dstY*vChrFilterSize, chrUSrcPtr, chrVSrcPtr, vChrFilterSize, + alpSrcPtr, dest, dstW, dstY); + } } } } - // copy strides, so they can safely be modified - if (c->sliceDir == 1) { - // slices go from top to bottom - int srcStride2[4]= {srcStride[0], srcStride[1], srcStride[2], srcStride[3]}; - int dstStride2[4]= {dstStride[0], dstStride[1], dstStride[2], dstStride[3]}; + if ((dstFormat == PIX_FMT_YUVA420P) && !alpPixBuf) + fillPlane(dst[3], dstStride[3], dstW, dstY-lastDstY, lastDstY, 255); - reset_ptr(src2, c->srcFormat); - reset_ptr((const uint8_t**)dst2, c->dstFormat); +#if HAVE_MMX2 + if (av_get_cpu_flags() & AV_CPU_FLAG_MMX2) + __asm__ volatile("sfence":::"memory"); +#endif + emms_c(); - /* reset slice direction at end of frame */ - if (srcSliceY + srcSliceH == c->srcH) - c->sliceDir = 0; + /* store changed local vars back in the context */ + c->dstY= dstY; + c->lumBufIndex= lumBufIndex; + c->chrBufIndex= chrBufIndex; + c->lastInLumBuf= lastInLumBuf; + c->lastInChrBuf= lastInChrBuf; - return c->swScale(c, src2, srcStride2, srcSliceY, srcSliceH, dst2, dstStride2); - } else { - // slices go from bottom to top => we flip the image internally - int srcStride2[4]= {-srcStride[0], -srcStride[1], -srcStride[2], -srcStride[3]}; - int dstStride2[4]= {-dstStride[0], -dstStride[1], -dstStride[2], -dstStride[3]}; + return dstY - lastDstY; +} - src2[0] += (srcSliceH-1)*srcStride[0]; - if (!usePal(c->srcFormat)) - src2[1] += ((srcSliceH>>c->chrSrcVSubSample)-1)*srcStride[1]; - src2[2] += ((srcSliceH>>c->chrSrcVSubSample)-1)*srcStride[2]; - src2[3] += (srcSliceH-1)*srcStride[3]; - dst2[0] += ( c->dstH -1)*dstStride[0]; - dst2[1] += ((c->dstH>>c->chrDstVSubSample)-1)*dstStride[1]; - dst2[2] += ((c->dstH>>c->chrDstVSubSample)-1)*dstStride[2]; - dst2[3] += ( c->dstH -1)*dstStride[3]; +static av_cold void sws_init_swScale_c(SwsContext *c) +{ + enum PixelFormat srcFormat = c->srcFormat; - reset_ptr(src2, c->srcFormat); - reset_ptr((const uint8_t**)dst2, c->dstFormat); + find_c_packed_planar_out_funcs(c, &c->yuv2yuv1, &c->yuv2yuvX, + &c->yuv2packed1, &c->yuv2packed2, + &c->yuv2packedX); - /* reset slice direction at end of frame */ - if (!srcSliceY) - c->sliceDir = 0; + c->hScale = hScale_c; - return c->swScale(c, src2, srcStride2, c->srcH-srcSliceY-srcSliceH, srcSliceH, dst2, dstStride2); + if (c->flags & SWS_FAST_BILINEAR) { + c->hyscale_fast = hyscale_fast_c; + c->hcscale_fast = hcscale_fast_c; } -} -#if LIBSWSCALE_VERSION_MAJOR < 1 -int sws_scale_ordered(SwsContext *c, const uint8_t* const src[], int srcStride[], int srcSliceY, - int srcSliceH, uint8_t* dst[], int dstStride[]) -{ - return sws_scale(c, src, srcStride, srcSliceY, srcSliceH, dst, dstStride); -} -#endif + c->chrToYV12 = NULL; + switch(srcFormat) { + case PIX_FMT_YUYV422 : c->chrToYV12 = yuy2ToUV_c; break; + case PIX_FMT_UYVY422 : c->chrToYV12 = uyvyToUV_c; break; + case PIX_FMT_NV12 : c->chrToYV12 = nv12ToUV_c; break; + case PIX_FMT_NV21 : c->chrToYV12 = nv21ToUV_c; break; + case PIX_FMT_RGB8 : + case PIX_FMT_BGR8 : + case PIX_FMT_PAL8 : + case PIX_FMT_BGR4_BYTE: + case PIX_FMT_RGB4_BYTE: c->chrToYV12 = palToUV_c; break; + case PIX_FMT_GRAY16BE : + case PIX_FMT_YUV444P9BE: + case PIX_FMT_YUV420P9BE: + case PIX_FMT_YUV444P10BE: + case PIX_FMT_YUV422P10BE: + case PIX_FMT_YUV420P10BE: + case PIX_FMT_YUV420P16BE: + case PIX_FMT_YUV422P16BE: + case PIX_FMT_YUV444P16BE: c->hScale16= HAVE_BIGENDIAN ? hScale16_c : hScale16X_c; break; + case PIX_FMT_GRAY16LE : + case PIX_FMT_YUV444P9LE: + case PIX_FMT_YUV420P9LE: + case PIX_FMT_YUV422P10LE: + case PIX_FMT_YUV420P10LE: + case PIX_FMT_YUV444P10LE: + case PIX_FMT_YUV420P16LE: + case PIX_FMT_YUV422P16LE: + case PIX_FMT_YUV444P16LE: c->hScale16= HAVE_BIGENDIAN ? hScale16X_c : hScale16_c; break; + } + if (c->chrSrcHSubSample) { + switch(srcFormat) { + case PIX_FMT_RGB48BE : c->chrToYV12 = rgb48BEToUV_half_c; break; + case PIX_FMT_RGB48LE : c->chrToYV12 = rgb48LEToUV_half_c; break; + case PIX_FMT_BGR48BE : c->chrToYV12 = bgr48BEToUV_half_c; break; + case PIX_FMT_BGR48LE : c->chrToYV12 = bgr48LEToUV_half_c; break; + case PIX_FMT_RGB32 : c->chrToYV12 = bgr32ToUV_half_c; break; + case PIX_FMT_RGB32_1 : c->chrToYV12 = bgr321ToUV_half_c; break; + case PIX_FMT_BGR24 : c->chrToYV12 = bgr24ToUV_half_c; break; + case PIX_FMT_BGR565LE: c->chrToYV12 = bgr16leToUV_half_c; break; + case PIX_FMT_BGR565BE: c->chrToYV12 = bgr16beToUV_half_c; break; + case PIX_FMT_BGR555LE: c->chrToYV12 = bgr15leToUV_half_c; break; + case PIX_FMT_BGR555BE: c->chrToYV12 = bgr15beToUV_half_c; break; + case PIX_FMT_BGR32 : c->chrToYV12 = rgb32ToUV_half_c; break; + case PIX_FMT_BGR32_1 : c->chrToYV12 = rgb321ToUV_half_c; break; + case PIX_FMT_RGB24 : c->chrToYV12 = rgb24ToUV_half_c; break; + case PIX_FMT_RGB565LE: c->chrToYV12 = rgb16leToUV_half_c; break; + case PIX_FMT_RGB565BE: c->chrToYV12 = rgb16beToUV_half_c; break; + case PIX_FMT_RGB555LE: c->chrToYV12 = rgb15leToUV_half_c; break; + case PIX_FMT_RGB555BE: c->chrToYV12 = rgb15beToUV_half_c; break; + } + } else { + switch(srcFormat) { + case PIX_FMT_RGB48BE : c->chrToYV12 = rgb48BEToUV_c; break; + case PIX_FMT_RGB48LE : c->chrToYV12 = rgb48LEToUV_c; break; + case PIX_FMT_BGR48BE : c->chrToYV12 = bgr48BEToUV_c; break; + case PIX_FMT_BGR48LE : c->chrToYV12 = bgr48LEToUV_c; break; + case PIX_FMT_RGB32 : c->chrToYV12 = bgr32ToUV_c; break; + case PIX_FMT_RGB32_1 : c->chrToYV12 = bgr321ToUV_c; break; + case PIX_FMT_BGR24 : c->chrToYV12 = bgr24ToUV_c; break; + case PIX_FMT_BGR565LE: c->chrToYV12 = bgr16leToUV_c; break; + case PIX_FMT_BGR565BE: c->chrToYV12 = bgr16beToUV_c; break; + case PIX_FMT_BGR555LE: c->chrToYV12 = bgr15leToUV_c; break; + case PIX_FMT_BGR555BE: c->chrToYV12 = bgr15beToUV_c; break; + case PIX_FMT_BGR32 : c->chrToYV12 = rgb32ToUV_c; break; + case PIX_FMT_BGR32_1 : c->chrToYV12 = rgb321ToUV_c; break; + case PIX_FMT_RGB24 : c->chrToYV12 = rgb24ToUV_c; break; + case PIX_FMT_RGB565LE: c->chrToYV12 = rgb16leToUV_c; break; + case PIX_FMT_RGB565BE: c->chrToYV12 = rgb16beToUV_c; break; + case PIX_FMT_RGB555LE: c->chrToYV12 = rgb15leToUV_c; break; + case PIX_FMT_RGB555BE: c->chrToYV12 = rgb15beToUV_c; break; + } + } -/* Convert the palette to the same packed 32-bit format as the palette */ -void sws_convertPalette8ToPacked32(const uint8_t *src, uint8_t *dst, long num_pixels, const uint8_t *palette) -{ - long i; + c->lumToYV12 = NULL; + c->alpToYV12 = NULL; + switch (srcFormat) { + case PIX_FMT_YUYV422 : + case PIX_FMT_GRAY8A : + c->lumToYV12 = yuy2ToY_c; break; + case PIX_FMT_UYVY422 : + c->lumToYV12 = uyvyToY_c; break; + case PIX_FMT_BGR24 : c->lumToYV12 = bgr24ToY_c; break; + case PIX_FMT_BGR565LE : c->lumToYV12 = bgr16leToY_c; break; + case PIX_FMT_BGR565BE : c->lumToYV12 = bgr16beToY_c; break; + case PIX_FMT_BGR555LE : c->lumToYV12 = bgr15leToY_c; break; + case PIX_FMT_BGR555BE : c->lumToYV12 = bgr15beToY_c; break; + case PIX_FMT_RGB24 : c->lumToYV12 = rgb24ToY_c; break; + case PIX_FMT_RGB565LE : c->lumToYV12 = rgb16leToY_c; break; + case PIX_FMT_RGB565BE : c->lumToYV12 = rgb16beToY_c; break; + case PIX_FMT_RGB555LE : c->lumToYV12 = rgb15leToY_c; break; + case PIX_FMT_RGB555BE : c->lumToYV12 = rgb15beToY_c; break; + case PIX_FMT_RGB8 : + case PIX_FMT_BGR8 : + case PIX_FMT_PAL8 : + case PIX_FMT_BGR4_BYTE: + case PIX_FMT_RGB4_BYTE: c->lumToYV12 = palToY_c; break; + case PIX_FMT_MONOBLACK: c->lumToYV12 = monoblack2Y_c; break; + case PIX_FMT_MONOWHITE: c->lumToYV12 = monowhite2Y_c; break; + case PIX_FMT_RGB32 : c->lumToYV12 = bgr32ToY_c; break; + case PIX_FMT_RGB32_1: c->lumToYV12 = bgr321ToY_c; break; + case PIX_FMT_BGR32 : c->lumToYV12 = rgb32ToY_c; break; + case PIX_FMT_BGR32_1: c->lumToYV12 = rgb321ToY_c; break; + case PIX_FMT_RGB48BE: c->lumToYV12 = rgb48BEToY_c; break; + case PIX_FMT_RGB48LE: c->lumToYV12 = rgb48LEToY_c; break; + case PIX_FMT_BGR48BE: c->lumToYV12 = bgr48BEToY_c; break; + case PIX_FMT_BGR48LE: c->lumToYV12 = bgr48LEToY_c; break; + } + if (c->alpPixBuf) { + switch (srcFormat) { + case PIX_FMT_BGRA: + case PIX_FMT_RGBA: c->alpToYV12 = rgbaToA_c; break; + case PIX_FMT_ABGR: + case PIX_FMT_ARGB: c->alpToYV12 = abgrToA_c; break; + case PIX_FMT_Y400A: c->alpToYV12 = uyvyToY_c; break; + case PIX_FMT_PAL8 : c->alpToYV12 = palToA_c; break; + } + } + + if(isAnyRGB(c->srcFormat) || c->srcFormat == PIX_FMT_PAL8) + c->hScale16= hScale16_c; - for (i=0; i<num_pixels; i++) - ((uint32_t *) dst)[i] = ((const uint32_t *) palette)[src[i]]; + if (c->srcRange != c->dstRange && !isAnyRGB(c->dstFormat)) { + if (c->srcRange) { + c->lumConvertRange = lumRangeFromJpeg_c; + c->chrConvertRange = chrRangeFromJpeg_c; + } else { + c->lumConvertRange = lumRangeToJpeg_c; + c->chrConvertRange = chrRangeToJpeg_c; + } + } + + if (!(isGray(srcFormat) || isGray(c->dstFormat) || + srcFormat == PIX_FMT_MONOBLACK || srcFormat == PIX_FMT_MONOWHITE)) + c->needs_hcscale = 1; } -/* Palette format: ABCD -> dst format: ABC */ -void sws_convertPalette8ToPacked24(const uint8_t *src, uint8_t *dst, long num_pixels, const uint8_t *palette) +SwsFunc ff_getSwsFunc(SwsContext *c) { - long i; + sws_init_swScale_c(c); - for (i=0; i<num_pixels; i++) { - //FIXME slow? - dst[0]= palette[src[i]*4+0]; - dst[1]= palette[src[i]*4+1]; - dst[2]= palette[src[i]*4+2]; - dst+= 3; - } + if (HAVE_MMX) + ff_sws_init_swScale_mmx(c); + if (HAVE_ALTIVEC) + ff_sws_init_swScale_altivec(c); + + return swScale; } diff --git a/libswscale/swscale.h b/libswscale/swscale.h index 406eec47f2..e798773158 100644 --- a/libswscale/swscale.h +++ b/libswscale/swscale.h @@ -31,7 +31,7 @@ #define LIBSWSCALE_VERSION_MAJOR 0 #define LIBSWSCALE_VERSION_MINOR 14 -#define LIBSWSCALE_VERSION_MICRO 0 +#define LIBSWSCALE_VERSION_MICRO 1 #define LIBSWSCALE_VERSION_INT AV_VERSION_INT(LIBSWSCALE_VERSION_MAJOR, \ LIBSWSCALE_VERSION_MINOR, \ @@ -50,6 +50,12 @@ #ifndef FF_API_SWS_GETCONTEXT #define FF_API_SWS_GETCONTEXT (LIBSWSCALE_VERSION_MAJOR < 2) #endif +#ifndef FF_API_SWS_CPU_CAPS +#define FF_API_SWS_CPU_CAPS (LIBSWSCALE_VERSION_MAJOR < 2) +#endif +#ifndef FF_API_SWS_FORMAT_NAME +#define FF_API_SWS_FORMAT_NAME (LIBSWSCALE_VERSION_MAJOR < 2) +#endif /** * Returns the LIBSWSCALE_VERSION_INT constant. @@ -95,12 +101,18 @@ const char *swscale_license(void); #define SWS_ACCURATE_RND 0x40000 #define SWS_BITEXACT 0x80000 +#if FF_API_SWS_CPU_CAPS +/** + * CPU caps are autodetected now, those flags + * are only provided for API compatibility. + */ #define SWS_CPU_CAPS_MMX 0x80000000 #define SWS_CPU_CAPS_MMX2 0x20000000 #define SWS_CPU_CAPS_3DNOW 0x40000000 #define SWS_CPU_CAPS_ALTIVEC 0x10000000 #define SWS_CPU_CAPS_BFIN 0x01000000 #define SWS_CPU_CAPS_SSE2 0x02000000 +#endif #define SWS_MAX_REDUCE_CUTOFF 0.002 @@ -187,6 +199,7 @@ void sws_freeContext(struct SwsContext *swsContext); * @return a pointer to an allocated context, or NULL in case of error * @note this function is to be removed after a saner alternative is * written + * @deprecated Use sws_getCachedContext() instead. */ struct SwsContext *sws_getContext(int srcW, int srcH, enum PixelFormat srcFormat, int dstW, int dstH, enum PixelFormat dstFormat, @@ -341,7 +354,7 @@ struct SwsContext *sws_getCachedContext(struct SwsContext *context, * @param num_pixels number of pixels to convert * @param palette array with [256] entries, which must match color arrangement (RGB or BGR) of src */ -void sws_convertPalette8ToPacked32(const uint8_t *src, uint8_t *dst, long num_pixels, const uint8_t *palette); +void sws_convertPalette8ToPacked32(const uint8_t *src, uint8_t *dst, int num_pixels, const uint8_t *palette); /** * Converts an 8bit paletted frame into a frame with a color depth of 24 bits. @@ -353,7 +366,7 @@ void sws_convertPalette8ToPacked32(const uint8_t *src, uint8_t *dst, long num_pi * @param num_pixels number of pixels to convert * @param palette array with [256] entries, which must match color arrangement (RGB or BGR) of src */ -void sws_convertPalette8ToPacked24(const uint8_t *src, uint8_t *dst, long num_pixels, const uint8_t *palette); +void sws_convertPalette8ToPacked24(const uint8_t *src, uint8_t *dst, int num_pixels, const uint8_t *palette); #endif /* SWSCALE_SWSCALE_H */ diff --git a/libswscale/swscale_internal.h b/libswscale/swscale_internal.h index 03c5bf9736..c0f8e64d70 100644 --- a/libswscale/swscale_internal.h +++ b/libswscale/swscale_internal.h @@ -35,9 +35,7 @@ #define MAX_FILTER_SIZE 256 -#define VOFW 21504 - -#define VOF (VOFW*2) +#define DITHER1XBPP #if HAVE_BIGENDIAN #define ALT32_CORR (-1) @@ -61,6 +59,41 @@ typedef int (*SwsFunc)(struct SwsContext *context, const uint8_t* src[], int srcStride[], int srcSliceY, int srcSliceH, uint8_t* dst[], int dstStride[]); +typedef void (*yuv2planar1_fn) (struct SwsContext *c, + const int16_t *lumSrc, const int16_t *chrUSrc, + const int16_t *chrVSrc, const int16_t *alpSrc, + uint8_t *dest, + uint8_t *uDest, uint8_t *vDest, uint8_t *aDest, + int dstW, int chrDstW, const uint8_t *lumDither, const uint8_t *chrDither); +typedef void (*yuv2planarX_fn) (struct SwsContext *c, + const int16_t *lumFilter, const int16_t **lumSrc, int lumFilterSize, + const int16_t *chrFilter, const int16_t **chrUSrc, + const int16_t **chrVSrc, int chrFilterSize, + const int16_t **alpSrc, + uint8_t *dest, + uint8_t *uDest, uint8_t *vDest, uint8_t *aDest, + int dstW, int chrDstW, const uint8_t *lumDither, const uint8_t *chrDither); +typedef void (*yuv2packed1_fn) (struct SwsContext *c, + const uint16_t *buf0, + const uint16_t *ubuf0, const uint16_t *ubuf1, + const uint16_t *vbuf0, const uint16_t *vbuf1, + const uint16_t *abuf0, + uint8_t *dest, + int dstW, int uvalpha, int dstFormat, int flags, int y); +typedef void (*yuv2packed2_fn) (struct SwsContext *c, + const uint16_t *buf0, const uint16_t *buf1, + const uint16_t *ubuf0, const uint16_t *ubuf1, + const uint16_t *vbuf0, const uint16_t *vbuf1, + const uint16_t *abuf0, const uint16_t *abuf1, + uint8_t *dest, + int dstW, int yalpha, int uvalpha, int y); +typedef void (*yuv2packedX_fn) (struct SwsContext *c, + const int16_t *lumFilter, const int16_t **lumSrc, int lumFilterSize, + const int16_t *chrFilter, const int16_t **chrUSrc, + const int16_t **chrVSrc, int chrFilterSize, + const int16_t **alpSrc, uint8_t *dest, + int dstW, int dstY); + /* This struct should be aligned on at least a 32-byte boundary. */ typedef struct SwsContext { /** @@ -108,7 +141,8 @@ typedef struct SwsContext { */ //@{ int16_t **lumPixBuf; ///< Ring buffer for scaled horizontal luma plane lines to be fed to the vertical scaler. - int16_t **chrPixBuf; ///< Ring buffer for scaled horizontal chroma plane lines to be fed to the vertical scaler. + int16_t **chrUPixBuf; ///< Ring buffer for scaled horizontal chroma plane lines to be fed to the vertical scaler. + int16_t **chrVPixBuf; ///< Ring buffer for scaled horizontal chroma plane lines to be fed to the vertical scaler. int16_t **alpPixBuf; ///< Ring buffer for scaled horizontal alpha plane lines to be fed to the vertical scaler. int vLumBufSize; ///< Number of vertical luma/alpha lines allocated in the ring buffer. int vChrBufSize; ///< Number of vertical chroma lines allocated in the ring buffer. @@ -118,7 +152,7 @@ typedef struct SwsContext { int chrBufIndex; ///< Index in ring buffer of the last scaled horizontal chroma line from source. //@} - uint8_t formatConvBuffer[VOF]; //FIXME dynamic allocation, but we have to change a lot of code for this to be useful + uint8_t *formatConvBuffer; /** * @name Horizontal and vertical filters. @@ -196,6 +230,10 @@ typedef struct SwsContext { #define V_TEMP "11*8+4*4*256*2+32" #define Y_TEMP "11*8+4*4*256*2+40" #define ALP_MMX_FILTER_OFFSET "11*8+4*4*256*2+48" +#define UV_OFF "11*8+4*4*256*3+48" +#define UV_OFFx2 "11*8+4*4*256*3+56" +#define DITHER16 "11*8+4*4*256*3+64" +#define DITHER32 "11*8+4*4*256*3+64+16" DECLARE_ALIGNED(8, uint64_t, redDither); DECLARE_ALIGNED(8, uint64_t, greenDither); @@ -218,6 +256,10 @@ typedef struct SwsContext { DECLARE_ALIGNED(8, uint64_t, v_temp); DECLARE_ALIGNED(8, uint64_t, y_temp); int32_t alpMmxFilter[4*MAX_FILTER_SIZE]; + DECLARE_ALIGNED(8, ptrdiff_t, uv_off); ///< offset (in pixels) between u and v planes + DECLARE_ALIGNED(8, ptrdiff_t, uv_offx2); ///< offset (in bytes) between u and v planes + uint16_t dither16[8]; + uint32_t dither32[8]; #if HAVE_ALTIVEC vector signed short CY; @@ -249,66 +291,37 @@ typedef struct SwsContext { #endif /* function pointers for swScale() */ - void (*yuv2nv12X )(struct SwsContext *c, - const int16_t *lumFilter, const int16_t **lumSrc, int lumFilterSize, - const int16_t *chrFilter, const int16_t **chrSrc, int chrFilterSize, - uint8_t *dest, uint8_t *uDest, - int dstW, int chrDstW, int dstFormat); - void (*yuv2yuv1 )(struct SwsContext *c, - const int16_t *lumSrc, const int16_t *chrSrc, const int16_t *alpSrc, - uint8_t *dest, - uint8_t *uDest, uint8_t *vDest, uint8_t *aDest, - long dstW, long chrDstW); - void (*yuv2yuvX )(struct SwsContext *c, - const int16_t *lumFilter, const int16_t **lumSrc, int lumFilterSize, - const int16_t *chrFilter, const int16_t **chrSrc, int chrFilterSize, - const int16_t **alpSrc, - uint8_t *dest, - uint8_t *uDest, uint8_t *vDest, uint8_t *aDest, - long dstW, long chrDstW); - void (*yuv2packed1)(struct SwsContext *c, - const uint16_t *buf0, - const uint16_t *uvbuf0, const uint16_t *uvbuf1, - const uint16_t *abuf0, - uint8_t *dest, - int dstW, int uvalpha, int dstFormat, int flags, int y); - void (*yuv2packed2)(struct SwsContext *c, - const uint16_t *buf0, const uint16_t *buf1, - const uint16_t *uvbuf0, const uint16_t *uvbuf1, - const uint16_t *abuf0, const uint16_t *abuf1, - uint8_t *dest, - int dstW, int yalpha, int uvalpha, int y); - void (*yuv2packedX)(struct SwsContext *c, - const int16_t *lumFilter, const int16_t **lumSrc, int lumFilterSize, - const int16_t *chrFilter, const int16_t **chrSrc, int chrFilterSize, - const int16_t **alpSrc, uint8_t *dest, - long dstW, long dstY); + yuv2planar1_fn yuv2yuv1; + yuv2planarX_fn yuv2yuvX; + yuv2packed1_fn yuv2packed1; + yuv2packed2_fn yuv2packed2; + yuv2packedX_fn yuv2packedX; void (*lumToYV12)(uint8_t *dst, const uint8_t *src, - long width, uint32_t *pal); ///< Unscaled conversion of luma plane to YV12 for horizontal scaler. + int width, uint32_t *pal); ///< Unscaled conversion of luma plane to YV12 for horizontal scaler. void (*alpToYV12)(uint8_t *dst, const uint8_t *src, - long width, uint32_t *pal); ///< Unscaled conversion of alpha plane to YV12 for horizontal scaler. + int width, uint32_t *pal); ///< Unscaled conversion of alpha plane to YV12 for horizontal scaler. void (*chrToYV12)(uint8_t *dstU, uint8_t *dstV, const uint8_t *src1, const uint8_t *src2, - long width, uint32_t *pal); ///< Unscaled conversion of chroma planes to YV12 for horizontal scaler. + int width, uint32_t *pal); ///< Unscaled conversion of chroma planes to YV12 for horizontal scaler. void (*hyscale_fast)(struct SwsContext *c, - int16_t *dst, long dstWidth, + int16_t *dst, int dstWidth, const uint8_t *src, int srcW, int xInc); void (*hcscale_fast)(struct SwsContext *c, - int16_t *dst, long dstWidth, + int16_t *dst1, int16_t *dst2, int dstWidth, const uint8_t *src1, const uint8_t *src2, int srcW, int xInc); void (*hScale)(int16_t *dst, int dstW, const uint8_t *src, int srcW, int xInc, const int16_t *filter, const int16_t *filterPos, - long filterSize); + int filterSize); - void (*lumConvertRange)(int16_t *dst, int width); ///< Color range conversion function for luma plane if needed. - void (*chrConvertRange)(int16_t *dst, int width); ///< Color range conversion function for chroma planes if needed. + void (*hScale16)(int16_t *dst, int dstW, const uint16_t *src, int srcW, + int xInc, const int16_t *filter, const int16_t *filterPos, + long filterSize, int shift); - int lumSrcOffset; ///< Offset given to luma src pointers passed to horizontal input functions. - int chrSrcOffset; ///< Offset given to chroma src pointers passed to horizontal input functions. - int alpSrcOffset; ///< Offset given to alpha src pointers passed to horizontal input functions. + void (*lumConvertRange)(int16_t *dst, int width); ///< Color range conversion function for luma plane if needed. + void (*chrConvertRange)(int16_t *dst1, int16_t *dst2, int width); ///< Color range conversion function for chroma planes if needed. int needs_hcscale; ///< Set if there are chroma planes to be converted. @@ -322,18 +335,23 @@ int ff_yuv2rgb_c_init_tables(SwsContext *c, const int inv_table[4], void ff_yuv2rgb_init_tables_altivec(SwsContext *c, const int inv_table[4], int brightness, int contrast, int saturation); +void updateMMXDitherTables(SwsContext *c, int dstY, int lumBufIndex, int chrBufIndex, + int lastInLumBuf, int lastInChrBuf); + SwsFunc ff_yuv2rgb_init_mmx(SwsContext *c); SwsFunc ff_yuv2rgb_init_vis(SwsContext *c); SwsFunc ff_yuv2rgb_init_mlib(SwsContext *c); SwsFunc ff_yuv2rgb_init_altivec(SwsContext *c); SwsFunc ff_yuv2rgb_get_func_ptr_bfin(SwsContext *c); void ff_bfin_get_unscaled_swscale(SwsContext *c); -void ff_yuv2packedX_altivec(SwsContext *c, - const int16_t *lumFilter, const int16_t **lumSrc, int lumFilterSize, - const int16_t *chrFilter, const int16_t **chrSrc, int chrFilterSize, - uint8_t *dest, int dstW, int dstY); +#if FF_API_SWS_FORMAT_NAME +/** + * @deprecated Use av_get_pix_fmt_name() instead. + */ +attribute_deprecated const char *sws_format_name(enum PixelFormat format); +#endif //FIXME replace this with something faster #define is16BPS(x) ( \ @@ -353,6 +371,12 @@ const char *sws_format_name(enum PixelFormat format); #define isNBPS(x) ( \ (x)==PIX_FMT_YUV420P9LE \ || (x)==PIX_FMT_YUV420P9BE \ + || (x)==PIX_FMT_YUV444P9BE \ + || (x)==PIX_FMT_YUV444P9LE \ + || (x)==PIX_FMT_YUV422P10BE \ + || (x)==PIX_FMT_YUV422P10LE \ + || (x)==PIX_FMT_YUV444P10BE \ + || (x)==PIX_FMT_YUV444P10LE \ || (x)==PIX_FMT_YUV420P10LE \ || (x)==PIX_FMT_YUV420P10BE \ || (x)==PIX_FMT_YUV422P10LE \ @@ -374,13 +398,19 @@ const char *sws_format_name(enum PixelFormat format); #define isPlanarYUV(x) ( \ isPlanar8YUV(x) \ || (x)==PIX_FMT_YUV420P9LE \ + || (x)==PIX_FMT_YUV444P9LE \ || (x)==PIX_FMT_YUV420P10LE \ + || (x)==PIX_FMT_YUV422P10LE \ + || (x)==PIX_FMT_YUV444P10LE \ || (x)==PIX_FMT_YUV420P16LE \ || (x)==PIX_FMT_YUV422P10LE \ || (x)==PIX_FMT_YUV422P16LE \ || (x)==PIX_FMT_YUV444P16LE \ || (x)==PIX_FMT_YUV420P9BE \ + || (x)==PIX_FMT_YUV444P9BE \ || (x)==PIX_FMT_YUV420P10BE \ + || (x)==PIX_FMT_YUV422P10BE \ + || (x)==PIX_FMT_YUV444P10BE \ || (x)==PIX_FMT_YUV420P16BE \ || (x)==PIX_FMT_YUV422P10BE \ || (x)==PIX_FMT_YUV422P16BE \ @@ -464,10 +494,20 @@ const char *sws_format_name(enum PixelFormat format); || (x)==PIX_FMT_GRAY8A \ || (x)==PIX_FMT_YUVA420P \ ) +#define isPacked(x) ( \ + (x)==PIX_FMT_PAL8 \ + || (x)==PIX_FMT_YUYV422 \ + || (x)==PIX_FMT_UYVY422 \ + || (x)==PIX_FMT_Y400A \ + || isAnyRGB(x) \ + ) #define usePal(x) ((av_pix_fmt_descriptors[x].flags & PIX_FMT_PAL) || (x) == PIX_FMT_GRAY8A) extern const uint64_t ff_dither4[2]; extern const uint64_t ff_dither8[2]; +extern const uint8_t dithers[8][8][8]; +extern const uint16_t dither_scale[15][16]; + extern const AVClass sws_context_class; @@ -477,10 +517,7 @@ extern const AVClass sws_context_class; */ void ff_get_unscaled_swscale(SwsContext *c); -/** - * Returns the SWS_CPU_CAPS for the optimized code compiled into swscale. - */ -int ff_hardcodedcpuflags(void); +void ff_swscale_get_unscaled_altivec(SwsContext *c); /** * Returns function pointer to fastest main scaler path function depending @@ -488,4 +525,7 @@ int ff_hardcodedcpuflags(void); */ SwsFunc ff_getSwsFunc(SwsContext *c); +void ff_sws_init_swScale_altivec(SwsContext *c); +void ff_sws_init_swScale_mmx(SwsContext *c); + #endif /* SWSCALE_SWSCALE_INTERNAL_H */ diff --git a/libswscale/swscale_template.c b/libswscale/swscale_template.c index e53cfc0752..9ae9fc771c 100644 --- a/libswscale/swscale_template.c +++ b/libswscale/swscale_template.c @@ -18,999 +18,55 @@ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA */ -#undef REAL_MOVNTQ -#undef MOVNTQ -#undef PAVGB -#undef PREFETCH - -#if COMPILE_TEMPLATE_AMD3DNOW -#define PREFETCH "prefetch" -#elif COMPILE_TEMPLATE_MMX2 -#define PREFETCH "prefetchnta" -#else -#define PREFETCH " # nop" -#endif - -#if COMPILE_TEMPLATE_MMX2 -#define PAVGB(a,b) "pavgb " #a ", " #b " \n\t" -#elif COMPILE_TEMPLATE_AMD3DNOW -#define PAVGB(a,b) "pavgusb " #a ", " #b " \n\t" -#endif - -#if COMPILE_TEMPLATE_MMX2 -#define REAL_MOVNTQ(a,b) "movntq " #a ", " #b " \n\t" -#else -#define REAL_MOVNTQ(a,b) "movq " #a ", " #b " \n\t" -#endif -#define MOVNTQ(a,b) REAL_MOVNTQ(a,b) - -#if COMPILE_TEMPLATE_ALTIVEC -#include "ppc/swscale_altivec_template.c" -#endif - -#define YSCALEYUV2YV12X(x, offset, dest, width) \ - __asm__ volatile(\ - "xor %%"REG_a", %%"REG_a" \n\t"\ - "movq "VROUNDER_OFFSET"(%0), %%mm3 \n\t"\ - "movq %%mm3, %%mm4 \n\t"\ - "lea " offset "(%0), %%"REG_d" \n\t"\ - "mov (%%"REG_d"), %%"REG_S" \n\t"\ - ".p2align 4 \n\t" /* FIXME Unroll? */\ - "1: \n\t"\ - "movq 8(%%"REG_d"), %%mm0 \n\t" /* filterCoeff */\ - "movq " x "(%%"REG_S", %%"REG_a", 2), %%mm2 \n\t" /* srcData */\ - "movq 8+" x "(%%"REG_S", %%"REG_a", 2), %%mm5 \n\t" /* srcData */\ - "add $16, %%"REG_d" \n\t"\ - "mov (%%"REG_d"), %%"REG_S" \n\t"\ - "test %%"REG_S", %%"REG_S" \n\t"\ - "pmulhw %%mm0, %%mm2 \n\t"\ - "pmulhw %%mm0, %%mm5 \n\t"\ - "paddw %%mm2, %%mm3 \n\t"\ - "paddw %%mm5, %%mm4 \n\t"\ - " jnz 1b \n\t"\ - "psraw $3, %%mm3 \n\t"\ - "psraw $3, %%mm4 \n\t"\ - "packuswb %%mm4, %%mm3 \n\t"\ - MOVNTQ(%%mm3, (%1, %%REGa))\ - "add $8, %%"REG_a" \n\t"\ - "cmp %2, %%"REG_a" \n\t"\ - "movq "VROUNDER_OFFSET"(%0), %%mm3 \n\t"\ - "movq %%mm3, %%mm4 \n\t"\ - "lea " offset "(%0), %%"REG_d" \n\t"\ - "mov (%%"REG_d"), %%"REG_S" \n\t"\ - "jb 1b \n\t"\ - :: "r" (&c->redDither),\ - "r" (dest), "g" ((x86_reg)width)\ - : "%"REG_a, "%"REG_d, "%"REG_S\ - ); - -#define YSCALEYUV2YV12X_ACCURATE(x, offset, dest, width) \ - __asm__ volatile(\ - "lea " offset "(%0), %%"REG_d" \n\t"\ - "xor %%"REG_a", %%"REG_a" \n\t"\ - "pxor %%mm4, %%mm4 \n\t"\ - "pxor %%mm5, %%mm5 \n\t"\ - "pxor %%mm6, %%mm6 \n\t"\ - "pxor %%mm7, %%mm7 \n\t"\ - "mov (%%"REG_d"), %%"REG_S" \n\t"\ - ".p2align 4 \n\t"\ - "1: \n\t"\ - "movq " x "(%%"REG_S", %%"REG_a", 2), %%mm0 \n\t" /* srcData */\ - "movq 8+" x "(%%"REG_S", %%"REG_a", 2), %%mm2 \n\t" /* srcData */\ - "mov "STR(APCK_PTR2)"(%%"REG_d"), %%"REG_S" \n\t"\ - "movq " x "(%%"REG_S", %%"REG_a", 2), %%mm1 \n\t" /* srcData */\ - "movq %%mm0, %%mm3 \n\t"\ - "punpcklwd %%mm1, %%mm0 \n\t"\ - "punpckhwd %%mm1, %%mm3 \n\t"\ - "movq "STR(APCK_COEF)"(%%"REG_d"), %%mm1 \n\t" /* filterCoeff */\ - "pmaddwd %%mm1, %%mm0 \n\t"\ - "pmaddwd %%mm1, %%mm3 \n\t"\ - "paddd %%mm0, %%mm4 \n\t"\ - "paddd %%mm3, %%mm5 \n\t"\ - "movq 8+" x "(%%"REG_S", %%"REG_a", 2), %%mm3 \n\t" /* srcData */\ - "mov "STR(APCK_SIZE)"(%%"REG_d"), %%"REG_S" \n\t"\ - "add $"STR(APCK_SIZE)", %%"REG_d" \n\t"\ - "test %%"REG_S", %%"REG_S" \n\t"\ - "movq %%mm2, %%mm0 \n\t"\ - "punpcklwd %%mm3, %%mm2 \n\t"\ - "punpckhwd %%mm3, %%mm0 \n\t"\ - "pmaddwd %%mm1, %%mm2 \n\t"\ - "pmaddwd %%mm1, %%mm0 \n\t"\ - "paddd %%mm2, %%mm6 \n\t"\ - "paddd %%mm0, %%mm7 \n\t"\ - " jnz 1b \n\t"\ - "psrad $16, %%mm4 \n\t"\ - "psrad $16, %%mm5 \n\t"\ - "psrad $16, %%mm6 \n\t"\ - "psrad $16, %%mm7 \n\t"\ - "movq "VROUNDER_OFFSET"(%0), %%mm0 \n\t"\ - "packssdw %%mm5, %%mm4 \n\t"\ - "packssdw %%mm7, %%mm6 \n\t"\ - "paddw %%mm0, %%mm4 \n\t"\ - "paddw %%mm0, %%mm6 \n\t"\ - "psraw $3, %%mm4 \n\t"\ - "psraw $3, %%mm6 \n\t"\ - "packuswb %%mm6, %%mm4 \n\t"\ - MOVNTQ(%%mm4, (%1, %%REGa))\ - "add $8, %%"REG_a" \n\t"\ - "cmp %2, %%"REG_a" \n\t"\ - "lea " offset "(%0), %%"REG_d" \n\t"\ - "pxor %%mm4, %%mm4 \n\t"\ - "pxor %%mm5, %%mm5 \n\t"\ - "pxor %%mm6, %%mm6 \n\t"\ - "pxor %%mm7, %%mm7 \n\t"\ - "mov (%%"REG_d"), %%"REG_S" \n\t"\ - "jb 1b \n\t"\ - :: "r" (&c->redDither),\ - "r" (dest), "g" ((x86_reg)width)\ - : "%"REG_a, "%"REG_d, "%"REG_S\ - ); - -#define YSCALEYUV2YV121 \ - "mov %2, %%"REG_a" \n\t"\ - ".p2align 4 \n\t" /* FIXME Unroll? */\ - "1: \n\t"\ - "movq (%0, %%"REG_a", 2), %%mm0 \n\t"\ - "movq 8(%0, %%"REG_a", 2), %%mm1 \n\t"\ - "psraw $7, %%mm0 \n\t"\ - "psraw $7, %%mm1 \n\t"\ - "packuswb %%mm1, %%mm0 \n\t"\ - MOVNTQ(%%mm0, (%1, %%REGa))\ - "add $8, %%"REG_a" \n\t"\ - "jnc 1b \n\t" - -#define YSCALEYUV2YV121_ACCURATE \ - "mov %2, %%"REG_a" \n\t"\ - "pcmpeqw %%mm7, %%mm7 \n\t"\ - "psrlw $15, %%mm7 \n\t"\ - "psllw $6, %%mm7 \n\t"\ - ".p2align 4 \n\t" /* FIXME Unroll? */\ - "1: \n\t"\ - "movq (%0, %%"REG_a", 2), %%mm0 \n\t"\ - "movq 8(%0, %%"REG_a", 2), %%mm1 \n\t"\ - "paddsw %%mm7, %%mm0 \n\t"\ - "paddsw %%mm7, %%mm1 \n\t"\ - "psraw $7, %%mm0 \n\t"\ - "psraw $7, %%mm1 \n\t"\ - "packuswb %%mm1, %%mm0 \n\t"\ - MOVNTQ(%%mm0, (%1, %%REGa))\ - "add $8, %%"REG_a" \n\t"\ - "jnc 1b \n\t" - -/* - :: "m" (-lumFilterSize), "m" (-chrFilterSize), - "m" (lumMmxFilter+lumFilterSize*4), "m" (chrMmxFilter+chrFilterSize*4), - "r" (dest), "m" (dstW_reg), - "m" (lumSrc+lumFilterSize), "m" (chrSrc+chrFilterSize) - : "%eax", "%ebx", "%ecx", "%edx", "%esi" -*/ -#define YSCALEYUV2PACKEDX_UV \ - __asm__ volatile(\ - "xor %%"REG_a", %%"REG_a" \n\t"\ - ".p2align 4 \n\t"\ - "nop \n\t"\ - "1: \n\t"\ - "lea "CHR_MMX_FILTER_OFFSET"(%0), %%"REG_d" \n\t"\ - "mov (%%"REG_d"), %%"REG_S" \n\t"\ - "movq "VROUNDER_OFFSET"(%0), %%mm3 \n\t"\ - "movq %%mm3, %%mm4 \n\t"\ - ".p2align 4 \n\t"\ - "2: \n\t"\ - "movq 8(%%"REG_d"), %%mm0 \n\t" /* filterCoeff */\ - "movq (%%"REG_S", %%"REG_a"), %%mm2 \n\t" /* UsrcData */\ - "movq "AV_STRINGIFY(VOF)"(%%"REG_S", %%"REG_a"), %%mm5 \n\t" /* VsrcData */\ - "add $16, %%"REG_d" \n\t"\ - "mov (%%"REG_d"), %%"REG_S" \n\t"\ - "pmulhw %%mm0, %%mm2 \n\t"\ - "pmulhw %%mm0, %%mm5 \n\t"\ - "paddw %%mm2, %%mm3 \n\t"\ - "paddw %%mm5, %%mm4 \n\t"\ - "test %%"REG_S", %%"REG_S" \n\t"\ - " jnz 2b \n\t"\ - -#define YSCALEYUV2PACKEDX_YA(offset,coeff,src1,src2,dst1,dst2) \ - "lea "offset"(%0), %%"REG_d" \n\t"\ - "mov (%%"REG_d"), %%"REG_S" \n\t"\ - "movq "VROUNDER_OFFSET"(%0), "#dst1" \n\t"\ - "movq "#dst1", "#dst2" \n\t"\ - ".p2align 4 \n\t"\ - "2: \n\t"\ - "movq 8(%%"REG_d"), "#coeff" \n\t" /* filterCoeff */\ - "movq (%%"REG_S", %%"REG_a", 2), "#src1" \n\t" /* Y1srcData */\ - "movq 8(%%"REG_S", %%"REG_a", 2), "#src2" \n\t" /* Y2srcData */\ - "add $16, %%"REG_d" \n\t"\ - "mov (%%"REG_d"), %%"REG_S" \n\t"\ - "pmulhw "#coeff", "#src1" \n\t"\ - "pmulhw "#coeff", "#src2" \n\t"\ - "paddw "#src1", "#dst1" \n\t"\ - "paddw "#src2", "#dst2" \n\t"\ - "test %%"REG_S", %%"REG_S" \n\t"\ - " jnz 2b \n\t"\ - -#define YSCALEYUV2PACKEDX \ - YSCALEYUV2PACKEDX_UV \ - YSCALEYUV2PACKEDX_YA(LUM_MMX_FILTER_OFFSET,%%mm0,%%mm2,%%mm5,%%mm1,%%mm7) \ - -#define YSCALEYUV2PACKEDX_END \ - :: "r" (&c->redDither), \ - "m" (dummy), "m" (dummy), "m" (dummy),\ - "r" (dest), "m" (dstW_reg) \ - : "%"REG_a, "%"REG_d, "%"REG_S \ - ); - -#define YSCALEYUV2PACKEDX_ACCURATE_UV \ - __asm__ volatile(\ - "xor %%"REG_a", %%"REG_a" \n\t"\ - ".p2align 4 \n\t"\ - "nop \n\t"\ - "1: \n\t"\ - "lea "CHR_MMX_FILTER_OFFSET"(%0), %%"REG_d" \n\t"\ - "mov (%%"REG_d"), %%"REG_S" \n\t"\ - "pxor %%mm4, %%mm4 \n\t"\ - "pxor %%mm5, %%mm5 \n\t"\ - "pxor %%mm6, %%mm6 \n\t"\ - "pxor %%mm7, %%mm7 \n\t"\ - ".p2align 4 \n\t"\ - "2: \n\t"\ - "movq (%%"REG_S", %%"REG_a"), %%mm0 \n\t" /* UsrcData */\ - "movq "AV_STRINGIFY(VOF)"(%%"REG_S", %%"REG_a"), %%mm2 \n\t" /* VsrcData */\ - "mov "STR(APCK_PTR2)"(%%"REG_d"), %%"REG_S" \n\t"\ - "movq (%%"REG_S", %%"REG_a"), %%mm1 \n\t" /* UsrcData */\ - "movq %%mm0, %%mm3 \n\t"\ - "punpcklwd %%mm1, %%mm0 \n\t"\ - "punpckhwd %%mm1, %%mm3 \n\t"\ - "movq "STR(APCK_COEF)"(%%"REG_d"),%%mm1 \n\t" /* filterCoeff */\ - "pmaddwd %%mm1, %%mm0 \n\t"\ - "pmaddwd %%mm1, %%mm3 \n\t"\ - "paddd %%mm0, %%mm4 \n\t"\ - "paddd %%mm3, %%mm5 \n\t"\ - "movq "AV_STRINGIFY(VOF)"(%%"REG_S", %%"REG_a"), %%mm3 \n\t" /* VsrcData */\ - "mov "STR(APCK_SIZE)"(%%"REG_d"), %%"REG_S" \n\t"\ - "add $"STR(APCK_SIZE)", %%"REG_d" \n\t"\ - "test %%"REG_S", %%"REG_S" \n\t"\ - "movq %%mm2, %%mm0 \n\t"\ - "punpcklwd %%mm3, %%mm2 \n\t"\ - "punpckhwd %%mm3, %%mm0 \n\t"\ - "pmaddwd %%mm1, %%mm2 \n\t"\ - "pmaddwd %%mm1, %%mm0 \n\t"\ - "paddd %%mm2, %%mm6 \n\t"\ - "paddd %%mm0, %%mm7 \n\t"\ - " jnz 2b \n\t"\ - "psrad $16, %%mm4 \n\t"\ - "psrad $16, %%mm5 \n\t"\ - "psrad $16, %%mm6 \n\t"\ - "psrad $16, %%mm7 \n\t"\ - "movq "VROUNDER_OFFSET"(%0), %%mm0 \n\t"\ - "packssdw %%mm5, %%mm4 \n\t"\ - "packssdw %%mm7, %%mm6 \n\t"\ - "paddw %%mm0, %%mm4 \n\t"\ - "paddw %%mm0, %%mm6 \n\t"\ - "movq %%mm4, "U_TEMP"(%0) \n\t"\ - "movq %%mm6, "V_TEMP"(%0) \n\t"\ - -#define YSCALEYUV2PACKEDX_ACCURATE_YA(offset) \ - "lea "offset"(%0), %%"REG_d" \n\t"\ - "mov (%%"REG_d"), %%"REG_S" \n\t"\ - "pxor %%mm1, %%mm1 \n\t"\ - "pxor %%mm5, %%mm5 \n\t"\ - "pxor %%mm7, %%mm7 \n\t"\ - "pxor %%mm6, %%mm6 \n\t"\ - ".p2align 4 \n\t"\ - "2: \n\t"\ - "movq (%%"REG_S", %%"REG_a", 2), %%mm0 \n\t" /* Y1srcData */\ - "movq 8(%%"REG_S", %%"REG_a", 2), %%mm2 \n\t" /* Y2srcData */\ - "mov "STR(APCK_PTR2)"(%%"REG_d"), %%"REG_S" \n\t"\ - "movq (%%"REG_S", %%"REG_a", 2), %%mm4 \n\t" /* Y1srcData */\ - "movq %%mm0, %%mm3 \n\t"\ - "punpcklwd %%mm4, %%mm0 \n\t"\ - "punpckhwd %%mm4, %%mm3 \n\t"\ - "movq "STR(APCK_COEF)"(%%"REG_d"), %%mm4 \n\t" /* filterCoeff */\ - "pmaddwd %%mm4, %%mm0 \n\t"\ - "pmaddwd %%mm4, %%mm3 \n\t"\ - "paddd %%mm0, %%mm1 \n\t"\ - "paddd %%mm3, %%mm5 \n\t"\ - "movq 8(%%"REG_S", %%"REG_a", 2), %%mm3 \n\t" /* Y2srcData */\ - "mov "STR(APCK_SIZE)"(%%"REG_d"), %%"REG_S" \n\t"\ - "add $"STR(APCK_SIZE)", %%"REG_d" \n\t"\ - "test %%"REG_S", %%"REG_S" \n\t"\ - "movq %%mm2, %%mm0 \n\t"\ - "punpcklwd %%mm3, %%mm2 \n\t"\ - "punpckhwd %%mm3, %%mm0 \n\t"\ - "pmaddwd %%mm4, %%mm2 \n\t"\ - "pmaddwd %%mm4, %%mm0 \n\t"\ - "paddd %%mm2, %%mm7 \n\t"\ - "paddd %%mm0, %%mm6 \n\t"\ - " jnz 2b \n\t"\ - "psrad $16, %%mm1 \n\t"\ - "psrad $16, %%mm5 \n\t"\ - "psrad $16, %%mm7 \n\t"\ - "psrad $16, %%mm6 \n\t"\ - "movq "VROUNDER_OFFSET"(%0), %%mm0 \n\t"\ - "packssdw %%mm5, %%mm1 \n\t"\ - "packssdw %%mm6, %%mm7 \n\t"\ - "paddw %%mm0, %%mm1 \n\t"\ - "paddw %%mm0, %%mm7 \n\t"\ - "movq "U_TEMP"(%0), %%mm3 \n\t"\ - "movq "V_TEMP"(%0), %%mm4 \n\t"\ - -#define YSCALEYUV2PACKEDX_ACCURATE \ - YSCALEYUV2PACKEDX_ACCURATE_UV \ - YSCALEYUV2PACKEDX_ACCURATE_YA(LUM_MMX_FILTER_OFFSET) - -#define YSCALEYUV2RGBX \ - "psubw "U_OFFSET"(%0), %%mm3 \n\t" /* (U-128)8*/\ - "psubw "V_OFFSET"(%0), %%mm4 \n\t" /* (V-128)8*/\ - "movq %%mm3, %%mm2 \n\t" /* (U-128)8*/\ - "movq %%mm4, %%mm5 \n\t" /* (V-128)8*/\ - "pmulhw "UG_COEFF"(%0), %%mm3 \n\t"\ - "pmulhw "VG_COEFF"(%0), %%mm4 \n\t"\ - /* mm2=(U-128)8, mm3=ug, mm4=vg mm5=(V-128)8 */\ - "pmulhw "UB_COEFF"(%0), %%mm2 \n\t"\ - "pmulhw "VR_COEFF"(%0), %%mm5 \n\t"\ - "psubw "Y_OFFSET"(%0), %%mm1 \n\t" /* 8(Y-16)*/\ - "psubw "Y_OFFSET"(%0), %%mm7 \n\t" /* 8(Y-16)*/\ - "pmulhw "Y_COEFF"(%0), %%mm1 \n\t"\ - "pmulhw "Y_COEFF"(%0), %%mm7 \n\t"\ - /* mm1= Y1, mm2=ub, mm3=ug, mm4=vg mm5=vr, mm7=Y2 */\ - "paddw %%mm3, %%mm4 \n\t"\ - "movq %%mm2, %%mm0 \n\t"\ - "movq %%mm5, %%mm6 \n\t"\ - "movq %%mm4, %%mm3 \n\t"\ - "punpcklwd %%mm2, %%mm2 \n\t"\ - "punpcklwd %%mm5, %%mm5 \n\t"\ - "punpcklwd %%mm4, %%mm4 \n\t"\ - "paddw %%mm1, %%mm2 \n\t"\ - "paddw %%mm1, %%mm5 \n\t"\ - "paddw %%mm1, %%mm4 \n\t"\ - "punpckhwd %%mm0, %%mm0 \n\t"\ - "punpckhwd %%mm6, %%mm6 \n\t"\ - "punpckhwd %%mm3, %%mm3 \n\t"\ - "paddw %%mm7, %%mm0 \n\t"\ - "paddw %%mm7, %%mm6 \n\t"\ - "paddw %%mm7, %%mm3 \n\t"\ - /* mm0=B1, mm2=B2, mm3=G2, mm4=G1, mm5=R1, mm6=R2 */\ - "packuswb %%mm0, %%mm2 \n\t"\ - "packuswb %%mm6, %%mm5 \n\t"\ - "packuswb %%mm3, %%mm4 \n\t"\ - -#define REAL_YSCALEYUV2PACKED(index, c) \ - "movq "CHR_MMX_FILTER_OFFSET"+8("#c"), %%mm0 \n\t"\ - "movq "LUM_MMX_FILTER_OFFSET"+8("#c"), %%mm1 \n\t"\ - "psraw $3, %%mm0 \n\t"\ - "psraw $3, %%mm1 \n\t"\ - "movq %%mm0, "CHR_MMX_FILTER_OFFSET"+8("#c") \n\t"\ - "movq %%mm1, "LUM_MMX_FILTER_OFFSET"+8("#c") \n\t"\ - "xor "#index", "#index" \n\t"\ - ".p2align 4 \n\t"\ - "1: \n\t"\ - "movq (%2, "#index"), %%mm2 \n\t" /* uvbuf0[eax]*/\ - "movq (%3, "#index"), %%mm3 \n\t" /* uvbuf1[eax]*/\ - "movq "AV_STRINGIFY(VOF)"(%2, "#index"), %%mm5 \n\t" /* uvbuf0[eax+2048]*/\ - "movq "AV_STRINGIFY(VOF)"(%3, "#index"), %%mm4 \n\t" /* uvbuf1[eax+2048]*/\ - "psubw %%mm3, %%mm2 \n\t" /* uvbuf0[eax] - uvbuf1[eax]*/\ - "psubw %%mm4, %%mm5 \n\t" /* uvbuf0[eax+2048] - uvbuf1[eax+2048]*/\ - "movq "CHR_MMX_FILTER_OFFSET"+8("#c"), %%mm0 \n\t"\ - "pmulhw %%mm0, %%mm2 \n\t" /* (uvbuf0[eax] - uvbuf1[eax])uvalpha1>>16*/\ - "pmulhw %%mm0, %%mm5 \n\t" /* (uvbuf0[eax+2048] - uvbuf1[eax+2048])uvalpha1>>16*/\ - "psraw $7, %%mm3 \n\t" /* uvbuf0[eax] - uvbuf1[eax] >>4*/\ - "psraw $7, %%mm4 \n\t" /* uvbuf0[eax+2048] - uvbuf1[eax+2048] >>4*/\ - "paddw %%mm2, %%mm3 \n\t" /* uvbuf0[eax]uvalpha1 - uvbuf1[eax](1-uvalpha1)*/\ - "paddw %%mm5, %%mm4 \n\t" /* uvbuf0[eax+2048]uvalpha1 - uvbuf1[eax+2048](1-uvalpha1)*/\ - "movq (%0, "#index", 2), %%mm0 \n\t" /*buf0[eax]*/\ - "movq (%1, "#index", 2), %%mm1 \n\t" /*buf1[eax]*/\ - "movq 8(%0, "#index", 2), %%mm6 \n\t" /*buf0[eax]*/\ - "movq 8(%1, "#index", 2), %%mm7 \n\t" /*buf1[eax]*/\ - "psubw %%mm1, %%mm0 \n\t" /* buf0[eax] - buf1[eax]*/\ - "psubw %%mm7, %%mm6 \n\t" /* buf0[eax] - buf1[eax]*/\ - "pmulhw "LUM_MMX_FILTER_OFFSET"+8("#c"), %%mm0 \n\t" /* (buf0[eax] - buf1[eax])yalpha1>>16*/\ - "pmulhw "LUM_MMX_FILTER_OFFSET"+8("#c"), %%mm6 \n\t" /* (buf0[eax] - buf1[eax])yalpha1>>16*/\ - "psraw $7, %%mm1 \n\t" /* buf0[eax] - buf1[eax] >>4*/\ - "psraw $7, %%mm7 \n\t" /* buf0[eax] - buf1[eax] >>4*/\ - "paddw %%mm0, %%mm1 \n\t" /* buf0[eax]yalpha1 + buf1[eax](1-yalpha1) >>16*/\ - "paddw %%mm6, %%mm7 \n\t" /* buf0[eax]yalpha1 + buf1[eax](1-yalpha1) >>16*/\ - -#define YSCALEYUV2PACKED(index, c) REAL_YSCALEYUV2PACKED(index, c) - -#define REAL_YSCALEYUV2RGB_UV(index, c) \ - "xor "#index", "#index" \n\t"\ - ".p2align 4 \n\t"\ - "1: \n\t"\ - "movq (%2, "#index"), %%mm2 \n\t" /* uvbuf0[eax]*/\ - "movq (%3, "#index"), %%mm3 \n\t" /* uvbuf1[eax]*/\ - "movq "AV_STRINGIFY(VOF)"(%2, "#index"), %%mm5 \n\t" /* uvbuf0[eax+2048]*/\ - "movq "AV_STRINGIFY(VOF)"(%3, "#index"), %%mm4 \n\t" /* uvbuf1[eax+2048]*/\ - "psubw %%mm3, %%mm2 \n\t" /* uvbuf0[eax] - uvbuf1[eax]*/\ - "psubw %%mm4, %%mm5 \n\t" /* uvbuf0[eax+2048] - uvbuf1[eax+2048]*/\ - "movq "CHR_MMX_FILTER_OFFSET"+8("#c"), %%mm0 \n\t"\ - "pmulhw %%mm0, %%mm2 \n\t" /* (uvbuf0[eax] - uvbuf1[eax])uvalpha1>>16*/\ - "pmulhw %%mm0, %%mm5 \n\t" /* (uvbuf0[eax+2048] - uvbuf1[eax+2048])uvalpha1>>16*/\ - "psraw $4, %%mm3 \n\t" /* uvbuf0[eax] - uvbuf1[eax] >>4*/\ - "psraw $4, %%mm4 \n\t" /* uvbuf0[eax+2048] - uvbuf1[eax+2048] >>4*/\ - "paddw %%mm2, %%mm3 \n\t" /* uvbuf0[eax]uvalpha1 - uvbuf1[eax](1-uvalpha1)*/\ - "paddw %%mm5, %%mm4 \n\t" /* uvbuf0[eax+2048]uvalpha1 - uvbuf1[eax+2048](1-uvalpha1)*/\ - "psubw "U_OFFSET"("#c"), %%mm3 \n\t" /* (U-128)8*/\ - "psubw "V_OFFSET"("#c"), %%mm4 \n\t" /* (V-128)8*/\ - "movq %%mm3, %%mm2 \n\t" /* (U-128)8*/\ - "movq %%mm4, %%mm5 \n\t" /* (V-128)8*/\ - "pmulhw "UG_COEFF"("#c"), %%mm3 \n\t"\ - "pmulhw "VG_COEFF"("#c"), %%mm4 \n\t"\ - /* mm2=(U-128)8, mm3=ug, mm4=vg mm5=(V-128)8 */\ - -#define REAL_YSCALEYUV2RGB_YA(index, c, b1, b2) \ - "movq ("#b1", "#index", 2), %%mm0 \n\t" /*buf0[eax]*/\ - "movq ("#b2", "#index", 2), %%mm1 \n\t" /*buf1[eax]*/\ - "movq 8("#b1", "#index", 2), %%mm6 \n\t" /*buf0[eax]*/\ - "movq 8("#b2", "#index", 2), %%mm7 \n\t" /*buf1[eax]*/\ - "psubw %%mm1, %%mm0 \n\t" /* buf0[eax] - buf1[eax]*/\ - "psubw %%mm7, %%mm6 \n\t" /* buf0[eax] - buf1[eax]*/\ - "pmulhw "LUM_MMX_FILTER_OFFSET"+8("#c"), %%mm0 \n\t" /* (buf0[eax] - buf1[eax])yalpha1>>16*/\ - "pmulhw "LUM_MMX_FILTER_OFFSET"+8("#c"), %%mm6 \n\t" /* (buf0[eax] - buf1[eax])yalpha1>>16*/\ - "psraw $4, %%mm1 \n\t" /* buf0[eax] - buf1[eax] >>4*/\ - "psraw $4, %%mm7 \n\t" /* buf0[eax] - buf1[eax] >>4*/\ - "paddw %%mm0, %%mm1 \n\t" /* buf0[eax]yalpha1 + buf1[eax](1-yalpha1) >>16*/\ - "paddw %%mm6, %%mm7 \n\t" /* buf0[eax]yalpha1 + buf1[eax](1-yalpha1) >>16*/\ - -#define REAL_YSCALEYUV2RGB_COEFF(c) \ - "pmulhw "UB_COEFF"("#c"), %%mm2 \n\t"\ - "pmulhw "VR_COEFF"("#c"), %%mm5 \n\t"\ - "psubw "Y_OFFSET"("#c"), %%mm1 \n\t" /* 8(Y-16)*/\ - "psubw "Y_OFFSET"("#c"), %%mm7 \n\t" /* 8(Y-16)*/\ - "pmulhw "Y_COEFF"("#c"), %%mm1 \n\t"\ - "pmulhw "Y_COEFF"("#c"), %%mm7 \n\t"\ - /* mm1= Y1, mm2=ub, mm3=ug, mm4=vg mm5=vr, mm7=Y2 */\ - "paddw %%mm3, %%mm4 \n\t"\ - "movq %%mm2, %%mm0 \n\t"\ - "movq %%mm5, %%mm6 \n\t"\ - "movq %%mm4, %%mm3 \n\t"\ - "punpcklwd %%mm2, %%mm2 \n\t"\ - "punpcklwd %%mm5, %%mm5 \n\t"\ - "punpcklwd %%mm4, %%mm4 \n\t"\ - "paddw %%mm1, %%mm2 \n\t"\ - "paddw %%mm1, %%mm5 \n\t"\ - "paddw %%mm1, %%mm4 \n\t"\ - "punpckhwd %%mm0, %%mm0 \n\t"\ - "punpckhwd %%mm6, %%mm6 \n\t"\ - "punpckhwd %%mm3, %%mm3 \n\t"\ - "paddw %%mm7, %%mm0 \n\t"\ - "paddw %%mm7, %%mm6 \n\t"\ - "paddw %%mm7, %%mm3 \n\t"\ - /* mm0=B1, mm2=B2, mm3=G2, mm4=G1, mm5=R1, mm6=R2 */\ - "packuswb %%mm0, %%mm2 \n\t"\ - "packuswb %%mm6, %%mm5 \n\t"\ - "packuswb %%mm3, %%mm4 \n\t"\ - -#define YSCALEYUV2RGB_YA(index, c, b1, b2) REAL_YSCALEYUV2RGB_YA(index, c, b1, b2) - -#define YSCALEYUV2RGB(index, c) \ - REAL_YSCALEYUV2RGB_UV(index, c) \ - REAL_YSCALEYUV2RGB_YA(index, c, %0, %1) \ - REAL_YSCALEYUV2RGB_COEFF(c) - -#define REAL_YSCALEYUV2PACKED1(index, c) \ - "xor "#index", "#index" \n\t"\ - ".p2align 4 \n\t"\ - "1: \n\t"\ - "movq (%2, "#index"), %%mm3 \n\t" /* uvbuf0[eax]*/\ - "movq "AV_STRINGIFY(VOF)"(%2, "#index"), %%mm4 \n\t" /* uvbuf0[eax+2048]*/\ - "psraw $7, %%mm3 \n\t" \ - "psraw $7, %%mm4 \n\t" \ - "movq (%0, "#index", 2), %%mm1 \n\t" /*buf0[eax]*/\ - "movq 8(%0, "#index", 2), %%mm7 \n\t" /*buf0[eax]*/\ - "psraw $7, %%mm1 \n\t" \ - "psraw $7, %%mm7 \n\t" \ - -#define YSCALEYUV2PACKED1(index, c) REAL_YSCALEYUV2PACKED1(index, c) - -#define REAL_YSCALEYUV2RGB1(index, c) \ - "xor "#index", "#index" \n\t"\ - ".p2align 4 \n\t"\ - "1: \n\t"\ - "movq (%2, "#index"), %%mm3 \n\t" /* uvbuf0[eax]*/\ - "movq "AV_STRINGIFY(VOF)"(%2, "#index"), %%mm4 \n\t" /* uvbuf0[eax+2048]*/\ - "psraw $4, %%mm3 \n\t" /* uvbuf0[eax] - uvbuf1[eax] >>4*/\ - "psraw $4, %%mm4 \n\t" /* uvbuf0[eax+2048] - uvbuf1[eax+2048] >>4*/\ - "psubw "U_OFFSET"("#c"), %%mm3 \n\t" /* (U-128)8*/\ - "psubw "V_OFFSET"("#c"), %%mm4 \n\t" /* (V-128)8*/\ - "movq %%mm3, %%mm2 \n\t" /* (U-128)8*/\ - "movq %%mm4, %%mm5 \n\t" /* (V-128)8*/\ - "pmulhw "UG_COEFF"("#c"), %%mm3 \n\t"\ - "pmulhw "VG_COEFF"("#c"), %%mm4 \n\t"\ - /* mm2=(U-128)8, mm3=ug, mm4=vg mm5=(V-128)8 */\ - "movq (%0, "#index", 2), %%mm1 \n\t" /*buf0[eax]*/\ - "movq 8(%0, "#index", 2), %%mm7 \n\t" /*buf0[eax]*/\ - "psraw $4, %%mm1 \n\t" /* buf0[eax] - buf1[eax] >>4*/\ - "psraw $4, %%mm7 \n\t" /* buf0[eax] - buf1[eax] >>4*/\ - "pmulhw "UB_COEFF"("#c"), %%mm2 \n\t"\ - "pmulhw "VR_COEFF"("#c"), %%mm5 \n\t"\ - "psubw "Y_OFFSET"("#c"), %%mm1 \n\t" /* 8(Y-16)*/\ - "psubw "Y_OFFSET"("#c"), %%mm7 \n\t" /* 8(Y-16)*/\ - "pmulhw "Y_COEFF"("#c"), %%mm1 \n\t"\ - "pmulhw "Y_COEFF"("#c"), %%mm7 \n\t"\ - /* mm1= Y1, mm2=ub, mm3=ug, mm4=vg mm5=vr, mm7=Y2 */\ - "paddw %%mm3, %%mm4 \n\t"\ - "movq %%mm2, %%mm0 \n\t"\ - "movq %%mm5, %%mm6 \n\t"\ - "movq %%mm4, %%mm3 \n\t"\ - "punpcklwd %%mm2, %%mm2 \n\t"\ - "punpcklwd %%mm5, %%mm5 \n\t"\ - "punpcklwd %%mm4, %%mm4 \n\t"\ - "paddw %%mm1, %%mm2 \n\t"\ - "paddw %%mm1, %%mm5 \n\t"\ - "paddw %%mm1, %%mm4 \n\t"\ - "punpckhwd %%mm0, %%mm0 \n\t"\ - "punpckhwd %%mm6, %%mm6 \n\t"\ - "punpckhwd %%mm3, %%mm3 \n\t"\ - "paddw %%mm7, %%mm0 \n\t"\ - "paddw %%mm7, %%mm6 \n\t"\ - "paddw %%mm7, %%mm3 \n\t"\ - /* mm0=B1, mm2=B2, mm3=G2, mm4=G1, mm5=R1, mm6=R2 */\ - "packuswb %%mm0, %%mm2 \n\t"\ - "packuswb %%mm6, %%mm5 \n\t"\ - "packuswb %%mm3, %%mm4 \n\t"\ - -#define YSCALEYUV2RGB1(index, c) REAL_YSCALEYUV2RGB1(index, c) - -#define REAL_YSCALEYUV2PACKED1b(index, c) \ - "xor "#index", "#index" \n\t"\ - ".p2align 4 \n\t"\ - "1: \n\t"\ - "movq (%2, "#index"), %%mm2 \n\t" /* uvbuf0[eax]*/\ - "movq (%3, "#index"), %%mm3 \n\t" /* uvbuf1[eax]*/\ - "movq "AV_STRINGIFY(VOF)"(%2, "#index"), %%mm5 \n\t" /* uvbuf0[eax+2048]*/\ - "movq "AV_STRINGIFY(VOF)"(%3, "#index"), %%mm4 \n\t" /* uvbuf1[eax+2048]*/\ - "paddw %%mm2, %%mm3 \n\t" /* uvbuf0[eax] + uvbuf1[eax]*/\ - "paddw %%mm5, %%mm4 \n\t" /* uvbuf0[eax+2048] + uvbuf1[eax+2048]*/\ - "psrlw $8, %%mm3 \n\t" \ - "psrlw $8, %%mm4 \n\t" \ - "movq (%0, "#index", 2), %%mm1 \n\t" /*buf0[eax]*/\ - "movq 8(%0, "#index", 2), %%mm7 \n\t" /*buf0[eax]*/\ - "psraw $7, %%mm1 \n\t" \ - "psraw $7, %%mm7 \n\t" -#define YSCALEYUV2PACKED1b(index, c) REAL_YSCALEYUV2PACKED1b(index, c) - -// do vertical chrominance interpolation -#define REAL_YSCALEYUV2RGB1b(index, c) \ - "xor "#index", "#index" \n\t"\ - ".p2align 4 \n\t"\ - "1: \n\t"\ - "movq (%2, "#index"), %%mm2 \n\t" /* uvbuf0[eax]*/\ - "movq (%3, "#index"), %%mm3 \n\t" /* uvbuf1[eax]*/\ - "movq "AV_STRINGIFY(VOF)"(%2, "#index"), %%mm5 \n\t" /* uvbuf0[eax+2048]*/\ - "movq "AV_STRINGIFY(VOF)"(%3, "#index"), %%mm4 \n\t" /* uvbuf1[eax+2048]*/\ - "paddw %%mm2, %%mm3 \n\t" /* uvbuf0[eax] + uvbuf1[eax]*/\ - "paddw %%mm5, %%mm4 \n\t" /* uvbuf0[eax+2048] + uvbuf1[eax+2048]*/\ - "psrlw $5, %%mm3 \n\t" /*FIXME might overflow*/\ - "psrlw $5, %%mm4 \n\t" /*FIXME might overflow*/\ - "psubw "U_OFFSET"("#c"), %%mm3 \n\t" /* (U-128)8*/\ - "psubw "V_OFFSET"("#c"), %%mm4 \n\t" /* (V-128)8*/\ - "movq %%mm3, %%mm2 \n\t" /* (U-128)8*/\ - "movq %%mm4, %%mm5 \n\t" /* (V-128)8*/\ - "pmulhw "UG_COEFF"("#c"), %%mm3 \n\t"\ - "pmulhw "VG_COEFF"("#c"), %%mm4 \n\t"\ - /* mm2=(U-128)8, mm3=ug, mm4=vg mm5=(V-128)8 */\ - "movq (%0, "#index", 2), %%mm1 \n\t" /*buf0[eax]*/\ - "movq 8(%0, "#index", 2), %%mm7 \n\t" /*buf0[eax]*/\ - "psraw $4, %%mm1 \n\t" /* buf0[eax] - buf1[eax] >>4*/\ - "psraw $4, %%mm7 \n\t" /* buf0[eax] - buf1[eax] >>4*/\ - "pmulhw "UB_COEFF"("#c"), %%mm2 \n\t"\ - "pmulhw "VR_COEFF"("#c"), %%mm5 \n\t"\ - "psubw "Y_OFFSET"("#c"), %%mm1 \n\t" /* 8(Y-16)*/\ - "psubw "Y_OFFSET"("#c"), %%mm7 \n\t" /* 8(Y-16)*/\ - "pmulhw "Y_COEFF"("#c"), %%mm1 \n\t"\ - "pmulhw "Y_COEFF"("#c"), %%mm7 \n\t"\ - /* mm1= Y1, mm2=ub, mm3=ug, mm4=vg mm5=vr, mm7=Y2 */\ - "paddw %%mm3, %%mm4 \n\t"\ - "movq %%mm2, %%mm0 \n\t"\ - "movq %%mm5, %%mm6 \n\t"\ - "movq %%mm4, %%mm3 \n\t"\ - "punpcklwd %%mm2, %%mm2 \n\t"\ - "punpcklwd %%mm5, %%mm5 \n\t"\ - "punpcklwd %%mm4, %%mm4 \n\t"\ - "paddw %%mm1, %%mm2 \n\t"\ - "paddw %%mm1, %%mm5 \n\t"\ - "paddw %%mm1, %%mm4 \n\t"\ - "punpckhwd %%mm0, %%mm0 \n\t"\ - "punpckhwd %%mm6, %%mm6 \n\t"\ - "punpckhwd %%mm3, %%mm3 \n\t"\ - "paddw %%mm7, %%mm0 \n\t"\ - "paddw %%mm7, %%mm6 \n\t"\ - "paddw %%mm7, %%mm3 \n\t"\ - /* mm0=B1, mm2=B2, mm3=G2, mm4=G1, mm5=R1, mm6=R2 */\ - "packuswb %%mm0, %%mm2 \n\t"\ - "packuswb %%mm6, %%mm5 \n\t"\ - "packuswb %%mm3, %%mm4 \n\t"\ - -#define YSCALEYUV2RGB1b(index, c) REAL_YSCALEYUV2RGB1b(index, c) - -#define REAL_YSCALEYUV2RGB1_ALPHA(index) \ - "movq (%1, "#index", 2), %%mm7 \n\t" /* abuf0[index ] */\ - "movq 8(%1, "#index", 2), %%mm1 \n\t" /* abuf0[index+4] */\ - "psraw $7, %%mm7 \n\t" /* abuf0[index ] >>7 */\ - "psraw $7, %%mm1 \n\t" /* abuf0[index+4] >>7 */\ - "packuswb %%mm1, %%mm7 \n\t" -#define YSCALEYUV2RGB1_ALPHA(index) REAL_YSCALEYUV2RGB1_ALPHA(index) - -#define REAL_WRITEBGR32(dst, dstw, index, b, g, r, a, q0, q2, q3, t) \ - "movq "#b", "#q2" \n\t" /* B */\ - "movq "#r", "#t" \n\t" /* R */\ - "punpcklbw "#g", "#b" \n\t" /* GBGBGBGB 0 */\ - "punpcklbw "#a", "#r" \n\t" /* ARARARAR 0 */\ - "punpckhbw "#g", "#q2" \n\t" /* GBGBGBGB 2 */\ - "punpckhbw "#a", "#t" \n\t" /* ARARARAR 2 */\ - "movq "#b", "#q0" \n\t" /* GBGBGBGB 0 */\ - "movq "#q2", "#q3" \n\t" /* GBGBGBGB 2 */\ - "punpcklwd "#r", "#q0" \n\t" /* ARGBARGB 0 */\ - "punpckhwd "#r", "#b" \n\t" /* ARGBARGB 1 */\ - "punpcklwd "#t", "#q2" \n\t" /* ARGBARGB 2 */\ - "punpckhwd "#t", "#q3" \n\t" /* ARGBARGB 3 */\ -\ - MOVNTQ( q0, (dst, index, 4))\ - MOVNTQ( b, 8(dst, index, 4))\ - MOVNTQ( q2, 16(dst, index, 4))\ - MOVNTQ( q3, 24(dst, index, 4))\ -\ - "add $8, "#index" \n\t"\ - "cmp "#dstw", "#index" \n\t"\ - " jb 1b \n\t" -#define WRITEBGR32(dst, dstw, index, b, g, r, a, q0, q2, q3, t) REAL_WRITEBGR32(dst, dstw, index, b, g, r, a, q0, q2, q3, t) - -#define REAL_WRITERGB16(dst, dstw, index) \ - "pand "MANGLE(bF8)", %%mm2 \n\t" /* B */\ - "pand "MANGLE(bFC)", %%mm4 \n\t" /* G */\ - "pand "MANGLE(bF8)", %%mm5 \n\t" /* R */\ - "psrlq $3, %%mm2 \n\t"\ -\ - "movq %%mm2, %%mm1 \n\t"\ - "movq %%mm4, %%mm3 \n\t"\ -\ - "punpcklbw %%mm7, %%mm3 \n\t"\ - "punpcklbw %%mm5, %%mm2 \n\t"\ - "punpckhbw %%mm7, %%mm4 \n\t"\ - "punpckhbw %%mm5, %%mm1 \n\t"\ -\ - "psllq $3, %%mm3 \n\t"\ - "psllq $3, %%mm4 \n\t"\ -\ - "por %%mm3, %%mm2 \n\t"\ - "por %%mm4, %%mm1 \n\t"\ -\ - MOVNTQ(%%mm2, (dst, index, 2))\ - MOVNTQ(%%mm1, 8(dst, index, 2))\ -\ - "add $8, "#index" \n\t"\ - "cmp "#dstw", "#index" \n\t"\ - " jb 1b \n\t" -#define WRITERGB16(dst, dstw, index) REAL_WRITERGB16(dst, dstw, index) - -#define REAL_WRITERGB15(dst, dstw, index) \ - "pand "MANGLE(bF8)", %%mm2 \n\t" /* B */\ - "pand "MANGLE(bF8)", %%mm4 \n\t" /* G */\ - "pand "MANGLE(bF8)", %%mm5 \n\t" /* R */\ - "psrlq $3, %%mm2 \n\t"\ - "psrlq $1, %%mm5 \n\t"\ -\ - "movq %%mm2, %%mm1 \n\t"\ - "movq %%mm4, %%mm3 \n\t"\ -\ - "punpcklbw %%mm7, %%mm3 \n\t"\ - "punpcklbw %%mm5, %%mm2 \n\t"\ - "punpckhbw %%mm7, %%mm4 \n\t"\ - "punpckhbw %%mm5, %%mm1 \n\t"\ -\ - "psllq $2, %%mm3 \n\t"\ - "psllq $2, %%mm4 \n\t"\ -\ - "por %%mm3, %%mm2 \n\t"\ - "por %%mm4, %%mm1 \n\t"\ -\ - MOVNTQ(%%mm2, (dst, index, 2))\ - MOVNTQ(%%mm1, 8(dst, index, 2))\ -\ - "add $8, "#index" \n\t"\ - "cmp "#dstw", "#index" \n\t"\ - " jb 1b \n\t" -#define WRITERGB15(dst, dstw, index) REAL_WRITERGB15(dst, dstw, index) - -#define WRITEBGR24OLD(dst, dstw, index) \ - /* mm2=B, %%mm4=G, %%mm5=R, %%mm7=0 */\ - "movq %%mm2, %%mm1 \n\t" /* B */\ - "movq %%mm5, %%mm6 \n\t" /* R */\ - "punpcklbw %%mm4, %%mm2 \n\t" /* GBGBGBGB 0 */\ - "punpcklbw %%mm7, %%mm5 \n\t" /* 0R0R0R0R 0 */\ - "punpckhbw %%mm4, %%mm1 \n\t" /* GBGBGBGB 2 */\ - "punpckhbw %%mm7, %%mm6 \n\t" /* 0R0R0R0R 2 */\ - "movq %%mm2, %%mm0 \n\t" /* GBGBGBGB 0 */\ - "movq %%mm1, %%mm3 \n\t" /* GBGBGBGB 2 */\ - "punpcklwd %%mm5, %%mm0 \n\t" /* 0RGB0RGB 0 */\ - "punpckhwd %%mm5, %%mm2 \n\t" /* 0RGB0RGB 1 */\ - "punpcklwd %%mm6, %%mm1 \n\t" /* 0RGB0RGB 2 */\ - "punpckhwd %%mm6, %%mm3 \n\t" /* 0RGB0RGB 3 */\ -\ - "movq %%mm0, %%mm4 \n\t" /* 0RGB0RGB 0 */\ - "psrlq $8, %%mm0 \n\t" /* 00RGB0RG 0 */\ - "pand "MANGLE(bm00000111)", %%mm4 \n\t" /* 00000RGB 0 */\ - "pand "MANGLE(bm11111000)", %%mm0 \n\t" /* 00RGB000 0.5 */\ - "por %%mm4, %%mm0 \n\t" /* 00RGBRGB 0 */\ - "movq %%mm2, %%mm4 \n\t" /* 0RGB0RGB 1 */\ - "psllq $48, %%mm2 \n\t" /* GB000000 1 */\ - "por %%mm2, %%mm0 \n\t" /* GBRGBRGB 0 */\ -\ - "movq %%mm4, %%mm2 \n\t" /* 0RGB0RGB 1 */\ - "psrld $16, %%mm4 \n\t" /* 000R000R 1 */\ - "psrlq $24, %%mm2 \n\t" /* 0000RGB0 1.5 */\ - "por %%mm4, %%mm2 \n\t" /* 000RRGBR 1 */\ - "pand "MANGLE(bm00001111)", %%mm2 \n\t" /* 0000RGBR 1 */\ - "movq %%mm1, %%mm4 \n\t" /* 0RGB0RGB 2 */\ - "psrlq $8, %%mm1 \n\t" /* 00RGB0RG 2 */\ - "pand "MANGLE(bm00000111)", %%mm4 \n\t" /* 00000RGB 2 */\ - "pand "MANGLE(bm11111000)", %%mm1 \n\t" /* 00RGB000 2.5 */\ - "por %%mm4, %%mm1 \n\t" /* 00RGBRGB 2 */\ - "movq %%mm1, %%mm4 \n\t" /* 00RGBRGB 2 */\ - "psllq $32, %%mm1 \n\t" /* BRGB0000 2 */\ - "por %%mm1, %%mm2 \n\t" /* BRGBRGBR 1 */\ -\ - "psrlq $32, %%mm4 \n\t" /* 000000RG 2.5 */\ - "movq %%mm3, %%mm5 \n\t" /* 0RGB0RGB 3 */\ - "psrlq $8, %%mm3 \n\t" /* 00RGB0RG 3 */\ - "pand "MANGLE(bm00000111)", %%mm5 \n\t" /* 00000RGB 3 */\ - "pand "MANGLE(bm11111000)", %%mm3 \n\t" /* 00RGB000 3.5 */\ - "por %%mm5, %%mm3 \n\t" /* 00RGBRGB 3 */\ - "psllq $16, %%mm3 \n\t" /* RGBRGB00 3 */\ - "por %%mm4, %%mm3 \n\t" /* RGBRGBRG 2.5 */\ -\ - MOVNTQ(%%mm0, (dst))\ - MOVNTQ(%%mm2, 8(dst))\ - MOVNTQ(%%mm3, 16(dst))\ - "add $24, "#dst" \n\t"\ -\ - "add $8, "#index" \n\t"\ - "cmp "#dstw", "#index" \n\t"\ - " jb 1b \n\t" - -#define WRITEBGR24MMX(dst, dstw, index) \ - /* mm2=B, %%mm4=G, %%mm5=R, %%mm7=0 */\ - "movq %%mm2, %%mm1 \n\t" /* B */\ - "movq %%mm5, %%mm6 \n\t" /* R */\ - "punpcklbw %%mm4, %%mm2 \n\t" /* GBGBGBGB 0 */\ - "punpcklbw %%mm7, %%mm5 \n\t" /* 0R0R0R0R 0 */\ - "punpckhbw %%mm4, %%mm1 \n\t" /* GBGBGBGB 2 */\ - "punpckhbw %%mm7, %%mm6 \n\t" /* 0R0R0R0R 2 */\ - "movq %%mm2, %%mm0 \n\t" /* GBGBGBGB 0 */\ - "movq %%mm1, %%mm3 \n\t" /* GBGBGBGB 2 */\ - "punpcklwd %%mm5, %%mm0 \n\t" /* 0RGB0RGB 0 */\ - "punpckhwd %%mm5, %%mm2 \n\t" /* 0RGB0RGB 1 */\ - "punpcklwd %%mm6, %%mm1 \n\t" /* 0RGB0RGB 2 */\ - "punpckhwd %%mm6, %%mm3 \n\t" /* 0RGB0RGB 3 */\ -\ - "movq %%mm0, %%mm4 \n\t" /* 0RGB0RGB 0 */\ - "movq %%mm2, %%mm6 \n\t" /* 0RGB0RGB 1 */\ - "movq %%mm1, %%mm5 \n\t" /* 0RGB0RGB 2 */\ - "movq %%mm3, %%mm7 \n\t" /* 0RGB0RGB 3 */\ -\ - "psllq $40, %%mm0 \n\t" /* RGB00000 0 */\ - "psllq $40, %%mm2 \n\t" /* RGB00000 1 */\ - "psllq $40, %%mm1 \n\t" /* RGB00000 2 */\ - "psllq $40, %%mm3 \n\t" /* RGB00000 3 */\ -\ - "punpckhdq %%mm4, %%mm0 \n\t" /* 0RGBRGB0 0 */\ - "punpckhdq %%mm6, %%mm2 \n\t" /* 0RGBRGB0 1 */\ - "punpckhdq %%mm5, %%mm1 \n\t" /* 0RGBRGB0 2 */\ - "punpckhdq %%mm7, %%mm3 \n\t" /* 0RGBRGB0 3 */\ -\ - "psrlq $8, %%mm0 \n\t" /* 00RGBRGB 0 */\ - "movq %%mm2, %%mm6 \n\t" /* 0RGBRGB0 1 */\ - "psllq $40, %%mm2 \n\t" /* GB000000 1 */\ - "por %%mm2, %%mm0 \n\t" /* GBRGBRGB 0 */\ - MOVNTQ(%%mm0, (dst))\ -\ - "psrlq $24, %%mm6 \n\t" /* 0000RGBR 1 */\ - "movq %%mm1, %%mm5 \n\t" /* 0RGBRGB0 2 */\ - "psllq $24, %%mm1 \n\t" /* BRGB0000 2 */\ - "por %%mm1, %%mm6 \n\t" /* BRGBRGBR 1 */\ - MOVNTQ(%%mm6, 8(dst))\ -\ - "psrlq $40, %%mm5 \n\t" /* 000000RG 2 */\ - "psllq $8, %%mm3 \n\t" /* RGBRGB00 3 */\ - "por %%mm3, %%mm5 \n\t" /* RGBRGBRG 2 */\ - MOVNTQ(%%mm5, 16(dst))\ -\ - "add $24, "#dst" \n\t"\ -\ - "add $8, "#index" \n\t"\ - "cmp "#dstw", "#index" \n\t"\ - " jb 1b \n\t" - -#define WRITEBGR24MMX2(dst, dstw, index) \ - /* mm2=B, %%mm4=G, %%mm5=R, %%mm7=0 */\ - "movq "MANGLE(ff_M24A)", %%mm0 \n\t"\ - "movq "MANGLE(ff_M24C)", %%mm7 \n\t"\ - "pshufw $0x50, %%mm2, %%mm1 \n\t" /* B3 B2 B3 B2 B1 B0 B1 B0 */\ - "pshufw $0x50, %%mm4, %%mm3 \n\t" /* G3 G2 G3 G2 G1 G0 G1 G0 */\ - "pshufw $0x00, %%mm5, %%mm6 \n\t" /* R1 R0 R1 R0 R1 R0 R1 R0 */\ -\ - "pand %%mm0, %%mm1 \n\t" /* B2 B1 B0 */\ - "pand %%mm0, %%mm3 \n\t" /* G2 G1 G0 */\ - "pand %%mm7, %%mm6 \n\t" /* R1 R0 */\ -\ - "psllq $8, %%mm3 \n\t" /* G2 G1 G0 */\ - "por %%mm1, %%mm6 \n\t"\ - "por %%mm3, %%mm6 \n\t"\ - MOVNTQ(%%mm6, (dst))\ -\ - "psrlq $8, %%mm4 \n\t" /* 00 G7 G6 G5 G4 G3 G2 G1 */\ - "pshufw $0xA5, %%mm2, %%mm1 \n\t" /* B5 B4 B5 B4 B3 B2 B3 B2 */\ - "pshufw $0x55, %%mm4, %%mm3 \n\t" /* G4 G3 G4 G3 G4 G3 G4 G3 */\ - "pshufw $0xA5, %%mm5, %%mm6 \n\t" /* R5 R4 R5 R4 R3 R2 R3 R2 */\ -\ - "pand "MANGLE(ff_M24B)", %%mm1 \n\t" /* B5 B4 B3 */\ - "pand %%mm7, %%mm3 \n\t" /* G4 G3 */\ - "pand %%mm0, %%mm6 \n\t" /* R4 R3 R2 */\ -\ - "por %%mm1, %%mm3 \n\t" /* B5 G4 B4 G3 B3 */\ - "por %%mm3, %%mm6 \n\t"\ - MOVNTQ(%%mm6, 8(dst))\ -\ - "pshufw $0xFF, %%mm2, %%mm1 \n\t" /* B7 B6 B7 B6 B7 B6 B6 B7 */\ - "pshufw $0xFA, %%mm4, %%mm3 \n\t" /* 00 G7 00 G7 G6 G5 G6 G5 */\ - "pshufw $0xFA, %%mm5, %%mm6 \n\t" /* R7 R6 R7 R6 R5 R4 R5 R4 */\ -\ - "pand %%mm7, %%mm1 \n\t" /* B7 B6 */\ - "pand %%mm0, %%mm3 \n\t" /* G7 G6 G5 */\ - "pand "MANGLE(ff_M24B)", %%mm6 \n\t" /* R7 R6 R5 */\ -\ - "por %%mm1, %%mm3 \n\t"\ - "por %%mm3, %%mm6 \n\t"\ - MOVNTQ(%%mm6, 16(dst))\ -\ - "add $24, "#dst" \n\t"\ -\ - "add $8, "#index" \n\t"\ - "cmp "#dstw", "#index" \n\t"\ - " jb 1b \n\t" - -#if COMPILE_TEMPLATE_MMX2 -#undef WRITEBGR24 -#define WRITEBGR24(dst, dstw, index) WRITEBGR24MMX2(dst, dstw, index) -#else -#undef WRITEBGR24 -#define WRITEBGR24(dst, dstw, index) WRITEBGR24MMX(dst, dstw, index) -#endif - -#define REAL_WRITEYUY2(dst, dstw, index) \ - "packuswb %%mm3, %%mm3 \n\t"\ - "packuswb %%mm4, %%mm4 \n\t"\ - "packuswb %%mm7, %%mm1 \n\t"\ - "punpcklbw %%mm4, %%mm3 \n\t"\ - "movq %%mm1, %%mm7 \n\t"\ - "punpcklbw %%mm3, %%mm1 \n\t"\ - "punpckhbw %%mm3, %%mm7 \n\t"\ -\ - MOVNTQ(%%mm1, (dst, index, 2))\ - MOVNTQ(%%mm7, 8(dst, index, 2))\ -\ - "add $8, "#index" \n\t"\ - "cmp "#dstw", "#index" \n\t"\ - " jb 1b \n\t" -#define WRITEYUY2(dst, dstw, index) REAL_WRITEYUY2(dst, dstw, index) - - -static inline void RENAME(yuv2yuvX)(SwsContext *c, const int16_t *lumFilter, const int16_t **lumSrc, int lumFilterSize, - const int16_t *chrFilter, const int16_t **chrSrc, int chrFilterSize, const int16_t **alpSrc, - uint8_t *dest, uint8_t *uDest, uint8_t *vDest, uint8_t *aDest, long dstW, long chrDstW) +static inline void yuv2yuvX_c(SwsContext *c, const int16_t *lumFilter, + const int16_t **lumSrc, int lumFilterSize, + const int16_t *chrFilter, const int16_t **chrUSrc, + const int16_t **chrVSrc, + int chrFilterSize, const int16_t **alpSrc, + uint8_t *dest, uint8_t *uDest, uint8_t *vDest, + uint8_t *aDest, int dstW, int chrDstW, const uint8_t *lumDither, const uint8_t *chrDither) { -#if COMPILE_TEMPLATE_MMX - if(!(c->flags & SWS_BITEXACT)) { - if (c->flags & SWS_ACCURATE_RND) { - if (uDest) { - YSCALEYUV2YV12X_ACCURATE( "0", CHR_MMX_FILTER_OFFSET, uDest, chrDstW) - YSCALEYUV2YV12X_ACCURATE(AV_STRINGIFY(VOF), CHR_MMX_FILTER_OFFSET, vDest, chrDstW) - } - if (CONFIG_SWSCALE_ALPHA && aDest) { - YSCALEYUV2YV12X_ACCURATE( "0", ALP_MMX_FILTER_OFFSET, aDest, dstW) - } - - YSCALEYUV2YV12X_ACCURATE("0", LUM_MMX_FILTER_OFFSET, dest, dstW) - } else { - if (uDest) { - YSCALEYUV2YV12X( "0", CHR_MMX_FILTER_OFFSET, uDest, chrDstW) - YSCALEYUV2YV12X(AV_STRINGIFY(VOF), CHR_MMX_FILTER_OFFSET, vDest, chrDstW) - } - if (CONFIG_SWSCALE_ALPHA && aDest) { - YSCALEYUV2YV12X( "0", ALP_MMX_FILTER_OFFSET, aDest, dstW) - } - - YSCALEYUV2YV12X("0", LUM_MMX_FILTER_OFFSET, dest, dstW) - } - return; - } -#endif -#if COMPILE_TEMPLATE_ALTIVEC - yuv2yuvX_altivec_real(lumFilter, lumSrc, lumFilterSize, - chrFilter, chrSrc, chrFilterSize, - dest, uDest, vDest, dstW, chrDstW); -#else //COMPILE_TEMPLATE_ALTIVEC yuv2yuvXinC(lumFilter, lumSrc, lumFilterSize, - chrFilter, chrSrc, chrFilterSize, - alpSrc, dest, uDest, vDest, aDest, dstW, chrDstW); -#endif //!COMPILE_TEMPLATE_ALTIVEC + chrFilter, chrUSrc, chrVSrc, chrFilterSize, + alpSrc, dest, uDest, vDest, aDest, dstW, chrDstW, lumDither, chrDither); } -static inline void RENAME(yuv2nv12X)(SwsContext *c, const int16_t *lumFilter, const int16_t **lumSrc, int lumFilterSize, - const int16_t *chrFilter, const int16_t **chrSrc, int chrFilterSize, - uint8_t *dest, uint8_t *uDest, int dstW, int chrDstW, enum PixelFormat dstFormat) +static inline void yuv2nv12X_c(SwsContext *c, const int16_t *lumFilter, + const int16_t **lumSrc, int lumFilterSize, + const int16_t *chrFilter, const int16_t **chrUSrc, + const int16_t **chrVSrc, + int chrFilterSize, uint8_t *dest, uint8_t *uDest, + int dstW, int chrDstW, enum PixelFormat dstFormat, const uint8_t *dither, const uint8_t *chrDither) { yuv2nv12XinC(lumFilter, lumSrc, lumFilterSize, - chrFilter, chrSrc, chrFilterSize, - dest, uDest, dstW, chrDstW, dstFormat); + chrFilter, chrUSrc, chrVSrc, chrFilterSize, + dest, uDest, dstW, chrDstW, dstFormat, dither, chrDither); } -static inline void RENAME(yuv2yuv1)(SwsContext *c, const int16_t *lumSrc, const int16_t *chrSrc, const int16_t *alpSrc, - uint8_t *dest, uint8_t *uDest, uint8_t *vDest, uint8_t *aDest, long dstW, long chrDstW) +static inline void yuv2yuv1_c(SwsContext *c, const int16_t *lumSrc, + const int16_t *chrUSrc, const int16_t *chrVSrc, + const int16_t *alpSrc, + uint8_t *dest, uint8_t *uDest, uint8_t *vDest, + uint8_t *aDest, int dstW, int chrDstW, const uint8_t *lumDither, const uint8_t *chrDither) { int i; -#if COMPILE_TEMPLATE_MMX - if(!(c->flags & SWS_BITEXACT)) { - long p= 4; - const int16_t *src[4]= {alpSrc + dstW, lumSrc + dstW, chrSrc + chrDstW, chrSrc + VOFW + chrDstW}; - uint8_t *dst[4]= {aDest, dest, uDest, vDest}; - x86_reg counter[4]= {dstW, dstW, chrDstW, chrDstW}; - if (c->flags & SWS_ACCURATE_RND) { - while(p--) { - if (dst[p]) { - __asm__ volatile( - YSCALEYUV2YV121_ACCURATE - :: "r" (src[p]), "r" (dst[p] + counter[p]), - "g" (-counter[p]) - : "%"REG_a - ); - } - } - } else { - while(p--) { - if (dst[p]) { - __asm__ volatile( - YSCALEYUV2YV121 - :: "r" (src[p]), "r" (dst[p] + counter[p]), - "g" (-counter[p]) - : "%"REG_a - ); - } - } - } - return; - } -#endif for (i=0; i<dstW; i++) { - int val= (lumSrc[i]+64)>>7; - - if (val&256) { - if (val<0) val=0; - else val=255; - } - - dest[i]= val; + int val= (lumSrc[i]+lumDither[i&7])>>7; + dest[i]= av_clip_uint8(val); } if (uDest) for (i=0; i<chrDstW; i++) { - int u=(chrSrc[i ]+64)>>7; - int v=(chrSrc[i + VOFW]+64)>>7; - - if ((u|v)&256) { - if (u<0) u=0; - else if (u>255) u=255; - if (v<0) v=0; - else if (v>255) v=255; - } - - uDest[i]= u; - vDest[i]= v; + int u=(chrUSrc[i]+chrDither[i&7])>>7; + int v=(chrVSrc[i]+chrDither[(i+3)&7])>>7; + uDest[i]= av_clip_uint8(u); + vDest[i]= av_clip_uint8(v); } if (CONFIG_SWSCALE_ALPHA && aDest) for (i=0; i<dstW; i++) { - int val= (alpSrc[i]+64)>>7; + int val= (alpSrc[i]+lumDither[i&7])>>7; aDest[i]= av_clip_uint8(val); } } @@ -1019,343 +75,44 @@ static inline void RENAME(yuv2yuv1)(SwsContext *c, const int16_t *lumSrc, const /** * vertical scale YV12 to RGB */ -static inline void RENAME(yuv2packedX)(SwsContext *c, const int16_t *lumFilter, const int16_t **lumSrc, int lumFilterSize, - const int16_t *chrFilter, const int16_t **chrSrc, int chrFilterSize, - const int16_t **alpSrc, uint8_t *dest, long dstW, long dstY) +static inline void yuv2packedX_c(SwsContext *c, const int16_t *lumFilter, + const int16_t **lumSrc, int lumFilterSize, + const int16_t *chrFilter, const int16_t **chrUSrc, + const int16_t **chrVSrc, + int chrFilterSize, const int16_t **alpSrc, + uint8_t *dest, int dstW, int dstY) { -#if COMPILE_TEMPLATE_MMX - x86_reg dummy=0; - x86_reg dstW_reg = dstW; - if(!(c->flags & SWS_BITEXACT)) { - if (c->flags & SWS_ACCURATE_RND) { - switch(c->dstFormat) { - case PIX_FMT_RGB32: - if (CONFIG_SWSCALE_ALPHA && c->alpPixBuf) { - YSCALEYUV2PACKEDX_ACCURATE - YSCALEYUV2RGBX - "movq %%mm2, "U_TEMP"(%0) \n\t" - "movq %%mm4, "V_TEMP"(%0) \n\t" - "movq %%mm5, "Y_TEMP"(%0) \n\t" - YSCALEYUV2PACKEDX_ACCURATE_YA(ALP_MMX_FILTER_OFFSET) - "movq "Y_TEMP"(%0), %%mm5 \n\t" - "psraw $3, %%mm1 \n\t" - "psraw $3, %%mm7 \n\t" - "packuswb %%mm7, %%mm1 \n\t" - WRITEBGR32(%4, %5, %%REGa, %%mm3, %%mm4, %%mm5, %%mm1, %%mm0, %%mm7, %%mm2, %%mm6) - - YSCALEYUV2PACKEDX_END - } else { - YSCALEYUV2PACKEDX_ACCURATE - YSCALEYUV2RGBX - "pcmpeqd %%mm7, %%mm7 \n\t" - WRITEBGR32(%4, %5, %%REGa, %%mm2, %%mm4, %%mm5, %%mm7, %%mm0, %%mm1, %%mm3, %%mm6) - - YSCALEYUV2PACKEDX_END - } - return; - case PIX_FMT_BGR24: - YSCALEYUV2PACKEDX_ACCURATE - YSCALEYUV2RGBX - "pxor %%mm7, %%mm7 \n\t" - "lea (%%"REG_a", %%"REG_a", 2), %%"REG_c"\n\t" //FIXME optimize - "add %4, %%"REG_c" \n\t" - WRITEBGR24(%%REGc, %5, %%REGa) - - - :: "r" (&c->redDither), - "m" (dummy), "m" (dummy), "m" (dummy), - "r" (dest), "m" (dstW_reg) - : "%"REG_a, "%"REG_c, "%"REG_d, "%"REG_S - ); - return; - case PIX_FMT_RGB555: - YSCALEYUV2PACKEDX_ACCURATE - YSCALEYUV2RGBX - "pxor %%mm7, %%mm7 \n\t" - /* mm2=B, %%mm4=G, %%mm5=R, %%mm7=0 */ -#ifdef DITHER1XBPP - "paddusb "BLUE_DITHER"(%0), %%mm2\n\t" - "paddusb "GREEN_DITHER"(%0), %%mm4\n\t" - "paddusb "RED_DITHER"(%0), %%mm5\n\t" -#endif - - WRITERGB15(%4, %5, %%REGa) - YSCALEYUV2PACKEDX_END - return; - case PIX_FMT_RGB565: - YSCALEYUV2PACKEDX_ACCURATE - YSCALEYUV2RGBX - "pxor %%mm7, %%mm7 \n\t" - /* mm2=B, %%mm4=G, %%mm5=R, %%mm7=0 */ -#ifdef DITHER1XBPP - "paddusb "BLUE_DITHER"(%0), %%mm2\n\t" - "paddusb "GREEN_DITHER"(%0), %%mm4\n\t" - "paddusb "RED_DITHER"(%0), %%mm5\n\t" -#endif - - WRITERGB16(%4, %5, %%REGa) - YSCALEYUV2PACKEDX_END - return; - case PIX_FMT_YUYV422: - YSCALEYUV2PACKEDX_ACCURATE - /* mm2=B, %%mm4=G, %%mm5=R, %%mm7=0 */ - - "psraw $3, %%mm3 \n\t" - "psraw $3, %%mm4 \n\t" - "psraw $3, %%mm1 \n\t" - "psraw $3, %%mm7 \n\t" - WRITEYUY2(%4, %5, %%REGa) - YSCALEYUV2PACKEDX_END - return; - } - } else { - switch(c->dstFormat) { - case PIX_FMT_RGB32: - if (CONFIG_SWSCALE_ALPHA && c->alpPixBuf) { - YSCALEYUV2PACKEDX - YSCALEYUV2RGBX - YSCALEYUV2PACKEDX_YA(ALP_MMX_FILTER_OFFSET, %%mm0, %%mm3, %%mm6, %%mm1, %%mm7) - "psraw $3, %%mm1 \n\t" - "psraw $3, %%mm7 \n\t" - "packuswb %%mm7, %%mm1 \n\t" - WRITEBGR32(%4, %5, %%REGa, %%mm2, %%mm4, %%mm5, %%mm1, %%mm0, %%mm7, %%mm3, %%mm6) - YSCALEYUV2PACKEDX_END - } else { - YSCALEYUV2PACKEDX - YSCALEYUV2RGBX - "pcmpeqd %%mm7, %%mm7 \n\t" - WRITEBGR32(%4, %5, %%REGa, %%mm2, %%mm4, %%mm5, %%mm7, %%mm0, %%mm1, %%mm3, %%mm6) - YSCALEYUV2PACKEDX_END - } - return; - case PIX_FMT_BGR24: - YSCALEYUV2PACKEDX - YSCALEYUV2RGBX - "pxor %%mm7, %%mm7 \n\t" - "lea (%%"REG_a", %%"REG_a", 2), %%"REG_c" \n\t" //FIXME optimize - "add %4, %%"REG_c" \n\t" - WRITEBGR24(%%REGc, %5, %%REGa) - - :: "r" (&c->redDither), - "m" (dummy), "m" (dummy), "m" (dummy), - "r" (dest), "m" (dstW_reg) - : "%"REG_a, "%"REG_c, "%"REG_d, "%"REG_S - ); - return; - case PIX_FMT_RGB555: - YSCALEYUV2PACKEDX - YSCALEYUV2RGBX - "pxor %%mm7, %%mm7 \n\t" - /* mm2=B, %%mm4=G, %%mm5=R, %%mm7=0 */ -#ifdef DITHER1XBPP - "paddusb "BLUE_DITHER"(%0), %%mm2 \n\t" - "paddusb "GREEN_DITHER"(%0), %%mm4 \n\t" - "paddusb "RED_DITHER"(%0), %%mm5 \n\t" -#endif - - WRITERGB15(%4, %5, %%REGa) - YSCALEYUV2PACKEDX_END - return; - case PIX_FMT_RGB565: - YSCALEYUV2PACKEDX - YSCALEYUV2RGBX - "pxor %%mm7, %%mm7 \n\t" - /* mm2=B, %%mm4=G, %%mm5=R, %%mm7=0 */ -#ifdef DITHER1XBPP - "paddusb "BLUE_DITHER"(%0), %%mm2 \n\t" - "paddusb "GREEN_DITHER"(%0), %%mm4 \n\t" - "paddusb "RED_DITHER"(%0), %%mm5 \n\t" -#endif - - WRITERGB16(%4, %5, %%REGa) - YSCALEYUV2PACKEDX_END - return; - case PIX_FMT_YUYV422: - YSCALEYUV2PACKEDX - /* mm2=B, %%mm4=G, %%mm5=R, %%mm7=0 */ - - "psraw $3, %%mm3 \n\t" - "psraw $3, %%mm4 \n\t" - "psraw $3, %%mm1 \n\t" - "psraw $3, %%mm7 \n\t" - WRITEYUY2(%4, %5, %%REGa) - YSCALEYUV2PACKEDX_END - return; - } - } - } -#endif /* COMPILE_TEMPLATE_MMX */ -#if COMPILE_TEMPLATE_ALTIVEC - /* The following list of supported dstFormat values should - match what's found in the body of ff_yuv2packedX_altivec() */ - if (!(c->flags & SWS_BITEXACT) && !c->alpPixBuf && - (c->dstFormat==PIX_FMT_ABGR || c->dstFormat==PIX_FMT_BGRA || - c->dstFormat==PIX_FMT_BGR24 || c->dstFormat==PIX_FMT_RGB24 || - c->dstFormat==PIX_FMT_RGBA || c->dstFormat==PIX_FMT_ARGB)) - ff_yuv2packedX_altivec(c, lumFilter, lumSrc, lumFilterSize, - chrFilter, chrSrc, chrFilterSize, - dest, dstW, dstY); - else -#endif yuv2packedXinC(c, lumFilter, lumSrc, lumFilterSize, - chrFilter, chrSrc, chrFilterSize, + chrFilter, chrUSrc, chrVSrc, chrFilterSize, alpSrc, dest, dstW, dstY); } /** * vertical bilinear scale YV12 to RGB */ -static inline void RENAME(yuv2packed2)(SwsContext *c, const uint16_t *buf0, const uint16_t *buf1, const uint16_t *uvbuf0, const uint16_t *uvbuf1, - const uint16_t *abuf0, const uint16_t *abuf1, uint8_t *dest, int dstW, int yalpha, int uvalpha, int y) +static inline void yuv2packed2_c(SwsContext *c, const uint16_t *buf0, + const uint16_t *buf1, const uint16_t *ubuf0, + const uint16_t *ubuf1, const uint16_t *vbuf0, + const uint16_t *vbuf1, const uint16_t *abuf0, + const uint16_t *abuf1, uint8_t *dest, int dstW, + int yalpha, int uvalpha, int y) { int yalpha1=4095- yalpha; int uvalpha1=4095-uvalpha; int i; -#if COMPILE_TEMPLATE_MMX - if(!(c->flags & SWS_BITEXACT)) { - switch(c->dstFormat) { - //Note 8280 == DSTW_OFFSET but the preprocessor can't handle that there :( - case PIX_FMT_RGB32: - if (CONFIG_SWSCALE_ALPHA && c->alpPixBuf) { -#if ARCH_X86_64 - __asm__ volatile( - YSCALEYUV2RGB(%%r8, %5) - YSCALEYUV2RGB_YA(%%r8, %5, %6, %7) - "psraw $3, %%mm1 \n\t" /* abuf0[eax] - abuf1[eax] >>7*/ - "psraw $3, %%mm7 \n\t" /* abuf0[eax] - abuf1[eax] >>7*/ - "packuswb %%mm7, %%mm1 \n\t" - WRITEBGR32(%4, 8280(%5), %%r8, %%mm2, %%mm4, %%mm5, %%mm1, %%mm0, %%mm7, %%mm3, %%mm6) - - :: "c" (buf0), "d" (buf1), "S" (uvbuf0), "D" (uvbuf1), "r" (dest), - "a" (&c->redDither) - ,"r" (abuf0), "r" (abuf1) - : "%r8" - ); -#else - c->u_temp=(intptr_t)abuf0; - c->v_temp=(intptr_t)abuf1; - __asm__ volatile( - "mov %%"REG_b", "ESP_OFFSET"(%5) \n\t" - "mov %4, %%"REG_b" \n\t" - "push %%"REG_BP" \n\t" - YSCALEYUV2RGB(%%REGBP, %5) - "push %0 \n\t" - "push %1 \n\t" - "mov "U_TEMP"(%5), %0 \n\t" - "mov "V_TEMP"(%5), %1 \n\t" - YSCALEYUV2RGB_YA(%%REGBP, %5, %0, %1) - "psraw $3, %%mm1 \n\t" /* abuf0[eax] - abuf1[eax] >>7*/ - "psraw $3, %%mm7 \n\t" /* abuf0[eax] - abuf1[eax] >>7*/ - "packuswb %%mm7, %%mm1 \n\t" - "pop %1 \n\t" - "pop %0 \n\t" - WRITEBGR32(%%REGb, 8280(%5), %%REGBP, %%mm2, %%mm4, %%mm5, %%mm1, %%mm0, %%mm7, %%mm3, %%mm6) - "pop %%"REG_BP" \n\t" - "mov "ESP_OFFSET"(%5), %%"REG_b" \n\t" - - :: "c" (buf0), "d" (buf1), "S" (uvbuf0), "D" (uvbuf1), "m" (dest), - "a" (&c->redDither) - ); -#endif - } else { - __asm__ volatile( - "mov %%"REG_b", "ESP_OFFSET"(%5) \n\t" - "mov %4, %%"REG_b" \n\t" - "push %%"REG_BP" \n\t" - YSCALEYUV2RGB(%%REGBP, %5) - "pcmpeqd %%mm7, %%mm7 \n\t" - WRITEBGR32(%%REGb, 8280(%5), %%REGBP, %%mm2, %%mm4, %%mm5, %%mm7, %%mm0, %%mm1, %%mm3, %%mm6) - "pop %%"REG_BP" \n\t" - "mov "ESP_OFFSET"(%5), %%"REG_b" \n\t" - - :: "c" (buf0), "d" (buf1), "S" (uvbuf0), "D" (uvbuf1), "m" (dest), - "a" (&c->redDither) - ); - } - return; - case PIX_FMT_BGR24: - __asm__ volatile( - "mov %%"REG_b", "ESP_OFFSET"(%5) \n\t" - "mov %4, %%"REG_b" \n\t" - "push %%"REG_BP" \n\t" - YSCALEYUV2RGB(%%REGBP, %5) - "pxor %%mm7, %%mm7 \n\t" - WRITEBGR24(%%REGb, 8280(%5), %%REGBP) - "pop %%"REG_BP" \n\t" - "mov "ESP_OFFSET"(%5), %%"REG_b" \n\t" - :: "c" (buf0), "d" (buf1), "S" (uvbuf0), "D" (uvbuf1), "m" (dest), - "a" (&c->redDither) - ); - return; - case PIX_FMT_RGB555: - __asm__ volatile( - "mov %%"REG_b", "ESP_OFFSET"(%5) \n\t" - "mov %4, %%"REG_b" \n\t" - "push %%"REG_BP" \n\t" - YSCALEYUV2RGB(%%REGBP, %5) - "pxor %%mm7, %%mm7 \n\t" - /* mm2=B, %%mm4=G, %%mm5=R, %%mm7=0 */ -#ifdef DITHER1XBPP - "paddusb "BLUE_DITHER"(%5), %%mm2 \n\t" - "paddusb "GREEN_DITHER"(%5), %%mm4 \n\t" - "paddusb "RED_DITHER"(%5), %%mm5 \n\t" -#endif - - WRITERGB15(%%REGb, 8280(%5), %%REGBP) - "pop %%"REG_BP" \n\t" - "mov "ESP_OFFSET"(%5), %%"REG_b" \n\t" - - :: "c" (buf0), "d" (buf1), "S" (uvbuf0), "D" (uvbuf1), "m" (dest), - "a" (&c->redDither) - ); - return; - case PIX_FMT_RGB565: - __asm__ volatile( - "mov %%"REG_b", "ESP_OFFSET"(%5) \n\t" - "mov %4, %%"REG_b" \n\t" - "push %%"REG_BP" \n\t" - YSCALEYUV2RGB(%%REGBP, %5) - "pxor %%mm7, %%mm7 \n\t" - /* mm2=B, %%mm4=G, %%mm5=R, %%mm7=0 */ -#ifdef DITHER1XBPP - "paddusb "BLUE_DITHER"(%5), %%mm2 \n\t" - "paddusb "GREEN_DITHER"(%5), %%mm4 \n\t" - "paddusb "RED_DITHER"(%5), %%mm5 \n\t" -#endif - - WRITERGB16(%%REGb, 8280(%5), %%REGBP) - "pop %%"REG_BP" \n\t" - "mov "ESP_OFFSET"(%5), %%"REG_b" \n\t" - :: "c" (buf0), "d" (buf1), "S" (uvbuf0), "D" (uvbuf1), "m" (dest), - "a" (&c->redDither) - ); - return; - case PIX_FMT_YUYV422: - __asm__ volatile( - "mov %%"REG_b", "ESP_OFFSET"(%5) \n\t" - "mov %4, %%"REG_b" \n\t" - "push %%"REG_BP" \n\t" - YSCALEYUV2PACKED(%%REGBP, %5) - WRITEYUY2(%%REGb, 8280(%5), %%REGBP) - "pop %%"REG_BP" \n\t" - "mov "ESP_OFFSET"(%5), %%"REG_b" \n\t" - :: "c" (buf0), "d" (buf1), "S" (uvbuf0), "D" (uvbuf1), "m" (dest), - "a" (&c->redDither) - ); - return; - default: break; - } - } -#endif //COMPILE_TEMPLATE_MMX YSCALE_YUV_2_ANYRGB_C(YSCALE_YUV_2_RGB2_C, YSCALE_YUV_2_PACKED2_C(void,0), YSCALE_YUV_2_GRAY16_2_C, YSCALE_YUV_2_MONO2_C) } /** * YV12 to RGB without scaling or interpolating */ -static inline void RENAME(yuv2packed1)(SwsContext *c, const uint16_t *buf0, const uint16_t *uvbuf0, const uint16_t *uvbuf1, - const uint16_t *abuf0, uint8_t *dest, int dstW, int uvalpha, enum PixelFormat dstFormat, int flags, int y) +static inline void yuv2packed1_c(SwsContext *c, const uint16_t *buf0, + const uint16_t *ubuf0, const uint16_t *ubuf1, + const uint16_t *vbuf0, const uint16_t *vbuf1, + const uint16_t *abuf0, uint8_t *dest, int dstW, + int uvalpha, enum PixelFormat dstFormat, + int flags, int y) { const int yalpha1=0; int i; @@ -1363,228 +120,6 @@ static inline void RENAME(yuv2packed1)(SwsContext *c, const uint16_t *buf0, cons const uint16_t *buf1= buf0; //FIXME needed for RGB1/BGR1 const int yalpha= 4096; //FIXME ... - if (flags&SWS_FULL_CHR_H_INT) { - c->yuv2packed2(c, buf0, buf0, uvbuf0, uvbuf1, abuf0, abuf0, dest, dstW, 0, uvalpha, y); - return; - } - -#if COMPILE_TEMPLATE_MMX - if(!(flags & SWS_BITEXACT)) { - if (uvalpha < 2048) { // note this is not correct (shifts chrominance by 0.5 pixels) but it is a bit faster - switch(dstFormat) { - case PIX_FMT_RGB32: - if (CONFIG_SWSCALE_ALPHA && c->alpPixBuf) { - __asm__ volatile( - "mov %%"REG_b", "ESP_OFFSET"(%5) \n\t" - "mov %4, %%"REG_b" \n\t" - "push %%"REG_BP" \n\t" - YSCALEYUV2RGB1(%%REGBP, %5) - YSCALEYUV2RGB1_ALPHA(%%REGBP) - WRITEBGR32(%%REGb, 8280(%5), %%REGBP, %%mm2, %%mm4, %%mm5, %%mm7, %%mm0, %%mm1, %%mm3, %%mm6) - "pop %%"REG_BP" \n\t" - "mov "ESP_OFFSET"(%5), %%"REG_b" \n\t" - - :: "c" (buf0), "d" (abuf0), "S" (uvbuf0), "D" (uvbuf1), "m" (dest), - "a" (&c->redDither) - ); - } else { - __asm__ volatile( - "mov %%"REG_b", "ESP_OFFSET"(%5) \n\t" - "mov %4, %%"REG_b" \n\t" - "push %%"REG_BP" \n\t" - YSCALEYUV2RGB1(%%REGBP, %5) - "pcmpeqd %%mm7, %%mm7 \n\t" - WRITEBGR32(%%REGb, 8280(%5), %%REGBP, %%mm2, %%mm4, %%mm5, %%mm7, %%mm0, %%mm1, %%mm3, %%mm6) - "pop %%"REG_BP" \n\t" - "mov "ESP_OFFSET"(%5), %%"REG_b" \n\t" - - :: "c" (buf0), "d" (buf1), "S" (uvbuf0), "D" (uvbuf1), "m" (dest), - "a" (&c->redDither) - ); - } - return; - case PIX_FMT_BGR24: - __asm__ volatile( - "mov %%"REG_b", "ESP_OFFSET"(%5) \n\t" - "mov %4, %%"REG_b" \n\t" - "push %%"REG_BP" \n\t" - YSCALEYUV2RGB1(%%REGBP, %5) - "pxor %%mm7, %%mm7 \n\t" - WRITEBGR24(%%REGb, 8280(%5), %%REGBP) - "pop %%"REG_BP" \n\t" - "mov "ESP_OFFSET"(%5), %%"REG_b" \n\t" - - :: "c" (buf0), "d" (buf1), "S" (uvbuf0), "D" (uvbuf1), "m" (dest), - "a" (&c->redDither) - ); - return; - case PIX_FMT_RGB555: - __asm__ volatile( - "mov %%"REG_b", "ESP_OFFSET"(%5) \n\t" - "mov %4, %%"REG_b" \n\t" - "push %%"REG_BP" \n\t" - YSCALEYUV2RGB1(%%REGBP, %5) - "pxor %%mm7, %%mm7 \n\t" - /* mm2=B, %%mm4=G, %%mm5=R, %%mm7=0 */ -#ifdef DITHER1XBPP - "paddusb "BLUE_DITHER"(%5), %%mm2 \n\t" - "paddusb "GREEN_DITHER"(%5), %%mm4 \n\t" - "paddusb "RED_DITHER"(%5), %%mm5 \n\t" -#endif - WRITERGB15(%%REGb, 8280(%5), %%REGBP) - "pop %%"REG_BP" \n\t" - "mov "ESP_OFFSET"(%5), %%"REG_b" \n\t" - - :: "c" (buf0), "d" (buf1), "S" (uvbuf0), "D" (uvbuf1), "m" (dest), - "a" (&c->redDither) - ); - return; - case PIX_FMT_RGB565: - __asm__ volatile( - "mov %%"REG_b", "ESP_OFFSET"(%5) \n\t" - "mov %4, %%"REG_b" \n\t" - "push %%"REG_BP" \n\t" - YSCALEYUV2RGB1(%%REGBP, %5) - "pxor %%mm7, %%mm7 \n\t" - /* mm2=B, %%mm4=G, %%mm5=R, %%mm7=0 */ -#ifdef DITHER1XBPP - "paddusb "BLUE_DITHER"(%5), %%mm2 \n\t" - "paddusb "GREEN_DITHER"(%5), %%mm4 \n\t" - "paddusb "RED_DITHER"(%5), %%mm5 \n\t" -#endif - - WRITERGB16(%%REGb, 8280(%5), %%REGBP) - "pop %%"REG_BP" \n\t" - "mov "ESP_OFFSET"(%5), %%"REG_b" \n\t" - - :: "c" (buf0), "d" (buf1), "S" (uvbuf0), "D" (uvbuf1), "m" (dest), - "a" (&c->redDither) - ); - return; - case PIX_FMT_YUYV422: - __asm__ volatile( - "mov %%"REG_b", "ESP_OFFSET"(%5) \n\t" - "mov %4, %%"REG_b" \n\t" - "push %%"REG_BP" \n\t" - YSCALEYUV2PACKED1(%%REGBP, %5) - WRITEYUY2(%%REGb, 8280(%5), %%REGBP) - "pop %%"REG_BP" \n\t" - "mov "ESP_OFFSET"(%5), %%"REG_b" \n\t" - - :: "c" (buf0), "d" (buf1), "S" (uvbuf0), "D" (uvbuf1), "m" (dest), - "a" (&c->redDither) - ); - return; - } - } else { - switch(dstFormat) { - case PIX_FMT_RGB32: - if (CONFIG_SWSCALE_ALPHA && c->alpPixBuf) { - __asm__ volatile( - "mov %%"REG_b", "ESP_OFFSET"(%5) \n\t" - "mov %4, %%"REG_b" \n\t" - "push %%"REG_BP" \n\t" - YSCALEYUV2RGB1b(%%REGBP, %5) - YSCALEYUV2RGB1_ALPHA(%%REGBP) - WRITEBGR32(%%REGb, 8280(%5), %%REGBP, %%mm2, %%mm4, %%mm5, %%mm7, %%mm0, %%mm1, %%mm3, %%mm6) - "pop %%"REG_BP" \n\t" - "mov "ESP_OFFSET"(%5), %%"REG_b" \n\t" - - :: "c" (buf0), "d" (abuf0), "S" (uvbuf0), "D" (uvbuf1), "m" (dest), - "a" (&c->redDither) - ); - } else { - __asm__ volatile( - "mov %%"REG_b", "ESP_OFFSET"(%5) \n\t" - "mov %4, %%"REG_b" \n\t" - "push %%"REG_BP" \n\t" - YSCALEYUV2RGB1b(%%REGBP, %5) - "pcmpeqd %%mm7, %%mm7 \n\t" - WRITEBGR32(%%REGb, 8280(%5), %%REGBP, %%mm2, %%mm4, %%mm5, %%mm7, %%mm0, %%mm1, %%mm3, %%mm6) - "pop %%"REG_BP" \n\t" - "mov "ESP_OFFSET"(%5), %%"REG_b" \n\t" - - :: "c" (buf0), "d" (buf1), "S" (uvbuf0), "D" (uvbuf1), "m" (dest), - "a" (&c->redDither) - ); - } - return; - case PIX_FMT_BGR24: - __asm__ volatile( - "mov %%"REG_b", "ESP_OFFSET"(%5) \n\t" - "mov %4, %%"REG_b" \n\t" - "push %%"REG_BP" \n\t" - YSCALEYUV2RGB1b(%%REGBP, %5) - "pxor %%mm7, %%mm7 \n\t" - WRITEBGR24(%%REGb, 8280(%5), %%REGBP) - "pop %%"REG_BP" \n\t" - "mov "ESP_OFFSET"(%5), %%"REG_b" \n\t" - - :: "c" (buf0), "d" (buf1), "S" (uvbuf0), "D" (uvbuf1), "m" (dest), - "a" (&c->redDither) - ); - return; - case PIX_FMT_RGB555: - __asm__ volatile( - "mov %%"REG_b", "ESP_OFFSET"(%5) \n\t" - "mov %4, %%"REG_b" \n\t" - "push %%"REG_BP" \n\t" - YSCALEYUV2RGB1b(%%REGBP, %5) - "pxor %%mm7, %%mm7 \n\t" - /* mm2=B, %%mm4=G, %%mm5=R, %%mm7=0 */ -#ifdef DITHER1XBPP - "paddusb "BLUE_DITHER"(%5), %%mm2 \n\t" - "paddusb "GREEN_DITHER"(%5), %%mm4 \n\t" - "paddusb "RED_DITHER"(%5), %%mm5 \n\t" -#endif - WRITERGB15(%%REGb, 8280(%5), %%REGBP) - "pop %%"REG_BP" \n\t" - "mov "ESP_OFFSET"(%5), %%"REG_b" \n\t" - - :: "c" (buf0), "d" (buf1), "S" (uvbuf0), "D" (uvbuf1), "m" (dest), - "a" (&c->redDither) - ); - return; - case PIX_FMT_RGB565: - __asm__ volatile( - "mov %%"REG_b", "ESP_OFFSET"(%5) \n\t" - "mov %4, %%"REG_b" \n\t" - "push %%"REG_BP" \n\t" - YSCALEYUV2RGB1b(%%REGBP, %5) - "pxor %%mm7, %%mm7 \n\t" - /* mm2=B, %%mm4=G, %%mm5=R, %%mm7=0 */ -#ifdef DITHER1XBPP - "paddusb "BLUE_DITHER"(%5), %%mm2 \n\t" - "paddusb "GREEN_DITHER"(%5), %%mm4 \n\t" - "paddusb "RED_DITHER"(%5), %%mm5 \n\t" -#endif - - WRITERGB16(%%REGb, 8280(%5), %%REGBP) - "pop %%"REG_BP" \n\t" - "mov "ESP_OFFSET"(%5), %%"REG_b" \n\t" - - :: "c" (buf0), "d" (buf1), "S" (uvbuf0), "D" (uvbuf1), "m" (dest), - "a" (&c->redDither) - ); - return; - case PIX_FMT_YUYV422: - __asm__ volatile( - "mov %%"REG_b", "ESP_OFFSET"(%5) \n\t" - "mov %4, %%"REG_b" \n\t" - "push %%"REG_BP" \n\t" - YSCALEYUV2PACKED1b(%%REGBP, %5) - WRITEYUY2(%%REGb, 8280(%5), %%REGBP) - "pop %%"REG_BP" \n\t" - "mov "ESP_OFFSET"(%5), %%"REG_b" \n\t" - - :: "c" (buf0), "d" (buf1), "S" (uvbuf0), "D" (uvbuf1), "m" (dest), - "a" (&c->redDither) - ); - return; - } - } - } -#endif /* COMPILE_TEMPLATE_MMX */ if (uvalpha < 2048) { YSCALE_YUV_2_ANYRGB_C(YSCALE_YUV_2_RGB1_C, YSCALE_YUV_2_PACKED1_C(void,0), YSCALE_YUV_2_GRAY16_1_C, YSCALE_YUV_2_MONO2_C) } else { @@ -1594,89 +129,28 @@ static inline void RENAME(yuv2packed1)(SwsContext *c, const uint16_t *buf0, cons //FIXME yuy2* can read up to 7 samples too much -static inline void RENAME(yuy2ToY)(uint8_t *dst, const uint8_t *src, long width, uint32_t *unused) +static inline void yuy2ToY_c(uint8_t *dst, const uint8_t *src, int width, + uint32_t *unused) { -#if COMPILE_TEMPLATE_MMX - __asm__ volatile( - "movq "MANGLE(bm01010101)", %%mm2 \n\t" - "mov %0, %%"REG_a" \n\t" - "1: \n\t" - "movq (%1, %%"REG_a",2), %%mm0 \n\t" - "movq 8(%1, %%"REG_a",2), %%mm1 \n\t" - "pand %%mm2, %%mm0 \n\t" - "pand %%mm2, %%mm1 \n\t" - "packuswb %%mm1, %%mm0 \n\t" - "movq %%mm0, (%2, %%"REG_a") \n\t" - "add $8, %%"REG_a" \n\t" - " js 1b \n\t" - : : "g" ((x86_reg)-width), "r" (src+width*2), "r" (dst+width) - : "%"REG_a - ); -#else int i; for (i=0; i<width; i++) dst[i]= src[2*i]; -#endif } -static inline void RENAME(yuy2ToUV)(uint8_t *dstU, uint8_t *dstV, const uint8_t *src1, const uint8_t *src2, long width, uint32_t *unused) +static inline void yuy2ToUV_c(uint8_t *dstU, uint8_t *dstV, const uint8_t *src1, + const uint8_t *src2, int width, uint32_t *unused) { -#if COMPILE_TEMPLATE_MMX - __asm__ volatile( - "movq "MANGLE(bm01010101)", %%mm4 \n\t" - "mov %0, %%"REG_a" \n\t" - "1: \n\t" - "movq (%1, %%"REG_a",4), %%mm0 \n\t" - "movq 8(%1, %%"REG_a",4), %%mm1 \n\t" - "psrlw $8, %%mm0 \n\t" - "psrlw $8, %%mm1 \n\t" - "packuswb %%mm1, %%mm0 \n\t" - "movq %%mm0, %%mm1 \n\t" - "psrlw $8, %%mm0 \n\t" - "pand %%mm4, %%mm1 \n\t" - "packuswb %%mm0, %%mm0 \n\t" - "packuswb %%mm1, %%mm1 \n\t" - "movd %%mm0, (%3, %%"REG_a") \n\t" - "movd %%mm1, (%2, %%"REG_a") \n\t" - "add $4, %%"REG_a" \n\t" - " js 1b \n\t" - : : "g" ((x86_reg)-width), "r" (src1+width*4), "r" (dstU+width), "r" (dstV+width) - : "%"REG_a - ); -#else int i; for (i=0; i<width; i++) { dstU[i]= src1[4*i + 1]; dstV[i]= src1[4*i + 3]; } -#endif assert(src1 == src2); } -static inline void RENAME(LEToUV)(uint8_t *dstU, uint8_t *dstV, const uint8_t *src1, const uint8_t *src2, long width, uint32_t *unused) +static inline void LEToUV_c(uint8_t *dstU, uint8_t *dstV, const uint8_t *src1, + const uint8_t *src2, int width, uint32_t *unused) { -#if COMPILE_TEMPLATE_MMX - __asm__ volatile( - "mov %0, %%"REG_a" \n\t" - "1: \n\t" - "movq (%1, %%"REG_a",2), %%mm0 \n\t" - "movq 8(%1, %%"REG_a",2), %%mm1 \n\t" - "movq (%2, %%"REG_a",2), %%mm2 \n\t" - "movq 8(%2, %%"REG_a",2), %%mm3 \n\t" - "psrlw $8, %%mm0 \n\t" - "psrlw $8, %%mm1 \n\t" - "psrlw $8, %%mm2 \n\t" - "psrlw $8, %%mm3 \n\t" - "packuswb %%mm1, %%mm0 \n\t" - "packuswb %%mm3, %%mm2 \n\t" - "movq %%mm0, (%3, %%"REG_a") \n\t" - "movq %%mm2, (%4, %%"REG_a") \n\t" - "add $8, %%"REG_a" \n\t" - " js 1b \n\t" - : : "g" ((x86_reg)-width), "r" (src1+width*2), "r" (src2+width*2), "r" (dstU+width), "r" (dstV+width) - : "%"REG_a - ); -#else int i; // FIXME I don't think this code is right for YUV444/422, since then h is not subsampled so // we need to skip each second pixel. Same for BEToUV. @@ -1684,148 +158,47 @@ static inline void RENAME(LEToUV)(uint8_t *dstU, uint8_t *dstV, const uint8_t *s dstU[i]= src1[2*i + 1]; dstV[i]= src2[2*i + 1]; } -#endif } /* This is almost identical to the previous, end exists only because * yuy2ToY/UV)(dst, src+1, ...) would have 100% unaligned accesses. */ -static inline void RENAME(uyvyToY)(uint8_t *dst, const uint8_t *src, long width, uint32_t *unused) +static inline void uyvyToY_c(uint8_t *dst, const uint8_t *src, int width, + uint32_t *unused) { -#if COMPILE_TEMPLATE_MMX - __asm__ volatile( - "mov %0, %%"REG_a" \n\t" - "1: \n\t" - "movq (%1, %%"REG_a",2), %%mm0 \n\t" - "movq 8(%1, %%"REG_a",2), %%mm1 \n\t" - "psrlw $8, %%mm0 \n\t" - "psrlw $8, %%mm1 \n\t" - "packuswb %%mm1, %%mm0 \n\t" - "movq %%mm0, (%2, %%"REG_a") \n\t" - "add $8, %%"REG_a" \n\t" - " js 1b \n\t" - : : "g" ((x86_reg)-width), "r" (src+width*2), "r" (dst+width) - : "%"REG_a - ); -#else int i; for (i=0; i<width; i++) dst[i]= src[2*i+1]; -#endif } -static inline void RENAME(uyvyToUV)(uint8_t *dstU, uint8_t *dstV, const uint8_t *src1, const uint8_t *src2, long width, uint32_t *unused) +static inline void uyvyToUV_c(uint8_t *dstU, uint8_t *dstV, const uint8_t *src1, + const uint8_t *src2, int width, uint32_t *unused) { -#if COMPILE_TEMPLATE_MMX - __asm__ volatile( - "movq "MANGLE(bm01010101)", %%mm4 \n\t" - "mov %0, %%"REG_a" \n\t" - "1: \n\t" - "movq (%1, %%"REG_a",4), %%mm0 \n\t" - "movq 8(%1, %%"REG_a",4), %%mm1 \n\t" - "pand %%mm4, %%mm0 \n\t" - "pand %%mm4, %%mm1 \n\t" - "packuswb %%mm1, %%mm0 \n\t" - "movq %%mm0, %%mm1 \n\t" - "psrlw $8, %%mm0 \n\t" - "pand %%mm4, %%mm1 \n\t" - "packuswb %%mm0, %%mm0 \n\t" - "packuswb %%mm1, %%mm1 \n\t" - "movd %%mm0, (%3, %%"REG_a") \n\t" - "movd %%mm1, (%2, %%"REG_a") \n\t" - "add $4, %%"REG_a" \n\t" - " js 1b \n\t" - : : "g" ((x86_reg)-width), "r" (src1+width*4), "r" (dstU+width), "r" (dstV+width) - : "%"REG_a - ); -#else int i; for (i=0; i<width; i++) { dstU[i]= src1[4*i + 0]; dstV[i]= src1[4*i + 2]; } -#endif assert(src1 == src2); } -static inline void RENAME(BEToUV)(uint8_t *dstU, uint8_t *dstV, const uint8_t *src1, const uint8_t *src2, long width, uint32_t *unused) +static inline void BEToUV_c(uint8_t *dstU, uint8_t *dstV, const uint8_t *src1, + const uint8_t *src2, int width, uint32_t *unused) { -#if COMPILE_TEMPLATE_MMX - __asm__ volatile( - "movq "MANGLE(bm01010101)", %%mm4 \n\t" - "mov %0, %%"REG_a" \n\t" - "1: \n\t" - "movq (%1, %%"REG_a",2), %%mm0 \n\t" - "movq 8(%1, %%"REG_a",2), %%mm1 \n\t" - "movq (%2, %%"REG_a",2), %%mm2 \n\t" - "movq 8(%2, %%"REG_a",2), %%mm3 \n\t" - "pand %%mm4, %%mm0 \n\t" - "pand %%mm4, %%mm1 \n\t" - "pand %%mm4, %%mm2 \n\t" - "pand %%mm4, %%mm3 \n\t" - "packuswb %%mm1, %%mm0 \n\t" - "packuswb %%mm3, %%mm2 \n\t" - "movq %%mm0, (%3, %%"REG_a") \n\t" - "movq %%mm2, (%4, %%"REG_a") \n\t" - "add $8, %%"REG_a" \n\t" - " js 1b \n\t" - : : "g" ((x86_reg)-width), "r" (src1+width*2), "r" (src2+width*2), "r" (dstU+width), "r" (dstV+width) - : "%"REG_a - ); -#else int i; for (i=0; i<width; i++) { dstU[i]= src1[2*i]; dstV[i]= src2[2*i]; } -#endif } -static inline void RENAME(nvXXtoUV)(uint8_t *dst1, uint8_t *dst2, - const uint8_t *src, long width) +static inline void nvXXtoUV_c(uint8_t *dst1, uint8_t *dst2, + const uint8_t *src, int width) { -#if COMPILE_TEMPLATE_MMX - __asm__ volatile( - "movq "MANGLE(bm01010101)", %%mm4 \n\t" - "mov %0, %%"REG_a" \n\t" - "1: \n\t" - "movq (%1, %%"REG_a",2), %%mm0 \n\t" - "movq 8(%1, %%"REG_a",2), %%mm1 \n\t" - "movq %%mm0, %%mm2 \n\t" - "movq %%mm1, %%mm3 \n\t" - "pand %%mm4, %%mm0 \n\t" - "pand %%mm4, %%mm1 \n\t" - "psrlw $8, %%mm2 \n\t" - "psrlw $8, %%mm3 \n\t" - "packuswb %%mm1, %%mm0 \n\t" - "packuswb %%mm3, %%mm2 \n\t" - "movq %%mm0, (%2, %%"REG_a") \n\t" - "movq %%mm2, (%3, %%"REG_a") \n\t" - "add $8, %%"REG_a" \n\t" - " js 1b \n\t" - : : "g" ((x86_reg)-width), "r" (src+width*2), "r" (dst1+width), "r" (dst2+width) - : "%"REG_a - ); -#else int i; for (i = 0; i < width; i++) { dst1[i] = src[2*i+0]; dst2[i] = src[2*i+1]; } -#endif -} - -static inline void RENAME(nv12ToUV)(uint8_t *dstU, uint8_t *dstV, - const uint8_t *src1, const uint8_t *src2, - long width, uint32_t *unused) -{ - RENAME(nvXXtoUV)(dstU, dstV, src1, width); -} - -static inline void RENAME(nv21ToUV)(uint8_t *dstU, uint8_t *dstV, - const uint8_t *src1, const uint8_t *src2, - long width, uint32_t *unused) -{ - RENAME(nvXXtoUV)(dstV, dstU, src1, width); } // FIXME Maybe dither instead. @@ -1833,7 +206,7 @@ static inline void RENAME(nv21ToUV)(uint8_t *dstU, uint8_t *dstV, #define YUV_NBPS(depth, endianness, rfunc) \ static inline void endianness ## depth ## ToUV_c(uint8_t *dstU, uint8_t *dstV, \ const uint16_t *srcU, const uint16_t *srcV, \ - long width, uint32_t *unused) \ + int width, uint32_t *unused) \ { \ int i; \ for (i = 0; i < width; i++) { \ @@ -1842,7 +215,7 @@ static inline void endianness ## depth ## ToUV_c(uint8_t *dstU, uint8_t *dstV, \ } \ } \ \ -static inline void endianness ## depth ## ToY_c(uint8_t *dstY, const uint16_t *srcY, long width, uint32_t *unused) \ +static inline void endianness ## depth ## ToY_c(uint8_t *dstY, const uint16_t *srcY, int width, uint32_t *unused) \ { \ int i; \ for (i = 0; i < width; i++) \ @@ -1855,736 +228,51 @@ YUV_NBPS(10, LE, AV_RL16) YUV_NBPS(10, BE, AV_RB16) #endif // YUV_NBPS -#if COMPILE_TEMPLATE_MMX -static inline void RENAME(bgr24ToY_mmx)(uint8_t *dst, const uint8_t *src, long width, enum PixelFormat srcFormat) -{ - - if(srcFormat == PIX_FMT_BGR24) { - __asm__ volatile( - "movq "MANGLE(ff_bgr24toY1Coeff)", %%mm5 \n\t" - "movq "MANGLE(ff_bgr24toY2Coeff)", %%mm6 \n\t" - : - ); - } else { - __asm__ volatile( - "movq "MANGLE(ff_rgb24toY1Coeff)", %%mm5 \n\t" - "movq "MANGLE(ff_rgb24toY2Coeff)", %%mm6 \n\t" - : - ); - } - - __asm__ volatile( - "movq "MANGLE(ff_bgr24toYOffset)", %%mm4 \n\t" - "mov %2, %%"REG_a" \n\t" - "pxor %%mm7, %%mm7 \n\t" - "1: \n\t" - PREFETCH" 64(%0) \n\t" - "movd (%0), %%mm0 \n\t" - "movd 2(%0), %%mm1 \n\t" - "movd 6(%0), %%mm2 \n\t" - "movd 8(%0), %%mm3 \n\t" - "add $12, %0 \n\t" - "punpcklbw %%mm7, %%mm0 \n\t" - "punpcklbw %%mm7, %%mm1 \n\t" - "punpcklbw %%mm7, %%mm2 \n\t" - "punpcklbw %%mm7, %%mm3 \n\t" - "pmaddwd %%mm5, %%mm0 \n\t" - "pmaddwd %%mm6, %%mm1 \n\t" - "pmaddwd %%mm5, %%mm2 \n\t" - "pmaddwd %%mm6, %%mm3 \n\t" - "paddd %%mm1, %%mm0 \n\t" - "paddd %%mm3, %%mm2 \n\t" - "paddd %%mm4, %%mm0 \n\t" - "paddd %%mm4, %%mm2 \n\t" - "psrad $15, %%mm0 \n\t" - "psrad $15, %%mm2 \n\t" - "packssdw %%mm2, %%mm0 \n\t" - "packuswb %%mm0, %%mm0 \n\t" - "movd %%mm0, (%1, %%"REG_a") \n\t" - "add $4, %%"REG_a" \n\t" - " js 1b \n\t" - : "+r" (src) - : "r" (dst+width), "g" ((x86_reg)-width) - : "%"REG_a - ); -} - -static inline void RENAME(bgr24ToUV_mmx)(uint8_t *dstU, uint8_t *dstV, const uint8_t *src, long width, enum PixelFormat srcFormat) -{ - __asm__ volatile( - "movq 24(%4), %%mm6 \n\t" - "mov %3, %%"REG_a" \n\t" - "pxor %%mm7, %%mm7 \n\t" - "1: \n\t" - PREFETCH" 64(%0) \n\t" - "movd (%0), %%mm0 \n\t" - "movd 2(%0), %%mm1 \n\t" - "punpcklbw %%mm7, %%mm0 \n\t" - "punpcklbw %%mm7, %%mm1 \n\t" - "movq %%mm0, %%mm2 \n\t" - "movq %%mm1, %%mm3 \n\t" - "pmaddwd (%4), %%mm0 \n\t" - "pmaddwd 8(%4), %%mm1 \n\t" - "pmaddwd 16(%4), %%mm2 \n\t" - "pmaddwd %%mm6, %%mm3 \n\t" - "paddd %%mm1, %%mm0 \n\t" - "paddd %%mm3, %%mm2 \n\t" - - "movd 6(%0), %%mm1 \n\t" - "movd 8(%0), %%mm3 \n\t" - "add $12, %0 \n\t" - "punpcklbw %%mm7, %%mm1 \n\t" - "punpcklbw %%mm7, %%mm3 \n\t" - "movq %%mm1, %%mm4 \n\t" - "movq %%mm3, %%mm5 \n\t" - "pmaddwd (%4), %%mm1 \n\t" - "pmaddwd 8(%4), %%mm3 \n\t" - "pmaddwd 16(%4), %%mm4 \n\t" - "pmaddwd %%mm6, %%mm5 \n\t" - "paddd %%mm3, %%mm1 \n\t" - "paddd %%mm5, %%mm4 \n\t" - - "movq "MANGLE(ff_bgr24toUVOffset)", %%mm3 \n\t" - "paddd %%mm3, %%mm0 \n\t" - "paddd %%mm3, %%mm2 \n\t" - "paddd %%mm3, %%mm1 \n\t" - "paddd %%mm3, %%mm4 \n\t" - "psrad $15, %%mm0 \n\t" - "psrad $15, %%mm2 \n\t" - "psrad $15, %%mm1 \n\t" - "psrad $15, %%mm4 \n\t" - "packssdw %%mm1, %%mm0 \n\t" - "packssdw %%mm4, %%mm2 \n\t" - "packuswb %%mm0, %%mm0 \n\t" - "packuswb %%mm2, %%mm2 \n\t" - "movd %%mm0, (%1, %%"REG_a") \n\t" - "movd %%mm2, (%2, %%"REG_a") \n\t" - "add $4, %%"REG_a" \n\t" - " js 1b \n\t" - : "+r" (src) - : "r" (dstU+width), "r" (dstV+width), "g" ((x86_reg)-width), "r"(ff_bgr24toUV[srcFormat == PIX_FMT_RGB24]) - : "%"REG_a - ); -} -#endif - -static inline void RENAME(bgr24ToY)(uint8_t *dst, const uint8_t *src, long width, uint32_t *unused) -{ -#if COMPILE_TEMPLATE_MMX - RENAME(bgr24ToY_mmx)(dst, src, width, PIX_FMT_BGR24); -#else - int i; - for (i=0; i<width; i++) { - int b= src[i*3+0]; - int g= src[i*3+1]; - int r= src[i*3+2]; - - dst[i]= ((RY*r + GY*g + BY*b + (33<<(RGB2YUV_SHIFT-1)))>>RGB2YUV_SHIFT); - } -#endif /* COMPILE_TEMPLATE_MMX */ -} - -static inline void RENAME(bgr24ToUV)(uint8_t *dstU, uint8_t *dstV, const uint8_t *src1, const uint8_t *src2, long width, uint32_t *unused) -{ -#if COMPILE_TEMPLATE_MMX - RENAME(bgr24ToUV_mmx)(dstU, dstV, src1, width, PIX_FMT_BGR24); -#else - int i; - for (i=0; i<width; i++) { - int b= src1[3*i + 0]; - int g= src1[3*i + 1]; - int r= src1[3*i + 2]; - - dstU[i]= (RU*r + GU*g + BU*b + (257<<(RGB2YUV_SHIFT-1)))>>RGB2YUV_SHIFT; - dstV[i]= (RV*r + GV*g + BV*b + (257<<(RGB2YUV_SHIFT-1)))>>RGB2YUV_SHIFT; - } -#endif /* COMPILE_TEMPLATE_MMX */ - assert(src1 == src2); -} - -static inline void RENAME(bgr24ToUV_half)(uint8_t *dstU, uint8_t *dstV, const uint8_t *src1, const uint8_t *src2, long width, uint32_t *unused) -{ - int i; - for (i=0; i<width; i++) { - int b= src1[6*i + 0] + src1[6*i + 3]; - int g= src1[6*i + 1] + src1[6*i + 4]; - int r= src1[6*i + 2] + src1[6*i + 5]; - - dstU[i]= (RU*r + GU*g + BU*b + (257<<RGB2YUV_SHIFT))>>(RGB2YUV_SHIFT+1); - dstV[i]= (RV*r + GV*g + BV*b + (257<<RGB2YUV_SHIFT))>>(RGB2YUV_SHIFT+1); - } - assert(src1 == src2); -} - -static inline void RENAME(rgb24ToY)(uint8_t *dst, const uint8_t *src, long width, uint32_t *unused) +static inline void nv12ToUV_c(uint8_t *dstU, uint8_t *dstV, + const uint8_t *src1, const uint8_t *src2, + int width, uint32_t *unused) { -#if COMPILE_TEMPLATE_MMX - RENAME(bgr24ToY_mmx)(dst, src, width, PIX_FMT_RGB24); -#else - int i; - for (i=0; i<width; i++) { - int r= src[i*3+0]; - int g= src[i*3+1]; - int b= src[i*3+2]; - - dst[i]= ((RY*r + GY*g + BY*b + (33<<(RGB2YUV_SHIFT-1)))>>RGB2YUV_SHIFT); - } -#endif + nvXXtoUV_c(dstU, dstV, src1, width); } -static inline void RENAME(rgb24ToUV)(uint8_t *dstU, uint8_t *dstV, const uint8_t *src1, const uint8_t *src2, long width, uint32_t *unused) +static inline void nv21ToUV_c(uint8_t *dstU, uint8_t *dstV, + const uint8_t *src1, const uint8_t *src2, + int width, uint32_t *unused) { -#if COMPILE_TEMPLATE_MMX - assert(src1==src2); - RENAME(bgr24ToUV_mmx)(dstU, dstV, src1, width, PIX_FMT_RGB24); -#else - int i; - assert(src1==src2); - for (i=0; i<width; i++) { - int r= src1[3*i + 0]; - int g= src1[3*i + 1]; - int b= src1[3*i + 2]; - - dstU[i]= (RU*r + GU*g + BU*b + (257<<(RGB2YUV_SHIFT-1)))>>RGB2YUV_SHIFT; - dstV[i]= (RV*r + GV*g + BV*b + (257<<(RGB2YUV_SHIFT-1)))>>RGB2YUV_SHIFT; - } -#endif + nvXXtoUV_c(dstV, dstU, src1, width); } -static inline void RENAME(rgb24ToUV_half)(uint8_t *dstU, uint8_t *dstV, const uint8_t *src1, const uint8_t *src2, long width, uint32_t *unused) -{ - int i; - assert(src1==src2); - for (i=0; i<width; i++) { - int r= src1[6*i + 0] + src1[6*i + 3]; - int g= src1[6*i + 1] + src1[6*i + 4]; - int b= src1[6*i + 2] + src1[6*i + 5]; - - dstU[i]= (RU*r + GU*g + BU*b + (257<<RGB2YUV_SHIFT))>>(RGB2YUV_SHIFT+1); - dstV[i]= (RV*r + GV*g + BV*b + (257<<RGB2YUV_SHIFT))>>(RGB2YUV_SHIFT+1); - } -} - - // bilinear / bicubic scaling -static inline void RENAME(hScale)(int16_t *dst, int dstW, const uint8_t *src, int srcW, int xInc, - const int16_t *filter, const int16_t *filterPos, long filterSize) +static inline void hScale_c(int16_t *dst, int dstW, const uint8_t *src, + int srcW, int xInc, + const int16_t *filter, const int16_t *filterPos, + int filterSize) { -#if COMPILE_TEMPLATE_MMX - assert(filterSize % 4 == 0 && filterSize>0); - if (filterSize==4) { // Always true for upscaling, sometimes for down, too. - x86_reg counter= -2*dstW; - filter-= counter*2; - filterPos-= counter/2; - dst-= counter/2; - __asm__ volatile( -#if defined(PIC) - "push %%"REG_b" \n\t" -#endif - "pxor %%mm7, %%mm7 \n\t" - "push %%"REG_BP" \n\t" // we use 7 regs here ... - "mov %%"REG_a", %%"REG_BP" \n\t" - ".p2align 4 \n\t" - "1: \n\t" - "movzwl (%2, %%"REG_BP"), %%eax \n\t" - "movzwl 2(%2, %%"REG_BP"), %%ebx \n\t" - "movq (%1, %%"REG_BP", 4), %%mm1 \n\t" - "movq 8(%1, %%"REG_BP", 4), %%mm3 \n\t" - "movd (%3, %%"REG_a"), %%mm0 \n\t" - "movd (%3, %%"REG_b"), %%mm2 \n\t" - "punpcklbw %%mm7, %%mm0 \n\t" - "punpcklbw %%mm7, %%mm2 \n\t" - "pmaddwd %%mm1, %%mm0 \n\t" - "pmaddwd %%mm2, %%mm3 \n\t" - "movq %%mm0, %%mm4 \n\t" - "punpckldq %%mm3, %%mm0 \n\t" - "punpckhdq %%mm3, %%mm4 \n\t" - "paddd %%mm4, %%mm0 \n\t" - "psrad $7, %%mm0 \n\t" - "packssdw %%mm0, %%mm0 \n\t" - "movd %%mm0, (%4, %%"REG_BP") \n\t" - "add $4, %%"REG_BP" \n\t" - " jnc 1b \n\t" - - "pop %%"REG_BP" \n\t" -#if defined(PIC) - "pop %%"REG_b" \n\t" -#endif - : "+a" (counter) - : "c" (filter), "d" (filterPos), "S" (src), "D" (dst) -#if !defined(PIC) - : "%"REG_b -#endif - ); - } else if (filterSize==8) { - x86_reg counter= -2*dstW; - filter-= counter*4; - filterPos-= counter/2; - dst-= counter/2; - __asm__ volatile( -#if defined(PIC) - "push %%"REG_b" \n\t" -#endif - "pxor %%mm7, %%mm7 \n\t" - "push %%"REG_BP" \n\t" // we use 7 regs here ... - "mov %%"REG_a", %%"REG_BP" \n\t" - ".p2align 4 \n\t" - "1: \n\t" - "movzwl (%2, %%"REG_BP"), %%eax \n\t" - "movzwl 2(%2, %%"REG_BP"), %%ebx \n\t" - "movq (%1, %%"REG_BP", 8), %%mm1 \n\t" - "movq 16(%1, %%"REG_BP", 8), %%mm3 \n\t" - "movd (%3, %%"REG_a"), %%mm0 \n\t" - "movd (%3, %%"REG_b"), %%mm2 \n\t" - "punpcklbw %%mm7, %%mm0 \n\t" - "punpcklbw %%mm7, %%mm2 \n\t" - "pmaddwd %%mm1, %%mm0 \n\t" - "pmaddwd %%mm2, %%mm3 \n\t" - - "movq 8(%1, %%"REG_BP", 8), %%mm1 \n\t" - "movq 24(%1, %%"REG_BP", 8), %%mm5 \n\t" - "movd 4(%3, %%"REG_a"), %%mm4 \n\t" - "movd 4(%3, %%"REG_b"), %%mm2 \n\t" - "punpcklbw %%mm7, %%mm4 \n\t" - "punpcklbw %%mm7, %%mm2 \n\t" - "pmaddwd %%mm1, %%mm4 \n\t" - "pmaddwd %%mm2, %%mm5 \n\t" - "paddd %%mm4, %%mm0 \n\t" - "paddd %%mm5, %%mm3 \n\t" - "movq %%mm0, %%mm4 \n\t" - "punpckldq %%mm3, %%mm0 \n\t" - "punpckhdq %%mm3, %%mm4 \n\t" - "paddd %%mm4, %%mm0 \n\t" - "psrad $7, %%mm0 \n\t" - "packssdw %%mm0, %%mm0 \n\t" - "movd %%mm0, (%4, %%"REG_BP") \n\t" - "add $4, %%"REG_BP" \n\t" - " jnc 1b \n\t" - - "pop %%"REG_BP" \n\t" -#if defined(PIC) - "pop %%"REG_b" \n\t" -#endif - : "+a" (counter) - : "c" (filter), "d" (filterPos), "S" (src), "D" (dst) -#if !defined(PIC) - : "%"REG_b -#endif - ); - } else { - const uint8_t *offset = src+filterSize; - x86_reg counter= -2*dstW; - //filter-= counter*filterSize/2; - filterPos-= counter/2; - dst-= counter/2; - __asm__ volatile( - "pxor %%mm7, %%mm7 \n\t" - ".p2align 4 \n\t" - "1: \n\t" - "mov %2, %%"REG_c" \n\t" - "movzwl (%%"REG_c", %0), %%eax \n\t" - "movzwl 2(%%"REG_c", %0), %%edx \n\t" - "mov %5, %%"REG_c" \n\t" - "pxor %%mm4, %%mm4 \n\t" - "pxor %%mm5, %%mm5 \n\t" - "2: \n\t" - "movq (%1), %%mm1 \n\t" - "movq (%1, %6), %%mm3 \n\t" - "movd (%%"REG_c", %%"REG_a"), %%mm0 \n\t" - "movd (%%"REG_c", %%"REG_d"), %%mm2 \n\t" - "punpcklbw %%mm7, %%mm0 \n\t" - "punpcklbw %%mm7, %%mm2 \n\t" - "pmaddwd %%mm1, %%mm0 \n\t" - "pmaddwd %%mm2, %%mm3 \n\t" - "paddd %%mm3, %%mm5 \n\t" - "paddd %%mm0, %%mm4 \n\t" - "add $8, %1 \n\t" - "add $4, %%"REG_c" \n\t" - "cmp %4, %%"REG_c" \n\t" - " jb 2b \n\t" - "add %6, %1 \n\t" - "movq %%mm4, %%mm0 \n\t" - "punpckldq %%mm5, %%mm4 \n\t" - "punpckhdq %%mm5, %%mm0 \n\t" - "paddd %%mm0, %%mm4 \n\t" - "psrad $7, %%mm4 \n\t" - "packssdw %%mm4, %%mm4 \n\t" - "mov %3, %%"REG_a" \n\t" - "movd %%mm4, (%%"REG_a", %0) \n\t" - "add $4, %0 \n\t" - " jnc 1b \n\t" - - : "+r" (counter), "+r" (filter) - : "m" (filterPos), "m" (dst), "m"(offset), - "m" (src), "r" ((x86_reg)filterSize*2) - : "%"REG_a, "%"REG_c, "%"REG_d - ); - } -#else -#if COMPILE_TEMPLATE_ALTIVEC - hScale_altivec_real(dst, dstW, src, srcW, xInc, filter, filterPos, filterSize); -#else int i; for (i=0; i<dstW; i++) { int j; int srcPos= filterPos[i]; int val=0; - //printf("filterPos: %d\n", filterPos[i]); for (j=0; j<filterSize; j++) { - //printf("filter: %d, src: %d\n", filter[i], src[srcPos + j]); val += ((int)src[srcPos + j])*filter[filterSize*i + j]; } //filter += hFilterSize; dst[i] = FFMIN(val>>7, (1<<15)-1); // the cubic equation does overflow ... //dst[i] = val>>7; } -#endif /* COMPILE_TEMPLATE_ALTIVEC */ -#endif /* COMPILE_MMX */ } -//FIXME all pal and rgb srcFormats could do this convertion as well -//FIXME all scalers more complex than bilinear could do half of this transform -static void RENAME(chrRangeToJpeg)(int16_t *dst, int width) -{ - int i; - for (i = 0; i < width; i++) { - dst[i ] = (FFMIN(dst[i ],30775)*4663 - 9289992)>>12; //-264 - dst[i+VOFW] = (FFMIN(dst[i+VOFW],30775)*4663 - 9289992)>>12; //-264 - } -} -static void RENAME(chrRangeFromJpeg)(int16_t *dst, int width) -{ - int i; - for (i = 0; i < width; i++) { - dst[i ] = (dst[i ]*1799 + 4081085)>>11; //1469 - dst[i+VOFW] = (dst[i+VOFW]*1799 + 4081085)>>11; //1469 - } -} -static void RENAME(lumRangeToJpeg)(int16_t *dst, int width) -{ - int i; - for (i = 0; i < width; i++) - dst[i] = (FFMIN(dst[i],30189)*19077 - 39057361)>>14; -} -static void RENAME(lumRangeFromJpeg)(int16_t *dst, int width) -{ - int i; - for (i = 0; i < width; i++) - dst[i] = (dst[i]*14071 + 33561947)>>14; -} - -#define FAST_BILINEAR_X86 \ - "subl %%edi, %%esi \n\t" /* src[xx+1] - src[xx] */ \ - "imull %%ecx, %%esi \n\t" /* (src[xx+1] - src[xx])*xalpha */ \ - "shll $16, %%edi \n\t" \ - "addl %%edi, %%esi \n\t" /* src[xx+1]*xalpha + src[xx]*(1-xalpha) */ \ - "mov %1, %%"REG_D"\n\t" \ - "shrl $9, %%esi \n\t" \ - -static inline void RENAME(hyscale_fast)(SwsContext *c, int16_t *dst, - long dstWidth, const uint8_t *src, int srcW, - int xInc) -{ -#if ARCH_X86 -#if COMPILE_TEMPLATE_MMX2 - int32_t *filterPos = c->hLumFilterPos; - int16_t *filter = c->hLumFilter; - int canMMX2BeUsed = c->canMMX2BeUsed; - void *mmx2FilterCode= c->lumMmx2FilterCode; - int i; -#if defined(PIC) - DECLARE_ALIGNED(8, uint64_t, ebxsave); -#endif - if (canMMX2BeUsed) { - __asm__ volatile( -#if defined(PIC) - "mov %%"REG_b", %5 \n\t" -#endif - "pxor %%mm7, %%mm7 \n\t" - "mov %0, %%"REG_c" \n\t" - "mov %1, %%"REG_D" \n\t" - "mov %2, %%"REG_d" \n\t" - "mov %3, %%"REG_b" \n\t" - "xor %%"REG_a", %%"REG_a" \n\t" // i - PREFETCH" (%%"REG_c") \n\t" - PREFETCH" 32(%%"REG_c") \n\t" - PREFETCH" 64(%%"REG_c") \n\t" - -#if ARCH_X86_64 - -#define CALL_MMX2_FILTER_CODE \ - "movl (%%"REG_b"), %%esi \n\t"\ - "call *%4 \n\t"\ - "movl (%%"REG_b", %%"REG_a"), %%esi \n\t"\ - "add %%"REG_S", %%"REG_c" \n\t"\ - "add %%"REG_a", %%"REG_D" \n\t"\ - "xor %%"REG_a", %%"REG_a" \n\t"\ - -#else - -#define CALL_MMX2_FILTER_CODE \ - "movl (%%"REG_b"), %%esi \n\t"\ - "call *%4 \n\t"\ - "addl (%%"REG_b", %%"REG_a"), %%"REG_c" \n\t"\ - "add %%"REG_a", %%"REG_D" \n\t"\ - "xor %%"REG_a", %%"REG_a" \n\t"\ - -#endif /* ARCH_X86_64 */ - - CALL_MMX2_FILTER_CODE - CALL_MMX2_FILTER_CODE - CALL_MMX2_FILTER_CODE - CALL_MMX2_FILTER_CODE - CALL_MMX2_FILTER_CODE - CALL_MMX2_FILTER_CODE - CALL_MMX2_FILTER_CODE - CALL_MMX2_FILTER_CODE - -#if defined(PIC) - "mov %5, %%"REG_b" \n\t" -#endif - :: "m" (src), "m" (dst), "m" (filter), "m" (filterPos), - "m" (mmx2FilterCode) -#if defined(PIC) - ,"m" (ebxsave) -#endif - : "%"REG_a, "%"REG_c, "%"REG_d, "%"REG_S, "%"REG_D -#if !defined(PIC) - ,"%"REG_b -#endif - ); - for (i=dstWidth-1; (i*xInc)>>16 >=srcW-1; i--) dst[i] = src[srcW-1]*128; - } else { -#endif /* COMPILE_TEMPLATE_MMX2 */ - x86_reg xInc_shr16 = xInc >> 16; - uint16_t xInc_mask = xInc & 0xffff; - x86_reg dstWidth_reg = dstWidth; - //NO MMX just normal asm ... - __asm__ volatile( - "xor %%"REG_a", %%"REG_a" \n\t" // i - "xor %%"REG_d", %%"REG_d" \n\t" // xx - "xorl %%ecx, %%ecx \n\t" // xalpha - ".p2align 4 \n\t" - "1: \n\t" - "movzbl (%0, %%"REG_d"), %%edi \n\t" //src[xx] - "movzbl 1(%0, %%"REG_d"), %%esi \n\t" //src[xx+1] - FAST_BILINEAR_X86 - "movw %%si, (%%"REG_D", %%"REG_a", 2) \n\t" - "addw %4, %%cx \n\t" //xalpha += xInc&0xFFFF - "adc %3, %%"REG_d" \n\t" //xx+= xInc>>16 + carry - - "movzbl (%0, %%"REG_d"), %%edi \n\t" //src[xx] - "movzbl 1(%0, %%"REG_d"), %%esi \n\t" //src[xx+1] - FAST_BILINEAR_X86 - "movw %%si, 2(%%"REG_D", %%"REG_a", 2) \n\t" - "addw %4, %%cx \n\t" //xalpha += xInc&0xFFFF - "adc %3, %%"REG_d" \n\t" //xx+= xInc>>16 + carry - - - "add $2, %%"REG_a" \n\t" - "cmp %2, %%"REG_a" \n\t" - " jb 1b \n\t" - - - :: "r" (src), "m" (dst), "m" (dstWidth_reg), "m" (xInc_shr16), "m" (xInc_mask) - : "%"REG_a, "%"REG_d, "%ecx", "%"REG_D, "%esi" - ); -#if COMPILE_TEMPLATE_MMX2 - } //if MMX2 can't be used -#endif -#else - int i; - unsigned int xpos=0; - for (i=0;i<dstWidth;i++) { - register unsigned int xx=xpos>>16; - register unsigned int xalpha=(xpos&0xFFFF)>>9; - dst[i]= (src[xx]<<7) + (src[xx+1] - src[xx])*xalpha; - xpos+=xInc; - } -#endif /* ARCH_X86 */ -} - - // *** horizontal scale Y line to temp buffer -static inline void RENAME(hyscale)(SwsContext *c, uint16_t *dst, long dstWidth, const uint8_t *src, int srcW, int xInc, - const int16_t *hLumFilter, - const int16_t *hLumFilterPos, int hLumFilterSize, - uint8_t *formatConvBuffer, - uint32_t *pal, int isAlpha) -{ - void (*toYV12)(uint8_t *, const uint8_t *, long, uint32_t *) = isAlpha ? c->alpToYV12 : c->lumToYV12; - void (*convertRange)(int16_t *, int) = isAlpha ? NULL : c->lumConvertRange; - - src += isAlpha ? c->alpSrcOffset : c->lumSrcOffset; - - if (toYV12) { - toYV12(formatConvBuffer, src, srcW, pal); - src= formatConvBuffer; - } - - if (!c->hyscale_fast) { - c->hScale(dst, dstWidth, src, srcW, xInc, hLumFilter, hLumFilterPos, hLumFilterSize); - } else { // fast bilinear upscale / crap downscale - c->hyscale_fast(c, dst, dstWidth, src, srcW, xInc); - } - - if (convertRange) - convertRange(dst, dstWidth); -} - -static inline void RENAME(hcscale_fast)(SwsContext *c, int16_t *dst, - long dstWidth, const uint8_t *src1, - const uint8_t *src2, int srcW, int xInc) -{ -#if ARCH_X86 -#if COMPILE_TEMPLATE_MMX2 - int32_t *filterPos = c->hChrFilterPos; - int16_t *filter = c->hChrFilter; - int canMMX2BeUsed = c->canMMX2BeUsed; - void *mmx2FilterCode= c->chrMmx2FilterCode; - int i; -#if defined(PIC) - DECLARE_ALIGNED(8, uint64_t, ebxsave); -#endif - if (canMMX2BeUsed) { - __asm__ volatile( -#if defined(PIC) - "mov %%"REG_b", %6 \n\t" -#endif - "pxor %%mm7, %%mm7 \n\t" - "mov %0, %%"REG_c" \n\t" - "mov %1, %%"REG_D" \n\t" - "mov %2, %%"REG_d" \n\t" - "mov %3, %%"REG_b" \n\t" - "xor %%"REG_a", %%"REG_a" \n\t" // i - PREFETCH" (%%"REG_c") \n\t" - PREFETCH" 32(%%"REG_c") \n\t" - PREFETCH" 64(%%"REG_c") \n\t" - - CALL_MMX2_FILTER_CODE - CALL_MMX2_FILTER_CODE - CALL_MMX2_FILTER_CODE - CALL_MMX2_FILTER_CODE - "xor %%"REG_a", %%"REG_a" \n\t" // i - "mov %5, %%"REG_c" \n\t" // src - "mov %1, %%"REG_D" \n\t" // buf1 - "add $"AV_STRINGIFY(VOF)", %%"REG_D" \n\t" - PREFETCH" (%%"REG_c") \n\t" - PREFETCH" 32(%%"REG_c") \n\t" - PREFETCH" 64(%%"REG_c") \n\t" - - CALL_MMX2_FILTER_CODE - CALL_MMX2_FILTER_CODE - CALL_MMX2_FILTER_CODE - CALL_MMX2_FILTER_CODE - -#if defined(PIC) - "mov %6, %%"REG_b" \n\t" -#endif - :: "m" (src1), "m" (dst), "m" (filter), "m" (filterPos), - "m" (mmx2FilterCode), "m" (src2) -#if defined(PIC) - ,"m" (ebxsave) -#endif - : "%"REG_a, "%"REG_c, "%"REG_d, "%"REG_S, "%"REG_D -#if !defined(PIC) - ,"%"REG_b -#endif - ); - for (i=dstWidth-1; (i*xInc)>>16 >=srcW-1; i--) { - //printf("%d %d %d\n", dstWidth, i, srcW); - dst[i] = src1[srcW-1]*128; - dst[i+VOFW] = src2[srcW-1]*128; - } - } else { -#endif /* COMPILE_TEMPLATE_MMX2 */ - x86_reg xInc_shr16 = (x86_reg) (xInc >> 16); - uint16_t xInc_mask = xInc & 0xffff; - x86_reg dstWidth_reg = dstWidth; - __asm__ volatile( - "xor %%"REG_a", %%"REG_a" \n\t" // i - "xor %%"REG_d", %%"REG_d" \n\t" // xx - "xorl %%ecx, %%ecx \n\t" // xalpha - ".p2align 4 \n\t" - "1: \n\t" - "mov %0, %%"REG_S" \n\t" - "movzbl (%%"REG_S", %%"REG_d"), %%edi \n\t" //src[xx] - "movzbl 1(%%"REG_S", %%"REG_d"), %%esi \n\t" //src[xx+1] - FAST_BILINEAR_X86 - "movw %%si, (%%"REG_D", %%"REG_a", 2) \n\t" - - "movzbl (%5, %%"REG_d"), %%edi \n\t" //src[xx] - "movzbl 1(%5, %%"REG_d"), %%esi \n\t" //src[xx+1] - FAST_BILINEAR_X86 - "movw %%si, "AV_STRINGIFY(VOF)"(%%"REG_D", %%"REG_a", 2) \n\t" - - "addw %4, %%cx \n\t" //xalpha += xInc&0xFFFF - "adc %3, %%"REG_d" \n\t" //xx+= xInc>>16 + carry - "add $1, %%"REG_a" \n\t" - "cmp %2, %%"REG_a" \n\t" - " jb 1b \n\t" - -/* GCC 3.3 makes MPlayer crash on IA-32 machines when using "g" operand here, -which is needed to support GCC 4.0. */ -#if ARCH_X86_64 && AV_GCC_VERSION_AT_LEAST(3,4) - :: "m" (src1), "m" (dst), "g" (dstWidth_reg), "m" (xInc_shr16), "m" (xInc_mask), -#else - :: "m" (src1), "m" (dst), "m" (dstWidth_reg), "m" (xInc_shr16), "m" (xInc_mask), -#endif - "r" (src2) - : "%"REG_a, "%"REG_d, "%ecx", "%"REG_D, "%esi" - ); -#if COMPILE_TEMPLATE_MMX2 - } //if MMX2 can't be used -#endif -#else - int i; - unsigned int xpos=0; - for (i=0;i<dstWidth;i++) { - register unsigned int xx=xpos>>16; - register unsigned int xalpha=(xpos&0xFFFF)>>9; - dst[i]=(src1[xx]*(xalpha^127)+src1[xx+1]*xalpha); - dst[i+VOFW]=(src2[xx]*(xalpha^127)+src2[xx+1]*xalpha); - /* slower - dst[i]= (src1[xx]<<7) + (src1[xx+1] - src1[xx])*xalpha; - dst[i+VOFW]=(src2[xx]<<7) + (src2[xx+1] - src2[xx])*xalpha; - */ - xpos+=xInc; - } -#endif /* ARCH_X86 */ -} - -inline static void RENAME(hcscale)(SwsContext *c, uint16_t *dst, long dstWidth, const uint8_t *src1, const uint8_t *src2, - int srcW, int xInc, const int16_t *hChrFilter, - const int16_t *hChrFilterPos, int hChrFilterSize, - uint8_t *formatConvBuffer, - uint32_t *pal) -{ - - src1 += c->chrSrcOffset; - src2 += c->chrSrcOffset; - - if (c->chrToYV12) { - c->chrToYV12(formatConvBuffer, formatConvBuffer+VOFW, src1, src2, srcW, pal); - src1= formatConvBuffer; - src2= formatConvBuffer+VOFW; - } - - if (!c->hcscale_fast) { - c->hScale(dst , dstWidth, src1, srcW, xInc, hChrFilter, hChrFilterPos, hChrFilterSize); - c->hScale(dst+VOFW, dstWidth, src2, srcW, xInc, hChrFilter, hChrFilterPos, hChrFilterSize); - } else { // fast bilinear upscale / crap downscale - c->hcscale_fast(c, dst, dstWidth, src1, src2, srcW, xInc); - } - - if (c->chrConvertRange) - c->chrConvertRange(dst, dstWidth); -} #define DEBUG_SWSCALE_BUFFERS 0 #define DEBUG_BUFFERS(...) if (DEBUG_SWSCALE_BUFFERS) av_log(c, AV_LOG_DEBUG, __VA_ARGS__) -static int RENAME(swScale)(SwsContext *c, const uint8_t* src[], int srcStride[], int srcSliceY, - int srcSliceH, uint8_t* dst[], int dstStride[]) +#if HAVE_MMX +static void updateMMXDitherTables(SwsContext *c, int dstY, int lumBufIndex, int chrBufIndex, + int lastInLumBuf, int lastInChrBuf); +#endif + +static int swScale_c(SwsContext *c, const uint8_t* src[], int srcStride[], + int srcSliceY, int srcSliceH, uint8_t* dst[], int dstStride[]) { /* load a few things into local vars to make the code more readable? and faster */ const int srcW= c->srcW; @@ -2612,7 +300,8 @@ static int RENAME(swScale)(SwsContext *c, const uint8_t* src[], int srcStride[], const int hLumFilterSize= c->hLumFilterSize; const int hChrFilterSize= c->hChrFilterSize; int16_t **lumPixBuf= c->lumPixBuf; - int16_t **chrPixBuf= c->chrPixBuf; + int16_t **chrUPixBuf= c->chrUPixBuf; + int16_t **chrVPixBuf= c->chrVPixBuf; int16_t **alpPixBuf= c->alpPixBuf; const int vLumBufSize= c->vLumBufSize; const int vChrBufSize= c->vChrBufSize; @@ -2678,6 +367,8 @@ static int RENAME(swScale)(SwsContext *c, const uint8_t* src[], int srcStride[], unsigned char *uDest=dst[1]+dstStride[1]*chrDstY; unsigned char *vDest=dst[2]+dstStride[2]*chrDstY; unsigned char *aDest=(CONFIG_SWSCALE_ALPHA && alpPixBuf) ? dst[3]+dstStride[3]*dstY : NULL; + const uint8_t *lumDither= isNBPS(c->srcFormat) || is16BPS(c->srcFormat) ? dithers[7][dstY &7] : flat64; + const uint8_t *chrDither= isNBPS(c->srcFormat) || is16BPS(c->srcFormat) ? dithers[7][chrDstY&7] : flat64; const int firstLumSrcY= vLumFilterPos[dstY]; //First line needed as input const int firstLumSrcY2= vLumFilterPos[FFMIN(dstY | ((1<<c->chrDstVSubSample) - 1), dstH-1)]; @@ -2717,15 +408,15 @@ static int RENAME(swScale)(SwsContext *c, const uint8_t* src[], int srcStride[], assert(lumBufIndex < 2*vLumBufSize); assert(lastInLumBuf + 1 - srcSliceY < srcSliceH); assert(lastInLumBuf + 1 - srcSliceY >= 0); - RENAME(hyscale)(c, lumPixBuf[ lumBufIndex ], dstW, src1, srcW, lumXInc, - hLumFilter, hLumFilterPos, hLumFilterSize, - formatConvBuffer, - pal, 0); + hyscale_c(c, lumPixBuf[ lumBufIndex ], dstW, src1, srcW, lumXInc, + hLumFilter, hLumFilterPos, hLumFilterSize, + formatConvBuffer, + pal, 0); if (CONFIG_SWSCALE_ALPHA && alpPixBuf) - RENAME(hyscale)(c, alpPixBuf[ lumBufIndex ], dstW, src2, srcW, lumXInc, - hLumFilter, hLumFilterPos, hLumFilterSize, - formatConvBuffer, - pal, 1); + hyscale_c(c, alpPixBuf[ lumBufIndex ], dstW, src2, srcW, + lumXInc, hLumFilter, hLumFilterPos, hLumFilterSize, + formatConvBuffer, + pal, 1); lastInLumBuf++; DEBUG_BUFFERS("\t\tlumBufIndex %d: lastInLumBuf: %d\n", lumBufIndex, lastInLumBuf); @@ -2740,10 +431,10 @@ static int RENAME(swScale)(SwsContext *c, const uint8_t* src[], int srcStride[], //FIXME replace parameters through context struct (some at least) if (c->needs_hcscale) - RENAME(hcscale)(c, chrPixBuf[ chrBufIndex ], chrDstW, src1, src2, chrSrcW, chrXInc, - hChrFilter, hChrFilterPos, hChrFilterSize, - formatConvBuffer, - pal); + hcscale_c(c, chrUPixBuf[chrBufIndex], chrVPixBuf[chrBufIndex], + chrDstW, src1, src2, chrSrcW, chrXInc, + hChrFilter, hChrFilterPos, hChrFilterSize, + formatConvBuffer, pal); lastInChrBuf++; DEBUG_BUFFERS("\t\tchrBufIndex %d: lastInChrBuf: %d\n", chrBufIndex, lastInChrBuf); @@ -2754,104 +445,59 @@ static int RENAME(swScale)(SwsContext *c, const uint8_t* src[], int srcStride[], if (!enough_lines) break; //we can't output a dstY line so let's try with the next slice -#if COMPILE_TEMPLATE_MMX - c->blueDither= ff_dither8[dstY&1]; - if (c->dstFormat == PIX_FMT_RGB555 || c->dstFormat == PIX_FMT_BGR555) - c->greenDither= ff_dither8[dstY&1]; - else - c->greenDither= ff_dither4[dstY&1]; - c->redDither= ff_dither8[(dstY+1)&1]; +#if HAVE_MMX + updateMMXDitherTables(c, dstY, lumBufIndex, chrBufIndex, lastInLumBuf, lastInChrBuf); #endif if (dstY < dstH-2) { const int16_t **lumSrcPtr= (const int16_t **) lumPixBuf + lumBufIndex + firstLumSrcY - lastInLumBuf + vLumBufSize; - const int16_t **chrSrcPtr= (const int16_t **) chrPixBuf + chrBufIndex + firstChrSrcY - lastInChrBuf + vChrBufSize; + const int16_t **chrUSrcPtr= (const int16_t **) chrUPixBuf + chrBufIndex + firstChrSrcY - lastInChrBuf + vChrBufSize; + const int16_t **chrVSrcPtr= (const int16_t **) chrVPixBuf + chrBufIndex + firstChrSrcY - lastInChrBuf + vChrBufSize; const int16_t **alpSrcPtr= (CONFIG_SWSCALE_ALPHA && alpPixBuf) ? (const int16_t **) alpPixBuf + lumBufIndex + firstLumSrcY - lastInLumBuf + vLumBufSize : NULL; -#if COMPILE_TEMPLATE_MMX - int i; - if (flags & SWS_ACCURATE_RND) { - int s= APCK_SIZE / 8; - for (i=0; i<vLumFilterSize; i+=2) { - *(const void**)&lumMmxFilter[s*i ]= lumSrcPtr[i ]; - *(const void**)&lumMmxFilter[s*i+APCK_PTR2/4 ]= lumSrcPtr[i+(vLumFilterSize>1)]; - lumMmxFilter[s*i+APCK_COEF/4 ]= - lumMmxFilter[s*i+APCK_COEF/4+1]= vLumFilter[dstY*vLumFilterSize + i ] - + (vLumFilterSize>1 ? vLumFilter[dstY*vLumFilterSize + i + 1]<<16 : 0); - if (CONFIG_SWSCALE_ALPHA && alpPixBuf) { - *(const void**)&alpMmxFilter[s*i ]= alpSrcPtr[i ]; - *(const void**)&alpMmxFilter[s*i+APCK_PTR2/4 ]= alpSrcPtr[i+(vLumFilterSize>1)]; - alpMmxFilter[s*i+APCK_COEF/4 ]= - alpMmxFilter[s*i+APCK_COEF/4+1]= lumMmxFilter[s*i+APCK_COEF/4 ]; - } - } - for (i=0; i<vChrFilterSize; i+=2) { - *(const void**)&chrMmxFilter[s*i ]= chrSrcPtr[i ]; - *(const void**)&chrMmxFilter[s*i+APCK_PTR2/4 ]= chrSrcPtr[i+(vChrFilterSize>1)]; - chrMmxFilter[s*i+APCK_COEF/4 ]= - chrMmxFilter[s*i+APCK_COEF/4+1]= vChrFilter[chrDstY*vChrFilterSize + i ] - + (vChrFilterSize>1 ? vChrFilter[chrDstY*vChrFilterSize + i + 1]<<16 : 0); - } - } else { - for (i=0; i<vLumFilterSize; i++) { - lumMmxFilter[4*i+0]= (int32_t)lumSrcPtr[i]; - lumMmxFilter[4*i+1]= (uint64_t)lumSrcPtr[i] >> 32; - lumMmxFilter[4*i+2]= - lumMmxFilter[4*i+3]= - ((uint16_t)vLumFilter[dstY*vLumFilterSize + i])*0x10001; - if (CONFIG_SWSCALE_ALPHA && alpPixBuf) { - alpMmxFilter[4*i+0]= (int32_t)alpSrcPtr[i]; - alpMmxFilter[4*i+1]= (uint64_t)alpSrcPtr[i] >> 32; - alpMmxFilter[4*i+2]= - alpMmxFilter[4*i+3]= lumMmxFilter[4*i+2]; - } - } - for (i=0; i<vChrFilterSize; i++) { - chrMmxFilter[4*i+0]= (int32_t)chrSrcPtr[i]; - chrMmxFilter[4*i+1]= (uint64_t)chrSrcPtr[i] >> 32; - chrMmxFilter[4*i+2]= - chrMmxFilter[4*i+3]= - ((uint16_t)vChrFilter[chrDstY*vChrFilterSize + i])*0x10001; - } - } -#endif if (dstFormat == PIX_FMT_NV12 || dstFormat == PIX_FMT_NV21) { const int chrSkipMask= (1<<c->chrDstVSubSample)-1; if (dstY&chrSkipMask) uDest= NULL; //FIXME split functions in lumi / chromi c->yuv2nv12X(c, vLumFilter+dstY*vLumFilterSize , lumSrcPtr, vLumFilterSize, - vChrFilter+chrDstY*vChrFilterSize, chrSrcPtr, vChrFilterSize, - dest, uDest, dstW, chrDstW, dstFormat); + vChrFilter+chrDstY*vChrFilterSize, chrUSrcPtr, chrVSrcPtr, vChrFilterSize, + dest, uDest, dstW, chrDstW, dstFormat, lumDither, chrDither); } else if (isPlanarYUV(dstFormat) || dstFormat==PIX_FMT_GRAY8) { //YV12 like const int chrSkipMask= (1<<c->chrDstVSubSample)-1; if ((dstY&chrSkipMask) || isGray(dstFormat)) uDest=vDest= NULL; //FIXME split functions in lumi / chromi if (is16BPS(dstFormat) || isNBPS(dstFormat)) { - yuv2yuvX16inC( - vLumFilter+dstY*vLumFilterSize , lumSrcPtr, vLumFilterSize, - vChrFilter+chrDstY*vChrFilterSize, chrSrcPtr, vChrFilterSize, - alpSrcPtr, (uint16_t *) dest, (uint16_t *) uDest, (uint16_t *) vDest, (uint16_t *) aDest, dstW, chrDstW, + yuv2yuvX16inC(vLumFilter+dstY*vLumFilterSize , lumSrcPtr, vLumFilterSize, + vChrFilter+chrDstY*vChrFilterSize, chrUSrcPtr, + chrVSrcPtr, vChrFilterSize, + alpSrcPtr, (uint16_t *) dest, (uint16_t *) uDest, + (uint16_t *) vDest, (uint16_t *) aDest, dstW, chrDstW, dstFormat); } else if (vLumFilterSize == 1 && vChrFilterSize == 1) { // unscaled YV12 const int16_t *lumBuf = lumSrcPtr[0]; - const int16_t *chrBuf= chrSrcPtr[0]; + const int16_t *chrUBuf= chrUSrcPtr[0]; + const int16_t *chrVBuf= chrVSrcPtr[0]; const int16_t *alpBuf= (CONFIG_SWSCALE_ALPHA && alpPixBuf) ? alpSrcPtr[0] : NULL; - c->yuv2yuv1(c, lumBuf, chrBuf, alpBuf, dest, uDest, vDest, aDest, dstW, chrDstW); + c->yuv2yuv1(c, lumBuf, chrUBuf, chrVBuf, alpBuf, dest, + uDest, vDest, aDest, dstW, chrDstW, lumDither, chrDither); } else { //General YV12 c->yuv2yuvX(c, vLumFilter+dstY*vLumFilterSize , lumSrcPtr, vLumFilterSize, - vChrFilter+chrDstY*vChrFilterSize, chrSrcPtr, vChrFilterSize, - alpSrcPtr, dest, uDest, vDest, aDest, dstW, chrDstW); + vChrFilter+chrDstY*vChrFilterSize, chrUSrcPtr, + chrVSrcPtr, vChrFilterSize, + alpSrcPtr, dest, uDest, vDest, aDest, dstW, chrDstW, lumDither, chrDither); } } else { - assert(lumSrcPtr + vLumFilterSize - 1 < lumPixBuf + vLumBufSize*2); - assert(chrSrcPtr + vChrFilterSize - 1 < chrPixBuf + vChrBufSize*2); + assert(lumSrcPtr + vLumFilterSize - 1 < lumPixBuf + vLumBufSize*2); + assert(chrUSrcPtr + vChrFilterSize - 1 < chrUPixBuf + vChrBufSize*2); if (vLumFilterSize == 1 && vChrFilterSize == 2) { //unscaled RGB int chrAlpha= vChrFilter[2*dstY+1]; if(flags & SWS_FULL_CHR_H_INT) { yuv2rgbXinC_full(c, //FIXME write a packed1_full function vLumFilter+dstY*vLumFilterSize, lumSrcPtr, vLumFilterSize, - vChrFilter+dstY*vChrFilterSize, chrSrcPtr, vChrFilterSize, + vChrFilter+dstY*vChrFilterSize, chrUSrcPtr, + chrVSrcPtr, vChrFilterSize, alpSrcPtr, dest, dstW, dstY); } else { - c->yuv2packed1(c, *lumSrcPtr, *chrSrcPtr, *(chrSrcPtr+1), + c->yuv2packed1(c, *lumSrcPtr, *chrUSrcPtr, *(chrUSrcPtr+1), + *chrVSrcPtr, *(chrVSrcPtr+1), alpPixBuf ? *alpSrcPtr : NULL, dest, dstW, chrAlpha, dstFormat, flags, dstY); } @@ -2865,10 +511,11 @@ static int RENAME(swScale)(SwsContext *c, const uint8_t* src[], int srcStride[], if(flags & SWS_FULL_CHR_H_INT) { yuv2rgbXinC_full(c, //FIXME write a packed2_full function vLumFilter+dstY*vLumFilterSize, lumSrcPtr, vLumFilterSize, - vChrFilter+dstY*vChrFilterSize, chrSrcPtr, vChrFilterSize, + vChrFilter+dstY*vChrFilterSize, chrUSrcPtr, chrVSrcPtr, vChrFilterSize, alpSrcPtr, dest, dstW, dstY); } else { - c->yuv2packed2(c, *lumSrcPtr, *(lumSrcPtr+1), *chrSrcPtr, *(chrSrcPtr+1), + c->yuv2packed2(c, *lumSrcPtr, *(lumSrcPtr+1), *chrUSrcPtr, *(chrUSrcPtr+1), + *chrVSrcPtr, *(chrVSrcPtr+1), alpPixBuf ? *alpSrcPtr : NULL, alpPixBuf ? *(alpSrcPtr+1) : NULL, dest, dstW, lumAlpha, chrAlpha, dstY); } @@ -2876,54 +523,55 @@ static int RENAME(swScale)(SwsContext *c, const uint8_t* src[], int srcStride[], if(flags & SWS_FULL_CHR_H_INT) { yuv2rgbXinC_full(c, vLumFilter+dstY*vLumFilterSize, lumSrcPtr, vLumFilterSize, - vChrFilter+dstY*vChrFilterSize, chrSrcPtr, vChrFilterSize, + vChrFilter+dstY*vChrFilterSize, chrUSrcPtr, chrVSrcPtr, vChrFilterSize, alpSrcPtr, dest, dstW, dstY); } else { c->yuv2packedX(c, vLumFilter+dstY*vLumFilterSize, lumSrcPtr, vLumFilterSize, - vChrFilter+dstY*vChrFilterSize, chrSrcPtr, vChrFilterSize, + vChrFilter+dstY*vChrFilterSize, chrUSrcPtr, chrVSrcPtr, vChrFilterSize, alpSrcPtr, dest, dstW, dstY); } } } } else { // hmm looks like we can't use MMX here without overwriting this array's tail const int16_t **lumSrcPtr= (const int16_t **)lumPixBuf + lumBufIndex + firstLumSrcY - lastInLumBuf + vLumBufSize; - const int16_t **chrSrcPtr= (const int16_t **)chrPixBuf + chrBufIndex + firstChrSrcY - lastInChrBuf + vChrBufSize; + const int16_t **chrUSrcPtr= (const int16_t **)chrUPixBuf + chrBufIndex + firstChrSrcY - lastInChrBuf + vChrBufSize; + const int16_t **chrVSrcPtr= (const int16_t **)chrVPixBuf + chrBufIndex + firstChrSrcY - lastInChrBuf + vChrBufSize; const int16_t **alpSrcPtr= (CONFIG_SWSCALE_ALPHA && alpPixBuf) ? (const int16_t **)alpPixBuf + lumBufIndex + firstLumSrcY - lastInLumBuf + vLumBufSize : NULL; if (dstFormat == PIX_FMT_NV12 || dstFormat == PIX_FMT_NV21) { const int chrSkipMask= (1<<c->chrDstVSubSample)-1; if (dstY&chrSkipMask) uDest= NULL; //FIXME split functions in lumi / chromi yuv2nv12XinC( vLumFilter+dstY*vLumFilterSize , lumSrcPtr, vLumFilterSize, - vChrFilter+chrDstY*vChrFilterSize, chrSrcPtr, vChrFilterSize, - dest, uDest, dstW, chrDstW, dstFormat); + vChrFilter+chrDstY*vChrFilterSize, chrUSrcPtr, chrVSrcPtr, vChrFilterSize, + dest, uDest, dstW, chrDstW, dstFormat, lumDither, chrDither); } else if (isPlanarYUV(dstFormat) || dstFormat==PIX_FMT_GRAY8) { //YV12 const int chrSkipMask= (1<<c->chrDstVSubSample)-1; if ((dstY&chrSkipMask) || isGray(dstFormat)) uDest=vDest= NULL; //FIXME split functions in lumi / chromi if (is16BPS(dstFormat) || isNBPS(dstFormat)) { yuv2yuvX16inC( vLumFilter+dstY*vLumFilterSize , lumSrcPtr, vLumFilterSize, - vChrFilter+chrDstY*vChrFilterSize, chrSrcPtr, vChrFilterSize, + vChrFilter+chrDstY*vChrFilterSize, chrUSrcPtr, chrVSrcPtr, vChrFilterSize, alpSrcPtr, (uint16_t *) dest, (uint16_t *) uDest, (uint16_t *) vDest, (uint16_t *) aDest, dstW, chrDstW, dstFormat); } else { yuv2yuvXinC( vLumFilter+dstY*vLumFilterSize , lumSrcPtr, vLumFilterSize, - vChrFilter+chrDstY*vChrFilterSize, chrSrcPtr, vChrFilterSize, - alpSrcPtr, dest, uDest, vDest, aDest, dstW, chrDstW); + vChrFilter+chrDstY*vChrFilterSize, chrUSrcPtr, chrVSrcPtr, vChrFilterSize, + alpSrcPtr, dest, uDest, vDest, aDest, dstW, chrDstW, lumDither, chrDither); } } else { assert(lumSrcPtr + vLumFilterSize - 1 < lumPixBuf + vLumBufSize*2); - assert(chrSrcPtr + vChrFilterSize - 1 < chrPixBuf + vChrBufSize*2); + assert(chrUSrcPtr + vChrFilterSize - 1 < chrUPixBuf + vChrBufSize*2); if(flags & SWS_FULL_CHR_H_INT) { yuv2rgbXinC_full(c, vLumFilter+dstY*vLumFilterSize, lumSrcPtr, vLumFilterSize, - vChrFilter+dstY*vChrFilterSize, chrSrcPtr, vChrFilterSize, + vChrFilter+dstY*vChrFilterSize, chrUSrcPtr, chrVSrcPtr, vChrFilterSize, alpSrcPtr, dest, dstW, dstY); } else { yuv2packedXinC(c, vLumFilter+dstY*vLumFilterSize, lumSrcPtr, vLumFilterSize, - vChrFilter+dstY*vChrFilterSize, chrSrcPtr, vChrFilterSize, + vChrFilter+dstY*vChrFilterSize, chrUSrcPtr, chrVSrcPtr, vChrFilterSize, alpSrcPtr, dest, dstW, dstY); } } @@ -2933,12 +581,12 @@ static int RENAME(swScale)(SwsContext *c, const uint8_t* src[], int srcStride[], if ((dstFormat == PIX_FMT_YUVA420P) && !alpPixBuf) fillPlane(dst[3], dstStride[3], dstW, dstY-lastDstY, lastDstY, 255); -#if COMPILE_TEMPLATE_MMX - if (flags & SWS_CPU_CAPS_MMX2 ) __asm__ volatile("sfence":::"memory"); - /* On K6 femms is faster than emms. On K7 femms is directly mapped to emms. */ - if (flags & SWS_CPU_CAPS_3DNOW) __asm__ volatile("femms" :::"memory"); - else __asm__ volatile("emms" :::"memory"); +#if HAVE_MMX2 + if (av_get_cpu_flags() & AV_CPU_FLAG_MMX2) + __asm__ volatile("sfence":::"memory"); #endif + emms_c(); + /* store changed local vars back in the context */ c->dstY= dstY; c->lumBufIndex= lumBufIndex; @@ -2949,85 +597,82 @@ static int RENAME(swScale)(SwsContext *c, const uint8_t* src[], int srcStride[], return dstY - lastDstY; } -static void RENAME(sws_init_swScale)(SwsContext *c) +static void sws_init_swScale_c(SwsContext *c) { enum PixelFormat srcFormat = c->srcFormat; - c->yuv2nv12X = RENAME(yuv2nv12X ); - c->yuv2yuv1 = RENAME(yuv2yuv1 ); - c->yuv2yuvX = RENAME(yuv2yuvX ); - c->yuv2packed1 = RENAME(yuv2packed1 ); - c->yuv2packed2 = RENAME(yuv2packed2 ); - c->yuv2packedX = RENAME(yuv2packedX ); + c->yuv2nv12X = yuv2nv12X_c; + c->yuv2yuv1 = yuv2yuv1_c; + c->yuv2yuvX = yuv2yuvX_c; + c->yuv2packed1 = yuv2packed1_c; + c->yuv2packed2 = yuv2packed2_c; + c->yuv2packedX = yuv2packedX_c; - c->hScale = RENAME(hScale ); + c->hScale = hScale_c; -#if COMPILE_TEMPLATE_MMX - // Use the new MMX scaler if the MMX2 one can't be used (it is faster than the x86 ASM one). - if (c->flags & SWS_FAST_BILINEAR && c->canMMX2BeUsed) -#else if (c->flags & SWS_FAST_BILINEAR) -#endif { - c->hyscale_fast = RENAME(hyscale_fast); - c->hcscale_fast = RENAME(hcscale_fast); + c->hyscale_fast = hyscale_fast_c; + c->hcscale_fast = hcscale_fast_c; } c->chrToYV12 = NULL; switch(srcFormat) { - case PIX_FMT_YUYV422 : c->chrToYV12 = RENAME(yuy2ToUV); break; - case PIX_FMT_UYVY422 : c->chrToYV12 = RENAME(uyvyToUV); break; - case PIX_FMT_NV12 : c->chrToYV12 = RENAME(nv12ToUV); break; - case PIX_FMT_NV21 : c->chrToYV12 = RENAME(nv21ToUV); break; + case PIX_FMT_YUYV422 : c->chrToYV12 = yuy2ToUV_c; break; + case PIX_FMT_UYVY422 : c->chrToYV12 = uyvyToUV_c; break; + case PIX_FMT_NV12 : c->chrToYV12 = nv12ToUV_c; break; + case PIX_FMT_NV21 : c->chrToYV12 = nv21ToUV_c; break; case PIX_FMT_RGB8 : case PIX_FMT_BGR8 : case PIX_FMT_PAL8 : case PIX_FMT_BGR4_BYTE: case PIX_FMT_RGB4_BYTE: c->chrToYV12 = palToUV; break; - case PIX_FMT_YUV420P9BE: c->chrToYV12 = BE9ToUV_c; break; - case PIX_FMT_YUV420P9LE: c->chrToYV12 = LE9ToUV_c; break; + case PIX_FMT_GRAY16BE : + case PIX_FMT_YUV420P9BE: case PIX_FMT_YUV422P10BE: - case PIX_FMT_YUV420P10BE: c->chrToYV12 = BE10ToUV_c; break; - case PIX_FMT_YUV422P10LE: - case PIX_FMT_YUV420P10LE: c->chrToYV12 = LE10ToUV_c; break; + case PIX_FMT_YUV420P10BE: case PIX_FMT_YUV420P16BE: case PIX_FMT_YUV422P16BE: - case PIX_FMT_YUV444P16BE: c->chrToYV12 = RENAME(BEToUV); break; + case PIX_FMT_YUV444P16BE: c->hScale16= HAVE_BIGENDIAN ? hScale16_c : hScale16X_c; break; + case PIX_FMT_GRAY16LE : + case PIX_FMT_YUV420P9LE: + case PIX_FMT_YUV422P10LE: + case PIX_FMT_YUV420P10LE: case PIX_FMT_YUV420P16LE: case PIX_FMT_YUV422P16LE: - case PIX_FMT_YUV444P16LE: c->chrToYV12 = RENAME(LEToUV); break; + case PIX_FMT_YUV444P16LE: c->hScale16= HAVE_BIGENDIAN ? hScale16X_c : hScale16_c; break; } if (c->chrSrcHSubSample) { switch(srcFormat) { - case PIX_FMT_RGB48BE: - case PIX_FMT_RGB48LE: c->chrToYV12 = rgb48ToUV_half; break; - case PIX_FMT_BGR48BE: - case PIX_FMT_BGR48LE: c->chrToYV12 = bgr48ToUV_half; break; + case PIX_FMT_RGB48BE: c->chrToYV12 = rgb48BEToUV_half; break; + case PIX_FMT_RGB48LE: c->chrToYV12 = rgb48LEToUV_half; break; + case PIX_FMT_BGR48BE: c->chrToYV12 = bgr48BEToUV_half; break; + case PIX_FMT_BGR48LE: c->chrToYV12 = bgr48LEToUV_half; break; case PIX_FMT_RGB32 : c->chrToYV12 = bgr32ToUV_half; break; case PIX_FMT_RGB32_1: c->chrToYV12 = bgr321ToUV_half; break; - case PIX_FMT_BGR24 : c->chrToYV12 = RENAME(bgr24ToUV_half); break; + case PIX_FMT_BGR24 : c->chrToYV12 = bgr24ToUV_half_c; break; case PIX_FMT_BGR565 : c->chrToYV12 = bgr16ToUV_half; break; case PIX_FMT_BGR555 : c->chrToYV12 = bgr15ToUV_half; break; case PIX_FMT_BGR32 : c->chrToYV12 = rgb32ToUV_half; break; case PIX_FMT_BGR32_1: c->chrToYV12 = rgb321ToUV_half; break; - case PIX_FMT_RGB24 : c->chrToYV12 = RENAME(rgb24ToUV_half); break; + case PIX_FMT_RGB24 : c->chrToYV12 = rgb24ToUV_half_c; break; case PIX_FMT_RGB565 : c->chrToYV12 = rgb16ToUV_half; break; case PIX_FMT_RGB555 : c->chrToYV12 = rgb15ToUV_half; break; } } else { switch(srcFormat) { - case PIX_FMT_RGB48BE: - case PIX_FMT_RGB48LE: c->chrToYV12 = rgb48ToUV; break; - case PIX_FMT_BGR48BE: - case PIX_FMT_BGR48LE: c->chrToYV12 = bgr48ToUV; break; + case PIX_FMT_RGB48BE: c->chrToYV12 = rgb48BEToUV; break; + case PIX_FMT_RGB48LE: c->chrToYV12 = rgb48LEToUV; break; + case PIX_FMT_BGR48BE: c->chrToYV12 = bgr48BEToUV; break; + case PIX_FMT_BGR48LE: c->chrToYV12 = bgr48LEToUV; break; case PIX_FMT_RGB32 : c->chrToYV12 = bgr32ToUV; break; case PIX_FMT_RGB32_1: c->chrToYV12 = bgr321ToUV; break; - case PIX_FMT_BGR24 : c->chrToYV12 = RENAME(bgr24ToUV); break; + case PIX_FMT_BGR24 : c->chrToYV12 = bgr24ToUV_c; break; case PIX_FMT_BGR565 : c->chrToYV12 = bgr16ToUV; break; case PIX_FMT_BGR555 : c->chrToYV12 = bgr15ToUV; break; case PIX_FMT_BGR32 : c->chrToYV12 = rgb32ToUV; break; case PIX_FMT_BGR32_1: c->chrToYV12 = rgb321ToUV; break; - case PIX_FMT_RGB24 : c->chrToYV12 = RENAME(rgb24ToUV); break; + case PIX_FMT_RGB24 : c->chrToYV12 = rgb24ToUV_c; break; case PIX_FMT_RGB565 : c->chrToYV12 = rgb16ToUV; break; case PIX_FMT_RGB555 : c->chrToYV12 = rgb15ToUV; break; } @@ -3036,27 +681,15 @@ static void RENAME(sws_init_swScale)(SwsContext *c) c->lumToYV12 = NULL; c->alpToYV12 = NULL; switch (srcFormat) { - case PIX_FMT_YUV420P9BE: c->lumToYV12 = BE9ToY_c; break; - case PIX_FMT_YUV420P9LE: c->lumToYV12 = LE9ToY_c; break; - case PIX_FMT_YUV422P10BE: - case PIX_FMT_YUV420P10BE: c->lumToYV12 = BE10ToY_c; break; - case PIX_FMT_YUV422P10LE: - case PIX_FMT_YUV420P10LE: c->lumToYV12 = LE10ToY_c; break; case PIX_FMT_YUYV422 : - case PIX_FMT_YUV420P16BE: - case PIX_FMT_YUV422P16BE: - case PIX_FMT_YUV444P16BE: case PIX_FMT_GRAY8A : - case PIX_FMT_GRAY16BE : c->lumToYV12 = RENAME(yuy2ToY); break; + c->lumToYV12 = yuy2ToY_c; break; case PIX_FMT_UYVY422 : - case PIX_FMT_YUV420P16LE: - case PIX_FMT_YUV422P16LE: - case PIX_FMT_YUV444P16LE: - case PIX_FMT_GRAY16LE : c->lumToYV12 = RENAME(uyvyToY); break; - case PIX_FMT_BGR24 : c->lumToYV12 = RENAME(bgr24ToY); break; + c->lumToYV12 = uyvyToY_c; break; + case PIX_FMT_BGR24 : c->lumToYV12 = bgr24ToY_c; break; case PIX_FMT_BGR565 : c->lumToYV12 = bgr16ToY; break; case PIX_FMT_BGR555 : c->lumToYV12 = bgr15ToY; break; - case PIX_FMT_RGB24 : c->lumToYV12 = RENAME(rgb24ToY); break; + case PIX_FMT_RGB24 : c->lumToYV12 = rgb24ToY_c; break; case PIX_FMT_RGB565 : c->lumToYV12 = rgb16ToY; break; case PIX_FMT_RGB555 : c->lumToYV12 = rgb15ToY; break; case PIX_FMT_RGB8 : @@ -3070,10 +703,10 @@ static void RENAME(sws_init_swScale)(SwsContext *c) case PIX_FMT_RGB32_1: c->lumToYV12 = bgr321ToY; break; case PIX_FMT_BGR32 : c->lumToYV12 = rgb32ToY; break; case PIX_FMT_BGR32_1: c->lumToYV12 = rgb321ToY; break; - case PIX_FMT_RGB48BE: - case PIX_FMT_RGB48LE: c->lumToYV12 = rgb48ToY; break; - case PIX_FMT_BGR48BE: - case PIX_FMT_BGR48LE: c->lumToYV12 = bgr48ToY; break; + case PIX_FMT_RGB48BE: c->lumToYV12 = rgb48BEToY; break; + case PIX_FMT_RGB48LE: c->lumToYV12 = rgb48LEToY; break; + case PIX_FMT_BGR48BE: c->lumToYV12 = bgr48BEToY; break; + case PIX_FMT_BGR48LE: c->lumToYV12 = bgr48LEToY; break; } if (c->alpPixBuf) { switch (srcFormat) { @@ -3081,11 +714,14 @@ static void RENAME(sws_init_swScale)(SwsContext *c) case PIX_FMT_RGB32_1: case PIX_FMT_BGR32 : case PIX_FMT_BGR32_1: c->alpToYV12 = abgrToA; break; - case PIX_FMT_GRAY8A : c->alpToYV12 = RENAME(yuy2ToY); break; + case PIX_FMT_GRAY8A : c->alpToYV12 = yuy2ToY_c; break; case PIX_FMT_PAL8 : c->alpToYV12 = palToA; break; } } + if(isAnyRGB(c->srcFormat) || c->srcFormat == PIX_FMT_PAL8) + c->hScale16= hScale16_c; + switch (srcFormat) { case PIX_FMT_GRAY8A : c->alpSrcOffset = 1; @@ -3094,21 +730,15 @@ static void RENAME(sws_init_swScale)(SwsContext *c) case PIX_FMT_BGR32 : c->alpSrcOffset = 3; break; - case PIX_FMT_RGB48LE: - case PIX_FMT_BGR48LE: - c->lumSrcOffset = 1; - c->chrSrcOffset = 1; - c->alpSrcOffset = 1; - break; } if (c->srcRange != c->dstRange && !isAnyRGB(c->dstFormat)) { if (c->srcRange) { - c->lumConvertRange = RENAME(lumRangeFromJpeg); - c->chrConvertRange = RENAME(chrRangeFromJpeg); + c->lumConvertRange = lumRangeFromJpeg_c; + c->chrConvertRange = chrRangeFromJpeg_c; } else { - c->lumConvertRange = RENAME(lumRangeToJpeg); - c->chrConvertRange = RENAME(chrRangeToJpeg); + c->lumConvertRange = lumRangeToJpeg_c; + c->chrConvertRange = chrRangeToJpeg_c; } } diff --git a/libswscale/swscale_unscaled.c b/libswscale/swscale_unscaled.c new file mode 100644 index 0000000000..e0c4b25846 --- /dev/null +++ b/libswscale/swscale_unscaled.c @@ -0,0 +1,849 @@ +/* + * Copyright (C) 2001-2003 Michael Niedermayer <michaelni@gmx.at> + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include <inttypes.h> +#include <string.h> +#include <math.h> +#include <stdio.h> +#include "config.h" +#include <assert.h> +#include "swscale.h" +#include "swscale_internal.h" +#include "rgb2rgb.h" +#include "libavutil/intreadwrite.h" +#include "libavutil/cpu.h" +#include "libavutil/avutil.h" +#include "libavutil/mathematics.h" +#include "libavutil/bswap.h" +#include "libavutil/pixdesc.h" + +#define RGB2YUV_SHIFT 15 +#define BY ( (int)(0.114*219/255*(1<<RGB2YUV_SHIFT)+0.5)) +#define BV (-(int)(0.081*224/255*(1<<RGB2YUV_SHIFT)+0.5)) +#define BU ( (int)(0.500*224/255*(1<<RGB2YUV_SHIFT)+0.5)) +#define GY ( (int)(0.587*219/255*(1<<RGB2YUV_SHIFT)+0.5)) +#define GV (-(int)(0.419*224/255*(1<<RGB2YUV_SHIFT)+0.5)) +#define GU (-(int)(0.331*224/255*(1<<RGB2YUV_SHIFT)+0.5)) +#define RY ( (int)(0.299*219/255*(1<<RGB2YUV_SHIFT)+0.5)) +#define RV ( (int)(0.500*224/255*(1<<RGB2YUV_SHIFT)+0.5)) +#define RU (-(int)(0.169*224/255*(1<<RGB2YUV_SHIFT)+0.5)) + +static void fillPlane(uint8_t* plane, int stride, int width, int height, int y, uint8_t val) +{ + int i; + uint8_t *ptr = plane + stride*y; + for (i=0; i<height; i++) { + memset(ptr, val, width); + ptr += stride; + } +} + +static void copyPlane(const uint8_t *src, int srcStride, + int srcSliceY, int srcSliceH, int width, + uint8_t *dst, int dstStride) +{ + dst += dstStride * srcSliceY; + if (dstStride == srcStride && srcStride > 0) { + memcpy(dst, src, srcSliceH * dstStride); + } else { + int i; + for (i=0; i<srcSliceH; i++) { + memcpy(dst, src, width); + src += srcStride; + dst += dstStride; + } + } +} + +static int planarToNv12Wrapper(SwsContext *c, const uint8_t* src[], int srcStride[], int srcSliceY, + int srcSliceH, uint8_t* dstParam[], int dstStride[]) +{ + uint8_t *dst = dstParam[1] + dstStride[1]*srcSliceY/2; + + copyPlane(src[0], srcStride[0], srcSliceY, srcSliceH, c->srcW, + dstParam[0], dstStride[0]); + + if (c->dstFormat == PIX_FMT_NV12) + interleaveBytes(src[1], src[2], dst, c->srcW/2, srcSliceH/2, srcStride[1], srcStride[2], dstStride[0]); + else + interleaveBytes(src[2], src[1], dst, c->srcW/2, srcSliceH/2, srcStride[2], srcStride[1], dstStride[0]); + + return srcSliceH; +} + +static int planarToYuy2Wrapper(SwsContext *c, const uint8_t* src[], int srcStride[], int srcSliceY, + int srcSliceH, uint8_t* dstParam[], int dstStride[]) +{ + uint8_t *dst=dstParam[0] + dstStride[0]*srcSliceY; + + yv12toyuy2(src[0], src[1], src[2], dst, c->srcW, srcSliceH, srcStride[0], srcStride[1], dstStride[0]); + + return srcSliceH; +} + +static int planarToUyvyWrapper(SwsContext *c, const uint8_t* src[], int srcStride[], int srcSliceY, + int srcSliceH, uint8_t* dstParam[], int dstStride[]) +{ + uint8_t *dst=dstParam[0] + dstStride[0]*srcSliceY; + + yv12touyvy(src[0], src[1], src[2], dst, c->srcW, srcSliceH, srcStride[0], srcStride[1], dstStride[0]); + + return srcSliceH; +} + +static int yuv422pToYuy2Wrapper(SwsContext *c, const uint8_t* src[], int srcStride[], int srcSliceY, + int srcSliceH, uint8_t* dstParam[], int dstStride[]) +{ + uint8_t *dst=dstParam[0] + dstStride[0]*srcSliceY; + + yuv422ptoyuy2(src[0],src[1],src[2],dst,c->srcW,srcSliceH,srcStride[0],srcStride[1],dstStride[0]); + + return srcSliceH; +} + +static int yuv422pToUyvyWrapper(SwsContext *c, const uint8_t* src[], int srcStride[], int srcSliceY, + int srcSliceH, uint8_t* dstParam[], int dstStride[]) +{ + uint8_t *dst=dstParam[0] + dstStride[0]*srcSliceY; + + yuv422ptouyvy(src[0],src[1],src[2],dst,c->srcW,srcSliceH,srcStride[0],srcStride[1],dstStride[0]); + + return srcSliceH; +} + +static int yuyvToYuv420Wrapper(SwsContext *c, const uint8_t* src[], int srcStride[], int srcSliceY, + int srcSliceH, uint8_t* dstParam[], int dstStride[]) +{ + uint8_t *ydst=dstParam[0] + dstStride[0]*srcSliceY; + uint8_t *udst=dstParam[1] + dstStride[1]*srcSliceY/2; + uint8_t *vdst=dstParam[2] + dstStride[2]*srcSliceY/2; + + yuyvtoyuv420(ydst, udst, vdst, src[0], c->srcW, srcSliceH, dstStride[0], dstStride[1], srcStride[0]); + + if (dstParam[3]) + fillPlane(dstParam[3], dstStride[3], c->srcW, srcSliceH, srcSliceY, 255); + + return srcSliceH; +} + +static int yuyvToYuv422Wrapper(SwsContext *c, const uint8_t* src[], int srcStride[], int srcSliceY, + int srcSliceH, uint8_t* dstParam[], int dstStride[]) +{ + uint8_t *ydst=dstParam[0] + dstStride[0]*srcSliceY; + uint8_t *udst=dstParam[1] + dstStride[1]*srcSliceY; + uint8_t *vdst=dstParam[2] + dstStride[2]*srcSliceY; + + yuyvtoyuv422(ydst, udst, vdst, src[0], c->srcW, srcSliceH, dstStride[0], dstStride[1], srcStride[0]); + + return srcSliceH; +} + +static int uyvyToYuv420Wrapper(SwsContext *c, const uint8_t* src[], int srcStride[], int srcSliceY, + int srcSliceH, uint8_t* dstParam[], int dstStride[]) +{ + uint8_t *ydst=dstParam[0] + dstStride[0]*srcSliceY; + uint8_t *udst=dstParam[1] + dstStride[1]*srcSliceY/2; + uint8_t *vdst=dstParam[2] + dstStride[2]*srcSliceY/2; + + uyvytoyuv420(ydst, udst, vdst, src[0], c->srcW, srcSliceH, dstStride[0], dstStride[1], srcStride[0]); + + if (dstParam[3]) + fillPlane(dstParam[3], dstStride[3], c->srcW, srcSliceH, srcSliceY, 255); + + return srcSliceH; +} + +static int uyvyToYuv422Wrapper(SwsContext *c, const uint8_t* src[], int srcStride[], int srcSliceY, + int srcSliceH, uint8_t* dstParam[], int dstStride[]) +{ + uint8_t *ydst=dstParam[0] + dstStride[0]*srcSliceY; + uint8_t *udst=dstParam[1] + dstStride[1]*srcSliceY; + uint8_t *vdst=dstParam[2] + dstStride[2]*srcSliceY; + + uyvytoyuv422(ydst, udst, vdst, src[0], c->srcW, srcSliceH, dstStride[0], dstStride[1], srcStride[0]); + + return srcSliceH; +} + +static void gray8aToPacked32(const uint8_t *src, uint8_t *dst, int num_pixels, const uint8_t *palette) +{ + int i; + for (i=0; i<num_pixels; i++) + ((uint32_t *) dst)[i] = ((const uint32_t *)palette)[src[i<<1]] | (src[(i<<1)+1] << 24); +} + +static void gray8aToPacked32_1(const uint8_t *src, uint8_t *dst, int num_pixels, const uint8_t *palette) +{ + int i; + + for (i=0; i<num_pixels; i++) + ((uint32_t *) dst)[i] = ((const uint32_t *)palette)[src[i<<1]] | src[(i<<1)+1]; +} + +static void gray8aToPacked24(const uint8_t *src, uint8_t *dst, int num_pixels, const uint8_t *palette) +{ + int i; + + for (i=0; i<num_pixels; i++) { + //FIXME slow? + dst[0]= palette[src[i<<1]*4+0]; + dst[1]= palette[src[i<<1]*4+1]; + dst[2]= palette[src[i<<1]*4+2]; + dst+= 3; + } +} + +static int palToRgbWrapper(SwsContext *c, const uint8_t* src[], int srcStride[], int srcSliceY, + int srcSliceH, uint8_t* dst[], int dstStride[]) +{ + const enum PixelFormat srcFormat= c->srcFormat; + const enum PixelFormat dstFormat= c->dstFormat; + void (*conv)(const uint8_t *src, uint8_t *dst, int num_pixels, + const uint8_t *palette)=NULL; + int i; + uint8_t *dstPtr= dst[0] + dstStride[0]*srcSliceY; + const uint8_t *srcPtr= src[0]; + + if (srcFormat == PIX_FMT_GRAY8A) { + switch (dstFormat) { + case PIX_FMT_RGB32 : conv = gray8aToPacked32; break; + case PIX_FMT_BGR32 : conv = gray8aToPacked32; break; + case PIX_FMT_BGR32_1: conv = gray8aToPacked32_1; break; + case PIX_FMT_RGB32_1: conv = gray8aToPacked32_1; break; + case PIX_FMT_RGB24 : conv = gray8aToPacked24; break; + case PIX_FMT_BGR24 : conv = gray8aToPacked24; break; + } + } else if (usePal(srcFormat)) { + switch (dstFormat) { + case PIX_FMT_RGB32 : conv = sws_convertPalette8ToPacked32; break; + case PIX_FMT_BGR32 : conv = sws_convertPalette8ToPacked32; break; + case PIX_FMT_BGR32_1: conv = sws_convertPalette8ToPacked32; break; + case PIX_FMT_RGB32_1: conv = sws_convertPalette8ToPacked32; break; + case PIX_FMT_RGB24 : conv = sws_convertPalette8ToPacked24; break; + case PIX_FMT_BGR24 : conv = sws_convertPalette8ToPacked24; break; + } + } + + if (!conv) + av_log(c, AV_LOG_ERROR, "internal error %s -> %s converter\n", + av_get_pix_fmt_name(srcFormat), av_get_pix_fmt_name(dstFormat)); + else { + for (i=0; i<srcSliceH; i++) { + conv(srcPtr, dstPtr, c->srcW, (uint8_t *) c->pal_rgb); + srcPtr+= srcStride[0]; + dstPtr+= dstStride[0]; + } + } + + return srcSliceH; +} + +#define isRGBA32(x) ( \ + (x) == PIX_FMT_ARGB \ + || (x) == PIX_FMT_RGBA \ + || (x) == PIX_FMT_BGRA \ + || (x) == PIX_FMT_ABGR \ + ) + +/* {RGB,BGR}{15,16,24,32,32_1} -> {RGB,BGR}{15,16,24,32} */ +static int rgbToRgbWrapper(SwsContext *c, const uint8_t* src[], int srcStride[], int srcSliceY, + int srcSliceH, uint8_t* dst[], int dstStride[]) +{ + const enum PixelFormat srcFormat= c->srcFormat; + const enum PixelFormat dstFormat= c->dstFormat; + const int srcBpp= (c->srcFormatBpp + 7) >> 3; + const int dstBpp= (c->dstFormatBpp + 7) >> 3; + const int srcId= c->srcFormatBpp >> 2; /* 1:0, 4:1, 8:2, 15:3, 16:4, 24:6, 32:8 */ + const int dstId= c->dstFormatBpp >> 2; + void (*conv)(const uint8_t *src, uint8_t *dst, int src_size)=NULL; + +#define CONV_IS(src, dst) (srcFormat == PIX_FMT_##src && dstFormat == PIX_FMT_##dst) + + if (isRGBA32(srcFormat) && isRGBA32(dstFormat)) { + if ( CONV_IS(ABGR, RGBA) + || CONV_IS(ARGB, BGRA) + || CONV_IS(BGRA, ARGB) + || CONV_IS(RGBA, ABGR)) conv = shuffle_bytes_3210; + else if (CONV_IS(ABGR, ARGB) + || CONV_IS(ARGB, ABGR)) conv = shuffle_bytes_0321; + else if (CONV_IS(ABGR, BGRA) + || CONV_IS(ARGB, RGBA)) conv = shuffle_bytes_1230; + else if (CONV_IS(BGRA, RGBA) + || CONV_IS(RGBA, BGRA)) conv = shuffle_bytes_2103; + else if (CONV_IS(BGRA, ABGR) + || CONV_IS(RGBA, ARGB)) conv = shuffle_bytes_3012; + } else + /* BGR -> BGR */ + if ( (isBGRinInt(srcFormat) && isBGRinInt(dstFormat)) + || (isRGBinInt(srcFormat) && isRGBinInt(dstFormat))) { + switch(srcId | (dstId<<4)) { + case 0x34: conv= rgb16to15; break; + case 0x36: conv= rgb24to15; break; + case 0x38: conv= rgb32to15; break; + case 0x43: conv= rgb15to16; break; + case 0x46: conv= rgb24to16; break; + case 0x48: conv= rgb32to16; break; + case 0x63: conv= rgb15to24; break; + case 0x64: conv= rgb16to24; break; + case 0x68: conv= rgb32to24; break; + case 0x83: conv= rgb15to32; break; + case 0x84: conv= rgb16to32; break; + case 0x86: conv= rgb24to32; break; + } + } else if ( (isBGRinInt(srcFormat) && isRGBinInt(dstFormat)) + || (isRGBinInt(srcFormat) && isBGRinInt(dstFormat))) { + switch(srcId | (dstId<<4)) { + case 0x33: conv= rgb15tobgr15; break; + case 0x34: conv= rgb16tobgr15; break; + case 0x36: conv= rgb24tobgr15; break; + case 0x38: conv= rgb32tobgr15; break; + case 0x43: conv= rgb15tobgr16; break; + case 0x44: conv= rgb16tobgr16; break; + case 0x46: conv= rgb24tobgr16; break; + case 0x48: conv= rgb32tobgr16; break; + case 0x63: conv= rgb15tobgr24; break; + case 0x64: conv= rgb16tobgr24; break; + case 0x66: conv= rgb24tobgr24; break; + case 0x68: conv= rgb32tobgr24; break; + case 0x83: conv= rgb15tobgr32; break; + case 0x84: conv= rgb16tobgr32; break; + case 0x86: conv= rgb24tobgr32; break; + } + } + + if (!conv) { + av_log(c, AV_LOG_ERROR, "internal error %s -> %s converter\n", + av_get_pix_fmt_name(srcFormat), av_get_pix_fmt_name(dstFormat)); + } else { + const uint8_t *srcPtr= src[0]; + uint8_t *dstPtr= dst[0]; + if ((srcFormat == PIX_FMT_RGB32_1 || srcFormat == PIX_FMT_BGR32_1) && !isRGBA32(dstFormat)) + srcPtr += ALT32_CORR; + + if ((dstFormat == PIX_FMT_RGB32_1 || dstFormat == PIX_FMT_BGR32_1) && !isRGBA32(srcFormat)) + dstPtr += ALT32_CORR; + + if (dstStride[0]*srcBpp == srcStride[0]*dstBpp && srcStride[0] > 0 && !(srcStride[0]%srcBpp)) + conv(srcPtr, dstPtr + dstStride[0]*srcSliceY, srcSliceH*srcStride[0]); + else { + int i; + dstPtr += dstStride[0]*srcSliceY; + + for (i=0; i<srcSliceH; i++) { + conv(srcPtr, dstPtr, c->srcW*srcBpp); + srcPtr+= srcStride[0]; + dstPtr+= dstStride[0]; + } + } + } + return srcSliceH; +} + +static int bgr24ToYv12Wrapper(SwsContext *c, const uint8_t* src[], int srcStride[], int srcSliceY, + int srcSliceH, uint8_t* dst[], int dstStride[]) +{ + rgb24toyv12( + src[0], + dst[0]+ srcSliceY *dstStride[0], + dst[1]+(srcSliceY>>1)*dstStride[1], + dst[2]+(srcSliceY>>1)*dstStride[2], + c->srcW, srcSliceH, + dstStride[0], dstStride[1], srcStride[0]); + if (dst[3]) + fillPlane(dst[3], dstStride[3], c->srcW, srcSliceH, srcSliceY, 255); + return srcSliceH; +} + +static int yvu9ToYv12Wrapper(SwsContext *c, const uint8_t* src[], int srcStride[], int srcSliceY, + int srcSliceH, uint8_t* dst[], int dstStride[]) +{ + copyPlane(src[0], srcStride[0], srcSliceY, srcSliceH, c->srcW, + dst[0], dstStride[0]); + + planar2x(src[1], dst[1] + dstStride[1]*(srcSliceY >> 1), c->chrSrcW, + srcSliceH >> 2, srcStride[1], dstStride[1]); + planar2x(src[2], dst[2] + dstStride[2]*(srcSliceY >> 1), c->chrSrcW, + srcSliceH >> 2, srcStride[2], dstStride[2]); + if (dst[3]) + fillPlane(dst[3], dstStride[3], c->srcW, srcSliceH, srcSliceY, 255); + return srcSliceH; +} + +/* unscaled copy like stuff (assumes nearly identical formats) */ +static int packedCopyWrapper(SwsContext *c, const uint8_t* src[], int srcStride[], int srcSliceY, + int srcSliceH, uint8_t* dst[], int dstStride[]) +{ + if (dstStride[0]==srcStride[0] && srcStride[0] > 0) + memcpy(dst[0] + dstStride[0]*srcSliceY, src[0], srcSliceH*dstStride[0]); + else { + int i; + const uint8_t *srcPtr= src[0]; + uint8_t *dstPtr= dst[0] + dstStride[0]*srcSliceY; + int length=0; + + /* universal length finder */ + while(length+c->srcW <= FFABS(dstStride[0]) + && length+c->srcW <= FFABS(srcStride[0])) length+= c->srcW; + assert(length!=0); + + for (i=0; i<srcSliceH; i++) { + memcpy(dstPtr, srcPtr, length); + srcPtr+= srcStride[0]; + dstPtr+= dstStride[0]; + } + } + return srcSliceH; +} + +#define DITHER_COPY(dst, dstStride, src, srcStride, bswap, dbswap)\ + uint16_t scale= dither_scale[dst_depth-1][src_depth-1];\ + int shift= src_depth-dst_depth + dither_scale[src_depth-2][dst_depth-1];\ + for (i = 0; i < height; i++) {\ + const uint8_t *dither= dithers[src_depth-9][i&7];\ + for (j = 0; j < length-7; j+=8){\ + dst[j+0] = dbswap((bswap(src[j+0]) + dither[0])*scale>>shift);\ + dst[j+1] = dbswap((bswap(src[j+1]) + dither[1])*scale>>shift);\ + dst[j+2] = dbswap((bswap(src[j+2]) + dither[2])*scale>>shift);\ + dst[j+3] = dbswap((bswap(src[j+3]) + dither[3])*scale>>shift);\ + dst[j+4] = dbswap((bswap(src[j+4]) + dither[4])*scale>>shift);\ + dst[j+5] = dbswap((bswap(src[j+5]) + dither[5])*scale>>shift);\ + dst[j+6] = dbswap((bswap(src[j+6]) + dither[6])*scale>>shift);\ + dst[j+7] = dbswap((bswap(src[j+7]) + dither[7])*scale>>shift);\ + }\ + for (; j < length; j++)\ + dst[j] = dbswap((bswap(src[j]) + dither[j&7])*scale>>shift);\ + dst += dstStride;\ + src += srcStride;\ + } + + +static int planarCopyWrapper(SwsContext *c, const uint8_t* src[], int srcStride[], int srcSliceY, + int srcSliceH, uint8_t* dst[], int dstStride[]) +{ + int plane, i, j; + for (plane=0; plane<4; plane++) { + int length= (plane==0 || plane==3) ? c->srcW : -((-c->srcW )>>c->chrDstHSubSample); + int y= (plane==0 || plane==3) ? srcSliceY: -((-srcSliceY)>>c->chrDstVSubSample); + int height= (plane==0 || plane==3) ? srcSliceH: -((-srcSliceH)>>c->chrDstVSubSample); + const uint8_t *srcPtr= src[plane]; + uint8_t *dstPtr= dst[plane] + dstStride[plane]*y; + + if (!dst[plane]) continue; + // ignore palette for GRAY8 + if (plane == 1 && !dst[2]) continue; + if (!src[plane] || (plane == 1 && !src[2])) { + if(is16BPS(c->dstFormat)) + length*=2; + fillPlane(dst[plane], dstStride[plane], length, height, y, (plane==3) ? 255 : 128); + } else { + if(isNBPS(c->srcFormat) || isNBPS(c->dstFormat) + || (is16BPS(c->srcFormat) != is16BPS(c->dstFormat)) + ) { + const int src_depth = av_pix_fmt_descriptors[c->srcFormat].comp[plane].depth_minus1+1; + const int dst_depth = av_pix_fmt_descriptors[c->dstFormat].comp[plane].depth_minus1+1; + const uint16_t *srcPtr2 = (const uint16_t*)srcPtr; + uint16_t *dstPtr2 = (uint16_t*)dstPtr; + + if (dst_depth == 8) { + if(isBE(c->srcFormat) == HAVE_BIGENDIAN){ + DITHER_COPY(dstPtr, dstStride[plane], srcPtr2, srcStride[plane]/2, , ) + } else { + DITHER_COPY(dstPtr, dstStride[plane], srcPtr2, srcStride[plane]/2, av_bswap16, ) + } + } else if (src_depth == 8) { + for (i = 0; i < height; i++) { + if(isBE(c->dstFormat)){ + for (j = 0; j < length; j++) + AV_WB16(&dstPtr2[j], (srcPtr[j]<<(dst_depth-8)) | + (srcPtr[j]>>(2*8-dst_depth))); + } else { + for (j = 0; j < length; j++) + AV_WL16(&dstPtr2[j], (srcPtr[j]<<(dst_depth-8)) | + (srcPtr[j]>>(2*8-dst_depth))); + } + dstPtr2 += dstStride[plane]/2; + srcPtr += srcStride[plane]; + } + } else if (src_depth <= dst_depth) { + for (i = 0; i < height; i++) { +#define COPY_UP(r,w) \ + for (j = 0; j < length; j++){ \ + unsigned int v= r(&srcPtr2[j]);\ + w(&dstPtr2[j], (v<<(dst_depth-src_depth)) | \ + (v>>(2*src_depth-dst_depth)));\ + } + if(isBE(c->srcFormat)){ + if(isBE(c->dstFormat)){ + COPY_UP(AV_RB16, AV_WB16) + } else { + COPY_UP(AV_RB16, AV_WL16) + } + } else { + if(isBE(c->dstFormat)){ + COPY_UP(AV_RL16, AV_WB16) + } else { + COPY_UP(AV_RL16, AV_WL16) + } + } + dstPtr2 += dstStride[plane]/2; + srcPtr2 += srcStride[plane]/2; + } + } else { + if(isBE(c->srcFormat) == HAVE_BIGENDIAN){ + if(isBE(c->dstFormat) == HAVE_BIGENDIAN){ + DITHER_COPY(dstPtr2, dstStride[plane]/2, srcPtr2, srcStride[plane]/2, , ) + } else { + DITHER_COPY(dstPtr2, dstStride[plane]/2, srcPtr2, srcStride[plane]/2, , av_bswap16) + } + }else{ + if(isBE(c->dstFormat) == HAVE_BIGENDIAN){ + DITHER_COPY(dstPtr2, dstStride[plane]/2, srcPtr2, srcStride[plane]/2, av_bswap16, ) + } else { + DITHER_COPY(dstPtr2, dstStride[plane]/2, srcPtr2, srcStride[plane]/2, av_bswap16, av_bswap16) + } + } + } + } else if(is16BPS(c->srcFormat) && is16BPS(c->dstFormat) + && isBE(c->srcFormat) != isBE(c->dstFormat)) { + + for (i=0; i<height; i++) { + for (j=0; j<length; j++) + ((uint16_t*)dstPtr)[j] = av_bswap16(((const uint16_t*)srcPtr)[j]); + srcPtr+= srcStride[plane]; + dstPtr+= dstStride[plane]; + } + } else if (dstStride[plane] == srcStride[plane] && + srcStride[plane] > 0 && srcStride[plane] == length) { + memcpy(dst[plane] + dstStride[plane]*y, src[plane], + height*dstStride[plane]); + } else { + if(is16BPS(c->srcFormat) && is16BPS(c->dstFormat)) + length*=2; + for (i=0; i<height; i++) { + memcpy(dstPtr, srcPtr, length); + srcPtr+= srcStride[plane]; + dstPtr+= dstStride[plane]; + } + } + } + } + return srcSliceH; +} + +void ff_get_unscaled_swscale(SwsContext *c) +{ + const enum PixelFormat srcFormat = c->srcFormat; + const enum PixelFormat dstFormat = c->dstFormat; + const int flags = c->flags; + const int dstH = c->dstH; + int needsDither; + + needsDither= isAnyRGB(dstFormat) + && c->dstFormatBpp < 24 + && (c->dstFormatBpp < c->srcFormatBpp || (!isAnyRGB(srcFormat))); + + /* yv12_to_nv12 */ + if ((srcFormat == PIX_FMT_YUV420P || srcFormat == PIX_FMT_YUVA420P) && (dstFormat == PIX_FMT_NV12 || dstFormat == PIX_FMT_NV21)) { + c->swScale= planarToNv12Wrapper; + } + /* yuv2bgr */ + if ((srcFormat==PIX_FMT_YUV420P || srcFormat==PIX_FMT_YUV422P || srcFormat==PIX_FMT_YUVA420P) && isAnyRGB(dstFormat) + && !(flags & SWS_ACCURATE_RND) && !(dstH&1)) { + c->swScale= ff_yuv2rgb_get_func_ptr(c); + } + + if (srcFormat==PIX_FMT_YUV410P && (dstFormat==PIX_FMT_YUV420P || dstFormat==PIX_FMT_YUVA420P) && !(flags & SWS_BITEXACT)) { + c->swScale= yvu9ToYv12Wrapper; + } + + /* bgr24toYV12 */ + if (srcFormat==PIX_FMT_BGR24 && (dstFormat==PIX_FMT_YUV420P || dstFormat==PIX_FMT_YUVA420P) && !(flags & SWS_ACCURATE_RND)) + c->swScale= bgr24ToYv12Wrapper; + + /* RGB/BGR -> RGB/BGR (no dither needed forms) */ + if ( isAnyRGB(srcFormat) + && isAnyRGB(dstFormat) + && srcFormat != PIX_FMT_BGR8 && dstFormat != PIX_FMT_BGR8 + && srcFormat != PIX_FMT_RGB8 && dstFormat != PIX_FMT_RGB8 + && srcFormat != PIX_FMT_BGR4 && dstFormat != PIX_FMT_BGR4 + && srcFormat != PIX_FMT_RGB4 && dstFormat != PIX_FMT_RGB4 + && srcFormat != PIX_FMT_BGR4_BYTE && dstFormat != PIX_FMT_BGR4_BYTE + && srcFormat != PIX_FMT_RGB4_BYTE && dstFormat != PIX_FMT_RGB4_BYTE + && srcFormat != PIX_FMT_MONOBLACK && dstFormat != PIX_FMT_MONOBLACK + && srcFormat != PIX_FMT_MONOWHITE && dstFormat != PIX_FMT_MONOWHITE + && srcFormat != PIX_FMT_RGB48LE && dstFormat != PIX_FMT_RGB48LE + && srcFormat != PIX_FMT_RGB48BE && dstFormat != PIX_FMT_RGB48BE + && srcFormat != PIX_FMT_BGR48LE && dstFormat != PIX_FMT_BGR48LE + && srcFormat != PIX_FMT_BGR48BE && dstFormat != PIX_FMT_BGR48BE + && (!needsDither || (c->flags&(SWS_FAST_BILINEAR|SWS_POINT)))) + c->swScale= rgbToRgbWrapper; + + if ((usePal(srcFormat) && ( + dstFormat == PIX_FMT_RGB32 || + dstFormat == PIX_FMT_RGB32_1 || + dstFormat == PIX_FMT_RGB24 || + dstFormat == PIX_FMT_BGR32 || + dstFormat == PIX_FMT_BGR32_1 || + dstFormat == PIX_FMT_BGR24))) + c->swScale= palToRgbWrapper; + + if (srcFormat == PIX_FMT_YUV422P) { + if (dstFormat == PIX_FMT_YUYV422) + c->swScale= yuv422pToYuy2Wrapper; + else if (dstFormat == PIX_FMT_UYVY422) + c->swScale= yuv422pToUyvyWrapper; + } + + /* LQ converters if -sws 0 or -sws 4*/ + if (c->flags&(SWS_FAST_BILINEAR|SWS_POINT)) { + /* yv12_to_yuy2 */ + if (srcFormat == PIX_FMT_YUV420P || srcFormat == PIX_FMT_YUVA420P) { + if (dstFormat == PIX_FMT_YUYV422) + c->swScale= planarToYuy2Wrapper; + else if (dstFormat == PIX_FMT_UYVY422) + c->swScale= planarToUyvyWrapper; + } + } + if(srcFormat == PIX_FMT_YUYV422 && (dstFormat == PIX_FMT_YUV420P || dstFormat == PIX_FMT_YUVA420P)) + c->swScale= yuyvToYuv420Wrapper; + if(srcFormat == PIX_FMT_UYVY422 && (dstFormat == PIX_FMT_YUV420P || dstFormat == PIX_FMT_YUVA420P)) + c->swScale= uyvyToYuv420Wrapper; + if(srcFormat == PIX_FMT_YUYV422 && dstFormat == PIX_FMT_YUV422P) + c->swScale= yuyvToYuv422Wrapper; + if(srcFormat == PIX_FMT_UYVY422 && dstFormat == PIX_FMT_YUV422P) + c->swScale= uyvyToYuv422Wrapper; + + /* simple copy */ + if ( srcFormat == dstFormat + || (srcFormat == PIX_FMT_YUVA420P && dstFormat == PIX_FMT_YUV420P) + || (srcFormat == PIX_FMT_YUV420P && dstFormat == PIX_FMT_YUVA420P) + || (isPlanarYUV(srcFormat) && isGray(dstFormat)) + || (isPlanarYUV(dstFormat) && isGray(srcFormat)) + || (isGray(dstFormat) && isGray(srcFormat)) + || (isPlanarYUV(srcFormat) && isPlanarYUV(dstFormat) + && c->chrDstHSubSample == c->chrSrcHSubSample + && c->chrDstVSubSample == c->chrSrcVSubSample + && dstFormat != PIX_FMT_NV12 && dstFormat != PIX_FMT_NV21 + && srcFormat != PIX_FMT_NV12 && srcFormat != PIX_FMT_NV21)) + { + if (isPacked(c->srcFormat)) + c->swScale= packedCopyWrapper; + else /* Planar YUV or gray */ + c->swScale= planarCopyWrapper; + } + + if (ARCH_BFIN) + ff_bfin_get_unscaled_swscale(c); + if (HAVE_ALTIVEC) + ff_swscale_get_unscaled_altivec(c); +} + +static void reset_ptr(const uint8_t* src[], int format) +{ + if(!isALPHA(format)) + src[3]=NULL; + if(!isPlanarYUV(format)) { + src[3]=src[2]=NULL; + + if (!usePal(format)) + src[1]= NULL; + } +} + +static int check_image_pointers(uint8_t *data[4], enum PixelFormat pix_fmt, + const int linesizes[4]) +{ + const AVPixFmtDescriptor *desc = &av_pix_fmt_descriptors[pix_fmt]; + int i; + + for (i = 0; i < 4; i++) { + int plane = desc->comp[i].plane; + if (!data[plane] || !linesizes[plane]) + return 0; + } + + return 1; +} + +/** + * swscale wrapper, so we don't need to export the SwsContext. + * Assumes planar YUV to be in YUV order instead of YVU. + */ +int sws_scale(SwsContext *c, const uint8_t* const src[], const int srcStride[], int srcSliceY, + int srcSliceH, uint8_t* const dst[], const int dstStride[]) +{ + int i; + const uint8_t* src2[4]= {src[0], src[1], src[2], src[3]}; + uint8_t* dst2[4]= {dst[0], dst[1], dst[2], dst[3]}; + + // do not mess up sliceDir if we have a "trailing" 0-size slice + if (srcSliceH == 0) + return 0; + + if (!check_image_pointers(src, c->srcFormat, srcStride)) { + av_log(c, AV_LOG_ERROR, "bad src image pointers\n"); + return 0; + } + if (!check_image_pointers(dst, c->dstFormat, dstStride)) { + av_log(c, AV_LOG_ERROR, "bad dst image pointers\n"); + return 0; + } + + if (c->sliceDir == 0 && srcSliceY != 0 && srcSliceY + srcSliceH != c->srcH) { + av_log(c, AV_LOG_ERROR, "Slices start in the middle!\n"); + return 0; + } + if (c->sliceDir == 0) { + if (srcSliceY == 0) c->sliceDir = 1; else c->sliceDir = -1; + } + + if (usePal(c->srcFormat)) { + for (i=0; i<256; i++) { + int p, r, g, b, y, u, v, a = 0xff; + if(c->srcFormat == PIX_FMT_PAL8) { + p=((const uint32_t*)(src[1]))[i]; + a= (p>>24)&0xFF; + r= (p>>16)&0xFF; + g= (p>> 8)&0xFF; + b= p &0xFF; + } else if(c->srcFormat == PIX_FMT_RGB8) { + r= (i>>5 )*36; + g= ((i>>2)&7)*36; + b= (i&3 )*85; + } else if(c->srcFormat == PIX_FMT_BGR8) { + b= (i>>6 )*85; + g= ((i>>3)&7)*36; + r= (i&7 )*36; + } else if(c->srcFormat == PIX_FMT_RGB4_BYTE) { + r= (i>>3 )*255; + g= ((i>>1)&3)*85; + b= (i&1 )*255; + } else if(c->srcFormat == PIX_FMT_GRAY8 || c->srcFormat == PIX_FMT_GRAY8A) { + r = g = b = i; + } else { + assert(c->srcFormat == PIX_FMT_BGR4_BYTE); + b= (i>>3 )*255; + g= ((i>>1)&3)*85; + r= (i&1 )*255; + } + y= av_clip_uint8((RY*r + GY*g + BY*b + ( 33<<(RGB2YUV_SHIFT-1)))>>RGB2YUV_SHIFT); + u= av_clip_uint8((RU*r + GU*g + BU*b + (257<<(RGB2YUV_SHIFT-1)))>>RGB2YUV_SHIFT); + v= av_clip_uint8((RV*r + GV*g + BV*b + (257<<(RGB2YUV_SHIFT-1)))>>RGB2YUV_SHIFT); + c->pal_yuv[i]= y + (u<<8) + (v<<16) + (a<<24); + + switch(c->dstFormat) { + case PIX_FMT_BGR32: +#if !HAVE_BIGENDIAN + case PIX_FMT_RGB24: +#endif + c->pal_rgb[i]= r + (g<<8) + (b<<16) + (a<<24); + break; + case PIX_FMT_BGR32_1: +#if HAVE_BIGENDIAN + case PIX_FMT_BGR24: +#endif + c->pal_rgb[i]= a + (r<<8) + (g<<16) + (b<<24); + break; + case PIX_FMT_RGB32_1: +#if HAVE_BIGENDIAN + case PIX_FMT_RGB24: +#endif + c->pal_rgb[i]= a + (b<<8) + (g<<16) + (r<<24); + break; + case PIX_FMT_RGB32: +#if !HAVE_BIGENDIAN + case PIX_FMT_BGR24: +#endif + default: + c->pal_rgb[i]= b + (g<<8) + (r<<16) + (a<<24); + } + } + } + + // copy strides, so they can safely be modified + if (c->sliceDir == 1) { + // slices go from top to bottom + int srcStride2[4]= {srcStride[0], srcStride[1], srcStride[2], srcStride[3]}; + int dstStride2[4]= {dstStride[0], dstStride[1], dstStride[2], dstStride[3]}; + + reset_ptr(src2, c->srcFormat); + reset_ptr((const uint8_t**)dst2, c->dstFormat); + + /* reset slice direction at end of frame */ + if (srcSliceY + srcSliceH == c->srcH) + c->sliceDir = 0; + + return c->swScale(c, src2, srcStride2, srcSliceY, srcSliceH, dst2, dstStride2); + } else { + // slices go from bottom to top => we flip the image internally + int srcStride2[4]= {-srcStride[0], -srcStride[1], -srcStride[2], -srcStride[3]}; + int dstStride2[4]= {-dstStride[0], -dstStride[1], -dstStride[2], -dstStride[3]}; + + src2[0] += (srcSliceH-1)*srcStride[0]; + if (!usePal(c->srcFormat)) + src2[1] += ((srcSliceH>>c->chrSrcVSubSample)-1)*srcStride[1]; + src2[2] += ((srcSliceH>>c->chrSrcVSubSample)-1)*srcStride[2]; + src2[3] += (srcSliceH-1)*srcStride[3]; + dst2[0] += ( c->dstH -1)*dstStride[0]; + dst2[1] += ((c->dstH>>c->chrDstVSubSample)-1)*dstStride[1]; + dst2[2] += ((c->dstH>>c->chrDstVSubSample)-1)*dstStride[2]; + dst2[3] += ( c->dstH -1)*dstStride[3]; + + reset_ptr(src2, c->srcFormat); + reset_ptr((const uint8_t**)dst2, c->dstFormat); + + /* reset slice direction at end of frame */ + if (!srcSliceY) + c->sliceDir = 0; + + return c->swScale(c, src2, srcStride2, c->srcH-srcSliceY-srcSliceH, srcSliceH, dst2, dstStride2); + } +} + +#if LIBSWSCALE_VERSION_MAJOR < 1 +int sws_scale_ordered(SwsContext *c, const uint8_t* const src[], int srcStride[], int srcSliceY, + int srcSliceH, uint8_t* dst[], int dstStride[]) +{ + return sws_scale(c, src, srcStride, srcSliceY, srcSliceH, dst, dstStride); +} +#endif + +/* Convert the palette to the same packed 32-bit format as the palette */ +void sws_convertPalette8ToPacked32(const uint8_t *src, uint8_t *dst, int num_pixels, const uint8_t *palette) +{ + int i; + + for (i=0; i<num_pixels; i++) + ((uint32_t *) dst)[i] = ((const uint32_t *) palette)[src[i]]; +} + +/* Palette format: ABCD -> dst format: ABC */ +void sws_convertPalette8ToPacked24(const uint8_t *src, uint8_t *dst, int num_pixels, const uint8_t *palette) +{ + int i; + + for (i=0; i<num_pixels; i++) { + //FIXME slow? + dst[0]= palette[src[i]*4+0]; + dst[1]= palette[src[i]*4+1]; + dst[2]= palette[src[i]*4+2]; + dst+= 3; + } +} diff --git a/libswscale/utils.c b/libswscale/utils.c index ea44190ace..984f2c52fa 100644 --- a/libswscale/utils.c +++ b/libswscale/utils.c @@ -77,13 +77,17 @@ const char *swscale_license(void) || (x)==PIX_FMT_BGR48BE \ || (x)==PIX_FMT_BGR48LE \ || (x)==PIX_FMT_BGR24 \ - || (x)==PIX_FMT_BGR565 \ - || (x)==PIX_FMT_BGR555 \ + || (x)==PIX_FMT_BGR565LE \ + || (x)==PIX_FMT_BGR565BE \ + || (x)==PIX_FMT_BGR555LE \ + || (x)==PIX_FMT_BGR555BE \ || (x)==PIX_FMT_BGR32 \ || (x)==PIX_FMT_BGR32_1 \ || (x)==PIX_FMT_RGB24 \ - || (x)==PIX_FMT_RGB565 \ - || (x)==PIX_FMT_RGB555 \ + || (x)==PIX_FMT_RGB565LE \ + || (x)==PIX_FMT_RGB565BE \ + || (x)==PIX_FMT_RGB555LE \ + || (x)==PIX_FMT_RGB555BE \ || (x)==PIX_FMT_GRAY8 \ || (x)==PIX_FMT_GRAY8A \ || (x)==PIX_FMT_YUV410P \ @@ -108,12 +112,18 @@ const char *swscale_license(void) || (x)==PIX_FMT_MONOWHITE \ || (x)==PIX_FMT_MONOBLACK \ || (x)==PIX_FMT_YUV420P9LE \ + || (x)==PIX_FMT_YUV444P9LE \ || (x)==PIX_FMT_YUV420P10LE \ + || (x)==PIX_FMT_YUV422P10LE \ + || (x)==PIX_FMT_YUV444P10LE \ || (x)==PIX_FMT_YUV420P16LE \ || (x)==PIX_FMT_YUV422P16LE \ || (x)==PIX_FMT_YUV444P16LE \ || (x)==PIX_FMT_YUV420P9BE \ + || (x)==PIX_FMT_YUV444P9BE \ || (x)==PIX_FMT_YUV420P10BE \ + || (x)==PIX_FMT_YUV444P10BE \ + || (x)==PIX_FMT_YUV422P10BE \ || (x)==PIX_FMT_YUV420P16BE \ || (x)==PIX_FMT_YUV422P16BE \ || (x)==PIX_FMT_YUV444P16BE \ @@ -137,7 +147,22 @@ int sws_isSupportedInput(enum PixelFormat pix_fmt) || (x)==PIX_FMT_YUVJ422P \ || (x)==PIX_FMT_YUVJ440P \ || (x)==PIX_FMT_YUVJ444P \ - || isAnyRGB(x) \ + || isRGBinBytes(x) \ + || isBGRinBytes(x) \ + || (x)==PIX_FMT_RGB565 \ + || (x)==PIX_FMT_RGB555 \ + || (x)==PIX_FMT_RGB444 \ + || (x)==PIX_FMT_BGR565 \ + || (x)==PIX_FMT_BGR555 \ + || (x)==PIX_FMT_BGR444 \ + || (x)==PIX_FMT_RGB8 \ + || (x)==PIX_FMT_BGR8 \ + || (x)==PIX_FMT_RGB4_BYTE \ + || (x)==PIX_FMT_BGR4_BYTE \ + || (x)==PIX_FMT_RGB4 \ + || (x)==PIX_FMT_BGR4 \ + || (x)==PIX_FMT_MONOBLACK \ + || (x)==PIX_FMT_MONOWHITE \ || (x)==PIX_FMT_NV12 \ || (x)==PIX_FMT_NV21 \ || (x)==PIX_FMT_GRAY16BE \ @@ -165,17 +190,15 @@ int sws_isSupportedOutput(enum PixelFormat pix_fmt) extern const int32_t ff_yuv2rgb_coeffs[8][4]; +#if FF_API_SWS_FORMAT_NAME const char *sws_format_name(enum PixelFormat format) { - if ((unsigned)format < PIX_FMT_NB && av_pix_fmt_descriptors[format].name) - return av_pix_fmt_descriptors[format].name; - else - return "Unknown format"; + return av_get_pix_fmt_name(format); } +#endif static double getSplineCoeff(double a, double b, double c, double d, double dist) { -// printf("%f %f %f %f %f\n", a,b,c,d,dist); if (dist<=1.0) return ((d*dist + c)*dist + b)*dist +a; else return getSplineCoeff( 0.0, b+ 2.0*c + 3.0*d, @@ -185,7 +208,7 @@ static double getSplineCoeff(double a, double b, double c, double d, double dist } static int initFilter(int16_t **outFilter, int16_t **filterPos, int *outFilterSize, int xInc, - int srcW, int dstW, int filterAlign, int one, int flags, + int srcW, int dstW, int filterAlign, int one, int flags, int cpu_flags, SwsVector *srcFilter, SwsVector *dstFilter, double param[2]) { int i; @@ -196,10 +219,8 @@ static int initFilter(int16_t **outFilter, int16_t **filterPos, int *outFilterSi int64_t *filter2=NULL; const int64_t fone= 1LL<<54; int ret= -1; -#if ARCH_X86 - if (flags & SWS_CPU_CAPS_MMX) - __asm__ volatile("emms\n\t"::: "memory"); //FIXME this should not be required but it IS (even for non-MMX versions) -#endif + + emms_c(); //FIXME this should not be required but it IS (even for non-MMX versions) // NOTE: the +1 is for the MMX scaler which reads over the end FF_ALLOC_OR_GOTO(NULL, *filterPos, (dstW+1)*sizeof(int16_t), fail); @@ -416,7 +437,7 @@ static int initFilter(int16_t **outFilter, int16_t **filterPos, int *outFilterSi if (min>minFilterSize) minFilterSize= min; } - if (flags & SWS_CPU_CAPS_ALTIVEC) { + if (HAVE_ALTIVEC && cpu_flags & AV_CPU_FLAG_ALTIVEC) { // we can handle the special case 4, // so we don't want to go to the full 8 if (minFilterSize < 5) @@ -431,7 +452,7 @@ static int initFilter(int16_t **outFilter, int16_t **filterPos, int *outFilterSi filterAlign = 1; } - if (flags & SWS_CPU_CAPS_MMX) { + if (HAVE_MMX && cpu_flags & AV_CPU_FLAG_MMX) { // special case for unscaled vertical filtering if (minFilterSize == 1 && filterAlign == 2) filterAlign= 1; @@ -521,7 +542,7 @@ fail: return ret; } -#if ARCH_X86 && (HAVE_MMX2 || CONFIG_RUNTIME_CPUDETECT) +#if HAVE_MMX2 static int initMMX2HScaler(int dstW, int xInc, uint8_t *filterCode, int16_t *filter, int32_t *filterPos, int numSplits) { uint8_t *fragmentA; @@ -679,7 +700,7 @@ static int initMMX2HScaler(int dstW, int xInc, uint8_t *filterCode, int16_t *fil return fragmentPos + 1; } -#endif /* ARCH_X86 && (HAVE_MMX2 || CONFIG_RUNTIME_CPUDETECT) */ +#endif /* HAVE_MMX2 */ static void getSubSampleFactors(int *h, int *v, enum PixelFormat format) { @@ -687,8 +708,6 @@ static void getSubSampleFactors(int *h, int *v, enum PixelFormat format) *v = av_pix_fmt_descriptors[format].log2_chroma_h; } -static int update_flags_cpu(int flags); - int sws_setColorspaceDetails(SwsContext *c, const int inv_table[4], int srcRange, const int table[4], int dstRange, int brightness, int contrast, int saturation) { memcpy(c->srcColorspaceTable, inv_table, sizeof(int)*4); @@ -703,21 +722,18 @@ int sws_setColorspaceDetails(SwsContext *c, const int inv_table[4], int srcRange c->dstFormatBpp = av_get_bits_per_pixel(&av_pix_fmt_descriptors[c->dstFormat]); c->srcFormatBpp = av_get_bits_per_pixel(&av_pix_fmt_descriptors[c->srcFormat]); - c->flags = update_flags_cpu(c->flags); ff_yuv2rgb_c_init_tables(c, inv_table, srcRange, brightness, contrast, saturation); //FIXME factorize -#if HAVE_ALTIVEC - if (c->flags & SWS_CPU_CAPS_ALTIVEC) + if (HAVE_ALTIVEC && av_get_cpu_flags() & AV_CPU_FLAG_ALTIVEC) ff_yuv2rgb_init_tables_altivec(c, inv_table, brightness, contrast, saturation); -#endif return 0; } int sws_getColorspaceDetails(SwsContext *c, int **inv_table, int *srcRange, int **table, int *dstRange, int *brightness, int *contrast, int *saturation) { - if (isYUV(c->dstFormat) || isGray(c->dstFormat)) return -1; + if (!c || isYUV(c->dstFormat) || isGray(c->dstFormat)) return -1; *inv_table = c->srcColorspaceTable; *table = c->dstColorspaceTable; @@ -741,27 +757,6 @@ static int handle_jpeg(enum PixelFormat *format) } } -static int update_flags_cpu(int flags) -{ -#if !CONFIG_RUNTIME_CPUDETECT //ensure that the flags match the compiled variant if cpudetect is off - flags &= ~( SWS_CPU_CAPS_MMX - |SWS_CPU_CAPS_MMX2 - |SWS_CPU_CAPS_3DNOW - |SWS_CPU_CAPS_SSE2 - |SWS_CPU_CAPS_ALTIVEC - |SWS_CPU_CAPS_BFIN); - flags |= ff_hardcodedcpuflags(); -#else /* !CONFIG_RUNTIME_CPUDETECT */ - int cpuflags = av_get_cpu_flags(); - - flags |= (cpuflags & AV_CPU_FLAG_SSE2 ? SWS_CPU_CAPS_SSE2 : 0); - flags |= (cpuflags & AV_CPU_FLAG_MMX ? SWS_CPU_CAPS_MMX : 0); - flags |= (cpuflags & AV_CPU_FLAG_MMX2 ? SWS_CPU_CAPS_MMX2 : 0); - flags |= (cpuflags & AV_CPU_FLAG_3DNOW ? SWS_CPU_CAPS_3DNOW : 0); -#endif /* CONFIG_RUNTIME_CPUDETECT */ - return flags; -} - SwsContext *sws_alloc_context(void) { SwsContext *c= av_mallocz(sizeof(SwsContext)); @@ -782,25 +777,24 @@ int sws_init_context(SwsContext *c, SwsFilter *srcFilter, SwsFilter *dstFilter) int srcH= c->srcH; int dstW= c->dstW; int dstH= c->dstH; - int flags; + int dst_stride = FFALIGN(dstW * sizeof(int16_t)+66, 16), dst_stride_px = dst_stride >> 1; + int flags, cpu_flags; enum PixelFormat srcFormat= c->srcFormat; enum PixelFormat dstFormat= c->dstFormat; - flags= c->flags = update_flags_cpu(c->flags); -#if ARCH_X86 - if (flags & SWS_CPU_CAPS_MMX) - __asm__ volatile("emms\n\t"::: "memory"); -#endif - if (!rgb15to16) sws_rgb2rgb_init(flags); + cpu_flags = av_get_cpu_flags(); + flags = c->flags; + emms_c(); + if (!rgb15to16) sws_rgb2rgb_init(); unscaled = (srcW == dstW && srcH == dstH); if (!isSupportedIn(srcFormat)) { - av_log(NULL, AV_LOG_ERROR, "swScaler: %s is not supported as input pixel format\n", sws_format_name(srcFormat)); + av_log(c, AV_LOG_ERROR, "%s is not supported as input pixel format\n", av_get_pix_fmt_name(srcFormat)); return AVERROR(EINVAL); } if (!isSupportedOut(dstFormat)) { - av_log(NULL, AV_LOG_ERROR, "swScaler: %s is not supported as output pixel format\n", sws_format_name(dstFormat)); + av_log(c, AV_LOG_ERROR, "%s is not supported as output pixel format\n", av_get_pix_fmt_name(dstFormat)); return AVERROR(EINVAL); } @@ -816,19 +810,15 @@ int sws_init_context(SwsContext *c, SwsFilter *srcFilter, SwsFilter *dstFilter) |SWS_SPLINE |SWS_BICUBLIN); if(!i || (i & (i-1))) { - av_log(NULL, AV_LOG_ERROR, "swScaler: Exactly one scaler algorithm must be chosen\n"); + av_log(c, AV_LOG_ERROR, "Exactly one scaler algorithm must be chosen\n"); return AVERROR(EINVAL); } /* sanity check */ if (srcW<4 || srcH<1 || dstW<8 || dstH<1) { //FIXME check if these are enough and try to lowwer them after fixing the relevant parts of the code - av_log(NULL, AV_LOG_ERROR, "swScaler: %dx%d -> %dx%d is invalid scaling dimension\n", + av_log(c, AV_LOG_ERROR, "%dx%d -> %dx%d is invalid scaling dimension\n", srcW, srcH, dstW, dstH); return AVERROR(EINVAL); } - if(srcW > VOFW || dstW > VOFW) { - av_log(NULL, AV_LOG_ERROR, "swScaler: Compile-time maximum width is "AV_STRINGIFY(VOFW)" change VOF/VOFW and recompile\n"); - return AVERROR(EINVAL); - } if (!dstFilter) dstFilter= &dummyFilter; if (!srcFilter) srcFilter= &dummyFilter; @@ -879,18 +869,19 @@ int sws_init_context(SwsContext *c, SwsFilter *srcFilter, SwsFilter *dstFilter) if (c->swScale) { if (flags&SWS_PRINT_INFO) av_log(c, AV_LOG_INFO, "using unscaled %s -> %s special converter\n", - sws_format_name(srcFormat), sws_format_name(dstFormat)); + av_get_pix_fmt_name(srcFormat), av_get_pix_fmt_name(dstFormat)); return 0; } } - if (flags & SWS_CPU_CAPS_MMX2) { + FF_ALLOC_OR_GOTO(c, c->formatConvBuffer, FFALIGN(srcW*2+78, 16) * 2, fail); + if (HAVE_MMX2 && cpu_flags & AV_CPU_FLAG_MMX2) { c->canMMX2BeUsed= (dstW >=srcW && (dstW&31)==0 && (srcW&15)==0) ? 1 : 0; if (!c->canMMX2BeUsed && dstW >=srcW && (srcW&15)==0 && (flags&SWS_FAST_BILINEAR)) { if (flags&SWS_PRINT_INFO) av_log(c, AV_LOG_INFO, "output width is not a multiple of 32 -> no MMX2 scaler\n"); } - if (usesHFilter) c->canMMX2BeUsed=0; + if (usesHFilter || isNBPS(c->srcFormat) || is16BPS(c->srcFormat) || isAnyRGB(c->srcFormat)) c->canMMX2BeUsed=0; } else c->canMMX2BeUsed=0; @@ -910,7 +901,7 @@ int sws_init_context(SwsContext *c, SwsFilter *srcFilter, SwsFilter *dstFilter) c->chrXInc+= 20; } //we don't use the x86 asm scaler if MMX is available - else if (flags & SWS_CPU_CAPS_MMX) { + else if (HAVE_MMX && cpu_flags & AV_CPU_FLAG_MMX) { c->lumXInc = ((srcW-2)<<16)/(dstW-2) - 20; c->chrXInc = ((c->chrSrcW-2)<<16)/(c->chrDstW-2) - 20; } @@ -918,7 +909,7 @@ int sws_init_context(SwsContext *c, SwsFilter *srcFilter, SwsFilter *dstFilter) /* precalculate horizontal scaler filter coefficients */ { -#if ARCH_X86 && (HAVE_MMX2 || CONFIG_RUNTIME_CPUDETECT) +#if HAVE_MMX2 // can't downscale !!! if (c->canMMX2BeUsed && (flags & SWS_FAST_BILINEAR)) { c->lumMmx2FilterCodeSize = initMMX2HScaler( dstW, c->lumXInc, NULL, NULL, NULL, 8); @@ -954,21 +945,21 @@ int sws_init_context(SwsContext *c, SwsFilter *srcFilter, SwsFilter *dstFilter) mprotect(c->chrMmx2FilterCode, c->chrMmx2FilterCodeSize, PROT_EXEC | PROT_READ); #endif } else -#endif /* ARCH_X86 && (HAVE_MMX2 || CONFIG_RUNTIME_CPUDETECT) */ +#endif /* HAVE_MMX2 */ { const int filterAlign= - (flags & SWS_CPU_CAPS_MMX) ? 4 : - (flags & SWS_CPU_CAPS_ALTIVEC) ? 8 : + (HAVE_MMX && cpu_flags & AV_CPU_FLAG_MMX) ? 4 : + (HAVE_ALTIVEC && cpu_flags & AV_CPU_FLAG_ALTIVEC) ? 8 : 1; if (initFilter(&c->hLumFilter, &c->hLumFilterPos, &c->hLumFilterSize, c->lumXInc, srcW , dstW, filterAlign, 1<<14, - (flags&SWS_BICUBLIN) ? (flags|SWS_BICUBIC) : flags, + (flags&SWS_BICUBLIN) ? (flags|SWS_BICUBIC) : flags, cpu_flags, srcFilter->lumH, dstFilter->lumH, c->param) < 0) goto fail; if (initFilter(&c->hChrFilter, &c->hChrFilterPos, &c->hChrFilterSize, c->chrXInc, c->chrSrcW, c->chrDstW, filterAlign, 1<<14, - (flags&SWS_BICUBLIN) ? (flags|SWS_BILINEAR) : flags, + (flags&SWS_BICUBLIN) ? (flags|SWS_BILINEAR) : flags, cpu_flags, srcFilter->chrH, dstFilter->chrH, c->param) < 0) goto fail; } @@ -977,18 +968,18 @@ int sws_init_context(SwsContext *c, SwsFilter *srcFilter, SwsFilter *dstFilter) /* precalculate vertical scaler filter coefficients */ { const int filterAlign= - (flags & SWS_CPU_CAPS_MMX) && (flags & SWS_ACCURATE_RND) ? 2 : - (flags & SWS_CPU_CAPS_ALTIVEC) ? 8 : + (HAVE_MMX && cpu_flags & AV_CPU_FLAG_MMX) && (flags & SWS_ACCURATE_RND) ? 2 : + (HAVE_ALTIVEC && cpu_flags & AV_CPU_FLAG_ALTIVEC) ? 8 : 1; if (initFilter(&c->vLumFilter, &c->vLumFilterPos, &c->vLumFilterSize, c->lumYInc, srcH , dstH, filterAlign, (1<<12), - (flags&SWS_BICUBLIN) ? (flags|SWS_BICUBIC) : flags, + (flags&SWS_BICUBLIN) ? (flags|SWS_BICUBIC) : flags, cpu_flags, srcFilter->lumV, dstFilter->lumV, c->param) < 0) goto fail; if (initFilter(&c->vChrFilter, &c->vChrFilterPos, &c->vChrFilterSize, c->chrYInc, c->chrSrcH, c->chrDstH, filterAlign, (1<<12), - (flags&SWS_BICUBLIN) ? (flags|SWS_BILINEAR) : flags, + (flags&SWS_BICUBLIN) ? (flags|SWS_BILINEAR) : flags, cpu_flags, srcFilter->chrV, dstFilter->chrV, c->param) < 0) goto fail; @@ -1031,29 +1022,32 @@ int sws_init_context(SwsContext *c, SwsFilter *srcFilter, SwsFilter *dstFilter) // allocate pixbufs (we use dynamic allocation because otherwise we would need to // allocate several megabytes to handle all possible cases) FF_ALLOC_OR_GOTO(c, c->lumPixBuf, c->vLumBufSize*2*sizeof(int16_t*), fail); - FF_ALLOC_OR_GOTO(c, c->chrPixBuf, c->vChrBufSize*2*sizeof(int16_t*), fail); + FF_ALLOC_OR_GOTO(c, c->chrUPixBuf, c->vChrBufSize*2*sizeof(int16_t*), fail); + FF_ALLOC_OR_GOTO(c, c->chrVPixBuf, c->vChrBufSize*2*sizeof(int16_t*), fail); if (CONFIG_SWSCALE_ALPHA && isALPHA(c->srcFormat) && isALPHA(c->dstFormat)) FF_ALLOCZ_OR_GOTO(c, c->alpPixBuf, c->vLumBufSize*2*sizeof(int16_t*), fail); //Note we need at least one pixel more at the end because of the MMX code (just in case someone wanna replace the 4000/8000) /* align at 16 bytes for AltiVec */ for (i=0; i<c->vLumBufSize; i++) { - FF_ALLOCZ_OR_GOTO(c, c->lumPixBuf[i+c->vLumBufSize], VOF+1, fail); + FF_ALLOCZ_OR_GOTO(c, c->lumPixBuf[i+c->vLumBufSize], dst_stride+1, fail); c->lumPixBuf[i] = c->lumPixBuf[i+c->vLumBufSize]; } + c->uv_off = dst_stride_px; + c->uv_offx2 = dst_stride; for (i=0; i<c->vChrBufSize; i++) { - FF_ALLOC_OR_GOTO(c, c->chrPixBuf[i+c->vChrBufSize], (VOF+1)*2, fail); - c->chrPixBuf[i] = c->chrPixBuf[i+c->vChrBufSize]; + FF_ALLOC_OR_GOTO(c, c->chrUPixBuf[i+c->vChrBufSize], dst_stride*2+1, fail); + c->chrUPixBuf[i] = c->chrUPixBuf[i+c->vChrBufSize]; + c->chrVPixBuf[i] = c->chrVPixBuf[i+c->vChrBufSize] = c->chrUPixBuf[i] + dst_stride_px; } if (CONFIG_SWSCALE_ALPHA && c->alpPixBuf) for (i=0; i<c->vLumBufSize; i++) { - FF_ALLOCZ_OR_GOTO(c, c->alpPixBuf[i+c->vLumBufSize], VOF+1, fail); + FF_ALLOCZ_OR_GOTO(c, c->alpPixBuf[i+c->vLumBufSize], dst_stride+1, fail); c->alpPixBuf[i] = c->alpPixBuf[i+c->vLumBufSize]; } //try to avoid drawing green stuff between the right end and the stride end - for (i=0; i<c->vChrBufSize; i++) memset(c->chrPixBuf[i], 64, (VOF+1)*2); - - assert(2*VOFW == VOF); + for (i=0; i<c->vChrBufSize; i++) + memset(c->chrUPixBuf[i], 64, dst_stride*2+1); assert(c->chrDstH <= dstH); @@ -1072,7 +1066,7 @@ int sws_init_context(SwsContext *c, SwsFilter *srcFilter, SwsFilter *dstFilter) else av_log(c, AV_LOG_INFO, "ehh flags invalid?! "); av_log(c, AV_LOG_INFO, "from %s to %s%s ", - sws_format_name(srcFormat), + av_get_pix_fmt_name(srcFormat), #ifdef DITHER1XBPP dstFormat == PIX_FMT_BGR555 || dstFormat == PIX_FMT_BGR565 || dstFormat == PIX_FMT_RGB444BE || dstFormat == PIX_FMT_RGB444LE || @@ -1080,15 +1074,15 @@ int sws_init_context(SwsContext *c, SwsFilter *srcFilter, SwsFilter *dstFilter) #else "", #endif - sws_format_name(dstFormat)); + av_get_pix_fmt_name(dstFormat)); - if (flags & SWS_CPU_CAPS_MMX2) av_log(c, AV_LOG_INFO, "using MMX2\n"); - else if (flags & SWS_CPU_CAPS_3DNOW) av_log(c, AV_LOG_INFO, "using 3DNOW\n"); - else if (flags & SWS_CPU_CAPS_MMX) av_log(c, AV_LOG_INFO, "using MMX\n"); - else if (flags & SWS_CPU_CAPS_ALTIVEC) av_log(c, AV_LOG_INFO, "using AltiVec\n"); + if (HAVE_MMX2 && cpu_flags & AV_CPU_FLAG_MMX2) av_log(c, AV_LOG_INFO, "using MMX2\n"); + else if (HAVE_AMD3DNOW && cpu_flags & AV_CPU_FLAG_3DNOW) av_log(c, AV_LOG_INFO, "using 3DNOW\n"); + else if (HAVE_MMX && cpu_flags & AV_CPU_FLAG_MMX) av_log(c, AV_LOG_INFO, "using MMX\n"); + else if (HAVE_ALTIVEC && cpu_flags & AV_CPU_FLAG_ALTIVEC) av_log(c, AV_LOG_INFO, "using AltiVec\n"); else av_log(c, AV_LOG_INFO, "using C\n"); - if (flags & SWS_CPU_CAPS_MMX) { + if (HAVE_MMX && cpu_flags & AV_CPU_FLAG_MMX) { if (c->canMMX2BeUsed && (flags&SWS_FAST_BILINEAR)) av_log(c, AV_LOG_VERBOSE, "using FAST_BILINEAR MMX2 scaler for horizontal scaling\n"); else { @@ -1107,7 +1101,7 @@ int sws_init_context(SwsContext *c, SwsFilter *srcFilter, SwsFilter *dstFilter) av_log(c, AV_LOG_VERBOSE, "using n-tap MMX scaler for horizontal chrominance scaling\n"); } } else { -#if ARCH_X86 +#if HAVE_MMX av_log(c, AV_LOG_VERBOSE, "using x86 asm scaler for horizontal scaling\n"); #else if (flags & SWS_FAST_BILINEAR) @@ -1118,31 +1112,41 @@ int sws_init_context(SwsContext *c, SwsFilter *srcFilter, SwsFilter *dstFilter) } if (isPlanarYUV(dstFormat)) { if (c->vLumFilterSize==1) - av_log(c, AV_LOG_VERBOSE, "using 1-tap %s \"scaler\" for vertical scaling (YV12 like)\n", (flags & SWS_CPU_CAPS_MMX) ? "MMX" : "C"); + av_log(c, AV_LOG_VERBOSE, "using 1-tap %s \"scaler\" for vertical scaling (YV12 like)\n", + (HAVE_MMX && cpu_flags & AV_CPU_FLAG_MMX) ? "MMX" : "C"); else - av_log(c, AV_LOG_VERBOSE, "using n-tap %s scaler for vertical scaling (YV12 like)\n", (flags & SWS_CPU_CAPS_MMX) ? "MMX" : "C"); + av_log(c, AV_LOG_VERBOSE, "using n-tap %s scaler for vertical scaling (YV12 like)\n", + (HAVE_MMX && cpu_flags & AV_CPU_FLAG_MMX) ? "MMX" : "C"); } else { if (c->vLumFilterSize==1 && c->vChrFilterSize==2) av_log(c, AV_LOG_VERBOSE, "using 1-tap %s \"scaler\" for vertical luminance scaling (BGR)\n" - " 2-tap scaler for vertical chrominance scaling (BGR)\n", (flags & SWS_CPU_CAPS_MMX) ? "MMX" : "C"); + " 2-tap scaler for vertical chrominance scaling (BGR)\n", + (HAVE_MMX && cpu_flags & AV_CPU_FLAG_MMX) ? "MMX" : "C"); else if (c->vLumFilterSize==2 && c->vChrFilterSize==2) - av_log(c, AV_LOG_VERBOSE, "using 2-tap linear %s scaler for vertical scaling (BGR)\n", (flags & SWS_CPU_CAPS_MMX) ? "MMX" : "C"); + av_log(c, AV_LOG_VERBOSE, "using 2-tap linear %s scaler for vertical scaling (BGR)\n", + (HAVE_MMX && cpu_flags & AV_CPU_FLAG_MMX) ? "MMX" : "C"); else - av_log(c, AV_LOG_VERBOSE, "using n-tap %s scaler for vertical scaling (BGR)\n", (flags & SWS_CPU_CAPS_MMX) ? "MMX" : "C"); + av_log(c, AV_LOG_VERBOSE, "using n-tap %s scaler for vertical scaling (BGR)\n", + (HAVE_MMX && cpu_flags & AV_CPU_FLAG_MMX) ? "MMX" : "C"); } if (dstFormat==PIX_FMT_BGR24) av_log(c, AV_LOG_VERBOSE, "using %s YV12->BGR24 converter\n", - (flags & SWS_CPU_CAPS_MMX2) ? "MMX2" : ((flags & SWS_CPU_CAPS_MMX) ? "MMX" : "C")); + (HAVE_MMX2 && cpu_flags & AV_CPU_FLAG_MMX2) ? "MMX2" : + ((HAVE_MMX && cpu_flags & AV_CPU_FLAG_MMX) ? "MMX" : "C")); else if (dstFormat==PIX_FMT_RGB32) - av_log(c, AV_LOG_VERBOSE, "using %s YV12->BGR32 converter\n", (flags & SWS_CPU_CAPS_MMX) ? "MMX" : "C"); + av_log(c, AV_LOG_VERBOSE, "using %s YV12->BGR32 converter\n", + (HAVE_MMX && cpu_flags & AV_CPU_FLAG_MMX) ? "MMX" : "C"); else if (dstFormat==PIX_FMT_BGR565) - av_log(c, AV_LOG_VERBOSE, "using %s YV12->BGR16 converter\n", (flags & SWS_CPU_CAPS_MMX) ? "MMX" : "C"); + av_log(c, AV_LOG_VERBOSE, "using %s YV12->BGR16 converter\n", + (HAVE_MMX && cpu_flags & AV_CPU_FLAG_MMX) ? "MMX" : "C"); else if (dstFormat==PIX_FMT_BGR555) - av_log(c, AV_LOG_VERBOSE, "using %s YV12->BGR15 converter\n", (flags & SWS_CPU_CAPS_MMX) ? "MMX" : "C"); + av_log(c, AV_LOG_VERBOSE, "using %s YV12->BGR15 converter\n", + (HAVE_MMX && cpu_flags & AV_CPU_FLAG_MMX) ? "MMX" : "C"); else if (dstFormat == PIX_FMT_RGB444BE || dstFormat == PIX_FMT_RGB444LE || dstFormat == PIX_FMT_BGR444BE || dstFormat == PIX_FMT_BGR444LE) - av_log(c, AV_LOG_VERBOSE, "using %s YV12->BGR12 converter\n", (flags & SWS_CPU_CAPS_MMX) ? "MMX" : "C"); + av_log(c, AV_LOG_VERBOSE, "using %s YV12->BGR12 converter\n", + (HAVE_MMX && cpu_flags & AV_CPU_FLAG_MMX) ? "MMX" : "C"); av_log(c, AV_LOG_VERBOSE, "%dx%d -> %dx%d\n", srcW, srcH, dstW, dstH); av_log(c, AV_LOG_DEBUG, "lum srcW=%d srcH=%d dstW=%d dstH=%d xInc=%d yInc=%d\n", @@ -1501,10 +1505,11 @@ void sws_freeContext(SwsContext *c) av_freep(&c->lumPixBuf); } - if (c->chrPixBuf) { + if (c->chrUPixBuf) { for (i=0; i<c->vChrBufSize; i++) - av_freep(&c->chrPixBuf[i]); - av_freep(&c->chrPixBuf); + av_freep(&c->chrUPixBuf[i]); + av_freep(&c->chrUPixBuf); + av_freep(&c->chrVPixBuf); } if (CONFIG_SWSCALE_ALPHA && c->alpPixBuf) { @@ -1527,7 +1532,7 @@ void sws_freeContext(SwsContext *c) av_freep(&c->hLumFilterPos); av_freep(&c->hChrFilterPos); -#if ARCH_X86 +#if HAVE_MMX #ifdef MAP_ANONYMOUS if (c->lumMmx2FilterCode) munmap(c->lumMmx2FilterCode, c->lumMmx2FilterCodeSize); if (c->chrMmx2FilterCode) munmap(c->chrMmx2FilterCode, c->chrMmx2FilterCodeSize); @@ -1540,9 +1545,10 @@ void sws_freeContext(SwsContext *c) #endif c->lumMmx2FilterCode=NULL; c->chrMmx2FilterCode=NULL; -#endif /* ARCH_X86 */ +#endif /* HAVE_MMX */ av_freep(&c->yuvTable); + av_freep(&c->formatConvBuffer); av_free(c); } @@ -1557,8 +1563,6 @@ struct SwsContext *sws_getCachedContext(struct SwsContext *context, if (!param) param = default_param; - flags = update_flags_cpu(flags); - if (context && (context->srcW != srcW || context->srcH != srcH || diff --git a/libswscale/x86/rgb2rgb.c b/libswscale/x86/rgb2rgb.c new file mode 100644 index 0000000000..ed7f5adb74 --- /dev/null +++ b/libswscale/x86/rgb2rgb.c @@ -0,0 +1,138 @@ +/* + * software RGB to RGB converter + * pluralize by software PAL8 to RGB converter + * software YUV to YUV converter + * software YUV to RGB converter + * Written by Nick Kurshev. + * palette & YUV & runtime CPU stuff by Michael (michaelni@gmx.at) + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include <stdint.h> + +#include "config.h" +#include "libavutil/x86_cpu.h" +#include "libavutil/cpu.h" +#include "libavutil/bswap.h" +#include "libswscale/rgb2rgb.h" +#include "libswscale/swscale.h" +#include "libswscale/swscale_internal.h" + +DECLARE_ASM_CONST(8, uint64_t, mmx_ff) = 0x00000000000000FFULL; +DECLARE_ASM_CONST(8, uint64_t, mmx_null) = 0x0000000000000000ULL; +DECLARE_ASM_CONST(8, uint64_t, mmx_one) = 0xFFFFFFFFFFFFFFFFULL; +DECLARE_ASM_CONST(8, uint64_t, mask32b) = 0x000000FF000000FFULL; +DECLARE_ASM_CONST(8, uint64_t, mask32g) = 0x0000FF000000FF00ULL; +DECLARE_ASM_CONST(8, uint64_t, mask32r) = 0x00FF000000FF0000ULL; +DECLARE_ASM_CONST(8, uint64_t, mask32a) = 0xFF000000FF000000ULL; +DECLARE_ASM_CONST(8, uint64_t, mask32) = 0x00FFFFFF00FFFFFFULL; +DECLARE_ASM_CONST(8, uint64_t, mask3216br) = 0x00F800F800F800F8ULL; +DECLARE_ASM_CONST(8, uint64_t, mask3216g) = 0x0000FC000000FC00ULL; +DECLARE_ASM_CONST(8, uint64_t, mask3215g) = 0x0000F8000000F800ULL; +DECLARE_ASM_CONST(8, uint64_t, mul3216) = 0x2000000420000004ULL; +DECLARE_ASM_CONST(8, uint64_t, mul3215) = 0x2000000820000008ULL; +DECLARE_ASM_CONST(8, uint64_t, mask24b) = 0x00FF0000FF0000FFULL; +DECLARE_ASM_CONST(8, uint64_t, mask24g) = 0xFF0000FF0000FF00ULL; +DECLARE_ASM_CONST(8, uint64_t, mask24r) = 0x0000FF0000FF0000ULL; +DECLARE_ASM_CONST(8, uint64_t, mask24l) = 0x0000000000FFFFFFULL; +DECLARE_ASM_CONST(8, uint64_t, mask24h) = 0x0000FFFFFF000000ULL; +DECLARE_ASM_CONST(8, uint64_t, mask24hh) = 0xffff000000000000ULL; +DECLARE_ASM_CONST(8, uint64_t, mask24hhh) = 0xffffffff00000000ULL; +DECLARE_ASM_CONST(8, uint64_t, mask24hhhh) = 0xffffffffffff0000ULL; +DECLARE_ASM_CONST(8, uint64_t, mask15b) = 0x001F001F001F001FULL; /* 00000000 00011111 xxB */ +DECLARE_ASM_CONST(8, uint64_t, mask15rg) = 0x7FE07FE07FE07FE0ULL; /* 01111111 11100000 RGx */ +DECLARE_ASM_CONST(8, uint64_t, mask15s) = 0xFFE0FFE0FFE0FFE0ULL; +DECLARE_ASM_CONST(8, uint64_t, mask15g) = 0x03E003E003E003E0ULL; +DECLARE_ASM_CONST(8, uint64_t, mask15r) = 0x7C007C007C007C00ULL; +#define mask16b mask15b +DECLARE_ASM_CONST(8, uint64_t, mask16g) = 0x07E007E007E007E0ULL; +DECLARE_ASM_CONST(8, uint64_t, mask16r) = 0xF800F800F800F800ULL; +DECLARE_ASM_CONST(8, uint64_t, red_16mask) = 0x0000f8000000f800ULL; +DECLARE_ASM_CONST(8, uint64_t, green_16mask) = 0x000007e0000007e0ULL; +DECLARE_ASM_CONST(8, uint64_t, blue_16mask) = 0x0000001f0000001fULL; +DECLARE_ASM_CONST(8, uint64_t, red_15mask) = 0x00007c0000007c00ULL; +DECLARE_ASM_CONST(8, uint64_t, green_15mask) = 0x000003e0000003e0ULL; +DECLARE_ASM_CONST(8, uint64_t, blue_15mask) = 0x0000001f0000001fULL; + +#define RGB2YUV_SHIFT 8 +#define BY ((int)( 0.098*(1<<RGB2YUV_SHIFT)+0.5)) +#define BV ((int)(-0.071*(1<<RGB2YUV_SHIFT)+0.5)) +#define BU ((int)( 0.439*(1<<RGB2YUV_SHIFT)+0.5)) +#define GY ((int)( 0.504*(1<<RGB2YUV_SHIFT)+0.5)) +#define GV ((int)(-0.368*(1<<RGB2YUV_SHIFT)+0.5)) +#define GU ((int)(-0.291*(1<<RGB2YUV_SHIFT)+0.5)) +#define RY ((int)( 0.257*(1<<RGB2YUV_SHIFT)+0.5)) +#define RV ((int)( 0.439*(1<<RGB2YUV_SHIFT)+0.5)) +#define RU ((int)(-0.148*(1<<RGB2YUV_SHIFT)+0.5)) + +//Note: We have C, MMX, MMX2, 3DNOW versions, there is no 3DNOW + MMX2 one. + +#define COMPILE_TEMPLATE_MMX2 0 +#define COMPILE_TEMPLATE_AMD3DNOW 0 +#define COMPILE_TEMPLATE_SSE2 0 + +//MMX versions +#undef RENAME +#define RENAME(a) a ## _MMX +#include "rgb2rgb_template.c" + +//MMX2 versions +#undef RENAME +#undef COMPILE_TEMPLATE_MMX2 +#define COMPILE_TEMPLATE_MMX2 1 +#define RENAME(a) a ## _MMX2 +#include "rgb2rgb_template.c" + +//SSE2 versions +#undef RENAME +#undef COMPILE_TEMPLATE_SSE2 +#define COMPILE_TEMPLATE_SSE2 1 +#define RENAME(a) a ## _SSE2 +#include "rgb2rgb_template.c" + +//3DNOW versions +#undef RENAME +#undef COMPILE_TEMPLATE_MMX2 +#undef COMPILE_TEMPLATE_SSE2 +#undef COMPILE_TEMPLATE_AMD3DNOW +#define COMPILE_TEMPLATE_MMX2 0 +#define COMPILE_TEMPLATE_SSE2 0 +#define COMPILE_TEMPLATE_AMD3DNOW 1 +#define RENAME(a) a ## _3DNOW +#include "rgb2rgb_template.c" + +/* + RGB15->RGB16 original by Strepto/Astral + ported to gcc & bugfixed : A'rpi + MMX2, 3DNOW optimization by Nick Kurshev + 32-bit C version, and and&add trick by Michael Niedermayer +*/ + +void rgb2rgb_init_x86(void) +{ + int cpu_flags = av_get_cpu_flags(); + + if (cpu_flags & AV_CPU_FLAG_MMX) + rgb2rgb_init_MMX(); + if (HAVE_AMD3DNOW && cpu_flags & AV_CPU_FLAG_3DNOW) + rgb2rgb_init_3DNOW(); + if (HAVE_MMX2 && cpu_flags & AV_CPU_FLAG_MMX2) + rgb2rgb_init_MMX2(); + if (HAVE_SSE && cpu_flags & AV_CPU_FLAG_SSE2) + rgb2rgb_init_SSE2(); +} diff --git a/libswscale/x86/rgb2rgb_template.c b/libswscale/x86/rgb2rgb_template.c new file mode 100644 index 0000000000..baef3f8ae5 --- /dev/null +++ b/libswscale/x86/rgb2rgb_template.c @@ -0,0 +1,2607 @@ +/* + * software RGB to RGB converter + * pluralize by software PAL8 to RGB converter + * software YUV to YUV converter + * software YUV to RGB converter + * Written by Nick Kurshev. + * palette & YUV & runtime CPU stuff by Michael (michaelni@gmx.at) + * lot of big-endian byte order fixes by Alex Beregszaszi + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include <stddef.h> + +#undef PREFETCH +#undef MOVNTQ +#undef EMMS +#undef SFENCE +#undef PAVGB + +#if COMPILE_TEMPLATE_AMD3DNOW +#define PREFETCH "prefetch" +#define PAVGB "pavgusb" +#elif COMPILE_TEMPLATE_MMX2 +#define PREFETCH "prefetchnta" +#define PAVGB "pavgb" +#else +#define PREFETCH " # nop" +#endif + +#if COMPILE_TEMPLATE_AMD3DNOW +/* On K6 femms is faster than emms. On K7 femms is directly mapped to emms. */ +#define EMMS "femms" +#else +#define EMMS "emms" +#endif + +#if COMPILE_TEMPLATE_MMX2 +#define MOVNTQ "movntq" +#define SFENCE "sfence" +#else +#define MOVNTQ "movq" +#define SFENCE " # nop" +#endif + +#if !COMPILE_TEMPLATE_SSE2 + +#if !COMPILE_TEMPLATE_AMD3DNOW + +static inline void RENAME(rgb24tobgr32)(const uint8_t *src, uint8_t *dst, int src_size) +{ + uint8_t *dest = dst; + const uint8_t *s = src; + const uint8_t *end; + const uint8_t *mm_end; + end = s + src_size; + __asm__ volatile(PREFETCH" %0"::"m"(*s):"memory"); + mm_end = end - 23; + __asm__ volatile("movq %0, %%mm7"::"m"(mask32a):"memory"); + while (s < mm_end) { + __asm__ volatile( + PREFETCH" 32%1 \n\t" + "movd %1, %%mm0 \n\t" + "punpckldq 3%1, %%mm0 \n\t" + "movd 6%1, %%mm1 \n\t" + "punpckldq 9%1, %%mm1 \n\t" + "movd 12%1, %%mm2 \n\t" + "punpckldq 15%1, %%mm2 \n\t" + "movd 18%1, %%mm3 \n\t" + "punpckldq 21%1, %%mm3 \n\t" + "por %%mm7, %%mm0 \n\t" + "por %%mm7, %%mm1 \n\t" + "por %%mm7, %%mm2 \n\t" + "por %%mm7, %%mm3 \n\t" + MOVNTQ" %%mm0, %0 \n\t" + MOVNTQ" %%mm1, 8%0 \n\t" + MOVNTQ" %%mm2, 16%0 \n\t" + MOVNTQ" %%mm3, 24%0" + :"=m"(*dest) + :"m"(*s) + :"memory"); + dest += 32; + s += 24; + } + __asm__ volatile(SFENCE:::"memory"); + __asm__ volatile(EMMS:::"memory"); + while (s < end) { + *dest++ = *s++; + *dest++ = *s++; + *dest++ = *s++; + *dest++ = 255; + } +} + +#define STORE_BGR24_MMX \ + "psrlq $8, %%mm2 \n\t" \ + "psrlq $8, %%mm3 \n\t" \ + "psrlq $8, %%mm6 \n\t" \ + "psrlq $8, %%mm7 \n\t" \ + "pand "MANGLE(mask24l)", %%mm0\n\t" \ + "pand "MANGLE(mask24l)", %%mm1\n\t" \ + "pand "MANGLE(mask24l)", %%mm4\n\t" \ + "pand "MANGLE(mask24l)", %%mm5\n\t" \ + "pand "MANGLE(mask24h)", %%mm2\n\t" \ + "pand "MANGLE(mask24h)", %%mm3\n\t" \ + "pand "MANGLE(mask24h)", %%mm6\n\t" \ + "pand "MANGLE(mask24h)", %%mm7\n\t" \ + "por %%mm2, %%mm0 \n\t" \ + "por %%mm3, %%mm1 \n\t" \ + "por %%mm6, %%mm4 \n\t" \ + "por %%mm7, %%mm5 \n\t" \ + \ + "movq %%mm1, %%mm2 \n\t" \ + "movq %%mm4, %%mm3 \n\t" \ + "psllq $48, %%mm2 \n\t" \ + "psllq $32, %%mm3 \n\t" \ + "pand "MANGLE(mask24hh)", %%mm2\n\t" \ + "pand "MANGLE(mask24hhh)", %%mm3\n\t" \ + "por %%mm2, %%mm0 \n\t" \ + "psrlq $16, %%mm1 \n\t" \ + "psrlq $32, %%mm4 \n\t" \ + "psllq $16, %%mm5 \n\t" \ + "por %%mm3, %%mm1 \n\t" \ + "pand "MANGLE(mask24hhhh)", %%mm5\n\t" \ + "por %%mm5, %%mm4 \n\t" \ + \ + MOVNTQ" %%mm0, %0 \n\t" \ + MOVNTQ" %%mm1, 8%0 \n\t" \ + MOVNTQ" %%mm4, 16%0" + + +static inline void RENAME(rgb32tobgr24)(const uint8_t *src, uint8_t *dst, int src_size) +{ + uint8_t *dest = dst; + const uint8_t *s = src; + const uint8_t *end; + const uint8_t *mm_end; + end = s + src_size; + __asm__ volatile(PREFETCH" %0"::"m"(*s):"memory"); + mm_end = end - 31; + while (s < mm_end) { + __asm__ volatile( + PREFETCH" 32%1 \n\t" + "movq %1, %%mm0 \n\t" + "movq 8%1, %%mm1 \n\t" + "movq 16%1, %%mm4 \n\t" + "movq 24%1, %%mm5 \n\t" + "movq %%mm0, %%mm2 \n\t" + "movq %%mm1, %%mm3 \n\t" + "movq %%mm4, %%mm6 \n\t" + "movq %%mm5, %%mm7 \n\t" + STORE_BGR24_MMX + :"=m"(*dest) + :"m"(*s) + :"memory"); + dest += 24; + s += 32; + } + __asm__ volatile(SFENCE:::"memory"); + __asm__ volatile(EMMS:::"memory"); + while (s < end) { + *dest++ = *s++; + *dest++ = *s++; + *dest++ = *s++; + s++; + } +} + +/* + original by Strepto/Astral + ported to gcc & bugfixed: A'rpi + MMX2, 3DNOW optimization by Nick Kurshev + 32-bit C version, and and&add trick by Michael Niedermayer +*/ +static inline void RENAME(rgb15to16)(const uint8_t *src, uint8_t *dst, int src_size) +{ + register const uint8_t* s=src; + register uint8_t* d=dst; + register const uint8_t *end; + const uint8_t *mm_end; + end = s + src_size; + __asm__ volatile(PREFETCH" %0"::"m"(*s)); + __asm__ volatile("movq %0, %%mm4"::"m"(mask15s)); + mm_end = end - 15; + while (s<mm_end) { + __asm__ volatile( + PREFETCH" 32%1 \n\t" + "movq %1, %%mm0 \n\t" + "movq 8%1, %%mm2 \n\t" + "movq %%mm0, %%mm1 \n\t" + "movq %%mm2, %%mm3 \n\t" + "pand %%mm4, %%mm0 \n\t" + "pand %%mm4, %%mm2 \n\t" + "paddw %%mm1, %%mm0 \n\t" + "paddw %%mm3, %%mm2 \n\t" + MOVNTQ" %%mm0, %0 \n\t" + MOVNTQ" %%mm2, 8%0" + :"=m"(*d) + :"m"(*s) + ); + d+=16; + s+=16; + } + __asm__ volatile(SFENCE:::"memory"); + __asm__ volatile(EMMS:::"memory"); + mm_end = end - 3; + while (s < mm_end) { + register unsigned x= *((const uint32_t *)s); + *((uint32_t *)d) = (x&0x7FFF7FFF) + (x&0x7FE07FE0); + d+=4; + s+=4; + } + if (s < end) { + register unsigned short x= *((const uint16_t *)s); + *((uint16_t *)d) = (x&0x7FFF) + (x&0x7FE0); + } +} + +static inline void RENAME(rgb16to15)(const uint8_t *src, uint8_t *dst, int src_size) +{ + register const uint8_t* s=src; + register uint8_t* d=dst; + register const uint8_t *end; + const uint8_t *mm_end; + end = s + src_size; + __asm__ volatile(PREFETCH" %0"::"m"(*s)); + __asm__ volatile("movq %0, %%mm7"::"m"(mask15rg)); + __asm__ volatile("movq %0, %%mm6"::"m"(mask15b)); + mm_end = end - 15; + while (s<mm_end) { + __asm__ volatile( + PREFETCH" 32%1 \n\t" + "movq %1, %%mm0 \n\t" + "movq 8%1, %%mm2 \n\t" + "movq %%mm0, %%mm1 \n\t" + "movq %%mm2, %%mm3 \n\t" + "psrlq $1, %%mm0 \n\t" + "psrlq $1, %%mm2 \n\t" + "pand %%mm7, %%mm0 \n\t" + "pand %%mm7, %%mm2 \n\t" + "pand %%mm6, %%mm1 \n\t" + "pand %%mm6, %%mm3 \n\t" + "por %%mm1, %%mm0 \n\t" + "por %%mm3, %%mm2 \n\t" + MOVNTQ" %%mm0, %0 \n\t" + MOVNTQ" %%mm2, 8%0" + :"=m"(*d) + :"m"(*s) + ); + d+=16; + s+=16; + } + __asm__ volatile(SFENCE:::"memory"); + __asm__ volatile(EMMS:::"memory"); + mm_end = end - 3; + while (s < mm_end) { + register uint32_t x= *((const uint32_t*)s); + *((uint32_t *)d) = ((x>>1)&0x7FE07FE0) | (x&0x001F001F); + s+=4; + d+=4; + } + if (s < end) { + register uint16_t x= *((const uint16_t*)s); + *((uint16_t *)d) = ((x>>1)&0x7FE0) | (x&0x001F); + } +} + +static inline void RENAME(rgb32to16)(const uint8_t *src, uint8_t *dst, int src_size) +{ + const uint8_t *s = src; + const uint8_t *end; + const uint8_t *mm_end; + uint16_t *d = (uint16_t *)dst; + end = s + src_size; + mm_end = end - 15; +#if 1 //is faster only if multiplies are reasonably fast (FIXME figure out on which CPUs this is faster, on Athlon it is slightly faster) + __asm__ volatile( + "movq %3, %%mm5 \n\t" + "movq %4, %%mm6 \n\t" + "movq %5, %%mm7 \n\t" + "jmp 2f \n\t" + ".p2align 4 \n\t" + "1: \n\t" + PREFETCH" 32(%1) \n\t" + "movd (%1), %%mm0 \n\t" + "movd 4(%1), %%mm3 \n\t" + "punpckldq 8(%1), %%mm0 \n\t" + "punpckldq 12(%1), %%mm3 \n\t" + "movq %%mm0, %%mm1 \n\t" + "movq %%mm3, %%mm4 \n\t" + "pand %%mm6, %%mm0 \n\t" + "pand %%mm6, %%mm3 \n\t" + "pmaddwd %%mm7, %%mm0 \n\t" + "pmaddwd %%mm7, %%mm3 \n\t" + "pand %%mm5, %%mm1 \n\t" + "pand %%mm5, %%mm4 \n\t" + "por %%mm1, %%mm0 \n\t" + "por %%mm4, %%mm3 \n\t" + "psrld $5, %%mm0 \n\t" + "pslld $11, %%mm3 \n\t" + "por %%mm3, %%mm0 \n\t" + MOVNTQ" %%mm0, (%0) \n\t" + "add $16, %1 \n\t" + "add $8, %0 \n\t" + "2: \n\t" + "cmp %2, %1 \n\t" + " jb 1b \n\t" + : "+r" (d), "+r"(s) + : "r" (mm_end), "m" (mask3216g), "m" (mask3216br), "m" (mul3216) + ); +#else + __asm__ volatile(PREFETCH" %0"::"m"(*src):"memory"); + __asm__ volatile( + "movq %0, %%mm7 \n\t" + "movq %1, %%mm6 \n\t" + ::"m"(red_16mask),"m"(green_16mask)); + while (s < mm_end) { + __asm__ volatile( + PREFETCH" 32%1 \n\t" + "movd %1, %%mm0 \n\t" + "movd 4%1, %%mm3 \n\t" + "punpckldq 8%1, %%mm0 \n\t" + "punpckldq 12%1, %%mm3 \n\t" + "movq %%mm0, %%mm1 \n\t" + "movq %%mm0, %%mm2 \n\t" + "movq %%mm3, %%mm4 \n\t" + "movq %%mm3, %%mm5 \n\t" + "psrlq $3, %%mm0 \n\t" + "psrlq $3, %%mm3 \n\t" + "pand %2, %%mm0 \n\t" + "pand %2, %%mm3 \n\t" + "psrlq $5, %%mm1 \n\t" + "psrlq $5, %%mm4 \n\t" + "pand %%mm6, %%mm1 \n\t" + "pand %%mm6, %%mm4 \n\t" + "psrlq $8, %%mm2 \n\t" + "psrlq $8, %%mm5 \n\t" + "pand %%mm7, %%mm2 \n\t" + "pand %%mm7, %%mm5 \n\t" + "por %%mm1, %%mm0 \n\t" + "por %%mm4, %%mm3 \n\t" + "por %%mm2, %%mm0 \n\t" + "por %%mm5, %%mm3 \n\t" + "psllq $16, %%mm3 \n\t" + "por %%mm3, %%mm0 \n\t" + MOVNTQ" %%mm0, %0 \n\t" + :"=m"(*d):"m"(*s),"m"(blue_16mask):"memory"); + d += 4; + s += 16; + } +#endif + __asm__ volatile(SFENCE:::"memory"); + __asm__ volatile(EMMS:::"memory"); + while (s < end) { + register int rgb = *(const uint32_t*)s; s += 4; + *d++ = ((rgb&0xFF)>>3) + ((rgb&0xFC00)>>5) + ((rgb&0xF80000)>>8); + } +} + +static inline void RENAME(rgb32tobgr16)(const uint8_t *src, uint8_t *dst, int src_size) +{ + const uint8_t *s = src; + const uint8_t *end; + const uint8_t *mm_end; + uint16_t *d = (uint16_t *)dst; + end = s + src_size; + __asm__ volatile(PREFETCH" %0"::"m"(*src):"memory"); + __asm__ volatile( + "movq %0, %%mm7 \n\t" + "movq %1, %%mm6 \n\t" + ::"m"(red_16mask),"m"(green_16mask)); + mm_end = end - 15; + while (s < mm_end) { + __asm__ volatile( + PREFETCH" 32%1 \n\t" + "movd %1, %%mm0 \n\t" + "movd 4%1, %%mm3 \n\t" + "punpckldq 8%1, %%mm0 \n\t" + "punpckldq 12%1, %%mm3 \n\t" + "movq %%mm0, %%mm1 \n\t" + "movq %%mm0, %%mm2 \n\t" + "movq %%mm3, %%mm4 \n\t" + "movq %%mm3, %%mm5 \n\t" + "psllq $8, %%mm0 \n\t" + "psllq $8, %%mm3 \n\t" + "pand %%mm7, %%mm0 \n\t" + "pand %%mm7, %%mm3 \n\t" + "psrlq $5, %%mm1 \n\t" + "psrlq $5, %%mm4 \n\t" + "pand %%mm6, %%mm1 \n\t" + "pand %%mm6, %%mm4 \n\t" + "psrlq $19, %%mm2 \n\t" + "psrlq $19, %%mm5 \n\t" + "pand %2, %%mm2 \n\t" + "pand %2, %%mm5 \n\t" + "por %%mm1, %%mm0 \n\t" + "por %%mm4, %%mm3 \n\t" + "por %%mm2, %%mm0 \n\t" + "por %%mm5, %%mm3 \n\t" + "psllq $16, %%mm3 \n\t" + "por %%mm3, %%mm0 \n\t" + MOVNTQ" %%mm0, %0 \n\t" + :"=m"(*d):"m"(*s),"m"(blue_16mask):"memory"); + d += 4; + s += 16; + } + __asm__ volatile(SFENCE:::"memory"); + __asm__ volatile(EMMS:::"memory"); + while (s < end) { + register int rgb = *(const uint32_t*)s; s += 4; + *d++ = ((rgb&0xF8)<<8) + ((rgb&0xFC00)>>5) + ((rgb&0xF80000)>>19); + } +} + +static inline void RENAME(rgb32to15)(const uint8_t *src, uint8_t *dst, int src_size) +{ + const uint8_t *s = src; + const uint8_t *end; + const uint8_t *mm_end; + uint16_t *d = (uint16_t *)dst; + end = s + src_size; + mm_end = end - 15; +#if 1 //is faster only if multiplies are reasonably fast (FIXME figure out on which CPUs this is faster, on Athlon it is slightly faster) + __asm__ volatile( + "movq %3, %%mm5 \n\t" + "movq %4, %%mm6 \n\t" + "movq %5, %%mm7 \n\t" + "jmp 2f \n\t" + ".p2align 4 \n\t" + "1: \n\t" + PREFETCH" 32(%1) \n\t" + "movd (%1), %%mm0 \n\t" + "movd 4(%1), %%mm3 \n\t" + "punpckldq 8(%1), %%mm0 \n\t" + "punpckldq 12(%1), %%mm3 \n\t" + "movq %%mm0, %%mm1 \n\t" + "movq %%mm3, %%mm4 \n\t" + "pand %%mm6, %%mm0 \n\t" + "pand %%mm6, %%mm3 \n\t" + "pmaddwd %%mm7, %%mm0 \n\t" + "pmaddwd %%mm7, %%mm3 \n\t" + "pand %%mm5, %%mm1 \n\t" + "pand %%mm5, %%mm4 \n\t" + "por %%mm1, %%mm0 \n\t" + "por %%mm4, %%mm3 \n\t" + "psrld $6, %%mm0 \n\t" + "pslld $10, %%mm3 \n\t" + "por %%mm3, %%mm0 \n\t" + MOVNTQ" %%mm0, (%0) \n\t" + "add $16, %1 \n\t" + "add $8, %0 \n\t" + "2: \n\t" + "cmp %2, %1 \n\t" + " jb 1b \n\t" + : "+r" (d), "+r"(s) + : "r" (mm_end), "m" (mask3215g), "m" (mask3216br), "m" (mul3215) + ); +#else + __asm__ volatile(PREFETCH" %0"::"m"(*src):"memory"); + __asm__ volatile( + "movq %0, %%mm7 \n\t" + "movq %1, %%mm6 \n\t" + ::"m"(red_15mask),"m"(green_15mask)); + while (s < mm_end) { + __asm__ volatile( + PREFETCH" 32%1 \n\t" + "movd %1, %%mm0 \n\t" + "movd 4%1, %%mm3 \n\t" + "punpckldq 8%1, %%mm0 \n\t" + "punpckldq 12%1, %%mm3 \n\t" + "movq %%mm0, %%mm1 \n\t" + "movq %%mm0, %%mm2 \n\t" + "movq %%mm3, %%mm4 \n\t" + "movq %%mm3, %%mm5 \n\t" + "psrlq $3, %%mm0 \n\t" + "psrlq $3, %%mm3 \n\t" + "pand %2, %%mm0 \n\t" + "pand %2, %%mm3 \n\t" + "psrlq $6, %%mm1 \n\t" + "psrlq $6, %%mm4 \n\t" + "pand %%mm6, %%mm1 \n\t" + "pand %%mm6, %%mm4 \n\t" + "psrlq $9, %%mm2 \n\t" + "psrlq $9, %%mm5 \n\t" + "pand %%mm7, %%mm2 \n\t" + "pand %%mm7, %%mm5 \n\t" + "por %%mm1, %%mm0 \n\t" + "por %%mm4, %%mm3 \n\t" + "por %%mm2, %%mm0 \n\t" + "por %%mm5, %%mm3 \n\t" + "psllq $16, %%mm3 \n\t" + "por %%mm3, %%mm0 \n\t" + MOVNTQ" %%mm0, %0 \n\t" + :"=m"(*d):"m"(*s),"m"(blue_15mask):"memory"); + d += 4; + s += 16; + } +#endif + __asm__ volatile(SFENCE:::"memory"); + __asm__ volatile(EMMS:::"memory"); + while (s < end) { + register int rgb = *(const uint32_t*)s; s += 4; + *d++ = ((rgb&0xFF)>>3) + ((rgb&0xF800)>>6) + ((rgb&0xF80000)>>9); + } +} + +static inline void RENAME(rgb32tobgr15)(const uint8_t *src, uint8_t *dst, int src_size) +{ + const uint8_t *s = src; + const uint8_t *end; + const uint8_t *mm_end; + uint16_t *d = (uint16_t *)dst; + end = s + src_size; + __asm__ volatile(PREFETCH" %0"::"m"(*src):"memory"); + __asm__ volatile( + "movq %0, %%mm7 \n\t" + "movq %1, %%mm6 \n\t" + ::"m"(red_15mask),"m"(green_15mask)); + mm_end = end - 15; + while (s < mm_end) { + __asm__ volatile( + PREFETCH" 32%1 \n\t" + "movd %1, %%mm0 \n\t" + "movd 4%1, %%mm3 \n\t" + "punpckldq 8%1, %%mm0 \n\t" + "punpckldq 12%1, %%mm3 \n\t" + "movq %%mm0, %%mm1 \n\t" + "movq %%mm0, %%mm2 \n\t" + "movq %%mm3, %%mm4 \n\t" + "movq %%mm3, %%mm5 \n\t" + "psllq $7, %%mm0 \n\t" + "psllq $7, %%mm3 \n\t" + "pand %%mm7, %%mm0 \n\t" + "pand %%mm7, %%mm3 \n\t" + "psrlq $6, %%mm1 \n\t" + "psrlq $6, %%mm4 \n\t" + "pand %%mm6, %%mm1 \n\t" + "pand %%mm6, %%mm4 \n\t" + "psrlq $19, %%mm2 \n\t" + "psrlq $19, %%mm5 \n\t" + "pand %2, %%mm2 \n\t" + "pand %2, %%mm5 \n\t" + "por %%mm1, %%mm0 \n\t" + "por %%mm4, %%mm3 \n\t" + "por %%mm2, %%mm0 \n\t" + "por %%mm5, %%mm3 \n\t" + "psllq $16, %%mm3 \n\t" + "por %%mm3, %%mm0 \n\t" + MOVNTQ" %%mm0, %0 \n\t" + :"=m"(*d):"m"(*s),"m"(blue_15mask):"memory"); + d += 4; + s += 16; + } + __asm__ volatile(SFENCE:::"memory"); + __asm__ volatile(EMMS:::"memory"); + while (s < end) { + register int rgb = *(const uint32_t*)s; s += 4; + *d++ = ((rgb&0xF8)<<7) + ((rgb&0xF800)>>6) + ((rgb&0xF80000)>>19); + } +} + +static inline void RENAME(rgb24tobgr16)(const uint8_t *src, uint8_t *dst, int src_size) +{ + const uint8_t *s = src; + const uint8_t *end; + const uint8_t *mm_end; + uint16_t *d = (uint16_t *)dst; + end = s + src_size; + __asm__ volatile(PREFETCH" %0"::"m"(*src):"memory"); + __asm__ volatile( + "movq %0, %%mm7 \n\t" + "movq %1, %%mm6 \n\t" + ::"m"(red_16mask),"m"(green_16mask)); + mm_end = end - 11; + while (s < mm_end) { + __asm__ volatile( + PREFETCH" 32%1 \n\t" + "movd %1, %%mm0 \n\t" + "movd 3%1, %%mm3 \n\t" + "punpckldq 6%1, %%mm0 \n\t" + "punpckldq 9%1, %%mm3 \n\t" + "movq %%mm0, %%mm1 \n\t" + "movq %%mm0, %%mm2 \n\t" + "movq %%mm3, %%mm4 \n\t" + "movq %%mm3, %%mm5 \n\t" + "psrlq $3, %%mm0 \n\t" + "psrlq $3, %%mm3 \n\t" + "pand %2, %%mm0 \n\t" + "pand %2, %%mm3 \n\t" + "psrlq $5, %%mm1 \n\t" + "psrlq $5, %%mm4 \n\t" + "pand %%mm6, %%mm1 \n\t" + "pand %%mm6, %%mm4 \n\t" + "psrlq $8, %%mm2 \n\t" + "psrlq $8, %%mm5 \n\t" + "pand %%mm7, %%mm2 \n\t" + "pand %%mm7, %%mm5 \n\t" + "por %%mm1, %%mm0 \n\t" + "por %%mm4, %%mm3 \n\t" + "por %%mm2, %%mm0 \n\t" + "por %%mm5, %%mm3 \n\t" + "psllq $16, %%mm3 \n\t" + "por %%mm3, %%mm0 \n\t" + MOVNTQ" %%mm0, %0 \n\t" + :"=m"(*d):"m"(*s),"m"(blue_16mask):"memory"); + d += 4; + s += 12; + } + __asm__ volatile(SFENCE:::"memory"); + __asm__ volatile(EMMS:::"memory"); + while (s < end) { + const int b = *s++; + const int g = *s++; + const int r = *s++; + *d++ = (b>>3) | ((g&0xFC)<<3) | ((r&0xF8)<<8); + } +} + +static inline void RENAME(rgb24to16)(const uint8_t *src, uint8_t *dst, int src_size) +{ + const uint8_t *s = src; + const uint8_t *end; + const uint8_t *mm_end; + uint16_t *d = (uint16_t *)dst; + end = s + src_size; + __asm__ volatile(PREFETCH" %0"::"m"(*src):"memory"); + __asm__ volatile( + "movq %0, %%mm7 \n\t" + "movq %1, %%mm6 \n\t" + ::"m"(red_16mask),"m"(green_16mask)); + mm_end = end - 15; + while (s < mm_end) { + __asm__ volatile( + PREFETCH" 32%1 \n\t" + "movd %1, %%mm0 \n\t" + "movd 3%1, %%mm3 \n\t" + "punpckldq 6%1, %%mm0 \n\t" + "punpckldq 9%1, %%mm3 \n\t" + "movq %%mm0, %%mm1 \n\t" + "movq %%mm0, %%mm2 \n\t" + "movq %%mm3, %%mm4 \n\t" + "movq %%mm3, %%mm5 \n\t" + "psllq $8, %%mm0 \n\t" + "psllq $8, %%mm3 \n\t" + "pand %%mm7, %%mm0 \n\t" + "pand %%mm7, %%mm3 \n\t" + "psrlq $5, %%mm1 \n\t" + "psrlq $5, %%mm4 \n\t" + "pand %%mm6, %%mm1 \n\t" + "pand %%mm6, %%mm4 \n\t" + "psrlq $19, %%mm2 \n\t" + "psrlq $19, %%mm5 \n\t" + "pand %2, %%mm2 \n\t" + "pand %2, %%mm5 \n\t" + "por %%mm1, %%mm0 \n\t" + "por %%mm4, %%mm3 \n\t" + "por %%mm2, %%mm0 \n\t" + "por %%mm5, %%mm3 \n\t" + "psllq $16, %%mm3 \n\t" + "por %%mm3, %%mm0 \n\t" + MOVNTQ" %%mm0, %0 \n\t" + :"=m"(*d):"m"(*s),"m"(blue_16mask):"memory"); + d += 4; + s += 12; + } + __asm__ volatile(SFENCE:::"memory"); + __asm__ volatile(EMMS:::"memory"); + while (s < end) { + const int r = *s++; + const int g = *s++; + const int b = *s++; + *d++ = (b>>3) | ((g&0xFC)<<3) | ((r&0xF8)<<8); + } +} + +static inline void RENAME(rgb24tobgr15)(const uint8_t *src, uint8_t *dst, int src_size) +{ + const uint8_t *s = src; + const uint8_t *end; + const uint8_t *mm_end; + uint16_t *d = (uint16_t *)dst; + end = s + src_size; + __asm__ volatile(PREFETCH" %0"::"m"(*src):"memory"); + __asm__ volatile( + "movq %0, %%mm7 \n\t" + "movq %1, %%mm6 \n\t" + ::"m"(red_15mask),"m"(green_15mask)); + mm_end = end - 11; + while (s < mm_end) { + __asm__ volatile( + PREFETCH" 32%1 \n\t" + "movd %1, %%mm0 \n\t" + "movd 3%1, %%mm3 \n\t" + "punpckldq 6%1, %%mm0 \n\t" + "punpckldq 9%1, %%mm3 \n\t" + "movq %%mm0, %%mm1 \n\t" + "movq %%mm0, %%mm2 \n\t" + "movq %%mm3, %%mm4 \n\t" + "movq %%mm3, %%mm5 \n\t" + "psrlq $3, %%mm0 \n\t" + "psrlq $3, %%mm3 \n\t" + "pand %2, %%mm0 \n\t" + "pand %2, %%mm3 \n\t" + "psrlq $6, %%mm1 \n\t" + "psrlq $6, %%mm4 \n\t" + "pand %%mm6, %%mm1 \n\t" + "pand %%mm6, %%mm4 \n\t" + "psrlq $9, %%mm2 \n\t" + "psrlq $9, %%mm5 \n\t" + "pand %%mm7, %%mm2 \n\t" + "pand %%mm7, %%mm5 \n\t" + "por %%mm1, %%mm0 \n\t" + "por %%mm4, %%mm3 \n\t" + "por %%mm2, %%mm0 \n\t" + "por %%mm5, %%mm3 \n\t" + "psllq $16, %%mm3 \n\t" + "por %%mm3, %%mm0 \n\t" + MOVNTQ" %%mm0, %0 \n\t" + :"=m"(*d):"m"(*s),"m"(blue_15mask):"memory"); + d += 4; + s += 12; + } + __asm__ volatile(SFENCE:::"memory"); + __asm__ volatile(EMMS:::"memory"); + while (s < end) { + const int b = *s++; + const int g = *s++; + const int r = *s++; + *d++ = (b>>3) | ((g&0xF8)<<2) | ((r&0xF8)<<7); + } +} + +static inline void RENAME(rgb24to15)(const uint8_t *src, uint8_t *dst, int src_size) +{ + const uint8_t *s = src; + const uint8_t *end; + const uint8_t *mm_end; + uint16_t *d = (uint16_t *)dst; + end = s + src_size; + __asm__ volatile(PREFETCH" %0"::"m"(*src):"memory"); + __asm__ volatile( + "movq %0, %%mm7 \n\t" + "movq %1, %%mm6 \n\t" + ::"m"(red_15mask),"m"(green_15mask)); + mm_end = end - 15; + while (s < mm_end) { + __asm__ volatile( + PREFETCH" 32%1 \n\t" + "movd %1, %%mm0 \n\t" + "movd 3%1, %%mm3 \n\t" + "punpckldq 6%1, %%mm0 \n\t" + "punpckldq 9%1, %%mm3 \n\t" + "movq %%mm0, %%mm1 \n\t" + "movq %%mm0, %%mm2 \n\t" + "movq %%mm3, %%mm4 \n\t" + "movq %%mm3, %%mm5 \n\t" + "psllq $7, %%mm0 \n\t" + "psllq $7, %%mm3 \n\t" + "pand %%mm7, %%mm0 \n\t" + "pand %%mm7, %%mm3 \n\t" + "psrlq $6, %%mm1 \n\t" + "psrlq $6, %%mm4 \n\t" + "pand %%mm6, %%mm1 \n\t" + "pand %%mm6, %%mm4 \n\t" + "psrlq $19, %%mm2 \n\t" + "psrlq $19, %%mm5 \n\t" + "pand %2, %%mm2 \n\t" + "pand %2, %%mm5 \n\t" + "por %%mm1, %%mm0 \n\t" + "por %%mm4, %%mm3 \n\t" + "por %%mm2, %%mm0 \n\t" + "por %%mm5, %%mm3 \n\t" + "psllq $16, %%mm3 \n\t" + "por %%mm3, %%mm0 \n\t" + MOVNTQ" %%mm0, %0 \n\t" + :"=m"(*d):"m"(*s),"m"(blue_15mask):"memory"); + d += 4; + s += 12; + } + __asm__ volatile(SFENCE:::"memory"); + __asm__ volatile(EMMS:::"memory"); + while (s < end) { + const int r = *s++; + const int g = *s++; + const int b = *s++; + *d++ = (b>>3) | ((g&0xF8)<<2) | ((r&0xF8)<<7); + } +} + +/* + I use less accurate approximation here by simply left-shifting the input + value and filling the low order bits with zeroes. This method improves PNG + compression but this scheme cannot reproduce white exactly, since it does + not generate an all-ones maximum value; the net effect is to darken the + image slightly. + + The better method should be "left bit replication": + + 4 3 2 1 0 + --------- + 1 1 0 1 1 + + 7 6 5 4 3 2 1 0 + ---------------- + 1 1 0 1 1 1 1 0 + |=======| |===| + | leftmost bits repeated to fill open bits + | + original bits +*/ +static inline void RENAME(rgb15tobgr24)(const uint8_t *src, uint8_t *dst, int src_size) +{ + const uint16_t *end; + const uint16_t *mm_end; + uint8_t *d = dst; + const uint16_t *s = (const uint16_t*)src; + end = s + src_size/2; + __asm__ volatile(PREFETCH" %0"::"m"(*s):"memory"); + mm_end = end - 7; + while (s < mm_end) { + __asm__ volatile( + PREFETCH" 32%1 \n\t" + "movq %1, %%mm0 \n\t" + "movq %1, %%mm1 \n\t" + "movq %1, %%mm2 \n\t" + "pand %2, %%mm0 \n\t" + "pand %3, %%mm1 \n\t" + "pand %4, %%mm2 \n\t" + "psllq $3, %%mm0 \n\t" + "psrlq $2, %%mm1 \n\t" + "psrlq $7, %%mm2 \n\t" + "movq %%mm0, %%mm3 \n\t" + "movq %%mm1, %%mm4 \n\t" + "movq %%mm2, %%mm5 \n\t" + "punpcklwd %5, %%mm0 \n\t" + "punpcklwd %5, %%mm1 \n\t" + "punpcklwd %5, %%mm2 \n\t" + "punpckhwd %5, %%mm3 \n\t" + "punpckhwd %5, %%mm4 \n\t" + "punpckhwd %5, %%mm5 \n\t" + "psllq $8, %%mm1 \n\t" + "psllq $16, %%mm2 \n\t" + "por %%mm1, %%mm0 \n\t" + "por %%mm2, %%mm0 \n\t" + "psllq $8, %%mm4 \n\t" + "psllq $16, %%mm5 \n\t" + "por %%mm4, %%mm3 \n\t" + "por %%mm5, %%mm3 \n\t" + + "movq %%mm0, %%mm6 \n\t" + "movq %%mm3, %%mm7 \n\t" + + "movq 8%1, %%mm0 \n\t" + "movq 8%1, %%mm1 \n\t" + "movq 8%1, %%mm2 \n\t" + "pand %2, %%mm0 \n\t" + "pand %3, %%mm1 \n\t" + "pand %4, %%mm2 \n\t" + "psllq $3, %%mm0 \n\t" + "psrlq $2, %%mm1 \n\t" + "psrlq $7, %%mm2 \n\t" + "movq %%mm0, %%mm3 \n\t" + "movq %%mm1, %%mm4 \n\t" + "movq %%mm2, %%mm5 \n\t" + "punpcklwd %5, %%mm0 \n\t" + "punpcklwd %5, %%mm1 \n\t" + "punpcklwd %5, %%mm2 \n\t" + "punpckhwd %5, %%mm3 \n\t" + "punpckhwd %5, %%mm4 \n\t" + "punpckhwd %5, %%mm5 \n\t" + "psllq $8, %%mm1 \n\t" + "psllq $16, %%mm2 \n\t" + "por %%mm1, %%mm0 \n\t" + "por %%mm2, %%mm0 \n\t" + "psllq $8, %%mm4 \n\t" + "psllq $16, %%mm5 \n\t" + "por %%mm4, %%mm3 \n\t" + "por %%mm5, %%mm3 \n\t" + + :"=m"(*d) + :"m"(*s),"m"(mask15b),"m"(mask15g),"m"(mask15r), "m"(mmx_null) + :"memory"); + /* borrowed 32 to 24 */ + __asm__ volatile( + "movq %%mm0, %%mm4 \n\t" + "movq %%mm3, %%mm5 \n\t" + "movq %%mm6, %%mm0 \n\t" + "movq %%mm7, %%mm1 \n\t" + + "movq %%mm4, %%mm6 \n\t" + "movq %%mm5, %%mm7 \n\t" + "movq %%mm0, %%mm2 \n\t" + "movq %%mm1, %%mm3 \n\t" + + STORE_BGR24_MMX + + :"=m"(*d) + :"m"(*s) + :"memory"); + d += 24; + s += 8; + } + __asm__ volatile(SFENCE:::"memory"); + __asm__ volatile(EMMS:::"memory"); + while (s < end) { + register uint16_t bgr; + bgr = *s++; + *d++ = (bgr&0x1F)<<3; + *d++ = (bgr&0x3E0)>>2; + *d++ = (bgr&0x7C00)>>7; + } +} + +static inline void RENAME(rgb16tobgr24)(const uint8_t *src, uint8_t *dst, int src_size) +{ + const uint16_t *end; + const uint16_t *mm_end; + uint8_t *d = (uint8_t *)dst; + const uint16_t *s = (const uint16_t *)src; + end = s + src_size/2; + __asm__ volatile(PREFETCH" %0"::"m"(*s):"memory"); + mm_end = end - 7; + while (s < mm_end) { + __asm__ volatile( + PREFETCH" 32%1 \n\t" + "movq %1, %%mm0 \n\t" + "movq %1, %%mm1 \n\t" + "movq %1, %%mm2 \n\t" + "pand %2, %%mm0 \n\t" + "pand %3, %%mm1 \n\t" + "pand %4, %%mm2 \n\t" + "psllq $3, %%mm0 \n\t" + "psrlq $3, %%mm1 \n\t" + "psrlq $8, %%mm2 \n\t" + "movq %%mm0, %%mm3 \n\t" + "movq %%mm1, %%mm4 \n\t" + "movq %%mm2, %%mm5 \n\t" + "punpcklwd %5, %%mm0 \n\t" + "punpcklwd %5, %%mm1 \n\t" + "punpcklwd %5, %%mm2 \n\t" + "punpckhwd %5, %%mm3 \n\t" + "punpckhwd %5, %%mm4 \n\t" + "punpckhwd %5, %%mm5 \n\t" + "psllq $8, %%mm1 \n\t" + "psllq $16, %%mm2 \n\t" + "por %%mm1, %%mm0 \n\t" + "por %%mm2, %%mm0 \n\t" + "psllq $8, %%mm4 \n\t" + "psllq $16, %%mm5 \n\t" + "por %%mm4, %%mm3 \n\t" + "por %%mm5, %%mm3 \n\t" + + "movq %%mm0, %%mm6 \n\t" + "movq %%mm3, %%mm7 \n\t" + + "movq 8%1, %%mm0 \n\t" + "movq 8%1, %%mm1 \n\t" + "movq 8%1, %%mm2 \n\t" + "pand %2, %%mm0 \n\t" + "pand %3, %%mm1 \n\t" + "pand %4, %%mm2 \n\t" + "psllq $3, %%mm0 \n\t" + "psrlq $3, %%mm1 \n\t" + "psrlq $8, %%mm2 \n\t" + "movq %%mm0, %%mm3 \n\t" + "movq %%mm1, %%mm4 \n\t" + "movq %%mm2, %%mm5 \n\t" + "punpcklwd %5, %%mm0 \n\t" + "punpcklwd %5, %%mm1 \n\t" + "punpcklwd %5, %%mm2 \n\t" + "punpckhwd %5, %%mm3 \n\t" + "punpckhwd %5, %%mm4 \n\t" + "punpckhwd %5, %%mm5 \n\t" + "psllq $8, %%mm1 \n\t" + "psllq $16, %%mm2 \n\t" + "por %%mm1, %%mm0 \n\t" + "por %%mm2, %%mm0 \n\t" + "psllq $8, %%mm4 \n\t" + "psllq $16, %%mm5 \n\t" + "por %%mm4, %%mm3 \n\t" + "por %%mm5, %%mm3 \n\t" + :"=m"(*d) + :"m"(*s),"m"(mask16b),"m"(mask16g),"m"(mask16r),"m"(mmx_null) + :"memory"); + /* borrowed 32 to 24 */ + __asm__ volatile( + "movq %%mm0, %%mm4 \n\t" + "movq %%mm3, %%mm5 \n\t" + "movq %%mm6, %%mm0 \n\t" + "movq %%mm7, %%mm1 \n\t" + + "movq %%mm4, %%mm6 \n\t" + "movq %%mm5, %%mm7 \n\t" + "movq %%mm0, %%mm2 \n\t" + "movq %%mm1, %%mm3 \n\t" + + STORE_BGR24_MMX + + :"=m"(*d) + :"m"(*s) + :"memory"); + d += 24; + s += 8; + } + __asm__ volatile(SFENCE:::"memory"); + __asm__ volatile(EMMS:::"memory"); + while (s < end) { + register uint16_t bgr; + bgr = *s++; + *d++ = (bgr&0x1F)<<3; + *d++ = (bgr&0x7E0)>>3; + *d++ = (bgr&0xF800)>>8; + } +} + +/* + * mm0 = 00 B3 00 B2 00 B1 00 B0 + * mm1 = 00 G3 00 G2 00 G1 00 G0 + * mm2 = 00 R3 00 R2 00 R1 00 R0 + * mm6 = FF FF FF FF FF FF FF FF + * mm7 = 00 00 00 00 00 00 00 00 + */ +#define PACK_RGB32 \ + "packuswb %%mm7, %%mm0 \n\t" /* 00 00 00 00 B3 B2 B1 B0 */ \ + "packuswb %%mm7, %%mm1 \n\t" /* 00 00 00 00 G3 G2 G1 G0 */ \ + "packuswb %%mm7, %%mm2 \n\t" /* 00 00 00 00 R3 R2 R1 R0 */ \ + "punpcklbw %%mm1, %%mm0 \n\t" /* G3 B3 G2 B2 G1 B1 G0 B0 */ \ + "punpcklbw %%mm6, %%mm2 \n\t" /* FF R3 FF R2 FF R1 FF R0 */ \ + "movq %%mm0, %%mm3 \n\t" \ + "punpcklwd %%mm2, %%mm0 \n\t" /* FF R1 G1 B1 FF R0 G0 B0 */ \ + "punpckhwd %%mm2, %%mm3 \n\t" /* FF R3 G3 B3 FF R2 G2 B2 */ \ + MOVNTQ" %%mm0, %0 \n\t" \ + MOVNTQ" %%mm3, 8%0 \n\t" \ + +static inline void RENAME(rgb15to32)(const uint8_t *src, uint8_t *dst, int src_size) +{ + const uint16_t *end; + const uint16_t *mm_end; + uint8_t *d = dst; + const uint16_t *s = (const uint16_t *)src; + end = s + src_size/2; + __asm__ volatile(PREFETCH" %0"::"m"(*s):"memory"); + __asm__ volatile("pxor %%mm7,%%mm7 \n\t":::"memory"); + __asm__ volatile("pcmpeqd %%mm6,%%mm6 \n\t":::"memory"); + mm_end = end - 3; + while (s < mm_end) { + __asm__ volatile( + PREFETCH" 32%1 \n\t" + "movq %1, %%mm0 \n\t" + "movq %1, %%mm1 \n\t" + "movq %1, %%mm2 \n\t" + "pand %2, %%mm0 \n\t" + "pand %3, %%mm1 \n\t" + "pand %4, %%mm2 \n\t" + "psllq $3, %%mm0 \n\t" + "psrlq $2, %%mm1 \n\t" + "psrlq $7, %%mm2 \n\t" + PACK_RGB32 + :"=m"(*d) + :"m"(*s),"m"(mask15b),"m"(mask15g),"m"(mask15r) + :"memory"); + d += 16; + s += 4; + } + __asm__ volatile(SFENCE:::"memory"); + __asm__ volatile(EMMS:::"memory"); + while (s < end) { + register uint16_t bgr; + bgr = *s++; + *d++ = (bgr&0x1F)<<3; + *d++ = (bgr&0x3E0)>>2; + *d++ = (bgr&0x7C00)>>7; + *d++ = 255; + } +} + +static inline void RENAME(rgb16to32)(const uint8_t *src, uint8_t *dst, int src_size) +{ + const uint16_t *end; + const uint16_t *mm_end; + uint8_t *d = dst; + const uint16_t *s = (const uint16_t*)src; + end = s + src_size/2; + __asm__ volatile(PREFETCH" %0"::"m"(*s):"memory"); + __asm__ volatile("pxor %%mm7,%%mm7 \n\t":::"memory"); + __asm__ volatile("pcmpeqd %%mm6,%%mm6 \n\t":::"memory"); + mm_end = end - 3; + while (s < mm_end) { + __asm__ volatile( + PREFETCH" 32%1 \n\t" + "movq %1, %%mm0 \n\t" + "movq %1, %%mm1 \n\t" + "movq %1, %%mm2 \n\t" + "pand %2, %%mm0 \n\t" + "pand %3, %%mm1 \n\t" + "pand %4, %%mm2 \n\t" + "psllq $3, %%mm0 \n\t" + "psrlq $3, %%mm1 \n\t" + "psrlq $8, %%mm2 \n\t" + PACK_RGB32 + :"=m"(*d) + :"m"(*s),"m"(mask16b),"m"(mask16g),"m"(mask16r) + :"memory"); + d += 16; + s += 4; + } + __asm__ volatile(SFENCE:::"memory"); + __asm__ volatile(EMMS:::"memory"); + while (s < end) { + register uint16_t bgr; + bgr = *s++; + *d++ = (bgr&0x1F)<<3; + *d++ = (bgr&0x7E0)>>3; + *d++ = (bgr&0xF800)>>8; + *d++ = 255; + } +} + +static inline void RENAME(shuffle_bytes_2103)(const uint8_t *src, uint8_t *dst, int src_size) +{ + x86_reg idx = 15 - src_size; + const uint8_t *s = src-idx; + uint8_t *d = dst-idx; + __asm__ volatile( + "test %0, %0 \n\t" + "jns 2f \n\t" + PREFETCH" (%1, %0) \n\t" + "movq %3, %%mm7 \n\t" + "pxor %4, %%mm7 \n\t" + "movq %%mm7, %%mm6 \n\t" + "pxor %5, %%mm7 \n\t" + ".p2align 4 \n\t" + "1: \n\t" + PREFETCH" 32(%1, %0) \n\t" + "movq (%1, %0), %%mm0 \n\t" + "movq 8(%1, %0), %%mm1 \n\t" +# if COMPILE_TEMPLATE_MMX2 + "pshufw $177, %%mm0, %%mm3 \n\t" + "pshufw $177, %%mm1, %%mm5 \n\t" + "pand %%mm7, %%mm0 \n\t" + "pand %%mm6, %%mm3 \n\t" + "pand %%mm7, %%mm1 \n\t" + "pand %%mm6, %%mm5 \n\t" + "por %%mm3, %%mm0 \n\t" + "por %%mm5, %%mm1 \n\t" +# else + "movq %%mm0, %%mm2 \n\t" + "movq %%mm1, %%mm4 \n\t" + "pand %%mm7, %%mm0 \n\t" + "pand %%mm6, %%mm2 \n\t" + "pand %%mm7, %%mm1 \n\t" + "pand %%mm6, %%mm4 \n\t" + "movq %%mm2, %%mm3 \n\t" + "movq %%mm4, %%mm5 \n\t" + "pslld $16, %%mm2 \n\t" + "psrld $16, %%mm3 \n\t" + "pslld $16, %%mm4 \n\t" + "psrld $16, %%mm5 \n\t" + "por %%mm2, %%mm0 \n\t" + "por %%mm4, %%mm1 \n\t" + "por %%mm3, %%mm0 \n\t" + "por %%mm5, %%mm1 \n\t" +# endif + MOVNTQ" %%mm0, (%2, %0) \n\t" + MOVNTQ" %%mm1, 8(%2, %0) \n\t" + "add $16, %0 \n\t" + "js 1b \n\t" + SFENCE" \n\t" + EMMS" \n\t" + "2: \n\t" + : "+&r"(idx) + : "r" (s), "r" (d), "m" (mask32b), "m" (mask32r), "m" (mmx_one) + : "memory"); + for (; idx<15; idx+=4) { + register int v = *(const uint32_t *)&s[idx], g = v & 0xff00ff00; + v &= 0xff00ff; + *(uint32_t *)&d[idx] = (v>>16) + g + (v<<16); + } +} + +static inline void RENAME(rgb24tobgr24)(const uint8_t *src, uint8_t *dst, int src_size) +{ + unsigned i; + x86_reg mmx_size= 23 - src_size; + __asm__ volatile ( + "test %%"REG_a", %%"REG_a" \n\t" + "jns 2f \n\t" + "movq "MANGLE(mask24r)", %%mm5 \n\t" + "movq "MANGLE(mask24g)", %%mm6 \n\t" + "movq "MANGLE(mask24b)", %%mm7 \n\t" + ".p2align 4 \n\t" + "1: \n\t" + PREFETCH" 32(%1, %%"REG_a") \n\t" + "movq (%1, %%"REG_a"), %%mm0 \n\t" // BGR BGR BG + "movq (%1, %%"REG_a"), %%mm1 \n\t" // BGR BGR BG + "movq 2(%1, %%"REG_a"), %%mm2 \n\t" // R BGR BGR B + "psllq $16, %%mm0 \n\t" // 00 BGR BGR + "pand %%mm5, %%mm0 \n\t" + "pand %%mm6, %%mm1 \n\t" + "pand %%mm7, %%mm2 \n\t" + "por %%mm0, %%mm1 \n\t" + "por %%mm2, %%mm1 \n\t" + "movq 6(%1, %%"REG_a"), %%mm0 \n\t" // BGR BGR BG + MOVNTQ" %%mm1, (%2, %%"REG_a") \n\t" // RGB RGB RG + "movq 8(%1, %%"REG_a"), %%mm1 \n\t" // R BGR BGR B + "movq 10(%1, %%"REG_a"), %%mm2 \n\t" // GR BGR BGR + "pand %%mm7, %%mm0 \n\t" + "pand %%mm5, %%mm1 \n\t" + "pand %%mm6, %%mm2 \n\t" + "por %%mm0, %%mm1 \n\t" + "por %%mm2, %%mm1 \n\t" + "movq 14(%1, %%"REG_a"), %%mm0 \n\t" // R BGR BGR B + MOVNTQ" %%mm1, 8(%2, %%"REG_a") \n\t" // B RGB RGB R + "movq 16(%1, %%"REG_a"), %%mm1 \n\t" // GR BGR BGR + "movq 18(%1, %%"REG_a"), %%mm2 \n\t" // BGR BGR BG + "pand %%mm6, %%mm0 \n\t" + "pand %%mm7, %%mm1 \n\t" + "pand %%mm5, %%mm2 \n\t" + "por %%mm0, %%mm1 \n\t" + "por %%mm2, %%mm1 \n\t" + MOVNTQ" %%mm1, 16(%2, %%"REG_a") \n\t" + "add $24, %%"REG_a" \n\t" + " js 1b \n\t" + "2: \n\t" + : "+a" (mmx_size) + : "r" (src-mmx_size), "r"(dst-mmx_size) + ); + + __asm__ volatile(SFENCE:::"memory"); + __asm__ volatile(EMMS:::"memory"); + + if (mmx_size==23) return; //finished, was multiple of 8 + + src+= src_size; + dst+= src_size; + src_size= 23-mmx_size; + src-= src_size; + dst-= src_size; + for (i=0; i<src_size; i+=3) { + register uint8_t x; + x = src[i + 2]; + dst[i + 1] = src[i + 1]; + dst[i + 2] = src[i + 0]; + dst[i + 0] = x; + } +} + +static inline void RENAME(yuvPlanartoyuy2)(const uint8_t *ysrc, const uint8_t *usrc, const uint8_t *vsrc, uint8_t *dst, + int width, int height, + int lumStride, int chromStride, int dstStride, int vertLumPerChroma) +{ + int y; + const x86_reg chromWidth= width>>1; + for (y=0; y<height; y++) { + //FIXME handle 2 lines at once (fewer prefetches, reuse some chroma, but very likely memory-limited anyway) + __asm__ volatile( + "xor %%"REG_a", %%"REG_a" \n\t" + ".p2align 4 \n\t" + "1: \n\t" + PREFETCH" 32(%1, %%"REG_a", 2) \n\t" + PREFETCH" 32(%2, %%"REG_a") \n\t" + PREFETCH" 32(%3, %%"REG_a") \n\t" + "movq (%2, %%"REG_a"), %%mm0 \n\t" // U(0) + "movq %%mm0, %%mm2 \n\t" // U(0) + "movq (%3, %%"REG_a"), %%mm1 \n\t" // V(0) + "punpcklbw %%mm1, %%mm0 \n\t" // UVUV UVUV(0) + "punpckhbw %%mm1, %%mm2 \n\t" // UVUV UVUV(8) + + "movq (%1, %%"REG_a",2), %%mm3 \n\t" // Y(0) + "movq 8(%1, %%"REG_a",2), %%mm5 \n\t" // Y(8) + "movq %%mm3, %%mm4 \n\t" // Y(0) + "movq %%mm5, %%mm6 \n\t" // Y(8) + "punpcklbw %%mm0, %%mm3 \n\t" // YUYV YUYV(0) + "punpckhbw %%mm0, %%mm4 \n\t" // YUYV YUYV(4) + "punpcklbw %%mm2, %%mm5 \n\t" // YUYV YUYV(8) + "punpckhbw %%mm2, %%mm6 \n\t" // YUYV YUYV(12) + + MOVNTQ" %%mm3, (%0, %%"REG_a", 4) \n\t" + MOVNTQ" %%mm4, 8(%0, %%"REG_a", 4) \n\t" + MOVNTQ" %%mm5, 16(%0, %%"REG_a", 4) \n\t" + MOVNTQ" %%mm6, 24(%0, %%"REG_a", 4) \n\t" + + "add $8, %%"REG_a" \n\t" + "cmp %4, %%"REG_a" \n\t" + " jb 1b \n\t" + ::"r"(dst), "r"(ysrc), "r"(usrc), "r"(vsrc), "g" (chromWidth) + : "%"REG_a + ); + if ((y&(vertLumPerChroma-1)) == vertLumPerChroma-1) { + usrc += chromStride; + vsrc += chromStride; + } + ysrc += lumStride; + dst += dstStride; + } + __asm__(EMMS" \n\t" + SFENCE" \n\t" + :::"memory"); +} + +/** + * Height should be a multiple of 2 and width should be a multiple of 16. + * (If this is a problem for anyone then tell me, and I will fix it.) + */ +static inline void RENAME(yv12toyuy2)(const uint8_t *ysrc, const uint8_t *usrc, const uint8_t *vsrc, uint8_t *dst, + int width, int height, + int lumStride, int chromStride, int dstStride) +{ + //FIXME interpolate chroma + RENAME(yuvPlanartoyuy2)(ysrc, usrc, vsrc, dst, width, height, lumStride, chromStride, dstStride, 2); +} + +static inline void RENAME(yuvPlanartouyvy)(const uint8_t *ysrc, const uint8_t *usrc, const uint8_t *vsrc, uint8_t *dst, + int width, int height, + int lumStride, int chromStride, int dstStride, int vertLumPerChroma) +{ + int y; + const x86_reg chromWidth= width>>1; + for (y=0; y<height; y++) { + //FIXME handle 2 lines at once (fewer prefetches, reuse some chroma, but very likely memory-limited anyway) + __asm__ volatile( + "xor %%"REG_a", %%"REG_a" \n\t" + ".p2align 4 \n\t" + "1: \n\t" + PREFETCH" 32(%1, %%"REG_a", 2) \n\t" + PREFETCH" 32(%2, %%"REG_a") \n\t" + PREFETCH" 32(%3, %%"REG_a") \n\t" + "movq (%2, %%"REG_a"), %%mm0 \n\t" // U(0) + "movq %%mm0, %%mm2 \n\t" // U(0) + "movq (%3, %%"REG_a"), %%mm1 \n\t" // V(0) + "punpcklbw %%mm1, %%mm0 \n\t" // UVUV UVUV(0) + "punpckhbw %%mm1, %%mm2 \n\t" // UVUV UVUV(8) + + "movq (%1, %%"REG_a",2), %%mm3 \n\t" // Y(0) + "movq 8(%1, %%"REG_a",2), %%mm5 \n\t" // Y(8) + "movq %%mm0, %%mm4 \n\t" // Y(0) + "movq %%mm2, %%mm6 \n\t" // Y(8) + "punpcklbw %%mm3, %%mm0 \n\t" // YUYV YUYV(0) + "punpckhbw %%mm3, %%mm4 \n\t" // YUYV YUYV(4) + "punpcklbw %%mm5, %%mm2 \n\t" // YUYV YUYV(8) + "punpckhbw %%mm5, %%mm6 \n\t" // YUYV YUYV(12) + + MOVNTQ" %%mm0, (%0, %%"REG_a", 4) \n\t" + MOVNTQ" %%mm4, 8(%0, %%"REG_a", 4) \n\t" + MOVNTQ" %%mm2, 16(%0, %%"REG_a", 4) \n\t" + MOVNTQ" %%mm6, 24(%0, %%"REG_a", 4) \n\t" + + "add $8, %%"REG_a" \n\t" + "cmp %4, %%"REG_a" \n\t" + " jb 1b \n\t" + ::"r"(dst), "r"(ysrc), "r"(usrc), "r"(vsrc), "g" (chromWidth) + : "%"REG_a + ); + if ((y&(vertLumPerChroma-1)) == vertLumPerChroma-1) { + usrc += chromStride; + vsrc += chromStride; + } + ysrc += lumStride; + dst += dstStride; + } + __asm__(EMMS" \n\t" + SFENCE" \n\t" + :::"memory"); +} + +/** + * Height should be a multiple of 2 and width should be a multiple of 16 + * (If this is a problem for anyone then tell me, and I will fix it.) + */ +static inline void RENAME(yv12touyvy)(const uint8_t *ysrc, const uint8_t *usrc, const uint8_t *vsrc, uint8_t *dst, + int width, int height, + int lumStride, int chromStride, int dstStride) +{ + //FIXME interpolate chroma + RENAME(yuvPlanartouyvy)(ysrc, usrc, vsrc, dst, width, height, lumStride, chromStride, dstStride, 2); +} + +/** + * Width should be a multiple of 16. + */ +static inline void RENAME(yuv422ptouyvy)(const uint8_t *ysrc, const uint8_t *usrc, const uint8_t *vsrc, uint8_t *dst, + int width, int height, + int lumStride, int chromStride, int dstStride) +{ + RENAME(yuvPlanartouyvy)(ysrc, usrc, vsrc, dst, width, height, lumStride, chromStride, dstStride, 1); +} + +/** + * Width should be a multiple of 16. + */ +static inline void RENAME(yuv422ptoyuy2)(const uint8_t *ysrc, const uint8_t *usrc, const uint8_t *vsrc, uint8_t *dst, + int width, int height, + int lumStride, int chromStride, int dstStride) +{ + RENAME(yuvPlanartoyuy2)(ysrc, usrc, vsrc, dst, width, height, lumStride, chromStride, dstStride, 1); +} + +/** + * Height should be a multiple of 2 and width should be a multiple of 16. + * (If this is a problem for anyone then tell me, and I will fix it.) + */ +static inline void RENAME(yuy2toyv12)(const uint8_t *src, uint8_t *ydst, uint8_t *udst, uint8_t *vdst, + int width, int height, + int lumStride, int chromStride, int srcStride) +{ + int y; + const x86_reg chromWidth= width>>1; + for (y=0; y<height; y+=2) { + __asm__ volatile( + "xor %%"REG_a", %%"REG_a" \n\t" + "pcmpeqw %%mm7, %%mm7 \n\t" + "psrlw $8, %%mm7 \n\t" // FF,00,FF,00... + ".p2align 4 \n\t" + "1: \n\t" + PREFETCH" 64(%0, %%"REG_a", 4) \n\t" + "movq (%0, %%"REG_a", 4), %%mm0 \n\t" // YUYV YUYV(0) + "movq 8(%0, %%"REG_a", 4), %%mm1 \n\t" // YUYV YUYV(4) + "movq %%mm0, %%mm2 \n\t" // YUYV YUYV(0) + "movq %%mm1, %%mm3 \n\t" // YUYV YUYV(4) + "psrlw $8, %%mm0 \n\t" // U0V0 U0V0(0) + "psrlw $8, %%mm1 \n\t" // U0V0 U0V0(4) + "pand %%mm7, %%mm2 \n\t" // Y0Y0 Y0Y0(0) + "pand %%mm7, %%mm3 \n\t" // Y0Y0 Y0Y0(4) + "packuswb %%mm1, %%mm0 \n\t" // UVUV UVUV(0) + "packuswb %%mm3, %%mm2 \n\t" // YYYY YYYY(0) + + MOVNTQ" %%mm2, (%1, %%"REG_a", 2) \n\t" + + "movq 16(%0, %%"REG_a", 4), %%mm1 \n\t" // YUYV YUYV(8) + "movq 24(%0, %%"REG_a", 4), %%mm2 \n\t" // YUYV YUYV(12) + "movq %%mm1, %%mm3 \n\t" // YUYV YUYV(8) + "movq %%mm2, %%mm4 \n\t" // YUYV YUYV(12) + "psrlw $8, %%mm1 \n\t" // U0V0 U0V0(8) + "psrlw $8, %%mm2 \n\t" // U0V0 U0V0(12) + "pand %%mm7, %%mm3 \n\t" // Y0Y0 Y0Y0(8) + "pand %%mm7, %%mm4 \n\t" // Y0Y0 Y0Y0(12) + "packuswb %%mm2, %%mm1 \n\t" // UVUV UVUV(8) + "packuswb %%mm4, %%mm3 \n\t" // YYYY YYYY(8) + + MOVNTQ" %%mm3, 8(%1, %%"REG_a", 2) \n\t" + + "movq %%mm0, %%mm2 \n\t" // UVUV UVUV(0) + "movq %%mm1, %%mm3 \n\t" // UVUV UVUV(8) + "psrlw $8, %%mm0 \n\t" // V0V0 V0V0(0) + "psrlw $8, %%mm1 \n\t" // V0V0 V0V0(8) + "pand %%mm7, %%mm2 \n\t" // U0U0 U0U0(0) + "pand %%mm7, %%mm3 \n\t" // U0U0 U0U0(8) + "packuswb %%mm1, %%mm0 \n\t" // VVVV VVVV(0) + "packuswb %%mm3, %%mm2 \n\t" // UUUU UUUU(0) + + MOVNTQ" %%mm0, (%3, %%"REG_a") \n\t" + MOVNTQ" %%mm2, (%2, %%"REG_a") \n\t" + + "add $8, %%"REG_a" \n\t" + "cmp %4, %%"REG_a" \n\t" + " jb 1b \n\t" + ::"r"(src), "r"(ydst), "r"(udst), "r"(vdst), "g" (chromWidth) + : "memory", "%"REG_a + ); + + ydst += lumStride; + src += srcStride; + + __asm__ volatile( + "xor %%"REG_a", %%"REG_a" \n\t" + ".p2align 4 \n\t" + "1: \n\t" + PREFETCH" 64(%0, %%"REG_a", 4) \n\t" + "movq (%0, %%"REG_a", 4), %%mm0 \n\t" // YUYV YUYV(0) + "movq 8(%0, %%"REG_a", 4), %%mm1 \n\t" // YUYV YUYV(4) + "movq 16(%0, %%"REG_a", 4), %%mm2 \n\t" // YUYV YUYV(8) + "movq 24(%0, %%"REG_a", 4), %%mm3 \n\t" // YUYV YUYV(12) + "pand %%mm7, %%mm0 \n\t" // Y0Y0 Y0Y0(0) + "pand %%mm7, %%mm1 \n\t" // Y0Y0 Y0Y0(4) + "pand %%mm7, %%mm2 \n\t" // Y0Y0 Y0Y0(8) + "pand %%mm7, %%mm3 \n\t" // Y0Y0 Y0Y0(12) + "packuswb %%mm1, %%mm0 \n\t" // YYYY YYYY(0) + "packuswb %%mm3, %%mm2 \n\t" // YYYY YYYY(8) + + MOVNTQ" %%mm0, (%1, %%"REG_a", 2) \n\t" + MOVNTQ" %%mm2, 8(%1, %%"REG_a", 2) \n\t" + + "add $8, %%"REG_a" \n\t" + "cmp %4, %%"REG_a" \n\t" + " jb 1b \n\t" + + ::"r"(src), "r"(ydst), "r"(udst), "r"(vdst), "g" (chromWidth) + : "memory", "%"REG_a + ); + udst += chromStride; + vdst += chromStride; + ydst += lumStride; + src += srcStride; + } + __asm__ volatile(EMMS" \n\t" + SFENCE" \n\t" + :::"memory"); +} +#endif /* !COMPILE_TEMPLATE_AMD3DNOW */ + +#if COMPILE_TEMPLATE_MMX2 || COMPILE_TEMPLATE_AMD3DNOW +static inline void RENAME(planar2x)(const uint8_t *src, uint8_t *dst, int srcWidth, int srcHeight, int srcStride, int dstStride) +{ + int x,y; + + dst[0]= src[0]; + + // first line + for (x=0; x<srcWidth-1; x++) { + dst[2*x+1]= (3*src[x] + src[x+1])>>2; + dst[2*x+2]= ( src[x] + 3*src[x+1])>>2; + } + dst[2*srcWidth-1]= src[srcWidth-1]; + + dst+= dstStride; + + for (y=1; y<srcHeight; y++) { + const x86_reg mmxSize= srcWidth&~15; + __asm__ volatile( + "mov %4, %%"REG_a" \n\t" + "movq "MANGLE(mmx_ff)", %%mm0 \n\t" + "movq (%0, %%"REG_a"), %%mm4 \n\t" + "movq %%mm4, %%mm2 \n\t" + "psllq $8, %%mm4 \n\t" + "pand %%mm0, %%mm2 \n\t" + "por %%mm2, %%mm4 \n\t" + "movq (%1, %%"REG_a"), %%mm5 \n\t" + "movq %%mm5, %%mm3 \n\t" + "psllq $8, %%mm5 \n\t" + "pand %%mm0, %%mm3 \n\t" + "por %%mm3, %%mm5 \n\t" + "1: \n\t" + "movq (%0, %%"REG_a"), %%mm0 \n\t" + "movq (%1, %%"REG_a"), %%mm1 \n\t" + "movq 1(%0, %%"REG_a"), %%mm2 \n\t" + "movq 1(%1, %%"REG_a"), %%mm3 \n\t" + PAVGB" %%mm0, %%mm5 \n\t" + PAVGB" %%mm0, %%mm3 \n\t" + PAVGB" %%mm0, %%mm5 \n\t" + PAVGB" %%mm0, %%mm3 \n\t" + PAVGB" %%mm1, %%mm4 \n\t" + PAVGB" %%mm1, %%mm2 \n\t" + PAVGB" %%mm1, %%mm4 \n\t" + PAVGB" %%mm1, %%mm2 \n\t" + "movq %%mm5, %%mm7 \n\t" + "movq %%mm4, %%mm6 \n\t" + "punpcklbw %%mm3, %%mm5 \n\t" + "punpckhbw %%mm3, %%mm7 \n\t" + "punpcklbw %%mm2, %%mm4 \n\t" + "punpckhbw %%mm2, %%mm6 \n\t" + MOVNTQ" %%mm5, (%2, %%"REG_a", 2) \n\t" + MOVNTQ" %%mm7, 8(%2, %%"REG_a", 2) \n\t" + MOVNTQ" %%mm4, (%3, %%"REG_a", 2) \n\t" + MOVNTQ" %%mm6, 8(%3, %%"REG_a", 2) \n\t" + "add $8, %%"REG_a" \n\t" + "movq -1(%0, %%"REG_a"), %%mm4 \n\t" + "movq -1(%1, %%"REG_a"), %%mm5 \n\t" + " js 1b \n\t" + :: "r" (src + mmxSize ), "r" (src + srcStride + mmxSize ), + "r" (dst + mmxSize*2), "r" (dst + dstStride + mmxSize*2), + "g" (-mmxSize) + : "%"REG_a + ); + + for (x=mmxSize-1; x<srcWidth-1; x++) { + dst[2*x +1]= (3*src[x+0] + src[x+srcStride+1])>>2; + dst[2*x+dstStride+2]= ( src[x+0] + 3*src[x+srcStride+1])>>2; + dst[2*x+dstStride+1]= ( src[x+1] + 3*src[x+srcStride ])>>2; + dst[2*x +2]= (3*src[x+1] + src[x+srcStride ])>>2; + } + dst[srcWidth*2 -1 ]= (3*src[srcWidth-1] + src[srcWidth-1 + srcStride])>>2; + dst[srcWidth*2 -1 + dstStride]= ( src[srcWidth-1] + 3*src[srcWidth-1 + srcStride])>>2; + + dst+=dstStride*2; + src+=srcStride; + } + + // last line + dst[0]= src[0]; + + for (x=0; x<srcWidth-1; x++) { + dst[2*x+1]= (3*src[x] + src[x+1])>>2; + dst[2*x+2]= ( src[x] + 3*src[x+1])>>2; + } + dst[2*srcWidth-1]= src[srcWidth-1]; + + __asm__ volatile(EMMS" \n\t" + SFENCE" \n\t" + :::"memory"); +} +#endif /* COMPILE_TEMPLATE_MMX2 || COMPILE_TEMPLATE_AMD3DNOW */ + +#if !COMPILE_TEMPLATE_AMD3DNOW +/** + * Height should be a multiple of 2 and width should be a multiple of 16. + * (If this is a problem for anyone then tell me, and I will fix it.) + * Chrominance data is only taken from every second line, others are ignored. + * FIXME: Write HQ version. + */ +static inline void RENAME(uyvytoyv12)(const uint8_t *src, uint8_t *ydst, uint8_t *udst, uint8_t *vdst, + int width, int height, + int lumStride, int chromStride, int srcStride) +{ + int y; + const x86_reg chromWidth= width>>1; + for (y=0; y<height; y+=2) { + __asm__ volatile( + "xor %%"REG_a", %%"REG_a" \n\t" + "pcmpeqw %%mm7, %%mm7 \n\t" + "psrlw $8, %%mm7 \n\t" // FF,00,FF,00... + ".p2align 4 \n\t" + "1: \n\t" + PREFETCH" 64(%0, %%"REG_a", 4) \n\t" + "movq (%0, %%"REG_a", 4), %%mm0 \n\t" // UYVY UYVY(0) + "movq 8(%0, %%"REG_a", 4), %%mm1 \n\t" // UYVY UYVY(4) + "movq %%mm0, %%mm2 \n\t" // UYVY UYVY(0) + "movq %%mm1, %%mm3 \n\t" // UYVY UYVY(4) + "pand %%mm7, %%mm0 \n\t" // U0V0 U0V0(0) + "pand %%mm7, %%mm1 \n\t" // U0V0 U0V0(4) + "psrlw $8, %%mm2 \n\t" // Y0Y0 Y0Y0(0) + "psrlw $8, %%mm3 \n\t" // Y0Y0 Y0Y0(4) + "packuswb %%mm1, %%mm0 \n\t" // UVUV UVUV(0) + "packuswb %%mm3, %%mm2 \n\t" // YYYY YYYY(0) + + MOVNTQ" %%mm2, (%1, %%"REG_a", 2) \n\t" + + "movq 16(%0, %%"REG_a", 4), %%mm1 \n\t" // UYVY UYVY(8) + "movq 24(%0, %%"REG_a", 4), %%mm2 \n\t" // UYVY UYVY(12) + "movq %%mm1, %%mm3 \n\t" // UYVY UYVY(8) + "movq %%mm2, %%mm4 \n\t" // UYVY UYVY(12) + "pand %%mm7, %%mm1 \n\t" // U0V0 U0V0(8) + "pand %%mm7, %%mm2 \n\t" // U0V0 U0V0(12) + "psrlw $8, %%mm3 \n\t" // Y0Y0 Y0Y0(8) + "psrlw $8, %%mm4 \n\t" // Y0Y0 Y0Y0(12) + "packuswb %%mm2, %%mm1 \n\t" // UVUV UVUV(8) + "packuswb %%mm4, %%mm3 \n\t" // YYYY YYYY(8) + + MOVNTQ" %%mm3, 8(%1, %%"REG_a", 2) \n\t" + + "movq %%mm0, %%mm2 \n\t" // UVUV UVUV(0) + "movq %%mm1, %%mm3 \n\t" // UVUV UVUV(8) + "psrlw $8, %%mm0 \n\t" // V0V0 V0V0(0) + "psrlw $8, %%mm1 \n\t" // V0V0 V0V0(8) + "pand %%mm7, %%mm2 \n\t" // U0U0 U0U0(0) + "pand %%mm7, %%mm3 \n\t" // U0U0 U0U0(8) + "packuswb %%mm1, %%mm0 \n\t" // VVVV VVVV(0) + "packuswb %%mm3, %%mm2 \n\t" // UUUU UUUU(0) + + MOVNTQ" %%mm0, (%3, %%"REG_a") \n\t" + MOVNTQ" %%mm2, (%2, %%"REG_a") \n\t" + + "add $8, %%"REG_a" \n\t" + "cmp %4, %%"REG_a" \n\t" + " jb 1b \n\t" + ::"r"(src), "r"(ydst), "r"(udst), "r"(vdst), "g" (chromWidth) + : "memory", "%"REG_a + ); + + ydst += lumStride; + src += srcStride; + + __asm__ volatile( + "xor %%"REG_a", %%"REG_a" \n\t" + ".p2align 4 \n\t" + "1: \n\t" + PREFETCH" 64(%0, %%"REG_a", 4) \n\t" + "movq (%0, %%"REG_a", 4), %%mm0 \n\t" // YUYV YUYV(0) + "movq 8(%0, %%"REG_a", 4), %%mm1 \n\t" // YUYV YUYV(4) + "movq 16(%0, %%"REG_a", 4), %%mm2 \n\t" // YUYV YUYV(8) + "movq 24(%0, %%"REG_a", 4), %%mm3 \n\t" // YUYV YUYV(12) + "psrlw $8, %%mm0 \n\t" // Y0Y0 Y0Y0(0) + "psrlw $8, %%mm1 \n\t" // Y0Y0 Y0Y0(4) + "psrlw $8, %%mm2 \n\t" // Y0Y0 Y0Y0(8) + "psrlw $8, %%mm3 \n\t" // Y0Y0 Y0Y0(12) + "packuswb %%mm1, %%mm0 \n\t" // YYYY YYYY(0) + "packuswb %%mm3, %%mm2 \n\t" // YYYY YYYY(8) + + MOVNTQ" %%mm0, (%1, %%"REG_a", 2) \n\t" + MOVNTQ" %%mm2, 8(%1, %%"REG_a", 2) \n\t" + + "add $8, %%"REG_a" \n\t" + "cmp %4, %%"REG_a" \n\t" + " jb 1b \n\t" + + ::"r"(src), "r"(ydst), "r"(udst), "r"(vdst), "g" (chromWidth) + : "memory", "%"REG_a + ); + udst += chromStride; + vdst += chromStride; + ydst += lumStride; + src += srcStride; + } + __asm__ volatile(EMMS" \n\t" + SFENCE" \n\t" + :::"memory"); +} +#endif /* !COMPILE_TEMPLATE_AMD3DNOW */ + +/** + * Height should be a multiple of 2 and width should be a multiple of 2. + * (If this is a problem for anyone then tell me, and I will fix it.) + * Chrominance data is only taken from every second line, + * others are ignored in the C version. + * FIXME: Write HQ version. + */ +static inline void RENAME(rgb24toyv12)(const uint8_t *src, uint8_t *ydst, uint8_t *udst, uint8_t *vdst, + int width, int height, + int lumStride, int chromStride, int srcStride) +{ + int y; + const x86_reg chromWidth= width>>1; + for (y=0; y<height-2; y+=2) { + int i; + for (i=0; i<2; i++) { + __asm__ volatile( + "mov %2, %%"REG_a" \n\t" + "movq "MANGLE(ff_bgr2YCoeff)", %%mm6 \n\t" + "movq "MANGLE(ff_w1111)", %%mm5 \n\t" + "pxor %%mm7, %%mm7 \n\t" + "lea (%%"REG_a", %%"REG_a", 2), %%"REG_d" \n\t" + ".p2align 4 \n\t" + "1: \n\t" + PREFETCH" 64(%0, %%"REG_d") \n\t" + "movd (%0, %%"REG_d"), %%mm0 \n\t" + "movd 3(%0, %%"REG_d"), %%mm1 \n\t" + "punpcklbw %%mm7, %%mm0 \n\t" + "punpcklbw %%mm7, %%mm1 \n\t" + "movd 6(%0, %%"REG_d"), %%mm2 \n\t" + "movd 9(%0, %%"REG_d"), %%mm3 \n\t" + "punpcklbw %%mm7, %%mm2 \n\t" + "punpcklbw %%mm7, %%mm3 \n\t" + "pmaddwd %%mm6, %%mm0 \n\t" + "pmaddwd %%mm6, %%mm1 \n\t" + "pmaddwd %%mm6, %%mm2 \n\t" + "pmaddwd %%mm6, %%mm3 \n\t" +#ifndef FAST_BGR2YV12 + "psrad $8, %%mm0 \n\t" + "psrad $8, %%mm1 \n\t" + "psrad $8, %%mm2 \n\t" + "psrad $8, %%mm3 \n\t" +#endif + "packssdw %%mm1, %%mm0 \n\t" + "packssdw %%mm3, %%mm2 \n\t" + "pmaddwd %%mm5, %%mm0 \n\t" + "pmaddwd %%mm5, %%mm2 \n\t" + "packssdw %%mm2, %%mm0 \n\t" + "psraw $7, %%mm0 \n\t" + + "movd 12(%0, %%"REG_d"), %%mm4 \n\t" + "movd 15(%0, %%"REG_d"), %%mm1 \n\t" + "punpcklbw %%mm7, %%mm4 \n\t" + "punpcklbw %%mm7, %%mm1 \n\t" + "movd 18(%0, %%"REG_d"), %%mm2 \n\t" + "movd 21(%0, %%"REG_d"), %%mm3 \n\t" + "punpcklbw %%mm7, %%mm2 \n\t" + "punpcklbw %%mm7, %%mm3 \n\t" + "pmaddwd %%mm6, %%mm4 \n\t" + "pmaddwd %%mm6, %%mm1 \n\t" + "pmaddwd %%mm6, %%mm2 \n\t" + "pmaddwd %%mm6, %%mm3 \n\t" +#ifndef FAST_BGR2YV12 + "psrad $8, %%mm4 \n\t" + "psrad $8, %%mm1 \n\t" + "psrad $8, %%mm2 \n\t" + "psrad $8, %%mm3 \n\t" +#endif + "packssdw %%mm1, %%mm4 \n\t" + "packssdw %%mm3, %%mm2 \n\t" + "pmaddwd %%mm5, %%mm4 \n\t" + "pmaddwd %%mm5, %%mm2 \n\t" + "add $24, %%"REG_d" \n\t" + "packssdw %%mm2, %%mm4 \n\t" + "psraw $7, %%mm4 \n\t" + + "packuswb %%mm4, %%mm0 \n\t" + "paddusb "MANGLE(ff_bgr2YOffset)", %%mm0 \n\t" + + MOVNTQ" %%mm0, (%1, %%"REG_a") \n\t" + "add $8, %%"REG_a" \n\t" + " js 1b \n\t" + : : "r" (src+width*3), "r" (ydst+width), "g" ((x86_reg)-width) + : "%"REG_a, "%"REG_d + ); + ydst += lumStride; + src += srcStride; + } + src -= srcStride*2; + __asm__ volatile( + "mov %4, %%"REG_a" \n\t" + "movq "MANGLE(ff_w1111)", %%mm5 \n\t" + "movq "MANGLE(ff_bgr2UCoeff)", %%mm6 \n\t" + "pxor %%mm7, %%mm7 \n\t" + "lea (%%"REG_a", %%"REG_a", 2), %%"REG_d" \n\t" + "add %%"REG_d", %%"REG_d" \n\t" + ".p2align 4 \n\t" + "1: \n\t" + PREFETCH" 64(%0, %%"REG_d") \n\t" + PREFETCH" 64(%1, %%"REG_d") \n\t" +#if COMPILE_TEMPLATE_MMX2 || COMPILE_TEMPLATE_AMD3DNOW + "movq (%0, %%"REG_d"), %%mm0 \n\t" + "movq (%1, %%"REG_d"), %%mm1 \n\t" + "movq 6(%0, %%"REG_d"), %%mm2 \n\t" + "movq 6(%1, %%"REG_d"), %%mm3 \n\t" + PAVGB" %%mm1, %%mm0 \n\t" + PAVGB" %%mm3, %%mm2 \n\t" + "movq %%mm0, %%mm1 \n\t" + "movq %%mm2, %%mm3 \n\t" + "psrlq $24, %%mm0 \n\t" + "psrlq $24, %%mm2 \n\t" + PAVGB" %%mm1, %%mm0 \n\t" + PAVGB" %%mm3, %%mm2 \n\t" + "punpcklbw %%mm7, %%mm0 \n\t" + "punpcklbw %%mm7, %%mm2 \n\t" +#else + "movd (%0, %%"REG_d"), %%mm0 \n\t" + "movd (%1, %%"REG_d"), %%mm1 \n\t" + "movd 3(%0, %%"REG_d"), %%mm2 \n\t" + "movd 3(%1, %%"REG_d"), %%mm3 \n\t" + "punpcklbw %%mm7, %%mm0 \n\t" + "punpcklbw %%mm7, %%mm1 \n\t" + "punpcklbw %%mm7, %%mm2 \n\t" + "punpcklbw %%mm7, %%mm3 \n\t" + "paddw %%mm1, %%mm0 \n\t" + "paddw %%mm3, %%mm2 \n\t" + "paddw %%mm2, %%mm0 \n\t" + "movd 6(%0, %%"REG_d"), %%mm4 \n\t" + "movd 6(%1, %%"REG_d"), %%mm1 \n\t" + "movd 9(%0, %%"REG_d"), %%mm2 \n\t" + "movd 9(%1, %%"REG_d"), %%mm3 \n\t" + "punpcklbw %%mm7, %%mm4 \n\t" + "punpcklbw %%mm7, %%mm1 \n\t" + "punpcklbw %%mm7, %%mm2 \n\t" + "punpcklbw %%mm7, %%mm3 \n\t" + "paddw %%mm1, %%mm4 \n\t" + "paddw %%mm3, %%mm2 \n\t" + "paddw %%mm4, %%mm2 \n\t" + "psrlw $2, %%mm0 \n\t" + "psrlw $2, %%mm2 \n\t" +#endif + "movq "MANGLE(ff_bgr2VCoeff)", %%mm1 \n\t" + "movq "MANGLE(ff_bgr2VCoeff)", %%mm3 \n\t" + + "pmaddwd %%mm0, %%mm1 \n\t" + "pmaddwd %%mm2, %%mm3 \n\t" + "pmaddwd %%mm6, %%mm0 \n\t" + "pmaddwd %%mm6, %%mm2 \n\t" +#ifndef FAST_BGR2YV12 + "psrad $8, %%mm0 \n\t" + "psrad $8, %%mm1 \n\t" + "psrad $8, %%mm2 \n\t" + "psrad $8, %%mm3 \n\t" +#endif + "packssdw %%mm2, %%mm0 \n\t" + "packssdw %%mm3, %%mm1 \n\t" + "pmaddwd %%mm5, %%mm0 \n\t" + "pmaddwd %%mm5, %%mm1 \n\t" + "packssdw %%mm1, %%mm0 \n\t" // V1 V0 U1 U0 + "psraw $7, %%mm0 \n\t" + +#if COMPILE_TEMPLATE_MMX2 || COMPILE_TEMPLATE_AMD3DNOW + "movq 12(%0, %%"REG_d"), %%mm4 \n\t" + "movq 12(%1, %%"REG_d"), %%mm1 \n\t" + "movq 18(%0, %%"REG_d"), %%mm2 \n\t" + "movq 18(%1, %%"REG_d"), %%mm3 \n\t" + PAVGB" %%mm1, %%mm4 \n\t" + PAVGB" %%mm3, %%mm2 \n\t" + "movq %%mm4, %%mm1 \n\t" + "movq %%mm2, %%mm3 \n\t" + "psrlq $24, %%mm4 \n\t" + "psrlq $24, %%mm2 \n\t" + PAVGB" %%mm1, %%mm4 \n\t" + PAVGB" %%mm3, %%mm2 \n\t" + "punpcklbw %%mm7, %%mm4 \n\t" + "punpcklbw %%mm7, %%mm2 \n\t" +#else + "movd 12(%0, %%"REG_d"), %%mm4 \n\t" + "movd 12(%1, %%"REG_d"), %%mm1 \n\t" + "movd 15(%0, %%"REG_d"), %%mm2 \n\t" + "movd 15(%1, %%"REG_d"), %%mm3 \n\t" + "punpcklbw %%mm7, %%mm4 \n\t" + "punpcklbw %%mm7, %%mm1 \n\t" + "punpcklbw %%mm7, %%mm2 \n\t" + "punpcklbw %%mm7, %%mm3 \n\t" + "paddw %%mm1, %%mm4 \n\t" + "paddw %%mm3, %%mm2 \n\t" + "paddw %%mm2, %%mm4 \n\t" + "movd 18(%0, %%"REG_d"), %%mm5 \n\t" + "movd 18(%1, %%"REG_d"), %%mm1 \n\t" + "movd 21(%0, %%"REG_d"), %%mm2 \n\t" + "movd 21(%1, %%"REG_d"), %%mm3 \n\t" + "punpcklbw %%mm7, %%mm5 \n\t" + "punpcklbw %%mm7, %%mm1 \n\t" + "punpcklbw %%mm7, %%mm2 \n\t" + "punpcklbw %%mm7, %%mm3 \n\t" + "paddw %%mm1, %%mm5 \n\t" + "paddw %%mm3, %%mm2 \n\t" + "paddw %%mm5, %%mm2 \n\t" + "movq "MANGLE(ff_w1111)", %%mm5 \n\t" + "psrlw $2, %%mm4 \n\t" + "psrlw $2, %%mm2 \n\t" +#endif + "movq "MANGLE(ff_bgr2VCoeff)", %%mm1 \n\t" + "movq "MANGLE(ff_bgr2VCoeff)", %%mm3 \n\t" + + "pmaddwd %%mm4, %%mm1 \n\t" + "pmaddwd %%mm2, %%mm3 \n\t" + "pmaddwd %%mm6, %%mm4 \n\t" + "pmaddwd %%mm6, %%mm2 \n\t" +#ifndef FAST_BGR2YV12 + "psrad $8, %%mm4 \n\t" + "psrad $8, %%mm1 \n\t" + "psrad $8, %%mm2 \n\t" + "psrad $8, %%mm3 \n\t" +#endif + "packssdw %%mm2, %%mm4 \n\t" + "packssdw %%mm3, %%mm1 \n\t" + "pmaddwd %%mm5, %%mm4 \n\t" + "pmaddwd %%mm5, %%mm1 \n\t" + "add $24, %%"REG_d" \n\t" + "packssdw %%mm1, %%mm4 \n\t" // V3 V2 U3 U2 + "psraw $7, %%mm4 \n\t" + + "movq %%mm0, %%mm1 \n\t" + "punpckldq %%mm4, %%mm0 \n\t" + "punpckhdq %%mm4, %%mm1 \n\t" + "packsswb %%mm1, %%mm0 \n\t" + "paddb "MANGLE(ff_bgr2UVOffset)", %%mm0 \n\t" + "movd %%mm0, (%2, %%"REG_a") \n\t" + "punpckhdq %%mm0, %%mm0 \n\t" + "movd %%mm0, (%3, %%"REG_a") \n\t" + "add $4, %%"REG_a" \n\t" + " js 1b \n\t" + : : "r" (src+chromWidth*6), "r" (src+srcStride+chromWidth*6), "r" (udst+chromWidth), "r" (vdst+chromWidth), "g" (-chromWidth) + : "%"REG_a, "%"REG_d + ); + + udst += chromStride; + vdst += chromStride; + src += srcStride*2; + } + + __asm__ volatile(EMMS" \n\t" + SFENCE" \n\t" + :::"memory"); + + rgb24toyv12_c(src, ydst, udst, vdst, width, height-y, lumStride, chromStride, srcStride); +} +#endif /* !COMPILE_TEMPLATE_SSE2 */ + +#if !COMPILE_TEMPLATE_AMD3DNOW +static void RENAME(interleaveBytes)(const uint8_t *src1, const uint8_t *src2, uint8_t *dest, + int width, int height, int src1Stride, + int src2Stride, int dstStride) +{ + int h; + + for (h=0; h < height; h++) { + int w; + +#if COMPILE_TEMPLATE_SSE2 + __asm__( + "xor %%"REG_a", %%"REG_a" \n\t" + "1: \n\t" + PREFETCH" 64(%1, %%"REG_a") \n\t" + PREFETCH" 64(%2, %%"REG_a") \n\t" + "movdqa (%1, %%"REG_a"), %%xmm0 \n\t" + "movdqa (%1, %%"REG_a"), %%xmm1 \n\t" + "movdqa (%2, %%"REG_a"), %%xmm2 \n\t" + "punpcklbw %%xmm2, %%xmm0 \n\t" + "punpckhbw %%xmm2, %%xmm1 \n\t" + "movntdq %%xmm0, (%0, %%"REG_a", 2) \n\t" + "movntdq %%xmm1, 16(%0, %%"REG_a", 2) \n\t" + "add $16, %%"REG_a" \n\t" + "cmp %3, %%"REG_a" \n\t" + " jb 1b \n\t" + ::"r"(dest), "r"(src1), "r"(src2), "r" ((x86_reg)width-15) + : "memory", "%"REG_a"" + ); +#else + __asm__( + "xor %%"REG_a", %%"REG_a" \n\t" + "1: \n\t" + PREFETCH" 64(%1, %%"REG_a") \n\t" + PREFETCH" 64(%2, %%"REG_a") \n\t" + "movq (%1, %%"REG_a"), %%mm0 \n\t" + "movq 8(%1, %%"REG_a"), %%mm2 \n\t" + "movq %%mm0, %%mm1 \n\t" + "movq %%mm2, %%mm3 \n\t" + "movq (%2, %%"REG_a"), %%mm4 \n\t" + "movq 8(%2, %%"REG_a"), %%mm5 \n\t" + "punpcklbw %%mm4, %%mm0 \n\t" + "punpckhbw %%mm4, %%mm1 \n\t" + "punpcklbw %%mm5, %%mm2 \n\t" + "punpckhbw %%mm5, %%mm3 \n\t" + MOVNTQ" %%mm0, (%0, %%"REG_a", 2) \n\t" + MOVNTQ" %%mm1, 8(%0, %%"REG_a", 2) \n\t" + MOVNTQ" %%mm2, 16(%0, %%"REG_a", 2) \n\t" + MOVNTQ" %%mm3, 24(%0, %%"REG_a", 2) \n\t" + "add $16, %%"REG_a" \n\t" + "cmp %3, %%"REG_a" \n\t" + " jb 1b \n\t" + ::"r"(dest), "r"(src1), "r"(src2), "r" ((x86_reg)width-15) + : "memory", "%"REG_a + ); +#endif + for (w= (width&(~15)); w < width; w++) { + dest[2*w+0] = src1[w]; + dest[2*w+1] = src2[w]; + } + dest += dstStride; + src1 += src1Stride; + src2 += src2Stride; + } + __asm__( + EMMS" \n\t" + SFENCE" \n\t" + ::: "memory" + ); +} +#endif /* !COMPILE_TEMPLATE_AMD3DNOW */ + +#if !COMPILE_TEMPLATE_SSE2 +#if !COMPILE_TEMPLATE_AMD3DNOW +static inline void RENAME(vu9_to_vu12)(const uint8_t *src1, const uint8_t *src2, + uint8_t *dst1, uint8_t *dst2, + int width, int height, + int srcStride1, int srcStride2, + int dstStride1, int dstStride2) +{ + x86_reg y; + int x,w,h; + w=width/2; h=height/2; + __asm__ volatile( + PREFETCH" %0 \n\t" + PREFETCH" %1 \n\t" + ::"m"(*(src1+srcStride1)),"m"(*(src2+srcStride2)):"memory"); + for (y=0;y<h;y++) { + const uint8_t* s1=src1+srcStride1*(y>>1); + uint8_t* d=dst1+dstStride1*y; + x=0; + for (;x<w-31;x+=32) { + __asm__ volatile( + PREFETCH" 32%1 \n\t" + "movq %1, %%mm0 \n\t" + "movq 8%1, %%mm2 \n\t" + "movq 16%1, %%mm4 \n\t" + "movq 24%1, %%mm6 \n\t" + "movq %%mm0, %%mm1 \n\t" + "movq %%mm2, %%mm3 \n\t" + "movq %%mm4, %%mm5 \n\t" + "movq %%mm6, %%mm7 \n\t" + "punpcklbw %%mm0, %%mm0 \n\t" + "punpckhbw %%mm1, %%mm1 \n\t" + "punpcklbw %%mm2, %%mm2 \n\t" + "punpckhbw %%mm3, %%mm3 \n\t" + "punpcklbw %%mm4, %%mm4 \n\t" + "punpckhbw %%mm5, %%mm5 \n\t" + "punpcklbw %%mm6, %%mm6 \n\t" + "punpckhbw %%mm7, %%mm7 \n\t" + MOVNTQ" %%mm0, %0 \n\t" + MOVNTQ" %%mm1, 8%0 \n\t" + MOVNTQ" %%mm2, 16%0 \n\t" + MOVNTQ" %%mm3, 24%0 \n\t" + MOVNTQ" %%mm4, 32%0 \n\t" + MOVNTQ" %%mm5, 40%0 \n\t" + MOVNTQ" %%mm6, 48%0 \n\t" + MOVNTQ" %%mm7, 56%0" + :"=m"(d[2*x]) + :"m"(s1[x]) + :"memory"); + } + for (;x<w;x++) d[2*x]=d[2*x+1]=s1[x]; + } + for (y=0;y<h;y++) { + const uint8_t* s2=src2+srcStride2*(y>>1); + uint8_t* d=dst2+dstStride2*y; + x=0; + for (;x<w-31;x+=32) { + __asm__ volatile( + PREFETCH" 32%1 \n\t" + "movq %1, %%mm0 \n\t" + "movq 8%1, %%mm2 \n\t" + "movq 16%1, %%mm4 \n\t" + "movq 24%1, %%mm6 \n\t" + "movq %%mm0, %%mm1 \n\t" + "movq %%mm2, %%mm3 \n\t" + "movq %%mm4, %%mm5 \n\t" + "movq %%mm6, %%mm7 \n\t" + "punpcklbw %%mm0, %%mm0 \n\t" + "punpckhbw %%mm1, %%mm1 \n\t" + "punpcklbw %%mm2, %%mm2 \n\t" + "punpckhbw %%mm3, %%mm3 \n\t" + "punpcklbw %%mm4, %%mm4 \n\t" + "punpckhbw %%mm5, %%mm5 \n\t" + "punpcklbw %%mm6, %%mm6 \n\t" + "punpckhbw %%mm7, %%mm7 \n\t" + MOVNTQ" %%mm0, %0 \n\t" + MOVNTQ" %%mm1, 8%0 \n\t" + MOVNTQ" %%mm2, 16%0 \n\t" + MOVNTQ" %%mm3, 24%0 \n\t" + MOVNTQ" %%mm4, 32%0 \n\t" + MOVNTQ" %%mm5, 40%0 \n\t" + MOVNTQ" %%mm6, 48%0 \n\t" + MOVNTQ" %%mm7, 56%0" + :"=m"(d[2*x]) + :"m"(s2[x]) + :"memory"); + } + for (;x<w;x++) d[2*x]=d[2*x+1]=s2[x]; + } + __asm__( + EMMS" \n\t" + SFENCE" \n\t" + ::: "memory" + ); +} + +static inline void RENAME(yvu9_to_yuy2)(const uint8_t *src1, const uint8_t *src2, const uint8_t *src3, + uint8_t *dst, + int width, int height, + int srcStride1, int srcStride2, + int srcStride3, int dstStride) +{ + x86_reg x; + int y,w,h; + w=width/2; h=height; + for (y=0;y<h;y++) { + const uint8_t* yp=src1+srcStride1*y; + const uint8_t* up=src2+srcStride2*(y>>2); + const uint8_t* vp=src3+srcStride3*(y>>2); + uint8_t* d=dst+dstStride*y; + x=0; + for (;x<w-7;x+=8) { + __asm__ volatile( + PREFETCH" 32(%1, %0) \n\t" + PREFETCH" 32(%2, %0) \n\t" + PREFETCH" 32(%3, %0) \n\t" + "movq (%1, %0, 4), %%mm0 \n\t" /* Y0Y1Y2Y3Y4Y5Y6Y7 */ + "movq (%2, %0), %%mm1 \n\t" /* U0U1U2U3U4U5U6U7 */ + "movq (%3, %0), %%mm2 \n\t" /* V0V1V2V3V4V5V6V7 */ + "movq %%mm0, %%mm3 \n\t" /* Y0Y1Y2Y3Y4Y5Y6Y7 */ + "movq %%mm1, %%mm4 \n\t" /* U0U1U2U3U4U5U6U7 */ + "movq %%mm2, %%mm5 \n\t" /* V0V1V2V3V4V5V6V7 */ + "punpcklbw %%mm1, %%mm1 \n\t" /* U0U0 U1U1 U2U2 U3U3 */ + "punpcklbw %%mm2, %%mm2 \n\t" /* V0V0 V1V1 V2V2 V3V3 */ + "punpckhbw %%mm4, %%mm4 \n\t" /* U4U4 U5U5 U6U6 U7U7 */ + "punpckhbw %%mm5, %%mm5 \n\t" /* V4V4 V5V5 V6V6 V7V7 */ + + "movq %%mm1, %%mm6 \n\t" + "punpcklbw %%mm2, %%mm1 \n\t" /* U0V0 U0V0 U1V1 U1V1*/ + "punpcklbw %%mm1, %%mm0 \n\t" /* Y0U0 Y1V0 Y2U0 Y3V0*/ + "punpckhbw %%mm1, %%mm3 \n\t" /* Y4U1 Y5V1 Y6U1 Y7V1*/ + MOVNTQ" %%mm0, (%4, %0, 8) \n\t" + MOVNTQ" %%mm3, 8(%4, %0, 8) \n\t" + + "punpckhbw %%mm2, %%mm6 \n\t" /* U2V2 U2V2 U3V3 U3V3*/ + "movq 8(%1, %0, 4), %%mm0 \n\t" + "movq %%mm0, %%mm3 \n\t" + "punpcklbw %%mm6, %%mm0 \n\t" /* Y U2 Y V2 Y U2 Y V2*/ + "punpckhbw %%mm6, %%mm3 \n\t" /* Y U3 Y V3 Y U3 Y V3*/ + MOVNTQ" %%mm0, 16(%4, %0, 8) \n\t" + MOVNTQ" %%mm3, 24(%4, %0, 8) \n\t" + + "movq %%mm4, %%mm6 \n\t" + "movq 16(%1, %0, 4), %%mm0 \n\t" + "movq %%mm0, %%mm3 \n\t" + "punpcklbw %%mm5, %%mm4 \n\t" + "punpcklbw %%mm4, %%mm0 \n\t" /* Y U4 Y V4 Y U4 Y V4*/ + "punpckhbw %%mm4, %%mm3 \n\t" /* Y U5 Y V5 Y U5 Y V5*/ + MOVNTQ" %%mm0, 32(%4, %0, 8) \n\t" + MOVNTQ" %%mm3, 40(%4, %0, 8) \n\t" + + "punpckhbw %%mm5, %%mm6 \n\t" + "movq 24(%1, %0, 4), %%mm0 \n\t" + "movq %%mm0, %%mm3 \n\t" + "punpcklbw %%mm6, %%mm0 \n\t" /* Y U6 Y V6 Y U6 Y V6*/ + "punpckhbw %%mm6, %%mm3 \n\t" /* Y U7 Y V7 Y U7 Y V7*/ + MOVNTQ" %%mm0, 48(%4, %0, 8) \n\t" + MOVNTQ" %%mm3, 56(%4, %0, 8) \n\t" + + : "+r" (x) + : "r"(yp), "r" (up), "r"(vp), "r"(d) + :"memory"); + } + for (; x<w; x++) { + const int x2 = x<<2; + d[8*x+0] = yp[x2]; + d[8*x+1] = up[x]; + d[8*x+2] = yp[x2+1]; + d[8*x+3] = vp[x]; + d[8*x+4] = yp[x2+2]; + d[8*x+5] = up[x]; + d[8*x+6] = yp[x2+3]; + d[8*x+7] = vp[x]; + } + } + __asm__( + EMMS" \n\t" + SFENCE" \n\t" + ::: "memory" + ); +} +#endif /* !COMPILE_TEMPLATE_AMD3DNOW */ + +static void RENAME(extract_even)(const uint8_t *src, uint8_t *dst, x86_reg count) +{ + dst += count; + src += 2*count; + count= - count; + + if(count <= -16) { + count += 15; + __asm__ volatile( + "pcmpeqw %%mm7, %%mm7 \n\t" + "psrlw $8, %%mm7 \n\t" + "1: \n\t" + "movq -30(%1, %0, 2), %%mm0 \n\t" + "movq -22(%1, %0, 2), %%mm1 \n\t" + "movq -14(%1, %0, 2), %%mm2 \n\t" + "movq -6(%1, %0, 2), %%mm3 \n\t" + "pand %%mm7, %%mm0 \n\t" + "pand %%mm7, %%mm1 \n\t" + "pand %%mm7, %%mm2 \n\t" + "pand %%mm7, %%mm3 \n\t" + "packuswb %%mm1, %%mm0 \n\t" + "packuswb %%mm3, %%mm2 \n\t" + MOVNTQ" %%mm0,-15(%2, %0) \n\t" + MOVNTQ" %%mm2,- 7(%2, %0) \n\t" + "add $16, %0 \n\t" + " js 1b \n\t" + : "+r"(count) + : "r"(src), "r"(dst) + ); + count -= 15; + } + while(count<0) { + dst[count]= src[2*count]; + count++; + } +} + +#if !COMPILE_TEMPLATE_AMD3DNOW +static void RENAME(extract_even2)(const uint8_t *src, uint8_t *dst0, uint8_t *dst1, x86_reg count) +{ + dst0+= count; + dst1+= count; + src += 4*count; + count= - count; + if(count <= -8) { + count += 7; + __asm__ volatile( + "pcmpeqw %%mm7, %%mm7 \n\t" + "psrlw $8, %%mm7 \n\t" + "1: \n\t" + "movq -28(%1, %0, 4), %%mm0 \n\t" + "movq -20(%1, %0, 4), %%mm1 \n\t" + "movq -12(%1, %0, 4), %%mm2 \n\t" + "movq -4(%1, %0, 4), %%mm3 \n\t" + "pand %%mm7, %%mm0 \n\t" + "pand %%mm7, %%mm1 \n\t" + "pand %%mm7, %%mm2 \n\t" + "pand %%mm7, %%mm3 \n\t" + "packuswb %%mm1, %%mm0 \n\t" + "packuswb %%mm3, %%mm2 \n\t" + "movq %%mm0, %%mm1 \n\t" + "movq %%mm2, %%mm3 \n\t" + "psrlw $8, %%mm0 \n\t" + "psrlw $8, %%mm2 \n\t" + "pand %%mm7, %%mm1 \n\t" + "pand %%mm7, %%mm3 \n\t" + "packuswb %%mm2, %%mm0 \n\t" + "packuswb %%mm3, %%mm1 \n\t" + MOVNTQ" %%mm0,- 7(%3, %0) \n\t" + MOVNTQ" %%mm1,- 7(%2, %0) \n\t" + "add $8, %0 \n\t" + " js 1b \n\t" + : "+r"(count) + : "r"(src), "r"(dst0), "r"(dst1) + ); + count -= 7; + } + while(count<0) { + dst0[count]= src[4*count+0]; + dst1[count]= src[4*count+2]; + count++; + } +} +#endif /* !COMPILE_TEMPLATE_AMD3DNOW */ + +static void RENAME(extract_even2avg)(const uint8_t *src0, const uint8_t *src1, uint8_t *dst0, uint8_t *dst1, x86_reg count) +{ + dst0 += count; + dst1 += count; + src0 += 4*count; + src1 += 4*count; + count= - count; +#ifdef PAVGB + if(count <= -8) { + count += 7; + __asm__ volatile( + "pcmpeqw %%mm7, %%mm7 \n\t" + "psrlw $8, %%mm7 \n\t" + "1: \n\t" + "movq -28(%1, %0, 4), %%mm0 \n\t" + "movq -20(%1, %0, 4), %%mm1 \n\t" + "movq -12(%1, %0, 4), %%mm2 \n\t" + "movq -4(%1, %0, 4), %%mm3 \n\t" + PAVGB" -28(%2, %0, 4), %%mm0 \n\t" + PAVGB" -20(%2, %0, 4), %%mm1 \n\t" + PAVGB" -12(%2, %0, 4), %%mm2 \n\t" + PAVGB" - 4(%2, %0, 4), %%mm3 \n\t" + "pand %%mm7, %%mm0 \n\t" + "pand %%mm7, %%mm1 \n\t" + "pand %%mm7, %%mm2 \n\t" + "pand %%mm7, %%mm3 \n\t" + "packuswb %%mm1, %%mm0 \n\t" + "packuswb %%mm3, %%mm2 \n\t" + "movq %%mm0, %%mm1 \n\t" + "movq %%mm2, %%mm3 \n\t" + "psrlw $8, %%mm0 \n\t" + "psrlw $8, %%mm2 \n\t" + "pand %%mm7, %%mm1 \n\t" + "pand %%mm7, %%mm3 \n\t" + "packuswb %%mm2, %%mm0 \n\t" + "packuswb %%mm3, %%mm1 \n\t" + MOVNTQ" %%mm0,- 7(%4, %0) \n\t" + MOVNTQ" %%mm1,- 7(%3, %0) \n\t" + "add $8, %0 \n\t" + " js 1b \n\t" + : "+r"(count) + : "r"(src0), "r"(src1), "r"(dst0), "r"(dst1) + ); + count -= 7; + } +#endif + while(count<0) { + dst0[count]= (src0[4*count+0]+src1[4*count+0])>>1; + dst1[count]= (src0[4*count+2]+src1[4*count+2])>>1; + count++; + } +} + +#if !COMPILE_TEMPLATE_AMD3DNOW +static void RENAME(extract_odd2)(const uint8_t *src, uint8_t *dst0, uint8_t *dst1, x86_reg count) +{ + dst0+= count; + dst1+= count; + src += 4*count; + count= - count; + if(count <= -8) { + count += 7; + __asm__ volatile( + "pcmpeqw %%mm7, %%mm7 \n\t" + "psrlw $8, %%mm7 \n\t" + "1: \n\t" + "movq -28(%1, %0, 4), %%mm0 \n\t" + "movq -20(%1, %0, 4), %%mm1 \n\t" + "movq -12(%1, %0, 4), %%mm2 \n\t" + "movq -4(%1, %0, 4), %%mm3 \n\t" + "psrlw $8, %%mm0 \n\t" + "psrlw $8, %%mm1 \n\t" + "psrlw $8, %%mm2 \n\t" + "psrlw $8, %%mm3 \n\t" + "packuswb %%mm1, %%mm0 \n\t" + "packuswb %%mm3, %%mm2 \n\t" + "movq %%mm0, %%mm1 \n\t" + "movq %%mm2, %%mm3 \n\t" + "psrlw $8, %%mm0 \n\t" + "psrlw $8, %%mm2 \n\t" + "pand %%mm7, %%mm1 \n\t" + "pand %%mm7, %%mm3 \n\t" + "packuswb %%mm2, %%mm0 \n\t" + "packuswb %%mm3, %%mm1 \n\t" + MOVNTQ" %%mm0,- 7(%3, %0) \n\t" + MOVNTQ" %%mm1,- 7(%2, %0) \n\t" + "add $8, %0 \n\t" + " js 1b \n\t" + : "+r"(count) + : "r"(src), "r"(dst0), "r"(dst1) + ); + count -= 7; + } + src++; + while(count<0) { + dst0[count]= src[4*count+0]; + dst1[count]= src[4*count+2]; + count++; + } +} +#endif /* !COMPILE_TEMPLATE_AMD3DNOW */ + +static void RENAME(extract_odd2avg)(const uint8_t *src0, const uint8_t *src1, uint8_t *dst0, uint8_t *dst1, x86_reg count) +{ + dst0 += count; + dst1 += count; + src0 += 4*count; + src1 += 4*count; + count= - count; +#ifdef PAVGB + if(count <= -8) { + count += 7; + __asm__ volatile( + "pcmpeqw %%mm7, %%mm7 \n\t" + "psrlw $8, %%mm7 \n\t" + "1: \n\t" + "movq -28(%1, %0, 4), %%mm0 \n\t" + "movq -20(%1, %0, 4), %%mm1 \n\t" + "movq -12(%1, %0, 4), %%mm2 \n\t" + "movq -4(%1, %0, 4), %%mm3 \n\t" + PAVGB" -28(%2, %0, 4), %%mm0 \n\t" + PAVGB" -20(%2, %0, 4), %%mm1 \n\t" + PAVGB" -12(%2, %0, 4), %%mm2 \n\t" + PAVGB" - 4(%2, %0, 4), %%mm3 \n\t" + "psrlw $8, %%mm0 \n\t" + "psrlw $8, %%mm1 \n\t" + "psrlw $8, %%mm2 \n\t" + "psrlw $8, %%mm3 \n\t" + "packuswb %%mm1, %%mm0 \n\t" + "packuswb %%mm3, %%mm2 \n\t" + "movq %%mm0, %%mm1 \n\t" + "movq %%mm2, %%mm3 \n\t" + "psrlw $8, %%mm0 \n\t" + "psrlw $8, %%mm2 \n\t" + "pand %%mm7, %%mm1 \n\t" + "pand %%mm7, %%mm3 \n\t" + "packuswb %%mm2, %%mm0 \n\t" + "packuswb %%mm3, %%mm1 \n\t" + MOVNTQ" %%mm0,- 7(%4, %0) \n\t" + MOVNTQ" %%mm1,- 7(%3, %0) \n\t" + "add $8, %0 \n\t" + " js 1b \n\t" + : "+r"(count) + : "r"(src0), "r"(src1), "r"(dst0), "r"(dst1) + ); + count -= 7; + } +#endif + src0++; + src1++; + while(count<0) { + dst0[count]= (src0[4*count+0]+src1[4*count+0])>>1; + dst1[count]= (src0[4*count+2]+src1[4*count+2])>>1; + count++; + } +} + +static void RENAME(yuyvtoyuv420)(uint8_t *ydst, uint8_t *udst, uint8_t *vdst, const uint8_t *src, + int width, int height, + int lumStride, int chromStride, int srcStride) +{ + int y; + const int chromWidth= -((-width)>>1); + + for (y=0; y<height; y++) { + RENAME(extract_even)(src, ydst, width); + if(y&1) { + RENAME(extract_odd2avg)(src-srcStride, src, udst, vdst, chromWidth); + udst+= chromStride; + vdst+= chromStride; + } + + src += srcStride; + ydst+= lumStride; + } + __asm__( + EMMS" \n\t" + SFENCE" \n\t" + ::: "memory" + ); +} + +#if !COMPILE_TEMPLATE_AMD3DNOW +static void RENAME(yuyvtoyuv422)(uint8_t *ydst, uint8_t *udst, uint8_t *vdst, const uint8_t *src, + int width, int height, + int lumStride, int chromStride, int srcStride) +{ + int y; + const int chromWidth= -((-width)>>1); + + for (y=0; y<height; y++) { + RENAME(extract_even)(src, ydst, width); + RENAME(extract_odd2)(src, udst, vdst, chromWidth); + + src += srcStride; + ydst+= lumStride; + udst+= chromStride; + vdst+= chromStride; + } + __asm__( + EMMS" \n\t" + SFENCE" \n\t" + ::: "memory" + ); +} +#endif /* !COMPILE_TEMPLATE_AMD3DNOW */ + +static void RENAME(uyvytoyuv420)(uint8_t *ydst, uint8_t *udst, uint8_t *vdst, const uint8_t *src, + int width, int height, + int lumStride, int chromStride, int srcStride) +{ + int y; + const int chromWidth= -((-width)>>1); + + for (y=0; y<height; y++) { + RENAME(extract_even)(src+1, ydst, width); + if(y&1) { + RENAME(extract_even2avg)(src-srcStride, src, udst, vdst, chromWidth); + udst+= chromStride; + vdst+= chromStride; + } + + src += srcStride; + ydst+= lumStride; + } + __asm__( + EMMS" \n\t" + SFENCE" \n\t" + ::: "memory" + ); +} + +#if !COMPILE_TEMPLATE_AMD3DNOW +static void RENAME(uyvytoyuv422)(uint8_t *ydst, uint8_t *udst, uint8_t *vdst, const uint8_t *src, + int width, int height, + int lumStride, int chromStride, int srcStride) +{ + int y; + const int chromWidth= -((-width)>>1); + + for (y=0; y<height; y++) { + RENAME(extract_even)(src+1, ydst, width); + RENAME(extract_even2)(src, udst, vdst, chromWidth); + + src += srcStride; + ydst+= lumStride; + udst+= chromStride; + vdst+= chromStride; + } + __asm__( + EMMS" \n\t" + SFENCE" \n\t" + ::: "memory" + ); +} +#endif /* !COMPILE_TEMPLATE_AMD3DNOW */ +#endif /* !COMPILE_TEMPLATE_SSE2 */ + +static inline void RENAME(rgb2rgb_init)(void) +{ +#if !COMPILE_TEMPLATE_SSE2 +#if !COMPILE_TEMPLATE_AMD3DNOW + rgb15to16 = RENAME(rgb15to16); + rgb15tobgr24 = RENAME(rgb15tobgr24); + rgb15to32 = RENAME(rgb15to32); + rgb16tobgr24 = RENAME(rgb16tobgr24); + rgb16to32 = RENAME(rgb16to32); + rgb16to15 = RENAME(rgb16to15); + rgb24tobgr16 = RENAME(rgb24tobgr16); + rgb24tobgr15 = RENAME(rgb24tobgr15); + rgb24tobgr32 = RENAME(rgb24tobgr32); + rgb32to16 = RENAME(rgb32to16); + rgb32to15 = RENAME(rgb32to15); + rgb32tobgr24 = RENAME(rgb32tobgr24); + rgb24to15 = RENAME(rgb24to15); + rgb24to16 = RENAME(rgb24to16); + rgb24tobgr24 = RENAME(rgb24tobgr24); + shuffle_bytes_2103 = RENAME(shuffle_bytes_2103); + rgb32tobgr16 = RENAME(rgb32tobgr16); + rgb32tobgr15 = RENAME(rgb32tobgr15); + yv12toyuy2 = RENAME(yv12toyuy2); + yv12touyvy = RENAME(yv12touyvy); + yuv422ptoyuy2 = RENAME(yuv422ptoyuy2); + yuv422ptouyvy = RENAME(yuv422ptouyvy); + yuy2toyv12 = RENAME(yuy2toyv12); + vu9_to_vu12 = RENAME(vu9_to_vu12); + yvu9_to_yuy2 = RENAME(yvu9_to_yuy2); + uyvytoyuv422 = RENAME(uyvytoyuv422); + yuyvtoyuv422 = RENAME(yuyvtoyuv422); +#endif /* !COMPILE_TEMPLATE_SSE2 */ + +#if COMPILE_TEMPLATE_MMX2 || COMPILE_TEMPLATE_AMD3DNOW + planar2x = RENAME(planar2x); +#endif /* COMPILE_TEMPLATE_MMX2 || COMPILE_TEMPLATE_AMD3DNOW */ + rgb24toyv12 = RENAME(rgb24toyv12); + + yuyvtoyuv420 = RENAME(yuyvtoyuv420); + uyvytoyuv420 = RENAME(uyvytoyuv420); +#endif /* COMPILE_TEMPLATE_SSE2 */ + +#if !COMPILE_TEMPLATE_AMD3DNOW + interleaveBytes = RENAME(interleaveBytes); +#endif /* !COMPILE_TEMPLATE_AMD3DNOW */ +} diff --git a/libswscale/x86/swscale_mmx.c b/libswscale/x86/swscale_mmx.c new file mode 100644 index 0000000000..775d5f683d --- /dev/null +++ b/libswscale/x86/swscale_mmx.c @@ -0,0 +1,189 @@ +/* + * Copyright (C) 2001-2003 Michael Niedermayer <michaelni@gmx.at> + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include <inttypes.h> +#include "config.h" +#include "libswscale/swscale.h" +#include "libswscale/swscale_internal.h" +#include "libavutil/intreadwrite.h" +#include "libavutil/x86_cpu.h" +#include "libavutil/cpu.h" +#include "libavutil/pixdesc.h" + +DECLARE_ASM_CONST(8, uint64_t, bF8)= 0xF8F8F8F8F8F8F8F8LL; +DECLARE_ASM_CONST(8, uint64_t, bFC)= 0xFCFCFCFCFCFCFCFCLL; +DECLARE_ASM_CONST(8, uint64_t, w10)= 0x0010001000100010LL; +DECLARE_ASM_CONST(8, uint64_t, w02)= 0x0002000200020002LL; +DECLARE_ASM_CONST(8, uint64_t, bm00001111)=0x00000000FFFFFFFFLL; +DECLARE_ASM_CONST(8, uint64_t, bm00000111)=0x0000000000FFFFFFLL; +DECLARE_ASM_CONST(8, uint64_t, bm11111000)=0xFFFFFFFFFF000000LL; +DECLARE_ASM_CONST(8, uint64_t, bm01010101)=0x00FF00FF00FF00FFLL; + +const DECLARE_ALIGNED(8, uint64_t, ff_dither4)[2] = { + 0x0103010301030103LL, + 0x0200020002000200LL,}; + +const DECLARE_ALIGNED(8, uint64_t, ff_dither8)[2] = { + 0x0602060206020602LL, + 0x0004000400040004LL,}; + +DECLARE_ASM_CONST(8, uint64_t, b16Mask)= 0x001F001F001F001FLL; +DECLARE_ASM_CONST(8, uint64_t, g16Mask)= 0x07E007E007E007E0LL; +DECLARE_ASM_CONST(8, uint64_t, r16Mask)= 0xF800F800F800F800LL; +DECLARE_ASM_CONST(8, uint64_t, b15Mask)= 0x001F001F001F001FLL; +DECLARE_ASM_CONST(8, uint64_t, g15Mask)= 0x03E003E003E003E0LL; +DECLARE_ASM_CONST(8, uint64_t, r15Mask)= 0x7C007C007C007C00LL; + +DECLARE_ALIGNED(8, const uint64_t, ff_M24A) = 0x00FF0000FF0000FFLL; +DECLARE_ALIGNED(8, const uint64_t, ff_M24B) = 0xFF0000FF0000FF00LL; +DECLARE_ALIGNED(8, const uint64_t, ff_M24C) = 0x0000FF0000FF0000LL; + +#ifdef FAST_BGR2YV12 +DECLARE_ALIGNED(8, const uint64_t, ff_bgr2YCoeff) = 0x000000210041000DULL; +DECLARE_ALIGNED(8, const uint64_t, ff_bgr2UCoeff) = 0x0000FFEEFFDC0038ULL; +DECLARE_ALIGNED(8, const uint64_t, ff_bgr2VCoeff) = 0x00000038FFD2FFF8ULL; +#else +DECLARE_ALIGNED(8, const uint64_t, ff_bgr2YCoeff) = 0x000020E540830C8BULL; +DECLARE_ALIGNED(8, const uint64_t, ff_bgr2UCoeff) = 0x0000ED0FDAC23831ULL; +DECLARE_ALIGNED(8, const uint64_t, ff_bgr2VCoeff) = 0x00003831D0E6F6EAULL; +#endif /* FAST_BGR2YV12 */ +DECLARE_ALIGNED(8, const uint64_t, ff_bgr2YOffset) = 0x1010101010101010ULL; +DECLARE_ALIGNED(8, const uint64_t, ff_bgr2UVOffset) = 0x8080808080808080ULL; +DECLARE_ALIGNED(8, const uint64_t, ff_w1111) = 0x0001000100010001ULL; + +DECLARE_ASM_CONST(8, uint64_t, ff_bgr24toY1Coeff) = 0x0C88000040870C88ULL; +DECLARE_ASM_CONST(8, uint64_t, ff_bgr24toY2Coeff) = 0x20DE4087000020DEULL; +DECLARE_ASM_CONST(8, uint64_t, ff_rgb24toY1Coeff) = 0x20DE0000408720DEULL; +DECLARE_ASM_CONST(8, uint64_t, ff_rgb24toY2Coeff) = 0x0C88408700000C88ULL; +DECLARE_ASM_CONST(8, uint64_t, ff_bgr24toYOffset) = 0x0008010000080100ULL; + +DECLARE_ASM_CONST(8, uint64_t, ff_bgr24toUV)[2][4] = { + {0x38380000DAC83838ULL, 0xECFFDAC80000ECFFULL, 0xF6E40000D0E3F6E4ULL, 0x3838D0E300003838ULL}, + {0xECFF0000DAC8ECFFULL, 0x3838DAC800003838ULL, 0x38380000D0E33838ULL, 0xF6E4D0E30000F6E4ULL}, +}; + +DECLARE_ASM_CONST(8, uint64_t, ff_bgr24toUVOffset)= 0x0040010000400100ULL; + +//MMX versions +#if HAVE_MMX +#undef RENAME +#define COMPILE_TEMPLATE_MMX2 0 +#define RENAME(a) a ## _MMX +#include "swscale_template.c" +#endif + +//MMX2 versions +#if HAVE_MMX2 +#undef RENAME +#undef COMPILE_TEMPLATE_MMX2 +#define COMPILE_TEMPLATE_MMX2 1 +#define RENAME(a) a ## _MMX2 +#include "swscale_template.c" +#endif + +void updateMMXDitherTables(SwsContext *c, int dstY, int lumBufIndex, int chrBufIndex, + int lastInLumBuf, int lastInChrBuf) +{ + const int dstH= c->dstH; + const int flags= c->flags; + int16_t **lumPixBuf= c->lumPixBuf; + int16_t **chrUPixBuf= c->chrUPixBuf; + int16_t **alpPixBuf= c->alpPixBuf; + const int vLumBufSize= c->vLumBufSize; + const int vChrBufSize= c->vChrBufSize; + int16_t *vLumFilterPos= c->vLumFilterPos; + int16_t *vChrFilterPos= c->vChrFilterPos; + int16_t *vLumFilter= c->vLumFilter; + int16_t *vChrFilter= c->vChrFilter; + int32_t *lumMmxFilter= c->lumMmxFilter; + int32_t *chrMmxFilter= c->chrMmxFilter; + int32_t av_unused *alpMmxFilter= c->alpMmxFilter; + const int vLumFilterSize= c->vLumFilterSize; + const int vChrFilterSize= c->vChrFilterSize; + const int chrDstY= dstY>>c->chrDstVSubSample; + const int firstLumSrcY= vLumFilterPos[dstY]; //First line needed as input + const int firstChrSrcY= vChrFilterPos[chrDstY]; //First line needed as input + + c->blueDither= ff_dither8[dstY&1]; + if (c->dstFormat == PIX_FMT_RGB555 || c->dstFormat == PIX_FMT_BGR555) + c->greenDither= ff_dither8[dstY&1]; + else + c->greenDither= ff_dither4[dstY&1]; + c->redDither= ff_dither8[(dstY+1)&1]; + if (dstY < dstH - 2) { + const int16_t **lumSrcPtr= (const int16_t **) lumPixBuf + lumBufIndex + firstLumSrcY - lastInLumBuf + vLumBufSize; + const int16_t **chrUSrcPtr= (const int16_t **) chrUPixBuf + chrBufIndex + firstChrSrcY - lastInChrBuf + vChrBufSize; + const int16_t **alpSrcPtr= (CONFIG_SWSCALE_ALPHA && alpPixBuf) ? (const int16_t **) alpPixBuf + lumBufIndex + firstLumSrcY - lastInLumBuf + vLumBufSize : NULL; + int i; + if (flags & SWS_ACCURATE_RND) { + int s= APCK_SIZE / 8; + for (i=0; i<vLumFilterSize; i+=2) { + *(const void**)&lumMmxFilter[s*i ]= lumSrcPtr[i ]; + *(const void**)&lumMmxFilter[s*i+APCK_PTR2/4 ]= lumSrcPtr[i+(vLumFilterSize>1)]; + lumMmxFilter[s*i+APCK_COEF/4 ]= + lumMmxFilter[s*i+APCK_COEF/4+1]= vLumFilter[dstY*vLumFilterSize + i ] + + (vLumFilterSize>1 ? vLumFilter[dstY*vLumFilterSize + i + 1]<<16 : 0); + if (CONFIG_SWSCALE_ALPHA && alpPixBuf) { + *(const void**)&alpMmxFilter[s*i ]= alpSrcPtr[i ]; + *(const void**)&alpMmxFilter[s*i+APCK_PTR2/4 ]= alpSrcPtr[i+(vLumFilterSize>1)]; + alpMmxFilter[s*i+APCK_COEF/4 ]= + alpMmxFilter[s*i+APCK_COEF/4+1]= lumMmxFilter[s*i+APCK_COEF/4 ]; + } + } + for (i=0; i<vChrFilterSize; i+=2) { + *(const void**)&chrMmxFilter[s*i ]= chrUSrcPtr[i ]; + *(const void**)&chrMmxFilter[s*i+APCK_PTR2/4 ]= chrUSrcPtr[i+(vChrFilterSize>1)]; + chrMmxFilter[s*i+APCK_COEF/4 ]= + chrMmxFilter[s*i+APCK_COEF/4+1]= vChrFilter[chrDstY*vChrFilterSize + i ] + + (vChrFilterSize>1 ? vChrFilter[chrDstY*vChrFilterSize + i + 1]<<16 : 0); + } + } else { + for (i=0; i<vLumFilterSize; i++) { + *(const void**)&lumMmxFilter[4*i+0]= lumSrcPtr[i]; + lumMmxFilter[4*i+2]= + lumMmxFilter[4*i+3]= + ((uint16_t)vLumFilter[dstY*vLumFilterSize + i])*0x10001; + if (CONFIG_SWSCALE_ALPHA && alpPixBuf) { + *(const void**)&alpMmxFilter[4*i+0]= alpSrcPtr[i]; + alpMmxFilter[4*i+2]= + alpMmxFilter[4*i+3]= lumMmxFilter[4*i+2]; + } + } + for (i=0; i<vChrFilterSize; i++) { + *(const void**)&chrMmxFilter[4*i+0]= chrUSrcPtr[i]; + chrMmxFilter[4*i+2]= + chrMmxFilter[4*i+3]= + ((uint16_t)vChrFilter[chrDstY*vChrFilterSize + i])*0x10001; + } + } + } +} + +void ff_sws_init_swScale_mmx(SwsContext *c) +{ + int cpu_flags = av_get_cpu_flags(); + + if (cpu_flags & AV_CPU_FLAG_MMX) + sws_init_swScale_MMX(c); +#if HAVE_MMX2 + if (cpu_flags & AV_CPU_FLAG_MMX2) + sws_init_swScale_MMX2(c); +#endif +} diff --git a/libswscale/x86/swscale_template.c b/libswscale/x86/swscale_template.c new file mode 100644 index 0000000000..25399fadef --- /dev/null +++ b/libswscale/x86/swscale_template.c @@ -0,0 +1,2493 @@ +/* + * Copyright (C) 2001-2003 Michael Niedermayer <michaelni@gmx.at> + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#undef REAL_MOVNTQ +#undef MOVNTQ +#undef PREFETCH + +#if COMPILE_TEMPLATE_MMX2 +#define PREFETCH "prefetchnta" +#else +#define PREFETCH " # nop" +#endif + +#if COMPILE_TEMPLATE_MMX2 +#define REAL_MOVNTQ(a,b) "movntq " #a ", " #b " \n\t" +#else +#define REAL_MOVNTQ(a,b) "movq " #a ", " #b " \n\t" +#endif +#define MOVNTQ(a,b) REAL_MOVNTQ(a,b) + +#define YSCALEYUV2YV12X(offset, dest, end, pos) \ + __asm__ volatile(\ + "movq "DITHER16"+0(%0), %%mm3 \n\t"\ + "movq "DITHER16"+8(%0), %%mm4 \n\t"\ + "lea " offset "(%0), %%"REG_d" \n\t"\ + "mov (%%"REG_d"), %%"REG_S" \n\t"\ + ".p2align 4 \n\t" /* FIXME Unroll? */\ + "1: \n\t"\ + "movq 8(%%"REG_d"), %%mm0 \n\t" /* filterCoeff */\ + "movq (%%"REG_S", %3, 2), %%mm2 \n\t" /* srcData */\ + "movq 8(%%"REG_S", %3, 2), %%mm5 \n\t" /* srcData */\ + "add $16, %%"REG_d" \n\t"\ + "mov (%%"REG_d"), %%"REG_S" \n\t"\ + "test %%"REG_S", %%"REG_S" \n\t"\ + "pmulhw %%mm0, %%mm2 \n\t"\ + "pmulhw %%mm0, %%mm5 \n\t"\ + "paddw %%mm2, %%mm3 \n\t"\ + "paddw %%mm5, %%mm4 \n\t"\ + " jnz 1b \n\t"\ + "psraw $3, %%mm3 \n\t"\ + "psraw $3, %%mm4 \n\t"\ + "packuswb %%mm4, %%mm3 \n\t"\ + MOVNTQ(%%mm3, (%1, %3))\ + "add $8, %3 \n\t"\ + "cmp %2, %3 \n\t"\ + "movq "DITHER16"+0(%0), %%mm3 \n\t"\ + "movq "DITHER16"+8(%0), %%mm4 \n\t"\ + "lea " offset "(%0), %%"REG_d" \n\t"\ + "mov (%%"REG_d"), %%"REG_S" \n\t"\ + "jb 1b \n\t"\ + :: "r" (&c->redDither),\ + "r" (dest), "g" ((x86_reg)(end)), "r"((x86_reg)(pos))\ + : "%"REG_d, "%"REG_S\ + ); + +static void RENAME(yuv2yuvX)(SwsContext *c, const int16_t *lumFilter, + const int16_t **lumSrc, int lumFilterSize, + const int16_t *chrFilter, const int16_t **chrUSrc, + const int16_t **chrVSrc, + int chrFilterSize, const int16_t **alpSrc, + uint8_t *dest, uint8_t *uDest, uint8_t *vDest, + uint8_t *aDest, int dstW, int chrDstW, + const uint8_t *lumDither, const uint8_t *chrDither) +{ + int i; + if (uDest) { + x86_reg uv_off = c->uv_off; + for(i=0; i<8; i++) c->dither16[i] = chrDither[i]>>4; + YSCALEYUV2YV12X(CHR_MMX_FILTER_OFFSET, uDest, chrDstW, 0) + for(i=0; i<8; i++) c->dither16[i] = chrDither[(i+3)&7]>>4; + YSCALEYUV2YV12X(CHR_MMX_FILTER_OFFSET, vDest - uv_off, chrDstW + uv_off, uv_off) + } + for(i=0; i<8; i++) c->dither16[i] = lumDither[i]>>4; + if (CONFIG_SWSCALE_ALPHA && aDest) { + YSCALEYUV2YV12X(ALP_MMX_FILTER_OFFSET, aDest, dstW, 0) + } + + YSCALEYUV2YV12X(LUM_MMX_FILTER_OFFSET, dest, dstW, 0) +} + +#define YSCALEYUV2YV12X_ACCURATE(offset, dest, end, pos) \ + __asm__ volatile(\ + "lea " offset "(%0), %%"REG_d" \n\t"\ + "movq "DITHER32"+0(%0), %%mm4 \n\t"\ + "movq "DITHER32"+8(%0), %%mm5 \n\t"\ + "movq "DITHER32"+16(%0), %%mm6 \n\t"\ + "movq "DITHER32"+24(%0), %%mm7 \n\t"\ + "pxor %%mm4, %%mm4 \n\t"\ + "pxor %%mm5, %%mm5 \n\t"\ + "pxor %%mm6, %%mm6 \n\t"\ + "pxor %%mm7, %%mm7 \n\t"\ + "mov (%%"REG_d"), %%"REG_S" \n\t"\ + ".p2align 4 \n\t"\ + "1: \n\t"\ + "movq (%%"REG_S", %3, 2), %%mm0 \n\t" /* srcData */\ + "movq 8(%%"REG_S", %3, 2), %%mm2 \n\t" /* srcData */\ + "mov "STR(APCK_PTR2)"(%%"REG_d"), %%"REG_S" \n\t"\ + "movq (%%"REG_S", %3, 2), %%mm1 \n\t" /* srcData */\ + "movq %%mm0, %%mm3 \n\t"\ + "punpcklwd %%mm1, %%mm0 \n\t"\ + "punpckhwd %%mm1, %%mm3 \n\t"\ + "movq "STR(APCK_COEF)"(%%"REG_d"), %%mm1 \n\t" /* filterCoeff */\ + "pmaddwd %%mm1, %%mm0 \n\t"\ + "pmaddwd %%mm1, %%mm3 \n\t"\ + "paddd %%mm0, %%mm4 \n\t"\ + "paddd %%mm3, %%mm5 \n\t"\ + "movq 8(%%"REG_S", %3, 2), %%mm3 \n\t" /* srcData */\ + "mov "STR(APCK_SIZE)"(%%"REG_d"), %%"REG_S" \n\t"\ + "add $"STR(APCK_SIZE)", %%"REG_d" \n\t"\ + "test %%"REG_S", %%"REG_S" \n\t"\ + "movq %%mm2, %%mm0 \n\t"\ + "punpcklwd %%mm3, %%mm2 \n\t"\ + "punpckhwd %%mm3, %%mm0 \n\t"\ + "pmaddwd %%mm1, %%mm2 \n\t"\ + "pmaddwd %%mm1, %%mm0 \n\t"\ + "paddd %%mm2, %%mm6 \n\t"\ + "paddd %%mm0, %%mm7 \n\t"\ + " jnz 1b \n\t"\ + "psrad $19, %%mm4 \n\t"\ + "psrad $19, %%mm5 \n\t"\ + "psrad $19, %%mm6 \n\t"\ + "psrad $19, %%mm7 \n\t"\ + "packssdw %%mm5, %%mm4 \n\t"\ + "packssdw %%mm7, %%mm6 \n\t"\ + "packuswb %%mm6, %%mm4 \n\t"\ + MOVNTQ(%%mm4, (%1, %3))\ + "add $8, %3 \n\t"\ + "cmp %2, %3 \n\t"\ + "lea " offset "(%0), %%"REG_d" \n\t"\ + "movq "DITHER32"+0(%0), %%mm4 \n\t"\ + "movq "DITHER32"+8(%0), %%mm5 \n\t"\ + "movq "DITHER32"+16(%0), %%mm6 \n\t"\ + "movq "DITHER32"+24(%0), %%mm7 \n\t"\ + "mov (%%"REG_d"), %%"REG_S" \n\t"\ + "jb 1b \n\t"\ + :: "r" (&c->redDither),\ + "r" (dest), "g" ((x86_reg)(end)), "r"((x86_reg)(pos))\ + : "%"REG_a, "%"REG_d, "%"REG_S\ + ); + +static void RENAME(yuv2yuvX_ar)(SwsContext *c, const int16_t *lumFilter, + const int16_t **lumSrc, int lumFilterSize, + const int16_t *chrFilter, const int16_t **chrUSrc, + const int16_t **chrVSrc, + int chrFilterSize, const int16_t **alpSrc, + uint8_t *dest, uint8_t *uDest, uint8_t *vDest, + uint8_t *aDest, int dstW, int chrDstW, + const uint8_t *lumDither, const uint8_t *chrDither) +{ + int i; + if (uDest) { + x86_reg uv_off = c->uv_off; + for(i=0; i<8; i++) c->dither32[i] = chrDither[i]<<12; + YSCALEYUV2YV12X_ACCURATE(CHR_MMX_FILTER_OFFSET, uDest, chrDstW, 0) + for(i=0; i<8; i++) c->dither32[i] = chrDither[(i+3)&7]<<12; + YSCALEYUV2YV12X_ACCURATE(CHR_MMX_FILTER_OFFSET, vDest - uv_off, chrDstW + uv_off, uv_off) + } + for(i=0; i<8; i++) c->dither32[i] = lumDither[i]<<12; + if (CONFIG_SWSCALE_ALPHA && aDest) { + YSCALEYUV2YV12X_ACCURATE(ALP_MMX_FILTER_OFFSET, aDest, dstW, 0) + } + + YSCALEYUV2YV12X_ACCURATE(LUM_MMX_FILTER_OFFSET, dest, dstW, 0) +} + +static void RENAME(yuv2yuv1)(SwsContext *c, const int16_t *lumSrc, + const int16_t *chrUSrc, const int16_t *chrVSrc, + const int16_t *alpSrc, + uint8_t *dest, uint8_t *uDest, uint8_t *vDest, + uint8_t *aDest, int dstW, int chrDstW, + const uint8_t *lumDither, const uint8_t *chrDither) +{ + int p= 4; + const int16_t *src[4]= { alpSrc + dstW, lumSrc + dstW, chrUSrc + chrDstW, chrVSrc + chrDstW }; + uint8_t *dst[4]= { aDest, dest, uDest, vDest }; + x86_reg counter[4]= { dstW, dstW, chrDstW, chrDstW }; + + while (p--) { + if (dst[p]) { + __asm__ volatile( + "mov %2, %%"REG_a" \n\t" + ".p2align 4 \n\t" /* FIXME Unroll? */ + "1: \n\t" + "movq (%0, %%"REG_a", 2), %%mm0 \n\t" + "movq 8(%0, %%"REG_a", 2), %%mm1 \n\t" + "psraw $7, %%mm0 \n\t" + "psraw $7, %%mm1 \n\t" + "packuswb %%mm1, %%mm0 \n\t" + MOVNTQ(%%mm0, (%1, %%REGa)) + "add $8, %%"REG_a" \n\t" + "jnc 1b \n\t" + :: "r" (src[p]), "r" (dst[p] + counter[p]), + "g" (-counter[p]) + : "%"REG_a + ); + } + } +} + +static void RENAME(yuv2yuv1_ar)(SwsContext *c, const int16_t *lumSrc, + const int16_t *chrUSrc, const int16_t *chrVSrc, + const int16_t *alpSrc, + uint8_t *dest, uint8_t *uDest, uint8_t *vDest, + uint8_t *aDest, int dstW, int chrDstW, + const uint8_t *lumDither, const uint8_t *chrDither) +{ + int p= 4; + const int16_t *src[4]= { alpSrc + dstW, lumSrc + dstW, chrUSrc + chrDstW, chrVSrc + chrDstW }; + uint8_t *dst[4]= { aDest, dest, uDest, vDest }; + x86_reg counter[4]= { dstW, dstW, chrDstW, chrDstW }; + + while (p--) { + if (dst[p]) { + int i; + for(i=0; i<8; i++) c->dither16[i] = i<2 ? lumDither[i] : chrDither[i]; + __asm__ volatile( + "mov %2, %%"REG_a" \n\t" + "movq 0(%3), %%mm6 \n\t" + "movq 8(%3), %%mm7 \n\t" + ".p2align 4 \n\t" /* FIXME Unroll? */ + "1: \n\t" + "movq (%0, %%"REG_a", 2), %%mm0 \n\t" + "movq 8(%0, %%"REG_a", 2), %%mm1 \n\t" + "paddsw %%mm6, %%mm0 \n\t" + "paddsw %%mm7, %%mm1 \n\t" + "psraw $7, %%mm0 \n\t" + "psraw $7, %%mm1 \n\t" + "packuswb %%mm1, %%mm0 \n\t" + MOVNTQ(%%mm0, (%1, %%REGa)) + "add $8, %%"REG_a" \n\t" + "jnc 1b \n\t" + :: "r" (src[p]), "r" (dst[p] + counter[p]), + "g" (-counter[p]), "r"(c->dither16) + : "%"REG_a + ); + } + } +} + +#define YSCALEYUV2PACKEDX_UV \ + __asm__ volatile(\ + "xor %%"REG_a", %%"REG_a" \n\t"\ + ".p2align 4 \n\t"\ + "nop \n\t"\ + "1: \n\t"\ + "lea "CHR_MMX_FILTER_OFFSET"(%0), %%"REG_d" \n\t"\ + "mov (%%"REG_d"), %%"REG_S" \n\t"\ + "movq "VROUNDER_OFFSET"(%0), %%mm3 \n\t"\ + "movq %%mm3, %%mm4 \n\t"\ + ".p2align 4 \n\t"\ + "2: \n\t"\ + "movq 8(%%"REG_d"), %%mm0 \n\t" /* filterCoeff */\ + "movq (%%"REG_S", %%"REG_a"), %%mm2 \n\t" /* UsrcData */\ + "add %6, %%"REG_S" \n\t" \ + "movq (%%"REG_S", %%"REG_a"), %%mm5 \n\t" /* VsrcData */\ + "add $16, %%"REG_d" \n\t"\ + "mov (%%"REG_d"), %%"REG_S" \n\t"\ + "pmulhw %%mm0, %%mm2 \n\t"\ + "pmulhw %%mm0, %%mm5 \n\t"\ + "paddw %%mm2, %%mm3 \n\t"\ + "paddw %%mm5, %%mm4 \n\t"\ + "test %%"REG_S", %%"REG_S" \n\t"\ + " jnz 2b \n\t"\ + +#define YSCALEYUV2PACKEDX_YA(offset,coeff,src1,src2,dst1,dst2) \ + "lea "offset"(%0), %%"REG_d" \n\t"\ + "mov (%%"REG_d"), %%"REG_S" \n\t"\ + "movq "VROUNDER_OFFSET"(%0), "#dst1" \n\t"\ + "movq "#dst1", "#dst2" \n\t"\ + ".p2align 4 \n\t"\ + "2: \n\t"\ + "movq 8(%%"REG_d"), "#coeff" \n\t" /* filterCoeff */\ + "movq (%%"REG_S", %%"REG_a", 2), "#src1" \n\t" /* Y1srcData */\ + "movq 8(%%"REG_S", %%"REG_a", 2), "#src2" \n\t" /* Y2srcData */\ + "add $16, %%"REG_d" \n\t"\ + "mov (%%"REG_d"), %%"REG_S" \n\t"\ + "pmulhw "#coeff", "#src1" \n\t"\ + "pmulhw "#coeff", "#src2" \n\t"\ + "paddw "#src1", "#dst1" \n\t"\ + "paddw "#src2", "#dst2" \n\t"\ + "test %%"REG_S", %%"REG_S" \n\t"\ + " jnz 2b \n\t"\ + +#define YSCALEYUV2PACKEDX \ + YSCALEYUV2PACKEDX_UV \ + YSCALEYUV2PACKEDX_YA(LUM_MMX_FILTER_OFFSET,%%mm0,%%mm2,%%mm5,%%mm1,%%mm7) \ + +#define YSCALEYUV2PACKEDX_END \ + :: "r" (&c->redDither), \ + "m" (dummy), "m" (dummy), "m" (dummy),\ + "r" (dest), "m" (dstW_reg), "m"(uv_off) \ + : "%"REG_a, "%"REG_d, "%"REG_S \ + ); + +#define YSCALEYUV2PACKEDX_ACCURATE_UV \ + __asm__ volatile(\ + "xor %%"REG_a", %%"REG_a" \n\t"\ + ".p2align 4 \n\t"\ + "nop \n\t"\ + "1: \n\t"\ + "lea "CHR_MMX_FILTER_OFFSET"(%0), %%"REG_d" \n\t"\ + "mov (%%"REG_d"), %%"REG_S" \n\t"\ + "pxor %%mm4, %%mm4 \n\t"\ + "pxor %%mm5, %%mm5 \n\t"\ + "pxor %%mm6, %%mm6 \n\t"\ + "pxor %%mm7, %%mm7 \n\t"\ + ".p2align 4 \n\t"\ + "2: \n\t"\ + "movq (%%"REG_S", %%"REG_a"), %%mm0 \n\t" /* UsrcData */\ + "add %6, %%"REG_S" \n\t" \ + "movq (%%"REG_S", %%"REG_a"), %%mm2 \n\t" /* VsrcData */\ + "mov "STR(APCK_PTR2)"(%%"REG_d"), %%"REG_S" \n\t"\ + "movq (%%"REG_S", %%"REG_a"), %%mm1 \n\t" /* UsrcData */\ + "movq %%mm0, %%mm3 \n\t"\ + "punpcklwd %%mm1, %%mm0 \n\t"\ + "punpckhwd %%mm1, %%mm3 \n\t"\ + "movq "STR(APCK_COEF)"(%%"REG_d"),%%mm1 \n\t" /* filterCoeff */\ + "pmaddwd %%mm1, %%mm0 \n\t"\ + "pmaddwd %%mm1, %%mm3 \n\t"\ + "paddd %%mm0, %%mm4 \n\t"\ + "paddd %%mm3, %%mm5 \n\t"\ + "add %6, %%"REG_S" \n\t" \ + "movq (%%"REG_S", %%"REG_a"), %%mm3 \n\t" /* VsrcData */\ + "mov "STR(APCK_SIZE)"(%%"REG_d"), %%"REG_S" \n\t"\ + "add $"STR(APCK_SIZE)", %%"REG_d" \n\t"\ + "test %%"REG_S", %%"REG_S" \n\t"\ + "movq %%mm2, %%mm0 \n\t"\ + "punpcklwd %%mm3, %%mm2 \n\t"\ + "punpckhwd %%mm3, %%mm0 \n\t"\ + "pmaddwd %%mm1, %%mm2 \n\t"\ + "pmaddwd %%mm1, %%mm0 \n\t"\ + "paddd %%mm2, %%mm6 \n\t"\ + "paddd %%mm0, %%mm7 \n\t"\ + " jnz 2b \n\t"\ + "psrad $16, %%mm4 \n\t"\ + "psrad $16, %%mm5 \n\t"\ + "psrad $16, %%mm6 \n\t"\ + "psrad $16, %%mm7 \n\t"\ + "movq "VROUNDER_OFFSET"(%0), %%mm0 \n\t"\ + "packssdw %%mm5, %%mm4 \n\t"\ + "packssdw %%mm7, %%mm6 \n\t"\ + "paddw %%mm0, %%mm4 \n\t"\ + "paddw %%mm0, %%mm6 \n\t"\ + "movq %%mm4, "U_TEMP"(%0) \n\t"\ + "movq %%mm6, "V_TEMP"(%0) \n\t"\ + +#define YSCALEYUV2PACKEDX_ACCURATE_YA(offset) \ + "lea "offset"(%0), %%"REG_d" \n\t"\ + "mov (%%"REG_d"), %%"REG_S" \n\t"\ + "pxor %%mm1, %%mm1 \n\t"\ + "pxor %%mm5, %%mm5 \n\t"\ + "pxor %%mm7, %%mm7 \n\t"\ + "pxor %%mm6, %%mm6 \n\t"\ + ".p2align 4 \n\t"\ + "2: \n\t"\ + "movq (%%"REG_S", %%"REG_a", 2), %%mm0 \n\t" /* Y1srcData */\ + "movq 8(%%"REG_S", %%"REG_a", 2), %%mm2 \n\t" /* Y2srcData */\ + "mov "STR(APCK_PTR2)"(%%"REG_d"), %%"REG_S" \n\t"\ + "movq (%%"REG_S", %%"REG_a", 2), %%mm4 \n\t" /* Y1srcData */\ + "movq %%mm0, %%mm3 \n\t"\ + "punpcklwd %%mm4, %%mm0 \n\t"\ + "punpckhwd %%mm4, %%mm3 \n\t"\ + "movq "STR(APCK_COEF)"(%%"REG_d"), %%mm4 \n\t" /* filterCoeff */\ + "pmaddwd %%mm4, %%mm0 \n\t"\ + "pmaddwd %%mm4, %%mm3 \n\t"\ + "paddd %%mm0, %%mm1 \n\t"\ + "paddd %%mm3, %%mm5 \n\t"\ + "movq 8(%%"REG_S", %%"REG_a", 2), %%mm3 \n\t" /* Y2srcData */\ + "mov "STR(APCK_SIZE)"(%%"REG_d"), %%"REG_S" \n\t"\ + "add $"STR(APCK_SIZE)", %%"REG_d" \n\t"\ + "test %%"REG_S", %%"REG_S" \n\t"\ + "movq %%mm2, %%mm0 \n\t"\ + "punpcklwd %%mm3, %%mm2 \n\t"\ + "punpckhwd %%mm3, %%mm0 \n\t"\ + "pmaddwd %%mm4, %%mm2 \n\t"\ + "pmaddwd %%mm4, %%mm0 \n\t"\ + "paddd %%mm2, %%mm7 \n\t"\ + "paddd %%mm0, %%mm6 \n\t"\ + " jnz 2b \n\t"\ + "psrad $16, %%mm1 \n\t"\ + "psrad $16, %%mm5 \n\t"\ + "psrad $16, %%mm7 \n\t"\ + "psrad $16, %%mm6 \n\t"\ + "movq "VROUNDER_OFFSET"(%0), %%mm0 \n\t"\ + "packssdw %%mm5, %%mm1 \n\t"\ + "packssdw %%mm6, %%mm7 \n\t"\ + "paddw %%mm0, %%mm1 \n\t"\ + "paddw %%mm0, %%mm7 \n\t"\ + "movq "U_TEMP"(%0), %%mm3 \n\t"\ + "movq "V_TEMP"(%0), %%mm4 \n\t"\ + +#define YSCALEYUV2PACKEDX_ACCURATE \ + YSCALEYUV2PACKEDX_ACCURATE_UV \ + YSCALEYUV2PACKEDX_ACCURATE_YA(LUM_MMX_FILTER_OFFSET) + +#define YSCALEYUV2RGBX \ + "psubw "U_OFFSET"(%0), %%mm3 \n\t" /* (U-128)8*/\ + "psubw "V_OFFSET"(%0), %%mm4 \n\t" /* (V-128)8*/\ + "movq %%mm3, %%mm2 \n\t" /* (U-128)8*/\ + "movq %%mm4, %%mm5 \n\t" /* (V-128)8*/\ + "pmulhw "UG_COEFF"(%0), %%mm3 \n\t"\ + "pmulhw "VG_COEFF"(%0), %%mm4 \n\t"\ + /* mm2=(U-128)8, mm3=ug, mm4=vg mm5=(V-128)8 */\ + "pmulhw "UB_COEFF"(%0), %%mm2 \n\t"\ + "pmulhw "VR_COEFF"(%0), %%mm5 \n\t"\ + "psubw "Y_OFFSET"(%0), %%mm1 \n\t" /* 8(Y-16)*/\ + "psubw "Y_OFFSET"(%0), %%mm7 \n\t" /* 8(Y-16)*/\ + "pmulhw "Y_COEFF"(%0), %%mm1 \n\t"\ + "pmulhw "Y_COEFF"(%0), %%mm7 \n\t"\ + /* mm1= Y1, mm2=ub, mm3=ug, mm4=vg mm5=vr, mm7=Y2 */\ + "paddw %%mm3, %%mm4 \n\t"\ + "movq %%mm2, %%mm0 \n\t"\ + "movq %%mm5, %%mm6 \n\t"\ + "movq %%mm4, %%mm3 \n\t"\ + "punpcklwd %%mm2, %%mm2 \n\t"\ + "punpcklwd %%mm5, %%mm5 \n\t"\ + "punpcklwd %%mm4, %%mm4 \n\t"\ + "paddw %%mm1, %%mm2 \n\t"\ + "paddw %%mm1, %%mm5 \n\t"\ + "paddw %%mm1, %%mm4 \n\t"\ + "punpckhwd %%mm0, %%mm0 \n\t"\ + "punpckhwd %%mm6, %%mm6 \n\t"\ + "punpckhwd %%mm3, %%mm3 \n\t"\ + "paddw %%mm7, %%mm0 \n\t"\ + "paddw %%mm7, %%mm6 \n\t"\ + "paddw %%mm7, %%mm3 \n\t"\ + /* mm0=B1, mm2=B2, mm3=G2, mm4=G1, mm5=R1, mm6=R2 */\ + "packuswb %%mm0, %%mm2 \n\t"\ + "packuswb %%mm6, %%mm5 \n\t"\ + "packuswb %%mm3, %%mm4 \n\t"\ + +#define REAL_WRITEBGR32(dst, dstw, index, b, g, r, a, q0, q2, q3, t) \ + "movq "#b", "#q2" \n\t" /* B */\ + "movq "#r", "#t" \n\t" /* R */\ + "punpcklbw "#g", "#b" \n\t" /* GBGBGBGB 0 */\ + "punpcklbw "#a", "#r" \n\t" /* ARARARAR 0 */\ + "punpckhbw "#g", "#q2" \n\t" /* GBGBGBGB 2 */\ + "punpckhbw "#a", "#t" \n\t" /* ARARARAR 2 */\ + "movq "#b", "#q0" \n\t" /* GBGBGBGB 0 */\ + "movq "#q2", "#q3" \n\t" /* GBGBGBGB 2 */\ + "punpcklwd "#r", "#q0" \n\t" /* ARGBARGB 0 */\ + "punpckhwd "#r", "#b" \n\t" /* ARGBARGB 1 */\ + "punpcklwd "#t", "#q2" \n\t" /* ARGBARGB 2 */\ + "punpckhwd "#t", "#q3" \n\t" /* ARGBARGB 3 */\ +\ + MOVNTQ( q0, (dst, index, 4))\ + MOVNTQ( b, 8(dst, index, 4))\ + MOVNTQ( q2, 16(dst, index, 4))\ + MOVNTQ( q3, 24(dst, index, 4))\ +\ + "add $8, "#index" \n\t"\ + "cmp "#dstw", "#index" \n\t"\ + " jb 1b \n\t" +#define WRITEBGR32(dst, dstw, index, b, g, r, a, q0, q2, q3, t) REAL_WRITEBGR32(dst, dstw, index, b, g, r, a, q0, q2, q3, t) + +static void RENAME(yuv2rgb32_X_ar)(SwsContext *c, const int16_t *lumFilter, + const int16_t **lumSrc, int lumFilterSize, + const int16_t *chrFilter, const int16_t **chrUSrc, + const int16_t **chrVSrc, + int chrFilterSize, const int16_t **alpSrc, + uint8_t *dest, int dstW, int dstY) +{ + x86_reg dummy=0; + x86_reg dstW_reg = dstW; + x86_reg uv_off = c->uv_off << 1; + + if (CONFIG_SWSCALE_ALPHA && c->alpPixBuf) { + YSCALEYUV2PACKEDX_ACCURATE + YSCALEYUV2RGBX + "movq %%mm2, "U_TEMP"(%0) \n\t" + "movq %%mm4, "V_TEMP"(%0) \n\t" + "movq %%mm5, "Y_TEMP"(%0) \n\t" + YSCALEYUV2PACKEDX_ACCURATE_YA(ALP_MMX_FILTER_OFFSET) + "movq "Y_TEMP"(%0), %%mm5 \n\t" + "psraw $3, %%mm1 \n\t" + "psraw $3, %%mm7 \n\t" + "packuswb %%mm7, %%mm1 \n\t" + WRITEBGR32(%4, %5, %%REGa, %%mm3, %%mm4, %%mm5, %%mm1, %%mm0, %%mm7, %%mm2, %%mm6) + YSCALEYUV2PACKEDX_END + } else { + YSCALEYUV2PACKEDX_ACCURATE + YSCALEYUV2RGBX + "pcmpeqd %%mm7, %%mm7 \n\t" + WRITEBGR32(%4, %5, %%REGa, %%mm2, %%mm4, %%mm5, %%mm7, %%mm0, %%mm1, %%mm3, %%mm6) + YSCALEYUV2PACKEDX_END + } +} + +static void RENAME(yuv2rgb32_X)(SwsContext *c, const int16_t *lumFilter, + const int16_t **lumSrc, int lumFilterSize, + const int16_t *chrFilter, const int16_t **chrUSrc, + const int16_t **chrVSrc, + int chrFilterSize, const int16_t **alpSrc, + uint8_t *dest, int dstW, int dstY) +{ + x86_reg dummy=0; + x86_reg dstW_reg = dstW; + x86_reg uv_off = c->uv_off << 1; + + if (CONFIG_SWSCALE_ALPHA && c->alpPixBuf) { + YSCALEYUV2PACKEDX + YSCALEYUV2RGBX + YSCALEYUV2PACKEDX_YA(ALP_MMX_FILTER_OFFSET, %%mm0, %%mm3, %%mm6, %%mm1, %%mm7) + "psraw $3, %%mm1 \n\t" + "psraw $3, %%mm7 \n\t" + "packuswb %%mm7, %%mm1 \n\t" + WRITEBGR32(%4, %5, %%REGa, %%mm2, %%mm4, %%mm5, %%mm1, %%mm0, %%mm7, %%mm3, %%mm6) + YSCALEYUV2PACKEDX_END + } else { + YSCALEYUV2PACKEDX + YSCALEYUV2RGBX + "pcmpeqd %%mm7, %%mm7 \n\t" + WRITEBGR32(%4, %5, %%REGa, %%mm2, %%mm4, %%mm5, %%mm7, %%mm0, %%mm1, %%mm3, %%mm6) + YSCALEYUV2PACKEDX_END + } +} + +#define REAL_WRITERGB16(dst, dstw, index) \ + "pand "MANGLE(bF8)", %%mm2 \n\t" /* B */\ + "pand "MANGLE(bFC)", %%mm4 \n\t" /* G */\ + "pand "MANGLE(bF8)", %%mm5 \n\t" /* R */\ + "psrlq $3, %%mm2 \n\t"\ +\ + "movq %%mm2, %%mm1 \n\t"\ + "movq %%mm4, %%mm3 \n\t"\ +\ + "punpcklbw %%mm7, %%mm3 \n\t"\ + "punpcklbw %%mm5, %%mm2 \n\t"\ + "punpckhbw %%mm7, %%mm4 \n\t"\ + "punpckhbw %%mm5, %%mm1 \n\t"\ +\ + "psllq $3, %%mm3 \n\t"\ + "psllq $3, %%mm4 \n\t"\ +\ + "por %%mm3, %%mm2 \n\t"\ + "por %%mm4, %%mm1 \n\t"\ +\ + MOVNTQ(%%mm2, (dst, index, 2))\ + MOVNTQ(%%mm1, 8(dst, index, 2))\ +\ + "add $8, "#index" \n\t"\ + "cmp "#dstw", "#index" \n\t"\ + " jb 1b \n\t" +#define WRITERGB16(dst, dstw, index) REAL_WRITERGB16(dst, dstw, index) + +static void RENAME(yuv2rgb565_X_ar)(SwsContext *c, const int16_t *lumFilter, + const int16_t **lumSrc, int lumFilterSize, + const int16_t *chrFilter, const int16_t **chrUSrc, + const int16_t **chrVSrc, + int chrFilterSize, const int16_t **alpSrc, + uint8_t *dest, int dstW, int dstY) +{ + x86_reg dummy=0; + x86_reg dstW_reg = dstW; + x86_reg uv_off = c->uv_off << 1; + + YSCALEYUV2PACKEDX_ACCURATE + YSCALEYUV2RGBX + "pxor %%mm7, %%mm7 \n\t" + /* mm2=B, %%mm4=G, %%mm5=R, %%mm7=0 */ +#ifdef DITHER1XBPP + "paddusb "BLUE_DITHER"(%0), %%mm2\n\t" + "paddusb "GREEN_DITHER"(%0), %%mm4\n\t" + "paddusb "RED_DITHER"(%0), %%mm5\n\t" +#endif + WRITERGB16(%4, %5, %%REGa) + YSCALEYUV2PACKEDX_END +} + +static void RENAME(yuv2rgb565_X)(SwsContext *c, const int16_t *lumFilter, + const int16_t **lumSrc, int lumFilterSize, + const int16_t *chrFilter, const int16_t **chrUSrc, + const int16_t **chrVSrc, + int chrFilterSize, const int16_t **alpSrc, + uint8_t *dest, int dstW, int dstY) +{ + x86_reg dummy=0; + x86_reg dstW_reg = dstW; + x86_reg uv_off = c->uv_off << 1; + + YSCALEYUV2PACKEDX + YSCALEYUV2RGBX + "pxor %%mm7, %%mm7 \n\t" + /* mm2=B, %%mm4=G, %%mm5=R, %%mm7=0 */ +#ifdef DITHER1XBPP + "paddusb "BLUE_DITHER"(%0), %%mm2 \n\t" + "paddusb "GREEN_DITHER"(%0), %%mm4 \n\t" + "paddusb "RED_DITHER"(%0), %%mm5 \n\t" +#endif + WRITERGB16(%4, %5, %%REGa) + YSCALEYUV2PACKEDX_END +} + +#define REAL_WRITERGB15(dst, dstw, index) \ + "pand "MANGLE(bF8)", %%mm2 \n\t" /* B */\ + "pand "MANGLE(bF8)", %%mm4 \n\t" /* G */\ + "pand "MANGLE(bF8)", %%mm5 \n\t" /* R */\ + "psrlq $3, %%mm2 \n\t"\ + "psrlq $1, %%mm5 \n\t"\ +\ + "movq %%mm2, %%mm1 \n\t"\ + "movq %%mm4, %%mm3 \n\t"\ +\ + "punpcklbw %%mm7, %%mm3 \n\t"\ + "punpcklbw %%mm5, %%mm2 \n\t"\ + "punpckhbw %%mm7, %%mm4 \n\t"\ + "punpckhbw %%mm5, %%mm1 \n\t"\ +\ + "psllq $2, %%mm3 \n\t"\ + "psllq $2, %%mm4 \n\t"\ +\ + "por %%mm3, %%mm2 \n\t"\ + "por %%mm4, %%mm1 \n\t"\ +\ + MOVNTQ(%%mm2, (dst, index, 2))\ + MOVNTQ(%%mm1, 8(dst, index, 2))\ +\ + "add $8, "#index" \n\t"\ + "cmp "#dstw", "#index" \n\t"\ + " jb 1b \n\t" +#define WRITERGB15(dst, dstw, index) REAL_WRITERGB15(dst, dstw, index) + +static void RENAME(yuv2rgb555_X_ar)(SwsContext *c, const int16_t *lumFilter, + const int16_t **lumSrc, int lumFilterSize, + const int16_t *chrFilter, const int16_t **chrUSrc, + const int16_t **chrVSrc, + int chrFilterSize, const int16_t **alpSrc, + uint8_t *dest, int dstW, int dstY) +{ + x86_reg dummy=0; + x86_reg dstW_reg = dstW; + x86_reg uv_off = c->uv_off << 1; + + YSCALEYUV2PACKEDX_ACCURATE + YSCALEYUV2RGBX + "pxor %%mm7, %%mm7 \n\t" + /* mm2=B, %%mm4=G, %%mm5=R, %%mm7=0 */ +#ifdef DITHER1XBPP + "paddusb "BLUE_DITHER"(%0), %%mm2\n\t" + "paddusb "GREEN_DITHER"(%0), %%mm4\n\t" + "paddusb "RED_DITHER"(%0), %%mm5\n\t" +#endif + WRITERGB15(%4, %5, %%REGa) + YSCALEYUV2PACKEDX_END +} + +static void RENAME(yuv2rgb555_X)(SwsContext *c, const int16_t *lumFilter, + const int16_t **lumSrc, int lumFilterSize, + const int16_t *chrFilter, const int16_t **chrUSrc, + const int16_t **chrVSrc, + int chrFilterSize, const int16_t **alpSrc, + uint8_t *dest, int dstW, int dstY) +{ + x86_reg dummy=0; + x86_reg dstW_reg = dstW; + x86_reg uv_off = c->uv_off << 1; + + YSCALEYUV2PACKEDX + YSCALEYUV2RGBX + "pxor %%mm7, %%mm7 \n\t" + /* mm2=B, %%mm4=G, %%mm5=R, %%mm7=0 */ +#ifdef DITHER1XBPP + "paddusb "BLUE_DITHER"(%0), %%mm2 \n\t" + "paddusb "GREEN_DITHER"(%0), %%mm4 \n\t" + "paddusb "RED_DITHER"(%0), %%mm5 \n\t" +#endif + WRITERGB15(%4, %5, %%REGa) + YSCALEYUV2PACKEDX_END +} + +#define WRITEBGR24MMX(dst, dstw, index) \ + /* mm2=B, %%mm4=G, %%mm5=R, %%mm7=0 */\ + "movq %%mm2, %%mm1 \n\t" /* B */\ + "movq %%mm5, %%mm6 \n\t" /* R */\ + "punpcklbw %%mm4, %%mm2 \n\t" /* GBGBGBGB 0 */\ + "punpcklbw %%mm7, %%mm5 \n\t" /* 0R0R0R0R 0 */\ + "punpckhbw %%mm4, %%mm1 \n\t" /* GBGBGBGB 2 */\ + "punpckhbw %%mm7, %%mm6 \n\t" /* 0R0R0R0R 2 */\ + "movq %%mm2, %%mm0 \n\t" /* GBGBGBGB 0 */\ + "movq %%mm1, %%mm3 \n\t" /* GBGBGBGB 2 */\ + "punpcklwd %%mm5, %%mm0 \n\t" /* 0RGB0RGB 0 */\ + "punpckhwd %%mm5, %%mm2 \n\t" /* 0RGB0RGB 1 */\ + "punpcklwd %%mm6, %%mm1 \n\t" /* 0RGB0RGB 2 */\ + "punpckhwd %%mm6, %%mm3 \n\t" /* 0RGB0RGB 3 */\ +\ + "movq %%mm0, %%mm4 \n\t" /* 0RGB0RGB 0 */\ + "movq %%mm2, %%mm6 \n\t" /* 0RGB0RGB 1 */\ + "movq %%mm1, %%mm5 \n\t" /* 0RGB0RGB 2 */\ + "movq %%mm3, %%mm7 \n\t" /* 0RGB0RGB 3 */\ +\ + "psllq $40, %%mm0 \n\t" /* RGB00000 0 */\ + "psllq $40, %%mm2 \n\t" /* RGB00000 1 */\ + "psllq $40, %%mm1 \n\t" /* RGB00000 2 */\ + "psllq $40, %%mm3 \n\t" /* RGB00000 3 */\ +\ + "punpckhdq %%mm4, %%mm0 \n\t" /* 0RGBRGB0 0 */\ + "punpckhdq %%mm6, %%mm2 \n\t" /* 0RGBRGB0 1 */\ + "punpckhdq %%mm5, %%mm1 \n\t" /* 0RGBRGB0 2 */\ + "punpckhdq %%mm7, %%mm3 \n\t" /* 0RGBRGB0 3 */\ +\ + "psrlq $8, %%mm0 \n\t" /* 00RGBRGB 0 */\ + "movq %%mm2, %%mm6 \n\t" /* 0RGBRGB0 1 */\ + "psllq $40, %%mm2 \n\t" /* GB000000 1 */\ + "por %%mm2, %%mm0 \n\t" /* GBRGBRGB 0 */\ + MOVNTQ(%%mm0, (dst))\ +\ + "psrlq $24, %%mm6 \n\t" /* 0000RGBR 1 */\ + "movq %%mm1, %%mm5 \n\t" /* 0RGBRGB0 2 */\ + "psllq $24, %%mm1 \n\t" /* BRGB0000 2 */\ + "por %%mm1, %%mm6 \n\t" /* BRGBRGBR 1 */\ + MOVNTQ(%%mm6, 8(dst))\ +\ + "psrlq $40, %%mm5 \n\t" /* 000000RG 2 */\ + "psllq $8, %%mm3 \n\t" /* RGBRGB00 3 */\ + "por %%mm3, %%mm5 \n\t" /* RGBRGBRG 2 */\ + MOVNTQ(%%mm5, 16(dst))\ +\ + "add $24, "#dst" \n\t"\ +\ + "add $8, "#index" \n\t"\ + "cmp "#dstw", "#index" \n\t"\ + " jb 1b \n\t" + +#define WRITEBGR24MMX2(dst, dstw, index) \ + /* mm2=B, %%mm4=G, %%mm5=R, %%mm7=0 */\ + "movq "MANGLE(ff_M24A)", %%mm0 \n\t"\ + "movq "MANGLE(ff_M24C)", %%mm7 \n\t"\ + "pshufw $0x50, %%mm2, %%mm1 \n\t" /* B3 B2 B3 B2 B1 B0 B1 B0 */\ + "pshufw $0x50, %%mm4, %%mm3 \n\t" /* G3 G2 G3 G2 G1 G0 G1 G0 */\ + "pshufw $0x00, %%mm5, %%mm6 \n\t" /* R1 R0 R1 R0 R1 R0 R1 R0 */\ +\ + "pand %%mm0, %%mm1 \n\t" /* B2 B1 B0 */\ + "pand %%mm0, %%mm3 \n\t" /* G2 G1 G0 */\ + "pand %%mm7, %%mm6 \n\t" /* R1 R0 */\ +\ + "psllq $8, %%mm3 \n\t" /* G2 G1 G0 */\ + "por %%mm1, %%mm6 \n\t"\ + "por %%mm3, %%mm6 \n\t"\ + MOVNTQ(%%mm6, (dst))\ +\ + "psrlq $8, %%mm4 \n\t" /* 00 G7 G6 G5 G4 G3 G2 G1 */\ + "pshufw $0xA5, %%mm2, %%mm1 \n\t" /* B5 B4 B5 B4 B3 B2 B3 B2 */\ + "pshufw $0x55, %%mm4, %%mm3 \n\t" /* G4 G3 G4 G3 G4 G3 G4 G3 */\ + "pshufw $0xA5, %%mm5, %%mm6 \n\t" /* R5 R4 R5 R4 R3 R2 R3 R2 */\ +\ + "pand "MANGLE(ff_M24B)", %%mm1 \n\t" /* B5 B4 B3 */\ + "pand %%mm7, %%mm3 \n\t" /* G4 G3 */\ + "pand %%mm0, %%mm6 \n\t" /* R4 R3 R2 */\ +\ + "por %%mm1, %%mm3 \n\t" /* B5 G4 B4 G3 B3 */\ + "por %%mm3, %%mm6 \n\t"\ + MOVNTQ(%%mm6, 8(dst))\ +\ + "pshufw $0xFF, %%mm2, %%mm1 \n\t" /* B7 B6 B7 B6 B7 B6 B6 B7 */\ + "pshufw $0xFA, %%mm4, %%mm3 \n\t" /* 00 G7 00 G7 G6 G5 G6 G5 */\ + "pshufw $0xFA, %%mm5, %%mm6 \n\t" /* R7 R6 R7 R6 R5 R4 R5 R4 */\ +\ + "pand %%mm7, %%mm1 \n\t" /* B7 B6 */\ + "pand %%mm0, %%mm3 \n\t" /* G7 G6 G5 */\ + "pand "MANGLE(ff_M24B)", %%mm6 \n\t" /* R7 R6 R5 */\ +\ + "por %%mm1, %%mm3 \n\t"\ + "por %%mm3, %%mm6 \n\t"\ + MOVNTQ(%%mm6, 16(dst))\ +\ + "add $24, "#dst" \n\t"\ +\ + "add $8, "#index" \n\t"\ + "cmp "#dstw", "#index" \n\t"\ + " jb 1b \n\t" + +#if COMPILE_TEMPLATE_MMX2 +#undef WRITEBGR24 +#define WRITEBGR24(dst, dstw, index) WRITEBGR24MMX2(dst, dstw, index) +#else +#undef WRITEBGR24 +#define WRITEBGR24(dst, dstw, index) WRITEBGR24MMX(dst, dstw, index) +#endif + +static void RENAME(yuv2bgr24_X_ar)(SwsContext *c, const int16_t *lumFilter, + const int16_t **lumSrc, int lumFilterSize, + const int16_t *chrFilter, const int16_t **chrUSrc, + const int16_t **chrVSrc, + int chrFilterSize, const int16_t **alpSrc, + uint8_t *dest, int dstW, int dstY) +{ + x86_reg dummy=0; + x86_reg dstW_reg = dstW; + x86_reg uv_off = c->uv_off << 1; + + YSCALEYUV2PACKEDX_ACCURATE + YSCALEYUV2RGBX + "pxor %%mm7, %%mm7 \n\t" + "lea (%%"REG_a", %%"REG_a", 2), %%"REG_c"\n\t" //FIXME optimize + "add %4, %%"REG_c" \n\t" + WRITEBGR24(%%REGc, %5, %%REGa) + :: "r" (&c->redDither), + "m" (dummy), "m" (dummy), "m" (dummy), + "r" (dest), "m" (dstW_reg), "m"(uv_off) + : "%"REG_a, "%"REG_c, "%"REG_d, "%"REG_S + ); +} + +static void RENAME(yuv2bgr24_X)(SwsContext *c, const int16_t *lumFilter, + const int16_t **lumSrc, int lumFilterSize, + const int16_t *chrFilter, const int16_t **chrUSrc, + const int16_t **chrVSrc, + int chrFilterSize, const int16_t **alpSrc, + uint8_t *dest, int dstW, int dstY) +{ + x86_reg dummy=0; + x86_reg dstW_reg = dstW; + x86_reg uv_off = c->uv_off << 1; + + YSCALEYUV2PACKEDX + YSCALEYUV2RGBX + "pxor %%mm7, %%mm7 \n\t" + "lea (%%"REG_a", %%"REG_a", 2), %%"REG_c" \n\t" //FIXME optimize + "add %4, %%"REG_c" \n\t" + WRITEBGR24(%%REGc, %5, %%REGa) + :: "r" (&c->redDither), + "m" (dummy), "m" (dummy), "m" (dummy), + "r" (dest), "m" (dstW_reg), "m"(uv_off) + : "%"REG_a, "%"REG_c, "%"REG_d, "%"REG_S + ); +} + +#define REAL_WRITEYUY2(dst, dstw, index) \ + "packuswb %%mm3, %%mm3 \n\t"\ + "packuswb %%mm4, %%mm4 \n\t"\ + "packuswb %%mm7, %%mm1 \n\t"\ + "punpcklbw %%mm4, %%mm3 \n\t"\ + "movq %%mm1, %%mm7 \n\t"\ + "punpcklbw %%mm3, %%mm1 \n\t"\ + "punpckhbw %%mm3, %%mm7 \n\t"\ +\ + MOVNTQ(%%mm1, (dst, index, 2))\ + MOVNTQ(%%mm7, 8(dst, index, 2))\ +\ + "add $8, "#index" \n\t"\ + "cmp "#dstw", "#index" \n\t"\ + " jb 1b \n\t" +#define WRITEYUY2(dst, dstw, index) REAL_WRITEYUY2(dst, dstw, index) + +static void RENAME(yuv2yuyv422_X_ar)(SwsContext *c, const int16_t *lumFilter, + const int16_t **lumSrc, int lumFilterSize, + const int16_t *chrFilter, const int16_t **chrUSrc, + const int16_t **chrVSrc, + int chrFilterSize, const int16_t **alpSrc, + uint8_t *dest, int dstW, int dstY) +{ + x86_reg dummy=0; + x86_reg dstW_reg = dstW; + x86_reg uv_off = c->uv_off << 1; + + YSCALEYUV2PACKEDX_ACCURATE + /* mm2=B, %%mm4=G, %%mm5=R, %%mm7=0 */ + "psraw $3, %%mm3 \n\t" + "psraw $3, %%mm4 \n\t" + "psraw $3, %%mm1 \n\t" + "psraw $3, %%mm7 \n\t" + WRITEYUY2(%4, %5, %%REGa) + YSCALEYUV2PACKEDX_END +} + +static void RENAME(yuv2yuyv422_X)(SwsContext *c, const int16_t *lumFilter, + const int16_t **lumSrc, int lumFilterSize, + const int16_t *chrFilter, const int16_t **chrUSrc, + const int16_t **chrVSrc, + int chrFilterSize, const int16_t **alpSrc, + uint8_t *dest, int dstW, int dstY) +{ + x86_reg dummy=0; + x86_reg dstW_reg = dstW; + x86_reg uv_off = c->uv_off << 1; + + YSCALEYUV2PACKEDX + /* mm2=B, %%mm4=G, %%mm5=R, %%mm7=0 */ + "psraw $3, %%mm3 \n\t" + "psraw $3, %%mm4 \n\t" + "psraw $3, %%mm1 \n\t" + "psraw $3, %%mm7 \n\t" + WRITEYUY2(%4, %5, %%REGa) + YSCALEYUV2PACKEDX_END +} + +#define REAL_YSCALEYUV2RGB_UV(index, c) \ + "xor "#index", "#index" \n\t"\ + ".p2align 4 \n\t"\ + "1: \n\t"\ + "movq (%2, "#index"), %%mm2 \n\t" /* uvbuf0[eax]*/\ + "movq (%3, "#index"), %%mm3 \n\t" /* uvbuf1[eax]*/\ + "add "UV_OFFx2"("#c"), "#index" \n\t" \ + "movq (%2, "#index"), %%mm5 \n\t" /* uvbuf0[eax+2048]*/\ + "movq (%3, "#index"), %%mm4 \n\t" /* uvbuf1[eax+2048]*/\ + "sub "UV_OFFx2"("#c"), "#index" \n\t" \ + "psubw %%mm3, %%mm2 \n\t" /* uvbuf0[eax] - uvbuf1[eax]*/\ + "psubw %%mm4, %%mm5 \n\t" /* uvbuf0[eax+2048] - uvbuf1[eax+2048]*/\ + "movq "CHR_MMX_FILTER_OFFSET"+8("#c"), %%mm0 \n\t"\ + "pmulhw %%mm0, %%mm2 \n\t" /* (uvbuf0[eax] - uvbuf1[eax])uvalpha1>>16*/\ + "pmulhw %%mm0, %%mm5 \n\t" /* (uvbuf0[eax+2048] - uvbuf1[eax+2048])uvalpha1>>16*/\ + "psraw $4, %%mm3 \n\t" /* uvbuf0[eax] - uvbuf1[eax] >>4*/\ + "psraw $4, %%mm4 \n\t" /* uvbuf0[eax+2048] - uvbuf1[eax+2048] >>4*/\ + "paddw %%mm2, %%mm3 \n\t" /* uvbuf0[eax]uvalpha1 - uvbuf1[eax](1-uvalpha1)*/\ + "paddw %%mm5, %%mm4 \n\t" /* uvbuf0[eax+2048]uvalpha1 - uvbuf1[eax+2048](1-uvalpha1)*/\ + "psubw "U_OFFSET"("#c"), %%mm3 \n\t" /* (U-128)8*/\ + "psubw "V_OFFSET"("#c"), %%mm4 \n\t" /* (V-128)8*/\ + "movq %%mm3, %%mm2 \n\t" /* (U-128)8*/\ + "movq %%mm4, %%mm5 \n\t" /* (V-128)8*/\ + "pmulhw "UG_COEFF"("#c"), %%mm3 \n\t"\ + "pmulhw "VG_COEFF"("#c"), %%mm4 \n\t"\ + /* mm2=(U-128)8, mm3=ug, mm4=vg mm5=(V-128)8 */\ + +#define REAL_YSCALEYUV2RGB_YA(index, c, b1, b2) \ + "movq ("#b1", "#index", 2), %%mm0 \n\t" /*buf0[eax]*/\ + "movq ("#b2", "#index", 2), %%mm1 \n\t" /*buf1[eax]*/\ + "movq 8("#b1", "#index", 2), %%mm6 \n\t" /*buf0[eax]*/\ + "movq 8("#b2", "#index", 2), %%mm7 \n\t" /*buf1[eax]*/\ + "psubw %%mm1, %%mm0 \n\t" /* buf0[eax] - buf1[eax]*/\ + "psubw %%mm7, %%mm6 \n\t" /* buf0[eax] - buf1[eax]*/\ + "pmulhw "LUM_MMX_FILTER_OFFSET"+8("#c"), %%mm0 \n\t" /* (buf0[eax] - buf1[eax])yalpha1>>16*/\ + "pmulhw "LUM_MMX_FILTER_OFFSET"+8("#c"), %%mm6 \n\t" /* (buf0[eax] - buf1[eax])yalpha1>>16*/\ + "psraw $4, %%mm1 \n\t" /* buf0[eax] - buf1[eax] >>4*/\ + "psraw $4, %%mm7 \n\t" /* buf0[eax] - buf1[eax] >>4*/\ + "paddw %%mm0, %%mm1 \n\t" /* buf0[eax]yalpha1 + buf1[eax](1-yalpha1) >>16*/\ + "paddw %%mm6, %%mm7 \n\t" /* buf0[eax]yalpha1 + buf1[eax](1-yalpha1) >>16*/\ + +#define REAL_YSCALEYUV2RGB_COEFF(c) \ + "pmulhw "UB_COEFF"("#c"), %%mm2 \n\t"\ + "pmulhw "VR_COEFF"("#c"), %%mm5 \n\t"\ + "psubw "Y_OFFSET"("#c"), %%mm1 \n\t" /* 8(Y-16)*/\ + "psubw "Y_OFFSET"("#c"), %%mm7 \n\t" /* 8(Y-16)*/\ + "pmulhw "Y_COEFF"("#c"), %%mm1 \n\t"\ + "pmulhw "Y_COEFF"("#c"), %%mm7 \n\t"\ + /* mm1= Y1, mm2=ub, mm3=ug, mm4=vg mm5=vr, mm7=Y2 */\ + "paddw %%mm3, %%mm4 \n\t"\ + "movq %%mm2, %%mm0 \n\t"\ + "movq %%mm5, %%mm6 \n\t"\ + "movq %%mm4, %%mm3 \n\t"\ + "punpcklwd %%mm2, %%mm2 \n\t"\ + "punpcklwd %%mm5, %%mm5 \n\t"\ + "punpcklwd %%mm4, %%mm4 \n\t"\ + "paddw %%mm1, %%mm2 \n\t"\ + "paddw %%mm1, %%mm5 \n\t"\ + "paddw %%mm1, %%mm4 \n\t"\ + "punpckhwd %%mm0, %%mm0 \n\t"\ + "punpckhwd %%mm6, %%mm6 \n\t"\ + "punpckhwd %%mm3, %%mm3 \n\t"\ + "paddw %%mm7, %%mm0 \n\t"\ + "paddw %%mm7, %%mm6 \n\t"\ + "paddw %%mm7, %%mm3 \n\t"\ + /* mm0=B1, mm2=B2, mm3=G2, mm4=G1, mm5=R1, mm6=R2 */\ + "packuswb %%mm0, %%mm2 \n\t"\ + "packuswb %%mm6, %%mm5 \n\t"\ + "packuswb %%mm3, %%mm4 \n\t"\ + +#define YSCALEYUV2RGB_YA(index, c, b1, b2) REAL_YSCALEYUV2RGB_YA(index, c, b1, b2) + +#define YSCALEYUV2RGB(index, c) \ + REAL_YSCALEYUV2RGB_UV(index, c) \ + REAL_YSCALEYUV2RGB_YA(index, c, %0, %1) \ + REAL_YSCALEYUV2RGB_COEFF(c) + +/** + * vertical bilinear scale YV12 to RGB + */ +static void RENAME(yuv2rgb32_2)(SwsContext *c, const uint16_t *buf0, + const uint16_t *buf1, const uint16_t *ubuf0, + const uint16_t *ubuf1, const uint16_t *vbuf0, + const uint16_t *vbuf1, const uint16_t *abuf0, + const uint16_t *abuf1, uint8_t *dest, + int dstW, int yalpha, int uvalpha, int y) +{ + if (CONFIG_SWSCALE_ALPHA && c->alpPixBuf) { +#if ARCH_X86_64 + __asm__ volatile( + YSCALEYUV2RGB(%%r8, %5) + YSCALEYUV2RGB_YA(%%r8, %5, %6, %7) + "psraw $3, %%mm1 \n\t" /* abuf0[eax] - abuf1[eax] >>7*/ + "psraw $3, %%mm7 \n\t" /* abuf0[eax] - abuf1[eax] >>7*/ + "packuswb %%mm7, %%mm1 \n\t" + WRITEBGR32(%4, 8280(%5), %%r8, %%mm2, %%mm4, %%mm5, %%mm1, %%mm0, %%mm7, %%mm3, %%mm6) + :: "c" (buf0), "d" (buf1), "S" (ubuf0), "D" (ubuf1), "r" (dest), + "a" (&c->redDither), + "r" (abuf0), "r" (abuf1) + : "%r8" + ); +#else + c->u_temp=(intptr_t)abuf0; + c->v_temp=(intptr_t)abuf1; + __asm__ volatile( + "mov %%"REG_b", "ESP_OFFSET"(%5) \n\t" + "mov %4, %%"REG_b" \n\t" + "push %%"REG_BP" \n\t" + YSCALEYUV2RGB(%%REGBP, %5) + "push %0 \n\t" + "push %1 \n\t" + "mov "U_TEMP"(%5), %0 \n\t" + "mov "V_TEMP"(%5), %1 \n\t" + YSCALEYUV2RGB_YA(%%REGBP, %5, %0, %1) + "psraw $3, %%mm1 \n\t" /* abuf0[eax] - abuf1[eax] >>7*/ + "psraw $3, %%mm7 \n\t" /* abuf0[eax] - abuf1[eax] >>7*/ + "packuswb %%mm7, %%mm1 \n\t" + "pop %1 \n\t" + "pop %0 \n\t" + WRITEBGR32(%%REGb, 8280(%5), %%REGBP, %%mm2, %%mm4, %%mm5, %%mm1, %%mm0, %%mm7, %%mm3, %%mm6) + "pop %%"REG_BP" \n\t" + "mov "ESP_OFFSET"(%5), %%"REG_b" \n\t" + :: "c" (buf0), "d" (buf1), "S" (ubuf0), "D" (ubuf1), "m" (dest), + "a" (&c->redDither) + ); +#endif + } else { + __asm__ volatile( + "mov %%"REG_b", "ESP_OFFSET"(%5) \n\t" + "mov %4, %%"REG_b" \n\t" + "push %%"REG_BP" \n\t" + YSCALEYUV2RGB(%%REGBP, %5) + "pcmpeqd %%mm7, %%mm7 \n\t" + WRITEBGR32(%%REGb, 8280(%5), %%REGBP, %%mm2, %%mm4, %%mm5, %%mm7, %%mm0, %%mm1, %%mm3, %%mm6) + "pop %%"REG_BP" \n\t" + "mov "ESP_OFFSET"(%5), %%"REG_b" \n\t" + :: "c" (buf0), "d" (buf1), "S" (ubuf0), "D" (ubuf1), "m" (dest), + "a" (&c->redDither) + ); + } +} + +static void RENAME(yuv2bgr24_2)(SwsContext *c, const uint16_t *buf0, + const uint16_t *buf1, const uint16_t *ubuf0, + const uint16_t *ubuf1, const uint16_t *vbuf0, + const uint16_t *vbuf1, const uint16_t *abuf0, + const uint16_t *abuf1, uint8_t *dest, + int dstW, int yalpha, int uvalpha, int y) +{ + //Note 8280 == DSTW_OFFSET but the preprocessor can't handle that there :( + __asm__ volatile( + "mov %%"REG_b", "ESP_OFFSET"(%5) \n\t" + "mov %4, %%"REG_b" \n\t" + "push %%"REG_BP" \n\t" + YSCALEYUV2RGB(%%REGBP, %5) + "pxor %%mm7, %%mm7 \n\t" + WRITEBGR24(%%REGb, 8280(%5), %%REGBP) + "pop %%"REG_BP" \n\t" + "mov "ESP_OFFSET"(%5), %%"REG_b" \n\t" + :: "c" (buf0), "d" (buf1), "S" (ubuf0), "D" (ubuf1), "m" (dest), + "a" (&c->redDither) + ); +} + +static void RENAME(yuv2rgb555_2)(SwsContext *c, const uint16_t *buf0, + const uint16_t *buf1, const uint16_t *ubuf0, + const uint16_t *ubuf1, const uint16_t *vbuf0, + const uint16_t *vbuf1, const uint16_t *abuf0, + const uint16_t *abuf1, uint8_t *dest, + int dstW, int yalpha, int uvalpha, int y) +{ + //Note 8280 == DSTW_OFFSET but the preprocessor can't handle that there :( + __asm__ volatile( + "mov %%"REG_b", "ESP_OFFSET"(%5) \n\t" + "mov %4, %%"REG_b" \n\t" + "push %%"REG_BP" \n\t" + YSCALEYUV2RGB(%%REGBP, %5) + "pxor %%mm7, %%mm7 \n\t" + /* mm2=B, %%mm4=G, %%mm5=R, %%mm7=0 */ +#ifdef DITHER1XBPP + "paddusb "BLUE_DITHER"(%5), %%mm2 \n\t" + "paddusb "GREEN_DITHER"(%5), %%mm4 \n\t" + "paddusb "RED_DITHER"(%5), %%mm5 \n\t" +#endif + WRITERGB15(%%REGb, 8280(%5), %%REGBP) + "pop %%"REG_BP" \n\t" + "mov "ESP_OFFSET"(%5), %%"REG_b" \n\t" + :: "c" (buf0), "d" (buf1), "S" (ubuf0), "D" (ubuf1), "m" (dest), + "a" (&c->redDither) + ); +} + +static void RENAME(yuv2rgb565_2)(SwsContext *c, const uint16_t *buf0, + const uint16_t *buf1, const uint16_t *ubuf0, + const uint16_t *ubuf1, const uint16_t *vbuf0, + const uint16_t *vbuf1, const uint16_t *abuf0, + const uint16_t *abuf1, uint8_t *dest, + int dstW, int yalpha, int uvalpha, int y) +{ + //Note 8280 == DSTW_OFFSET but the preprocessor can't handle that there :( + __asm__ volatile( + "mov %%"REG_b", "ESP_OFFSET"(%5) \n\t" + "mov %4, %%"REG_b" \n\t" + "push %%"REG_BP" \n\t" + YSCALEYUV2RGB(%%REGBP, %5) + "pxor %%mm7, %%mm7 \n\t" + /* mm2=B, %%mm4=G, %%mm5=R, %%mm7=0 */ +#ifdef DITHER1XBPP + "paddusb "BLUE_DITHER"(%5), %%mm2 \n\t" + "paddusb "GREEN_DITHER"(%5), %%mm4 \n\t" + "paddusb "RED_DITHER"(%5), %%mm5 \n\t" +#endif + WRITERGB16(%%REGb, 8280(%5), %%REGBP) + "pop %%"REG_BP" \n\t" + "mov "ESP_OFFSET"(%5), %%"REG_b" \n\t" + :: "c" (buf0), "d" (buf1), "S" (ubuf0), "D" (ubuf1), "m" (dest), + "a" (&c->redDither) + ); +} + +#define REAL_YSCALEYUV2PACKED(index, c) \ + "movq "CHR_MMX_FILTER_OFFSET"+8("#c"), %%mm0 \n\t"\ + "movq "LUM_MMX_FILTER_OFFSET"+8("#c"), %%mm1 \n\t"\ + "psraw $3, %%mm0 \n\t"\ + "psraw $3, %%mm1 \n\t"\ + "movq %%mm0, "CHR_MMX_FILTER_OFFSET"+8("#c") \n\t"\ + "movq %%mm1, "LUM_MMX_FILTER_OFFSET"+8("#c") \n\t"\ + "xor "#index", "#index" \n\t"\ + ".p2align 4 \n\t"\ + "1: \n\t"\ + "movq (%2, "#index"), %%mm2 \n\t" /* uvbuf0[eax]*/\ + "movq (%3, "#index"), %%mm3 \n\t" /* uvbuf1[eax]*/\ + "add "UV_OFFx2"("#c"), "#index" \n\t" \ + "movq (%2, "#index"), %%mm5 \n\t" /* uvbuf0[eax+2048]*/\ + "movq (%3, "#index"), %%mm4 \n\t" /* uvbuf1[eax+2048]*/\ + "sub "UV_OFFx2"("#c"), "#index" \n\t" \ + "psubw %%mm3, %%mm2 \n\t" /* uvbuf0[eax] - uvbuf1[eax]*/\ + "psubw %%mm4, %%mm5 \n\t" /* uvbuf0[eax+2048] - uvbuf1[eax+2048]*/\ + "movq "CHR_MMX_FILTER_OFFSET"+8("#c"), %%mm0 \n\t"\ + "pmulhw %%mm0, %%mm2 \n\t" /* (uvbuf0[eax] - uvbuf1[eax])uvalpha1>>16*/\ + "pmulhw %%mm0, %%mm5 \n\t" /* (uvbuf0[eax+2048] - uvbuf1[eax+2048])uvalpha1>>16*/\ + "psraw $7, %%mm3 \n\t" /* uvbuf0[eax] - uvbuf1[eax] >>4*/\ + "psraw $7, %%mm4 \n\t" /* uvbuf0[eax+2048] - uvbuf1[eax+2048] >>4*/\ + "paddw %%mm2, %%mm3 \n\t" /* uvbuf0[eax]uvalpha1 - uvbuf1[eax](1-uvalpha1)*/\ + "paddw %%mm5, %%mm4 \n\t" /* uvbuf0[eax+2048]uvalpha1 - uvbuf1[eax+2048](1-uvalpha1)*/\ + "movq (%0, "#index", 2), %%mm0 \n\t" /*buf0[eax]*/\ + "movq (%1, "#index", 2), %%mm1 \n\t" /*buf1[eax]*/\ + "movq 8(%0, "#index", 2), %%mm6 \n\t" /*buf0[eax]*/\ + "movq 8(%1, "#index", 2), %%mm7 \n\t" /*buf1[eax]*/\ + "psubw %%mm1, %%mm0 \n\t" /* buf0[eax] - buf1[eax]*/\ + "psubw %%mm7, %%mm6 \n\t" /* buf0[eax] - buf1[eax]*/\ + "pmulhw "LUM_MMX_FILTER_OFFSET"+8("#c"), %%mm0 \n\t" /* (buf0[eax] - buf1[eax])yalpha1>>16*/\ + "pmulhw "LUM_MMX_FILTER_OFFSET"+8("#c"), %%mm6 \n\t" /* (buf0[eax] - buf1[eax])yalpha1>>16*/\ + "psraw $7, %%mm1 \n\t" /* buf0[eax] - buf1[eax] >>4*/\ + "psraw $7, %%mm7 \n\t" /* buf0[eax] - buf1[eax] >>4*/\ + "paddw %%mm0, %%mm1 \n\t" /* buf0[eax]yalpha1 + buf1[eax](1-yalpha1) >>16*/\ + "paddw %%mm6, %%mm7 \n\t" /* buf0[eax]yalpha1 + buf1[eax](1-yalpha1) >>16*/\ + +#define YSCALEYUV2PACKED(index, c) REAL_YSCALEYUV2PACKED(index, c) + +static void RENAME(yuv2yuyv422_2)(SwsContext *c, const uint16_t *buf0, + const uint16_t *buf1, const uint16_t *ubuf0, + const uint16_t *ubuf1, const uint16_t *vbuf0, + const uint16_t *vbuf1, const uint16_t *abuf0, + const uint16_t *abuf1, uint8_t *dest, + int dstW, int yalpha, int uvalpha, int y) +{ + //Note 8280 == DSTW_OFFSET but the preprocessor can't handle that there :( + __asm__ volatile( + "mov %%"REG_b", "ESP_OFFSET"(%5) \n\t" + "mov %4, %%"REG_b" \n\t" + "push %%"REG_BP" \n\t" + YSCALEYUV2PACKED(%%REGBP, %5) + WRITEYUY2(%%REGb, 8280(%5), %%REGBP) + "pop %%"REG_BP" \n\t" + "mov "ESP_OFFSET"(%5), %%"REG_b" \n\t" + :: "c" (buf0), "d" (buf1), "S" (ubuf0), "D" (ubuf1), "m" (dest), + "a" (&c->redDither) + ); +} + +#define REAL_YSCALEYUV2RGB1(index, c) \ + "xor "#index", "#index" \n\t"\ + ".p2align 4 \n\t"\ + "1: \n\t"\ + "movq (%2, "#index"), %%mm3 \n\t" /* uvbuf0[eax]*/\ + "add "UV_OFFx2"("#c"), "#index" \n\t" \ + "movq (%2, "#index"), %%mm4 \n\t" /* uvbuf0[eax+2048]*/\ + "sub "UV_OFFx2"("#c"), "#index" \n\t" \ + "psraw $4, %%mm3 \n\t" /* uvbuf0[eax] - uvbuf1[eax] >>4*/\ + "psraw $4, %%mm4 \n\t" /* uvbuf0[eax+2048] - uvbuf1[eax+2048] >>4*/\ + "psubw "U_OFFSET"("#c"), %%mm3 \n\t" /* (U-128)8*/\ + "psubw "V_OFFSET"("#c"), %%mm4 \n\t" /* (V-128)8*/\ + "movq %%mm3, %%mm2 \n\t" /* (U-128)8*/\ + "movq %%mm4, %%mm5 \n\t" /* (V-128)8*/\ + "pmulhw "UG_COEFF"("#c"), %%mm3 \n\t"\ + "pmulhw "VG_COEFF"("#c"), %%mm4 \n\t"\ + /* mm2=(U-128)8, mm3=ug, mm4=vg mm5=(V-128)8 */\ + "movq (%0, "#index", 2), %%mm1 \n\t" /*buf0[eax]*/\ + "movq 8(%0, "#index", 2), %%mm7 \n\t" /*buf0[eax]*/\ + "psraw $4, %%mm1 \n\t" /* buf0[eax] - buf1[eax] >>4*/\ + "psraw $4, %%mm7 \n\t" /* buf0[eax] - buf1[eax] >>4*/\ + "pmulhw "UB_COEFF"("#c"), %%mm2 \n\t"\ + "pmulhw "VR_COEFF"("#c"), %%mm5 \n\t"\ + "psubw "Y_OFFSET"("#c"), %%mm1 \n\t" /* 8(Y-16)*/\ + "psubw "Y_OFFSET"("#c"), %%mm7 \n\t" /* 8(Y-16)*/\ + "pmulhw "Y_COEFF"("#c"), %%mm1 \n\t"\ + "pmulhw "Y_COEFF"("#c"), %%mm7 \n\t"\ + /* mm1= Y1, mm2=ub, mm3=ug, mm4=vg mm5=vr, mm7=Y2 */\ + "paddw %%mm3, %%mm4 \n\t"\ + "movq %%mm2, %%mm0 \n\t"\ + "movq %%mm5, %%mm6 \n\t"\ + "movq %%mm4, %%mm3 \n\t"\ + "punpcklwd %%mm2, %%mm2 \n\t"\ + "punpcklwd %%mm5, %%mm5 \n\t"\ + "punpcklwd %%mm4, %%mm4 \n\t"\ + "paddw %%mm1, %%mm2 \n\t"\ + "paddw %%mm1, %%mm5 \n\t"\ + "paddw %%mm1, %%mm4 \n\t"\ + "punpckhwd %%mm0, %%mm0 \n\t"\ + "punpckhwd %%mm6, %%mm6 \n\t"\ + "punpckhwd %%mm3, %%mm3 \n\t"\ + "paddw %%mm7, %%mm0 \n\t"\ + "paddw %%mm7, %%mm6 \n\t"\ + "paddw %%mm7, %%mm3 \n\t"\ + /* mm0=B1, mm2=B2, mm3=G2, mm4=G1, mm5=R1, mm6=R2 */\ + "packuswb %%mm0, %%mm2 \n\t"\ + "packuswb %%mm6, %%mm5 \n\t"\ + "packuswb %%mm3, %%mm4 \n\t"\ + +#define YSCALEYUV2RGB1(index, c) REAL_YSCALEYUV2RGB1(index, c) + +// do vertical chrominance interpolation +#define REAL_YSCALEYUV2RGB1b(index, c) \ + "xor "#index", "#index" \n\t"\ + ".p2align 4 \n\t"\ + "1: \n\t"\ + "movq (%2, "#index"), %%mm2 \n\t" /* uvbuf0[eax]*/\ + "movq (%3, "#index"), %%mm3 \n\t" /* uvbuf1[eax]*/\ + "add "UV_OFFx2"("#c"), "#index" \n\t" \ + "movq (%2, "#index"), %%mm5 \n\t" /* uvbuf0[eax+2048]*/\ + "movq (%3, "#index"), %%mm4 \n\t" /* uvbuf1[eax+2048]*/\ + "sub "UV_OFFx2"("#c"), "#index" \n\t" \ + "paddw %%mm2, %%mm3 \n\t" /* uvbuf0[eax] + uvbuf1[eax]*/\ + "paddw %%mm5, %%mm4 \n\t" /* uvbuf0[eax+2048] + uvbuf1[eax+2048]*/\ + "psrlw $5, %%mm3 \n\t" /*FIXME might overflow*/\ + "psrlw $5, %%mm4 \n\t" /*FIXME might overflow*/\ + "psubw "U_OFFSET"("#c"), %%mm3 \n\t" /* (U-128)8*/\ + "psubw "V_OFFSET"("#c"), %%mm4 \n\t" /* (V-128)8*/\ + "movq %%mm3, %%mm2 \n\t" /* (U-128)8*/\ + "movq %%mm4, %%mm5 \n\t" /* (V-128)8*/\ + "pmulhw "UG_COEFF"("#c"), %%mm3 \n\t"\ + "pmulhw "VG_COEFF"("#c"), %%mm4 \n\t"\ + /* mm2=(U-128)8, mm3=ug, mm4=vg mm5=(V-128)8 */\ + "movq (%0, "#index", 2), %%mm1 \n\t" /*buf0[eax]*/\ + "movq 8(%0, "#index", 2), %%mm7 \n\t" /*buf0[eax]*/\ + "psraw $4, %%mm1 \n\t" /* buf0[eax] - buf1[eax] >>4*/\ + "psraw $4, %%mm7 \n\t" /* buf0[eax] - buf1[eax] >>4*/\ + "pmulhw "UB_COEFF"("#c"), %%mm2 \n\t"\ + "pmulhw "VR_COEFF"("#c"), %%mm5 \n\t"\ + "psubw "Y_OFFSET"("#c"), %%mm1 \n\t" /* 8(Y-16)*/\ + "psubw "Y_OFFSET"("#c"), %%mm7 \n\t" /* 8(Y-16)*/\ + "pmulhw "Y_COEFF"("#c"), %%mm1 \n\t"\ + "pmulhw "Y_COEFF"("#c"), %%mm7 \n\t"\ + /* mm1= Y1, mm2=ub, mm3=ug, mm4=vg mm5=vr, mm7=Y2 */\ + "paddw %%mm3, %%mm4 \n\t"\ + "movq %%mm2, %%mm0 \n\t"\ + "movq %%mm5, %%mm6 \n\t"\ + "movq %%mm4, %%mm3 \n\t"\ + "punpcklwd %%mm2, %%mm2 \n\t"\ + "punpcklwd %%mm5, %%mm5 \n\t"\ + "punpcklwd %%mm4, %%mm4 \n\t"\ + "paddw %%mm1, %%mm2 \n\t"\ + "paddw %%mm1, %%mm5 \n\t"\ + "paddw %%mm1, %%mm4 \n\t"\ + "punpckhwd %%mm0, %%mm0 \n\t"\ + "punpckhwd %%mm6, %%mm6 \n\t"\ + "punpckhwd %%mm3, %%mm3 \n\t"\ + "paddw %%mm7, %%mm0 \n\t"\ + "paddw %%mm7, %%mm6 \n\t"\ + "paddw %%mm7, %%mm3 \n\t"\ + /* mm0=B1, mm2=B2, mm3=G2, mm4=G1, mm5=R1, mm6=R2 */\ + "packuswb %%mm0, %%mm2 \n\t"\ + "packuswb %%mm6, %%mm5 \n\t"\ + "packuswb %%mm3, %%mm4 \n\t"\ + +#define YSCALEYUV2RGB1b(index, c) REAL_YSCALEYUV2RGB1b(index, c) + +#define REAL_YSCALEYUV2RGB1_ALPHA(index) \ + "movq (%1, "#index", 2), %%mm7 \n\t" /* abuf0[index ] */\ + "movq 8(%1, "#index", 2), %%mm1 \n\t" /* abuf0[index+4] */\ + "psraw $7, %%mm7 \n\t" /* abuf0[index ] >>7 */\ + "psraw $7, %%mm1 \n\t" /* abuf0[index+4] >>7 */\ + "packuswb %%mm1, %%mm7 \n\t" +#define YSCALEYUV2RGB1_ALPHA(index) REAL_YSCALEYUV2RGB1_ALPHA(index) + +/** + * YV12 to RGB without scaling or interpolating + */ +static void RENAME(yuv2rgb32_1)(SwsContext *c, const uint16_t *buf0, + const uint16_t *ubuf0, const uint16_t *ubuf1, + const uint16_t *vbuf0, const uint16_t *vbuf1, + const uint16_t *abuf0, uint8_t *dest, + int dstW, int uvalpha, enum PixelFormat dstFormat, + int flags, int y) +{ + const uint16_t *buf1= buf0; //FIXME needed for RGB1/BGR1 + + if (uvalpha < 2048) { // note this is not correct (shifts chrominance by 0.5 pixels) but it is a bit faster + if (CONFIG_SWSCALE_ALPHA && c->alpPixBuf) { + __asm__ volatile( + "mov %%"REG_b", "ESP_OFFSET"(%5) \n\t" + "mov %4, %%"REG_b" \n\t" + "push %%"REG_BP" \n\t" + YSCALEYUV2RGB1(%%REGBP, %5) + YSCALEYUV2RGB1_ALPHA(%%REGBP) + WRITEBGR32(%%REGb, 8280(%5), %%REGBP, %%mm2, %%mm4, %%mm5, %%mm7, %%mm0, %%mm1, %%mm3, %%mm6) + "pop %%"REG_BP" \n\t" + "mov "ESP_OFFSET"(%5), %%"REG_b" \n\t" + :: "c" (buf0), "d" (abuf0), "S" (ubuf0), "D" (ubuf1), "m" (dest), + "a" (&c->redDither) + ); + } else { + __asm__ volatile( + "mov %%"REG_b", "ESP_OFFSET"(%5) \n\t" + "mov %4, %%"REG_b" \n\t" + "push %%"REG_BP" \n\t" + YSCALEYUV2RGB1(%%REGBP, %5) + "pcmpeqd %%mm7, %%mm7 \n\t" + WRITEBGR32(%%REGb, 8280(%5), %%REGBP, %%mm2, %%mm4, %%mm5, %%mm7, %%mm0, %%mm1, %%mm3, %%mm6) + "pop %%"REG_BP" \n\t" + "mov "ESP_OFFSET"(%5), %%"REG_b" \n\t" + :: "c" (buf0), "d" (buf1), "S" (ubuf0), "D" (ubuf1), "m" (dest), + "a" (&c->redDither) + ); + } + } else { + if (CONFIG_SWSCALE_ALPHA && c->alpPixBuf) { + __asm__ volatile( + "mov %%"REG_b", "ESP_OFFSET"(%5) \n\t" + "mov %4, %%"REG_b" \n\t" + "push %%"REG_BP" \n\t" + YSCALEYUV2RGB1b(%%REGBP, %5) + YSCALEYUV2RGB1_ALPHA(%%REGBP) + WRITEBGR32(%%REGb, 8280(%5), %%REGBP, %%mm2, %%mm4, %%mm5, %%mm7, %%mm0, %%mm1, %%mm3, %%mm6) + "pop %%"REG_BP" \n\t" + "mov "ESP_OFFSET"(%5), %%"REG_b" \n\t" + :: "c" (buf0), "d" (abuf0), "S" (ubuf0), "D" (ubuf1), "m" (dest), + "a" (&c->redDither) + ); + } else { + __asm__ volatile( + "mov %%"REG_b", "ESP_OFFSET"(%5) \n\t" + "mov %4, %%"REG_b" \n\t" + "push %%"REG_BP" \n\t" + YSCALEYUV2RGB1b(%%REGBP, %5) + "pcmpeqd %%mm7, %%mm7 \n\t" + WRITEBGR32(%%REGb, 8280(%5), %%REGBP, %%mm2, %%mm4, %%mm5, %%mm7, %%mm0, %%mm1, %%mm3, %%mm6) + "pop %%"REG_BP" \n\t" + "mov "ESP_OFFSET"(%5), %%"REG_b" \n\t" + :: "c" (buf0), "d" (buf1), "S" (ubuf0), "D" (ubuf1), "m" (dest), + "a" (&c->redDither) + ); + } + } +} + +static void RENAME(yuv2bgr24_1)(SwsContext *c, const uint16_t *buf0, + const uint16_t *ubuf0, const uint16_t *ubuf1, + const uint16_t *vbuf0, const uint16_t *vbuf1, + const uint16_t *abuf0, uint8_t *dest, + int dstW, int uvalpha, enum PixelFormat dstFormat, + int flags, int y) +{ + const uint16_t *buf1= buf0; //FIXME needed for RGB1/BGR1 + + if (uvalpha < 2048) { // note this is not correct (shifts chrominance by 0.5 pixels) but it is a bit faster + __asm__ volatile( + "mov %%"REG_b", "ESP_OFFSET"(%5) \n\t" + "mov %4, %%"REG_b" \n\t" + "push %%"REG_BP" \n\t" + YSCALEYUV2RGB1(%%REGBP, %5) + "pxor %%mm7, %%mm7 \n\t" + WRITEBGR24(%%REGb, 8280(%5), %%REGBP) + "pop %%"REG_BP" \n\t" + "mov "ESP_OFFSET"(%5), %%"REG_b" \n\t" + :: "c" (buf0), "d" (buf1), "S" (ubuf0), "D" (ubuf1), "m" (dest), + "a" (&c->redDither) + ); + } else { + __asm__ volatile( + "mov %%"REG_b", "ESP_OFFSET"(%5) \n\t" + "mov %4, %%"REG_b" \n\t" + "push %%"REG_BP" \n\t" + YSCALEYUV2RGB1b(%%REGBP, %5) + "pxor %%mm7, %%mm7 \n\t" + WRITEBGR24(%%REGb, 8280(%5), %%REGBP) + "pop %%"REG_BP" \n\t" + "mov "ESP_OFFSET"(%5), %%"REG_b" \n\t" + :: "c" (buf0), "d" (buf1), "S" (ubuf0), "D" (ubuf1), "m" (dest), + "a" (&c->redDither) + ); + } +} + +static void RENAME(yuv2rgb555_1)(SwsContext *c, const uint16_t *buf0, + const uint16_t *ubuf0, const uint16_t *ubuf1, + const uint16_t *vbuf0, const uint16_t *vbuf1, + const uint16_t *abuf0, uint8_t *dest, + int dstW, int uvalpha, enum PixelFormat dstFormat, + int flags, int y) +{ + const uint16_t *buf1= buf0; //FIXME needed for RGB1/BGR1 + + if (uvalpha < 2048) { // note this is not correct (shifts chrominance by 0.5 pixels) but it is a bit faster + __asm__ volatile( + "mov %%"REG_b", "ESP_OFFSET"(%5) \n\t" + "mov %4, %%"REG_b" \n\t" + "push %%"REG_BP" \n\t" + YSCALEYUV2RGB1(%%REGBP, %5) + "pxor %%mm7, %%mm7 \n\t" + /* mm2=B, %%mm4=G, %%mm5=R, %%mm7=0 */ +#ifdef DITHER1XBPP + "paddusb "BLUE_DITHER"(%5), %%mm2 \n\t" + "paddusb "GREEN_DITHER"(%5), %%mm4 \n\t" + "paddusb "RED_DITHER"(%5), %%mm5 \n\t" +#endif + WRITERGB15(%%REGb, 8280(%5), %%REGBP) + "pop %%"REG_BP" \n\t" + "mov "ESP_OFFSET"(%5), %%"REG_b" \n\t" + :: "c" (buf0), "d" (buf1), "S" (ubuf0), "D" (ubuf1), "m" (dest), + "a" (&c->redDither) + ); + } else { + __asm__ volatile( + "mov %%"REG_b", "ESP_OFFSET"(%5) \n\t" + "mov %4, %%"REG_b" \n\t" + "push %%"REG_BP" \n\t" + YSCALEYUV2RGB1b(%%REGBP, %5) + "pxor %%mm7, %%mm7 \n\t" + /* mm2=B, %%mm4=G, %%mm5=R, %%mm7=0 */ +#ifdef DITHER1XBPP + "paddusb "BLUE_DITHER"(%5), %%mm2 \n\t" + "paddusb "GREEN_DITHER"(%5), %%mm4 \n\t" + "paddusb "RED_DITHER"(%5), %%mm5 \n\t" +#endif + WRITERGB15(%%REGb, 8280(%5), %%REGBP) + "pop %%"REG_BP" \n\t" + "mov "ESP_OFFSET"(%5), %%"REG_b" \n\t" + :: "c" (buf0), "d" (buf1), "S" (ubuf0), "D" (ubuf1), "m" (dest), + "a" (&c->redDither) + ); + } +} + +static void RENAME(yuv2rgb565_1)(SwsContext *c, const uint16_t *buf0, + const uint16_t *ubuf0, const uint16_t *ubuf1, + const uint16_t *vbuf0, const uint16_t *vbuf1, + const uint16_t *abuf0, uint8_t *dest, + int dstW, int uvalpha, enum PixelFormat dstFormat, + int flags, int y) +{ + const uint16_t *buf1= buf0; //FIXME needed for RGB1/BGR1 + + if (uvalpha < 2048) { // note this is not correct (shifts chrominance by 0.5 pixels) but it is a bit faster + __asm__ volatile( + "mov %%"REG_b", "ESP_OFFSET"(%5) \n\t" + "mov %4, %%"REG_b" \n\t" + "push %%"REG_BP" \n\t" + YSCALEYUV2RGB1(%%REGBP, %5) + "pxor %%mm7, %%mm7 \n\t" + /* mm2=B, %%mm4=G, %%mm5=R, %%mm7=0 */ +#ifdef DITHER1XBPP + "paddusb "BLUE_DITHER"(%5), %%mm2 \n\t" + "paddusb "GREEN_DITHER"(%5), %%mm4 \n\t" + "paddusb "RED_DITHER"(%5), %%mm5 \n\t" +#endif + WRITERGB16(%%REGb, 8280(%5), %%REGBP) + "pop %%"REG_BP" \n\t" + "mov "ESP_OFFSET"(%5), %%"REG_b" \n\t" + :: "c" (buf0), "d" (buf1), "S" (ubuf0), "D" (ubuf1), "m" (dest), + "a" (&c->redDither) + ); + } else { + __asm__ volatile( + "mov %%"REG_b", "ESP_OFFSET"(%5) \n\t" + "mov %4, %%"REG_b" \n\t" + "push %%"REG_BP" \n\t" + YSCALEYUV2RGB1b(%%REGBP, %5) + "pxor %%mm7, %%mm7 \n\t" + /* mm2=B, %%mm4=G, %%mm5=R, %%mm7=0 */ +#ifdef DITHER1XBPP + "paddusb "BLUE_DITHER"(%5), %%mm2 \n\t" + "paddusb "GREEN_DITHER"(%5), %%mm4 \n\t" + "paddusb "RED_DITHER"(%5), %%mm5 \n\t" +#endif + WRITERGB16(%%REGb, 8280(%5), %%REGBP) + "pop %%"REG_BP" \n\t" + "mov "ESP_OFFSET"(%5), %%"REG_b" \n\t" + :: "c" (buf0), "d" (buf1), "S" (ubuf0), "D" (ubuf1), "m" (dest), + "a" (&c->redDither) + ); + } +} + +#define REAL_YSCALEYUV2PACKED1(index, c) \ + "xor "#index", "#index" \n\t"\ + ".p2align 4 \n\t"\ + "1: \n\t"\ + "movq (%2, "#index"), %%mm3 \n\t" /* uvbuf0[eax]*/\ + "add "UV_OFFx2"("#c"), "#index" \n\t" \ + "movq (%2, "#index"), %%mm4 \n\t" /* uvbuf0[eax+2048]*/\ + "sub "UV_OFFx2"("#c"), "#index" \n\t" \ + "psraw $7, %%mm3 \n\t" \ + "psraw $7, %%mm4 \n\t" \ + "movq (%0, "#index", 2), %%mm1 \n\t" /*buf0[eax]*/\ + "movq 8(%0, "#index", 2), %%mm7 \n\t" /*buf0[eax]*/\ + "psraw $7, %%mm1 \n\t" \ + "psraw $7, %%mm7 \n\t" \ + +#define YSCALEYUV2PACKED1(index, c) REAL_YSCALEYUV2PACKED1(index, c) + +#define REAL_YSCALEYUV2PACKED1b(index, c) \ + "xor "#index", "#index" \n\t"\ + ".p2align 4 \n\t"\ + "1: \n\t"\ + "movq (%2, "#index"), %%mm2 \n\t" /* uvbuf0[eax]*/\ + "movq (%3, "#index"), %%mm3 \n\t" /* uvbuf1[eax]*/\ + "add "UV_OFFx2"("#c"), "#index" \n\t" \ + "movq (%2, "#index"), %%mm5 \n\t" /* uvbuf0[eax+2048]*/\ + "movq (%3, "#index"), %%mm4 \n\t" /* uvbuf1[eax+2048]*/\ + "sub "UV_OFFx2"("#c"), "#index" \n\t" \ + "paddw %%mm2, %%mm3 \n\t" /* uvbuf0[eax] + uvbuf1[eax]*/\ + "paddw %%mm5, %%mm4 \n\t" /* uvbuf0[eax+2048] + uvbuf1[eax+2048]*/\ + "psrlw $8, %%mm3 \n\t" \ + "psrlw $8, %%mm4 \n\t" \ + "movq (%0, "#index", 2), %%mm1 \n\t" /*buf0[eax]*/\ + "movq 8(%0, "#index", 2), %%mm7 \n\t" /*buf0[eax]*/\ + "psraw $7, %%mm1 \n\t" \ + "psraw $7, %%mm7 \n\t" +#define YSCALEYUV2PACKED1b(index, c) REAL_YSCALEYUV2PACKED1b(index, c) + +static void RENAME(yuv2yuyv422_1)(SwsContext *c, const uint16_t *buf0, + const uint16_t *ubuf0, const uint16_t *ubuf1, + const uint16_t *vbuf0, const uint16_t *vbuf1, + const uint16_t *abuf0, uint8_t *dest, + int dstW, int uvalpha, enum PixelFormat dstFormat, + int flags, int y) +{ + const uint16_t *buf1= buf0; //FIXME needed for RGB1/BGR1 + + if (uvalpha < 2048) { // note this is not correct (shifts chrominance by 0.5 pixels) but it is a bit faster + __asm__ volatile( + "mov %%"REG_b", "ESP_OFFSET"(%5) \n\t" + "mov %4, %%"REG_b" \n\t" + "push %%"REG_BP" \n\t" + YSCALEYUV2PACKED1(%%REGBP, %5) + WRITEYUY2(%%REGb, 8280(%5), %%REGBP) + "pop %%"REG_BP" \n\t" + "mov "ESP_OFFSET"(%5), %%"REG_b" \n\t" + :: "c" (buf0), "d" (buf1), "S" (ubuf0), "D" (ubuf1), "m" (dest), + "a" (&c->redDither) + ); + } else { + __asm__ volatile( + "mov %%"REG_b", "ESP_OFFSET"(%5) \n\t" + "mov %4, %%"REG_b" \n\t" + "push %%"REG_BP" \n\t" + YSCALEYUV2PACKED1b(%%REGBP, %5) + WRITEYUY2(%%REGb, 8280(%5), %%REGBP) + "pop %%"REG_BP" \n\t" + "mov "ESP_OFFSET"(%5), %%"REG_b" \n\t" + :: "c" (buf0), "d" (buf1), "S" (ubuf0), "D" (ubuf1), "m" (dest), + "a" (&c->redDither) + ); + } +} + +#if !COMPILE_TEMPLATE_MMX2 +//FIXME yuy2* can read up to 7 samples too much + +static void RENAME(yuy2ToY)(uint8_t *dst, const uint8_t *src, + int width, uint32_t *unused) +{ + __asm__ volatile( + "movq "MANGLE(bm01010101)", %%mm2 \n\t" + "mov %0, %%"REG_a" \n\t" + "1: \n\t" + "movq (%1, %%"REG_a",2), %%mm0 \n\t" + "movq 8(%1, %%"REG_a",2), %%mm1 \n\t" + "pand %%mm2, %%mm0 \n\t" + "pand %%mm2, %%mm1 \n\t" + "packuswb %%mm1, %%mm0 \n\t" + "movq %%mm0, (%2, %%"REG_a") \n\t" + "add $8, %%"REG_a" \n\t" + " js 1b \n\t" + : : "g" ((x86_reg)-width), "r" (src+width*2), "r" (dst+width) + : "%"REG_a + ); +} + +static void RENAME(yuy2ToUV)(uint8_t *dstU, uint8_t *dstV, + const uint8_t *src1, const uint8_t *src2, + int width, uint32_t *unused) +{ + __asm__ volatile( + "movq "MANGLE(bm01010101)", %%mm4 \n\t" + "mov %0, %%"REG_a" \n\t" + "1: \n\t" + "movq (%1, %%"REG_a",4), %%mm0 \n\t" + "movq 8(%1, %%"REG_a",4), %%mm1 \n\t" + "psrlw $8, %%mm0 \n\t" + "psrlw $8, %%mm1 \n\t" + "packuswb %%mm1, %%mm0 \n\t" + "movq %%mm0, %%mm1 \n\t" + "psrlw $8, %%mm0 \n\t" + "pand %%mm4, %%mm1 \n\t" + "packuswb %%mm0, %%mm0 \n\t" + "packuswb %%mm1, %%mm1 \n\t" + "movd %%mm0, (%3, %%"REG_a") \n\t" + "movd %%mm1, (%2, %%"REG_a") \n\t" + "add $4, %%"REG_a" \n\t" + " js 1b \n\t" + : : "g" ((x86_reg)-width), "r" (src1+width*4), "r" (dstU+width), "r" (dstV+width) + : "%"REG_a + ); + assert(src1 == src2); +} + +static void RENAME(LEToUV)(uint8_t *dstU, uint8_t *dstV, + const uint8_t *src1, const uint8_t *src2, + int width, uint32_t *unused) +{ + __asm__ volatile( + "mov %0, %%"REG_a" \n\t" + "1: \n\t" + "movq (%1, %%"REG_a",2), %%mm0 \n\t" + "movq 8(%1, %%"REG_a",2), %%mm1 \n\t" + "movq (%2, %%"REG_a",2), %%mm2 \n\t" + "movq 8(%2, %%"REG_a",2), %%mm3 \n\t" + "psrlw $8, %%mm0 \n\t" + "psrlw $8, %%mm1 \n\t" + "psrlw $8, %%mm2 \n\t" + "psrlw $8, %%mm3 \n\t" + "packuswb %%mm1, %%mm0 \n\t" + "packuswb %%mm3, %%mm2 \n\t" + "movq %%mm0, (%3, %%"REG_a") \n\t" + "movq %%mm2, (%4, %%"REG_a") \n\t" + "add $8, %%"REG_a" \n\t" + " js 1b \n\t" + : : "g" ((x86_reg)-width), "r" (src1+width*2), "r" (src2+width*2), "r" (dstU+width), "r" (dstV+width) + : "%"REG_a + ); +} + +/* This is almost identical to the previous, end exists only because + * yuy2ToY/UV)(dst, src+1, ...) would have 100% unaligned accesses. */ +static void RENAME(uyvyToY)(uint8_t *dst, const uint8_t *src, + int width, uint32_t *unused) +{ + __asm__ volatile( + "mov %0, %%"REG_a" \n\t" + "1: \n\t" + "movq (%1, %%"REG_a",2), %%mm0 \n\t" + "movq 8(%1, %%"REG_a",2), %%mm1 \n\t" + "psrlw $8, %%mm0 \n\t" + "psrlw $8, %%mm1 \n\t" + "packuswb %%mm1, %%mm0 \n\t" + "movq %%mm0, (%2, %%"REG_a") \n\t" + "add $8, %%"REG_a" \n\t" + " js 1b \n\t" + : : "g" ((x86_reg)-width), "r" (src+width*2), "r" (dst+width) + : "%"REG_a + ); +} + +static void RENAME(uyvyToUV)(uint8_t *dstU, uint8_t *dstV, + const uint8_t *src1, const uint8_t *src2, + int width, uint32_t *unused) +{ + __asm__ volatile( + "movq "MANGLE(bm01010101)", %%mm4 \n\t" + "mov %0, %%"REG_a" \n\t" + "1: \n\t" + "movq (%1, %%"REG_a",4), %%mm0 \n\t" + "movq 8(%1, %%"REG_a",4), %%mm1 \n\t" + "pand %%mm4, %%mm0 \n\t" + "pand %%mm4, %%mm1 \n\t" + "packuswb %%mm1, %%mm0 \n\t" + "movq %%mm0, %%mm1 \n\t" + "psrlw $8, %%mm0 \n\t" + "pand %%mm4, %%mm1 \n\t" + "packuswb %%mm0, %%mm0 \n\t" + "packuswb %%mm1, %%mm1 \n\t" + "movd %%mm0, (%3, %%"REG_a") \n\t" + "movd %%mm1, (%2, %%"REG_a") \n\t" + "add $4, %%"REG_a" \n\t" + " js 1b \n\t" + : : "g" ((x86_reg)-width), "r" (src1+width*4), "r" (dstU+width), "r" (dstV+width) + : "%"REG_a + ); + assert(src1 == src2); +} + +static void RENAME(BEToUV)(uint8_t *dstU, uint8_t *dstV, + const uint8_t *src1, const uint8_t *src2, + int width, uint32_t *unused) +{ + __asm__ volatile( + "movq "MANGLE(bm01010101)", %%mm4 \n\t" + "mov %0, %%"REG_a" \n\t" + "1: \n\t" + "movq (%1, %%"REG_a",2), %%mm0 \n\t" + "movq 8(%1, %%"REG_a",2), %%mm1 \n\t" + "movq (%2, %%"REG_a",2), %%mm2 \n\t" + "movq 8(%2, %%"REG_a",2), %%mm3 \n\t" + "pand %%mm4, %%mm0 \n\t" + "pand %%mm4, %%mm1 \n\t" + "pand %%mm4, %%mm2 \n\t" + "pand %%mm4, %%mm3 \n\t" + "packuswb %%mm1, %%mm0 \n\t" + "packuswb %%mm3, %%mm2 \n\t" + "movq %%mm0, (%3, %%"REG_a") \n\t" + "movq %%mm2, (%4, %%"REG_a") \n\t" + "add $8, %%"REG_a" \n\t" + " js 1b \n\t" + : : "g" ((x86_reg)-width), "r" (src1+width*2), "r" (src2+width*2), "r" (dstU+width), "r" (dstV+width) + : "%"REG_a + ); +} + +static av_always_inline void RENAME(nvXXtoUV)(uint8_t *dst1, uint8_t *dst2, + const uint8_t *src, int width) +{ + __asm__ volatile( + "movq "MANGLE(bm01010101)", %%mm4 \n\t" + "mov %0, %%"REG_a" \n\t" + "1: \n\t" + "movq (%1, %%"REG_a",2), %%mm0 \n\t" + "movq 8(%1, %%"REG_a",2), %%mm1 \n\t" + "movq %%mm0, %%mm2 \n\t" + "movq %%mm1, %%mm3 \n\t" + "pand %%mm4, %%mm0 \n\t" + "pand %%mm4, %%mm1 \n\t" + "psrlw $8, %%mm2 \n\t" + "psrlw $8, %%mm3 \n\t" + "packuswb %%mm1, %%mm0 \n\t" + "packuswb %%mm3, %%mm2 \n\t" + "movq %%mm0, (%2, %%"REG_a") \n\t" + "movq %%mm2, (%3, %%"REG_a") \n\t" + "add $8, %%"REG_a" \n\t" + " js 1b \n\t" + : : "g" ((x86_reg)-width), "r" (src+width*2), "r" (dst1+width), "r" (dst2+width) + : "%"REG_a + ); +} + +static void RENAME(nv12ToUV)(uint8_t *dstU, uint8_t *dstV, + const uint8_t *src1, const uint8_t *src2, + int width, uint32_t *unused) +{ + RENAME(nvXXtoUV)(dstU, dstV, src1, width); +} + +static void RENAME(nv21ToUV)(uint8_t *dstU, uint8_t *dstV, + const uint8_t *src1, const uint8_t *src2, + int width, uint32_t *unused) +{ + RENAME(nvXXtoUV)(dstV, dstU, src1, width); +} +#endif /* !COMPILE_TEMPLATE_MMX2 */ + +static av_always_inline void RENAME(bgr24ToY_mmx)(int16_t *dst, const uint8_t *src, + int width, enum PixelFormat srcFormat) +{ + + if(srcFormat == PIX_FMT_BGR24) { + __asm__ volatile( + "movq "MANGLE(ff_bgr24toY1Coeff)", %%mm5 \n\t" + "movq "MANGLE(ff_bgr24toY2Coeff)", %%mm6 \n\t" + : + ); + } else { + __asm__ volatile( + "movq "MANGLE(ff_rgb24toY1Coeff)", %%mm5 \n\t" + "movq "MANGLE(ff_rgb24toY2Coeff)", %%mm6 \n\t" + : + ); + } + + __asm__ volatile( + "movq "MANGLE(ff_bgr24toYOffset)", %%mm4 \n\t" + "mov %2, %%"REG_a" \n\t" + "pxor %%mm7, %%mm7 \n\t" + "1: \n\t" + PREFETCH" 64(%0) \n\t" + "movd (%0), %%mm0 \n\t" + "movd 2(%0), %%mm1 \n\t" + "movd 6(%0), %%mm2 \n\t" + "movd 8(%0), %%mm3 \n\t" + "add $12, %0 \n\t" + "punpcklbw %%mm7, %%mm0 \n\t" + "punpcklbw %%mm7, %%mm1 \n\t" + "punpcklbw %%mm7, %%mm2 \n\t" + "punpcklbw %%mm7, %%mm3 \n\t" + "pmaddwd %%mm5, %%mm0 \n\t" + "pmaddwd %%mm6, %%mm1 \n\t" + "pmaddwd %%mm5, %%mm2 \n\t" + "pmaddwd %%mm6, %%mm3 \n\t" + "paddd %%mm1, %%mm0 \n\t" + "paddd %%mm3, %%mm2 \n\t" + "paddd %%mm4, %%mm0 \n\t" + "paddd %%mm4, %%mm2 \n\t" + "psrad $9, %%mm0 \n\t" + "psrad $9, %%mm2 \n\t" + "packssdw %%mm2, %%mm0 \n\t" + "movq %%mm0, (%1, %%"REG_a") \n\t" + "add $8, %%"REG_a" \n\t" + " js 1b \n\t" + : "+r" (src) + : "r" (dst+width), "g" ((x86_reg)-2*width) + : "%"REG_a + ); +} + +static void RENAME(bgr24ToY)(int16_t *dst, const uint8_t *src, + int width, uint32_t *unused) +{ + RENAME(bgr24ToY_mmx)(dst, src, width, PIX_FMT_BGR24); +} + +static void RENAME(rgb24ToY)(int16_t *dst, const uint8_t *src, + int width, uint32_t *unused) +{ + RENAME(bgr24ToY_mmx)(dst, src, width, PIX_FMT_RGB24); +} + +static av_always_inline void RENAME(bgr24ToUV_mmx)(int16_t *dstU, int16_t *dstV, + const uint8_t *src, int width, + enum PixelFormat srcFormat) +{ + __asm__ volatile( + "movq 24(%4), %%mm6 \n\t" + "mov %3, %%"REG_a" \n\t" + "pxor %%mm7, %%mm7 \n\t" + "1: \n\t" + PREFETCH" 64(%0) \n\t" + "movd (%0), %%mm0 \n\t" + "movd 2(%0), %%mm1 \n\t" + "punpcklbw %%mm7, %%mm0 \n\t" + "punpcklbw %%mm7, %%mm1 \n\t" + "movq %%mm0, %%mm2 \n\t" + "movq %%mm1, %%mm3 \n\t" + "pmaddwd (%4), %%mm0 \n\t" + "pmaddwd 8(%4), %%mm1 \n\t" + "pmaddwd 16(%4), %%mm2 \n\t" + "pmaddwd %%mm6, %%mm3 \n\t" + "paddd %%mm1, %%mm0 \n\t" + "paddd %%mm3, %%mm2 \n\t" + + "movd 6(%0), %%mm1 \n\t" + "movd 8(%0), %%mm3 \n\t" + "add $12, %0 \n\t" + "punpcklbw %%mm7, %%mm1 \n\t" + "punpcklbw %%mm7, %%mm3 \n\t" + "movq %%mm1, %%mm4 \n\t" + "movq %%mm3, %%mm5 \n\t" + "pmaddwd (%4), %%mm1 \n\t" + "pmaddwd 8(%4), %%mm3 \n\t" + "pmaddwd 16(%4), %%mm4 \n\t" + "pmaddwd %%mm6, %%mm5 \n\t" + "paddd %%mm3, %%mm1 \n\t" + "paddd %%mm5, %%mm4 \n\t" + + "movq "MANGLE(ff_bgr24toUVOffset)", %%mm3 \n\t" + "paddd %%mm3, %%mm0 \n\t" + "paddd %%mm3, %%mm2 \n\t" + "paddd %%mm3, %%mm1 \n\t" + "paddd %%mm3, %%mm4 \n\t" + "psrad $9, %%mm0 \n\t" + "psrad $9, %%mm2 \n\t" + "psrad $9, %%mm1 \n\t" + "psrad $9, %%mm4 \n\t" + "packssdw %%mm1, %%mm0 \n\t" + "packssdw %%mm4, %%mm2 \n\t" + "movq %%mm0, (%1, %%"REG_a") \n\t" + "movq %%mm2, (%2, %%"REG_a") \n\t" + "add $8, %%"REG_a" \n\t" + " js 1b \n\t" + : "+r" (src) + : "r" (dstU+width), "r" (dstV+width), "g" ((x86_reg)-2*width), "r"(ff_bgr24toUV[srcFormat == PIX_FMT_RGB24]) + : "%"REG_a + ); +} + +static void RENAME(bgr24ToUV)(int16_t *dstU, int16_t *dstV, + const uint8_t *src1, const uint8_t *src2, + int width, uint32_t *unused) +{ + RENAME(bgr24ToUV_mmx)(dstU, dstV, src1, width, PIX_FMT_BGR24); + assert(src1 == src2); +} + +static void RENAME(rgb24ToUV)(int16_t *dstU, int16_t *dstV, + const uint8_t *src1, const uint8_t *src2, + int width, uint32_t *unused) +{ + assert(src1==src2); + RENAME(bgr24ToUV_mmx)(dstU, dstV, src1, width, PIX_FMT_RGB24); +} + +#if !COMPILE_TEMPLATE_MMX2 +// bilinear / bicubic scaling +static void RENAME(hScale)(int16_t *dst, int dstW, + const uint8_t *src, int srcW, + int xInc, const int16_t *filter, + const int16_t *filterPos, int filterSize) +{ + assert(filterSize % 4 == 0 && filterSize>0); + if (filterSize==4) { // Always true for upscaling, sometimes for down, too. + x86_reg counter= -2*dstW; + filter-= counter*2; + filterPos-= counter/2; + dst-= counter/2; + __asm__ volatile( +#if defined(PIC) + "push %%"REG_b" \n\t" +#endif + "pxor %%mm7, %%mm7 \n\t" + "push %%"REG_BP" \n\t" // we use 7 regs here ... + "mov %%"REG_a", %%"REG_BP" \n\t" + ".p2align 4 \n\t" + "1: \n\t" + "movzwl (%2, %%"REG_BP"), %%eax \n\t" + "movzwl 2(%2, %%"REG_BP"), %%ebx \n\t" + "movq (%1, %%"REG_BP", 4), %%mm1 \n\t" + "movq 8(%1, %%"REG_BP", 4), %%mm3 \n\t" + "movd (%3, %%"REG_a"), %%mm0 \n\t" + "movd (%3, %%"REG_b"), %%mm2 \n\t" + "punpcklbw %%mm7, %%mm0 \n\t" + "punpcklbw %%mm7, %%mm2 \n\t" + "pmaddwd %%mm1, %%mm0 \n\t" + "pmaddwd %%mm2, %%mm3 \n\t" + "movq %%mm0, %%mm4 \n\t" + "punpckldq %%mm3, %%mm0 \n\t" + "punpckhdq %%mm3, %%mm4 \n\t" + "paddd %%mm4, %%mm0 \n\t" + "psrad $7, %%mm0 \n\t" + "packssdw %%mm0, %%mm0 \n\t" + "movd %%mm0, (%4, %%"REG_BP") \n\t" + "add $4, %%"REG_BP" \n\t" + " jnc 1b \n\t" + + "pop %%"REG_BP" \n\t" +#if defined(PIC) + "pop %%"REG_b" \n\t" +#endif + : "+a" (counter) + : "c" (filter), "d" (filterPos), "S" (src), "D" (dst) +#if !defined(PIC) + : "%"REG_b +#endif + ); + } else if (filterSize==8) { + x86_reg counter= -2*dstW; + filter-= counter*4; + filterPos-= counter/2; + dst-= counter/2; + __asm__ volatile( +#if defined(PIC) + "push %%"REG_b" \n\t" +#endif + "pxor %%mm7, %%mm7 \n\t" + "push %%"REG_BP" \n\t" // we use 7 regs here ... + "mov %%"REG_a", %%"REG_BP" \n\t" + ".p2align 4 \n\t" + "1: \n\t" + "movzwl (%2, %%"REG_BP"), %%eax \n\t" + "movzwl 2(%2, %%"REG_BP"), %%ebx \n\t" + "movq (%1, %%"REG_BP", 8), %%mm1 \n\t" + "movq 16(%1, %%"REG_BP", 8), %%mm3 \n\t" + "movd (%3, %%"REG_a"), %%mm0 \n\t" + "movd (%3, %%"REG_b"), %%mm2 \n\t" + "punpcklbw %%mm7, %%mm0 \n\t" + "punpcklbw %%mm7, %%mm2 \n\t" + "pmaddwd %%mm1, %%mm0 \n\t" + "pmaddwd %%mm2, %%mm3 \n\t" + + "movq 8(%1, %%"REG_BP", 8), %%mm1 \n\t" + "movq 24(%1, %%"REG_BP", 8), %%mm5 \n\t" + "movd 4(%3, %%"REG_a"), %%mm4 \n\t" + "movd 4(%3, %%"REG_b"), %%mm2 \n\t" + "punpcklbw %%mm7, %%mm4 \n\t" + "punpcklbw %%mm7, %%mm2 \n\t" + "pmaddwd %%mm1, %%mm4 \n\t" + "pmaddwd %%mm2, %%mm5 \n\t" + "paddd %%mm4, %%mm0 \n\t" + "paddd %%mm5, %%mm3 \n\t" + "movq %%mm0, %%mm4 \n\t" + "punpckldq %%mm3, %%mm0 \n\t" + "punpckhdq %%mm3, %%mm4 \n\t" + "paddd %%mm4, %%mm0 \n\t" + "psrad $7, %%mm0 \n\t" + "packssdw %%mm0, %%mm0 \n\t" + "movd %%mm0, (%4, %%"REG_BP") \n\t" + "add $4, %%"REG_BP" \n\t" + " jnc 1b \n\t" + + "pop %%"REG_BP" \n\t" +#if defined(PIC) + "pop %%"REG_b" \n\t" +#endif + : "+a" (counter) + : "c" (filter), "d" (filterPos), "S" (src), "D" (dst) +#if !defined(PIC) + : "%"REG_b +#endif + ); + } else { + const uint8_t *offset = src+filterSize; + x86_reg counter= -2*dstW; + //filter-= counter*filterSize/2; + filterPos-= counter/2; + dst-= counter/2; + __asm__ volatile( + "pxor %%mm7, %%mm7 \n\t" + ".p2align 4 \n\t" + "1: \n\t" + "mov %2, %%"REG_c" \n\t" + "movzwl (%%"REG_c", %0), %%eax \n\t" + "movzwl 2(%%"REG_c", %0), %%edx \n\t" + "mov %5, %%"REG_c" \n\t" + "pxor %%mm4, %%mm4 \n\t" + "pxor %%mm5, %%mm5 \n\t" + "2: \n\t" + "movq (%1), %%mm1 \n\t" + "movq (%1, %6), %%mm3 \n\t" + "movd (%%"REG_c", %%"REG_a"), %%mm0 \n\t" + "movd (%%"REG_c", %%"REG_d"), %%mm2 \n\t" + "punpcklbw %%mm7, %%mm0 \n\t" + "punpcklbw %%mm7, %%mm2 \n\t" + "pmaddwd %%mm1, %%mm0 \n\t" + "pmaddwd %%mm2, %%mm3 \n\t" + "paddd %%mm3, %%mm5 \n\t" + "paddd %%mm0, %%mm4 \n\t" + "add $8, %1 \n\t" + "add $4, %%"REG_c" \n\t" + "cmp %4, %%"REG_c" \n\t" + " jb 2b \n\t" + "add %6, %1 \n\t" + "movq %%mm4, %%mm0 \n\t" + "punpckldq %%mm5, %%mm4 \n\t" + "punpckhdq %%mm5, %%mm0 \n\t" + "paddd %%mm0, %%mm4 \n\t" + "psrad $7, %%mm4 \n\t" + "packssdw %%mm4, %%mm4 \n\t" + "mov %3, %%"REG_a" \n\t" + "movd %%mm4, (%%"REG_a", %0) \n\t" + "add $4, %0 \n\t" + " jnc 1b \n\t" + + : "+r" (counter), "+r" (filter) + : "m" (filterPos), "m" (dst), "m"(offset), + "m" (src), "r" ((x86_reg)filterSize*2) + : "%"REG_a, "%"REG_c, "%"REG_d + ); + } +} +#endif /* !COMPILE_TEMPLATE_MMX2 */ + +static inline void RENAME(hScale16)(int16_t *dst, int dstW, const uint16_t *src, int srcW, int xInc, + const int16_t *filter, const int16_t *filterPos, long filterSize, int shift) +{ + int i, j; + + assert(filterSize % 4 == 0 && filterSize>0); + if (filterSize==4 && shift<15) { // Always true for upscaling, sometimes for down, too. + x86_reg counter= -2*dstW; + filter-= counter*2; + filterPos-= counter/2; + dst-= counter/2; + __asm__ volatile( + "movd %5, %%mm7 \n\t" +#if defined(PIC) + "push %%"REG_b" \n\t" +#endif + "push %%"REG_BP" \n\t" // we use 7 regs here ... + "mov %%"REG_a", %%"REG_BP" \n\t" + ".p2align 4 \n\t" + "1: \n\t" + "movzwl (%2, %%"REG_BP"), %%eax \n\t" + "movzwl 2(%2, %%"REG_BP"), %%ebx \n\t" + "movq (%1, %%"REG_BP", 4), %%mm1 \n\t" + "movq 8(%1, %%"REG_BP", 4), %%mm3 \n\t" + "movq (%3, %%"REG_a", 2), %%mm0 \n\t" + "movq (%3, %%"REG_b", 2), %%mm2 \n\t" + "pmaddwd %%mm1, %%mm0 \n\t" + "pmaddwd %%mm2, %%mm3 \n\t" + "movq %%mm0, %%mm4 \n\t" + "punpckldq %%mm3, %%mm0 \n\t" + "punpckhdq %%mm3, %%mm4 \n\t" + "paddd %%mm4, %%mm0 \n\t" + "psrad %%mm7, %%mm0 \n\t" + "packssdw %%mm0, %%mm0 \n\t" + "movd %%mm0, (%4, %%"REG_BP") \n\t" + "add $4, %%"REG_BP" \n\t" + " jnc 1b \n\t" + + "pop %%"REG_BP" \n\t" +#if defined(PIC) + "pop %%"REG_b" \n\t" +#endif + : "+a" (counter) + : "c" (filter), "d" (filterPos), "S" (src), "D" (dst), "m"(shift) +#if !defined(PIC) + : "%"REG_b +#endif + ); + } else if (filterSize==8 && shift<15) { + x86_reg counter= -2*dstW; + filter-= counter*4; + filterPos-= counter/2; + dst-= counter/2; + __asm__ volatile( + "movd %5, %%mm7 \n\t" +#if defined(PIC) + "push %%"REG_b" \n\t" +#endif + "push %%"REG_BP" \n\t" // we use 7 regs here ... + "mov %%"REG_a", %%"REG_BP" \n\t" + ".p2align 4 \n\t" + "1: \n\t" + "movzwl (%2, %%"REG_BP"), %%eax \n\t" + "movzwl 2(%2, %%"REG_BP"), %%ebx \n\t" + "movq (%1, %%"REG_BP", 8), %%mm1 \n\t" + "movq 16(%1, %%"REG_BP", 8), %%mm3 \n\t" + "movq (%3, %%"REG_a", 2), %%mm0 \n\t" + "movq (%3, %%"REG_b", 2), %%mm2 \n\t" + "pmaddwd %%mm1, %%mm0 \n\t" + "pmaddwd %%mm2, %%mm3 \n\t" + + "movq 8(%1, %%"REG_BP", 8), %%mm1 \n\t" + "movq 24(%1, %%"REG_BP", 8), %%mm5 \n\t" + "movq 8(%3, %%"REG_a", 2), %%mm4 \n\t" + "movq 8(%3, %%"REG_b", 2), %%mm2 \n\t" + "pmaddwd %%mm1, %%mm4 \n\t" + "pmaddwd %%mm2, %%mm5 \n\t" + "paddd %%mm4, %%mm0 \n\t" + "paddd %%mm5, %%mm3 \n\t" + "movq %%mm0, %%mm4 \n\t" + "punpckldq %%mm3, %%mm0 \n\t" + "punpckhdq %%mm3, %%mm4 \n\t" + "paddd %%mm4, %%mm0 \n\t" + "psrad %%mm7, %%mm0 \n\t" + "packssdw %%mm0, %%mm0 \n\t" + "movd %%mm0, (%4, %%"REG_BP") \n\t" + "add $4, %%"REG_BP" \n\t" + " jnc 1b \n\t" + + "pop %%"REG_BP" \n\t" +#if defined(PIC) + "pop %%"REG_b" \n\t" +#endif + : "+a" (counter) + : "c" (filter), "d" (filterPos), "S" (src), "D" (dst), "m"(shift) +#if !defined(PIC) + : "%"REG_b +#endif + ); + } else if (shift<15){ + const uint16_t *offset = src+filterSize; + x86_reg counter= -2*dstW; + //filter-= counter*filterSize/2; + filterPos-= counter/2; + dst-= counter/2; + __asm__ volatile( + "movd %7, %%mm7 \n\t" + ".p2align 4 \n\t" + "1: \n\t" + "mov %2, %%"REG_c" \n\t" + "movzwl (%%"REG_c", %0), %%eax \n\t" + "movzwl 2(%%"REG_c", %0), %%edx \n\t" + "mov %5, %%"REG_c" \n\t" + "pxor %%mm4, %%mm4 \n\t" + "pxor %%mm5, %%mm5 \n\t" + "2: \n\t" + "movq (%1), %%mm1 \n\t" + "movq (%1, %6), %%mm3 \n\t" + "movq (%%"REG_c", %%"REG_a", 2), %%mm0 \n\t" + "movq (%%"REG_c", %%"REG_d", 2), %%mm2 \n\t" + "pmaddwd %%mm1, %%mm0 \n\t" + "pmaddwd %%mm2, %%mm3 \n\t" + "paddd %%mm3, %%mm5 \n\t" + "paddd %%mm0, %%mm4 \n\t" + "add $8, %1 \n\t" + "add $8, %%"REG_c" \n\t" + "cmp %4, %%"REG_c" \n\t" + " jb 2b \n\t" + "add %6, %1 \n\t" + "movq %%mm4, %%mm0 \n\t" + "punpckldq %%mm5, %%mm4 \n\t" + "punpckhdq %%mm5, %%mm0 \n\t" + "paddd %%mm0, %%mm4 \n\t" + "psrad %%mm7, %%mm4 \n\t" + "packssdw %%mm4, %%mm4 \n\t" + "mov %3, %%"REG_a" \n\t" + "movd %%mm4, (%%"REG_a", %0) \n\t" + "add $4, %0 \n\t" + " jnc 1b \n\t" + + : "+r" (counter), "+r" (filter) + : "m" (filterPos), "m" (dst), "m"(offset), + "m" (src), "r" ((x86_reg)filterSize*2), "m"(shift) + : "%"REG_a, "%"REG_c, "%"REG_d + ); + } else + for (i=0; i<dstW; i++) { + int srcPos= filterPos[i]; + int val=0; + for (j=0; j<filterSize; j++) { + val += ((int)src[srcPos + j])*filter[filterSize*i + j]; + } + dst[i] = FFMIN(val>>shift, (1<<15)-1); // the cubic equation does overflow ... + } +} + + +#if COMPILE_TEMPLATE_MMX2 +static void RENAME(hyscale_fast)(SwsContext *c, int16_t *dst, + int dstWidth, const uint8_t *src, + int srcW, int xInc) +{ + int16_t *filterPos = c->hLumFilterPos; + int16_t *filter = c->hLumFilter; + void *mmx2FilterCode= c->lumMmx2FilterCode; + int i; +#if defined(PIC) + DECLARE_ALIGNED(8, uint64_t, ebxsave); +#endif + + __asm__ volatile( +#if defined(PIC) + "mov %%"REG_b", %5 \n\t" +#endif + "pxor %%mm7, %%mm7 \n\t" + "mov %0, %%"REG_c" \n\t" + "mov %1, %%"REG_D" \n\t" + "mov %2, %%"REG_d" \n\t" + "mov %3, %%"REG_b" \n\t" + "xor %%"REG_a", %%"REG_a" \n\t" // i + PREFETCH" (%%"REG_c") \n\t" + PREFETCH" 32(%%"REG_c") \n\t" + PREFETCH" 64(%%"REG_c") \n\t" + +#if ARCH_X86_64 +#define CALL_MMX2_FILTER_CODE \ + "movl (%%"REG_b"), %%esi \n\t"\ + "call *%4 \n\t"\ + "movl (%%"REG_b", %%"REG_a"), %%esi \n\t"\ + "add %%"REG_S", %%"REG_c" \n\t"\ + "add %%"REG_a", %%"REG_D" \n\t"\ + "xor %%"REG_a", %%"REG_a" \n\t"\ + +#else +#define CALL_MMX2_FILTER_CODE \ + "movl (%%"REG_b"), %%esi \n\t"\ + "call *%4 \n\t"\ + "addl (%%"REG_b", %%"REG_a"), %%"REG_c" \n\t"\ + "add %%"REG_a", %%"REG_D" \n\t"\ + "xor %%"REG_a", %%"REG_a" \n\t"\ + +#endif /* ARCH_X86_64 */ + + CALL_MMX2_FILTER_CODE + CALL_MMX2_FILTER_CODE + CALL_MMX2_FILTER_CODE + CALL_MMX2_FILTER_CODE + CALL_MMX2_FILTER_CODE + CALL_MMX2_FILTER_CODE + CALL_MMX2_FILTER_CODE + CALL_MMX2_FILTER_CODE + +#if defined(PIC) + "mov %5, %%"REG_b" \n\t" +#endif + :: "m" (src), "m" (dst), "m" (filter), "m" (filterPos), + "m" (mmx2FilterCode) +#if defined(PIC) + ,"m" (ebxsave) +#endif + : "%"REG_a, "%"REG_c, "%"REG_d, "%"REG_S, "%"REG_D +#if !defined(PIC) + ,"%"REG_b +#endif + ); + + for (i=dstWidth-1; (i*xInc)>>16 >=srcW-1; i--) + dst[i] = src[srcW-1]*128; +} + +static void RENAME(hcscale_fast)(SwsContext *c, int16_t *dst1, int16_t *dst2, + int dstWidth, const uint8_t *src1, + const uint8_t *src2, int srcW, int xInc) +{ + int16_t *filterPos = c->hChrFilterPos; + int16_t *filter = c->hChrFilter; + void *mmx2FilterCode= c->chrMmx2FilterCode; + int i; +#if defined(PIC) + DECLARE_ALIGNED(8, uint64_t, ebxsave); +#endif + + __asm__ volatile( +#if defined(PIC) + "mov %%"REG_b", %7 \n\t" +#endif + "pxor %%mm7, %%mm7 \n\t" + "mov %0, %%"REG_c" \n\t" + "mov %1, %%"REG_D" \n\t" + "mov %2, %%"REG_d" \n\t" + "mov %3, %%"REG_b" \n\t" + "xor %%"REG_a", %%"REG_a" \n\t" // i + PREFETCH" (%%"REG_c") \n\t" + PREFETCH" 32(%%"REG_c") \n\t" + PREFETCH" 64(%%"REG_c") \n\t" + + CALL_MMX2_FILTER_CODE + CALL_MMX2_FILTER_CODE + CALL_MMX2_FILTER_CODE + CALL_MMX2_FILTER_CODE + "xor %%"REG_a", %%"REG_a" \n\t" // i + "mov %5, %%"REG_c" \n\t" // src + "mov %6, %%"REG_D" \n\t" // buf2 + PREFETCH" (%%"REG_c") \n\t" + PREFETCH" 32(%%"REG_c") \n\t" + PREFETCH" 64(%%"REG_c") \n\t" + + CALL_MMX2_FILTER_CODE + CALL_MMX2_FILTER_CODE + CALL_MMX2_FILTER_CODE + CALL_MMX2_FILTER_CODE + +#if defined(PIC) + "mov %7, %%"REG_b" \n\t" +#endif + :: "m" (src1), "m" (dst1), "m" (filter), "m" (filterPos), + "m" (mmx2FilterCode), "m" (src2), "m"(dst2) +#if defined(PIC) + ,"m" (ebxsave) +#endif + : "%"REG_a, "%"REG_c, "%"REG_d, "%"REG_S, "%"REG_D +#if !defined(PIC) + ,"%"REG_b +#endif + ); + + for (i=dstWidth-1; (i*xInc)>>16 >=srcW-1; i--) { + dst1[i] = src1[srcW-1]*128; + dst2[i] = src2[srcW-1]*128; + } +} +#endif /* COMPILE_TEMPLATE_MMX2 */ + +static av_cold void RENAME(sws_init_swScale)(SwsContext *c) +{ + enum PixelFormat srcFormat = c->srcFormat, + dstFormat = c->dstFormat; + + if (!is16BPS(dstFormat) && !is9_OR_10BPS(dstFormat) && dstFormat != PIX_FMT_NV12 + && dstFormat != PIX_FMT_NV21 && !(c->flags & SWS_BITEXACT)) { + if (c->flags & SWS_ACCURATE_RND) { + c->yuv2yuv1 = RENAME(yuv2yuv1_ar ); + c->yuv2yuvX = RENAME(yuv2yuvX_ar ); + if (!(c->flags & SWS_FULL_CHR_H_INT)) { + switch (c->dstFormat) { + case PIX_FMT_RGB32: c->yuv2packedX = RENAME(yuv2rgb32_X_ar); break; + case PIX_FMT_BGR24: c->yuv2packedX = RENAME(yuv2bgr24_X_ar); break; + case PIX_FMT_RGB555: c->yuv2packedX = RENAME(yuv2rgb555_X_ar); break; + case PIX_FMT_RGB565: c->yuv2packedX = RENAME(yuv2rgb565_X_ar); break; + case PIX_FMT_YUYV422: c->yuv2packedX = RENAME(yuv2yuyv422_X_ar); break; + default: break; + } + } + } else { + int should_dither= isNBPS(c->srcFormat) || is16BPS(c->srcFormat); + c->yuv2yuv1 = should_dither ? RENAME(yuv2yuv1_ar ) : RENAME(yuv2yuv1 ); + c->yuv2yuvX = RENAME(yuv2yuvX ); + if (!(c->flags & SWS_FULL_CHR_H_INT)) { + switch (c->dstFormat) { + case PIX_FMT_RGB32: c->yuv2packedX = RENAME(yuv2rgb32_X); break; + case PIX_FMT_BGR24: c->yuv2packedX = RENAME(yuv2bgr24_X); break; + case PIX_FMT_RGB555: c->yuv2packedX = RENAME(yuv2rgb555_X); break; + case PIX_FMT_RGB565: c->yuv2packedX = RENAME(yuv2rgb565_X); break; + case PIX_FMT_YUYV422: c->yuv2packedX = RENAME(yuv2yuyv422_X); break; + default: break; + } + } + } + if (!(c->flags & SWS_FULL_CHR_H_INT)) { + switch (c->dstFormat) { + case PIX_FMT_RGB32: + c->yuv2packed1 = RENAME(yuv2rgb32_1); + c->yuv2packed2 = RENAME(yuv2rgb32_2); + break; + case PIX_FMT_BGR24: + c->yuv2packed1 = RENAME(yuv2bgr24_1); + c->yuv2packed2 = RENAME(yuv2bgr24_2); + break; + case PIX_FMT_RGB555: + c->yuv2packed1 = RENAME(yuv2rgb555_1); + c->yuv2packed2 = RENAME(yuv2rgb555_2); + break; + case PIX_FMT_RGB565: + c->yuv2packed1 = RENAME(yuv2rgb565_1); + c->yuv2packed2 = RENAME(yuv2rgb565_2); + break; + case PIX_FMT_YUYV422: + c->yuv2packed1 = RENAME(yuv2yuyv422_1); + c->yuv2packed2 = RENAME(yuv2yuyv422_2); + break; + default: + break; + } + } + } + +#if !COMPILE_TEMPLATE_MMX2 + c->hScale = RENAME(hScale ); +#endif /* !COMPILE_TEMPLATE_MMX2 */ + + // Use the new MMX scaler if the MMX2 one can't be used (it is faster than the x86 ASM one). +#if COMPILE_TEMPLATE_MMX2 + if (c->flags & SWS_FAST_BILINEAR && c->canMMX2BeUsed) + { + c->hyscale_fast = RENAME(hyscale_fast); + c->hcscale_fast = RENAME(hcscale_fast); + } else { +#endif /* COMPILE_TEMPLATE_MMX2 */ + c->hyscale_fast = NULL; + c->hcscale_fast = NULL; +#if COMPILE_TEMPLATE_MMX2 + } +#endif /* COMPILE_TEMPLATE_MMX2 */ + +#if !COMPILE_TEMPLATE_MMX2 + switch(srcFormat) { + case PIX_FMT_YUYV422 : c->chrToYV12 = RENAME(yuy2ToUV); break; + case PIX_FMT_UYVY422 : c->chrToYV12 = RENAME(uyvyToUV); break; + case PIX_FMT_NV12 : c->chrToYV12 = RENAME(nv12ToUV); break; + case PIX_FMT_NV21 : c->chrToYV12 = RENAME(nv21ToUV); break; + case PIX_FMT_GRAY16LE : + case PIX_FMT_YUV420P9LE: + case PIX_FMT_YUV422P10LE: + case PIX_FMT_YUV420P10LE: + case PIX_FMT_YUV420P16LE: + case PIX_FMT_YUV422P16LE: + case PIX_FMT_YUV444P16LE: c->hScale16= RENAME(hScale16); break; + } +#endif /* !COMPILE_TEMPLATE_MMX2 */ + if (!c->chrSrcHSubSample) { + switch(srcFormat) { + case PIX_FMT_BGR24 : c->chrToYV12 = RENAME(bgr24ToUV); break; + case PIX_FMT_RGB24 : c->chrToYV12 = RENAME(rgb24ToUV); break; + default: break; + } + } + + switch (srcFormat) { +#if !COMPILE_TEMPLATE_MMX2 + case PIX_FMT_YUYV422 : + case PIX_FMT_Y400A : + c->lumToYV12 = RENAME(yuy2ToY); break; + case PIX_FMT_UYVY422 : + c->lumToYV12 = RENAME(uyvyToY); break; +#endif /* !COMPILE_TEMPLATE_MMX2 */ + case PIX_FMT_BGR24 : c->lumToYV12 = RENAME(bgr24ToY); break; + case PIX_FMT_RGB24 : c->lumToYV12 = RENAME(rgb24ToY); break; + default: break; + } +#if !COMPILE_TEMPLATE_MMX2 + if (c->alpPixBuf) { + switch (srcFormat) { + case PIX_FMT_Y400A : c->alpToYV12 = RENAME(yuy2ToY); break; + default: break; + } + } +#endif /* !COMPILE_TEMPLATE_MMX2 */ + if(isAnyRGB(c->srcFormat)) + c->hScale16= RENAME(hScale16); +} diff --git a/libswscale/x86/yuv2rgb_mmx.c b/libswscale/x86/yuv2rgb_mmx.c index 6478311f00..df0e1a3726 100644 --- a/libswscale/x86/yuv2rgb_mmx.c +++ b/libswscale/x86/yuv2rgb_mmx.c @@ -34,6 +34,7 @@ #include "libswscale/swscale.h" #include "libswscale/swscale_internal.h" #include "libavutil/x86_cpu.h" +#include "libavutil/cpu.h" #define DITHER1XBPP // only for MMX @@ -46,57 +47,60 @@ DECLARE_ASM_CONST(8, uint64_t, pb_03) = 0x0303030303030303ULL; DECLARE_ASM_CONST(8, uint64_t, pb_07) = 0x0707070707070707ULL; //MMX versions +#if HAVE_MMX #undef RENAME -#undef HAVE_MMX2 -#undef HAVE_AMD3DNOW -#define HAVE_MMX2 0 -#define HAVE_AMD3DNOW 0 +#undef COMPILE_TEMPLATE_MMX2 +#define COMPILE_TEMPLATE_MMX2 0 #define RENAME(a) a ## _MMX #include "yuv2rgb_template.c" +#endif /* HAVE_MMX */ //MMX2 versions +#if HAVE_MMX2 #undef RENAME -#undef HAVE_MMX2 -#define HAVE_MMX2 1 +#undef COMPILE_TEMPLATE_MMX2 +#define COMPILE_TEMPLATE_MMX2 1 #define RENAME(a) a ## _MMX2 #include "yuv2rgb_template.c" +#endif /* HAVE_MMX2 */ SwsFunc ff_yuv2rgb_init_mmx(SwsContext *c) { - if (c->flags & SWS_CPU_CAPS_MMX2) { + int cpu_flags = av_get_cpu_flags(); + + if (c->srcFormat != PIX_FMT_YUV420P && + c->srcFormat != PIX_FMT_YUVA420P) + return NULL; + +#if HAVE_MMX2 + if (cpu_flags & AV_CPU_FLAG_MMX2) { switch (c->dstFormat) { - case PIX_FMT_RGB32: - if (CONFIG_SWSCALE_ALPHA && c->srcFormat == PIX_FMT_YUVA420P) { - if (HAVE_7REGS) return yuva420_rgb32_MMX2; - break; - } else return yuv420_rgb32_MMX2; - case PIX_FMT_BGR32: - if (CONFIG_SWSCALE_ALPHA && c->srcFormat == PIX_FMT_YUVA420P) { - if (HAVE_7REGS) return yuva420_bgr32_MMX2; - break; - } else return yuv420_bgr32_MMX2; case PIX_FMT_RGB24: return yuv420_rgb24_MMX2; case PIX_FMT_BGR24: return yuv420_bgr24_MMX2; - case PIX_FMT_RGB565: return yuv420_rgb16_MMX2; - case PIX_FMT_RGB555: return yuv420_rgb15_MMX2; } } - if (c->flags & SWS_CPU_CAPS_MMX) { +#endif + + if (cpu_flags & AV_CPU_FLAG_MMX) { switch (c->dstFormat) { - case PIX_FMT_RGB32: - if (CONFIG_SWSCALE_ALPHA && c->srcFormat == PIX_FMT_YUVA420P) { - if (HAVE_7REGS) return yuva420_rgb32_MMX; - break; - } else return yuv420_rgb32_MMX; - case PIX_FMT_BGR32: - if (CONFIG_SWSCALE_ALPHA && c->srcFormat == PIX_FMT_YUVA420P) { - if (HAVE_7REGS) return yuva420_bgr32_MMX; - break; - } else return yuv420_bgr32_MMX; - case PIX_FMT_RGB24: return yuv420_rgb24_MMX; - case PIX_FMT_BGR24: return yuv420_bgr24_MMX; - case PIX_FMT_RGB565: return yuv420_rgb16_MMX; - case PIX_FMT_RGB555: return yuv420_rgb15_MMX; + case PIX_FMT_RGB32: + if (c->srcFormat == PIX_FMT_YUVA420P) { +#if HAVE_7REGS && CONFIG_SWSCALE_ALPHA + return yuva420_rgb32_MMX; +#endif + break; + } else return yuv420_rgb32_MMX; + case PIX_FMT_BGR32: + if (c->srcFormat == PIX_FMT_YUVA420P) { +#if HAVE_7REGS && CONFIG_SWSCALE_ALPHA + return yuva420_bgr32_MMX; +#endif + break; + } else return yuv420_bgr32_MMX; + case PIX_FMT_RGB24: return yuv420_rgb24_MMX; + case PIX_FMT_BGR24: return yuv420_bgr24_MMX; + case PIX_FMT_RGB565: return yuv420_rgb16_MMX; + case PIX_FMT_RGB555: return yuv420_rgb15_MMX; } } diff --git a/libswscale/x86/yuv2rgb_template.c b/libswscale/x86/yuv2rgb_template.c index 8050932d1d..926e3fb9c4 100644 --- a/libswscale/x86/yuv2rgb_template.c +++ b/libswscale/x86/yuv2rgb_template.c @@ -25,14 +25,7 @@ #undef EMMS #undef SFENCE -#if HAVE_AMD3DNOW -/* On K6 femms is faster than emms. On K7 femms is directly mapped to emms. */ -#define EMMS "femms" -#else -#define EMMS "emms" -#endif - -#if HAVE_MMX2 +#if COMPILE_TEMPLATE_MMX2 #define MOVNTQ "movntq" #define SFENCE "sfence" #else @@ -50,17 +43,14 @@ if (h_size * depth > FFABS(dstStride[0])) \ h_size -= 8; \ \ - if (c->srcFormat == PIX_FMT_YUV422P) { \ - srcStride[1] *= 2; \ - srcStride[2] *= 2; \ - } \ + vshift = c->srcFormat != PIX_FMT_YUV422P; \ \ __asm__ volatile ("pxor %mm4, %mm4\n\t"); \ for (y = 0; y < srcSliceH; y++) { \ uint8_t *image = dst[0] + (y + srcSliceY) * dstStride[0]; \ const uint8_t *py = src[0] + y * srcStride[0]; \ - const uint8_t *pu = src[1] + (y >> 1) * srcStride[1]; \ - const uint8_t *pv = src[2] + (y >> 1) * srcStride[2]; \ + const uint8_t *pu = src[1] + (y >> vshift) * srcStride[1]; \ + const uint8_t *pv = src[2] + (y >> vshift) * srcStride[2]; \ x86_reg index = -h_size / 2; \ #define YUV2RGB_INITIAL_LOAD \ @@ -159,7 +149,8 @@ } \ #define YUV2RGB_ENDFUNC \ - __asm__ volatile (SFENCE"\n\t"EMMS); \ + __asm__ volatile (SFENCE"\n\t" \ + "emms \n\t"); \ return srcSliceH; \ #define IF0(x) @@ -188,12 +179,13 @@ "paddusb "GREEN_DITHER"(%4), %%mm2\n\t" \ "paddusb "RED_DITHER"(%4), %%mm1\n\t" \ +#if !COMPILE_TEMPLATE_MMX2 static inline int RENAME(yuv420_rgb15)(SwsContext *c, const uint8_t *src[], int srcStride[], int srcSliceY, int srcSliceH, uint8_t *dst[], int dstStride[]) { - int y, h_size; + int y, h_size, vshift; YUV2RGB_LOOP(2) @@ -221,7 +213,7 @@ static inline int RENAME(yuv420_rgb16)(SwsContext *c, const uint8_t *src[], int srcSliceY, int srcSliceH, uint8_t *dst[], int dstStride[]) { - int y, h_size; + int y, h_size, vshift; YUV2RGB_LOOP(2) @@ -243,6 +235,7 @@ static inline int RENAME(yuv420_rgb16)(SwsContext *c, const uint8_t *src[], YUV2RGB_OPERANDS YUV2RGB_ENDFUNC } +#endif /* !COMPILE_TEMPLATE_MMX2 */ #define RGB_PACK24(blue, red)\ "packuswb %%mm3, %%mm0 \n" /* R0 R2 R4 R6 R1 R3 R5 R7 */\ @@ -259,7 +252,7 @@ static inline int RENAME(yuv420_rgb16)(SwsContext *c, const uint8_t *src[], "punpckhwd %%mm6, %%mm5 \n" /* R4 G4 B4 R5 R6 G6 B6 R7 */\ RGB_PACK24_B -#if HAVE_MMX2 +#if COMPILE_TEMPLATE_MMX2 DECLARE_ASM_CONST(8, int16_t, mask1101[4]) = {-1,-1, 0,-1}; DECLARE_ASM_CONST(8, int16_t, mask0010[4]) = { 0, 0,-1, 0}; DECLARE_ASM_CONST(8, int16_t, mask0110[4]) = { 0,-1,-1, 0}; @@ -310,7 +303,7 @@ static inline int RENAME(yuv420_rgb24)(SwsContext *c, const uint8_t *src[], int srcSliceY, int srcSliceH, uint8_t *dst[], int dstStride[]) { - int y, h_size; + int y, h_size, vshift; YUV2RGB_LOOP(3) @@ -328,7 +321,7 @@ static inline int RENAME(yuv420_bgr24)(SwsContext *c, const uint8_t *src[], int srcSliceY, int srcSliceH, uint8_t *dst[], int dstStride[]) { - int y, h_size; + int y, h_size, vshift; YUV2RGB_LOOP(3) @@ -366,12 +359,13 @@ static inline int RENAME(yuv420_bgr24)(SwsContext *c, const uint8_t *src[], MOVNTQ " %%mm5, 16(%1)\n\t" \ MOVNTQ " %%mm"alpha", 24(%1)\n\t" \ +#if !COMPILE_TEMPLATE_MMX2 static inline int RENAME(yuv420_rgb32)(SwsContext *c, const uint8_t *src[], int srcStride[], int srcSliceY, int srcSliceH, uint8_t *dst[], int dstStride[]) { - int y, h_size; + int y, h_size, vshift; YUV2RGB_LOOP(4) @@ -386,13 +380,13 @@ static inline int RENAME(yuv420_rgb32)(SwsContext *c, const uint8_t *src[], YUV2RGB_ENDFUNC } +#if HAVE_7REGS && CONFIG_SWSCALE_ALPHA static inline int RENAME(yuva420_rgb32)(SwsContext *c, const uint8_t *src[], int srcStride[], int srcSliceY, int srcSliceH, uint8_t *dst[], int dstStride[]) { -#if HAVE_7REGS - int y, h_size; + int y, h_size, vshift; YUV2RGB_LOOP(4) @@ -406,16 +400,15 @@ static inline int RENAME(yuva420_rgb32)(SwsContext *c, const uint8_t *src[], YUV2RGB_ENDLOOP(4) YUV2RGB_OPERANDS_ALPHA YUV2RGB_ENDFUNC -#endif - return 0; } +#endif static inline int RENAME(yuv420_bgr32)(SwsContext *c, const uint8_t *src[], int srcStride[], int srcSliceY, int srcSliceH, uint8_t *dst[], int dstStride[]) { - int y, h_size; + int y, h_size, vshift; YUV2RGB_LOOP(4) @@ -430,13 +423,13 @@ static inline int RENAME(yuv420_bgr32)(SwsContext *c, const uint8_t *src[], YUV2RGB_ENDFUNC } +#if HAVE_7REGS && CONFIG_SWSCALE_ALPHA static inline int RENAME(yuva420_bgr32)(SwsContext *c, const uint8_t *src[], int srcStride[], int srcSliceY, int srcSliceH, uint8_t *dst[], int dstStride[]) { -#if HAVE_7REGS - int y, h_size; + int y, h_size, vshift; YUV2RGB_LOOP(4) @@ -450,6 +443,7 @@ static inline int RENAME(yuva420_bgr32)(SwsContext *c, const uint8_t *src[], YUV2RGB_ENDLOOP(4) YUV2RGB_OPERANDS_ALPHA YUV2RGB_ENDFUNC -#endif - return 0; } +#endif + +#endif /* !COMPILE_TEMPLATE_MMX2 */ diff --git a/libswscale/yuv2rgb.c b/libswscale/yuv2rgb.c index f8365ef567..36182a5ea9 100644 --- a/libswscale/yuv2rgb.c +++ b/libswscale/yuv2rgb.c @@ -32,8 +32,9 @@ #include "rgb2rgb.h" #include "swscale.h" #include "swscale_internal.h" -#include "libavutil/x86_cpu.h" +#include "libavutil/cpu.h" #include "libavutil/bswap.h" +#include "libavutil/pixdesc.h" extern const uint8_t dither_4x4_16[4][8]; extern const uint8_t dither_8x8_32[8][8]; @@ -366,28 +367,6 @@ YUV2RGBFUNC(yuv2rgb_c_16, uint16_t, 0) PUTRGB(dst_1,py_1,3); CLOSEYUV2RGBFUNC(8) -#if 0 // Currently unused -// This is exactly the same code as yuv2rgb_c_32 except for the types of -// r, g, b, dst_1, dst_2 -YUV2RGBFUNC(yuv2rgb_c_8, uint8_t, 0) - LOADCHROMA(0); - PUTRGB(dst_1,py_1,0); - PUTRGB(dst_2,py_2,0); - - LOADCHROMA(1); - PUTRGB(dst_2,py_2,1); - PUTRGB(dst_1,py_1,1); - - LOADCHROMA(2); - PUTRGB(dst_1,py_1,2); - PUTRGB(dst_2,py_2,2); - - LOADCHROMA(3); - PUTRGB(dst_2,py_2,3); - PUTRGB(dst_1,py_1,3); -CLOSEYUV2RGBFUNC(8) -#endif - // r, g, b, dst_1, dst_2 YUV2RGBFUNC(yuv2rgb_c_12_ordered_dither, uint16_t, 0) const uint8_t *d16 = dither_4x4_16[y&3]; @@ -441,36 +420,6 @@ YUV2RGBFUNC(yuv2rgb_c_8_ordered_dither, uint8_t, 0) PUTRGB8(dst_1,py_1,3,6); CLOSEYUV2RGBFUNC(8) -#if 0 // Currently unused -// This is exactly the same code as yuv2rgb_c_32 except for the types of -// r, g, b, dst_1, dst_2 -YUV2RGBFUNC(yuv2rgb_c_4, uint8_t, 0) - int acc; -#define PUTRGB4(dst,src,i) \ - Y = src[2*i]; \ - acc = r[Y] + g[Y] + b[Y]; \ - Y = src[2*i+1]; \ - acc |= (r[Y] + g[Y] + b[Y])<<4; \ - dst[i] = acc; - - LOADCHROMA(0); - PUTRGB4(dst_1,py_1,0); - PUTRGB4(dst_2,py_2,0); - - LOADCHROMA(1); - PUTRGB4(dst_2,py_2,1); - PUTRGB4(dst_1,py_1,1); - - LOADCHROMA(2); - PUTRGB4(dst_1,py_1,2); - PUTRGB4(dst_2,py_2,2); - - LOADCHROMA(3); - PUTRGB4(dst_2,py_2,3); - PUTRGB4(dst_1,py_1,3); -CLOSEYUV2RGBFUNC(4) -#endif - YUV2RGBFUNC(yuv2rgb_c_4_ordered_dither, uint8_t, 0) const uint8_t *d64 = dither_8x8_73[y&7]; const uint8_t *d128 = dither_8x8_220[y&7]; @@ -500,28 +449,6 @@ YUV2RGBFUNC(yuv2rgb_c_4_ordered_dither, uint8_t, 0) PUTRGB4D(dst_1,py_1,3,6); CLOSEYUV2RGBFUNC(4) -#if 0 // Currently unused -// This is exactly the same code as yuv2rgb_c_32 except for the types of -// r, g, b, dst_1, dst_2 -YUV2RGBFUNC(yuv2rgb_c_4b, uint8_t, 0) - LOADCHROMA(0); - PUTRGB(dst_1,py_1,0); - PUTRGB(dst_2,py_2,0); - - LOADCHROMA(1); - PUTRGB(dst_2,py_2,1); - PUTRGB(dst_1,py_1,1); - - LOADCHROMA(2); - PUTRGB(dst_1,py_1,2); - PUTRGB(dst_2,py_2,2); - - LOADCHROMA(3); - PUTRGB(dst_2,py_2,3); - PUTRGB(dst_1,py_1,3); -CLOSEYUV2RGBFUNC(8) -#endif - YUV2RGBFUNC(yuv2rgb_c_4b_ordered_dither, uint8_t, 0) const uint8_t *d64 = dither_8x8_73[y&7]; const uint8_t *d128 = dither_8x8_220[y&7]; @@ -579,29 +506,24 @@ CLOSEYUV2RGBFUNC(1) SwsFunc ff_yuv2rgb_get_func_ptr(SwsContext *c) { SwsFunc t = NULL; -#if HAVE_MMX - t = ff_yuv2rgb_init_mmx(c); -#endif -#if HAVE_VIS - t = ff_yuv2rgb_init_vis(c); -#endif -#if CONFIG_MLIB - t = ff_yuv2rgb_init_mlib(c); -#endif -#if HAVE_ALTIVEC - if (c->flags & SWS_CPU_CAPS_ALTIVEC) - t = ff_yuv2rgb_init_altivec(c); -#endif -#if ARCH_BFIN - if (c->flags & SWS_CPU_CAPS_BFIN) + if (HAVE_MMX) { + t = ff_yuv2rgb_init_mmx(c); + } else if (HAVE_VIS) { + t = ff_yuv2rgb_init_vis(c); + } else if (CONFIG_MLIB) { + t = ff_yuv2rgb_init_mlib(c); + } else if (HAVE_ALTIVEC) { + t = ff_yuv2rgb_init_altivec(c); + } else if (ARCH_BFIN) { t = ff_yuv2rgb_get_func_ptr_bfin(c); -#endif + } if (t) return t; - av_log(c, AV_LOG_WARNING, "No accelerated colorspace conversion found from %s to %s.\n", sws_format_name(c->srcFormat), sws_format_name(c->dstFormat)); + av_log(c, AV_LOG_WARNING, "No accelerated colorspace conversion found from %s to %s.\n", + av_get_pix_fmt_name(c->srcFormat), av_get_pix_fmt_name(c->dstFormat)); switch (c->dstFormat) { case PIX_FMT_BGR48BE: |