summaryrefslogtreecommitdiff
path: root/libswscale
Commit message (Collapse)AuthorAgeFilesLines
* version.h: Bump minor post 6.0 branchn6.1-devMichael Niedermayer2023-02-191-1/+1
| | | | Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
* version.h: Bump minor for 6.0 branchMichael Niedermayer2023-02-191-1/+1
| | | | Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
* Bump major versions of all librariesJames Almer2023-02-092-3/+3
| | | | Signed-off-by: James Almer <jamrial@gmail.com>
* sws/utils.c: Do not uselessly call initFilter() when unscalingTomas Härdin2023-02-081-31/+31
|
* x86: replace explicit REP_RETs with RETsLynne2023-02-016-16/+16
| | | | | | | | | | | | | | | | | | | From x86inc: > On AMD cpus <=K10, an ordinary ret is slow if it immediately follows either > a branch or a branch target. So switch to a 2-byte form of ret in that case. > We can automatically detect "follows a branch", but not a branch target. > (SSSE3 is a sufficient condition to know that your cpu doesn't have this problem.) x86inc can automatically determine whether to use REP_RET rather than REP in most of these cases, so impact is minimal. Additionally, a few REP_RETs were used unnecessary, despite the return being nowhere near a branch. The only CPUs affected were AMD K10s, made between 2007 and 2011, 16 years ago and 12 years ago, respectively. In the future, everyone involved with x86inc should consider dropping REP_RETs altogether.
* swscale/utils: Fix indentationAndreas Rheinhardt2022-11-241-10/+10
| | | | | | Forgotten after c1eb3e7fecdc270e03a700d61ef941600a6af491. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
* swscale/utils: Derive range from YUVJ-pix-fmt only onceAndreas Rheinhardt2022-11-241-8/+10
| | | | | | | | | | | | | | Currently, it is done once per slice-thread, leading to one warning per slice-thread in case a YUVJ pixel format has been originally used. This also fixes the anomaly that said parameter are only updated for the user-facing context (whose values are retrievable via av_opt_get()) if slice-threading is not in use. Fixes ticket #9860. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
* swscale/utils: Move functions to avoid forward declarationsAndreas Rheinhardt2022-11-241-207/+200
| | | | Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
* swscale/utils: Avoid calling ff_thread_once() unnecessarilyAndreas Rheinhardt2022-11-241-3/+4
| | | | Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
* swscale/utils: Don't allocate AVFrames for slice contextsAndreas Rheinhardt2022-11-241-10/+5
| | | | | | Only the parent context's AVFrames are ever used. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
* swscale/utils: Factor initializing single slice context outAndreas Rheinhardt2022-11-241-10/+21
| | | | | | | | | | | | | | | | | | Initializing slice threads currently uses the function (sws_init_context()) that is also used for initializing user-facing contexts with the only difference being that nb_threads is set to one before initializing the slice contexts. Yet sws_init_context() also initializes lots of stuff that is not slice-dependent, i.e. (src|dst)Range. This currently only works because the code sets these fields to the same values for all slice contexts. This is not nice; even worse, it entails that log messages are printed once per slice context (and therefore fill the screen). This commit lays the groundwork to fix this. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
* swscale/input: Use more unsigned intermediatesMichael Niedermayer2022-11-201-12/+12
| | | | | | | | | | | Same principle as previous commit, with sufficiently huge rgb2yuv table values this produces wrong results and undefined behavior. The unsigned produces the same incorrect results. That is probably ok as these cases with huge values seem not to occur in any real use case. Fixes: signed integer overflow Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
* swscale/input: Use unsigned intermediates in rgb64ToUV_c_templateJeremy Dorfman2022-11-201-3/+3
| | | | | | | | | | | | Large rgb2yuv tables and high pixel values cause the intermediate int32_t of ru*r + gu*g + bu*b to exceed INT_MAX, which is undefined behavior. This causes libswscale built with LLVM -fsanitize=undefined to assert. Using unsigned integers instead has defined behavior and produces identical results, and makes rgb64ToUV_c_template match rgb64ToY_c_template. Fixes: signed integer overflow Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
* swscale/utils: Remove obsolete 3DNow referenceAndreas Rheinhardt2022-11-091-2/+0
| | | | | | | swscale does not use 3DNow any more since commit 608319a311a31f7d85333a7b08286c00be38eab6. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
* swscale/output: Bias 16bps output calculations to improve non overflowing ↵Michael Niedermayer2022-11-042-15/+26
| | | | | | | range for GBRP16/GBRPF32 Fixes: integer overflow Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
* swscale/output: Bias 16bps output calculations to improve non overflowing rangeMichael Niedermayer2022-11-041-60/+60
| | | | | | | | | Fixes: integer overflow Fixes: ./ffmpeg -f rawvideo -video_size 66x64 -pixel_format yuva420p10le -i ~/videos/overflow_input_w66h64.yuva420p10le -filter_complex "scale=flags=bicubic+full_chroma_int+full_chroma_inp+bitexact+accurate_rnd:in_color_matrix=bt2020:out_color_matrix=bt2020:in_range=full:out_range=full,format=rgba64[out]" -pixel_format rgba64 -map '[out]' -y overflow_w66h64.png Found-by: Drew Dunne <asdunne@google.com> Tested-by: Drew Dunne <asdunne@google.com> Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
* sw_scale: Add specializations for hscale 16 to 19Hubert Mazur2022-11-012-0/+468
| | | | | | | | | | | | | | | | | | | | | | | | | | | Provide arm64 neon optimized implementations for hscale16To19 with filter sizes 4, 8 and X4. The tests and benchmarks run on AWS Graviton 2 instances. The results from a checkasm tool are shown below. hscale_16_to_19__fs_4_dstW_512_c: 6216.0 hscale_16_to_19__fs_4_dstW_512_neon: 2257.0 hscale_16_to_19__fs_8_dstW_512_c: 10417.7 hscale_16_to_19__fs_8_dstW_512_neon: 3112.5 hscale_16_to_19__fs_12_dstW_512_c: 14890.5 hscale_16_to_19__fs_12_dstW_512_neon: 3899.0 hscale_16_to_19__fs_16_dstW_512_c: 19006.5 hscale_16_to_19__fs_16_dstW_512_neon: 5341.2 hscale_16_to_19__fs_32_dstW_512_c: 36629.5 hscale_16_to_19__fs_32_dstW_512_neon: 9502.7 hscale_16_to_19__fs_40_dstW_512_c: 45477.5 hscale_16_to_19__fs_40_dstW_512_neon: 11552.0 (Note, the checkasm tests for these functions haven't been merged since they fail on x86.) Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>
* sw_scale: Add specializations for hscale 16 to 15Hubert Mazur2022-11-012-0/+468
| | | | | | | | | | | | | | | | | | | | | | | | | | | Add arm64 neon implementations for hscale 16 to 15 with filter sizes 4, 8 and X4. The tests and benchmarks run on AWS Graviton 2 instances. The results from a checkasm tool are shown below. hscale_16_to_15__fs_4_dstW_512_c: 6703.5 hscale_16_to_15__fs_4_dstW_512_neon: 2298.0 hscale_16_to_15__fs_8_dstW_512_c: 10983.0 hscale_16_to_15__fs_8_dstW_512_neon: 3216.5 hscale_16_to_15__fs_12_dstW_512_c: 15526.0 hscale_16_to_15__fs_12_dstW_512_neon: 3993.0 hscale_16_to_15__fs_16_dstW_512_c: 20183.5 hscale_16_to_15__fs_16_dstW_512_neon: 5369.7 hscale_16_to_15__fs_32_dstW_512_c: 39315.2 hscale_16_to_15__fs_32_dstW_512_neon: 9511.2 hscale_16_to_15__fs_40_dstW_512_c: 48995.7 hscale_16_to_15__fs_40_dstW_512_neon: 11570.0 (Note, the checkasm tests for these functions haven't been merged since they fail on x86.) Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>
* sw_scale: Add specializations for hscale 8 to 19Hubert Mazur2022-11-012-4/+300
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add arm64 neon implementations for hscale 8 to 19 with filter sizes 4, 4X and 8. Both implementations are based on very similar ones dedicated to hscale 8 to 15. The major changes refer to saving the data - instead of writing the result as int16_t it is done with int32_t. These functions are heavily inspired on patches provided by J. Swinney and M. Storsjö for hscale8to15 which were slightly adapted for hscale8to19. The tests and benchmarks run on AWS Graviton 2 instances. The results from a checkasm tool shown below. hscale_8_to_19__fs_4_dstW_512_c: 5663.2 hscale_8_to_19__fs_4_dstW_512_neon: 1259.7 hscale_8_to_19__fs_8_dstW_512_c: 9306.0 hscale_8_to_19__fs_8_dstW_512_neon: 2020.2 hscale_8_to_19__fs_12_dstW_512_c: 12932.7 hscale_8_to_19__fs_12_dstW_512_neon: 2462.5 hscale_8_to_19__fs_16_dstW_512_c: 16844.2 hscale_8_to_19__fs_16_dstW_512_neon: 4671.2 hscale_8_to_19__fs_32_dstW_512_c: 32803.7 hscale_8_to_19__fs_32_dstW_512_neon: 5474.2 hscale_8_to_19__fs_40_dstW_512_c: 40948.0 hscale_8_to_19__fs_40_dstW_512_neon: 6669.7 Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>
* swscale: aarch64: Fix yuv2rgb with negative stridesMartin Storsjö2022-10-271-4/+4
| | | | | | | | | | | | | | Treat the 32 bit stride registers as signed. Alternatively, we could make the stride arguments ptrdiff_t instead of int, and changing all of the assembly to operate on these registers with their full 64 bit width, but that's a much larger and more intrusive change (and risks missing some operation, which would clamp the intermediates to 32 bit still). Fixes: https://trac.ffmpeg.org/ticket/9985 Signed-off-by: Martin Storsjö <martin@martin.st>
* swscale: document some missing argumentsMarvin Scholz2022-10-171-0/+10
|
* swscale: Fix bogus doxy comment #ifdefsMarvin Scholz2022-10-171-10/+5
| | | | | | | | The intention here was probably to document this as use of conditionals does not make sense in a comment. Fixes doxy warning: warning: explicit link request to 'if' could not be resolved
* libswscale: force a minimum size of the slide for bayer sourcesChema Gonzalez2022-10-141-0/+1
| | | | | | | | | Bayer sources are read in groups of 2 lines (e.g. for a BGGR flavor, the first row contains only B and G samples, while the second row contains only G and R samples). They need to be read as a whole. Signed-off-by: Anton Khirnov <anton@khirnov.net>
* sws/rgb2rgb: RISC-V 64-bit V packed YUYV/UYVY to planar 4:2:2Rémi Denis-Courmont2022-09-302-0/+63
| | | | | | | This is currently 64-bit only because the stack spilling code would not assemble on RV32I (and it would corrupt s0 and s1 on RV128I, in theory). This could be added later in the unlikely that someone wants it.
* sws/rgb2rgb: RISC-V V interleaveBytesRémi Denis-Courmont2022-09-302-0/+30
|
* sws/rgb2rgb: RISC-V V shuffle_bytes_xxxx functionsRémi Denis-Courmont2022-09-305-0/+130
|
* swscale/output: Don't call av_pix_fmt_desc_get() in a loopAndreas Rheinhardt2022-09-191-42/+58
| | | | | | | | | | | | | | | | | | | | | | | | | | Up until now, libswscale/output.c used a macro to write an output pixel which involved a call to av_pix_fmt_desc_get() to find out whether the input pixel format is BE or LE despite this being known at compile-time (there are templates per pixfmt). Even worse, these calls are made in a loop, so that e.g. there are eight calls to av_pix_fmt_desc_get() for every pixel processed in yuv2rgba64_X_c_template() for 64bit RGB formats. This commit modifies these macros to ensure that isBE() is evaluated at compile-time. This saved 41184B of .text for me (GCC 11.2, -O3). Of course, it also improved performance. E.g. ffmpeg_g -f lavfi -i testsrc2,format=yuva420p -pix_fmt rgba64le \ -threads 1 -t 1:00 -f null - (which uses yuv2rgba64le_X_c, which is an invocation of yuv2rgba64_X_c_template() mentioned above), performance improved from 95589 to 41387 decicycles for one call to yuv2packedX; for the be variant the numbers went down from 76087 to 43024 decicycles. Reviewed-by: Anton Khirnov <anton@khirnov.net> Reviewed-by: Paul B Mahol <onemda@gmail.com> Reviewed-by: Michael Niedermayer <michael@niedermayer.cc> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
* swscale/input: Avoid calls to av_pix_fmt_desc_get()Andreas Rheinhardt2022-09-191-54/+68
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Up until now, libswscale/input.c used a macro to read an input pixel which involved a call to av_pix_fmt_desc_get() to find out whether the input pixel format is BE or LE despite this being known at compile-time (there are templates per pixfmt). Even worse, these calls are made in a loop, so that e.g. there are six calls to av_pix_fmt_desc_get() for every pair of UV pixel processed in rgb64ToUV_half_c_template(). This commit modifies these macros to ensure that isBE() is evaluated at compile-time. This saved 9743B of .text for me (GCC 11.2, -O3). For a simple RGB64LE->YUV420P transformation like ffmpeg -f lavfi -i haldclutsrc,format=rgba64le -pix_fmt yuv420p \ -threads 1 -t 1:00 -f null - the amount of decicycles spent in rgb64LEToUV_half_c (which is created via the template mentioned above) decreases from 19751 to 5341; for RGBA64BE the number went down from 11945 to 5393. For shared builds (where the call to av_pix_fmt_desc_get() is indirect) the old numbers are 15230 for RGBA64BE and 27502 for RGBA64LE, whereas the numbers with this patch are indistinguishable from the numbers from a static build. Also make the macros that are touched conform to the usual convention of using uppercase names while just at it. Reviewed-by: Anton Khirnov <anton@khirnov.net> Reviewed-by: Paul B Mahol <onemda@gmail.com> Reviewed-by: Michael Niedermayer <michael@niedermayer.cc> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
* swscale/la: Add output_lasx.c file.Hao Chen2022-09-104-1/+1993
| | | | | | | | | | | ffmpeg -i 1_h264_1080p_30fps_3Mbps.mp4 -f rawvideo -s 640x480 -pix_fmt rgb24 -y /dev/null -an before: 150fps after: 183fps Signed-off-by: Hao Chen <chenhao@loongson.cn> Reviewed-by: yinshiyou-hf@loongson.cn Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
* swscale/la: Add yuv2rgb_lasx.c and rgb2rgb_lasx.c filesHao Chen2022-09-108-0/+444
| | | | | | | | | | ffmpeg -i 1_h264_1080p_30fps_3Mbps.mp4 -f rawvideo -pix_fmt rgb24 -y /dev/null -an before: 178fps after: 210fps Signed-off-by: Hao Chen <chenhao@loongson.cn> Reviewed-by: yinshiyou-hf@loongson.cn Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
* swscale/la: Optimize hscale functions with lasx.Hao Chen2022-09-108-1/+1293
| | | | | | | | | | ffmpeg -i 1_h264_1080p_30fps_3Mbps.mp4 -f rawvideo -s 640x480 -y /dev/null -an before: 101fps after: 138fps Signed-off-by: Hao Chen <chenhao@loongson.cn> Reviewed-by: yinshiyou-hf@loongson.cn Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
* swscale/output: add support for Y210LE and Y212LEPhilip Langdale2022-09-103-3/+51
|
* swscale/output: add support for XV30LEPhilip Langdale2022-09-103-2/+33
|
* swscale/output: add support for XV36LEPhilip Langdale2022-09-103-2/+31
|
* swscale/output: add support for P012Philip Langdale2022-09-103-62/+84
| | | | This generalises the existing P010 support.
* swscale/input: Remove spec-incompliant ';'Andreas Rheinhardt2022-09-081-5/+5
| | | | | | | | | These macros are definitions, not only declarations and therefore should not contain a semicolon. Such a semicolon is actually spec-incompliant, but compilers happen to accept them. Reviewed-by: Philip Langdale <philipl@overt.org> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
* swscale/input: add support for Y212LEPhilip Langdale2022-09-063-16/+32
|
* swscale/input: add support for XV30LEPhilip Langdale2022-09-063-1/+27
|
* swscale/input: add support for P012Philip Langdale2022-09-063-60/+67
| | | | | As we now have three of these formats, I added macros to generate the conversion functions.
* swscale/input: add support for XV36LEPhilip Langdale2022-09-063-1/+27
|
* libswscale: add support for VUYX formatPhilip Langdale2022-08-254-9/+39
| | | | | | As we already have support for VUYA, I figured I should do the small amount of work to support VUYX as well. That means a little refactoring to share code.
* swscale/x86/rgb_2_rgb: Empty MMX state in ff_shuffle_bytes_2103_mmxextAndreas Rheinhardt2022-08-231-0/+1
| | | | | | | | | | | | | Fixes FATE-failures with the the filter-2xbr filter-3xbr filter-4xbr filter-ep2x filter-ep3x filter-hq2x filter-hq3x filter-hq4x filter-paletteuse-bayer filter-paletteuse-bayer0 filter-paletteuse-nodither and filter-paletteuse-sierra2_4a tests when using 32bit x86 with CPUFLAGS ranging from "mmx+mmxext" to "mmx+mmxext+sse+sse2+sse3" (the relevant function is only overwritten when using SSSE3). Reviewed-by: Lynne <dev@lynne.ee> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
* swscale/input: add rgbaf16 input supportTimo Rothenpieler2022-08-197-2/+171
| | | | | | This is by no means perfect, since at least ddagrab will return scRGB data with values outside of 0.0f to 1.0f for HDR values. Its primary purpose is to be able to work with the format at all.
* swscale: add opaque parameter to input functionsTimo Rothenpieler2022-08-194-85/+106
|
* swscale/x86/yuv2yuvX: Remove unused ff_yuv2yuvX_mmx()Andreas Rheinhardt2022-08-191-2/+0
| | | | Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
* libswscale: Enable hscale_avx2 for all input sizes.Alan Kelly2022-08-182-7/+18
| | | | | | ff_shuffle_filter_coefficients shuffles the tail as required. Signed-off-by: Anton Khirnov <anton@khirnov.net>
* sws: allow avx2 hscale to process inputs of any size.Alan Kelly2022-08-181-1/+43
| | | | | | | The main loop processes blocks of 16 pixels. The tail processes blocks of size 4. Signed-off-by: Anton Khirnov <anton@khirnov.net>
* sws: Replace call to yuv2yuvX_mmx by yuv2yuvX_mmxextAlan Kelly2022-08-181-5/+2
| | | | Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
* swscale/aarch64: add vscale specializationsSwinney, Jonathan2022-08-162-0/+190
| | | | | | | | | | | | | | | | This commit adds new code paths for vscale when filterSize is 2, 4, or 8. By using specialized code with unrolling to match the filterSize we can improve performance. On AWS c7g (Graviton 3, Neoverse V1) instances: before after yuv2yuvX_2_0_512_accurate_neon: 558.8 268.9 yuv2yuvX_4_0_512_accurate_neon: 637.5 434.9 yuv2yuvX_8_0_512_accurate_neon: 1144.8 806.2 yuv2yuvX_16_0_512_accurate_neon: 2080.5 1853.7 Signed-off-by: Jonathan Swinney <jswinney@amazon.com> Signed-off-by: Martin Storsjö <martin@martin.st>
* swscale/aarch64: vscale optimizationSwinney, Jonathan2022-08-161-6/+5
| | | | | | | | | | | | | Use scalar times vector multiply accumlate instructions instead of vector times vector to remove the need for replicating load instructions which are slightly slower. On AWS c7g (Graviton 3, Neoverse V1) instances: yuv2yuvX_8_0_512_accurate_neon: 1144.8 987.4 yuv2yuvX_16_0_512_accurate_neon: 2080.5 1869.4 Signed-off-by: Jonathan Swinney <jswinney@amazon.com> Signed-off-by: Martin Storsjö <martin@martin.st>