Skip to main content

SSIMULACRA2

SSIMULACRA2 is a visual fidelity metric based on the concept of the multi-scale structural similarity index measure (MS-SSIM), computed in a perceptually relevant color space, adding two other (asymmetric) error maps, and aggregating using two different norms. It is currently the most reputable visual quality metric according to its correlation with subjective results, and is considered a very robust means of comparing encoders. It is debatable whether Butteraugli is better for very high fidelity, but SSIMULACRA2 is considered the best for medium/low fidelity comparisons.

Scoring

The score that SSIMULACRA 2 outputs is simple: a number in range -inf..100. According to the developers of the metric, for image quality assessment, SSIMULACRA 2 scores correlate to subjective visual quality as follows:

  • Very high quality: 90 and above
  • High quality: 70 to 90
  • Medium quality: 50 to 70
  • Low quality: Below 50

Metric Breakdown

A step-by-step description of the SSIMULACRA2 metric follows. The steps assume we have two input images, a reference image as well as a distorted image to be compared and scored.

Convert sRGB to Linear RGB

Undo the sRGB gamma curve to obtain linear light values for each RGB channel. This converts perceptual-encoded pixel values into physically meaningful intensities.

Transform to XYB

Map linear RGB into an opsin-inspired space that separates perceptual channels. Steps: apply an absorbance-like linear transform, clamp negatives, take a cube root to compress dynamic range, add small biases, then mix into three channels (X, Y, B). Resulting channels are tuned to align better with human vision than raw RGB.

Normalize Values

Apply slight shifts and scalings so all channel values are positive and stable for later statistical operations. This prevents division instability and extreme ratios in subsequent steps.

Build a Multi-Scale Image Pyramid

Create multiple downscaled versions of both images (typically several scales, each half the previous dimension). Each scale captures structure at a different spatial frequency. The metric computes statistics independently at every scale.

Blurred Statistics

For each scale and each perceptual channel, compute local blurred statistics. Local blurred quantities are computed on each pixel using a separable spatial blur:

  • Local mean (blurred image).
  • Local second moments: blurred squared values and blurred cross-products between reference and distorted channels.

These give local variance and covariance estimates analogous to SSIM’s variance/covariance.

Similarity & Artifacts

Similarity map (SSIM-like): Combine local means, variances, and covariance into a per-pixel similarity value. This value measures structural agreement while using small stabilizing constants to avoid division by zero. Similarity is clamped to a sensible range and emphasizes structural fidelity.

Edge / artifact map: Compare local deviations from local means between distorted and reference images to detect:

  • New artifacts (excess local detail or harsh edges introduced by distortion).
  • Lost detail (original detail suppressed or blurred). Compute per-pixel artifact and lost-detail measures, and preserve higher-order statistics to capture outliers.

Aggregate statistics per channel and scale

Across each channel and scale compute compact summaries:

  • Mean of the per-pixel similarity and artifact/lost-detail measures.
  • Higher-order moment summaries (fourth-moment based measures reduced by a fourth-root) to detect heavy tails and strong local errors.

These condensed statistics encode both average behavior and extreme localized errors.

Weighted combination of statistics

Multiply each aggregated statistic by a pre-tuned weight. Sum all weighted terms across channels and scales to produce a single scalar accumulator. Weights are learned/tuned to map the diverse statistics into a perceptually meaningful predictor.

Nonlinear mapping to final score

Pass the accumulator through a nonlinear curve (polynomial and a power-law transform). This mapping compresses the predictor into a bounded perceptual score. The final value is expressed on a convenient scale where higher means better and values near the top indicate imperceptible differences.

Notes

  • The metric examines structure and edge behavior separately. It penalizes both new artifacts and lost detail.
  • Multi-scale analysis makes it sensitive to distortions at different spatial frequencies.
  • Aggregating higher-order moments preserves sensitivity to rare but visually important outliers.

Implementations

There are a couple of different SSIMULACRA2 implementations available, some useful in different contexts.

Cloudinary's SSIMULACRA2

Cloudinary's SSIMULACRA2 implementation is the reference implementation written in C++. It comes from the libjxl project, the reference implementation of the JPEG XL image codec.

vapoursynth-zip Filter

vapoursynth-zip is a collection of filters for use with Vapoursynth. It is written in Zig, and features a SSIMULACRA2 implementation.

fssimu2

fssimu2 is a fast SSIMULACRA2 implementation written in Zig. It is designed for speed, claiming to be up to 14% more performant while using just 50% of the memory. It displays a recorded error of ~1.5% relative to Cloudinary's reference implementation, and it achieves 99.7% correlation according to the Pearson correlation coefficient documented in the README. It is based on Julek's Zig implementation.

vship

Vship is a GPU-accelerated metrics toolkit compatible with Vapoursynth. It also features its own standalone FFVship binary, available independent of Vapoursynth. Vship's SSIMULACRA2 implementation is an order of magnitude faster than CPU-based implementations, and has reportedly high correlation with the reference implementation.

ssimulacra2_rs

ssimulacra2_rs is a binary interface to the Rust implementation of the SSIMULACRA2 metric. It is notable for being one of the first independent implementations, as well as one of the first to consider video inputs.