vmaf output

In previous post, I went about calculating vmaf for my encodes. However, couple of points for my reference with respect to output files.

The output file generated from command

ffmpeg -i <tocheck.mp4> -i <ref.mp4> -lavfi libvmaf="n_threads=4:model_path=./vmaf_v0.6.1.json:log_fmt=json:ssim=1:log_path=./vmaf.txt:psnr=1:ssim=1" -f null -


looks like this (json format):

{
  "version": "2.1.1",
  "frames": [
    {
      "frameNum": 0,
      "metrics": {
        "integer_motion2": 0.000000,
        "integer_motion": 0.000000,
        "ssim": 0.999468,
        ...
        "vmaf": 98.682631
      }
    },
    ...
    ...
    }
  ],
    "pooled_metrics": {
    ...
    "ssim": {
      "min": 0.980323,
      "max": 1.000000,
      "mean": 0.996725,
      "harmonic_mean": 0.996724
    },
    ...
    ...
    "vmaf": {
      "min": 90.401109,
      "max": 100.000000,
      "mean": 97.782800,
      "harmonic_mean": 97.748271
    }
  },
  "aggregate_metrics": {
  }
}


There is per-frame data and towards the end is pooled-metrics. If one looks at ssim, it lists two fields: mean and harmonic-mean, similarly for vmaf.
Here mean is just an average over all frames. This is not necessarily the indicator of high quality encode. In case a few frames were encoded with lower score, they will be neglected with most of the higher scored ones. However, this is fine for most encodings, as mean weighs each frame score equally. Now, harmonic_mean brings another twist where it weighs frames with lower score more and will reduce the aggregate value in case of significant deviation.
This is one of the cases, where ssim could be near perfect but vmaf/harmonic_mean would be much lower. Discussion in github/netflix/vamf/FAQs is quite good and brings about these things.

A note from twitter about encodings & quality along similar lines:

As discussed on [VMAF GitHub](https://github.com/Netflix/vmaf/blob/master/FAQ.md#q-why-the-aggregate-vmaf-score-sometimes-may-bias-easy-content-too-much-issue-20), aggregating VMAF scores of frames by averaging over the entire sequence may hide the impact of difficult-to-encode frames (if these frames occur infrequently). An optimal way to pool frames is an open problem. For example, VMAF tools can already aggregate harmonic mean and output one percentile score. In the context of this blog, after calculating VMAF scores of all frames of a sequence, we compute the 1st, 5th, 10th, 25th, and 50th percentiles. By definition, the 5th percentile gives us a VMAF score of the worst 5% frames while the 50th percentile is the median. The intuition here is that instead of weighing all frames equally and getting one score, we rank frames according to their complexity and look at how a particular encoder setting performs across these different ranks. We want to prioritize improving quality on frames in the order of their VMAF scores, from lowest to highest. Frames with a high VMAF score already look great and improving quality on them won’t matter as much.

A few links of interest:


A few youtube videos: