Terminology

When learning about encoding technology, it is important to understand the vast terminology that is often used to describe concepts that are often not very complex to understand.

Bitstream

A bitstream or bit stream is a media file, the kind that is played in a media player. It consists of a container wrapping multiple elementary streams

Lossy / Lossless

Lossy encoding throws out some of the detail to achieve a smaller size. Often, this is an acceptable trade-off, but if you need a perfect recreation of the data, you need lossless encoding.

Elementary stream

An elementary stream is an audio, video, or subtitle track. Basically, it's the compressed data you want to mux into the container.

Muxing

Putting elementary streams into a container, which preserves them without making any changes to the data.

Codec

A codec (coder/decoder) is the piece of code that actually encodes the data you put in. It takes as input and produces as output an elementary stream. More information is provided in the prologue under "What is a Codec".

Filter

A filter is a piece of code you can apply to the data to make something about it different, for instance sharpening, removing artifacts, shakiness, denoising, scaling, overlay, etc.

Muxer/Demuxer

The pieces of code that mux or do the reverse, getting elementary streams from the container.

Bitstream filter

A bitstream filter is a filter that is directly applied to the bitstream in order to change something about the container, for instance, convert frame types, or corrupt some packets.

Container

A container is a format for putting one or more elementary streams into one file, which is then called a bitstream.

A video container is a digital file format that holds video and audio data, as well as additional information such as subtitles, metadata, and chapter markers. It acts as a "wrapper" that packages all these elements into a single file that can be played on various devices and software platforms. Think of it like a container you might use to transport goods - the video and audio data are like the items being transported, while the container itself provides a structure and organization for the contents.

Some kinds of containers:

MP4 / M4V

This is likely the most common container you've encountered, & has near universal compatibility. Has a limited maximum amount of streams. The supported video codecs are H.264, H.265, H.266, DivX, Xvid, VP9 (Unofficial, hacky), and AV1 (Unofficial, hacky). For audio codecs it's many of the various flavors of AAC, MP3, FLAC (Unofficial), Opus (Unofficial, hacky). For subtitles only MPEG-4 Timed Text (TTXT) is supported.

The best tool to work with this container is MP4Box, but FFmpeg also works.

MOV

Similar to MP4, but less supported. Made with Apple Quicktime in mind, supports ProRes.

MKV / MKA / MKS / MK3D

Also known as Matroska, allows an unlimited amount of video/audio/subtitle streams and any codec that probably still exists in Area 51, you can put literally anything in there and it won't even care, MPEG-2/DivX/H.266/Theora/Thor/RealVideo/MJPEG/AVS3/AMR-WB, you name it. All around best container for working with if you have the choice.

WebM

A container made with web streaming in mind. WebM is a stripped-down subset of MKV that only allows free & open source codecs such as VP8, VP9 or AV1 for video alongside Vorbis or Opus for audio. It is a common misconception that WebVTT tracks always work natively in browsers when within a WebM container; in practice, WebMs containing WebVTT subtitles will usually not play back the subtitles in browsers. WebVTT subtitles can be utilized with the <track> element instead, meaning they exist outside the WebM container itself. More info in the Mozilla Web Docs.

Transcoding

Taking an elementary stream & converting it to another format, lossless or lossy, using an encoder of some kind. For example, if I convert a lossless FFV1 video to a lossy AV1 video using an encoder like rav1e, I have transcoded this lossless video to AV1. Transcoding doesn't have anything to do with the container.

RDO

RDO, or Rate-Distortion Optimization, is a technique used to find the best trade-off between the bit rate & the quality of lossily encoded content. RDO can be metric-based, optimizing to score well on metrics like PSNR or SSIM.

Perceputal / Psychovisual / Psychoacoustic

"Psychovisual quality" (for videos), "Psychoacoustic quality" (for audio), or "perceptual quality" is a term used to describe the perception of quality of distorted media by the human senses. The goal of any multimedia codec is to minimize data while maintaining perceived quality, and optimizing around human perception theoretically yields the best performance within any limited set of coding techniques (like when using an older codec). Our model of human perception continues to evolve, which makes modelling perceptual quality very difficult. Presently, the metrics SSIMULACRA2 (Images/Video) & Butteraugli (Video) are considered the most accurate to our human visual system.

Discrete Cosine Transform (DCT)

The Discrete Cosine Transform is a mathematical transformation that can transform discrete data into the frequency domain. This discrete data could be pixels in an image/video compression block or data points recorded temporally representing an audio recording. This algorithm is a particularly good choice for image, video, music, & speech compression because it has high energy compaction relative to our understanding of fidelity in media. High energy compaction means the DCT is able to represent a signal with a small number of significant coefficients.

Bitstream​

Lossy / Lossless​

Elementary stream​

Muxing​

Codec​

Filter​

Muxer/Demuxer​

Bitstream filter​

Container​

MP4 / M4V​

MOV​

MKV / MKA / MKS / MK3D​

WebM​

Transcoding​

RDO​

Perceputal / Psychovisual / Psychoacoustic​

Discrete Cosine Transform (DCT)​