Custom Video Compression Project Report

Project Description

This project is a learning-focused video compression pipeline. It reads a video, prepares each frame for block processing, separates frames into I-frames and P-frames, compresses residual data, writes compact binary chunks, creates a playlist file, and reconstructs video frames for validation. The main implementation is in main.py, where frame processing, compression, chunk writing, decoding, and logging are kept together for study and experimentation.

I built the project because my streaming website already used Cloudinary and Multer for media handling. That approach works well for real application usage, but it hides many internal details. Tools like FFmpeg are also powerful, but using them directly does not show how compression works at frame, block, residual, coefficient, and binary-layout levels. This project was created to understand those internals by building the core ideas manually.

The purpose is not only to compress a file. The purpose is to learn how video compression can be designed from basic building blocks: motion prediction, transform coding, quantization, chunking, metadata logs, and reconstruction.

Measured Results From Logs

The following values come from info_log.txt, frame_log.txt, and the generated chunk files. They describe how many frames were processed, how the frame types were distributed, and how much size reduction was achieved against raw padded RGB frame data.

Raw padded data

1252.65 MB

RGB frame estimate from info_log.txt

Gray data

417.55 MB

Single-channel frame estimate

Frames

1,859

25 FPS, about 74.36 seconds

Logged total

24.48 MB

Encoder summary from info_log.txt

I-frames

478

25.71% of all frames

P-frames

1,381

74.29% of all frames

Output chunks

Total chunk size: 45.82 MB

Reduction

98.05%

Raw padded data to logged total

The logged total size is an internal encoder summary. The final chunk files total 45.82 MB on disk because the chunk payload stores motion vectors, residual coefficient data, headers, and zlib-compressed binary output. Against raw padded RGB data, the final chunks still reduce size by about 96.16%.

Compression Method

Frame Preparation

The input frame is padded so both width and height are multiples of 16. This makes every frame split cleanly into 16 x 16 macroblocks.

I-Frame Decision

The first frame is always stored as an I-frame. Later frames become I-frames when the mean squared error against the reconstructed reference is greater than the threshold value of 500.

P-Frame Prediction

If the error is low enough, the frame is stored as a P-frame. Each macroblock searches nearby blocks in the reconstructed reference and saves a small motion vector.

Residual Coding

The difference between the current block and reference block is converted to grayscale, split into four 8 x 8 blocks, transformed with DCT, quantized, and saved as signed 16-bit coefficients.

Detailed Compression Flow

Padding: frames are expanded to fit the 16 x 16 block grid.
Block split: each padded frame is divided into macroblocks using BLOCK_SIZE = 16.
Motion estimation: each block checks neighboring positions from -1 to +1 in both directions.
Residual creation: the selected reference block is subtracted from the current block.
DCT transform: residual data is converted into frequency coefficients.
Quantization: coefficients are divided by a quantization matrix where each value is 8.
Binary serialization: I-frames and P-frames are written into a custom binary layout using magic bytes STR1 and version 1.
Extra compression: serialized binary chunks are compressed with zlib.compress(payload, 9).
Reconstruction: the decoder reads each chunk, rebuilds I-frames directly, and rebuilds P-frames from motion vectors plus decoded residuals.

Techniques Used

The project combines multiple compression ideas instead of depending on one single method. The main techniques are frame prediction, block-based motion search, grayscale residual coding, DCT, quantization, custom binary storage, and zlib compression.

Padding And 16 x 16 Macroblocks

Before compression starts, every frame is padded so its height and width are divisible by 16. This prevents incomplete edge blocks and allows the whole frame to be divided into equal 16 x 16 macroblocks. These macroblocks are the main units used for motion prediction. A fixed block size also makes decoding easier because the decoder can rebuild the same grid structure.

Motion Search And Motion Vectors

For a P-frame, the encoder does not store the full current frame. It tries to predict the current frame from the previously reconstructed frame. For every 16 x 16 macroblock, the function find_best_match searches a small local area around the same block position in the reference frame.

for di in range(-1, 2):
    for dj in range(-1, 2):
        ni, nj = i + di, j + dj
        cand = blocks2[ni][nj]
        err = mse(block, cand)

The search range is from -1 to +1 in both row and column directions. That gives nine possible reference positions: the original position plus its eight neighbors. The block with the lowest mean squared error is selected, and only the small movement value (dx, dy) is stored as the motion vector. This is smaller than storing the full block.

Predicted Frame And Residual Error

After the best reference block is found, the predicted block is taken from the previous reconstructed frame. The encoder then subtracts that reference block from the current block. This difference is called the residual error.

ref = b2[i + dx][j + dy]
residual = b1[i][j].astype(np.float32) - ref.astype(np.float32)

If the prediction is good, the residual contains much less information than the original block. During decoding, the frame is reconstructed by adding the decoded residual back to the predicted reference block. This is the reason P-frames can be smaller than I-frames.

Grayscale Residual Coding

The residual is converted from color to grayscale before DCT compression. This reduces the amount of residual data because the encoder stores one channel of residual coefficients instead of separate coefficients for blue, green, and red channels.

gray = cv2.cvtColor(residual, cv2.COLOR_BGR2GRAY).astype(np.float32)

The same decoded grayscale residual is applied back to all three color channels during reconstruction. This is a deliberate simplification for learning. It reduces stored data and makes the algorithm easier to understand, but it can lose some color-specific error detail compared with storing separate residuals for each color channel. The log also records grayscale data size separately: 417.55 MB, compared with 1252.65 MB for raw padded RGB data.

8 x 8 DCT Transform

Each 16 x 16 residual block is split into four 8 x 8 blocks. The DCT, or Discrete Cosine Transform, converts pixel error values into frequency coefficients. In simple terms, it changes the data from raw spatial differences into values that describe smooth areas, edges, and detail frequencies.

This is useful because most natural video blocks contain more low-frequency information than high-frequency information. After DCT, many coefficients become small, which makes them easier to reduce through quantization and later binary compression.

Quantization

After DCT, each coefficient is divided by a quantization matrix. In this project, the matrix uses the value 8 for every position.

Q = np.ones((8, 8)) * 8

def quantize(b):
    return np.round(b / Q)

Quantization is the lossy part of the compression. It reduces precision so coefficients take fewer useful values. Smaller coefficient values are easier to store, and tiny visual differences are removed. During decoding, the values are multiplied by the same matrix through dequantization before inverse DCT.

Binary Format

The encoded data is stored in a custom binary format instead of plain text. This keeps the output more compact. The binary stream starts with magic bytes STR1 and a version number. Then each frame is written with a small type marker.

Stored Item	Binary Content	Reason
I-frame	Frame type byte, JPEG byte length, JPEG bytes	Stores a complete reference frame for decoding and recovery.
P-frame	Frame type byte, block rows, block columns, motion vectors, DCT coefficients	Stores only prediction information and residual data.
Macroblock	`dx`, `dy`, and four 8 x 8 groups of signed 16-bit coefficients	Represents one predicted 16 x 16 block compactly.

zlib Compression

After the custom binary payload is created, the project applies zlib.compress(payload, 9). This is a second compression stage. DCT and quantization reduce the video information first, binary serialization packs it into a compact structure, and zlib then removes repeated byte patterns from that binary payload.

Why I-Frames Store The Original Frame

I-frames are stored as complete original frames because predictive frames need a reliable reference. A P-frame depends on a previous reconstructed frame. If every frame were predictive, decoding could not start cleanly, and one error could continue spreading through many later frames.

Keeping I-frames as complete JPEG images gives the decoder reset points. The first frame is always an I-frame, and later frames become I-frames when the color-frame mean squared error is too high. This means the encoder switches back to a full frame when prediction is no longer accurate enough.

5-Second Segments And I-Frame Boundaries

The chunking logic targets about 5 seconds per segment. It does not cut at a random frame. It waits until at least 5 seconds have passed and the current frame is an I-frame. This precaution keeps the first frame of each new segment as an I-frame, so the decoder has a complete frame available at the segment boundary.

if ts - last >= 5 and t == "I":
    points.append(i)
    last = ts

Because the boundary waits for an I-frame, some segments are slightly longer than 5 seconds. For example, the logs show segment durations such as 6.64 s, 6.28 s, and 6.76 s. This is intentional because decoding reliability is more important than cutting exactly at five seconds.

Core Code Overview

The code in main.py is organized into small functions for padding, block handling, motion estimation, residual compression, serialization, chunk creation, and decoding.

Important Constants

SEG_MAGIC = b"STR1"
SEG_VERSION = 1
COEFF_BYTES_PER_MB = 4 * 64 * 2
BLOCK_SIZE = 16
Q = np.ones((8,8)) * 8
threshold = 500

BLOCK_SIZE = 16 Creates a fixed 16 x 16 macroblock grid for motion prediction.

Q = 8 Controls quantization strength for DCT coefficients.

threshold = 500 Decides whether a frame is stored as an I-frame or P-frame.

Residual Encoder

def encode_residual(residual):
    gray = cv2.cvtColor(residual, cv2.COLOR_BGR2GRAY).astype(np.float32)
    tiles = []

    for y in range(0, gray.shape[0], 8):
        for x in range(0, gray.shape[1], 8):
            block = gray[y:y+8, x:x+8]
            block = block - 128
            d = dct2(block)
            q = quantize(d)
            q_int = np.round(q).astype(np.int16)
            tiles.append(q_int.reshape(64))

    return np.stack(tiles, axis=0)

P-Frame Block Prediction

for i in range(len(b1)):
    for j in range(len(b1[0])):
        dx, dy = find_best_match(b1[i][j], b2, i, j)
        ref = b2[i+dx][j+dy]
        residual = b1[i][j].astype(np.float32) - ref.astype(np.float32)
        comp = encode_residual(residual)
        rec_res = decode_residual(comp, ref.shape[:2])

Chunk Compression

payload = serialize_encoded(encoded[s:e])
data = zlib.compress(payload, 9)

with open(f"segments/segment_{i}.ts", "wb") as f:
    f.write(data)

Frame Type Analysis

The frame log records every frame number, frame type, and timestamp. This shows that most frames were compressed as P-frames, which means the project reused previous reconstructed frames instead of storing every frame independently.

Frame Type	Count	Share	Meaning
I	478	25.71%	Stored as JPEG images when a full reference frame is needed.
P	1,381	74.29%	Stored using motion vectors and residual coefficients.
Total	1,859	100%	Processed at 25 FPS for an approximate duration of 74.36 seconds.

Chunk Log Details

The project writes compressed binary chunks and lists them in index.m3u8. Each chunk starts at an I-frame when possible, so decoding has a reliable reference frame at chunk boundaries.

Chunk	Frame Range	Frame Count	Duration	Logged Size
0	0 to 165	166	6.64 s	3.99 MB
1	166 to 290	125	5.00 s	2.66 MB
2	291 to 415	125	5.00 s	3.44 MB
3	416 to 541	126	5.04 s	2.93 MB
4	542 to 666	125	5.00 s	2.00 MB
5	667 to 810	144	5.76 s	2.84 MB
6	811 to 967	157	6.28 s	3.92 MB
7	968 to 1097	130	5.20 s	2.59 MB
8	1098 to 1241	144	5.76 s	3.53 MB
9	1242 to 1410	169	6.76 s	3.65 MB
10	1411 to 1536	126	5.04 s	4.00 MB
11	1537 to 1661	125	5.00 s	4.90 MB
12	1662 to 1809	148	5.92 s	4.07 MB
13	1810 to 1858	49	1.96 s	1.31 MB

Why Segmentation Matters

Segmentation means the video is divided into smaller time-based chunks instead of being handled as one large file. In this project, the target chunk length is about 5 seconds. This design is useful because a player can begin work with the first chunk while the remaining chunks are still waiting to be loaded or processed. The user does not need to wait for the full video file before playback can begin.

Smaller chunks improve the user experience in several ways. Startup can be faster because the first playable unit is small. Seeking is cleaner because the player can jump near the requested time and load only the chunk around that point. If one chunk fails, the application can retry only that chunk instead of repeating the whole video transfer. It also allows progress and buffering to feel more responsive, because the application works with many manageable units instead of one large binary.

Segmentation also improves efficiency. Memory use is lower because the application can decode or cache a small chunk at a time. Network usage can be more practical because only the needed time range has to be loaded. It also creates a path for future improvements such as quality switching, preview loading, partial downloads, and better seeking behavior.

Why About 5 Seconds

A 5-second target is a practical balance. Very tiny chunks would create too many files and too much metadata overhead. Very large chunks would slow startup and seeking. A 5-second chunk is short enough for quick buffering and long enough to keep the number of chunks reasonable. The project waits for an I-frame before starting a new chunk, so some chunks are slightly longer than 5 seconds, but each new chunk has a strong decoding starting point.

How We Can Use It

The generated index.m3u8 file works like a map. It stores the duration of each chunk and the path to the matching compressed file. A web video feature can read that playlist, load the first chunk for quick playback, continue loading later chunks as the user watches, and jump to the right chunk when the user seeks to another timestamp.

Fast start: load chunk 0 first and begin playback from its I-frame.
Smooth buffering: load the next chunk before the current one finishes.
Seeking: calculate the target time, find the nearest chunk in index.m3u8, and decode from that chunk's first I-frame.
Retry handling: if a chunk has a loading problem, retry that chunk only.
Future quality modes: keep multiple chunk sets at different quality levels and choose the right one based on bandwidth.

Visual Frame References

The report also includes reference images from the original and reconstructed outputs. These images make the compression result easier to understand because the reader can compare the source frame with the frame rebuilt by the custom decoder at the same timestamp.

The 10-second and 22-second samples are useful checkpoints. They show how the predicted frame, grayscale residual, DCT coefficients, quantization, binary storage, and zlib-compressed chunks still allow the project to reconstruct a recognizable frame after decoding.

Frame Reference At 10 Seconds

This pair compares the original frame at 10s with the reconstructed frame generated after decoding.

Original frame at 10 seconds — Original frame, 10s

Reconstructed frame at 10 seconds — Reconstructed frame, 10s

Frame Reference At 22 Seconds

This pair compares the original frame at 22s with the reconstructed frame generated after decoding.

Original frame at 22 seconds — Original frame, 22s

Reconstructed frame at 22 seconds — Reconstructed frame, 22s

Generated Files

The project creates files that make debugging and validation easier. The logs are especially important because they show the compression decisions and measured results of the run.

File	Purpose	File Size
`info_log.txt`	Summary metrics: data size, FPS, frame counts, encoded size, and chunk sizes.	793 bytes
`frame_log.txt`	Per-frame record with frame number, frame type, and timestamp.	52,551 bytes
`index.m3u8`	Playlist-style chunk index containing durations and chunk paths.	601 bytes
`segments/segment_0.ts` to `segments/segment_13.ts`	Compressed binary chunks produced by serialization and zlib compression.	45.82 MB total
`flow.mp4`	Motion visualization output created from optical flow.	45.18 MB
`reconstructed.mp4`	Rebuilt video output used to verify that decoding can reconstruct frames.	14.64 MB

Learning Outcomes

This project shows the internal ideas behind video compression in a practical way. It demonstrates why modern video compression does not store every frame as a complete image, how motion prediction reduces repeated visual information, and how DCT plus quantization reduce residual data.

Learned how I-frames and P-frames work together.
Built a block-based motion estimation method using neighboring macroblocks.
Implemented residual compression using DCT, quantization, and inverse reconstruction.
Created a custom binary format using frame type bytes, dimensions, motion vectors, and coefficient payloads.
Used logs to measure frame counts, size reduction, chunk sizes, and processing behavior.

Conclusion

The project successfully converts raw video frames into a custom compressed representation and records the complete process in logs. From the logged raw padded size of 1252.65 MB to the internal encoded summary of 24.48 MB, the pipeline shows about a 51.17:1 size ratio. The produced chunk files total 45.82 MB, which is still about a 26.08:1 ratio against raw padded RGB data.

Most importantly, the project explains the internal workflow behind compression rather than hiding it behind external tools. That makes it useful as both a working prototype and a learning project for video compression concepts.