Project Report

Custom Video Compression Pipeline

A Python and OpenCV project that demonstrates how video frames can be analyzed, compressed, stored in small chunks, logged, and reconstructed without depending entirely on black-box media tools.

Main file: main.py Input path: videos/input.mp4 Log source: info_log.txt Frame source: frame_log.txt
Harsh | Roll No. S20230030382 | UG-3 | MS-Project

Project Description

This project is a learning-focused video compression pipeline. It reads a video, prepares each frame for block processing, separates frames into I-frames and P-frames, compresses residual data, writes compact binary chunks, creates a playlist file, and reconstructs video frames for validation. The main implementation is in main.py, where frame processing, compression, chunk writing, decoding, and logging are kept together for study and experimentation.

I built the project because my streaming website already used Cloudinary and Multer for media handling. That approach works well for real application usage, but it hides many internal details. Tools like FFmpeg are also powerful, but using them directly does not show how compression works at frame, block, residual, coefficient, and binary-layout levels. This project was created to understand those internals by building the core ideas manually.

The purpose is not only to compress a file. The purpose is to learn how video compression can be designed from basic building blocks: motion prediction, transform coding, quantization, chunking, metadata logs, and reconstruction.

Measured Results From Logs

The following values come from info_log.txt, frame_log.txt, and the generated chunk files. They describe how many frames were processed, how the frame types were distributed, and how much size reduction was achieved against raw padded RGB frame data.

Raw padded data
1252.65 MB
RGB frame estimate from info_log.txt
Gray data
417.55 MB
Single-channel frame estimate
Frames
1,859
25 FPS, about 74.36 seconds
Logged total
24.48 MB
Encoder summary from info_log.txt
I-frames
478
25.71% of all frames
P-frames
1,381
74.29% of all frames
Output chunks
14
Total chunk size: 45.82 MB
Reduction
98.05%
Raw padded data to logged total
The logged total size is an internal encoder summary. The final chunk files total 45.82 MB on disk because the chunk payload stores motion vectors, residual coefficient data, headers, and zlib-compressed binary output. Against raw padded RGB data, the final chunks still reduce size by about 96.16%.

Compression Method

1

Frame Preparation

The input frame is padded so both width and height are multiples of 16. This makes every frame split cleanly into 16 x 16 macroblocks.

2

I-Frame Decision

The first frame is always stored as an I-frame. Later frames become I-frames when the mean squared error against the reconstructed reference is greater than the threshold value of 500.

3

P-Frame Prediction

If the error is low enough, the frame is stored as a P-frame. Each macroblock searches nearby blocks in the reconstructed reference and saves a small motion vector.

4

Residual Coding

The difference between the current block and reference block is converted to grayscale, split into four 8 x 8 blocks, transformed with DCT, quantized, and saved as signed 16-bit coefficients.

Detailed Compression Flow

Techniques Used

The project combines multiple compression ideas instead of depending on one single method. The main techniques are frame prediction, block-based motion search, grayscale residual coding, DCT, quantization, custom binary storage, and zlib compression.

Padding And 16 x 16 Macroblocks

Before compression starts, every frame is padded so its height and width are divisible by 16. This prevents incomplete edge blocks and allows the whole frame to be divided into equal 16 x 16 macroblocks. These macroblocks are the main units used for motion prediction. A fixed block size also makes decoding easier because the decoder can rebuild the same grid structure.

Motion Search And Motion Vectors

For a P-frame, the encoder does not store the full current frame. It tries to predict the current frame from the previously reconstructed frame. For every 16 x 16 macroblock, the function find_best_match searches a small local area around the same block position in the reference frame.

for di in range(-1, 2):
    for dj in range(-1, 2):
        ni, nj = i + di, j + dj
        cand = blocks2[ni][nj]
        err = mse(block, cand)

The search range is from -1 to +1 in both row and column directions. That gives nine possible reference positions: the original position plus its eight neighbors. The block with the lowest mean squared error is selected, and only the small movement value (dx, dy) is stored as the motion vector. This is smaller than storing the full block.

Predicted Frame And Residual Error

After the best reference block is found, the predicted block is taken from the previous reconstructed frame. The encoder then subtracts that reference block from the current block. This difference is called the residual error.

ref = b2[i + dx][j + dy]
residual = b1[i][j].astype(np.float32) - ref.astype(np.float32)

If the prediction is good, the residual contains much less information than the original block. During decoding, the frame is reconstructed by adding the decoded residual back to the predicted reference block. This is the reason P-frames can be smaller than I-frames.

Grayscale Residual Coding

The residual is converted from color to grayscale before DCT compression. This reduces the amount of residual data because the encoder stores one channel of residual coefficients instead of separate coefficients for blue, green, and red channels.

gray = cv2.cvtColor(residual, cv2.COLOR_BGR2GRAY).astype(np.float32)

The same decoded grayscale residual is applied back to all three color channels during reconstruction. This is a deliberate simplification for learning. It reduces stored data and makes the algorithm easier to understand, but it can lose some color-specific error detail compared with storing separate residuals for each color channel. The log also records grayscale data size separately: 417.55 MB, compared with 1252.65 MB for raw padded RGB data.

8 x 8 DCT Transform

Each 16 x 16 residual block is split into four 8 x 8 blocks. The DCT, or Discrete Cosine Transform, converts pixel error values into frequency coefficients. In simple terms, it changes the data from raw spatial differences into values that describe smooth areas, edges, and detail frequencies.

This is useful because most natural video blocks contain more low-frequency information than high-frequency information. After DCT, many coefficients become small, which makes them easier to reduce through quantization and later binary compression.

Quantization

After DCT, each coefficient is divided by a quantization matrix. In this project, the matrix uses the value 8 for every position.

Q = np.ones((8, 8)) * 8

def quantize(b):
    return np.round(b / Q)

Quantization is the lossy part of the compression. It reduces precision so coefficients take fewer useful values. Smaller coefficient values are easier to store, and tiny visual differences are removed. During decoding, the values are multiplied by the same matrix through dequantization before inverse DCT.

Binary Format

The encoded data is stored in a custom binary format instead of plain text. This keeps the output more compact. The binary stream starts with magic bytes STR1 and a version number. Then each frame is written with a small type marker.

Stored Item Binary Content Reason
I-frame Frame type byte, JPEG byte length, JPEG bytes Stores a complete reference frame for decoding and recovery.
P-frame Frame type byte, block rows, block columns, motion vectors, DCT coefficients Stores only prediction information and residual data.
Macroblock dx, dy, and four 8 x 8 groups of signed 16-bit coefficients Represents one predicted 16 x 16 block compactly.

zlib Compression

After the custom binary payload is created, the project applies zlib.compress(payload, 9). This is a second compression stage. DCT and quantization reduce the video information first, binary serialization packs it into a compact structure, and zlib then removes repeated byte patterns from that binary payload.

Why I-Frames Store The Original Frame

I-frames are stored as complete original frames because predictive frames need a reliable reference. A P-frame depends on a previous reconstructed frame. If every frame were predictive, decoding could not start cleanly, and one error could continue spreading through many later frames.

Keeping I-frames as complete JPEG images gives the decoder reset points. The first frame is always an I-frame, and later frames become I-frames when the color-frame mean squared error is too high. This means the encoder switches back to a full frame when prediction is no longer accurate enough.

5-Second Segments And I-Frame Boundaries

The chunking logic targets about 5 seconds per segment. It does not cut at a random frame. It waits until at least 5 seconds have passed and the current frame is an I-frame. This precaution keeps the first frame of each new segment as an I-frame, so the decoder has a complete frame available at the segment boundary.

if ts - last >= 5 and t == "I":
    points.append(i)
    last = ts

Because the boundary waits for an I-frame, some segments are slightly longer than 5 seconds. For example, the logs show segment durations such as 6.64 s, 6.28 s, and 6.76 s. This is intentional because decoding reliability is more important than cutting exactly at five seconds.

Core Code Overview

The code in main.py is organized into small functions for padding, block handling, motion estimation, residual compression, serialization, chunk creation, and decoding.

Important Constants

SEG_MAGIC = b"STR1"
SEG_VERSION = 1
COEFF_BYTES_PER_MB = 4 * 64 * 2
BLOCK_SIZE = 16
Q = np.ones((8,8)) * 8
threshold = 500
BLOCK_SIZE = 16 Creates a fixed 16 x 16 macroblock grid for motion prediction.
Q = 8 Controls quantization strength for DCT coefficients.
threshold = 500 Decides whether a frame is stored as an I-frame or P-frame.

Residual Encoder

def encode_residual(residual):
    gray = cv2.cvtColor(residual, cv2.COLOR_BGR2GRAY).astype(np.float32)
    tiles = []

    for y in range(0, gray.shape[0], 8):
        for x in range(0, gray.shape[1], 8):
            block = gray[y:y+8, x:x+8]
            block = block - 128
            d = dct2(block)
            q = quantize(d)
            q_int = np.round(q).astype(np.int16)
            tiles.append(q_int.reshape(64))

    return np.stack(tiles, axis=0)

P-Frame Block Prediction

for i in range(len(b1)):
    for j in range(len(b1[0])):
        dx, dy = find_best_match(b1[i][j], b2, i, j)
        ref = b2[i+dx][j+dy]
        residual = b1[i][j].astype(np.float32) - ref.astype(np.float32)
        comp = encode_residual(residual)
        rec_res = decode_residual(comp, ref.shape[:2])

Chunk Compression

payload = serialize_encoded(encoded[s:e])
data = zlib.compress(payload, 9)

with open(f"segments/segment_{i}.ts", "wb") as f:
    f.write(data)

Frame Type Analysis

The frame log records every frame number, frame type, and timestamp. This shows that most frames were compressed as P-frames, which means the project reused previous reconstructed frames instead of storing every frame independently.

Frame Type Count Share Meaning
I 478 25.71% Stored as JPEG images when a full reference frame is needed.
P 1,381 74.29% Stored using motion vectors and residual coefficients.
Total 1,859 100% Processed at 25 FPS for an approximate duration of 74.36 seconds.

Chunk Log Details

The project writes compressed binary chunks and lists them in index.m3u8. Each chunk starts at an I-frame when possible, so decoding has a reliable reference frame at chunk boundaries.

Chunk Frame Range Frame Count Duration Logged Size
00 to 1651666.64 s3.99 MB
1166 to 2901255.00 s2.66 MB
2291 to 4151255.00 s3.44 MB
3416 to 5411265.04 s2.93 MB
4542 to 6661255.00 s2.00 MB
5667 to 8101445.76 s2.84 MB
6811 to 9671576.28 s3.92 MB
7968 to 10971305.20 s2.59 MB
81098 to 12411445.76 s3.53 MB
91242 to 14101696.76 s3.65 MB
101411 to 15361265.04 s4.00 MB
111537 to 16611255.00 s4.90 MB
121662 to 18091485.92 s4.07 MB
131810 to 1858491.96 s1.31 MB

Why Segmentation Matters

Segmentation means the video is divided into smaller time-based chunks instead of being handled as one large file. In this project, the target chunk length is about 5 seconds. This design is useful because a player can begin work with the first chunk while the remaining chunks are still waiting to be loaded or processed. The user does not need to wait for the full video file before playback can begin.

Smaller chunks improve the user experience in several ways. Startup can be faster because the first playable unit is small. Seeking is cleaner because the player can jump near the requested time and load only the chunk around that point. If one chunk fails, the application can retry only that chunk instead of repeating the whole video transfer. It also allows progress and buffering to feel more responsive, because the application works with many manageable units instead of one large binary.

Segmentation also improves efficiency. Memory use is lower because the application can decode or cache a small chunk at a time. Network usage can be more practical because only the needed time range has to be loaded. It also creates a path for future improvements such as quality switching, preview loading, partial downloads, and better seeking behavior.

Why About 5 Seconds

A 5-second target is a practical balance. Very tiny chunks would create too many files and too much metadata overhead. Very large chunks would slow startup and seeking. A 5-second chunk is short enough for quick buffering and long enough to keep the number of chunks reasonable. The project waits for an I-frame before starting a new chunk, so some chunks are slightly longer than 5 seconds, but each new chunk has a strong decoding starting point.

How We Can Use It

The generated index.m3u8 file works like a map. It stores the duration of each chunk and the path to the matching compressed file. A web video feature can read that playlist, load the first chunk for quick playback, continue loading later chunks as the user watches, and jump to the right chunk when the user seeks to another timestamp.

Visual Frame References

The report also includes reference images from the original and reconstructed outputs. These images make the compression result easier to understand because the reader can compare the source frame with the frame rebuilt by the custom decoder at the same timestamp.

The 10-second and 22-second samples are useful checkpoints. They show how the predicted frame, grayscale residual, DCT coefficients, quantization, binary storage, and zlib-compressed chunks still allow the project to reconstruct a recognizable frame after decoding.

Frame Reference At 10 Seconds

This pair compares the original frame at 10s with the reconstructed frame generated after decoding.

Original frame at 10 seconds
Original frame, 10s
Reconstructed frame at 10 seconds
Reconstructed frame, 10s

Frame Reference At 22 Seconds

This pair compares the original frame at 22s with the reconstructed frame generated after decoding.

Original frame at 22 seconds
Original frame, 22s
Reconstructed frame at 22 seconds
Reconstructed frame, 22s

Generated Files

The project creates files that make debugging and validation easier. The logs are especially important because they show the compression decisions and measured results of the run.

File Purpose File Size
info_log.txt Summary metrics: data size, FPS, frame counts, encoded size, and chunk sizes. 793 bytes
frame_log.txt Per-frame record with frame number, frame type, and timestamp. 52,551 bytes
index.m3u8 Playlist-style chunk index containing durations and chunk paths. 601 bytes
segments/segment_0.ts to segments/segment_13.ts Compressed binary chunks produced by serialization and zlib compression. 45.82 MB total
flow.mp4 Motion visualization output created from optical flow. 45.18 MB
reconstructed.mp4 Rebuilt video output used to verify that decoding can reconstruct frames. 14.64 MB

Learning Outcomes

This project shows the internal ideas behind video compression in a practical way. It demonstrates why modern video compression does not store every frame as a complete image, how motion prediction reduces repeated visual information, and how DCT plus quantization reduce residual data.

Conclusion

The project successfully converts raw video frames into a custom compressed representation and records the complete process in logs. From the logged raw padded size of 1252.65 MB to the internal encoded summary of 24.48 MB, the pipeline shows about a 51.17:1 size ratio. The produced chunk files total 45.82 MB, which is still about a 26.08:1 ratio against raw padded RGB data.

Most importantly, the project explains the internal workflow behind compression rather than hiding it behind external tools. That makes it useful as both a working prototype and a learning project for video compression concepts.