gzip 炸弹检测

国内很多人说两句话就能检测 gzip 炸弹,我翻了一下大概是这样

import gzip
import io
import requests
resp = requests.get(url, stream=True)

decompressed = resp.raw.read()
with gzip.open(io.BytesIO(decompressed), 'rb') as g:
    g.seek(0, 2)
    origin_size = g.tell()
    print(origin_size)

gzip -l xxx.gz 类似,原理是gzip格式在尾部8字节保存了 [CRC32][ISIZE],其中 ISIZE = uncompressed_length % 2³²

要反制这个检测很easy嘛,直接返回 Content-Encoding: deflate 不就行了?

况且,我搜了下,ISIZE是可以改的。。。所以更好的办法是:

import zlib

MAX_OUTPUT = 50 * 1024 * 1024  # 50 MB cap

def safe_decompress_gzip_stream(compressed_iterable):
    # compressed_iterable yields bytes chunks from incoming request body
    d = zlib.decompressobj(16 + zlib.MAX_WBITS)  # 16+ for gzip wrapper
    total_out = 0
    for chunk in compressed_iterable:
        out = d.decompress(chunk, 64*1024)  # limit per-call output
        total_out += len(out)
        if total_out > MAX_OUTPUT:
            raise ValueError("Exceeded decompression limit")
        yield out
    # flush remaining
    out = d.flush()
    total_out += len(out)
    if total_out > MAX_OUTPUT:
        raise ValueError("Exceeded decompression limit")
    if out:
        yield out

终归来说,gzip炸的都是内存。我在想,能不能利用LZ77反复横跳,做一个CPU炸弹呢?

比如解压个半天,发现结果是个 1KB 的小文件?压缩率高达 114514% ?

ChatGPT 居然拒绝回答了。但是指了个路:

Many tiny dynamic-Huffman blocks so the decoder rebuilds trees repeatedly (parsing overhead per block).
Constructing distance/length sequences that cause a lot of back-reference copying (expensive repeated copies, especially with overlapping distances).
Interleaving short literal runs with copies to create branch-heavy decode work.
Using many concatenated members/streams (or nested archives) to multiply cost.

OpenAI 真猥琐啊。

Comments