Home > Uncategorized > Unique bytes in a sliding window as a file content signature

Unique bytes in a sliding window as a file content signature

I was at a workshop a few months ago where a speaker pointed out a useful technique for spotting whether a file contains compressed data, e.g., a virus hidden in a script by compressing it to look like a jumble of numbers. Compressed data contains a uniform distribution of byte values (after all, compression is achieved by reducing apparent information content), your mileage may vary between compression techniques. The thought struck me that it would only take a minute to knock up an R script to check out this claim (my use of R is starting to branch out into solving certain kinds of general coding problems) and here it is:

window_width=256  # if this is less than 256 divisor has to change in call to plot
t=readBin(filename, what="raw", n=1e7)
# Sliding the window over every point is too much overhead
cnt_points=seq(1, length(t)-window_width, 5)
u=sapply(cnt_points, function(X) length(unique(t[X:(X+window_width)])))
plot(u/256, type="l", xlab="Offset", ylab="Fraction Unique", las=1)

The unique bytes per window (256 bytes wide) of a HTML file has a mean around 15% (sd 2):
Number of unique bytes in n-byte chunks of a html file

while for a tgz file the mean is 61% (sd 2.9):
Number of unique bytes in n-byte chunks of a tgz file

I don’t have any scripts containing a virus, but I do have a pdf containing lots of figures (are viruses hidden in pieces all all together?):
Number of unique bytes in n-byte chunks of a tgz file

Do let me know if you find any interesting ‘unique byte’ signatures for file contents.

  1. Tony Aldridge
    July 22nd, 2013 at 05:44 | #1

    Thanks for posting. Appreciate the use of sliding window and sapply. Sliding data windows feature in process control. Regards, Tony

  2. July 23rd, 2013 at 04:04 | #2

    This signature problem (though not this particular solution) comes up in computer forensics pretty frequently. I worked on this particular problem a few years ago. Given high entropy data with little to no context (say in a 512-byte window), what is the data type? In your example, you could be looking at a Word file with an embedded JPEG, not necessarily an Excel file with a compressed payload. (It turns out JPEG does have a few telling byte sequences, mainly FF00.)

    Further, given uniform-looking data from a small window (4096 bytes in our case), is it compressed, or random (effectively, encrypted)? Autocorrelating the data gives a quite-good, threshold-based answer to the second question.

    See in particular Section 4.3.4, labeled page “S20″.


  1. No trackbacks yet.

A question to answer *