The Shape of Code

About

Home > Uncategorized > Unique bytes in a sliding window as a file content signature

Unique bytes in a sliding window as a file content signature

July 21, 2013 Derek Jones Leave a comment Go to comments

I was at a workshop a few months ago where a speaker pointed out a useful technique for spotting whether a file contains compressed data, e.g., a virus hidden in a script by compressing it to look like a jumble of numbers. Compressed data contains a uniform distribution of byte values (after all, compression is achieved by reducing apparent information content), your mileage may vary between compression techniques. The thought struck me that it would only take a minute to knock up an R script to check out this claim (my use of R is starting to branch out into solving certain kinds of general coding problems) and here it is:

window_width=256  # if this is less than 256 divisor has to change in call to plot
 
plot_unique=function(filename)
{
t=readBin(filename, what="raw", n=1e7)
 
# Sliding the window over every point is too much overhead
cnt_points=seq(1, length(t)-window_width, 5)
 
u=sapply(cnt_points, function(X) length(unique(t[X:(X+window_width)])))
plot(u/256, type="l", xlab="Offset", ylab="Fraction Unique", las=1)
 
return(u)
}
 
dummy=plot_unique("http://shape-of-code.com/2013/05/17/preferential-attachment-applied-to-frequency-of-accessing-a-variable/")
 
dummy=plot_unique("http://www.shape-of-code.com/R_code/requirements.tgz")

The unique bytes per window (256 bytes wide) of a HTML file has a mean around 15% (sd 2):
Number of unique bytes in n-byte chunks of a html file

while for a tgz file the mean is 61% (sd 2.9):
Number of unique bytes in n-byte chunks of a tgz file

I don’t have any scripts containing a virus, but I do have a pdf containing lots of figures (are viruses hidden in pieces all all together?):
Number of unique bytes in n-byte chunks of a tgz file

Do let me know if you find any interesting ‘unique byte’ signatures for file contents.

Categories: Uncategorized Tags: file contents, R, signature, virus

Comments (2) Trackbacks (0) Leave a comment Trackback

Tony Aldridge

July 22, 2013 05:44 | #1

Reply | Quote

Thanks for posting. Appreciate the use of sliding window and sapply. Sliding data windows feature in process control. Regards, Tony
Alex

July 23, 2013 04:04 | #2

Reply | Quote

This signature problem (though not this particular solution) comes up in computer forensics pretty frequently. I worked on this particular problem a few years ago. Given high entropy data with little to no context (say in a 512-byte window), what is the data type? In your example, you could be looking at a Word file with an embedded JPEG, not necessarily an Excel file with a compressed payload. (It turns out JPEG does have a few telling byte sequences, mainly FF00.)

Further, given uniform-looking data from a small window (4096 bytes in our case), is it compressed, or random (effectively, encrypted)? Autocorrelating the data gives a quite-good, threshold-based answer to the second question.

http://dfrws.org/2010/proceedings/2010-302.pdf
See in particular Section 4.3.4, labeled page “S20”.

Cheers,
Alex

No trackbacks yet.

Amount of end-user usage of code in Firefox Free range software developers: Are they cost effective?

The Shape of Code

Unique bytes in a sliding window as a file content signature

Recent Posts

Recent Comments

Archives

Meta