Spurred on by the russian spies use of steganography, I decided to revisit the work I’ve done in my undergrad thesis regarding the effectiveness of current steganalysis methods. (For those of you who are scratching their heads about steganalysis and steganography, read a quick primer from one of my previous posts: http://dvas0004.wordpress.com/2010/06/27/steganalysis-in-modern-day-anti-malware-systems/)
My thesis work was written in MATLAB, a very popular and powerful engineering language. However I was determined to see if it was feasible to write the same sort of programs and algorithms in python. It’s possible, and probably easier than using MATLAB 🙂
So, the python program would ideally be something like this:
You can see that python already provided most of the libraries I needed, including the all important wavelet library (pywt), array manipulation and statistical libraries (numpy + scipy) and so on. I decided to try to tackle only a small portion of the methods presented by various researchers in steganalysis. The first algorithm (which I’ll call “moments”) is based on the research paper here. I wont bore you with mathematical details of the methods, suffice it to say that an image is first “decomposed” into several layers using a common method called “wavelet decomposition”. We use wavelet decomposition whenever we compress an image into the JPG format for example. These decomposed layers are then passed through several statistical tests. The first is the moments test, the other (which I’ll call “stats”) is simply the mean, skewness and kurtosis of these layers.
Usually researchers do not just use statistical tests and wavelet decomposition on just the images themselves, but they also apply the tests on “predictor images”… sort of what they expect a normal image should look like. By comparing the results of the tests on a test image and a predictor image, you can tell if a test image is a normal image or a steganaographic image. In my current implementation I didn’t have time to implement these predictor image tests, so I just performed the tests on the actual images themselves.
I downloaded two test images and used a very goo stegonographic tool called STEGHIDE to hide information within the two images, and tested my python program here. Now, keep in mind this is a very basic algorithm written in about 2 days by a complete python beginner, but the results are still promising. Looking at the results for the “stats” method:
The red crosses show the normal, untouched image, and the red crosses show the image with steganographic data hidden within it. You see that there is a slight difference between them. Similarly for the “moments” method:
The above graph didnt turn out so well, need to work on mu scaling, but you see the same principle, there are slight variations between the normal and steganographic images. In other words, if the images where exactly the same, or if the steganography was undetectable, you would see something like:
Note how the red and green crosses are simply overlapping so there is nothing to distinguish between them. Now an astute observer can see that there is a problem with my algorithms as presented above. Even though there are differences between the red and green dots, there is only a slight difference, and the red and green crosses are all mixed together. Ideally, the green crosses would be to one side and the red crosses to another side. That would make it very easy to distinguish between them. That is actually the crux of steganalysis. Finding an algorithm that can do that reliably for every steganographic image out there would be awesome. That is also the reason why the researchers usually implement the “predictor images” I mentioned before, to make the difference between red and green crosses more pronounced. Anyway, implementing that…. and also implementing the SVM trainer, will be a job for another time soon to come.
In the meantime, here is my python code for the above. Stay tuned for more to come as soon as I find the time…