This page will teach you how to write a not-necessarily-very-good programme to beat some common captchas, but it will not provide any useful code to do so for you. It should give you an idea how to go about defeating captchas not listed here. But mostly, I hope it will be instructive for anyone who wants to write a less easily defeated captcha in the future, since apparently you're all hopeless at it at the moment.
As everyone in the world knows by now, most websites and forums use "captchas" to try and stop computer programmes from posting fake comments containing adverts. "Captcha" stands for "Completely Automated Public Turing test to tell Computers and Humans Apart". And as everyone in the world ought to have realised by now, they don't work.
There exist a number of ways around them, the most cunning and most effective, although the most difficult to set up, is to build a pornographic website and get real humans to solve the captchas for you in exchange for naked pictures.
But mostly, they're easy to get around because they're shit. This, for example, is the default captcha that comes with the now obsolete phpbb2:
Here is the code required to make this captcha machine readable:
function bank=solvefuzzy(bank)Here is the result of this algorithm on four example captchas:
[x y c n]=size(bank);
%First, greyscale the lot by taking the red channel bank=bank(:,:,1,:);
%Now blur it slightly. for (i=1:n) bank(:,:,1,i)=filter2(ones([3 3])/9, bank(:,:,1,i)); end
%Now threshold it. bank=(bank<0.63);
%Now trim the borders. bank(1:x,[1 y],:,:)=0; bank([1 x],1:y,:,:)=0;
Of course, cracking obsolete captchas isn't terribly useful, or wouldn't be if anyone bothered to update their forums, so here's a captcha from phpbb3:
function bank=solveclassy(bank)That's two commands. Now, let's check the results.
%First, greyscale the whole thing bank=mean(bank,3);
%Now threshold it. bank=(bank<0.55);
But phpbb3 has another trick up its sleeve: a second type of captcha. This is the one used on www.phpbb3.com's forum, so presumably they trust it:
So let's see some code.
function result=solvefunky(bank);That results in this rather pleasing image:
[x y c n]=size(bank); result=zeros([x,y,1,n]);
%First, determine that background colours. %We assume the first colour with five continuous pixels of itself %along the first row is background. background=zeros(3,n); for (i=1:n) colour=[0 0 0]; j=1; count=0; while (count<5) if ((bank(1, j, 1, i)==colour(1)) && ... (bank(1, j, 2, i)==colour(2)) && ... (bank(1, j, 3, i)==colour(3))) count=count+1; else colour(:)=bank(1,y,:,i); end end background(:,i)=colour(:);
%Next, find areas that are that colour backgroundareas(:,:,i)=((bank(:,:,1,i)==background(1,i)) & ... (bank(:,:,2,i)==background(2,i)) & ... (bank(:,:,3,i)==background(3,i)));
%Now, find areas of that colour smaller than 15 pixels temp=bwlabel(backgroundareas(:,:,i), 4); small=zeros([x,y]); numberofregions=max(temp(:)); for (region=1:numberofregions) pixels=sum(temp(:)==region); if (pixels<15) thisarea=temp==region; %This leaves a lot of bits which aren'treal, but we know from %looking at the captcha that the letters are outlined in just %one colour, so lets eliminate anything that's got more than %one colour adjacent to it. (In fact, we allow one pixel of a %different colour as this works better.) adjacentpixels=(imdilate(thisarea, [0 1 0;1 1 1;0 1 0])&~thisarea); red=bank(:,:,1,i); green=bank(:,:,2,i); blue=bank(:,:,3,i); ar=red(adjacentpixels); ag=green(adjacentpixels); ab=blue(adjacentpixels); if ((sum((ar~=ar(1)))<2) && ... (sum((ag~=ag(1)))<2) && ... (sum((ab~=ab(1)))<2)) small=small|thisarea; end end end result(:,:,1,i)=small(:,:);
Those of you who know me should already have worked out that this took me less than one evening, including grabbing all the pictures and writing this entry. I wouldn't bother if it was going to take longer. You know that. So if you employed a good programmer for a week to crack such a captcha you ought to be able to finish the job off. Then you'd have access to every phpbb3 forum out there.
Clearly there are false positives and things in these processed images: the bottom one in particular has a large false positive in the Z, and the last H has a bit missing where the L overlapped it. I don't think either of these would actually affect a good OCR algorithm (given that said algorithm would have the font used built into it and have an ideally oriented and scaled image of the letters, albeit with the odd mistake), and even if it did, well, we cracked the other three. If we assume we can crack 75% of these captchas, then we can break into a forum which allows us 5 attempts (which is pretty standard) 99.9% of the time.
phpbb3 also allows the user an almost ludicrous amount of options for their captcha. This is good, as it means that a cracker will have a harder time beating the captcha in the general case. But in the specific case of the default settings, which almost everyone will use, this won't help at all.
So what's the solution? Personally, I use a bespoke text-based captcha. Image based ones are hard to programme, which isn't a problem if you're doing something like phpbb, because it has to be hard to crack (oh dear) and text based ones really aren't. Another problem with image-based solutions is that some devices or people can't read them, so there usually has to be a fallback, and then you have two links, of which a cracker need only outsmart the weakest. (Sorry for the mixed metaphor there.) I think bespoke text-based is good because there's no really motivation for a cracker to devote any time to cracking it, as they'll only get access to my websites, and if they do I can very easily change it the following evening. But it couldn't work for phpbb as you can't make a bespoke captcha for every user.
Some captchas are obfuscated further than these. Sometimes this is a simple case of drawing lines over and through the text. This is pretty easy to beat -- any good photo touching-up software has had this feature since the week after flatbed scanners were invented, and replicating it is not hard, even when the lines must be found automatically. A better solution is to deform the letters themselves, though this involves a very direct tradeoff: anything you do that makes letter shapes harder for a computer to identify will have the same effect for your legitimate users. Again, I would attack such a captcha by not attempting to restore the original image, but by developing an algorithm to characterise each... well, character based on its Euler number, the number of sharp corners in its outline and their relative locations, and maybe the Euler number of the shape you get if you dilate it a bit. I believe this could crack such a captcha with minimal training.
Theoretically, human authentication is the best way, but humans aren't apparently very good at that. It's not always apparent from a name and an email address if a user is a human or a spambot. My proposed solution is a deliberately impossible captcha: you find or create an image, possible of random abstract 'art', or a landscape, or a sort of randomly generated Rorschach ink-blot test, and ask the user for a vague, one sentence description. Then a human would authenticate the user's account by seeing if the user's description of the image relates to that image in any way. It'd be a little subjective, but I really can't see it being cracked, except perhaps be Derren Brown concocting a sentence that would appear to describe any image. And people would learn to spot that sentence. It would still be susceptible to the porn crack, but then everything is, and honestly I think it'd be fairly easy to tell which descriptions of Rorschach ink-blots had come from the minds of teenage boys looking for naked pictures with a pretty high degree of certainty.
Plus, I think it'd offer a fascinating glimpse into the psyche of all prospective users of your forum.
You can download all the above code, and some general making-it-work gubbins in the Code Factory, but you'll need Matlab to make it work, and you'll need the Image Processing Studio, to make it run. If anyone wants to extend the code, do feel free. Complete code is available if you want it, though -- people sell it to prospective spammers.