SourceAvengers Blog

0x00-0xff, Indiana, United States
SourceAvengers Blog - I am a Junior in College, have competed and won in multiple Capture the Flag events, competed in the Indiana CCDC for two years, and founder of The Computer Security Group of Indiana University Southeast. I enjoy network security, penetration testing and programming. I also greatly enjoy video games and action movies.

Friday, November 12, 2010

Email Scraping, removing anti-ocr protections, and fixing these problems

Okay I am going to just jump right in. There was a certain website which shall remain nameless. This website had a weakness in its function to search for people who were apart of the website. This vulnerability was that there was a folder which all the previous 9000+ email address images(with multi-colored anti-ocr protections) which had been loaded in the people searcher. Now the thing is I didnt really know there was that many till I had a program count them there was around 8500 but it kept increasing until they were all reset a day or to later and then new ones started appearing.

Okay anti-ocr mechnaisms that have a color different than the color of the text(aka black) PHAIL. These is a huge reason why first of all the fact is someone can EASILY go in and remove any colors other than white and black and turn each pixel white effectivaly removing all those protections. Secondly using a common font for the images also phails. A better option would be to have the software that generates the image randmize between 3-5 fonts which are different enough to through off OCR programs. aspriseOCR is highly recommened if you know java its VERY easy ti implement and extremly accurate. The final step though guys into this ocr process us ti read out ALL the images text into a text file and then you have a list of thousands of email addresses. Remember do not use this information for malicious purposes.

Also I found a second vulnerabilty search for the last part of the email in the page finder application on this site allowed you to list something like 15,000 email addresses in plain text along with the persons name who owns it and the homepage that is their personal homepage. This being the case a program could easily be created to capture only the email, name, homepage name, and homepage link and link all of that information together possibly even under a database.

Finally this folder that listed all the images also had an Apache version of 1.3.3.1 which is over 5 years old and vulnerable to multiple attacks(just google exploit-db) and search on the site for that apache version. Anyways you get my point a server hosting this set of images most likely stores other private information which should not be released.

Anyways I would like to state that I have talked with the security staff of this website and they informed me they have/will soon fix these bugs so noone else who is more malicious will exploit said vulnerabilities. Hope you guys enjoyed this read I will be attaching the example source code(which works) on how to remove anti-ocr protection and also it implements aspriseOCR which is not free but if you use an autoclick program you can have it automatically click the nag-window so it will actually go through your images and output the text from them.

Sourcecode in java - http://www.mediafire.com/file/j1bxwdl7f880j8b/OcrSourceCode7z

===============================Disclaimer============================================
I would like to state at no time will I disclose to what party this vulnerabilty(set of vulnerabilities) were found at. I wish to keep the party's details anonymous due to the unknown effects that could happen if I were to release the parties information. Also everything I am discussing here should ONLY be done for educational purposes to study such possibilities. Do not and I repeat do not use any of this for malicious purposes.