When I wrote that robots.txt will not prevent bad crawlers from accessing your private data, a reader wondered how a crawler can bypass robots.txt.
I think the original article was clear enough. Anyway I will try again:
Imagine a sign that says “Trespassers will be prosecuted“. The sign just tells you that you are not expected to trespass. After reading the sign, you have to make up your mind whether to trespass or not. The sign itself will not stop you from proceeding further. It will just tell you that you shouldn’t.
Similarly, robots.txt just tells the crawlers that they are not expected to visit some of the pages. If the crawler wants, it can still visit those pages. This means that a bad bot can read the robots.txt file and learn which files the user wants to keep private and read those files to look for confidential data.
What this essentially means is that when they read your sign, the good guys will stop. The bad ones will not. So if you really want to stop everybody from trespassing, try build a wall around your compound rather than using a sign.
So How can a crawler bypass robots.txt?
A crawler needs to do nothing to bypass robots.txt. To the contrary, a crawler should do some extra work if it wants to follow the rules in robots.txt.
Thanks Niyaz,
I know these things. But as i wants to download the whole directory full of images, and which is protected by robots.txt. And i know everything is possible. But as i am desktop programmer i can fool the registry, new with the web tech it was difficult to overcome with robots.txt. Well as soon as i will have depth knowledge in crawler i will write one to bypass robots.txt. I just asked for help to write that one.
Enjoy
Yogesh,
You can do something doesn’t necessarily mean that you should.
Yeah you are right crawler can easily bypass the robots.txt
Is there any software which can download images on ftp protected by robots.txt?
if anyone knows please let me know its name or the link on this email mesystech@yahoo.in
Thanks
Hm, must be kind of frustrating to write such a post and still get comments on ‘bypassing’ the robots.txt or a robots.txt ‘protected’ folder…
Good Luck anyhow on your educational efforts.
“Non ragioniam di lor, ma guarda e passa”
(“Do not bother about them, just look and move on”)
Dante, Divine Comedy, Inferno, Chapter 3
There must be a way. The robots.txt is just a sign, as you stated above, but how can you tell a scraper/crawler script that signs should be ignored in order to retrieve desired results. All of the answers here are sad and limited and hardly help any programmer who has already gotten this far…Research won’t help much, considering all the answers will come from the peanut gallery…
Somebody just post a decent explanation of how to bypass robots.txt; Don’t just talk about how it is possible. Actually, post intelligible solutions or ideas.
How disappointing.
-Something useful
Did you even read the article? Or are you just trolling?