I am increasingly coming across people who think robots.txt file can be used to prevent search engine crawlers from crawling sensitive data in their websites. Seriously.
This is just plain wrong. Data to be excluded using a robots.txt file is: unwanted, redundant or useless data. An entry in the robots.txt file cannot protect your sensitive data from going out. Sensitive data should not be left open in your website in the first place.
There are many malicious crawlers which crawl only the pages blocked by the robot.txt file in every website. I bet many interesting stuff will turn up in their search results.
Nice Post
Thanks Anish
How?
How a crawler can bypass robots.txt
[...] I wrote that robots.txt will not prevent bad crawlers from accessing your private data, a reader wondered how a crawler can bypass [...]
[...] [...]
Anything that requests pages from a website – be it a web browser, a legitimate search engine or a malicious crawler – does it in the exactly the same way. They all look exactly the same to any other computer on the Internet, including the computers running that website. (That is unless they choose to distinguish themselves, but why would a malicious crawler do that?)
Robots.txt only works on good crawlers that take it upon themselves to enforce the rules it contains. Seriously, how did you think the Web worked?
I think lawyers should look into this! They could have another pay day with this neglect.
Internet law allows for copyright simply by stating a site or software is copyrighted.
I was told by an attorney to just put copyright on my software that I wrote and it would be copyrighted. I imagine the same goes for web pages.
If an author does not want a work released then it should not be released!
Same should go for the robots.txt file.
Copyrighted material that is not wanted to be released should not be released.
I understand it would be wise to have security configured. That is another issue.
That is not the same thing (There are laws for that the FBI takes care of such things, ask Kevin Mitnick). Copyrighted material should be able to be allowed or disallowed by the author. It should be law that if they don’t obey the author can claim damages.
I feel this could currently hold up in court. As yet no individual has went to court over it.
It is possible.