- Norbert Page(parser for robots.txt file)
http://osjava.org/norbert/ - About specification of robots.txt file
http://www.robotstxt.org/wc/norobots-rfc.html - About robots.txt by Baidu(japanese)
http://www.baidu.jp/search/robots.html - list of Crawlers
http://www.robotstxt.org/wc/active/html/index.html
Norbert is a parser and utility to find and check the robots.txt file in web page.
(If you need to get the norbert utility, please access to http://osjava.org/norbert/ described above.)
But Norbert utility sometimes doesn't check and parse the robots.txt file properly even though the robots.txt file is existed in top directory of the web site. Norbert checks the string case sensitive, but some web sites ignore the tag's case sensitive in robots.txt file(refer to http://www.robotstxt.org/wc/norobots-rfc.html) and Norbert utility failed to check the line and can't pick up the tags properly.
Solution
You can avoid this problem if you change parseTextForUserAgent function in org.osjava.norbert.NoRobotClient.java as follows.
Before checking the tags in robots.txt, System changes all of the line's text to lowercase, and then System uses the String object converted to lowercase.
Customization Example
private RulesEngine parseTextForUserAgent(String txt, String userAgent) throws NoRobotException {
.....
try {
while( (line = rdr.readLine()) != null ) {
// trim whitespace from either side
line = line.trim();
//change the line's string to new lowercase string
//add this line to store the text after exchanging all of the line's text to lowercase.
String lineToLowerCase=line.toLowerCase();
//end
.....
//replace the "line" object with lineToLowerCase string object when checking the line string.
if(lineToLowerCase.startsWith("user-agent:")) {
}
else{
//replace using the line object with lineToLowerCase string object.
if(lineToLowerCase.startsWith("allow:")) {
.....
} else
//replace using the line object with lineToLowerCase string object.
if(lineToLowerCase.startsWith("disallow:")) {
.....
} else {
.....
}
}
.....
}
0 件のコメント:
コメントを投稿