python - improve robots regular expression -
i have made following regexp extracting robots links:
re.compile(r"/\s+(?:\/+)")
and following result:
/includes/ /modules/ /search/ /?q=user/password/ /?q=user/register/ /node/add/ /logout/ /?q=admin/ /themes/ /?q=node/add/ /admin/ /?q=comment/reply/ /misc/ //example.com/ //example.com/site/ /profiles/ //www.robotstxt.org/wc/ /?q=search/ /user/password/ /?q=logout/ /comment/reply/ /?q=filter/tips/ /?q=user/login/ /user/register/ /user/login/ /scripts/ /filter/tips/ //www.sxw.org.uk/computing/robots/
how can exclude links has 2 slashes like:
//www.sxw.org.uk/computing/robots/ //www.robotstxt.org/wc/ //example.com/ //example.com/site/
any ideas ??
i'll suggest add if
condition:
if not line.startswith(r'//'): #then here
Comments
Post a Comment