python - improve robots regular expression -


i have made following regexp extracting robots links:

re.compile(r"/\s+(?:\/+)") 

and following result:

/includes/ /modules/ /search/ /?q=user/password/ /?q=user/register/ /node/add/ /logout/ /?q=admin/ /themes/ /?q=node/add/ /admin/ /?q=comment/reply/ /misc/ //example.com/ //example.com/site/ /profiles/ //www.robotstxt.org/wc/ /?q=search/ /user/password/ /?q=logout/ /comment/reply/ /?q=filter/tips/ /?q=user/login/ /user/register/ /user/login/ /scripts/ /filter/tips/ //www.sxw.org.uk/computing/robots/ 

how can exclude links has 2 slashes like:

 //www.sxw.org.uk/computing/robots/  //www.robotstxt.org/wc/  //example.com/  //example.com/site/ 

any ideas ??

i'll suggest add if condition:

 if not line.startswith(r'//'):      #then here 

Comments

Popular posts from this blog

c# - SVN Error : "svnadmin: E205000: Too many arguments" -

c# - Copy ObservableCollection to another ObservableCollection -

All overlapping substrings matching a java regex -