regex - Strip URL - Python -


ok how use regex remove http and/or www http://www.domain.com/ domain.com

assume x kind of tld or ctld

input example:

http://www.domain.x/

www.domain.x

output:

domain.x

if want use regular expressions instead of urlparse() or splitting string:

>>> domain = 'http://www.example.com/' >>> re.match(r'(?:\w*://)?(?:.*\.)?([a-za-z-1-9]*\.[a-za-z]{1,}).*', domain).groups()[0] example.com 

the regular expression might bit simplistic, works. it's not replacing, think getting domain out easier.

to support domains 'co.uk', 1 can following:

>>> p = re.compile(r'(?:\w*://)?(?:.*?\.)?(?:([a-za-z-1-9]*)\.)?([a-za-z-1-9]*\.[a-za-z]{1,}).*') >>> p.match(domain).groups() 

('google', 'co.uk')

so got check result domains 'co.uk', , join result again in such case. normal domains should work ok. not make work when have multiple subdomains.

one-liner without regular expressions or fancy modules:

>>> domain = 'http://www.example.com/' >>> '.'.join(domain.replace('http://','').split('/')[0].split('.')[-2:]) 

Comments

Popular posts from this blog

c# - SVN Error : "svnadmin: E205000: Too many arguments" -

c# - Copy ObservableCollection to another ObservableCollection -

All overlapping substrings matching a java regex -