regex - Strip URL - Python -
ok how use regex remove http and/or www http://www.domain.com/ domain.com
assume x kind of tld or ctld
input example:
www.domain.x
output:
domain.x
if want use regular expressions instead of urlparse()
or splitting string:
>>> domain = 'http://www.example.com/' >>> re.match(r'(?:\w*://)?(?:.*\.)?([a-za-z-1-9]*\.[a-za-z]{1,}).*', domain).groups()[0] example.com
the regular expression might bit simplistic, works. it's not replacing, think getting domain out easier.
to support domains 'co.uk', 1 can following:
>>> p = re.compile(r'(?:\w*://)?(?:.*?\.)?(?:([a-za-z-1-9]*)\.)?([a-za-z-1-9]*\.[a-za-z]{1,}).*') >>> p.match(domain).groups()
('google', 'co.uk')
so got check result domains 'co.uk', , join result again in such case. normal domains should work ok. not make work when have multiple subdomains.
one-liner without regular expressions or fancy modules:
>>> domain = 'http://www.example.com/' >>> '.'.join(domain.replace('http://','').split('/')[0].split('.')[-2:])
Comments
Post a Comment