regex - Django regular expression confusion involving unicode -
we're trying use beautifulsoup through django text extraction. we've got odd bug we've traced down following don't understand.
if issue following in standard python prompt:
import re print re.match("&#([0-9]+)[^0-9]","»")
we output of none
, should expected. however, when put code in sgmllib.py (which django calls through long string of calls via our website), python does match this, , returns object. it's appearing though django somehow ignoring x
in above string. assume has got related unicode settings, , on, can't seem figure out why django running differently opposed when run code ourselves in vanilla python 2.6 session.
why should regular expression above not match when run normally, does match, when django tries it?
the 'x' part of string testing. if don't account in regular expression won't match. python working correctly. surprised if django behaves differently, maybe there bug somewhere else. if adding 'x' gives problems in django, can try this:
>>> rc = re.match("&#[xx]?([0-9]+)","»") >>> rc.group(1) '00'
Comments
Post a Comment