regex - Django regular expression confusion involving unicode -

- May 15, 2010

we're trying use beautifulsoup through django text extraction. we've got odd bug we've traced down following don't understand.

if issue following in standard python prompt:

import re print re.match("&#([0-9]+)[^0-9]","&#x00bb;")

we output of none, should expected. however, when put code in sgmllib.py (which django calls through long string of calls via our website), python does match this, , returns object. it's appearing though django somehow ignoring x in above string. assume has got related unicode settings, , on, can't seem figure out why django running differently opposed when run code ourselves in vanilla python 2.6 session.

why should regular expression above not match when run normally, does match, when django tries it?

the 'x' part of string testing. if don't account in regular expression won't match. python working correctly. surprised if django behaves differently, maybe there bug somewhere else. if adding 'x' gives problems in django, can try this:

>>> rc = re.match("&#[xx]?([0-9]+)","&#x00bb;") >>> rc.group(1) '00'

Search This Blog

Convert PH

regex - Django regular expression confusion involving unicode -

Comments

Post a Comment

Popular posts from this blog

c# - SVN Error : "svnadmin: E205000: Too many arguments" -

c# - Copy ObservableCollection to another ObservableCollection -

All overlapping substrings matching a java regex -