Issue 15515: Regular expression match does not return
Created on 2012-07-31 18:08 by crouleau, last changed 2022-04-11 14:57 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| RegexBug.py | crouleau, 2012-07-31 18:08 | |||
| Messages (7) | |||
|---|---|---|---|
| msg167024 - (view) | Author: Caleb Rouleau (crouleau) | Date: 2012-07-31 18:08 | |
Version info: 2.7.1 (r271:86832, Feb 7 2011, 11:33:02) [MSC v.1500 64 bit (AMD64)] The program included never prints "done" because it never returns from re.match(). -- Caleb Rouleau |
|||
| msg167028 - (view) | Author: Matthew Barnett (mrabarnett) * ![]() |
Date: 2012-07-31 18:59 | |
That's because it uses a pathological regular expression (catastrophic backtracking). The problem lies here: (\\?[\w\.\-]+)+ |
|||
| msg167031 - (view) | Author: Tim Peters (tim.peters) * ![]() |
Date: 2012-07-31 19:14 | |
Matthew is right: the nested quantifiers can cause this to take a very long time when the regexp doesn't match. Note that the example cannot match, because nothing in the regexp can match the space before "warning" in the example string. But the nested quantifiers cause it to _try_ an enormous number of futile attempts. Under Python 2.7.1, it eventually does return, but it took over 15 minutes when I tried it on my laptop. Friedl's book "Mastering Regular Expressions" is a book-length treatment of how to write regexps that don't "take forever" when they fail to match, and that's highly recommended. Or start a discussion on comp.lang.python, and I'm sure someone will help you flesh out exactly what it is you do and don't want to match, and how to write a regexp that performs well on both matching and non-matching text (the bug tracker isn't an appropriate place for this). |
|||
| msg167035 - (view) | Author: Caleb Rouleau (crouleau) | Date: 2012-07-31 19:44 | |
Thanks for the help. Apologies for the poor understanding of regular expressions. Closing this issue. |
|||
| msg167038 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * ![]() |
Date: 2012-07-31 19:48 | |
Make a distinction between a large number of infinity. You have a bad regexp, the matching time depends exponentially on the length of the string. Try with short strings. Use the regexp r"(\w:)(\\?[\w\.\-]+)((\\[\w\.\-]+)*)(\.[\w ]+): ". It's not a bug. |
|||
| msg167042 - (view) | Author: Matthew Barnett (mrabarnett) * ![]() |
Date: 2012-07-31 19:58 | |
It's probably inappropriate for me to mention that the alternative 'regex' module on PyPI completes promptly, so I won't. :-) |
|||
| msg167054 - (view) | Author: Tim Peters (tim.peters) * ![]() |
Date: 2012-07-31 21:16 | |
Matthew, yes, PyPy's regex module implements regular expressions of the "computer science" (as opposed to POSIX) sense. See Friedl's book for a full explanation. Short course is that regex's flavor of regexp matching is linear-time, but cannot support "advanced" features like backreferences. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022-04-11 14:57:33 | admin | set | github: 59720 |
| 2012-07-31 21:16:17 | tim.peters | set | messages: + msg167054 |
| 2012-07-31 19:58:12 | mrabarnett | set | messages: + msg167042 |
| 2012-07-31 19:48:36 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka messages: + msg167038 |
| 2012-07-31 19:44:56 | crouleau | set | status: open -> closed messages: + msg167035 |
| 2012-07-31 19:14:55 | tim.peters | set | resolution: not a bug messages:
+ msg167031 |
| 2012-07-31 18:59:38 | mrabarnett | set | messages: + msg167028 |
| 2012-07-31 18:08:57 | crouleau | create | |

