Link in code comment no longer relevant for HTML unescaping
Documentation
The link in
| # see http://www.w3.org/TR/html5/syntax.html#tokenizing-character-references |
is not longer relevant and should be replace:
- current link http://www.w3.org/TR/html5/syntax.html#tokenizing-character-references
- more accurate link https://html.spec.whatwg.org/multipage/parsing.html#numeric-character-reference-end-state
The link should explain the source of the replacements table:
# see http://www.w3.org/TR/html5/syntax.html#tokenizing-character-references _invalid_charrefs = { 0x00: '\ufffd', # REPLACEMENT CHARACTER 0x0d: '\r', # CARRIAGE RETURN 0x80: '\u20ac', # EURO SIGN 0x81: '\x81', # <control> 0x82: '\u201a', # SINGLE LOW-9 QUOTATION MARK 0x83: '\u0192', # LATIN SMALL LETTER F WITH HOOK 0x84: '\u201e', # DOUBLE LOW-9 QUOTATION MARK 0x85: '\u2026', # HORIZONTAL ELLIPSIS 0x86: '\u2020', # DAGGER 0x87: '\u2021', # DOUBLE DAGGER 0x88: '\u02c6', # MODIFIER LETTER CIRCUMFLEX ACCENT 0x89: '\u2030', # PER MILLE SIGN 0x8a: '\u0160', # LATIN CAPITAL LETTER S WITH CARON