◐ Shell
clean mode source ↗

Message 148549 - Python tracker

http://www.w3.org/TR/html5/named-character-references.html lists 2152 HTML 5 entities (see also attached file for a dict generated from that table).
Currently html.entities only has 252 entities, organized in 3 dicts:
  1) name -> intvalue (e.g. 'amp': 0x0026);
  2) intvalue -> name (e.g. 0x0026: 'amp');
  3) name -> char (e.g. 'amp': '&');

In HTML 5, some of the entities map to a sequence of 2 characters, for example ≂̸ corresponds to [U+2242, U+0338] (i.e. MINUS TILDE + COMBINING LONG SOLIDUS OVERLAY).

This means that:
  1) the current approach of having a dict with name -> intvalue doesn't work anymore, and a name -> valuelist should be used instead;
  2) the reverse dict for this would have to use tuples as keys, but I'm not sure how useful would that be (producing entities is not a common case, especially "unusual" ones like these).
  3) The name -> char dict might still be useful, and can easily become a name -> str dict in order to deal with the multichar entities;

Since 1) is not backward-compatible the HTML5 entities should probably go in a separate dict.

Also note that the entities are case-sensitive and some of them include different spellings (e.g. both 'amp' and 'AMP' map to '&'), so the reverse dict won't work too well.  Having '&' -> 'amp' seems better than '&' -> 'AMP', but this might not be obvious for all the entities and requires some extra logic in the code to get it right.