How to Convert HTML Character Codes Into Unicode

|

I’ve improved my Python New York Times web scraper that extracts the global home page’s top articles. The latest version doesn’t clumsily replace HTML character codes like “é” with “é”. I wondered if there was a way for Python to convert it. It turns out there is.

Here’s the trick:

  1. encode the raw HTML in UTF-8:
1
html = unicode(html, 'utf8')
  1. unescape the HTML special characters using this function. (Not sure how this works yet.)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def unescape(text):
    def fixup(m):
        text = m.group(0)
        if text[:2] == "&#":
            # character reference
            try:
                if text[:3] == "&#x":
                    return unichr(int(text[3:-1], 16))
                else:
                    return unichr(int(text[2:-1]))
            except ValueError:
                pass
        else:
            # named entity
            try:
                text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
            except KeyError:
                pass
        return text # leave as is
    return re.sub("&#?w+;", fixup, text)
  1. and then create a BeautifulSoup object with this unescaped utf-8 raw HTML:
1
soup = BeautifulSoup(html);

Voila! Any strings returned by BeautifulSoup methods will render smart quotation marks, em-dashes, and any letter with accent marks correctly. My updated script is here.