How to Convert HTML Character Codes Into Unicode

I’ve improved my Python New York Times web scraper that extracts the global home page’s top articles. The latest version doesn’t clumsily replace HTML character codes like “é” with “é”. I wondered if there was a way for Python to convert it. It turns out there is.

Here’s the trick:

encode the raw HTML in UTF-8:

html = unicode(html, 'utf8')

unescape the HTML special characters using this function. (Not sure how this works yet.)

def unescape(text):
    def fixup(m):
        text = m.group(0)
        if text[:2] == "&amp;#":
            # character reference
            try:
                if text[:3] == "&amp;#x":
                    return unichr(int(text[3:-1], 16))
                else:
                    return unichr(int(text[2:-1]))
            except ValueError:
                pass
        else:
            # named entity
            try:
                text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
            except KeyError:
                pass
        return text # leave as is
    return re.sub("&amp;#?w+;", fixup, text)

and then create a BeautifulSoup object with this unescaped utf-8 raw HTML:

soup = BeautifulSoup(html);

Voila! Any strings returned by BeautifulSoup methods will render smart quotation marks, em-dashes, and any letter with accent marks correctly. My updated script is here.