How to Convert HTML Character Codes Into Unicode

|

I’ve improved my Python New York Times web scraper that extracts the global home page’s top articles. The latest version doesn’t clumsily replace HTML character codes like “é” with “é”. I wondered if there was a way for Python to convert it. It turns out there is.

Here’s the trick:


Why I Like Python More Than Perl

|

Update: After programming more and reading this post again, I realize I was still a noob when I wrote this and titled it “Why Python Is Better Than Perl.” Languages are tools. Tools are not objectively better or worse than others. It depends on the task.

A month ago, I began learning Perl. Two days ago, I began learning Python. I’m already a convert.

Python makes programming fun. It’s more readable, doesn’t have funky $, @, % symbols everywhere, and whitespace like tab and return handle program logic so that I can stop worrying about semicolons and curly braces. In addition, Python seems to be more widely favored by natural language processing researchers. There are quite a few at my workplace, and as a Perl user, I couldn’t communicate with them at all. They’d speak Python while I’d be blathering in Perl. In order to tap into the NLP community and all the NLP goodies (like this), I switched to Python to the dismay of a systems administrator (click here if you have no idea what that is) and Perl-loyalist who sits near me at work.

Despite the similarities between Perl and Python regular expressions, a way of matching text, I still favor Perl’s. But my Python translation of my Perl script that web scrapes the New York Times is only half the length and more understandable for humans. The Perl script runs twice as fast, but that’s something I can live with.

Now I can relate to this xkcd comic.


Amazon Kindle 3 Pros and Cons

|

I bought Amazon’s Kindle 3 a week ago and have read a mix of plain text files and PDFs (Edgar Allen Poe short stories and math textbooks). Last time, I wrote about my initial satisfaction, and a week later I’m still happy with the reading experience. The electronic ink doesn’t strain my eyes, page turning is fast, and the choice of fonts and page orientation is suitable. The Kindle renders PDF files well if you orient the page layout to landscape instead of portrait. You can see the page in close to normal size this way.


Why You Should Learn Perl and How to Install Modules Without Headaches

|

I’ve become a fan of the computer programming language Perl after my friend who majored in computer science recommended learning Perl over Python a month ago. For the longest time, however, I was getting pissed off trying to use it on my Mac.

Overview

Perl is a high-level computer language nicknamed the “Swiss Army chainsaw of programming languages” for its flexibility and adaptability. It can collect Edgar Allen Poe short stories from the Internet, calculate the similarities between them, and store that information in a database. Perl is free and is supported by tons of how-to books and online tutorials (e.g. here, here, and here). You don’t have to be a computer geek to learn Perl. You’d probably think of new applications for Perl by approaching it from a non-technical standpoint.

Now that I’ve convinced you to learn Perl, how do you get started?


I’m Sticking With WordPress’ Thematic Theme

|

I’m going to keep my Thematic WordPress theme. Nothing beats its simplicity and minimalism. It’s so crisp and clean. I don’t put up many photos or non-textual media, and Thematic lets readers focus on the writing. I just can’t find anything I like better.

I’d like to thank Ian Stewart for Thematic, which you can download here. I’ve modified Thematic on both my blog and my homepage by setting up a child theme that inherits properties of its parent.

Child themes allow you to tweak the design and look of your WordPress blog by overriding the properties the parent theme sets up. So why would you want to do this instead of directly altering the parent? If your theme is updated and you download that update, all your modifications will be erased. But if you have a child theme, they’ll be preserved since they’re in a separate file.

Here are some good places to learn about creating your own WordPress child theme.


Finally, My New Kindle

|

I just started using my new Kindle from Amazon ($139), and I like it so far. I’ve been lusting after an e-reader for a long time. I don’t like carrying heavy books and papers. Digital not analog for me, please. The iPad’s backlit screen strains my eyes if I read for a long time, and I can’t justify the $500 price tag.


Treme & Engineering: Fight Club & Anti-Consumerism

|

Treme

Treme is a television series about post-Katrina New Orleans. David Simon, the creator of The Wire, uses his story-telling skills to describe how a colorful group of characters try to piece their lives together after the hurricane. In season one, episode two, “Meet De Boys on the Battlefront,” John Goodman’s character, a Tulane University English professor, criticizes the school for disbanding its engineering departments while keeping its liberal arts majors. Disciplines like philosophy and history make us critical thinkers about human life and society, but at the end of the day we still need hard science and math to build the things that feed us, heat our homes, and improve quality of life on a physical level. Goodman’s character has a point.


Sex Diaries of a Recent College Grad

|

A friend of mine who recently graduated from college told me (s)he “used a dating site to ease a transition and came to terms with newfound promiscuity.” I shudder to think what the last part of that description even means. I encouraged my friend to write a post for my blog and enlighten me about his/her life after college. I was hoping for a witty David-Foster-Wallace-esque narrative. I received something à la Daily Intel’s Sex Diaries. Complete confidentiality was a stipulation given the sensitive nature of the info below. Like Ira Glass says before he plays a This American Life program that contains slightly mature content: for those with young children, the below piece of writing “acknowledges the existence of sexual acts.”


How to Cheat Wall Street: Swindle Them With Oil

|

Originally appeared in Saturday Evening Post on April 25, 1964.

The soybean scandal, one of the biggest swindles in American history, broke into the news last fall and is still producing tremors. Losses came to some $150 million [$1.1 billion in 2008 dollars which is pretty hefty]. One large brokerage house was destroyed and 20 leading banks were stuck with millions in bad loans. The man behind the uproar is Anthony DeAngelis, who in his spare time donated bicycles to boys in his neighborhood. A onetime hog processor, De Angelis became the biggest trader in fats and oils. Today, protesting his Innocence, he faces criminal charges that could send him to prison for 185 years. Veteran Wall Street reporter Norman C. Miller reveals the behind-the-scenes story and raises some important questions that still confront the U.S. Agriculture Department and the businessmen who trade in commodities worth billions.


New York Times Perl Web Scraper

|

This Perl script scrapes The New York Times website.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
#!/usr/bin/env perl

use strict;
use LWP::UserAgent;
use HTTP::Cookies;
use HTML::TreeBuilder 3;

my $OUTPUT_FILE = 'nyt_top_stories.txt';

# User agent needs to accept cookies to access NYT
my $cookie = 'nyt_cookie.lwp';
my $cookie_jar = HTTP::Cookies->new('file' => $cookie, 'autosave' => 1);

my $content = get_html('http://global.nytimes.com/');
my $tree = HTML::TreeBuilder->new_from_content($content);

# Stores homepage URLS
my @urls;
scan_nyt_tree($tree, 'http://global.nytimes.com/');
$tree->delete();

unlink $OUTPUT_FILE;

# Scrape article from each URL
foreach (@urls) {
    $content = get_html($_);
    # Replace all newline characters, needed for $rawtext extraction
    $content =~ s/\n//g;

    # Extracts headline, byline, dateline, and raw text
    my $headline;
    if ($content =~ m/<nyt_headline .*?>(.*?)< \/NYT_HEADLINE>/) {
        $headline = $1;
    }

    my $byline;
    if ($content =~ m/<nyt_byline .*?>.*?<a \shref.*?>(.*?)< \/a>/) {
        $byline = $1;
    }

    my $dateline;
    if ($content =~ m/class="dateline">.*?Published:\s+([\w\s,]+)< \//) {
        $dateline = $1;
    }

    my $rawtext;
    if ($content =~ m/<NYT_TEXT.*?>(.*)< \/NYT_TEXT>/) {
        $rawtext = $1;
    }

    # Parses article's text by extracting everything between <p> tags
    my $text;
    while ($rawtext =~ m/</p><p>(.*?)< \/p>/g) {
        $text .= "\n\n$1";
    }
    $text =~ s/ +/ /g;              # REPLACE MUTLIPLE SPACES WITH ONE
    $text =~ s/< .*?>//g;           # REMOVE HTML TAGS
    $text =~ s/&mdash;/--/g;        # REPLACE HTML EM-DASH CODE WITH 2 HYPHENS
    $text =~ s/'|&lsquo;/'/g; # REPLACE SMART APOSTROPHES WITH '
    $text =~ s/"|&rdquo;/"/g; # REPLACE SMART QUOTATIONS WITH "
    $text =~ s/&nbsp;/ /g;

    open(OUTPUT, ">>$OUTPUT_FILE") or die("Cannot open $OUTPUT_FILE\n");
    print OUTPUT "$headline\n$byline\n$dateline$text\n\n\n";
    close(OUTPUT);
}

# Stores a web page's HTML as string
sub get_html {
    my $url = $_[0];
    my $browser = LWP::UserAgent->new();
    $browser->cookie_jar($cookie_jar);

    # $response declared out here to be accessible after while loop
    my $response;
    # Prevents inifinite loops
    my $redirect_limit = 5;
    my $x = 0;

    # Sends GET request, follows redirects until response code 200 received
    # Stores successful request URL
    my $responseCode = 0;
    while ($responseCode != 200 && $x < $redirect_limit) {
        $response = $browser->get($url);
        $responseCode = $response->code;
        print "$url\n";
        #print "response code: $responseCode\n";
        $url = $response->header('Location');
        $x++;
    }
    return $response->content;
}

# Picks out URLs of top NYT articles
sub scan_nyt_tree {
   my ($root, $docbase) = @_;
   foreach my $div ($root->find_by_tag_name('div')) {
       my $class = $div->attr('class') || next;
       if ($class eq 'story') {
           my @children = $div->content_list;
           for (my $i = 0; $i < = $#children; $i++) {
               if (ref $children[$i] and
                   ($children[$i]->tag eq 'h2' ||
                   $children[$i]->tag eq 'h3' ||
                   $children[$i]->tag eq 'h5')) {
                   my @grandchildren = $children[$i]->content_list;
                       # Search sibling if 1st grandchild not <a>
                       if (ref $grandchildren[0] and $grandchildren[0]->tag eq 'a') {
                       push (@urls, URI->new_abs($grandchildren[0]->attr('href') || next, $docbase));
                   }

               }
           }
       }
   }
   return;
}