Perl’s XML::Twig – バカな火星人

I was asked to “Free the code” from my XML parsing experiment , so I will post some here. It may be a bit disappointing though, since these are only some short scripts, and they’re a bit ugly. I’ll explain the Perl one today, and do the Haskell sometime soon.

I was playing with Jim Breen’s Japanese dictionary and I wanted to make a list of the first kanji component in each entry. I wanted one result for each entry, so I used “(none)” if the entry has no kanji part. This is not a difficult problem, although XML makes it as slow and memory intensive as many difficult problems.

use XML::Twig;
my @keb = (); # for the results

sub entry {
    my ($t, $e) = @_;
    my $kt = "(none)";
    if (my $k = $e->first_child("k_ele")) {
        if(my $keb = $k->first_child("keb")) {
            $kt = $keb->text();
        }
    }
    $e->purge;
    push @keb, $kt;
}

my $twig = XML::Twig->new(
    twig_handlers => { entry => \&entry }
);
$twig->parsefile($ARGV[0]);
$twig->purge;

# now the results are in @keb

Using XML::Twig is quite simple. When I create the parser I tell it how to handle the elements I care about, and in this case I only care about “entry” elements. When the parser finds an entry, it calls my entry subroutine, passing the entry’s object as the second parameter, $e. Inside the entry routine I can use DOM-style methods on $e to extract the data I want. Notice that I call $e->purge when I’ve got the data out. This tells the parser that I won’t need that element again, so it can free the memory. This is how XML::Twig manages to parse a file that most other modules can’t.

1 comment