Beam me up, Scotty.

Based on the messages we have received from family and friends, I’m sure that if Star Trek’s transporter technology worked they would without hesitation beam us out of Tokyo.

But despite the best efforts of many so-called “news” media, I don’t feel afraid to be in Tokyo. I’m not unnamed Starfleet crewman #2. With the tricorder information I have here I am confident that my continued exploration of Tokyo is much safer than being dematerialized, where I would risk being trapped in a pattern buffer for 70 years, or cloned by inadvertent reflection off the ionosphere.

I’m glad to say that the radiation threatening my life, just like the transporter, does not yet exist.

Haskell sometime soon

Way back in October last year (but only 2 entries ago) I posted some Perl code and wrote that I’d post the Haskell sometime soon. Well, that is now; the code is below. I intend to rewrite it to use Parsec at some point, but I haven’t tried that yet since this little hacky script works well enough; and look how long it has taken me to blog it!

module Main where

import Text.XML.HaXml.SAX

data ParserState = FindEntry | FindKeb | FindText

scan :: ParserState -> [SaxElement] -> [String]
scan _ [] = []
scan FindEntry ( (SaxElementOpen "entry" _) : es ) =
    scan FindKeb es
scan FindKeb   ( (SaxElementClose "entry") : es ) =
    "(none)" : (scan FindEntry es)
scan FindKeb   ( (SaxElementOpen "keb" _) : es ) =
    scan FindText es
scan FindText  ( (SaxCharData "\n") : es ) =
    scan FindText es
scan FindText  ( (SaxCharData txt) : es ) =
    txt : (scan FindEntry es)
scan st ( _ : es ) = scan st es

findKebs :: String -> [String]
findKebs i =
    let (es, erc) = saxParse "" i in
    scan FindEntry es

To understand how it works the most important line is the type declaration “scan :: ParserState -> [SaxElement] -> [String]“, which is not actually required by Haskell. From that line we know that “scan” is a function that expects a ParserState as its first parameter and a list of SaxElements as its second parameter, and will then return a list of Strings. Everything else is a simple matter of recursion and pattern matching :-)

三年間

Karen and I have now been living in Japan for three years. When we decided to move here we thought it would be an adventure, and it has been. We’ve had some problems, we’ve had some fun, and we’ve had lots of fun-problem combinations. I think we made the right choice.

Perl’s XML::Twig

I was asked to “Free the code” from my XML parsing experiment , so I will post some here. It may be a bit disappointing though, since these are only some short scripts, and they’re a bit ugly. I’ll explain the Perl one today, and do the Haskell sometime soon.

I was playing with Jim Breen’s Japanese dictionary and I wanted to make a list of the first kanji component in each entry. I wanted one result for each entry, so I used “(none)” if the entry has no kanji part. This is not a difficult problem, although XML makes it as slow and memory intensive as many difficult problems.

use XML::Twig;
my @keb = (); # for the results

sub entry {
    my ($t, $e) = @_;
    my $kt = "(none)";
    if (my $k = $e->first_child("k_ele")) {
        if(my $keb = $k->first_child("keb")) {
            $kt = $keb->text();
        }
    }
    $e->purge;
    push @keb, $kt;
}

my $twig = XML::Twig->new(
    twig_handlers => { entry => \&entry }
);
$twig->parsefile($ARGV[0]);
$twig->purge;

# now the results are in @keb

Using XML::Twig is quite simple. When I create the parser I tell it how to handle the elements I care about, and in this case I only care about “entry” elements. When the parser finds an entry, it calls my entry subroutine, passing the entry’s object as the second parameter, $e. Inside the entry routine I can use DOM-style methods on $e to extract the data I want. Notice that I call $e->purge when I’ve got the data out. This tells the parser that I won’t need that element again, so it can free the memory. This is how XML::Twig manages to parse a file that most other modules can’t.

Beating down the XML

XML is still a huge mess, but at least now I have managed to get a few programs that can handle it with reasonable-ish memory requirements.

For Perl, as I had thought, the XML::Twig module gave me a pleasant interface and was able to easily handle the document.

For Haskell it was a little bit trickier. I used the SAX parser in HaXml, but it is not like a regular SAX parser, since Haskell is so unlike any regular language. The parser returns a lazy list of SAX events, so I had to make sure I processed the list without evaluating the whole thing into memory.

Now that I’ve dealt with the memory issue it appears that I have a speed issue to deal with next.

XML is a huge mess

I have a 39 MB XML file that I wanted to process. I wasn’t expecting it to be so difficult. Writing the code, in multiple languages, was not difficult. But running the programs was a big problem.

My first attempt was a simple Haskell program, but I had to kill it after it ate over 1.3 GB (yes, 1.3 GB) of ram!

Haskell’s strings are known to be memory hogs, and the HaXml module I was using was making them even worse by not sensible decoding the UTF-8 text correctly. I decided to write a leaner Haskell program later, and switch to Perl to get the job done.

At this point I also decided to set a limit to the amount of memory the programs could consume. For a 39 MB file I hoped that 10 times that would be enough, so I rounded up and set the limit at 512 MB.

But Perl, using the XML::LibXML module, couldn’t process the file with that memory limit. I also ran a quick one-liner in Erlang, just to watch it crash out of memory too. I’m going to try some other languages to see if I can find one that can work in 512 MB.

My next useful step is to try the XML::Twig module in Perl. I’ve had good experiences with it before. It won’t be as fast as LibXML, but it probably has the best chance of surviving within my 512 MB limit. For Haskell, I think I’ll have to resort to a SAX style parser.