Way back in October last year (but only 2 entries ago) I posted some Perl code and wrote that I’d post the Haskell sometime soon. Well, that is now; the code is below. I intend to rewrite it to use Parsec at some point, but I haven’t tried that yet since this little hacky script works well enough; and look how long it has taken me to blog it!
module Main where
data ParserState = FindEntry | FindKeb | FindText
scan :: ParserState -> [SaxElement] -> [String]
scan _  = 
scan FindEntry ( (SaxElementOpen "entry" _) : es ) =
scan FindKeb es
scan FindKeb ( (SaxElementClose "entry") : es ) =
"(none)" : (scan FindEntry es)
scan FindKeb ( (SaxElementOpen "keb" _) : es ) =
scan FindText es
scan FindText ( (SaxCharData "\n") : es ) =
scan FindText es
scan FindText ( (SaxCharData txt) : es ) =
txt : (scan FindEntry es)
scan st ( _ : es ) = scan st es
findKebs :: String -> [String]
findKebs i =
let (es, erc) = saxParse "" i in
scan FindEntry es
To understand how it works the most important line is the type declaration “
scan :: ParserState -> [SaxElement] -> [String]“, which is not actually required by Haskell. From that line we know that “scan” is a function that expects a ParserState as its first parameter and a list of SaxElements as its second parameter, and will then return a list of Strings. Everything else is a simple matter of recursion and pattern matching :-)
XML is still a huge mess, but at least now I have managed to get a few programs that can handle it with reasonable-ish memory requirements.
For Perl, as I had thought, the XML::Twig module gave me a pleasant interface and was able to easily handle the document.
For Haskell it was a little bit trickier. I used the SAX parser in HaXml, but it is not like a regular SAX parser, since Haskell is so unlike any regular language. The parser returns a lazy list of SAX events, so I had to make sure I processed the list without evaluating the whole thing into memory.
Now that I’ve dealt with the memory issue it appears that I have a speed issue to deal with next.
I have a 39 MB XML file that I wanted to process. I wasn’t expecting it to be so difficult. Writing the code, in multiple languages, was not difficult. But running the programs was a big problem.
My first attempt was a simple Haskell program, but I had to kill it after it ate over 1.3 GB (yes, 1.3 GB) of ram!
Haskell’s strings are known to be memory hogs, and the HaXml module I was using was making them even worse by not sensible decoding the UTF-8 text correctly. I decided to write a leaner Haskell program later, and switch to Perl to get the job done.
At this point I also decided to set a limit to the amount of memory the programs could consume. For a 39 MB file I hoped that 10 times that would be enough, so I rounded up and set the limit at 512 MB.
But Perl, using the XML::LibXML module, couldn’t process the file with that memory limit. I also ran a quick one-liner in Erlang, just to watch it crash out of memory too. I’m going to try some other languages to see if I can find one that can work in 512 MB.
My next useful step is to try the XML::Twig module in Perl. I’ve had good experiences with it before. It won’t be as fast as LibXML, but it probably has the best chance of surviving within my 512 MB limit. For Haskell, I think I’ll have to resort to a SAX style parser.