Perl Collocates – バカな火星人

Karen and I were talking about linguistics and textual analysis, and how she wanted to analyse the writings of the Perl community. So, to make a start we decided to write a short Perl script to extract word level n-grams from some text so we could start looking for interesting collocates.


$n=4;
undef $/;
@txt = split /\W+/, lc <>;
for($i = 0; @txt-$i > $n;  ++$i) {
    print "@txt[$i..$i+$n]\n";
}

(N-grams of 5 elements seem to be a good size for collocates, so we set $n=4.)

To look for interesting collocates we simply piped the output of that script through sort | uniq -c | sort -n | tail . As test data I ran the script against version 2 and 3 of the GPL. In version 2 the most common n-gram was “work based on the program”; but for version 3 it was “the gnu general public license”. That isn’t a particularly interesting result, but I’m sure we will find some when we look at more than 2 source documents.

1 comment