Defining Collocates

Now that Marty has made the decision that he will write a simple Perl script to pull collocates out of data for me I need to give him a more precise specification of a collocate. Carmen Dayrell wrote a paper on “A quantitative approach to compare collocation patterns in translated and non-translated texts” which contains a detailed section on how to decide what a collocate is.

The first step is to work out which words should be taken as nodes – but as I am interested in specific nodes, like the word “Perl”, I will not be doing this. Then we need to decide how we will define a collocate. Dayrell suggests that the collocations should occur at least 4 times to be significant with a span of up to 4 words on either side of the node. Structural boundaries in the text should also be ignored.

While Marty does this I am going to read the work that Church and Hanks did on word association norms and mutual information to see if any of that will help me get better results.

One Response to “Defining Collocates”

  1. Marty was here! - Working with collocates Says:

    […] that Karen is expecting me to write more Perl scripts to analyse collocates I think it’s time to install […]