I’ve just finished reading an interesting paper called “Latent Social Structure in Open Source Projects“[pdf].
The authors looked at open source projects to discover if the project members self-organize and how successful the self-organization is. They also tried to determine if the ways that open source projects self-organize could provide useful lessons to aid in the building of commercial software teams. A nice change from open source projects trying to learn from more traditional software projects.
I was particularly interested as one of the projects they looked at was Perl; the other four were Apache HTTPD, Arache ANT, PostgresSQL, and Python.
The authors detected community structure by data-mining the mailing lists used by the projects’ developers. They are aware that developers use other methods of communication, such as private email and irc channels, but they considered mailing lists to be a good place to start.
There are a some things in the paper that I am not sure about. They describe Perl and Python as being examples of projects that are monoarchist with a project leader. I can’t comment on the Python project but is Larry Wall really “at the helm making informed important decisions”? The paper contains a chart showing the development community structure in Perl from April to June 2007. They have taken out the managers from this chart so I can’t tell if Larry was involved in any of the development being shown. (Larry would fall under their definition of a manager as a person with intimate knowledge of large parts of the project who would link various sub-communities together.) I am aware that Larry is still involved in many aspects of Perl development but I do feel that the project is much too big to think of any one person being at its helm.
I did find the chart showing Perl development fascinating though I have no idea what aspect of Perl development the sub-communities they show are involved in. There is a sub-community of Paul Marquis and Xiao Liang Liu. Another with Jonathan Stowe and Pelle Svensson. One of the communities showing active development contains Arthur Bergman, Leon Brocard, H. Merijn Brand and Jarkko Hietaniemi.
They also gave some information on the data gathered to work out the sub-communities. They looked at Perl mailing lists from 1st March 1999 to the 20th June 2007 (it doesn’t state which list or lists.) They counted 112,514 messages with 3,261 participants. They also extracted information from code taking the author, time of commit and the filename. They then matched the author to the email addresses. For this period of Perl they say that there were 92,502 commits. But the figure that really shocked me was that for all these commits there were only 25 developers. So we have a mailing list with 3,621 people on it but only 25 people actually making any changes that were agreed!
I assume that they were looking at the core Perl language and I am aware that Perl modules are being developed by thousands of developers but I’m still fascinated by the concept that so few people are committers on what appears to be a vast project. I would love to know more about their data. Which list did they use, how many messages are there per month, is this number declining over time? Maybe I’ll write and ask.
A lot of the things in the paper didn’t surprise me but one thing intrigues me. Do developers work better if they get to choose who to work with?