The impact of language choice on github projects

philh · on Dec 14, 2009

The graphs in a grid layout need more whitespace below them. I frequently thought the labels were for the graph above, which made his interpretations make no sense.

> I suspect that the Perl result is due to the fact that it becomes harder and harder to contribute to a Perl codebase, the bigger it gets.

And a codebase in any other language retains its original complexity no matter how large it grows? A more reasonable explanation from the comments:

> My experience is that a lot of Perl projects are helper modules with focused scope, which get written as supporting units in the course of other work. They progress to a point where the author(s) consider(s) them satisfactory, then go into maintenance mode.

cortesi · on Dec 15, 2009

The surprise isn't that Perl's commit rate declines at all - it's that the decline seems to be so much sharper than that of other languages. A number of dedicated Perl programmers have fixated on that one speculative aside in my post - in fact, this is the first blog post I've ever written that has garnered me some genuinely nasty email. Perhaps a sign of a wee bit of defensiveness in the Perl community at the moment?

A number of people have argued that Perl projects somehow approach completion more quickly than other projects, and that this explains the decline. I'm pondering ways to test this idea from the data, but I must say that it sounds pretty implausible to me.

chromatic · on Dec 15, 2009

> I'm pondering ways to test this idea from the data, but I must say that it sounds pretty implausible to me.

It sounds exactly right to me, if you consider that some 21,000 of those Perl projects are CPAN distributions. There's a very strong bias among CPAN contributors to building small, reusable tools. I am the primary developer of some 30 of those Perl 5 projects. My commit rate has followed the graph of several commits in the initial stages, then very few after a year or eighteen months because the projects entered bugfix-only maintenance mode.

> Perhaps a sign of a wee bit of defensiveness in the Perl community at the moment?

You made some flippant provocative statements apparently based on poor assumptions unsupported by any of the evidence you presented. What did you expect?

cortesi · on Dec 15, 2009

What surprises me is that NONE of my charming Perl correspondents have argued that the Github data is simply atypical. As I point out in my post, it's not just possible, it's likely that this is the case. Instead, every single one has claimed with the type of absolute certainty you can only achieve by having no data at all that it's due to CPAN, and some magical tendency towards completeness that it imparts to projects... Again, I think it's implausible, but I'm open to suggestions of ways to settle the matter with actual data.

chromatic · on Dec 15, 2009

> ... it's due to CPAN

Given the huge jump in the number of Perl projects available on GitHub thanks to the recent BackPAN import, it's a reasonable conclusion. Likewise the commit history; CPAN's fourteen years old.

> ... charming ... the type of absolute certainty you can only achieve by having no data at all ... some magical tendency...

You'll have a much more fruitful discussion without this condescension. > I'm open to suggestions of ways to settle the matter with actual data.

Easy suggestion: find the percentage of Perl projects in your study that came from the BackPAN import. See if they match the experiences of the CPAN contributors who've offered explanations.

cortesi · on Dec 15, 2009

It's genuinely difficult to avoid being condescending given the type of conversations I've been having about this. For what it's worth, it's not aimed at you specifically.

At any rate, I'm happy to ditch the snark, and talk about something concrete. I'm not sure what your suggestion is - I could separate out the CPAN projects (is there any way to do this programmatically?), and see if they show a greater decline in commits than non-CPAN Perl projects. But that wouldn't settle the issue, because I would still need some way to distinguish between projects that have decreasing commits because developer impetus is petering out as projects become more unwieldy, and projects that have decreasing commits because they are nearing "completion". I would also want to compare the results with an equivalent set of Python or Ruby libaries - choosing an appropriate set would be tricky.

kscaldef · on Dec 14, 2009

Most of the interpretation on these graphs seems like subjective speculation. One thing I think is worth pointing out is that the observation that "C, C++ and Perl projects are significantly more "top-heavy" than those in other languages, with a smaller core of contributors doing more of the work" may be almost entirely explained by the fact that projects in those languages also have a larger median number of contributors. If you postulate that the size of the core group of committers is the same for all languages and projects (in my experience, this number is very close to 1), then projects which attract more occasional contributors will appear more "top-heavy" because the core is a smaller fraction of the total population of contributors.

cortesi · on Dec 15, 2009

What you describe is possible, but I'd be surprised if the size of core committer groups was that stable as project size grows. Intuitively, I'd expect the size of the core committer group to grow more or less at the rate of the total active committer pool. At any rate, this can be tested quite easily, given an appropriate definition of what a "core committer" is... I released the dataset precisely to make it possible for other people to check this type of conjecture.

city41 · on Dec 14, 2009

I've been curious how much position of curly braces in C derived languages affects open source popularity.

Nowadays it seems like placing the opening bracket on its own line is more popular. I have found people are unusually turned off by these choices. ie, if you prefer the bracket on its own line, code where it is at the end of the line really bothers you, and vice versa. I wonder how much this affects adoption of new projects.

boucher · on Dec 14, 2009

In an ideal world (and apparently this is already a feature in some Java IDEs) you would just tell your editor how you wanted curly braces and whitespace formatted, and it would always show it to you that way.

My understanding of the existing implementation, I believe its in IDEA, is that it will do this, and save new changes back in whatever style had the highest frequency when the file was opened.

I'm a fan of having project wide style guidelines. But, if you seriously don't use open source code because of where the curly braces are, you're being pretty stupid.

draegtun · on Dec 14, 2009

Interesting stats but there are too many "assumptions" made on what they actually mean!

Interestingly the latest language stats from GitHub (http://github.com/languages) shows this:

  Ruby	        22%
  JavaScript	15%
  Perl	        14%
  Python	9%
  Shell	        7%
  C	        6%
  PHP	        6%
  Java	        4%
  C++	        4%
  Objective-C	2%

cortesi · on Dec 15, 2009

For comparison, after eliminating projects with 3 or fewer watchers and duplicate projects, my language breakdown looks like this:

	 Ruby 35.3%
	 Python 11.5%
	 Javscript 9.4%
	 PHP 7.6%
	 C 5.4%
	 None 5.3%
	 Java 5.3%
	 C++ 4.0%
	 Objective C 3.8%
	 Perl 3.6%
	 C# 1.7%
	 Erlang 1.4%
	 ActionScript 1.3%
	 Scala 1.0%
	 Lua 0.9%
	 Clojure 0.7%
	 Lisp 0.6%
	 Haskell 0.5%
	 Go 0.5%
	 Objective J 0.3%

Pretty close to the overall estimate by github. Some of the difference can probably be explained by the fact that Github tried to eliminate commonly included libraries when they did their file line counts, while I didn't.

draegtun · on Dec 15, 2009

Ruby & Perl are currently at 21% & 18% respectively on Github, so it bears no resemblance to your figs.

I understand what you trying to do by eliminating projects with less than 4 watchers but this is arbitrary figure and the results you produced are therefore affected by this decision.

When you play with population samples then side effects can creep in. You can see this in the difference in Ruby being 35.3% in your figs and it being 21% on Github. Its a big difference and can possibly be explained by things like: http://www.ruby-toolbox.com/

whiskeyjack · on Dec 14, 2009

War Perl!

Interesting to see how the Perl community has really hopped on the bandwagon. Happy to see it.

draegtun · on Dec 14, 2009

Yes the Perl community started to embrace Git & Github about a year ago.

When I started my Github account in Nov 2008, Perl only had around 1% of language metrics. It rose over the past year to 6% and recently jumped to 14% because of the BackPAN inclusion (http://github.com/gitpan/).

davidw · on Dec 14, 2009

I love these kinds of things, and he's done some cool stuff with the data he has. It'd be interesting to see how this would look on a more 'mature' site like SourceForge.

pilif · on Dec 14, 2009

I don't know whether "mature" is the right word.

Different. Yes. Older. Yes.

Github and SF represent two wholly different development paradigms: SF (mostly) represents the central way for doing development where a central repository contains the officially blessed code and additions are done by sending patches to maintainers when then go ahead and igno^H apply them.

github is based on forking and moving patches around in a more fluid manner.

Due to how the traditional systems (CVS, SVN) work, finding contributors on SF would actually be very hard because the traditional systems don't discern between author and commiter: If I send a patch to awesome-project to a maintainer, they would commit it in their name and it would be impossible to automatically determine my initial ownership.

If I fork awesome-project on github, create a patch and make them pull (or cherry-pick or whatever) from my repository, git will track them as commiter and myself as author.

This is probably another reason why, on the original article, ruby projects seem to have so many more contributors: Some of the older non-ruby projects on github are mirrors of the projects main SVN repository, thus losing all author information on patches coming from external contributors.

One must be careful to really be comparing apples to apples here, but still, in light of these inherent limitations, the article was very interesting.

davidw · on Dec 14, 2009

I meant 'mature' in the sense of having been around for many more years than github.

By the way, though, git does not change the human nature of projects: there is still usually one official one. Git 'forks' are just a more efficient means of doing patch management (you point out some of the benefits), rather than (hopefully), different, competing versions of the same codebase. I say 'hopefully' because in most cases it's nicer for everyone involved if the code in question has one more or less authoritative source, rather than forcing the user to figure out which one of the 788 rails forks is the 'real' one. Project forks are occasionally necessary, but they are also costly, which is why they should be rare, and only for situations where it is impossible to collaborate. In some ways it's a pity that git/github use the 'fork' terminology: while it's technically a fork, in most cases it does not represent a genuine attempt to create a competing code and user base.

pilif · on Dec 14, 2009

yeah. of course.

but if I send a pull-request or even just a patch created by "git format-patch" and that patch gets applied, then I am credited as author and the person who put it into the official repository is credited as commiter.

In SVN or CVS, you would do this by some comment in the commit message or the changelog but it's entirely optional and often such contributions are only visible in some mailinglist or bugtracker.

This skews any analysis of contributor frequency in the traditional style of managing projects.

mindstab · on Dec 15, 2009

There are no Lisp projects on Github? This seemed a lacking piece of data or maybe I'm missing something

cortesi · on Dec 15, 2009

The Lisp dataset was really small - after eliminating projects with less than 3 watchers and duplicates, there were 19 projects left. Haskell - with 18 projects - was left out for the same reason.