June 11, 2018

Who gets counted?

Last week, Microsoft (who you’ve probably heard of) bought GitHub (who you may well not have heard of) with $7.5 billion in Microsoft stock.  Github is a site that cross-bred version-control software (‘track changes’ for programmers) with social media, providing a place to share and promote code.

Stuff and the Herald quoted the same number

More than 28 million developers around the world use GitHub, with Microsoft ranking as the most active organisation on GitHub.

You might wonder what proportion of developers use GitHub. If you search a bit on Google, you’ll find that the total number of developers is estimated at about 21 million, so roughly 130% of them use GitHub.

Obviously there’s something wrong there.  The problem is how to define ‘developers’.

In the US, the Bureau of Labor Statistics reports how many people do each job. They say, currently,  that there are 1,617,400 “Software Developers and Programmers”.  The figure of 28 million worldwide is actually based on a subset of those, the 1.2 million “Software Developers, Applications” and “Software Developers, Systems”.  This sort of official classification has to be narrow, because the goal is for  every job to end up in exactly one category.

Lots of people in the US who are developers in the GitHub sense aren’t developers in the Bureau of Labor Statistics sense. Some of them write software only in their spare time. Others write software as part of their jobs, but their jobs are classified somewhere else in the BLS system. The same is true in New Zealand. Stats NZ reports 31,860 jobs in ANZSIC06 category 7000 “Computer Systems Design and Related Services”, which is a bit broader than the US category.  Even though I’m a developer in the GitHub sense, I’m not one of them. I’m in 8102, “Higher Education”.

Other people who probably in the 21 million count include statisticians, data scientists, computational biologists, ornithologists, journalists, linguists, and many more.

Official statistics are usually pretty accurate, but they are only accurate for what they are trying to measure, which might not be what you are looking for.

avatar

Thomas Lumley (@tslumley) is Professor of Biostatistics at the University of Auckland. His research interests include semiparametric models, survey sampling, statistical computing, foundations of statistics, and whatever methodological problems his medical collaborators come up with. He also blogs at Biased and Inefficient See all posts by Thomas Lumley »

Comments

  • avatar
    Steve Curtis

    Another subset of users that Microsoft might track their keystrokes, just like it does on Windows 10 ( free version) via a service they call ‘telemetry’

    7 years ago

  • avatar
    Thomas Lumley

    They can’t track keystrokes through GitHub unless you’re in the unusual position of doing your editing on GitHub itself.

    I’d usually edit, run, and to some extent test before sending a small batch of changes to GitHub. In multi-developer organisations people would do even more testing before committing changes.

    7 years ago

  • avatar
    Steve Curtis

    techcrunch says this:
    ‘The flagship functionality of GitHub is “forking” – copying a repository from one user’s account to another. This enables you to take a project that you don’t have write access to and modify it under your own account. ”
    Seems to provide a lot insight into coding done to me. More useful than just keystrokes

    7 years ago

  • avatar
    Richard Penny

    Personally (I work in official statistics) the key point of the post is how data is put into boxes (i.e. classified). Thomas is correct that in official statistics there is a history of putting people into single boxes. I add that often users *want* us to do this as assigning multiple or overlapping classifications to a single unit makes the analysis messy.

    This is not new consideration for official statisticians (e.g. McGuckin’s 1992 paper “Multiple Classification Systems For Economic Data: Can A Thousand Flowers Bloom? And Should They?”). With the changes in the data environment we are going to have to get used to multiple complex classifications for the same unit and how to integrate this information into our analysis.

    7 years ago

  • avatar

    An organisation I worked for recently had (and probably has) around 60 people with GitHub accounts and members of the “organization” on GitHub, but many of them were just there for accessing the issues management, team manuals and other documentation (GitHub makes it very easy to spin up a private Wiki, much better to use than for example SharePoint). But even of those who were actually writing code and doing commits, pushes and pulls, I doubt many as “developers”; at least not as a primary work role! And as you say, definitely not as an ANZSCO classification category.

    7 years ago