Home > Uncategorized > Collecting all of the publicly available source code

Collecting all of the publicly available source code

Collecting together all the software ever written is impossible, but collecting everything that can be found would create something very useful (since I have always been interested in source code analysis, my opinion is biased).

Collecting together source available on the internet is easy. Creating a copy of Github is one of the first actions of anybody with ambitions of collecting lots of source (back in the day it was collecting shareware 5″ 1/4, then 3″ 1/2 floppy discs, then CDs and then DVDs).

The very hard to collect items “exist” on line-printer paper, punched cards, rolls of punched tape and mag tape. These are the items where serious collectors should concentrate their efforts (NASA lost a lot of Voyager data when magnetic particles fell off the plastic strips of tape reels in storage, because the adhesive had degraded).

The Internet Archive is doing a great job of collecting and making available ‘antique’ source code (old computer games is a popular genre; other collectors concentrate on being able to execute ROM images of games), but they are primarily US based.

Collecting the World’s source code requires collection organizations in every country. Collecting old code is a people intensive business and requires lots of local knowledge.

A new source code collection organization has recently been setup in France; the Software Heritage currently aims to collect all software that is publicly available. So far they have done what everybody else does, made a copy of Github and a couple of the well-known source repos.

I hope this organization is not just the French government throwing money at another one-upmanship US vs. France project.

If those involved are serious about collecting source code, rather than enjoying the perks of a tax-funded show project, they will realise that lots of French specific source code is dotted around the country needing to be collected now (before the media decomposes and those who know how to read it die).

  1. April 6, 2017 10:48 | #1

    Dear Derek,
    I have been pointed to this blog post by a friend who spotted it. May I suggest that you look at the extensive support material we have made available on the Software Heritage mission, vision and technical approach on https://annex.softwareheritage.org/? One easy starting point is the video of the OSCON London 2016 keynote: https://annex.softwareheritage.org/public/talks/2016/2016-10-18-oscon-london-rdicosmo-keynote-software-heritage-building-the-universal-software-archive.mp4 More info is in the slides of the FOSDEM 2017 keynclote here https://annex.softwareheritage.org/public/talks/2017/2017-02-04-fosdem-software-heritage-foss-commons.pdf

    Looking at these, it should be fairly clear that :
    – we are not after “french” source code (quite a strange notion to introduce today),
    – we do not want to reinvent the wheel, so only do what others are not doing
    – we are not just “making a copy of GitHub”, but building a much more sophisticated infrastructure
    – we do everything in the open, with avenues for external collaboration, and look forward to all constructive suggestions
    – a lot of funding comes from international companies and institutions (see http://www.softwareheritage.org/support/sponsors), that, you may expect, had a thorough view of what we are doing before becoming sponsors

    Please feel free to contact us if you want to contribute to the project: all our source code is open source, and help is always welcome.


    Roberto Di Cosmo
    Director
    Software Heritage

  2. April 6, 2017 12:04 | #2

    @Roberto Di Cosmo
    Thanks for the prompt response and links (Adobe is reporting that the pdf file is corrupted and does not display anything).

    We all agree that the heritage source code is the top priority, obtaining this code is where the money and time should be invested.

    French code is the heritage of the French people, just and British code is the heritage of the British. The Americans are busy collecting their heritage and I had always thought that the French were very proud of their heritage.

    Finding source from the 1950s,60s, and 70s requires a lot of local knowledge. Collectors in Silicon valley are very unlikely to encounter code from this era that was written in France. Once it is found hardware that can read the media is needed, again this requires local knowledge; what hardware was being used in France at the time and where can it be found. Once the bits have been read they need to be decoded, and again more local knowledge is required.

    There are a surprising number of groups collecting code, writing search tools and making everything available. What can you do that is unique? Collecting the source code of France is a huge undertaking and it would be a unique collection.

  1. No trackbacks yet.