Research Data
Research Ethics
SourceForge.net Research Data

SourceForge.net Research Data

SourceForge.net is the world's largest Open Source software development web site, with the largest repository of Open Source code and applications available on the Internet. Owned and operated by OSTG, Inc. ("OSTG"), SourceForge.net provides free services to Open Source developers. The SourceForge.net web site is database driven and the supporting database includes historic and status statistics on over 140,000 projects and over 1.5 million registered users' activities at the project management web site. OSTG has shared certain SourceForge.net data with the University of Notre Dame for the sole purpose of supporting academic and scholarly research on the Free/Open Source Software phenomenon. OSTG has given Notre Dame permission to in turn share this data with other academic researchers studying the Free/Open Source Software phenomenon.

Release of the SourceForge.net Research Data

To advance the understanding of, and research on, the Free/Open Source Software phenomenon, portions of the data that may support such research, will be made available to academic or scholarly researchers. All requests for data must be submitted in writing (e-mail) to the Notre Dame PI, (Greg Madey). Only academic and scholarly researchers are eligible to receive the data. To receive the data, a short questionnaire and agreement must be completed, signed and returned. A wiki for users of the research data is available here.

Description of Data Available

SourceForge.net uses relational databases to store project managment activity and statistics. There are over 100 relations (tables) in the data dumps provided to Notre Dame. Some of the data have been removed for security and privacy reasons. SourceForge.net cleanses the data of personal information and strips out all OSTG specific and site functionality specific information. On a monthly basis, a complete dump of the databases (minus the data dropped for privacy and security reasons) is shared with Notre Dame. The Notre Dame researchers  have built a data warehouse comprised of these monthly dumps, with each stored in a separate schema. Thus, each monthly dump is a shapshot of the status of all the SourceForge.net projects at that point in time.As of March 2007, the data warehouse was almost 500 GBytes in size, and is growing at about 25 GBytes per month. Much of the data is duplicated among the monthly dumps, but trends or changes in project activity and structure can be discovered by comparing data from the monthly dumps. Queries across the monthly schema may be used to discover when changes took place, to estimate trends in project activity and participation, or even that no activity, events or changes have taken place. To help researchers determine what data is available, an ER-diagram and the defintions of tables and views in the data warehouse are provided.

For each month, the data waerhouse includes three major parts.

  • The tables supporting theSourceForge.net web site, for example, the tables  user, group etc..
  • The tables used to store the statistics of the whole community, including daily page access, downloads etc..
  • The  tables with the history information on the other tables.

The data warehouse contains SourceForge.net archive data, each in its own respective schema (sf_mm_yy) for the following months (note the gaps in 2003-2005, with all months available starting with February 2005 to the present):

  • January 2003 (sf0103)
  • November 2004 (sf1104)
  • December 2004 (sf1204)
  • February 2005 (sf0205)
  • March 2005 (sf0305)
  • ...
  • ... to present month

We suggest studying the "live" SourceForge.net site to see what kind of services are provided. This may provide hints as to what types of data might be available in the SourceForge.net data warehouse to support research into the Free/Open Source Software.

Types of Data that Can be Extracted from the SourceForge.net Research Data Archive

The following are types of data  that we have extracted from the SourceForge.net Research Data Archive:

  • Project sizes over time (number of developers as a function of time presented as a frequecy distribution)
  • Development participation on projects (number of projects individual developers participate on presented as a frequency distribution)
    • The above two items are used to create a "collaboration social-network"
    • The above two items were used to discover scale-free distributions among developer activity and project activity
  • The extended-communty size around each project including project developers plus  registered members who participated in any way on a project (discussion forum posting, bug report, patch submission, etc.)
  • Date of project creation (at SourceForge.net)
  • Date of first software release for a project
  • SourceForge.net ranking of projects at various times
  • Activity statistics on projects at various times
  • Number of projects in various software categories, e.g., games, communications, database, security, etc.

Since all of the archived data is stored in a relational database, data to support F/OSS investigations will be extracted using SQL queries against the data warehouse.

How to get research data?

The research data is available to those doing academic and scholarly research on the Free/Open Source Software phenomenon:

    1. Submit the survey form and signed license
    2. Study the "live" SourceForge.net site to understand the context of  the data collection.
    3. Review the ER-Diagram (multipage version or single page versions) and the Table Definitions to identify research data of interest
    4. The researcher will be sent a userid and password that will provide access to a web-based form that will permit direct SQL queries against the data archive
    5. Submit email requests for support to oss@nd.edu
    6. Please review papers on FLOSS Research Ethics (here)
    7. A wiki for users of the research data has been integrated with the query pages here.

Research Data Archive News (here)

Free/Libre/Open Source Software Research Ethics: Please see the list of references on research ethics for FLOSS Research (here)

The material presented at this web site is based in part upon work supported by the National Science Foundation, CISE/IIS-Digital Society & Technology, under Grant No. 0222829. 

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation

[Home] [Overview] [Papers] [People] [Research Data] [News] [Research Ethics] [FLOSS]

[Created by Greg Madey: gmadey@nd.edu]   [Notre Dame Home]   [CSE Home]