Towards Automatic Generation of

Short Summaries of Commits

We proposed to generate one-sentence summaries of commits, which can serve as topics or leading sentences for the messages that are generated by the existing techniques.

Data set

We obtained the top 1000 popular Java projects from Github in September 2016 (sorted by stars in Github). In total, we obtained 2 million commits from Github. Also, we obtained the data set reported by Mauczka et al. in MSR'15 [1].

The scripts we used in obtaining the data set:

  • Getting the list of the top 1000 Java projects: script
  • Getting the commits from the top 300 projects: script
  • Getting the commits from the remaining projects: script

We removed the merges and rollbacks from our data set. To detect the mereges and rollbacks, we check whether the messages are started with "merge" or "rollback" (case insensitive, script)

Note 1: We obtained all the commit messages. But in case of diff files, for the first 300 Java projects, we obtained all the diff files; for the remaining 700 projects, we obtained only the diff files that are no bigger than 1M due to space limit.

Download Link: Commit Messages that follow the verb/direct-object pattern: link; The diff files of the corresponding commits: link

Verb Groups

After we obtained the verbs from the commit messages by Stanford CoreNLP [2], we had some non-Enlish letters in some verbs. We removed "#" from the extracted 46 verbs and removed two verbs (corresponding to commit 1238026 and 748570) that do not contain alphabets.

  • The Java program that extracts the verbs from the commit messages: download, an IntelliJ project, which has a dependency on Stanford CoreNLP - English model
  • All the verbs we obtained originally (with the ones we removed later): download. Each row denotes a commit in the file; the first column in a row is a commit id; the second column is the verb in the corresponding commit message.


  • [1] Andreas Mauczka, Florian Brosch, Christian Schanes, and Thomas Grechenig. 2015. Dataset of developer-labeled commit messages. In Proceedings of the 12th Working Conference on Mining Software Repositories (MSR '15). IEEE Press, Piscataway, NJ, USA, 490-493.
  • [2] Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60. url: