• Home
  • About
  • Repositories
  • Search
  • Web API
  • Feedback
<< Go Back

Metadata

Name
TSSB-3M: A massive scale dataset of single statement bugs
Repository
ZENODO
Identifier
doi:10.5281/zenodo.5845439
Description
Datasets created for the paper &quot;TSSB-3M: Mining single statement bugs at massive scale&quot;.

Access to single statement bug fixes at massive scale is not only important for exploring how developers introduce bugs in code and fix them but it is also a valuable ressource for research in data-driven bug detection and automatic repair. Therefore, we are releasing multiple large-scale collections of single statement bug fixes mined from over 500K&nbsp;public Python repositories.

To facilitate future research, we are releasing three datasets:



TSSB-3M:&nbsp;A dataset of over 3 million isolated single statement bug fixes. Each bug fix is related to a commit in a public Python that does not change more than a single statement.


SSB-9M:&nbsp;A dataset of over 9 million single statement bug fixes. Each fix modifies at least a single statement to fix a bug. However, the related code changes might incorporate changes to other files.


SSC-28M:&nbsp;A dataset of over 28 million general single statement changes. We are releasing this dataset with the intention to faciliate research in software evolution. Therefore, a code change might not necessarily relate to a bug fix.



Because of concerns regarding the licensing of code, we do not release the original source code related to the single statement code changes. However, our datasets provide enough information to load the original code from the source project.&nbsp;

All dataset entries are saved in a compressed jsonlines format. Each individual entry provides access to the following information:

Commit details:


project:&nbsp;Name of the git project where the commit occurred.
project_url:&nbsp;URL of project containing the commit
commit_sha:&nbsp;commit SHA of the code change
parent_sha:&nbsp;commit SHA of the parent commit
file_path:&nbsp;File path of the changed source file
diff:&nbsp;Universal diff describing the change made during the commit
before:&nbsp;Python statement before commit
after:&nbsp;Python statement after commit (addresses the same line)


Commit analysis:


likely_bug:&nbsp;true&nbsp;if the commit message indicates that the commit is a bug fix. This is heuristically determined.
comodified:&nbsp;true&nbsp;if the commit modifies more than one statement in a single file (formatting and comments are ignored).
in_function:&nbsp;true&nbsp;if the changed statement appears inside a Python function
sstub_pattern:&nbsp;the name of the single statement change pattern the commit can be classified for (if any). Default:&nbsp;SINGLE_STMT
edit_script:&nbsp;A sequence of AST operation to transform the code before the commit to the code after the commit (includes&nbsp;Insert,&nbsp;Update,&nbsp;Move&nbsp;and&nbsp;Delete&nbsp;operations).
Data or Study Types
multiple
Source Organization
Unknown
Access Conditions
available
Year
2022
Access Hyperlink
https://doi.org/10.5281/zenodo.5845439

Distributions

  • Encoding Format: HTML ; URL: https://doi.org/10.5281/zenodo.5845439
This project was funded in part by grant U24AI117966 from the NIH National Institute of Allergy and Infectious Diseases as part of the Big Data to Knowledge program. We thank all members of the bioCADDIE community for their valuable input on the overall project.