Reproducible paper template to expose all steps of analysis

23 Feb 2020


I would like to introduce a Data Management solution I have been working on that was also awarded an RDA Europe Adoption grant in 2019 and has a lot of parallels with the aims of this working group. I am now finalizing a paper on formally introducing it in the CODATA Data Science Journal, but until then, you can see some slides that I have recently presented.

The basic idea is precisely to expose/publish a research project's data management plan. The data management strategy has some special properties: 1) its fully in machine-readable plain-text files that are also under version control, 2) It includes the complete project, which is run automatically: 2.1) downloading the software tarballs and input datasets, 2.2) verifying them with stored checksums, 2.3) building the software with pre-defined configuration and environment, 2.4) running the software on the data (doing the analysis), 2.5) All outputs are verified with their stored checksums to be 2.6) Creating a final report/paper in PDF (LaTeX is also built within the project).

The fact that the whole project is in plain text, it is also very helpful in publishing/exposing the workflow. For example see arXiv:1909.11230. The full workflow is published with the paper's LaTeX source on arXiv: you can simply click on other formats. and download the paper's source tarball (which is actually a `.tar.gz' file) and unpack it to see the whole workflow in the "reproduce" directory. Ofcourse, this workflow and all supplements (for example all necessary software tarballs and the full Git history) are also published on zenodo.3408481. Finally, the workflow is also publish-able/archive-able as a simple Git repository, for the example above, its on Gitlab.

I am now finalizing a paper on this system for the CODATA Data Science Journal, so I wanted to get in touch with you for your feedback (I would be happy to share the draft privately, but for a start the slides should also be descriptive).

I also wanted to say that I am very interested to join the activities and discussions of this working group and seeing how this solution is comparable with others to further improve it.

    Date: 02 Mar, 2020

    Dear Mohammad,

    Thank you for sharing the data management solution you'be been working on.  You'd be welcome to briefly present to Exposing DMPs WG on our next call(s) March 5(15:00, UTC), and March 11 (11:00am, AEDT).  Email if you'd like some time on the agend for either of these calls. 


    You may find interesting this recent paper:  Markus Konkol, Daniel Nüst(PI for o2r), and Laura Goulier’s  “Publishing computational research -- A review of infrastructures for reproducible and transparent scholarly communicationarXiv:2001.00484 [cs], Jan. 2020.  The authors make a review of the following applications addressing the issue of publishing executable computational research results.

    It would be of interest to working group members to think about how your approach to expose/publish a research project's data management plan goes beyond and/or compares with the infrastructures addressed in the above paper. We'd welcome a chance to hear more about it on a future call, or if you will be attending RDA Plenary on March 18th in Melbourne, let us know. 



    Natalie, Angus, Fiona, Marie Christine, and Kathryn (Co-Chairs) 

    Date: 03 Mar, 2020

    Thanks a lot for setting a slot on the agenda of March 5th, I would be very happy to review the proposed solution while also comparing it with other tools. Infact in the draft paper, I have already discussed/compared 21 workflow automation tools so far. I'll try to add a summary of the comparison to the slides for this talk.

