Work reproducible with Datahugger

By downloading a dataset directly from a DOI, you can improve the reproducibility of your (scientific) work. It also saves manual work and time. Because of this, sharing work is easier. In the following sections, we discuss possible ways to integrate Datahugger in your workflow.

Publish a project with datahugger

Find a dataset

First, find a dataset you want to use in your analysis. You can use dataset search engines like Google Dataset Search or look for references in scientific publications. Once you found the dataset, copy the DOI. If there is no DOI available, use the URL to the dataset.

Instruct user to install datahugger

If the user doesn't have datahugger installed on their device, it is required to install datahugger. Datahugger can be added to an existing Python installation file like requirements.txt or via documentation pip install datahugger.

Scenario 1: Standalone project setup

In this scenario, you create a script or piece of documentation to setup the prerequirements for your project. This likely contains the installation dependencies and the download of the data with Datahugger. The following example shows an example for a Python project.

pip install -r requirements.txt
datahugger 10.xxx/yyy data

This script sets up the required Python dependencies and downloads the dataset.

Scenario 2: Single workflow

In this scenario, the data download is part of the same script or workflow as the analysis. This is common for interactive environments like Jupyter Notebooks.

Tips for git users

Add download folder to `.gitignore`

Are you using git for version control? Add the download folder to the .gitignore file. This prevents you from adding the new dataset to the history. As you can redownload the same dataset, there is no need to track the dataset (it's disposable).

Download without progress indicators

A redownload of the data will likely result in different progress output as download times will vary. For some applications, like Jupyter Notebooks, this will result in changes in the output, where there are no changes in the actual results. To prevent this, you can disable the progress indicator.

CLIPython

datahugger 10.5061/dryad.31zcrjdm5 data --no-progress

datahugger.get("10.5061/dryad.x3ffbg7m8", "data", progress=False)

After publishing your project in a data repository or alike, others can start reusing it. Ideally, your downloaded data is not republished. This implies that the user of your project needs to download the assets with datahugger as well. This means that datahugger needs to be added to the installation dependencies.

Reuse a project with datahugger

Install datahugger

Ideally, the installation instructions of the project you want to (re)use provide installation instructions or automates the installation of dependencies. If not, please install datahugger.