Getting started

In this small tutorial, we will guide you through a standard procedure of how to merge a new data contribution into the EMPD. In the following sections, the three main steps are described shortly, that are:

  1. Testing and fixing the contribution

  2. Analysing the contribution

  3. Finishing the contribution

Note that, as a contributor to the EMPD, you do not have to work through this guide. Of course, you are welcomed to do this, but it is the objective of the core-maintainers to validate your contribution with the methods presented here.

Note

The commands here are run from an interactive python session (ipython). For this reason, we are prefixing and shell command with an exclamation mark (!). If you run the examples locally, leave them away.

Note

Note, we run the commands here locally using the empd-admin shell command. But if you use it from a Pull Request into the EMPD-data repository, you have to use @EMPD-admin. If you are using the EMPD-admin through Docker, you can type ``docker run -t empd2/empd-admin empd-admin `` instead.

Downloading the test data

For our tutorial, we will use the test-data branch of the EMPD (see here). It contains artificial data that we use for testing. If you want to test the commands here with your own contribution to the EMPD, you can also use your own fork of the EMPD-data repository.

The following two commands clone the branch from Github and changes the working directory to the downloaded directory.

In [1]: !git clone -q https://github.com/EMPD2/EMPD-data -b test-data

In [2]: cd EMPD-data

In [3]: ls
CODE_OF_CONDUCT.md  README.rst		     meta.tsv	     templates/
CONTRIBUTING.md     docker-compose.test.yml  postgres/	     test.tsv
Dockerfile	    docker_tests.sh*	     samples/	     tests/
LICENSE		    failures/		     tab-delimited/

Testing the repository

The first step is now, to test the data contribution. For this, we can use the empd-admin test command. For the test-data branch this will fail due to an invalid country in the test_a1 sample.

In [4]: !empd-admin test
Starting test run with /opt/conda/envs/empd-admin/bin/pytest --empd-meta=./test.tsv --markdown-report=/tmp/tmp9vh4iqer_test/report.md --commit --maxfail=20 --sample=.* /tmp/tmp9vh4iqer_test/tests/
============================= test session starts ==============================
platform linux -- Python 3.7.3, pytest-4.4.1, py-1.8.0, pluggy-0.9.0
rootdir: /tmp/tmp0tzagvu0/tests, inifile: pytest.ini
collected 38 items / 7 deselected / 31 selected

../tmp9vh4iqer_test/tests/test_critical.py ...                           [  9%]
../tmp9vh4iqer_test/tests/test_formatting.py ..sssss                     [ 32%]
../tmp9vh4iqer_test/tests/test_meta.py ..F..................             [100%]

=================================== FAILURES ===================================
_________________________________ test_country _________________________________

meta =            OriginalSampleName   SiteName  ... empd_countries  Country_ok
SampleName                                ......    Germany        True
test_a3          orig_test_a2  somewhere  ...        Germany        True

[3 rows x 76 columns]
countries = natural_earth
Albania                                           Albania
Algeria                                       ...      United Kingdom
Yemen                                               Yemen
Name: country, Length: 82, dtype: object
nat_earth_countries = SampleName
test_a1    Germany
test_a2    Germany
test_a3    Germany
dtype: object
okexcept = <function okexcept.<locals>.okexcept at 0x7f7e8cd35400>
record_property = <function record_property.<locals>.append_property at 0x7f7e868568c8>

    def test_country(meta, countries, nat_earth_countries, okexcept,
                     record_property):
        ref = countries
        meta = meta.copy()
        s = meta['Country'].fillna('')
    
        s_ok = okexcept(s.name)
        s_ok |= (okexcept("Latitude") & okexcept("Longitude")).values
        meta = meta.join(s_ok)
        meta['nat_earth'] = nat_earth_countries
        meta['empd_countries'] = [
            '; '.join(ref.loc[[n]]) if n in ref.index else c
            for c, n in meta[[s.name, 'nat_earth']].values]
        meta[s.name + '_ok'] = [n in ref.index and c in ref.loc[[n]].values
                                for c, n in meta[[s.name, 'nat_earth']].values]
        failed = meta[~(s_ok | meta.Country_ok)]
        if len(failed):
            record_property('failed_samples',
                            failed[[s.name, 'nat_earth', 'empd_countries', s_ok.name,
                                    s.name + '_ok']]
                            )
        msg = "Found %i invalid %s: %s" % (
            len(failed), s.name if len(failed) == 1 else 'Countries',
            textwrap.shorten(
                ', '.join(failed.index), 80, placeholder='...'))
>       assert not len(failed), msg
E       AssertionError: Found 1 invalid Country: test_a1
E       assert not 1
E        +  where 1 = len(           OriginalSampleName   SiteName  ... empd_countries  Country_ok\nSampleName                                ......                       \ntest_a1          orig_test_a1  somewhere  ...        Germany       False\n\n[1 rows x 76 columns])

../tmp9vh4iqer_test/tests/test_meta.py:80: AssertionError
=============================== warnings summary ===============================
/opt/conda/envs/empd-admin/lib/python3.7/site-packages/socks.py:58
  /opt/conda/envs/empd-admin/lib/python3.7/site-packages/socks.py:58: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
    from collections import Callable

test_meta.py::test_country
  /opt/conda/envs/empd-admin/lib/python3.7/site-packages/shapely/affinity.py:45: DeprecationWarning: invalid escape sequence \ 
    """

test_meta.py::test_country
  /opt/conda/envs/empd-admin/lib/python3.7/site-packages/shapely/affinity.py:153: DeprecationWarning: invalid escape sequence \ 
    """

test_meta.py::test_country
  /opt/conda/envs/empd-admin/lib/python3.7/site-packages/shapely/affinity.py:192: DeprecationWarning: invalid escape sequence \ 
    """

test_meta.py::test_country
  /opt/conda/envs/empd-admin/lib/python3.7/site-packages/shapely/affinity.py:222: DeprecationWarning: invalid escape sequence \ 
    """

test_meta.py::test_country
  /opt/conda/envs/empd-admin/lib/python3.7/site-packages/shapely/affinity.py:250: DeprecationWarning: invalid escape sequence \ 
    """

-- Docs: https://docs.pytest.org/en/latest/warnings.html
=== 1 failed, 25 passed, 5 skipped, 7 deselected, 6 warnings in 5.36 seconds ===

The failed sample can be extracted using the -e option:

In [5]: !empd-admin test -e failed.tsv

which extracted the failed metadata into failures/failed.tsv:

In [6]: ls failures/failed.tsv
failures/failed.tsv

and allows you to analyse it further. In our case, we can fix the failed sample with the fix command:

In [7]: !empd-admin fix country -s test_a1
Starting test run with /opt/conda/envs/empd-admin/bin/pytest --empd-meta=./test.tsv --markdown-report=/tmp/tmp566xewx6_test/report.md -m dbfix --fix-db --commit -k country --sample=test_a1 /tmp/tmp566xewx6_test/tests/fixes.py
============================= test session starts ==============================
platform linux -- Python 3.7.3, pytest-4.4.1, py-1.8.0, pluggy-0.9.0
rootdir: /tmp/tmp0tzagvu0/tests, inifile: pytest.ini
collected 7 items / 6 deselected / 1 selected

../tmp566xewx6_test/tests/fixes.py .                                     [100%]

=============================== warnings summary ===============================
/opt/conda/envs/empd-admin/lib/python3.7/site-packages/socks.py:58
  /opt/conda/envs/empd-admin/lib/python3.7/site-packages/socks.py:58: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
    from collections import Callable

fixes.py::fix_country
  /opt/conda/envs/empd-admin/lib/python3.7/site-packages/shapely/affinity.py:45: DeprecationWarning: invalid escape sequence \ 
    """

fixes.py::fix_country
  /opt/conda/envs/empd-admin/lib/python3.7/site-packages/shapely/affinity.py:153: DeprecationWarning: invalid escape sequence \ 
    """

fixes.py::fix_country
  /opt/conda/envs/empd-admin/lib/python3.7/site-packages/shapely/affinity.py:192: DeprecationWarning: invalid escape sequence \ 
    """

fixes.py::fix_country
  /opt/conda/envs/empd-admin/lib/python3.7/site-packages/shapely/affinity.py:222: DeprecationWarning: invalid escape sequence \ 
    """

fixes.py::fix_country
  /opt/conda/envs/empd-admin/lib/python3.7/site-packages/shapely/affinity.py:250: DeprecationWarning: invalid escape sequence \ 
    """

-- Docs: https://docs.pytest.org/en/latest/warnings.html
============== 1 passed, 6 deselected, 6 warnings in 5.05 seconds ==============

This now fixed the country for the test_a1 sample and the tests will run through

In [8]: !empd-admin test
Starting test run with /opt/conda/envs/empd-admin/bin/pytest --empd-meta=./test.tsv --markdown-report=/tmp/tmp9914u_4o_test/report.md --commit --maxfail=20 --sample=.* /tmp/tmp9914u_4o_test/tests/
All tests passed

Query the repository

The empd-admin provides several diagnostics to investigate the data contribution. You already saw one of them: the extraction of failures with the -e option the the test command.

Another one is the query command to query the database. This will use an sql query to display a subset of your metadata. The syntax is like:

empd-admin query WHERE_CLAUSE [Column1, [Column2, [Column2]]]

Which will transform into a query like:

SELECT Column1, Column2, Column3 FROM metadata WHERE WHERE_CLAUSE;

For example:

empd-admin query 'Country = "Germany"' SampleName

will transform into:

SELECT SampleName FROM metadata WHERE Country = "Germany";

and result in a markdown table for the SampleName:

In [9]: !empd-admin query 'Country = "Germany"' SampleName
<details><summary>Country = "Germany"</summary>

| SampleName
| ---
| test_a1
| test_a2
| test_a3


Displaying 3 of 3 rows
</details>

When combining this with pandoc, you can also directly transform it to HTML

In [10]: !empd-admin query 'Country = "Germany"' SampleName | pandoc -o query.html
Country = "Germany"
SampleName
test_a1
test_a2
test_a3
Displaying 3 of 3 rows

You can also dump them as a file to the queries folder of your repository with the --output and --commit options:

In [11]: !empd-admin query 'Country = "Germany"' SampleName -c
<details><summary>Country = "Germany"</summary>

| SampleName
| ---
| test_a1
| test_a2
| test_a3


Displaying 3 of 3 rows
</details>

In [12]: cat queries/query.tsv
SampleName
test_a1
test_a2
test_a3

Transform into postgres

Assuming you have Postgres installed on your system and a running postgres database server, you can transform the EMPD meta data in a relational database, as it is available in the EMPD-data repository.

Just create postgres database using

In [13]: !createdb MyEMPD

and then incorporate your data as

In [14]: !empd-admin createdb -db MyEMPD
Importing in postgresql://postgres@localhost/MyEMPD with /opt/conda/envs/empd-admin/bin/python /tmp/tmp0tzagvu0/postgres/scripts/import_into_empd2.py ./test.tsv --database-url postgresql://postgres@localhost/MyEMPD --no-dump

Then you can access it via

In [15]: !psql MyEMPD

Finishing a data contribution

When you are satisfied with your data contribution, you can finish the contribution.

This will remove all the intermediate working files (e.g. our new meta data test.tsv or the failed samples failures/failed.tsv) and merge the new data into the base meta file meta.tsv of the EMPD.

In [16]: !empd-admin finish --commit
Starting test run with /opt/conda/envs/empd-admin/bin/pytest --empd-meta=./test.tsv --markdown-report=/tmp/tmpir8lpohl_test/report.md --fix-db -v -k fix_sample_data_formatting --commit /tmp/tmpir8lpohl_test/tests/fixes.py
Importing in postgresql://postgres@localhost/tmpg0cldh3u_empd with /opt/conda/envs/empd-admin/bin/python /tmp/tmp0tzagvu0/postgres/scripts/import_into_empd2.py ./test.tsv --database-url postgresql://postgres@localhost/tmpg0cldh3u_empd

In [17]: log = !git log -8
   ....: print(log.n)
   ....: 
commit 41a7c231847fe011be03372dae810e9de785b5f6
Author: EMPD-admin <postgres@b7da7f78ba1e>
Date:   Thu Mar 12 13:52:57 2020 +0000

    Removed test.tsv to finish the PR

commit 2ee018a237820eb5854a5960504adaec2e88e2e9
Author: EMPD-admin <postgres@b7da7f78ba1e>
Date:   Thu Mar 12 13:52:57 2020 +0000

    Removed extracted queries

commit 0ac6a167d70561207bc3a5b8acb46e6ac93894cb
Author: EMPD-admin <postgres@b7da7f78ba1e>
Date:   Thu Mar 12 13:52:57 2020 +0000

    Removed extracted failures

commit a9a2e637b547fb9da105ea386147dbe3756e1b10
Author: EMPD-admin <postgres@b7da7f78ba1e>
Date:   Thu Mar 12 13:52:56 2020 +0000

    Merged test.tsv into meta.tsv [skip ci]

commit 52eb2999f7d4fc3c1ad83fe4995f3614cef68581
Author: EMPD-admin <postgres@b7da7f78ba1e>
Date:   Thu Mar 12 13:52:55 2020 +0000

    Updated tab-delimited files from EMPD2 postgres database

commit b643a580791a1ff425ef5bc8d7517b53e72b4581
Author: EMPD-admin <postgres@b7da7f78ba1e>
Date:   Thu Mar 12 13:52:52 2020 +0000

    Removed postgres dump of test.tsv

commit 4c72300f0d0b2ff8fe5bdb59aa9ad0c5d0ed5083
Author: EMPD-admin <postgres@b7da7f78ba1e>
Date:   Thu Mar 12 13:52:52 2020 +0000

    Added postgres dump for test.tsv

commit 3a57ace27dade30e58d254469a1858d5576f3cec
Author: EMPD-admin <postgres@b7da7f78ba1e>
Date:   Thu Mar 12 13:52:51 2020 +0000

    Updated fixed tables

That’s it. Now we could merge this contribution into the EMPD from within Github.