Getting started¶
In this small tutorial, we will guide you through a standard procedure of how to merge a new data contribution into the EMPD. In the following sections, the three main steps are described shortly, that are:
Note that, as a contributor to the EMPD, you do not have to work through this guide. Of course, you are welcomed to do this, but it is the objective of the core-maintainers to validate your contribution with the methods presented here.
Note
The commands here are run from an interactive python session (ipython). For
this reason, we are prefixing and shell command with an exclamation mark
(!
). If you run the examples locally, leave them away.
Note
Note, we run the commands here locally using the empd-admin
shell
command. But if you use it from a Pull Request into the EMPD-data
repository, you have to use @EMPD-admin
. If you are using the
EMPD-admin through Docker, you can type
``docker run -t empd2/empd-admin empd-admin `` instead.
Downloading the test data¶
For our tutorial, we will use the test-data branch of the EMPD (see here). It contains artificial data that we use for testing. If you want to test the commands here with your own contribution to the EMPD, you can also use your own fork of the EMPD-data repository.
The following two commands clone the branch from Github and changes the working directory to the downloaded directory.
In [1]: !git clone -q https://github.com/EMPD2/EMPD-data -b test-data
In [2]: cd EMPD-data
In [3]: ls
CODE_OF_CONDUCT.md README.rst meta.tsv templates/
CONTRIBUTING.md docker-compose.test.yml postgres/ test.tsv
Dockerfile docker_tests.sh* samples/ tests/
LICENSE failures/ tab-delimited/
Testing the repository¶
The first step is now, to test the data contribution. For this, we can use the empd-admin test command. For the test-data branch this will fail due to an invalid country in the test_a1 sample.
In [4]: !empd-admin test
Starting test run with /opt/conda/envs/empd-admin/bin/pytest --empd-meta=./test.tsv --markdown-report=/tmp/tmp9vh4iqer_test/report.md --commit --maxfail=20 --sample=.* /tmp/tmp9vh4iqer_test/tests/
============================= test session starts ==============================
platform linux -- Python 3.7.3, pytest-4.4.1, py-1.8.0, pluggy-0.9.0
rootdir: /tmp/tmp0tzagvu0/tests, inifile: pytest.ini
collected 38 items / 7 deselected / 31 selected
../tmp9vh4iqer_test/tests/test_critical.py ... [ 9%]
../tmp9vh4iqer_test/tests/test_formatting.py ..sssss [ 32%]
../tmp9vh4iqer_test/tests/test_meta.py ..F.................. [100%]
=================================== FAILURES ===================================
_________________________________ test_country _________________________________
meta = OriginalSampleName SiteName ... empd_countries Country_ok
SampleName ...... Germany True
test_a3 orig_test_a2 somewhere ... Germany True
[3 rows x 76 columns]
countries = natural_earth
Albania Albania
Algeria ... United Kingdom
Yemen Yemen
Name: country, Length: 82, dtype: object
nat_earth_countries = SampleName
test_a1 Germany
test_a2 Germany
test_a3 Germany
dtype: object
okexcept = <function okexcept.<locals>.okexcept at 0x7f7e8cd35400>
record_property = <function record_property.<locals>.append_property at 0x7f7e868568c8>
def test_country(meta, countries, nat_earth_countries, okexcept,
record_property):
ref = countries
meta = meta.copy()
s = meta['Country'].fillna('')
s_ok = okexcept(s.name)
s_ok |= (okexcept("Latitude") & okexcept("Longitude")).values
meta = meta.join(s_ok)
meta['nat_earth'] = nat_earth_countries
meta['empd_countries'] = [
'; '.join(ref.loc[[n]]) if n in ref.index else c
for c, n in meta[[s.name, 'nat_earth']].values]
meta[s.name + '_ok'] = [n in ref.index and c in ref.loc[[n]].values
for c, n in meta[[s.name, 'nat_earth']].values]
failed = meta[~(s_ok | meta.Country_ok)]
if len(failed):
record_property('failed_samples',
failed[[s.name, 'nat_earth', 'empd_countries', s_ok.name,
s.name + '_ok']]
)
msg = "Found %i invalid %s: %s" % (
len(failed), s.name if len(failed) == 1 else 'Countries',
textwrap.shorten(
', '.join(failed.index), 80, placeholder='...'))
> assert not len(failed), msg
E AssertionError: Found 1 invalid Country: test_a1
E assert not 1
E + where 1 = len( OriginalSampleName SiteName ... empd_countries Country_ok\nSampleName ...... \ntest_a1 orig_test_a1 somewhere ... Germany False\n\n[1 rows x 76 columns])
../tmp9vh4iqer_test/tests/test_meta.py:80: AssertionError
=============================== warnings summary ===============================
/opt/conda/envs/empd-admin/lib/python3.7/site-packages/socks.py:58
/opt/conda/envs/empd-admin/lib/python3.7/site-packages/socks.py:58: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
from collections import Callable
test_meta.py::test_country
/opt/conda/envs/empd-admin/lib/python3.7/site-packages/shapely/affinity.py:45: DeprecationWarning: invalid escape sequence \
"""
test_meta.py::test_country
/opt/conda/envs/empd-admin/lib/python3.7/site-packages/shapely/affinity.py:153: DeprecationWarning: invalid escape sequence \
"""
test_meta.py::test_country
/opt/conda/envs/empd-admin/lib/python3.7/site-packages/shapely/affinity.py:192: DeprecationWarning: invalid escape sequence \
"""
test_meta.py::test_country
/opt/conda/envs/empd-admin/lib/python3.7/site-packages/shapely/affinity.py:222: DeprecationWarning: invalid escape sequence \
"""
test_meta.py::test_country
/opt/conda/envs/empd-admin/lib/python3.7/site-packages/shapely/affinity.py:250: DeprecationWarning: invalid escape sequence \
"""
-- Docs: https://docs.pytest.org/en/latest/warnings.html
=== 1 failed, 25 passed, 5 skipped, 7 deselected, 6 warnings in 5.36 seconds ===
The failed sample can be extracted using the -e option:
In [5]: !empd-admin test -e failed.tsv
which extracted the failed metadata into failures/failed.tsv
:
In [6]: ls failures/failed.tsv
failures/failed.tsv
and allows you to analyse it further. In our case, we can fix the failed sample with the fix command:
In [7]: !empd-admin fix country -s test_a1
Starting test run with /opt/conda/envs/empd-admin/bin/pytest --empd-meta=./test.tsv --markdown-report=/tmp/tmp566xewx6_test/report.md -m dbfix --fix-db --commit -k country --sample=test_a1 /tmp/tmp566xewx6_test/tests/fixes.py
============================= test session starts ==============================
platform linux -- Python 3.7.3, pytest-4.4.1, py-1.8.0, pluggy-0.9.0
rootdir: /tmp/tmp0tzagvu0/tests, inifile: pytest.ini
collected 7 items / 6 deselected / 1 selected
../tmp566xewx6_test/tests/fixes.py . [100%]
=============================== warnings summary ===============================
/opt/conda/envs/empd-admin/lib/python3.7/site-packages/socks.py:58
/opt/conda/envs/empd-admin/lib/python3.7/site-packages/socks.py:58: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
from collections import Callable
fixes.py::fix_country
/opt/conda/envs/empd-admin/lib/python3.7/site-packages/shapely/affinity.py:45: DeprecationWarning: invalid escape sequence \
"""
fixes.py::fix_country
/opt/conda/envs/empd-admin/lib/python3.7/site-packages/shapely/affinity.py:153: DeprecationWarning: invalid escape sequence \
"""
fixes.py::fix_country
/opt/conda/envs/empd-admin/lib/python3.7/site-packages/shapely/affinity.py:192: DeprecationWarning: invalid escape sequence \
"""
fixes.py::fix_country
/opt/conda/envs/empd-admin/lib/python3.7/site-packages/shapely/affinity.py:222: DeprecationWarning: invalid escape sequence \
"""
fixes.py::fix_country
/opt/conda/envs/empd-admin/lib/python3.7/site-packages/shapely/affinity.py:250: DeprecationWarning: invalid escape sequence \
"""
-- Docs: https://docs.pytest.org/en/latest/warnings.html
============== 1 passed, 6 deselected, 6 warnings in 5.05 seconds ==============
This now fixed the country for the test_a1 sample and the tests will run through
In [8]: !empd-admin test
Starting test run with /opt/conda/envs/empd-admin/bin/pytest --empd-meta=./test.tsv --markdown-report=/tmp/tmp9914u_4o_test/report.md --commit --maxfail=20 --sample=.* /tmp/tmp9914u_4o_test/tests/
All tests passed
Query the repository¶
The empd-admin provides several diagnostics to investigate the data
contribution. You already saw one of them: the extraction of failures with the
-e
option the the test command.
Another one is the query command to query the database. This will use an sql query to display a subset of your metadata. The syntax is like:
empd-admin query WHERE_CLAUSE [Column1, [Column2, [Column2]]]
Which will transform into a query like:
SELECT Column1, Column2, Column3 FROM metadata WHERE WHERE_CLAUSE;
For example:
empd-admin query 'Country = "Germany"' SampleName
will transform into:
SELECT SampleName FROM metadata WHERE Country = "Germany";
and result in a markdown table for the SampleName:
In [9]: !empd-admin query 'Country = "Germany"' SampleName
<details><summary>Country = "Germany"</summary>
| SampleName
| ---
| test_a1
| test_a2
| test_a3
Displaying 3 of 3 rows
</details>
When combining this with pandoc, you can also directly transform it to HTML
In [10]: !empd-admin query 'Country = "Germany"' SampleName | pandoc -o query.html
SampleName |
---|
test_a1 |
test_a2 |
test_a3 |
You can also dump them as a file to the queries folder of your repository
with the --output
and --commit
options:
In [11]: !empd-admin query 'Country = "Germany"' SampleName -c
<details><summary>Country = "Germany"</summary>
| SampleName
| ---
| test_a1
| test_a2
| test_a3
Displaying 3 of 3 rows
</details>
In [12]: cat queries/query.tsv
SampleName
test_a1
test_a2
test_a3
Transform into postgres¶
Assuming you have Postgres installed on your system and a running postgres database server, you can transform the EMPD meta data in a relational database, as it is available in the EMPD-data repository.
Just create postgres database using
In [13]: !createdb MyEMPD
and then incorporate your data as
In [14]: !empd-admin createdb -db MyEMPD
Importing in postgresql://postgres@localhost/MyEMPD with /opt/conda/envs/empd-admin/bin/python /tmp/tmp0tzagvu0/postgres/scripts/import_into_empd2.py ./test.tsv --database-url postgresql://postgres@localhost/MyEMPD --no-dump
Then you can access it via
In [15]: !psql MyEMPD
Finishing a data contribution¶
When you are satisfied with your data contribution, you can finish the contribution.
This will remove all the intermediate working files (e.g. our new meta data
test.tsv
or the failed samples failures/failed.tsv
) and merge the new
data into the base meta file meta.tsv of the EMPD.
In [16]: !empd-admin finish --commit
Starting test run with /opt/conda/envs/empd-admin/bin/pytest --empd-meta=./test.tsv --markdown-report=/tmp/tmpir8lpohl_test/report.md --fix-db -v -k fix_sample_data_formatting --commit /tmp/tmpir8lpohl_test/tests/fixes.py
Importing in postgresql://postgres@localhost/tmpg0cldh3u_empd with /opt/conda/envs/empd-admin/bin/python /tmp/tmp0tzagvu0/postgres/scripts/import_into_empd2.py ./test.tsv --database-url postgresql://postgres@localhost/tmpg0cldh3u_empd
In [17]: log = !git log -8
....: print(log.n)
....:
commit 41a7c231847fe011be03372dae810e9de785b5f6
Author: EMPD-admin <postgres@b7da7f78ba1e>
Date: Thu Mar 12 13:52:57 2020 +0000
Removed test.tsv to finish the PR
commit 2ee018a237820eb5854a5960504adaec2e88e2e9
Author: EMPD-admin <postgres@b7da7f78ba1e>
Date: Thu Mar 12 13:52:57 2020 +0000
Removed extracted queries
commit 0ac6a167d70561207bc3a5b8acb46e6ac93894cb
Author: EMPD-admin <postgres@b7da7f78ba1e>
Date: Thu Mar 12 13:52:57 2020 +0000
Removed extracted failures
commit a9a2e637b547fb9da105ea386147dbe3756e1b10
Author: EMPD-admin <postgres@b7da7f78ba1e>
Date: Thu Mar 12 13:52:56 2020 +0000
Merged test.tsv into meta.tsv [skip ci]
commit 52eb2999f7d4fc3c1ad83fe4995f3614cef68581
Author: EMPD-admin <postgres@b7da7f78ba1e>
Date: Thu Mar 12 13:52:55 2020 +0000
Updated tab-delimited files from EMPD2 postgres database
commit b643a580791a1ff425ef5bc8d7517b53e72b4581
Author: EMPD-admin <postgres@b7da7f78ba1e>
Date: Thu Mar 12 13:52:52 2020 +0000
Removed postgres dump of test.tsv
commit 4c72300f0d0b2ff8fe5bdb59aa9ad0c5d0ed5083
Author: EMPD-admin <postgres@b7da7f78ba1e>
Date: Thu Mar 12 13:52:52 2020 +0000
Added postgres dump for test.tsv
commit 3a57ace27dade30e58d254469a1858d5576f3cec
Author: EMPD-admin <postgres@b7da7f78ba1e>
Date: Thu Mar 12 13:52:51 2020 +0000
Updated fixed tables
That’s it. Now we could merge this contribution into the EMPD from within Github.