\n",
@@ -250,21 +257,15 @@
"4 3347197472 1990 2 2 1990-02-02 NaT m 2008119900 ORBIS"
]
},
- "execution_count": 5,
"metadata": {},
- "output_type": "execute_result"
+ "execution_count": 5
}
],
- "source": [
- "person = data.person\n",
- "person.drop(columns = ['person_id']).head()"
- ]
+ "metadata": {}
},
{
"cell_type": "code",
"execution_count": 6,
- "metadata": {},
- "outputs": [],
"source": [
"from datetime import datetime\n",
"\n",
@@ -276,71 +277,71 @@
" .person_id\n",
" .count()\n",
")"
- ]
+ ],
+ "outputs": [],
+ "metadata": {}
},
{
"cell_type": "markdown",
- "metadata": {},
"source": [
"Once data has been sufficiently aggregated, it can be converted back to Pandas, e.g. for plotting."
- ]
+ ],
+ "metadata": {}
},
{
"cell_type": "code",
"execution_count": 7,
- "metadata": {},
- "outputs": [],
"source": [
"stats_pd = stats.to_pandas()"
- ]
+ ],
+ "outputs": [],
+ "metadata": {}
},
{
"cell_type": "markdown",
- "metadata": {},
"source": [
"Similarily, if you want to work on the `Spark` DataFrame instead, a similar method is available:"
- ]
+ ],
+ "metadata": {}
},
{
"cell_type": "code",
"execution_count": 8,
- "metadata": {},
- "outputs": [],
"source": [
"person_spark = person.to_spark()"
- ]
+ ],
+ "outputs": [],
+ "metadata": {}
},
{
"cell_type": "markdown",
- "metadata": {},
"source": [
"## Persisting/Reading a sample to/from disk: `PandasData`\n",
"\n",
"Working with Pandas DataFrame is, when possible, more convenient. \n",
"You have the possibility to save your database or at least a subset of it. \n",
"Doing so allows you to work on it later without having to go through `Spark` again. "
- ]
+ ],
+ "metadata": {}
},
{
"cell_type": "markdown",
- "metadata": {},
"source": [
"!!! warning \"Careful with cohort size\"\n",
" Do not save it if your cohort is **big**: This saves **all** available tables on disk."
- ]
+ ],
+ "metadata": {}
},
{
"cell_type": "markdown",
- "metadata": {},
"source": [
"For instance, let us define a dummy subset of 1000 patients:"
- ]
+ ],
+ "metadata": {}
},
{
"cell_type": "code",
"execution_count": 10,
- "metadata": {},
- "outputs": [],
"source": [
"visits = data.visit_occurrence\n",
"\n",
@@ -355,20 +356,20 @@
" .head(1000)\n",
" .to_list()\n",
")"
- ]
+ ],
+ "outputs": [],
+ "metadata": {}
},
{
"cell_type": "markdown",
- "metadata": {},
"source": [
"And save every table restricted to this small cohort as a `parquet` file:"
- ]
+ ],
+ "metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
- "metadata": {},
- "outputs": [],
"source": [
"import os\n",
"\n",
@@ -382,11 +383,12 @@
" tables=tables_to_save,\n",
" person_ids=sample_patients\n",
")"
- ]
+ ],
+ "outputs": [],
+ "metadata": {}
},
{
"cell_type": "markdown",
- "metadata": {},
"source": [
"Once you saved some data to disk, a dedicated class can be used to access it: \n",
"The class `PandasData` can be used to load OMOP data from a folder containing several parquet files. The tables\n",
@@ -394,74 +396,84 @@
"\n",
"**Warning**: in this case, the whole table will be loaded into memory on a single jupyter server. Consequently it is advised\n",
"to only use this for small datasets."
- ]
+ ],
+ "metadata": {}
},
{
"cell_type": "code",
"execution_count": 5,
- "metadata": {},
- "outputs": [],
"source": [
"data = PandasData(folder)"
- ]
+ ],
+ "outputs": [],
+ "metadata": {}
},
{
"cell_type": "code",
"execution_count": 6,
- "metadata": {},
+ "source": [
+ "data.available_tables"
+ ],
"outputs": [
{
+ "output_type": "execute_result",
"data": {
"text/plain": [
"['visit_occurrence', 'visit_detail', 'person']"
]
},
- "execution_count": 6,
"metadata": {},
- "output_type": "execute_result"
+ "execution_count": 6
}
],
- "source": [
- "data.available_tables"
- ]
+ "metadata": {}
},
{
"cell_type": "code",
"execution_count": 7,
- "metadata": {},
+ "source": [
+ "person = data.person\n",
+ "print(f\"type: {type(person)}\")\n",
+ "print(f\"shape: {person.shape}\")"
+ ],
"outputs": [
{
- "name": "stdout",
"output_type": "stream",
+ "name": "stdout",
"text": [
"type:
\n",
"shape: (1000, 10)\n"
]
}
],
- "source": [
- "person = data.person\n",
- "print(f\"type: {type(person)}\")\n",
- "print(f\"shape: {person.shape}\")"
- ]
+ "metadata": {}
},
{
"cell_type": "markdown",
- "metadata": {},
"source": [
"## Loading from PostGres: `PostgresData`\n",
"\n",
"OMOP data can be stored in a PostgreSQL database. The `PostgresData` class provides a convinient interface to it.\n",
"\n",
"**Note :** this class relies on the file `~/.pgpass` that contains your identifiers for several databases."
- ]
+ ],
+ "metadata": {}
},
{
"cell_type": "code",
"execution_count": 15,
- "metadata": {},
+ "source": [
+ "data = PostgresData(\n",
+ " dbname=DB, \n",
+ " schema=\"omop\", \n",
+ " user=USER,\n",
+ ")\n",
+ "\n",
+ "data.read_sql(\"select count(*) from person\")"
+ ],
"outputs": [
{
+ "output_type": "execute_result",
"data": {
"text/html": [
"\n",
@@ -499,20 +511,11 @@
"0 12688670"
]
},
- "execution_count": 15,
"metadata": {},
- "output_type": "execute_result"
+ "execution_count": 15
}
],
- "source": [
- "data = PostgresData(\n",
- " dbname=DB, \n",
- " schema=\"omop\", \n",
- " user=USER,\n",
- ")\n",
- "\n",
- "data.read_sql(\"select count(*) from person\")"
- ]
+ "metadata": {}
}
],
"metadata": {
diff --git a/docs/index.md b/docs/index.md
index ca3b91ef..8a0e6a64 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -26,6 +26,9 @@ As an example, the following figure was obtained using various functionalities f
!!! question "How was it done ?"
Click on the figure above to jump to the tutorial using various functionalities from eds-scikit, or continue reading the introduction!
+!!! tip "Using `eds-scikit` with I2B2"
+ Although designed for OMOP databases, `eds-scikit` provides a connector for I2B2 databases is available. We don't guarantee its exhaustivity, but it should allow you to use functionnalities of the library seamlessly.
+
## Quick start
### Installation
@@ -90,7 +93,7 @@ color:green Successfully installed eds_scikit !
### A first example: Merging visits together
Let's tackle a common problem when dealing with clinical data: Merging close/consecutive visits into **stays**.
-As detailled in [the dedicated section](), eds-scikit is expecting to work with [Pandas](https://pandas.pydata.org/) or [Koalas](https://koalas.readthedocs.io/en/latest/) DataFrames. We provide various connectors to facilitate data fetching, namely a [Hive]() connector and a [Postgres]() connector
+As detailled in [the dedicated section](), eds-scikit is expecting to work with [Pandas](https://pandas.pydata.org/) or [Koalas](https://koalas.readthedocs.io/en/latest/) DataFrames. We provide various connectors to facilitate data fetching, namely a [Hive][loading-from-hive-hivedata] connector and a [Postgres][loading-from-postgres-postgresdata] connector
=== "Using a Hive DataBase"
@@ -104,6 +107,9 @@ As detailled in [the dedicated section](), eds-scikit is expecting to work with
1. With this connector, `visit_occurrence` will be a *Pandas* DataFrame
+ !!! tip "I2B2"
+ If `DB_NAME` points to an I2B2 database, use `data = HiveData(DB_NAME, database_type="I2B2")`
+
=== "Using a Postgres DataBase"
```python
From 7eb9e73cb167fa3e20047ec32dcf8596aece7907 Mon Sep 17 00:00:00 2001
From: Matthieu Doutreligne
Date: Thu, 2 Feb 2023 10:31:56 +0000
Subject: [PATCH 20/25] hotfix: isort version
---
.pre-commit-config.yaml | 2 +-
eds_scikit/io/files.py | 5 +----
eds_scikit/io/hive.py | 9 +++++++--
pyproject.toml | 2 +-
4 files changed, 10 insertions(+), 8 deletions(-)
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
index e38d3199..3055b3f4 100644
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -10,7 +10,7 @@ repos:
- id: check-added-large-files
args: ["--maxkb", "5000"]
- repo: https://github.com/pycqa/isort
- rev: 5.10.1
+ rev: 5.11.5
hooks:
- id: isort
name: isort (python)
diff --git a/eds_scikit/io/files.py b/eds_scikit/io/files.py
index 2e95aab3..98f0a434 100644
--- a/eds_scikit/io/files.py
+++ b/eds_scikit/io/files.py
@@ -3,8 +3,6 @@
import pandas as pd
-from . import settings
-
class PandasData: # pragma: no cover
def __init__(
@@ -36,10 +34,9 @@ def __init__(
def list_available_tables(folder: str) -> Tuple[List[str], List[str]]:
available_tables = []
tables_paths = {}
- known_omop_tables = settings.tables_to_load.keys()
for filename in os.listdir(folder):
table_name, extension = os.path.splitext(filename)
- if extension == ".parquet" and table_name in known_omop_tables:
+ if extension == ".parquet":
abspath = os.path.abspath(os.path.join(folder, filename))
tables_paths[table_name] = abspath
available_tables.append(table_name)
diff --git a/eds_scikit/io/hive.py b/eds_scikit/io/hive.py
index b20c7913..4eaf1e2b 100644
--- a/eds_scikit/io/hive.py
+++ b/eds_scikit/io/hive.py
@@ -193,12 +193,14 @@ def _prepare_person_ids(self, list_of_person_ids) -> Optional[SparkDataFrame]:
else:
unique_ids = set(list_of_person_ids)
- print(f"Number of unique patients: {len(unique_ids)}")
schema = StructType([StructField("person_id", LongType(), True)])
filtering_df = self.spark_session.createDataFrame(
[(int(p),) for p in unique_ids], schema=schema
- )
+ ).cache()
+
+ print(f"Number of unique patients: {filtering_df.count()}")
+
return filtering_df
def _read_table(self, table_name, person_ids=None) -> DataFrame:
@@ -261,6 +263,9 @@ def persist_tables_to_folder(
)
folder = os.path.abspath(folder)
+
+ os.makedirs(folder, mode=0o766, exist_ok=False)
+
assert os.path.exists(folder) and os.path.isdir(
folder
), f"Folder {folder} not found."
diff --git a/pyproject.toml b/pyproject.toml
index 3757972b..7323069c 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -41,7 +41,7 @@ dependencies = [
"loguru>=0.6.0, <0.7.0",
"pypandoc==1.7.5",
"pyspark==2.4.3",
- "pyarrow>=0.10, <0.17.0",
+ "pyarrow==0.17.0", #"pyarrow>=0.10, <0.17.0",
"pretty-html-table>=0.9.15, <0.10.0",
"catalogue",
"schemdraw>=0.15.0, <1.0.0",
From 3fc77faa78b8079e51dc6e6b76c17803f665f0d3 Mon Sep 17 00:00:00 2001
From: Matthieu Doutreligne
Date: Thu, 2 Feb 2023 12:39:18 +0000
Subject: [PATCH 21/25] fix emergency mapping from registry
---
docs/recipes/small-cohorts.ipynb | 90 +++++++++++++--------
eds_scikit/emergency/emergency_care_site.py | 1 +
eds_scikit/utils/checks.py | 2 +-
tests/emergency/test_emergency_care_site.py | 2 +-
4 files changed, 58 insertions(+), 37 deletions(-)
diff --git a/docs/recipes/small-cohorts.ipynb b/docs/recipes/small-cohorts.ipynb
index 1c7efd29..66dafc71 100644
--- a/docs/recipes/small-cohorts.ipynb
+++ b/docs/recipes/small-cohorts.ipynb
@@ -40,7 +40,7 @@
"name": "stderr",
"output_type": "stream",
"text": [
- "2022-05-19 07:59:01.372 | WARNING | eds_scikit::25 - \n",
+ "2023-02-02 11:04:54.866 | WARNING | eds_scikit::31 - \n",
" To improve performances when using Spark and Koalas, please call `eds_scikit.improve_performances()`\n",
" This function optimally configures Spark. Use it as:\n",
" `spark, sc, sql = eds_scikit.improve_performances()`\n",
@@ -67,7 +67,17 @@
},
{
"cell_type": "code",
- "execution_count": 17,
+ "execution_count": 3,
+ "id": "779c6ef7-5839-4fcd-9971-a5e6fd804124",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "DBNAME=\"cse210038_20220921_160214312112\""
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
"id": "27086763-abf5-43ec-9d22-7fea742ef4e6",
"metadata": {},
"outputs": [
@@ -75,7 +85,7 @@
"name": "stderr",
"output_type": "stream",
"text": [
- "/export/home/tpetitjean/.user_conda/miniconda/envs/scikit/lib/python3.7/site-packages/pyarrow/util.py:43: FutureWarning: pyarrow.open_stream is deprecated as of 0.17.0, please use pyarrow.ipc.open_stream instead.\n",
+ "/export/home/cse210038/Thomas/scikitenv/lib/python3.7/site-packages/pyarrow/util.py:39: FutureWarning: pyarrow.open_stream is deprecated as of 0.17.0, please use pyarrow.ipc.open_stream instead\n",
" warnings.warn(msg, FutureWarning)\n",
" \r"
]
@@ -98,7 +108,7 @@
},
{
"cell_type": "code",
- "execution_count": 18,
+ "execution_count": 5,
"id": "71a75e1c-5abc-4471-aa92-41185a95b261",
"metadata": {},
"outputs": [],
@@ -109,7 +119,7 @@
},
{
"cell_type": "code",
- "execution_count": 19,
+ "execution_count": 6,
"id": "11794b94-7736-4468-a262-7ad2d36a7232",
"metadata": {},
"outputs": [],
@@ -146,7 +156,7 @@
},
{
"cell_type": "code",
- "execution_count": 22,
+ "execution_count": 7,
"id": "4e0c42dc-6333-4b02-b159-33d5341db558",
"metadata": {},
"outputs": [
@@ -154,6 +164,10 @@
"name": "stderr",
"output_type": "stream",
"text": [
+ "[Stage 2:> (0 + 2) / 200]/export/home/cse210038/Thomas/scikitenv/lib/python3.7/site-packages/pyarrow/util.py:39: FutureWarning: pyarrow.open_stream is deprecated as of 0.17.0, please use pyarrow.ipc.open_stream instead\n",
+ " warnings.warn(msg, FutureWarning)\n",
+ "/export/home/cse210038/Thomas/scikitenv/lib/python3.7/site-packages/pyarrow/util.py:39: FutureWarning: pyarrow.open_stream is deprecated as of 0.17.0, please use pyarrow.ipc.open_stream instead\n",
+ " warnings.warn(msg, FutureWarning)\n",
" \r"
]
},
@@ -161,11 +175,11 @@
"data": {
"text/plain": [
"concept value \n",
- "HEART_TRANSPLANT DZEA002 27\n",
+ "HEART_TRANSPLANT DZEA002 39\n",
"dtype: int64"
]
},
- "execution_count": 22,
+ "execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
@@ -176,7 +190,7 @@
},
{
"cell_type": "code",
- "execution_count": 23,
+ "execution_count": 8,
"id": "897f1753-6f6d-4633-a8f7-135d9ccb01ad",
"metadata": {},
"outputs": [
@@ -191,11 +205,11 @@
"data": {
"text/plain": [
"concept value\n",
- "HEART_TRANSPLANT Z941 135\n",
+ "HEART_TRANSPLANT Z941 602\n",
"dtype: int64"
]
},
- "execution_count": 23,
+ "execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
@@ -214,7 +228,7 @@
},
{
"cell_type": "code",
- "execution_count": 20,
+ "execution_count": 9,
"id": "f388afd0-4373-4170-af2e-203adb84e9a2",
"metadata": {},
"outputs": [
@@ -242,17 +256,17 @@
},
{
"cell_type": "code",
- "execution_count": 25,
+ "execution_count": 10,
"id": "b394ce04-fd97-4d48-8700-1321de4f0d17",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
- "30"
+ "53"
]
},
- "execution_count": 25,
+ "execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
@@ -271,7 +285,7 @@
},
{
"cell_type": "code",
- "execution_count": 32,
+ "execution_count": 11,
"id": "d5124d4a-8f6d-4433-a93f-99cc6fa660cc",
"metadata": {},
"outputs": [
@@ -285,10 +299,10 @@
{
"data": {
"text/plain": [
- "'0.08898 %'"
+ "'0.06849 %'"
]
},
- "execution_count": 32,
+ "execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
@@ -308,16 +322,23 @@
},
{
"cell_type": "code",
- "execution_count": 35,
+ "execution_count": 14,
"id": "e279c8d8-f489-479d-aa7c-886863718491",
"metadata": {},
"outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ " \r"
+ ]
+ },
{
"name": "stdout",
"output_type": "stream",
"text": [
- "Number of unique patients: 30\n",
- "writing ./heart_transplant_cohort/person.parquet\n"
+ "Number of unique patients: 53\n",
+ "writing /export/home/cse210038/Thomas/eds-scikit/docs/recipes/heart_transplant_cohort/person.parquet\n"
]
},
{
@@ -331,7 +352,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
- "writing ./heart_transplant_cohort/visit_detail.parquet\n"
+ "writing /export/home/cse210038/Thomas/eds-scikit/docs/recipes/heart_transplant_cohort/visit_detail.parquet\n"
]
},
{
@@ -345,7 +366,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
- "writing ./heart_transplant_cohort/visit_occurrence.parquet\n"
+ "writing /export/home/cse210038/Thomas/eds-scikit/docs/recipes/heart_transplant_cohort/visit_occurrence.parquet\n"
]
},
{
@@ -359,7 +380,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
- "writing ./heart_transplant_cohort/procedure_occurrence.parquet\n"
+ "writing /export/home/cse210038/Thomas/eds-scikit/docs/recipes/heart_transplant_cohort/procedure_occurrence.parquet\n"
]
},
{
@@ -373,7 +394,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
- "writing ./heart_transplant_cohort/condition_occurrence.parquet\n"
+ "writing /export/home/cse210038/Thomas/eds-scikit/docs/recipes/heart_transplant_cohort/condition_occurrence.parquet\n"
]
},
{
@@ -388,7 +409,6 @@
"import os\n",
"\n",
"folder = os.path.abspath(\"./heart_transplant_cohort\")\n",
- "os.makedirs(folder, mode=777, exist_ok=True)\n",
"\n",
"tables_to_save = [\n",
" \"person\",\n",
@@ -418,7 +438,7 @@
},
{
"cell_type": "code",
- "execution_count": 36,
+ "execution_count": 15,
"id": "4938e306-4178-46f4-a8f6-3da42e3bbfb6",
"metadata": {},
"outputs": [],
@@ -438,17 +458,17 @@
},
{
"cell_type": "code",
- "execution_count": 39,
+ "execution_count": 16,
"id": "900b09bc-40c1-4222-8931-a63b13d78433",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
- "30"
+ "53"
]
},
- "execution_count": 39,
+ "execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
@@ -468,7 +488,7 @@
},
{
"cell_type": "code",
- "execution_count": 40,
+ "execution_count": 17,
"id": "f8f3f466-0b59-4ae1-900a-f7a93972daa6",
"metadata": {},
"outputs": [
@@ -478,7 +498,7 @@
"'100.00000 %'"
]
},
- "execution_count": 40,
+ "execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
@@ -490,9 +510,9 @@
],
"metadata": {
"kernelspec": {
- "display_name": "Python 3 (ipykernel)",
+ "display_name": "scikit",
"language": "python",
- "name": "python3"
+ "name": "scikit"
},
"language_info": {
"codemirror_mode": {
@@ -504,7 +524,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
- "version": "3.7.11"
+ "version": "3.7.8"
}
},
"nbformat": 4,
diff --git a/eds_scikit/emergency/emergency_care_site.py b/eds_scikit/emergency/emergency_care_site.py
index 5f372ba2..dbc3fbd0 100644
--- a/eds_scikit/emergency/emergency_care_site.py
+++ b/eds_scikit/emergency/emergency_care_site.py
@@ -103,6 +103,7 @@ def from_mapping(
if version is not None:
function_name += f".{version}"
mapping = registry.get("data", function_name=function_name)()
+ print(mapping)
# Getting the right framework
fw = framework.get_framework(care_site)
diff --git a/eds_scikit/utils/checks.py b/eds_scikit/utils/checks.py
index ef3ff172..c31b763e 100644
--- a/eds_scikit/utils/checks.py
+++ b/eds_scikit/utils/checks.py
@@ -96,7 +96,7 @@ def algo_checker(
algo = _get_arg_value(function, "algo", args, kwargs)
# Stripping eventual version suffix
- algo = algo.split(".")[-1]
+ algo = algo.split(".")[0]
if algo not in algos:
raise ValueError(
diff --git a/tests/emergency/test_emergency_care_site.py b/tests/emergency/test_emergency_care_site.py
index a734cc7d..5bb5d9c8 100644
--- a/tests/emergency/test_emergency_care_site.py
+++ b/tests/emergency/test_emergency_care_site.py
@@ -145,6 +145,6 @@ def test_tagging(module, algo):
converted_input_df = framework.to(module, input_df)
- output = tag_emergency_care_site(converted_input_df, algo=algo)
+ output = tag_emergency_care_site(converted_input_df, algo=f"{algo}.test")
assert_equal_no_order(framework.pandas(output), expected_result, check_like=True)
From 35a662e486bf4cda5748e27874e46e59ec63d0a7 Mon Sep 17 00:00:00 2001
From: Matthieu Doutreligne
Date: Thu, 2 Feb 2023 14:37:32 +0000
Subject: [PATCH 22/25] fix emergency test
---
eds_scikit/emergency/emergency_care_site.py | 2 +-
tests/emergency/test_emergency_care_site.py | 10 +++++-----
tests/emergency/test_emergency_visits.py | 2 +-
3 files changed, 7 insertions(+), 7 deletions(-)
diff --git a/eds_scikit/emergency/emergency_care_site.py b/eds_scikit/emergency/emergency_care_site.py
index dbc3fbd0..6549c9de 100644
--- a/eds_scikit/emergency/emergency_care_site.py
+++ b/eds_scikit/emergency/emergency_care_site.py
@@ -102,8 +102,8 @@ def from_mapping(
function_name = "get_care_site_emergency_mapping"
if version is not None:
function_name += f".{version}"
+
mapping = registry.get("data", function_name=function_name)()
- print(mapping)
# Getting the right framework
fw = framework.get_framework(care_site)
diff --git a/tests/emergency/test_emergency_care_site.py b/tests/emergency/test_emergency_care_site.py
index 5bb5d9c8..077800f2 100644
--- a/tests/emergency/test_emergency_care_site.py
+++ b/tests/emergency/test_emergency_care_site.py
@@ -5,8 +5,8 @@
from eds_scikit.utils.test_utils import assert_equal_no_order, make_df
# Dictionnary of the form {algo_name : [input_df, expected_output_df]}
-algos = dict(
- from_mapping=[
+algos = {
+ "from_mapping.test": [
make_df(
"""
care_site_id,care_site_source_value
@@ -36,7 +36,7 @@
"""
),
],
- from_regex_on_care_site_description=[
+ "from_regex_on_care_site_description": [
make_df(
"""
care_site_name
@@ -134,7 +134,7 @@
"""
),
],
-)
+}
@pytest.mark.parametrize("module", ["pandas", "koalas"])
@@ -145,6 +145,6 @@ def test_tagging(module, algo):
converted_input_df = framework.to(module, input_df)
- output = tag_emergency_care_site(converted_input_df, algo=f"{algo}.test")
+ output = tag_emergency_care_site(converted_input_df, algo=algo)
assert_equal_no_order(framework.pandas(output), expected_result, check_like=True)
diff --git a/tests/emergency/test_emergency_visits.py b/tests/emergency/test_emergency_visits.py
index 6be75747..8fb3b10c 100644
--- a/tests/emergency/test_emergency_visits.py
+++ b/tests/emergency/test_emergency_visits.py
@@ -52,7 +52,7 @@
visit_detail=visit_detail,
care_site=care_site,
visit_occurrence=None,
- algo="from_mapping",
+ algo="from_mapping.test",
),
dict(
visit_detail=visit_detail,
From bb51963d5ffecc35d0fbe48565db71e01bc5040b Mon Sep 17 00:00:00 2001
From: Matthieu Doutreligne
Date: Thu, 2 Feb 2023 14:39:29 +0000
Subject: [PATCH 23/25] fix: remove unwanted cell in notebook
---
docs/recipes/small-cohorts.ipynb | 10 ----------
1 file changed, 10 deletions(-)
diff --git a/docs/recipes/small-cohorts.ipynb b/docs/recipes/small-cohorts.ipynb
index 66dafc71..fbd2ad1a 100644
--- a/docs/recipes/small-cohorts.ipynb
+++ b/docs/recipes/small-cohorts.ipynb
@@ -65,16 +65,6 @@
"DBNAME=\"YOUR_DATABASE_NAME\""
]
},
- {
- "cell_type": "code",
- "execution_count": 3,
- "id": "779c6ef7-5839-4fcd-9971-a5e6fd804124",
- "metadata": {},
- "outputs": [],
- "source": [
- "DBNAME=\"cse210038_20220921_160214312112\""
- ]
- },
{
"cell_type": "code",
"execution_count": 4,
From 117083762cc2d14e60b51a84842a11c521e6a20c Mon Sep 17 00:00:00 2001
From: Adam REMAKI
Date: Thu, 2 Feb 2023 16:35:49 +0100
Subject: [PATCH 24/25] docs: :lipstick: Update bioclean table for clarity
(#23)
* docs: :lipstick: Update bioclean table for clarity
* new line
* Hot fix n the doc
* doc: explicit configuration in biology tutorial
---------
Co-authored-by: Thomas Petit-Jean
---
docs/functionalities/biology/index.md | 14 ++---
docs/functionalities/biology/tutorial.ipynb | 65 ++++++++++++++------
docs/functionalities/biology/vocabulary.md | 2 +-
eds_scikit/biology/utils/process_concepts.py | 4 +-
4 files changed, 57 insertions(+), 28 deletions(-)
diff --git a/docs/functionalities/biology/index.md b/docs/functionalities/biology/index.md
index 09c7260f..2a3e51af 100644
--- a/docs/functionalities/biology/index.md
+++ b/docs/functionalities/biology/index.md
@@ -73,13 +73,13 @@ bioclean(data, start_date="2020-01-01", end_date="2021-12-31")
data.bioclean.head()
```
-| concepts_set | transformed_unit | transformed_value | max_threshold | min_threshold | outlier | .... |
-| :----------- | :--------------- | :---------------- | :------------ | :------------ | :------ | :--- |
-| Entity A | x10*9/l | 115 | 190 | 0 | False | .... |
-| Entity A | x10*9/l | 220 | 190 | 0 | True | .... |
-| Entity B | mmol | 0.45 | 8.548 | 0.542 | True | .... |
-| Entity B | mmol | 4.52 | 8.548 | 0.542 | False | .... |
-| Entity B | mmol | 9.58 | 8.548 | 0.542 | True | .... |
+| concepts_set | LOINC_concept_code | LOINC_concept_name | AnaBio_concept_code | AnaBio_concept_name | transformed_unit | transformed_value | max_threshold | min_threshold | outlier | value_source_value | unit_source_value |
+| :------------------------- | :----------------- | :----------------- | :------------------ | :------------------- | :--------------- | :---------------- | :------------ | :------------ | :------ | :----------------- | :---------------- |
+| EntityA_Blood_Quantitative | 000-0 | EntityA #Bld | A0000 | EntityA_Blood | x10*9/l | 115 | 190 | 0 | False | 115 x10*9/l | x10*9/l |
+| EntityA_Blood_Quantitative | 000-1 | EntityA_Blood_Vol | A0001 | EntityA_Blood_g/l | x10*9/l | 220 | 190 | 0 | True | 560 g/l | g/l |
+| EntityB_Blood_Quantitative | 001-0 | EntityB_Blood | B0000 | EntityB_Blood_artery | mmol | 0.45 | 8.548 | 0.542 | True | 0.45 mmol | mmol |
+| EntityB_Blood_Quantitative | 001-0 | EntityB_Blood | B0001 | EntityB_Blood_vein | mmol | 4.52 | 8.548 | 0.542 | False | 4.52 mmol | mmol |
+| EntityB_Blood_Quantitative | 000-1 | EntityB Bld Auto | B0002 | EntityB_Blood_µg/l | mmol | 9.58 | 8.548 | 0.542 | True | 3587 µg/l | µg/l |
For more details, have a look on [the dedicated section](cleaning).
diff --git a/docs/functionalities/biology/tutorial.ipynb b/docs/functionalities/biology/tutorial.ipynb
index 59a224dc..bd6a87d5 100644
--- a/docs/functionalities/biology/tutorial.ipynb
+++ b/docs/functionalities/biology/tutorial.ipynb
@@ -143,13 +143,42 @@
},
{
"cell_type": "markdown",
- "id": "e67de6c8",
+ "id": "9ac6f5a6-bdff-4826-8b3f-314856f2e1d9",
+ "metadata": {},
+ "source": [
+ "## 3. Define the configuration\n",
+ "\n",
+ "The configuration files does 3 things:\n",
+ "\n",
+ "- Remove outliers\n",
+ "- Remove unwanted codes\n",
+ "- Normalize units\n",
+ "\n",
+ "### 3.1 The default configuration\n",
+ "\n",
+ "A **default configuration** is available when working on APHP's CDW. You can access it via:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "5051cb27-ada6-49dd-9026-0c6489b29bbd",
"metadata": {},
+ "outputs": [],
"source": [
- "## 3. Create your own configuration (**OPTIONAL**)\n",
+ "from eds_scikit.resources import registry\n",
"\n",
- "If the [default configuration](../../datasets/biology-config.md) file based on the AP-HP's Data Warehouse does not meet your requirements, you can follow this tutorial to create your own configuration file.\n",
+ "biology_config = registry.get(\"data\", \"get_biology_config.all_aphp\")()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e67de6c8",
+ "metadata": {},
+ "source": [
+ "### 3.2 Create your own configuration (**OPTIONAL**)\n",
"\n",
+ "If this default configuration file does not meet your requirements, you can follow this tutorial to create your own configuration file. \n",
"As a reminder, a configuration file is a csv table where each row corresponds to a given standard concept_code and a given unit. For each row, it gives a maximum threshold and a minimum threshold to flag outliers and a unit conversion coefficient to normalize units if needed."
]
},
@@ -158,7 +187,7 @@
"id": "ed657ed4",
"metadata": {},
"source": [
- "### 3.1 Plot statistical summary\n",
+ "#### 3.2.1 Plot statistical summary\n",
"\n",
"The first step is to compute the statistical summary of each concepts-set with the function ``plot_biology_summary(stats_only=True)``. "
]
@@ -602,7 +631,7 @@
"id": "70ce6f91",
"metadata": {},
"source": [
- "### 3.2 Create configuration from statistical summary\n",
+ "#### 3.2.2 Create configuration from statistical summary\n",
"\n",
"Then, you can use the function ``create_config_from_stats()`` to pre-fill the configuration file with ``max_threshold`` and ``min_threshold``. The thresholds computation is based on the Median Absolute Deviation (MAD) Methodology[@madmethodology]."
]
@@ -629,7 +658,7 @@
"id": "7eb98862",
"metadata": {},
"source": [
- "### 3.3 Edit units manually"
+ "#### 3.2.3 Edit units manually"
]
},
{
@@ -649,7 +678,7 @@
"id": "8c634329",
"metadata": {},
"source": [
- "### 3.4 Use your custom configuration\n",
+ "#### 3.2.4 Use your custom configuration\n",
"\n",
"Once you created your configuration (for instance under the name `config_name=\"my_custom_config\"`), you can use provide it to the relevant functions (see below).\n",
"\n",
@@ -689,7 +718,7 @@
"bioclean(\n",
" data,\n",
" concepts_sets=concepts_sets,\n",
- " config_name=config_name,\n",
+ " config_name=config_name, # use config_name=\"all_aphp\" for APHP's default configuration\n",
" start_date=start_date,\n",
" end_date=end_date,\n",
")"
@@ -702,13 +731,13 @@
"source": [
"See below the columns created by the ``bioclean()`` function:\n",
"\n",
- "| concepts_set | transformed_unit | transformed_value | max_threshold | min_threshold | outlier | .... |\n",
- "| :----------- | :--------------- | :---------------- | :------------ | :------------ | :------ | :--- |\n",
- "| Entity A | x10*9/l | 115 | 190 | 0 | False | .... |\n",
- "| Entity A | x10*9/l | 220 | 190 | 0 | True | .... |\n",
- "| Entity B | mmol | 0.45 | 8.548 | 0.542 | True | .... |\n",
- "| Entity B | mmol | 4.52 | 8.548 | 0.542 | False | .... |\n",
- "| Entity B | mmol | 9.58 | 8.548 | 0.542 | True | .... |"
+ "| concepts_set | LOINC_concept_code | LOINC_concept_name | AnaBio_concept_code | AnaBio_concept_name | transformed_unit | transformed_value | max_threshold | min_threshold | outlier | value_source_value | unit_source_value |\n",
+ "| :------------------------- | :----------------- | :----------------- | :------------------ | :------------------- | :--------------- | :---------------- | :------------ | :------------ | :------ | :----------------- | :---------------- |\n",
+ "| EntityA_Blood_Quantitative | 000-0 | EntityA #Bld | A0000 | EntityA_Blood | x10*9/l | 115 | 190 | 0 | False | 115 x10*9/l | x10*9/l |\n",
+ "| EntityA_Blood_Quantitative | 000-1 | EntityA_Blood_Vol | A0001 | EntityA_Blood_g/l | x10*9/l | 220 | 190 | 0 | True | 560 g/l | g/l |\n",
+ "| EntityB_Blood_Quantitative | 001-0 | EntityB_Blood | B0000 | EntityB_Blood_artery | mmol | 0.45 | 8.548 | 0.542 | True | 0.45 mmol | mmol |\n",
+ "| EntityB_Blood_Quantitative | 001-0 | EntityB_Blood | B0001 | EntityB_Blood_vein | mmol | 4.52 | 8.548 | 0.542 | False | 4.52 mmol | mmol |\n",
+ "| EntityB_Blood_Quantitative | 000-1 | EntityB Bld Auto | B0002 | EntityB_Blood_µg/l | mmol | 9.58 | 8.548 | 0.542 | True | 3587 µg/l | µg/l |"
]
},
{
@@ -748,9 +777,9 @@
],
"metadata": {
"kernelspec": {
- "display_name": "Python 3 (ipykernel)",
+ "display_name": "scikit",
"language": "python",
- "name": "python3"
+ "name": "scikit"
},
"language_info": {
"codemirror_mode": {
@@ -762,7 +791,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
- "version": "3.7.12"
+ "version": "3.7.8"
},
"vscode": {
"interpreter": {
diff --git a/docs/functionalities/biology/vocabulary.md b/docs/functionalities/biology/vocabulary.md
index 40a3c5c5..b08813f5 100644
--- a/docs/functionalities/biology/vocabulary.md
+++ b/docs/functionalities/biology/vocabulary.md
@@ -15,4 +15,4 @@ The standard vocabulary is a unified vocabulary that allows data analysis on a l
## Vocabulary flowchart in OMOP
-
+
diff --git a/eds_scikit/biology/utils/process_concepts.py b/eds_scikit/biology/utils/process_concepts.py
index 9947d751..cfd96398 100644
--- a/eds_scikit/biology/utils/process_concepts.py
+++ b/eds_scikit/biology/utils/process_concepts.py
@@ -120,9 +120,9 @@ def get_concept_src_to_std(
concepts_sets : List[ConceptsSet]
List of concepts-sets to select
standard_concept_regex : dict, optional
- **EXAMPLE**: `["LOINC", "AnaBio"]`
- standard_terminologies : List[str], optional
**EXAMPLE**: `{"LOINC": "[0-9]{2,5}[-][0-9]","AnaBio": "[A-Z][0-9]{4}"}`
+ standard_terminologies : List[str], optional
+ **EXAMPLE**: `["LOINC", "AnaBio"]`
Returns
From e29f4a58e0c4d01e990496e20b61b333ec686ede Mon Sep 17 00:00:00 2001
From: Thomas Petit-Jean <30775613+Thomzoy@users.noreply.github.com>
Date: Thu, 2 Feb 2023 16:47:18 +0100
Subject: [PATCH 25/25] V0.1.3 (#28)
* chore: bump to v0.1.3
* chore: changelog
---
README.md | 6 +++---
changelog.md | 2 +-
eds_scikit/__init__.py | 2 +-
3 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/README.md b/README.md
index ee90eefe..057c014c 100644
--- a/README.md
+++ b/README.md
@@ -57,13 +57,13 @@ eds-scikit stands on the shoulders of [Spark 2.4](https://spark.apache.org/docs/
You can install eds-scikit via `pip`:
```bash
-pip install eds-scikit
+pip install "eds-scikit[aphp]"
```
-:warning: If you work in AP-HP's ecosystem (EDS), please install additionnal features via:
+:warning: If you don't work in AP-HP's ecosystem (EDS), please install via:
```bash
-pip install "eds-scikit[aphp]"
+pip install eds-scikit
```
You can now import the library via
diff --git a/changelog.md b/changelog.md
index c7ddd93d..5b6401e8 100644
--- a/changelog.md
+++ b/changelog.md
@@ -1,6 +1,6 @@
# Changelog
-## Pending
+## v0.1.3 (2023-02-02)
### Added
diff --git a/eds_scikit/__init__.py b/eds_scikit/__init__.py
index 3e5c80e4..9cfd7c6d 100644
--- a/eds_scikit/__init__.py
+++ b/eds_scikit/__init__.py
@@ -1,7 +1,7 @@
"""Top-level package for eds_scikit."""
__author__ = """eds_scikit"""
-__version__ = "0.1.2"
+__version__ = "0.1.3"
import importlib
import os