The data creation repository for the paper: MedDistant19: Towards an Accurate Benchmark for Broad-Coverage Biomedical Relation Extraction. Check out the baselines repository as well.
MedDistant19 is a distantly supervised biomedical relation extraction (Bio-DSRE) corpus obtained by aligning the PubMed MEDLINE abstracts from 2019 with the SNOMED-CT knowledge graph (KG) derived from the UMLS Metathesaurus 2019.
Please use the requirements file for installation:
pip install -r requirements.txt
Before Downloading: Ensure a copy of UMLS license to use this dataset. For more details, please read the note here.
cd benchmark
bash download_meddistant19.sh
This will download the data in OpenNRE compatiable format in the directory benchmark/meddistant19
. An example line looks as follows:
{
"text": "Urethral stones are rarely formed primarily in the urethra and are usually associated with urethral strictures or diverticula .",
"h": {"id": "C0041967", "pos": [51, 58], "name": "urethra"},
"t": {"id": "C0041974", "pos": [91, 110], "name": "urethral strictures"},
"relation": "finding_site_of"
}
The text is pre-tokenized with ScispaCy and can be split at whitespace. The position indexes are at the character level.
The dataset is constructed using the inductive KG split (see below). The summary statistics of the final data are presented in the following table:
Split | Instances | Facts | Inst. Per Bag | Bags | NA (%) |
---|---|---|---|---|---|
Train | 450,071 | 5,455 | 5.06 | 88,861 | 90.0% |
Valid | 39,434 | 842 | 3.76 | 10,475 | 91.2% |
Test | 91,568 | 1,663 | 4.05 | 22,606 | 91.1% |
The KG split can be inductive or transductive. The table below summarizes both (split ratio: 70%
, 10%
, 20%
):
Facts | Train | Valid | Test |
---|---|---|---|
Inductive (I) | 261,797 | 48,641 | 97,861 |
Transductive (T) | 318,524 | 28,370 | 56,812 |
We use UMLS
as our knowledge base with SNOMED_CT_US
subset-based installation using Metamorphosys. Please note that to have reproducible data splits, follow the steps outlined below.
Download UMLS2019AB and unzip it in a directory (prefer this directory). Set the resulting path of the unzipped directory umls-2019AB-full
. We will call this path UMLS_DOWNLOAD_DIR
in the remaining document.
Go to UMLS_DOWNLOAD_DIR/2019AB-full
and use the script run*
depending on the OS. Once the MetamorphoSys application opens, press the Install UMLS
button. A window will prompt asking for Source
and Destination
paths. The Source
shall already be set to UMLS_DOWNLOAD_DIR/2019AB-full
. Create a new folder under UMLS_DOWNLOAD_DIR
called MedDistant19
and set it as Destination
path, it shall look like UMLS_DOWNLOAD_DIR/MedDistant19
. In the remaining document, these two paths will be called SOURCE
and DESTINATION
.
Run the script init_config.py
to set the path values in the config.prop
file provided in this directory.
python init_config.py --src SOURCE --dst DESTINATION
Now, use this configuration file in MetamorphoSys for installing the SNOMED_CT_US
by selecting the Open Configuration
option.
Once UMLS installation is complete with MetamorphoSys, find the *.RRF
files under the DESTINATION/META
. Copy MRREL.RRF
, MRCONSO.RRF
and MRSTY.RRF
in this directory.
Please download the Semantic Groups file from here. Once downloaded all the files, please match the resulting MD5 hash values of relevant files as reported in the mmsys.md5
file in this directory. If there still are mismatches, please report the issue.
First we will preprocess the UMLS files with the script:
bash scripts/preprocess_umls.sh
Now, we can extract the transductive triples split:
bash scripts/kg_transductive.sh
This will create several files, but the more important ones are train.tsv
, dev.tsv
and test.tsv
. These splits are transductive, i.e., the entities appearing in dev and test sets have appeared in the training set.
Inductive split refers to the creation of dev and test sets where entities were not seen during training. It uses the files created by the transductive split. To create a simple inductive split, use:
bash scripts/kg_inductive.sh
As our documents, we use abstract texts from the PubMed MEDLINE 2019 version available here. We provide a processed version of the corpora, which has been deduplicated, tokenized, and linked to the UMLS concepts with ScipaCy's UMLSEntityLinker
. We can optionally recreate the corpora by following the steps outlined in the "From Scratch" section.
Please view the link and download the file medline_pubmed_2019_entity_linked.tar.gz
(~30GB compressed) in MEDLINE
folder. Match the md5sum value for the downloaded file. Uncompress the file (~221GB):
cd MEDLINE
tar -xzvf medline_pubmed_2019_entity_linked.tar.gz
This will result extract the file medline_pubmed_2019_entity_linked.jsonl
, where each line is in JSON format with tokenized text with associated UMLS concepts and linguistic features per token. For example:
{
"text": "30 % of ertapenem is cleared by a session of haemodialysis ( HD ) .",
"mentions": [
["C1120106", "1.00", [8, 17]],
["C1883016", "1.00", [34, 41]],
["C0019004", "1.00", [45, 58]],
["C0019829", "1.00", [61, 63]]
],
"features": [
["CD", "NUM", "NumType=Card", "nummod", 1],
["NN", "NOUN", "Number=Sing", "nsubjpass", 5],
["IN", "ADP", "", "case", 3],
["NN", "NOUN", "Number=Sing", "nmod", 1],
["VBZ", "VERB", "Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin", "auxpass", 5],
["VBN", "VERB", "Aspect=Perf|Tense=Past|VerbForm=Part", "ROOT", 5],
["IN", "ADP", "", "case", 8],
["DT", "DET", "Definite=Ind|PronType=Art", "det", 8],
["NN", "NOUN", "Number=Sing", "nmod", 5],
["IN", "ADP", "", "case", 10],
["NN", "NOUN", "Number=Sing", "nmod", 8],
["-LRB-", "PUNCT", "PunctSide=Ini|PunctType=Brck", "punct", 12],
["NN", "NOUN", "Number=Sing", "appos", 10],
["-RRB-", "PUNCT", "PunctSide=Fin|PunctType=Brck", "punct", 12],
[".", "PUNCT", "PunctType=Peri", "punct", 5]
]
}
This step is optional if we wish to train our word2vec model using this corpus. The current default (word2vec.py
) setup is the one used to obtain the pre-trained PubMed embeddings for the word2vec model:
python word2vec.py --medline_entities_linked_fname MEDLINE/medline_pubmed_2019_entity_linked.jsonl --output_dir w2v_model
(Optional step ends here)
Assuming we followed the instructions in the UMLS folder, we can now create the benchmark splits in OpenNRE format.
The script below creates the benchmark med_distant19
with the split files med_distant19_train.txt
, med_distant19_dev.txt
, and med_distant19_test.txt
in MEDLINE
directory:
bash scripts/create_meddistant19.sh
We can move these files to the folder benchmark
:
mkdir benchmark/med_distant19
mv ../MEDLINE/med_distant19_*.txt benchmark/med_distant19/
Please match the md5 hash values provided in benchmark/med_distant19
. We can extract several relevant files (semantic types, semantic groups, relation categories, etc.) from benchmark/med_distant19
with:
python extract_benchmark_metadata.py \
--benchmark_dir benchmark \
--umls_dir UMLS \
--dataset benchmark/med_distant19
First, download the abstracts from 2019 and extract the texts from them with the script:
cd ../scripts
sh download_and_extract_abstracts.sh
This will produce several *.xml.gz.txt
files in this directory.
We use the en_core_sci_lg
model; please install it first:
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_lg-0.4.0.tar.gz
To extract sentences from the abstract texts, we use ScispaCy for tokenization:
model=en_core_sci_lg
num_cpus=32
batch_size=1024
python scispacy_tokenization.py \
--data_dir MEDLINE \
--scispacy_model_name $model \
--n_process $num_cpus \
--batch_size $batch_size
We ran this command on a cluster with SLURM support. It took 9hrs with 32 CPUs (with 4GB memory each) and a batch size of 1024 used internally to use spaCy's multi-processing. The number of extracted sentences will be 151M in the file MEDLINE/medline_pubmed_2019_sents.txt
. Sort and extract unique sentences:
cat MEDLINE/medline_pubmed_2019_sents.txt | sort | uniq > MEDLINE/medline_pubmed_2019_unique_sents.txt
Previous studies have used exact matching strategies, which produce suboptimal concept linking. We use ScispaCy's UMLSEntityLinker
to extract concepts.
num_cpus=32
model=en_core_sci_lg
batch_size=1024
# Please set this to a directory with much space! ScispaCy will download indexes the first time, which takes space
# export SCISPACY_CACHE=/to/cache/scispacy
python scispacy_entity_linking.py \
--medline_unique_sents_fname MEDLINE/medline_pubmed_2019_unique_sents.txt \
--output_file MEDLINE/medline_pubmed_2019_entity_linked.jsonl \
--scispacy_model_name $model \
--n_process $num_cpus \
--batch_size $batch_size \
--min_sent_tokens 5 \
--max_sent_tokens 128
WARNING: This job is memory intensive and requires up to half TB. We ran this command on SLURM supported cluster with 32 CPUs (with ~18GB memory each) and a batch size of 1024. It took about 75hrs to link about 149M unique sentences.
If you find our work useful, please consider citing:
@inproceedings{amin-etal-2022-meddistant19,
title = "{M}ed{D}istant19: Towards an Accurate Benchmark for Broad-Coverage Biomedical Relation Extraction",
author = "Amin, Saadullah and Minervini, Pasquale and Chang, David and Stenetorp, Pontus and Neumann, G{\"u}nter",
booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
month = oct,
year = "2022",
address = "Gyeongju, Republic of Korea",
publisher = "International Committee on Computational Linguistics",
url = "https://aclanthology.org/2022.coling-1.198",
pages = "2259--2277",
}
We thank the original authors of the following sources for releasing their split codes. The transductive split is adopted from snomed_kge. The inductive split is adopted from blp.