The code and output of our AAAI paper "Knowledge-Enriched Visual Storytelling" (arxiv)
Unlike the format in VIST dataset, here we put all stories in a column. E.g.
{
"original_text": "The local parish holds a craft show each year.",
"album_id": "44277",
"photo_flickr_id": "1741642",
"story_id": "45530",
"text_mapped_with_nouns_and_frame": [
"parish_NOUN",
"Containing_Frame",
"craft_NOUN",
"show_NOUN",
"year_NOUN"
],
"predicted_term_seq": [
"Arriving_Frame",
"town_NOUN"
],
"predicted_story": "i was out of town . there were so many books to find . the table looked very old . we played games . it came from a man ."
predicted_term_seq
is either the terms predicted from our image2term module(in this example) or . predicted_story
is the whole story predicted by our term2story model, and text_mapped_with_nouns_and_frame
is the result open-seasame extracted from original sentence.
KG-Story (Visual Genome): generated_stories/1path_visualLanguageModel_VIST_LDPE.json
KG-Story (OpenIE): generated_stories/1path_OpenieLanguageModel_VIST_LDPE.json
KG-Story (no KG): generated_stories/0path_visual_VIST.json
See details in https://github.com/peteanderson80/bottom-up-attention.
pytorch==1.3
python==3.6
Caffe
- in
src/stage1/image_preprocess
resize images into 224*224
bash resize_all_image.sh CPU_NUM IMAGE_DIR OUTPUT_DIR
2.in src/stage1/image_object_detection
perform object detection & object feature extraction
With Caffe environment
# Object Detection
python3.6 ./tools/generate_tsv.py --gpu 0,1 \
--cfg experiments/cfgs/faster_rcnn_end2end_resnet.yml\
--def ./models/vg/ResNet-101/faster_rcnn_end2end_final/test.prototxt\
--out vist_with_classes_attr_png.csv\
--net data/faster_rcnn_models/resnet101_faster_rcnn_final.caffemodel\
--split vist
# Split file into npzs
python3.6 make_bu_data.py
3.in src/stage1/image2term
generate terms from image feature
# Build Vocabulary
python3.6 build_term_vocab.py
#Training
# bash run.sh GPU_ID Decode_mode[transformer, rnn] Decoder_type[term, text] [-self_att] [-global_att]
bash run_train.sh 0 term rnn -self_att
pytorch==1.3
python==3.6
- in
src/stage2/data
execute./download_big_data.sh
to download those files > 100MB - to train the path scoring language model
python3.6 main.py
trained model will be saved as ./model.pt
3 to link paths using Visual Genome Knowledge Graph i.e. Visual_Genome in the paper
python score_relation.vg.py
result story_id-to-path file will be saved as ./vist_scored_terms_6_path.json
(The path in this file is different from that in the generated stories since the model has been trained again)
For the openIE part, I mis-delete the knowledge graph result of OpenIE. I will try to solve it in the future.
Afterwads, run convert_file_format.py to convert ./vist_scored_terms_6_path.json
into different data format for stage 3.
The data is outputed as "vist_scored_terms_6_path_for_stage3.json".
- train on ROC, in
src/stage3/term2story
, run
bash run.sh [GPU device number] roc [positional encoding]
for example, running on first GPU, using LDPE positional encoding as mentioned in the paper
bash run.sh 1 roc LDPE
trained model will be save as ./save_model_rocLDPE/trained.chkpt
- finetune on VIST, in
src/stage3/term2story
, run
bash run_finetune.sh ./save_model_rocLDPE/trained.chkpt finetune [GPU device number] LDPE
trained model will be save in ./save_model_finetuneLDPE_pretrain_vist/
- To generate stories, in
src/stage3/term2story
, run
python inference_added_story.py -model [trained model checkpoint filepath] -device [GPU device number] -positional LDPE -term_path [stage2 filepath]
python inference_added_story.py -model save_model_finetuneLDPE_pretrain_vist/trained_accu_55.397.chkpt -device 3 -positional LDPE -term_path ../../stage2/visual_genome_language_model_terms_add_lowest.json
the generate story will be saved asf'VIST_test_self_output_diverse_add_highest_one_path_noun' + str(opt.insert+1) + str(opt.relation) + '_norm_penalty_coor_VISTdataset_percent_' + str(opt.positional) + '.json'