25th International Conference on Database Systems for Advanced Applications

Sep. 24-27, 2020, Jeju, South Korea

Click following URL

http://dasfaa2020.sigongji.com

to visit DASFAA 2020 Online Event Site

Paper details

Title: Tackling MeSH Indexing Dataset Shift with Time-aware Concept Embedding Learning

Authors: Qiao Jin, Haoyang Ding, Linfeng Li, Haitao Huang, Lei Wang and Jun Yan

Abstract: Medical Subject Headings (MeSH) is a controlled thesaurus developed by the National Library of Medicine (NLM). MeSH covers a wide variety of biomedical topics like diseases and drugs, which are used to classify PubMed articles. Human indexers at NLM have been annotating the PubMed articles with MeSH for decades, and have collected millions of MeSH-labeled articles. Recently, many deep learning algorithms have been developed to automatically annotate the MeSH terms, utilizing this large-scale MeSH indexing dataset. However, most of the models are trained on all articles non-discriminatively, ignoring the temporal structure of the dataset. In this paper, we uncover and thoroughly characterize the problem of MeSH indexing dataset shift (MeSHIFT), meaning that the data distribution changes with time. MeSHIFT includes the shift of input articles, output MeSH labels and annotation rules. We found that machine learning models suffer from performance loss for not tackling the problem of MeSHIFT. Towards this end, we present a novel method, time-aware concept embedding learning (TaCEL), as an attempt to solve it. TaCEL is a plug-in module which can be easily incorporated in other automatic MeSH indexing models. Results show that TaCEL improves current state-of-the-art models with only minimum additional costs. We hope this work can facilitate understanding of the MeSH indexing dataset, especially its temporal structure, and provide a solution that can be used to improve current models.

Video file:

Slide file:

Sponsors