Evaluating the Impact of Knowledge Graph Context on Entity Disambiguation Models

- Isaiah Onando Mulang’, Kuldeep Singh, Chaitali Prabhu, Abhishek Nadgeri, Johannes Hoffart, Jens Lehmann - CIKM '20, October 19–23, 2020, Virtual Event, Ireland ## Abstract - [The] context derived from a knowledge graph (in our case: Wikidata) provides enough signals to inform pretrained transformer models and improve their performance for named entity disambiguation (NED) on Wikidata KG. - Providing KG context in transformer architectures considerably outperforms the existing baselines, including the vanilla transformer models. ## 1 Introduction - Entity Linking (EL) generally consists of two subtasks[,] namely: surface form extraction (mention detection) and named entity disambiguation (NED) - [This paper only reviews the NED aspect, based on given NER detections] - A method to obtain matching contextual information from the KG itself could be beneficial to disambiguate [named entities]. - RQ1:How does applying KG context impact the performance of transformer models on NED over Wikidata? - RQ2: What is the performance of different configurations of KG context as new information signals on the NED task? - RQ3: Can we generalize our proposed context in a state of the art NED model for other knowledge bases such as Wikipedia? ## 2 Task Definition - Given: - a sentence, - a recognized entity surface form, - a set of candidate entities, and - a Knowledge Graph (KG), - the objective is to select the entity within the KG that matches the surface form in the text. - A set of entity surface forms S - This paper addresses the problem of named entity disambiguation which selects an entity e^C ∈ E′ that matches the textual mention s ∈ S - classification f = Classify(h(X)) - We view this task as a classification f = Classify(h(X)) on the conditional probability h(X) = P(Y = 1|X). Taking x ∈ X = (s,e′;theta) **we study configurations of the context parameters theta**. ## 3 Related Work - For detailed information on entity linking we refer to the surveys in [1, 25] - we argue that a slight amount of KG triple context is enough for a pretrained transformer - [is WikiData really "small amounts?"] ## 4 Approach - classification : f (h(s,e′;theta)) = y - S, the mentioned surface form - e′, the candidate entity - contextual parameters theta - KG triples Φ [Phi] as context - The classifier employs the binary cross-entropy loss - Φ is [] an ordered set of of triples (h^e,r_{hp},t_{hp})^i - h^e [is] the head (subject) of any triple is the candidate entity to be classified [while r is the relationship type, and t is the target] - The sequence of these verbalized [“stringified”] triples [for a given candidate e'] are appended to the original sentence and surface form delimited by the [SEP] token - When the total number of triples is too many, we use the maximum sequence length to limit the input - [It seems unclear what the actual cutoff was.] - Figure 1: Overall Approach : Φ refers to the ordered set of triples from the KG for a candidate entity while Φ^{max} ⊆ Φ, is the maximum number of triples that fits in the sequence length. For brevity: N → "National", H → "Highway", desc → "description" - ![Figure 1: Overall Approach ](BERT-based%20KG%20Embeddings%20for%20NED.png) - [The classification target then is to predict "y=1" when the triples for the correct entities are fed to the model, an "y=0" otherwise] - Figure 2: KG context : Top three 1-hop triples from Wikidata for the two entities with same label: National Highway. - ![Figure 2: 1-hop KG Context](1-hop%20KG%20Context.png) ## 5 Evaluation - The first dataset is Wikidata-Disamb[3], which aligns Wiki-Disamb30 [17] to Wikidata entities, and adds closely matching entities as negative samples to every entity in the dataset. It consists of 200,000 Train and 20,000 Test samples - We aligned its Wikipedia entities to corresponding Wikidata mentions to fetch the KG triples - We compare our results with three types of baselines 1. Long Short Term Memory (LSTM) networks [3] - These models were augmented with a massive amount of 1&2-hop KG triples 1. A vanilla transformer models of RoBERTa and XLNet(i.e., transformers without KG context) - We chose two state of the art transformer architectures: RoBERTa [13], and XLNet [16] and fine-tune them using Wikidata-Disamb30 training set. - For each vanilla Transformer architecture, we add a classification head - We fine-tuned vanilla models on Wikidata-Disamb training set 1. For AIDA-CoNLL, we chose [12] as our underlying model which is the second peer reviewed SOTA on this dataset. - [It was rather difficult to identify what the three "baselines" are.] ### Results and Discussion - although the Transformer based language models are trained on huge corpus and possess context for the data, they show limited performance even against the RNN model. This RNN model [3] uses GloVe embeddings together with task-specific context - ![Table 1: KG Context for ED Evaluation](KG%20Context%20for%20ED%20Evaluation.png) - transformer models achieve better precision compared to recall [see Table 2] - els achieve better precision compared to recall - the detailed analysis of each experimental setup and corresponding data can be found in our Github - https://github.com/mulangonando/Impact-of-KG-Context-on-ED - The amount of data fed as the context in our models is minimal (up to 15 1-hop triples). - In contrast, the best performing model from work in [3], was fed up to 1500 1+2-hop triples. - We induced 1-hop KG context in DCA-SL model [12] for candidate entities. - The replacement of the unstructured Wikipedia description with structured KG triple context containing entity aliases, entity types, consolidated entity description, etc. has a positive impact on the performance - ["The unstructured Wikipedia description" refers to the names of (coherence-filtered) WikiPedia links pointing at the candidate entity; see Section 3 in [12]] - Our proposed change [i.e., replacing "unstructured Wikipedia descriptions" with the Phi triples] (DCA-SL + Triples) outperforms the baselines for Wikipedia entity disambiguation(cf. Table 3). - ![](Evaluating%20the%20impact%20of%20KG%20for%20NED%20with%20AIDA-CONLL.png) - [It seems the best peer reviewed performance on In-KB Acc. comes from a team at Bloomberg: "Collective Entity Disambiguation with Structured Gradient Tree Boosting", Yi Yang et al., NAACL-HLT 2018, with 95.3% In-KB accuracy, making the NED SOTA a non-deep-learning based approach.] ## 6 Conclusion - pretrained Transformer models, although powerful, are limited to capturing context available purely on the texts concerning the original training corpus - task-specific KG context improved the performance. However, there is a limit to the number of triples as the context that can improve performance. We note that 2-hop triples resulted in negative or little impact on transformer performance - > This work provides a new SOTA for AIDA-CoNLL dataset - Wrong - Yi Yang et al., NAACL-HLT 2018, hold the NED SOTA to date. ## References - #1 Krisztian Balog. 2018. Entity Linking. Springer Internation - #3 Alberto Cetoli et al. 2019. A Neural Approach to Entity Linking on Wikidata. In ECIR. 78–86 WikidataDisamb dataset - #12 Xiyuan Yang et al. 2019. Learning Dynamic Context Augmentation for Global Entity Linking. In EMNLP - #25 W. Shen, J. Wang, and J. Han. 2015. Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions. IEEE Transactions on Knowledge and Data Engineering 27, 2 (2015), 443–460. ## Mindmap ![](Evaluating%20the%20Impact%20of%20Knowledge%20Graph%20Context%20on%20Entity%20Disambiguation%20Models%20-%202020.pdf)