- Isaiah Onando Mulang’, Kuldeep Singh, Chaitali Prabhu, Abhishek Nadgeri, Johannes Hoffart, Jens Lehmann
- CIKM '20, October 19–23, 2020, Virtual Event, Ireland
## Abstract
- [The] context derived from a knowledge graph (in our case: Wikidata) provides enough signals to inform pretrained transformer models and improve their performance for named entity disambiguation (NED) on Wikidata KG.
- Providing KG context in transformer architectures considerably outperforms the existing baselines, including the vanilla transformer models.
## 1 Introduction
- Entity Linking (EL) generally consists of two subtasks[,] namely: surface form extraction (mention detection) and named entity disambiguation (NED)
- [This paper only reviews the NED aspect, based on given NER detections]
- A method to obtain matching contextual information from the KG itself could be beneficial to disambiguate [named entities].
- RQ1:How does applying KG context impact the performance of transformer models on NED over Wikidata?
- RQ2: What is the performance of different configurations of KG context as new information signals on the NED task?
- RQ3: Can we generalize our proposed context in a state of the art NED model for other knowledge bases such as Wikipedia?
## 2 Task Definition
- Given:
- a sentence,
- a recognized entity surface form,
- a set of candidate entities, and
- a Knowledge Graph (KG),
- the objective is to select the entity within the KG that matches the surface form in the text.
- A set of entity surface forms S
- This paper addresses the problem of named entity disambiguation which selects an entity e^C ∈ E′ that matches the textual mention s ∈ S
- classification f = Classify(h(X))
- We view this task as a classification f = Classify(h(X)) on the conditional probability h(X) = P(Y = 1|X).
Taking x ∈ X = (s,e′;theta) **we study configurations of the context parameters theta**.
## 3 Related Work
- For detailed information on entity linking we refer to the surveys in [1, 25]
- we argue that a slight amount of KG triple context is enough for a pretrained transformer
- [is WikiData really "small amounts?"]
## 4 Approach
- classification : f (h(s,e′;theta)) = y
- S, the mentioned surface form
- e′, the candidate entity
- contextual parameters theta
- KG triples Φ [Phi] as context
- The classifier employs the binary cross-entropy loss
- Φ is [] an ordered set of of triples (h^e,r_{hp},t_{hp})^i
- h^e [is] the head (subject) of any triple is the candidate entity to be classified [while r is the relationship type, and t is the target]
- The sequence of these verbalized [“stringified”] triples [for a given candidate e'] are appended to the original sentence and surface form delimited by the [SEP] token
- When the total number of triples is too many, we use the maximum sequence length to limit the input
- [It seems unclear what the actual cutoff was.]
- Figure 1: Overall Approach : Φ refers to the ordered set of triples from the KG for a candidate entity while Φ^{max} ⊆ Φ, is the maximum number of triples that fits in the sequence length. For brevity: N → "National", H → "Highway", desc → "description"
- 
- [The classification target then is to predict "y=1" when the triples for the correct entities are fed to the model, an "y=0" otherwise]
- Figure 2: KG context : Top three 1-hop triples from Wikidata for the two entities with same label: National Highway.
- 
## 5 Evaluation
- The first dataset is Wikidata-Disamb[3], which aligns Wiki-Disamb30 [17] to Wikidata entities, and adds closely matching entities as negative samples to every entity in the dataset. It consists of 200,000 Train and 20,000 Test samples
- We aligned its Wikipedia entities to corresponding Wikidata mentions to fetch the KG triples
- We compare our results with three types of baselines
1. Long Short Term Memory (LSTM) networks [3]
- These models were augmented with a massive amount of 1&2-hop KG triples
1. A vanilla transformer models of RoBERTa and XLNet(i.e., transformers without KG context)
- We chose two state of the art transformer architectures: RoBERTa [13], and XLNet [16] and fine-tune them using Wikidata-Disamb30 training set.
- For each vanilla Transformer architecture, we add a classification head
- We fine-tuned vanilla models on Wikidata-Disamb training set
1. For AIDA-CoNLL, we chose [12] as our underlying model which is the second peer reviewed SOTA on this dataset.
- [It was rather difficult to identify what the three "baselines" are.]
### Results and Discussion
- although the Transformer based language models are trained on huge corpus and possess context for the data, they show limited performance even against the RNN model. This RNN model [3] uses GloVe embeddings together with task-specific context
- 
- transformer models achieve better precision compared to recall [see Table 2]
- els achieve better precision compared to recall
- the detailed analysis of each experimental setup and corresponding data can be found in our Github
- https://github.com/mulangonando/Impact-of-KG-Context-on-ED
- The amount of data fed as the context in our models is minimal (up to 15 1-hop triples).
- In contrast, the best performing model from work in [3], was fed up to 1500 1+2-hop triples.
- We induced 1-hop KG context in DCA-SL model [12] for candidate entities.
- The replacement of the unstructured Wikipedia description with structured KG triple context containing entity aliases, entity types, consolidated entity description, etc. has a positive impact on the performance
- ["The unstructured Wikipedia description" refers to the names of (coherence-filtered) WikiPedia links pointing at the candidate entity; see Section 3 in [12]]
- Our proposed change [i.e., replacing "unstructured Wikipedia descriptions" with the Phi triples] (DCA-SL + Triples) outperforms the baselines for Wikipedia entity disambiguation(cf. Table 3).
- 
- [It seems the best peer reviewed performance on In-KB Acc. comes from a team at Bloomberg: "Collective Entity Disambiguation with Structured Gradient Tree Boosting", Yi Yang et al., NAACL-HLT 2018, with 95.3% In-KB accuracy, making the NED SOTA a non-deep-learning based approach.]
## 6 Conclusion
- pretrained Transformer models, although powerful, are limited to capturing context available purely on the texts concerning the original training corpus
- task-specific KG context improved the performance. However, there is a limit to the number of triples as the context that can improve performance. We note that 2-hop triples resulted in negative or little impact on transformer performance
- > This work provides a new SOTA for AIDA-CoNLL dataset
- Wrong - Yi Yang et al., NAACL-HLT 2018, hold the NED SOTA to date.
## References
- #1 Krisztian Balog. 2018. Entity Linking. Springer Internation
- #3 Alberto Cetoli et al. 2019. A Neural Approach to Entity Linking on Wikidata. In ECIR. 78–86
WikidataDisamb dataset
- #12 Xiyuan Yang et al. 2019. Learning Dynamic Context Augmentation for Global Entity Linking. In EMNLP
- #25 W. Shen, J. Wang, and J. Han. 2015. Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions. IEEE Transactions on Knowledge and Data Engineering 27, 2 (2015), 443–460.
## Mindmap
