Cross-lingual Name Tagging and Linking (PAN-X)

The WikiANN dataset (Pan et al. 2017) is a dataset with NER annotations for PER, ORG and LOC. It has been constructed using the linked entities in Wikipedia pages for 282 different languages.

Identifier Task Type Metric License Website Code Download
U.Dep POS Tagging F1 (macro) CC BY-SA 3.0

Data Source

The dataset is constructed using the linked entities in Wikipedia pages.

Data Description

# Train Dev Test
Examples 16,237 7,029 7,263

Label Distribution

train validation test
O 0.786 0.789 0.791
LOC 0.100 0.096 0.096
PER 0.060 0.060 0.059
ORG 0.054 0.055 0.055

Vocabulary Overlap

Number of common words in the row and column divided by the total number of unique words in the row.

   train validation test
train 1.000 0.142 0.144
validation 0.075 1.000 0.099
test 0.076 0.098 1.000

Example

bg:Видът        O
bg:е    O
bg:разпространен        O
bg:в    O
bg:Бурунди      B-LOC
bg:,    O
bg:Демократична B-LOC
bg:република    I-LOC
bg:Конго        I-LOC
bg:,    O
bg:Замбия       B-LOC
bg:и    O
bg:Танзания     B-LOC
bg:.    O

Citation

[1] Xiaoman Pan, Boliang Zhang, Jonathan May, Joel Nothman, Kevin Knight, and Heng Ji. 2017. Cross-lingual Name Tagging and Linking for 282 Languages. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1946–1958, Vancouver, Canada. Association for Computational Linguistics.

@inproceedings{pan-etal-2017-cross,
    title = "Cross-lingual Name Tagging and Linking for 282 Languages",
    author = "Pan, Xiaoman  and
      Zhang, Boliang  and
      May, Jonathan  and
      Nothman, Joel  and
      Knight, Kevin  and
      Ji, Heng",
    booktitle = "Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2017",
    address = "Vancouver, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/P17-1178",
    doi = "10.18653/v1/P17-1178",
    pages = "1946--1958",
}

License

Attribution-NonCommercial 4.0 International (Apache License 2.0). See the LICENSE file.