Universal Dependencies (POS Tagging)

Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological features, and syntactic dependencies) across different human languages. UD is an open community effort with over 200 contributors producing more than 100 treebanks in over 70 languages.

Identifier Task Type Metric License Website Code Download
U.Dep POS Tagging F1 (macro) CC BY-NC-SA 3.0

Data Source

UD_Bulgarian-BTB is based on the HPSG-based BulTreeBank, created at the Institute of Information and Communication Technologies, Bulgarian Academy of Sciences. The original consists of 215,000 tokens (over 15,000 sentences).

All the texts were processed automatically at tokenization, morphological and chunk level. Then the full syntactic analysis were performed manually by trained annotators.

Data Description

# Train Dev Test
Examples 8,907 1,115 1,116

Label Distribution

train validation test
NOUN 0.218 0.219 0.222
PUNCT 0.141 0.141 0.144
ADP 0.141 0.142 0.142
VERB 0.110 0.111 0.108
ADJ 0.087 0.087 0.088
PRON 0.065 0.066 0.062
AUX 0.056 0.055 0.056
PROPN 0.055 0.052 0.051
ADV 0.042 0.040 0.043
CCONJ 0.031 0.032 0.030
DET 0.015 0.018 0.017
NUM 0.013 0.013 0.014
PART 0.013 0.013 0.012
SCONJ 0.010 0.011 0.010
INTJ 0.001 0.001 0.001

Vocabulary Overlap

Number of common words in the row and column divided by the total number of unique words in the row.

   train validation test
train 1.000 0.671 0.689
validation 0.163 1.000 0.327
test 0.165 0.321 1.000

Example

# newdoc id = akadgram
# sent_id = akadgram-s2
# text = В дискусията, предполагам, ще се засегнат важни въпроси.
1	В	в	ADP	R	_	2	case	2:case	_
2	дискусията	дискусия	NOUN	Ncfsd	Definite=Def|Gender=Fem|Number=Sing	8	obl	8:obl:в	SpaceAfter=No
3	,	,	PUNCT	punct	_	4	punct	4:punct	_
4	предполагам	предполагам	VERB	Vpitf-r1s	Aspect=Imp|Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin|Voice=Act	8	advcl	8:advcl	SpaceAfter=No
5	,	,	PUNCT	punct	_	4	punct	4:punct	_
6	ще	ще	AUX	Tx	_	8	aux	8:aux	_
7	се	се	PRON	Ppxta	Case=Acc|PronType=Prs|Reflex=Yes	8	expl	8:expl	_
8	засегнат	засегна-(се)	VERB	Vpptf-r3p	Aspect=Perf|Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act	0	root	0:root	_
9	важни	важен	ADJ	A-pi	Definite=Ind|Degree=Pos|Number=Plur	10	amod	10:amod	_
10	въпроси	въпрос	NOUN	Ncmpi	Definite=Ind|Gender=Masc|Number=Plur	8	nsubj:pass	8:nsubj:pass	SpaceAfter=No
11	.	.	PUNCT	punct	_	8	punct	8:punct	_

Citation

[1] Petya Osenova and Kiril Simov. BTB-TR05: BulTreeBank Stylebook. BulTreeBank Project Technical Report № 05. 2004.

@techreport{OsenovaSimov2004,
    author = {Petya Osenova and Kiril Simov},
    title = {BTB-TR05: BulTreeBank Stylebook ą 05},
    year = {2004},
    url = {http://www.bultreebank.org/TechRep/BTB-TR05.pdf}
}

[2] Kiril Simov and Petya Osenova. 2003. Practical Annotation Scheme for an HPSG Treebank of Bulgarian. In Proceedings of 4th International Workshop on Linguistically Interpreted Corpora (LINC-03) at EACL 2003.

@inproceedings{simov-osenova-2003-practical,
    title = "Practical Annotation Scheme for an {HPSG} Treebank of {B}ulgarian",
    author = "Simov, Kiril  and Osenova, Petya",
    booktitle = "Proceedings of 4th International Workshop on Linguistically Interpreted Corpora ({LINC}-03) at {EACL} 2003",
    year = "2003",
    url = "https://aclanthology.org/W03-2403",
}

[3] Kiril Simov, Gergana Popova, Petya Osenova. HPSG-based syntactic treebank of Bulgarian (BulTreeBank). In: “A Rainbow of Corpora: Corpus Linguistics and the Languages of the World”, edited by Andrew Wilson, Paul Rayson, and Tony McEnery; Lincom-Europa, Munich 2002, pp. 135-142.

@incollection{SimovOsPo2002,
    author = {Kiril Simov and Gergana Popova and Petya Osenova},
    title = {HPSG-based syntactic treebank of Bulgarian (BulTreeBank)},
    booktitle = {A Rainbow of Corpora: Corpus Linguistics and the Languages of the World},
    editor = {Andrew Wilson, Paul Rayson and Tony McEnery},
    publisher = {Lincom-Europa},
    pages = {135--142},
    year = {2002},
}

[4] Kiril Simov, Petya Osenova and Milena Slavcheva. BTB-TR03: BulTreeBank Morphosyntactic Tagset. BulTreeBank Project Technical Report № 03. 2004

@techreport{SimovOseSlav2004,
    author = {Kiril Simov and Petya Osenova and Milena Slavcheva},
    title = {BTB-TR03: BulTreeBank Morphosyntactic Tagset. BulTreeBank Project Technical Report ą 03},
    year = {2004},
    url = {http://www.bultreebank.org/TechRep/BTB-TR03.pdf}
}

[5] Kiril Simov, Petya Osenova, Alexander Simov, Milen Kouylekov. Design and Implementation of the Bulgarian HPSG-based Treebank. In Erhard Hinrichs and Kiril Simov, editors, Journal of Research on Language and Computation, Special Issue, Kluwer Academic Publishers, pp. 495-522.

@article{SimOsSimKo2005,
    author = {Kiril Simov and Petya Osenova and Alexander Simov and Milen Kouylekov},
    title = {Design and Implementation of the Bulgarian HPSG-based Treebank},
    journal = {Journal of Research on Language and Computation. Special Issue},
    year = {2005},
    pages = {495--522},
    publisher = {Kluwer Academic Publisher},
}

License

Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0).