Corpus-Based Analysis for Multi-Token Units in Persian

Sharifi Atashgah ,  Masoud; Bijankhan ,  Mahmoud

Volume 1, Issue 3 (9-2009) itrc 2009, 1(3): 15-26 | Back to browse issues page

Mendeley

Zotero

RefWorks

Sharifi Atashgah M, Bijankhan M. Corpus-Based Analysis for Multi-Token Units in Persian . itrc 2009; 1 (3) :15-26
URL: http://ijict.itrc.ac.ir/article-1-287-en.html

Corpus-Based Analysis for Multi-Token Units in Persian

Masoud Sharifi Atashgah¹

, Mahmoud Bijankhan¹

1- Department of Literature and Human Science University,Tehran University,Tehran,Iran

Abstract: (4387 Views)

Because of the joining behavior of Persian script and its orthographic variation, the morphological and syntactic annotations of multi-token units meet various issues. By the analysis of Perso-Arabic script and its problems, the various collocation types of the tokens including the compositional, non-compositional and the new semicompositional constructions are described in the present paper. Then, to illustrate these constructions, the static and dynamic multi-token units will be presented for the generative and non-generative structures of the main categories including the verbs, infinitives, prepositions, conjunctions, adverbs, adjectives and nouns. Defining the multi-token unit templates for these categories is one of the important results of this research. The findings can be input to the segmentation module of the Persian Treebank generator system. The other usage of the present research is in the design and implementation of the morphological analyzers and syntactical parsers.

Keywords: Persian script, orthographic variation, morphological and syntactic annotations, Persian Treebank generator system, syntactical parsers, morphological analyzers

Full-Text [PDF 1482 kb] (1009 Downloads)

Type of Study: Research | Subject: Information Technology

Rights and permissions
	This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Principal Contact