A CLUSTERING TECHNIQUE FOR THE VIETNAMESE WORD CATEGORIZATION

Authors

  • Nguyễn Minh Hiệp Faculty of Information Technology, Dalat University
  • Nguyễn Thị Minh Huyền Faculty of Informatics, VNU University of Science
  • Ngô Thế Quyền Faculty of Informatics, VNU University of Science
  • Trần Thị Phương Linh Faculty of Information Technology, Dalat University

DOI:

https://doi.org/10.37569/DalatUniversity.6.2.40(2016)

Keywords:

Clustering, Corpus, DBSCAN, POS, POS tagging, Tag set.

Abstract

In natural language processing, part-of-speech (POS) tagging plays an important role, as its output is the input of many other tasks (syntax analysis, semantic analysis. . . ). One of the problems related to POS tagging is to define the POS set. This could be solved using unsupervised machine learning methods. This paper presents an application of the DBSCAN clustering algorithm to classify Vietnamese words from a large corpus. The features used to characterize each word are naturally defined by the context of that word in a sentence. We use a large corpus containing sentences automatically extracted from the online Nhan Dan newspaper.

Metrics

Metrics Loading ...

References

Hong, C.N.: “Vấn đề phân định từ loại trong tiếng Việt”. T/c Ngôn ngữ số 2(1) (2003)

Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: in Proceedings of the 2nd International Conference on Knowledge Discovery and Data mining, Germany (1996).

Rosenberg, A., Hirschberg, J.: V-measure: A conditional entropy-based external cluster evaluation measure. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Prague, June (2007).

Published

30-06-2016

Volume and Issues

Section

Natural Sciences and Technology

How to Cite

Hiệp, N. M., Huyền, N. T. M., Quyền, N. T., & Linh, T. T. P. (2016). A CLUSTERING TECHNIQUE FOR THE VIETNAMESE WORD CATEGORIZATION. Dalat University Journal of Science, 6(2). https://doi.org/10.37569/DalatUniversity.6.2.40(2016)