TEXT CLASSIFICATION BASED ON SUPPORT VECTOR MACHINE

Lê Thị Minh Nguyện

Abstract


The development of the Internet has increased the need for daily online information storage. Finding the correct information that we are interested in takes a lot of time, so the use of techniques for organizing and processing text data are needed. These techniques are called text classification or text categorization. There are many methods of text classification, but for this paper we study and apply the Support Vector Machine (SVM) method and compare its effect with the Naïve Bayes probability method. In addition, before implementing text classification, we performed preprocessing steps on the training set by extracting keywords with dimensional reduction techniques to reduce the time needed in the classification process.


Keywords


Feature vector; Kernal; Naïve Bayes; Support Vector Machine; Text classification.

Full Text:

PDF

References


An, J., & Chen, Y. P. P. (2005). Keyword extraction for text categorization. Paper presented at The International Conference on Active Media Technology, Japan.

Cortes, C., & Vapnik, V. (1995). Support vector networks. Machine Leaming, 20(3), 273-297

Ehrentraut, C., Ekholm, M., & Tan, H. (2018). Detecting hospital-acquired infections: A document classification approach using Support Vector Machines and gradient tree boosting. Health Informatics Journal, 24, 24-42.

Github. (2017). Vietnamese stopwords. Retrieved from https://github.com/stopwords/vietnamese-stopwords/blob/master/vietnamese-stopwords.txt

Kim, S., Han, K., Rim, H., & Myaeng, S. (2006). Some effective techniques for Naive Bayes text classification. Transactions on Knowledge and Data Engineering, 18(11), 1457-1466.

Leopold, E., & Kinermann, J. (2002). Text categorization with Support Vector Machines. How to represent texts in input space? Machine Learning, 46(1-3), 423-444.

Lin, D., Peng, H., & Liu, B. (2006). Support Vector Machines for text categorization in Chinese question classification. Paper presented at The International Conference on Web Intelligence, China.

Liu, Z., & Xu, H. (2013). Kernel parameter selection for Support Vector Machines classification. Journal of Algorithms & Computational Technology, 8(2), 163-177.

Madge, S., & Bhatt, S. (2015). Predicting stock price direction using Support Vector Machines. Retrieved from https://www.cs.princeton.edu/sites/default/files/uploads/saahil_madge.pdf

Nguyen, G. L., & Luong, M. T. (2006). Phân loại văn bản tiếng Việt với bộ phân loại vectơ hỗ trợ SVM. Retrieved from http://ictvietnam.vn/files/_layouts/biznews/uploads/file/Uploaded/admin/CS15012_bai_anh_Linh_Giang.pdf

Nguyen, S. D., Ngo, H. Q., & Jiamthapthaksin, R. (2016). State-of-the-art Vietnamese word segmentation. Paper presented at The International Conference on Science in Information Technology, Indonesia.

Ninh, D. K., & Nguyen, Q. V. (2017). Biểu diễn ngữ cảnh trong khai triển chữ viết tắt dùng tiếp cận học máy. Tạp chí Khoa học và Công nghệ Đại học Đà Nẵng, 5(114), 31-35.

Pham, T. V., & Ta, T. M. (2017). Vietnamese news classification based on BoW with keywords extraction and neural network. Paper presented at The Asia Pacific Symposium on Intelligent and Evolutionary Systems, Vietnam.

Phan, T. H., & Nguyen, Q. C. (2015). Automatic classification for Vietnamese news. Advances in Computer Science: An International Journal, 4(4), 126-132.

R. Courant, & D. Hilbert. (1953). Methods of mathematical physics. New Jersey, USA: John Wiley & Sons.

Umair, S., & Sharif, M. (2018). Predicting students grades using artificial neural networks and Support Vector Machines. In M. K. Pour (Eds.), Encyclopedia of Information Science and Technology (4th ed. p. 14). Pennsylvania, USA: IGI Global USA.

Vladimir, V. (1999). The nature of statistical learning theory (2nd ed.). Berlin, Germany: Springer Publishing.

Vu, T. H. (2018). Bài 32: Naive Bayes classifier. Retrieved from https://machinelearningcoban.com/2017/08/08/nbc/

Xue, D., & Fengxin. (2015). Research of text categorization model based on random forests. Paper presented at The IEEE International Conference on Computational Intelligence & Communication Technology, India.




DOI: http://dx.doi.org/10.37569/DalatUniversity.9.2.536(2019)

Refbacks

  • There are currently no refbacks.


Copyright (c) 2019 Lê Thị Minh Nguyện.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Editorial Office of DLU Journal of Science
Room.15, A25 Building, 01 Phu Dong Thien Vuong Street, Dalat, Lamdong
Email: tapchikhoahoc@dlu.edu.vn - Phone: (+84) 263 3 555 131

Creative Commons License
Based on Open Journal Systems
Developed by Information Technology Department