MATHEMATICAL MODEL OF AUTOMATIC STOP WORDS DETECTION OF TEXTS IN THE UZBEK LANGUAGE
Keywords:
TF-IDF, stop words, tuple, dictionary, unique wordsAbstract
Filtering stop words is an important task when processing text queries for information retrieval in large data sets. Existing mathematical models of this problem are not suitable for all families of natural languages. For example, they do not cover families of languages to which the Uzbek language may be classified. In this work, an attempt is made to construct a new mathematical model of this problem for Turkic languages, which include Uzbek. This model concerns the so-called agglutinative languages, in which the task of automatically recognizing unimportant words in Uzbek language texts is much more difficult, since stop words are “masked” in the text. This paper proposes a model of a mathematical structure that corresponds to the type of language being studied and allows filtering words that are not essential for information retrieval. This model allows you to compress texts to work with various methods for identifying stop words.
References
Madatov X., Matlatipov S. Kosinus o’xshahshlik va uning o’zbek tili matnlariga tatbiqi haqida // O’zMU xabarlari.– 2016.– No. 2/1.
Madatov Kh. A prolog format of uzbek WordNet’s entries // Conf. Proc.: Human Language technology as a Challenge for Computer Science and Linguistics.– 2019.
Madatov Kh.A., Khujamov D.J., Boltayev B.R. Creating of the Uzbek WordNet based on Turkish WordNet // AIP Conference Proceedings.– 2022.– Vol. 2432.– DOI: 10.1063/5.0089532.
Madatov Kh., Bekchanov Sh., Viˇ ciˇ c J. Dataset of stopwords extracted from Uzbek texts // Data in Brief.– 202243.– 108351.
Madatov Kh., Bekchanov Sh., Viˇciˇc J. Accuracy of the Uzbek stop words detection: a case study on School corpus // CEUR Workshop Proceedings.– 2022.– Vol. 3315.– P. 107–115.
Madatov Kh., Bekchanov Sh., Viˇ ciˇ c J. Dataset of Karakalpak language stop words // Data in Brief.– 2023.– 48.– 109111.
Madatov, Kh., Bekchanov, Sh., Viˇciˇc, J. Automatic Detection of Stop Words for Texts in the Uzbek Language // Informatica.– 2023.– No. 47(2).– P. 143–150.
Madatov Kh., Sharipov M., Bekchanov Sh. O‘zbek tili matnlaridagi nomuhim so‘zlar // Kompyuter lingvistikasi: muammolar, yechim, istiqbollar: respublika ilmiy-texnikaviy konferensiya.– 1.– P. 156–162.
Madatov Kh., Bekchanov Sh., Viˇ ciˇ c J. Uzbek text summarization based on TF-IDF // Human Language Technologies as a Challenge for Computer Science and Linguistics. Poznan, 2023.– P. 21-23.
Akhmedovich K., Beknazarovna S. Methods of checking the given literature on the intellectual potential of schoolchildren.– 2023.
Miretie S., Khedkar V. Automatic Generation of Stopwords in the Amharic Text // Int. J. of Computer Applications.– No. 180(10).– P. 19–22.
Rakholia R.M., Saini J.R. A Rule-Based Approach to Identify Stop Words for Gujarati Language // Proc. of the 5th Int. Conf. on Frontiers Intelligent Computing: Theory and Applications.– P. 797-806.
Pandey A.K., Siddiqui T.J. Evaluating Effect of Stemming and Stop-word Removal on Hindi Text Retrieval // Proc. of the First Int. Conf. on Intelligent Human Computer Interection.– New-Delhi: Springer, 2009.– DOI: 10.1007/978-81-8489-203-1_31.
Metin S.K., Karao˘ glan B. Stop Word Detection As a Binary Classification Problem // Anadolu University Journal of Science and Technology Applied Sciences and Engineering. 2017.– Vol. 18, No. 2.– P. 346-359.– doi: http://dx.doi.org/10.18038/aubtda.322136.
Raulji J.K., Saini J.R. Generating Stopword List for Sanskrit Language // Advance Computing Conference: IEEE 7th Inter. conf.– 2017.– P. 799-802.
Rakholia R.M., Saini J.R. A Rule-Based Approach to Identify Stop Words for Gujarati Language // Proc. of the 5th Int. Conf. on Frontiers in Intel. Compup.: Theory and Applications.– 2017.– Vol. 515.– doi: http://dx.doi.org/10.1007/978-981-10-3153-3_ 79.
Rakholia R.M., Saini J.R. Proc. of the 5th Int. Conf. on Frontiers in Intel. Comp.: Theory and Applications.– 2017.– Vol. 516.– doi: http://dx.doi.org/10.1007/ 978-981-10-3156-4_1.
Aditya Wiha Pradana, Mardhiya Hayaty The effect of stemming and removal of stopwords on the accuracy of sentiment analysis on indonesian-language texts // KINETIK.– 2019. Vol. 4, No. 3.– P. 277-288.
Wang Y. et al. Word clustering based on POS feature for efficient twitter sentiment analysis // Human-centric Comput. Inf. Sci.– 2019.– Vol. 8, No. 17.– P. 1–25.– doi: http: //dx.doi.org/10.1186/s13673-018-0140-y.
Li G., Li J. Research on Sentiment Classification for Tang Poetry based on TF-IDF and FP-Growth // Proc. of 2018 IEEE 3rd Advanced Inform. Tech., Elect. and Auto. Control.– 2018.– P. 630–634.– doi: http://dx.doi.org/10.1109/IAEAC.2018.8577715.
