Document Annotation by Semantic Roles
Keywords:
semantic roles, topic uniqueness, document grouping, regularizing criteriaAbstract
This study examines the annotation of documents with respect to semantic roles. The data for thematic models is formed based on the standard bag-of-words hypothesis. The question of the number and uniqueness of themes in modeling remains unresolved. A key reason cited is the lack of unified criteria for evaluating the quality of clustering. There have been attempts to prove the uniqueness of themes through experimental evidence using a set of regularization criteria based on the results of document grouping. The absence of such proof is explained by the fact that the number of groups is initially set as a free parameter. Currently, five types of annotation are applied for the unification of documents: metatextual, morphological, syntactic, accentual, and semantic. This paper employs semantic annotation by identifying actants within sentences—noun phrases that denote participants in a situation and their semantic roles. It is proposed that, in defining semantic roles in the Uzbek language, the unification of terms should be conducted in relation to specific subject areas. An example of document annotation using semantic roles is provided, along with a determination of their similarity based on cosine metric.
References
Воронцов К.В. Вероятностные тематические модели(курс лекций, К.В.Воронцов) ВМК МГУ. 1 марта – 2018. URL: 1.http://www.MachineLearning.ru/wiki.
Воронцов К.В., Фрей А.И., Апишев М.А., Потапенко А.А. Тематическое моделирование в BigARTM: теория, алгоритмы, приложения. 14 июнь – 2015.
Воронцов К.В., Булатов В.Г., Алексеев В.П. Determination of the Number of Topics Intrinsically: Is It Possible? 14 June – 2024. https://arxiv.org/pdf/2406.10402.
Kleinberg J.M. An Impossibility Theorem for Clustering Jon Kleinberg Advances in Neural Information Processing Systems 15. NIPS – 2002.
Игнатьев Н.А., Тулиев У.Ю. Семантическая структуризация текстовых документов на основе паттернов сущностей естественного языка. Компьютерные исследования и моделирование – 2022. – Т.14 – №5 – С. 1185–1197.
Воронцов К.В. Разметка данных для обучения нейросетевых моделей языка как способ формализации гуманитарных знаний // XVIII научная конференция межрегиональной ассоциации .История и компьютер. Историческая информатика как Historical Data Science..
Tikhomirov М.М. ”Using bert and augmentation in named entity recognition for cybersecurity domain,” in Natural Language Processing and Information Systems: 25th International Conference on Applications of Natural Language to Information Systems, NLDB – 2020. Saarbrucken
