Speaker
Description
Manual annotation of large text datasets is both time- and cost-intensive, leading to a growing need for semi-supervised learning methods. Furthermore, inconsistencies among human labelers directly impact the quality of synthetic label generation due to the sensitivity of semi-supervised models to initial labels. This study examines the impact of multi-labeler consistency on BERT-based semi-supervised learning models and proposes a holistic framework for statistically modeling labeler reliability, establishing a core training set, and optimizing the synthetic labeling process. The proposed approach involves calculating labeler consistency using methods such as Cohen’s Kappa and Dawid-Skene, generating a core dataset of reliable examples, training the BERT model on this dataset, generating synthetic labels for unlabeled data, and redirecting low-confidence examples back to the labeler. Furthermore, the process is enhanced with consistency adjustment and noise reduction techniques, and a labeling interface is developed for practical use. In conclusion, the study demonstrates that multi-labeler consistency plays a critical role on the stability and accuracy of semi-supervised BERT models and provides a scalable, reliable and cost-effective automatic labeling infrastructure on large text datasets.