Title: Labeling Massive Data from Noisy, Incomplete and Crowdsourced Annotations


Department of Systems Engineering and Engineering Management
                    The Chinese University of Hong Kong


Title: Labeling Massive Data from Noisy, Incomplete and Crowdsourced Annotations

Speaker: Prof. Xiao Fu

                School of Electrical Engineering and Computer Science

                Oregon State University

Abstract: Labeled data has been in high demand in the era of deep learning. Crowdsourcing techniques aim to produce accurate labels by effectively integrating noisy, non-expert annotations from multiple annotators. The (arguably) most notable method from the statistical machine learning community is an expectation maximization (EM) algorithm proposed by Dawid and Skene in 1979. However, theoretical understanding to the Dawid-Skene approach is still limited. Recently, elegant tensor decomposition-based methods were proposed to establish identifiability of the Dawid-Skene model. One challenge is that tensor methods may suffer from high sample complexity, since they involve third-order statistics of the annotator responses – which are hard to estimate accurately with limited annotations. In addition, tensor decomposition-based methods are often associated with challenging computation problems.

In this talk, I will introduce a simple algebraic algorithm that can efficiently solve large-scale crowdsourcing problems under the Dawid-Skene model – with provable guarantees. Our approach uses second-order statistics of the annotator responses, and thus naturally enjoys much lower sample complexity relative to the tensor methods. I will also introduce a coupled matrix decomposition based algorithm that enhances performance of the algebraic algorithm under more challenging scenarios. Experiments show that the proposed algorithms outperform the state-of-art algorithms under a variety of scenarios. Extensions beyond the Dawid-Skene model will also be briefly touched upon.

Friday, April 17, 2020 - 11:30