Research on the Detection of Text Similarity Based on Hadoop
Calculating text similarity is a key point in the detection of content duplication of science and technology project application documents, academic papers and degree papers. Aiming at the Chinese text similarity detection, a text similarity detection method based on Hadoop was proposed. The text to be detected and sample text are converted into a word segmentation matrix by using the word segmentation results, and the detection results are obtained by scanning and analyzing the matrix. MapReduce was used to realize the parallel optimization of the algorithm and improve the execution efficiency. Finally, an example is given to demonstrate the effectiveness of the proposed algorithm.
similarity, matrix model of word segmentation, text similarity detection, Mapreduce, Hadoop