Damir Vandic obtained cum laude the master degree in Economics & Informatics from Erasmus University Rotterdam in 2010. For his PhD research, he has obtained the NWO Mosaic grant. The focus of his research is on using Semantic Web techniques to improve product search and browsing on the Web. His research interests cover areas such as machine learning, the Semantic Web foundations and applications, knowledge systems, and Web information systems.
The detection of product duplicates is one of the challenges that Web shop aggregators are currently facing. In this paper, we focus on solving the
problem of product duplicate detection on the Web. Our proposed method extends a state-of-the-art solution that uses the model words in product titles to find duplicate products. First, we employ the aforementioned algorithm in order to find matching product titles. If no matching title is found, our
method continues by computing similarities between the two product descriptions. These similarities are based on the product attribute keys and
on the product attribute values. Furthermore, instead of only extracting model words from the title, our method also extracts model words from the product attribute values. Based on our experimental results on real-world data gathered from two existing Web shops, we show that the proposed method, in terms of F1-measure, significantly outperforms the existing state-of-the-art title model words method and the well-known TF-IDF method.