Remove duplicates from the Second Hand Song Dataset

December 1st 2014

The Second Hand Song dataset (SHSD) is a subset of the Million Song Dataset (MSD) organized in cliques containing several tracks ids that correspond to different versions of the same track. This dataset is particularly suited for the problem of cover song retrieval because it contains 5,854 cliques for a total of 18,196 cover songs. To our knowledge, it is the biggest dataset available for cover songs retrieval.

The SHSD is split into two files: a training set containing 4,128 cliques and a test set containing the 1,696 remaining cliques. For our work, we grouped these files together to have a single file containing the 5,854 cliques.

The MSD comes with a list of known duplicates also organized in cliques. That list of duplicates is available here. We noticed that some of the SHSD cliques contain duplicates songs. That means that within one clique, two different track identifiers refer to exactly the same song. This means that when evaluating a cover song retrieval with the SHSD, the algorithm will most likely detect a true positive when comparing two duplicate songs (their content is supposed to be exactly the same !). Such comparisons should probably not be made because it produces optimistic results. It seems quite obvious that comparing a query with itself will return a near 100% similarity. Therefore, we wrote a simple script in Python to remove duplicates from the cliques and only leaving one version of a duplicate inside the dataset.

After executing the script, some cliques are left with one track only. We remove these cliques from the dataset, which reduces the total number of cliques. The original SHSD contains 5,854 cliques, while the new pruned dataset contains 5,828 cliques. Therefore we lose 26 cliques in the process.

The code is available below. We provide the Python code used to prune the SHSD. We also provide our single SHSD file and the new pruned SHSD.