Doctoral thesis (2017)
This thesis is concerned with the problem of determining whether two songs are different versions of each other. This problem is known as the problem of cover song identification, which is a challenging task, as different versions of the same song can differ in terms of pitch, tempo, voicing, instrumentation, structure, etc. Our approach differs from existing methods, by considering as much information as possible to identify cover songs. More precisely, we consider audio features spanning multiple musical facets, such as the tempo, the duration, the harmonic progression, the musical structure, the relative evolution of timbre, among others. In order to do that, we evaluate several state-of-the-art systems on a common database, containing 12,856 songs, that is a subset of the Second Hand Song dataset. In addition to evaluating existing systems, we introduce our own methods, based on global features, and making use of supervised machine learning algorithms to build a similarity model. For evaluating and comparing the performance of 10 cover song identification systems, we propose a new intuitive evaluation space, based on the notions of pruning and loss. Our evaluation space allows to represent the performance of the selected systems in a two dimensional space. We further demonstrate that it is compatible with standard metrics, such as the mean rank, the mean reciprocal rank and the mean average precision. Using our evaluation space, we present a comparative analysis of 10 systems. The results show that few systems are usable in a commercial system, as the most efficient is able to identify a match at the first position for 39% of the analyzed queries, which corresponds to 4,965 songs. In addition, we evaluate the systems when they are pushed to their limits, by analyzing how they perform when the audio signal is strongly degraded. To improve the identification rate, we investigate ways of combining 10 systems. We evaluate rank aggregation methods, that aim at aggregating ordered lists of similarity results, to produce a new, better ordering of the database. We demonstrate that such methods produce improved results, especially for early pruning applications. In addition to evaluating rank aggregation techniques, we propose to study combination through probabilistic rules. As the 10 selected systems do not all produce probabilities of similarity, we investigate calibration techniques to map scores to relevant posterior probability estimates. After the calibration process, we evaluate several probabilistic rules, such as the product, the sum, and the median rule. We further demonstrate that a subset of the 10 initial systems produces better performance than the full set, thus showing that some systems are not relevant to the final combination. Applying a probabilistic product rule to a subset of systems significantly outperforms any individual systems, on the considered database. In terms of direct identification (top-1), we achieve an improvement of 10% (5,460 tracks identified), and in terms of mean rank, mean reciprocal rank and mean average precision, we respectively improve the performance by 40%, 9.5%, and 12.5%, with respect to the previous state-of-the-art performance. We further implement our final combination in a practical application, named DISCover, giving the possibility for a user to select a query and listen to the produced list of results. While a cover is not systematically identified, the produced list of songs is often musically similar to the query.
in Proceedings of the 17th International for Music Information Retrieval Conference (2016, August)
Abstract Cover song identification involves calculating pairwise similarities between a query audio track and a database of reference tracks. While most authors make exclusively use of chroma features, recent work tends to demonstrate that combining similarity estimators based on multiple audio features increases the performance. We improve this approach by using a hierarchical rank aggregation method for combining estimators based on different features. More precisely, we first aggregate estimators based on global features such as the tempo, the duration, the loudness, the beats, and the average chroma vectors. Then, we aggregate the resulting composite estimator with four popular state-of-the-art methods based on chromas as well as timbre sequences. We further introduce a refinement step for the rank aggregation called “local Kemenization” and quantify its benefit for cover song identification. The performance of our method is evaluated on the Second Hand Song dataset. Our experiments show an significant improvement of the performance, up to an increase of more than 200 % of the number of queries identified in the Top-1, compared to previous results.
in International Congress on Acoustics: ICA 2016 (2016)
Cover song identification systems deal with the problem of identifying different versions of an audio query in a reference database. Such systems involve the computation of pairwise similarity scores between a query and all the tracks of a database. The usual way of evaluating such systems is to use a set of audio queries, extract features from them, and compare them to other tracks in the database to report diverse statistics. Databases in such research are usually designed in a controlled environment, with relatively clean audio signals. However, in real life conditions, audio signals can be seriously modified due to acoustic degradations. For example, depending on the context, audio can be modified by room reverberation, or by added hands clapping noise in a live concert, etc. In this paper, we study how environmental audio degradations affect the performance of several state-of-the-arty cover song recognition systems. In particular, we study how reverberation, ambient noise and distortion affect the performance of the systems. We further investigate the effect of recording or playing music through a smartphone for music recognition. To achieve this, we use an audio degradation toolbox to degrade the set of queries to be evaluated. We propose a comparison of the performance achieved with cover song identification systems based on several harmonic and timbre features under ideal and noisy conditions. We demonstrate that the performance depends strongly on the degradation method applied to the source, and quantify the performance using multiple statistics.
; ; ;
in 16th International Society for Music Information Retrieval Conference (2015, October)
In this paper, we evaluate a set of methods for combining features for cover song identification. We first create multiple classifiers based on global tempo, duration, loudness, beats and chroma average features, training a random forest for each feature. Subsequently, we evaluate standard combination rules for merging these single classifiers into a composite classifier based on global features. We further obtain two higher level classifiers based on chroma features: one based on comparing histograms of quantized chroma features, and a second one based on computing cross-correlations between sequences of chroma features, to account for temporal information. For combining the latter chroma-based classifiers with the composite classifier based on global features, we use standard rank aggregation methods adapted from the information retrieval literature. We evaluate performance with the Second Hand Song dataset, where we quantify performance using multiple statistics. We observe that each combination rule outperforms single methods in terms of the total number of identified queries. Experiments with rank aggregation me- thods show an increase of up to 23.5 % of the number of identified queries, compared to single classifiers.
Conference (2015, February 24)
Cover songs retrieval is a MIR task that has been widely studied in the recent years. The task, in its general sense, consists in retrieving covers for a given audio query. In this PhD, we focus on the more specific task of cover songs *identification*. Given an unknown cover song query, we want to find relevant information about the original song, such as the title, the artist etc. Applications are live music recognition, plagiarism detections, query by examples, etc. Our approach is based on a pruning algorithm, whose role is to discard as quickly as possible as many tracks as possible from a search database in order to keep a smaller subset that should contain the tracks relevant to the query. We introduce the concept of *rejectors* for the pruning algorithm, which are single or combined classifier using multiple audio features for selecting relevant tracks.
in International Conference on Systems, Signals and Image Processing (2014, May)
This paper proposes a survey of the performances of binary classifiers based on low-level audio features, for music similarity in large-scale databases. Various low-level descriptors are used individually and then combined using several fusion schemes in a content-based audio retrieval system. We show the performances of the classifiers in terms of pruning and loss and we demonstrate that some combination schemes achieve a better performance at a minimum computational cost.
; ; ;
in International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2013, May)
This paper focuses on cover song recognition over a large dataset, potentially containing millions of songs. At this time, the problem of cover song recognition is still challenging and only few methods have been proposed on large scale databases. We present an efficient method for quickly extracting a small subset from a large database in which a correspondence to an audio query should be found. We make use of fast rejectors based on independent audio features. Our method mixes independent rejectors together to build composite ones. We evaluate our system with the Million Song Dataset and we present composite rejectors offering a good trade-off between the percentage of pruning and the percentage of loss.
; ; ;
in Journées d'informatique musicale (2012, May)
In this paper, we consider the challenging problem of music recognition and present an effective machine learning based method using a feed-forward neural network for chord recognition. The method uses the known feature vector for automatic chord recognition called the Pitch Class Profile (PCP). Although the PCP vector only provides attributes corresponding to 12 semi-tone values, we show that it is adequate for chord recognition. Part of our work also relates to the design of a database of chords. Our database is primarily designed for chords typical of Western Europe music. In particular, we have built a large dataset filled with recorded guitar chords under different acquisition conditions (instruments, microphones, etc), but also with samples obtained with other instruments. Our experiments establish a twofold result: (1) the PCP is well suited for describing chords in a machine learning context, and (2) the algorithm is also capable to recognize chords played with other instruments, even unknown from the training phase.