Mp3cat - gets only the data part of the mp3
One's mp3 files have a tendency to get duplicated over different computers, hard disks and portable players. Some of the tags may also get edited on some copies but not others. Since tags are stored in the file there is no easy way to detect which mp3s are identical music-wise by just hashing the file with md5, sha-1 or other hashing algorithm. Instead you need to run the hashing on only the data part.
I have not found any modules for python that does this. I have found one python module which will report the byte offset of where the data starts, but I guess that only works if the meta data are in the start of the file. The only command line tool I have found that can extract the data part is mp3cat:
This is the mp3cat home page. Download the latest release tarball mp3cat-0.4.tar.gz or (better) check out the current version from my subversion repository: http://svn.tomclegg.net/repos/trunk/mp3cat
My intial evaluation is favorable, i.e. an md5 hash of two files with different meta data but same music data gets the same hash value. Found mp3cat via this blog post:
The author wants the same thing I am searching for–the ability to generate a checksum of the audio stream and store it in the file header as a tag. Furthermore, he mentions his use of mp3cat! I pulled down a copy of mp3cat and compiled it on my archive box. Then the fun began
Tim's Mind Organized » Checksum mp3 audio frames (the data and not the headers)
There has at least existed one java program that does the data extraction and hashing in one fell swoop. It is mentioned in this discussion, but the download link does not seem to work.
Update 2010-07-25
I now looked into the CPAN archive, and as usual there is a Perl module, Audio::Digest::MP3 that does exactly what you want :-)
Audio::Digest::MP3 - Get a message digest for the audio stream out of an MP3 file (skipping ID3 tagsAudio::Digest::MP3 - search.cpan.org
Untested by me so far.