rss.png profile for ebal on Stack Exchange, a network of free, community-driven Q&A sites
Jul
29
2010
mbox Deduplication using python

A simple python script to deduplicate a mailbox (mbox format).


#!/usr/bin/env python
# Created by Evaggelos Balaskas on Thu Jul 29 21:22:41 EEST 2010
# Remove duplicate mails from mbox using message-id
 
import sys
import mailbox
 
if len(sys.argv) == 2:
        mid = []
 
        for message in mailbox.mbox( sys.argv[1] ) :
                s = message['message-id']
                if s not in mid:
                        mid.append(s)
                        print message
else:
        print "Usage should be: " + sys.argv[0] + " mbox > new.mbox"

You can take a look, also, on my other python script: How to remove specific mails from a mbox by subject

  1. Avatar di Dimitris Leventeas Dimitris Leventeas

    Friday, July 30, 2010 - 02:02:32

    The use of sets (mid = {}) instead of list would lead to code with better performance. The required changes are very few.

    I can’t test it right now because I need sleep, but I think something like:
    uniq_message_ids = {m[’message-id’] for m in mailbox.mbox(sys.argv[1])}
    would work