Autarchy of the Private Cave

Tiny bits of bioinformatics, [web-]programming etc

    Python performance: set vs list

    15th August 2011

    Sometimes there is a need to be sure that no identifier is processed twice – for example, when parsing a file into a database, with file potentially containing duplicate records. An obvious solution is to properly wrap the DB insertion code into try…except block, and process duplicate primary ID exceptions. Another, sometimes more desired solution is to maintain a set/list of processed IDs internally, and check against that list prior to attempting the insertion of anything. So is it a set or a list?

    There are already quite a few internet resources discussing “python set vs list”, but probably the simplest while elegant way to test that is below.

    First, test the speed of adding/appending to a set or a list (here, I’m mimicking the real-life application, thus the test case has an optional loop):

    1. $python -mtimeit -s 'myset = set()' 'for x in xrange(1000): myset.add(x)'
    2. 10000 loops, best of 3: 133 usec per loop
    3. $python -mtimeit -s 'tmp = list()' 'for x in xrange(1000): tmp.append(x)'
    4. 10000 loops, best of 3: 116 usec per loop

    As we can see, set and list are comparable in the speed of adding new items, with list being slightly (~12%) faster than set.

    Now, the speed of membership testing: ‘x in tmp’. For this test, I’ve deliberately chosen an imbalance of True (1%) and False (99%) results for the test – again, mimicking the real problem I have at hand:

    1. $python -mtimeit -s 'tmp = set()' -s 'for x in xrange(1000): tmp.add(x)' 'for x in xrange(100000): x in tmp'
    2. 100 loops, best of 3: 7.27 msec per loop
    3. $python -mtimeit -s 'tmp = list()' -s 'for x in xrange(1000): tmp.append(x)' 'for x in xrange(100000): x in tmp'
    4. 10 loops, best of 3: 2.12 sec per loop

    List is much slower for membership testing, while sets were designed to be fast for doing just that.

    StumbleUponDeliciousCiteULikePocketKindle ItEvernotePinterestShare

    One Response to “Python performance: set vs list”

    1. Adrian Jobs Says:

      Are you sure that if the identifier was processed twice it will directly detect that duplicate records?

    Leave a Reply

    XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>