Python performance: set vs list

15th August 2011

Sometimes there is a need to be sure that no identifier is processed twice – for example, when parsing a file into a database, with file potentially containing duplicate records. An obvious solution is to properly wrap the DB insertion code into try…except block, and process duplicate primary ID exceptions. Another, sometimes more desired solution is to maintain a set/list of processed IDs internally, and check against that list prior to attempting the insertion of anything. So is it a set or a list?

There are already quite a few internet resources discussing “python set vs list”, but probably the simplest while elegant way to test that is below.

First, test the speed of adding/appending to a set or a list (here, I’m mimicking the real-life application, thus the test case has an optional loop):

$python -mtimeit -s 'myset = set()' 'for x in xrange(1000): myset.add(x)'
10000 loops, best of 3: 133 usec per loop
$python -mtimeit -s 'tmp = list()' 'for x in xrange(1000): tmp.append(x)'
10000 loops, best of 3: 116 usec per loop

As we can see, set and list are comparable in the speed of adding new items, with list being slightly (~12%) faster than set.

Now, the speed of membership testing: ‘x in tmp’. For this test, I’ve deliberately chosen an imbalance of True (1%) and False (99%) results for the test – again, mimicking the real problem I have at hand:

$python -mtimeit -s 'tmp = set()' -s 'for x in xrange(1000): tmp.add(x)' 'for x in xrange(100000): x in tmp'
100 loops, best of 3: 7.27 msec per loop
$python -mtimeit -s 'tmp = list()' -s 'for x in xrange(1000): tmp.append(x)' 'for x in xrange(100000): x in tmp'
10 loops, best of 3: 2.12 sec per loop

List is much slower for membership testing, while sets were designed to be fast for doing just that.

This entry was posted on Monday, August 15th, 2011 at 11:29 and is filed under Notepad, Programming, Python. You can follow any responses to this entry through the RSS 2.0 feed. You can skip to the end and leave a response. Pinging is currently not allowed.

One Response to “Python performance: set vs list”

Adrian Jobs Says:
February 8th, 2012 at 11:31
Are you sure that if the identifier was processed twice it will directly detect that duplicate records?

« Phusion Passenger Apache users guide

Good advice: /bin/false is not security »

Autarchy of the Private Cave

Tiny bits of bioinformatics, [web-]programming etc

Categories

Related entries

Subscribe

Archives

Recent comments

Meta