15th August 2011
Sometimes there is a need to be sure that no identifier is processed twice – for example, when parsing a file into a database, with file potentially containing duplicate records. An obvious solution is to properly wrap the DB insertion code into try…except block, and process duplicate primary ID exceptions. Another, sometimes more desired solution is to maintain a set/list of processed IDs internally, and check against that list prior to attempting the insertion of anything. So is it a set or a list?
There are already quite a few internet resources discussing “python set vs list”, but probably the simplest while elegant way to test that is below.
First, test the speed of adding/appending to a set or a list (here, I’m mimicking the real-life application, thus the test case has an optional loop):
- $python -mtimeit -s 'myset = set()' 'for x in xrange(1000): myset.add(x)'
- 10000 loops, best of 3: 133 usec per loop
- $python -mtimeit -s 'tmp = list()' 'for x in xrange(1000): tmp.append(x)'
- 10000 loops, best of 3: 116 usec per loop
As we can see, set and list are comparable in the speed of adding new items, with list being slightly (~12%) faster than set.
Now, the speed of membership testing: ‘x in tmp’. For this test, I’ve deliberately chosen an imbalance of True (1%) and False (99%) results for the test – again, mimicking the real problem I have at hand:
- $python -mtimeit -s 'tmp = set()' -s 'for x in xrange(1000): tmp.add(x)' 'for x in xrange(100000): x in tmp'
- 100 loops, best of 3: 7.27 msec per loop
- $python -mtimeit -s 'tmp = list()' -s 'for x in xrange(1000): tmp.append(x)' 'for x in xrange(100000): x in tmp'
- 10 loops, best of 3: 2.12 sec per loop
List is much slower for membership testing, while sets were designed to be fast for doing just that.