Simple substring counting script in Python

21st June 2006

Approximately a month ago I endeavoured to use Python as my main shell-scripting language. At that moment, I was already aware of multiple benefits you get when you use Python for scripting:

source-level cross-platform scripting: your script will run anywhere, where Python compiles; expanding this statement – your script will run anywhere, where there is a C compiler (needed to build Python itself)
high-level language: you can iterate all the lines in a text file with as little as one ‘for’-statement, for example (see the actual example below)
simple/minimalist syntax: no curly braces around blocks of statements, no semicolons after each and every line of code, etc. Python at a glance looks much more understandable, than, for example, Perl.
the power of C in a language-interpreting system
it is interpreted! This gives easyness of debugging: modify, execute, see the trouble – with no compile/link stages
and, despite being interpreted, it is fast!

For the comparison (in speed, memory use, program size) with other computer programming languages, please see the “Computer Language Shootout Benchmarks”. Here I provide the link only to the comparison of Python with Perl and comparison of Python with PHP (which can also be used as shell-scripting language, albeit after some tinkering with settings and stuff)

Below is an example of the 2-minute script in Python, which counts the number of occurrences of some string in a file.

"""Read FILE and count number of occurences of SUBSTR."""
version = 0.01
import sys
def main():
from optparse import OptionParser
opts = OptionParser(usage="%prog [options] FILE SUBSTR",
version="%prog " + str(version),
description="Read FILE and count number of occurences of SUBSTR.")
opts.set_defaults(verbose=False,flush=False)
opts.add_option("-v", "--verbose", action="store_true", dest="verbose", help="Print every line containing substr [default: %default]")
opts.add_option("-f", "--flush", action="store_true", dest="flush", help="When verbose, flush every line [default: %default]")
(options, args) = opts.parse_args()
if len(args) != 2:
print "Two arguments required for correct processing"
opts.print_help()
sys.exit(2)
infile = args[0]
substr = args[1]
lines_count = 0
substr_count = 0
lines_substr_count = 0
if options.verbose and not options.flush:
msg = ""
f = open(infile, 'r')
for line in f:
lines_count += 1
found = line.count(substr)
substr_count += found
if found > 0:
lines_substr_count += 1
if options.verbose and not options.flush:
msg += str(found) + ": " + line
elif options.verbose and options.flush:
print (str(found) + ": " + line).replace("n","")
f.close()
if options.verbose and not options.flush:
print msg
print "Lines read from file: ", str(lines_count)
print "Lines with substring found: ", str(lines_substr_count)
print "Total substrings detected: ", str(substr_count)
return
if __name__ == "__main__": main()

This entry was posted on Wednesday, June 21st, 2006 at 20:41 and is filed under Programming, Python. You can follow any responses to this entry through the RSS 2.0 feed. You can skip to the end and leave a response. Pinging is currently not allowed.

4 Responses to “Simple substring counting script in Python”

spiderlama Says:
October 12th, 2009 at 6:56

That’s not simple


from __future__ import with_statement
import sys
if __name__ == '__main__':
    assert len(sys.argv) == 3, 'invalid arguments. usage: file str'
    filePath = sys.argv[1]
    substr = sys.argv[2]
    linesFound = 0
    substrFound = 0
    with open(filePath) as f:
        for lineIndex, lineString in enumerate(f):
            if lineString.find(substr) != -1:
                linesFound += 1
                substrFound += lineString.count(substr)
                print filePath + ':' + str(lineIndex) + '\t' + \
                    lineString.rstrip('\r\n')
        print 'Lines read from file:', lineIndex + 2
        print 'Lines with substring found:', linesFound
        print 'Total substrings detected:', substrFound

Bogdan Says:
October 12th, 2009 at 11:38
Thanks for the contribution!
* please note, that the script listed is 3+ years old; something has definitely changed in Python since then
* your options parser is definitely less flexible (and less verbose to the end-user)
* your main loop is indeed several lines shorter, but mostly thanks to omitting open() and file.close() with while() – imported from __future__, as well as not using the ‘–verbose’ option processing to dump extra data to the terminal
Overall, I find your submission definitely useful, but not actually as short and simple as you implied with “That’s not simple :)”

pete Says:
January 7th, 2010 at 17:26


def getSubStringPositionsList(str=None, subStr=None):
    l=[]
    posCount = 0
    while str:
        pos = str.find(subStr)
        if pos != -1:
            l.append(pos + posCount)
            str = str[pos + len(subStr):]
            posCount += pos + len(subStr)
        else:
            break
    return l
def getSubStringCount(str=None, subStr=None):
    return len(getSubStringPositionsList(str, subStr))

Bogdan Says:
January 7th, 2010 at 18:25
Nice, thanks.

Database Backup »

Autarchy of the Private Cave

Tiny bits of bioinformatics, [web-]programming etc

Categories

Related entries

Subscribe

Archives

Recent comments

Meta