Autarchy of the Private Cave

Tiny bits of bioinformatics, [web-]programming etc

    Simple substring counting script in Python

    21st June 2006

    Approximately a month ago I endeavoured to use Python as my main shell-scripting language. At that moment, I was already aware of multiple benefits you get when you use Python for scripting:

    • source-level cross-platform scripting: your script will run anywhere, where Python compiles; expanding this statement – your script will run anywhere, where there is a C compiler (needed to build Python itself)
    • high-level language: you can iterate all the lines in a text file with as little as one ‘for’-statement, for example (see the actual example below)
    • simple/minimalist syntax: no curly braces around blocks of statements, no semicolons after each and every line of code, etc. Python at a glance looks much more understandable, than, for example, Perl.
    • the power of C in a language-interpreting system
    • it is interpreted! This gives easyness of debugging: modify, execute, see the trouble – with no compile/link stages
    • and, despite being interpreted, it is fast!

    For the comparison (in speed, memory use, program size) with other computer programming languages, please see the “Computer Language Shootout Benchmarks”. Here I provide the link only to the comparison of Python with Perl and comparison of Python with PHP (which can also be used as shell-scripting language, albeit after some tinkering with settings and stuff)

    Below is an example of the 2-minute script in Python, which counts the number of occurrences of some string in a file.

    1. """Read FILE and count number of occurences of SUBSTR."""
    2. version = 0.01
    3.  
    4. import sys
    5.  
    6. def main():
    7.   from optparse import OptionParser
    8.   opts = OptionParser(usage="%prog [options] FILE SUBSTR",
    9.     version="%prog " + str(version),
    10.     description="Read FILE and count number of occurences of SUBSTR.")
    11.   opts.set_defaults(verbose=False,flush=False)
    12.   opts.add_option("-v", "--verbose", action="store_true", dest="verbose", help="Print every line containing substr [default: %default]")
    13.   opts.add_option("-f", "--flush", action="store_true", dest="flush", help="When verbose, flush every line [default: %default]")
    14.   (options, args) = opts.parse_args()
    15.  
    16.   if len(args) != 2:
    17.     print "Two arguments required for correct processing"
    18.     opts.print_help()
    19.     sys.exit(2)
    20.  
    21.   infile = args[0]
    22.   substr = args[1]
    23.   lines_count = 0
    24.   substr_count = 0
    25.   lines_substr_count = 0
    26.   if options.verbose and not options.flush:
    27.     msg = ""
    28.  
    29.   f = open(infile, 'r')
    30.   for line in f:
    31.     lines_count += 1
    32.     found = line.count(substr)
    33.     substr_count += found
    34.     if found > 0:
    35.       lines_substr_count += 1
    36.       if options.verbose and not options.flush:
    37.         msg += str(found) + ": " + line
    38.       elif options.verbose and options.flush:
    39.         print (str(found) + ": " + line).replace("n","")
    40.  
    41.   f.close()
    42.  
    43.   if options.verbose and not options.flush:
    44.     print msg
    45.   print "Lines read from file: ", str(lines_count)
    46.   print "Lines with substring found: ", str(lines_substr_count)
    47.   print "Total substrings detected: ", str(substr_count)
    48.  
    49.   return
    50.  
    51. if __name__ == "__main__":  main()
    Share

    4 Responses to “Simple substring counting script in Python”

    1. spiderlama Says:

      That’s not simple :P

      
      from __future__ import with_statement
      import sys
      if __name__ == '__main__':
          assert len(sys.argv) == 3, 'invalid arguments. usage: file str'
          filePath = sys.argv[1]
          substr = sys.argv[2]
          linesFound = 0
          substrFound = 0
          with open(filePath) as f:
              for lineIndex, lineString in enumerate(f):
                  if lineString.find(substr) != -1:
                      linesFound += 1
                      substrFound += lineString.count(substr)
                      print filePath + ':' + str(lineIndex) + '\t' + \
                          lineString.rstrip('\r\n')
              print 'Lines read from file:', lineIndex + 2
              print 'Lines with substring found:', linesFound
              print 'Total substrings detected:', substrFound
      
    2. Bogdan Says:

      Thanks for the contribution!

      * please note, that the script listed is 3+ years old; something has definitely changed in Python since then
      * your options parser is definitely less flexible (and less verbose to the end-user)
      * your main loop is indeed several lines shorter, but mostly thanks to omitting open() and file.close() with while() – imported from __future__, as well as not using the ‘–verbose’ option processing to dump extra data to the terminal

      Overall, I find your submission definitely useful, but not actually as short and simple as you implied with “That’s not simple :)” :)

    3. pete Says:
      
      def getSubStringPositionsList(str=None, subStr=None):
          l=[]
          posCount = 0
          while str:
              pos = str.find(subStr)
              if pos != -1:
                  l.append(pos + posCount)
                  str = str[pos + len(subStr):]
                  posCount += pos + len(subStr)
              else:
                  break
          return l
      def getSubStringCount(str=None, subStr=None):
          return len(getSubStringPositionsList(str, subStr))
      
    4. Bogdan Says:

      Nice, thanks.

    Leave a Reply

    XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>