Autarchy of the Private Cave

Tiny bits of bioinformatics, [web-]programming etc

    How to replace newlines with commas, tabs etc (merge lines)

    16th November 2010

    Imagine you need to get a few lines from a group of files with missing identifier mappings. I have a bunch of files with content similar to this one:

    ENSRNOG00000018677 1368832_at 25233
    ENSRNOG00000002079 1369102_at 25272
    ENSRNOG00000043451 25353
    ENSRNOG00000001527 1388013_at 25408
    ENSRNOG00000007390 1389538_at 25493

    In the example above I need ’25353′, which does not have corresponding affy_probeset_id in the 2nd column.

    It is clear how to do that:

    1. sort -u *_affy_ensembl.txt | grep -v '_at' | awk '{print $2}'

    This outputs a column of required IDs (EntrezGene in this example):

    116720
    679845
    309295
    364867
    298220
    298221
    25353

    However, I need these IDs as a comma-separated list, not as newline-separated list.

    There are several ways to achieve the desired result (only the last pipe commands differ):

    1. sort -u *_affy_ensembl.txt | grep -v '_at' | awk '{print $2}' | gawk '$1=$1' ORS=', '
    1. sort -u *_affy_ensembl.txt | grep -v '_at' | awk '{print $2}' | tr '\n' ','
    1. sort -u *_affy_ensembl.txt | grep -v '_at' | awk '{print $2}' | sed ':a;N;$!ba;s/\n/, /g'
    1. sort -u *_affy_ensembl.txt | grep -v '_at' | awk '{print $2}' | sed ':q;N;s/\n/, /g;t q'
    1. sort -u *_affy_ensembl.txt | grep -v '_at' | awk '{print $2}' | paste -s -d ","

    These solutions differ in efficiency and (slightly) in output. sed will read all the input into its buffer to replace newlines with other separators, so it might not be best for large files. tr might be the most efficient, but I haven’t tested that. paste will re-use delimiters, so you cannot really get comma-space “, ” separation with it.

    Sources: linuxquestions 1 (explains used sed commands), linuxquestions 2, nixcraft.

    StumbleUponDeliciousCiteULikePocketKindle ItEvernotePinterestShare

    2 Responses to “How to replace newlines with commas, tabs etc (merge lines)”

    1. buggy Says:

      hello,
      thank you for your the pointers, good read. this is my feedback to you and future readers:

      instead of
      awk '{print $2}'
      i prefer
      cut -f2 -d" "
      as in “cut out field number 2 with delimiter ‘space character’” (if you have tab-delimited data, you can leave this one out as the default is tab-delimited)
      it is faster and easier to type

      also, some stuff can be shortened for quick typing, such as
      paste -s -d ","
      becomes
      paste -sd,

      cheers, ben

    2. Bogdan Says:

      Thanks for the feedback, Ben. I actually also prefer ‘cut’ now, unless delimiters are unclear – awk seems to do a better job at automatically identifying what is a delimiter and what is not. Didn’t know you can merge paste’s arguments into one :)

    Leave a Reply

    XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>