How to replace newlines with commas, tabs etc (merge lines)

16th November 2010

Imagine you need to get a few lines from a group of files with missing identifier mappings. I have a bunch of files with content similar to this one:

ENSRNOG00000018677 1368832_at 25233
ENSRNOG00000002079 1369102_at 25272
ENSRNOG00000043451 25353
ENSRNOG00000001527 1388013_at 25408
ENSRNOG00000007390 1389538_at 25493

In the example above I need ’25353′, which does not have corresponding affy_probeset_id in the 2nd column.

It is clear how to do that:

sort -u *_affy_ensembl.txt | grep -v '_at' | awk '{print $2}'

This outputs a column of required IDs (EntrezGene in this example):

116720
679845
309295
364867
298220
298221
25353

However, I need these IDs as a comma-separated list, not as newline-separated list.

There are several ways to achieve the desired result (only the last pipe commands differ):

sort -u *_affy_ensembl.txt | grep -v '_at' | awk '{print $2}' | gawk '$1=$1' ORS=', '

sort -u *_affy_ensembl.txt | grep -v '_at' | awk '{print $2}' | tr '\n' ','

sort -u *_affy_ensembl.txt | grep -v '_at' | awk '{print $2}' | sed ':a;N;$!ba;s/\n/, /g'

sort -u *_affy_ensembl.txt | grep -v '_at' | awk '{print $2}' | sed ':q;N;s/\n/, /g;t q'

sort -u *_affy_ensembl.txt | grep -v '_at' | awk '{print $2}' | paste -s -d ","

These solutions differ in efficiency and (slightly) in output. sed will read all the input into its buffer to replace newlines with other separators, so it might not be best for large files. tr might be the most efficient, but I haven’t tested that. paste will re-use delimiters, so you cannot really get comma-space “, ” separation with it.

Sources: linuxquestions 1 (explains used sed commands), linuxquestions 2, nixcraft.

This entry was posted on Tuesday, November 16th, 2010 at 10:20 and is filed under *nix, Bioinformatics, how-to, Notepad, Software. You can follow any responses to this entry through the RSS 2.0 feed. You can skip to the end and leave a response. Pinging is currently not allowed.

2 Responses to “How to replace newlines with commas, tabs etc (merge lines)”

buggy Says:
November 14th, 2012 at 22:52
hello,
thank you for your the pointers, good read. this is my feedback to you and future readers:
instead of
awk '{print $2}'
i prefer
cut -f2 -d" "
as in “cut out field number 2 with delimiter ‘space character’” (if you have tab-delimited data, you can leave this one out as the default is tab-delimited)
it is faster and easier to type
also, some stuff can be shortened for quick typing, such as
paste -s -d ","
becomes
paste -sd,
cheers, ben
Bogdan Says:
November 15th, 2012 at 12:42
Thanks for the feedback, Ben. I actually also prefer ‘cut’ now, unless delimiters are unclear – awk seems to do a better job at automatically identifying what is a delimiter and what is not. Didn’t know you can merge paste’s arguments into one

« How to record Skype calls on Linux: use free Skype Call Recorder

Beautiful aurora timelapse in HD »

Autarchy of the Private Cave

Tiny bits of bioinformatics, [web-]programming etc

Categories

Related entries

Subscribe

Archives

Recent comments

Meta