Autarchy of the Private Cave

Tiny bits of bioinformatics, [web-]programming etc

    • Archives

    • Recent comments

    MultiParanoid vs. QuickParanoid: pro et contra for each

    9th July 2013

    MultiParanoid

    Here we present a new proteome-scale analysis program called MultiParanoid that can automatically find orthology relationships between proteins in multiple proteomes. The software is an extension of the InParanoid program that identifies orthologs and inparalogs in pairwise proteome comparisons. MultiParanoid applies a clustering algorithm to merge multiple pairwise ortholog groups from InParanoid into multi-species ortholog groups.

    QuickParanoid

    QuickParanoid is a suite of programs for automatic ortholog clustering and analysis. It takes as input a collection of files produced by InParanoid and finds ortholog clusters among multiple species. For a given dataset, QuickParanoid first preprocesses each InParanoid output file and then computes ortholog clusters. It also provides a couple of programs qa1 and qa2 for analyzing the result of ortholog clustering.

    So… both use InParanoid… Are there any differences? Let me list those which I’ve found.

    MultiParanoid

    • requires all species names to be passed through the command line; I had 11, so that’s a downside (even though I got that list with 1 extra ls command)
    • is written in Perl
    • needs source code editing – to specify the input directory with all the InParanoid’s sqltable.* files, and also the output file
    • seemed to run fairly fast on my 11 genomes – finished in under 4 minutes

    Overall, MultiParanoid left a somewhat “messy” impression… But it definitely did work.

    QuickParanoid

    • interactive: it just asks you for a directory containing all the sqltable.* files, for configuration file, and the executable prefix
    • configuration file is just a list of all species names; this is similar to MultiParanoid, is also achieved with ls command (ls -1 *.faa > config in my case), but feels a little better than dropping those 11 filenames in the command line
    • written in C++
    • generates and compiles code! after collecting your input, two custom binaries are generated to actually run the analysis, which seems to have no practical utility for the end-user, but is definitely cool!
    • much faster than MultiParanoid – analysis itself (with the generated custom binary) took less than 5 seconds; generating that custom binary adds only a few more seconds
    • contains helpful qa1 and qa2 utilities; qa1 summarizes the final clusters, and qa2 compares two different results with each other (see examples below)
    • it had to be compiled first with make qa; also, import <string.h> was missing in one of the source files…

    Overall, QuickParanoid left the impressions “quick and cool”, with the minor drawback of having to add that missing string.h import.

    With the help of QuickParanoid’s qa1, I’ve collected some stats on the clusters of orthologs in my 11 genomes.

    QuickParanoid clusters:

    Number of clusters consisting of 1 species : 0
    Number of clusters consisting of 2 species : 1757
    Number of clusters consisting of 3 species : 975
    Number of clusters consisting of 4 species : 622
    Number of clusters consisting of 5 species : 463
    Number of clusters consisting of 6 species : 406
    Number of clusters consisting of 7 species : 325
    Number of clusters consisting of 8 species : 355
    Number of clusters consisting of 9 species : 448
    Number of clusters consisting of 10 species : 607
    Number of clusters consisting of 11 species : 2449
    Total: 8407

    MultiParanoid clusters:

    Number of clusters consisting of 1 species : 0
    Number of clusters consisting of 2 species : 1872
    Number of clusters consisting of 3 species : 1023
    Number of clusters consisting of 4 species : 637
    Number of clusters consisting of 5 species : 479
    Number of clusters consisting of 6 species : 418
    Number of clusters consisting of 7 species : 338
    Number of clusters consisting of 8 species : 358
    Number of clusters consisting of 9 species : 454
    Number of clusters consisting of 10 species : 605
    Number of clusters consisting of 11 species : 2451
    Total: 8635

    So, qa1 is definitely useful. As the output of QuickParanoid is almost the same as that of MultiParanoid, qa1 also works on MultiParanoid results – one just has to add a hash # at the beginning of the very first line of the MultiParanoid results file.

    qa2 allows to compare, e.g., MultiParanoid and QuickParanoid results. Here’s the output from the default ‘names-only’ comparison mode:

    Checking only sequence names…
    Number of clusters in multiparanoid_result.txt : 8635
    Number of clusters in quickparanoid_result.txt : 8407
    Number of matched clusters: 8253
    Residue clusters in multiparanoid_result.txt:
    5945 4338 8374 7913 8334 8315 4564 7508 7492 6173 5922 8187 8104 4377 7003 8590 7808 5265 8064 5195 6297 8285 6849 6868 6317 8619 5983 4549 6659 4503 7500 4550 8510 7591 7776 8075 5969 6051 3857 5949 7795 6031 8005 8441 7737 7445 8507 4534 8289 6712 7294 6083 8377 8117 7344 8040 6858 8138 4559 7788 7818 7479 7100 7906 6287 8007 8552 7198 7489 7446 8522 8270 8327 6367 5278 8150 6832 8519 6332 6361 6674 8323 6156 6183 7187 8048 6328 8127 7260 7143 7794 7810 7376 8541 8397 8389 8516 5146 6311 8347 7573 7103 4391 6980 8330 8384 5918 4318 4759 4656 6061 7030 6694 8260 6180 4084 4321 7745 7875 7650 5112 7635 6976 4776 6249 7607 5208 5046 7566 4401 8487 7307 5571 4310 7322 6625 4398 5410 7401 6213 7392 4985 8369 8165 5704 8520 6244 6305 8072 8172 8143 8236 8419 5235 4746 5333 5930 4309 4349 3097 8393 5665 4265 4317 6502 7175 6386 5215 7117 4332 5182 7590 4788 5907 8396 7910 7221 7484 4807 5610 6613 4050 4410 4878 6522 4284 4245 1936 4274 5407 6963 7694 6528 8538 5800 6577 6767 5344 4069 5747 4105 7162 5047 4879 5036 6303 78 3446 5883 3891 4727 6912 4219 4724 3775 2404 7618 4171 6499 6127 3438 3844 4962 5708 4824 4657 99 160 5507 4992 6610 3683 5871 3908 4127 4725 4749 5200 5450 4138 3342 4283 4814 2210 5672 5551 6487 569 1545 2556 2950 6949 3933 4051 5084 5492 4389 1142 1303 1555 1784 2045 2490 3037 3894 4390 4668 4066 510 695 3982 1349 1922 2148 2221 4781 1065 1100 1125 1263 6489 2054 3147 951 1708 2472 3009 3458 3569 3875 4996 6470 145 1565 2196 2298 1470 2312 3868 4750 1996 3579 1029 515 716 1075 2119 4825 509 1855 1863 101 390 756 1426 4721 4912 5342 351 4734 1455 1738 1880 2367 5167 788 2750 338 3543 3745 4762 837 2662 4143 209 3175 4717 2071 2429 847 1220 1340 2046 4573 2065 2328 939 1995 97 845 1631 2076 3711 4880 1039 1214 534 1133 578 767 754 226 988 3521 1221 2689 1848 1917 3654 2906 61 435 738 1577 971 1099 926 2101 376 294 173 119
    Residue clusters in quickparanoid_result.txt:
    7697 5071 8137 8292 7403 6349 6516 7093 4532 5193 3596 6608 6214 5738 5438 3606 3492 4631 4037 5155 2725 5804 5871 3930 1185 3044 4776 5064 5321 760 2653 3419 4426 4530 5281 958 3602 1158 3008 256 5551 2742 3356 3143 3626 3548 1754 1411 1718 2104 3922 3867 3087 967 1932 2261 269 3881 3692 3206 3059 1115 500 3664 3985 3180 3615 3713 748 3311 1351 1396 1409 4113 3822 1917 1816 1109 3505 1110 1826 4079 4007 201 1308 4572 471 5128 1216 3060 2697 5732 2248 2299 3590 4537 2741 1440 2563 1001 3772 206 3020 4075 4571 3147 3727 2438 1744 1365 1156 1181 1442 2379 272 3902 777 3198 350 2111 3320 2670 3415 2118 3939 326 2826 3343 3698 2066 3284 3319 2883 3487 2976 876 1576 661 942 3960 1525 2962 270 577 1266 287 1499 2643 1651 1401 2218 1991 1 1681

    Not a conclusion:

    • thanks to the summary of qa1, I’ve decided to take MultiParanoid results – they have (in my case) larger clusters with more genes in them, which is good, and overall more clusters – which is also good
    • if I had 20+ genomes to compare, or if I had to re-run this type of analysis multiple times – I’d use QuickParanoid
    • if I had to implement yet-another-inparanoid-based orthology clustering tool, then I’d first consider the QuickParanoid’s preprocessor/code generator, which was designed in an easy to extend manner

    Initially, I had also considered OrthoMCL for multi-species orthologs clustering. However, InParanoid + Multi/QuickParanoid is way much easier and quicker to set up and use, as OrthoMCL requires a database back-end for better scalability.

    Well, QuickParanoid has a test dataset with 120 species, and

    … it takes only 199.56 seconds on an Intel 2.4Ghz machine with 1 gigabyte memory to process a dataset of 120 species …

    :)

    Share

    2 Responses to “MultiParanoid vs. QuickParanoid: pro et contra for each”

    1. Matteo Brilli Says:

      It is very dangerous to infer that multiparanoid it was best for you since it gives larger clusters. It depends on what you need. If you need orthologs, size of the clusters is not at all a pertinent measure. In principle e.g. the best clustering would minimize the presence of multiple proteins from the same organism and so on

    2. Bogdan Says:

      Matteo, you are right, of course. Multiparanoid output was only better for me in that specific project I was working on. Overall, I liked quickparanoid more.
      I had even used quickparanoid in another project. If you are interested, here’s a simple patch to ‘qp’ to allow running quickparanoid non-interactively:
      https://bitbucket.org/qmentis/clusterscluster/src/41e4b2d5716f5b1d416b3abaef8b3214b54d9b9a/quickparanoid/qp.patch

    Leave a Reply

    XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>