Thursday, May 13, 2010

Improvement Ideas on Phylogenetic Maker

I've spent the past hour examining the phylogeneticMaker code and have had a few insights as per how it is set up to work. From what I understand the code is written to compare the difference between expression of a gene in different states (e.g. with knock-out data, it compares the expression of a given gene X when another gene B is knocked-out (which we call state B) as compared to when gene C is knocked-out). This "footprint" of expression data is written as a "pseudoread" of a's, c's, and t's. The 'a' indicates X in state B is more highly expressed than X in state C. The 'c' indicates the opposite. The 't' indicates that there is no significant difference between the two expression levels (as defined by a hardcoded threshold).

Though most of it is very well written, the scoring algorithm (which in the code is preceded by the text "//TODO: Is this how we want to score?") has left me with a few questions and thoughts that the code may be improved with a few small adjustments:

1. Eliminate redundant information. The problem is that it is compares the expression of X in state B to the expression in state C and then later comes back to compare the expression in state C to the expression in state B, which is redundant. This creates a pattern in the data which is a mark of the program and not the data. It could be that this was done intentionally to magnify differences. Eliminating this redundancy is simple, saves space (the pseudoreads will be less than half the length), and time (both in producing the pseudoreads and perhaps later in determining the phylogenetic tree).

I adjusted the code to only compute the upper triangle of the matrix without including the diagonal and it produced reads with less than half the length. I had to adjust the nexus file to specify nchar in the nexus file.

2. Appropriate threshold value. At present the threshold value is "0.000000" which essentially means that unless gene A at time B and at time C are expressed identically, you will record a "significant" difference in expression. For example, in the first gene in the test data I was using, when G1 is knocked out the expression level is 0.10592894624038984 and then later when G8 is knocked down the expression is 0.10506157963165738. Thus the comparison was scored as an 'a'. In fact, once the redundancy referred to above was removed, the only 't' scores were when 0.0 was compared to 0.0. Raising the threshold (and by consequence the incidence of 't' in the generated pseudoread) diversifies the sequence, making it perhaps easier to find more accurate phylogenetic homology when run in psoda.

Defining a significant threshold should remove a lot of noise from the data, particularly where initially we may be concerned to know only those patterns which are the most resilient to even a high threshold.

3. Diversify the footprint. Following the same vein as above, perhaps finding a way to include 'g' in the footprint may allow more information to be coded in the pseudoread without any additional cost in the phylogenetic analysis later. A few ideas come to mind:

a) Let 'g' represent expression increases (or decreases) above a certain threshold. The difficulty then becomes defining the threshold (for example, if that threshold scored one increase as 'g' and another as 'a' even though the two were very close, it might as well be completely different). Another possibility is creating a "gap" in order to communicate other information.

b) Let 'g' represent information gleaned from comparing gene expressions for two genes both in the same state.

c) Let 'g' represent information reflecting the relationship between the gene being knocked-out and the gene whose expression is measured. For example, if the greatest irregularity in gene expression for gene X comes when gene D is knocked-out, that information may be somehow included in the pseudoread.

d) In reality, the length of the pseudoread is non-limiting, meaning that one might consider encoding additional information about given comparisons in more "nucleotides". For example rather than letting 'a' represent an increase, 'at' might be a significant increase and 'ag' a less significant increase. Thus one could create a "codon" which describes the relationships in the data, and that codon need not be limited to 3 nucleotides.

Something to think about.

4) I also made it so that the tree produced by psoda from the nexus file would have a more descriptive name than "out.tre", something that matched the expression data file name.


Those are the changes I thought might improve the code. To test if (with a constant threshold) the removal of redundant data would affect the phylogenetic relational predictions, I ran psoda on both files. The time taken was identical, I'm not sure why the shortened reads didn't speed up performance. The scores were exactly half for the shortened reads on multiple trials. The trees produced were the same.

Another idea may be to create the "complement" for every pseudoread which, if it matched closely with a read, would indicate that the genes might be inversely regulated. I know David Oviatt was doing something like this.