RESEARCH
|
Our long-term research goal is to study evolutionary/comparative
genomics, population genetics and computational biology. My
lab is developing research projects with a combination of
statistical/computational method developing, software engineering,
large-scale multi-layer data analysis, and experimental work.
EVOLUTIONARY
FUNCTIONAL GENOMICS
|
 |
Statistical framework for Gene family Evolution |
| |
We
are developing statistical approaches to have a better understanding
functional innovation, specification and divergence during the
course of gene family evolution. The random-field model of evolutionary
rate in terms of lineage (subtree), and site (region) provide
a general statistical framework that includes functional divergence
(Mol Biol Evol. 2001. 18:453; J. Comp
Biol. 2001 8:221), sit-by-site dependence, positive
selection, etc. as special cases. Moreover, we are focused on
how to incorporate biological covariates (from protein structure,
knockout phenotype to expression profile) appropriately such
that the statistical model becomes more biologically relevant. |
 |
Integrated
Software System |
| |
Our
goal is to develop an integrated software system to extract
function-related information from DNA/amino acid sequence evolution.
We have developed a software system, DIVERGE 1.0 (Bioinformatics
2002. 18:500) for functional divergence analysis and
prediction based on site-specific evolutionary rate changes,
with a protein 3D viewer mapping these predicted residues onto
the 3D structure (with >300 download cases) (Mol
Biol Evol. 1999. 16:1664). We are working new version
DIVERGE 2.0 that will include more options: rate variation among
sites (Mol Biol Evol. 1997. 14:1106), ancestral
sequence inference, functional distance analysis, and other
statistical methods for functional divergence (Mol Biol
Evol. 2001. 18: 453; 18:2327).
|
 |
Evidence
for the Association between Site-Specific Rate Shifts and Changes
in Function after Gene Duplication |
| |
As
the evolutionary rate of an amino acid residue is inversely
correlated with its functional importance, site-specific shift
of evolutionary rate within a gene family (i.e., a site is highly
variable in one duplicate gene but very conserved in the other
one) may indicate change in functional importance after gene
duplication. Using statistical methods in DIVERGE, we have conducted
a large-scale analysis showing that site-specific rate shift
is a general evolutionary pattern (J Mol Evol. 2002.
54:725). Moreover, we found evidence that the level
of site-specific rate shift of member genes could be related
to protein structure differences (Genetics. 2001. 158:1311;
Trend in Biochem Sci. 2002. 27:315), the severity
of knockout phenotypes, and tissue-specificity (Gu et
al. PNAS under review). |
COMPARATIVE AND EVOLUTONARY GENOMICS
|
 |
Vertebrate
genome duplication & origin of human gene family hierarchy
|
| |
The
2R model of vertebrate genome duplications, which is under hot
debate, postulates two successive polyploidizations prior to
the origin of fishes (Genome Res. 2002. 12:1).
We address this issue by estimating the age distribution of
paralogous genes in the human genome. In total 1,739 gene duplication
events are dated from the phylogenetic analysis of 749 vertebrate
gene families, which shows a pattern characterized by two waves
(I, II) and an ancient component. While Wave I represents a
recent gene family expansion by tandem or segmental duplications,
Wave II, a rapid paralogous gene increase in the early stage
of vertebrates, supporting the 2R model (the big-bang mode).
Further analysis indicates that large and small-scale gene duplications
both have significant contributions during early stage of vertebrate
evolution to building the current hierarchy of human gene families.
(Nature Genetics. 2002. 31:205). Our future
research plan is to study (1) the impact of gene family proliferation
on functional innovations (from tissue-specificity to molecular
pathways; (2) the pattern of functional divergence in human
major gene families (protein kinases, etc.); and (3) the joint
distribution of age and chromosome location of duplication genes.
|
 |
Algorithms
for ancestral gene order inference and comparative genome mapping
|
| |
Genome-level
comparative mapping raises new challenges in studies of genome
rearrangement, especially for multiple species genome rearrangement
problem. In this case, a genome is viewed as a string of signed
permutation where each integer corresponds to a unique gene
location and the sign corresponds to its orientation. The Multiple
Genome Rearrangement Problem (MGRP) is to find a most parsimonious
rearrangement scenario for multiple genomes. Since MGRP is NP-hard,
we address this issue by developing efficient heuristic algorithms,
e.g., the nearest path search algorithm (Pacific Symposium
on Biocomputing (PSB) 2002. 7:259; 2003 in
press). We are working on other approaches including
neighbor-perturbing algorithm, branch-and-bound algorithm, and
simulated annealing algorithm. |
 |
Whole-genome
phylogenetic analysis based on gene (family) content |
| |
For
a complete genome, gene (family) content is a string of 1 or
0's representing the presence/absence of gene families. It has
been used to infer the minimum set of essential genes, estimate
the size of ancestral genome, reconstruct the genome tree of
life, and predict the functional interaction between genes,
but the results are subject to controversy. The bottleneck is
the lacking of rigorous statistical framework for gene content
evolution. We are developing probabilistic models, aiming at
(1) the controversy between the genome tree of life and the
lateral gene transfer, (2) functional interaction prediction
under the statistical framework of phylogenetic tree, and (3)
statistical testing for the existence of minimum gene set. For
instance, we have developed a software system GenContent for
inferring the genome tree (PNAS 2002, Gu-Zhang submitted).
|
MULTI-LAYER GENOME DATA EXPLORATION AND GENE NETWORK EVOLUTION
|
 |
Statistical
framework for expression profile evolution |
| |
For
a gene family from a single species (e.g., yeast) with a known
phylogeny, our goal is to develop a joint density for gene expressions,
when substantial microarray data are available. Similar to DNA
sequences, it will provide a statistical framework to explore
the pattern of gene expression evolution (likelihood ratio test),
infer the ancestral expression pattern (Bayesian analysis),
and phylogeny inference (Gu, PNAS submitted).
Moreover, the joint distribution of expressions and motifs under
a tree is developing. |
 |
Evolution
of repeat elements, gene regulation and motif detecting
|
| |
We
(collaborated with other groups) have found evidence for the
role of repeat elements in regulatory motif spreading in human
(Alu) (Genome Res. 2002 in press) and mouse
(B1) (Gene. 2000. 245:319). After conducting
a whole-genome association study between yeast sporulation expressions
and the regulatory motif MSE, we have found a positive association
between induced expression and MSE between recent duplicate
pairs, but not for ancient duplicates (Information Sciences,
2002, in press). It seems that the model of subfunction-loss
after duplication needs to be revisited because acquisition
of a motif could be repeat-element-mediated. Moreover, nucleotide
changes in transcription factors instead of binding sites should
also be considered (Mol. Biol. Evol. 2002 19:1490).
In bioinformatics, we are studying the power of "phylogenetic
footprinting" for motif detection, when repeat element
activities are overwhelming during the course of genome evolution.
|
 |
Genetic
buffering, duplicate genes and network complexity |
| |
Knocking
out a gene in an organism often has little phenotypic effect,
owing to two mechanisms: the existence of duplicate genes, and
genetic buffering of network (canalization), but their relative
importance is controversial (Gu Trends in Genetics in press).
Using fitness data for a complete set of single-gene-deletion
mutants of the yeast genome, we (collaborated with other groups)
have conducted a genome-wide evaluation of the role of duplicate
genes in genetic robustness against null mutations, and found
evidence for functional compensation by duplicate genes (Nature,
in press). We are developing more vigorous models to explore
the evolutionary mechanisms of functional compensation and divergence
within a gene family. On the other hand, genetic buffering stems
from the complexity of gene networks called scale-free but little
is known about its emergence during evolution. Several evolutionary
models are under study, considering (1) growth of genome size
by domain-gene-genome duplications, (2) algometry growth of
gene interactions, (3) random gene/connection loss. We are also
interested in the connection of functional divergence and the
origin of gene network modularity, which is not fully expected
by the scale-free complexity. |
HUMAN POPULATION GENETICS AND PRIMATE EVOLUTION
|
 |
Gene
expression evolution in humans and chimpanzees |
| |
The
regulatory hypothesis claims that humans and chimpanzees differ
considerably at mental and linguistic capability because of
gene regulation changes. A recent comparative microarray analysis
in human and great apes supports this hypothesis but cast some
doubts because of the statistical problem. Reanalysis of the
Affymetrix data (Trends in Genetics (J. Gu- X. Gu) in
press) shows that the dramatic brain-expression alterations
in humans since the split from chimpanzees is mainly driven
by a set of genes with increased (rather than decreased) expression
levels in the human brain. Furthermore, we have identified a
set of genes with significant changes in the human brain (induced
or repressed) since the split of human-chimpanzee. My wet-lab
research wing is to sequence the homologous 5'-noncoding region
of these genes in several primates in attempt to identify human-lineage
specific DNA substitutions. Collaborated with other groups,
the 5'- surrounding regions of these genes will be compared
with the mouse genome to identify conserved noncoding islands.
Human population genetics survey (SNP) around this gene region
will be used to test the role of positive selection (e.g., by
Tajima or Fu-Li’s test).
|
 |
Interplay
between species evolution and population genetics |
| |
We
are interested how to use sequence evolution information to
estimate population genetics parameters. Currently we are focused
on (1) the variation of mutation rate and indels along the genome,
as well as the effects of GC content, codon-usage bias, sequence
features, etc; (2) site-specific selection intensity –
the association of evolutionary conservation and potential disease-related
site. Moreover, we are examining the population genetics basis
of molecular evolutionary parameters. For instance, we show
that the a gamma parameter for rate variation among sites is
inversely proportional to the squared root of effective population
size under the stabilized-selection model.
|