The high degree of sequence divergence of hepatitis B virus (HBV) has important impacts on disease prevention, detection and treatment. It is useful to develop an integrated in-silico platform to complement laboratory experiments for discovering novel mechanisms of HBV infection and drug resistance, improving clinical diagnosis and prognosis accuracy, and optimizing therapeutic protocols to combat the disease. A HBV data integrated analysis platform (HBV-DIAP) was developed by encompassing various type of HBV sequences with a sequence analysis pipeline. The latest version of HBV-DIAP comprises xxxx re-annotated HBV genome sequences and xx CDS. We identified 9 genotypes and 23 serotypes from xxx whole-genome sequences. The sequence analysis pipeline includes analyses of HBV genotyping/sub-genotypes, drug-resistance-associated mutations, and protein-functional-associated mutations et al. We also developed a dynamic surveillance to tracing the treatment outcome of anti-hbv drug including lamivudine, telbivudine, adefovir, entecavir, and tenofovir. As a comprehensive data management and analysis platform, HBV-DIAP may greatly facilitate disease diagnosis and drug resistant prediction, and to benefit the standardization of HBV clinical data collection and analysis.

Overview of hepatitis B virus

The genome of hepatitis B virus is made of circular DNA, encompassing four major overlapping but indispensable genes (polymerase, surface, core and X gene) for the structure and infectivity. Traditionally most researchers will number the HBV genome from the middle point of the EcoRI restriction site, namely the first nucleotide of full HBV genome. Historically, however, some users preferred the numbering of the genome from the start codon of specific ORF. For example, someone started the numbering of the genome from the first nucleotide of ATG of the core gene. Other numbering systems, such as the exact cutting position within the EcoRI site, have not been commonly used. Whichever the numbering system employed, the DNA sequences of the genome remain intact due to the circular nature, although different numbering tactics make unnecessary inconveniences during alignment or other large-scale comparative analysis between sequences.

To ease the users, we employed an additional standardization process on those sequences with old numbering systems by setting the start position of a complete genome from the middle point of EcoRI restriction site. For the reference sequence (AB014381) shown, for example, the complete genome will start from CTC, i.e., the right half of the EcoRI site and will end at GAA, i.e., the left half of the EcoRI site. By this standardized numbering system, the locations of genes, especially those genes which traverse the end and start positions of the genome in the 5'-3' manner, can be displayed in the following way:

e.g.: AB014381.1 Start position (nt) End position (nt)
Polymerase gene 2307 1623
Within polymerase gene
        TP domain 2307 2843
        Spacer domain 2844 135
        RT domain 136 1167
        RH domain 1168 1623
preS1/S2/S gene 2848 835
preS2/S gene 3205 835
Surface gene 155 835
preC/C gene 1814 2452
Core gene 1901 2452
X gene 1374 1838

We also develop a motif base screening tool to automatically identify a EcoRI site in HBV genome sequences to facilitate the online sequence standardization submitted by any users. In this tool, we firstly extracted fragment sequences (15 nt) from each side of EcoRI site from a set of manually calibrated genome sequences. These fragment sequences were then used to generate the HBV genome start and end sequence patterns by using the motif discovery tool suite MEME (http://meme.nbcr.net/meme/). A perl program incorporated with FIMO (http://meme.nbcr.net/meme/fimo-intro.html), a biological sequence motif identified tool, was developed to provide online standardization of HBV genome sequences.

HBV-DIAP integrative analysis pipeline

Our database incorporated 8 sequence analysis tools in an analysis pipeline, including HBV genome sequence calibration, annotation, genotyping/sub-genotypes prediction, protein-functional-associated mutation analysis, drug resistant analysis, Blast, ClustalW, and Phylip. This pipeline may improve the analysis of HBV in the lab and clinical settings.

- Collaborator -

Shanghai Guidance of Science and Technology  Shanghai Data Sharing Platform Bioinformation

Address:2nd Floor, No.1278, Keyuan Rd, Pudong District, Shanghai, China Postcode:201203 Tel:021-20283723 E-mail:lifecenter@scbit.org

Copyright © Shanghai Center of Bioinformation Technology