Basic Information
GFF3 (Generic Feature Format Version 3) file format represents the genomic features in a simple text-based tab-delimited file
GFF3 file has nine fields (seqid, source, feature, start, end, score, strand, phase, and attributes)
The lines which starts with ‘##’ provides the meta-information of the file and ‘#’ represents the human-readable comments
We sometimes need to transfer GFF3 to GTF format.
I used gffread before. It generally works well, but it has some problem when I need to deal with a GFF3 file downloaded from NCBI.
This file has some lines like:
chrxxx Gnomon exon 46964 46999 . - . ID=id-LOC123327042;Parent=gene-LOC123327042;Dbxref=GeneID:123327042;gbkey=exon;gene=LOC123327042;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 74%25 coverage of the annotated genomic feature by RNAseq alignments;pseudo=true
chrxxx Gnomon exon 47054 47468 . - . ID=id-LOC123327042-2;Parent=gene-LOC123327042;Dbxref=GeneID:123327042;gbkey=exon;gene=LOC123327042;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 74%25 coverage of the annotated genomic feature by RNAseq alignments;pseudo=true chrxxx Gnomon exon 47542 47661 . - . ID=id-LOC123327042-3;Parent=gene-LOC123327042;Dbxref=GeneID:123327042;gbkey=exon;gene=LOC123327042;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 74%25 coverage of the annotated genomic feature by RNAseq alignments;pseudo=true
When I used gffread, it can not read there lines containing “ID=id*”, it will lose the gene_id attribute in the output GTF file.
I realized that I need to write a script by myself.
Script
import sys
import uuid
import re
def parse_gff_attributes(attributes_str):
"""Parse GFF attributes into a dictionary."""
= {}
attributes for attr in attributes_str.split(';'):
if attr:
= attr.split('=', 1)
key_value if len(key_value) == 2:
= key_value
key, value = value
attributes[key] return attributes
def convert_gff_to_gtf(gff_file, gtf_file):
"""Convert GFF3 to GTF format, retaining ID, gene_id, and transcript_id."""
with open(gff_file, 'r') as gff, open(gtf_file, 'w') as gtf:
for line in gff:
if line.startswith('#'):
gtf.write(line)continue
= line.strip().split('\t')
fields if len(fields) != 9:
continue
= fields
seqid, source, feature, start, end, score, strand, phase, attributes_str = parse_gff_attributes(attributes_str)
attributes
# Skip features that are not relevant for GTF (e.g., region)
if feature not in ['gene', 'mRNA', 'exon', 'CDS', 'lnc_RNA', 'pseudogene']:
continue
# Determine feature type for GTF
= feature
gtf_feature if feature == 'mRNA' or feature == 'lnc_RNA':
= 'transcript'
gtf_feature
# Extract required attributes
= attributes.get('gene', '')
gene_id = attributes.get('transcript_id', '')
transcript_id = attributes.get('ID', '')
feature_id
# Build GTF attributes string
= []
gtf_attributes if feature == 'gene':
f'gene_id "{gene_id}"')
gtf_attributes.append(if 'Name' in attributes:
f'gene_name "{attributes["Name"]}"')
gtf_attributes.append(elif feature in ['mRNA', 'lnc_RNA', 'exon', 'CDS']:
f'gene_id "{gene_id}"')
gtf_attributes.append(f'transcript_id "{transcript_id}"')
gtf_attributes.append(if 'Name' in attributes and feature in ['mRNA', 'lnc_RNA']:
f'gene_name "{attributes["Name"]}"')
gtf_attributes.append(if feature == 'exon':
f'exon_id "{feature_id}"')
gtf_attributes.append(elif feature == 'pseudogene':
f'gene_id "{gene_id}"')
gtf_attributes.append(if 'Name' in attributes:
f'gene_name "{attributes["Name"]}"')
gtf_attributes.append(
# Write GTF line
= [seqid, source, gtf_feature, start, end, score, strand, phase, '; '.join(gtf_attributes)]
gtf_fields '\t'.join(gtf_fields) + '\n')
gtf.write(
if __name__ == '__main__':
if len(sys.argv) != 3:
print("Usage: python gff_to_gtf.py input.gff output.gtf")
1)
sys.exit(
= sys.argv[1]
gff_file = sys.argv[2]
gtf_file
convert_gff_to_gtf(gff_file, gtf_file)print(f"Conversion complete. GTF file written to {gtf_file}")
Usage
It is very easy to use this script, just save it as gff2gtf.py and then run like
python gff2gtf.py input.gff output.gtf
The result looks like
chrxxx Gnomon exon 46964 46999 . - . gene_id "LOC123327042"; transcript_id ""; exon_id "id-LOC123327042"
chrxxx Gnomon exon 47054 47468 . - . gene_id "LOC123327042"; transcript_id ""; exon_id "id-LOC123327042-2" chrxxx Gnomon exon 47542 47661 . - . gene_id "LOC123327042"; transcript_id ""; exon_id "id-LOC123327042-3"