Basic Information
GFF3 (Generic Feature Format Version 3) file format represents the genomic features in a simple text-based tab-delimited file
GFF3 file has nine fields (seqid, source, feature, start, end, score, strand, phase, and attributes)
The lines which starts with ‘##’ provides the meta-information of the file and ‘#’ represents the human-readable comments
We sometimes need to transfer GFF3 to GTF format.
I used gffread before. It generally works well, but it has some problem when I need to deal with a GFF3 file downloaded from NCBI.
This file has some lines like:
chrxxx  Gnomon  exon    46964   46999   .   -   .   ID=id-LOC123327042;Parent=gene-LOC123327042;Dbxref=GeneID:123327042;gbkey=exon;gene=LOC123327042;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 74%25 coverage of the annotated genomic feature by RNAseq alignments;pseudo=true
chrxxx  Gnomon  exon    47054   47468   .   -   .   ID=id-LOC123327042-2;Parent=gene-LOC123327042;Dbxref=GeneID:123327042;gbkey=exon;gene=LOC123327042;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 74%25 coverage of the annotated genomic feature by RNAseq alignments;pseudo=true
chrxxx  Gnomon  exon    47542   47661   .   -   .   ID=id-LOC123327042-3;Parent=gene-LOC123327042;Dbxref=GeneID:123327042;gbkey=exon;gene=LOC123327042;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 74%25 coverage of the annotated genomic feature by RNAseq alignments;pseudo=trueWhen I used gffread, it can not read there lines containing “ID=id*”, it will lose the gene_id attribute in the output GTF file.
I realized that I need to write a script by myself.
Script
import sys
import uuid
import re
def parse_gff_attributes(attributes_str):
    """Parse GFF attributes into a dictionary."""
    attributes = {}
    for attr in attributes_str.split(';'):
        if attr:
            key_value = attr.split('=', 1)
            if len(key_value) == 2:
                key, value = key_value
                attributes[key] = value
    return attributes
def convert_gff_to_gtf(gff_file, gtf_file):
    """Convert GFF3 to GTF format, retaining ID, gene_id, and transcript_id."""
    with open(gff_file, 'r') as gff, open(gtf_file, 'w') as gtf:
        for line in gff:
            if line.startswith('#'):
                gtf.write(line)
                continue
            
            fields = line.strip().split('\t')
            if len(fields) != 9:
                continue
                
            seqid, source, feature, start, end, score, strand, phase, attributes_str = fields
            attributes = parse_gff_attributes(attributes_str)
            
            # Skip features that are not relevant for GTF (e.g., region)
            if feature not in ['gene', 'mRNA', 'exon', 'CDS', 'lnc_RNA', 'pseudogene']:
                continue
                
            # Determine feature type for GTF
            gtf_feature = feature
            if feature == 'mRNA' or feature == 'lnc_RNA':
                gtf_feature = 'transcript'
            
            # Extract required attributes
            gene_id = attributes.get('gene', '')
            transcript_id = attributes.get('transcript_id', '')
            feature_id = attributes.get('ID', '')
            
            # Build GTF attributes string
            gtf_attributes = []
            if feature == 'gene':
                gtf_attributes.append(f'gene_id "{gene_id}"')
                if 'Name' in attributes:
                    gtf_attributes.append(f'gene_name "{attributes["Name"]}"')
            elif feature in ['mRNA', 'lnc_RNA', 'exon', 'CDS']:
                gtf_attributes.append(f'gene_id "{gene_id}"')
                gtf_attributes.append(f'transcript_id "{transcript_id}"')
                if 'Name' in attributes and feature in ['mRNA', 'lnc_RNA']:
                    gtf_attributes.append(f'gene_name "{attributes["Name"]}"')
                if feature == 'exon':
                    gtf_attributes.append(f'exon_id "{feature_id}"')
            elif feature == 'pseudogene':
                gtf_attributes.append(f'gene_id "{gene_id}"')
                if 'Name' in attributes:
                    gtf_attributes.append(f'gene_name "{attributes["Name"]}"')
            
            # Write GTF line
            gtf_fields = [seqid, source, gtf_feature, start, end, score, strand, phase, '; '.join(gtf_attributes)]
            gtf.write('\t'.join(gtf_fields) + '\n')
if __name__ == '__main__':
    if len(sys.argv) != 3:
        print("Usage: python gff_to_gtf.py input.gff output.gtf")
        sys.exit(1)
    
    gff_file = sys.argv[1]
    gtf_file = sys.argv[2]
    convert_gff_to_gtf(gff_file, gtf_file)
    print(f"Conversion complete. GTF file written to {gtf_file}")Usage
It is very easy to use this script, just save it as gff2gtf.py and then run like
python gff2gtf.py input.gff output.gtfThe result looks like
chrxxx      Gnomon  exon    46964   46999   .       -       .       gene_id "LOC123327042"; transcript_id ""; exon_id "id-LOC123327042"
chrxxx      Gnomon  exon    47054   47468   .       -       .       gene_id "LOC123327042"; transcript_id ""; exon_id "id-LOC123327042-2"
chrxxx       Gnomon  exon    47542   47661   .       -       .       gene_id "LOC123327042"; transcript_id ""; exon_id "id-LOC123327042-3"