Skip to content

GTF

scan(path)

Scan a GFF or GTF file.

Parameters:

Name Type Description Default
path str | Path

The path to the GTF file.

required

Returns:

Type Description
LazyFrame

A lazy frame with columns: seqname, source, feature, start, end, score, strand, frame, attribute.

Source code in python/seqpro/gtf.py
def scan(path: str | Path) -> pl.LazyFrame:
    """Scan a GFF or GTF file.

    Parameters
    ----------
    path
        The path to the GTF file.

    Returns
    -------
    pl.LazyFrame
        A lazy frame with columns: seqname, source, feature, start, end, score, strand, frame, attribute.
    """
    REQUIRED_COLUMNS = [
        "seqname",
        "source",
        "feature",
        "start",
        "end",
        "score",
        "strand",
        "frame",
        "attribute",
    ]

    DEFAULT_COLUMN_DTYPES = {
        "seqname": pl.Categorical,
        "source": pl.Categorical,
        "start": pl.Int64,
        "end": pl.Int64,
        "score": pl.Float32,
        "feature": pl.Categorical,
        "strand": pl.Categorical,
        "frame": pl.UInt32,
    }

    return pl.scan_csv(
        path,
        has_header=False,
        separator="\t",
        comment_prefix="#",
        null_values=".",
        new_columns=REQUIRED_COLUMNS,
        schema_overrides=DEFAULT_COLUMN_DTYPES,
    ).with_columns(
        pl.col("frame").fill_null(0),
        pl.col("attribute").str.replace_all('"', "'"),
    )

attr(attr)

Extract a column from the attribute field. In general, GTF/GFF attributes can be any type, so this always returns a Utf8 column. If an explicit cast is necessary, it can be done by e.g. seqpro.gtf.attr(attr).cast(pl.Int32).

Parameters:

Name Type Description Default
attr str

The attribute to extract.

required

Returns:

Type Description
Expr

A Polars expression that extracts the named attribute as a Utf8 column.

Source code in python/seqpro/gtf.py
def attr(attr: str) -> pl.Expr:
    """Extract a column from the attribute field. In general, GTF/GFF attributes can be any
    type, so this always returns a Utf8 column. If an explicit cast is necessary, it can be
    done by e.g. `seqpro.gtf.attr(attr).cast(pl.Int32)`.

    Parameters
    ----------
    attr
        The attribute to extract.

    Returns
    -------
    pl.Expr
        A Polars expression that extracts the named attribute as a Utf8 column.
    """
    return (
        pl.col("attribute")
        .str.extract(rf"""{attr} ["']?([^"';]*)["']?;?""")
        .alias(attr)
    )