Ragged Arrays¶
Ragged is SeqPro's array type for collections of variable-length sequences. It stores data contiguously in a flat NumPy array alongside an offsets array that marks segment boundaries, following the Arrow/awkward-array layout. This avoids padding and lets you apply NumPy operations and SeqPro functions directly — without a Python loop. If you're familiar with Awkward Array, Ragged is a subclass and special case of ak.Array, where there is only one "awkward" axis.
Numeric data¶
Construction¶
The two primary constructors are Ragged.from_lengths (supply lengths for each element) and
Ragged.from_offsets (supply a pre-computed offsets array).
# per-position coverage for three intervals of different widths
data = np.array([
0.1,
0.5,
0.3, # interval 0 — length 3
0.8,
0.2, # interval 1 — length 2
0.4,
0.6,
0.9,
0.1,
0.7, # interval 2 — length 5
])
lengths = np.array([3, 2, 5])
rag = Ragged.from_lengths(data, lengths)
rag.shape # (3, None) (1)
rag.dtype # dtype('float64')
rag.lengths # array([3, 2, 5])
Nonemarks the ragged dimension.
Access individual elements with standard indexing:
Arithmetic and NumPy ufuncs¶
NumPy ufuncs and arithmetic operators are dispatched element-wise across the flat data:
scaled = rag * 2.0
shifted = rag + 1.0
normed = rag / rag.data.max()
log1p = np.log1p(rag)
rooted = np.sqrt(rag)
The result is always a new Ragged with the same offsets — no copies of the offset structure.
Sequence data¶
Ragged[np.bytes_] is SeqPro's representation of a collection of variable-length sequences.
Building a sequence Ragged¶
cds_seqs = ["ATGAAATAA", "ATGGGG", "ATCGAT"]
data = np.array(list("".join(cds_seqs)), dtype="S1")
lengths = np.array([len(s) for s in cds_seqs])
cds = Ragged.from_lengths(data, lengths)
cds.shape # (3, None)
cds.dtype # dtype('S1')
Translation with sp.AA.translate¶
AminoAlphabet.translate accepts Ragged[np.bytes_] and returns a new Ragged of amino-acid
sequences — each output length is input_length // 3, so variable-length CDS stays variable
after translation with no extra bookkeeping:
Pass truncate_stop=True to strip any codons after the first stop codon:
# truncate_stop=False
# ATGTAAAAA → M * K
# truncate_stop=True
# ATGTAAAAA → M * (stop retained but truncated inclusive)
aa_trunc = sp.AA.translate(cds, truncate_stop=True)
OHE Ragged translation¶
AminoAlphabet.translate also accepts one-hot encoded ragged arrays (Ragged[np.uint8]). Provide
nuc_alphabet so SeqPro knows how to decode the OHE encoding:
ohe_data = np.concatenate([sp.DNA.ohe(sp.cast_seqs(s)) for s in cds_seqs])
# ohe_data has shape (total_nucleotides, 4)
cds_ohe = Ragged.from_lengths(ohe_data, lengths)
aa_ohe = sp.AA.translate(cds_ohe, nuc_alphabet=sp.DNA)
aa_ohe.dtype # dtype('uint8') — output is OHE amino acids
Record Ragged (structure-of-arrays)¶
A record Ragged holds multiple named fields that share the same ragged structure. This is
the structure-of-arrays (SoA) pattern: one offsets array, multiple data arrays.
Building a record Ragged¶
Use ak.zip on two or more Ragged arrays of the same length and offsets:
import awkward as ak
scores = np.array([0.1, 0.5, 0.3, 0.8, 0.2, 0.4, 0.6, 0.9])
flags = np.array([1, 0, 1, 1, 0, 0, 1, 1], dtype=np.int8)
lengths = np.array([3, 2, 3])
r_scores = Ragged.from_lengths(scores, lengths)
r_flags = Ragged.from_lengths(flags, lengths)
rec = ak.zip({"score": r_scores, "flag": r_flags})
# ak.zip returns a Ragged automatically when inputs are Ragged
Inspecting fields¶
rec.shape # (3, None)
rec.lengths # array([3, 2, 3])
rec.dtype
# {"score": dtype('float64'), "flag": dtype('int8')}
Field access¶
Fields are accessed by key or attribute. Both paths are zero-copy — the returned Ragged
shares the parent's offsets array:
rec["score"] # Ragged[float64] — key access
rec.score # same, attribute-style
# all fields share exactly the same offsets object
assert rec["score"].offsets is rec["flag"].offsets
Flat data access¶
Ragged.data on a record Ragged returns a dict of flat NumPy arrays, one per field:
d = rec.data
d["score"] # array([0.1, 0.5, 0.3, 0.8, 0.2, 0.4, 0.6, 0.9])
d["flag"] # array([1, 0, 1, 1, 0, 0, 1, 1], dtype=int8)
For per-field operations, access the field first, then use ufuncs:
API reference¶
See the Ragged API reference for the full method and property listing.