API Reference¶
bin_coverage(coverage, bin_width, length_axis, normalize=False)
¶
Bin coverage by summing over non-overlapping windows.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
coverage_array
|
|
required | |
bin_width
|
int
|
Width of the windows to sum over. Must be an even divisor of the length of the coverage array. If not, raises an error. |
required |
length_axis
|
int
|
|
required |
normalize
|
bool
|
Whether to normalize by the length of the bin. |
False
|
Returns:
| Type | Description |
|---|---|
NDArray[number]
|
Coverage summed into bins of width |
Source code in python/seqpro/_modifiers.py
cast_seqs(seqs)
¶
Cast any sequence type to be a NumPy array of ASCII characters (or left alone as 8-bit unsigned integers if the input is OHE).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seqs
|
SeqType
|
|
required |
Returns:
| Type | Description |
|---|---|
NDArray[bytes_ | uint8]
|
S1 byte array, or unchanged uint8 array if input is OHE. |
Source code in python/seqpro/_utils.py
decode_ohe(seqs, ohe_axis, alphabet, unknown_char='N')
¶
Convert an OHE array to an S1 byte array.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seqs
|
NDArray[uint8] | Ragged[uint8]
|
OHE array. Ragged input must have shape (n, ~L, A, ...) as produced by ohe(). |
required |
ohe_axis
|
int
|
Axis of the one-hot dimension. Ignored for Ragged input (always axis 1 of flat data). |
required |
alphabet
|
NucleotideAlphabet | AminoAlphabet
|
|
required |
unknown_char
|
str
|
Single character to use for unknown values, by default "N" |
'N'
|
Returns:
| Type | Description |
|---|---|
NDArray[bytes_] | Ragged[bytes_]
|
S1 byte array of decoded characters; ohe_axis is removed from the shape. |
Source code in python/seqpro/_encoders.py
decode_tokens(seqs, token_map, unknown_char='N')
¶
Untokenize sequences. Maps each integer token back to its character. Tokens absent from token_map are replaced with unknown_char.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seqs
|
NDArray[int32] | Ragged[int32]
|
Token ID array. Ragged input must have dtype np.int32. |
required |
token_map
|
dict[str, int]
|
Mapping of characters to tokens (same map used for tokenization). |
required |
unknown_char
|
str
|
Character to replace unknown tokens with, by default 'N'. |
'N'
|
Returns:
| Type | Description |
|---|---|
NDArray[bytes_] | Ragged[bytes_]
|
S1 byte array with the same shape/layout as the input. |
Source code in python/seqpro/_encoders.py
gc_content(seqs, normalize=True, length_axis=None, alphabet=None, ohe_axis=None)
¶
Compute the number or proportion of G & C nucleotides.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seqs
|
SeqType
|
|
required |
normalize
|
bool
|
True => return proportions False => return counts |
True
|
length_axis
|
int | None
|
Needed if seqs is an array. |
None
|
alphabet
|
NucleotideAlphabet | None
|
Needed if seqs is OHE. |
None
|
ohe_axis
|
int | None
|
Needed if seqs is OHE. |
None
|
Returns:
| Type | Description |
|---|---|
NDArray[integer | float64]
|
Integers if unnormalized, otherwise floats. |
Source code in python/seqpro/_analyzers.py
jitter(*arrays, max_jitter, length_axis, jitter_axes, seed=None)
¶
Randomly jitter data from arrays, using the same jitter across arrays.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
*arrays
|
NDArray[DTYPE]
|
Arrays to be jittered. They must have the same sized jitter and length axes. |
()
|
max_jitter
|
int
|
Maximum jitter amount. |
required |
length_axis
|
int
|
|
required |
jitter_axes
|
int | tuple[int, ...]
|
Each slice along the jitter axes will be randomly jittered independently. Thus, if jitter_axes = 0, then every slice of data along axis 0 would be jittered independently. If jitter_axes = (0, 1), then each slice along axes 0 and 1 would be randomly jittered independently. |
required |
seed
|
int | Generator | None
|
Random seed or generator, by default None |
None
|
Returns:
| Type | Description |
|---|---|
tuple[NDArray[DTYPE], ...]
|
Jittered arrays. Each will have a new length equal to length - 2*max_jitter. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If any arrays have insufficient length to be jittered. |
Source code in python/seqpro/_modifiers.py
k_shuffle(seqs, k, alphabet, *, length_axis=None, ohe_axis=None, seed=None)
¶
Shuffle sequences while preserving k-let frequencies.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seqs
|
SeqType
|
|
required |
k
|
int
|
Size of k-lets to preserve frequencies of. |
required |
alphabet
|
NucleotideAlphabet
|
Alphabet, needed for OHE sequence input. |
required |
length_axis
|
int | None
|
Needed for array input. Axis that corresponds to the length of sequences. |
None
|
ohe_axes
|
Needed for OHE input. Axis that corresponds to the one hot encoding, should be the same size as the length of the alphabet. |
required | |
seed
|
int | Generator | None
|
Seed or generator for shuffling. When given a fixed integer seed, the
same |
None
|
Returns:
| Type | Description |
|---|---|
NDArray[bytes_ | uint8]
|
Shuffled sequences as bytes (S1) or uint8 for string or OHE input, respectively. |
Source code in python/seqpro/_modifiers.py
length(seqs, length_axis=None)
¶
Calculate the length of each sequence.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seqs
|
SeqType
|
Sequences. For arrays, |
required |
length_axis
|
int | None
|
Axis to count non-empty characters along. Defaults to the last axis. |
None
|
Returns:
| Type | Description |
|---|---|
NDArray[integer]
|
Array containing the length of each sequence; |
Source code in python/seqpro/_analyzers.py
nucleotide_content(seqs, normalize=True, length_axis=None, alphabet=None)
¶
Compute the number or proportion of each nucleotide.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seqs
|
SeqType
|
|
required |
normalize
|
bool
|
True => return proportions False => return counts |
True
|
length_axis
|
int | None
|
Needed if seqs is an array. |
None
|
Returns:
| Type | Description |
|---|---|
NDArray[integer | floating]
|
Integers if unnormalized, otherwise floats. |
Source code in python/seqpro/_analyzers.py
ohe(seqs, alphabet)
¶
One hot encode sequences against an alphabet.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seqs
|
StrSeqType | Ragged[bytes_]
|
Sequences to encode. Ragged input must have dtype np.bytes_ (S1). |
required |
alphabet
|
NucleotideAlphabet | AminoAlphabet
|
|
required |
Returns:
| Type | Description |
|---|---|
NDArray[uint8] | Ragged[uint8]
|
One-hot encoded sequences. Dense output has shape (..., length, alphabet_size). Ragged output has shape (n, ~L, A). |
Source code in python/seqpro/_encoders.py
pad_seqs(seqs, pad, pad_value=None, length=None, length_axis=None)
¶
Pad (or truncate) sequences on either the left, right, or both sides.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seqs
|
SeqType
|
|
required |
pad
|
Literal['left', 'both', 'right']
|
How to pad. If padding on both sides and an odd amount of padding is needed, 1 more pad value will be on the right side. Similarly for truncating, if an odd amount length needs to be truncated, 1 more character will be truncated from the right side. |
required |
pad_val
|
Single character to pad sequences with. Needed for string input. Ignored for OHE sequences. |
required | |
length
|
int | None
|
Needed for character or OHE array input. Length to pad or truncate sequences to. If not given, uses the length of longest sequence. |
None
|
length_axis
|
int | None
|
Needed for array input. |
None
|
Returns:
| Type | Description |
|---|---|
NDArray[bytes_ | uint8]
|
Padded (or truncated) sequences as S1 bytes or uint8 for OHE input. |
Source code in python/seqpro/_encoders.py
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 | |
random_seqs(shape, alphabet, seed=None)
¶
Generate random nucleotide sequences.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
shape
|
int | tuple[int, ...]
|
Shape of sequences to generate |
required |
alphabet
|
NucleotideAlphabet
|
Alphabet to sample nucleotides from. |
required |
seed
|
int | Generator | None
|
Random seed or generator. |
None
|
Returns:
| Type | Description |
|---|---|
NDArray[bytes_]
|
Randomly generated sequences of shape |
Source code in python/seqpro/_modifiers.py
reverse_complement(seqs, alphabet, length_axis=None, ohe_axis=None)
¶
Reverse complement a sequence.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seqs
|
SeqType
|
|
required |
alphabet
|
NucleotideAlphabet
|
|
required |
length_axis
|
int | None
|
Needed for array input. Length axis, by default None |
None
|
ohe_axis
|
int | None
|
Needed for OHE input. One hot encoding axis, by default None |
None
|
Returns:
| Type | Description |
|---|---|
NDArray[bytes_ | uint8]
|
Reverse-complemented sequences as S1 bytes or uint8 for OHE input. |
Source code in python/seqpro/_modifiers.py
tokenize(seqs, token_map, unknown_token, out=None, *, parallel=None)
¶
Tokenize sequences. Maps each character to its integer token. Characters absent from token_map are replaced with unknown_token.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seqs
|
StrSeqType | Ragged[bytes_]
|
Sequences to tokenize. Ragged input must have dtype np.bytes_ (S1). |
required |
token_map
|
dict[str, int]
|
Mapping of characters to tokens. |
required |
unknown_token
|
int
|
Token to use for unknown values. |
required |
out
|
NDArray[int32] | None
|
Output array to store the result in. Only valid for non-Ragged input.
Must have dtype |
None
|
parallel
|
bool | None
|
Escape hatch overriding the size-based heuristic for choosing between the
single-threaded |
None
|
Returns:
| Type | Description |
|---|---|
NDArray[int32] | Ragged[int32]
|
Integer token IDs with the same shape/layout as the input. |
Source code in python/seqpro/_encoders.py
249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 | |