API Reference¶
bin_coverage(coverage, bin_width, length_axis, normalize=False)
¶
Bin coverage by summing over non-overlapping windows.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
coverage_array
|
|
required | |
bin_width
|
int
|
Width of the windows to sum over. Must be an even divisor of the length of the coverage array. If not, raises an error. |
required |
length_axis
|
int
|
|
required |
normalize
|
Whether to normalize by the length of the bin. |
False
|
Returns:
| Type | Description |
|---|---|
NDArray[number]
|
Coverage summed into bins of width |
Source code in python/seqpro/_modifiers.py
cast_seqs(seqs)
¶
Cast any sequence type to be a NumPy array of ASCII characters (or left alone as 8-bit unsigned integers if the input is OHE).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seqs
|
SeqType
|
|
required |
Returns:
| Type | Description |
|---|---|
NDArray[bytes_ | uint8]
|
S1 byte array, or unchanged uint8 array if input is OHE. |
Source code in python/seqpro/_utils.py
decode_ohe(seqs, ohe_axis, alphabet, unknown_char='N')
¶
Convert an OHE array to an S1 byte array.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seqs
|
NDArray[uint8]
|
|
required |
ohe_axis
|
int
|
|
required |
alphabet
|
NucleotideAlphabet | AminoAlphabet
|
|
required |
unknown_char
|
str
|
Single character to use for unknown values, by default "N" |
'N'
|
Returns:
| Type | Description |
|---|---|
NDArray[bytes_]
|
S1 byte array of decoded characters. |
Source code in python/seqpro/_encoders.py
decode_tokens(seqs, token_map, unknown_char='N')
¶
Untokenize nucleotides. Replaces each token/index with its corresponding nucleotide in the alphabet.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ids
|
|
required | |
alphabet
|
|
required | |
tokens
|
List of tokens to use for each nucleotide, by default None |
required | |
unknown_char
|
str
|
Character to replace unknown tokens with, by default 'N' |
'N'
|
Returns:
| Type | Description |
|---|---|
NDArray[bytes_]
|
S1 byte array of decoded characters with the same shape as the input. |
Source code in python/seqpro/_encoders.py
gc_content(seqs, normalize=True, length_axis=None, alphabet=None, ohe_axis=None)
¶
Compute the number or proportion of G & C nucleotides.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seqs
|
SeqType
|
|
required |
normalize
|
True => return proportions False => return counts |
True
|
|
length_axis
|
int | None
|
Needed if seqs is an array. |
None
|
alphabet
|
NucleotideAlphabet | None
|
Needed if seqs is OHE. |
None
|
ohe_axis
|
int | None
|
Needed if seqs is OHE. |
None
|
Returns:
| Type | Description |
|---|---|
NDArray[integer | float64]
|
Integers if unnormalized, otherwise floats. |
Source code in python/seqpro/_analyzers.py
jitter(*arrays, max_jitter, length_axis, jitter_axes, seed=None)
¶
Randomly jitter data from arrays, using the same jitter across arrays.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
*arrays
|
NDArray[DTYPE]
|
Arrays to be jittered. They must have the same sized jitter and length axes. |
()
|
max_jitter
|
int
|
Maximum jitter amount. |
required |
length_axis
|
int
|
|
required |
jitter_axes
|
int | tuple[int, ...]
|
Each slice along the jitter axes will be randomly jittered independently. Thus, if jitter_axes = 0, then every slice of data along axis 0 would be jittered independently. If jitter_axes = (0, 1), then each slice along axes 0 and 1 would be randomly jittered independently. |
required |
seed
|
int | Generator | None
|
Random seed or generator, by default None |
None
|
Returns:
| Type | Description |
|---|---|
tuple[NDArray[DTYPE], ...]
|
Jittered arrays. Each will have a new length equal to length - 2*max_jitter. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If any arrays have insufficient length to be jittered. |
Source code in python/seqpro/_modifiers.py
k_shuffle(seqs, k, alphabet, *, length_axis=None, ohe_axis=None, seed=None)
¶
Shuffle sequences while preserving k-let frequencies.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seqs
|
SeqType
|
|
required |
k
|
int
|
Size of k-lets to preserve frequencies of. |
required |
alphabet
|
NucleotideAlphabet
|
Alphabet, needed for OHE sequence input. |
required |
length_axis
|
int | None
|
Needed for array input. Axis that corresponds to the length of sequences. |
None
|
ohe_axes
|
Needed for OHE input. Axis that corresponds to the one hot encoding, should be the same size as the length of the alphabet. |
required | |
seed
|
int | Generator | None
|
Seed or generator for shuffling. |
None
|
Returns:
| Type | Description |
|---|---|
NDArray[bytes_ | uint8]
|
Shuffled sequences as bytes (S1) or uint8 for string or OHE input, respectively. |
Source code in python/seqpro/_modifiers.py
length(seqs)
¶
nucleotide_content(seqs, normalize=True, length_axis=None, alphabet=None)
¶
Compute the number or proportion of each nucleotide.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seqs
|
SeqType
|
|
required |
normalize
|
True => return proportions False => return counts |
True
|
|
length_axis
|
int | None
|
Needed if seqs is an array. |
None
|
Returns:
| Type | Description |
|---|---|
NDArray[integer | floating]
|
Integers if unnormalized, otherwise floats. |
Source code in python/seqpro/_analyzers.py
ohe(seqs, alphabet)
¶
One hot encode a nucleotide sequence.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seqs
|
StrSeqType
|
|
required |
alphabet
|
NucleotideAlphabet | AminoAlphabet
|
|
required |
Returns:
| Type | Description |
|---|---|
NDArray[uint8]
|
One-hot encoded sequences with shape (..., length, alphabet_size). |
Source code in python/seqpro/_encoders.py
pad_seqs(seqs, pad, pad_value=None, length=None, length_axis=None)
¶
Pad (or truncate) sequences on either the left, right, or both sides.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seqs
|
SeqType
|
|
required |
pad
|
Literal['left', 'both', 'right']
|
How to pad. If padding on both sides and an odd amount of padding is needed, 1 more pad value will be on the right side. Similarly for truncating, if an odd amount length needs to be truncated, 1 more character will be truncated from the right side. |
required |
pad_val
|
Single character to pad sequences with. Needed for string input. Ignored for OHE sequences. |
required | |
length
|
int | None
|
Needed for character or OHE array input. Length to pad or truncate sequences to. If not given, uses the length of longest sequence. |
None
|
length_axis
|
int | None
|
Needed for array input. |
None
|
Returns:
| Type | Description |
|---|---|
NDArray[bytes_ | uint8]
|
Padded (or truncated) sequences as S1 bytes or uint8 for OHE input. |
Source code in python/seqpro/_encoders.py
random_seqs(shape, alphabet, seed=None)
¶
Generate random nucleotide sequences.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
shape
|
int | tuple[int, ...]
|
Shape of sequences to generate |
required |
alphabet
|
NucleotideAlphabet
|
Alphabet to sample nucleotides from. |
required |
seed
|
int | Generator | None
|
Random seed or generator. |
None
|
Returns:
| Type | Description |
|---|---|
NDArray[bytes_]
|
Randomly generated sequences of shape |
Source code in python/seqpro/_modifiers.py
reverse_complement(seqs, alphabet, length_axis=None, ohe_axis=None)
¶
Reverse complement a sequence.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seqs
|
SeqType
|
|
required |
alphabet
|
NucleotideAlphabet
|
|
required |
length_axis
|
int | None
|
Needed for array input. Length axis, by default None |
None
|
ohe_axis
|
int | None
|
Needed for OHE input. One hot encoding axis, by default None |
None
|
Returns:
| Type | Description |
|---|---|
NDArray[bytes_ | uint8]
|
Reverse-complemented sequences as S1 bytes or uint8 for OHE input. |
Source code in python/seqpro/_modifiers.py
tokenize(seqs, token_map, unknown_token, out=None)
¶
Tokenize nucleotides. Replaces each nucleotide with its corresponding token, if provided. Otherwise, uses each nucleotide's index in the alphabet. Nucleotides not in the alphabet or list of tokens are replaced with -1.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seqs
|
StrSeqType
|
Sequences to tokenize. |
required |
token_map
|
dict[str, int]
|
Mapping of nucleotides to tokens. |
required |
unknown_token
|
int
|
Token to use for unknown values. |
required |
out
|
NDArray[int32] | None
|
Output array to store the result in. If not provided, a new array is created. |
None
|
Returns:
| Type | Description |
|---|---|
NDArray[int32]
|
Integer token IDs with the same shape as the input sequences. |