scikit-genome package¶
Module skgenome
contents¶
Tabular file I/O (tabio)¶
tabio
¶
I/O for tabular formats of genomic data (regions or features).
-
skgenome.tabio.
read
(infile, fmt='tab', into=None, sample_id=None, meta=None, **kwargs)[source]¶ Read tabular data from a file or stream into a genome object.
Supported formats: see READERS
If a format supports multiple samples, return the sample specified by sample_id, or if unspecified, return the first sample and warn if there were other samples present in the file.
Parameters: - infile (handle or string) – Filename or opened file-like object to read.
- fmt (string) – File format.
- into (class) – GenomicArray class or subclass to instantiate, overriding the default for the target file format.
- sample_id (string) – Sample identifier.
- meta (dict) – Metadata, as arbitrary key-value pairs.
- **kwargs – Additional keyword arguments to the format-specific reader function.
Returns: The data from the given file instantiated as into, if specified, or the default base class for the given file format (usually GenomicArray).
Return type: GenomicArray or subclass
-
skgenome.tabio.
read_auto
(infile)[source]¶ Auto-detect a file’s format and use an appropriate parser to read it.
-
skgenome.tabio.
safe_write
(outfile, verbose=True)[source]¶ Write to a filename or file-like object with error handling.
If given a file name, open it. If the path includes directories that don’t exist yet, create them. If given a file-like object, just pass it through.
Base class: GenomicArray¶
The base class of the core objects used throughout CNVkit and scikit-genome is
GenomicArray. It wraps a pandas DataFrame
instance, which is accessible through the .data
attribute and can be used
for any manipulations that aren’t already provided by methods in the wrapper
class.
gary
¶
Base class for an array of annotated genomic regions.
-
class
skgenome.gary.
GenomicArray
(data_table, meta_dict=None)[source]¶ Bases:
object
An array of genomic intervals. Base class for genomic data structures.
Can represent most BED-like tabular formats with arbitrary additional columns.
-
add
(other)[source]¶ Combine this array’s data with another GenomicArray (in-place).
Any optional columns must match between both arrays.
-
add_columns
(**columns)[source]¶ Add the given columns to a copy of this GenomicArray.
Parameters: **columns (array) – Keyword arguments where the key is the new column’s name and the value is an array of the same length as self which will be the new column’s values. Returns: A new instance of self with the given columns included in the underlying dataframe. Return type: GenomicArray or subclass
-
as_dataframe
(dframe, reset_index=False)[source]¶ Wrap the given pandas DataFrame in this instance’s metadata.
-
by_arm
(min_gap_size=100000.0, min_arm_bins=50)[source]¶ Iterate over bins grouped by chromosome arm (inferred).
-
by_ranges
(other, mode='outer', keep_empty=True)[source]¶ Group rows by another GenomicArray’s bin coordinate ranges.
For example, this can be used to group SNVs by CNV segments.
Bins in this array that fall outside the other array’s bins are skipped.
Parameters: - other (GenomicArray) – Another GA instance.
- mode (string) –
Determines what to do with bins that overlap a boundary of the selection. Possible values are:
inner
: Drop the bins on the selection boundary, don’t emit them.outer
: Keep/emit those bins as they are.trim
: Emit those bins but alter their boundaries to match the selection; the bin start or end position is replaced with the selection boundary position.
- keep_empty (bool) – Whether to also yield other bins with no overlapping bins in self, or to skip them when iterating.
Yields: tuple – (other bin, GenomicArray of overlapping rows in self)
-
chromosome
¶
-
concat
(others)[source]¶ Concatenate several GenomicArrays, keeping this array’s metadata.
This array’s data table is not implicitly included in the result.
-
coords
(also=())[source]¶ Iterate over plain coordinates of each bin: chromosome, start, end.
Parameters: - also (str, or iterable of strings) – Also include these columns from self, in addition to chromosome, start, and end.
- yielding rows in BED format (Example,) –
- probes.coords(also=["gene", "strand"]) (>>>) –
-
drop_extra_columns
()[source]¶ Remove any optional columns from this GenomicArray.
Returns: A new copy with only the minimal set of columns required by the class (e.g. chromosome, start, end for GenomicArray; may be more for subclasses). Return type: GenomicArray or subclass
-
end
¶
-
filter
(func=None, **kwargs)[source]¶ Take a subset of rows where the given condition is true.
Parameters: - func (callable) – A boolean function which will be applied to each row to keep rows where the result is True.
- **kwargs (string) – Keyword arguments like
chromosome="chr7"
orgene="Antitarget"
, which will keep rows where the keyed field equals the specified value.
Returns: Subset of self where the specified condition is True.
Return type:
-
classmethod
from_columns
(columns, meta_dict=None)[source]¶ Create a new instance from column arrays, given as a dict.
-
classmethod
from_rows
(rows, columns=None, meta_dict=None)[source]¶ Create a new instance from a list of rows, as tuples or arrays.
-
in_range
(chrom=None, start=None, end=None, mode='outer')[source]¶ Get the GenomicArray portion within the given genomic range.
Parameters: - chrom (str or None) – Chromosome name to select. Use None if self has only one chromosome.
- start (int or None) – Start coordinate of range to select, in 0-based coordinates. If None, start from 0.
- end (int or None) – End coordinate of range to select. If None, select to the end of the chromosome.
- mode (str) – As in by_ranges:
outer
includes bins straddling the range boundaries,trim
additionally alters the straddling bins’ endpoints to match the range boundaries, andinner
excludes those bins.
Returns: The subset of self enclosed by the specified range.
Return type:
-
in_ranges
(chrom=None, starts=None, ends=None, mode='outer')[source]¶ Get the GenomicArray portion within the specified ranges.
Similar to in_ranges, but concatenating the selections of all the regions specified by the starts and ends arrays.
Parameters: - chrom (str or None) – Chromosome name to select. Use None if self has only one chromosome.
- starts (int array, or None) – Start coordinates of ranges to select, in 0-based coordinates. If None, start from 0.
- ends (int array, or None) – End coordinates of ranges to select. If None, select to the end of the chromosome. If starts and ends are both specified, they must be arrays of equal length.
- mode (str) – As in by_ranges:
outer
includes bins straddling the range boundaries,trim
additionally alters the straddling bins’ endpoints to match the range boundaries, andinner
excludes those bins.
Returns: Concatenation of all the subsets of self enclosed by the specified ranges.
Return type:
-
intersection
(other, mode='outer')[source]¶ Select the bins in self that overlap the regions in other.
The extra fields of self, but not other, are retained in the output.
-
into_ranges
(other, column, default, summary_func=None)[source]¶ Re-bin values from column into the corresponding ranges in other.
Match overlapping/intersecting rows from other to each row in self. Then, within each range in other, extract the value(s) from column in self, using the function summary_func to produce a single value if multiple bins in self map to a single range in other.
For example, group SNVs (self) by CNV segments (other) and calculate the median (summary_func) of each SNV group’s allele frequencies.
Parameters: - other (GenomicArray) – Ranges into which the overlapping values of self will be summarized.
- column (string) – Column name in self to extract values from.
- default – Value to assign to indices in other that do not overlap any bins in self. Type should be the same as or compatible with the output field specified by column, or the output of summary_func.
- summary_func (callable, dict of string-to-callable, or None) –
Specify how to reduce 1 or more other rows into a single value for the corresponding row in self.
- If callable, apply to the column field each group of rows in other column.
- If a single-element dict of column name to callable, apply to that field in other instead of column.
- If None, use an appropriate summarizing function for the datatype of the column column in other (e.g. median of numbers, concatenation of strings).
- If some other value, assign that value to self wherever there is an overlap.
Returns: The extracted and summarized values from self corresponding to other’s genomic ranges, the same length as other.
Return type: pd.Series
-
iter_ranges_of
(other, column, mode='outer', keep_empty=True)[source]¶ Group rows by another GenomicArray’s bin coordinate ranges.
For example, this can be used to group SNVs by CNV segments.
Bins in this array that fall outside the other array’s bins are skipped.
Parameters: - other (GenomicArray) – Another GA instance.
- column (string) – Column name in self to extract values from.
- mode (string) –
Determines what to do with bins that overlap a boundary of the selection. Possible values are:
inner
: Drop the bins on the selection boundary, don’t emit them.outer
: Keep/emit those bins as they are.trim
: Emit those bins but alter their boundaries to match the selection; the bin start or end position is replaced with the selection boundary position.
- keep_empty (bool) – Whether to also yield other bins with no overlapping bins in self, or to skip them when iterating.
Yields: tuple – (other bin, GenomicArray of overlapping rows in self)
-
merge
(bp=0, stranded=False, combine=None)[source]¶ Merge adjacent or overlapping regions into single rows.
Similar to ‘bedtools merge’.
-
resize_ranges
(bp, chrom_sizes=None)[source]¶ Resize each genomic bin by a fixed number of bases at each end.
Bin ‘start’ values have a minimum of 0, and chrom_sizes can specify each chromosome’s maximum ‘end’ value.
Similar to ‘bedtools slop’.
Parameters: - bp (int) – Number of bases in each direction to expand or shrink each bin. Applies to ‘start’ and ‘end’ values symmetrically, and may be positive (expand) or negative (shrink).
- chrom_sizes (dict of string-to-int) – Chromosome name to length in base pairs. If given, all chromosomes in self must be included.
-
sample_id
¶
-
start
¶
-
Genomic interval arithmetic¶
intersect
¶
DataFrame-level intersection operations.
Calculate overlapping regions, similar to bedtools intersect.
The functions here operate on pandas DataFrame and Series instances, not GenomicArray types.
-
skgenome.intersect.
by_ranges
(table, other, mode, keep_empty)[source]¶ Group rows by another GenomicArray’s bin coordinate ranges.
-
skgenome.intersect.
into_ranges
(source, dest, src_col, default, summary_func)[source]¶ Group a column in source by regions in dest and summarize.
-
skgenome.intersect.
iter_ranges
(table, chrom, starts, ends, mode)[source]¶ Iterate through sub-ranges.
-
skgenome.intersect.
iter_slices
(table, other, mode, keep_empty)[source]¶ Yields indices to extract ranges from table.
Returns an iterable of integer arrays that can apply to Series objects, i.e. columns of table. These indices are of the DataFrame/Series’ Index, not array coordinates – so be sure to use DataFrame.loc, Series.loc, or Series getitem, as opposed to .iloc or indexing directly into Numpy arrays.
merge
¶
DataFrame-level merging operations.
Merge overlapping regions into single rows, similar to bedtools merge.
The functions here operate on pandas DataFrame and Series instances, not GenomicArray types.
subdivide
¶
DataFrame-level subdivide operation.
Split each region into similar-sized sub-regions.
The functions here operate on pandas DataFrame and Series instances, not GenomicArray types.
Helper modules¶
chromsort
¶
Operations on chromosome/contig/sequence names.
-
skgenome.chromsort.
detect_big_chroms
(sizes)[source]¶ Determine the number of “big” chromosomes from their lengths.
In the human genome, this returns 24, where the canonical chromosomes 1-22, X, and Y are considered “big”, while mitochrondria and the alternative contigs are not. This allows us to exclude the non-canonical chromosomes from an analysis where they’re not relevant.
Returns: - n_big (int) – Number of “big” chromosomes in the genome.
- thresh (int) – Length of the smallest “big” chromosomes.
combiners
¶
Combiner functions for Python list-like input.
-
skgenome.combiners.
get_combiners
(table, stranded=False, combine=None)[source]¶ Get a combine lookup suitable for table.
Parameters: - table (DataFrame) –
- stranded (bool) –
- combine (dict or None) – Column names to their value-combining functions, replacing or in addition to the defaults.
Returns: Column names to their value-combining functions.
Return type: dict
rangelabel
¶
Handle text genomic ranges as named tuples.
A range specification should look like chromosome:start-end
, e.g.
chr1:1234-5678
, with 1-indexed integer coordinates. We also allow
chr1:1234-
or chr1:-5678
, where missing start becomes 0 and missing end
becomes None.
-
class
skgenome.rangelabel.
NamedRegion
(chromosome, start, end, gene)¶ Bases:
tuple
-
chromosome
¶ Alias for field number 0
-
end
¶ Alias for field number 2
-
gene
¶ Alias for field number 3
-
start
¶ Alias for field number 1
-
-
class
skgenome.rangelabel.
Region
(chromosome, start, end)¶ Bases:
tuple
-
chromosome
¶ Alias for field number 0
-
end
¶ Alias for field number 2
-
start
¶ Alias for field number 1
-
-
skgenome.rangelabel.
from_label
(text, keep_gene=True)[source]¶ Parse a chromosomal range specification.
Parameters: text (string) – Range specification, which should look like chr1:1234-5678
orchr1:1234-
orchr1:-5678
, where missing start becomes 0 and missing end becomes None.