Python API (cnvlib package)¶
Module cnvlib
contents¶
The one function exposed at the top level, read, loads a file in CNVkit’s BED-like tabular format and returns a CopyNumArray instance. For your own scripting, you can usually accomplish what you need using just the CopyNumArray and GenomicArray methods available on this returned object (see Core classes).
To load other file formats, see Tabular file I/O. To run functions equivalent to CNVkit commands within Python, see Interface to CNVkit sub-commands.
Core classes¶
The core objects used throughout CNVkit. The base class is GenomicArray. All
of these classes wrap a pandas DataFrame instance
accessible through the .data
attribute which can be used for any
manipulations that aren’t already provided by methods in the wrapper class.
gary
¶
Base class for an array of annotated genomic regions.
-
class
cnvlib.genome.gary.
GenomicArray
(data_table, meta_dict=None)[source]¶ Bases:
future.types.newobject.newobject
An array of genomic intervals. Base class for genomic data structures.
Can represent most BED-like tabular formats with arbitrary additional columns.
-
__getitem__
(index)[source]¶ Access a portion of the data.
Cases:
- single integer: a row, as pd.Series
- string row name: a column, as pd.Series
- a boolean array: masked rows, as_dataframe
- tuple of integers: selected rows, as_dataframe
-
__next__
(iterator, default=<object object>)¶ next(iterator[, default])
Return the next item from the iterator. If default is given and the iterator is exhausted, it is returned instead of raising StopIteration.
-
add
(other)[source]¶ Combine this array’s data with another GenomicArray (in-place).
Any optional columns must match between both arrays.
-
add_columns
(**columns)[source]¶ Add the given columns to a copy of this GenomicArray.
Parameters: **columns – Keyword arguments where the key is the new column’s name and the value is an array of the same length as self which will be the new column’s values.
Returns: A new instance of self with the given columns included in the underlying dataframe. Return type: GenomicArray or subclass
-
by_ranges
(other, mode='inner', keep_empty=True)[source]¶ Group rows by another GenomicArray’s bin coordinate ranges.
For example, this can be used to group SNVs by CNV segments.
Bins in this array that fall outside the other array’s bins are skipped.
Parameters: - other (GenomicArray) – Another GA instance.
- mode (string) –
Determines what to do with bins that overlap a boundary of the selection. Possible values are:
inner
: Drop the bins on the selection boundary, don’t emit them.outer
: Keep/emit those bins as they are.trim
: Emit those bins but alter their boundaries to match the selection; the bin start or end position is replaced with the selection boundary position.
- keep_empty (bool) – Whether to also yield other bins with no overlapping bins in self, or to skip them when iterating.
Yields: tuple – (other bin, GenomicArray of overlapping rows in self)
-
chromosome
¶
-
concat
(others)[source]¶ Concatenate several GenomicArrays, keeping this array’s metadata.
This array’s data table is not implicitly included in the result.
-
coords
(also=())[source]¶ Iterate over plain coordinates of each bin: chromosome, start, end.
Parameters: - also (str, or iterable of strings) – Also include these columns from self, in addition to chromosome, start, and end.
- yielding rows in BED format (Example,) –
- probes.coords(also=["gene", "strand"]) (>>>) –
-
drop_extra_columns
()[source]¶ Remove any optional columns from this GenomicArray.
Returns: A new copy with only the minimal set of columns required by the class (e.g. chromosome, start, end for GenomicArray; may be more for subclasses). Return type: GenomicArray or subclass
-
end
¶
-
filter
(func=None, **kwargs)[source]¶ Take a subset of rows where the given condition is true.
Parameters: - func (callable) – A boolean function which will be applied to each row to keep rows where the result is True.
- **kwargs –
Keyword arguments like
chromosome="chr7"
orgene="Background"
, which will keep rows where the keyed field equals the specified value.
Returns: Subset of self where the specified condition is True.
Return type:
-
classmethod
from_columns
(columns, meta_dict=None)[source]¶ Create a new instance from column arrays, given as a dict.
-
classmethod
from_rows
(rows, columns=None, meta_dict=None)[source]¶ Create a new instance from a list of rows, as tuples or arrays.
-
in_range
(chrom=None, start=None, end=None, mode='inner')[source]¶ Get the GenomicArray portion within the given genomic range.
Parameters: - chrom (str or None) – Chromosome name to select. Use None if self has only one chromosome.
- start (int or None) – Start coordinate of range to select, in 0-based coordinates. If None, start from 0.
- end (int or None) – End coordinate of range to select. If None, select to the end of the chromosome.
- mode (str) – As in by_ranges:
outer
includes bins straddling the range boundaries,trim
additionally alters the straddling bins’ endpoints to match the range boundaries, andinner
excludes those bins.
Returns: The subset of self enclosed by the specified range.
Return type:
-
in_ranges
(chrom=None, starts=None, ends=None, mode='inner')[source]¶ Get the GenomicArray portion within the specified ranges.
Similar to in_ranges, but concatenating the selections of all the regions specified by the starts and ends arrays.
Parameters: - chrom (str or None) – Chromosome name to select. Use None if self has only one chromosome.
- starts (int array, or None) – Start coordinates of ranges to select, in 0-based coordinates. If None, start from 0.
- ends (int array, or None) – End coordinates of ranges to select. If None, select to the end of the chromosome. If starts and ends are both specified, they must be arrays of equal length.
- mode (str) – As in by_ranges:
outer
includes bins straddling the range boundaries,trim
additionally alters the straddling bins’ endpoints to match the range boundaries, andinner
excludes those bins.
Returns: Concatenation of all the subsets of self enclosed by the specified ranges.
Return type:
-
into_ranges
(other, column, default, summary_func=None)[source]¶ Re-bin values from column into the corresponding ranges in other.
Match overlapping/intersecting rows from other to each row in self. Then, within each range in other, extract the value(s) from column in self, using the function summary_func to produce a single value if multiple bins in self map to a single range in other.
For example, group SNVs (self) by CNV segments (other) and calculate the median (summary_func) of each SNV group’s allele frequencies.
Parameters: - other (GenomicArray) – Bins to
- column (string) – Column name in self to extract values from.
- default – Value to assign to indices in other that do not overlap any bins in self. Type should be the same as or compatible with the output field specified by column, or the output of summary_func.
- summary_func (callable, dict of string-to-callable, or None) –
Specify how to reduce 1 or more other rows into a single value for the corresponding row in self.
- If callable, apply to the column field each group of rows in other column.
- If a single-element dict of column name to callable, apply to that field in other instead of column.
- If None, use an appropriate summarizing function for the datatype of the column column in other (e.g. median of numbers, concatenation of strings).
- If some other value, assign that value to self wherever there is an overlap.
Returns: The extracted and summarized values from self corresponding to other’s genomic ranges, the same length as other.
Return type: pd.Series
-
merge
(bp=0, stranded=False, combine=None)[source]¶ Merge adjacent or overlapping regions into single rows.
Similar to ‘bedtools merge’.
-
resize_ranges
(bp, chrom_sizes=None)[source]¶ Resize each genomic bin by a fixed number of bases at each end.
Bin ‘start’ values have a minimum of 0, and chrom_sizes can specify each chromosome’s maximum ‘end’ value.
Similar to ‘bedtools slop’.
Parameters: - bp (int) – Number of bases in each direction to expand or shrink each bin. Applies to ‘start’ and ‘end’ values symmetrically, and may be positive (expand) or negative (shrink).
- chrom_sizes (dict of string-to-int) – Chromosome name to length in base pairs. If given, all chromosomes in self must be included.
-
sample_id
¶
-
start
¶
-
cnary
¶
CNVkit’s core data structure, a copy number array.
-
class
cnvlib.cnary.
CopyNumArray
(data_table, meta_dict=None)[source]¶ Bases:
cnvlib.genome.gary.GenomicArray
An array of genomic intervals, treated like aCGH probes.
Required columns: chromosome, start, end, gene, log2
Optional columns: gc, rmask, spread, weight, probes
-
by_gene
(ignore=('-', '.', 'CGH'))[source]¶ Iterate over probes grouped by gene name.
Group each series of intergenic bins as a “Background” gene; any “Background” bins within a gene are grouped with that gene.
Bins’ gene names are split on commas to accommodate overlapping genes and bins that cover multiple genes.
Parameters: ignore (list or tuple of str) – Gene names to treat as “Background” bins instead of real genes, grouping these bins with the surrounding gene or background region. These bins will still retain their name in the output. Yields: tuple – Pairs of: (gene name, CNA of rows with same name)
-
center_all
(estimator=<unbound method Series.median>, skip_low=False, by_chrom=True)[source]¶ Re-center log2 values to the autosomes’ average (in-place).
Parameters: - estimator (str or callable) – Function to estimate central tendency. If a string, must be one of ‘mean’, ‘median’, ‘mode’, ‘biweight’ (for biweight location). Median by default.
- skip_low (bool) – Whether to drop very-low-coverage bins (via drop_low_coverage) before estimating the center value.
- by_chrom (bool) – If True, first apply estimator to each chromosome separately, then apply estimator to the per-chromosome values, to reduce the impact of uneven targeting or extreme aneuploidy. Otherwise, apply estimator to all log2 values directly.
-
compare_sex_chromosomes
(male_reference=False, skip_low=False)[source]¶ Compare coverage ratios of sex chromosomes versus autosomes.
Perform 4 Mood’s median tests of the log2 coverages on chromosomes X and Y, separately shifting for assumed male and female chromosomal sex. Compare the chi-squared values obtained to infer whether the male or female assumption fits the data better.
Parameters: - male_reference (bool) – Whether a male reference copy number profile was used to normalize the data. If so, a male sample should have log2 values of 0 on X and Y, and female +1 on X, deep negative (below -3) on Y. Otherwise, a male sample should have log2 values of -1 on X and 0 on Y, and female 0 on X, deep negative (below -3) on Y.
- skip_low (bool) – If True, drop very-low-coverage bins (via drop_low_coverage) before comparing log2 coverage ratios. Included for completeness, but shouldn’t affect the result much since the M-W test is nonparametric and p-values are not used here.
Returns: - bool – True if the sample appears male.
- dict – Calculated values used for the inference: relative log2 ratios of chromosomes X and Y versus the autosomes; the Mann-Whitney U values from each test; and ratios of U values for male vs. female assumption on chromosomes X and Y.
-
drop_low_coverage
(verbose=False)[source]¶ Drop bins with extremely low log2 coverage or copy ratio values.
These are generally bins that had no reads mapped due to sample-specific issues. A very small log2 ratio or coverage value may have been substituted to avoid domain or divide-by-zero errors.
-
expect_flat_log2
(is_male_reference=None)[source]¶ Get the uninformed expected copy ratios of each bin.
Create an array of log2 coverages like a “flat” reference.
This is a neutral copy ratio at each autosome (log2 = 0.0) and sex chromosomes based on whether the reference is male (XX or XY).
-
guess_xx
(male_reference=False, verbose=True)[source]¶ Detect chromosomal sex; return True if a sample is probably female.
Uses compare_sex_chromosomes to calculate coverage ratios of the X and Y chromosomes versus autosomes.
Parameters: - male_reference (bool) – Was this sample normalized to a male reference copy number profile?
- verbose (bool) – If True, print (i.e. log to console) the ratios of the log2 coverages of the X and Y chromosomes versus autosomes, the “maleness” ratio of male vs. female expectations for each sex chromosome, and the inferred chromosomal sex.
Returns: True if the coverage ratios indicate the sample is female.
Return type: bool
-
log2
¶
-
residuals
(segments=None)[source]¶ Difference in log2 value of each bin from its segment mean.
Parameters: segments (GenomicArray, CopyNumArray, or None) – Determines the “mean” value to which self log2 values are relative:
- If CopyNumArray, use the log2 values as the segment means to subtract.
- If GenomicArray with no log2 values, group self by these ranges and subtract each group’s median log2 value.
- If None, subtract each chromosome’s median.
Returns: Residual log2 values from self relative to segments; same length as self. Return type: array
-
shift_xx
(male_reference=False, is_xx=None)[source]¶ Adjust chrX log2 ratios (subtract 1) for apparent female samples.
-
squash_genes
(summary_func=<function biweight_location>, squash_background=False, ignore=('-', '.', 'CGH'))[source]¶ Combine consecutive bins with the same targeted gene name.
Parameters: - summary_func (callable) – Function to summarize an array of log2 values to produce a new log2 value for a “squashed” (i.e. reduced) region. By default this is the biweight location, but you might want median, mean, max, min or something else in some cases.
- squash_background (bool) – If True, also reduce consecutive “Background” bins into a single bin. Otherwise, keep “Background” and ignored bins as they are in the output.
- ignore (list or tuple of str) – Bin names to be treated as “Background” instead of as unique genes.
Returns: Another, usually smaller, copy of self with each gene’s bins reduced to a single bin with appropriate values.
Return type:
-
vary
¶
An array of genomic intervals, treated as variant loci.
-
class
cnvlib.vary.
VariantArray
(data_table, meta_dict=None)[source]¶ Bases:
cnvlib.genome.gary.GenomicArray
An array of genomic intervals, treated as variant loci.
Required columns: chromosome, start, end, ref, alt
-
baf_by_ranges
(ranges, summary_func=<function nanmedian>, above_half=None, tumor_boost=True)[source]¶ Aggregate variant (b-allele) frequencies in each given bin.
Get the average BAF in each of the bins of another genomic array: BAFs are mirrored above/below 0.5 (per above_half), grouped in each bin of ranges, and summarized into one value per bin with summary_func (default median).
Parameters: - ranges (GenomicArray or subclass) – Bins for grouping the variants in self.
- above_half (bool) – The same as in mirrored_baf.
- tumor_boost (bool) – The same as in mirrored_baf.
Returns: Average b-allele frequency in each range; same length as ranges. May contain NaN values where no variants overlap a range.
Return type: float array
-
heterozygous
()[source]¶ Subset to only heterozygous variants.
Use ‘zygosity’ or ‘n_zygosity’ genotype values (if present) to exclude variants with value 0.0 or 1.0. If these columns are missing, or there are no heterozygous variants, then return the full (input) set of variants.
Returns: The subset of self with heterozygous genotype, or allele frequency between the specified thresholds. Return type: VariantArray
-
mirrored_baf
(above_half=None, tumor_boost=False)[source]¶ Mirrored B-allele frequencies (BAFs).
Parameters: - above_half (bool or None) – If specified, flip BAFs to be all above 0.5 (True) or below 0.5 (False), respectively, for consistency. Otherwise, if None, mirror in the direction of the majority of BAFs.
- tumor_boost (bool) – Normalize tumor-sample allele frequencies to the matched normal sample’s allele frequencies.
Returns: Mirrored b-allele frequencies, the same length as self. May contain NaN values.
Return type: float array
-
tumor_boost
()[source]¶ TumorBoost normalization of tumor-sample allele frequencies.
De-noises the signal for detecting LOH.
See: TumorBoost, Bengtsson et al. 2010
-
zygosity_from_freq
(het_freq=0.0, hom_freq=1.0)[source]¶ Set zygosity (genotype) according to allele frequencies.
Creates or replaces ‘zygosity’ column if ‘alt_freq’ column is present, and ‘n_zygosity’ if ‘n_alt_freq’ is present.
Parameters: - het_freq (float) – Assign zygosity 0.5 (heterozygous), otherwise 0.0 (i.e. reference genotype), to variants with alt allele frequency of at least this value.
- hom_freq (float) – Assign zygosity 1.0 (homozygous) to variants with alt allele frequency of at least this value.
-
Tabular file I/O¶
tabio
¶
I/O for tabular formats of genomic data (regions or features).
-
cnvlib.tabio.
read
(infile, fmt='tab', into=None, sample_id=None, meta=None, **kwargs)[source]¶ Read tabular data from a file or stream into a genome object.
Supported formats: see READERS
If a format supports multiple samples, return the sample specified by sample_id, or if unspecified, return the first sample and warn if there were other samples present in the file.
Parameters: - infile (handle or string) – Filename or opened file-like object to read.
- fmt (string) – File format.
- into (class) – GenomicArray class or subclass to instantiate, overriding the default for the target file format.
- sample_id (string) – Sample identifier.
- meta (dict) – Metadata, as arbitrary key-value pairs.
- **kwargs –
Additional keyword arguments to the format-specific reader function.
Returns: The data from the given file instantiated as into, if specified, or the default base class for the given file format (usually GenomicArray).
Return type: GenomicArray or subclass
-
cnvlib.tabio.
read_auto
(infile)[source]¶ Auto-detect a file’s format and use an appropriate parser to read it.
-
cnvlib.tabio.
read_cna
(infile, sample_id=None, meta=None)[source]¶ Read a tabular file to create a CopyNumArray object.
Interface to CNVkit sub-commands¶
commands
¶
The public API for each of the commands defined in the CNVkit workflow.
Command-line interface and corresponding API for CNVkit.
-
cnvlib.commands.
do_target
(bait_arr, annotate=None, do_short_names=False, do_split=False, avg_size=266.6666666666667)[source]¶ Transform bait intervals into targets more suitable for CNVkit.
-
cnvlib.commands.
do_access
(fa_fname, exclude_fnames=(), min_gap_size=5000)[source]¶ List the locations of accessible sequence regions in a FASTA file.
-
cnvlib.commands.
do_antitarget
(targets, access=None, avg_bin_size=150000, min_bin_size=None)[source]¶ Derive a background/antitarget BED file from a target BED file.
-
cnvlib.commands.
do_coverage
(bed_fname, bam_fname, by_count=False, min_mapq=0, processes=1)[source]¶ Calculate coverage in the given regions from BAM read depths.
-
cnvlib.commands.
do_reference
(target_fnames, antitarget_fnames=None, fa_fname=None, male_reference=False, female_samples=None, do_gc=True, do_edge=True, do_rmask=True)[source]¶ Compile a coverage reference from the given files (normal samples).
-
cnvlib.commands.
do_reference_flat
(targets, antitargets=None, fa_fname=None, male_reference=False)[source]¶ Compile a neutral-coverage reference from the given intervals.
Combines the intervals, shifts chrX values if requested, and calculates GC and RepeatMasker content from the genome FASTA sequence.
-
cnvlib.commands.
do_fix
(target_raw, antitarget_raw, reference, do_gc=True, do_edge=True, do_rmask=True)[source]¶ Combine target and antitarget coverages and correct for biases.
-
cnvlib.commands.
do_segmentation
(cnarr, method, threshold=None, variants=None, skip_low=False, skip_outliers=10, save_dataframe=False, rlibpath=None, processes=1)[source]¶ Infer copy number segments from the given coverage table.
-
cnvlib.commands.
do_call
(cnarr, variants=None, method='threshold', ploidy=2, purity=None, is_reference_male=False, is_sample_female=False, filters=None, thresholds=(-1.1, -0.25, 0.2, 0.7))[source]¶
-
cnvlib.commands.
do_scatter
(cnarr, segments=None, variants=None, show_range=None, show_gene=None, background_marker=None, do_trend=False, window_width=1000000.0, y_min=None, y_max=None, title=None, segment_color='darkorange')[source]¶ Plot probe log2 coverages and segmentation calls together.
-
cnvlib.commands.
do_heatmap
(cnarrs, show_range=None, do_desaturate=False)[source]¶ Plot copy number for multiple samples as a heatmap.
-
cnvlib.commands.
do_breaks
(probes, segments, min_probes=1)[source]¶ List the targeted genes in which a copy number breakpoint occurs.
-
cnvlib.commands.
do_gainloss
(cnarr, segments=None, threshold=0.2, min_probes=3, skip_low=False, male_reference=False, is_sample_female=None)[source]¶ Identify targeted genes with copy number gain or loss.
-
cnvlib.commands.
do_sex
(cnarrs, is_male_reference)[source]¶ Guess samples’ sex from the relative coverage of chromosomes X and Y.
-
cnvlib.commands.
do_sex
(cnarrs, is_male_reference)[source] Guess samples’ sex from the relative coverage of chromosomes X and Y.
-
cnvlib.commands.
do_metrics
(cnarrs, segments=None, skip_low=False)[source]¶ Compute coverage deviations and other metrics for self-evaluation.
The following modules implement lower-level functionality specific to each of the CNVkit sub-commands.
access
¶
List the locations of accessible sequence regions in a FASTA file.
Inaccessible regions, e.g. telomeres and centromeres, are masked out with N in the reference genome sequence; this script scans those to identify the coordinates of the accessible regions (those between the long spans of N’s).
-
cnvlib.access.
do_access
(fa_fname, exclude_fnames=(), min_gap_size=5000)[source]¶ List the locations of accessible sequence regions in a FASTA file.
-
cnvlib.access.
get_regions
(fasta_fname)[source]¶ Find accessible sequence regions (those not masked out with ‘N’).
antitarget
¶
Supporting functions for the ‘antitarget’ command.
-
cnvlib.antitarget.
do_antitarget
(targets, access=None, avg_bin_size=150000, min_bin_size=None)[source]¶ Derive a background/antitarget BED file from a target BED file.
-
cnvlib.antitarget.
get_background
(targets, accessible, avg_bin_size, min_bin_size)[source]¶ Generate background intervals from target intervals.
Procedure:
Invert target intervals
Subtract the inverted targets from accessible regions
For each of the resulting regions:
- Shrink by a fixed margin on each end
- If it’s smaller than min_bin_size, skip
- Divide into equal-size (region_size/avg_bin_size) portions
- Emit the (chrom, start, end) coords of each portion
call
¶
Call copy number variants from segmented log2 ratios.
-
cnvlib.call.
absolute_clonal
(cnarr, ploidy, purity, is_reference_male, is_sample_female)[source]¶ Calculate absolute copy number values from segment or bin log2 ratios.
-
cnvlib.call.
absolute_dataframe
(cnarr, ploidy, purity, is_reference_male, is_sample_female)[source]¶ Absolute, expected and reference copy number in a DataFrame.
-
cnvlib.call.
absolute_expect
(cnarr, ploidy, is_sample_female)[source]¶ Absolute integer number of expected copies in each bin.
I.e. the given ploidy for autosomes, and XY or XX sex chromosome counts according to the sample’s specified chromosomal sex.
-
cnvlib.call.
absolute_pure
(cnarr, ploidy, is_reference_male)[source]¶ Calculate absolute copy number values from segment or bin log2 ratios.
-
cnvlib.call.
absolute_reference
(cnarr, ploidy, is_reference_male)[source]¶ Absolute integer number of reference copies in each bin.
I.e. the given ploidy for autosomes, 1 or 2 X according to the reference sex, and always 1 copy of Y.
-
cnvlib.call.
absolute_threshold
(cnarr, ploidy, thresholds, is_reference_male)[source]¶ Call integer copy number using hard thresholds for each level.
Integer values are assigned for log2 ratio values less than each given threshold value in sequence, counting up from zero. Above the last threshold value, integer copy numbers are called assuming full purity, diploidy, and rounding up.
Default thresholds follow this heuristic for calling CNAs in a tumor sample: For single-copy gains and losses, assume 50% tumor cell clonality (including normal cell contamination). Then:
R> log2(2:6 / 4) -1.0 -0.4150375 0.0 0.3219281 0.5849625
Allowing for random noise of +/- 0.1, the cutoffs are:
DEL(0) < -1.1 LOSS(1) < -0.25 GAIN(3) >= +0.2 AMP(4) >= +0.7
For germline samples, better precision could be achieved with:
LOSS(1) < -0.4 GAIN(3) >= +0.3
-
cnvlib.call.
do_call
(cnarr, variants=None, method='threshold', ploidy=2, purity=None, is_reference_male=False, is_sample_female=False, filters=None, thresholds=(-1.1, -0.25, 0.2, 0.7))[source]¶
coverage
¶
Supporting functions for the ‘antitarget’ command.
-
cnvlib.coverage.
bedcov
(bed_fname, bam_fname, min_mapq)[source]¶ Calculate depth of all regions in a BED file via samtools (pysam) bedcov.
i.e. mean pileup depth across each region.
-
cnvlib.coverage.
do_coverage
(bed_fname, bam_fname, by_count=False, min_mapq=0, processes=1)[source]¶ Calculate coverage in the given regions from BAM read depths.
-
cnvlib.coverage.
interval_coverages
(bed_fname, bam_fname, by_count, min_mapq, processes)[source]¶ Calculate log2 coverages in the BAM file at each interval.
-
cnvlib.coverage.
interval_coverages_count
(bed_fname, bam_fname, min_mapq, procs=1)[source]¶ Calculate log2 coverages in the BAM file at each interval.
diagram
¶
Chromosome diagram drawing functions.
This uses and abuses Biopython’s BasicChromosome module. It depends on ReportLab, too, so we isolate this functionality here so that the rest of CNVkit will run without it. (And also to keep the codebase tidy.)
-
cnvlib.diagram.
bc_chromosome_draw_label
(self, cur_drawing, label_name)[source]¶ Monkeypatch to Bio.Graphics.BasicChromosome.Chromosome._draw_label.
Draw a label for the chromosome. Mod: above the chromosome, not below.
-
cnvlib.diagram.
bc_organism_draw
(org, title, wrap=12)[source]¶ Modified copy of Bio.Graphics.BasicChromosome.Organism.draw.
Instead of stacking chromosomes horizontally (along the x-axis), stack rows vertically, then proceed with the chromosomes within each row.
Parameters: - org – The chromosome diagram object being modified.
- title (str) – The output title of the produced document.
- wrap (int) – Maximum number of chromosomes per row; the remainder will be wrapped to the next row(s).
export
¶
Export CNVkit objects and files to other formats.
-
cnvlib.export.
create_chrom_ids
(segments)[source]¶ Map chromosome names to integers in the order encountered.
-
cnvlib.export.
export_bed
(segments, ploidy, is_reference_male, is_sample_female, label, show)[source]¶ Convert a copy number array to a BED-like DataFrame.
For each region in each sample (possibly filtered according to show), the columns are:
- reference sequence name
- start (0-indexed)
- end
- sample name or given label
- integer copy number
By default (show=”ploidy”), skip regions where copy number is the default ploidy, i.e. equal to 2 or the value set by –ploidy. If show=”variant”, skip regions where copy number is neutral, i.e. equal to the reference ploidy on autosomes, or half that on sex chromosomes.
-
cnvlib.export.
export_nexus_basic
(cnarr)[source]¶ Biodiscovery Nexus Copy Number “basic” format.
Only represents one sample per file.
-
cnvlib.export.
export_nexus_ogt
(cnarr, varr, min_weight=0.0)[source]¶ Biodiscovery Nexus Copy Number “Custom-OGT” format.
To create the b-allele frequencies column, alterate allele frequencies from the VCF are aligned to the .cnr file bins. Bins that contain no variants are left blank; if a bin contains multiple variants, then the frequencies are all “mirrored” to be above or below .5 (majority rules), then the median of those values is taken.
-
cnvlib.export.
export_seg
(sample_fnames)[source]¶ SEG format for copy number segments.
Segment breakpoints are not the same across samples, so samples are listed in serial with the sample ID as the left column.
-
cnvlib.export.
export_theta
(tumor_segs, normal_cn)[source]¶ Convert tumor segments and normal .cnr or reference .cnn to THetA input.
Follows the THetA segmentation import script but avoid repeating the pileups, since we already have the mean depth of coverage in each target bin.
The options for average depth of coverage and read length do not matter crucially for proper operation of THetA; increased read counts per bin simply increase the confidence of THetA’s results.
- THetA2 input format is tabular, with columns:
- ID, chrm, start, end, tumorCount, normalCount
where chromosome IDs (“chrm”) are integers 1 through 24.
-
cnvlib.export.
export_theta_snps
(varr)[source]¶ Generate THetA’s SNP per-allele read count “formatted.txt” files.
-
cnvlib.export.
export_vcf
(segments, ploidy, is_reference_male, is_sample_female, sample_id=None)[source]¶ Convert segments to Variant Call Format.
For now, only 1 sample per VCF. (Overlapping CNVs seem tricky.)
-
cnvlib.export.
merge_samples
(filenames)[source]¶ Merge probe values from multiple samples into a 2D table (of sorts).
- Input:
- dict of {sample ID: (probes, values)}
- Output:
- list-of-tuples: (probe, log2 coverages...)
-
cnvlib.export.
ref_means_nbins
(tumor_segs, normal_cn)[source]¶ Extract segments’ reference mean log2 values and probe counts.
Code paths:
wt_mdn wt_old probes norm -> norm, nbins + * * - 0, wt_mdn - + + - 0, wt_old * probes - + - - 0, wt_old * size? - - + - 0, probes - - - - 0, size? + - + + norm, probes + - - + norm, bin counts - + + + norm, probes - + - + norm, bin counts - - + + norm, probes - - - + norm, bin counts
-
cnvlib.export.
segments2vcf
(segments, ploidy, is_reference_male, is_sample_female)[source]¶ Convert copy number segments to VCF records.
-
cnvlib.export.
theta_read_counts
(log2_ratio, nbins, avg_depth=500, avg_bin_width=200, read_len=100)[source]¶ Calculate segments’ read counts from log2-ratios.
- Math:
- nbases = read_length * read_count
- and
- nbases = bin_width * read_depth
- where
- read_depth = read_depth_ratio * avg_depth
- So:
- read_length * read_count = bin_width * read_depth read_count = bin_width * read_depth / read_length
fix
¶
Supporting functions for the ‘fix’ command.
-
cnvlib.fix.
apply_weights
(cnarr, ref_matched, epsilon=0.0001)[source]¶ Calculate weights for each bin.
Weights are derived from:
- bin sizes
- average bin coverage depths in the reference
- the “spread” column of the reference.
-
cnvlib.fix.
center_by_window
(cnarr, fraction, sort_key)[source]¶ Smooth out biases according to the trait specified by sort_key.
E.g. correct GC-biased bins by windowed averaging across similar-GC bins; or for similar interval sizes.
-
cnvlib.fix.
do_fix
(target_raw, antitarget_raw, reference, do_gc=True, do_edge=True, do_rmask=True)[source]¶ Combine target and antitarget coverages and correct for biases.
-
cnvlib.fix.
edge_gains
(target_sizes, gap_sizes, insert_size)[source]¶ Calculate coverage gain from neighboring baits’ flanking reads.
Letting i = insert size, t = target size, g = gap to neighboring bait, the gain of coverage due to a nearby bait, if g < i, is:
.. math :: (i-g)^2 / 4it
If the neighbor flank extends beyond the target (t+g < i), reduce by:
.. math :: (i-t-g)^2 / 4it
If a neighbor overlaps the target, treat it as adjacent (gap size 0).
-
cnvlib.fix.
edge_losses
(target_sizes, insert_size)[source]¶ Calculate coverage losses at the edges of baited regions.
Letting i = insert size and t = target size, the proportional loss of coverage near the two edges of the baited region (combined) is:
\[i/2t\]If the “shoulders” extend outside the bait $(t < i), reduce by:
\[(i-t)^2 / 4it\]on each side, or (i-t)^2 / 2it total.
-
cnvlib.fix.
get_edge_bias
(cnarr, margin)[source]¶ Quantify the “edge effect” of the target tile and its neighbors.
The result is proportional to the change in the target’s coverage due to these edge effects, i.e. the expected loss of coverage near the target edges and, if there are close neighboring tiles, gain of coverage due to “spill over” reads from the neighbor tiles.
(This is not the actual change in coverage. This is just a tribute.)
-
cnvlib.fix.
load_adjust_coverages
(cnarr, ref_cnarr, skip_low, fix_gc, fix_edge, fix_rmask)[source]¶ Load and filter probe coverages; correct using reference and GC.
importers
¶
Import from other formats to the CNVkit format.
metrics
¶
Robust metrics to evaluate performance of copy number estimates.
-
cnvlib.metrics.
confidence_interval_bootstrap
(bins, alpha, bootstraps=100, smoothed=True)[source]¶ Confidence interval for segment mean log2 value, estimated by bootstrap.
-
cnvlib.metrics.
do_metrics
(cnarrs, segments=None, skip_low=False)[source]¶ Compute coverage deviations and other metrics for self-evaluation.
-
cnvlib.metrics.
ests_of_scale
(deviations)[source]¶ Estimators of scale: standard deviation, MAD, biweight midvariance.
Calculates all of these values for an array of deviations and returns them as a tuple.
reference
¶
Supporting functions for the ‘reference’ command.
-
cnvlib.reference.
calculate_gc_lo
(subseq)[source]¶ Calculate the GC and lowercase (RepeatMasked) content of a string.
-
cnvlib.reference.
combine_probes
(filenames, fa_fname, is_male_reference, is_female_sample, skip_low, fix_gc, fix_edge, fix_rmask)[source]¶ Calculate the median coverage of each bin across multiple samples.
Parameters: - filenames (list) – List of string filenames, corresponding to targetcoverage.cnn and antitargetcoverage.cnn files, as generated by ‘coverage’ or ‘import-picard’.
- fa_fname (str) – Reference genome sequence in FASTA format, used to extract GC and RepeatMasker content of each genomic bin.
- is_male_reference (bool) –
- skip_low (bool) –
- fix_gc (bool) –
- fix_edge (bool) –
- fix_rmask (bool) –
Returns: One object summarizing the coverages of the input samples, including each bin’s “average” coverage, “spread” of coverages, and GC content.
Return type:
-
cnvlib.reference.
do_reference
(target_fnames, antitarget_fnames=None, fa_fname=None, male_reference=False, female_samples=None, do_gc=True, do_edge=True, do_rmask=True)[source]¶ Compile a coverage reference from the given files (normal samples).
-
cnvlib.reference.
do_reference_flat
(targets, antitargets=None, fa_fname=None, male_reference=False)[source]¶ Compile a neutral-coverage reference from the given intervals.
Combines the intervals, shifts chrX values if requested, and calculates GC and RepeatMasker content from the genome FASTA sequence.
-
cnvlib.reference.
fasta_extract_regions
(fa_fname, intervals)[source]¶ Extract an iterable of regions from an indexed FASTA file.
Input: FASTA file name; iterable of (seq_id, start, end) (1-based) Output: iterable of string sequences.
-
cnvlib.reference.
get_fasta_stats
(cnarr, fa_fname)[source]¶ Calculate GC and RepeatMasker content of each bin in the FASTA genome.
reports
¶
Supports the sub-commands breaks and gainloss.
Supporting functions for the text/tabular-reporting commands.
Namely: breaks, gainloss.
-
cnvlib.reports.
gainloss_by_gene
(cnarr, threshold, skip_low=False)[source]¶ Identify genes where average bin copy ratio value exceeds threshold.
NB: Must shift sex-chromosome values beforehand with shift_xx, otherwise all chrX/chrY genes may be reported gained/lost.
-
cnvlib.reports.
gainloss_by_segment
(cnarr, segments, threshold, skip_low=False)[source]¶ Identify genes where segmented copy ratio exceeds threshold.
NB: Must shift sex-chromosome values beforehand with shift_xx, otherwise all chrX/chrY genes may be reported gained/lost.
-
cnvlib.reports.
get_breakpoints
(intervals, segments, min_probes)[source]¶ Identify CBS segment breaks within the targeted intervals.
segmentation
¶
Segmentation of copy number values.
-
cnvlib.segmentation.
do_segmentation
(cnarr, method, threshold=None, variants=None, skip_low=False, skip_outliers=10, save_dataframe=False, rlibpath=None, processes=1)[source]¶ Infer copy number segments from the given coverage table.
-
cnvlib.segmentation.
drop_outliers
(cnarr, width, factor)[source]¶ Drop outlier bins with log2 ratios too far from the trend line.
Outliers are the log2 values factor times the 90th quantile of absolute deviations from the rolling average, within a window of given width. The 90th quantile is about 1.97 standard deviations if the log2 values are Gaussian, so this is similar to calling outliers factor * 1.97 standard deviations from the rolling mean. For a window size of 50, the breakdown point is 2.5 outliers within a window, which is plenty robust for our needs.
-
cnvlib.segmentation.
repair_segments
(segments, orig_probes)[source]¶ Post-process segmentation output.
- Ensure every chromosome has at least one segment.
- Ensure first and last segment ends match 1st/last bin ends (but keep log2 as-is).
-
cnvlib.segmentation.
transfer_fields
(segments, cnarr, ignore=('-', '.', 'CGH'))[source]¶ Map gene names, weights, depths from cnarr bins to segarr segments.
Segment gene name is the comma-separated list of bin gene names. Segment weight is the sum of bin weights, and depth is the (weighted) mean of bin depths.
target
¶
Transform bait intervals into targets more suitable for CNVkit.
-
cnvlib.target.
do_target
(bait_arr, annotate=None, do_short_names=False, do_split=False, avg_size=266.6666666666667)[source]¶ Transform bait intervals into targets more suitable for CNVkit.
-
cnvlib.target.
filter_names
(names, exclude=('mRNA', ))[source]¶ Remove less-meaningful accessions from the given set.
-
cnvlib.target.
shorten_labels
(gene_labels)[source]¶ Reduce multi-accession interval labels to the minimum consistent.
So: BED or interval_list files have a label for every region. We want this to be a short, unique string, like the gene name. But if an interval list is instead a series of accessions, including additional accessions for sub-regions of the gene, we can extract a single accession that covers the maximum number of consecutive regions that share this accession.
e.g.:
... mRNA|JX093079,ens|ENST00000342066,mRNA|JX093077,ref|SAMD11,mRNA|AF161376,mRNA|JX093104 ens|ENST00000483767,mRNA|AF161376,ccds|CCDS3.1,ref|NOC2L ...
becomes:
... mRNA|AF161376 mRNA|AF161376 ...
Helper modules¶
core
¶
CNV utilities.
-
cnvlib.core.
assert_equal
(msg, **values)[source]¶ Evaluate and compare two or more values for equality.
Sugar for a common assertion pattern. Saves re-evaluating (and retyping) the same values for comparison and error reporting.
Example:
>>> assert_equal("Mismatch", expected=1, saw=len(['xx', 'yy'])) ... ValueError: Mismatch: expected = 1, saw = 2
-
cnvlib.core.
call_quiet
(*args)[source]¶ Safely run a command and get stdout; print stderr if there’s an error.
Like subprocess.check_output, but silent in the normal case where the command logs unimportant stuff to stderr. If there is an error, then the full error message(s) is shown in the exception message.
-
cnvlib.core.
check_unique
(items, title)[source]¶ Ensure all items in an iterable are identical; return that one item.
-
cnvlib.core.
ensure_path
(fname)[source]¶ Create dirs and move an existing file to avoid overwriting, if necessary.
If a file already exists at the given path, it is renamed with an integer suffix to clear the way.
-
cnvlib.core.
safe_write
(*args, **kwds)[source]¶ Write to a filename or file-like object with error handling.
If given a file name, open it. If the path includes directories that don’t exist yet, create them. If given a file-like object, just pass it through.
-
cnvlib.core.
temp_write_text
(*args, **kwds)[source]¶ Save text to a temporary file.
NB: This won’t work on Windows b/c the file stays open.
-
cnvlib.core.
write_dataframe
(outfname, dframe, header=True)[source]¶ Write a pandas.DataFrame to a tabular file.
descriptives
¶
Robust estimators of central tendency and scale.
- See:
- https://en.wikipedia.org/wiki/Robust_measures_of_scale https://astropy.readthedocs.io/en/latest/_modules/astropy/stats/funcs.html
-
cnvlib.descriptives.
biweight_location
(a, initial=None, c=6.0, epsilon=0.001, max_iter=5)[source]¶ Compute the biweight location for an array.
The biweight is a robust statistic for estimating the central location of a distribution.
-
cnvlib.descriptives.
biweight_midvariance
(a, initial=None, c=9.0, epsilon=0.001)[source]¶ Compute the biweight midvariance for an array.
The biweight midvariance is a robust statistic for determining the midvariance (i.e. the standard deviation) of a distribution.
See:
-
cnvlib.descriptives.
gapper_scale
(a)[source]¶ Scale estimator based on gaps between order statistics.
See:
- Wainer & Thissen (1976)
- Beers, Flynn, and Gebhardt (1990)
-
cnvlib.descriptives.
interquartile_range
(a)[source]¶ Compute the difference between the array’s first and third quartiles.
-
cnvlib.descriptives.
mean_squared_error
(a, initial=None)[source]¶ Mean squared error (MSE).
By default, assume the input array a is the residuals/deviations/error, so MSE is calculated from zero. Another reference point for calculating the error can be specified with initial.
-
cnvlib.descriptives.
median_absolute_deviation
(a, scale_to_sd=True)[source]¶ Compute the median absolute deviation (MAD) of array elements.
The MAD is defined as:
median(abs(a - median(a)))
.See: https://en.wikipedia.org/wiki/Median_absolute_deviation
-
cnvlib.descriptives.
modal_location
(a)[source]¶ Return the modal value of an array’s values.
The “mode” is the location of peak density among the values, estimated using a Gaussian kernel density estimator.
Parameters: a (np.array) – A 1-D array of floating-point values, e.g. bin log2 ratio values.
-
cnvlib.descriptives.
q_n
(a)[source]¶ Rousseeuw & Croux’s (1993) Q_n, an alternative to MAD.
Qn := Cn first quartile of (|x_i - x_j|: i < j)
where Cn is a constant depending on n.
Finite-sample correction factors must be used to calibrate the scale of Qn for small-to-medium-sized samples.
n E[Qn] – —– 10 1.392 20 1.193 40 1.093 60 1.064 80 1.048 100 1.038 200 1.019
parallel
¶
Utilities for multi-core parallel processing.
-
class
cnvlib.parallel.
SerialFuture
(result)[source]¶ Bases:
future.types.newobject.newobject
Mimic the concurrent.futures.Future interface.
params
¶
Defines several constants used in the pipeline.
Hard-coded parameters for CNVkit. These should not change between runs.
plots
¶
Plotting utilities.
-
cnvlib.plots.
chromosome_sizes
(probes, to_mb=False)[source]¶ Create an ordered mapping of chromosome names to sizes.
-
cnvlib.plots.
cvg2rgb
(cvg, desaturate)[source]¶ Choose a shade of red or blue representing log2-coverage value.
-
cnvlib.plots.
gene_coords_by_name
(probes, names)[source]¶ Find the chromosomal position of each named gene in probes.
Returns: Of: {chromosome: [(start, end, gene name), ...]} Return type: dict
-
cnvlib.plots.
gene_coords_by_range
(probes, chrom, start, end, ignore=('-', '.', 'CGH'))[source]¶ Find the chromosomal position of all genes in a range.
Returns: Of: {chromosome: [(start, end, gene), ...]} Return type: dict
-
cnvlib.plots.
partition_by_chrom
(chrom_snvs)[source]¶ Group the tumor shift values by chromosome (for statistical testing).
-
cnvlib.plots.
plot_x_dividers
(axis, chrom_sizes, pad=None)[source]¶ Plot vertical dividers and x-axis labels given the chromosome sizes.
Draws vertical black lines between each chromosome, with padding. Labels each chromosome range with the chromosome name, centered in the region, under a tick. Sets the x-axis limits to the covered range.
Returns: A table of the x-position offsets of each chromosome. Return type: OrderedDict
samutil
¶
BAM utilities.
-
cnvlib.samutil.
bam_total_reads
(bam_fname)[source]¶ Count the total number of mapped reads in a BAM file.
Uses the BAM index to do this quickly.
-
cnvlib.samutil.
ensure_bam_index
(bam_fname)[source]¶ Ensure a BAM file is indexed, to enable fast traversal & lookup.
For MySample.bam, samtools will look for an index in these files, in order:
- MySample.bam.bai
- MySample.bai
-
cnvlib.samutil.
ensure_bam_sorted
(bam_fname, by_name=False, span=50)[source]¶ Test if the reads in a BAM file are sorted as expected.
by_name=True: reads are expected to be sorted by query name. Consecutive read IDs are in alphabetical order, and read pairs appear together.
by_name=False: reads are sorted by position. Consecutive reads have increasing position.
-
cnvlib.samutil.
get_read_length
(bam, span=1000)[source]¶ Get (median) read length from first few reads in a BAM file.
Illumina reads all have the same length; other sequencers might not.
Parameters: - bam (str or pysam.Samfile) – Filename or pysam-opened BAM file.
- n (int) – Number of reads used to calculate median read length.
segfilters
¶
Filter copy number segments.
-
cnvlib.segfilters.
ampdel
(segarr)[source]¶ Merge segments by amplified/deleted/neutral copy number status.
Follow the clinical reporting convention:
- 5+ copies (2.5-fold gain) is amplification
- 0 copies is homozygous/deep deletion
- CNAs of lesser degree are not reported
This is recommended only for selecting segments corresponding to actionable, usually focal, CNAs. Real and potentially informative but lower-level CNAs will be merged together.
-
cnvlib.segfilters.
bic
(segarr)[source]¶ Merge segments by Bayesian Information Criterion.
See: BIC-seq (Xi 2011), doi:10.1073/pnas.1110574108
-
cnvlib.segfilters.
ci
(segarr)[source]¶ Merge segments by confidence interval (overlapping 0).
Segments with lower CI above 0 are kept as gains, upper CI below 0 as losses, and the rest with CI overlapping zero are collapsed as neutral.
-
cnvlib.segfilters.
enumerate_changes
(levels)[source]¶ Assign a unique integer to each run of identical values.
Repeated but non-consecutive values will be assigned different integers.
-
cnvlib.segfilters.
require_column
(*colnames)[source]¶ Wrapper to coordinate the segment-filtering functions.
Verify that the given columns are in the CopyNumArray the wrapped function takes. Also log the number of rows in the array before and after filtration.
-
cnvlib.segfilters.
sem
(segarr)[source]¶ Merge segments by Standard Error of the Mean (SEM).
Use each segment’s SEM value to estimate a 95% confidence interval (via zscore). Segments with lower CI above 0 are kept as gains, upper CI below 0 as losses, and the rest with CI overlapping zero are collapsed as neutral.
smoothing
¶
Signal smoothing functions.
-
cnvlib.smoothing.
check_inputs
(x, width)[source]¶ Transform width into a half-window size.
width is either a fraction of the length of x or an integer size of the whole window. The output half-window size is truncated to the length of x if needed.
-
cnvlib.smoothing.
fit_edges
(x, y, wing, polyorder=3)[source]¶ Apply polynomial interpolation to the edges of y, in-place.
Calculates a polynomial fit (of order polyorder) of x within a window of width twice wing, then updates the smoothed values y in the half of the window closest to the edge.
-
cnvlib.smoothing.
outlier_iqr
(a, c=3.0)[source]¶ Detect outliers as a multiple of the IQR from the median.
By convention, “outliers” are points more than 1.5 * IQR from the median, and “extremes” or extreme outliers are those more than 3.0 * IQR.
-
cnvlib.smoothing.
outlier_mad_median
(a)[source]¶ MAD-Median rule for detecting outliers.
X_i is an outlier if:
| X_i - M | _____________ > K ~= 2.24 MAD / 0.6745
where $K = sqrt( X^2_{0.975,1} )$, the square root of the 0.975 quantile of a chi-squared distribution with 1 degree of freedom.
This is a very robust rule with the highest possible breakdown point of 0.5.
Returns: A boolean array of the same size as a, where outlier indices are True. Return type: np.array References
- Davies & Gather (1993) The Identification of Multiple Outliers.
- Rand R. Wilcox (2012) Introduction to robust estimation and hypothesis testing. Ch.3: Estimating measures of location and scale.
-
cnvlib.smoothing.
rolling_outlier_iqr
(x, width, c=3.0)[source]¶ Detect outliers as a multiple of the IQR from the median.
By convention, “outliers” are points more than 1.5 * IQR from the median (~2 SD if values are normally distributed), and “extremes” or extreme outliers are those more than 3.0 * IQR (~4 SD).
-
cnvlib.smoothing.
rolling_outlier_quantile
(x, width, q, m)[source]¶ Detect outliers by multiples of a quantile in a window.
Outliers are the array elements outside m times the q‘th quantile of deviations from the smoothed trend line, as calculated from the trend line residuals. (For example, take the magnitude of the 95th quantile times 5, and mark any elements greater than that value as outliers.)
This is the smoothing method used in BIC-seq (doi:10.1073/pnas.1110574108) with the parameters width=200, q=.95, m=5 for WGS.
Returns: A boolean array of the same size as x, where outlier indices are True. Return type: np.array
-
cnvlib.smoothing.
rolling_outlier_std
(x, width, stdevs)[source]¶ Detect outliers by stdev within a rolling window.
Outliers are the array elements outside stdevs standard deviations from the smoothed trend line, as calculated from the trend line residuals.
Returns: A boolean array of the same size as x, where outlier indices are True. Return type: np.array
-
cnvlib.smoothing.
rolling_quantile
(x, width, quantile)[source]¶ Rolling quantile (0–1) with mirrored edges.
-
cnvlib.smoothing.
smoothed
(x, width, do_fit_edges=False)[source]¶ Smooth the values in x with the Kaiser windowed filter.
See: https://en.wikipedia.org/wiki/Kaiser_window
Parameters: - x (array-like) – 1-dimensional numeric data set.
- width (float) – Fraction of x’s total length to include in the rolling window (i.e. the proportional window width), or the integer size of the window.