cnvlib package¶
Module cnvlib
contents¶
The one function exposed at the top level, read, loads a file in CNVkit’s BED-like tabular format and returns a CopyNumArray instance. For your own scripting, you can usually accomplish what you need using just the CopyNumArray and GenomicArray methods available on this returned object (see Core classes).
To load other file formats, see Tabular file I/O (tabio). To run functions equivalent to CNVkit commands within Python, see Interface to CNVkit sub-commands.
Core classes¶
The core objects used throughout CNVkit. The base class is GenomicArray from
scikit-genome package. All of these classes wrap a pandas DataFrame instance, which is accessible through
the .data
attribute and can be used for any manipulations that aren’t
already provided by methods in the wrapper class.
cnary
¶
CNVkit’s core data structure, a copy number array.
-
class
cnvlib.cnary.
CopyNumArray
(data_table, meta_dict=None)[source]¶ Bases:
skgenome.gary.GenomicArray
An array of genomic intervals, treated like aCGH probes.
Required columns: chromosome, start, end, gene, log2
Optional columns: gc, rmask, spread, weight, probes
-
by_gene
(ignore=('-', '.', 'CGH'))[source]¶ Iterate over probes grouped by gene name.
Group each series of intergenic bins as an “Antitarget” gene; any “Antitarget” bins within a gene are grouped with that gene.
Bins’ gene names are split on commas to accommodate overlapping genes and bins that cover multiple genes.
Parameters: ignore (list or tuple of str) – Gene names to treat as “Antitarget” bins instead of real genes, grouping these bins with the surrounding gene or intergenic region. These bins will still retain their name in the output. Yields: tuple – Pairs of: (gene name, CNA of rows with same name)
-
center_all
(estimator=<unbound method Series.median>, by_chrom=True, skip_low=False, verbose=False)[source]¶ Re-center log2 values to the autosomes’ average (in-place).
Parameters: - estimator (str or callable) – Function to estimate central tendency. If a string, must be one of ‘mean’, ‘median’, ‘mode’, ‘biweight’ (for biweight location). Median by default.
- skip_low (bool) – Whether to drop very-low-coverage bins (via drop_low_coverage) before estimating the center value.
- by_chrom (bool) – If True, first apply estimator to each chromosome separately, then apply estimator to the per-chromosome values, to reduce the impact of uneven targeting or extreme aneuploidy. Otherwise, apply estimator to all log2 values directly.
-
compare_sex_chromosomes
(male_reference=False, skip_low=False)[source]¶ Compare coverage ratios of sex chromosomes versus autosomes.
Perform 4 Mood’s median tests of the log2 coverages on chromosomes X and Y, separately shifting for assumed male and female chromosomal sex. Compare the chi-squared values obtained to infer whether the male or female assumption fits the data better.
Parameters: - male_reference (bool) – Whether a male reference copy number profile was used to normalize the data. If so, a male sample should have log2 values of 0 on X and Y, and female +1 on X, deep negative (below -3) on Y. Otherwise, a male sample should have log2 values of -1 on X and 0 on Y, and female 0 on X, deep negative (below -3) on Y.
- skip_low (bool) – If True, drop very-low-coverage bins (via drop_low_coverage) before comparing log2 coverage ratios. Included for completeness, but shouldn’t affect the result much since the M-W test is nonparametric and p-values are not used here.
Returns: - bool – True if the sample appears male.
- dict – Calculated values used for the inference: relative log2 ratios of chromosomes X and Y versus the autosomes; the Mann-Whitney U values from each test; and ratios of U values for male vs. female assumption on chromosomes X and Y.
-
drop_low_coverage
(verbose=False)[source]¶ Drop bins with extremely low log2 coverage or copy ratio values.
These are generally bins that had no reads mapped due to sample-specific issues. A very small log2 ratio or coverage value may have been substituted to avoid domain or divide-by-zero errors.
-
expect_flat_log2
(is_male_reference=None)[source]¶ Get the uninformed expected copy ratios of each bin.
Create an array of log2 coverages like a “flat” reference.
This is a neutral copy ratio at each autosome (log2 = 0.0) and sex chromosomes based on whether the reference is male (XX or XY).
-
guess_xx
(male_reference=False, verbose=True)[source]¶ Detect chromosomal sex; return True if a sample is probably female.
Uses compare_sex_chromosomes to calculate coverage ratios of the X and Y chromosomes versus autosomes.
Parameters: - male_reference (bool) – Was this sample normalized to a male reference copy number profile?
- verbose (bool) – If True, print (i.e. log to console) the ratios of the log2 coverages of the X and Y chromosomes versus autosomes, the “maleness” ratio of male vs. female expectations for each sex chromosome, and the inferred chromosomal sex.
Returns: True if the coverage ratios indicate the sample is female.
Return type: bool
-
log2
¶
-
residuals
(segments=None)[source]¶ Difference in log2 value of each bin from its segment mean.
Parameters: segments (GenomicArray, CopyNumArray, or None) – Determines the “mean” value to which self log2 values are relative:
- If CopyNumArray, use the log2 values as the segment means to subtract.
- If GenomicArray with no log2 values, group self by these ranges and subtract each group’s median log2 value.
- If None, subtract each chromosome’s median.
Returns: Residual log2 values from self relative to segments; same length as self. Return type: array
-
shift_xx
(male_reference=False, is_xx=None)[source]¶ Adjust chrX log2 ratios (subtract 1) for apparent female samples.
-
squash_genes
(summary_func=<function biweight_location>, squash_antitarget=False, ignore=('-', '.', 'CGH'), squash_background=None)[source]¶ Combine consecutive bins with the same targeted gene name.
Parameters: - summary_func (callable) – Function to summarize an array of log2 values to produce a new log2 value for a “squashed” (i.e. reduced) region. By default this is the biweight location, but you might want median, mean, max, min or something else in some cases.
- squash_antitarget (bool) – If True, also reduce consecutive “Antitarget” bins into a single bin. Otherwise, keep “Antitarget” and ignored bins as they are in the output.
- ignore (list or tuple of str) – Bin names to be treated as “Antitarget” instead of as unique genes.
Returns: Another, usually smaller, copy of self with each gene’s bins reduced to a single bin with appropriate values.
Return type:
-
vary
¶
An array of genomic intervals, treated as variant loci.
-
class
cnvlib.vary.
VariantArray
(data_table, meta_dict=None)[source]¶ Bases:
skgenome.gary.GenomicArray
An array of genomic intervals, treated as variant loci.
Required columns: chromosome, start, end, ref, alt
-
baf_by_ranges
(ranges, summary_func=<function nanmedian>, above_half=None, tumor_boost=False)[source]¶ Aggregate variant (b-allele) frequencies in each given bin.
Get the average BAF in each of the bins of another genomic array: BAFs are mirrored above/below 0.5 (per above_half), grouped in each bin of ranges, and summarized into one value per bin with summary_func (default median).
Parameters: - ranges (GenomicArray or subclass) – Bins for grouping the variants in self.
- above_half (bool) – The same as in mirrored_baf.
- tumor_boost (bool) – The same as in mirrored_baf.
Returns: Average b-allele frequency in each range; same length as ranges. May contain NaN values where no variants overlap a range.
Return type: float array
-
heterozygous
()[source]¶ Subset to only heterozygous variants.
Use ‘zygosity’ or ‘n_zygosity’ genotype values (if present) to exclude variants with value 0.0 or 1.0. If these columns are missing, or there are no heterozygous variants, then return the full (input) set of variants.
Returns: The subset of self with heterozygous genotype, or allele frequency between the specified thresholds. Return type: VariantArray
-
mirrored_baf
(above_half=None, tumor_boost=False)[source]¶ Mirrored B-allele frequencies (BAFs).
Parameters: - above_half (bool or None) – If specified, flip BAFs to be all above 0.5 (True) or below 0.5 (False), respectively, for consistency. Otherwise, if None, mirror in the direction of the majority of BAFs.
- tumor_boost (bool) – Normalize tumor-sample allele frequencies to the matched normal sample’s allele frequencies.
Returns: Mirrored b-allele frequencies, the same length as self. May contain NaN values.
Return type: float array
-
tumor_boost
()[source]¶ TumorBoost normalization of tumor-sample allele frequencies.
De-noises the signal for detecting LOH.
See: TumorBoost, Bengtsson et al. 2010
-
zygosity_from_freq
(het_freq=0.0, hom_freq=1.0)[source]¶ Set zygosity (genotype) according to allele frequencies.
Creates or replaces ‘zygosity’ column if ‘alt_freq’ column is present, and ‘n_zygosity’ if ‘n_alt_freq’ is present.
Parameters: - het_freq (float) – Assign zygosity 0.5 (heterozygous), otherwise 0.0 (i.e. reference genotype), to variants with alt allele frequency of at least this value.
- hom_freq (float) – Assign zygosity 1.0 (homozygous) to variants with alt allele frequency of at least this value.
-
Interface to CNVkit sub-commands¶
commands
¶
The public API for each of the commands defined in the CNVkit workflow.
Command-line interface and corresponding API for CNVkit.
-
cnvlib.commands.
do_target
(bait_arr, annotate=None, do_short_names=False, do_split=False, avg_size=266.6666666666667)[source]¶ Transform bait intervals into targets more suitable for CNVkit.
-
cnvlib.commands.
do_access
(fa_fname, exclude_fnames=(), min_gap_size=5000)[source]¶ List the locations of accessible sequence regions in a FASTA file.
-
cnvlib.commands.
do_antitarget
(targets, access=None, avg_bin_size=150000, min_bin_size=None)[source]¶ Derive off-targt (“antitarget”) bins from target regions.
-
cnvlib.commands.
do_autobin
(bam_fname, method, targets=None, access=None, bp_per_bin=100000.0, target_min_size=20, target_max_size=20000, antitarget_min_size=500, antitarget_max_size=500000)[source]¶ Quickly calculate reasonable bin sizes from BAM read counts.
Parameters: - bam_fname (string) – BAM filename.
- method (string) – One of: ‘wgs’ (whole-genome sequencing), ‘amplicon’ (targeted amplicon capture), ‘hybrid’ (hybridization capture).
- targets (GenomicArray) – Targeted genomic regions (for ‘hybrid’ and ‘amplicon’).
- access (GenomicArray) – Sequencing-accessible regions of the reference genome (for ‘hybrid’ and ‘wgs’).
- bp_per_bin (int) – Desired number of sequencing read nucleotide bases mapped to each bin.
Returns: - ((target depth, target avg. bin size),
(antitarget depth, antitarget avg. bin size))
Return type: 2-tuple of 2-tuples
-
cnvlib.commands.
do_coverage
(bed_fname, bam_fname, by_count=False, min_mapq=0, processes=1)[source]¶ Calculate coverage in the given regions from BAM read depths.
-
cnvlib.commands.
do_reference
(target_fnames, antitarget_fnames=None, fa_fname=None, male_reference=False, female_samples=None, do_gc=True, do_edge=True, do_rmask=True)[source]¶ Compile a coverage reference from the given files (normal samples).
-
cnvlib.commands.
do_reference_flat
(targets, antitargets=None, fa_fname=None, male_reference=False)[source]¶ Compile a neutral-coverage reference from the given intervals.
Combines the intervals, shifts chrX values if requested, and calculates GC and RepeatMasker content from the genome FASTA sequence.
-
cnvlib.commands.
do_fix
(target_raw, antitarget_raw, reference, do_gc=True, do_edge=True, do_rmask=True)[source]¶ Combine target and antitarget coverages and correct for biases.
-
cnvlib.commands.
do_segmentation
(cnarr, method, threshold=None, variants=None, skip_low=False, skip_outliers=10, save_dataframe=False, rlibpath=None, processes=1)[source]¶ Infer copy number segments from the given coverage table.
-
cnvlib.commands.
do_call
(cnarr, variants=None, method='threshold', ploidy=2, purity=None, is_reference_male=False, is_sample_female=False, filters=None, thresholds=(-1.1, -0.25, 0.2, 0.7))[source]¶
-
cnvlib.commands.
do_scatter
(cnarr, segments=None, variants=None, show_range=None, show_gene=None, antitarget_marker=None, do_trend=False, window_width=1000000.0, y_min=None, y_max=None, title=None, segment_color='darkorange', background_marker=None)[source]¶ Plot probe log2 coverages and segmentation calls together.
-
cnvlib.commands.
do_heatmap
(cnarrs, show_range=None, do_desaturate=False)[source]¶ Plot copy number for multiple samples as a heatmap.
-
cnvlib.commands.
do_breaks
(probes, segments, min_probes=1)[source]¶ List the targeted genes in which a copy number breakpoint occurs.
-
cnvlib.commands.
do_gainloss
(cnarr, segments=None, threshold=0.2, min_probes=3, skip_low=False, male_reference=False, is_sample_female=None)[source]¶ Identify targeted genes with copy number gain or loss.
-
cnvlib.commands.
do_sex
(cnarrs, is_male_reference)[source]¶ Guess samples’ sex from the relative coverage of chromosomes X and Y.
-
cnvlib.commands.
do_sex
(cnarrs, is_male_reference)[source] Guess samples’ sex from the relative coverage of chromosomes X and Y.
-
cnvlib.commands.
do_metrics
(cnarrs, segments=None, skip_low=False)[source]¶ Compute coverage deviations and other metrics for self-evaluation.
The following modules implement lower-level functionality specific to each of the CNVkit sub-commands.
access
¶
List the locations of accessible sequence regions in a FASTA file.
Inaccessible regions, e.g. telomeres and centromeres, are masked out with N in the reference genome sequence; this script scans those to identify the coordinates of the accessible regions (those between the long spans of N’s).
-
cnvlib.access.
do_access
(fa_fname, exclude_fnames=(), min_gap_size=5000)[source]¶ List the locations of accessible sequence regions in a FASTA file.
-
cnvlib.access.
get_regions
(fasta_fname)[source]¶ Find accessible sequence regions (those not masked out with ‘N’).
antitarget
¶
Supporting functions for the ‘antitarget’ command.
-
cnvlib.antitarget.
do_antitarget
(targets, access=None, avg_bin_size=150000, min_bin_size=None)[source]¶ Derive off-targt (“antitarget”) bins from target regions.
-
cnvlib.antitarget.
get_antitargets
(targets, accessible, avg_bin_size, min_bin_size)[source]¶ Generate antitarget intervals between/around target intervals.
Procedure:
Invert target intervals
Subtract the inverted targets from accessible regions
For each of the resulting regions:
- Shrink by a fixed margin on each end
- If it’s smaller than min_bin_size, skip
- Divide into equal-size (region_size/avg_bin_size) portions
- Emit the (chrom, start, end) coords of each portion
autobin
¶
Estimate reasonable bin sizes from BAM read counts or depths.
-
cnvlib.autobin.
average_depth
(rc_table, read_length)[source]¶ Estimate the average read depth across the genome.
Returns: Median of the per-chromosome mean read depths, weighted by chromosome size. Return type: float
-
cnvlib.autobin.
do_autobin
(bam_fname, method, targets=None, access=None, bp_per_bin=100000.0, target_min_size=20, target_max_size=20000, antitarget_min_size=500, antitarget_max_size=500000)[source]¶ Quickly calculate reasonable bin sizes from BAM read counts.
Parameters: - bam_fname (string) – BAM filename.
- method (string) – One of: ‘wgs’ (whole-genome sequencing), ‘amplicon’ (targeted amplicon capture), ‘hybrid’ (hybridization capture).
- targets (GenomicArray) – Targeted genomic regions (for ‘hybrid’ and ‘amplicon’).
- access (GenomicArray) – Sequencing-accessible regions of the reference genome (for ‘hybrid’ and ‘wgs’).
- bp_per_bin (int) – Desired number of sequencing read nucleotide bases mapped to each bin.
Returns: - ((target depth, target avg. bin size),
(antitarget depth, antitarget avg. bin size))
Return type: 2-tuple of 2-tuples
-
cnvlib.autobin.
hybrid
(rc_table, read_len, bam_fname, targets, access=None)[source]¶ Hybrid capture sequencing.
-
cnvlib.autobin.
midsize_file
(fnames)[source]¶ Select the median-size file from several given filenames.
If an even number of files is given, selects the file just below the median.
-
cnvlib.autobin.
sample_midsize_regions
(regions, max_num)[source]¶ Randomly select a subset of up to max_num regions.
-
cnvlib.autobin.
sample_region_cov
(bam_fname, regions, max_num=100)[source]¶ Calculate read depth in a randomly sampled subset of regions.
Intersection of DataFrame .chromosome values.
batch
¶
The ‘batch’ command.
-
cnvlib.batch.
batch_make_reference
(normal_bams, target_bed, antitarget_bed, male_reference, fasta, annotate, short_names, target_avg_size, access_bed, antitarget_avg_size, antitarget_min_size, output_reference, output_dir, processes, by_count, method)[source]¶ Build the CN reference from normal samples, targets and antitargets.
call
¶
Call copy number variants from segmented log2 ratios.
-
cnvlib.call.
absolute_clonal
(cnarr, ploidy, purity, is_reference_male, is_sample_female)[source]¶ Calculate absolute copy number values from segment or bin log2 ratios.
-
cnvlib.call.
absolute_dataframe
(cnarr, ploidy, purity, is_reference_male, is_sample_female)[source]¶ Absolute, expected and reference copy number in a DataFrame.
-
cnvlib.call.
absolute_expect
(cnarr, ploidy, is_sample_female)[source]¶ Absolute integer number of expected copies in each bin.
I.e. the given ploidy for autosomes, and XY or XX sex chromosome counts according to the sample’s specified chromosomal sex.
-
cnvlib.call.
absolute_pure
(cnarr, ploidy, is_reference_male)[source]¶ Calculate absolute copy number values from segment or bin log2 ratios.
-
cnvlib.call.
absolute_reference
(cnarr, ploidy, is_reference_male)[source]¶ Absolute integer number of reference copies in each bin.
I.e. the given ploidy for autosomes, 1 or 2 X according to the reference sex, and always 1 copy of Y.
-
cnvlib.call.
absolute_threshold
(cnarr, ploidy, thresholds, is_reference_male)[source]¶ Call integer copy number using hard thresholds for each level.
Integer values are assigned for log2 ratio values less than each given threshold value in sequence, counting up from zero. Above the last threshold value, integer copy numbers are called assuming full purity, diploidy, and rounding up.
Default thresholds follow this heuristic for calling CNAs in a tumor sample: For single-copy gains and losses, assume 50% tumor cell clonality (including normal cell contamination). Then:
R> log2(2:6 / 4) -1.0 -0.4150375 0.0 0.3219281 0.5849625
Allowing for random noise of +/- 0.1, the cutoffs are:
DEL(0) < -1.1 LOSS(1) < -0.25 GAIN(3) >= +0.2 AMP(4) >= +0.7
For germline samples, better precision could be achieved with:
LOSS(1) < -0.4 GAIN(3) >= +0.3
-
cnvlib.call.
do_call
(cnarr, variants=None, method='threshold', ploidy=2, purity=None, is_reference_male=False, is_sample_female=False, filters=None, thresholds=(-1.1, -0.25, 0.2, 0.7))[source]¶
coverage
¶
Supporting functions for the ‘antitarget’ command.
-
cnvlib.coverage.
bedcov
(bed_fname, bam_fname, min_mapq)[source]¶ Calculate depth of all regions in a BED file via samtools (pysam) bedcov.
i.e. mean pileup depth across each region.
-
cnvlib.coverage.
do_coverage
(bed_fname, bam_fname, by_count=False, min_mapq=0, processes=1)[source]¶ Calculate coverage in the given regions from BAM read depths.
-
cnvlib.coverage.
interval_coverages
(bed_fname, bam_fname, by_count, min_mapq, processes)[source]¶ Calculate log2 coverages in the BAM file at each interval.
-
cnvlib.coverage.
interval_coverages_count
(bed_fname, bam_fname, min_mapq, procs=1)[source]¶ Calculate log2 coverages in the BAM file at each interval.
diagram
¶
Chromosome diagram drawing functions.
This uses and abuses Biopython’s BasicChromosome module. It depends on ReportLab, too, so we isolate this functionality here so that the rest of CNVkit will run without it. (And also to keep the codebase tidy.)
-
cnvlib.diagram.
bc_chromosome_draw_label
(self, cur_drawing, label_name)[source]¶ Monkeypatch to Bio.Graphics.BasicChromosome.Chromosome._draw_label.
Draw a label for the chromosome. Mod: above the chromosome, not below.
-
cnvlib.diagram.
bc_organism_draw
(org, title, wrap=12)[source]¶ Modified copy of Bio.Graphics.BasicChromosome.Organism.draw.
Instead of stacking chromosomes horizontally (along the x-axis), stack rows vertically, then proceed with the chromosomes within each row.
Parameters: - org – The chromosome diagram object being modified.
- title (str) – The output title of the produced document.
- wrap (int) – Maximum number of chromosomes per row; the remainder will be wrapped to the next row(s).
export
¶
Export CNVkit objects and files to other formats.
-
cnvlib.export.
assign_ci_start_end
(segarr, cnarr)[source]¶ Assign ci_start and ci_end fields to segments.
Values for each segment indicate the CI boundary points within that segment, i.e. the right CI boundary for the left-side breakpoint (segment start), and left CI boundary for the right-side breakpoint (segment end).
This is a little unintuitive because the CI refers to the breakpoint, not the segment, but we’re storing the info in an array of segments.
Calculation: Just use the boundaries of the bins left- and right-adjacent to each segment breakpoint.
-
cnvlib.export.
export_bed
(segments, ploidy, is_reference_male, is_sample_female, label, show)[source]¶ Convert a copy number array to a BED-like DataFrame.
For each region in each sample (possibly filtered according to show), the columns are:
- reference sequence name
- start (0-indexed)
- end
- sample name or given label
- integer copy number
By default (show=”ploidy”), skip regions where copy number is the default ploidy, i.e. equal to 2 or the value set by –ploidy. If show=”variant”, skip regions where copy number is neutral, i.e. equal to the reference ploidy on autosomes, or half that on sex chromosomes.
-
cnvlib.export.
export_nexus_basic
(cnarr)[source]¶ Biodiscovery Nexus Copy Number “basic” format.
Only represents one sample per file.
-
cnvlib.export.
export_nexus_ogt
(cnarr, varr, min_weight=0.0)[source]¶ Biodiscovery Nexus Copy Number “Custom-OGT” format.
To create the b-allele frequencies column, alterate allele frequencies from the VCF are aligned to the .cnr file bins. Bins that contain no variants are left blank; if a bin contains multiple variants, then the frequencies are all “mirrored” to be above or below .5 (majority rules), then the median of those values is taken.
-
cnvlib.export.
export_seg
(sample_fnames, chrom_ids=None)[source]¶ SEG format for copy number segments.
Segment breakpoints are not the same across samples, so samples are listed in serial with the sample ID as the left column.
-
cnvlib.export.
export_theta
(tumor_segs, normal_cn)[source]¶ Convert tumor segments and normal .cnr or reference .cnn to THetA input.
Follows the THetA segmentation import script but avoid repeating the pileups, since we already have the mean depth of coverage in each target bin.
The options for average depth of coverage and read length do not matter crucially for proper operation of THetA; increased read counts per bin simply increase the confidence of THetA’s results.
- THetA2 input format is tabular, with columns:
- ID, chrm, start, end, tumorCount, normalCount
where chromosome IDs (“chrm”) are integers 1 through 24.
-
cnvlib.export.
export_theta_snps
(varr)[source]¶ Generate THetA’s SNP per-allele read count “formatted.txt” files.
-
cnvlib.export.
export_vcf
(segments, ploidy, is_reference_male, is_sample_female, sample_id=None, cnarr=None)[source]¶ Convert segments to Variant Call Format.
For now, only 1 sample per VCF. (Overlapping CNVs seem tricky.)
-
cnvlib.export.
merge_samples
(filenames)[source]¶ Merge probe values from multiple samples into a 2D table (of sorts).
- Input:
- dict of {sample ID: (probes, values)}
- Output:
- list-of-tuples: (probe, log2 coverages...)
-
cnvlib.export.
ref_means_nbins
(tumor_segs, normal_cn)[source]¶ Extract segments’ reference mean log2 values and probe counts.
Code paths:
wt_mdn wt_old probes norm -> norm, nbins + * * - 0, wt_mdn - + + - 0, wt_old * probes - + - - 0, wt_old * size? - - + - 0, probes - - - - 0, size? + - + + norm, probes + - - + norm, bin counts - + + + norm, probes - + - + norm, bin counts - - + + norm, probes - - - + norm, bin counts
-
cnvlib.export.
segments2vcf
(segments, ploidy, is_reference_male, is_sample_female)[source]¶ Convert copy number segments to VCF records.
-
cnvlib.export.
theta_read_counts
(log2_ratio, nbins, avg_depth=500, avg_bin_width=200, read_len=100)[source]¶ Calculate segments’ read counts from log2-ratios.
- Math:
- nbases = read_length * read_count
- and
- nbases = bin_width * read_depth
- where
- read_depth = read_depth_ratio * avg_depth
- So:
- read_length * read_count = bin_width * read_depth read_count = bin_width * read_depth / read_length
fix
¶
Supporting functions for the ‘fix’ command.
-
cnvlib.fix.
apply_weights
(cnarr, ref_matched, epsilon=0.0001)[source]¶ Calculate weights for each bin.
Weights are derived from:
- bin sizes
- average bin coverage depths in the reference
- the “spread” column of the reference.
-
cnvlib.fix.
center_by_window
(cnarr, fraction, sort_key)[source]¶ Smooth out biases according to the trait specified by sort_key.
E.g. correct GC-biased bins by windowed averaging across similar-GC bins; or for similar interval sizes.
-
cnvlib.fix.
do_fix
(target_raw, antitarget_raw, reference, do_gc=True, do_edge=True, do_rmask=True)[source]¶ Combine target and antitarget coverages and correct for biases.
-
cnvlib.fix.
edge_gains
(target_sizes, gap_sizes, insert_size)[source]¶ Calculate coverage gain from neighboring baits’ flanking reads.
Letting i = insert size, t = target size, g = gap to neighboring bait, the gain of coverage due to a nearby bait, if g < i, is:
.. math :: (i-g)^2 / 4it
If the neighbor flank extends beyond the target (t+g < i), reduce by:
.. math :: (i-t-g)^2 / 4it
If a neighbor overlaps the target, treat it as adjacent (gap size 0).
-
cnvlib.fix.
edge_losses
(target_sizes, insert_size)[source]¶ Calculate coverage losses at the edges of baited regions.
Letting i = insert size and t = target size, the proportional loss of coverage near the two edges of the baited region (combined) is:
\[i/2t\]If the “shoulders” extend outside the bait $(t < i), reduce by:
\[(i-t)^2 / 4it\]on each side, or (i-t)^2 / 2it total.
-
cnvlib.fix.
get_edge_bias
(cnarr, margin)[source]¶ Quantify the “edge effect” of the target tile and its neighbors.
The result is proportional to the change in the target’s coverage due to these edge effects, i.e. the expected loss of coverage near the target edges and, if there are close neighboring tiles, gain of coverage due to “spill over” reads from the neighbor tiles.
(This is not the actual change in coverage. This is just a tribute.)
-
cnvlib.fix.
load_adjust_coverages
(cnarr, ref_cnarr, skip_low, fix_gc, fix_edge, fix_rmask)[source]¶ Load and filter probe coverages; correct using reference and GC.
heatmap
¶
The ‘heatmap’ command.
importers
¶
Import from other formats to the CNVkit format.
-
cnvlib.importers.
parse_theta_results
(fname)[source]¶ Parse THetA results into a data structure.
Columns: NLL, mu, C, p*
-
cnvlib.importers.
unpipe_name
(name)[source]¶ Fix the duplicated gene names Picard spits out.
Return a string containing the single gene name, sans duplications and pipe characters.
Picard CalculateHsMetrics combines the labels of overlapping intervals by joining all labels with ‘|’, e.g. ‘BRAF|BRAF’ – no two distinct targeted genes actually overlap, though, so these dupes are redundant. Meaningless target names are dropped, e.g. ‘CGH|FOO|-‘ resolves as ‘FOO’. In case of ambiguity, the longest name is taken, e.g. “TERT|TERT Promoter” resolves as “TERT Promoter”.
metrics
¶
Robust metrics to evaluate performance of copy number estimates.
-
cnvlib.metrics.
confidence_interval_bootstrap
(bins, alpha, bootstraps=100, smoothed=True)[source]¶ Confidence interval for segment mean log2 value, estimated by bootstrap.
-
cnvlib.metrics.
do_metrics
(cnarrs, segments=None, skip_low=False)[source]¶ Compute coverage deviations and other metrics for self-evaluation.
-
cnvlib.metrics.
ests_of_scale
(deviations)[source]¶ Estimators of scale: standard deviation, MAD, biweight midvariance.
Calculates all of these values for an array of deviations and returns them as a tuple.
reference
¶
Supporting functions for the ‘reference’ command.
-
cnvlib.reference.
bed2probes
(bed_fname)[source]¶ Create a neutral-coverage CopyNumArray from a file of regions.
-
cnvlib.reference.
calculate_gc_lo
(subseq)[source]¶ Calculate the GC and lowercase (RepeatMasked) content of a string.
-
cnvlib.reference.
combine_probes
(filenames, fa_fname, is_male_reference, sexes, skip_low, fix_gc, fix_edge, fix_rmask)[source]¶ Calculate the median coverage of each bin across multiple samples.
Parameters: - filenames (list) – List of string filenames, corresponding to targetcoverage.cnn and antitargetcoverage.cnn files, as generated by ‘coverage’ or ‘import-picard’.
- fa_fname (str) – Reference genome sequence in FASTA format, used to extract GC and RepeatMasker content of each genomic bin.
- is_male_reference (bool) –
- skip_low (bool) –
- fix_gc (bool) –
- fix_edge (bool) –
- fix_rmask (bool) –
Returns: One object summarizing the coverages of the input samples, including each bin’s “average” coverage, “spread” of coverages, and GC content.
Return type:
-
cnvlib.reference.
do_reference
(target_fnames, antitarget_fnames=None, fa_fname=None, male_reference=False, female_samples=None, do_gc=True, do_edge=True, do_rmask=True)[source]¶ Compile a coverage reference from the given files (normal samples).
-
cnvlib.reference.
do_reference_flat
(targets, antitargets=None, fa_fname=None, male_reference=False)[source]¶ Compile a neutral-coverage reference from the given intervals.
Combines the intervals, shifts chrX values if requested, and calculates GC and RepeatMasker content from the genome FASTA sequence.
-
cnvlib.reference.
fasta_extract_regions
(fa_fname, intervals)[source]¶ Extract an iterable of regions from an indexed FASTA file.
Input: FASTA file name; iterable of (seq_id, start, end) (1-based) Output: iterable of string sequences.
-
cnvlib.reference.
get_fasta_stats
(cnarr, fa_fname)[source]¶ Calculate GC and RepeatMasker content of each bin in the FASTA genome.
-
cnvlib.reference.
infer_sexes
(cnn_fnames, is_male_reference)[source]¶ Map sample IDs to inferred chromosomal sex, where possible.
For samples where the source file is empty or does not include either sex chromosome, that sample ID will not be in the returned dictionary.
reports
¶
Supports the sub-commands breaks and gainloss.
Supporting functions for the text/tabular-reporting commands.
Namely: breaks, gainloss.
-
cnvlib.reports.
gainloss_by_gene
(cnarr, threshold, skip_low=False)[source]¶ Identify genes where average bin copy ratio value exceeds threshold.
NB: Must shift sex-chromosome values beforehand with shift_xx, otherwise all chrX/chrY genes may be reported gained/lost.
-
cnvlib.reports.
gainloss_by_segment
(cnarr, segments, threshold, skip_low=False)[source]¶ Identify genes where segmented copy ratio exceeds threshold.
NB: Must shift sex-chromosome values beforehand with shift_xx, otherwise all chrX/chrY genes may be reported gained/lost.
-
cnvlib.reports.
get_breakpoints
(intervals, segments, min_probes)[source]¶ Identify CBS segment breaks within the targeted intervals.
-
cnvlib.reports.
get_gene_intervals
(all_probes, ignore=('-', '.', 'CGH'))[source]¶ Tally genomic locations of each targeted gene.
Return a dict of chromosomes to a list of tuples: (gene name, starts, end), where gene name is a string, starts is a sorted list of probe start positions, and end is the last probe’s end position as an integer. (The endpoints are redundant since probes are adjacent.)
scatter
¶
Command-line interface and corresponding API for CNVkit.
-
cnvlib.scatter.
chromosome_scatter
(cnarr, segments, variants, show_range, show_gene, antitarget_marker, do_trend, window_width, y_min, y_max, title, segment_color)[source]¶ Plot a specified region on one chromosome.
Possibilities:
Options | Shown ------------ | -------- -c | -g | Genes | Region ------- | -- | ----- | ------ - | + | given | auto: gene(s) + margin chr | - | none | whole chrom chr | + | given | whole chrom chr:s-e | - | all | given chr:s-e | + | given | given
-
cnvlib.scatter.
cnv_on_chromosome
(axis, probes, segments, genes, antitarget_marker=None, do_trend=False, x_limits=None, y_min=None, y_max=None, segment_color='darkorange')[source]¶ Draw a scatter plot of probe values with CBS calls overlaid.
Parameters: genes (list) – Of tuples: (start, end, gene name)
-
cnvlib.scatter.
cnv_on_genome
(axis, probes, segments, do_trend=False, y_min=None, y_max=None, segment_color='darkorange')[source]¶ Plot coverages and CBS calls for all chromosomes on one plot.
-
cnvlib.scatter.
do_scatter
(cnarr, segments=None, variants=None, show_range=None, show_gene=None, antitarget_marker=None, do_trend=False, window_width=1000000.0, y_min=None, y_max=None, title=None, segment_color='darkorange', background_marker=None)[source]¶ Plot probe log2 coverages and segmentation calls together.
-
cnvlib.scatter.
genome_scatter
(cnarr, segments=None, variants=None, do_trend=False, y_min=None, y_max=None, title=None, segment_color='darkorange')[source]¶ Plot all chromosomes, concatenated on one plot.
-
cnvlib.scatter.
group_snvs_by_segments
(snv_posns, snv_freqs, segments, chrom=None)[source]¶ Group SNP allele frequencies by segment.
Yields: tuple – (start, end, value)
-
cnvlib.scatter.
set_xlim_from
(axis, probes=None, segments=None, variants=None)[source]¶ Configure axes for plotting a single chromosome’s data.
Parameters: - probes (CopyNumArray) –
- segments (CopyNumArray) –
- variants (VariantArray) – All should already be subsetted to the region that will be plotted.
segmentation
¶
Segmentation of copy number values.
-
cnvlib.segmentation.
do_segmentation
(cnarr, method, threshold=None, variants=None, skip_low=False, skip_outliers=10, save_dataframe=False, rlibpath=None, processes=1)[source]¶ Infer copy number segments from the given coverage table.
-
cnvlib.segmentation.
drop_outliers
(cnarr, width, factor)[source]¶ Drop outlier bins with log2 ratios too far from the trend line.
Outliers are the log2 values factor times the 90th quantile of absolute deviations from the rolling average, within a window of given width. The 90th quantile is about 1.97 standard deviations if the log2 values are Gaussian, so this is similar to calling outliers factor * 1.97 standard deviations from the rolling mean. For a window size of 50, the breakdown point is 2.5 outliers within a window, which is plenty robust for our needs.
-
cnvlib.segmentation.
repair_segments
(segments, orig_probes)[source]¶ Post-process segmentation output.
- Ensure every chromosome has at least one segment.
- Ensure first and last segment ends match 1st/last bin ends (but keep log2 as-is).
-
cnvlib.segmentation.
transfer_fields
(segments, cnarr, ignore=('-', '.', 'CGH'))[source]¶ Map gene names, weights, depths from cnarr bins to segarr segments.
Segment gene name is the comma-separated list of bin gene names. Segment weight is the sum of bin weights, and depth is the (weighted) mean of bin depths.
target
¶
Transform bait intervals into targets more suitable for CNVkit.
-
cnvlib.target.
do_target
(bait_arr, annotate=None, do_short_names=False, do_split=False, avg_size=266.6666666666667)[source]¶ Transform bait intervals into targets more suitable for CNVkit.
-
cnvlib.target.
filter_names
(names, exclude=('mRNA', ))[source]¶ Remove less-meaningful accessions from the given set.
-
cnvlib.target.
shorten_labels
(gene_labels)[source]¶ Reduce multi-accession interval labels to the minimum consistent.
So: BED or interval_list files have a label for every region. We want this to be a short, unique string, like the gene name. But if an interval list is instead a series of accessions, including additional accessions for sub-regions of the gene, we can extract a single accession that covers the maximum number of consecutive regions that share this accession.
e.g.:
... mRNA|JX093079,ens|ENST00000342066,mRNA|JX093077,ref|SAMD11,mRNA|AF161376,mRNA|JX093104 ens|ENST00000483767,mRNA|AF161376,ccds|CCDS3.1,ref|NOC2L ...
becomes:
... mRNA|AF161376 mRNA|AF161376 ...
Helper modules¶
cmdutil
¶
Functions reused within command-line implementations.
-
cnvlib.cmdutil.
load_het_snps
(vcf_fname, sample_id=None, normal_id=None, min_variant_depth=20, zygosity_freq=None, tumor_boost=False)[source]¶
-
cnvlib.cmdutil.
read_cna
(infile, sample_id=None, meta=None)[source]¶ Read a CNVkit file (.cnn, .cnr, .cns) to create a CopyNumArray object.
-
cnvlib.cmdutil.
write_dataframe
(outfname, dframe, header=True)[source]¶ Write a pandas.DataFrame to a tabular file.
core
¶
CNV utilities.
-
cnvlib.core.
assert_equal
(msg, **values)[source]¶ Evaluate and compare two or more values for equality.
Sugar for a common assertion pattern. Saves re-evaluating (and retyping) the same values for comparison and error reporting.
Example:
>>> assert_equal("Mismatch", expected=1, saw=len(['xx', 'yy'])) ... ValueError: Mismatch: expected = 1, saw = 2
-
cnvlib.core.
call_quiet
(*args)[source]¶ Safely run a command and get stdout; print stderr if there’s an error.
Like subprocess.check_output, but silent in the normal case where the command logs unimportant stuff to stderr. If there is an error, then the full error message(s) is shown in the exception message.
-
cnvlib.core.
check_unique
(items, title)[source]¶ Ensure all items in an iterable are identical; return that one item.
descriptives
¶
Robust estimators of central tendency and scale.
- See:
- https://en.wikipedia.org/wiki/Robust_measures_of_scale https://astropy.readthedocs.io/en/latest/_modules/astropy/stats/funcs.html
-
cnvlib.descriptives.
biweight_location
(a, **kwargs)[source]¶ Compute the biweight location for an array.
The biweight is a robust statistic for estimating the central location of a distribution.
-
cnvlib.descriptives.
biweight_midvariance
(a, **kwargs)[source]¶ Compute the biweight midvariance for an array.
The biweight midvariance is a robust statistic for determining the midvariance (i.e. the standard deviation) of a distribution.
See:
-
cnvlib.descriptives.
gapper_scale
(a, **kwargs)[source]¶ Scale estimator based on gaps between order statistics.
See:
- Wainer & Thissen (1976)
- Beers, Flynn, and Gebhardt (1990)
-
cnvlib.descriptives.
interquartile_range
(a, **kwargs)[source]¶ Compute the difference between the array’s first and third quartiles.
-
cnvlib.descriptives.
mean_squared_error
(a, **kwargs)[source]¶ Mean squared error (MSE).
By default, assume the input array a is the residuals/deviations/error, so MSE is calculated from zero. Another reference point for calculating the error can be specified with initial.
-
cnvlib.descriptives.
median_absolute_deviation
(a, **kwargs)[source]¶ Compute the median absolute deviation (MAD) of array elements.
The MAD is defined as:
median(abs(a - median(a)))
.See: https://en.wikipedia.org/wiki/Median_absolute_deviation
-
cnvlib.descriptives.
modal_location
(a, **kwargs)[source]¶ Return the modal value of an array’s values.
The “mode” is the location of peak density among the values, estimated using a Gaussian kernel density estimator.
Parameters: a (np.array) – A 1-D array of floating-point values, e.g. bin log2 ratio values.
-
cnvlib.descriptives.
on_array
(default=None)[source]¶ Ensure a is a numpy array with no missing/NaN values.
-
cnvlib.descriptives.
on_weighted_array
(default=None)[source]¶ Ensure a and w are equal-length numpy arrays with no NaN values.
For weighted descriptives – a is the array of values, w is weights.
- Drop any cells in a that are NaN from both a and w
- Replace any remaining NaN cells in w with 0.
-
cnvlib.descriptives.
q_n
(a, **kwargs)[source]¶ Rousseeuw & Croux’s (1993) Q_n, an alternative to MAD.
Qn := Cn first quartile of (|x_i - x_j|: i < j)
where Cn is a constant depending on n.
Finite-sample correction factors must be used to calibrate the scale of Qn for small-to-medium-sized samples.
n E[Qn] – —– 10 1.392 20 1.193 40 1.093 60 1.064 80 1.048 100 1.038 200 1.019
parallel
¶
Utilities for multi-core parallel processing.
-
class
cnvlib.parallel.
SerialFuture
(result)[source]¶ Bases:
future.types.newobject.newobject
Mimic the concurrent.futures.Future interface.
params
¶
Defines several constants used in the pipeline.
Hard-coded parameters for CNVkit. These should not change between runs.
plots
¶
Plotting utilities.
-
cnvlib.plots.
chromosome_sizes
(probes, to_mb=False)[source]¶ Create an ordered mapping of chromosome names to sizes.
-
cnvlib.plots.
cvg2rgb
(cvg, desaturate)[source]¶ Choose a shade of red or blue representing log2-coverage value.
-
cnvlib.plots.
gene_coords_by_name
(probes, names)[source]¶ Find the chromosomal position of each named gene in probes.
Returns: Of: {chromosome: [(start, end, gene name), ...]} Return type: dict
-
cnvlib.plots.
gene_coords_by_range
(probes, chrom, start, end, ignore=('-', '.', 'CGH'))[source]¶ Find the chromosomal position of all genes in a range.
Returns: Of: {chromosome: [(start, end, gene), ...]} Return type: dict
-
cnvlib.plots.
partition_by_chrom
(chrom_snvs)[source]¶ Group the tumor shift values by chromosome (for statistical testing).
-
cnvlib.plots.
plot_x_dividers
(axis, chrom_sizes, pad=None)[source]¶ Plot vertical dividers and x-axis labels given the chromosome sizes.
Draws vertical black lines between each chromosome, with padding. Labels each chromosome range with the chromosome name, centered in the region, under a tick. Sets the x-axis limits to the covered range.
Returns: A table of the x-position offsets of each chromosome. Return type: OrderedDict
samutil
¶
BAM utilities.
-
cnvlib.samutil.
bam_total_reads
(bam_fname)[source]¶ Count the total number of mapped reads in a BAM file.
Uses the BAM index to do this quickly.
-
cnvlib.samutil.
ensure_bam_index
(bam_fname)[source]¶ Ensure a BAM file is indexed, to enable fast traversal & lookup.
For MySample.bam, samtools will look for an index in these files, in order:
- MySample.bam.bai
- MySample.bai
-
cnvlib.samutil.
ensure_bam_sorted
(bam_fname, by_name=False, span=50)[source]¶ Test if the reads in a BAM file are sorted as expected.
by_name=True: reads are expected to be sorted by query name. Consecutive read IDs are in alphabetical order, and read pairs appear together.
by_name=False: reads are sorted by position. Consecutive reads have increasing position.
-
cnvlib.samutil.
get_read_length
(bam, span=1000)[source]¶ Get (median) read length from first few reads in a BAM file.
Illumina reads all have the same length; other sequencers might not.
Parameters: - bam (str or pysam.Samfile) – Filename or pysam-opened BAM file.
- n (int) – Number of reads used to calculate median read length.
segfilters
¶
Filter copy number segments.
-
cnvlib.segfilters.
ampdel
(segarr)[source]¶ Merge segments by amplified/deleted/neutral copy number status.
Follow the clinical reporting convention:
- 5+ copies (2.5-fold gain) is amplification
- 0 copies is homozygous/deep deletion
- CNAs of lesser degree are not reported
This is recommended only for selecting segments corresponding to actionable, usually focal, CNAs. Any real and potentially informative but lower-level CNAs will be dropped.
-
cnvlib.segfilters.
bic
(segarr)[source]¶ Merge segments by Bayesian Information Criterion.
See: BIC-seq (Xi 2011), doi:10.1073/pnas.1110574108
-
cnvlib.segfilters.
ci
(segarr)[source]¶ Merge segments by confidence interval (overlapping 0).
Segments with lower CI above 0 are kept as gains, upper CI below 0 as losses, and the rest with CI overlapping zero are collapsed as neutral.
-
cnvlib.segfilters.
enumerate_changes
(levels)[source]¶ Assign a unique integer to each run of identical values.
Repeated but non-consecutive values will be assigned different integers.
-
cnvlib.segfilters.
require_column
(*colnames)[source]¶ Wrapper to coordinate the segment-filtering functions.
Verify that the given columns are in the CopyNumArray the wrapped function takes. Also log the number of rows in the array before and after filtration.
-
cnvlib.segfilters.
sem
(segarr)[source]¶ Merge segments by Standard Error of the Mean (SEM).
Use each segment’s SEM value to estimate a 95% confidence interval (via zscore). Segments with lower CI above 0 are kept as gains, upper CI below 0 as losses, and the rest with CI overlapping zero are collapsed as neutral.
smoothing
¶
Signal smoothing functions.
-
cnvlib.smoothing.
check_inputs
(x, width)[source]¶ Transform width into a half-window size.
width is either a fraction of the length of x or an integer size of the whole window. The output half-window size is truncated to the length of x if needed.
-
cnvlib.smoothing.
fit_edges
(x, y, wing, polyorder=3)[source]¶ Apply polynomial interpolation to the edges of y, in-place.
Calculates a polynomial fit (of order polyorder) of x within a window of width twice wing, then updates the smoothed values y in the half of the window closest to the edge.
-
cnvlib.smoothing.
outlier_iqr
(a, c=3.0)[source]¶ Detect outliers as a multiple of the IQR from the median.
By convention, “outliers” are points more than 1.5 * IQR from the median, and “extremes” or extreme outliers are those more than 3.0 * IQR.
-
cnvlib.smoothing.
outlier_mad_median
(a)[source]¶ MAD-Median rule for detecting outliers.
X_i is an outlier if:
| X_i - M | _____________ > K ~= 2.24 MAD / 0.6745
where $K = sqrt( X^2_{0.975,1} )$, the square root of the 0.975 quantile of a chi-squared distribution with 1 degree of freedom.
This is a very robust rule with the highest possible breakdown point of 0.5.
Returns: A boolean array of the same size as a, where outlier indices are True. Return type: np.array References
- Davies & Gather (1993) The Identification of Multiple Outliers.
- Rand R. Wilcox (2012) Introduction to robust estimation and hypothesis testing. Ch.3: Estimating measures of location and scale.
-
cnvlib.smoothing.
rolling_outlier_iqr
(x, width, c=3.0)[source]¶ Detect outliers as a multiple of the IQR from the median.
By convention, “outliers” are points more than 1.5 * IQR from the median (~2 SD if values are normally distributed), and “extremes” or extreme outliers are those more than 3.0 * IQR (~4 SD).
-
cnvlib.smoothing.
rolling_outlier_quantile
(x, width, q, m)[source]¶ Detect outliers by multiples of a quantile in a window.
Outliers are the array elements outside m times the q‘th quantile of deviations from the smoothed trend line, as calculated from the trend line residuals. (For example, take the magnitude of the 95th quantile times 5, and mark any elements greater than that value as outliers.)
This is the smoothing method used in BIC-seq (doi:10.1073/pnas.1110574108) with the parameters width=200, q=.95, m=5 for WGS.
Returns: A boolean array of the same size as x, where outlier indices are True. Return type: np.array
-
cnvlib.smoothing.
rolling_outlier_std
(x, width, stdevs)[source]¶ Detect outliers by stdev within a rolling window.
Outliers are the array elements outside stdevs standard deviations from the smoothed trend line, as calculated from the trend line residuals.
Returns: A boolean array of the same size as x, where outlier indices are True. Return type: np.array
-
cnvlib.smoothing.
rolling_quantile
(x, width, quantile)[source]¶ Rolling quantile (0–1) with mirrored edges.
-
cnvlib.smoothing.
smoothed
(x, width, do_fit_edges=False)[source]¶ Smooth the values in x with the Kaiser windowed filter.
See: https://en.wikipedia.org/wiki/Kaiser_window
Parameters: - x (array-like) – 1-dimensional numeric data set.
- width (float) – Fraction of x’s total length to include in the rolling window (i.e. the proportional window width), or the integer size of the window.