Create a commaData object from methylation calling output files
Source:R/commaData_constructor.R
commaData.RdConstructor for the commaData-class S4 class. Parses one or
more methylation calling output files (modkit, Megalodon, or Dorado), merges
them into a sites × samples matrix representation, and optionally loads
genomic annotation and motif site positions.
Usage
commaData(
files,
colData,
genome = NULL,
annotation = NULL,
mod_type = NULL,
motif = NULL,
min_coverage = 5L,
caller = "modkit"
)Arguments
- files
Named character vector mapping sample names to file paths. Names must match
colData$sample_name. Example:c(ctrl_1 = "/path/to/ctrl_1.bed", treat_1 = "/path/to/treat_1.bed").- colData
A
data.framewith one row per sample. Must contain columnssample_name,condition, andreplicate. Additional columns (e.g.,file_path,batch) are preserved.- genome
Genome size information: a named integer vector of chromosome sizes (e.g.,
c(NC_000913 = 4641652L)), a path to a FASTA file, aDNAStringSet(Biostrings), or aBSgenomeobject. For single-chromosome genomes pass theBSgenomeobject directly or a named integer vector — do not index into the BSgenome with$(e.g.,BSgenome.Ecoli.NCBI.20080805$NC_000913) as that yields aDNAStringwhich has no chromosome name and cannot be used. Set toNULLto omit genome information (not recommended). When a multi-sequence source is provided, genomeInfo is automatically restricted to chromosomes present in the data.- annotation
Optional. Path to a GFF3 or BED annotation file, or a pre-loaded
GRangesobject. IfNULL, the annotation slot is left empty.- mod_type
Optional character vector specifying which modification types to retain (e.g.,
"6mA"orc("6mA", "5mC")). IfNULL, all modification types detected in the files are kept.- motif
Optional character string. A DNA sequence motif (e.g.,
"GATC") to locate in the genome. Requiresgenometo be a FASTA path orBSgenomeobject (not a named integer vector). IfNULL, themotifSitesslot is left empty.- min_coverage
Integer. Minimum read depth to include a site. Sites present in a sample with coverage below this threshold have their beta value set to
NA. Sites absent from a sample entirely are alsoNA. Default5.- caller
Character string specifying the methylation caller that produced the input files. One of
"modkit"(default),"megalodon", or"dorado".
Details
The constructor uses a parse-then-merge strategy:
Each file is parsed independently using the appropriate parser.
Sites are identified by a key:
"chrom:position:strand:mod_type".The union of all sites across all samples is taken.
Beta values and coverage are arranged into sites × samples matrices, with
NAfor samples that do not cover a given site.Sites where coverage is below
min_coveragein a sample have their beta value set toNA(but coverage is preserved).
Examples
if (FALSE) { # \dontrun{
# Load two modkit BED files
cd <- commaData(
files = c(
ctrl_1 = "ctrl_1_modkit.bed",
treat_1 = "treat_1_modkit.bed"
),
colData = data.frame(
sample_name = c("ctrl_1", "treat_1"),
condition = c("control", "treatment"),
replicate = c(1L, 1L)
),
genome = c(chr1 = 4641652L),
annotation = "MG1655.gff3",
caller = "modkit"
)
cd
} # }