This class stores information from parsed command lines and provides the mechanism to call SFS_CODE commands.
execute a simulation command
Parameters:
- rand=1
a random integer. If the value is not reset by the user then a new random number is rolled within self.execute. This value is used as the random seed for SFS_CODE.
This class is used to store, parse, and convert SFS_CODE command lines.
Upon initialization, an object of the SFSCommand class sets the values of many of its attributes to the SFS_CODE defaults.
Parameters:
- outdir=os.path.join(os.getcwd(), ‘sims’)
A directory containing subdirectories with sfs_code simulations.
- prefix=’out’
The prefix of the out directory and the data files within the out directory.
- err=’err’
The name of the directory that contains all the stderr ouput from calling sfs_code.
Attributes:
- self.com_string=’‘
the entire command stored as a single string.
- self.outdir= outdir
the parent directory of output directories for sets of SFS_CODE simulations
- self.sfs_code_loc = ‘’
the location of the SFS_CODE binary.
- self.N = 500
the number of individuals in the ancestral population.
- self.P = [2]
the ploidy of the individuals in each population.
- self.t = 0.001
. This is the value of
in the ancestral population
- self.L = [5000]
an array containing the length of each simulate locus.
- self.B = 5 self.p[0] self.N
the length of the burn in (generations).
- self.prefix= prefix
the prefix for the output file directory and each simulation file.
- self.r=0.0.
. The value of
in the ancestral population.
- self.n_pops=1
number of populations.
- self.nsim=1
number of simulations.
- self.line=[]
an array of strings, each of which is an argument to SFS_CODE. This is the attribute that is used to execute SFS_CODE commands.
add an event to a command line.
Parameters:
an array of strings corresponding to the event.
e.g., adding a mutation at locus 0 at time 0 with
event = ['--mutation', '0','L','0']
Note that order of events is sometimes important for the SFS_CODE command line. Please see the SFS_CODE manual at sfscode.sourceforge.net.
A method to build a recurrent hitchhiking command line using the method of Uricchio & Hernandez (2014, Genetics).
Dependencies:
Parameters
, the ancestral population
scaled strength of selection. Note that demographic events
can change N, and hence they also changle alpha.
the ancestral population size
the population scaled recombination coefficient in the ancestral population.
the rate of positive substitutions per generation per site in the population.
a single parameter that encapsulates both delta parameters from Uricchio & Hernandez (Genetics, 2014). Smaller values of delta result in dynamics that are a better match for the original population of size N0, but are more computationally expensive. We do not recommend using values of delta greater than 0.1, and ideally should be smaller than 0.05. For more information please see the paper referenced above.
the length of the flanking sequence on each side of the neutral locus. If L0 is not reset from it’s default value, it is automatically set to L0 = s0/r0, where s0 and r0 are alpha/2N0 and rho/4N0, respectively.
the neutral value of theta.
the ending time of the simulation in units of 2*N0*self.P[0] generations.
Allows for recombination within the neutral locus.
The proportion of mutations in the flanking sequence which are under negative selection
The population scaled negative selection coefficient
If True, use the distribution of selection coefficients from Boyko et al (PLoS Genetics, 2008)
If True, use the distribution of selection coefficents from Torgerson et al (PLoS Genetics, 2009)
If True, split the central (neutral) sequence up into lmultnum equal sized bins.
Number of bins to be used when Lmult is True
Number of repetitions of the simulation to perform.
Demographic event/s of the users choice. Should be an array storing the exact SFS_CODE command. For example, for a bottleneck you could use [‘-Td’, ‘0’, ‘0.1’], where the bottleneck occurs at time 0 and shrinks the population to 10 percent of its original size. Multiple demographic events can be concatenated
A method for running simulations of genomic elements using realistic genome structure and demographic models. The demographic models of Gutenkunst (2009, PLoS Genetics), Gravle (2011, PNAS), Tennessenn (2012,*Science*), and the standard neutral model are included.
The default data sources and options are all human-centric, but in principle these methods could be used to simulate sequences from any population for which the relevant data sources are available (recombination map, conserved elements, exon positions).
This function calls a number of perl scripts, originally implemented by Ryan Hernandez, to build the input to SFS_CODE. These perl scripts are bundled with sfs_coder in the directory src/req
Parameters:
the directory where all the datasources and perl for this method are located. You shouldn’t have to change this unless you’re moving around the source files relative to each other.
the directory where the sfs_code annotation files will be written
chromosome number to query
genomic coordinate of the beginning of the simulated sequence
genomic coordinate of the end of the simulated sequence
genomic coordinate of the region in which to include dense neutral sequences. Neutral sites within this region will be simulated if they are within a specified distance of one of the simulated genomic elements (given by dense_dist)
genomic coordinate of the end of the dense neutral region
The amount of neutral sequence to pad onto the end of each conserved element or exon. For example, if two adjaacent conserved elemets are 20,000 base pairs apart, and the dense_dist=5,000, then 5,000 base pairs are padded onto the end of each of the conserved elements and the middle 10,000 base pairs are not simulated.
To include every neutral base pair in the region, use dense_dist=-1 (potentially very computationally expensive for large regions!)
If true, draw selection coefficients from a gamma distribution of selection coefficients for conserved elements and coding regions. These distributions are taken from Boyko et al (coding, 2008, PLoS Genetics) and Torgerson et al (conserved non-coding, 2009, PLoS Genetics).
Demographic model. By default, uses standard neutral model, but you can specify ‘gutenkunst’, ‘tennessen’, or ‘gravel’. For descriptions of each model, see the relevant publications (Gravel et al, PNAS, 2011; Gutenkunst, PLoS Genetics, 2009; Tennessen et al, Science, 2012)
Theta = 4Nu in the ancestral population
rho = 4Nr in the ancestral population. Note, you should interpret this parameter as the genomewide average. If you are simulating a low recombination rate region, sfs_coder will automatically reduce the recombination rate using the recombination map for the region.
A method to parse SFS_CODE command lines. By default, every switch is stored as an array with the exception of certain special cases that are stored as dictionaries.
Note, only the short form of SFS_CODE options are currently fully supported! For example, -t 0.002 is supported but –theta 0.002 is not. Hence, if you run SFS_CODE with the long forms or wish to analyze code that used the long form to run, you may have issues.
A general three population model with growth events. Inspired by Gutenkunst et al (2009, PLoS Genetics) and Gravel et al (2011, PNAS). For a pictorial representation see Figure 3A of the Gutenkunst paper.
The default parameters are set to Gutenkunst et al. The maximum likelihood estimates of Gravel et al and Tennessen et al (2012, Science) are also included.
Use model=’gravel’, model=’tennessen’ or model=’gutenkunst’ to explicitly choose one of the models. Note that the Tennessen model differs slightly from the other two by adding recent explosive growth in the African contitnental group and two phases of growth in Europeans.
The user can specify the model parameters as desired. To use a user specified model, simply use model = ‘’. Any unspecified parameters are set by default to the parameters of the Gutenkunst model.
Population 0 is the African continental group, 1 is Europe, and 2 is Asia.
a class to store the data associated with a variant in an SFS_CODE output file. Note, both mutations and substitutions are stored as instances of this class.
Attributes:
The locus number of the variant
‘A’ for autosomal, ‘X’ or ‘Y’ for the corresponding sex chromosomes
The position within the locus. Note that the positions within each locus start from 0.
A dictionary, keyed by population number, and storing the time that the variant arose
A dictionary, keyed by population number, and storing the time that the variant fixed within the population. If the variant is segregating, the time stored is the time of sampling.
The ancestral trinucleotide (the middle base is the mutated base, so this is not necessarily a codon!)
The derived nucleotide
Is the mutation synonymous (0) or nonsynonymous (1). 0 also is used to indicate non-coding.
Ancestral amino acid
Derived amino acid
fitness effect of the mutation (0 for neutral)
A dictionary of dictionaries that is keyed by population and chromosome number.
E.g., if the derived allele is present on chromosome 11 in population 2, then
self.chrs[2][11] = True
self.pops_numchr = {}
A dictionary that stores the number of chromosomes that carry the derived allele in each population.
A class that handles the basic parsing of sfs_code output file data.
Parameters:
the path to the file that is to be read.
Attributes:
the path to the file that is to be read.
an array of sfs.Simulation objects.
A method that reads sfs_code output files and stores all the data in sfs.Simulation objects.
Parameters:
the set of iterations over which to store data. Can be useful if you have a lot of simulations in a file and you only want a subset. Defaults to all iterations if left empty
A class to store the data from SFS_CODE simulations.
Attributes:
self.command = command.SFSCommand()
self.data = ‘’
self.loci = defaultdict(lambda: defaultdict(list))
A dictionary of dictionaries indexed by locus and position. Each dictionary of dictionaries is a list of Mutation objects that occur at the corresponding locus and position.
e.g., self.loci[0][0] is a list of Mutation objects that occur at the first position in the first locus.
Note that these keys will only exist in self.loci if there were mutations at this particular point in the sequence in the sample from the simulation.
self.muts = []
A list of all the Mutation objects in the simulation.
self.haplo = {}
A dictionary of haplotypes for each population.
self.haplo[0] is a 2-D array of haplotypes in the 0th population.
This attribute is only filled when you run the haplotype method (or the haplotype_SKAT method)
calculate the number of segretating sites within all populations.
Parameters:
skip sites that are more than biallelic if True
populations over which to calculate
Array of loci over which to calculate the number of segregating sites. Uses all loci if this is left blank.
start=-1 site at which to start calculating. If you are doing a simulation of genomic structure, this is the starting position in hg19 coordinates. Ignored if left at the default. Note that the caluclation is still locus based, so if the start parameter is in the middle of a locus, the entire locus is included in the calculation. DO NOT USE THIS OPTION AND ALSO SELECT A SET OF LOCI.
site at which to stop calculating. All the same caveats for the start parameter apply to this parameter as well.
the log file that was made by sfs_coder, storing the start and stop of the genomic region that was simulated. By default, it is stored in the same directory as the simulated data, in the file ‘err/log.build_input.txt’
Calculate ZnS.
Parameters:
populations over which to calculate
Array of loci over which to calculate the number of segregating sites. Uses all loci if this is left blank.
start=-1 site at which to start calculating. If you are doing a simulation of genomic structure, this is the starting position in hg19 coordinates. Ignored if left at the default. Note that the caluclation is still locus based, so if the start parameter is in the middle of a locus, the entire locus is included in the calculation. DO NOT USE THIS OPTION AND ALSO SELECT A SET OF LOCI.
site at which to stop calculating. All the same caveats for the start parameter apply to this parameter as well.
the log file that was made by sfs_coder, storing the start and stop of the genomic region that was simulated. By default, it is stored in the same directory as the simulated data, in the file ‘err/log.build_input.txt’
- Return value: A dictionary of values of ZnS indexed by population number.
calculate the fitness of the sampled chromosomes within a population.
Parameters:
population number
set of loci over which to calcuate fitness. Uses all loci if left empty
calculate the mean pairwise diversity per site
bewteen pairs of sequences across a set of loci.
If the loci parameter is left undefined by the
user, then this method calculates
over all loci in the simulation.
Parameters:
skip sites that are more than biallelic if True
populations over which to calculate
Array of loci over which to calculate the number of segregating sites. Uses all loci if this is left blank.
start=-1 site at which to start calculating. If you are doing a simulation of genomic structure, this is the starting position in hg19 coordinates. Ignored if left at the default. Note that the caluclation is still locus based, so if the start parameter is in the middle of a locus, the entire locus is included in the calculation. DO NOT USE THIS OPTION AND ALSO SELECT A SET OF LOCI.
site at which to stop calculating. All the same caveats for the start parameter apply to this parameter as well.
the log file that was made by sfs_coder, storing the start and stop of the genomic region that was simulated. By default, it is stored in the same directory as the simulated data, in the file ‘err/log.build_input.txt’
- Return value: A dictionary of
values indexed by population number.
Calculate Tajima’s D
Parameters:
populations over which to calculate
Array of loci over which to calculate the number of segregating sites. Uses all loci if this is left blank.
start=-1 site at which to start calculating. If you are doing a simulation of genomic structure, this is the starting position in hg19 coordinates. Ignored if left at the default. Note that the caluclation is still locus based, so if the start parameter is in the middle of a locus, the entire locus is included in the calculation. DO NOT USE THIS OPTION AND ALSO SELECT A SET OF LOCI.
site at which to stop calculating. All the same caveats for the start parameter apply to this parameter as well.
the log file that was made by sfs_coder, storing the start and stop of the genomic region that was simulated. By default, it is stored in the same directory as the simulated data, in the file ‘err/log.build_input.txt’
Return value: A dictionary of values of Tajima’s D indexed by population number.
Calculate theta_H.
Parameters:
skip sites that are more than biallelic if True
populations over which to calculate
Array of loci over which to calculate the number of segregating sites. Uses all loci if this is left blank.
start=-1 site at which to start calculating. If you are doing a simulation of genomic structure, this is the starting position in hg19 coordinates. Ignored if left at the default. Note that the caluclation is still locus based, so if the start parameter is in the middle of a locus, the entire locus is included in the calculation. DO NOT USE THIS OPTION AND ALSO SELECT A SET OF LOCI.
site at which to stop calculating. All the same caveats for the start parameter apply to this parameter as well.
the log file that was made by sfs_coder, storing the start and stop of the genomic region that was simulated. By default, it is stored in the same directory as the simulated data, in the file ‘err/log.build_input.txt’
Return value: A dictionary of H values indexed by population number.
Calculate Watterson’s estimator of pi for a set of loci
Parameters:
populations over which to calculate
Array of loci over which to calculate the number of segregating sites. Uses all loci if this is left blank.
start=-1 site at which to start calculating. If you are doing a simulation of genomic structure, this is the starting position in hg19 coordinates. Ignored if left at the default. Note that the caluclation is still locus based, so if the start parameter is in the middle of a locus, the entire locus is included in the calculation. DO NOT USE THIS OPTION AND ALSO SELECT A SET OF LOCI.
site at which to stop calculating. All the same caveats for the start parameter apply to this parameter as well.
the log file that was made by sfs_coder, storing the start and stop of the genomic region that was simulated. By default, it is stored in the same directory as the simulated data, in the file ‘err/log.build_input.txt’
compute the site frequency specutrum for a set of population
Parameters:
Set of populations over which to calculate the sfs
Array of loci over which to calculate the number of segregating sites. Uses all loci if this is left blank.
start=-1 site at which to start calculating. If you are doing a simulation of genomic structure, this is the starting position in hg19 coordinates. Ignored if left at the default. Note that the caluclation is still locus based, so if the start parameter is in the middle of a locus, the entire locus is included in the calculation. DO NOT USE THIS OPTION AND ALSO SELECT A SET OF LOCI.
site at which to stop calculating. All the same caveats for the start parameter apply to this parameter as well.
the log file that was made by sfs_coder, storing the start and stop of the genomic region that was simulated. By default, it is stored in the same directory as the simulated data, in the file ‘err/log.build_input.txt’
Include synonymous sites if True
Include non-synonymous sites if True
Include non-coding sites if True
Build haplotypes for each sampled chromosome.
Parameters:
Population of interest
The haplotypes are stored in self.haplo
Build haplotypes for each sampled chromosome in the SKAT fashion (number of minor alleles, not necessarily same as derived allele).
Parameters:
Population of interest
The haplotypes are stored in self.haplo
A method for simulating phenotypes.
Currently has three different methods, inspired by Wu et al (AJHG, 2011), Simons et al (Nature genetics, 2014), and Eyre-Walker (PNAS, 2010).
The proportion of variance in the phenotype that is explained by the test sequence
Populations to be simulated
The minimum frequency that is taken as non-causal (i.e., all sites with frequency below min_freq are taken to be causal). Only used with method=’SKAT’
The population that is used for determining the proportion of variance explained by the test_sequence. Hence, this population will have h_sq=Vg/Vp, but the other populations will not. Defaults to the first population in the pops parameter if it is not reset from -1.
The probability that a causal site’s effect size is taken as proportional to its selection coefficient. Only used with method=’SIMONS’
The tau parameter in the model of Eyre-Walker. This parameter is also used in the ‘SIMONS’ mehtod, where effect sizes are taken as s**tau (s is the selection coefficient)
The sigma parameter in the model of Eyre-Walker
Note that because we are fixing h_sq in one of the populations, there is always an arbitrary constant scaling each effect size. For each method, the variation from the environment is taken to be a normal distribution
Write genotypes out a file. You must call self.haplotype() or self.haplotype_SKAT() before running this method.
Parameters:
Population of interest
The destination file. Must be set to a valid path