Regression¶
CLARITE has several classes used for Regression.
Regression Classes¶
Base Class¶

class
clarite.analyze.regression.
Regression
(data: pandas.core.frame.DataFrame, outcome_variable: str, regression_variables: List[str], covariates: Optional[List[str]] = None)¶ Abstract Base Class for Regression objects used in EWAS.
 Parameters
 data: pd.DataFrame
Data used in the analysis
 outcome_variable: str
The variable to be used as the output (y) of the regression(s)
 regression_variables: List[str]
Variables to be regressed
 covariates: List[str], optional
The variables to be used as covariates in each regression. Any variables in the DataFrames not listed as covariates are regressed. Use None or an empty list when no covariates are being used.
Notes
These are the abstract methods: * run() > None * get_results() > pd.DataFrame
clarite.analyze.association_study¶
The regression_kind parameter can be set to use one of three regression classes, or a custom subclass of Regression can be created.

class
clarite.analyze.regression.
GLMRegression
(data: pandas.core.frame.DataFrame, outcome_variable: str, regression_variables: List[str], covariates: Optional[List[str]] = None, min_n: int = 200, report_categorical_betas: bool = False, standardize_data: bool = False, encoding: str = 'additive', edge_encoding_info: Optional[pandas.core.frame.DataFrame] = None, process_num: Optional[int] = None)¶ Statsmodels GLM Regression. This class handles running a regression for each variable of interest and collecting results.
 Parameters
 data:
The data to be analyzed, including the outcome, covariates, and any variables to be regressed.
 outcome_variable:
The variable to be used as the output (y) of the regression
 regression_variables:
List of regression variables to be used as input
 covariates:
The variables to be used as covariates. Any variables in the DataFrames not listed as covariates are regressed.
 min_n:
Minimum number of completecase observations (no NA values for outcome, covariates, or variable) Defaults to 200
 report_categorical_betas: boolean
 False by default.
If True, the results will contain one row for each categorical value (other than the reference category) and will include the beta value, standard error (SE), and beta pvalue for that specific category. The number of terms increases with the number of categories.
 standardize_data: boolean
 False by default.
If True, numeric data will be standardized using zscores before regression. This will affect the beta values and standard error, but not the pvalues.
 encoding: str, default “additive”
Encoding method to use for any genotype data. One of {‘additive’, ‘dominant’, ‘recessive’, ‘codominant’, or ‘weighted’}
 edge_encoding_info: Optional pd.DataFrame, default None
If edge encoding is used, this must be provided. See PandasGenomics documentation on edge encodings.
 process_num: Optional[int]
Number of processes to use when running the analysis, default is None (use the number of cores)
Notes
The family used is either Gaussian (continuous outcomes) or binomial(logit) for binary outcomes.
Covariates variables that are constant produce warnings and are ignored
The dataset is subset to drop missing values, and the same dataset is used for both models in the LRT
Regression Methods
 Binary variables
Treated as continuous features, with values of 0 and 1 (the larger value in the original data is encoded as 1).
 Categorical variables
The results of a likelihood ratio test are used to calculate a pvalue. No Beta or SE values are reported.
 Continuous variables
A GLM is used to obtain Beta, SE, and pvalue results.

class
clarite.analyze.regression.
WeightedGLMRegression
(data: pandas.core.frame.DataFrame, outcome_variable: str, regression_variables: List[str], covariates: Optional[List[str]], survey_design_spec: Optional[clarite.modules.survey.survey_design.SurveyDesignSpec] = None, min_n: int = 200, report_categorical_betas: bool = False, standardize_data: bool = False, encoding: str = 'additive', edge_encoding_info: Optional[pandas.core.frame.DataFrame] = None, process_num: Optional[int] = None)¶ Statsmodels GLM Regression with adjustments for survey design. This class handles running a regression for each variable of interest and collecing results. The statistical adjustments (primarily the covariance calculation) are designed to match results when running with the R survey library.
 Parameters
 data:
The data to be analyzed, including the outcome, covariates, and any variables to be regressed.
 outcome_variable:
The variable to be used as the output (y) of the regression
 regression_variables:
List of regression variables to be used as input
 covariates:
The variables to be used as covariates. Any variables in the DataFrames not listed as covariates are regressed.
 survey_design_spec:
A SurveyDesignSpec object is used to create SurveyDesign objects for each regression.
 min_n:
Minimum number of completecase observations (no NA values for outcome, covariates, variable, or weight) Defaults to 200
 report_categorical_betas: boolean
 False by default.
If True, the results will contain one row for each categorical value (other than the reference category) and will include the beta value, standard error (SE), and beta pvalue for that specific category. The number of terms increases with the number of categories.
 standardize_data: boolean
 False by default.
If True, numeric data will be standardized using zscores before regression. This will affect the beta values and standard error, but not the pvalues.
 encoding: str, default “additive”
Encoding method to use for any genotype data. One of {‘additive’, ‘dominant’, ‘recessive’, ‘codominant’, or ‘weighted’}
 edge_encoding_info: Optional pd.DataFrame, default None
If edge encoding is used, this must be provided. See PandasGenomics documentation on edge encodings.
 process_num: Optional[int]
Number of processes to use when running the analysis, default is None (use the number of cores)
Notes
The family used is Gaussian for continuous outcomes or binomial(logit) for binary outcomes.
Covariates variables that are constant (after dropping rows due to missing data or applying subsets) produce warnings and are ignored
Rows missing a weight but not missing the tested variable will cause an error unless the SurveyDesignSpec specifies drop_unweighted as True (in which case those rows are dropped)
Categorical variables run with a survey design will not report Diff_AIC as it may not be possible to calculate it accurately
Regression Methods
 Binary variables
Treated as continuous features, with values of 0 and 1 (the larger value in the original data is encoded as 1).
 Categorical variables
The results of a likelihood ratio test are used to calculate a pvalue. No Beta or SE values are reported.
 Continuous variables
A GLM is used to obtain Beta, SE, and pvalue results.

class
clarite.analyze.regression.
RSurveyRegression
(data: pandas.core.frame.DataFrame, outcome_variable: str, regression_variables: List[str], covariates: Optional[List[str]], survey_design_spec: Optional[clarite.modules.survey.survey_design.SurveyDesignSpec] = None, min_n: int = 200, report_categorical_betas: bool = False, standardize_data: bool = False)¶ Run regressions by calling R from Python When a SurveyDesignSpec is provided, the R survey library is used. Results should match those run with either GLMRegression or WeightedGLMRegression.
 Parameters
 data:
The data to be analyzed, including the outcome, covariates, and any variables to be regressed.
 outcome_variable:
The variable to be used as the output (y) of the regression
 covariates:
The variables to be used as covariates. Any variables in the DataFrames not listed as covariates are regressed.
 survey_design_spec:
A SurveyDesignSpec object is used to create SurveyDesign objects for each regression. Use None if unweighted regression is desired.
 minn:
Minimum number of completecase observations (no NA values for outcome, covariates, variable, or weight) Defaults to 200
 report_betas: boolean
False by default. If True, the results will contain one row for each categorical value (other than the reference category) and will include the beta value, standard error (SE), and beta pvalue for that specific category. The number of terms increases with the number of categories.
 standardize_data: boolean
False by default. If True, numeric data will be standardized using zscores before regression. This will affect the beta values and standard error, but not the pvalues.
clarite.analyze.interaction_study¶

class
clarite.analyze.regression.
InteractionRegression
(data, outcome_variable, covariates, min_n=200, interactions=None, report_betas=False, encoding: str = 'additive', edge_encoding_info: Optional[pandas.core.frame.DataFrame] = None, process_num: Optional[int] = None)¶ Statsmodels GLM Regression. This class handles running regressions and calculating LRT pvalues based on including interaction terms
 Parameters
 data: pd.DataFrame
The data to be analyzed, including the outcome, covariates, and any variables to be regressed.
 outcome_variable: string
The variable to be used as the output (y) of the regression
 covariates: list (strings),
The variables to be used as covariates. Any variables in the DataFrames not listed as covariates are regressed.
 min_n: int or None
Minimum number of completecase observations (no NA values for outcome, covariates, or variable) Defaults to 200
 interactions: list(tuple(strings)), str, or None
Valid variables are those in the data that are not the outcome variable or a covariate. None: Test all pairwise interactions between valid variables String: Test all interactions of this valid variable with other valid variables List of tuples: Test specific interactions of valid variables
 report_betas: boolean
 False by default.
If True, the results will contain one row for each interaction term and will include the beta value for that term. The number of terms increases with the number of categories in each interacting term.
 encoding: str, default “additive”
Encoding method to use for any genotype data. One of {‘additive’, ‘dominant’, ‘recessive’, ‘codominant’, or ‘weighted’}
 edge_encoding_info: Optional pd.DataFrame, default None
If edge encoding is used, this must be provided. See PandasGenomics documentation on edge encodings.
 process_num: Optional[int]
Number of processes to use when running the analysis, default is None (use the number of cores)
Notes
The family used is either Gaussian (continuous outcomes) or binomial(logit) for binary outcomes.
Covariates variables that are constant produce warnings and are ignored
The dataset is subset to drop missing values, and the same dataset is used for both models in the LRT