CLARITE Documentation¶
Version: | 1.1.0 |
---|
CLeaning to Analysis: Reproducibility-based Interface for Traits and Exposures
Motivation¶
CLARITE was created to provide an easy-to-use tool for analysis of traits and exposures.
It exists as both a python package (for integration into scripts and/or other packages) and as a command line program.
Installation¶
Running R code from CLARITE¶
In order to use the ewas_r function, it is recommended to install CLARITE using Conda:
Create and activate a conda environment with python 3.6 or 3.7:
$ conda create -n clarite python=3.7 $ conda activate clarite
Install rpy2 and related dependencies (optional). CLARITE has a version of the EWAS function that calls R code using the survey library:
$ conda install -c conda-forge rpy2 $ conda install r-essentials $ conda install r-survey
Install CLARITE:
$ pip install clarite
Install required R packages (such as survey) (optional):
$ clarite-cli utils install-r-packages
Citing CLARITE¶
If you use CLARITE in a scientific publication, please consider citing:
1. Lucas AM, et al (2019) CLARITE facilitates the quality control and analysis process for EWAS of metabolic-related traits. Frontiers in Genetics: 10, 1240
BibTeX entry:
@article{lucas2019clarite,
title={CLARITE facilitates the quality control and analysis process for EWAS of metabolic-related traits},
author={Lucas, Anastasia M. and Palmiero, Nicole E. and McGuigan, John and Passero, Kristin and Zhou, Jiayan and Orie, Deven and Ritchie, Marylyn D. and Hall, Molly A.},
journal={Frontiers in Genetics},
volume={10},
pages={1240},
year={2019},
publisher={Frontiers},
url={https://www.frontiersin.org/article/10.3389/fgene.2019.01240},
doi={10.3389/fgene.2019.01240}
}
2. Passero K, et al (2020) Phenome-wide association studies on cardiovascular health and fatty acids considering phenotype quality control practices for epidemiological data. Pacific Symposium on Biocomputing: 25, 659
BibTeX entry:
@inproceedings{passero2020phenome,
title={Phenome-wide association studies on cardiovascular health and fatty acids considering phenotype quality control practices for epidemiological data.},
author={Passero, Kristin and He, Xi and Zhou, Jiayan and Mueller-Myhsok, Bertram and Kleber, Marcus E and Maerz, Winfried and Hall, Molly A},
booktitle={Pacific Symposium on Biocomputing},
volume={25},
pages={659},
year={2020},
organization={World Scientific},
URL={https://www.worldscientific.com/doi/abs/10.1142/9789811215636_0058},
DOI={10.1142/9789811215636_0058}
}
Usage¶
Organization of Functions¶
CLARITE has many functions organized into several different modules:
- Analyze
- Functions related to calculating EWAS results
- Describe
- Functions used to gather information about data
- Load
- Functions used to load data from different formats or sources
- Modify
- Functions used to filter and/or modify data
- Plot
- Functions that generate plots
- Survey
- Functions and classes related to handling data with a complex survey design
Coding Style¶
There are three primary ways of using CLARITE’.
- Using the CLARITE package as part of a python script or Jupyter notebook
This can be done using the function directly:
import clarite
df = clarite.load.from_tsv('data.txt')
df_filtered = clarite.modify.colfilter_min_n(df, n=250)
df_filtered_complete = clarite.modify.rowfilter_incomplete_obs(df_filtered)
clarite.plot.distributions(df_filtered_complete, filename='plots.pdf')
Or it can be done using Pandas pipe
clarite.plot.distributions(df.pipe(clarite.modify.colfilter_min_n, n=250)\
.pipe(clarite.modify.rowfilter_incomplete_obs),
filename='plots.pdf')
- Using the command line tool
clarite-cli load from_tsv data/nhanes.txt results/data.txt --index SEQN
cd results
clarite-cli modify colfilter-min-n data data_filtered -n 250
clarite-cli modify rowfilter-incomplete-obs data_filtered data_filtered_complete
clarite-cli plot distributions data_filtered_complete plots.pdf
- Using the GUI (coming soon)
Example Analysis¶
CLARITE facilitates the quality control and analysis process for EWAS of metabolic-related traits
[Paper in review]
Data from NHANES was used in an EWAS analysis including utilizing the provided survey weight information. The first two cycles of NHANES (1999-2000 and 2001-2002) are assigned to a ‘discovery’ dataset and the next two cycles (2003-2004 and 2005-2006) are assigned to a ‘replication’ datset.
import pandas as pd
import numpy as np
from scipy import stats
import clarite
pd.options.display.max_rows = 10
pd.options.display.max_columns = 6
Load Data¶
data_folder = "../../../../data/NHANES_99-06/"
data_main_table_over18 = data_folder + "MainTable_keepvar_over18.tsv"
data_main_table = data_folder + "MainTable.csv"
data_var_description = data_folder + "VarDescription.csv"
data_var_categories = data_folder + "VarCat_nopf.txt"
output = "."
Data of all samples with age >= 18¶
# Data
nhanes = clarite.load.from_tsv(data_main_table_over18, index_col="ID")
nhanes.head()
Loaded 22,624 observations of 970 variables
RIDAGEYR | female | black | ... | LBXV4E | LBXVTE | occupation | |
---|---|---|---|---|---|---|---|
ID | |||||||
2 | 77 | 0 | 0 | ... | NaN | NaN | 1.0 |
5 | 49 | 0 | 0 | ... | NaN | NaN | NaN |
6 | 19 | 1 | 0 | ... | NaN | NaN | 2.0 |
7 | 59 | 1 | 1 | ... | NaN | NaN | NaN |
10 | 43 | 0 | 1 | ... | NaN | NaN | 4.0 |
5 rows × 970 columns
Variable Descriptions¶
var_descriptions = pd.read_csv(data_var_description)[["tab_desc","module","var","var_desc"]]\
.drop_duplicates()\
.set_index("var")
var_descriptions.head()
tab_desc | module | var_desc | |
---|---|---|---|
var | |||
LBXHBC | Hepatitis A, B, C and D | laboratory | Hepatitis B core antibody |
LBDHBG | Hepatitis A, B, C and D | laboratory | Hepatitis B surface antigen |
LBDHCV | Hepatitis A, B, C and D | laboratory | Hepatitis C antibody (confirmed) |
LBDHD | Hepatitis A, B, C and D | laboratory | Hepatitis D (anti-HDV) |
LBXHBS | Hepatitis B Surface Antibody | laboratory | Hepatitis B Surface Antibody |
# Convert variable descriptions to a dictionary for convenience
var_descr_dict = var_descriptions["var_desc"].to_dict()
Survey Weights, as provided by NHANES¶
Survey weight information is used so that the results apply to the US civillian non-institutionalized population.
This includes:
- SDMVPSU (Cluster ID)
- SDMVSTRA (Nested Strata ID)
- 2-year weights
- 4-year weights
Different variables require different weights, as many of them were measured on a subset of the full dataset. For example:
- WTINT is the survey weight for interview variables.
- WTMEC is the survey weight for variables measured in the Mobile Exam Centers (a subset of interviewed samples)
2-year and 4-year weights are provided. It is important to adjust the weights when combining multiple cycles, by computing the weighted average. In this case 4-year weights (covering the first 2 cycles) are provided by NHANES and the replication weights (the 3rd and 4th cycles) were computed from the 2-year weights prior to loading them here.
survey_design_discovery = pd.read_csv(data_folder + "weights/weights_discovery.txt", sep="\t")\
.rename(columns={'SEQN':'ID'})\
.set_index("ID")\
.drop(columns="SDDSRVYR")
survey_design_discovery.head()
SDMVPSU | SDMVSTRA | WTINT2YR | ... | WTSVOC2Y | WTSAU2YR | WTUIO2YR | |
---|---|---|---|---|---|---|---|
ID | |||||||
1 | 1 | 5 | 9727.078709 | ... | NaN | NaN | NaN |
2 | 3 | 1 | 26678.636376 | ... | NaN | NaN | NaN |
3 | 2 | 7 | 43621.680548 | ... | NaN | NaN | NaN |
4 | 1 | 2 | 10346.119327 | ... | NaN | NaN | NaN |
5 | 2 | 8 | 91050.846620 | ... | NaN | NaN | NaN |
5 rows × 35 columns
survey_design_replication = pd.read_csv(data_folder + "weights/weights_replication_4yr.txt", sep="\t")\
.rename(columns={'SEQN':'ID'})\
.set_index("ID")\
.drop(columns="SDDSRVYR")
survey_design_replication.head()
SDMVPSU | SDMVSTRA | WTINT2YR | ... | WTSOG2YR | WTSC2YRA | WTSPC2YR | |
---|---|---|---|---|---|---|---|
ID | |||||||
21005 | 2 | 39 | 2756.160474 | ... | NaN | NaN | NaN |
21006 | 1 | 41 | 2711.070226 | ... | NaN | NaN | NaN |
21007 | 2 | 35 | 19882.088706 | ... | NaN | NaN | NaN |
21008 | 1 | 32 | 2799.749676 | ... | NaN | NaN | NaN |
21009 | 2 | 31 | 48796.839489 | ... | NaN | NaN | NaN |
5 rows × 23 columns
# These files map variables to their correct weights, and were compiled by reading throught the NHANES codebook
var_weights = pd.read_csv(data_folder + "weights/VarWeights.csv")
var_weights.head()
variable_name | discovery | replication | |
---|---|---|---|
0 | 99999 | WTMEC4YR | WTMEC2YR |
1 | ACETAMINOPHEN__CODEINE | WTMEC4YR | WTMEC2YR |
2 | ACETAMINOPHEN__CODEINE_PHOSPHATE | WTMEC4YR | WTMEC2YR |
3 | ACETAMINOPHEN__HYDROCODONE | WTMEC4YR | WTMEC2YR |
4 | ACETAMINOPHEN__HYDROCODONE_BITARTRATE | WTMEC4YR | WTMEC2YR |
# Convert the data to two dictionaries for convenience
weights_discovery = var_weights.set_index('variable_name')['discovery'].to_dict()
weights_replication = var_weights.set_index('variable_name')['replication'].to_dict()
Survey Year data¶
Survey year is found in a separate file and can be matched using the SEQN ID value.
survey_year = pd.read_csv(data_main_table)[["SEQN", "SDDSRVYR"]].rename(columns={'SEQN':'ID'}).set_index("ID")
nhanes = clarite.modify.merge_variables(nhanes, survey_year, how="left")
================================================================================
Running merge_variables
--------------------------------------------------------------------------------
left Merge:
left = 22,624 observations of 970 variables
right = 41,474 observations of 1 variables
Kept 22,624 observations of 971 variables.
================================================================================
Define the phenotype and covariates¶
phenotype = "BMXBMI"
print(f"{phenotype} = {var_descriptions.loc[phenotype, 'var_desc']}")
covariates = ["female", "black", "mexican", "other_hispanic", "other_eth", "SES_LEVEL", "RIDAGEYR", "SDDSRVYR"]
BMXBMI = Body Mass Index (kg/m**2)
Initial cleanup / variable selection¶
Remove any samples missing the phenotype or one of the covariates¶
nhanes = clarite.modify.rowfilter_incomplete_obs(nhanes, only=[phenotype] + covariates)
================================================================================
Running rowfilter_incomplete_obs
--------------------------------------------------------------------------------
Removed 3,687 of 22,624 observations (16.30%) due to NA values in any of 9 variables
================================================================================
Remove variables that aren’t appropriate for the analysis¶
Physical fitness measures¶
These are measurements rather than proxies for environmental exposures
phys_fitness_vars = ["CVDVOMAX","CVDESVO2","CVDS1HR","CVDS1SY","CVDS1DI","CVDS2HR","CVDS2SY","CVDS2DI","CVDR1HR","CVDR1SY","CVDR1DI","CVDR2HR","CVDR2SY","CVDR2DI","physical_activity"]
for v in phys_fitness_vars:
print(f"\t{v} = {var_descr_dict[v]}")
nhanes = nhanes.drop(columns=phys_fitness_vars)
CVDVOMAX = Predicted VO2max (ml/kg/min)
CVDESVO2 = Estimated VO2max (ml/kg/min)
CVDS1HR = Stage 1 heart rate (per min)
CVDS1SY = Stage 1 systolic BP (mm Hg)
CVDS1DI = Stage 1 diastolic BP (mm Hg)
CVDS2HR = Stage 2 heart rate (per min)
CVDS2SY = Stage 2 systolic BP (mm Hg)
CVDS2DI = Stage 2 diastolic BP (mm Hg)
CVDR1HR = Recovery 1 heart rate (per min)
CVDR1SY = Recovery 1 systolic BP (mm Hg)
CVDR1DI = Recovery 1 diastolic BP (mm Hg)
CVDR2HR = Recovery 2 heart rate (per min)
CVDR2SY = Recovery 2 systolic BP (mm Hg)
CVDR2DI = Recovery 2 diastolic BP (mm Hg)
physical_activity = Physical Activity (MET-based rank)
Lipid variables¶
These are likely correlated with BMI in some way
lipid_vars = ["LBDHDD", "LBDHDL", "LBDLDL", "LBXSTR", "LBXTC", "LBXTR"]
print("Removing lipid measurement variables:")
for v in lipid_vars:
print(f"\t{v} = {var_descr_dict[v]}")
nhanes = nhanes.drop(columns=lipid_vars)
Removing lipid measurement variables:
LBDHDD = Direct HDL-Cholesterol (mg/dL)
LBDHDL = Direct HDL-Cholesterol (mg/dL)
LBDLDL = LDL-cholesterol (mg/dL)
LBXSTR = Triglycerides (mg/dL)
LBXTC = Total cholesterol (mg/dL)
LBXTR = Triglyceride (mg/dL)
Indeterminate variables¶
These variables don’t have clear meanings
indeterminent_vars = ["house_type","hepa","hepb", "house_age", "current_past_smoking"]
print("Removing variables with indeterminate meanings:")
for v in indeterminent_vars:
print(f"\t{v} = {var_descr_dict[v]}")
nhanes = nhanes.drop(columns=indeterminent_vars)
Removing variables with indeterminate meanings:
house_type = house type
hepa = hepatitis a
hepb = hepatitis b
house_age = house age
current_past_smoking = Current or Past Cigarette Smoker?
Recode “missing” values¶
# SMQ077 and DDB100 have Refused/Don't Know for "7" and "9"
nhanes = clarite.modify.recode_values(nhanes, {7: np.nan, 9: np.nan}, only=['SMQ077', 'DBD100'])
================================================================================
Running recode_values
--------------------------------------------------------------------------------
Replaced 11 values from 18,937 observations in 2 variables
================================================================================
Split the data into discovery and replication¶
discovery = (nhanes['SDDSRVYR']==1) | (nhanes['SDDSRVYR']==2)
replication = (nhanes['SDDSRVYR']==3) | (nhanes['SDDSRVYR']==4)
nhanes_discovery = nhanes.loc[discovery]
nhanes_replication = nhanes.loc[replication]
nhanes_discovery.head()
RIDAGEYR | female | black | ... | LBXVTE | occupation | SDDSRVYR | |
---|---|---|---|---|---|---|---|
ID | |||||||
2 | 77 | 0 | 0 | ... | NaN | 1.0 | 1 |
5 | 49 | 0 | 0 | ... | NaN | NaN | 1 |
6 | 19 | 1 | 0 | ... | NaN | 2.0 | 1 |
12 | 37 | 0 | 0 | ... | NaN | 4.0 | 1 |
13 | 70 | 0 | 0 | ... | NaN | 4.0 | 1 |
5 rows × 945 columns
nhanes_replication.head()
RIDAGEYR | female | black | ... | LBXVTE | occupation | SDDSRVYR | |
---|---|---|---|---|---|---|---|
ID | |||||||
21005 | 19 | 0 | 1 | ... | NaN | 4.0 | 3 |
21009 | 55 | 0 | 0 | ... | NaN | 4.0 | 3 |
21010 | 52 | 1 | 0 | ... | NaN | 2.0 | 3 |
21012 | 63 | 0 | 1 | ... | NaN | 1.0 | 3 |
21015 | 83 | 0 | 0 | ... | NaN | 1.0 | 3 |
5 rows × 945 columns
QC¶
Minimum of 200 non-NA values in each variable¶
Drop variables that have too small of a sample size
nhanes_discovery = clarite.modify.colfilter_min_n(nhanes_discovery, skip=[phenotype] + covariates)
nhanes_replication = clarite.modify.colfilter_min_n(nhanes_replication, skip=[phenotype] + covariates)
================================================================================
Running colfilter_min_n
--------------------------------------------------------------------------------
Testing 0 of 0 binary variables
Testing 0 of 0 categorical variables
Testing 936 of 945 continuous variables
Removed 302 (32.26%) tested continuous variables which had less than 200 non-null values.
================================================================================
================================================================================
Running colfilter_min_n
--------------------------------------------------------------------------------
Testing 0 of 0 binary variables
Testing 0 of 0 categorical variables
Testing 936 of 945 continuous variables
Removed 225 (24.04%) tested continuous variables which had less than 200 non-null values.
================================================================================
Categorize Variables¶
This is important, as different variable types must be processed in different ways. The number of unique values for each variable is a good heuristic for determining this. The default settings were used here, but different cutoffs can be specified. CLARITE reports the results in neatly formatted text:
nhanes_discovery = clarite.modify.categorize(nhanes_discovery)
nhanes_replication = clarite.modify.categorize(nhanes_replication)
================================================================================
Running categorize
--------------------------------------------------------------------------------
229 of 643 variables (35.61%) are classified as binary (2 unique values).
19 of 643 variables (2.95%) are classified as categorical (3 to 6 unique values).
336 of 643 variables (52.26%) are classified as continuous (>= 15 unique values).
37 of 643 variables (5.75%) were dropped.
0 variables had zero unique values (all NA).
37 variables had one unique value.
22 of 643 variables (3.42%) were not categorized and need to be set manually.
22 variables had between 6 and 15 unique values
0 variables had >= 15 values but couldn't be converted to continuous (numeric) values
================================================================================
================================================================================
Running categorize
--------------------------------------------------------------------------------
236 of 720 variables (32.78%) are classified as binary (2 unique values).
32 of 720 variables (4.44%) are classified as categorical (3 to 6 unique values).
400 of 720 variables (55.56%) are classified as continuous (>= 15 unique values).
13 of 720 variables (1.81%) were dropped.
0 variables had zero unique values (all NA).
13 variables had one unique value.
39 of 720 variables (5.42%) were not categorized and need to be set manually.
39 variables had between 6 and 15 unique values
0 variables had >= 15 values but couldn't be converted to continuous (numeric) values
================================================================================
Checking categorization¶
Distributions of variables may be plotted using CLARITE:¶
clarite.plot.distributions(nhanes_discovery,
filename="discovery_distributions.pdf",
continuous_kind='count',
nrows=4,
ncols=3,
quality='medium')
One variable needed correcting where the heuristic was not correct¶
v = "L_GLUTAMINE_gm"
print(f"\t{v} = {var_descr_dict[v]}\n")
nhanes_discovery = clarite.modify.make_continuous(nhanes_discovery, only=[v])
nhanes_replication = clarite.modify.make_continuous(nhanes_replication, only=[v])
L_GLUTAMINE_gm = L_GLUTAMINE_gm
================================================================================
Running make_continuous
--------------------------------------------------------------------------------
Set 1 of 606 variable(s) as continuous, each with 9,063 observations
================================================================================
================================================================================
Running make_continuous
--------------------------------------------------------------------------------
Set 1 of 707 variable(s) as continuous, each with 9,874 observations
================================================================================
After examining all of the uncategorized variables, they are all continuous¶
discovery_types = clarite.describe.get_types(nhanes_discovery)
discovery_unknown = discovery_types[discovery_types == 'unknown'].index
for v in list(discovery_unknown):
print(f"\t{v} = {var_descr_dict[v]}")
nhanes_discovery = clarite.modify.make_continuous(nhanes_discovery, only=discovery_unknown)
WARNING: 22 variables need to be categorized into a type manually
URXUBE = Beryllium, urine (ug/L)
URXUPT = Platinum, urine (ug/L)
DRD350BQ = # of times crabs eaten in past 30 days
DRD350FQ = # of times oysters eaten in past 30 days
DRD350IQ = # of times other shellfish eaten
DRD370AQ = # of times breaded fish products eaten
DRD370DQ = # of times catfish eaten in past 30 days
DRD370EQ = # of times cod eaten in past 30 days
DRD370FQ = # of times flatfish eaten past 30 days
DRD370UQ = # of times other unknown fish eaten
OMEGA_3_FATTY_ACIDS_mg = OMEGA_3_FATTY_ACIDS_mg
ALANINE_mg = ALANINE_mg
ARGININE_mg = ARGININE_mg
BETA_CAROTENE_mg = BETA_CAROTENE_mg
CAFFEINE_mg = CAFFEINE_mg
CYSTINE_mg = CYSTINE_mg
LYSINE_mg = LYSINE_mg
PROLINE_mg = PROLINE_mg
SERINE_mg = SERINE_mg
TRYPTOPHAN_mg = TRYPTOPHAN_mg
TYROSINE_mg = TYROSINE_mg
OTHER_FATTY_ACIDS_mg = OTHER_FATTY_ACIDS_mg
================================================================================
Running make_continuous
--------------------------------------------------------------------------------
Set 22 of 606 variable(s) as continuous, each with 9,063 observations
================================================================================
replication_types = clarite.describe.get_types(nhanes_replication)
replication_unknown = replication_types[replication_types == 'unknown'].index
for v in list(replication_unknown):
print(f"\t{v} = {var_descr_dict[v]}")
nhanes_replication = clarite.modify.make_continuous(nhanes_replication, only=replication_unknown)
WARNING: 39 variables need to be categorized into a type manually
LBXVCT = Blood Carbon Tetrachloride (ng/ml)
LBXV3A = Blood 1,1,1-Trichloroethene (ng/ml)
URXUBE = Beryllium, urine (ug/L)
LBXTO2 = Toxoplasma (IgM)
LBXPFDO = Perfluorododecanoic acid
DRD350AQ = # of times clams eaten in past 30 days
DRD350BQ = # of times crabs eaten in past 30 days
DRD350DQ = # of times lobsters eaten past 30 days
DRD350FQ = # of times oysters eaten in past 30 days
DRD350GQ = # of times scallops eaten past 30 days
DRD370AQ = # of times breaded fish products eaten
DRD370DQ = # of times catfish eaten in past 30 days
DRD370EQ = # of times cod eaten in past 30 days
DRD370FQ = # of times flatfish eaten past 30 days
DRD370GQ = # of times haddock eaten in past 30 days
DRD370NQ = # of times sardines eaten past 30 days
DRD370RQ = # of times trout eaten in past 30 days
DRD370UQ = # of times other unknown fish eaten
ALANINE_mg = ALANINE_mg
ARGININE_mg = ARGININE_mg
BETA_CAROTENE_mg = BETA_CAROTENE_mg
CAFFEINE_mg = CAFFEINE_mg
CYSTINE_mg = CYSTINE_mg
HISTIDINE_mg = HISTIDINE_mg
ISOLEUCINE_mg = ISOLEUCINE_mg
LEUCINE_mg = LEUCINE_mg
LYSINE_mg = LYSINE_mg
PHENYLALANINE_mg = PHENYLALANINE_mg
PROLINE_mg = PROLINE_mg
SERINE_mg = SERINE_mg
THREONINE_mg = THREONINE_mg
TRYPTOPHAN_mg = TRYPTOPHAN_mg
TYROSINE_mg = TYROSINE_mg
VALINE_mg = VALINE_mg
LBXV2T = Blood trans-1,2-Dichloroethene (ng/mL)
LBXV4T = Blood 1,1,2,2-Tetrachloroethane (ng/mL)
LBXVDM = Blood Dibromomethane (ng/mL)
URXUTM = Urinary Trimethylarsine Oxide (ug/L)
LBXPFBS = Perfluorobutane sulfonic acid
================================================================================
Running make_continuous
--------------------------------------------------------------------------------
Set 39 of 707 variable(s) as continuous, each with 9,874 observations
================================================================================
Types should match across discovery/replication¶
# Take note of which variables were differently typed in each dataset
print("Correcting differences in variable types between discovery and replication")
# Merge current type series
dtypes = pd.DataFrame({'discovery':clarite.describe.get_types(nhanes_discovery),
'replication':clarite.describe.get_types(nhanes_replication)
})
diff_dtypes = dtypes.loc[(dtypes['discovery'] != dtypes['replication']) &
(~dtypes['discovery'].isna()) &
(~dtypes['replication'].isna())]
# Discovery
# Binary -> Categorical
compare_bin_cat = list(diff_dtypes.loc[(diff_dtypes['discovery']=='binary') &
(diff_dtypes['replication']=='categorical'),].index)
if len(compare_bin_cat) > 0:
print(f"Bin vs Cat: {', '.join(compare_bin_cat)}")
nhanes_discovery = clarite.modify.make_categorical(nhanes_discovery, only=compare_bin_cat)
print()
# Binary -> Continuous
compare_bin_cont = list(diff_dtypes.loc[(diff_dtypes['discovery']=='binary') &
(diff_dtypes['replication']=='continuous'),].index)
if len(compare_bin_cont) > 0:
print(f"Bin vs Cont: {', '.join(compare_bin_cont)}")
nhanes_discovery = clarite.modify.make_continuous(nhanes_discovery, only=compare_bin_cont)
print()
# Categorical -> Continuous
compare_cat_cont = list(diff_dtypes.loc[(diff_dtypes['discovery']=='categorical') &
(diff_dtypes['replication']=='continuous'),].index)
if len(compare_cat_cont) > 0:
print(f"Cat vs Cont: {', '.join(compare_cat_cont)}")
nhanes_discovery = clarite.modify.make_continuous(nhanes_discovery, only=compare_cat_cont)
print()
# Replication
# Binary -> Categorical
compare_cat_bin = list(diff_dtypes.loc[(diff_dtypes['discovery']=='categorical') &
(diff_dtypes['replication']=='binary'),].index)
if len(compare_cat_bin) > 0:
print(f"Cat vs Bin: {', '.join(compare_cat_bin)}")
nhanes_replication = clarite.modify.make_categorical(nhanes_replication, only=compare_cat_bin)
print()
# Binary -> Continuous
compare_cont_bin = list(diff_dtypes.loc[(diff_dtypes['discovery']=='continuous') &
(diff_dtypes['replication']=='binary'),].index)
if len(compare_cont_bin) > 0:
print(f"Cont vs Bin: {', '.join(compare_cont_bin)}")
nhanes_replication = clarite.modify.make_continuous(nhanes_replication, only=compare_cont_bin)
print()
# Categorical -> Continuous
compare_cont_cat = list(diff_dtypes.loc[(diff_dtypes['discovery']=='continuous') &
(diff_dtypes['replication']=='categorical'),].index)
if len(compare_cont_cat) > 0:
print(f"Cont vs Cat: {', '.join(compare_cont_cat)}")
nhanes_replication = clarite.modify.make_continuous(nhanes_replication, only=compare_cont_cat)
print()
Correcting differences in variable types between discovery and replication
Bin vs Cat: BETA_CAROTENE_mcg, CALCIUM_Unknown, MAGNESIUM_Unknown
================================================================================
Running make_categorical
--------------------------------------------------------------------------------
Set 3 of 606 variable(s) as categorical, each with 9,063 observations
================================================================================
Bin vs Cont: LBXPFDO
================================================================================
Running make_continuous
--------------------------------------------------------------------------------
Set 1 of 606 variable(s) as continuous, each with 9,063 observations
================================================================================
Cat vs Cont: DRD350AQ, DRD350DQ, DRD350GQ
================================================================================
Running make_continuous
--------------------------------------------------------------------------------
Set 3 of 606 variable(s) as continuous, each with 9,063 observations
================================================================================
Cat vs Bin: VITAMIN_B_12_Unknown
================================================================================
Running make_categorical
--------------------------------------------------------------------------------
Set 1 of 707 variable(s) as categorical, each with 9,874 observations
================================================================================
Filtering¶
These are a standard set of filters with default settings
# 200 non-na samples
discovery_1_min_n = clarite.modify.colfilter_min_n(nhanes_discovery)
replication_1_min_n = clarite.modify.colfilter_min_n(nhanes_replication)
================================================================================
Running colfilter_min_n
--------------------------------------------------------------------------------
Testing 228 of 228 binary variables
Removed 0 (0.00%) tested binary variables which had less than 200 non-null values.
Testing 15 of 15 categorical variables
Removed 0 (0.00%) tested categorical variables which had less than 200 non-null values.
Testing 363 of 363 continuous variables
Removed 0 (0.00%) tested continuous variables which had less than 200 non-null values.
================================================================================
================================================================================
Running colfilter_min_n
--------------------------------------------------------------------------------
Testing 236 of 236 binary variables
Removed 0 (0.00%) tested binary variables which had less than 200 non-null values.
Testing 31 of 31 categorical variables
Removed 0 (0.00%) tested categorical variables which had less than 200 non-null values.
Testing 440 of 440 continuous variables
Removed 0 (0.00%) tested continuous variables which had less than 200 non-null values.
================================================================================
# 200 samples per category
discovery_2_min_cat_n = clarite.modify.colfilter_min_cat_n(discovery_1_min_n, skip=[c for c in covariates + [phenotype] if c in discovery_1_min_n.columns] )
replication_2_min_cat_n = clarite.modify.colfilter_min_cat_n(replication_1_min_n,skip=[c for c in covariates + [phenotype] if c in replication_1_min_n.columns])
================================================================================
Running colfilter_min_cat_n
--------------------------------------------------------------------------------
Testing 222 of 228 binary variables
Removed 162 (72.97%) tested binary variables which had a category with less than 200 values.
Testing 14 of 15 categorical variables
Removed 10 (71.43%) tested categorical variables which had a category with less than 200 values.
================================================================================
================================================================================
Running colfilter_min_cat_n
--------------------------------------------------------------------------------
Testing 230 of 236 binary variables
Removed 154 (66.96%) tested binary variables which had a category with less than 200 values.
Testing 30 of 31 categorical variables
Removed 25 (83.33%) tested categorical variables which had a category with less than 200 values.
================================================================================
# 90percent zero filter
discovery_3_pzero = clarite.modify.colfilter_percent_zero(discovery_2_min_cat_n)
replication_3_pzero = clarite.modify.colfilter_percent_zero(replication_2_min_cat_n)
================================================================================
Running colfilter_percent_zero
--------------------------------------------------------------------------------
Testing 363 of 363 continuous variables
Removed 28 (7.71%) tested continuous variables which were equal to zero in at least 90.00% of non-NA observations.
================================================================================
================================================================================
Running colfilter_percent_zero
--------------------------------------------------------------------------------
Testing 440 of 440 continuous variables
Removed 30 (6.82%) tested continuous variables which were equal to zero in at least 90.00% of non-NA observations.
================================================================================
# Those without weights
keep = set(weights_discovery.keys()) | set([phenotype] + covariates)
discovery_4_weights = discovery_3_pzero[[c for c in list(discovery_3_pzero) if c in keep]]
keep = set(weights_replication.keys()) | set([phenotype] + covariates)
replication_4_weights = replication_3_pzero[[c for c in list(replication_3_pzero) if c in keep]]
Summarize¶
# Summarize Results
print("\nDiscovery:")
clarite.describe.summarize(discovery_4_weights)
print('-'*50)
print("Replication:")
clarite.describe.summarize(replication_4_weights)
Discovery:
9,063 observations of 385 variables
66 Binary Variables
5 Categorical Variables
314 Continuous Variables
0 Unknown-Type Variables
--------------------------------------------------
Replication:
9,874 observations of 428 variables
77 Binary Variables
6 Categorical Variables
345 Continuous Variables
0 Unknown-Type Variables
Keep only variables that passed QC in both datasets¶
both = set(list(discovery_4_weights)) & set(list(replication_4_weights))
discovery_final = discovery_4_weights[both]
replication_final = replication_4_weights[both]
print(f"{len(both)} variables in common")
341 variables in common
Checking the phenotype distribution¶
The phenotype appears to be skewed, so it will need to be corrected. CLARITE makes it easy to plot distributions and to transform variables.
title = f"Discovery: Skew of BMIMBX = {stats.skew(discovery_final['BMXBMI']):.6}"
clarite.plot.histogram(discovery_final, column="BMXBMI", title=title, bins=100)
# Log-transform
discovery_final = clarite.modify.transform(discovery_final, transform_method='log', only='BMXBMI')
#Plot
title = f"Discovery: Skew of BMXBMI after log transform = {stats.skew(discovery_final['BMXBMI']):.6}"
clarite.plot.histogram(discovery_final, column="BMXBMI", title=title, bins=100)
================================================================================
Running transform
--------------------------------------------------------------------------------
Transformed 'BMXBMI' using 'log'
================================================================================


title = f"Replication: Skew of BMIMBX = {stats.skew(replication_final['BMXBMI']):.6}"
clarite.plot.histogram(replication_final, column="BMXBMI", title=title, bins=100)
# Log-transform
replication_final = clarite.modify.transform(replication_final, transform_method='log', only='BMXBMI')
#Plot
title = f"Replication: Skew of logBMI = {stats.skew(replication_final['BMXBMI']):.6}"
clarite.plot.histogram(replication_final, column="BMXBMI", title=title, bins=100)
================================================================================
Running transform
--------------------------------------------------------------------------------
Transformed 'BMXBMI' using 'log'
================================================================================


EWAS¶
Survey Design Spec¶
When utilizing survey data, a survey design spec object must be created.
sd_discovery = clarite.survey.SurveyDesignSpec(survey_df=survey_design_discovery,
strata="SDMVSTRA",
cluster="SDMVPSU",
nest=True,
weights=weights_discovery,
single_cluster='centered')
EWAS¶
This can then be passed into the EWAS function
ewas_discovery = clarite.analyze.ewas(phenotype, covariates, discovery_final, sd_discovery)
Running EWAS on a continuous variable
####### Regressing 280 Continuous Variables #######
WARNING: DRD370UQ - 3 observation(s) with missing, negative, or zero weights were removed
WARNING: LBXVID has non-varying covariates(s): SDDSRVYR
WARNING: URXP24 has non-varying covariates(s): SDDSRVYR
WARNING: age_stopped_birth_control has non-varying covariates(s): female
WARNING: DR1TCHOL - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: LBX206 has non-varying covariates(s): SDDSRVYR
WARNING: DR1TVB1 - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: LBXDIE has non-varying covariates(s): SDDSRVYR
WARNING: DRD350BQ - 2 observation(s) with missing, negative, or zero weights were removed
WARNING: LBXLYC has non-varying covariates(s): SDDSRVYR
WARNING: LBXF09 has non-varying covariates(s): SDDSRVYR
WARNING: DR1TS160 - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TVK has non-varying covariates(s): SDDSRVYR
WARNING: DRD350FQ - 1 observation(s) with missing, negative, or zero weights were removed
WARNING: DRD370TQ - 1 observation(s) with missing, negative, or zero weights were removed
WARNING: DRD370EQ - 1 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TS100 - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: LBXALD has non-varying covariates(s): SDDSRVYR
WARNING: DR1TCOPP - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: URXP20 has non-varying covariates(s): SDDSRVYR
WARNING: DR1TSELE - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: LBX151 has non-varying covariates(s): SDDSRVYR
WARNING: LBXLUZ has non-varying covariates(s): SDDSRVYR
WARNING: DR1TLZ has non-varying covariates(s): SDDSRVYR
WARNING: DR1TPHOS - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TP204 - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: LBXCBC has non-varying covariates(s): SDDSRVYR
WARNING: DR1TPOTA - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TVB6 - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TVB12 - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TP184 - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TP182 - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TMFAT - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: RHQ556 has non-varying covariates(s): female
WARNING: LBXBEC has non-varying covariates(s): SDDSRVYR
WARNING: DR1TSUGR has non-varying covariates(s): SDDSRVYR
WARNING: URXP02 has non-varying covariates(s): SDDSRVYR
WARNING: DRD370AQ - 2 observation(s) with missing, negative, or zero weights were removed
WARNING: LBXEND has non-varying covariates(s): SDDSRVYR
WARNING: DR1TCRYP has non-varying covariates(s): SDDSRVYR
WARNING: DR1TKCAL - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TFIBE - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TTFAT - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TZINC - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: LBX110 has non-varying covariates(s): SDDSRVYR
WARNING: how_long_estrogen has non-varying covariates(s): female
WARNING: LBD199 has non-varying covariates(s): SDDSRVYR
WARNING: URXMHH has non-varying covariates(s): SDDSRVYR
WARNING: DR1TTHEO - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TFDFE has non-varying covariates(s): SDDSRVYR
WARNING: URXOP4 - 403 observation(s) with missing, negative, or zero weights were removed
WARNING: DRD350DQ - 1 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TALCO - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: URXUHG has non-varying covariates(s): female
WARNING: URXP22 has non-varying covariates(s): SDDSRVYR
WARNING: URXP21 has non-varying covariates(s): SDDSRVYR
WARNING: DR1TSFAT - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DRD350HQ - 6 observation(s) with missing, negative, or zero weights were removed
WARNING: URXOP1 - 404 observation(s) with missing, negative, or zero weights were removed
WARNING: DRD370BQ - 5 observation(s) with missing, negative, or zero weights were removed
WARNING: URXOP2 - 404 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TM201 - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TFF has non-varying covariates(s): SDDSRVYR
WARNING: URXMOH has non-varying covariates(s): SDDSRVYR
WARNING: DR1TFA has non-varying covariates(s): SDDSRVYR
WARNING: DR1TS120 - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: URXMNM has non-varying covariates(s): SDDSRVYR
WARNING: LBX195 has non-varying covariates(s): SDDSRVYR
WARNING: DR1TACAR has non-varying covariates(s): SDDSRVYR
WARNING: DRD370FQ - 1 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TATOC has non-varying covariates(s): SDDSRVYR
WARNING: URXOP3 - 404 observation(s) with missing, negative, or zero weights were removed
WARNING: LBX189 has non-varying covariates(s): SDDSRVYR
WARNING: DR1TP225 - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TP226 - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TP183 - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: LBXTHG has non-varying covariates(s): female
WARNING: DR1TBCAR has non-varying covariates(s): SDDSRVYR
WARNING: DRD370MQ - 1 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TPFAT - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TS060 - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TM161 - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: LBXCRY has non-varying covariates(s): SDDSRVYR
WARNING: DR1TCALC - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: LBXIHG has non-varying covariates(s): female
WARNING: DR1TM221 - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TIRON - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DRD370DQ - 1 observation(s) with missing, negative, or zero weights were removed
WARNING: URXOP5 - 403 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TPROT - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TVARA has non-varying covariates(s): SDDSRVYR
WARNING: DR1TCARB - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TMAGN - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TM181 - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TS140 - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TVC - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: LBX196 has non-varying covariates(s): SDDSRVYR
WARNING: age_started_birth_control has non-varying covariates(s): female
WARNING: URXP01 has non-varying covariates(s): SDDSRVYR
WARNING: LBXD02 has non-varying covariates(s): SDDSRVYR
WARNING: URXMIB has non-varying covariates(s): SDDSRVYR
WARNING: LBX149 has non-varying covariates(s): SDDSRVYR
WARNING: LBXALC has non-varying covariates(s): SDDSRVYR
WARNING: DR1TS180 - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TVB2 - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TCAFF - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TLYCO has non-varying covariates(s): SDDSRVYR
WARNING: LBX087 has non-varying covariates(s): SDDSRVYR
WARNING: LBXV3A has non-varying covariates(s): SDDSRVYR
WARNING: DR1TP205 - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: LBX194 has non-varying covariates(s): SDDSRVYR
WARNING: DR1TNIAC - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: URXUUR has non-varying covariates(s): SDDSRVYR
WARNING: DRD350AQ - 1 observation(s) with missing, negative, or zero weights were removed
WARNING: URXMC1 has non-varying covariates(s): SDDSRVYR
WARNING: DR1TS040 - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: URXOP6 - 403 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TS080 - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TRET has non-varying covariates(s): SDDSRVYR
WARNING: LBX028 has non-varying covariates(s): SDDSRVYR
####### Regressing 48 Binary Variables #######
WARNING: DRD350A - 6 observation(s) with missing, negative, or zero weights were removed
WARNING: DRD350B - 6 observation(s) with missing, negative, or zero weights were removed
WARNING: current_loud_noise - 925 observation(s) with missing, negative, or zero weights were removed
WARNING: LBXBV has non-varying covariates(s): female, SDDSRVYR
WARNING: ordinary_salt - 19 observation(s) with missing, negative, or zero weights were removed
WARNING: ordinary_salt has non-varying covariates(s): SDDSRVYR
WARNING: taking_birth_control has non-varying covariates(s): female
WARNING: LBXMS1 has non-varying covariates(s): SDDSRVYR
WARNING: DRD370A - 10 observation(s) with missing, negative, or zero weights were removed
WARNING: DRD370F - 10 observation(s) with missing, negative, or zero weights were removed
WARNING: SXQ280 has non-varying covariates(s): female
WARNING: DRD350F - 6 observation(s) with missing, negative, or zero weights were removed
WARNING: DRD350G - 6 observation(s) with missing, negative, or zero weights were removed
WARNING: DRD370B - 10 observation(s) with missing, negative, or zero weights were removed
WARNING: DRD370U - 10 observation(s) with missing, negative, or zero weights were removed
WARNING: DRD370D - 10 observation(s) with missing, negative, or zero weights were removed
WARNING: LBXHBC - 5808 observation(s) with missing, negative, or zero weights were removed
WARNING: DRD370T - 10 observation(s) with missing, negative, or zero weights were removed
WARNING: DRD340 - 22 observation(s) with missing, negative, or zero weights were removed
WARNING: DRD350H - 6 observation(s) with missing, negative, or zero weights were removed
WARNING: RHQ540 has non-varying covariates(s): female
WARNING: DRD350D - 6 observation(s) with missing, negative, or zero weights were removed
WARNING: DRD370M - 10 observation(s) with missing, negative, or zero weights were removed
WARNING: DRD360 - 21 observation(s) with missing, negative, or zero weights were removed
WARNING: no_salt - 19 observation(s) with missing, negative, or zero weights were removed
WARNING: no_salt has non-varying covariates(s): SDDSRVYR
WARNING: DRD370E - 10 observation(s) with missing, negative, or zero weights were removed
WARNING: RHQ510 has non-varying covariates(s): female
####### Regressing 4 Categorical Variables #######
WARNING: DBD100 - 9 observation(s) with missing, negative, or zero weights were removed
WARNING: DBD100 has non-varying covariates(s): SDDSRVYR
Completed EWAS
There is a separate function for adding pvalues with multiple-test-correction applied.
clarite.analyze.add_corrected_pvalues(ewas_discovery)
Saving results is straightforward
ewas_discovery.to_csv(output + "/BMI_Discovery_Results.txt", sep="\t")
Selecting top results¶
Variables with an FDR less than 0.1 were selected (using standard functionality from the Pandas library, since the ewas results are simply a Pandas DataFrame).
significant_discovery_variables = ewas_discovery[ewas_discovery['pvalue_fdr']<0.1].index.get_level_values('Variable')
print(f"Using {len(significant_discovery_variables)} variables based on FDR-corrected pvalues from the discovery dataset")
Using 100 variables based on FDR-corrected pvalues from the discovery dataset
Replication¶
The variables with low FDR in the discovery dataset were analyzed in the replication dataset
Filter out variables¶
keep_cols = list(significant_discovery_variables) + covariates + [phenotype]
replication_final_sig = clarite.modify.colfilter(replication_final, only=keep_cols)
clarite.describe.summarize(replication_final_sig)
================================================================================
Running colfilter
--------------------------------------------------------------------------------
Keeping 109 of 341 variables:
19 of 54 binary variables
3 of 5 categorical variables
87 of 282 continuous variables
0 of 0 unknown variables
================================================================================
9,874 observations of 109 variables
19 Binary Variables
3 Categorical Variables
87 Continuous Variables
0 Unknown-Type Variables
Run Replication EWAS¶
survey_design_replication
SDMVPSU | SDMVSTRA | WTINT2YR | ... | WTSOG2YR | WTSC2YRA | WTSPC2YR | |
---|---|---|---|---|---|---|---|
ID | |||||||
21005 | 2 | 39 | 2756.160474 | ... | NaN | NaN | NaN |
21006 | 1 | 41 | 2711.070226 | ... | NaN | NaN | NaN |
21007 | 2 | 35 | 19882.088706 | ... | NaN | NaN | NaN |
21008 | 1 | 32 | 2799.749676 | ... | NaN | NaN | NaN |
21009 | 2 | 31 | 48796.839489 | ... | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... |
41470 | 2 | 46 | 8473.426110 | ... | NaN | NaN | NaN |
41471 | 1 | 52 | 3141.652775 | ... | 9148.1015 | NaN | NaN |
41472 | 1 | 48 | 33673.789576 | ... | 99690.8420 | NaN | 71892.249044 |
41473 | 1 | 55 | 9956.504488 | ... | NaN | NaN | 26257.847868 |
41474 | 1 | 47 | 3087.275833 | ... | 9417.3990 | NaN | NaN |
20470 rows × 23 columns
sd_replication = clarite.survey.SurveyDesignSpec(survey_df=survey_design_replication,
strata="SDMVSTRA",
cluster="SDMVPSU",
nest=True,
weights=weights_replication,
single_cluster='centered')
ewas_replication = clarite.analyze.ewas(phenotype, covariates, replication_final_sig, sd_replication)
clarite.analyze.add_corrected_pvalues(ewas_replication)
ewas_replication.to_csv(output + "/BMI_Replication_Results.txt", sep="\t")
Running EWAS on a continuous variable
####### Regressing 85 Continuous Variables #######
WARNING: URXP24 has non-varying covariates(s): SDDSRVYR
WARNING: age_stopped_birth_control has non-varying covariates(s): female
WARNING: LBXODT has non-varying covariates(s): SDDSRVYR
WARNING: LBX206 has non-varying covariates(s): SDDSRVYR
WARNING: LBX170 has non-varying covariates(s): SDDSRVYR
WARNING: LBX099 has non-varying covariates(s): SDDSRVYR
WARNING: URXP20 has non-varying covariates(s): SDDSRVYR
WARNING: LBX156 has non-varying covariates(s): SDDSRVYR
WARNING: URXP11 has non-varying covariates(s): SDDSRVYR
WARNING: LBX118 has non-varying covariates(s): SDDSRVYR
WARNING: LBX153 has non-varying covariates(s): SDDSRVYR
WARNING: LBXD05 has non-varying covariates(s): SDDSRVYR
WARNING: LBD199 has non-varying covariates(s): SDDSRVYR
WARNING: LBXHPE has non-varying covariates(s): SDDSRVYR
WARNING: URXOP1 has non-varying covariates(s): SDDSRVYR
WARNING: URXP15 has non-varying covariates(s): SDDSRVYR
WARNING: LBXMIR has non-varying covariates(s): SDDSRVYR
WARNING: URXOP3 has non-varying covariates(s): SDDSRVYR
WARNING: LBXHXC has non-varying covariates(s): SDDSRVYR
WARNING: LBXME has non-varying covariates(s): SDDSRVYR
WARNING: LBX180 has non-varying covariates(s): SDDSRVYR
WARNING: LBX196 has non-varying covariates(s): SDDSRVYR
WARNING: age_started_birth_control has non-varying covariates(s): female
WARNING: LBXF04 has non-varying covariates(s): SDDSRVYR
WARNING: URXP03 has non-varying covariates(s): SDDSRVYR
WARNING: LBXIRN has non-varying covariates(s): female
WARNING: LBX194 has non-varying covariates(s): SDDSRVYR
WARNING: DUQ110 has non-varying covariates(s): SDDSRVYR
####### Regressing 13 Binary Variables #######
WARNING: DUQ100 has non-varying covariates(s): SDDSRVYR
WARNING: LBXHBC - 6318 observation(s) with missing, negative, or zero weights were removed
WARNING: SMQ210 has non-varying covariates(s): SDDSRVYR
WARNING: ever_loud_noise_gt3 has non-varying covariates(s): SDDSRVYR
WARNING: ever_loud_noise_gt3_2 has non-varying covariates(s): SDDSRVYR
WARNING: DRD370M - 19 observation(s) with missing, negative, or zero weights were removed
WARNING: DRD370E - 19 observation(s) with missing, negative, or zero weights were removed
####### Regressing 2 Categorical Variables #######
Completed EWAS
## Compare results
# Combine results
ewas_keep_cols = ['pvalue', 'pvalue_bonferroni', 'pvalue_fdr']
combined = pd.merge(ewas_discovery[['Variable_type'] + ewas_keep_cols],
ewas_replication[ewas_keep_cols],
left_index=True, right_index=True, suffixes=("_disc", "_repl"))
# FDR < 0.1 in both
fdr_significant = combined.loc[(combined['pvalue_fdr_disc'] <= 0.1) & (combined['pvalue_fdr_repl'] <= 0.1),]
fdr_significant = fdr_significant.assign(m=fdr_significant[['pvalue_fdr_disc', 'pvalue_fdr_repl']].mean(axis=1))\
.sort_values('m').drop('m', axis=1)
fdr_significant.to_csv(output + "/Significant_Results_FDR_0.1.txt", sep="\t")
print(f"{len(fdr_significant)} variables had FDR < 0.1 in both discovery and replication")
# Bonferroni < 0.05 in both
bonf_significant05 = combined.loc[(combined['pvalue_bonferroni_disc'] <= 0.05) & (combined['pvalue_bonferroni_repl'] <= 0.05),]
bonf_significant05 = bonf_significant05.assign(m=fdr_significant[['pvalue_bonferroni_disc', 'pvalue_bonferroni_repl']].mean(axis=1))\
.sort_values('m').drop('m', axis=1)
bonf_significant05.to_csv(output + "/Significant_Results_Bonferroni_0.05.txt", sep="\t")
print(f"{len(bonf_significant05)} variables had Bonferroni < 0.05 in both discovery and replication")
# Bonferroni < 0.01 in both
bonf_significant01 = combined.loc[(combined['pvalue_bonferroni_disc'] <= 0.01) & (combined['pvalue_bonferroni_repl'] <= 0.01),]
bonf_significant01 = bonf_significant01.assign(m=fdr_significant[['pvalue_bonferroni_disc', 'pvalue_bonferroni_repl']].mean(axis=1))\
.sort_values('m').drop('m', axis=1)
bonf_significant01.to_csv(output + "/Significant_Results_Bonferroni_0.01.txt", sep="\t")
print(f"{len(bonf_significant01)} variables had Bonferroni < 0.01 in both discovery and replication")
bonf_significant01.head()
63 variables had FDR < 0.1 in both discovery and replication
16 variables had Bonferroni < 0.05 in both discovery and replication
10 variables had Bonferroni < 0.01 in both discovery and replication
Variable_type | pvalue_disc | pvalue_bonferroni_disc | ... | pvalue_repl | pvalue_bonferroni_repl | pvalue_fdr_repl | ||
---|---|---|---|---|---|---|---|---|
Variable | Phenotype | |||||||
LBXGTC | BMXBMI | continuous | 2.611467e-14 | 8.670071e-12 | ... | 2.729179e-11 | 2.729179e-09 | 4.548631e-10 |
LBXIRN | BMXBMI | continuous | 3.283440e-11 | 1.090102e-08 | ... | 1.748424e-12 | 1.748424e-10 | 5.828079e-11 |
total_days_drink_year | BMXBMI | continuous | 4.562887e-07 | 1.514879e-04 | ... | 1.709681e-10 | 1.709681e-08 | 2.442402e-09 |
LBXBEC | BMXBMI | continuous | 8.394013e-07 | 2.786812e-04 | ... | 1.689733e-08 | 1.689733e-06 | 1.299795e-07 |
LBXCBC | BMXBMI | continuous | 9.142106e-07 | 3.035179e-04 | ... | 1.159283e-09 | 1.159283e-07 | 1.288093e-08 |
5 rows × 7 columns
Manhattan Plots¶
CLARITE provides functionality for generating highly customizable Manhattan plots from EWAS results
data_categories = pd.read_csv(data_var_categories, sep="\t").set_index('Variable')
data_categories.columns = ['category']
data_categories = data_categories['category'].to_dict()
clarite.plot.manhattan({'discovery': ewas_discovery, 'replication': ewas_replication},
categories=data_categories, title="Weighted EWAS Results", filename=output + "/ewas_plot.png",
figsize=(14, 10))

Complex Survey Data¶
CLARITE provides support for handling complex survey designs, similar to how the r-package survey works.
A SurveyDesignSpec can be created, which is used to obtain survey design objects for specific variables:
sd_discovery = clarite.survey.SurveyDesignSpec(survey_df=survey_design_discovery,
strata="SDMVSTRA",
cluster="SDMVPSU",
nest=True,
weights=weights_discovery,
single_cluster='adjust',
drop_unweighted=False)
After a SurveyDesignSpec is created, it can be passed into an ewas function to utilize the survey design parameters:
ewas_discovery = clarite.analyze.ewas(phenotype="logBMI",
covariates=covariates,
data=nhanes_discovery,
survey_design_spec=sd_discovery)
Single Clusters¶
There are a few different options for the ‘single_cluster’ parameter, which controls how strata with single clusters are handled in the linearized covariance calculation:
- fail - Throw an error (default)
- adjust - Use the average value of all observations (conservative)
- average - Use the average of other strata
- certainty - Single-cluster strata don’t contribute to the variance
Missing Weights¶
The drop_unweighted parameter is False by default- any variables with missing weights will have an error and no results. Setting it to True will simply drop those observations (which may not be strictly correct).
Subsets¶
When using a survey design, the data should not be directly modified in order to look at subset populations. Instead, the data should be subset:
design = clarite.survey.SurveyDesignSpec(df, weights="WTMEC2YR", cluster="SDMVPSU", strata="SDMVSTRA",
fpc=None, nest=True, drop_unweighted=True)
design.subset(df['agecat'] != "(19,39]")
API Reference¶
If you are looking for information on a specific function, class or method, this part of the documentation is for you.
API Reference¶
CLARITE functions are organized into several modules:
Analyze¶
EWAS and associated calculations
ewas
(phenotype, covariates, data, …)Run an Environment-Wide Association Study add_corrected_pvalues
(ewas_result)Add bonferroni and FDR pvalues to an ewas result and sort by increasing FDR (in-place)
Describe¶
Functions that are used to gather information about some data
correlations
(data, threshold)Return variables with pearson correlation above the threshold freq_table
(data)Return the count of each unique value for all binary and categorical variables. get_types
(data)Return the type of each variable percent_na
(data)Return the percent of observations that are NA for each variable skewness
(data, dropna)Return the skewness of each continuous variable summarize
(data)Print the number of each type of variable and the number of observations
Modify¶
Functions used to filter and/or change some data, always taking in one set of data and returning one set of data.
categorize
(data, cat_min, cat_max, cont_min)Classify variables into constant, binary, categorical, continuous, and ‘unknown’. colfilter
(data, skip, List[str], …)Remove some variables (skip) or keep only certain variables (only) colfilter_percent_zero
(data, filter_percent, …)Remove continuous variables which have <proportion> or more values of zero (excluding NA) colfilter_min_n
(data, n, skip, List[str], …)Remove variables which have less than <n> non-NA values colfilter_min_cat_n
(data, n, skip, …)Remove binary and categorical variables which have less than <n> occurences of each unique value make_binary
(data, skip, List[str], …)Set variable types as Binary make_categorical
(data, skip, List[str], …)Set variable types as Categorical make_continuous
(data, skip, List[str], …)Set variable types as Numeric merge_observations
(top, bottom)Merge two datasets, keeping only the columns present in both. merge_variables
(left, …)Merge a list of dataframes with different variables side-by-side. move_variables
(left, right, …)Move one or more variables from one DataFrame to another recode_values
(data, replacement_dict, skip, …)Convert values in a dataframe. remove_outliers
(data, method[, cutoff])Remove outliers from continuous variables by replacing them with np.nan rowfilter_incomplete_obs
(data, skip, …)Remove rows containing null values transform
(data, transform_method, skip, …)Apply a transformation function to a variable
Plot¶
Functions that generate plots
histogram
(data, column, figsize, int] = (12, …)Plot a histogram of the values in the given column. distributions
(data, filename, …)Create a pdf containing histograms for each binary or categorical variable, and one of several types of plots for each continuous variable. manhattan
(dfs, pandas.core.frame.DataFrame], …)Create a Manhattan-like plot for a list of EWAS Results manhattan_fdr
(dfs, …)Create a Manhattan-like plot for a list of EWAS Results using FDR significance manhattan_bonferroni
(dfs, …)Create a Manhattan-like plot for a list of EWAS Results using Bonferroni significance top_results
(ewas_result, pvalue_name, …)Create a dotplot for EWAS Results showing pvalues and beta coefficients
Survey¶
Complex survey design
SurveyDesignSpec
(survey_df, strata, cluster, …)Holds parameters for building a statsmodels SurveyDesign object
CLI Reference¶
Documentation for using the CLI
CLI Reference¶
Once CLARITE is installed, the command line interface can be run using the clarte-cli
command.
The command line interface has command groups that are the same as the modules in the package (except for survey).
The --help
option will show documentation when run with any command or command group:
$ clarite-cli --help
Usage: clarite-cli [OPTIONS] COMMAND [ARGS]...
Options:
--help Show this message and exit.
Commands:
analyze
describe
load
modify
plot
–skip and –only¶
Many commands in the CLI have the skip and only options. These will limit the command to specific variables. If skip is specified, all variables except the specified ones will be processed. If only is specified, only the specified variables will be processed.
Only one or the other option may be used in a single command. They may be passed in any combination of two ways:
- As the name of a file containing one variable name per line
- As the variable name specfied directly in the terminal
For example:
results in:
-------------------------------------------------------------------------------------------------------------------------
--only: 1 variable(s) specified directly
8 variable(s) loaded from 'covars.txt'
=========================================================================================================================
Running rowfilter_incomplete_obs
-------------------------------------------------------------------------------------------------------------------------
Removed 3,687 of 22,624 observations (16.30%) due to NA values in any of 9 variables
=========================================================================================================================
Commands¶
clarite-cli analyze¶
clarite-cli analyze [OPTIONS] COMMAND [ARGS]...
add-corrected-pvals¶
Get FDR-corrected and Bonferroni-corrected pvalues
clarite-cli analyze add-corrected-pvals [OPTIONS] EWAS_RESULT OUTPUT
Arguments
-
EWAS_RESULT
¶
Required argument
-
OUTPUT
¶
Required argument
ewas¶
Run an EWAS analysis
clarite-cli analyze ewas [OPTIONS] PHENOTYPE DATA OUTPUT
Options
-
-c
,
--covariate
<covariate>
¶ Covariates
-
--covariance-calc
<covariance_calc>
¶ Covariance calculation method
Options: stata|jackknife
-
--min-n
<min_n>
¶ Minimum number of complete cases needed to run a regression
-
--survey-data
<survey_data>
¶ Tab-separated data file with survey weights, strata IDs, and/or cluster IDs. Must have an ‘ID’ column.
-
--strata
<strata>
¶ Name of the strata column in the survey data
-
--cluster
<cluster>
¶ Name of the cluster column in the survey data
-
--nested
,
--not-nested
¶
Whether survey data is nested or not
-
--weights-file
<weights_file>
¶ Tab-delimited data file with ‘Variable’ and ‘Weight’ columns to match weights from the survey data to specific variables
-
-w
,
--weight
<weight>
¶ Name of a survey weight column found in the survey data. This option can’t be used with –weights-file
-
--fpc
<fpc>
¶ Name of the finite population correction column in the survey data
-
--single-cluster
<single_cluster>
¶ How to handle singular clusters
Options: fail|adjust|average|certainty
Arguments
-
PHENOTYPE
¶
Required argument
-
DATA
¶
Required argument
-
OUTPUT
¶
Required argument
ewas-r¶
Run an EWAS analysis using R
clarite-cli analyze ewas-r [OPTIONS] PHENOTYPE DATA OUTPUT
Options
-
-c
,
--covariate
<covariate>
¶ Covariates
-
--covariance-calc
<covariance_calc>
¶ Covariance calculation method
Options: stata|jackknife
-
--min-n
<min_n>
¶ Minimum number of complete cases needed to run a regression
-
--survey-data
<survey_data>
¶ Tab-separated data file with survey weights, strata IDs, and/or cluster IDs. Must have an ‘ID’ column.
-
--strata
<strata>
¶ Name of the strata column in the survey data
-
--cluster
<cluster>
¶ Name of the cluster column in the survey data
-
--nested
,
--not-nested
¶
Whether survey data is nested or not
-
--weights-file
<weights_file>
¶ Tab-delimited data file with ‘Variable’ and ‘Weight’ columns to match weights from the survey data to specific variables
-
-w
,
--weight
<weight>
¶ Name of a survey weight column found in the survey data. This option can’t be used with –weights-file
-
--fpc
<fpc>
¶ Name of the finite population correction column in the survey data
-
--single-cluster
<single_cluster>
¶ How to handle singular clusters
Options: fail|adjust|average|certainty
Arguments
-
PHENOTYPE
¶
Required argument
-
DATA
¶
Required argument
-
OUTPUT
¶
Required argument
get-significant¶
filter out non-significant results
clarite-cli analyze get-significant [OPTIONS] EWAS_RESULT OUTPUT
Options
-
--fdr
,
--bonferroni
¶
Use FDR (–fdr) or Bonferroni pvalues (–bonferroni). FDR by default.
-
-p
,
--pvalue
<pvalue>
¶ Keep results with a pvalue <= this value (0.05 by default)
Arguments
-
EWAS_RESULT
¶
Required argument
-
OUTPUT
¶
Required argument
clarite-cli describe¶
clarite-cli describe [OPTIONS] COMMAND [ARGS]...
correlations¶
Report top correlations between variables
clarite-cli describe correlations [OPTIONS] DATA OUTPUT
Options
-
-t
,
--threshold
<threshold>
¶ Report correlations with R >= this value
Arguments
-
DATA
¶
Required argument
-
OUTPUT
¶
Required argument
freq-table¶
Report the number of occurences of each value for each variable
clarite-cli describe freq-table [OPTIONS] DATA OUTPUT
Arguments
-
DATA
¶
Required argument
-
OUTPUT
¶
Required argument
get-types¶
Get the type of each variable
clarite-cli describe get-types [OPTIONS] DATA OUTPUT
Arguments
-
DATA
¶
Required argument
-
OUTPUT
¶
Required argument
clarite-cli load¶
clarite-cli load [OPTIONS] COMMAND [ARGS]...
from-csv¶
Load data from a comma-separated file and save it in the standard format
clarite-cli load from-csv [OPTIONS] INPUT OUTPUT
Options
-
-i
,
--index
<index>
¶ Name of the column to use as the index. Default is the first column.
-
-s
,
--skip
<skip>
¶ variables to skip. Either individual names, or a file containing one name per line.
-
-o
,
--only
<only>
¶ variables to process, skipping all others. Either individual names, or a file containing one name per line.
Arguments
-
INPUT
¶
Required argument
-
OUTPUT
¶
Required argument
from-tsv¶
Load data from a tab-separated file and save it in the standard format
clarite-cli load from-tsv [OPTIONS] INPUT OUTPUT
Options
-
-i
,
--index
<index>
¶ Name of the column to use as the index. Default is the first column.
-
-s
,
--skip
<skip>
¶ variables to skip. Either individual names, or a file containing one name per line.
-
-o
,
--only
<only>
¶ variables to process, skipping all others. Either individual names, or a file containing one name per line.
Arguments
-
INPUT
¶
Required argument
-
OUTPUT
¶
Required argument
clarite-cli modify¶
clarite-cli modify [OPTIONS] COMMAND [ARGS]...
categorize¶
Categorize data based on the number of unique values
clarite-cli modify categorize [OPTIONS] DATA OUTPUT
Options
-
--cat_min
<cat_min>
¶ Minimum number of unique values in a variable to make it a categorical type
-
--cat_max
<cat_max>
¶ Maximum number of unique values in a variable to make it a categorical type
-
--cont_min
<cont_min>
¶ Minimum number of unique values in a variable to make it a continuous type
Arguments
-
DATA
¶
Required argument
-
OUTPUT
¶
Required argument
colfilter¶
Remove some variables from a dataset
clarite-cli modify colfilter [OPTIONS] DATA OUTPUT
Options
-
-s
,
--skip
<skip>
¶ variables to skip. Either individual names, or a file containing one name per line.
-
-o
,
--only
<only>
¶ variables to process, skipping all others. Either individual names, or a file containing one name per line.
Arguments
-
DATA
¶
Required argument
-
OUTPUT
¶
Required argument
colfilter-min-cat-n¶
Filter variables based on a minimum number of non-NA observations per category
clarite-cli modify colfilter-min-cat-n [OPTIONS] DATA OUTPUT
Options
-
-n
<n>
¶ Remove variables with less than this many non-na observations in each category
-
-s
,
--skip
<skip>
¶ variables to skip. Either individual names, or a file containing one name per line.
-
-o
,
--only
<only>
¶ variables to process, skipping all others. Either individual names, or a file containing one name per line.
Arguments
-
DATA
¶
Required argument
-
OUTPUT
¶
Required argument
colfilter-min-n¶
Filter variables based on a minimum number of non-NA observations
clarite-cli modify colfilter-min-n [OPTIONS] DATA OUTPUT
Options
-
-n
<n>
¶ Remove variables with less than this many non-na observations
-
-s
,
--skip
<skip>
¶ variables to skip. Either individual names, or a file containing one name per line.
-
-o
,
--only
<only>
¶ variables to process, skipping all others. Either individual names, or a file containing one name per line.
Arguments
-
DATA
¶
Required argument
-
OUTPUT
¶
Required argument
colfilter-percent-zero¶
Filter variables based on the fraction of observations with a value of zero
clarite-cli modify colfilter-percent-zero [OPTIONS] DATA OUTPUT
Options
-
-p
,
--filter-percent
<filter_percent>
¶ Remove variables when the percentage of observations equal to 0 is >= this value (0 to 100)
-
-s
,
--skip
<skip>
¶ variables to skip. Either individual names, or a file containing one name per line.
-
-o
,
--only
<only>
¶ variables to process, skipping all others. Either individual names, or a file containing one name per line.
Arguments
-
DATA
¶
Required argument
-
OUTPUT
¶
Required argument
drop-extra-categories¶
Remove extra categories from categorical datatypes
clarite-cli modify drop-extra-categories [OPTIONS] DATA OUTPUT
Options
-
-s
,
--skip
<skip>
¶ variables to skip. Either individual names, or a file containing one name per line.
-
-o
,
--only
<only>
¶ variables to process, skipping all others. Either individual names, or a file containing one name per line.
Arguments
-
DATA
¶
Required argument
-
OUTPUT
¶
Required argument
make-binary¶
Set the type of variables to ‘binary’
clarite-cli modify make-binary [OPTIONS] DATA OUTPUT
Options
-
-s
,
--skip
<skip>
¶ variables to skip. Either individual names, or a file containing one name per line.
-
-o
,
--only
<only>
¶ variables to process, skipping all others. Either individual names, or a file containing one name per line.
Arguments
-
DATA
¶
Required argument
-
OUTPUT
¶
Required argument
make-categorical¶
Set the type of variables to ‘categorical’
clarite-cli modify make-categorical [OPTIONS] DATA OUTPUT
Options
-
-s
,
--skip
<skip>
¶ variables to skip. Either individual names, or a file containing one name per line.
-
-o
,
--only
<only>
¶ variables to process, skipping all others. Either individual names, or a file containing one name per line.
Arguments
-
DATA
¶
Required argument
-
OUTPUT
¶
Required argument
make-continuous¶
Set the type of variables to ‘continuous’
clarite-cli modify make-continuous [OPTIONS] DATA OUTPUT
Options
-
-s
,
--skip
<skip>
¶ variables to skip. Either individual names, or a file containing one name per line.
-
-o
,
--only
<only>
¶ variables to process, skipping all others. Either individual names, or a file containing one name per line.
Arguments
-
DATA
¶
Required argument
-
OUTPUT
¶
Required argument
merge-observations¶
Merge observations from two different datasets into one
clarite-cli modify merge-observations [OPTIONS] TOP BOTTOM OUTPUT
Arguments
-
TOP
¶
Required argument
-
BOTTOM
¶
Required argument
-
OUTPUT
¶
Required argument
merge-variables¶
Merge variables from two different datasets into one
clarite-cli modify merge-variables [OPTIONS] LEFT RIGHT OUTPUT
Options
-
-h
,
--how
<how>
¶ Type of Merge
Options: left|right|inner|outer
Arguments
-
LEFT
¶
Required argument
-
RIGHT
¶
Required argument
-
OUTPUT
¶
Required argument
move-variables¶
Move variables from one dataset to another
clarite-cli modify move-variables [OPTIONS] LEFT RIGHT
Options
-
--output_left
<output_left>
¶
-
--output_right
<output_right>
¶
-
-s
,
--skip
<skip>
¶ variables to skip. Either individual names, or a file containing one name per line.
-
-o
,
--only
<only>
¶ variables to process, skipping all others. Either individual names, or a file containing one name per line.
Arguments
-
LEFT
¶
Required argument
-
RIGHT
¶
Required argument
recode-values¶
Replace values in the data with other values.The value being replaced (‘current’) and the new value (‘replacement’) are specified with their type, and only one may be included for each. If it is not specified, the value being replaced or being inserted is None.
clarite-cli modify recode-values [OPTIONS] DATA OUTPUT
Options
-
--current-str
<cs>
¶ Replace occurences of this string value
-
--current-int
<ci>
¶ Replace occurences of this integer value
-
--current-float
<cf>
¶ Replace occurences of this float value
-
--replacement-str
<rs>
¶ Insert this string value
-
--replacement-int
<ri>
¶ Insert this integer value
-
--replacement-float
<rf>
¶ Insert this float value
-
-s
,
--skip
<skip>
¶ variables to skip. Either individual names, or a file containing one name per line.
-
-o
,
--only
<only>
¶ variables to process, skipping all others. Either individual names, or a file containing one name per line.
Arguments
-
DATA
¶
Required argument
-
OUTPUT
¶
Required argument
remove-outliers¶
Replace outlier values with NaN. Outliers are defined using a gaussian or IQR approach.
clarite-cli modify remove-outliers [OPTIONS] DATA OUTPUT
Options
-
-m
,
--method
<method>
¶ Options: gaussian|iqr
-
-c
,
--cutoff
<cutoff>
¶
-
-s
,
--skip
<skip>
¶ variables to skip. Either individual names, or a file containing one name per line.
-
-o
,
--only
<only>
¶ variables to process, skipping all others. Either individual names, or a file containing one name per line.
Arguments
-
DATA
¶
Required argument
-
OUTPUT
¶
Required argument
rowfilter¶
Select some rows from a dataset using a simple comparison, keeping rows where the comparison is True.
clarite-cli modify rowfilter [OPTIONS] DATA OUTPUT COLUMN
Options
-
--value-str
<vs>
¶ Compare values in the column to this string
-
--value-int
<vi>
¶ Compare values in the column to this integer
-
--value-float
<vf>
¶ Compare values in the column to this floating point number
-
-c
,
--comparison
<comparison>
¶ Keep rows where the value of the column is lt (<), lte (<=), eq (==), gte (>=), or gt (>) the specified value. Eq by default.
Options: lt|lte|eq|gte|gt
Arguments
-
DATA
¶
Required argument
-
OUTPUT
¶
Required argument
-
COLUMN
¶
Required argument
rowfilter-incomplete-obs¶
Filter out observations that are not complete cases (contain no NA values)
clarite-cli modify rowfilter-incomplete-obs [OPTIONS] DATA OUTPUT
Options
-
-s
,
--skip
<skip>
¶ variables to skip. Either individual names, or a file containing one name per line.
-
-o
,
--only
<only>
¶ variables to process, skipping all others. Either individual names, or a file containing one name per line.
Arguments
-
DATA
¶
Required argument
-
OUTPUT
¶
Required argument
transform-variable¶
Apply a function to each value of a variable
clarite-cli modify transform-variable [OPTIONS] DATA OUTPUT TRANSFORM_METHOD
Options
-
-s
,
--skip
<skip>
¶ variables to skip. Either individual names, or a file containing one name per line.
-
-o
,
--only
<only>
¶ variables to process, skipping all others. Either individual names, or a file containing one name per line.
Arguments
-
DATA
¶
Required argument
-
OUTPUT
¶
Required argument
-
TRANSFORM_METHOD
¶
Required argument
clarite-cli plot¶
clarite-cli plot [OPTIONS] COMMAND [ARGS]...
distributions¶
Generate a pdf containing distribution plots for each variable
clarite-cli plot distributions [OPTIONS] DATA OUTPUT
Options
-
-k
,
--kind
<kind>
¶ Kind of plot used for continuous data. Non-continuous always shows a count plot.
Options: count|box|violin|qq
-
--nrows
<nrows>
¶ Number of rows per page
-
--ncols
<ncols>
¶ Number of columns per page
-
-q
,
--quality
<quality>
¶ Quality of the generated plots: low (150 dpi), medium (300 dpi), or high (1200 dpi).
Options: low|medium|high
-
--sort
,
--no-sort
¶
Sort variables alphabetically
Arguments
-
DATA
¶
Required argument
-
OUTPUT
¶
Required argument
histogram¶
Create a histogram plot of a variable
clarite-cli plot histogram [OPTIONS] DATA OUTPUT VARIABLE
Arguments
-
DATA
¶
Required argument
-
OUTPUT
¶
Required argument
-
VARIABLE
¶
Required argument
manhattan¶
Generate a manhattan plot of EWAS results
clarite-cli plot manhattan [OPTIONS] EWAS_RESULT OUTPUT
Options
-
-c
,
--categories
<categories>
¶ tab-separate file with two columns: ‘Variable’ and ‘category’
-
--bonferroni
<bonferroni>
¶ cutoff value to plot bonferroni-adjusted pvalue line
-
--fdr
<fdr>
¶ cutoff value to plot fdr-adjusted pvalue line
-
-o
,
--other
<other>
¶ other datasets to include in the plot
-
--nlabeled
<nlabeled>
¶ label top n points
-
--label
<label>
¶ label points by name
Arguments
-
EWAS_RESULT
¶
Required argument
-
OUTPUT
¶
Required argument
manhattan-bonferroni¶
Generate a manhattan plot of EWAS results showing Bonferroni-corrected pvalues
clarite-cli plot manhattan-bonferroni [OPTIONS] EWAS_RESULT OUTPUT
Options
-
-c
,
--categories
<categories>
¶ tab-separate file with two columns: ‘Variable’ and ‘category’
-
--cutoff
<cutoff>
¶ cutoff value for plotting the significance line
-
--fdr
<fdr>
¶ cutoff value to plot Bonferroni-adjusted pvalue line
-
-o
,
--other
<other>
¶ other datasets to include in the plot
-
--nlabeled
<nlabeled>
¶ label top n points
-
--label
<label>
¶ label points by name
Arguments
-
EWAS_RESULT
¶
Required argument
-
OUTPUT
¶
Required argument
manhattan-fdr¶
Generate a manhattan plot of EWAS results showing FDR-corrected pvalues
clarite-cli plot manhattan-fdr [OPTIONS] EWAS_RESULT OUTPUT
Options
-
-c
,
--categories
<categories>
¶ tab-separate file with two columns: ‘Variable’ and ‘category’
-
--cutoff
<cutoff>
¶ cutoff value for plotting the significance line
-
--fdr
<fdr>
¶ cutoff value to plot fdr-adjusted pvalue line
-
-o
,
--other
<other>
¶ other datasets to include in the plot
-
--nlabeled
<nlabeled>
¶ label top n points
-
--label
<label>
¶ label points by name
Arguments
-
EWAS_RESULT
¶
Required argument
-
OUTPUT
¶
Required argument
Additional Notes¶
Release History, etc
Release History¶
v1.1.0 (2020-08-14)¶
Enhancements¶
- Add a subset method on the SurveyDesignSpec class
- Refactored regression so that the ewas function now takes a regression_kind parameter
Tests¶
- Added tests for the subset method
v1.0.1 (2020-06-12)¶
Enhancements¶
- Improve the legend in the top_results plot and add additional parameters similar to the manhattan plots
Fixes¶
- Update the default names for the ewas parameter single_cluster in the CLI
- Add the “drop_unweighted” parameter to the printed result of Survey Designs
- Fix an IndexError caused by non-continuous variables being passed to describe.skewness
- Fix the travis build (the bioconda channel must be specified to install r-survey)
Tests¶
- Added a plot test for passing “None” as the cutoff to the top results plot
v1.0.0 (2020-06-04)¶
Fixes¶
- Fixed ewas_r not working for some parameter combinations
- Improved the top_results plot to work with non-continuous values (which don’t have Betas)
- Corrected ewas results for some scenarios (strata and clusters) related to missing data (incorrect degrees of freedom)
Tests¶
- Added additional analysis tests with realistic data (more missing values)
- All analysis tests are now passing with 1E-4 relative tolerance
- Added the first plot tests
v0.10.0 (2020-05-28)¶
Enhancements¶
- Manhattan plot split into three functions (raw, bonferroni, and fdr) and now has a custom threshold parameter
- Use Pandas v1.0+
- Refactored regression objects to simplify internal code and potentially allow for more types of regression in the future
- Added an ewas_r function that seamlessly runs the ewas analysis in R, using the R survey library * This is recommended when using weights, as the python version has some inconsistencies in some edge cases
- Added a skewness function
- Added a top_results plot
- Add a drop_unweighted parameter to the SurveyDesignSpec to provide an easy (if potentially incorrect) workaround for observations with missing weights
Fixes¶
- Provide a warning and a convenience function when categorical types have categories with no occurrences
- Catch errors when categorizing variables with many unique string values
- Corrected some edge-case EWAS results when using weights in the presence of missing values
- Avoid some cryptic errors by ensuring the input to some functions is a DataFrame and not a Series
Tests¶
Many additional tests were added, especially related to EWAS
v0.9.1 (2019-11-20)¶
Minor documentation update
v0.9.0 (2019-10-31)¶
Enhancements¶
- Add a figure parameter to histogram and manhattan plots in order to plot to an existing figure
- SurveyDesignSpec can now utilize more parameters, such as fpc
- The larger (numeric or alphabetic) binary variable is always treated as the success case for binary phenotypes
- Improved logging during EWAS, including printing the survey design information
- Extensively updated documentation
- CLARITE now has a logo!
Fixes¶
- Corrected an indexing error that sometimes occurred when removing rows with missing weights
- Improve precision in EWAS results for weighted analyses by using sf instead of 1-cdf
- Change some column names in the EWAS output to be more clear
Tests¶
An R script and the output of that script is now included. The R output is compared to the python output in the test suite in order to ensure analysis result concordance between R and Python for several analysis scenarios.
v0.8.0 (2019-09-03)¶
Enhancements¶
- Allow file input in the command line for skip/only
- Make the manhattan plot function less restrictive of the data passed into it
- Use skip/only in the transform function
Fixes¶
- Categorization would silently fail if there was only one variable of a given type
v0.7.0 (2019-07-23)¶
Enhancements¶
- Improvements to the CLI and printed log messages.
- The functions from the ‘Process’ module were put into the ‘Modify’ module.
- Datasets are no longer split apart when categorizing.
v0.6.0 (2019-07-11)¶
Extensive changes in organization, but limited new functionality (not counting the CLI).
Enhancements¶
- Reorganize functions - https://github.com/HallLab/clarite-python/pull/13
- Add a CLI - https://github.com/HallLab/clarite-python/pull/11
v0.5.0 (2019-06-28)¶
Enhancements¶
- Added a function to recode values - https://github.com/HallLab/clarite-python/issues/4
- Added a function to filter outlier values - https://github.com/HallLab/clarite-python/issues/5
- Added a function to generate manhattan plots for multiple datasets together - https://github.com/HallLab/clarite-python/issues/9
Fixes¶
- Add some validation of input DataFrames to prevent some errors in calculations
Tests¶
- Added an initial batch of tests
v0.4.0 (2019-06-18)¶
Support EWAS with binary outcomes. Additional handling of NA values in covariates and the phenotype. Add a ‘min_n’ parameter to the ewas function to require a minimum number of observations after removing incomplete cases. Add additional functions including ‘plot_distributions’, ‘merge_variables’, ‘get_correlations’, ‘get_freq_table’, and ‘get_percent_na’
v0.3.0 (2019-05-31)¶
Add support for complex survey designs
v0.2.1 (2019-05-02)¶
Added documentation for existing functions
v0.2.0 (2019-04-30)¶
First functional version. Mutliple methods are available under a ‘clarite’ Pandas accessor.
v0.1.0 (2019-04-23)¶
Initial Release