Example Analysis

CLARITE facilitates the quality control and analysis process for EWAS of metabolic-related traits

[Paper in review]

Data from NHANES was used in an EWAS analysis including utilizing the provided survey weight information. The first two cycles of NHANES (1999-2000 and 2001-2002) are assigned to a ‘discovery’ dataset and the next two cycles (2003-2004 and 2005-2006) are assigned to a ‘replication’ datset.

import pandas as pd
import numpy as np
from scipy import stats
import clarite
pd.options.display.max_rows = 10
pd.options.display.max_columns = 6

Load Data

data_folder = "../../../../data/NHANES_99-06/"
data_main_table_over18 = data_folder + "MainTable_keepvar_over18.tsv"
data_main_table = data_folder + "MainTable.csv"
data_var_description = data_folder + "VarDescription.csv"
data_var_categories = data_folder + "VarCat_nopf.txt"
output = "."

Data of all samples with age >= 18

# Data
nhanes = clarite.load.from_tsv(data_main_table_over18, index_col="ID")
nhanes.head()
Loaded 22,624 observations of 970 variables
RIDAGEYR female black ... LBXV4E LBXVTE occupation
ID
2 77 0 0 ... NaN NaN 1.0
5 49 0 0 ... NaN NaN NaN
6 19 1 0 ... NaN NaN 2.0
7 59 1 1 ... NaN NaN NaN
10 43 0 1 ... NaN NaN 4.0

5 rows × 970 columns

Variable Descriptions

var_descriptions = pd.read_csv(data_var_description)[["tab_desc","module","var","var_desc"]]\
                     .drop_duplicates()\
                     .set_index("var")
var_descriptions.head()
tab_desc module var_desc
var
LBXHBC Hepatitis A, B, C and D laboratory Hepatitis B core antibody
LBDHBG Hepatitis A, B, C and D laboratory Hepatitis B surface antigen
LBDHCV Hepatitis A, B, C and D laboratory Hepatitis C antibody (confirmed)
LBDHD Hepatitis A, B, C and D laboratory Hepatitis D (anti-HDV)
LBXHBS Hepatitis B Surface Antibody laboratory Hepatitis B Surface Antibody
# Convert variable descriptions to a dictionary for convenience
var_descr_dict = var_descriptions["var_desc"].to_dict()

Survey Weights, as provided by NHANES

Survey weight information is used so that the results apply to the US civillian non-institutionalized population.

This includes:

  • SDMVPSU (Cluster ID)
  • SDMVSTRA (Nested Strata ID)
  • 2-year weights
  • 4-year weights

Different variables require different weights, as many of them were measured on a subset of the full dataset. For example:

  • WTINT is the survey weight for interview variables.
  • WTMEC is the survey weight for variables measured in the Mobile Exam Centers (a subset of interviewed samples)

2-year and 4-year weights are provided. It is important to adjust the weights when combining multiple cycles, by computing the weighted average. In this case 4-year weights (covering the first 2 cycles) are provided by NHANES and the replication weights (the 3rd and 4th cycles) were computed from the 2-year weights prior to loading them here.

survey_design_discovery = pd.read_csv(data_folder + "weights/weights_discovery.txt", sep="\t")\
                            .rename(columns={'SEQN':'ID'})\
                            .set_index("ID")\
                            .drop(columns="SDDSRVYR")
survey_design_discovery.head()
SDMVPSU SDMVSTRA WTINT2YR ... WTSVOC2Y WTSAU2YR WTUIO2YR
ID
1 1 5 9727.078709 ... NaN NaN NaN
2 3 1 26678.636376 ... NaN NaN NaN
3 2 7 43621.680548 ... NaN NaN NaN
4 1 2 10346.119327 ... NaN NaN NaN
5 2 8 91050.846620 ... NaN NaN NaN

5 rows × 35 columns

survey_design_replication = pd.read_csv(data_folder + "weights/weights_replication_4yr.txt", sep="\t")\
                            .rename(columns={'SEQN':'ID'})\
                            .set_index("ID")\
                            .drop(columns="SDDSRVYR")
survey_design_replication.head()
SDMVPSU SDMVSTRA WTINT2YR ... WTSOG2YR WTSC2YRA WTSPC2YR
ID
21005 2 39 2756.160474 ... NaN NaN NaN
21006 1 41 2711.070226 ... NaN NaN NaN
21007 2 35 19882.088706 ... NaN NaN NaN
21008 1 32 2799.749676 ... NaN NaN NaN
21009 2 31 48796.839489 ... NaN NaN NaN

5 rows × 23 columns

# These files map variables to their correct weights, and were compiled by reading throught the NHANES codebook
var_weights = pd.read_csv(data_folder + "weights/VarWeights.csv")
var_weights.head()
variable_name discovery replication
0 99999 WTMEC4YR WTMEC2YR
1 ACETAMINOPHEN__CODEINE WTMEC4YR WTMEC2YR
2 ACETAMINOPHEN__CODEINE_PHOSPHATE WTMEC4YR WTMEC2YR
3 ACETAMINOPHEN__HYDROCODONE WTMEC4YR WTMEC2YR
4 ACETAMINOPHEN__HYDROCODONE_BITARTRATE WTMEC4YR WTMEC2YR
# Convert the data to two dictionaries for convenience
weights_discovery = var_weights.set_index('variable_name')['discovery'].to_dict()
weights_replication = var_weights.set_index('variable_name')['replication'].to_dict()

Survey Year data

Survey year is found in a separate file and can be matched using the SEQN ID value.

survey_year = pd.read_csv(data_main_table)[["SEQN", "SDDSRVYR"]].rename(columns={'SEQN':'ID'}).set_index("ID")
nhanes = clarite.modify.merge_variables(nhanes, survey_year, how="left")
================================================================================
Running merge_variables
--------------------------------------------------------------------------------
left Merge:
    left = 22,624 observations of 970 variables
    right = 41,474 observations of 1 variables
Kept 22,624 observations of 971 variables.
================================================================================

Define the phenotype and covariates

phenotype = "BMXBMI"
print(f"{phenotype} = {var_descriptions.loc[phenotype, 'var_desc']}")
covariates = ["female", "black", "mexican", "other_hispanic", "other_eth", "SES_LEVEL", "RIDAGEYR", "SDDSRVYR"]
BMXBMI = Body Mass Index (kg/m**2)

Initial cleanup / variable selection

Remove any samples missing the phenotype or one of the covariates

nhanes = clarite.modify.rowfilter_incomplete_obs(nhanes, only=[phenotype] + covariates)
================================================================================
Running rowfilter_incomplete_obs
--------------------------------------------------------------------------------
Removed 3,687 of 22,624 observations (16.30%) due to NA values in any of 9 variables
================================================================================

Remove variables that aren’t appropriate for the analysis

Physical fitness measures

These are measurements rather than proxies for environmental exposures

phys_fitness_vars = ["CVDVOMAX","CVDESVO2","CVDS1HR","CVDS1SY","CVDS1DI","CVDS2HR","CVDS2SY","CVDS2DI","CVDR1HR","CVDR1SY","CVDR1DI","CVDR2HR","CVDR2SY","CVDR2DI","physical_activity"]
for v in phys_fitness_vars:
    print(f"\t{v} = {var_descr_dict[v]}")
nhanes = nhanes.drop(columns=phys_fitness_vars)
CVDVOMAX = Predicted VO2max (ml/kg/min)
CVDESVO2 = Estimated VO2max (ml/kg/min)
CVDS1HR = Stage 1 heart rate (per min)
CVDS1SY = Stage 1 systolic BP (mm Hg)
CVDS1DI = Stage 1 diastolic BP (mm Hg)
CVDS2HR = Stage 2 heart rate (per min)
CVDS2SY = Stage 2 systolic BP (mm Hg)
CVDS2DI = Stage 2 diastolic BP (mm Hg)
CVDR1HR = Recovery 1 heart rate (per min)
CVDR1SY = Recovery 1 systolic BP (mm Hg)
CVDR1DI = Recovery 1 diastolic BP (mm Hg)
CVDR2HR = Recovery 2 heart rate (per min)
CVDR2SY = Recovery 2 systolic BP (mm Hg)
CVDR2DI = Recovery 2 diastolic BP (mm Hg)
physical_activity = Physical Activity (MET-based rank)

Lipid variables

These are likely correlated with BMI in some way

lipid_vars = ["LBDHDD", "LBDHDL", "LBDLDL", "LBXSTR", "LBXTC", "LBXTR"]
print("Removing lipid measurement variables:")
for v in lipid_vars:
    print(f"\t{v} = {var_descr_dict[v]}")
nhanes = nhanes.drop(columns=lipid_vars)
Removing lipid measurement variables:
    LBDHDD = Direct HDL-Cholesterol (mg/dL)
    LBDHDL = Direct HDL-Cholesterol (mg/dL)
    LBDLDL = LDL-cholesterol (mg/dL)
    LBXSTR = Triglycerides (mg/dL)
    LBXTC = Total cholesterol (mg/dL)
    LBXTR = Triglyceride (mg/dL)

Indeterminate variables

These variables don’t have clear meanings

indeterminent_vars = ["house_type","hepa","hepb", "house_age", "current_past_smoking"]
print("Removing variables with indeterminate meanings:")
for v in indeterminent_vars:
    print(f"\t{v} = {var_descr_dict[v]}")
nhanes = nhanes.drop(columns=indeterminent_vars)
Removing variables with indeterminate meanings:
    house_type = house type
    hepa = hepatitis a
    hepb = hepatitis b
    house_age = house age
    current_past_smoking = Current or Past Cigarette Smoker?

Recode “missing” values

# SMQ077 and DDB100 have Refused/Don't Know for "7" and "9"
nhanes = clarite.modify.recode_values(nhanes, {7: np.nan, 9: np.nan}, only=['SMQ077', 'DBD100'])
================================================================================
Running recode_values
--------------------------------------------------------------------------------
Replaced 11 values from 18,937 observations in 2 variables
================================================================================

Split the data into discovery and replication

discovery = (nhanes['SDDSRVYR']==1) | (nhanes['SDDSRVYR']==2)
replication = (nhanes['SDDSRVYR']==3) | (nhanes['SDDSRVYR']==4)

nhanes_discovery = nhanes.loc[discovery]
nhanes_replication = nhanes.loc[replication]
nhanes_discovery.head()
RIDAGEYR female black ... LBXVTE occupation SDDSRVYR
ID
2 77 0 0 ... NaN 1.0 1
5 49 0 0 ... NaN NaN 1
6 19 1 0 ... NaN 2.0 1
12 37 0 0 ... NaN 4.0 1
13 70 0 0 ... NaN 4.0 1

5 rows × 945 columns

nhanes_replication.head()
RIDAGEYR female black ... LBXVTE occupation SDDSRVYR
ID
21005 19 0 1 ... NaN 4.0 3
21009 55 0 0 ... NaN 4.0 3
21010 52 1 0 ... NaN 2.0 3
21012 63 0 1 ... NaN 1.0 3
21015 83 0 0 ... NaN 1.0 3

5 rows × 945 columns

QC

Minimum of 200 non-NA values in each variable

Drop variables that have too small of a sample size

nhanes_discovery = clarite.modify.colfilter_min_n(nhanes_discovery, skip=[phenotype] + covariates)
nhanes_replication = clarite.modify.colfilter_min_n(nhanes_replication, skip=[phenotype] + covariates)
================================================================================
Running colfilter_min_n
--------------------------------------------------------------------------------
Testing 0 of 0 binary variables
Testing 0 of 0 categorical variables
Testing 936 of 945 continuous variables
    Removed 302 (32.26%) tested continuous variables which had less than 200 non-null values.
================================================================================
================================================================================
Running colfilter_min_n
--------------------------------------------------------------------------------
Testing 0 of 0 binary variables
Testing 0 of 0 categorical variables
Testing 936 of 945 continuous variables
    Removed 225 (24.04%) tested continuous variables which had less than 200 non-null values.
================================================================================

Categorize Variables

This is important, as different variable types must be processed in different ways. The number of unique values for each variable is a good heuristic for determining this. The default settings were used here, but different cutoffs can be specified. CLARITE reports the results in neatly formatted text:

nhanes_discovery = clarite.modify.categorize(nhanes_discovery)
nhanes_replication = clarite.modify.categorize(nhanes_replication)
================================================================================
Running categorize
--------------------------------------------------------------------------------
229 of 643 variables (35.61%) are classified as binary (2 unique values).
19 of 643 variables (2.95%) are classified as categorical (3 to 6 unique values).
336 of 643 variables (52.26%) are classified as continuous (>= 15 unique values).
37 of 643 variables (5.75%) were dropped.
    0 variables had zero unique values (all NA).
    37 variables had one unique value.
22 of 643 variables (3.42%) were not categorized and need to be set manually.
    22 variables had between 6 and 15 unique values
    0 variables had >= 15 values but couldn't be converted to continuous (numeric) values
================================================================================
================================================================================
Running categorize
--------------------------------------------------------------------------------
236 of 720 variables (32.78%) are classified as binary (2 unique values).
32 of 720 variables (4.44%) are classified as categorical (3 to 6 unique values).
400 of 720 variables (55.56%) are classified as continuous (>= 15 unique values).
13 of 720 variables (1.81%) were dropped.
    0 variables had zero unique values (all NA).
    13 variables had one unique value.
39 of 720 variables (5.42%) were not categorized and need to be set manually.
    39 variables had between 6 and 15 unique values
    0 variables had >= 15 values but couldn't be converted to continuous (numeric) values
================================================================================

Checking categorization

Distributions of variables may be plotted using CLARITE:

clarite.plot.distributions(nhanes_discovery,
                           filename="discovery_distributions.pdf",
                           continuous_kind='count',
                           nrows=4,
                           ncols=3,
                           quality='medium')

One variable needed correcting where the heuristic was not correct

v = "L_GLUTAMINE_gm"
print(f"\t{v} = {var_descr_dict[v]}\n")
nhanes_discovery = clarite.modify.make_continuous(nhanes_discovery, only=[v])
nhanes_replication = clarite.modify.make_continuous(nhanes_replication, only=[v])
    L_GLUTAMINE_gm = L_GLUTAMINE_gm

================================================================================
Running make_continuous
--------------------------------------------------------------------------------
Set 1 of 606 variable(s) as continuous, each with 9,063 observations
================================================================================
================================================================================
Running make_continuous
--------------------------------------------------------------------------------
Set 1 of 707 variable(s) as continuous, each with 9,874 observations
================================================================================

After examining all of the uncategorized variables, they are all continuous

discovery_types = clarite.describe.get_types(nhanes_discovery)
discovery_unknown = discovery_types[discovery_types == 'unknown'].index
for v in list(discovery_unknown):
    print(f"\t{v} = {var_descr_dict[v]}")
nhanes_discovery = clarite.modify.make_continuous(nhanes_discovery, only=discovery_unknown)
WARNING: 22 variables need to be categorized into a type manually
    URXUBE = Beryllium, urine (ug/L)
    URXUPT = Platinum, urine (ug/L)
    DRD350BQ = # of times crabs eaten in past 30 days
    DRD350FQ = # of times oysters eaten in past 30 days
    DRD350IQ = # of times other shellfish eaten
    DRD370AQ = # of times breaded fish products eaten
    DRD370DQ = # of times catfish eaten in past 30 days
    DRD370EQ = # of times cod eaten in past 30 days
    DRD370FQ = # of times flatfish eaten past 30 days
    DRD370UQ = # of times other unknown fish eaten
    OMEGA_3_FATTY_ACIDS_mg = OMEGA_3_FATTY_ACIDS_mg
    ALANINE_mg = ALANINE_mg
    ARGININE_mg = ARGININE_mg
    BETA_CAROTENE_mg = BETA_CAROTENE_mg
    CAFFEINE_mg = CAFFEINE_mg
    CYSTINE_mg = CYSTINE_mg
    LYSINE_mg = LYSINE_mg
    PROLINE_mg = PROLINE_mg
    SERINE_mg = SERINE_mg
    TRYPTOPHAN_mg = TRYPTOPHAN_mg
    TYROSINE_mg = TYROSINE_mg
    OTHER_FATTY_ACIDS_mg = OTHER_FATTY_ACIDS_mg
================================================================================
Running make_continuous
--------------------------------------------------------------------------------
Set 22 of 606 variable(s) as continuous, each with 9,063 observations
================================================================================
replication_types = clarite.describe.get_types(nhanes_replication)
replication_unknown = replication_types[replication_types == 'unknown'].index
for v in list(replication_unknown):
    print(f"\t{v} = {var_descr_dict[v]}")
nhanes_replication = clarite.modify.make_continuous(nhanes_replication, only=replication_unknown)
WARNING: 39 variables need to be categorized into a type manually
    LBXVCT = Blood Carbon Tetrachloride (ng/ml)
    LBXV3A = Blood 1,1,1-Trichloroethene (ng/ml)
    URXUBE = Beryllium, urine (ug/L)
    LBXTO2 = Toxoplasma (IgM)
    LBXPFDO = Perfluorododecanoic acid
    DRD350AQ = # of times clams eaten in past 30 days
    DRD350BQ = # of times crabs eaten in past 30 days
    DRD350DQ = # of times lobsters eaten past 30 days
    DRD350FQ = # of times oysters eaten in past 30 days
    DRD350GQ = # of times scallops eaten past 30 days
    DRD370AQ = # of times breaded fish products eaten
    DRD370DQ = # of times catfish eaten in past 30 days
    DRD370EQ = # of times cod eaten in past 30 days
    DRD370FQ = # of times flatfish eaten past 30 days
    DRD370GQ = # of times haddock eaten in past 30 days
    DRD370NQ = # of times sardines eaten past 30 days
    DRD370RQ = # of times trout eaten in past 30 days
    DRD370UQ = # of times other unknown fish eaten
    ALANINE_mg = ALANINE_mg
    ARGININE_mg = ARGININE_mg
    BETA_CAROTENE_mg = BETA_CAROTENE_mg
    CAFFEINE_mg = CAFFEINE_mg
    CYSTINE_mg = CYSTINE_mg
    HISTIDINE_mg = HISTIDINE_mg
    ISOLEUCINE_mg = ISOLEUCINE_mg
    LEUCINE_mg = LEUCINE_mg
    LYSINE_mg = LYSINE_mg
    PHENYLALANINE_mg = PHENYLALANINE_mg
    PROLINE_mg = PROLINE_mg
    SERINE_mg = SERINE_mg
    THREONINE_mg = THREONINE_mg
    TRYPTOPHAN_mg = TRYPTOPHAN_mg
    TYROSINE_mg = TYROSINE_mg
    VALINE_mg = VALINE_mg
    LBXV2T = Blood trans-1,2-Dichloroethene (ng/mL)
    LBXV4T = Blood 1,1,2,2-Tetrachloroethane (ng/mL)
    LBXVDM = Blood Dibromomethane (ng/mL)
    URXUTM = Urinary Trimethylarsine Oxide (ug/L)
    LBXPFBS = Perfluorobutane sulfonic acid
================================================================================
Running make_continuous
--------------------------------------------------------------------------------
Set 39 of 707 variable(s) as continuous, each with 9,874 observations
================================================================================

Types should match across discovery/replication

# Take note of which variables were differently typed in each dataset
print("Correcting differences in variable types between discovery and replication")
# Merge current type series
dtypes = pd.DataFrame({'discovery':clarite.describe.get_types(nhanes_discovery),
                       'replication':clarite.describe.get_types(nhanes_replication)
                       })
diff_dtypes = dtypes.loc[(dtypes['discovery'] != dtypes['replication']) &
                         (~dtypes['discovery'].isna()) &
                         (~dtypes['replication'].isna())]

# Discovery

# Binary -> Categorical
compare_bin_cat = list(diff_dtypes.loc[(diff_dtypes['discovery']=='binary') &
                                       (diff_dtypes['replication']=='categorical'),].index)
if len(compare_bin_cat) > 0:
    print(f"Bin vs Cat: {', '.join(compare_bin_cat)}")
    nhanes_discovery = clarite.modify.make_categorical(nhanes_discovery, only=compare_bin_cat)
    print()
# Binary -> Continuous
compare_bin_cont = list(diff_dtypes.loc[(diff_dtypes['discovery']=='binary') &
                                        (diff_dtypes['replication']=='continuous'),].index)
if len(compare_bin_cont) > 0:
    print(f"Bin vs Cont: {', '.join(compare_bin_cont)}")
    nhanes_discovery = clarite.modify.make_continuous(nhanes_discovery, only=compare_bin_cont)
    print()
# Categorical -> Continuous
compare_cat_cont = list(diff_dtypes.loc[(diff_dtypes['discovery']=='categorical') &
                                        (diff_dtypes['replication']=='continuous'),].index)
if len(compare_cat_cont) > 0:
    print(f"Cat vs Cont: {', '.join(compare_cat_cont)}")
    nhanes_discovery = clarite.modify.make_continuous(nhanes_discovery, only=compare_cat_cont)
    print()

# Replication

# Binary -> Categorical
compare_cat_bin = list(diff_dtypes.loc[(diff_dtypes['discovery']=='categorical') &
                                       (diff_dtypes['replication']=='binary'),].index)
if len(compare_cat_bin) > 0:
    print(f"Cat vs Bin: {', '.join(compare_cat_bin)}")
    nhanes_replication = clarite.modify.make_categorical(nhanes_replication, only=compare_cat_bin)
    print()
# Binary -> Continuous
compare_cont_bin = list(diff_dtypes.loc[(diff_dtypes['discovery']=='continuous') &
                                        (diff_dtypes['replication']=='binary'),].index)
if len(compare_cont_bin) > 0:
    print(f"Cont vs Bin: {', '.join(compare_cont_bin)}")
    nhanes_replication = clarite.modify.make_continuous(nhanes_replication, only=compare_cont_bin)
    print()
# Categorical -> Continuous
compare_cont_cat = list(diff_dtypes.loc[(diff_dtypes['discovery']=='continuous') &
                                        (diff_dtypes['replication']=='categorical'),].index)
if len(compare_cont_cat) > 0:
    print(f"Cont vs Cat: {', '.join(compare_cont_cat)}")
    nhanes_replication = clarite.modify.make_continuous(nhanes_replication, only=compare_cont_cat)
    print()
Correcting differences in variable types between discovery and replication
Bin vs Cat: BETA_CAROTENE_mcg, CALCIUM_Unknown, MAGNESIUM_Unknown
================================================================================
Running make_categorical
--------------------------------------------------------------------------------
Set 3 of 606 variable(s) as categorical, each with 9,063 observations
================================================================================

Bin vs Cont: LBXPFDO
================================================================================
Running make_continuous
--------------------------------------------------------------------------------
Set 1 of 606 variable(s) as continuous, each with 9,063 observations
================================================================================

Cat vs Cont: DRD350AQ, DRD350DQ, DRD350GQ
================================================================================
Running make_continuous
--------------------------------------------------------------------------------
Set 3 of 606 variable(s) as continuous, each with 9,063 observations
================================================================================

Cat vs Bin: VITAMIN_B_12_Unknown
================================================================================
Running make_categorical
--------------------------------------------------------------------------------
Set 1 of 707 variable(s) as categorical, each with 9,874 observations
================================================================================

Filtering

These are a standard set of filters with default settings

# 200 non-na samples
discovery_1_min_n = clarite.modify.colfilter_min_n(nhanes_discovery)
replication_1_min_n = clarite.modify.colfilter_min_n(nhanes_replication)
================================================================================
Running colfilter_min_n
--------------------------------------------------------------------------------
Testing 228 of 228 binary variables
    Removed 0 (0.00%) tested binary variables which had less than 200 non-null values.
Testing 15 of 15 categorical variables
    Removed 0 (0.00%) tested categorical variables which had less than 200 non-null values.
Testing 363 of 363 continuous variables
    Removed 0 (0.00%) tested continuous variables which had less than 200 non-null values.
================================================================================
================================================================================
Running colfilter_min_n
--------------------------------------------------------------------------------
Testing 236 of 236 binary variables
    Removed 0 (0.00%) tested binary variables which had less than 200 non-null values.
Testing 31 of 31 categorical variables
    Removed 0 (0.00%) tested categorical variables which had less than 200 non-null values.
Testing 440 of 440 continuous variables
    Removed 0 (0.00%) tested continuous variables which had less than 200 non-null values.
================================================================================
# 200 samples per category
discovery_2_min_cat_n = clarite.modify.colfilter_min_cat_n(discovery_1_min_n, skip=[c for c in covariates + [phenotype] if c in discovery_1_min_n.columns] )
replication_2_min_cat_n = clarite.modify.colfilter_min_cat_n(replication_1_min_n,skip=[c for c in covariates + [phenotype] if c in replication_1_min_n.columns])
================================================================================
Running colfilter_min_cat_n
--------------------------------------------------------------------------------
Testing 222 of 228 binary variables
    Removed 162 (72.97%) tested binary variables which had a category with less than 200 values.
Testing 14 of 15 categorical variables
    Removed 10 (71.43%) tested categorical variables which had a category with less than 200 values.
================================================================================
================================================================================
Running colfilter_min_cat_n
--------------------------------------------------------------------------------
Testing 230 of 236 binary variables
    Removed 154 (66.96%) tested binary variables which had a category with less than 200 values.
Testing 30 of 31 categorical variables
    Removed 25 (83.33%) tested categorical variables which had a category with less than 200 values.
================================================================================
# 90percent zero filter
discovery_3_pzero = clarite.modify.colfilter_percent_zero(discovery_2_min_cat_n)
replication_3_pzero = clarite.modify.colfilter_percent_zero(replication_2_min_cat_n)
================================================================================
Running colfilter_percent_zero
--------------------------------------------------------------------------------
Testing 363 of 363 continuous variables
    Removed 28 (7.71%) tested continuous variables which were equal to zero in at least 90.00% of non-NA observations.
================================================================================
================================================================================
Running colfilter_percent_zero
--------------------------------------------------------------------------------
Testing 440 of 440 continuous variables
    Removed 30 (6.82%) tested continuous variables which were equal to zero in at least 90.00% of non-NA observations.
================================================================================
# Those without weights
keep = set(weights_discovery.keys()) | set([phenotype] + covariates)
discovery_4_weights = discovery_3_pzero[[c for c in list(discovery_3_pzero) if c in keep]]

keep = set(weights_replication.keys()) | set([phenotype] + covariates)
replication_4_weights = replication_3_pzero[[c for c in list(replication_3_pzero) if c in keep]]

Summarize

# Summarize Results
print("\nDiscovery:")
clarite.describe.summarize(discovery_4_weights)
print('-'*50)
print("Replication:")
clarite.describe.summarize(replication_4_weights)
Discovery:
9,063 observations of 385 variables
    66 Binary Variables
    5 Categorical Variables
    314 Continuous Variables
    0 Unknown-Type Variables

--------------------------------------------------
Replication:
9,874 observations of 428 variables
    77 Binary Variables
    6 Categorical Variables
    345 Continuous Variables
    0 Unknown-Type Variables

Keep only variables that passed QC in both datasets

both = set(list(discovery_4_weights)) & set(list(replication_4_weights))
discovery_final = discovery_4_weights[both]
replication_final = replication_4_weights[both]
print(f"{len(both)} variables in common")
341 variables in common

Checking the phenotype distribution

The phenotype appears to be skewed, so it will need to be corrected. CLARITE makes it easy to plot distributions and to transform variables.

title = f"Discovery: Skew of BMIMBX = {stats.skew(discovery_final['BMXBMI']):.6}"
clarite.plot.histogram(discovery_final, column="BMXBMI", title=title, bins=100)
# Log-transform
discovery_final = clarite.modify.transform(discovery_final, transform_method='log', only='BMXBMI')
#Plot
title = f"Discovery: Skew of BMXBMI after log transform = {stats.skew(discovery_final['BMXBMI']):.6}"
clarite.plot.histogram(discovery_final, column="BMXBMI", title=title, bins=100)
================================================================================
Running transform
--------------------------------------------------------------------------------
Transformed 'BMXBMI' using 'log'
================================================================================
_images/output_60_1.png _images/output_60_2.png
title = f"Replication: Skew of BMIMBX = {stats.skew(replication_final['BMXBMI']):.6}"
clarite.plot.histogram(replication_final, column="BMXBMI", title=title, bins=100)
# Log-transform
replication_final = clarite.modify.transform(replication_final, transform_method='log', only='BMXBMI')
#Plot
title = f"Replication: Skew of logBMI = {stats.skew(replication_final['BMXBMI']):.6}"
clarite.plot.histogram(replication_final, column="BMXBMI", title=title, bins=100)
================================================================================
Running transform
--------------------------------------------------------------------------------
Transformed 'BMXBMI' using 'log'
================================================================================
_images/output_61_1.png _images/output_61_2.png

EWAS

Survey Design Spec

When utilizing survey data, a survey design spec object must be created.

sd_discovery = clarite.survey.SurveyDesignSpec(survey_df=survey_design_discovery,
                                        strata="SDMVSTRA",
                                        cluster="SDMVPSU",
                                        nest=True,
                                        weights=weights_discovery,
                                        single_cluster='centered')

EWAS

This can then be passed into the EWAS function

ewas_discovery = clarite.analyze.ewas(phenotype, covariates, discovery_final, sd_discovery)
Running EWAS on a continuous variable

####### Regressing 280 Continuous Variables #######

WARNING: DRD370UQ - 3 observation(s) with missing, negative, or zero weights were removed
WARNING: LBXVID has non-varying covariates(s): SDDSRVYR
WARNING: URXP24 has non-varying covariates(s): SDDSRVYR
WARNING: age_stopped_birth_control has non-varying covariates(s): female
WARNING: DR1TCHOL - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: LBX206 has non-varying covariates(s): SDDSRVYR
WARNING: DR1TVB1 - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: LBXDIE has non-varying covariates(s): SDDSRVYR
WARNING: DRD350BQ - 2 observation(s) with missing, negative, or zero weights were removed
WARNING: LBXLYC has non-varying covariates(s): SDDSRVYR
WARNING: LBXF09 has non-varying covariates(s): SDDSRVYR
WARNING: DR1TS160 - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TVK has non-varying covariates(s): SDDSRVYR
WARNING: DRD350FQ - 1 observation(s) with missing, negative, or zero weights were removed
WARNING: DRD370TQ - 1 observation(s) with missing, negative, or zero weights were removed
WARNING: DRD370EQ - 1 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TS100 - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: LBXALD has non-varying covariates(s): SDDSRVYR
WARNING: DR1TCOPP - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: URXP20 has non-varying covariates(s): SDDSRVYR
WARNING: DR1TSELE - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: LBX151 has non-varying covariates(s): SDDSRVYR
WARNING: LBXLUZ has non-varying covariates(s): SDDSRVYR
WARNING: DR1TLZ has non-varying covariates(s): SDDSRVYR
WARNING: DR1TPHOS - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TP204 - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: LBXCBC has non-varying covariates(s): SDDSRVYR
WARNING: DR1TPOTA - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TVB6 - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TVB12 - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TP184 - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TP182 - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TMFAT - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: RHQ556 has non-varying covariates(s): female
WARNING: LBXBEC has non-varying covariates(s): SDDSRVYR
WARNING: DR1TSUGR has non-varying covariates(s): SDDSRVYR
WARNING: URXP02 has non-varying covariates(s): SDDSRVYR
WARNING: DRD370AQ - 2 observation(s) with missing, negative, or zero weights were removed
WARNING: LBXEND has non-varying covariates(s): SDDSRVYR
WARNING: DR1TCRYP has non-varying covariates(s): SDDSRVYR
WARNING: DR1TKCAL - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TFIBE - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TTFAT - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TZINC - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: LBX110 has non-varying covariates(s): SDDSRVYR
WARNING: how_long_estrogen has non-varying covariates(s): female
WARNING: LBD199 has non-varying covariates(s): SDDSRVYR
WARNING: URXMHH has non-varying covariates(s): SDDSRVYR
WARNING: DR1TTHEO - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TFDFE has non-varying covariates(s): SDDSRVYR
WARNING: URXOP4 - 403 observation(s) with missing, negative, or zero weights were removed
WARNING: DRD350DQ - 1 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TALCO - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: URXUHG has non-varying covariates(s): female
WARNING: URXP22 has non-varying covariates(s): SDDSRVYR
WARNING: URXP21 has non-varying covariates(s): SDDSRVYR
WARNING: DR1TSFAT - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DRD350HQ - 6 observation(s) with missing, negative, or zero weights were removed
WARNING: URXOP1 - 404 observation(s) with missing, negative, or zero weights were removed
WARNING: DRD370BQ - 5 observation(s) with missing, negative, or zero weights were removed
WARNING: URXOP2 - 404 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TM201 - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TFF has non-varying covariates(s): SDDSRVYR
WARNING: URXMOH has non-varying covariates(s): SDDSRVYR
WARNING: DR1TFA has non-varying covariates(s): SDDSRVYR
WARNING: DR1TS120 - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: URXMNM has non-varying covariates(s): SDDSRVYR
WARNING: LBX195 has non-varying covariates(s): SDDSRVYR
WARNING: DR1TACAR has non-varying covariates(s): SDDSRVYR
WARNING: DRD370FQ - 1 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TATOC has non-varying covariates(s): SDDSRVYR
WARNING: URXOP3 - 404 observation(s) with missing, negative, or zero weights were removed
WARNING: LBX189 has non-varying covariates(s): SDDSRVYR
WARNING: DR1TP225 - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TP226 - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TP183 - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: LBXTHG has non-varying covariates(s): female
WARNING: DR1TBCAR has non-varying covariates(s): SDDSRVYR
WARNING: DRD370MQ - 1 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TPFAT - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TS060 - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TM161 - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: LBXCRY has non-varying covariates(s): SDDSRVYR
WARNING: DR1TCALC - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: LBXIHG has non-varying covariates(s): female
WARNING: DR1TM221 - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TIRON - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DRD370DQ - 1 observation(s) with missing, negative, or zero weights were removed
WARNING: URXOP5 - 403 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TPROT - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TVARA has non-varying covariates(s): SDDSRVYR
WARNING: DR1TCARB - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TMAGN - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TM181 - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TS140 - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TVC - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: LBX196 has non-varying covariates(s): SDDSRVYR
WARNING: age_started_birth_control has non-varying covariates(s): female
WARNING: URXP01 has non-varying covariates(s): SDDSRVYR
WARNING: LBXD02 has non-varying covariates(s): SDDSRVYR
WARNING: URXMIB has non-varying covariates(s): SDDSRVYR
WARNING: LBX149 has non-varying covariates(s): SDDSRVYR
WARNING: LBXALC has non-varying covariates(s): SDDSRVYR
WARNING: DR1TS180 - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TVB2 - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TCAFF - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TLYCO has non-varying covariates(s): SDDSRVYR
WARNING: LBX087 has non-varying covariates(s): SDDSRVYR
WARNING: LBXV3A has non-varying covariates(s): SDDSRVYR
WARNING: DR1TP205 - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: LBX194 has non-varying covariates(s): SDDSRVYR
WARNING: DR1TNIAC - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: URXUUR has non-varying covariates(s): SDDSRVYR
WARNING: DRD350AQ - 1 observation(s) with missing, negative, or zero weights were removed
WARNING: URXMC1 has non-varying covariates(s): SDDSRVYR
WARNING: DR1TS040 - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: URXOP6 - 403 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TS080 - 14 observation(s) with missing, negative, or zero weights were removed
WARNING: DR1TRET has non-varying covariates(s): SDDSRVYR
WARNING: LBX028 has non-varying covariates(s): SDDSRVYR

####### Regressing 48 Binary Variables #######

WARNING: DRD350A - 6 observation(s) with missing, negative, or zero weights were removed
WARNING: DRD350B - 6 observation(s) with missing, negative, or zero weights were removed
WARNING: current_loud_noise - 925 observation(s) with missing, negative, or zero weights were removed
WARNING: LBXBV has non-varying covariates(s): female, SDDSRVYR
WARNING: ordinary_salt - 19 observation(s) with missing, negative, or zero weights were removed
WARNING: ordinary_salt has non-varying covariates(s): SDDSRVYR
WARNING: taking_birth_control has non-varying covariates(s): female
WARNING: LBXMS1 has non-varying covariates(s): SDDSRVYR
WARNING: DRD370A - 10 observation(s) with missing, negative, or zero weights were removed
WARNING: DRD370F - 10 observation(s) with missing, negative, or zero weights were removed
WARNING: SXQ280 has non-varying covariates(s): female
WARNING: DRD350F - 6 observation(s) with missing, negative, or zero weights were removed
WARNING: DRD350G - 6 observation(s) with missing, negative, or zero weights were removed
WARNING: DRD370B - 10 observation(s) with missing, negative, or zero weights were removed
WARNING: DRD370U - 10 observation(s) with missing, negative, or zero weights were removed
WARNING: DRD370D - 10 observation(s) with missing, negative, or zero weights were removed
WARNING: LBXHBC - 5808 observation(s) with missing, negative, or zero weights were removed
WARNING: DRD370T - 10 observation(s) with missing, negative, or zero weights were removed
WARNING: DRD340 - 22 observation(s) with missing, negative, or zero weights were removed
WARNING: DRD350H - 6 observation(s) with missing, negative, or zero weights were removed
WARNING: RHQ540 has non-varying covariates(s): female
WARNING: DRD350D - 6 observation(s) with missing, negative, or zero weights were removed
WARNING: DRD370M - 10 observation(s) with missing, negative, or zero weights were removed
WARNING: DRD360 - 21 observation(s) with missing, negative, or zero weights were removed
WARNING: no_salt - 19 observation(s) with missing, negative, or zero weights were removed
WARNING: no_salt has non-varying covariates(s): SDDSRVYR
WARNING: DRD370E - 10 observation(s) with missing, negative, or zero weights were removed
WARNING: RHQ510 has non-varying covariates(s): female

####### Regressing 4 Categorical Variables #######

WARNING: DBD100 - 9 observation(s) with missing, negative, or zero weights were removed
WARNING: DBD100 has non-varying covariates(s): SDDSRVYR
Completed EWAS

There is a separate function for adding pvalues with multiple-test-correction applied.

clarite.analyze.add_corrected_pvalues(ewas_discovery)

Saving results is straightforward

ewas_discovery.to_csv(output + "/BMI_Discovery_Results.txt", sep="\t")

Selecting top results

Variables with an FDR less than 0.1 were selected (using standard functionality from the Pandas library, since the ewas results are simply a Pandas DataFrame).

significant_discovery_variables = ewas_discovery[ewas_discovery['pvalue_fdr']<0.1].index.get_level_values('Variable')
print(f"Using {len(significant_discovery_variables)} variables based on FDR-corrected pvalues from the discovery dataset")
Using 100 variables based on FDR-corrected pvalues from the discovery dataset

Replication

The variables with low FDR in the discovery dataset were analyzed in the replication dataset

Filter out variables

keep_cols = list(significant_discovery_variables) + covariates + [phenotype]
replication_final_sig = clarite.modify.colfilter(replication_final, only=keep_cols)
clarite.describe.summarize(replication_final_sig)
================================================================================
Running colfilter
--------------------------------------------------------------------------------
Keeping 109 of 341 variables:
    19 of 54 binary variables
    3 of 5 categorical variables
    87 of 282 continuous variables
    0 of 0 unknown variables
================================================================================
9,874 observations of 109 variables
    19 Binary Variables
    3 Categorical Variables
    87 Continuous Variables
    0 Unknown-Type Variables

Run Replication EWAS

survey_design_replication
SDMVPSU SDMVSTRA WTINT2YR ... WTSOG2YR WTSC2YRA WTSPC2YR
ID
21005 2 39 2756.160474 ... NaN NaN NaN
21006 1 41 2711.070226 ... NaN NaN NaN
21007 2 35 19882.088706 ... NaN NaN NaN
21008 1 32 2799.749676 ... NaN NaN NaN
21009 2 31 48796.839489 ... NaN NaN NaN
... ... ... ... ... ... ... ...
41470 2 46 8473.426110 ... NaN NaN NaN
41471 1 52 3141.652775 ... 9148.1015 NaN NaN
41472 1 48 33673.789576 ... 99690.8420 NaN 71892.249044
41473 1 55 9956.504488 ... NaN NaN 26257.847868
41474 1 47 3087.275833 ... 9417.3990 NaN NaN

20470 rows × 23 columns

sd_replication = clarite.survey.SurveyDesignSpec(survey_df=survey_design_replication,
                                          strata="SDMVSTRA",
                                          cluster="SDMVPSU",
                                          nest=True,
                                          weights=weights_replication,
                                          single_cluster='centered')

ewas_replication = clarite.analyze.ewas(phenotype, covariates, replication_final_sig, sd_replication)
clarite.analyze.add_corrected_pvalues(ewas_replication)
ewas_replication.to_csv(output + "/BMI_Replication_Results.txt", sep="\t")
Running EWAS on a continuous variable

####### Regressing 85 Continuous Variables #######

WARNING: URXP24 has non-varying covariates(s): SDDSRVYR
WARNING: age_stopped_birth_control has non-varying covariates(s): female
WARNING: LBXODT has non-varying covariates(s): SDDSRVYR
WARNING: LBX206 has non-varying covariates(s): SDDSRVYR
WARNING: LBX170 has non-varying covariates(s): SDDSRVYR
WARNING: LBX099 has non-varying covariates(s): SDDSRVYR
WARNING: URXP20 has non-varying covariates(s): SDDSRVYR
WARNING: LBX156 has non-varying covariates(s): SDDSRVYR
WARNING: URXP11 has non-varying covariates(s): SDDSRVYR
WARNING: LBX118 has non-varying covariates(s): SDDSRVYR
WARNING: LBX153 has non-varying covariates(s): SDDSRVYR
WARNING: LBXD05 has non-varying covariates(s): SDDSRVYR
WARNING: LBD199 has non-varying covariates(s): SDDSRVYR
WARNING: LBXHPE has non-varying covariates(s): SDDSRVYR
WARNING: URXOP1 has non-varying covariates(s): SDDSRVYR
WARNING: URXP15 has non-varying covariates(s): SDDSRVYR
WARNING: LBXMIR has non-varying covariates(s): SDDSRVYR
WARNING: URXOP3 has non-varying covariates(s): SDDSRVYR
WARNING: LBXHXC has non-varying covariates(s): SDDSRVYR
WARNING: LBXME has non-varying covariates(s): SDDSRVYR
WARNING: LBX180 has non-varying covariates(s): SDDSRVYR
WARNING: LBX196 has non-varying covariates(s): SDDSRVYR
WARNING: age_started_birth_control has non-varying covariates(s): female
WARNING: LBXF04 has non-varying covariates(s): SDDSRVYR
WARNING: URXP03 has non-varying covariates(s): SDDSRVYR
WARNING: LBXIRN has non-varying covariates(s): female
WARNING: LBX194 has non-varying covariates(s): SDDSRVYR
WARNING: DUQ110 has non-varying covariates(s): SDDSRVYR

####### Regressing 13 Binary Variables #######

WARNING: DUQ100 has non-varying covariates(s): SDDSRVYR
WARNING: LBXHBC - 6318 observation(s) with missing, negative, or zero weights were removed
WARNING: SMQ210 has non-varying covariates(s): SDDSRVYR
WARNING: ever_loud_noise_gt3 has non-varying covariates(s): SDDSRVYR
WARNING: ever_loud_noise_gt3_2 has non-varying covariates(s): SDDSRVYR
WARNING: DRD370M - 19 observation(s) with missing, negative, or zero weights were removed
WARNING: DRD370E - 19 observation(s) with missing, negative, or zero weights were removed

####### Regressing 2 Categorical Variables #######

Completed EWAS
## Compare results
# Combine results
ewas_keep_cols = ['pvalue', 'pvalue_bonferroni', 'pvalue_fdr']
combined = pd.merge(ewas_discovery[['Variable_type'] + ewas_keep_cols],
                    ewas_replication[ewas_keep_cols],
                    left_index=True, right_index=True, suffixes=("_disc", "_repl"))

# FDR < 0.1 in both
fdr_significant = combined.loc[(combined['pvalue_fdr_disc'] <= 0.1) & (combined['pvalue_fdr_repl'] <= 0.1),]
fdr_significant = fdr_significant.assign(m=fdr_significant[['pvalue_fdr_disc', 'pvalue_fdr_repl']].mean(axis=1))\
                                 .sort_values('m').drop('m', axis=1)
fdr_significant.to_csv(output + "/Significant_Results_FDR_0.1.txt", sep="\t")
print(f"{len(fdr_significant)} variables had FDR < 0.1 in both discovery and replication")

# Bonferroni < 0.05 in both
bonf_significant05 = combined.loc[(combined['pvalue_bonferroni_disc'] <= 0.05) & (combined['pvalue_bonferroni_repl'] <= 0.05),]
bonf_significant05 = bonf_significant05.assign(m=fdr_significant[['pvalue_bonferroni_disc', 'pvalue_bonferroni_repl']].mean(axis=1))\
                                       .sort_values('m').drop('m', axis=1)
bonf_significant05.to_csv(output + "/Significant_Results_Bonferroni_0.05.txt", sep="\t")
print(f"{len(bonf_significant05)} variables had Bonferroni < 0.05 in both discovery and replication")

# Bonferroni < 0.01 in both
bonf_significant01 = combined.loc[(combined['pvalue_bonferroni_disc'] <= 0.01) & (combined['pvalue_bonferroni_repl'] <= 0.01),]
bonf_significant01 = bonf_significant01.assign(m=fdr_significant[['pvalue_bonferroni_disc', 'pvalue_bonferroni_repl']].mean(axis=1))\
                                       .sort_values('m').drop('m', axis=1)
bonf_significant01.to_csv(output + "/Significant_Results_Bonferroni_0.01.txt", sep="\t")
print(f"{len(bonf_significant01)} variables had Bonferroni < 0.01 in both discovery and replication")

bonf_significant01.head()
63 variables had FDR < 0.1 in both discovery and replication
16 variables had Bonferroni < 0.05 in both discovery and replication
10 variables had Bonferroni < 0.01 in both discovery and replication
Variable_type pvalue_disc pvalue_bonferroni_disc ... pvalue_repl pvalue_bonferroni_repl pvalue_fdr_repl
Variable Phenotype
LBXGTC BMXBMI continuous 2.611467e-14 8.670071e-12 ... 2.729179e-11 2.729179e-09 4.548631e-10
LBXIRN BMXBMI continuous 3.283440e-11 1.090102e-08 ... 1.748424e-12 1.748424e-10 5.828079e-11
total_days_drink_year BMXBMI continuous 4.562887e-07 1.514879e-04 ... 1.709681e-10 1.709681e-08 2.442402e-09
LBXBEC BMXBMI continuous 8.394013e-07 2.786812e-04 ... 1.689733e-08 1.689733e-06 1.299795e-07
LBXCBC BMXBMI continuous 9.142106e-07 3.035179e-04 ... 1.159283e-09 1.159283e-07 1.288093e-08

5 rows × 7 columns

Manhattan Plots

CLARITE provides functionality for generating highly customizable Manhattan plots from EWAS results

data_categories = pd.read_csv(data_var_categories, sep="\t").set_index('Variable')
data_categories.columns = ['category']
data_categories = data_categories['category'].to_dict()

clarite.plot.manhattan({'discovery': ewas_discovery, 'replication': ewas_replication},
                       categories=data_categories, title="Weighted EWAS Results", filename=output + "/ewas_plot.png",
                       figsize=(14, 10))
_images/output_82_0.png