Modify

Functions used to filter and/or change some data, always taking in one set of data and returning one set of data.

clarite.modify.categorize(data: pandas.core.frame.DataFrame, cat_min: int = 3, cat_max: int = 6, cont_min: int = 15)

Classify variables into constant, binary, categorical, continuous, and ‘unknown’. Drop variables that only have NaN values.

Parameters
data: pd.DataFrame

The DataFrame to be processed

cat_min: int, default 3

Minimum number of unique, non-NA values for a categorical variable

cat_max: int, default 6

Maximum number of unique, non-NA values for a categorical variable

cont_min: int, default 15

Minimum number of unique, non-NA values for a continuous variable

Returns
result: pd.DataFrame or None

If inplace, returns None. Changes the datatypes on the input DataFrame.

Examples

>>> import clarite
>>> clarite.modify.categorize(nhanes)
362 of 970 variables (37.32%) are classified as binary (2 unique values).
47 of 970 variables (4.85%) are classified as categorical (3 to 6 unique values).
483 of 970 variables (49.79%) are classified as continuous (>= 15 unique values).
42 of 970 variables (4.33%) were dropped.
        10 variables had zero unique values (all NA).
        32 variables had one unique value.
36 of 970 variables (3.71%) were not categorized and need to be set manually.
        36 variables had between 6 and 15 unique values
        0 variables had >= 15 values but couldn't be converted to continuous (numeric) values
clarite.modify.colfilter(data, skip: Optional[Union[List[str], str]] = None, only: Optional[Union[List[str], str]] = None)

Remove some variables (skip) or keep only certain variables (only)

Parameters
data: pd.DataFrame

The DataFrame to be processed and returned

skip: str, list or None (default is None)

List of variables to remove

only: str, list or None (default is None)

List of variables to keep

Returns
data: pd.DataFrame

The filtered DataFrame

Examples

>>> import clarite
>>> female_logBMI = clarite.modify.colfilter(nhanes, only=['BMXBMI', 'female'])
================================================================================
Running colfilter
--------------------------------------------------------------------------------
Keeping 2 of 945 variables:
        0 of 0 binary variables
        0 of 0 categorical variables
        2 of 945 continuous variables
        0 of 0 unknown variables
================================================================================
clarite.modify.colfilter_percent_zero(data: pandas.core.frame.DataFrame, filter_percent: float = 90.0, skip: Optional[Union[List[str], str]] = None, only: Optional[Union[List[str], str]] = None)

Remove continuous variables which have <proportion> or more values of zero (excluding NA)

Parameters
data: pd.DataFrame

The DataFrame to be processed and returned

filter_percent: float, default 90.0

If the percentage of rows in the data with a value of zero is greater than or equal to this value, the variable is filtered out.

skip: str, list or None (default is None)

List of variables that the filter should not be applied to

only: str, list or None (default is None)

List of variables that the filter should only be applied to

Returns
data: pd.DataFrame

The filtered DataFrame

Examples

>>> import clarite
>>> nhanes_filtered = clarite.modify.colfilter_percent_zero(nhanes_filtered)
================================================================================
Running colfilter_percent_zero
--------------------------------------------------------------------------------
WARNING: 36 variables need to be categorized into a type manually
Testing 483 of 483 continuous variables
        Removed 30 (6.21%) tested continuous variables which were equal to zero in at least 90.00% of non-NA observations.
clarite.modify.colfilter_min_n(data: pandas.core.frame.DataFrame, n: int = 200, skip: Optional[Union[List[str], str]] = None, only: Optional[Union[List[str], str]] = None)

Remove variables which have less than <n> non-NA values

Parameters
data: pd.DataFrame

The DataFrame to be processed and returned

n: int, default 200

The minimum number of unique values required in order for a variable not to be filtered

skip: str, list or None (default is None)

List of variables that the filter should not be applied to

only: str, list or None (default is None)

List of variables that the filter should only be applied to

Returns
data: pd.DataFrame

The filtered DataFrame

Examples

>>> import clarite
>>> nhanes_filtered = clarite.modify.colfilter_min_n(nhanes)
================================================================================
Running colfilter_min_n
--------------------------------------------------------------------------------
WARNING: 36 variables need to be categorized into a type manually
Testing 362 of 362 binary variables
        Removed 12 (3.31%) tested binary variables which had less than 200 non-null values
Testing 47 of 47 categorical variables
        Removed 8 (17.02%) tested categorical variables which had less than 200 non-null values
Testing 483 of 483 continuous variables
        Removed 8 (1.66%) tested continuous variables which had less than 200 non-null values
clarite.modify.colfilter_min_cat_n(data, n: int = 200, skip: Optional[Union[List[str], str]] = None, only: Optional[Union[List[str], str]] = None)

Remove binary and categorical variables which have less than <n> occurences of each unique value

Parameters
data: pd.DataFrame

The DataFrame to be processed and returned

n: int, default 200

The minimum number of occurences of each unique value required in order for a variable not to be filtered

skip: str, list or None (default is None)

List of variables that the filter should not be applied to

only: str, list or None (default is None)

List of variables that the filter should only be applied to

Returns
data: pd.DataFrame

The filtered DataFrame

Examples

>>> import clarite
>>> nhanes_filtered = clarite.modify.colfilter_min_cat_n(nhanes)
================================================================================
Running colfilter_min_cat_n
--------------------------------------------------------------------------------
WARNING: 36 variables need to be categorized into a type manually
Testing 362 of 362 binary variables
        Removed 248 (68.51%) tested binary variables which had a category with less than 200 values
Testing 47 of 47 categorical variables
        Removed 36 (76.60%) tested categorical variables which had a category with less than 200 values
clarite.modify.make_binary(data: pandas.core.frame.DataFrame, skip: Optional[Union[List[str], str]] = None, only: Optional[Union[List[str], str]] = None)

Set variable types as Binary

Checks that each variable has at most 2 values and converts the type to pd.Categorical.

Note: When these variables are used in regression, they are ordered by value. For example, Sex (Male=1, Female=2) will encode “Male” as 0 and “Female” as 1 during the EWAS regression step.

Parameters
data: pd.DataFrame or pd.Series

Data to be processed

skip: str, list or None (default is None)

List of variables that should not be made binary

only: str, list or None (default is None)

List of variables that are the only ones to be made binary

Returns
data: pd.DataFrame

DataFrame with the same data but validated and converted to binary types

Examples

>>> import clarite
>>> nhanes = clarite.modify.make_binary(nhanes, only=['female', 'black', 'mexican', 'other_hispanic'])
================================================================================
Running make_binary
--------------------------------------------------------------------------------
Set 4 of 970 variable(s) as binary, each with 22,624 observations
clarite.modify.make_categorical(data: pandas.core.frame.DataFrame, skip: Optional[Union[List[str], str]] = None, only: Optional[Union[List[str], str]] = None)

Set variable types as Categorical

Converts the type to pd.Categorical

Parameters
data: pd.DataFrame or pd.Series

Data to be processed

skip: str, list or None (default is None)

List of variables that should not be made categorical

only: str, list or None (default is None)

List of variables that are the only ones to be made categorical

Returns
data: pd.DataFrame

DataFrame with the same data but validated and converted to categorical types

Examples

>>> import clarite
>>> df = clarite.modify.make_categorical(df)
================================================================================
Running make_categorical
--------------------------------------------------------------------------------
Set 12 of 12 variable(s) as categorical, each with 4,321 observations
clarite.modify.make_continuous(data: pandas.core.frame.DataFrame, skip: Optional[Union[List[str], str]] = None, only: Optional[Union[List[str], str]] = None)

Set variable types as Numeric

Converts the type to numeric

Parameters
data: pd.DataFrame or pd.Series

Data to be processed

skip: str, list or None (default is None)

List of variables that should not be made continuous

only: str, list or None (default is None)

List of variables that are the only ones to be made continuous

Returns
data: pd.DataFrame

DataFrame with the same data but validated and converted to numeric types

Examples

>>> import clarite
>>> df = clarite.modify.make_continuous(df)
================================================================================
Running make_categorical
--------------------------------------------------------------------------------
Set 128 of 128 variable(s) as continuous, each with 4,321 observations
clarite.modify.merge_observations(top: pandas.core.frame.DataFrame, bottom: pandas.core.frame.DataFrame)

Merge two datasets, keeping only the columns present in both. Raise an error if a datatype conflict occurs.

Parameters
top: pd.DataFrame

“top” DataFrame

bottom: pd.DataFrame

“bottom” DataFrame

Returns
result: pd.DataFrame
clarite.modify.merge_variables(left: Union[pandas.core.frame.DataFrame, pandas.core.series.Series], right: Union[pandas.core.frame.DataFrame, pandas.core.series.Series], how: str = 'outer')

Merge a list of dataframes with different variables side-by-side. Keep all observations (‘outer’ merge) by default.

Parameters
left: pd.Dataframe or pd.Series

“left” DataFrame or Series

right: pd.DataFrame or pd.Series

“right” DataFrame or Series which uses the same index

how: merge method, one of {‘left’, ‘right’, ‘inner’, ‘outer’}

Keep only rows present in the left data, the right data, both datasets, or either dataset.

Examples

>>> import clarite
>>> df = clarite.modify.merge_variables(df_bin, df_cat, how='outer')
clarite.modify.move_variables(left: pandas.core.frame.DataFrame, right: Union[pandas.core.frame.DataFrame, pandas.core.series.Series], skip: Optional[Union[List[str], str]] = None, only: Optional[Union[List[str], str]] = None)

Move one or more variables from one DataFrame to another

Parameters
left: pd.Dataframe

DataFrame containing the variable(s) to be moved

right: pd.DataFrame or pd.Series

DataFrame or Series (which uses the same index) that the variable(s) will be moved to

skip: str, list or None (default is None)

List of variables that will not be moved

only: str, list or None (default is None)

List of variables that are the only ones to be moved

Returns
left: pd.DataFrame

The first DataFrame with the variables removed

right: pd.DataFrame

The second DataFrame with the variables added

Examples

>>> import clarite
>>> df_cat, df_cont = clarity.modify.move_variables(df_cat, df_cont, only=["DRD350AQ", "DRD350DQ", "DRD350GQ"])
Moved 3 variables.
>>> discovery_check, discovery_cont = clarite.modify.move_variables(discovery_check, discovery_cont)
Moved 39 variables.
clarite.modify.recode_values(data, replacement_dict, skip: Optional[Union[List[str], str]] = None, only: Optional[Union[List[str], str]] = None)

Convert values in a dataframe. By default, replacement occurs in all columns but this may be modified with ‘skip’ or ‘only’. Pandas has more powerful ‘replace’ methods for more complicated scenarios.

Parameters
data: pd.DataFrame

The DataFrame to be processed and returned

replacement_dict: dictionary

A dictionary mapping the value being replaced to the value being inserted

skip: str, list or None (default is None)

List of variables that the replacement should not be applied to

only: str, list or None (default is None)

List of variables that the replacement should only be applied to

Examples

>>> import clarite
>>> clarite.modify.recode_values(df, {7: np.nan, 9: np.nan}, only=['SMQ077', 'DBD100'])
================================================================================
Running recode_values
--------------------------------------------------------------------------------
Replaced 17 values from 22,624 observations in 2 variables
>>> clarite.modify.recode_values(df, {10: 12}, only=['SMQ077', 'DBD100'])
================================================================================
Running recode_values
--------------------------------------------------------------------------------
No occurences of replaceable values were found, so nothing was replaced.
clarite.modify.remove_outliers(data, method: str = 'gaussian', cutoff=3, skip: Optional[Union[List[str], str]] = None, only: Optional[Union[List[str], str]] = None)

Remove outliers from continuous variables by replacing them with np.nan

Parameters
data: pd.DataFrame

The DataFrame to be processed and returned

method: string, ‘gaussian’ (default) or ‘iqr’

Define outliers using a gaussian approach (standard deviations from the mean) or inter-quartile range

cutoff: positive numeric, default of 3

Either the number of standard deviations from the mean (method=’gaussian’) or the multiple of the IQR (method=’iqr’) Any values equal to or more extreme will be replaced with np.nan

skip: str, list or None (default is None)

List of variables that the replacement should not be applied to

only: str, list or None (default is None)

List of variables that the replacement should only be applied to

Examples

>>> import clarite
>>> nhanes_rm_outliers = clarite.modify.remove_outliers(nhanes, method='iqr', cutoff=1.5, only=['DR1TVB1', 'URXP07', 'SMQ077'])
================================================================================
Running remove_outliers
--------------------------------------------------------------------------------
WARNING: 36 variables need to be categorized into a type manually
Removing outliers from 2 continuous variables with values < 1st Quartile - (1.5 * IQR) or > 3rd quartile + (1.5 * IQR)
        Removed 0 low and 430 high IQR outliers from URXP07 (outside -153.55 to 341.25)
        Removed 0 low and 730 high IQR outliers from DR1TVB1 (outside -0.47 to 3.48)
>>> nhanes_rm_outliers = clarite.modify.remove_outliers(nhanes, only=['DR1TVB1', 'URXP07'])
================================================================================
Running remove_outliers
--------------------------------------------------------------------------------
WARNING: 36 variables need to be categorized into a type manually
Removing outliers from 2 continuous variables with values more than 3 standard deviations from the mean
        Removed 0 low and 42 high gaussian outliers from URXP07 (outside -1,194.83 to 1,508.13)
        Removed 0 low and 301 high gaussian outliers from DR1TVB1 (outside -1.06 to 4.27)
clarite.modify.rowfilter_incomplete_obs(data, skip: Optional[Union[List[str], str]] = None, only: Optional[Union[List[str], str]] = None)

Remove rows containing null values

Parameters
data: pd.DataFrame

The DataFrame to be processed and returned

skip: str, list or None (default is None)

List of columns that are not checked for null values

only: str, list or None (default is None)

List of columns that are the only ones to be checked for null values

Returns
data: pd.DataFrame

The filtered DataFrame

Examples

>>> import clarite
>>> nhanes_filtered = clarite.modify.rowfilter_incomplete_obs(nhanes, only=[outcome] + covariates)
================================================================================
Running rowfilter_incomplete_obs
--------------------------------------------------------------------------------
Removed 3,687 of 22,624 observations (16.30%) due to NA values in any of 8 variables
clarite.modify.transform(data: pandas.core.frame.DataFrame, transform_method: str, skip: Optional[Union[List[str], str]] = None, only: Optional[Union[List[str], str]] = None)

Apply a transformation function to a variable

Parameters
data: pd.DataFrame or pd.Series

Data to be processed

transform_method: str

Name of the transformation (Python function or NumPy ufunc to apply)

skip: str, list or None (default is None)

List of variables that will not be transformed

only: str, list or None (default is None)

List of variables that are the only ones to be transformed

Returns
data: pd.DataFrame

DataFrame with variables that have been transformed

Examples

>>> import clarite
>>> df = clarite.modify.transform(df, 'log', only=['BMXBMI'])
================================================================================
Running transform
--------------------------------------------------------------------------------
Transformed 'BMXBMI' using 'log'.