clarite.modify.remove_outliers

clarite.modify.remove_outliers(data, method: str = 'gaussian', cutoff=3, skip: Union[str, List[str], NoneType] = None, only: Union[str, List[str], NoneType] = None)

Remove outliers from continuous variables by replacing them with np.nan

Parameters:
data: pd.DataFrame

The DataFrame to be processed and returned

method: string, ‘gaussian’ (default) or ‘iqr’

Define outliers using a gaussian approach (standard deviations from the mean) or inter-quartile range

cutoff: positive numeric, default of 3

Either the number of standard deviations from the mean (method=’gaussian’) or the multiple of the IQR (method=’iqr’) Any values equal to or more extreme will be replaced with np.nan

skip: str, list or None (default is None)

List of variables that the replacement should not be applied to

only: str, list or None (default is None)

List of variables that the replacement should only be applied to

Examples

>>> import clarite
>>> nhanes_rm_outliers = clarite.modify.remove_outliers(nhanes, method='iqr', cutoff=1.5, only=['DR1TVB1', 'URXP07', 'SMQ077'])
================================================================================
Running remove_outliers
--------------------------------------------------------------------------------
WARNING: 36 variables need to be categorized into a type manually
Removing outliers from 2 continuous variables with values < 1st Quartile - (1.5 * IQR) or > 3rd quartile + (1.5 * IQR)
        Removed 0 low and 430 high IQR outliers from URXP07 (outside -153.55 to 341.25)
        Removed 0 low and 730 high IQR outliers from DR1TVB1 (outside -0.47 to 3.48)
>>> nhanes_rm_outliers = clarite.modify.remove_outliers(nhanes, only=['DR1TVB1', 'URXP07'])
================================================================================
Running remove_outliers
--------------------------------------------------------------------------------
WARNING: 36 variables need to be categorized into a type manually
Removing outliers from 2 continuous variables with values more than 3 standard deviations from the mean
        Removed 0 low and 42 high gaussian outliers from URXP07 (outside -1,194.83 to 1,508.13)
        Removed 0 low and 301 high gaussian outliers from DR1TVB1 (outside -1.06 to 4.27)