clarite.modify.categorize¶

clarite.modify.categorize(data: pandas.core.frame.DataFrame, cat_min: int = 3, cat_max: int = 6, cont_min: int = 15)¶

Classify variables into constant, binary, categorical, continuous, and ‘unknown’. Drop variables that only have NaN values.

Parameters:	data: pd.DataFrame The DataFrame to be processed cat_min: int, default 3 Minimum number of unique, non-NA values for a categorical variable cat_max: int, default 6 Maximum number of unique, non-NA values for a categorical variable cont_min: int, default 15 Minimum number of unique, non-NA values for a continuous variable
Returns:	result: pd.DataFrame or None If inplace, returns None. Changes the datatypes on the input DataFrame.

Examples

>>> import clarite
>>> clarite.modify.categorize(nhanes)
362 of 970 variables (37.32%) are classified as binary (2 unique values).
47 of 970 variables (4.85%) are classified as categorical (3 to 6 unique values).
483 of 970 variables (49.79%) are classified as continuous (>= 15 unique values).
42 of 970 variables (4.33%) were dropped.
        10 variables had zero unique values (all NA).
        32 variables had one unique value.
36 of 970 variables (3.71%) were not categorized and need to be set manually.
        36 variables had between 6 and 15 unique values
        0 variables had >= 15 values but couldn't be converted to continuous (numeric) values