clarite.modify.categorize

clarite.modify.categorize(data: pandas.core.frame.DataFrame, cat_min: int = 3, cat_max: int = 6, cont_min: int = 15)

Classify variables into constant, binary, categorical, continuous, and ‘unknown’. Drop variables that only have NaN values.

Parameters:
data: pd.DataFrame

The DataFrame to be processed

cat_min: int, default 3

Minimum number of unique, non-NA values for a categorical variable

cat_max: int, default 6

Maximum number of unique, non-NA values for a categorical variable

cont_min: int, default 15

Minimum number of unique, non-NA values for a continuous variable

Returns:
result: pd.DataFrame or None

If inplace, returns None. Changes the datatypes on the input DataFrame.

Examples

>>> import clarite
>>> clarite.modify.categorize(nhanes)
362 of 970 variables (37.32%) are classified as binary (2 unique values).
47 of 970 variables (4.85%) are classified as categorical (3 to 6 unique values).
483 of 970 variables (49.79%) are classified as continuous (>= 15 unique values).
42 of 970 variables (4.33%) were dropped.
        10 variables had zero unique values (all NA).
        32 variables had one unique value.
36 of 970 variables (3.71%) were not categorized and need to be set manually.
        36 variables had between 6 and 15 unique values
        0 variables had >= 15 values but couldn't be converted to continuous (numeric) values