categorize(data:pandas.core.frame.DataFrame, cat_min:int=3, cat_max:int=6, cont_min:int=15)¶
Classify variables into binary, categorical, continuous, and ‘unknown’. Drop variables that only have NaN values.
- data: pd.DataFrame
The DataFrame to be processed
- cat_min: int, default 3
Minimum number of unique, non-NA values for a categorical variable
- cat_max: int, default 6
Maximum number of unique, non-NA values for a categorical variable
- cont_min: int, default 15
Minimum number of unique, non-NA values for a continuous variable
- result: pd.DataFrame or None
If inplace, returns None. Changes the datatypes on the input DataFrame.
>>> import clarite >>> clarite.modify.categorize(nhanes) 362 of 970 variables (37.32%) are classified as binary (2 unique values). 47 of 970 variables (4.85%) are classified as categorical (3 to 6 unique values). 483 of 970 variables (49.79%) are classified as continuous (>= 15 unique values). 42 of 970 variables (4.33%) were dropped. 10 variables had zero unique values (all NA). 32 variables had one unique value. 36 of 970 variables (3.71%) were not categorized and need to be set manually. 36 variables had between 6 and 15 unique values 0 variables had >= 15 values but couldn't be converted to continuous (numeric) values