Describe

Functions that are used to gather information about some data

clarite.describe.correlations(data: pandas.core.frame.DataFrame, threshold: float = 0.75)

Return variables with pearson correlation above the threshold

Parameters
data: pd.DataFrame

The DataFrame to be described

threshold: float, between 0 and 1

Return a dataframe listing pairs of variables whose absolute value of correlation is above this threshold

Returns
result: pd.DataFrame

DataFrame listing pairs of correlated variables and their correlation value

Examples

>>> import clarite
>>> correlations = clarite.describe.correlations(df, threshold=0.9)
>>> correlations.head()
                    var1      var2  correlation
0  supplement_count  DSDCOUNT     1.000000
1          DR1TM181  DR1TMFAT     0.997900
2          DR1TP182  DR1TPFAT     0.996172
3          DRD370FQ  DRD370UQ     0.987974
4          DR1TS160  DR1TSFAT     0.984733
clarite.describe.freq_table(data: pandas.core.frame.DataFrame)

Return the count of each unique value for all binary and categorical variables. Other variables will return a single row with a value of ‘<Non-Categorical Values>’ and the number of non-NA values.

Parameters
data: pd.DataFrame

The DataFrame to be described

Returns
result: pd.DataFrame

DataFrame listing variable, value, and count for each categorical variable

Examples

>>> import clarite
>>> clarite.describe.freq_table(df).head(n=10)
    variable value  count
0                 SDDSRVYR                         2   4872
1                 SDDSRVYR                         1   4191
2                   female                         1   4724
3                   female                         0   4339
4  how_many_years_in_house                         5   2961
5  how_many_years_in_house                         3   1713
6  how_many_years_in_house                         2   1502
7  how_many_years_in_house                         1   1451
8  how_many_years_in_house                         4   1419
9                  LBXPFDO  <Non-Categorical Values>   1032
clarite.describe.get_types(data: pandas.core.frame.DataFrame)

Return the type of each variable

Parameters
data: pd.DataFrame

The DataFrame to be described

Returns
result: pd.Series

Series listing the CLARITE type for each variable

Examples

>>> import clarite
>>> clarite.describe.get_types(df).head()
RIDAGEYR          continuous
female                binary
black                 binary
mexican               binary
other_hispanic        binary
dtype: object
clarite.describe.percent_na(data: pandas.core.frame.DataFrame)

Return the percent of observations that are NA for each variable

Parameters
data: pd.DataFrame

The DataFrame to be described

Returns
result: pd.DataFrame

DataFrame listing percent NA for each variable

Examples

>>> import clarite
>>> clarite.describe.percent_na(df)
   variable  percent_na
0  SDDSRVYR     0.00000
1    female     0.00000
2    LBXHBC     4.99321
3    LBXHBS     4.98730
clarite.describe.skewness(data: pandas.core.frame.DataFrame, dropna: bool = False)

Return the skewness of each continuous variable

Parameters
data: pd.DataFrame

The DataFrame to be described

dropna: bool

If True, drop rows with NA values before calculating skew. Otherwise the NA values propagate.

Returns
result: pd.DataFrame

DataFrame listing three values for each continuous variable and NA for others: skew, zscore, and pvalue The test null hypothesis is that the skewness of the samples population is the same as the corresponding normal distribution. The pvalue is the two-sided pvalue for the hypothesis test

Examples

>>> import clarite
>>> clarite.describe.skewness(df)
     Variable         type      skew    zscore        pvalue
0       pdias  categorical       NaN       NaN           NaN
1   longindex  categorical       NaN       NaN           NaN
2     durflow   continuous  2.754286  8.183515  2.756827e-16
3      height   continuous  0.583514  2.735605  6.226567e-03
4     begflow   continuous -0.316648 -1.549449  1.212738e-01
clarite.describe.summarize(data: pandas.core.frame.DataFrame)

Print the number of each type of variable and the number of observations

Parameters
data: pd.DataFrame

The DataFrame to be described

Returns
result: None

Examples

>>> import clarite
>>> clarite.describe.get_types(df).head()
RIDAGEYR          continuous
female                binary
black                 binary
mexican               binary
other_hispanic        binary
dtype: object