Learning from #EffectivePandas and #PythonForDataAnalysis.

Recipe for permutating or randomly reordering the rows of a DataFrame or Series:
new_order = np.random.permutation(n)
df.iloc[new_order]
df.take(new_order)
To permutate the cols of a DataFrame, add "axis='columns'" to .take().

Method for selecting a random subset of the rows DataFrame or Series:
df.sample(n=, frac=)
To allow for replacement, add "replace=True" to .sample().

#LearnPython #ProgressToday

@treyhunner I continued working through 2 books on pandas: #EffectivePandas and #PythonForDataAnalysis, and wrote a few toots as my notes

Learning from #EffectivePandas and #PythonForDataAnalysis.

The preferred way to index and filter a Series or a DataFrame is i) with .loc[] indexing on index labels or ii) with .iloc[] indexing on index position integers. Their call signatures are nearly identical:

.loc[rows]
.loc[:, cols]
.loc[rows, cols]

Their strengths come from the increased clarity what we intend to index on and what we intend to select, therefore helping us not be the problem 😂

#LearnPython #ProgressToday

Continued my way through #EffectivePandas and #PythonForDataAnalysis.

Element-wise transformation of a Series values or an Index labels can be done by feeding a dictionary (for selected elements) or a function (for all elements) into method

.map(dict or func)

Binning of a Series or column can be done with i) the data values, or ii) the data quantiles:

.cut(data, bins or nbins, right=, labels=, precision=)
.qcut(data, quantiles or nquartiles)

#LearnPython #ProgressToday

#ProgressToday Finished the sections in #EffectivePandas and #PythonForDataAnalysis on converting the data types of a Series or column. Top methods:

.astype(dtype, copy=, errors=)
.convert_dtypes()
pd.to_datetime()
pd.CategoricalDtype(categories=, ordered=)

The 1st one converts to Python + NumPy types, while the 2nd one converts to pandas extension types that support NA.

Before converting data types, be sure to take care of codes for missing data or errors.

#LearnPython

#ProgressToday Finished going through sections in #EffectivePandas and #PythonForDataAnalysis related to duplicated data and cleaning. It's good that the two important methods apply to all three objects - Series, DataFrame, and Index:

.duplicated(subset=, keep=)
.drop_duplicates(subset=, keep=)

One difference is that the kwarg 'subset=' applies to DataFrame objects only, which can have multiple columns to choose from.

#LearnPython

#ProgressToday Finished going over sections in #EffectivePandas and #PythonForDataAnalysis related to handling missing data. Here are useful methods on this topic:

.isna()
.notna()
.dropna(how=, thresh=, axis=)
.fillna(value=, method=, limit=, axis=)
.interpolate(method=, limit=, axis=)

#LearnPython