Published on

Playing Around with Numpy and Pandas

Authors

I recently started working on a basic data science project. Below are some of the things I learned getting a feel for Numpy and Pandas.

Linspace - Specify Interval and Range for Graph Axis

The official docs can be found here

This returns an array from start to end up to the number of required samples if provided otherwise 50 samples are returned. For example:

>>> X = np.linspace(1.5, 5, 10)
array([1.5       , 1.88888889, 2.27777778, 2.66666667, 3.05555556,
       3.44444444, 3.83333333, 4.22222222, 4.61111111, 5.        ])
>>> len(X)
10

This seems to be used with plotting libraries to control each axises range and interval.

For example, the above output if used for an X-axis in a plotting library will have the axis start from 1.5 and end at 5 with 10 intervals.

Unique - Identify Unique Values in Dataframe

dupes = some_df[some_df.duplicated()]

In the above:

  • some_df.duplicated returns a list with True or False values. True if the key is a dupe and false otherwise.
  • some_df[: Pandas seems to be capable of taking in a list of True/False values corresponding to the position of each index.
    • If True the index row is rendered
    • If False it is not rendered.

The duplicated method's documentation can be found here

To easily drop the duplicates run:

# drop duplicates where the entire row is the same
de_duplicated_df = some_df.drop_duplicates()

# drop duplicates where rows are compared based on one column only
de_duplicated_df = some_df.drop_duplicates(subset='SOME_COL')

To confirm that the correct number of dupes was dropped we can see the:

  • DataFrame length before
  • DataFrame length after de-duplicating
  • The number of duplicates:
num_dupes = len(some_df[some_dfduplicated()])
len_before = len(some_df)
de_duped = some_df.drop_duplicates(subset='SOME_COL')
len_after = len(de_duped)

len_before, len_after, num_dupes, (len_before - len_after) == num_dupes

A good idea is to sort the columns to at least at a glance verify if the dupes are gone based on the column you dropped duplicates against:

some_df.sort_values('SOME_COL', ascending=True)
  • This nice brief article goes over this in more detail.
  • The official docs for drop_duplicates can be found here
  • The official docs for sort_values

Append a Column with Defaults

This answer describes this in more details. But to add some new column lets say COUNT with a default of 0 we do the following:

some_df['COUNT']= 0

You can also default all values in the column to not a number (NaN) as follows:

some_df['COUNT']= np.nan

at - set value for a particular cell

This answer describes how to do this.

You have to use something of the form:

row_num = 10
previous = some_df.at[index_row_num, 'COUNT']
some_df.at[index_row_num, 'COUNT'] = previous + 1
  • row_num: this is used to illustrate that the first part of the at is the row number you want
  • The second of the at is the column you want to set the cell value for.

concat - insert a new row

An example of adding a row with columns is below:

row_df = pd.DataFrame({'ID':0, 'Name': 'John', 'Surnam': 'Smith'}, index=['0'])
row_df.set_index('ID', inplace=True)
data = pd.concat([row_df, data])

If the entry you adding is a simple row that has the same type then you can use the approach described here. An extract from this article is below:

a_row = pd.Series([1, 2])
df = pd.DataFrame([[3, 4], [5, 6]])

row_df = pd.DataFrame([a_row])
df = pd.concat([row_df, df], ignore_index=True)

set_index - change the index to a specific column

This is as simple as:

data.set_index('SOME_OTHER_COL', inplace=True)
  • inplace will change the dataframe directly
    • If this is set to False a new dataframe with the new index is returned instead.
  • Here is the official set_index docs

Filtering a DataFrame

This is of the format:

some_df = some_df[some_df.COUNT > 0]

In this case some_df:

  • Has a column called COUNT which is an int column
    • If it has a different type use the appropriate operator
  • We are replacing some_df but if you just want to filter and see the results change this to:
some_df[some_df.COUNT > 0]