close
close

first Drop

Com TW NOw News 2024

NumPy with Pandas for more efficient data analysis
news

NumPy with Pandas for more efficient data analysis

NumPy with Pandas for more efficient data analysisImage by jcomp on Freepik

As a data person, Pandas is a go-to package for any data manipulation activity because it is intuitive and easy to use. That is why many data science programs include Pandas in their curriculum.

Pandas is built on the NumPy package, specifically the NumPy array. Many NumPy functions and methodologies still work well with it, so we can use NumPy to effectively improve our data analysis with Pandas.

This article discusses several examples of how NumPy can enhance our Pandas data analysis experience.

Let’s get started.

Improving Pandas Data Analysis with NumPy

Before we continue with the tutorial, we need to have all the required packages installed. If you haven’t done so, you can install Pandas and NumPy with the following code.

We can start by explaining how Pandas and NumPy are related. As mentioned above, Pandas is built on the NumPy package. Let’s see how they can complement each other to improve our data analysis.

First, let’s try to create a NumPy array and Pandas DataFrame with the respective packages.

import numpy as np
import pandas as pd

np_array= np.array(((1, 2, 3), (4, 5, 6), (7, 8, 9)))
pandas_df = pd.DataFrame(np_array, columns=('A', 'B', 'C'))

print(np_array)
print(pandas_df)
Output>>
((1 2 3)
 (4 5 6)
 (7 8 9))
   A  B  C
0  1  2  3
1  4  5  6
2  7  8  9

As you can see in the code above, we can create Pandas DataFrame with a NumPy array having the same dimension structure.

Next, we can use NumPy in the Pandas data processing and cleaning steps. For example, we can use the NumPy NaN object as the missing data placeholder.

df = pd.DataFrame({
    'A': (1, 2, np.nan, 4, 5),
    'B': (5, np.nan, np.nan, 3, 2),
    'C': (1, 2, 3, np.nan, 5)
})
print(df)
Output>>
    A    B    C
0  1.0  5.0  1.0
1  2.0  NaN  2.0
2  NaN  NaN  3.0
3  4.0  3.0  NaN
4  5.0  2.0  5.0

As you can see in the result above, the NumPy NaN object becomes a synonym for all missing data in Pandas.

This code allows you to examine the number of NaN objects in each Pandas DataFrame column.

Output>>
A    1
B    2
C    1
dtype: int64

The data collector can represent the missing data values ​​in the DataFrame column as strings. When that happens, we can try to replace that string value with a NumPy NaN object.

df('A') = df('A').replace('missing data'', np.nan)

NumPy can also be used for outlier detection. Let’s see how we can do that.

df = pd.DataFrame({
    'A': np.random.normal(0, 1, 1000),
    'B': np.random.normal(0, 1, 1000)
})

df.loc(10, 'A') = 100
df.loc(25, 'B') = -100

def detect_outliers(data, threshold=3):
    z_scores = np.abs((data - data.mean()) / data.std())
    return z_scores > threshold

outliers = detect_outliers(df)
print(df(outliers.any(axis =1)))
Output>>
            A           B
10  100.000000    0.355967
25    0.239933 -100.000000

In the code above, we generate random numbers using NumPy and then create a function that detects outliers using the Z-score and sigma rules. The result is the DataFrame with the outlier.

We can perform statistical analysis with Pandas. NumPy can help facilitate more efficient analysis during the aggregation process. For example, here is statistical aggregation with Pandas and NumPy.

df = pd.DataFrame({
    'Category': (np.random.choice(('A', 'B')) for i in range(100)),
    'Values': np.random.rand(100)
})

print(df.groupby('Category')('Values').agg((np.mean, np.std, np.min, np.max)))
Output>>
             mean       std      amin      amax
Category                                        
A         0.524568  0.288471  0.025635  0.999284
B         0.525937  0.300526  0.019443  0.999090

Using NumPy, we can apply the statistical analysis function to the Pandas DataFrame and get aggregated statistics similar to the output above.

Finally, we will discuss vectorized operations with Pandas and NumPy. Vectorized operations are a method to perform operations on the data concurrently instead of repeating them individually. The result is supposed to be faster and memory optimized.
For example, we can perform element-wise additions between DataFrame columns using NumPy.

data = {'A': (15,20,25,30,35), 'B': (10, 20, 30, 40, 50)}

df = pd.DataFrame(data)
df('C') = np.add(df('A'), df('B'))  

print(df)
Output>>
   A   B   C
0  15  10  25
1  20  20  40
2  25  30  55
3  30  40  70
4  35  50  85

We can also transform the DataFrame column using the mathematical function NumPy.

df('B_exp') = np.exp(df('B'))
print(df)
Output>>
   A   B   C         B_exp
0  15  10  25  2.202647e+04
1  20  20  40  4.851652e+08
2  25  30  55  1.068647e+13
3  30  40  70  2.353853e+17
4  35  50  85  5.184706e+21

There is also the possibility of conditional substitution using NumPy for Pandas DataFrame.

df('A_replaced') = np.where(df('A') > 20, df('B') * 2, df('B') / 2)
print(df)
Output>>
   A   B   C         B_exp  A_replaced
0  15  10  25  2.202647e+04         5.0
1  20  20  40  4.851652e+08        10.0
2  25  30  55  1.068647e+13        60.0
3  30  40  70  2.353853e+17        80.0
4  35  50  85  5.184706e+21       100.0

That’s all the examples we’ve explored. These features of NumPy would definitely help you improve your Data Analysis process.

Conclusion

This article discusses how NumPy can help improve efficient data analysis with Pandas. We tried to perform data preprocessing, data cleaning, statistical analysis and vectorized operations with Pandas and NumPy.

I hope it helps!

Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he enjoys sharing Python and data tips via social media and writing media. Cornellius writes on various topics in AI and machine learning.