๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
๐˜ผ๐™„/๐™Ž๐™ฉ๐™ช๐™™๐™ฎ

๋Œ€์šฉ๋Ÿ‰ ์ฒ˜๋ฆฌ์—๋Š” for ๋ฃจํ”„๋ณด๋‹ค numpy, df ๋ฒกํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ด๋ณด์ž.

by beomcoder 2024. 1. 11.
728x90
๋ฐ˜์‘ํ˜•

ํ•™๊ต์—์„œ ๊ฑฐ์˜ ์ฒ˜์Œ์œผ๋กœ ๋ฐฐ์šฐ๋Š” ๋ฃจํ”„๋Š” ๊ฑฐ์˜ ๋ชจ๋“  ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์–ธ์–ด์—์„œ ๋ฃจํ”„์— ๋Œ€ํ•ด ๋ฐฐ์šด๋‹ค. ๊ทธ๋ž˜์„œ ๊ธฐ๋ณธ์ ์œผ๋กœ ๋ฐ˜๋ณต ์ž‘์—…์ด ์žˆ์„ ๋•Œ๋งˆ๋‹ค ๋ฃจํ”„๋กœ ๊ตฌํ˜„์„ ํ–ˆ๋˜๊ฒƒ ๊ฐ™๋‹ค. ์ตœ๊ทผ์— ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ฒŒ ๋˜๋ฉด์„œ ๋งŽ์€ ์ˆ˜์˜ ๋ฐ˜๋ณต(์ˆ˜๋ฐฑ๋งŒ/์ˆ˜์‹ญ์–ต ํ–‰)์œผ๋กœ ์ž‘์—…ํ•  ๋•Œ ๋ฃจํ”„๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ ๋ถ€๋‹ด์ด ๋์—ˆ๋‹ค. ๊ทธ๋ž˜์„œ ๋ฐฐ์› ์ง€๋งŒ ์ž˜ ์•ˆ์“ฐ๊ณ  ์žˆ๋˜ numpy๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ผ์„ ์ง„ํ–‰ํ•˜๊ฒŒ ๋˜์—ˆ๋‹ค. ์—ฌ๊ธฐ์— ๋Œ€ํ•ด์„œ ์žŠ์–ด๋ฒ„๋ฆฌ์ง€ ์•Š๊ฒŒ ์ ์–ด๋ณด๋ ค๊ณ  ํ•œ๋‹ค.

 

import time 
start = time.time()

# iterative sum
total = 0
# iterating through 1.5 Million numbers
for item in range(0, 1500000):
    total = total + item

print('sum is:' + str(total))
end = time.time()
print(end - start)
#1124999250000
#0.14 Seconds

 

import numpy as np

start = time.time()
# vectorized sum - using numpy for vectorization
# np.arange create the sequence of numbers from 0 to 1499999
print(np.sum(np.arange(1500000)))
end = time.time()
print(end - start)

##1124999250000
##0.008 Seconds

 

๊ฐ„๋‹จํ•˜๊ฒŒ ๋‹จ์ˆœ ๋ง์…ˆ์ธ ์ฝ”๋“œ์ธ๋ฐ๋„ ๋ถˆ๊ตฌํ•˜๊ณ , ๋ฒกํ„ฐํ™”๋Š” ๋ฒ”์œ„ ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•œ ๋ฐ˜๋ณต์— ๋น„ํ•ด ์‹คํ–‰ ์‹œ๊ฐ„์ด ์•ฝ 18๋ฐฐ ๋” ์งง๋‹ค. ์ด ์ฐจ์ด๋Š” Pandas DataFrame์„ ์‚ฌ์šฉํ• ๋•Œ ๋” ์‹ฌํ•˜๊ฒŒ ๋‚˜์˜จ๋‹ค.

 

import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0, 50, size=(5000000, 4)), columns=('a','b','c','d'))
df.shape
# (5000000, 5)
df.head()

 

๋จผ์ € ๋น„๊ตํ•ด๋ณด๊ธฐ ์œ„ํ•ด ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ๋งŒ๋“ค์—ˆ๋‹ค. 

 

import time 
start = time.time()

# Iterating through DataFrame using iterrows
for idx, row in df.iterrows():
    # creating a new column 
    df.at[idx,'ratio'] = 100 * (row["d"] / row["c"])  
end = time.time()
print(end - start)
### 109 Seconds

 

start = time.time()
df["ratio"] = 100 * (df["d"] / df["c"])

end = time.time()
print(end - start)
### 0.12 seconds

 

DataFrame์„ ์‚ฌ์šฉํ•˜๋ฉด ์—„์ฒญ๋‚˜๊ฒŒ ๋นจ๋ผ์งˆ ์ˆ˜ ์žˆ๋‹ค. ๋ฒกํ„ฐํ™” ์ž‘์—…์— ์†Œ์š”๋˜๋Š” ์‹œ๊ฐ„์€ for๋ฌธ์— ๋น„ํ•ด 1000๋ฐฐ์ •๋„ ๋น ๋ฅด๋‹ค.

 

import time 
start = time.time()

# Iterating through DataFrame using iterrows
for idx, row in df.iterrows():
    if row.a == 0:
        df.at[idx,'e'] = row.d    
    elif (row.a <= 25) & (row.a > 0):
        df.at[idx,'e'] = (row.b)-(row.c)    
    else:
        df.at[idx,'e'] = row.b + row.c
end = time.time()
print(end - start)
### Time taken: 177 seconds

 

start = time.time()
df['e'] = df['b'] + df['c']
df.loc[df['a'] <= 25, 'e'] = df['b'] -df['c']
df.loc[df['a']==0, 'e'] = df['d']end = time.time()
print(end - start)
## 0.28007707595825195 sec

 

๋ฒกํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ฝ”๋”ฉํ•˜๋Š”๊ฒŒ if-else ๋ฌธ์„ ์‚ฌ์šฉํ•˜๋Š” for๋ฌธ์— ๋น„ํ•ด 600๋ฐฐ ๋น ๋ฅด๋‹ค.

 

๋‚˜๋„ ์•„์ง ์ œ๋Œ€๋กœ ๊ณต๋ถ€๋ฅผ ์•ˆํ•ด์„œ ์‚ฌ์šฉํ•˜๊ธฐ์— ์–ด๋ ต๊ณ , ์ œ๋Œ€๋กœ ์•Œ๋ ค์ฃผ๊ธฐ๋„ ๋ฌธ์ œ๊ฐ€ ์žˆ์ง€๋งŒ ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜๋Š”๊ฑด ์‹œ๊ฐ„๋ฌธ์ œ์ด๊ธฐ ๋•Œ๋ฌธ์— ๊ณต๋ถ€๋ฅผ ๋งŽ์ด ํ•ด์„œ ์ž์›์„ ์•„๊ปด์•ผ๊ฒ ๋‹ค๊ณ  ๋Š๊ผˆ๋‹ค.

 

 

728x90
๋ฐ˜์‘ํ˜•

๋Œ“๊ธ€