๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
๐˜ผ๐™„/๐˜พ๐™ค๐™™๐™š

๋กœ์ปฌ์—์„œ BERT๋ชจ๋ธ ๋Œ๋ ค์„œ ํ•™์Šตํ•˜๊ธฐ

by beomcoder 2023. 2. 15.
728x90
๋ฐ˜์‘ํ˜•

์–ผ๋งˆ์ „์— koBERT๋กœ colab์—์„œ ํ•™์Šตํ•˜์˜€๋Š”๋ฐ ์–ด๋–ป๊ฒŒ ๋กœ์ปฌ์—์„œ ์˜ฎ๊ฒจ์•ผ ํ• ์ง€ ๊ฐ์ด ์•ˆ์žกํ˜”๋‹ค.

mxnet, glounnlp๋ฅผ ์ง์ ‘ ๋‹ค์šด๋กœ๋“œํ•˜์—ฌ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์— ๋„ฃ์–ด์ฃผ์—ˆ๋Š”๋ฐ๋„ ์•ˆ๋˜๊ณ 

vmware๋ฅผ ๊น”์•„์„œ ๋ฆฌ๋ˆ…์Šคํ™˜๊ฒฝ์—์„œ ํ•ด๋ณด์•˜๋Š”๋ฐ๋„ ์ž˜ ์•ˆ๋˜์—ˆ๋‹ค.

๊ทธ๋ž˜์„œ koBERT๋Š” ์•„์‰ฝ์ง€๋งŒ ์ž ์‹œ ๋ชจ๋ธ๋งŒ ๋‚จ๊ฒจ๋‘๊ณ  ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์œผ๋กœ ๋กœ์ปฌ๋กœ ํ•™์Šต์„ ์‹œํ‚ค๋ ค๊ณ  ํ•œ๋‹ค.

koBERT ๋ชจ๋ธ ํ•™์Šตํ•˜๊ธฐ์—์„œ txtํŒŒ์ผ์„ ๋งŒ๋“ค์—ˆ๋Š”๋ฐ ๊ทธ๊ฒƒ์„ ์‚ฌ์šฉํ•˜๊ฒ ๋‹ค.

 

 

์ธ๊ณต์ง€๋Šฅ koBERT ๋ชจ๋ธ ํ•™์Šต

์ถ”์ฒœ์‹œ์Šคํ…œ์— ์“ฐ์ผ 'ํƒœ๊ทธ'๋ฅผ ๋‹ฌ๊ธฐ ์œ„ํ•ด ๋ชจ๋ธ์„ ํ•˜๋‚˜ ์ œ์ž‘ํ•˜๊ณ  ์žˆ๋‹ค. ๋‹ค๋ฅธ ๋ชจ๋ธ๋“ค๋„ ๋งŽ์ง€๋งŒ koELECTRA์™€ ๊ธฐํƒ€ ๋ชจ๋ธ์€ ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ๋ฅผ ๋ชจ๋ธ์— ๋งž๊ฒŒ ํ•ด์ฃผ์ง€ ์•Š์•„์„œ ๊ทธ๋Ÿฐ๊ฐ€ ์ •ํ™•๋„๊ฐ€ ๋†’์ง€ ์•Š์•˜๋‹ค. ๊ทธ

beomcoder.tistory.com

 

0. ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌํ•˜๊ธฐ

 

koBERT ๋ชจ๋ธ ํ•™์Šตํ•˜๊ธฐ์—์„œ ํ–ˆ๋˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ์ฝ”๋“œ์ด๋‹ค.

AIHUB์—์„œ '์ฃผ์ œ๋ณ„ ํ…์ŠคํŠธ ์ผ์ƒ ๋Œ€ํ™” ๋ฐ์ดํ„ฐ'๋ฅผ ๋‹ค์šด๋กœ๋“œ ๋ฐ›์•˜๋‹ค.

๊ทธ ์ค‘์—์„œ ๋‚˜๋Š” jsonํ˜•์‹์œผ๋กœ ๋œ ํŒŒ์ผ๋งŒ ์‚ฌ์šฉํ•œ๋‹ค. ๋ณธ์ธ์ด ์“ธ ๋ถ€๋ถ„์„ ๊ฐ€์ง€๊ณ  ์ฒ˜๋ฆฌํ•˜๋ฉด ๋œ๋‹ค. ๋‚˜๋Š” ํด๋”๋ฅผ vscode์— ๋„ฃ์—ˆ๋‹ค.

 

# aihub์˜ ๋ฐ์ดํ„ฐ ์ค‘ ์นดํ…Œ๊ณ ๋ฆฌ๊ฐ€ '์ƒ๊ฑฐ๋ž˜์ „๋ฐ˜'์—์„œ ๋„์–ด์“ฐ๊ธฐ๋ฅผ ํ•œ๊ฒƒ๊ณผ ์•ˆํ•œ๊ฒƒ์ด ๊ณต์กดํ•˜์—ฌ ์ฒ˜๋ฆฌํ•ด์คŒ
category = { '์‹์Œ๋ฃŒ': 0, '์ฃผ๊ฑฐ์™€ ์ƒํ™œ': 1, '๊ตํ†ต': 2, 'ํšŒ์‚ฌ/์•„๋ฅด๋ฐ”์ดํŠธ': 3, '๊ตฐ๋Œ€': 4, '๊ต์œก': 5, '๊ฐ€์กฑ': 6, '์—ฐ์• /๊ฒฐํ˜ผ': 7, '๋ฐ˜๋ ค๋™๋ฌผ': 8, '์Šคํฌ์ธ /๋ ˆ์ €': 9, 
             '๊ฒŒ์ž„': 10, '์—ฌํ–‰': 11, '๊ณ„์ ˆ/๋‚ ์”จ': 12, '์‚ฌํšŒ์ด์Šˆ': 13, 'ํƒ€ ๊ตญ๊ฐ€ ์ด์Šˆ': 14, '๋ฏธ์šฉ': 15, '๊ฑด๊ฐ•': 16, '์ƒ๊ฑฐ๋ž˜์ „๋ฐ˜': 17, '์ƒ๊ฑฐ๋ž˜ ์ „๋ฐ˜': 17, '๋ฐฉ์†ก/์—ฐ์˜ˆ': 18,
             '์˜ํ™”/๋งŒํ™”': 19 }

 

์นดํ…Œ๊ณ ๋ฆฌ๋Š” aihub์—์„œ ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ๊ทธ๋Œ€๋กœ ๊ฐ€์ง€๊ณ  ์™”๋Š”๋ฐ ์ค‘๊ฐ„์— ๋„์–ด์“ฐ๊ธฐ๋กœ ์ ํ˜€์žˆ๋Š” ๋ถ€๋ถ„์ด ์žˆ์–ด์„œ

๊ทธ ๋ถ€๋ถ„์€ ๋”ฐ๋กœ ๊ฐ™์€ ๋ฒˆํ˜ธ๋กœ ์ฒ˜๋ฆฌํ•ด์ฃผ์—ˆ๋‹ค.

 

# AIHUB์—์„œ ๊ฐ€์ง€๊ณ ์˜จ jsonํŒŒ์ผ์„ text file๋กœ ๋ณ€ํ™˜ํ•˜๊ธฐ ์œ„ํ•ด txt file ์ƒ์„ฑ
txt_file_names = ['train_data', 'valid_data']
for name in txt_file_names:
   f = open(name+'.txt', 'w')
   f.write('id\ttext\tlabel\n')
   f.close()

 

train data์™€ valid data๋ฅผ ๋‹ค๋ฅด๊ฒŒ ์ฒ˜๋ฆฌํ•˜๋ ค๊ณ  2๊ฐœ๋ฅผ ๋งŒ๋“ค์—ˆ๋‹ค. ๋‚˜์ค‘์— pandas dataframe์œผ๋กœ

์ปฌ๋Ÿผ๊ฐ’์œผ๋กœ id, text, label์œผ๋กœ ์„ค์ •ํ•ด์ฃผ๋ ค๊ณ  ํ…์ŠคํŠธ ์ œ์ผ ์œ—์ค„์— ์ ์–ด์ฃผ์—ˆ๋‹ค.

 

# txt๋กœ ๋งŒ๋“  ์ด์œ ๋Š” ๋‹ค๋ฅธ ๊ณณ์—์„œ๋„ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜๋„ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— txtํŒŒ์ผ๋กœ ๋งŒ๋“ค์—ˆ๋‹ค.
import os, json
dirs = ['dataset\\1.Training\\๋ผ๋ฒจ๋ง๋ฐ์ดํ„ฐ', 'dataset\\2.Validation\\๋ผ๋ฒจ๋ง๋ฐ์ดํ„ฐ']

for txt_file_name, dir_name in zip(txt_file_names, dirs):
    f = open(txt_file_name+'.txt', 'a', encoding='UTF-8')
    id_count = 0
    for root, dir, filenames in os.walk(dir_name):
        for filename in filenames:
            path = os.path.join(root, filename)
            with open(path, 'r', encoding='UTF-8') as file:
                try:
                    json_data = json.load(file)
                except:
                    print(path)
                    continue

                for info in json_data['info']:
                    label = category[info['annotations']['subject']]
                    for line in info['annotations']['lines']:
                        text = line['norm_text']
                        f.write(f'{id_count}\t{text}\t{label}\n')
                        print(f'\r{id_count} ', end='')
                        id_count += 1
                file.close()
    f.close()

์ฒ˜์Œ์—” ์˜ˆ์™ธ์ฒ˜๋ฆฌ๋ฅผ ์•ˆํ•ด์ฃผ์—ˆ๋Š”๋ฐ, ์ค‘๊ฐ„์— ํŒŒ์ผ ๋ช‡๊ฐœ๊ฐ€ ์ฝ์–ด์ง€์ง€ ์•Š์•„์„œ try ~ except ๊ตฌ๋ฌธ์œผ๋กœ ์ฒ˜๋ฆฌํ•ด์ฃผ์—ˆ๋‹ค. 

!pip install transformers

 

import tensorflow as tf
import torch

from transformers import BertTokenizer
from transformers import BertForSequenceClassification, AdamW, BertConfig
from transformers import get_linear_schedule_with_warmup
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
# from keras.preprocessing.sequence import pad_sequences # ๋กœ์ปฌํ™˜๊ฒฝ์—์„œ ์—๋Ÿฌ๊ฐ€ ๋‚œ๋‹ค.
from keras_preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

import pandas as pd
import numpy as np
import random
import time
import datetime

 

import csv
import pandas as pd

train_data = pd.read_csv('train_data.txt', encoding='utf-8', sep='\t').dropna(axis=0)

pandas csvํŒŒ์ผ์„ ์ฝ์–ด์˜ค๋Š” ๋ฐฉ์‹์ธ๋ฐ txtํŒŒ์ผ๋„ ์ƒ๊ด€์—†๋‹ค. ๊ตฌ๋ถ„์ž๋Š” tabํ‚ค๋กœ ํ•ด๋‘์—ˆ๊ธฐ ๋•Œ๋ฌธ์— sep='\t'๋กœ ํ•ด์ฃผ์—ˆ๋‹ค.

๋„์–ด์“ฐ๊ธฐ๋กœ ๊ตฌ๋ถ„ํ•˜๋ฉด ์ฑ„ํŒ… ๋„์–ด์“ฐ๊ธฐ์™€ ๊ตฌ๋ถ„ํ•  ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๊ทธ๋ฆฌ๊ณ  nan ํ–‰์ด ์—†์„ ์ค„ ์•Œ์•˜๋Š”๋ฐ

๋‚˜์ค‘์— ํ•™์Šต์„ ํ•ด์ฃผ๋‹ˆ nanํ–‰์ด ๋ฐœ๊ฒฌ๋˜์–ด dropna๋กœ nanํ–‰์„ ์—†์• ์ฃผ์—ˆ๋‹ค.

 

train_data['label'] = train_data['label'].astype(np.int32)

 

label์„ 0~19๋กœ ์ ์–ด์ฃผ์—ˆ๋Š”๋ฐ train.head()๋กœ ํ™•์ธํ•ด๋ณด๋‹ˆ floatํ˜•์œผ๋กœ ๋‚˜์™€์žˆ์–ด์„œ intํ˜•์œผ๋กœ ๋ฐ”๊พธ์–ด์ฃผ์—ˆ๋‹ค.

floatํ˜•์œผ๋กœ๋„ ๋˜๋Š”์ง€๋Š” ์ž˜ ๋ชจ๋ฅด์ง€๋งŒ ๋ณดํ†ต ๋‚ด๊ฐ€ ๋ณด๊ธฐ์— label์€ intํ˜•์œผ๋กœ ๋งŽ์ด ์ ํ˜€์žˆ์–ด์„œ ๊ทธ๋ ‡๊ฒŒ ์ฒ˜๋ฆฌํ•ด์ฃผ์—ˆ๋‹ค.

 

sentences = ["[CLS] " + str(s) + " [SEP]" for s in train_data['text']]
labels = train_data['label'].values

BERT๋ชจ๋ธ์˜ ๊ฒฝ์šฐ ๋ฌธ์žฅ์˜ ์•ž๋งˆ๋‹ค [CLS]๋ฅผ ๋ถ™์—ฌ ์ธ์‹ํ•˜๊ณ , [SEP]ํ‘œ์‹œ๋กœ ๋ฌธ์žฅ์˜ ์ข…๋ฃŒ๋ฅผ ์ธ์‹ํ•œ๋‹ค.

๊ทธ๋ž˜์„œ ๋ฌธ์žฅ์˜ ์•ž๊ฐ€ ๋’ค์— ํ‘œ์‹œ๋ฅผ ํ•ด์ฃผ์—ˆ๋‹ค.

 

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-multilingual-cased", do_lower_case=False)
tokenized_texts = [tokenizer.tokenize(s) for s in sentences]

 

print(sentences[0])
print(tokenized_texts[0])

# [CLS] ์• ๋œ์•™ ๋‚˜ ๋„ˆ๋ฌด ๋ฐฐ๋ถˆ๋Ÿฌ์„œ ๋ฐฐ ์•„ํŒŒ [SEP]
# ['[CLS]', '์• ', '##๋œ', '##์•™', '๋‚˜', '๋„ˆ', '##๋ฌด', '๋ฐฐ', '##๋ถˆ', '##๋Ÿฌ', '##์„œ', '๋ฐฐ', '์•„', '##ํŒŒ', '[SEP]']

 

tokenizer๋ฅผ ๊ฐ€์ง€๊ณ ์™€์„œ ์šฐ๋ฆฌ ๋ฌธ์žฅ์„ ํ† ํฌ๋‚˜์ด์ง•์„ ํ•œ๋‹ค. ํ† ํฌ๋‚˜์ด์ง•์€ ๋‹จ์–ด์ง‘ํ•ฉ์— ์žˆ๋Š” ๋‹จ์–ด๋Š”

ํ•œ ๋ฌถ์Œ์œผ๋กœ ๋ฌถ์–ด์ฃผ๊ณ , ๋‹จ์–ด ์ง‘ํ•ฉ์— ์—†๋Š” ๋‹จ์–ด๋“ค์€ ์ชผ๊ฐœ์ค€๋‹ค. ์ง‘ํ•ฉ์— ์—†๋Š” ๋‹จ์–ด๋Š” ์ชผ๊ฐค๋•Œ ##๋ฅผ ๋ถ™์—ฌ์„œ

๋‹ค๋ฅธ ๋‹จ์–ด์—์„œ ์ชผ๊ฐœ์ ธ ๋‚˜์™”์Œ์„ ์•Œ๋ ค์ฃผ๋Š” ๊ฒƒ์ด๋‹ค. 

 

# ๋ฌธ์žฅ์˜ ์ตœ๋Œ€ ์‹œํ€€์Šค๋ฅผ ์„ค์ •ํ•˜์—ฌ ์ •์ˆ˜ ์ธ์ฝ”๋”ฉ๊ณผ ์ œ๋กœ ํŒจ๋”ฉ์„ ์ˆ˜ํ–‰
MAX_LEN = 128 #์ตœ๋Œ€ ์‹œํ€€์Šค ๊ธธ์ด ์„ค์ •

input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts]
input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")

 

attention_masks = []

for seq in input_ids:
    seq_mask = [float(i>0) for i in seq]
    attention_masks.append(seq_mask)

 

์–ดํ…์…˜ ๋งˆ์Šคํฌ๋ž€ 0 ๊ฐ’์„ ๊ฐ€์ง€๋Š” ํŒจ๋”ฉ ํ† ํฐ์— ๋Œ€ํ•ด์„œ ์–ดํ…์…˜ ์—ฐ์‚ฐ์„ ๋ถˆํ•„์š”ํ•˜๊ฒŒ ์ˆ˜ํ–‰ํ•˜์ง€ ์•Š๋„๋ก

๋‹จ์–ด์™€ ํŒจ๋”ฉ ํ† ํฐ์„ ๊ตฌ๋ถ„ํ•  ์ˆ˜ ์žˆ๊ฒŒ ์•Œ๋ ค์ฃผ๋Š” ๊ฒƒ์„ ๋งํ•œ๋‹ค.

 

๋”ฐ๋ผ์„œ [40311, 9435, 102, 0, 0]์™€ ๊ฐ™์€ ํŒจ๋”ฉ๋œ ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ์„ ๋•Œ, 

ํŒจ๋”ฉ๋œ ๊ฐ’์€ '0', ํŒจ๋”ฉ๋˜์ง€ ์•Š์€ ๋‹จ์–ด๋Š” '1'์˜ ๊ฐ’์„ ๊ฐ–๋„๋ก ์‹œ๋ฆฌ์–ผ ๋ฐ์ดํ„ฐ๋ฅผ ๋งŒ๋“ค์–ด ์ฃผ์–ด์•ผ ํ•œ๋‹ค.

 
train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(input_ids, labels, random_state=2000, test_size=0.1)
train_masks, validation_masks, _, _ = train_test_split(attention_masks, input_ids, random_state=2000, test_size=0.1)     
                                                       
train_inputs = torch.tensor(train_inputs)
train_labels = torch.tensor(train_labels)
train_masks = torch.tensor(train_masks)
validation_inputs = torch.tensor(validation_inputs)
validation_labels = torch.tensor(validation_labels)
validation_masks = torch.tensor(validation_masks)

train๊ณผ valid๋ฅผ ๋‚˜๋ˆ„์–ด์ค€๋‹ค. test์…‹์€ ์ฒ˜์Œ์— ๋งŒ๋“ค์—ˆ๋Š”๋ฐ, ๋ฐ์ดํ„ฐ๊ฐ€ ๋„ˆ๋ฌด ๋งŽ์•„

์ด ๊ธ€์„ ์ ๊ณ  ์žˆ๋Š” ์ค‘์—๋„ ํ•™์Šต์ด ์ง„ํ–‰์ค‘์ด๋‹ค. ๊ทธ๋ž˜์„œ test์…‹์€ ํ•˜์ง€ ์•Š์œผ๋ ค๊ณ  ๋งŒ๋“ค์–ด์ฃผ์ง€ ์•Š์•˜๋‹ค.

๋ฐ์ดํ„ฐ๊ฐ€ ์ ๊ฑฐ๋‚˜ ์‹œ๊ฐ„์ด ๋งŽ์€ ๋ถ„๋“ค์€ test ์…‹๋„ ์ด๋ ‡๊ฒŒ ๋งŒ๋“ค์–ด ์ฃผ๋ฉด ๋œ๋‹ค.

 

batch_size = 1

train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)

 

๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๋ฐ์ดํ„ฐ๊ฐ€ ๋„ˆ๋ฌด ๋งŽ์•„์„œ batch_size๋ฅผ 64๋กœ ํ•˜๋‹ˆ๊นŒ

์šฉ๋Ÿ‰๋ฌธ์ œ๋กœ ์ง„ํ–‰์ด ๋˜์ง€ ์•Š์•„์„œ batch_size๋ฅผ 1๋กœ ๋งž์ถ”์—ˆ๋‹ค. 

 

1. BERT๋ชจ๋ธ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

 

n_devices = torch.cuda.device_count()
print(n_devices)

for i in range(n_devices):
    print(torch.cuda.get_device_name(i))
    
if torch.cuda.is_available():    
    device = torch.device("cuda")
    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use the GPU:', torch.cuda.get_device_name(0))
else:
    device = torch.device("cpu")
    print('No GPU available, using the CPU instead.')

 

์ปดํ“จํ„ฐ์—์„œ GPU๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ํ™•์ธํ•œ๋‹ค. 

 

model = BertForSequenceClassification.from_pretrained("bert-base-multilingual-cased", num_labels=20)
model.cuda()
"""
...
    )
  )
  (dropout): Dropout(p=0.1, inplace=False)
  (classifier): Linear(in_features=768, out_features=20, bias=True)
  """

 

num_labels์—์„œ ๋ณธ์ธ์˜ ์นดํ…Œ๊ณ ๋ฆฌ ๊ฐฏ์ˆ˜๋ฅผ ์ ์–ด์ค€๋‹ค. ์ œ์ผ ๋ฐ‘๋‹จ out_features์—์„œ ๋‚˜์˜ฌ ๊ฐฏ์ˆ˜์ด๋‹ค.

๋‚˜๋Š” ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ 20๊ฐœ๋กœ ๋ถ„๋ฅ˜ํ•ด์„œ 20๊ฐœ๋กœ ์ ์–ด์ฃผ์—ˆ๋‹ค.

์ด์ง„ ๋ถ„๋ฅ˜๋ฉด num_labels์—์„œ 2๋กœ ์ ์–ด์ฃผ๋ฉด ๋œ๋‹ค.

 

# ์˜ตํ‹ฐ๋งˆ์ด์ €
optimizer = AdamW(model.parameters(), lr = 2e-5, eps = 1e-8)

# ์—ํญ์ˆ˜
epochs = 5

# ์ด ํ›ˆ๋ จ ์Šคํ… : ๋ฐฐ์น˜๋ฐ˜๋ณต ํšŸ์ˆ˜ * ์—ํญ
total_steps = len(train_dataloader) * epochs

# ์Šค์ผ€์ค„๋Ÿฌ ์ƒ์„ฑ
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps = 0, num_training_steps = total_steps)

 

# ์ •ํ™•๋„ ๊ณ„์‚ฐ ํ•จ์ˆ˜
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()

    return np.sum(pred_flat == labels_flat) / len(labels_flat)
    
    
# ์‹œ๊ฐ„ ํ‘œ์‹œ ํ•จ์ˆ˜
def format_time(elapsed):
    # ๋ฐ˜์˜ฌ๋ฆผ
    elapsed_rounded = int(round((elapsed)))
    
    # hh:mm:ss์œผ๋กœ ํ˜•ํƒœ ๋ณ€๊ฒฝ
    return str(datetime.timedelta(seconds=elapsed_rounded))

 

import gc 

# Your code with pytorch using GPU
gc.collect()

์ž๊พธ cuda out of mememory๊ฐ€ ๋‚˜์™€์„œ collect๋ฅผ ํ†ตํ•ด ์ฒ˜๋ฆฌํ•ด์ฃผ์—ˆ๋‹ค.

 

2. ๋ชจ๋ธ ํ•™์Šตํ•˜๊ธฐ

#๋žœ๋ค์‹œ๋“œ ๊ณ ์ •
seed_val = 42
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

#๊ทธ๋ž˜๋””์–ธํŠธ ์ดˆ๊ธฐํ™”
model.zero_grad()
# ํ•™์Šต
for epoch_i in range(0, epochs):
    
    # ========================================
    #               Training
    # ========================================
    
    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')

    # ์‹œ์ž‘ ์‹œ๊ฐ„ ์„ค์ •
    t0 = time.time()

    # ๋กœ์Šค ์ดˆ๊ธฐํ™”
    total_loss = 0

    # ํ›ˆ๋ จ๋ชจ๋“œ๋กœ ๋ณ€๊ฒฝ
    model.train()
        
    # ๋ฐ์ดํ„ฐ๋กœ๋”์—์„œ ๋ฐฐ์น˜๋งŒํผ ๋ฐ˜๋ณตํ•˜์—ฌ ๊ฐ€์ ธ์˜ด
    for step, batch in enumerate(train_dataloader):
        # ๊ฒฝ๊ณผ ์ •๋ณด ํ‘œ์‹œ
        if step % 500 == 0 and not step == 0:
            elapsed = format_time(time.time() - t0)
            print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))

        # ๋ฐฐ์น˜๋ฅผ GPU์— ๋„ฃ์Œ
        batch = tuple(t.to(device) for t in batch)
        
        # ๋ฐฐ์น˜์—์„œ ๋ฐ์ดํ„ฐ ์ถ”์ถœ
        b_input_ids, b_input_mask, b_labels = batch
        
        b_input_ids = b_input_ids.long().to(device)
        b_input_mask = b_input_mask.long().to(device)
        b_labels = b_labels.type(torch.LongTensor).to(device)

        # Forward ์ˆ˜ํ–‰           
        outputs = model(input_ids=b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)
        
        # ๋กœ์Šค ๊ตฌํ•จ
        loss = outputs[0]

        # ์ด ๋กœ์Šค ๊ณ„์‚ฐ
        total_loss += loss.item()

        # Backward ์ˆ˜ํ–‰์œผ๋กœ ๊ทธ๋ž˜๋””์–ธํŠธ ๊ณ„์‚ฐ
        loss.backward()

        # ๊ทธ๋ž˜๋””์–ธํŠธ ํด๋ฆฌํ•‘
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        # ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ํ†ตํ•ด ๊ฐ€์ค‘์น˜ ํŒŒ๋ผ๋ฏธํ„ฐ ์—…๋ฐ์ดํŠธ
        optimizer.step()

        # ์Šค์ผ€์ค„๋Ÿฌ๋กœ ํ•™์Šต๋ฅ  ๊ฐ์†Œ
        scheduler.step()

        # ๊ทธ๋ž˜๋””์–ธํŠธ ์ดˆ๊ธฐํ™”
        model.zero_grad()

    # ํ‰๊ท  ๋กœ์Šค ๊ณ„์‚ฐ
    avg_train_loss = total_loss / len(train_dataloader)            

    print("")
    print("  Average training loss: {0:.2f}".format(avg_train_loss))
    print("  Training epcoh took: {:}".format(format_time(time.time() - t0)))
        
    # ========================================
    #               Validation
    # ========================================

    print("")
    print("Running Validation...")

    #์‹œ์ž‘ ์‹œ๊ฐ„ ์„ค์ •
    t0 = time.time()

    # ํ‰๊ฐ€๋ชจ๋“œ๋กœ ๋ณ€๊ฒฝ
    model.eval()

    # ๋ณ€์ˆ˜ ์ดˆ๊ธฐํ™”
    eval_loss, eval_accuracy = 0, 0
    nb_eval_steps, nb_eval_examples = 0, 0

    # ๋ฐ์ดํ„ฐ๋กœ๋”์—์„œ ๋ฐฐ์น˜๋งŒํผ ๋ฐ˜๋ณตํ•˜์—ฌ ๊ฐ€์ ธ์˜ด
    for batch in validation_dataloader:
        # ๋ฐฐ์น˜๋ฅผ GPU์— ๋„ฃ์Œ
        batch = tuple(t.to(device) for t in batch)
        
        # ๋ฐฐ์น˜์—์„œ ๋ฐ์ดํ„ฐ ์ถ”์ถœ
        b_input_ids, b_input_mask, b_labels = batch
        
        # ๊ทธ๋ž˜๋””์–ธํŠธ ๊ณ„์‚ฐ ์•ˆํ•จ
        with torch.no_grad():     
            # Forward ์ˆ˜ํ–‰
            outputs = model(b_input_ids, 
                            token_type_ids=None, 
                            attention_mask=b_input_mask)
        
        # ๋กœ์Šค ๊ตฌํ•จ
        logits = outputs[0]

        # CPU๋กœ ๋ฐ์ดํ„ฐ ์ด๋™
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()
        
        # ์ถœ๋ ฅ ๋กœ์ง“๊ณผ ๋ผ๋ฒจ์„ ๋น„๊ตํ•˜์—ฌ ์ •ํ™•๋„ ๊ณ„์‚ฐ
        tmp_eval_accuracy = flat_accuracy(logits, label_ids)
        eval_accuracy += tmp_eval_accuracy
        nb_eval_steps += 1

    print("  Accuracy: {0:.2f}".format(eval_accuracy/nb_eval_steps))
    print("  Validation took: {:}".format(format_time(time.time() - t0)))

print("")
print("Training complete!")

 

#        # ๋ฐฐ์น˜์—์„œ ๋ฐ์ดํ„ฐ ์ถ”์ถœ
#        b_input_ids, b_input_mask, b_labels = batch
#        
#        b_input_ids = b_input_ids.long().to(device)
#        b_input_mask = b_input_mask.long().to(device)
#        b_labels = b_labels.type(torch.LongTensor).to(device)
#
#        # Forward ์ˆ˜ํ–‰           
#        outputs = model(input_ids=b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)

 

# nll_loss forward_reduce_cuda_kernel_2d index not implemented for 'Int'

 

์ฐธ๊ณ ํ–ˆ๋˜ ์ฝ”๋“œ์—์„œ๋Š” b_input_ids, b_input_mask, b_labels๋ฅผ ๊ทธ๋ƒฅ ๋„ฃ์—ˆ๋Š”๋ฐ ๋‚˜๋Š” ์—๋Ÿฌ๊ฐ€ ๋ฐœ์ƒํ–ˆ๋‹ค.

๊ทธ๋ž˜์„œ .long().to(device)๋กœ ์ฒ˜๋ฆฌํ•˜์—ฌ ํ•ด๊ฒฐํ•˜์˜€๋‹ค.

์•„์ง๋„ ์ง„ํ–‰์ค‘์ธ๋ฐ epoch๋ฅผ 5๋ฒˆ์ด๋‚˜ ๋Œ๋ฆฐ๊ฑธ ํ›„ํšŒํ•œ๋‹ค.

 

3. ๋ชจ๋ธ ํ…Œ์ŠคํŠธํ•˜๊ธฐ

 

# ์ž…๋ ฅ ๋ฐ์ดํ„ฐ ๋ณ€ํ™˜
def convert_input_data(sentences):

    # BERT์˜ ํ† ํฌ๋‚˜์ด์ €๋กœ ๋ฌธ์žฅ์„ ํ† ํฐ์œผ๋กœ ๋ถ„๋ฆฌ
    tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences]

    # ์ž…๋ ฅ ํ† ํฐ์˜ ์ตœ๋Œ€ ์‹œํ€€์Šค ๊ธธ์ด
    MAX_LEN = 128

    # ํ† ํฐ์„ ์ˆซ์ž ์ธ๋ฑ์Šค๋กœ ๋ณ€ํ™˜
    input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts]
    
    # ๋ฌธ์žฅ์„ MAX_LEN ๊ธธ์ด์— ๋งž๊ฒŒ ์ž๋ฅด๊ณ , ๋ชจ์ž๋ž€ ๋ถ€๋ถ„์„ ํŒจ๋”ฉ 0์œผ๋กœ ์ฑ„์›€
    input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")

    # ์–ดํ…์…˜ ๋งˆ์Šคํฌ ์ดˆ๊ธฐํ™”
    attention_masks = []

    # ์–ดํ…์…˜ ๋งˆ์Šคํฌ๋ฅผ ํŒจ๋”ฉ์ด ์•„๋‹ˆ๋ฉด 1, ํŒจ๋”ฉ์ด๋ฉด 0์œผ๋กœ ์„ค์ •
    # ํŒจ๋”ฉ ๋ถ€๋ถ„์€ BERT ๋ชจ๋ธ์—์„œ ์–ดํ…์…˜์„ ์ˆ˜ํ–‰ํ•˜์ง€ ์•Š์•„ ์†๋„ ํ–ฅ์ƒ
    for seq in input_ids:
        seq_mask = [float(i>0) for i in seq]
        attention_masks.append(seq_mask)

    # ๋ฐ์ดํ„ฐ๋ฅผ ํŒŒ์ดํ† ์น˜์˜ ํ…์„œ๋กœ ๋ณ€ํ™˜
    inputs = torch.tensor(input_ids)
    masks = torch.tensor(attention_masks)

    return inputs, masks

 

# ๋ฌธ์žฅ ํ…Œ์ŠคํŠธ
def test_sentences(sentences):

    # ํ‰๊ฐ€๋ชจ๋“œ๋กœ ๋ณ€๊ฒฝ
    model.eval()

    # ๋ฌธ์žฅ์„ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋กœ ๋ณ€ํ™˜
    inputs, masks = convert_input_data(sentences)

    # ๋ฐ์ดํ„ฐ๋ฅผ GPU์— ๋„ฃ์Œ
    b_input_ids = inputs.to(device)
    b_input_mask = masks.to(device)
            
    # ๊ทธ๋ž˜๋””์–ธํŠธ ๊ณ„์‚ฐ ์•ˆํ•จ
    with torch.no_grad():     
        # Forward ์ˆ˜ํ–‰
        outputs = model(b_input_ids, 
                        token_type_ids=None, 
                        attention_mask=b_input_mask)

    # ๋กœ์Šค ๊ตฌํ•จ
    logits = outputs[0]

    # CPU๋กœ ๋ฐ์ดํ„ฐ ์ด๋™
    logits = logits.detach().cpu().numpy()

    return logits

 

logits = test_sentences(['๋” ๋‚˜์€ ํ•™๊ต์ƒํ™œ ํ•˜๊ณ  ์‹ถ์–ด'])
print(logits)

 

 

์ฐธ๊ณ : https://velog.io/@seolini43

728x90
๋ฐ˜์‘ํ˜•

๋Œ“๊ธ€