[minGPT]play_image 설명
Train GPT on images¶
Effectively re-implements OpenAI's Image GPT model, getting GPT to model images instead of text, but using a near identical model. It's truly quite remarkable that a single model can agnostically do a great job modeling whatever data you give it: text, images, or whatever else. At the end of the day it is just a sequence of integers. Notice that unlike models like PixelCNN++ etc, this model knows nothing at all about the spatial layout of the pixels and has to learn the appropriate positional embeddings that reflect the spatial topology of the data.
github: https://github.com/karpathy/minGPT
minGPT 코드 이해를 돕기 위한 노트북.
attention 설명 생략
전체 코드 및 참고 자료는 Andrej karpathy github 참고 바람
play_image 목적
image 데이터를 학습하여 새로운 image 생성
-> how?: image 데이터도 결국 숫자로 이루어진 데이터에 불과. 이를 sequential data로 변환하여 transformer 모델에 적용.
(즉, 다음에 나올 pixel value 예측)- image 변환(sequential data) -> Model -> image 예측(sequential data) -> image 변환(H, W, C)
"""
Drive Mount and change directory
If you will not use google colab then you don't need this cell
code: cs231n/assignments
"""
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
FOLDERNAME = 'minGPT/'
assert FOLDERNAME is not None, "[!] Enter the foldername."
import sys
sys.path.append('/content/drive/My Drive/{}'.format(FOLDERNAME))
%cd drive/My\ Drive/$FOLDERNAME
Mounted at /content/drive /content/drive/My Drive/minGPT
import numpy as np
import torchvision
import torch
import matplotlib.pyplot as plt
%matplotlib inline
# set up logging
import logging
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
datefmt="%m/%d/%Y %H:%M:%S",
level=logging.INFO,
)
# make deterministic
from mingpt.utils import set_seed
set_seed(42)
# pytorch helpfully makes it easy to download datasets, e.g. the common CIFAR-10 https://www.kaggle.com/c/cifar-10
root = './'
train_data = torchvision.datasets.CIFAR10(root, train=True, transform=None, target_transform=None, download=True)
test_data = torchvision.datasets.CIFAR10(root, train=False, transform=None, target_transform=None, download=True)
print(len(train_data), len(test_data))
Files already downloaded and verified Files already downloaded and verified 50000 10000
Images are represented as array of size (height, width, 3), where is the RGB values, each is a uint8 in range 0..255. In CIFAR-10 for example, the height and width are both 32.
naive strategy Now, to feed images into GPT we have to somehow turn every image into a sequence of integers. Since each image is 32*32*3 = 3072 uint8s, in principle we could just flatten each image out into a 3072-long sequence of numbers from 0..255 and train GPT on that. Note that we are free to feed this into GPT in any random arbitrary order, as long as the encoding order is fixed for all images. The problem with this is that GPT gets very expensive as you grow the sequence size, since each new predicted integer is a function of all previously predicted integers in the sequence, and the attention inside the Transformer modules gets very expensive.
k-means codebook strategy Instead, the Image GPT strategy is to encode every individual RGB pixel into a codebook of 512 entries, which we train via the k-means clustering algorithm. This way, we only have a 32*32 = 1024-long sequence, but now of integers in the range 0..511. This is a net saving in compute because we're "shrunk" the sequence length by a factor of 3. On the other hand, our token encoding embedding parameters grow a bit in size, as well as the size of the Softmax classifier at the end.
Okay let's train our codebook with k-means now
k-means rgb clustering¶
image의 rgb을 k-mean 알고리즘을 통해 압축하는 방법
일반적으로 image를 flatten할 경우 (N, C x H x W)의 형태를 띱니다.
(N: image 갯수, C: rgb, H: height, W: width)
이를 k-means rgb clustering할 경우 (N, C x H x W) -> (N, H x W)
다만 value의 range가 (0, 255) -> (0, ncluster-1)
결과적으로 r(0, 255), g(0, 255), b(0, 255)의 데이터가 ncluster로 압축
그렇다면 이 방법이 어떻게 가능할까요?
# picture from https://github.com/karpathy/minGPT
from IPython.display import Image
Image('mingpt.jpg', width =500)
위의 오른쪽 사진(title: minGPT)의 경우 바다의 색을 보시면 전체적으로 에메랄드 빛을 띱니다.
아마 rgb값도 모두 비슷할 것 같네요.
그렇다면 해당 바다의 색을 표현할 때 각 픽셀마다 rgb로 모두 표현하는 것은 비효율적으로 보입니다.
이를 k-means clustering 알고리즘을 통해 적절한 value를 찾으면 사진의 용량을 많이 줄일 수 있겠네요.
참고: https://m.blog.naver.com/won19600/222037833707 (or search k-means rgb)
# get random 5 pixels per image and stack them all up as rgb values to get half a million random pixels
pluck_rgb = lambda x: torch.from_numpy(np.array(x)).view(32*32, 3)[torch.randperm(32*32)[:5], :]
px = torch.cat([pluck_rgb(x) for x, y in train_data], dim=0).float()
print(px.size())
torch.Size([250000, 3])
총 train data(image)의 개수: 5000
pluck_rgb: image data 중 5개의 pixel 랜덤 추출
cf) total number of a image pixels: 32 x 32
따라서 px.size()=250000, 3 (5 x 5000 = 250000, channel(RGB) = 3)
# run kmeans to get our codebook
def kmeans(x, ncluster, niter=10):
N, D = x.size()
c = x[torch.randperm(N)[:ncluster]] # init clusters at random
for i in range(niter):
# assign all pixels to the closest codebook element
a = ((x[:, None, :] - c[None, :, :])**2).sum(-1).argmin(1) # Broadcasting semantics
# move each codebook element to be the mean of the pixels that assigned to it
c = torch.stack([x[a==k].mean(0) for k in range(ncluster)])
# re-assign any poorly positioned codebook elements
nanix = torch.any(torch.isnan(c), dim=1)
ndead = nanix.sum().item()
print('done step %d/%d, re-initialized %d dead clusters' % (i+1, niter, ndead))
c[nanix] = x[torch.randperm(N)[:ndead]] # re-init dead clusters
return c
ncluster = 512
with torch.no_grad():
C = kmeans(px, ncluster, niter=8)
print(C.size())
done step 1/8, re-initialized 3 dead clusters done step 2/8, re-initialized 0 dead clusters done step 3/8, re-initialized 0 dead clusters done step 4/8, re-initialized 0 dead clusters done step 5/8, re-initialized 0 dead clusters done step 6/8, re-initialized 0 dead clusters done step 7/8, re-initialized 0 dead clusters done step 8/8, re-initialized 0 dead clusters torch.Size([512, 3])
kmeans-clustering¶
목적: pixels(N)의 rgb에 대해 특정 ncluster의 rgb와 가장 가까운 image들을 구한 후 이들의 rgb 평균으로 clustering
- 각각의 x와 각각의 표본 차이 계산 (using vectorized code, 반복문 사용 X)
- 절대적인 차이를 구하기 위해 제곱
- ncluster와 가장 작은 차이를 보이는 index 반환(0 ~ ncluster)
- 해당 인덱스들의 x의 평균 계산하여 tensor로 저장
- 만약 특정 ncluster와 가장 가까운 pixel이 없을 경우 해당 인덱스 random 추출
- 반복 -> (ndead의 갯수 지속적으로 줄어들 것)
# Kmeans-clustering 예시/
# source code: https://parkeunsang.github.io/blog/datascience/2021/03/10/pythonk-means.html
import matplotlib.pyplot as plt
import seaborn as sns
x = [] # data
k = 3 # hyper paramters
np.random.seed(2021)
x.extend(np.random.normal(loc=[0,0], scale=0.5, size=(100, 2)).tolist())
x.extend(np.random.normal(loc=[2,2], scale=0.5, size=(100, 2)).tolist())
x.extend(np.random.normal(loc=[-3,3], scale=0.5, size=(100, 2)).tolist())
x = np.array(x)
sns.scatterplot(x=x[:,0], y=x[:,1]);
def dist(a, b):
return ((a-b)**2).sum(-1)
k=50
num_iter = 3
c = x[np.random.choice(len(x), k)] # random init
for iter in range(num_iter):
argmin = dist(x[:, None, :], c[None, :, :]).argmin(1)
group = {}
for i in range(k):
group[i] = x[argmin==i]
array = [ group[i].mean(0) for i in range(k)]
c = np.stack(array, axis=0)
nanix = np.any(np.isnan(c))
nanix = list(set(np.where(np.isnan(c))[0]))
ndead = len(nanix)
# visualize
plt.subplot(3, 1, iter+1)
for i in range(k):
plt.scatter(group[i][:, 0], group[i][:, 1], marker='o')
plt.scatter(c[:, 0], c[:, 1], c='black', marker='v')
if ndead != 0:
c[nanix] = x[np.random.choice(len(x), ndead)] # 만약 clustering 되지 않는 c(center) 있을 시 random init, visualize(red)
plt.scatter(c[nanix, 0], c[nanix, 1], s=100, c='red', marker='v')
plt.show()
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:14: RuntimeWarning: Mean of empty slice. /usr/local/lib/python3.7/dist-packages/numpy/core/_methods.py:182: RuntimeWarning: invalid value encountered in true_divide ret, rcount, out=ret, casting='unsafe', subok=False)
Broadcasting semantics¶
x[:, None, :].size(): (N, 1, D)
c[None, :, :].size(): (1, ncluster, D)
how to calculate x[:, None, :] - c[None, :, :] ?
차원의 크기가 1인 부분 다른 torch 차원 크기로 복사
c[None, :, :].size(): (1, ncluster, D) -> (N, ncluster, D)
마찬가지로 x 또한 (N, 1, D) -> (N, ncluster, D)
자세한 사항은 broadcasting semantics (pytorch docs) 참고
why broadcasting semantics?
반복문을 쓰지 않으므로 시간 복잡도에서 우수 & 간결함
# broadcasting semantics example
import time
N=4
D=5
ncluster = 3
x= torch.ones(N, D)
c= x[:ncluster]
# vectorized version
tic = time.time()
one = x[:, None, :] # size: (4, 1, 5) -> (4, 3, 5)
two = 2 * c[None, :, :] # size: (1, 3, 5) -> (4, 3, 5)
result = one+two
toc = time.time()
print((result).size())
print('result')
print((result).numpy())
# naive version
t1=time.time()
result_2 = torch.empty((N, ncluster, D))
for n in range(N):
for nc in range(ncluster):
result_2[n, nc, :] = x[n, :] + 2 * c[nc, :]
t2=time.time()
print('Each of all result values is same? ',(result == result_2).all())
print('vectorized code is x %f faster than naive version '%( (t2-t1)/(toc-tic) ) )
# ref: cs231n/assignment
torch.Size([4, 3, 5]) result [[[3. 3. 3. 3. 3.] [3. 3. 3. 3. 3.] [3. 3. 3. 3. 3.]] [[3. 3. 3. 3. 3.] [3. 3. 3. 3. 3.] [3. 3. 3. 3. 3.]] [[3. 3. 3. 3. 3.] [3. 3. 3. 3. 3.] [3. 3. 3. 3. 3.]] [[3. 3. 3. 3. 3.] [3. 3. 3. 3. 3.] [3. 3. 3. 3. 3.]]] Each of all result values is same? tensor(True) vectorized code is x 2.401875 faster than naive version
# encode the training examples with our codebook to visualize how much we've lost in the discretization
n_samples = 16
ncol = 8
nrow = n_samples // ncol + 1
plt.figure(figsize=(20, 10))
for i in range(n_samples):
# encode and decode random data
x, y = train_data[np.random.randint(0, len(train_data))]
xpt = torch.from_numpy(np.array(x)).float().view(32*32, 3)
ix = ((xpt[:, None, :] - C[None, :, :])**2).sum(-1).argmin(1) # cluster assignments for each pixel
# these images should look normal ideally
plt.subplot(nrow, ncol, i+1)
plt.imshow(C[ix].view(32, 32, 3).numpy().astype(np.uint8))
plt.axis('off')
The images above look relatively reasonable, so our 512-sized codebook is enough to reasonably re-represent RGB values. Ok cool. So now every image is just a 1024-long sequence of numbers between 0..511. Time to train a GPT.
from torch.utils.data import Dataset
class ImageDataset(Dataset):
"""
wrap up the pytorch CIFAR-10 dataset into our own, which will convert images into sequences of integers
"""
def __init__(self, pt_dataset, clusters, perm=None):
self.pt_dataset = pt_dataset
self.clusters = clusters
self.perm = torch.arange(32*32) if perm is None else perm
self.vocab_size = clusters.size(0)
self.block_size = 32*32 - 1
def __len__(self):
return len(self.pt_dataset)
def __getitem__(self, idx):
x, y = self.pt_dataset[idx]
x = torch.from_numpy(np.array(x)).view(-1, 3) # flatten out all pixels
x = x[self.perm].float() # reshuffle pixels with any fixed permutation and -> float
a = ((x[:, None, :] - self.clusters[None, :, :])**2).sum(-1).argmin(1) # cluster assignments
return a[:-1], a[1:] # always just predict the next one in the sequence
train_dataset = ImageDataset(train_data, C)
test_dataset = ImageDataset(test_data, C)
train_dataset[0][0] # one example image flattened out into integers
tensor([449, 229, 229, ..., 379, 0, 177])
For reference, iGPT-S from the paper is:
- batch size of 128 and trained for 1M terations
- Adam lr 0.003 with betas = (0.9, 0.95)
- learning rate is warmed up for one epoch, then decays to 0
- did not use weight decay or dropout
n_layer=24, n_head=8, n_embd=512
We will do something similar but smaller
from mingpt.model import GPT, GPTConfig, GPT1Config
# we'll do something a bit smaller
mconf = GPTConfig(train_dataset.vocab_size, train_dataset.block_size,
embd_pdrop=0.0, resid_pdrop=0.0, attn_pdrop=0.0,
n_layer=12, n_head=8, n_embd=256)
model = GPT(mconf)
08/23/2020 15:41:51 - INFO - mingpt.model - number of parameters: 1.000166e+07
# minGPT model architecture
from IPython.display import Image
Image('minGPT.png', width=1000)
GPT model init¶
model.py
class GPT(nn.Module):
def __init__(self, config):
super().__init__()
# layer init
self.apply(self._init_weights) # Applies _init_weights function recursively to every submodule
# (ref: pytorch.docs)
def _init_weights(self, module):
'''
layer's weights init
1) Linear weights, Embedding -> normal(mean=0, std=0.2)
2) Linear bias -> zeros
3) Layernorm weights -> ones, bias -> zeors
4) pos_emb -> normal(mean=0, std=0.02)
'''
from mingpt.trainer import Trainer, TrainerConfig
"""
Note that I am running on an 8-GPU V100 machine so each GPU has 32GB.
If you don't have as many computational resources you have to bring down
the batch_size until the model fits into your memory, and then you may
also need to adjust the learning rate (e.g. decrease it a bit). Alternatively,
you can use an even smaller model up above, bringing down the number of layers,
number of heads, and the embedding size.
"""
tokens_per_epoch = len(train_data) * train_dataset.block_size # len(train_data) = 5000, train_dataset.block_size = 32*32 - 1
train_epochs = 20 # todo run a bigger model and longer, this is tiny
# initialize a trainer instance and kick off training
tconf = TrainerConfig(max_epochs=train_epochs, batch_size=16*8, learning_rate=3e-3,
betas = (0.9, 0.95), weight_decay=0,
lr_decay=True, warmup_tokens=tokens_per_epoch, final_tokens=train_epochs*tokens_per_epoch,
ckpt_path='cifar10_model.pt',
num_workers=8)
trainer = Trainer(model, train_dataset, test_dataset, tconf)
trainer.train()
0%| | 0/391 [00:00<?, ?it/s]/apcv/shared/conda-envs/apcv-6244e1d-566/lib/python3.8/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
epoch 1 iter 390: train loss 2.51640. lr 3.000000e-03: 100%|██████████| 391/391 [04:19<00:00, 1.51it/s]
08/23/2020 15:46:42 - INFO - mingpt.trainer - test loss: 2.535099
08/23/2020 15:46:42 - INFO - mingpt.trainer - saving cifar10_model.pt
epoch 2 iter 390: train loss 2.30809. lr 2.979542e-03: 100%|██████████| 391/391 [03:28<00:00, 1.88it/s]
epoch 3 iter 390: train loss 2.17811. lr 2.918726e-03: 100%|██████████| 391/391 [03:29<00:00, 1.86it/s]
08/23/2020 15:54:33 - INFO - mingpt.trainer - test loss: 2.202279
08/23/2020 15:54:33 - INFO - mingpt.trainer - saving cifar10_model.pt
epoch 4 iter 390: train loss 2.11399. lr 2.819211e-03: 100%|██████████| 391/391 [03:30<00:00, 1.86it/s]
08/23/2020 15:58:30 - INFO - mingpt.trainer - test loss: 2.145966
08/23/2020 15:58:30 - INFO - mingpt.trainer - saving cifar10_model.pt
epoch 5 iter 390: train loss 2.07712. lr 2.683711e-03: 100%|██████████| 391/391 [03:30<00:00, 1.86it/s]
08/23/2020 16:02:26 - INFO - mingpt.trainer - test loss: 2.110127
08/23/2020 16:02:26 - INFO - mingpt.trainer - saving cifar10_model.pt
epoch 6 iter 390: train loss 2.05083. lr 2.515922e-03: 100%|██████████| 391/391 [03:29<00:00, 1.87it/s]
08/23/2020 16:06:21 - INFO - mingpt.trainer - test loss: 2.092440
08/23/2020 16:06:21 - INFO - mingpt.trainer - saving cifar10_model.pt
epoch 7 iter 45: train loss 2.05004. lr 2.494243e-03: 12%|█▏ | 46/391 [00:26<03:06, 1.85it/s]IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.
Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)
epoch 8 iter 390: train loss 2.00891. lr 2.102543e-03: 100%|██████████| 391/391 [03:32<00:00, 1.84it/s]
08/23/2020 16:14:17 - INFO - mingpt.trainer - test loss: 2.065428
08/23/2020 16:14:17 - INFO - mingpt.trainer - saving cifar10_model.pt
epoch 9 iter 390: train loss 1.99380. lr 1.868228e-03: 100%|██████████| 391/391 [03:30<00:00, 1.86it/s]
08/23/2020 16:18:13 - INFO - mingpt.trainer - test loss: 2.057797
08/23/2020 16:18:13 - INFO - mingpt.trainer - saving cifar10_model.pt
epoch 10 iter 390: train loss 1.97990. lr 1.623869e-03: 100%|██████████| 391/391 [03:32<00:00, 1.84it/s]
08/23/2020 16:22:14 - INFO - mingpt.trainer - test loss: 2.052964
08/23/2020 16:22:14 - INFO - mingpt.trainer - saving cifar10_model.pt
epoch 11 iter 390: train loss 1.96656. lr 1.376131e-03: 100%|██████████| 391/391 [03:29<00:00, 1.87it/s]
08/23/2020 16:26:10 - INFO - mingpt.trainer - test loss: 2.048069
08/23/2020 16:26:10 - INFO - mingpt.trainer - saving cifar10_model.pt
epoch 12 iter 390: train loss 1.95148. lr 1.131772e-03: 100%|██████████| 391/391 [03:28<00:00, 1.87it/s]
08/23/2020 16:30:07 - INFO - mingpt.trainer - test loss: 2.044518
08/23/2020 16:30:07 - INFO - mingpt.trainer - saving cifar10_model.pt
epoch 13 iter 390: train loss 1.93635. lr 8.974569e-04: 100%|██████████| 391/391 [03:29<00:00, 1.87it/s]
08/23/2020 16:34:04 - INFO - mingpt.trainer - test loss: 2.042194
08/23/2020 16:34:04 - INFO - mingpt.trainer - saving cifar10_model.pt
epoch 14 iter 390: train loss 1.92239. lr 6.795778e-04: 100%|██████████| 391/391 [03:29<00:00, 1.87it/s]
08/23/2020 16:38:01 - INFO - mingpt.trainer - test loss: 2.039827
08/23/2020 16:38:01 - INFO - mingpt.trainer - saving cifar10_model.pt
epoch 15 iter 390: train loss 1.91114. lr 4.840776e-04: 100%|██████████| 391/391 [03:31<00:00, 1.85it/s]
08/23/2020 16:41:59 - INFO - mingpt.trainer - test loss: 2.039363
08/23/2020 16:41:59 - INFO - mingpt.trainer - saving cifar10_model.pt
epoch 16 iter 390: train loss 1.90019. lr 3.162892e-04: 100%|██████████| 391/391 [03:30<00:00, 1.86it/s]
08/23/2020 16:45:57 - INFO - mingpt.trainer - test loss: 2.037797
08/23/2020 16:45:57 - INFO - mingpt.trainer - saving cifar10_model.pt
epoch 17 iter 390: train loss 1.89348. lr 3.000000e-04: 100%|██████████| 391/391 [03:29<00:00, 1.86it/s]
08/23/2020 16:49:54 - INFO - mingpt.trainer - test loss: 2.039623
epoch 18 iter 390: train loss 1.88827. lr 3.000000e-04: 100%|██████████| 391/391 [03:28<00:00, 1.87it/s]
08/23/2020 16:53:51 - INFO - mingpt.trainer - test loss: 2.041819
epoch 19 iter 390: train loss 1.88321. lr 3.000000e-04: 100%|██████████| 391/391 [03:27<00:00, 1.88it/s]
08/23/2020 16:57:45 - INFO - mingpt.trainer - test loss: 2.044153
epoch 20 iter 390: train loss 1.87812. lr 3.000000e-04: 100%|██████████| 391/391 [03:28<00:00, 1.88it/s]
08/23/2020 17:01:40 - INFO - mingpt.trainer - test loss: 2.046611
# load the state of the best model we've seen based on early stopping
checkpoint = torch.load('cifar10_model.pt')
model.load_state_dict(checkpoint)
<All keys matched successfully>
GPT model train¶
- get optimizer
- train
- data load
- forward
- back prop and gradient descent
- model save (if it's needed)
get optimizer(AdamW)¶
trainer.py
class Trainer:
def train(self):
...
optimizer = raw_model.configure_optimizers(config)
...
model.py
class GPT(nn.Module):
def configure_optimizers(self, train_config):
"""
learning rate decay할 대상 지정한 후에 AdamW 적용
decay: (Linear weights)
no decay: (biases, layernorm/embedding wegihts, position embedding(special case))
"""
for mn, m in self.named_moduels(): # mn: index, m: each module
for pn, p in m.named_parameters(): #pn: parameter's name, p: parameters(tensor)
fpn = '%s.%s' %(mn, pn) if mn else pn
...
'''
error 처리
inter_params = decay & no_decay
따라서 inter_params: 공집합
union_params = decay | no_decay
따라서 union_params: 전체 params
'''
'''
AdamW 적용
learning_rate=6e-4, warmup_tokens=1024, betas = (0.9, 0.95)
'''
# named_modules, named_parameters() example
import torch.nn as nn
layer = nn.Sequential(
nn.Dropout(p=0.1),
nn.Linear(1024, 10)
)
for mn, m in layer.named_modules():
print('mn: ', mn)
print('m: ', m)
for pn, p in m.named_parameters():
fpn = '%s.%s' %(mn, pn) if mn else pn
print('fpn: ', fpn)
print('pn: ', pn)
print('p: ', p)
mn:
m: Sequential(
(0): Dropout(p=0.1, inplace=False)
(1): Linear(in_features=1024, out_features=10, bias=True)
)
fpn: 1.weight
pn: 1.weight
p: Parameter containing:
tensor([[-0.0126, 0.0164, -0.0279, ..., -0.0249, -0.0303, -0.0266],
[-0.0062, 0.0159, -0.0307, ..., -0.0046, -0.0190, 0.0276],
[-0.0294, -0.0212, 0.0291, ..., 0.0173, -0.0175, -0.0051],
...,
[-0.0257, -0.0015, 0.0083, ..., -0.0071, 0.0157, 0.0152],
[-0.0224, -0.0037, -0.0237, ..., -0.0111, 0.0217, 0.0149],
[-0.0261, 0.0095, -0.0055, ..., -0.0095, -0.0130, -0.0168]],
requires_grad=True)
fpn: 1.bias
pn: 1.bias
p: Parameter containing:
tensor([-0.0262, 0.0004, -0.0271, -0.0243, -0.0113, -0.0098, -0.0165, -0.0054,
-0.0263, 0.0002], requires_grad=True)
mn: 0
m: Dropout(p=0.1, inplace=False)
mn: 1
m: Linear(in_features=1024, out_features=10, bias=True)
fpn: 1.weight
pn: weight
p: Parameter containing:
tensor([[-0.0126, 0.0164, -0.0279, ..., -0.0249, -0.0303, -0.0266],
[-0.0062, 0.0159, -0.0307, ..., -0.0046, -0.0190, 0.0276],
[-0.0294, -0.0212, 0.0291, ..., 0.0173, -0.0175, -0.0051],
...,
[-0.0257, -0.0015, 0.0083, ..., -0.0071, 0.0157, 0.0152],
[-0.0224, -0.0037, -0.0237, ..., -0.0111, 0.0217, 0.0149],
[-0.0261, 0.0095, -0.0055, ..., -0.0095, -0.0130, -0.0168]],
requires_grad=True)
fpn: 1.bias
pn: bias
p: Parameter containing:
tensor([-0.0262, 0.0004, -0.0271, -0.0243, -0.0113, -0.0098, -0.0165, -0.0054,
-0.0263, 0.0002], requires_grad=True)
train¶
다시 train()함수로 돌아와서
trainer.py
class Trainer:
def train(self):
...
# optimizer
def run_epoch(loader, is_train):
# 일단 생략
best_loss = float('inf')
self.tokens=0 # learning_rate deay (if tokens >= warmup_tokens)
"""
train data loader
test data loader
cf) data loader: 설정된 batch size를 통해 data를 계속해서 모델에 feed할 수 있는 iterable 객체.
설계자가 직접 random함수를 이용하여 iterable하게 batch data를 생성하지 않아도 되므로 편리.
이외에도 single-and multi- processing data loading 등 지원.
자세한 건 pytorch.docs(torch.utils.data) 참고 (사실 잘 모름)
"""
# 매 epoch마다 run_epoch 실행
for epoch in range(config.max_epochs):
run_epoch(train_loader, is_train=True)
if self.test_dataset is not None:
test_loss = run_epoch(test_loader, is_train=False)
#good model 저장
good_model = self.test_dataset is None or test_loss < best_loss
if self.config.ckpt_path is not None and good_model:
best_loss = test_loss
self.save_checkpoint()
run_epoch¶
- loss 저장
- if train
- gradient descent
- learning rate decay?
- if test
- return test_loss
trainer.py
class Trainer:
def train(self):
def run_epoch(loader, is_train):
model.train(is_train) # nn.module funtion. Let model know train or test(for batchnorm or dropout etc. Not above Trainer.train function)
losses = []
pbar = tqdm(enumerate(loader), total=len(loader)) if is_train else enumerate(loader) # just to know visually how much time will take. Search tqdm lib for detail.
for it, (x, y) in pbar:
x = x.to(self.device)
y = y.to(self.device)
with torch.set_grad_enabled(is_train): # enable grad if True else disenable grad
...
if is_train:
# backprop, gradient update
if config.lr_decay:
self.tokens += (y>=0).sum() # sum all labels (not -100)
if self.tokens < config.warmup_tokens:
#linear warmup
else:
#cosine learning rate decay
lr = config.learning_rate * lr_mult
for param_group in optimizer.param_groups:
# learning rate 변경
else:
# no learning rate decay
# report (epoch, iter, train loss, lr)
if not is_train:
test_loss = float(np.mean(losses))
logger.info('test loss: %f', test_loss)
return test_loss
# to sample we also have to technically "train" a separate model for the first token in the sequence
# we are going to do so below simply by calculating and normalizing the histogram of the first token
counts = torch.ones(ncluster) # start counts as 1 not zero, this is called "smoothing"
rp = torch.randperm(len(train_dataset))
nest = 5000 # how many images to use for the estimation
for i in range(nest):
a, _ = train_dataset[int(rp[i])] # a[:-1], a[1:] = train_dataset[index]
t = a[0].item() # index of first token in the sequence
counts[t] += 1
prob = counts/counts.sum()
%%time
from mingpt.utils import sample
'''
# 0~ncluster 중 random 추출하여 (n_samples, 1) 벡터 생성
이 때 추출 확률은 train_dataset의 first token 랜덤 확률 이용
start_pixel의 fisrt token을 이용하여 img 생성. (rgb값은 Clustering index 사용)
따라서 추후에 img visualize를 위해 Clustering index를 이용하여 rgb값으로 변환해줘야 함.
'''
n_samples = 32
start_pixel = np.random.choice(np.arange(C.size(0)), size=(n_samples, 1), replace=True, p=prob)
start_pixel = torch.from_numpy(start_pixel).to(trainer.device)
pixels = sample(model, start_pixel, 32*32-1, temperature=1.0, sample=True, top_k=100)
CPU times: user 1min 14s, sys: 35.2 s, total: 1min 49s Wall time: 1min 48s
# for visualization we have to invert the permutation used to produce the pixels
# train_datset.perm = torch.arange(32*32) if perm is None else perm
iperm = torch.argsort(train_dataset.perm)
ncol = 8
nrow = n_samples // ncol
plt.figure(figsize=(16, 8))
for i in range(n_samples):
pxi = pixels[i][iperm] # note: undo the encoding permutation
plt.subplot(nrow, ncol, i+1)
plt.imshow(C[pxi].view(32, 32, 3).numpy().astype(np.uint8))
plt.axis('off')
# visualize some of the learned positional embeddings, maybe they contain structure
plt.figure(figsize=(5, 5))
nsee = 8*8
ncol = 8
nrow = nsee // ncol
for i in range(nsee):
ci = model.pos_emb.data[0, :, i].cpu()
zci = torch.cat((torch.tensor([0.0]), ci)) # pre-cat a zero
rzci = zci[iperm] # undo the permutation to recover the pixel space of the image
plt.subplot(nrow, ncol, i+1)
plt.imshow(rzci.view(32, 32).numpy())
plt.axis('off')
# huh, pretty cool! :P