預訓練模型transformers綜合總結(二)
阿新 • • 發佈:2021-02-06
接著第一部分,這裡寫如何使用自定義資料集,呼叫transformers庫去訓練模型,其實感覺本質就是如何把資料集合理讀取進來。
文字分類
使用aclImdb資料集,我比較傾向於直接用list把文字給讀取進來
(一)資料準備
#資料讀取 from pathlib import Path def read_imdb_split(split_dir): split_dir = Path(split_dir) texts = [] labels = [] for label_dir in ["pos", "neg"]: for text_file in (split_dir/label_dir).iterdir(): texts.append(text_file.read_text()) labels.append(0 if label_dir is "neg" else 1) return texts, labels train_texts, train_labels = read_imdb_split('aclImdb/train') test_texts, test_labels = read_imdb_split('aclImdb/test') #資料處理 from sklearn.model_selection import train_test_split train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2) ##分詞 from transformers import DistilBertTokenizerFast tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased') ##文字向量化 train_encodings = tokenizer(train_texts, truncation=True, padding=True) val_encodings = tokenizer(val_texts, truncation=True, padding=True) test_encodings = tokenizer(test_texts, truncation=True, padding=True)
(二)管道搭建
1.使用pytorch的方式實現
import torch class IMDbDataset(torch.utils.data.Dataset): def __init__(self, encodings, labels): self.encodings = encodings self.labels = labels def __getitem__(self, idx): item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()} item['labels'] = torch.tensor(self.labels[idx]) return item def __len__(self): return len(self.labels) train_dataset = IMDbDataset(train_encodings, train_labels) val_dataset = IMDbDataset(val_encodings, val_labels) test_dataset = IMDbDataset(test_encodings, test_labels)
2.使用tensorflow的方式實現
import tensorflow as tf train_dataset = tf.data.Dataset.from_tensor_slices(( dict(train_encodings), train_labels )) val_dataset = tf.data.Dataset.from_tensor_slices(( dict(val_encodings), val_labels )) test_dataset = tf.data.Dataset.from_tensor_slices(( dict(test_encodings), test_labels ))
(三)訓練模式
1.使用自帶訓練函式訓練
(1)使用pytorch的方式實現
model_path="H:\\code\\Model\\distilbert-base-cased\\"
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir='./results', # output directory
num_train_epochs=3, # total number of training epochs
per_device_train_batch_size=16, # batch size per device during training
per_device_eval_batch_size=64, # batch size for evaluation
warmup_steps=500, # number of warmup steps for learning rate scheduler
weight_decay=0.01, # strength of weight decay
logging_dir='./logs', # directory for storing logs
logging_steps=10,
)
model = DistilBertForSequenceClassification.from_pretrained(model_path)
trainer = Trainer(
model=model, # the instantiated