1. 程式人生 > 其它 >NLP(二十六):如何微調 GPT-2 以生成文字

NLP(二十六):如何微調 GPT-2 以生成文字

近年來,自然語言生成 (NLG) 取得了令人難以置信的進步。 2019 年初,OpenAI 釋出了 GPT-2,這是一個巨大的預訓練模型(1.5B 引數),能夠生成類似人類質量的文字。 Generative Pretrained Transformer 2 (GPT-2) 顧名思義,基於 Transformer。 因此,它使用注意力機制,這意味著它會學習關注與上下文最相關的前一個詞,以預測下一個詞(有關更多資訊,請轉到此處)。 本文的目的是向您展示如何根據您提供給 GPT-2 的資料微調 GPT-2 以生成與上下文相關的文字。 例如,我將生成歌詞。 這個想法是使用已經訓練好的模型,根據我們的特定資料對其進行微調,然後根據模型觀察到的內容,生成任何給定歌曲中應該遵循的內容。

一、準備資料

GPT-2 本身可以生成質量不錯的文字。 但是,如果您希望它在特定上下文中做得更好,則需要根據您的特定資料對其進行微調。 就我而言,由於我想生成歌詞,我將使用以下 Kaggle 資料集(https://www.kaggle.com/neisse/scrapped-lyrics-from-6-genres)其中包含總共 12,500 首流行搖滾歌曲歌詞,全部為英文。

資料展示:artists-data.csv

Artist,Songs,Popularity,Link,Genre,Genres
10000 Maniacs,110,0.3,/10000-maniacs/,Rock,Rock; Pop; Electronica; Dance; J-Pop/J-Rock; Gospel/Religioso; Infantil; Emocore
12 Stones,75,0.3,/12-stones/,Rock,Rock; Gospel/Religioso; Hard Rock; Grunge; Rock Alternativo; Hardcore; Punk Rock; Chillout; Electronica; Heavy Metal; Metal; World Music; Axé; Emocore 311,196,0.5,/311/,Rock,Rock; Surf Music; Reggae; Ska; Pop/Rock; Rock Alternativo; Hardcore 4 Non Blondes,15,7.5,/4-non-blondes/,Rock,Rock; Pop/Rock; Rock Alternativo; Grunge; Blues; Pop; Soft Rock; Power-Pop; Piano Rock; Indie; Chillout A Cruz Está Vazia,
13,0,/a-cruz-esta-vazia/,Rock,Rock Aborto Elétrico,36,0.1,/aborto-eletrico/,Rock,Rock; Punk Rock; Pós-Punk; Post-Rock Abril,36,0.1,/abril/,Rock,Rock; Emocore; Hardcore; Pop/Rock; Rock Alternativo; Romântico; Hard Rock; Blues; World Music Abuse,13,0,/abuse/,Rock,Rock; Hardcore AC/DC,192,10.8,/ac-dc/,Rock,Rock; Heavy Metal; Classic Rock; Hard Rock; Clássico; Metal; Punk Rock; Blues; Black Music; Rockabilly; Psicodelia; Funk Carioca; Rock Alternativo; Trilha Sonora; New Age; Hip Hop; New Wave; Sertanejo; Post-Rock; Pop/Rock; MPB; Electronica; Grunge; Progressivo; Pop/Punk; Funk; Forró ACEIA,0,0,/aceia/,Rock,Rock Acid Tree,5,0,/acid-tree/,Rock,Rock; Heavy Metal; Metal Adam Lambert,110,1.4,/adam-lambert/,Pop,Pop; Pop/Rock; Rock; Romântico; Dance; Electronica; Emocore; Power-Pop; Axé; Gótico; R&B; Punk Rock; Pop/Punk; Black Music; Rock Alternativo; World Music; J-Pop/J-Rock; Gospel/Religioso; Hip Hop; K-Pop/K-Rock; Piano Rock; Heavy Metal; Velha Guarda; Soul Music; Hard Rock; Country; Soft Rock; Tecnopop; House; Trilha Sonora; Blues Adrian Suirady,7,0,/adrian-suirady/,Rock,Rock; Gótico Aerosmith,249,16.5,/aerosmith/,Rock,Rock; Hard Rock; Heavy Metal; Romântico; Pop/Rock; Classic Rock; Rock Alternativo; Blues; Metal; Chillout; Piano Rock; Funk; Gótico; Forró; Jovem Guarda; Hip Hop Aliados,75,0.8,/aliados/,Rock,Rock; Pop/Rock; Rock Alternativo; Surf Music; Hardcore; Pop/Punk; Blues; R&B; Punk Rock; Axé Alice Cooper,310,1.2,/alice-cooper/,Rock,Rock; Hard Rock; Heavy Metal; Punk Rock; Classic Rock; Grunge; Trilha Sonora; Gótico Alter Bridge,74,1.4,/alter-bridge/,Rock,Rock; Hard Rock; Rock Alternativo; Heavy Metal; Grunge; Romântico; Rap; Metal; Hardcore Amy Lee,33,0.5,/amy-lee/,Rock,Rock; Gótico; Hard Rock; Rock Alternativo; Heavy Metal; Piano Rock; Romântico; Metal; Indie; Classic Rock; New Age; Funk; Electronica; Industrial; Post-Rock; Psicodelia; Funk Carioca; Infantil; Pós-Punk; Dance; Pop; Clássico; Axé; Trilha Sonora Anberlin,98,0.1,/anberlin/,Rock,Rock; Rock Alternativo; Hardcore; Emocore; Gospel/Religioso Andi Deris,44,0,/andi-deris/,Rock,Rock; Hard Rock; Heavy Metal Andrew W.K.,31,0,/andrew-w-k/,Rock,Rock Andy (Brasil),7,0,/andy-brasil/,Rock,Rock Angra,124,2.2,/angra/,Rock,Rock; Heavy Metal; Hard Rock; Progressivo; Metal; Black Music; Piano Rock; Post-Rock; Romântico; Psicodelia; Hardcore; Clássico; Forró; Pagode Arthur Brown,2,0,/arthur-brown/,Rock,Rock Asking Alexandria,77,1,/asking-alexandria/,Rock,Rock; Hard Rock; Hardcore; Heavy Metal; Emocore; Metal; Rock Alternativo; K-Pop/K-Rock; Classic Rock; Samba; Tecnopop; Grunge; Reggae; Chillout; World Music; Pop/Rock; Black Music; Gótico; Punk Rock; New Age Autoramas,67,0.1,/autoramas/,Rock,Rock; Pop/Rock; Rock Alternativo; Progressivo; Indie; Punk Rock; Hardcore; Surf Music; Electronica; Funk; Pagode; Ska; R&B; Samba; New Age; MPB; Axé; Funk Carioca; Emocore; Grunge Avante,21,0,/avante/,Rock,Rock

資料展示:lyrics-data.csv

ALink,SName,SLink,Lyric,Idiom
/10000-maniacs/,More Than This,/10000-maniacs/more-than-this.html,I could feel at the time. There was no way of knowing. Fallen leaves in the night. Who can say where they're blowing. As free as the wind. Hopefully learning. Why the sea on the tide. Has no way of turning. More than this. You know there's nothing. More than this. Tell me one thing. More than this. You know there's nothing. It was fun for a while. There was no way of knowing. Like a dream in the night. Who can say where we're going. No care in the world. Maybe I'm learning. Why the sea on the tide. Has no way of turning. More than this. You know there's nothing. More than this. Tell me one thing. More than this. You know there's nothing. More than this. You know there's nothing. More than this. Tell me one thing. More than this. There's nothing.,ENGLISH
/10000-maniacs/,Because The Night,/10000-maniacs/because-the-night.html,"Take me now, baby, here as I am. Hold me close, and try and understand. Desire is hunger is the fire I breathe. Love is a banquet on which we feed. Come on now, try and understand. The way I feel under your command. Take my hand, as the sun descends. They can't hurt you now can't hurt you now, can't hurt you now. Because the night belongs to lovers. Because the night belongs to us. Because the night belongs to lovers. Cause the night belongs to us. Have I a doubt, baby, when I'm alone. Love is a ring a telephone. Love is an angel, disguised as lust. Here in our bed 'til the morning comes. Come on now, try and understand. The way I feel under your command. Take my hand, as the sun descends. They can't hurt you now, can't hurt you now, can't hurt you now. Because the night belongs to lovers. Because the night belongs to us. Because the night belongs to lovers. Because the night belongs to us. With love we sleep,. with doubt the vicious circle turns, and burns. Without you, oh I cannot live,. forgive the yearning burning. I believe it's time to heal to feel,. so take me now, take me now, take me now. Because the night belongs to lovers. Because the night belongs to us. Because the night belongs to lovers. Because the night belongs to us",ENGLISH
/10000-maniacs/,These Are Days,/10000-maniacs/these-are-days.html,"These are. These are days you'll remember. Never before and never since, I promise. Will the whole world be warm as this. And as you feel it,. You'll know it's true. That you - you are blessed and lucky. It's true - that you. Are touched by something. That will grow and bloom in you. These are days you'll remember. When May is rushing over you. With desire to be part of the miracles. You see in every hour. You'll know it's true. That you are blessed and lucky. It's true that you are touched. By something that will grow and bloom in you. These are days. These are the days you might fill. With laughter until you break. These days you might feel. A shaft of light. Make its way across your face. And when you do. You'll know how it was meant to be. See the signs and know their meaning. You'll know how it was meant to be. Hear the signs and know they're speaking. To you, to you",ENGLISH
/10000-maniacs/,A Campfire Song,/10000-maniacs/a-campfire-song.html,"A lie to say, ""O my mountain has coal veins and beds to dig.. 500 men with axes and they all dig for me."" A lie to ssay, ""O my. river where mant fish do swim, half of the catch is mine when you haul. your nets in."" Never will he believe that his greed is a blinding. ray. No devil or redeemer will cheat him. He'll take his gold to. where he's lying cold.. A lie to say, ""O my mine gave a diamond as big as a fist."". But with every gem in his pocket, the jewels he has missed. A lie to. say, ""O my garden is growing taller by the day."" He only eats the. best and tosses the rest away. Never will he be believe that his. greed is a blinding ray. No devil or redeemer can cheat him. he'll. take his gold to where he's lying cold. Six deep in the grave.. Something is out of reach. something he wanted. something is out of reach. he's being taunted. something is out of reach. that he can' beg or steal nor can he buy. his oldest pain. and fear in life. there'll not be time. his oldest pain. and fear in life. there'll not be time. A lie to say ""O my forest has trees that block the sun and. when I cut them down I don't answer to anyone."" No, no, never will he. believe that his greed is a blinding ray no devil or redeemer can. cheat. him. He'll take his gold where he's lying cold..",ENGLISH
/10000-maniacs/,Everyday Is Like Sunday,/10000-maniacs/everyday-is-like-sunday.html,"Trudging slowly over wet sand. Back to the bench where your clothes were stolen. This is a coastal town. That they forgot to close down. Armagedon - come armagedon come armagedon come. Everyday is like sunday. Everyday is silent and grey. Hide on a promanade. Etch on a post card:. How I dearly wish I was not here. In the seaside town. That they forgot to bomb. Come, come nuclear bomb!. Everyday is like sunday. Everyday is silent and grey. Trudging back over pebbles and sand. And a strange dust lands on your hands. (and on your face). Everyday is like sunday. Win yourself a cheap tray. Share some grease tea with me. Everyday is silent and grey",ENGLISH
/10000-maniacs/,Don't Talk,/10000-maniacs/dont-talk.html,"Don't talk, I will listen. Don't talk, you keep your distance. For I'd rather hear some truth tonight. Than entertain your lies,. So take you poison silently. Let me be let me close my eyes. Don't talk, I'll believe it. Don't talk, listen to me instead,. I know that if you think of it,. Both long enough and hard. The drink you drown your troubles. In is the trouble you're in now. Talk talk talk about it,. If you talk as if you care. But when your talk is over. Tilt that bottle in the air,. Tossing back more than your share. Don't talk, I can guess it. Don't talk, well now your restless. And you need somewhere to put the blame. For how you feel inside. You'll look for a close. And easy mark and you'll see me as fair game. Talk talk talk about it,. Talk as if you care. But when your talk is over tilt. That bottle in the air. Tossing back more than your share. You talk talk talk about it,. You talk as if you care. I'm marking every word. And can tell this time for sure,. Your talk is the finest I have heard. So don't talk, I'll be sleeping,. Let me go on dreaming. How your eyes they glow so fiercely. I can tell your inspired. By the name you just chose for me. Now what was it?. O, never mind it. We will talk talk. Talk about this when your head is clear. I'll discuss this in the morning,. But until then you may talk but I won't hear",ENGLISH
/10000-maniacs/,Across The Fields,/10000-maniacs/across-the-fields.html,"Well they left then in the morning, a hundred pairs of wings. In the light moved together in the colors of the morning. I looked to the clouds in the cirrus sky and they'd gone.. Across the marshes, across the fields below.. I fell through the vines and I hoped they would catch me below.. If only to take me with them there,. Tell me the part that shinesIn your heart on the wind.. And the reeds blew in the morning.. Take me along to the places. You've gone when my eyes looked away.. Tell me the song that you sing in the trees in the dawning.. Tell me the part that shines in your heart. And the rays of love forever,. Please take me there..",ENGLISH
/10000-maniacs/,Planned Obsolescence,/10000-maniacs/planned-obsolescence.html,[ music: Dennis Drew/lyric: Natalie Merchant ]. . science. is truth for life. watch religion fall obsolete. science. will be truth for life. technology as nature. science. truth for life. in fortran tongue the. answer. with wealth and prominence. man so near perfection. possession. it's an absence of interim. secure no demurrer. defense against divine. defense against his true. image. human conflict number five. discovery. dissolved all illusion. mystery. destroyed with conclusion. and illusion never restored. any modern man can see. that religion is. obsolete. piety. obsolete. ritual. obsolete. martyrdom. obsolete. prophetic vision. obsolete. mysticism. obsolete. commitment. obsolete. sacrament. obsolete. revelation. obsolete.,ENGLISH
/10000-maniacs/,Rainy Day,/10000-maniacs/rainy-day.html,"On bended kneeI've looked through every window then.. Touched the bottom, the night a sleepless day instead. A day when love came,came easy like what's lost now found.. Beneath a blinding light that would surround.. We were without, in doubt. We were about saving for a rainy day.. I crashed through mirrors,. I crashed through floors of laughter then.. In a blind scene, no ties would moor us to this room.. A day when love came, came easy like what's lost now found.. And you would save me, and I held you like you were my child.. If I were you, defiant you, alone upon a troubled way.. I would send my heart to you. To save it for a rainy day..",ENGLISH
/10000-maniacs/,Anthem For Doomed Youth,/10000-maniacs/anthem-for-doomed-youth.html,For whom do the bells toll. When sentenced to die. The stuttering rifles. Will stifle the cry. The monstrous anger. The fear's rapid rattle. A desert inferno. Kids dying like cattle. Don't tell me. We're not prepared. I've seen today's marine. He's eighteen and he's eager. He can be quite mean. No mock'ries for them. No prayers or bells. The demented choirs. The wailing of shells. The boys holding candles. On untraveled roads. The fear spreads like fire. As shrapnel explodes. I think it's wrong. To conscript our youth. Against their will. When plenty of our citizenry. Really like to kill. What sign posts will lead. To armageddon's fires. What bugles will call them. From crowded grey shires. The women sit quiet. With death on their minds. A slow dusk descending. The drawing of blinds. Make the hunters all line up. It's their idea of fun. And let those be forgiven. Who never owned a gun. Was it him or me. Or the wailing of the dead. The laughing soldiers. Cast their lots. And you can cut the dread.,ENGLISH
/10000-maniacs/,All That Never Happens,/10000-maniacs/all-that-never-happens.html,"She walks alone on the brick lane,. the breeze is blowing.. A year had changed her forever,. just like her grey home.. He used to live so close here,. we'd look for places I can't remember.. The world was safe when she knew him,. she tried to hold him, hold on forever.. For all that never happens and all that never will be,. a candle burning for the love we seldom keep.. The earth was raw in her fingers,. she overturned it.. Considered planting some flowers,. they wouldn't last long,. no one to tend them.. It's funny how these things go,. you were the answer to all the questions.. The memories made her weary,. she shuddered slowly,. she didn't want to.. As a distant summer he began to whisper,. and threw a smile her way.. She looked into the glass,. liquid surface showing that they were melding,. together present past.. So where can I go from here?. The color fading,. he didn't answer.. She felt him slip from her vision.. She tried to hold him, hold on forever.. So close forever,. in a silent frozen sleep..",ENGLISH
/10000-maniacs/,Back O' The Moon,/10000-maniacs/back-o-the-moon.html,Jenny. Jenny you don't know the nights I hide. below a second story room. to whistle you down. the man who's let to divvy up. time is a miser. he's got a silver coin. only lets it shine for hours. while you sleep it away. there's one rare and odd style of living. part only known to the everybody Jenny. a comical where's the end parade. of the sort people here would think unusual. Jenny. tonight upon the mock brine of a Luna Sea. far off we sail on to Back O' The Moon. Jenny. Jenny you don't know the days I've tried. telling backyard tales. so to maybe amuse. o your mood is never giddy. if you smile I'm delighted. but you'd rather pout. such a lazy child. you dare fold your arms. tisk and say that I lie. there's one rare and odd style of thinking. part only known to the everybody Jenny. the small step and giant leap takers. got the head start in the race toward it. Jenny. tonight upon the mock brine of a Luna Sea. far off we sail on to the Back O' The Moon. that was a sigh. but not meant to envy you. when your age was mine. some things were sworn true. morning would come. and calendar pages had. new printed seasons on. their opposite sides. Jenny. Jenny you don't know the nights I hide. below a second story room. to whistle you down. o the man who's let to divvy up. time is a miser. he's got a silver coin. lets it shine for hours. while you sleep it away. there's one rare and odd style of living. part only known to the everybody Jenny. out of tin ships jump the bubble head boys. to push their flags into powdered soils and cry. no second placers. no smart looking geese in bonnets. dance with pigs in high button trousers. no milk pail for the farmer's daughter. no merry towns of sweet walled houses. here I've found. Back O' the Moon. not here. I've found. Back O' the Moon,ENGLISH

讓我們首先匯入必要的庫並準備資料。 我建議在這個專案中使用 Google Colab,因為訪問 GPU 會讓事情變得更快。

import pandas as pd
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import numpy as np
import random
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import GPT2Tokenizer, GPT2LMHeadModel, AdamW, get_linear_schedule_with_warmup
from tqdm import tqdm, trange
import torch.nn.functional as F
import csv

### Prepare data
lyrics = pd.read_csv('lyrics-data.csv')
lyrics = lyrics[lyrics['Idiom']=='ENGLISH']

#Only keep popular artists, with genre Rock/Pop and popularity high enough
artists = pd.read_csv('artists-data.csv')
artists = artists[(artists['Genre'].isin(['Rock'])) & (artists['Popularity']>5)]
df = lyrics.merge(artists[['Artist', 'Genre', 'Link']], left_on='ALink', right_on='Link', how='inner')
df = df.drop(columns=['ALink','SLink','Idiom','Link'])

#Drop the songs with lyrics too long (after more than 1024 tokens, does not work)
df = df[df['Lyric'].apply(lambda x: len(x.split(' ')) < 350)]

#Create a very small test set to compare generated text with the reality
test_set = df.sample(n = 200)
df = df.loc[~df.index.isin(test_set.index)]

#Reset the indexes
test_set = test_set.reset_index()
df = df.reset_index()

#For the test set only, keep last 20 words in a new column, then remove them from original column
test_set['True_end_lyrics'] = test_set['Lyric'].str.split().str[-20:].apply(' '.join)
test_set['Lyric'] = test_set['Lyric'].str.split().str[:-20].apply(' '.join)

從第 26 行和第 34-35 行可以看出,我建立了一個小型測試集,其中刪除了每首歌曲的最後 20 個單詞。 這將允許我將生成的文字與實際文字進行比較,以檢視模型的效能如何。

二、建立資料集

為了在我們的資料上使用 GPT-2,我們還需要做一些事情。 我們需要對資料進行標記化,這是將字元序列轉換為標記的過程,即將一個句子分成單詞。
我們還需要確保每首歌曲最多尊重 1024 個令牌。
SongLyrics 類將在訓練期間為我們完成原始資料幀中的每首歌曲。

class SongLyrics(Dataset):  
    def __init__(self, control_code, truncate=False, gpt2_type="gpt2", max_length=1024):

        self.tokenizer = GPT2Tokenizer.from_pretrained(gpt2_type)
        self.lyrics = []

        for row in df['Lyric']:
          self.lyrics.append(torch.tensor(
                self.tokenizer.encode(f"<|{control_code}|>{row[:max_length]}<|endoftext|>")
            ))               
        if truncate:
            self.lyrics = self.lyrics[:20000]
        self.lyrics_count = len(self.lyrics)
        
    def __len__(self):
        return self.lyrics_count

    def __getitem__(self, item):
        return self.lyrics[item]
    
dataset = SongLyrics(df['Lyric'], truncate=True, gpt2_type="gpt2")    

三、訓練模型

我們現在可以匯入預訓練的 GPT-2 模型以及分詞器。 此外,正如我之前提到的,GPT-2 非常龐大。 如果您嘗試在計算機上使用它,很可能會遇到一堆 CUDA 記憶體不足錯誤。
可以使用的替代方法是累積梯度。
這個想法很簡單,在呼叫優化以執行梯度下降步驟之前,它將對幾個操作的梯度求和。 然後,它將總數除以累積的步驟數,以獲得訓練樣本的平均損失。 這意味著計算要少得多。

#Get the tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

#Accumulated batch size (since GPT2 is so big)
def pack_tensor(new_tensor, packed_tensor, max_seq_len):
    if packed_tensor is None:
        return new_tensor, True, None
    if new_tensor.size()[1] + packed_tensor.size()[1] > max_seq_len:
        return packed_tensor, False, new_tensor
    else:
        packed_tensor = torch.cat([new_tensor, packed_tensor[:, 1:]], dim=1)
        return packed_tensor, True, None

現在,最後,我們可以建立訓練函式,使用我們所有的歌詞來微調 GPT-2,以便它可以預測未來的質量詩句。

def train(
    dataset, model, tokenizer,
    batch_size=16, epochs=5, lr=2e-5,
    max_seq_len=400, warmup_steps=200,
    gpt2_type="gpt2", output_dir=".", output_prefix="wreckgar",
    test_mode=False,save_model_on_epoch=False,
):
    acc_steps = 100
    device=torch.device("cuda")
    model = model.cuda()
    model.train()

    optimizer = AdamW(model.parameters(), lr=lr)
    scheduler = get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=warmup_steps, num_training_steps=-1
    )

    train_dataloader = DataLoader(dataset, batch_size=1, shuffle=True)
    loss=0
    accumulating_batch_count = 0
    input_tensor = None

    for epoch in range(epochs):

        print(f"Training epoch {epoch}")
        print(loss)
        for idx, entry in tqdm(enumerate(train_dataloader)):
            (input_tensor, carry_on, remainder) = pack_tensor(entry, input_tensor, 768)

            if carry_on and idx != len(train_dataloader) - 1:
                continue

            input_tensor = input_tensor.to(device)
            outputs = model(input_tensor, labels=input_tensor)
            loss = outputs[0]
            loss.backward()

            if (accumulating_batch_count % batch_size) == 0:
                optimizer.step()
                scheduler.step()
                optimizer.zero_grad()
                model.zero_grad()

            accumulating_batch_count += 1
            input_tensor = None
        if save_model_on_epoch:
            torch.save(
                model.state_dict(),
                os.path.join(output_dir, f"{output_prefix}-{epoch}.pt"),
            )
    return model

隨意使用各種超引數(批量大小、學習率、時期、優化器)。
然後,最後,我們可以訓練模型。

model = train(dataset, model, tokenizer)

使用 torch.save 和 torch.load,您還可以儲存經過訓練的模型以備將來使用。

四、歌詞生成

是時候使用我們全新的微調模型來生成歌詞了。 通過使用以下兩個函式,我們可以為測試資料集中的所有歌曲生成歌詞。 請記住,我已經刪除了每首歌的最後 20 個單詞。 現在,對於給定的歌曲,我們的模型將檢視他擁有的歌詞,並想出歌曲的結尾應該是什麼。

def generate(
    model,
    tokenizer,
    prompt,
    entry_count=10,
    entry_length=30, #maximum number of words
    top_p=0.8,
    temperature=1.,
):
    model.eval()
    generated_num = 0
    generated_list = []

    filter_value = -float("Inf")

    with torch.no_grad():

        for entry_idx in trange(entry_count):

            entry_finished = False
            generated = torch.tensor(tokenizer.encode(prompt)).unsqueeze(0)

            for i in range(entry_length):
                outputs = model(generated, labels=generated)
                loss, logits = outputs[:2]
                logits = logits[:, -1, :] / (temperature if temperature > 0 else 1.0)

                sorted_logits, sorted_indices = torch.sort(logits, descending=True)
                cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)

                sorted_indices_to_remove = cumulative_probs > top_p
                sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[
                    ..., :-1
                ].clone()
                sorted_indices_to_remove[..., 0] = 0

                indices_to_remove = sorted_indices[sorted_indices_to_remove]
                logits[:, indices_to_remove] = filter_value

                next_token = torch.multinomial(F.softmax(logits, dim=-1), num_samples=1)
                generated = torch.cat((generated, next_token), dim=1)

                if next_token in tokenizer.encode("<|endoftext|>"):
                    entry_finished = True

                if entry_finished:

                    generated_num = generated_num + 1

                    output_list = list(generated.squeeze().numpy())
                    output_text = tokenizer.decode(output_list)
                    generated_list.append(output_text)
                    break
            
            if not entry_finished:
              output_list = list(generated.squeeze().numpy())
              output_text = f"{tokenizer.decode(output_list)}<|endoftext|>" 
              generated_list.append(output_text)
                
    return generated_list

#Function to generate multiple sentences. Test data should be a dataframe
def text_generation(test_data):
  generated_lyrics = []
  for i in range(len(test_data)):
    x = generate(model.to('cpu'), tokenizer, test_data['Lyric'][i], entry_count=1)
    generated_lyrics.append(x)
  return generated_lyrics

#Run the functions to generate the lyrics
generated_lyrics = text_generation(test_set)

generate 函式為整個測試資料幀準備生成,而 text_generation 實際上是這樣做的。
在第 6 行中,我們指定了一代的最大長度。我將其保留為 30,但這是因為標點符號很重要,稍後我將刪除最後幾個單詞,以確保生成在句子末尾完成。
另外兩個超引數值得一提:
溫度(第 8 行)。它用於縮放生成給定單詞的概率。因此,高溫迫使模型做出更原始的預測,而較小的溫度則使模型不會偏離主題。
Top p 過濾(第 7 行)。該模型將按降序對單詞概率進行排序。然後,它會將這些概率加起來為 p,同時刪除其他單詞。這意味著模型只保留
最相關的詞概率,但不僅保留最好的一個,因為給定一個序列,可以有多個詞是合適的。
在下面的程式碼中,我只是清理生成的文字,確保它在句子的末尾(而不是在句子的中間)結束,並將其儲存在測試資料集中的新列中。

#Loop to keep only generated text and add it as a new column in the dataframe
my_generations=[]

for i in range(len(generated_lyrics)):
  a = test_set['Lyric'][i].split()[-30:] #Get the matching string we want (30 words)
  b = ' '.join(a)
  c = ' '.join(generated_lyrics[i]) #Get all that comes after the matching string
  my_generations.append(c.split(b)[-1])

test_set['Generated_lyrics'] = my_generations


#Finish the sentences when there is a point, remove after that
final=[]

for i in range(len(test_set)):
  to_remove = test_set['Generated_lyrics'][i].split('.')[-1]
  final.append(test_set['Generated_lyrics'][i].replace(to_remove,''))

test_set['Generated_lyrics'] = final

五、效果評估

有很多方法可以評估生成文字的質量。 最流行的指標稱為 BLEU。 該演算法輸出 0 到 1 之間的分數,具體取決於生成的文字與現實的相似程度。 1 分表示生成的每個單詞都存在於真實文字中。
這是評估生成歌詞的 BLEU 分數的程式碼。

#Using BLEU score to compare the real sentences with the generated ones
import statistics
from nltk.translate.bleu_score import sentence_bleu

scores=[]

for i in range(len(test_set)):
  reference = test_set['True_end_lyrics'][i]
  candidate = test_set['Generated_lyrics'][i]
  scores.append(sentence_bleu(reference, candidate))

statistics.mean(scores)

我們獲得了 0.685 的平均 BLEU 分數,這是相當不錯的。 相比之下,沒有任何微調的 GPT-2 模型的 BLEU 得分為 0.288。
但是,BLEU 有其侷限性。 它最初是為機器翻譯而建立的,只檢視用於確定生成文字質量的詞彙。 這對我們來說是個問題。 事實上,有可能生成使用與現實完全不同的詞的高質量詩句。
這就是為什麼我會對模型的效能做一個主觀的評估。 為此,我建立了一個小型 Web 介面(使用 Dash)。 該程式碼可在我的 Github 儲存庫中找到。
介面的工作方式是您嚮應用程式提供一些輸入詞。 然後,模型將使用它來預測接下來的幾節經文應該是什麼。 以下是一些示例結果。

給定黑色輸入序列,紅色是 GPT-2 模型預測的結果。 你會看到它已經成功地生成了有意義的詩句,並且尊重了之前的上下文! 此外,它會生成長度相似的句子,這在保持歌曲節奏方面非常重要。 在這方面,輸入文字中的標點符號在生成歌詞時是絕對必要的。

六、結論

正如文章所示,通過針對特定資料對 GPT-2 進行微調,可以相當輕鬆地生成與上下文相關的文字。
對於歌詞生成,該模型可以生成符合上下文和句子所需長度的歌詞。 當然,可以對模型進行改進。 例如,我們可以強制它生成押韻的詩句,這在編寫歌詞時通常是必要的。
非常感謝閱讀,希望能幫到你!
可以在此處找到包含所有程式碼和模型的儲存庫:https://github.com/francoisstamant/lyrics-generation-with-GPT2

https://towardsdatascience.com/how-to-fine-tune-gpt-2-for-text-generation-ae2ea53bc272