Harnessing Positivity: PPO and Generative Text Models

Introduction to Positive Text Generation

I've been planning to write this blog post for quite some time, and today feels like the perfect moment! If you're reading this, chances are you’re already familiar with AI language models like ChatGPT, GPT-4, or Bard. You might have encountered numerous articles and videos detailing their training processes. And yes, you might be a bit fatigued by the frequent mention of the acronym RLHF! I’ll do my best to use it sparingly, but I can't make any guarantees!

In this post, we aim to fine-tune a GPT-2 language model to encourage the generation of more positive sentences. We’ll utilize the Proximal Policy Optimization (PPO) algorithm to enhance the model's training, ensuring that it concludes given prompts with more uplifting language.

What Exactly is PPO?

PPO, or Proximal Policy Optimization, is a technique designed to refine the decision-making strategies of deep learning models, specifically in our case, the GPT-2 model.

Let’s illustrate this with an example. Imagine you’re a food enthusiast who tends to stick to familiar dishes. Your partner, however, is eager to explore new cuisines since she’s grown weary of the same old Chinese food. As a cautious foodie, you might consider options that resemble Chinese cuisine but introduce a new flavor. After some deliberation—Indian, Mexican, Italian—you settle on Vietnamese food, which has some similarities to what you already enjoy. You try it and find that you actually like it! This way, you’ve ventured slightly out of your comfort zone without straying too far, allowing you to explore and still enjoy a new experience.

In a similar vein, PPO strikes a balance between exploration and exploitation: you won’t keep consuming the same dishes (exploitation), but you also won’t risk trying something entirely outside your taste (exploration). Essentially, PPO updates the model in a way that prevents it from straying too far from its learned behaviors while still allowing it to explore new possibilities.

For a deeper dive into PPO, refer to this paper.

Getting Started with Coding and Data

We'll be leveraging the Sentiment140 dataset, which contains 1.6 million labeled tweets, to fine-tune the GPT-2 Language Model. The first step involves setting up your environment in a Colab notebook and following these steps:

Load the Data

import pandas as pd

data = pd.read_csv("/content/training.1600000.processed.noemoticon.csv", encoding="latin-1")

data.head()

Assign Column Names

data.columns = ['target', "id", "date", "type", "user", "text"]

Filter for Negative Comments

data = data[["text", "target"]]

negative_comments = data[data.target == 0]

Sample a Portion of the Dataset

negative_comments = negative_comments.sample(frac=0.05)

Remove "@" Tags

import re

def remove_tags(string_with_tags):

return re.sub(r'@w+', '', string_with_tags)

Compile into a Text File

txt_file = ". ".join([remove_tags(txt) for txt in negative_comments.text.values.tolist()])

with open("./negative_reviews_small.txt", "w") as fp:

fp.write(txt_file)

Training the GPT-2 Language Model

Install Required Libraries

!pip install transformers sentencepiece

Import Necessary Modules

from transformers import GPT2LMHeadModel, GPT2Tokenizer, GPT2Config, TextDataset, DataCollatorForLanguageModeling

from transformers import Trainer, TrainingArguments

from tqdm import tqdm_notebook, tnrange

Initialize Model and Tokenizer

model = GPT2LMHeadModel.from_pretrained("gpt2")

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

Prepare Dataset and Tokenization

train_dataset = TextDataset(tokenizer=tokenizer, file_path="negative_reviews_small.txt", block_size=128)

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

Fine-Tune the Model

training_args = TrainingArguments(

output_dir="output",

overwrite_output_dir=True,

num_train_epochs=20,

per_device_train_batch_size=32,

save_steps=800,

warmup_steps=500,

)

trainer = Trainer(

model=model,

args=training_args,

data_collator=data_collator,

train_dataset=train_dataset,

)

trainer.train()

Generate Text

tokenizer.decode(model.generate(tokenizer.encode("I’m going to", return_tensors="pt").cuda())[0], skip_special_tokens=True)

Developing the Positive Reinforcement Strategy

For PPO to be effective, we need a reward function that helps the algorithm determine whether the generated text is positive.

Creating the Reward Function

There are various methods to establish a reward function. We can use the classic NLTK approach or utilize a deep learning model fine-tuned with HuggingFace. Below are examples of both.

Using NLTK

import nltk

from nltk.sentiment import SentimentIntensityAnalyzer

nltk.download("vader_lexicon")

def reward_function(text):

sia = SentimentIntensityAnalyzer()

sentiment = sia.polarity_scores(text)

return sentiment["pos"] - sentiment["neg"]

In this function, we employ the SentimentIntensityAnalyzer to calculate the polarity scores for the input text and return the difference between positive and negative scores.

Using a Pre-trained DistilBert Model

from transformers import pipeline

sentiment_pipeline = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")

def normalize_scores(score, new_min, new_max, old_min=0, old_max=1):

return new_min + ((score - old_min) * (new_max - new_min)) / (old_max - old_min)

def get_normalized_sentiment_scores(text, new_min=-1, new_max=1):

result = sentiment_pipeline(text)[0]

positive_score = result["score"] if result["label"] == "POSITIVE" else 1 - result["score"]

negative_score = 1 - positive_score

positive_score = normalize_scores(positive_score, new_min, new_max)

negative_score = normalize_scores(negative_score, new_min, new_max)

return positive_score - negative_score

In this example, we define a reward function that utilizes the fine-tuned DistilBert model via the HuggingFace pipeline, normalizing the scores to fit within a specified range.

Fine-Tuning the Model with PPO

Next, we’ll create a class for PPO that incorporates various methods. Here’s a brief overview:

class PPO:

def __init__(self, model, tokenizer, reward_function, corpus, device="cuda"):

self.model = model

self.tokenizer = tokenizer

self.reward_function = reward_function

self.device = device

self.corpus = corpus.split(". ")

def random_chunk_choice(self):

txt = random.choice(self.corpus)

rtrn_txt = txt[random.choice([0, 2, 5]):random.choice([7, 8, 10])]

while not len(rtrn_txt) >= 3:

rtrn_txt = txt[random.choice([0, 2, 5]):random.choice([7, 8, 10])]

return rtrn_txt

def generate(self, input_text, max_length=50):

input_ids = self.tokenizer.encode(input_text, return_tensors="pt").to(self.device)

with torch.no_grad():

output = self.model.generate(input_ids, max_length=max_length, do_sample=True)

return self.tokenizer.decode(output[0], skip_special_tokens=True)

def get_action_probs(self, input_text):

input_ids = self.tokenizer.encode(input_text, return_tensors="pt").to(self.device)

with torch.no_grad():

logits = self.model(input_ids).logits[:, -1, :]

action_probs = torch.softmax(logits, dim=-1)

return action_probs

def get_reward(self, input_text):

return self.reward_function(input_text)

def train(self, num_epochs, num_rollouts, num_steps, lr, clip_epsilon, discount_factor):

optimizer = optim.Adam(self.model.parameters(), lr=lr)

for epoch in tnrange(num_epochs):

self.model.train()

old_log_probs = []

for rollout in range(num_rollouts):

input_text = self.random_chunk_choice()

log_probs = []

rewards = []

for t in range(num_steps):

action_probs = self.get_action_probs(input_text)

m = Categorical(action_probs)

action = m.sample()

log_prob = m.log_prob(action)

generated_text = self.tokenizer.decode(action.cpu().numpy(), skip_special_tokens=True)

input_text += generated_text

reward = self.get_reward(input_text)

log_probs.append(log_prob)

rewards.append(reward)

old_log_probs.extend(log_probs)

print(f'EPOCH: {epoch} | ROLLOUT: {rollout} | MEAN REWARDS: {torch.tensor(rewards).mean()}')

discounted_rewards = []

Gt = 0

for reward in reversed(rewards):

Gt = reward + discount_factor * Gt

discounted_rewards.insert(0, Gt)

discounted_rewards = torch.tensor(discounted_rewards).to(self.device)

discounted_rewards = (discounted_rewards - discounted_rewards.mean()) / (discounted_rewards.std() + 1e-9)

policy_loss = []

for log_prob, old_log_prob, Gt in zip(log_probs, old_log_probs, discounted_rewards):

ratio = torch.exp(log_prob - old_log_prob.detach())

advantage = Gt

policy_loss_1 = ratio * advantage

policy_loss_2 = torch.clamp(ratio, 1 - clip_epsilon, 1 + clip_epsilon) * advantage

policy_loss.append(-torch.min(policy_loss_1, policy_loss_2))

policy_loss = Variable(torch.stack(policy_loss).sum(), requires_grad=True)

optimizer.zero_grad()

policy_loss.backward()

optimizer.step()

old_log_probs = log_probs

print(f"Epoch {epoch + 1}/{num_epochs} completed ")

Starting the Fine-Tuning Process

ppo_agent = PPO(model, tokenizer, get_normalized_sentiment_scores, "cuda")

ppo_agent.train(num_epochs=10, num_rollouts=4, num_steps=128, lr=2e-5, clip_epsilon=0.2, discount_factor=0.99)

Understanding Training Parameters

Epoch: A complete pass through the training process. In one epoch, the model undergoes several rollouts, updating its parameters based on the accumulated policy loss.

Rollout: A sequence of steps within an epoch. Each rollout represents an exploration path within the environment, consisting of multiple actions where the model generates tokens and receives rewards.

Step: A single interaction between the model and the environment during a rollout. At each step, the model selects an action (i.e., the next token), receives a reward, and proceeds to the next step.

Generating Text Samples

tokenizer.decode(model.generate(tokenizer.encode("I'm going to be", return_tensors="pt").cuda())[0], skip_special_tokens=True)

Conclusion

In this post, we explored the fundamentals of PPO and applied it to generate more positive sentences based on an initial input. We also examined the intricacies of the training process and the function of each component. With this knowledge, you can now experiment with creating your own positive text generation model!

afyonkarahisarkitapfuari.com

Harnessing Positivity: PPO and Generative Text Models

Introduction to Positive Text Generation

What Exactly is PPO?

Getting Started with Coding and Data

Training the GPT-2 Language Model

Developing the Positive Reinforcement Strategy

Fine-Tuning the Model with PPO

Starting the Fine-Tuning Process

Understanding Training Parameters

Generating Text Samples

Conclusion

Share the page:

Recent Post:

Unveiling the Truth: Life Can Be Easier Than You Think

Making a Lasting Impression in Job Interviews: Key Strategies

generate a new title here, between 50 to 60 characters long

Unveiling the Roswell Incident: New Discoveries and Revelations

Understanding the Distinction Between Natural and Processed Sugars

Finding Beauty in the Everyday: A Reflection on Ordinary Moments

Unearthing UFO Mysteries: A Dive into the Past and Present

# Embracing Maturity: How to Stop Acting Like a Child