Harnessing Positivity: PPO and Generative Text Models
Written on
Introduction to Positive Text Generation
I've been planning to write this blog post for quite some time, and today feels like the perfect moment! If you're reading this, chances are you’re already familiar with AI language models like ChatGPT, GPT-4, or Bard. You might have encountered numerous articles and videos detailing their training processes. And yes, you might be a bit fatigued by the frequent mention of the acronym RLHF! I’ll do my best to use it sparingly, but I can't make any guarantees!
In this post, we aim to fine-tune a GPT-2 language model to encourage the generation of more positive sentences. We’ll utilize the Proximal Policy Optimization (PPO) algorithm to enhance the model's training, ensuring that it concludes given prompts with more uplifting language.
What Exactly is PPO?
PPO, or Proximal Policy Optimization, is a technique designed to refine the decision-making strategies of deep learning models, specifically in our case, the GPT-2 model.
Let’s illustrate this with an example. Imagine you’re a food enthusiast who tends to stick to familiar dishes. Your partner, however, is eager to explore new cuisines since she’s grown weary of the same old Chinese food. As a cautious foodie, you might consider options that resemble Chinese cuisine but introduce a new flavor. After some deliberation—Indian, Mexican, Italian—you settle on Vietnamese food, which has some similarities to what you already enjoy. You try it and find that you actually like it! This way, you’ve ventured slightly out of your comfort zone without straying too far, allowing you to explore and still enjoy a new experience.
In a similar vein, PPO strikes a balance between exploration and exploitation: you won’t keep consuming the same dishes (exploitation), but you also won’t risk trying something entirely outside your taste (exploration). Essentially, PPO updates the model in a way that prevents it from straying too far from its learned behaviors while still allowing it to explore new possibilities.
For a deeper dive into PPO, refer to this paper.
Getting Started with Coding and Data
We'll be leveraging the Sentiment140 dataset, which contains 1.6 million labeled tweets, to fine-tune the GPT-2 Language Model. The first step involves setting up your environment in a Colab notebook and following these steps:
- Load the Data
import pandas as pd
data = pd.read_csv("/content/training.1600000.processed.noemoticon.csv", encoding="latin-1")
data.head()
- Assign Column Names
data.columns = ['target', "id", "date", "type", "user", "text"]
- Filter for Negative Comments
data = data[["text", "target"]]
negative_comments = data[data.target == 0]
- Sample a Portion of the Dataset
negative_comments = negative_comments.sample(frac=0.05)
- Remove "@" Tags
import re
def remove_tags(string_with_tags):
return re.sub(r'@w+', '', string_with_tags)
- Compile into a Text File
txt_file = ". ".join([remove_tags(txt) for txt in negative_comments.text.values.tolist()])
with open("./negative_reviews_small.txt", "w") as fp:
fp.write(txt_file)
Training the GPT-2 Language Model
- Install Required Libraries
!pip install transformers sentencepiece
- Import Necessary Modules
from transformers import GPT2LMHeadModel, GPT2Tokenizer, GPT2Config, TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments
from tqdm import tqdm_notebook, tnrange
- Initialize Model and Tokenizer
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
- Prepare Dataset and Tokenization
train_dataset = TextDataset(tokenizer=tokenizer, file_path="negative_reviews_small.txt", block_size=128)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
- Fine-Tune the Model
training_args = TrainingArguments(
output_dir="output",
overwrite_output_dir=True,
num_train_epochs=20,
per_device_train_batch_size=32,
save_steps=800,
warmup_steps=500,
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=train_dataset,
)
trainer.train()
- Generate Text
tokenizer.decode(model.generate(tokenizer.encode("I’m going to", return_tensors="pt").cuda())[0], skip_special_tokens=True)
Developing the Positive Reinforcement Strategy
For PPO to be effective, we need a reward function that helps the algorithm determine whether the generated text is positive.
Creating the Reward Function
There are various methods to establish a reward function. We can use the classic NLTK approach or utilize a deep learning model fine-tuned with HuggingFace. Below are examples of both.
Using NLTK
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
nltk.download("vader_lexicon")
def reward_function(text):
sia = SentimentIntensityAnalyzer()
sentiment = sia.polarity_scores(text)
return sentiment["pos"] - sentiment["neg"]
In this function, we employ the SentimentIntensityAnalyzer to calculate the polarity scores for the input text and return the difference between positive and negative scores.
Using a Pre-trained DistilBert Model
from transformers import pipeline
sentiment_pipeline = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
def normalize_scores(score, new_min, new_max, old_min=0, old_max=1):
return new_min + ((score - old_min) * (new_max - new_min)) / (old_max - old_min)
def get_normalized_sentiment_scores(text, new_min=-1, new_max=1):
result = sentiment_pipeline(text)[0]
positive_score = result["score"] if result["label"] == "POSITIVE" else 1 - result["score"]
negative_score = 1 - positive_score
positive_score = normalize_scores(positive_score, new_min, new_max)
negative_score = normalize_scores(negative_score, new_min, new_max)
return positive_score - negative_score
In this example, we define a reward function that utilizes the fine-tuned DistilBert model via the HuggingFace pipeline, normalizing the scores to fit within a specified range.
Fine-Tuning the Model with PPO
Next, we’ll create a class for PPO that incorporates various methods. Here’s a brief overview:
class PPO:
def __init__(self, model, tokenizer, reward_function, corpus, device="cuda"):
self.model = model
self.tokenizer = tokenizer
self.reward_function = reward_function
self.device = device
self.corpus = corpus.split(". ")
def random_chunk_choice(self):
txt = random.choice(self.corpus)
rtrn_txt = txt[random.choice([0, 2, 5]):random.choice([7, 8, 10])]
while not len(rtrn_txt) >= 3:
rtrn_txt = txt[random.choice([0, 2, 5]):random.choice([7, 8, 10])]return rtrn_txt
def generate(self, input_text, max_length=50):
input_ids = self.tokenizer.encode(input_text, return_tensors="pt").to(self.device)
with torch.no_grad():
output = self.model.generate(input_ids, max_length=max_length, do_sample=True)return self.tokenizer.decode(output[0], skip_special_tokens=True)
def get_action_probs(self, input_text):
input_ids = self.tokenizer.encode(input_text, return_tensors="pt").to(self.device)
with torch.no_grad():
logits = self.model(input_ids).logits[:, -1, :]
action_probs = torch.softmax(logits, dim=-1)
return action_probs
def get_reward(self, input_text):
return self.reward_function(input_text)
def train(self, num_epochs, num_rollouts, num_steps, lr, clip_epsilon, discount_factor):
optimizer = optim.Adam(self.model.parameters(), lr=lr)
for epoch in tnrange(num_epochs):
self.model.train()
old_log_probs = []
for rollout in range(num_rollouts):
input_text = self.random_chunk_choice()
log_probs = []
rewards = []
for t in range(num_steps):
action_probs = self.get_action_probs(input_text)
m = Categorical(action_probs)
action = m.sample()
log_prob = m.log_prob(action)
generated_text = self.tokenizer.decode(action.cpu().numpy(), skip_special_tokens=True)
input_text += generated_text
reward = self.get_reward(input_text)
log_probs.append(log_prob)
rewards.append(reward)
old_log_probs.extend(log_probs)
print(f'EPOCH: {epoch} | ROLLOUT: {rollout} | MEAN REWARDS: {torch.tensor(rewards).mean()}')
discounted_rewards = []
Gt = 0
for reward in reversed(rewards):
Gt = reward + discount_factor * Gt
discounted_rewards.insert(0, Gt)
discounted_rewards = torch.tensor(discounted_rewards).to(self.device)
discounted_rewards = (discounted_rewards - discounted_rewards.mean()) / (discounted_rewards.std() + 1e-9)
policy_loss = []
for log_prob, old_log_prob, Gt in zip(log_probs, old_log_probs, discounted_rewards):
ratio = torch.exp(log_prob - old_log_prob.detach())
advantage = Gt
policy_loss_1 = ratio * advantage
policy_loss_2 = torch.clamp(ratio, 1 - clip_epsilon, 1 + clip_epsilon) * advantage
policy_loss.append(-torch.min(policy_loss_1, policy_loss_2))
policy_loss = Variable(torch.stack(policy_loss).sum(), requires_grad=True)
optimizer.zero_grad()
policy_loss.backward()
optimizer.step()
old_log_probs = log_probs
print(f"Epoch {epoch + 1}/{num_epochs} completed ")
Starting the Fine-Tuning Process
ppo_agent = PPO(model, tokenizer, get_normalized_sentiment_scores, "cuda")
ppo_agent.train(num_epochs=10, num_rollouts=4, num_steps=128, lr=2e-5, clip_epsilon=0.2, discount_factor=0.99)
Understanding Training Parameters
Epoch: A complete pass through the training process. In one epoch, the model undergoes several rollouts, updating its parameters based on the accumulated policy loss.
Rollout: A sequence of steps within an epoch. Each rollout represents an exploration path within the environment, consisting of multiple actions where the model generates tokens and receives rewards.
Step: A single interaction between the model and the environment during a rollout. At each step, the model selects an action (i.e., the next token), receives a reward, and proceeds to the next step.
Generating Text Samples
tokenizer.decode(model.generate(tokenizer.encode("I'm going to be", return_tensors="pt").cuda())[0], skip_special_tokens=True)
Conclusion
In this post, we explored the fundamentals of PPO and applied it to generate more positive sentences based on an initial input. We also examined the intricacies of the training process and the function of each component. With this knowledge, you can now experiment with creating your own positive text generation model!