5 tips to improve NLP datasets

Published in

Nerd For Tech

4 min readMay 5, 2023

In April 2023, there was a 100 points bountied problem about algorithms that don’t work with a dataset of documents.

I was wondering if I could help, and I have given my opinion based on my experience: the problem is not the algorithms but the dataset, and I was right.

That’s why I have decided to write this article. Having worked with NLP datasets in many fields, my experience can be helpful to others who have problems with their datasets.

Good NLP models depend on algorithms, methods, and also their raw material: the datasets. Photo by Earl Wilcox on Unsplash

Here are 5 tips for improving NLP datasets:

1. Reduce redundancy

Some datasets could have very repetitive content, which reduces the model quality.

Like any neural network, NLP models are based on data frequency, and having redundant data would unbalance weights.

It also applies to us: the more repetitive information we get, the more important it is in our minds.

So instead of keeping redundant data, you can remove it as much as possible by detecting similarities.

Many algorithms can do it: it can be done through dimensional reduction, cosine similarity, euclidian distance, or TF-IDF.

2. Reduce noise

This tip is mainly for small NLP datasets but can work with big ones. In a former project, I had to create a small GPT-2 model with limited data. I have spent days trying to improve it with many trials, and most of them fail. But the one that has given great results is noise reduction.

Noise always disturbs something and makes it difficult to understand. In the NLP field, noise refers to characters or sets of characters that are similar semantically but not computer-wise.

For instance, “I’m” and “I am”, “???” and “?” or “;” and “,”.

Even if NLP models are built to deal with this kind of scenario, they can’t manage it perfectly, and they will always perform better if we reduce noise.

Consequently, I have completely reworked the NLP dataset by replacing noisy data: “I’m” with “I am” etc. so that I have a fully clean data set, written by a perfect robot. Of course, I had to apply the same logic to the user input, but it was easy as all the rules were already set.

3. Increase variability

GPT-4Chan was the most controversial NLP model in recent history for 2 reasons:

It was unethical
Its results were great

And if the results were great, it was mainly because it was unethical, and I explain why.

Every NLP model is built on a vocabulary corpus that merely represents a semantical space.

If you limit that semantical space to a small field that only focuses on “good” things, it will never learn the contrary and reach poor results.

It is just like negative prompting: a model must know what is good and bad to understand reality better.

The best way to increase variability is to add negative data in any manner. It doesn’t apply to any data set, mostly to the ones that have human interaction, but it could trigger new ideas for improvement in any case.

For instance, if you have guidelines documents as a dataset, it should contain not only best practices but also risks and dangers.

4. Find the right text length balance

NLP models learn not only about words but also about phrases and the links between them.

I used to work with small models having small phrases to learn, and I have seen their limitations.

They could only answer short sentences, and I had to rework the dataset to add many long sentences with a story to tell.

What does it mean? The more coherent links there are between words, phrases, and paragraphs, the better it is.

For instance, if you have a model that learns paragraphs or documents, you can have better results if you classify them in the first place. Even if the learning process is randomized, the links between learned data are not random, so it is important to take this into account.

Keep in mind that data complexity is correlated with data quantity. The more the data is complex, the more data you need.

5. Apply efficient learning algorithms

Depending on the GPU, you need hours, even days, to train an NLP model. Learning should be as optimal as possible, and many libraries could improve it.

For instance, working with GPT-based models, I always had to apply the AdamW algorithm to find the optimal learning score because it improves the learning when the iterations increase.
Remember that learning should never be too high (0.95) or too low (0.7), which could be avoided thanks to a good dropout obtained through many trials.

A good dropout is also crucial to keep the model learning and avoid overfitting.

Once you know how to fine-tune some parameters, you can apply a genetic algorithm to find the optimal parameters for your NLP model.

Conclusion

NLP models like GPT or Bert are relatively new, and there is still much to learn from and improve.

Those tips are learned from months of training, and there are probably plenty of others to learn.

What lessons did you learn from your experience with NLP models? Please feel free to share them in the comments.

Who am I?

I am a Full Stack Data Scientist, ranked 3rd in the 2022’s Data Science Stack Exchange. I develop AI Web Services in NLP and generative AI.