Mastering Data Preparation: A Comprehensive Guide

Ali Issa
3 min readApr 25, 2024

--

Generated by DALL-E 3

We all know that data is the most crucial element in training an AI model. Recently, we noticed the importance of this when smaller language models were able to achieve better results than larger models because the dataset was well-curated.

So, back to our topic, how can we properly label the data? Let’s take an example of speech recognition. Two years ago, I faced an issue that I wish I had some experience with the effective way to deal with data. I was working on a speech recognition model in Arabic. I had a lot of Arabic voices (around 3k), and my task was to manually transcribe these words in Lebanese dialect (specifically in Arabizi).

While writing the transcription, I noticed that I was not consistent in my transcription. Some recordings were transcribed in a way that differed from other voices. Although all of the transcriptions are technically correct, the inconsistency in labels will result in a degradation in the performance of the model.

Because I did not set clear instructions on how we should have the input and output, I rewrote some of the transcriptions multiple times. Even if you are working by yourself, you need to take into consideration that maybe another person might continue your work, so it’s better to set clear instructions about how to label data.

When you start working with your data, ask clear questions such as the type of the input and its characteristics. For unstructured data like images, we talk about contrast, resolution…, and in the case of structured data, we discuss features that are important to add and will increase the accuracy of the models.

𝐔𝐧𝐬𝐭𝐫𝐮𝐜𝐭𝐮𝐫𝐞𝐝 𝐃𝐚𝐭𝐚 𝐚𝐧𝐝 𝐒𝐭𝐫𝐮𝐜𝐭𝐮𝐫𝐞𝐝 𝐃𝐚𝐭𝐚

People tend to have more difficulties in labeling structured data, for example, visualizing training data composed of a lot of features, let’s say a recommendation system training data. It’s easier to label unstructured data such as images, audio, and text.

Also, applying 𝐝𝐚𝐭𝐚 𝐚𝐮𝐠𝐦𝐞𝐧𝐭𝐚𝐭𝐢𝐨𝐧 over unstructured data is easier. For example, we can change the contrast, rotation of images, add noise to audio, paraphrase text. But when dealing with structured data, it’s a little bit hard. In the case of a house pricing , you can’t generate random values and train the model over it; it won’t simulate the real world. This was just an example, but it shows you why augmenting unstructured data is easier.

𝐁𝐢𝐠 𝐃𝐚𝐭𝐚 𝐚𝐧𝐝 𝐒𝐦𝐚𝐥𝐥 𝐃𝐚𝐭𝐚

When dealing with small data, you can ensure that data is well labeled, but it’s more difficult when dealing with larger data. Labelers will find difficulty going through all problems, so the focus is on the process of labeling, how to collect the data, how to label it, instructions, and so on.

Feel free to follow me for more content like this, and be sure to connect with me on LinkedIn as well.

--

--

No responses yet