Data Collection, Labeling, and Streamlined Data Pipelines

Ali Issa
3 min readMay 7, 2024

--

Generated by DALL-E 3

HLP (Human Level Preference)

When we have a set of training data that we’d like to label to create ground truth labels, we sometimes depend on humans to label this data. One thing to note when using HLP is that when a person is labeling the data and comparing its output to the ground truth label, it’s as if we are comparing how much this person’s answers are similar to the person who created the ground truth labels.

Inconsistent labeling instructions can lead to a significant drop in accuracy when compared to HLP. Achieving 100% accuracy is possible with consistent labeling, but well-defined data is crucial for optimal machine learning model performance. HLP is mostly used with unstructured data.

Data collection

Size of Data

It’s better to start with a small amount of data, train the model, and analyze results rather than spending a lot of time on collecting a huge amount of data. On each iteration (collect data, train, analyze), we might notice something in the results that require a different solution than collecting more data.

One way to properly collect data is to ask the team members how much data they would like to collect in ‘X’ days, rather than asking how many examples we need (unless we are familiar with the number of examples required for training the model).

Source of Data

We need to compare to know where we need to collect data, whether it’s our own or crowdsourced data. In the case of speech recognition, it can be a website where people visit and start reading a text, this way we can collect data, and maybe they get paid for it. I remember in the past I used Common Voice, where they requested from the user to record his voice or write transcriptions. Other than crowdsourcing, maybe paying for labels or buying data can be a solution.

Note: The cost and time required for each source should be a factor in the correct decision.

Data labeling

This can be done by the team, outsourced to specialized companies, or crowdsourced. It’s advised not to significantly increase the data quantity with each iteration (e.g., 10x), as sometimes a small amount of data can guide the next best decision based on the model’s performance.

Data Pipeline

In a PoC, we used to write in an unorganized way where the code can be used and found on different laptops, each person has their working environment, and so on. When moving to production, more sophisticated tools like Apache Airflow, and KubeFlow can be used to make the process of using the same code used in PoC replicable.

When building a data pipeline, it’s better to track the source of data (data provenance), and the sequence of execution steps (data lineage). Adding metadata like time, model_name used, and other parameters can be useful and can simplify our work when debugging and monitoring our system.

Check the MLOps specialization provided by DeepLearning.AI for more info.

--

--