Machine Translations using LLM

Translating Skills at Scale Using Large Language Models (LLMs) with Python

Sajith Gowthaman
6 min readOct 15, 2024

In a world that’s becoming increasingly interconnected, companies often need to communicate in multiple languages. For instance, a company’s skills taxonomy — a list of skills that employees or job seekers possess — may need to be available in several languages, from English to Spanish, French, Chinese, and beyond. But translating these skill names manually is time-consuming, especially when we’re talking about thousands of entries.

Enter large language models (LLMs), like OpenAI’s GPT-3.5, which can assist by translating these terms quickly and accurately. In this article, I will explain how I along with another ML expert (Joshua Mathias), built an efficient translation pipeline using Python and LLMs, focusing on automation, error handling, scaling, and ensuring that the translated data is stored properly.

Whether you are a beginner or someone with no technical background, by the end of this article, you will have a solid understanding of how skill translations can be automated using AI — and why it’s important to handle these tasks efficiently when working with large datasets.

Purpose: Why Do We Need to Translate Skills?

Imagine a company based in the US that expands into several non-English-speaking countries, like France, Brazil, and China. This company may have a list of skills in its database (e.g., “Software Development,” “Project Management”) that needs to be available in multiple languages. While hiring translators to manually translate each skill could work, it would be slow, expensive, and error-prone.

Use case: When a data scientist onboards the platform, their current skills and growth goals are captured, allowing the system to identify both the skills they already have and the ones they need to develop, such as Python. This information is then used to recommend personalized content from the platform’s content database, ensuring that the material is presented in the user’s preferred language. By tailoring learning resources to their specific skill set and career aspirations, the platform helps the user upskill more efficiently, enabling continuous growth in their profession and helping them advance in their career.

By using an LLM like OpenAI’s GPT-3.5, we can automate this task. The AI can translate large sets of skills quickly, ensuring accuracy and consistency across the entire dataset. Not only does this save time, but it also reduces costs and avoids human error.

The Process: How Do We Translate the Skills?

We designed a system called the SkillTranslatorProc class, which automatically translates skills from English into any other language using an LLM like GPT-3.5. The process happens in three key steps:

  1. Batch Processing: Instead of translating thousands of terms all at once, the system breaks the dataset into smaller batches for better performance. It helps to fit the translation process within a single prompt.
  2. Error Handling and Retries: The system anticipates potential issues, like exceeding the API’s rate limit (i.e., too many requests in a short time), and retries the translation after a short delay.
  3. Efficient Storage: Translations are saved in a structured format (DataFrame) so that we know exactly which skill was translated into which language.

Let’s break these down in more detail.

Step 1: Batch Processing for Efficient Translation

Translating thousands of entries at once can overwhelm both the AI model and your system. Instead, we split the dataset into small batches of 20 or so skills at a time.

This makes sure that the model is not overwhelmed with too many requests as it considerably decreases the number of API calls and tokens and thereby being more cost efficient.

Here’s a snippet of how we do this:slations = self.translate_skills(skills, language)

batches = [df[i:i+batch_size] for i in range(0, len(df), batch_size)]  # Break list into batches
for batch in batches:
skills = batch['label'].dropna().tolist()
batch_translations = self.translate_skills(skills, language)

This simple loop breaks down the list of skills into smaller groups, allowing each batch to be sent to the AI model for translation.

Step 2: The Translation Prompt

To translate a skill, we send a specific prompt to the AI model. Think of a prompt as the set of instructions you give to the model, telling it exactly what to do. In this case, we are instructing the model to translate skills from one language (e.g., English) to another (e.g., Spanish) and provide the result in a specific format.

Here’s an example of what the prompt looks like:

prompt = f"""
Please translate the skills below from English to {target_lang}. Prefix as English and suffix as {target_lang} separated by "->".
For example: Staff Development -> Desarrollo del Personal.
Use English title capitalization. Example:Staff Development -> Desarrollo del PersonalTranslate the following skills:
"""

Let’s break this down:

  • We tell the AI model to translate skills from English to the target language, for instance, Spanish.
  • The result should be in the format English -> Translated.
  • The example provided (Staff Development -> Desarrollo del Personal) shows the model exactly what we want.

We then parse out the translation from the response by using the format specified.

After the prompt, we list the skills that need to be translated. The AI model processes this and returns the translated text.

Step 3: Handling Runtime Errors and API Limitations

When working with AI models like GPT-3.5, we need to be mindful of rate limits — the number of requests the model can handle in a certain period. If we exceed this limit, the system can fail. To avoid these issues, we built error-handling logic that retries the request if it fails, with an increasing wait time between retries.

Here’s how it works:

retry_count = 0
successful = False
while not successful and retry_count < 3:
try:
batch_translations = self.translate_skills(skills, language)
successful = True
except openai.error.APIError as e:
if e.status == 429: # If rate limit is hit
retry_count += 1
wait_time = 60 * retry_count # Wait longer for each retry
time.sleep(wait_time)
else:
raise e # Other errors, we don't retry

If the system hits the API’s rate limit, it waits (starting from 60 seconds and increasing with each retry) before trying again. This ensures that the process doesn’t fail prematurely and can handle large-scale translation requests over time.

Step 4: Saving Translations to Ensure Accuracy

One of the most important steps in this process is saving the translated skills in the right place. After the model provides the translations, we store them in a structured format using a DataFrame (a table-like structure used in Python).

Here’s how we make sure that each translation is saved accurately:

for lang_code, lang_translations in self.new_translations.items():
for label, translation in lang_translations.items():
translations_df.loc[translations_df['label'] == label, lang_code] = translation

This snippet ensures that the translation is placed in the correct row of the DataFrame, aligning the original English skill with its translated counterpart in the appropriate language column. By storing translations in a dictionary first and then saving them into the DataFrame, we guarantee that no data is lost or mismatched.

We save the translations in a dataframe in an incremental manner and would save the dataframe that way we keep the translations if the script crashed. It also skips the translations that were already saved if the script is run again for whatever reason.

Observations

  • Batch Processing for Efficiency: Splitting the translations into batches ensures that the process runs smoothly, without overloading the system or the AI model.
  • Robust Error Handling: The retry mechanism allows the system to recover from temporary issues, such as hitting the API’s rate limit, without failing the entire process.
  • Accurate Data Storage: By storing translations in memory first and then committing them to the DataFrame, we ensure that each translation ends up in the right place.

Conclusion

By leveraging large language models like GPT-3.5, we can automate skill translation processes that would otherwise take days or even weeks to complete manually. This approach not only saves time and reduces costs but also ensures that the translations are accurate and consistent across different languages.

The key advantage of this approach over traditional machine translation, which often translates words in isolation, is that the prompt provides context by indicating that these are skills to be translated. Large language models (LLMs) are generally better at leveraging this context, resulting in more accurate translations compared to traditional machine translation models.

What makes this solution special is its ability to handle large datasets, its error-handling capabilities, and the careful attention to ensuring that translations are stored correctly. This makes the system robust, efficient, and ready to handle large-scale translation tasks.

Whether you’re working with taxonomies, product catalogs, or any other large dataset that requires translation, this approach can help you scale your efforts and maintain high-quality results.

#LLM #Translation #OpenAI #MachineLearning #AI #Python #DataScience #SoftwareDevelopment

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Sajith Gowthaman
Sajith Gowthaman

Written by Sajith Gowthaman

ML Engineer looking to make a positive impact using AI

No responses yet

Write a response