11 min read · May 7, 2024

Deep dive: Fixing legacy problems using AI

Deep dive: Fixing legacy problems using AIThis article is part of a deep dive into tech issues we encounter and how we solve them at Decoders.
asd2

What do you do when your blog data gets corrupted and dropping this content would seriously hurt your SEO score? Amenitiz faced a serious problem with their blog during the last migration to a new server. They have 600 blog articles in 6 languages, which they people wrote about topics that were interesting for the industry. But somewhere along the way, they lost all the special characters “éíáiôèêî” and even “’” got lost while moving from a legacy system. Now, you could send 6 content moderators in to fix these 600 articles. Or, you could send in one guy with a laptop 👨‍💻.


Image https://decoders-content-production.s3.eu-central-1.amazonaws.com/media/cleaning_goal_min_45c0bf025d.png
asd2

The problem is a bunch of broken characters in a WordPress blog. Step one: Get all the data. Luckily, WordPress has an API available to get and update its content. This API can be used to export data and also for our problem of updating the content. So we pulled the autogenerated sitemap from WordPress, called for all the blog posts individually, and saved them locally. So we have something to play with.


asd2

The content contained two problems that arose after their last migration. One was absolute URLs in the content, which we fixed by doing some intelligent find-and-replace. It's pretty easy, especially if you keep talking to Github Copilot for a while. It can help you find the necessary solutions pretty quickly. But now, the hard part is finding a suitable replacement for the � characters in the content. These characters got inserted when the encoding of a table is off, and this character has no information on what it was before. These could appear in any word, or even a single letter could become this character.


asd2

The problem is a bunch of broken characters in a WordPress blog. Step one: Get all the data. Luckily, WordPress has an API available to get and update its content. This API can be used to export data and also for our problem of updating the content. So we pulled the autogenerated sitemap from WordPress, called for all the blog posts individually, and saved them locally. So we have something to play with.



The content contained two problems that arose after their last migration. One was absolute URLs in the content, which we fixed by doing some intelligent find-and-replace. It's pretty easy, especially if you keep talking to Github Copilot for a while. It can help you find the necessary solutions pretty quickly. But now, the hard part is finding a suitable replacement for the � characters in the content. These characters got inserted when the encoding of a table is off, and this character has no information on what it was before. These could appear in any word, or even a single letter could become this character.



At first glance, finding a replacement for a single � is impossible; all special characters it didn't know what to do with got replaced by this character. So let's try widening the scope to the word e.g. cl�vacances, tr�s vari�es, r�ver, sacr�, renomm�e, h�teli�res. You might realise these blog posts are primarily languages that use a lot of special characters like French and Portuguese, and our content is about hoteliers, aka Amenitiz's business. But I don't speak French or Italian. So I had to find either someone who does or an AI willing who is supposedly good at language processing to get a replacement for these words.



Talking to AI is typically done verbally, where you pose a question, and it gives back an answer that a natural person could provide. That's not really what I want here; I want to tell it what the problem is [Fix my problem plz, I need a replacement for the corrupted word “h�teli�res”]. Response [This appears to be an encoding issue. The correct word is "hôtelière"]. Hmm, helpful but quite annoying to then be able to find the answer in response. So, to start, I added some system input stating it should only return one word and one word or nothing. With some checks, this gave me an input-output flow I could work with in code.



Right, let's run it… around 60% of the time, it comes up with something semi-accurate. Again, my French is pretty bad, so I had a colleague check. But that resolves 60% of the weird characters, not a 100%. Let's see if we can do better. I noticed that the AI would only give the right output if you supplied it with the correct input and instructions. So, I tried to give it some context. These broken words like h�teli�res are used in a sentence, for example:


Cette stabilité fait la renommée de ces chaînes h�teli�res de luxe, mais ne permet aucune personnalisation ciblée selon la clientèle.


Through some regex and coding magic, I looked forward and backwards in the text to figure out the sentence, or at least a big part.


Image https://decoders-content-production.s3.eu-central-1.amazonaws.com/media/cleaning_blog_pipeline_min_5a5df0003c.png
asd2

Now, bringing it all together, the AI knows the word that we want to fix and what language it's in; it will only return to me one word and has context about where the word is used. And now we have a solution good enough to run at scale. Initial tests showed correct solutions for 95% of the questions I provided. I started storing all the inputs and the responses of ChatGPT, so we don't ask the same question twice. Ultimately, I spent a few euros on OpenAI credits, which would have taken content moderators days or weeks to fix.



If you liked this story or are curious to hear more stories like this, then follow us on LinkedIn.


Johannes SandersDeveloper @ Decoders