Migrating News Articles

I’m new to FreeUKGenealogy. As an experienced software engineer I was looking for a new challenge, to learn something, a way to use my skills and a bit of a hobby. I discovered that FreeUKGenealogy were looking for a WordPress Developer on socialcoder.org.

Shortly after I registered I had a quick discussion with Denise. I understood that the website was undergoing a refresh and that, ideally, some of the news articles would be copied from the current site across to the new WordPress site.

I’m not really a WordPress developer, I’ve written the odd blog post in the past, but I was up for the challenge. I combined this with a desire to learn a new programming language, I decided to try Python.

Essentially I needed to extract the current news articles, transform them to “WordPress” format and upload them (Extract, Transform, Load). It soon became apparent that what I thought was challenging about this would be easy, but what I thought would be easy would be quite tricky…

The aim was to define a repeatable process. I decided that the news articles on the current site were unlikely to change and that I wouldn’t get much better than the raw HTML files. I downloaded a tool to scrape the current contents: I used seo-spider. It was relatively straightforward to use (I didn’t really use any of the multitude of options it has) and was able to save the raw HTML files (even though they ended up with meaningless file names).

Inspecting these files I discovered the news articles were contained in <article> elements, so I thought I could use that in the extract process.

Next I looked at how to upload the articles to WordPress. After considering a few options (create a site archive for upload, write directly to the database, use a local WordPress instance running in Docker as an interim step) I discovered the WordPress api. It was straightforward to download content from the api (`GET /wp/v2/posts`), but for some reason I don’t yet understand the built-in WordPress api does not have an authentication module built in! As someone who works with RESTful APIs everyday this was quite a surprise. Alas I couldn’t experiment with uploading news articles.

I found a few WordPress authentication provider plugins, I opted for application-passwords for its simplicity. An application password was created and the postman.com was used to upload some test news articles.

So far so good. I was happy with the extract (static HTML files + parse as HTML + filter for <article> tags) and the upload (use Postman or similar technique to post content to the WordPress api). Next step was to do the transform bit.

I decided I pretty much just wanted to upload the raw HTML. As I had already decided on using Python, I found a library called BeautifulSoup. Essentially it’s an HTML parser, but has quite good support for navigating the content, filtering, editing and transforming the output.

The script file is here: parse.py

The article date was a problem. I just couldn’t get the Python date parser to work properly with the format, but I found a workaround Line 109. Some defaults were set for dates that were missing or couldn’t be parsed.

I had to filter the text to extract the author: Line 103. Fortunately there were only two authors so it wasn’t too complex.

Various special characters were a problem, I had to replace these: Line 79.

In particular I had a problem with <a> links. I just couldn’t find a suitable combination to replace the " characters in the href links. No matter what I tried, either the Postman Runner (the Postman tool for running in batch mode) or the WordPress api would fail. Eventually I found the solution: as simple as replace \" with'! See Line 82.

As for the overall structure, it was a case of iterating over the <article> tags, extracting the header and footer, and iterating of the <section> tags. Categories were also extracted from the article header and a “migrated” tag was set on each article to support filtering on WordPress.

Images were another problem. I did a different script to extract the links to the images, used curl to download and used WordPress to bulk upload. They all ended up in a folder /wp-content/uploads/2020/12/ so I could edit the <img> tags to hold a relative link to the files.

At this stage I was fairly happy with the output, but there was just a bit of tidying up to do, mostly the site links. I decided to create a map of these Line 33 and replace with relative links.

Overall, the output isn’t perfect, but I’m quite satisfied with the result. It feels relatively clean and robust. I only downloaded the source HTML once, though I created a semi-automated process and was able to re-run the transform and upload steps many times with little effort.

I did have times where I thought the time taken to write a near-automated process seemed more difficult than just re-writing all the articles, but I’m glad I persevered. I’ve learnt a lot and I hope this helps a smooth transition to the WordPress site.