SED What?!


CodeMilitant Solutions for Linux Nginx Python PHP Bash MariaDB

The Linux GNU stream editor (sed), is a data manipulating beast! When manipulating data for big jobs like migrating a WordPress website from on WordPress theme to another, SED makes it easy to not only remove a single line, but multiple lines can be removed in just seconds.

The performance of this command is like no other. It can hammer through a 100 megabyte (MB) file in just a couple seconds.

Lets take a look at some examples from the command line:

The most basic way to use sed is with a file. Starting with a WordPress export file, let’s remove any guid and post ID references in the file. The guid is linked to the post ID so the post ID will also be removed. This gives the new WordPress import a clean slate.

The purpose of this example is to show the power of sed, “caution” keep in mind that removing post IDs will also remove all image links to that post ID, meaning the site will be clean, but image links will be broken and the site will have to be rebuilt to reconnect images to each post.

The WordPress WXR export file will look like this before removing the guid:


	<item>
		<title><![CDATA[Example Post Title]]></title>
		<link>https://example.com/example-post-title/</link>
		<pubDate>Thu, 24 Mar 2022 03:15:41 +0000</pubDate>
		<dc:creator><![CDATA[admin]]></dc:creator>
		<guid isPermaLink="false">https://example.com/example-post-title/</guid>
		<description></description>
		<content:encoded><![CDATA[]]></content:encoded>
		<excerpt:encoded><![CDATA[This is the excerpt found in the post list on the designated blog post list.]]></excerpt:encoded>
		<wp:post_id>1327</wp:post_id>
		<wp:post_date><![CDATA[2022-03-23 20:15:41]]></wp:post_date>
		<wp:post_date_gmt><![CDATA[2022-03-24 03:15:41]]></wp:post_date_gmt>
		<wp:post_modified><![CDATA[2022-03-26 15:34:54]]></wp:post_modified>
		<wp:post_modified_gmt><![CDATA[2022-03-26 22:34:54]]></wp:post_modified_gmt>
		<wp:comment_status><![CDATA[closed]]></wp:comment_status>
		<wp:ping_status><![CDATA[closed]]></wp:ping_status>
		<wp:post_name><![CDATA[Example Post Title]]></wp:post_name>
		<wp:status><![CDATA[publish]]></wp:status>
		<wp:post_parent>0</wp:post_parent>
		<wp:menu_order>0</wp:menu_order>
		<wp:post_type><![CDATA[post]]></wp:post_type>
		<wp:post_password><![CDATA[]]></wp:post_password>
		<wp:is_sticky>0</wp:is_sticky>
.....
</item>

In the code snippet above, the ‘guid’ and ‘wp:post_id’ are the two elements that will be removed from this WXR export file. To do so, open a terminal and from the command line type:


sed -i 's/<link.*//g' example_wordpress_export.xml

sed -i 's/<guid.*//g' example_wordpress_export.xml

sed -i 's/<wp:post_id.*//g' example_wordpress_export.xml

The “-i” option means to edit files in-place. “CAUTION” this will take the original file and edit it, so it’s always best to work from a copy.


cp original_wordpress_export.xml example_wordpress_export.xml

These first two sed commands highlight a single line that will be removed from this large export file, however, how do we remove multiple lines throughout the file?

In WordPress export files, the ‘wp:postmeta’ is just what it sounds like, the postmeta that links theme and plugin data to a specific post. When starting with a new WordPress theme, because the site is being gutted and rebuilt, then it’s vital to remove all these lines from the WXR export file.


	<wp:postmeta>
		<wp:meta_key><![CDATA[_sku]]></wp:meta_key>
		<wp:meta_value><![CDATA[HEA101224]]></wp:meta_value>
	</wp:postmeta>
							<wp:postmeta>
		<wp:meta_key><![CDATA[total_sales]]></wp:meta_key>
		<wp:meta_value><![CDATA[2]]></wp:meta_value>
	</wp:postmeta>
							<wp:postmeta>
		<wp:meta_key><![CDATA[_tax_status]]></wp:meta_key>
		<wp:meta_value><![CDATA[taxable]]></wp:meta_value>
	</wp:postmeta>
							<wp:postmeta>
		<wp:meta_key><![CDATA[_tax_class]]></wp:meta_key>
		<wp:meta_value><![CDATA[]]></wp:meta_value>
	</wp:postmeta>

sed -i '/<wp:postmeta>/,/<\/wp:postmeta>/d' example_wordpress_export.xml

In the code above, sed will find the first instance of each, and delete them. So the stream editor finds the first <wp:postmeta> and the first </wp:postmeta> and deletes both the identifier and everything in-between the identifier. The “/” is used to delineate the specific identifier being removed. This can be anything as long as there’s a specific closing identifier.

Sed will repeat this process until all elements are no longer found.

Now suppose there’s a large number of items in this WordPress export file that’s no longer needed. For example, maybe the first 300 posts are not going to be imported into the new WordPress site. In this case, the sed command can find a special identifier that’s been added to this export file and start removing content from that point.

The best way to do this is to create an identifier that will not be found anywhere else in this WXR file. Something like: ZZZZZZZZZZZZZZZ

Now this identifier can be placed at the start position of the elements that will be removed.


ZZZZZZZZZZZZ

		<item>
		<title><![CDATA[Example Post Title]]></title>
		<link>https://example.com/example-post-title/</link>
		<pubDate>Thu, 24 Mar 2022 03:15:41 +0000</pubDate>
		<dc:creator><![CDATA[admin]]></dc:creator>

Then this same identifier can be placed at the end of the content to be removed.


                        <wp:postmeta>
    <wp:meta_key><![CDATA[_height]]></wp:meta_key>
    <wp:meta_value><![CDATA[0.875]]></wp:meta_value>
    </wp:postmeta>
                        </item>

ZZZZZZZZZZZZ

sed -i '/ZZZZZZZZZZZZ/,/ZZZZZZZZZZZZ/d' 

Sed will now find and remove both the identifiers and everything in-between. Note that the number of Z’s doesn’t have to be equal to the number of Z’s added to the export file. It’s always good practice to find an exact match, but in this case, since there are no lines in the export file, such as in the content of a blog post, that contain ZZZZZZZZ, then an exact match between the identifier in the sed command, and the identifier placed in the export file is unnecessary. Sed will find any match and delete the content in-between.

Show us what you did with sed in the comments below.

, ,

Leave a Reply