Thursday, June 29, 2017

Remove images and internal links from html content via regular expressions in PowerShell

Suppose that we need to export intranet news (e.g. to the file system) which then will be read by other tool which will show these news to external users or in internet. Of course images and internal links won’t work in this case because intranet is not accessible from outside of organization network. One of solution is to remove them from exported news. Here are 2 functions which remove images and internal links from html content using regular expressions:

   1: function RemoveImages($str)
   2: {
   3:     if ([System.String]::IsNullOrEmpty($str))
   4:     {
   5:         return "";
   6:     }
   7:     $re = [regex]"<img.+?/>"
   8:     return $re.Replace($str, "[image deleted]")
   9: }
  10:  
  11: function RemoveInternalLinks($str)
  12: {
  13:     if ([System.String]::IsNullOrEmpty($str))
  14:     {
  15:         return "";
  16:     }
  17:     
  18:     $matchEvaluator =
  19:     {
  20:         param($m)
  21:         
  22:         if ($m.Groups.Count -eq 2 -and $m.Groups[1].Success -and
  23:             ($m.Groups[1].Value.ToLower().Contains("myintranet.com") -or
  24:                 $m.Groups[1].Value.StartsWith("/")))
  25:         {
  26:             return "[link deleted]";
  27:         }
  28:         return $m.Groups[0].Value;
  29:     }
  30:     
  31:     $re = [regex]"<a.+?href=['""](.+?)['""].*?>.+?</a>"
  32:     return $re.Replace($str, $matchEvaluator)
  33: }

If we will use these functions for the following html:

Some text <img src=”http://example.com/someimage.png” />, internal links <a href=”http://myintranet.com”>test1</a> and <a href=”/subsite”>test2</a>, external link <a href=”http://example.com”>test3</a>

it will be transformed to the following:

Some text [image deleted], internal links [link deleted] and [link deleted], external link <a href=”http://example.com”>test3</a>

Note that both absolute and relative internal links are removed. It is done by conditional regular expression replace (lines 18-32) which removes links only if their href attribute contains server name of intranet (myintranet.com in our example) or if it starts with slash / which means relative link. And external link remains in resulting html. Hope that this information will help someone.

No comments:

Post a Comment