RegularExpressionstoremoveorreplace

RegularExpressionstoremoveorreplace


  Solution Title: regular Expressions to remove or replaceAuthor:
pmengal
Points: 500  Grade: ADate: 05/12/2003 01:18AM PDT    Hello,I want to use a regular expression to replace or remove some texts.Replace-------I want to be able to replace > by &;amp;gt; in the following HTML text :"<p><strong>Superman is greater > than Spiderman</strong></p>"The same code should work also for this text without any change :"<span class="thisname<isinvalid"><b>a > b ?</b></span>"You understood, it's to use with a custom Html Encoder.Remove------I want to remove all (tags included) that is between <script></script> like :"<script> some malicious code </script>"The same code (without any change, but different that the replace one of course) should work on this too :"<script language="javascript"> some malicious code </script>"and this one"<script language='javascript'> some malicious code </script>"and this one too"<script dull="dull" language="javascript"> some malicious code</script>"and this one too ..."<SCRIPT Language="JavaScript"> some malicious code </ScripT>"Sorry to be so complete, but I posted some 500 and 250 questions and got incomplete answers due to the non complete enough question.Thanks in advance !    Comment from

pmengal

Date: 05/12/2003 01:19AM PDT Author Comment    Forgot to say :Can you provide ALL the code to achieve this ? Giving me just the regular expression will not help me. I'm not familiar with regular expressions at all...If you have time, giving me some
website
to learn is welcome ;)  Comment from

AvonWyss

Date: 05/12/2003 05:22AM PDT Comment    I will.... stay tuned.  Comment from

testn

Date: 05/12/2003 06:37AM PDT Comment    Hi,my previous regex should work with >pattern = "((?<!(<([^>])*))(>))|((?<=((<(\/)?[^A-Z,a-z,/,]){1}([^>,<])*))(>))"About tutorials,

http://www.c-sharpcorner.com/3/RegExpPSD.asp



http://www.wellho.net/regex/dotnet.html



http://windows.oreilly.com/news/csharp_0101.html

If you want a comprehensive book, you might consider buying pdf from amazon

http://www.amazon.com/exec/obidos/tg/detail/-/B0000632ZU/102-4200309-1247344?vi=glance

  Accepted Answer from

testn

Date: 05/12/2003 06:42AM PDT Accepted Answer    This is the code for removing malicious code.using System.Text.RegularExpressions;public string removeMaliciousCode(string oldStr) {          string pattern = @"(?i)<script([^>])*>(\w|\W)*</script([^>])*>";          string newStr  = Regex.Replace(oldStr,pattern,"");                return newStr;}This function will return the string that contains no malicious code.  Comment from

testn

Date: 05/12/2003 06:47AM PDT Comment    Explain the function......(?i) means case-insensitive string matching<script([^>])*> means finding any string starting with "<script" and contains 0 or more characters before ending with >(\w|\W)* means may having some string in between <Script> and </script> (0 or more characters of anything)</script([^>])*>" means finding any string starting with "</script" and contains 0 or more characters before ending with >However, this one may be too extreme since it will also match the whole string of <SCRIPT Language="JavaScript"> some malicious code</ScripT> Hello <script ></script> without leaving Hello  Comment from

testn

Date: 05/12/2003 07:22AM PDT Comment    You can make it better by putting<script[^>]*>.*?</script[^>]*>It will screen<SCRIPT Language="JavaScript"> some malicious code</ScripT> Hello <script ></script> without to Hellosince .*? mean non-greedy matching it will try to match up least possible characters of the pattern  Comment from

testn

Date: 05/12/2003 07:33AM PDT Comment    Please also keep testing when this applies to multiple lines dataYou might need to change it to<script[^>]*>(\w|\W)*?</script[^>]*>or to(?m)<script[^>]*>(\w|\W)*?</script[^>]*>  Comment from

AvonWyss

Date: 05/12/2003 09:39PM PDT Comment            private string ReplaceMatch(Match match) {              if (match.Groups["script"].Success)                    return "";              else if (match.Groups["gt"].Value==">")                    return "&;amp;gt;";              else                    return match.Value;          }                    public string CleanupHtml(string html) {              return Regex.Replace(html, @"(?<script><script[^>]*>.*?</script[^>]*>)|(?<gt>(<(""[^""]""|'[^']'|[^>]) )?>)", new MatchEvaluator(ReplaceMatch), RegexOptions.ExplicitCapture|RegexOptions.IgnoreCase|RegexOptions.Singleline);          }This Regex will do both your tasks and at the same time. The first part is pretty similar to testn's suggestion, but I also provide the code to find single > chars (with no matching < before).  Comment from

osxmaster

Date: 02/23/2004 02:19PM PST Comment    Hi do you know how I can clean this page from HTML and script tags?

http://www.wipo.org

Seems to be very complicated.thanks  Comment from

AvonWyss

Date: 02/24/2004 09:59PM PST Comment    Yes I do. Just a few days ago I answered a very similar question; have a look at recent posts in C#

http:Q_20892954.html

Or you can of course also post a new Q.From:http://www.experts-exchange.com/Programming/Programming_Languages/C_Sharp/Q_20613142.html#8508997  (2005-9-13:03:46)

 感谢原创者的辛勤劳动,希望对您有所帮助,转载请注明原出处。
 您可能对 [正则表达式] 的这些文章也感兴趣:

实用正则表达式收集
提取HTML代码中文字的C#函数
深入浅出之正则表达式(一)
C#正则表达式应用范例
用正则表达式解析C#文件
正则表达式在UBB论坛中的应用
正则表达式详解
正则表达式
someoftheregularexpressionslistedinthePerlCookBook
正则式中的实用命名组替换