विकिपीडिया:AutoWikiBrowser/Typos

विकिपीडिया से
Jump to navigation Jump to search

These are the typo regular expressions for RegExTypoFix (Regular Expression Typographical error Fixer, or RETF). Development has been open to the public since 2006.

Please add to or improve these regular expressions!

Description[संपादन]

These regular expressions find and fix common misspellings and grammatical errors. The primary advantage of RegExTypoFix over other possible spellchecking engines and approaches is accuracy and the return of only one possible replacement. The rules below are developed to give as few false positives as possible. Errors should be encountered only in extremely rare usages or when parsing other languages (though even then if there are too many false positives the expression will be modified). On everyday English, accuracy should hit 100%.

RegExTypoFix is used across diverse sources of text from many languages, the English Wikipedia. RegExTypoFix is also used on other MediaWiki-based wikis, and derivatives can be leveraged in other software. This leads to a massively tested, well-vetted set of automatic corrections. Even so, due to the great variability of text, RegExTypoFix is not accurate enough to be run without a human checking every proposed correction when running against an encyclopedia such as Wikipedia.

Syntax of the expressions is described in full on the MSDN website, though for the purposes of this page the Well House summary is likely easier to use.

Usage[संपादन]

Everyone using RegExTypoFix should use it responsibly. Check every edit before you make it. If in doubt, SKIP. This typo list is used by the in-browser editor and multiple Wikipedia tools.

AutoWikiBrowser (AWB)[संपादन]

AWB purposely avoids fixing typos in certain areas of the wiki-text. Typo fixing is prevented within: image names, template names and parameters, wikilink targets, text in quotations and italics, and any text that follows a colon or asterisk. If a typo rule matches a wikilink target, this rule will be ignored on the whole page.

When using AWB, you can refresh the typo list by selecting "File → Refresh status/typos" (CTRL-R). This is useful when you are modifying the typo list on Wikipedia while using AWB to test/process the modification (but basic testing should first be done offline—e.g. by using AWB's Regex Tester or "Find and replace").

Javascript Wiki Browser (JWB)[संपादन]

The Javascript Wiki Browser uses the same rules for ignoring typo fixing as the downloadable AWB does. The typo rules will not be applied to image names, template names and parameters, quotes, and any text following a colon or asterisk, as well as skipping any rule that also matches a wikilink target on that page. Since JavaScript does not support lookbehinds, any replacement rules containing lookbehinds (?<= and ?<!) will be ignored.

To refresh the typo list, simply click the Gnome-view-refresh.svg right next to the checkbox for enabling the Typo Fixing.

WPCleaner[संपादन]

WPCleaner also purposely avoids fixing typos in certain areas of the wiki-text. Since Java supports lookbehinds a bit differently than C#, any replacement rules containing lookbehinds (?<= and ?<!) will be rejected if the lookbehind expression doesn't have an obvious maximum length (for example, if the lookbehind expression is using quantifiers like * or +, it will probably be rejected) . Rules starting with \{\{ are only applied on the beginning of templates, rules starting with \[\[ are only applied on the beginning of internal links. For other rules, typo fixing is prevented within:

  • comments,
  • internal links, except for the text description when the link is in the form [[link|description]],
  • images, except for the text description or the alternate text description,
  • templates,
  • categories,
  • interwiki links, except for the text description when the link is in the form [[xx:link|description]],
  • language links,
  • external links, except for the text description when the link is in the form [http://xxxx/ description],
  • defaultsort,
  • tags,
  • between <gallery>...</gallery>, <math>...</math>, <code>...</code> or <timeline>...</timeline> tags,
  • if the text is surrounded by dots, themselves surrounded by letters or digits.

When using WPCleaner, you can refresh the typo list by clicking on the Gnome-view-refresh.svg button in the main window.

wikEd[संपादन]

On Wikipedia gadget wikEd, the rules are applied everywhere.

Adding/changing a misspelling[संपादन]

The syntax for each rule is the following (according to AWB and WikEd source code):

<Typo word="Optional name for this rule" find="Regex code to detect the error" replace="Replacement for the error"/>

The "word" parameter is optional and any additional spaces between the parameters are ignored.

Before editing this page[संपादन]

  • Note that all typo rules are case-sensitive. This affects how they are written and tested.
  • Test your proposed change by using an ordinary Wikipedia search or an AWB Google Search with a "Find and Replace" configured. This may reveal that your rule will sometimes damage correct text, or may sometimes make the wrong correction. In these cases do not add the rule here; instead, consider adding it to the Lists of common misspellings.
  • If you do not know how to make a change, suggest it here, where a knowledgeable user will add it for you.
  • Keep in mind that every addition/possibility of a word uses more CPU and slows scanning.
  • Note that only words outside wikimarkup are fixed, so a rule to fix, say, a wiki template will not work on AWB.

Writing typo rules[संपादन]

  • Aim to have a single rule for each root word, prefix, and suffix.
  • Avoid having a rule detect a spelling outside its intended scope (for example, a rule that fixes housa to house must not detect thousand or house). Add word boundaries (\b) to both ends of the regex unless you are matching errors in parts of words or multiple words.
  • Do not expect rules to be applied in the order they appear.
  • Write fast rules:
    • Beginnings are expensive, so be specific in the matching of the first few characters to eliminate possibilities quickly.
    • If possible don't use the quantifiers * and + with anything but a single character. Avoid them entirely if possible, as they put extra strain on CPU and are apt to do other than what you expect.
  • Each rule must be completely independent.
  • Update the rule name if you change something that affects it.
  • Lookbehind constructs ?<= and ?<! are not supported by wikEd , and cause these rules to be skipped.
  • Because the typo rules are case-sensitive, be sure to handle all reasonable case possibilities.

Testing typo rules[संपादन]

  • With the AWB Regular Expression tester, AWB's "Find and replace", or something similar before adding here. If you use AWB's "Find and replace", make sure "CaseSensitive", "Regex" and "Enabled" in Normal settings (or "Case sensitive", "Regular expression" and "Enabled" in Advanced settings) are checked for each rule tested.
  • With AWB or WikEd immediately after you add them. If they do not work, remove first, analyze later.

To do[संपादन]

  • Identify and improve rules to avoid false positives
  • Remove duplicates.
  • Expand rules to accept more suffixes (e.g., "-ing", "-ed", "-able") and prefixes.
    • Note that some regular expressions purposely correct only certain versions of a word to avoid false positives. These should be marked with an underscore character "_" at the beginning or end of the word= field.
  • Remove rare words. Note that no matches today does not mean a rule is rare, since another user may have used the rule to fix many articles yesterday.
  • Keep lists sorted alphabetically by root word; e.g., put "(Un)Equal" just before "(In)Equality" among the "E" words. Don't sort by, say, ASCII character value.
  • Ignore words surrounded by "." as in www.harvard.edu. by adding the following to the end of a rule: (?![^\s\.]*\.\w)(?<!\.[^\s\.]{0,999})

टाइपो सभ के लिस्ट[संपादन]

All changes to this list are live. AWB loads directly from this list whenever someone invokes the RETF option.

चीन्हा[संपादन]

<Typo word="॥" find="\s।।|।।" replace="॥"/>
<Typo word="।" find="&nbsp;।|\s।" replace="।"/>
<Typo word="॥" find="&nbsp;॥|\s॥" replace="॥"/>
<Typo word="॥" find="\s।।" replace="॥"/>
<Typo word="॥" find="(। ।|।।)" replace="॥"/>
<Typo word=" ," find=" +,\s?" replace=", "/><!--change space before comma to space after comma, for eventual move to punctuation section-->
<Typo word=",," find="\s?,\s?\s?,\s?" replace=", "/><!--fixes double commas-->
<Typo word="/" find=" /" replace="/"/>
<Typo word="/" find="/ " replace="/"/>
<Typo word="—" find="--" replace="—" disabled="----"/>
<Typo word="—" find="---" replace="—" disabled="----"/>
<Typo word="—" find="——" replace="—"/>
<Typo word="—" find="—-" replace="—"/>
<Typo word="—" find="-—" replace="—"/>