Page MenuHomePhabricator

SIMPLEXpress
Updated 538 Days AgoPublic

Algorithms | Roadmap

Simplex Class | Match | Snag | Lex | Helper Functions

Unit Parser | Parse Result class

Vision

simplexpress.png (316×546 px, 27 KB)

SIMPLEXpress is a lightweight, simple alternative to regular expressions. it is intended to be easier to learn, and doing the heavy lifting for parsing.

Advantages over Regular Expressions:

  • Syntax is easier to remember.
  • Faster performance than regular expressions.
  • Capable of character-by-character parsing.

Structure

SIMPLEXpress has four hard reserved characters, the ^ and ~ to indicate the beginning of either a regular or snag unit, / to indicate the end of a unit, and % to indicate literals. Unless following a ^ or a ~ to start a unit, and followed by a / to terminate a unit, all other characters are treated as literals. Anything that is not a literal is placed in a "unit", which always starts with a ^ or ~ and ends with a /.

The following is an example of a "simplex" (that is, a SIMPLEXpress model) in Ratscript.

make model as simplex = ^l+/~i+/~a+/~l+/.png^(24)?/

print(model.test("afraid~əˈfrād~3a~gf.png24"))
>>True
print(model.test("below~biˈlō-1d~gf.png24"))
>>True
print(model.test("below~biˈlō-1d~gf.png"))
>>True
print(model.test("below~biˈlō-1d~gf.gif"))
>>False

Specifiers

All specifiers start with a single letter. Lowercase is a match, uppercase inverts the logic (A = NOT alphanumeric).

  • a: alphanumeric
  • c: classification (Reserved for later expanded character classes, such as c_hangal for Hangal characters) (2.0-3.0)
  • d: digit
  • e: extended Latin (2.0)
  • g: greek (2.0)
  • i: iPA (2.0)
  • l: latin Letter
  • n: newline (\n)
  • o: math operator
  • p: punctuation
  • r: Carriage return (\r)
  • s: Literal space
  • t: tab
  • u#: unicode (accepts u78 or u57-78) (2.0)
  • w: whitespace

Most specifiers can also include u or l after the first character to indicate upper or lower. For example, /au/ indicates alphanumeric uppercase, while /gl/ indicates Greek/Copic lowercase. This will be ignored if case doesn't apply (no error.)

Hard Reserved Characters

These characters must always be escaped to be literal.

  • ^: Start unit. (^.../)
  • ~: Snag, a.k.a. capture group. (~.../)
  • %: Literal escape character for hard reserved characters.

Soft Reserved Characters

These characters are only reserved within a unit.

  • /: Close unit.
  • [ ]: Set. Match any one of the unit values within. Space delimited.
  • <>: Literal Set: Any literal character within.
  • ( ): Group. Allows for literal characters, strings, and further units (simplex-ception!) within a unit. For example, ^(abc)?/ matches optional abc.
  • %: Escape following character (literal). Affects exactly one character, and modifiers following it will affect that character's unit.
  • .: Any character.
  • +: Multiple
  • ?: Optional
  • *: Optional Multiple
  • !: NOT
  • $: Line beginning or end. Logically, we can combine the two together, because nothing can follow a line end, and nothing can precede a line beginning. In multi-line mode, this would match a line break.
  • #1, #2-3, (etc, any number): Exact number or range of matches.

Alternation can take place a number of ways:

  • ^[(abc) (123)]/ matches either abc or 123, but not both.
  • ^[lu d]/ matches either an uppercase letter or a digit, but not both.
  • ^[(abc) d]/ matches either abc or a digit.
  • ^[<abc> d]/ matches either a, b, c, or a digit.

Flags

All flags are passed in as arguments on the simplex function.

  • Ignore Case: All letters are automatically changed to uppercase (ignoring space). [Default: false]
  • Full Match: If false, the model must only match a PART of the input string. [Default: true]

Functions

  • match(): Does the enitre given string match the entire simplex model? Returns TRUE or FALSE.
  • snag(): Returns a FlexArray of onestrings, which are the pieces captured by snag units (~). If match() would return FALSE, the array will be empty.
  • lex(): Returns a SimplexResult containing a true or false boolean match, a FlexArray of onestrings snag_array (empty if no snag units included), and an unsigned integer match_length containing the length of the match in onechars. Will return TRUE if the beginning of the input matches the simplex model.

Ratscript Features

Ratscript's own unique implementation should allow for concatenation of multiple simplexes. This is unlike regex, which cannot be easily concatenated. Thus, one can write a reusable portion of a simplex, and then have it as PART of a larger simplex.

How exactly this will happen, I'm not sure...we'll figure it out.

Last Author
ardunster
Last Edited
Feb 17 2021, 2:10 PM

Event Timeline

jcmcdonald edited the content of this document. (Show Details)
jcmcdonald edited the content of this document. (Show Details)

Does this make sense to include at all anymore? We aren't passing any flags to simplex besides the model, and aspects of the model can handle both of these scenarios:

Flags

All flags are passed in as arguments on the simplex function.

Ignore Case: All letters are automatically changed to uppercase (ignoring space). [Default: false]
Full Match: If false, the model must only match a PART of the input string. [Default: true]