Page MenuHomePhabricator

API and Bytecode
Updated 995 Days AgoPublic

Discussion and notes for RIFT Format and APIs.

RIFT Format

Comment by Moshe:

I will preface with a couple points about compilers/interpreters, so as to be clear in the points I will make later.

In programming a bytecode interpreter, a distinction must be made between the lexer and the virtual machine (hereafter: VM) in which the commands are run.

In general, a file is nothing more than a sequence of bytes. So the first step (the “lexer”) in all compilation/interpretation is to group the bytes into bytearrays (“tokens”), each bytearray representing a syntax character, or a word.
The second step (the “parser”) is to give structured meaning to the tokens.

Put in other words, the first step is so the computer can read the file properly, analogous to a human determining which symbols on the page are punctuation (and what type), and which are part of words. While the second step is to actually read the file, analogous to a human determining the structure of a sentence. For example, recognizing the English phrase “Rebecca runs to the park” means we know “Rebecca” is the subject, “runs” is the verb, and “to the park” is the object (example taken verbatim from Language Implementation Patterns p. 90).

[After this step is usually the Intermediate Representation/Form. I won’t discuss this as it is certainly unnecessary for our purposes.]

Finally, we reach the VM, which will digest the tokens, and execute the instructions. I expect to use a Syntax-Directed Interpreter (Language Implementation Patterns pattern 24).

The Syntax-Directed Interpreter works well when our code is no more than a sequence of instructions, and action can be taken as soon as information is read from the file. It is “Syntax-Directed” because execution follows immediately upon “syntax”, meaning upon every individual instruction read.

The current spec for rift bytecode contains separators. This is necessary for the lexer stage mentioned above, for how else will the lexer know where a instruction starts and ends, and where a sub-instruction begins and ends, or where data begins and ends?

However, I propose that this may not be necessary, and we could in fact skip both the lexer and parser almost entirely.

As we know, some instructions may not pass data, some will pass data, and some will pass instructions. Let us assume that 1) any given instruction will always pass the same type and number of arguments, and that 2) the number of bytes for a instruction itself is always uniform, and so too the number of bytes for a data parameter. If we can safely assume all the above, then there is no need for the separators, as the interpreter can determine how many bytes to read next, and how many times to read that many bytes, based on the instruction that it has previously read.

I believe that the above assumptions can be met, assuming that data may often merely refer to a constant within the interpreter (as per Language Implementation Patterns section 10.3).

Of course the question is, can this be safely assumed?

[Note: the file of course may be read into memory in its entirety first, when I refer to reading the file I refer to its processing by the interpreter.]


Comment by Jason:

Very good points! Just a couple of things...

However, I propose that this may not be necessary, and we could in fact skip both the lexer and parser almost entirely. (...) I believe that the above assumptions can be met, assuming that data may often merely refer to a constant within the interpreter (as per Language Implementation Patterns section 10.3). Of course the question is, can this be safely assumed?

In some cases, yes. However, in other cases, the amount of data is actually unknown. A Bezier curve, for example, must be assumed to have between 2 and n points, so there would need to be, at minimum, some "sentry" value (a separator) to indicate when all the points have been specified. What's further, any time you have values with indeterminate length (such as strings), one right after another, you need to be able to separate these.

Of course, the separators and sentries are only reserved for such cases as they're needed. If a particular instruction's argument list does not need them, they can be safely ignored (not used) by that instruction's specification.

...group the bytes into bytearrays...

I forgot to mention that PawLIB has an experimental data structure called FlexBit, which was created for this specific purpose. It behaves like a std::vector (actually, like pawlib::FlexArray), in that it grows and shrinks to accommodate its contents, but is optimized for binary data. However, this is not yet stable. I'll be prioritizing refactoring FlexBit this December.


Comment by Moshe:

I was thinking, that the above assumptions, if used at all, will make it difficult to handle multiple RIFT specs (that is, if the appropriate number of parameters for a function changes, generic programming cannot be used to handle both versions). This makes it imperative, I think, to create and include versioning at the beginning of every RIFT file.


Comment by Jason:

That would be a wise thing to have in the metadata at the beginning, yes.

Last Author
jcmcdonald
Last Edited
Nov 23 2019, 8:36 PM

Event Timeline

jcmcdonald edited the content of this document. (Show Details)
jcmcdonald changed the edit policy from "Custom Policy" to "Anari [Project] (Project)".
mduminer edited the content of this document. (Show Details)
mduminer published a new version of this document.
jcmcdonald added a subscriber: mduminer.
jcmcdonald edited the content of this document. (Show Details)