Introduction Why Docs About

Parsing system

Some brainstorming and documentations on the implementation. This document is related to the mass-import concept from a technical point of view to document how the parsing of the transcripts files is done.

The problem

  1. The way import.ts is currently developed is not very well architectured and it starts to become a mess to maintain. I would prefer an architecture where the common extraction logic, the list and settings of prefixes, the errors detection is clearly separated. I would like to work with smaller files.
  2. The parsing is slow in the browser (4 seconds for ~100 exos) and blocks the user interface, we need to run it in another thread or optimize the code or the way we parse new changes to avoid parsing everything. This is particularly important for the preview in the vscode extension.
  3. The parsing code is in JS, maybe not the best thing to run it in the backend made in PHP...

Here is the brainstorm on how to fix this situation with a big rewrite: Refactor parsing code with a blocks strategy to have a way better and maintainable structure. Goal: have a parsing architecture easy to test, to understand and to change.

  • [ ] DoD:
    • [ ] all existing tests have been refactored and still passes.
    • [ ] it's possible to add a new keyword in the list of constants without touching the existing logic code.
    • [ ] This keyword TestKW would be defined with a name "TestKW" (without :), with settings: single line, with markdown support. with an associated validation function taking in param the extracted value or the exo or the block and returning errors or true.
  • [ ] Brainstorm the parsing strategy and data types in a concept file.
    • [ ] Maybe what I want is like programming languages behing parsed to an AST and them analyzed on this
      • [ ] Define what is a block, a keyword and other core types if necessary. Define functions and typings.
      • [ ] Give blocks content examples.
      • [ ] This is kind of an AST of document structure.
      • [ ] Define how errors detection works, on AST or on raw text.
      • [ ] Define the list of errors detected, declare them as constants somewhere.
      • [ ] Define a system of content ID to identify exos, skills and courses uniquely across imports.
      • [ ] Decide if a block should be a Javascript Class or another better structure that just an object
      • [ ] Defines functions and their jobs.
      • [ ] Define a constant list of prefixes with their settings (list of keywords, single or multiline..., markdown support, can contain other blocks), this could even be used to auto generate the language support...
      • [ ] Define blocks hierarchy (instruction block under exo block for ex.)
      • [ ] Does this parsing can be used to do syntactic/advanced highlighting in vscode ?? (see the docs)
      • [ ] The blocks tree have to contain line numbers of start (and end ?) lines numbers (even char number) (so we can in the future do a reactive preview in vscode by parsing only the new blocks and subblocks to refresh the state and errors detection and finally the preview. could be used to point errors with underlined red styles).
      • [ ] Make sure to separate functions that
      • [ ] Split a raw text in raw blocks/strings
      • [ ] Build the exos list with these raw strings parsed in blocks (make sense ??)
      • [ ] Parse a given raw string into a block with an associated prefix, a list of keywords and some value.
      • [ ] Convert a block to an object/property stored in the final exo object.
      • [ ] Validation code for each prefix must be in a separated functions and have associated tests
      • [ ] A clear separation between the list of prefixes and keywords and logic related to them, and the rest of splitting, parsing, extraction that is common to all prefixes and keywords. Maybe we should use 2 files ? This could be used in the future to reuse this prefixes based system for an entire different system with other needs and prefixes.
  • [ ] Ask for feedback on this strategy
  • [ ] Decide the implementation language: Javascript, PHP or Rust (compiled to WASM for the web interface, used directly as a binary in vscode and the server, or as a PHP extension with PHP bindings to Rust functions).
  • [ ] If Rust is chosen, the choice must be made carefully as this is a new language for me. POC of WASM compilation and JS bindings are important to do before starting anything. Migration of unit tests to Rust testing tool has to be feasible.
    • [ ] This is probably going to be used in backend and frontend
  • [ ] Create new file parsing.ts or dy.ts or something more related to the DY syntax and less to import or update.
  • [ ] Add tests for new logic and migrate existing ones into this new architecture
  • [ ] Implement function after function, refactor them when tests passes, commit them seperatly
  • [ ] Ship it !

Idea of the abstract syntax tree (described in TS for the sake of simplicity):

interface Block {
	prefix: &PrefixDef,
	keywords: KeywordDef[],
	raw: String,
	value: String // ???
}

interface PrefixDef {
	name: String,
	keywords: KeywordDef[]
}

interface Keyword {
	type: &KeywordDef,
	params: String[]
}

interface KeywordDef {
	pattern: String,
	//Need more data on the type or validation of the params ? like a regex ?
	hasParams: boolean	//to know if they are params to parse
}

interface Tree {
	blocksLines: {&Block -> number}
	blocks: Block[]
	lastUpate: date
}

Errors detection

  1. Some serious errors where the parsing of the exo doesn't make any sense and is skipped. These errors are searched against the raw string:
    1. duplicated prefix (2 Solution: i.e.)
    2. no exo prefix (## Exo: or ### Subexo: )
    3. TBD
  2. Some light errors, where the maximum amount of attributes must be correctly extracted to show the preview of the partial or incoherent parsed version. These errors are searched against the extracted exo object not the raw string. (TODO: refactor these 2 kinds of error detections).

Unit tests

The import feature has a big complexity level because of all the possible ways to do transcripts errors. When dealing with hundreds of exos in dozens of skills, it's really important to catch all the possible logic or format errors before we import it, to avoid any problem later during the training. As exos supports various formats, optional options and structure, they are not as easy as a skill with just a name and a description.
This complexity requires a high test coverage to make sure it can be refactored easily, maintained and is really doing what it should in every possible edge cases. This is something just cannot be tested by hand because they are dozens, even a hundred cases to test at each change.

But unlike the import interface which is not easy to implement, we don't have any dependencies (No browser, no request, no DOM, just string manipulation) ! The parsing code is just a few functions that take a string as argument and return the extracted objects, with some errors if needed. Those functions are import.ts. We write unit tests in ImportLogic.test.ts with the help of Vitest, and they run in less than 100ms !

TODO: document test structure (it, desc, it in foreach) TODO: rethink the general architecture of functions in import.ts, break big functions into smaller ones. TODO: should we rename import.ts to parsing.ts ?

Parsing algorithm overview

TODO

Conventions

  • Every prefix that contains sub-prefixes (exo prefix, table prefix, others...), need to be developed in a dedicated function that receive the raw block string in argument, to better group unit tests. (TODO: refactor this for Table:)
  • Extra whitespace characters must be trimmed (removed at start and end of parsed texts)

Running parsing in a separated thread

  • In browser: web worker !
  • In backend: IDK
  • In VSCode: IDK