Building Your Own Programming Language From Scratch

目录
警告
本文最后更新于 2023-06-18,文中内容可能已过时。

中文:

Part 1

1 简介(Introduction)

在本教程中,我们将使用Java建立自己的编程语言和编译器(你可以使用任何其他语言,最好是面向对象)。这篇文章的目的是帮助那些正在寻找创建自己的编程语言和编译器的人。这是一个玩具的例子,但它将试图帮助你了解从哪里开始,向哪个方向发展。完整的源代码可以在GitHub上找到。

每种语言从源代码到最终的可执行文件都有几个阶段。每个阶段都以某种方式对传入的数据进行格式化:

  1. 词法分析(Lexical analysis),简单地说,就是将源代码划分为标记。每个标记可以包含一个不同的词素:关键词、标识符/变量、带有相应值的运算符等。
  2. 语法分析或解析器(Syntax analysis or parser)将传入的标记列表转换为抽象语法树(abstract syntax tree (AST),它允许你在结构上呈现所创建语言的规则。这个过程本身很简单,一眼就能看出,但随着语言结构的增加,它可能变得更加复杂。
  3. 在构建了AST之后,我们就可以生成代码了。代码通常是使用抽象的语法树递归地生成的。我们的编译器为了简单起见将在语法分析过程中产生语句。

我们将创建一个具有以下能力的简单语言:

  1. 分配变量(数字、逻辑和文本)。
  2. 声明结构,创建实例和访问字段。
  3. 进行简单的数学运算(加、减、非)。
  4. 用数学运算符打印变量、数值和更复杂的表达式。
  5. 从控制台读取数值、逻辑和文本值。
  6. 执行if-then语句

有一个我们语言的代码的例子,它是Ruby和Python语法的混合:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
struct Person
    arg name
    arg experience
    arg is_developer
end

input your_name
input your_experience_in_years
input do_you_like_programming

person = new Person [your_name your_experience_in_years do_you_like_programming == "yes"]
print person

if person :: is_developer then

    person_name = person :: name
    print "hey " + person_name + "!"

    experience = person :: experience

    if experience > 0 then
        started_in = 2022 - experience
        print "you had started your career in " + started_in
    end

end

2 词法分析(Lexical analysis)

首先,我们将从词法分析开始。让我们想象一下,你从一个朋友那里得到一条信息,内容如下:

1
"Ienjoyreadingbooks"

这个表达方式有点难读。它只是一组没有任何意义的字母。这是因为我们的自然词法分析器无法从我们的字典中找到任何合适的词。然而,如果我们正确地放置空格,一切都会变得清晰:

1
"I enjoy reading books"

编程语言的词汇分析器的工作原理与此相同。当我们说话时,我们通过语调和停顿来强调个别的词,以了解对方。以同样的方式,我们必须向词汇分析器提供代码以使其理解我们。如果我们写错了,词法分析器将无法将单个词组、单词和句法结构分开。

在词汇分析过程中,我们的编程语言有六个词汇单位(tokens),我们将计算这些词汇:

  1. 空格、回车、和其他空白字符
    这些词素单位并没有什么意义。你不能通过使用空格来声明任何代码块或函数参数。主要意图是只帮助开发者将他的代码划分为独立的词组。因此,词法分析器首先会寻找空格和换行,以便了解如何在代码中突出所提供的词素。

  2. 运算符: +, -, =, <, >, ::
    它们可以是更复杂的复合语句的一部分。等号不仅可以表示一个赋值运算符,而且还可以结合一个更复杂的平等比较运算符,由两个=组成。在这种情况下,词法分析器将尝试从左到右读取表达式,试图抓住最长的运算符。

  3. 组别分割线: [, ]
    组的分隔符可以将两个组的词条相互分开。例如,一个开放的方括号可以用来标记一些特定组的开始,一个封闭的方括号将标记开始组的结束。在我们的语言中,方括号将只用于声明结构实例参数。

  4. 关键字:print, input, struct, arg, end, new, if, then

    关键字是一组具有某种特定意义的字母字符,由编译器分配。例如,5个字母print的组合构成一个词,它被语言编译器认为是一个控制台输出语句的定义。这个意义不能被程序员改变。这就是关键词的基本思想,它们是语言的公理,我们把它们结合起来,创造我们自己的语句–程序。

  5. 变量或标识符
    除了关键字之外,我们还需要考虑到变量。变量是由程序员而不是编译器给出的具有某种意义的字符序列。同时,我们需要对变量名称进行某些限制。我们的变量将只包含字母、数字和下划线字符。标识符内的其他字符不能出现,因为我们描述的大多数运算符都是定界符,因此不能成为另一个词素的一部分。在这种情况下,变量不能以数字开头,这是因为词法分析器检测到一个数字后会立即尝试与数字相匹配。另外,需要注意的是,该变量不能用关键词来表达。这意味着,如果我们的语言定义了关键字print,那么程序员就不能引入一个以相同顺序定义的相同字符集的变量。

  6. 字面意义
    如果给定的字符序列不是一个关键词,也不是一个变量,那么还有最后一种选择——它可以是一个字面常数。我们的语言将能够定义数字、逻辑和文本字面。数字字面是包含数字的特殊变量。为了简单起见,我们将不使用浮点数字(分数),你可以在以后自己实现它。逻辑字面可以包含布尔值:falsetrue。文本字面是一组任意的字母表中的字符,用双引号(“”)括起来。

现在我们有了关于我们编程语言中所有可能的词素的基本信息,让我们深入到代码中,开始声明我们的标记类型。我们将使用枚举常量,为每个词素类型提供相应的Pattern表达。我将使用Lombok注解来减少模板代码:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
@RequiredArgsConstructor
@Getter
public enum TokenType {
    Whitespace("[\\s\\t\\n\\r]"),
    Keyword("(if|then|end|print|input|struct|arg|new)"),
    GroupDivider("(\\[|\\])"),
    Logical("true|false"),
    Numeric("[0-9]+"),
    Text("\"([^\"]*)\""),
    Variable("[a-zA-Z_]+[a-zA-Z0-9_]*"),
    Operator("(\\+|\\-|\\>|\\<|\\={1,2}|\\!|\\:{2})");

    private final String regex;
}

为了简化字词的解析,我把每一种字词类型划分为一个单独的词组: Numeric, LogicalText。对于Text字面,我们设置了一个单独的组([^"]*)来抓取没有双引号的字面值。对于等号,我们声明{1,2}范围。用两个等号==我们希望得到比较运算符而不是赋值。为了访问一个结构域,我们声明了双冒号运算符::

现在要在我们的代码中找到一个标记,我们只需在我们的源代码上应用正则来迭代和过滤所有TokenType值。为了匹配行的开头,我们在每个正则表达式的开头加上^元字符,创建一个Pattern实例。TokenType将捕获没有引号的值到单独的组中。因此,为了访问不带引号的值,如果我们至少有一个明确的组,我们从索引为1的组中抓取一个值:

1
2
3
4
5
6
7
8
for (TokenType tokenType : TokenType.values()) {
    Pattern pattern = Pattern.compile("^" + tokenType.getRegex());
    Matcher matcher = pattern.matcher(sourceCode);
    if (matcher.find()) {
        // group(1) is used to get text literal without double quotes
        String token = matcher.groupCount() > 0 ? matcher.group(1) : matcher.group();
    }
}

为了存储找到的词条,我们需要声明以下带有类型和值字段的Token类:

1
2
3
4
5
6
@Builder
@Getter
public class Token {
    private final TokenType type;
    private final String value;
}

现在我们有了创建我们的词法分析器的一切。我们将在构造函数中接收作为字符串的源代码,并初始化tokens List:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
public class LexicalParser {
    private final List<Token> tokens;
    private final String source;

    public LexicalParser(String source) {
        this.source = source;
        this.tokens = new ArrayList<>();
    }

    ...
}

为了从源代码中检索出所有的标记,我们需要在每个发现的词组之后切割源代码。我们将声明nextToken()方法,该方法将接受源代码的当前索引作为参数,并抓取当前位置之后开始的下一个标记:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
public class LexicalParser {
    ...

    private int nextToken(int position) {
        String nextToken = source.substring(position);
        for (TokenType tokenType : TokenType.values()) {
            Pattern pattern = Pattern.compile("^" + tokenType.getRegex());
            Matcher matcher = pattern.matcher(nextToken);
            if (matcher.find()) {
                if (tokenType != TokenType.Whitespace) {
                    // group(1) is used to get text literal without double quotes
                    String value = matcher.groupCount() > 0 ? matcher.group(1) : matcher.group();
                    Token token = Token.builder().type(tokenType).value(value).build();
                    tokens.add(token);
                }
                return matcher.group().length();
            }
        }
        throw new TokenException(String.format("invalid expression: `%s`", nextToken));
    }
}

成功捕获后,我们创建一个Token实例,并将其累积到tokens列表中。我们将不添加Whitespace词条,因为它们只用于将两个词条相互分割开来。最后,我们返回找到的词组的长度。

为了捕捉源文件中的所有标记,我们创建了parse()方法,其中的while循环在每次捕捉到一个标记时增加源文件的位置:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
public class LexicalParser {
    ...

    public List<Token> parse() {
        int position = 0;
        while (position < source.length()) {
            position += nextToken(position);
        }
        return tokens;
    }

    ... 
}

3 语法分析(Syntax analysis)

在我们的编译器模型中,语法分析器将从词法分析器接收一个标记列表,并检查这个序列是否可以由语言语法生成。最后,这个语法分析器应该返回一个抽象的语法树。

我们将通过声明Expression接口来启动语法分析器:

1
2
public interface Expression {
}

这个接口将被用来声明字面意义、变量和带有运算符的复合表达式。

3.1 字面值(Literals)

首先,我们为我们语言的字面值类型创建 Expression 的实现:Numeric, TextLogical与相应的Java类型: Integer, StringBoolean。我们将创建具有通用类型的 Value 类,并将其扩展为可比较类型(它将在以后用于比较运算符):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
@RequiredArgsConstructor
@Getter
public class Value<T extends Comparable<T>> implements Expression {
    private final T value;

    @Override
    public String toString() {
        return value.toString();
    }
}

public class NumericValue extends Value<Integer> {
    public NumericValue(Integer value) {
        super(value);
    }
}

public class TextValue extends Value<String> {
    public TextValue(String value) {
        super(value);
    }
}

public class LogicalValue extends Value<Boolean> {
    public LogicalValue(Boolean value) {
        super(value);
    }
}

我们还将为我们的结构实例声明StructureValue

1
2
3
4
5
public class StructureValue extends Value<StructureExpression> {
    public StructureValue(StructureExpression value) {
        super(value);
    }
}

结构化表达(StructureExpression)将在稍后介绍。

3.2 变量(Variables)

变量表达式将有一个代表其名称的单一字段:

1
2
3
4
5
@AllArgsConstructor
@Getter
public class VariableExpression implements Expression {
    private final String name;
}

3.3 结构体(Structures)

为了存储一个结构实例,我们需要知道结构定义和参数值,我们将通过这些参数值来创建一个对象。参数值可以表示任何表达式,包括字面值、变量和实现Expression接口的更复杂的表达式。因此,我们将使用Expression作为数值类型:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
@RequiredArgsConstructor
@Getter
public class StructureExpression implements Expression {
    private final StructureDefinition definition;
    private final List<Expression> values;
}

@RequiredArgsConstructor
@Getter
@EqualsAndHashCode(onlyExplicitlyIncluded = true)
public class StructureDefinition {
    @EqualsAndHashCode.Include
    private final String name;
    private final List<String> arguments;
}

值可以是VariableExpression的一个类型。我们需要一种方法来通过其名称访问变量的值。我将把这个责任委托给Function接口,它将接受变量名并返回一个Value对象:

1
2
3
4
5
...
public class StructureExpression implements Expression {
    ...
    private final Function<String, Value<?>> variableValue;
}

现在我们可以实现一个接口,通过名称检索参数的值,它将被用来访问结构实例的值。getValue()方法将在稍后实现:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
...
public class StructureExpression implements Expression {
       

    public Value<?> getArgumentValue(String field) {
        return IntStream
                .range(0, values.size())
                .filter(i -> definition.getArguments().get(i).equals(field))
                .mapToObj(this::getValue) //will be implemented later
                .findFirst()
                .orElse(null);
    }
}

不要忘了我们的StructureExpression是作为扩展了ComparableStructureValue通用类型的参数使用的。因此,我们必须为我们的StructureExpression实现Comparable接口:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
...
public class StructureExpression implements Expression, Comparable<StructureExpression> {
    ...
    @Override
    public int compareTo(StructureExpression o) {
        for (String field : definition.getArguments()) {
            Value<?> value = getArgumentValue(field);
            Value<?> oValue = o.getArgumentValue(field);
            if (value == null && oValue == null) continue;
            if (value == null) return -1;
            if (oValue == null) return 1;
            //noinspection unchecked,rawtypes
            int result = ((Comparable) value.getValue()).compareTo(oValue.getValue());
            if (result != 0) return result;
        }
        return 0;
    }
}

我们也可以重写标准的toString()方法。如果我们想在控制台中打印整个结构实例,这将是非常有用的:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
...
public class ObjectExpression implements Expression, Comparable<ObjectExpression> {
    ...
    @Override
    public String toString() {
        return IntStream
                .range(0, values.size())
                .mapToObj(i -> {
                    Value<?> value = getValue(i); //will be implemented later
                    String fieldName = definition.getArguments().get(i);
                    return fieldName + " = " + value;
                })
                .collect(Collectors.joining(", ", definition.getName() + " [ ", " ]"));
    }
}

3.4 运算符(Operators)

在我们的语言中,我们将有带1个操作数的单数运算符和带2个操作数的二元运算符。

img

让我们为我们的每个运算符实现表达式接口。我们声明OperatorExpression接口,并为单运算符和双运算符创建基础实现:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
public interface OperatorExpression extends Expression {
}

@RequiredArgsConstructor
@Getter
public class UnaryOperatorExpression implements OperatorExpression {
    private final Expression value;
}

@RequiredArgsConstructor
@Getter
public class BinaryOperatorExpression implements OperatorExpression {
    private final Expression left;
    private final Expression right;
}

我们还需要为我们的每个运算符的实现声明一个抽象的calc()方法,并有相应的值操作数。这个方法将根据每个运算符的本质,从操作数中计算出值。操作数从表达式到值的转换将在稍后介绍。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11

public abstract class UnaryOperatorExpression extends OperatorExpression {
    
    public abstract Value<?> calc(Value<?> value);
}


public abstract class BinaryOperatorExpression extends OperatorExpression {
    
    public abstract Value<?> calc(Value<?> left, Value<?> right);
}

在声明了基础运算符类之后,我们可以创建更详细的实现:

3.4.1 非(NOT !)

这个操作符只对LogicalValue起作用,返回具有反转值的新LogicalValue实例:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
public class NotOperator extends UnaryOperatorExpression {
    ...
    @Override
    public Value<?> calc(Value<?> value) {
        if (value instanceof LogicalValue) {
            return new LogicalValue(!((LogicalValue) value.getValue()).getValue());
        } else {
            throw new ExecutionException(String.format("Unable to perform NOT operator for non logical value `%s`", value));
        }
    }
}

现在我们将切换到有两个操作符的二元操作数表达式的实现:

3.4.2 加(Addition +)

第一个是加法。Calc()方法将对NumericValue对象进行加法。对于其他价值类型,我们将连接toString()值,并返回TextValue实例:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
public class AdditionOperator extends BinaryOperatorExpression {
    public AdditionOperator(Expression left, Expression right) {
        super(left, right);
    }

    @Override
    public Value<?> calc(Value<?> left, Value<?> right) {
        if (left instanceof NumericValue && right instanceof NumericValue) {
            return new NumericValue(((NumericValue) left).getValue() + ((NumericValue) right).getValue());
        } else {
            return new TextValue(left.toString() + right.toString());
        }
    }
}

3.4.3 减(Subtraction -)

calc()只有在两个值都是NumericValue类型时才会进行getValue()比较。在其他情况下,两个值都将被映射到字符串中,从第一个值中删除第二个值的匹配:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
public class SubtractionOperator extends BinaryOperatorExpression {
    ...
    @Override
    public Value<?> calc(Value<?> left, Value<?> right) {
        if (left instanceof NumericValue && right instanceof NumericValue) {
            return new NumericValue(((NumericValue) left).getValue() - ((NumericValue) right).getValue());
        } else {
            return new TextValue(left.toString().replaceAll(right.toString(), ""));
        }
    }
}

3.4.4 相等(Equals ==)

calc()只有在两个值的类型相同时才会进行getValue()比较。在其他情况下,两个值都将被映射为字符串:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
public class EqualsOperator extends BinaryOperatorExpression {
   ...
    @Override
    public Value<?> calc(Value<?> left, Value<?> right) {
        boolean result;
        if (Objects.equals(left.getClass(), right.getClass())) {
            result = ((Comparable) left.getValue()).compareTo(right.getValue()) == 0;
        } else {
            result = ((Comparable) left.toString()).compareTo(right.toString()) == 0;
        }
        return new LogicalValue(result);
    }
}

3.4.5 小于 和 大于(Less than < and greater than >)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
public class LessThanOperator extends BinaryOperatorExpression {
    ...
    @Override
    public Value<?> calc(Value<?> left, Value<?> right) {
        boolean result;
        if (Objects.equals(left.getClass(), right.getClass())) {
            result = ((Comparable) left.getValue()).compareTo(right.getValue()) < 0;
        } else {
            result = left.toString().compareTo(right.toString()) < 0;
        }
        return new LogicalValue(result);
    }
}

public class GreaterThanOperator extends BinaryOperatorExpression {
    ...
    @Override
    public Value<?> calc(Value<?> left, Value<?> right) {
        boolean result;
        if (Objects.equals(left.getClass(), right.getClass())) {
            result = ((Comparable) left.getValue()).compareTo(right.getValue()) > 0;
        } else {
            result = left.toString().compareTo(right.toString()) > 0;
        }
        return new LogicalValue(result);
    }
}

3.4.6 结构体取值运算符(Structure value operator ::)

要读取结构体参数的值,我们希望将左值作为 StructureValue 类型接收:

1
2
3
4
5
6
7
8
9
public class StructureValueOperator extends BinaryOperatorExpression {
    ...
    @Override
    public Value<?> calc(Value<?> left, Value<?> right) {
        if (left instanceof StructureValue)
            return ((StructureValue) left).getValue().getArgumentValue(right.toString());
        return left;
    }
}

3.5 Value evaluation

如果我们想把变量或更复杂的表达式实现传递给运算符,包括运算符本身,我们需要一种方法来把表达式对象转换为值。为了做到这一点,我们在我们的表达式接口中声明evaluate()方法:

1
2
3
public interface Expression {
    Value<?> evaluate();
}

Value类将只是返回这个实例:

1
2
3
4
5
6
7
public class Value<T extends Comparable<T>> implements Expression {
    ...
    @Override
    public Value<?> evaluate() {
        return this;
    }
}

为了实现VariableExpressionevaluate()方法,首先我们必须提供一种通过变量名称获取Value的能力。我们将这项工作委托给Function字段,它将接受变量名称并返回相应的Value对象:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
...
public class VariableExpression implements Expression {
    ...
    private final Function<String, Value<?>> variableValue;

    @Override
    public Value<?> evaluate() {
        Value<?> value = variableValue.apply(name);
        if (value == null) {
            return new TextValue(name);
        }
        return value;
    }
}

为了评估StructureExpression,我们将创建一个StructureValue实例。我们还需要实现缺失的getValue()方法,该方法将接受一个参数索引并根据Expression类型返回值:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
...	
public class StructureExpression implements Expression, Comparable<StructureExpression> {
    ...
    @Override
    public Value<?> evaluate() {
        return new StructureValue(this);
    }

    private Value<?> getValue(int index) {
        Expression expression = values.get(index);
        if (expression instanceof VariableExpression) {
            return variableValue.apply(((VariableExpression) expression).getName());
        } else {
            return expression.evaluate(); 
        }
    }
}

对于运算符表达式,我们评估操作数的值并将其传递给calc()方法:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
public abstract class UnaryOperatorExpression extends OperatorExpression {
    ...
    @Override
    public Value<?> evaluate() {
        return calc(getValue().evaluate());
    }
}

public abstract class BinaryOperatorExpression extends OperatorExpression {
    ...
    @Override
    public Value<?> evaluate() {
        return calc(getLeft().evaluate(), getRight().evaluate());
    }
}

3.6 语句(Statements)

在我们开始创建我们的语法分析器之前,我们需要为我们的语言的语句引入一个模型。让我们从带有execute()处理方法的Statement接口开始,它将在代码执行期间被调用:

1
2
3
public interface Statement {
    void execute();
}
3.6.1 打印语句(Print statement)

我们要实现的第一个语句是PrintStatement。这个语句需要知道要打印的表达式。我们的语言将支持打印字面值、变量和其他实现Expression的表达式,这些表达式将在执行过程中被评估,以计算出Value对象:

3.6.2 赋值(Assign statement)

要声明一个赋值,我们需要知道变量的名字和我们要赋值的表达式。我们的语言将能够赋值字面值、变量和实现Expression接口的更复杂的表达式:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
@AllArgsConstructor
@Getter
public class AssignStatement implements Statement {
    private final String name;
    private final Expression expression;

    @Override
    public void execute() {
        ...
    }
}

在执行过程中,我们需要把变量的值储存在某个地方。让我们把这个逻辑委托给BiConsumer字段,它将传递一个变量名和赋值。注意,在进行赋值时,需要对Expression的值进行评估:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
...
public class AssignStatement implements Statement {
    ...
    private final BiConsumer<String, Value<?>> variableSetter;

    @Override
    public void execute() {
        variableSetter.accept(name, expression.evaluate());
    }
}
3.6.3 输入语句(Input statement)

为了从控制台读取一个表达式,我们需要知道要赋值的变量名称。在execute()过程中,我们必须从控制台读取一行。这项工作可以委托给Supplier,因此我们将不会创建多个输入流。读取该行后,我们将其解析为相应的Value对象,并将赋值委托给BiConsumer实例,正如我们对AssignStatement所做的那样:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
@AllArgsConstructor
@Getter
public class InputStatement implements Statement {
    private final String name;
    private final Supplier<String> consoleSupplier;
    private final BiConsumer<String, Value<?>> variableSetter;

    @Override
    public void execute() {
        //just a way to tell the user in which variable the entered value will be assigned
        System.out.printf("enter \"%s\" >>> ", name.replace("_", " "));
        String line = consoleSupplier.get();

        Value<?> value;
        if (line.matches("[0-9]+")) {
            value = new NumericValue(Integer.parseInt(line));
        } else if (line.matches("true|false")) {
            value = new LogicalValue(Boolean.valueOf(line));
        } else {
            value = new TextValue(line);
        }

        variableSetter.accept(name, value);
    }
}
3.6.4 条件语句(Condition statement)

首先,我们将介绍CompositeStatement类,它将包含要执行的内部语句列表:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
@Getter
public class CompositeStatement implements Statement {
    private final List<Statement> statements2Execute = new ArrayList<>();

    public void addStatement(Statement statement) {
        if (statement != null)
            statements2Execute.add(statement);
    }

    @Override
    public void execute() {
        statements2Execute.forEach(Statement::execute);
    }
}

这个类可以在以后我们创建复合语句结构时重复使用。第一个结构将是Condition语句。为了描述条件,我们可以使用字面值、变量和实现Expression接口的更复杂的结构。最后,在执行过程中,我们要计算Expression的值,并确保该值是符合逻辑的,而且里面的结果是真实的。只有在这种情况下,我们才能执行内部语句:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
@RequiredArgsConstructor
@Getter
public class ConditionStatement extends CompositeStatement {
    private final Expression condition;

    @Override
    public void execute() {
        Value<?> value = condition.evaluate();
        if (value instanceof LogicalValue) {
            if (((LogicalValue) value).getValue()) {
                super.execute();
            }
        } else {
            throw new ExecutionException(String.format("Cannot compare non logical value `%s`", value));
        }
    }
}

3.7 语句解析器(Statement parser)

现在我们有了与我们的词库一起工作的语句,我们现在可以建立StatementParser。在这个类中,我们声明了一个标记的列表和可变的标记位置变量:

1
2
3
4
5
6
7
8
9
public class StatementParser {
    private final List<Token> tokens;
    private int position;

    public StatementParser(List<Token> tokens) {
        this.tokens = tokens;
    }
    ...
}

然后我们创建parseExpression()方法,它将处理提供的标记并返回一个完整的语句:

1
2
3
public Statement parseExpression() {
	...
}

我们的语言表达式只能从声明一个变量或一个关键词开始。因此,首先我们需要读取一个标记,并验证它是否具有Variable或关Keyword的标记类型。为了处理这个问题,我们声明一个单独的next()方法,将token类型varargs作为一个参数,它将验证下一个令牌是否具有相同的类型。在真实的情况下,我们增加StatementParserposition字段并返回找到的token

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
private Statement parseExpression() {
      Token token = next(TokenType.Keyword, TokenType.Variable);
	...
}

private Token next(TokenType type, TokenType... types) {
    TokenType[] tokenTypes = org.apache.commons.lang3.ArrayUtils.add(types, type);
    if (position < tokens.size()) {
        Token token = tokens.get(position);
        if (Stream.of(tokenTypes).anyMatch(t -> t == token.getType())) {
            position++;
            return token;
        }
    }
    Token previousToken = tokens.get(position - 1);
    throw new SyntaxException(String.format("After `%s` declaration expected any of the following lexemes `%s`", previousToken, Arrays.toString(tokenTypes)));
}

在我们找到适当的token后,我们可以根据token类型建立我们的声明。

3.7.1 变量(Variable)

每个变量赋值都以变量标记类型开始,我们已经读过了。我们期待的下一个标记是equals =操作符。你可以重写next()方法,该方法将接受带有字符串操作符表示的 TokenType(例如=),并且只在下一个找到的token具有与请求的相同类型和值时才返回:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
public Statement parseExpression() {
    Token token = next(TokenType.Keyword, TokenType.Variable);
    switch (token.getType()) {
        case Variable:
            next(TokenType.Operator, "="); //skip equals

            Expression value;
            if (peek(TokenType.Keyword, "new")) {
                value = readInstance();
            } else {
                value = readExpression();
            }

            return new AssignStatement(token.getValue(), value, variables::put);
    }
    ...
}

在等号运算符之后,我们读到一个表达式。当我们实例化一个结构时,一个表达式可以用new关键字开始。我们将在稍后介绍这种情况。为了在不增加位置和抛出错误的情况下消耗一个标记以验证其值,我们将创建一个额外的peek()方法,如果下一个标记与我们期望的相同,它将返回真值:

1
2
3
4
5
6
7
8
9
...
private boolean peek(TokenType type, String value) {
    if (position < tokens.size()) {
        Token token = tokens.get(position);
        return type == token.getType() && token.getValue().equals(value);
    }
    return false;
}
...
变量表达式(Variable Expression)

image

首先,我们将介绍普通的变量表达式读取,它可以是一个字面意思,也可以是另一个变量,或者是一个带有运算符的更复杂的表达式。因此,readExpression()方法将返回具有相应实现的Expression。在这个方法中,我们声明左边的Expression实例。一个表达式只能以一个字面意思或另一个变量开始。因此,我们期望从我们的next()方法中获得Variable, Numeric, Logical or Text标记类型。在读取一个适当的标记后,我们在下面的switch case块内将其转换为适当的表达式对象,并将其分配给左边的Expression变量:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
private Expression readExpression() {
    Expression left;
    Token token = next(TokenType.Variable, TokenType.Numeric, TokenType.Logical, TokenType.Text);
    String value = token.getValue();
    switch (token.getType()) {
        case Numeric:
            left = new NumericValue(Integer.parseInt(value));
        case Logical:
            left = new LogicalValue(Boolean.valueOf(value));
        case Text:
            left = new TextValue(value);
        case Variable:
        default:
            left = new VariableExpression(value, variables::get);
    }
    ...
}

在读完左边的变量或字面值后,我们希望能抓到一个Operator词组:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
private Expression readExpression() {
    Expression left;
    Token token = next(TokenType.Variable, TokenType.Numeric, TokenType.Logical, TokenType.Text);
    String value = token.getValue();
    switch (token.getType()) {
        ...
    }

    Token operation = next(TokenType.Operator);
    Class<? extends OperatorExpression> operatorType = Operator.getType(operation.getValue());
    ...
}

然后我们把我们的Operator词组映射到OperatorExpression对象上。要做到这一点,你可以使用一个switch block块,根据标记值返回适当的OperatorExpression类。我将使用枚举常量和一个适当的OperatorExpression类型:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
@RequiredArgsConstructor
@Getter
public enum Operator {
    Not("!", NotOperator.class),
    Addition("+", AdditionOperator.class),
    Subtraction("-", SubtractionOperator.class),
    Equality("==", EqualsOperator.class),
    GreaterThan(">", GreaterThanOperator.class),
    LessThan("<", LessThanOperator.class),
    StructureValue("::", StructureValueOperator.class);

    private final String character;
    private final Class<? extends OperatorExpression> operatorType;

    public static Class<? extends OperatorExpression> getType(String character) {
        return Arrays.stream(values())
                .filter(t -> Objects.equals(t.getCharacter(), character))
                .map(Operator::getOperatorType)
                .findAny().orElse(null);
    }
}

最后,我们读取右边的字面值或变量,正如我们对左边的Expression所做的那样。为了不重复代码,我们将读取Expression的内容提取到一个单独的NextExpression()方法中:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
private Expression nextExpression() {
    Token token = next(TokenType.Variable, TokenType.Numeric, TokenType.Logical, TokenType.Text);
    String value = token.getValue();
    switch (token.getType()) {
        case Numeric:
            return new NumericValue(Integer.parseInt(value));
        case Logical:
            return new LogicalValue(Boolean.valueOf(value));
        case Text:
            return new TextValue(value);
        case Variable:
        default:
            return new VariableExpression(value, variables::get);
    }
}

让我们重构readExpression()方法,用nextExpression()读取正确的词素。但是在我们读取正确的词素之前,我们需要确定我们的操作符支持两个操作数。我们可以检查我们的操作符是否扩展了BinaryOperatorExpression。在其他情况下,如果运算符是单数,我们只用左边的表达式来创建OperatorExpression。要创建一个OperatorExpression对象,我们要检索适合单数或二进制运算符实现的构造函数,然后用获得的早期表达式创建一个实例:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
@SneakyThrows
private Expression readExpression() {
    Expression left = nextExpression();

    Token operation = next(TokenType.Operator);
    Class<? extends OperatorExpression> operatorType = Operator.getType(operation.getValue());

    if (BinaryOperatorExpression.class.isAssignableFrom(operatorType)) {
        Expression right = nextExpression();
        return operatorType
                .getConstructor(Expression.class, Expression.class)
                .newInstance(left, right);
    } else if (UnaryOperatorExpression.class.isAssignableFrom(operatorType)) {
        return operatorType
                .getConstructor(Expression.class)
                .newInstance(left);
    }

    return left;
}

此外,我们可以提供一个机会,用多个运算符创建一个长表达式,或者根本不用运算符,只用一个字面或变量。让我们把读入while循环的操作包围起来,条件是我们有一个操作符作为下一个词素。每次我们创建一个OperatorExpression,我们就把它分配给左边的表达式,这样就在一个Expression里面创建了一棵后续运算符树,直到我们读完整个表达式。最后,我们返回左边的表达式:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
@SneakyThrows
private Expression readExpression() {
    Expression left = nextExpression();

    //recursively read an expression
    while (peek(TokenType.Operator)) {
        Token operation = next(TokenType.Operator);
        Class<? extends OperatorExpression> operatorType = Operator.getType(operation.getValue());
        if (BinaryOperatorExpression.class.isAssignableFrom(operatorType)) {
            Expression right = nextExpression();
            left = operatorType
                    .getConstructor(Expression.class, Expression.class)
                    .newInstance(left, right);
        } else if (UnaryOperatorExpression.class.isAssignableFrom(operatorType)) {
            left = operatorType
                    .getConstructor(Expression.class)
                    .newInstance(left);
        }
    }

    return left;
}

private boolean peek(TokenType type) {
    if (position < tokens.size()) {
        Token token = tokens.get(position);
        return token.getType() == type;
    }
    return false;
}
结构体实例(Structure instance)

在我们完成readExpression()的实现后,我们可以回到parseExpression(),并完成readInstance()的实现,以实例化一个结构实例:

image

根据我们语言的语义,我们知道我们的结构实例化是从new关键字开始的,我们可以通过调用next()方法跳过下一个标记。下一个词素将标志着结构名称,我们将其理解为Variable标记类型。在结构名称之后,我们希望收到方括号内的参数作为组的分隔符。在某些情况下,我们的结构可以在没有任何参数的情况下被创建。因此,我们首先使用 peek() 。每个传递给结构的参数可以意味着一个表达式,因此我们调用 readExpression() 并将结果传递到参数列表中。在建立结构参数之后,我们可以初步建立我们的StructureExpression,检索适当的StructureDefinition

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
private Expression readInstance() {
    next(TokenType.Keyword, "new"); //skip new

    Token type = next(TokenType.Variable);

    List<Expression> arguments = new ArrayList<>();

    if (peek(TokenType.GroupDivider, "[")) {

        next(TokenType.GroupDivider, "["); //skip open square bracket

        while (!peek(TokenType.GroupDivider, "]")) {
            Expression value = readExpression();
            arguments.add(value);
        }

        next(TokenType.GroupDivider, "]"); //skip close square bracket
    }

    StructureDefinition definition = structures.get(type.getValue());
    if (definition == null) {
        throw new SyntaxException(String.format("Structure is not defined: %s", type.getValue()));
    }
    return new StructureExpression(definition, arguments, variables::get);
}

为了按名称检索StructureDefinition,我们必须将结构映射声明为StatementParser的字段,我们将在稍后的结构关键字分析中填充它:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
public class StatementParser {
    ...
    private final Map<String, StructureDefinition> structures;

    public StatementParser(List<Token> tokens) {
	  ...
        this.structures = new HashMap<>();
    }
    ...
}

3.7.2 关键字(Keyword)

English:

Part 1


1 Introduction

In this tutorial, we will build our own programming language and compiler using Java (you can use any other language, preferably object-oriented). The purpose of the article is to help people who are looking for a way to create their own programming language and compiler. This is a toy example, but it will try to help you understand where to start and in which direction to move. The full source code is available over on GitHub

Each language has several stages from the source code to the final executable file. Each of the stages formats incoming data in a certain way:

  1. Lexical analysis in simple terms it’s a division of the source code into tokens. Each token can contain a different lexeme: keyword, identifier/variable, operator with the corresponding value, etc.
  2. Syntax analysis or parser converts a list of incoming tokens into the abstract syntax tree (AST), which allows you to structurally present the rules of the language being created. The process itself is quite simple as can be seen at the first glance, but with an increase of language constructions, it can become much more complicated.
  3. After AST has been built we can generate the code. Code is usually generated recursively using an abstract syntax tree. Our compiler for the sake of simplicity will produce statements during the syntax analysis.

We will create a simple language with the following abilities:

  1. assign the variables (numeric, logical and text)
  2. declare structures, create instances and accessing fields
  3. perform simple mathematical operations (addition, subtraction, NOT)
  4. print variables, values and more complex expressions with mathematical operators
  5. read numeric, logical and text values from the console
  6. perform if-then statements

There is an example of our language’s code, it’s a mix of Ruby and Python syntax:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
struct Person
    arg name
    arg experience
    arg is_developer
end

input your_name
input your_experience_in_years
input do_you_like_programming

person = new Person [your_name your_experience_in_years do_you_like_programming == "yes"]
print person

if person :: is_developer then

    person_name = person :: name
    print "hey " + person_name + "!"

    experience = person :: experience

    if experience > 0 then
        started_in = 2022 - experience
        print "you had started your career in " + started_in
    end

end

2 Lexical analysis

First of all, we will start with the lexical analysis. Let’s imagine you got a message from a friend with the following content:

1
Ienjoyreadingbooks

This expression is a little difficult to read. It’s just a set of letters without any meaning. This is because our natural lexical analyzer can’t find any appropriate word from our dictionary. However, if we put spaces correctly, everything becomes clear:

1
I enjoy reading books

The lexical analyzer of a programming language works by the same principle. When we talk we highlight individual words by intonation and pauses to understand each other. In the same way we must provide the code to the lexical analyzer to make it understand us. If we write it wrong, the lexical analyzer will not be able to separate individual lexemes, words and syntax constructions.

There are six lexemes units (tokens) we will count in our programming language during lexical analysis:

  1. Space, carriage return, and other whitespace characters

These lexeme units do not make a lot of sense. You can’t declare any code blocks or function parameters by using a space. The main intent is to only help a developer to divide his code into separate lexemes. Therefore, the lexical analyzer will first of all look for spaces and newlines in order to understand how to highlight provided lexemes in the code.

  1. Operators: +, -, =, <, >, ::

They can be a part of more complex compound statements. The equals sign can mean not only an assignment operator but can also combine a more complex equality compare operator, consisting of two = in a row. In this case, the lexical analyzer will try to read expressions from left to right trying to catch the longest operator.

  1. Group dividers: [, ]

Group dividers can separate two group lexemes from each other. For example, an open square bracket can be used to mark the beginning of some specific group and a close square bracket will mark the end of the started group. In our language square brackets will be only used to declare struct instance arguments.

  1. Keywords: print, input, struct, arg, end, new, if, then

A keyword is a set of alphabetic characters with some specific sense assigned by the compiler. For example, a combination of 5 letters print makes one word and it’s perceived by the language compiler as a console output statement definition. This meaning cannot be changed by a programmer. That’s the basic idea of keywords, they are language axioms that we combine to create our own statements - programs.

  1. Variables or identifiers

In addition to the keywords, we also need to take variables into account. A variable is a sequence of characters with some meaning given by a programmer, not a compiler. At the same time, we need to put certain restrictions on variable names. Our variables will contain only letters, numbers, and underscore characters. Other characters within the identifier cannot occur because most of the operators we described are delimiters and therefore cannot be part of another lexeme. In this case, the variable cannot begin with a digit due to the fact that the lexical analyzer detects a digit and immediately tries to match it with a number. Also, it’s important to note that the variable cannot be expressed by a keyword. It means that if our language defines the keyword print, then the programmer can’t introduce a variable with the same set of characters defined in the same order.

  1. Literals

If the given sequence of characters is not a keyword and is not a variable, the last option remains - it can be a literal constant. Our language will be able to define numeric, logical and text literals. Numeric literals are special variables containing digits. For the sake of simplicity, we will not use floating-point numbers (fractions), you can implement it yourself later. Logical literals can contain boolean values: false or true. Text literal is an arbitrary set of characters from the alphabet enclosed in double-quotes.

Now that we have the basic information about all possible lexemes in our programming language let’s dive into the code and start declaring our token types. We will use enum constants with the corresponding Pattern expression for each lexeme type. I will use Lombok annotations to minimize the boilerplate code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
@RequiredArgsConstructor
@Getter
public enum TokenType {
    Whitespace("[\\s\\t\\n\\r]"),
    Keyword("(if|then|end|print|input|struct|arg|new)"),
    GroupDivider("(\\[|\\])"),
    Logical("true|false"),
    Numeric("[0-9]+"),
    Text("\"([^\"]*)\""),
    Variable("[a-zA-Z_]+[a-zA-Z0-9_]*"),
    Operator("(\\+|\\-|\\>|\\<|\\={1,2}|\\!|\\:{2})");

    private final String regex;
}

To simplify literals parsing I divided each of the literal types into a separate lexeme: Numeric, Logical and Text. For the Text literal we set a separate group ([^"]*) to grab literal value without double quotes. For the equals we declare {1,2} range. With two equal signs == we expect to get the compare operator instead of the assignment. To access a structure field we declared the double colon operator ::.

Now to find a token in our code we just iterate and filter all TokenType values applying regex on our source code. To match the beginning of the row we put the ^ meta character in the beginning of each regex expression creating a Pattern instance. The Text token will capture value without quotes into the separate group. Therefore, in order to access value without quotes we grab a value from group with index 1 if we have at least one explicit group:

1
2
3
4
5
6
7
8
for (TokenType tokenType : TokenType.values()) {
    Pattern pattern = Pattern.compile("^" + tokenType.getRegex());
    Matcher matcher = pattern.matcher(sourceCode);
    if (matcher.find()) {
        // group(1) is used to get text literal without double quotes
        String token = matcher.groupCount() > 0 ? matcher.group(1) : matcher.group();
    }
}

To store found lexemes we need to declare the following Token class with type and value fields:

1
2
3
4
5
6
@Builder
@Getter
public class Token {
    private final TokenType type;
    private final String value;
}

Now we have everything to create our lexical parser. We will receive the source code as a String in the constructor and initialize tokens List:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
public class LexicalParser {
    private final List<Token> tokens;
    private final String source;

    public LexicalParser(String source) {
        this.source = source;
        this.tokens = new ArrayList<>();
    }

    ...
}

In order to retrieve all tokens from the source code we need to cut the source after each found lexeme. We will declare the nextToken() method that will accept the current index of the source code as an argument and grab the next token starting after the current position:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
public class LexicalParser {
    ...

    private int nextToken(int position) {
        String nextToken = source.substring(position);
        for (TokenType tokenType : TokenType.values()) {
            Pattern pattern = Pattern.compile("^" + tokenType.getRegex());
            Matcher matcher = pattern.matcher(nextToken);
            if (matcher.find()) {
                if (tokenType != TokenType.Whitespace) {
                    // group(1) is used to get text literal without double quotes
                    String value = matcher.groupCount() > 0 ? matcher.group(1) : matcher.group();
                    Token token = Token.builder().type(tokenType).value(value).build();
                    tokens.add(token);
                }
                return matcher.group().length();
            }
        }
        throw new TokenException(String.format("invalid expression: `%s`", nextToken));
    }
}

After successful capture we create a Token instance and accumulate it in the tokens list. We will not add the Whitespace lexemes as they are only used to divide two lexemes from each other. In the end we return the found lexeme’s length.

To capture all tokens in the source we create the parse() method with the while loop increasing source position each time we catch a token:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
public class LexicalParser {
    ...

    public List<Token> parse() {
        int position = 0;
        while (position < source.length()) {
            position += nextToken(position);
        }
        return tokens;
    }

    ... 
}

3 Syntax analysis

Within our compiler model, the syntax analyzer will receive a list of tokens from the lexical analyzer and check whether this sequence can be generated by the language grammar. In the end this syntax analyzer should return an abstract syntax tree.

We will start the syntax analyzer by declaring the Expression interface:

1
2
public interface Expression {
}

This interface will be used to declare literals, variables and composite expressions with operators.

3.1 Literals

First of all, we create Expression implementations for our language’s literal types: Numeric, Text and Logical with corresponding Java types: Integer, String and Boolean. We will create the base Value class with generic type extending Comparable (it will be used later with comparison operators):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
@RequiredArgsConstructor
@Getter
public class Value<T extends Comparable<T>> implements Expression {
    private final T value;

    @Override
    public String toString() {
        return value.toString();
    }
}

public class NumericValue extends Value<Integer> {
    public NumericValue(Integer value) {
        super(value);
    }
}

public class TextValue extends Value<String> {
    public TextValue(String value) {
        super(value);
    }
}

public class LogicalValue extends Value<Boolean> {
    public LogicalValue(Boolean value) {
        super(value);
    }
}

We will also declare StructureValue for our structure instances:

1
2
3
4
5
public class StructureValue extends Value<StructureExpression> {
    public StructureValue(StructureExpression value) {
        super(value);
    }
}

StructureExpression will be covered a bit later.

3.2 Variables

Variable expression will have a single field representing its name:

1
2
3
4
5
@AllArgsConstructor
@Getter
public class VariableExpression implements Expression {
    private final String name;
}

3.3 Structures

In order to store a structure instance we will need to know the structure definition and arguments values we are passing to create an object. Argument values can signify any expressions including literals, variables and more complex expressions implementing Expression interface. Therefore, we will use Expression as values type:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
@RequiredArgsConstructor
@Getter
public class StructureExpression implements Expression {
    private final StructureDefinition definition;
    private final List<Expression> values;
}

@RequiredArgsConstructor
@Getter
@EqualsAndHashCode(onlyExplicitlyIncluded = true)
public class StructureDefinition {
    @EqualsAndHashCode.Include
    private final String name;
    private final List<String> arguments;
}

Values can be a type of the VariableExpression. We need a way to access the variable’s value by its name. I will delegate this responsibility to the Function interface which will accept variable name and return a Value object:

1
2
3
4
5
...
public class StructureExpression implements Expression {
    ...
    private final Function<String, Value<?>> variableValue;
}

Now we can implement an interface to retrieve an argument’s value by name, it will be used to access the structure instance values. The getValue() method will be implemented a bit later:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
...
public class StructureExpression implements Expression {
       

    public Value<?> getArgumentValue(String field) {
        return IntStream
                .range(0, values.size())
                .filter(i -> definition.getArguments().get(i).equals(field))
                .mapToObj(this::getValue) //will be implemented later
                .findFirst()
                .orElse(null);
    }
}

Don’t forget our StructureExpression is used as a parameter for the StructureValue generic type which extends Comparable. Therefore, we must implement Comparable interface for our StructureExpression:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
...
public class StructureExpression implements Expression, Comparable<StructureExpression> {
    ...
    @Override
    public int compareTo(StructureExpression o) {
        for (String field : definition.getArguments()) {
            Value<?> value = getArgumentValue(field);
            Value<?> oValue = o.getArgumentValue(field);
            if (value == null && oValue == null) continue;
            if (value == null) return -1;
            if (oValue == null) return 1;
            //noinspection unchecked,rawtypes
            int result = ((Comparable) value.getValue()).compareTo(oValue.getValue());
            if (result != 0) return result;
        }
        return 0;
    }
}

We can also override the standard toString() method. It will be useful if we want to print an entire structure instance in the console:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
...
public class ObjectExpression implements Expression, Comparable<ObjectExpression> {
    ...
    @Override
    public String toString() {
        return IntStream
                .range(0, values.size())
                .mapToObj(i -> {
                    Value<?> value = getValue(i); //will be implemented later
                    String fieldName = definition.getArguments().get(i);
                    return fieldName + " = " + value;
                })
                .collect(Collectors.joining(", ", definition.getName() + " [ ", " ]"));
    }
}

3.4 Operators

In our language, we will have unary operators with 1 operand and binary operators with 2 operands.

img

Let’s implement the Expression interface for each of our operators. We declare OperatorExpression interface and create base implementations for unary and binary operators:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
public interface OperatorExpression extends Expression {
}

@RequiredArgsConstructor
@Getter
public class UnaryOperatorExpression implements OperatorExpression {
    private final Expression value;
}

@RequiredArgsConstructor
@Getter
public class BinaryOperatorExpression implements OperatorExpression {
    private final Expression left;
    private final Expression right;
}

We also need to declare an abstract calc() method for each of our operator implementations with corresponding values operands. This method will calculate the Value from operands depending on each operator’s essence. The operand’s transition from the Expression to the Value will be covered a bit later.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11

public abstract class UnaryOperatorExpression extends OperatorExpression {
    
    public abstract Value<?> calc(Value<?> value);
}


public abstract class BinaryOperatorExpression extends OperatorExpression {
    
    public abstract Value<?> calc(Value<?> left, Value<?> right);
}

After declaring base operator classes we can create more detailed implementations:

3.4.1 NOT !

This operator will only work with the LogicalValue returning the new LogicalValue instance with inverted value:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
public class NotOperator extends UnaryOperatorExpression {
    ...
    @Override
    public Value<?> calc(Value<?> value) {
        if (value instanceof LogicalValue) {
            return new LogicalValue(!((LogicalValue) value.getValue()).getValue());
        } else {
            throw new ExecutionException(String.format("Unable to perform NOT operator for non logical value `%s`", value));
        }
    }
}

Now we will switch to the BinaryOperatorExpression implementations with two operands:

3.4.2 Addition +

The first one is addition. calc() method will make the addition of NumericValue objects. For the other Value types we will concat toString() values returning TextValue instance:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
public class AdditionOperator extends BinaryOperatorExpression {
    public AdditionOperator(Expression left, Expression right) {
        super(left, right);
    }

    @Override
    public Value<?> calc(Value<?> left, Value<?> right) {
        if (left instanceof NumericValue && right instanceof NumericValue) {
            return new NumericValue(((NumericValue) left).getValue() + ((NumericValue) right).getValue());
        } else {
            return new TextValue(left.toString() + right.toString());
        }
    }
}

3.4.3 Subtraction -

calc() will make getValue() comparison only if both values have NumericValue type. In the other case both values will be mapped to the String, removing 2nd value matches from the 1st value:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
public class SubtractionOperator extends BinaryOperatorExpression {
    ...
    @Override
    public Value<?> calc(Value<?> left, Value<?> right) {
        if (left instanceof NumericValue && right instanceof NumericValue) {
            return new NumericValue(((NumericValue) left).getValue() - ((NumericValue) right).getValue());
        } else {
            return new TextValue(left.toString().replaceAll(right.toString(), ""));
        }
    }
}

3.4.4 Equals ==

calc() will make getValue() comparison only if both values have the same type. In the other case both values will be mapped to the String:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
public class EqualsOperator extends BinaryOperatorExpression {
   ...
    @Override
    public Value<?> calc(Value<?> left, Value<?> right) {
        boolean result;
        if (Objects.equals(left.getClass(), right.getClass())) {
            result = ((Comparable) left.getValue()).compareTo(right.getValue()) == 0;
        } else {
            result = ((Comparable) left.toString()).compareTo(right.toString()) == 0;
        }
        return new LogicalValue(result);
    }
}

3.4.5 Less than < and greater than >

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
public class LessThanOperator extends BinaryOperatorExpression {
    ...
    @Override
    public Value<?> calc(Value<?> left, Value<?> right) {
        boolean result;
        if (Objects.equals(left.getClass(), right.getClass())) {
            result = ((Comparable) left.getValue()).compareTo(right.getValue()) < 0;
        } else {
            result = left.toString().compareTo(right.toString()) < 0;
        }
        return new LogicalValue(result);
    }
}

public class GreaterThanOperator extends BinaryOperatorExpression {
    ...
    @Override
    public Value<?> calc(Value<?> left, Value<?> right) {
        boolean result;
        if (Objects.equals(left.getClass(), right.getClass())) {
            result = ((Comparable) left.getValue()).compareTo(right.getValue()) > 0;
        } else {
            result = left.toString().compareTo(right.toString()) > 0;
        }
        return new LogicalValue(result);
    }
}

3.4.6 Structure value operator ::

To read a structure argument’s value we expect to receive the left value as a StructureValue type:

1
2
3
4
5
6
7
8
9
public class StructureValueOperator extends BinaryOperatorExpression {
    ...
    @Override
    public Value<?> calc(Value<?> left, Value<?> right) {
        if (left instanceof StructureValue)
            return ((StructureValue) left).getValue().getArgumentValue(right.toString());
        return left;
    }
}

3.5 Value evaluation

If we want to pass variables or more complex Expression implementations to the operators including operators itself, we need a way to transform Expression object to the Value. To do this we declare evaluate() method in our Expression interface:

1
2
3
public interface Expression {
    Value<?> evaluate();
}

Value class will just return this instance:

1
2
3
4
5
6
7
public class Value<T extends Comparable<T>> implements Expression {
    ...
    @Override
    public Value<?> evaluate() {
        return this;
    }
}

To implement evaluate() for the VariableExpression first we must provide an ability to get Value by variable’s name. We delegate this work to the Function field which will accept variable name and return corresponding Value object:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
...
public class VariableExpression implements Expression {
    ...
    private final Function<String, Value<?>> variableValue;

    @Override
    public Value<?> evaluate() {
        Value<?> value = variableValue.apply(name);
        if (value == null) {
            return new TextValue(name);
        }
        return value;
    }
}

To evaluate the StructureExpression we will create a StructureValue instance. We need also implement the missing getValue() method that will accept an argument index and return value depending on the Expression type:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
...	
public class StructureExpression implements Expression, Comparable<StructureExpression> {
    ...
    @Override
    public Value<?> evaluate() {
        return new StructureValue(this);
    }

    private Value<?> getValue(int index) {
        Expression expression = values.get(index);
        if (expression instanceof VariableExpression) {
            return variableValue.apply(((VariableExpression) expression).getName());
        } else {
            return expression.evaluate(); 
        }
    }
}

For the operator expressions we evaluate operand values and pass them to the calc() method:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
public abstract class UnaryOperatorExpression extends OperatorExpression {
    ...
    @Override
    public Value<?> evaluate() {
        return calc(getValue().evaluate());
    }
}

public abstract class BinaryOperatorExpression extends OperatorExpression {
    ...
    @Override
    public Value<?> evaluate() {
        return calc(getLeft().evaluate(), getRight().evaluate());
    }
}

3.6 Statements

Before we start creating our syntax analyzer we need to introduce a model for our language’s statements. Let’s start with the Statement interface with the execute() proceeding method which will be called during code execution:

1
2
3
public interface Statement {
    void execute();
}

3.6.1 Print statement

The first statement we will implement is the PrintStatement. This statement needs to know the expression to print. Our language will support printing literals, variables and other expressions implementing Expression which will be evaluated during execution to calculate the Value object:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
@AllArgsConstructor
@Getter
public class PrintStatement implements Statement {
    private final Expression expression;

    @Override
    public void execute() {
        Value<?> value = expression.evaluate();
        System.out.println(value);
    }
}

3.6.2 Assign statement

To declare an assignment we need to know the variable’s name and the expression we want to assign. Our language will be able to assign literals, variables and more complex expressions implementing Expression interface:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
@AllArgsConstructor
@Getter
public class AssignStatement implements Statement {
    private final String name;
    private final Expression expression;

    @Override
    public void execute() {
        ...
    }
}

During the execution we need to store the variables values somewhere. Let’s delegate this logic to the BiConsumer field which will pass a variable name and the assigning value. Note that Expression value needs to be evaluated while doing an assignment:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
...
public class AssignStatement implements Statement {
    ...
    private final BiConsumer<String, Value<?>> variableSetter;

    @Override
    public void execute() {
        variableSetter.accept(name, expression.evaluate());
    }
}

3.6.3 Input statement

In order to read an expression from the console we need to know the variable name to assign value to. During the execute() we must read a line from the console. This work can be delegated to the Supplier, thus we will not create multiple input streams. After reading the line we parse it to the corresponding Value object and delegate assignment to the BiConsumer instance as we did for the AssignStatement:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
@AllArgsConstructor
@Getter
public class InputStatement implements Statement {
    private final String name;
    private final Supplier<String> consoleSupplier;
    private final BiConsumer<String, Value<?>> variableSetter;

    @Override
    public void execute() {
        //just a way to tell the user in which variable the entered value will be assigned
        System.out.printf("enter \"%s\" >>> ", name.replace("_", " "));
        String line = consoleSupplier.get();

        Value<?> value;
        if (line.matches("[0-9]+")) {
            value = new NumericValue(Integer.parseInt(line));
        } else if (line.matches("true|false")) {
            value = new LogicalValue(Boolean.valueOf(line));
        } else {
            value = new TextValue(line);
        }

        variableSetter.accept(name, value);
    }
}

3.6.4 Condition statement

First, we will introduce the CompositeStatement class which will contain internal list of statements to execute:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
@Getter
public class CompositeStatement implements Statement {
    private final List<Statement> statements2Execute = new ArrayList<>();

    public void addStatement(Statement statement) {
        if (statement != null)
            statements2Execute.add(statement);
    }

    @Override
    public void execute() {
        statements2Execute.forEach(Statement::execute);
    }
}

This class can be reused later in case we create composite statements construction. First of the constructions will be the Condition statement. To describe the condition we can use literals, variables and more complex constructions implementing the Expression interface. In the end during execution we calculate the value of the Expression and make sure the value is Logical and the result inside is truthy. Only in this case we can perform inner statements:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
@RequiredArgsConstructor
@Getter
public class ConditionStatement extends CompositeStatement {
    private final Expression condition;

    @Override
    public void execute() {
        Value<?> value = condition.evaluate();
        if (value instanceof LogicalValue) {
            if (((LogicalValue) value).getValue()) {
                super.execute();
            }
        } else {
            throw new ExecutionException(String.format("Cannot compare non logical value `%s`", value));
        }
    }
}

3.7 Statement parser

Now we have the statements to work with our lexemes, we can now build the StatementParser. Inside the class we declare a list of tokens and mutable tokens’ position variable:

1
2
3
4
5
6
7
8
9
public class StatementParser {
    private final List<Token> tokens;
    private int position;

    public StatementParser(List<Token> tokens) {
        this.tokens = tokens;
    }
    ...
}

Then we create the parseExpression() method which will handle provided tokens and return a full-fledged statement:

1
2
3
public Statement parseExpression() {
	...
}

Our language expressions can only begin with declaring a variable or a keyword. Therefore, first we need to read a token and validate that it has the Variable or the Keyword token type. To handle this we declare a separate next() method with token types varargs as an argument which will validate that the next token has the same type. In truthy case we increment StatementParser’s position field and return the found token:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
private Statement parseExpression() {
      Token token = next(TokenType.Keyword, TokenType.Variable);
	...
}

private Token next(TokenType type, TokenType... types) {
    TokenType[] tokenTypes = org.apache.commons.lang3.ArrayUtils.add(types, type);
    if (position < tokens.size()) {
        Token token = tokens.get(position);
        if (Stream.of(tokenTypes).anyMatch(t -> t == token.getType())) {
            position++;
            return token;
        }
    }
    Token previousToken = tokens.get(position - 1);
    throw new SyntaxException(String.format("After `%s` declaration expected any of the following lexemes `%s`", previousToken, Arrays.toString(tokenTypes)));
}

After we have found the appropriate token, we can build our statement depending on the token type.

3.7.1 Variable

Each variable assignment starts with the Variable token type, which we already did read. The next token we expect is the equals = operator. You can override the next() method that will accept TokenType with String operator representation (e.g.=) and return the next found token only if it has the same type and value as requested:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
public Statement parseExpression() {
    Token token = next(TokenType.Keyword, TokenType.Variable);
    switch (token.getType()) {
        case Variable:
            next(TokenType.Operator, "="); //skip equals

            Expression value;
            if (peek(TokenType.Keyword, "new")) {
                value = readInstance();
            } else {
                value = readExpression();
            }

            return new AssignStatement(token.getValue(), value, variables::put);
    }
    ...
}

After the equals operator, we read an expression. An expression can start with the new keyword when we instantiate a structure. We will cover this case a bit later. To consume a token with intent to validate its value without incrementing position and throwing an error we will create an additional peek() method that will return true value if the next token is the same as we expect:

1
2
3
4
5
6
7
8
9
...
private boolean peek(TokenType type, String value) {
    if (position < tokens.size()) {
        Token token = tokens.get(position);
        return type == token.getType() && token.getValue().equals(value);
    }
    return false;
}
...

To create an AssignStatement instance we need to pass a BiConsumer object to the constructor. Let’s go to the StatementParser fields declaration and add new Map variables field which will store variable name as a key and variable value as a value:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
public class StatementParser {
    ...
    private final Map<String, Value<?>> variables;

    public StatementParser(List<Token> tokens) {
        ...
        this.variables = new HashMap<>();
    }
    ...
}

Variable Expression

image

First, we will cover the plain variable expression reading that can be a literal, another variable or a more complex expression with operators. Therefore, the readExpression() method will return the Expression with the corresponding implementations. Inside this method, we declare the left Expression instance. An expression can start only with a literal or with another variable. Thus, we expect to get the Variable, Numeric, Logical or Text token type from our next() method. After reading a proper token we transform it to the appropriate Expression object inside the following switch block and assign it to the left Expression variable:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
private Expression readExpression() {
    Expression left;
    Token token = next(TokenType.Variable, TokenType.Numeric, TokenType.Logical, TokenType.Text);
    String value = token.getValue();
    switch (token.getType()) {
        case Numeric:
            left = new NumericValue(Integer.parseInt(value));
        case Logical:
            left = new LogicalValue(Boolean.valueOf(value));
        case Text:
            left = new TextValue(value);
        case Variable:
        default:
            left = new VariableExpression(value, variables::get);
    }
    ...
}

After left variable or literal has been read we expect to catch an Operator lexeme:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
private Expression readExpression() {
    Expression left;
    Token token = next(TokenType.Variable, TokenType.Numeric, TokenType.Logical, TokenType.Text);
    String value = token.getValue();
    switch (token.getType()) {
        ...
    }

    Token operation = next(TokenType.Operator);
    Class<? extends OperatorExpression> operatorType = Operator.getType(operation.getValue());
    ...
}

Then we map our Operator lexeme to the OperatorExpression object. To do it you can use a switch block returning proper OperatorExpression class depending on the token value. I will use enum constants with an appropriate OperatorExpression type:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
@RequiredArgsConstructor
@Getter
public enum Operator {
    Not("!", NotOperator.class),
    Addition("+", AdditionOperator.class),
    Subtraction("-", SubtractionOperator.class),
    Equality("==", EqualsOperator.class),
    GreaterThan(">", GreaterThanOperator.class),
    LessThan("<", LessThanOperator.class),
    StructureValue("::", StructureValueOperator.class);

    private final String character;
    private final Class<? extends OperatorExpression> operatorType;

    public static Class<? extends OperatorExpression> getType(String character) {
        return Arrays.stream(values())
                .filter(t -> Objects.equals(t.getCharacter(), character))
                .map(Operator::getOperatorType)
                .findAny().orElse(null);
    }
}

In the end we read the right literal or variable as we did for the left Expression. In order to not duplicate code we extract reading Expression into a separate nextExpression() method:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
private Expression nextExpression() {
    Token token = next(TokenType.Variable, TokenType.Numeric, TokenType.Logical, TokenType.Text);
    String value = token.getValue();
    switch (token.getType()) {
        case Numeric:
            return new NumericValue(Integer.parseInt(value));
        case Logical:
            return new LogicalValue(Boolean.valueOf(value));
        case Text:
            return new TextValue(value);
        case Variable:
        default:
            return new VariableExpression(value, variables::get);
    }
}

Let’s refactor the readExpression() method and read the right lexeme using nextExpression(). But before we read the right lexeme we need to be sure that our operator supports two operands. We can check if our operator extends the BinaryOperatorExpression. In the other case if the operator is unary we create an OperatorExpression only using the left expression. To create an OperatorExpression object we retrieve suitable constructor for unary or binary operator implementation and then create an instance with obtained earlier expressions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
@SneakyThrows
private Expression readExpression() {
    Expression left = nextExpression();

    Token operation = next(TokenType.Operator);
    Class<? extends OperatorExpression> operatorType = Operator.getType(operation.getValue());

    if (BinaryOperatorExpression.class.isAssignableFrom(operatorType)) {
        Expression right = nextExpression();
        return operatorType
                .getConstructor(Expression.class, Expression.class)
                .newInstance(left, right);
    } else if (UnaryOperatorExpression.class.isAssignableFrom(operatorType)) {
        return operatorType
                .getConstructor(Expression.class)
                .newInstance(left);
    }

    return left;
}

Additionally, we can provide an opportunity to create a long-expression with multiple operators or without operators at all with only one literal or variable. Let’s enclose the operation read into the while loop with the condition that we have an Operator as the next lexeme. Each time we create an OperatorExpression, we assign it to the left expression thus creating a tree of subsequent operators inside one Expression until we read the whole expression. In the end we return the left expression:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
@SneakyThrows
private Expression readExpression() {
    Expression left = nextExpression();

    //recursively read an expression
    while (peek(TokenType.Operator)) {
        Token operation = next(TokenType.Operator);
        Class<? extends OperatorExpression> operatorType = Operator.getType(operation.getValue());
        if (BinaryOperatorExpression.class.isAssignableFrom(operatorType)) {
            Expression right = nextExpression();
            left = operatorType
                    .getConstructor(Expression.class, Expression.class)
                    .newInstance(left, right);
        } else if (UnaryOperatorExpression.class.isAssignableFrom(operatorType)) {
            left = operatorType
                    .getConstructor(Expression.class)
                    .newInstance(left);
        }
    }

    return left;
}

private boolean peek(TokenType type) {
    if (position < tokens.size()) {
        Token token = tokens.get(position);
        return token.getType() == type;
    }
    return false;
}

Structure instance

After we completed the readExpression() implementation we can go back to the parseExpression() and finish the readInstance() implementation to instantiate a structure instance:

image

According to our language’s semantics we know that our structure instantiation starts with the new keyword, we can skip the next token by calling the next() method. The next lexeme will signify the structure name, we read it as the Variable token type. After the structure name we expect to receive arguments inside square brackets as group dividers. In some cases our structure can be created without any arguments at all. Therefore, we use peek() first. Each passing argument to the structure can mean an expression, thus we call readExpression() and pass the result to the arguments list. After building structure arguments we can build our StructureExpression preliminarily retrieving appropriate StructureDefinition:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
private Expression readInstance() {
    next(TokenType.Keyword, "new"); //skip new

    Token type = next(TokenType.Variable);

    List<Expression> arguments = new ArrayList<>();

    if (peek(TokenType.GroupDivider, "[")) {

        next(TokenType.GroupDivider, "["); //skip open square bracket

        while (!peek(TokenType.GroupDivider, "]")) {
            Expression value = readExpression();
            arguments.add(value);
        }

        next(TokenType.GroupDivider, "]"); //skip close square bracket
    }

    StructureDefinition definition = structures.get(type.getValue());
    if (definition == null) {
        throw new SyntaxException(String.format("Structure is not defined: %s", type.getValue()));
    }
    return new StructureExpression(definition, arguments, variables::get);
}

To retrieve the StructureDefinition by name we must declare the structures map as StatementParser’s field, we will fill it later during struct keyword analysis:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
public class StatementParser {
    ...
    private final Map<String, StructureDefinition> structures;

    public StatementParser(List<Token> tokens) {
	  ...
        this.structures = new HashMap<>();
    }
    ...
}

3.7.2 Keyword

Now we can continue working on the parseExpression() method when we receive the Keyword lexeme:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
public Statement parseExpression() {
    ...
    switch (token.getType()) {
        case Variable:
        ...
        case Keyword:
            switch (token.getValue()) {
                case "print":
                case "input":
                case "if":
                case "struct":
            }
    ...
    }
}

Our language expression can start with print, input, if and struct keywords.

Print

To read the printing value we call the already created readExpression() method. It will read a literal, variable or more complex Expression implementation as we did for the variable assignment. Then we create and return an instance of the PrintStatement:

1
2
3
4
5
...
case "print":
    Expression expression = readExpression();
    return new PrintStatement(expression);
...

Input

For the input statement, we need to know the variable name we assign value to. Therefore, we ask the next() method to catch the next Variable token for us:

1
2
3
4
5
...
case "input":
    Token variable = next(TokenType.Variable);
    return new InputStatement(variable.getValue(), scanner::nextLine, variables::put);
...

To create an InputStatement instance we introduce a Scanner object in the StatementParser fields declaration which will help us to read a line from the console:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
public class StatementParser {
    ...
    private final Scanner scanner;

    public StatementParser(List<Token> tokens) {
	  ...
        this.scanner = new Scanner(System.in);
    }
    ...
}

If

The next statement we will touch is if/then condition:

image

Firstly, when we receive the if keyword, we read the condition expression by calling readExpression(). Then according to our language semantics, we need to catch the then keyword. Inside our condition we can declare other statements including other condition statements. Therefore, we recursively add parseExpression() to the condition’s inner statements until we will read the finalizing end keyword:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
...  
case "if":
    Expression condition = readExpression();
    next(TokenType.Keyword, "then"); //skip then

    ConditionStatement conditionStatement = new ConditionStatement(condition);
    while (!peek(TokenType.Keyword, "end")) {
        Statement statement = parseExpression();
        conditionStatement.addStatement(statement);
    }
    next(TokenType.Keyword, "end"); //skip end

    return conditionStatement;
...

Struct

image

Structure declaration is quite simple because it consists only of Variable and Keyword token types. After reading the struct keyword we expect to read the Structure name as Variable token type. Then we read arguments until we end up with the end keyword. In the end we build our StructureDefinition and put it in the structures map to be accessed in the future when we’ll create a structure instance:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
...  
case "struct":
    Token type = next(TokenType.Variable);
    
    Set<String> args = new HashSet<>();
    while (!peek(TokenType.Keyword, "end")) {
        next(TokenType.Keyword, "arg");
    
        Token arg = next(TokenType.Variable);
        args.add(arg.getValue());
    }
    next(TokenType.Keyword, "end"); //skip end
    
    structures.put(type.getValue(), new StructureDefinition(type.getValue(), new ArrayList<>(args)));
    
    return null;
... 

The complete parseExpression() method:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
...
public Statement parseExpression() {
    Token token = next(TokenType.Keyword, TokenType.Variable);
    switch (token.getType()) {
        case Variable:
            next(TokenType.Operator, "="); //skip equals

            Expression value;
            if (peek(TokenType.Keyword, "new")) {
                value = readInstance();
            } else {
                value = readExpression();
            }

            return new AssignStatement(token.getValue(), value, variables::put);
        case Keyword:
            switch (token.getValue()) {
                case "print":
                    Expression expression = readExpression();
                    return new PrintStatement(expression);
                case "input":
                    Token variable = next(TokenType.Variable);
                    return new InputStatement(variable.getValue(), scanner::nextLine, variables::put);
                case "if":
                    Expression condition = readExpression();
                    next(TokenType.Keyword, "then"); //skip then

                    ConditionStatement conditionStatement = new ConditionStatement(condition);
                    while (!peek(TokenType.Keyword, "end")) {
                        Statement statement = parseExpression();
                        conditionStatement.addStatement(statement);
                    }
                    next(TokenType.Keyword, "end"); //skip end

                    return conditionStatement;
                case "struct":
                    Token type = next(TokenType.Variable);

                    Set<String> args = new HashSet<>();
                    while (!peek(TokenType.Keyword, "end")) {
                        next(TokenType.Keyword, "arg");

                        Token arg = next(TokenType.Variable);
                        args.add(arg.getValue());
                    }
                    next(TokenType.Keyword, "end"); //skip end

                    structures.put(type.getValue(), new StructureDefinition(type.getValue(), new ArrayList<>(args)));

                    return null;
            }
        default:
            throw new SyntaxException(String.format("Statement can't start with the following lexeme `%s`", token));
    }
}
...

To find and accumulate all statements we create parse() method that will parse all expressions from the given tokens and return CompositeStatement instance:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
...
public Statement parse() {
    CompositeStatement root = new CompositeStatement();
    while (position < tokens.size()) {
        Statement statement = parseExpression();
        root.addStatement(statement);
    }
    return root;
}
...

4 ToyLanguage

We finished with the lexical and syntax analyzer. Now we can gather both implementations into the ToyLanguage class and finally run our language:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
public class ToyLanguage {
    @SneakyThrows
    public void execute(Path path) {
        String source = Files.readString(path);
        LexicalParser lexicalParser = new LexicalParser(source);
        List<Token> tokens = lexicalParser.parse();
        StatementParser statementParser = new StatementParser(tokens);
        Statement statement = statementParser.parse();
        statement.execute();
    }
}

image

5 Wrapping Up

In this tutorial, we built our own language with lexical and syntax analysis. I hope this article will be useful to someone. I highly recommend you to try writing your own language, despite the fact that you have to understand a lot of implementation details. This is a learning, self-improving, and interesting experiment!

Part 2


Building Your Own Programming Language From Scratch: Part II - Dijkstra’s Two-Stack Algorithm

1 Introduction

In this article we will improve mathematical expressions evaluation for our programming language (please see Part I) using Dijkstra’s Two-Stack algorithm. The essence of this algorithm is that any complex expression ultimately comes down to the simple expression. In the end, It will be a unary or binary operation on one/two operands.

How Dijkstra’s Two-Stack algorithm works:

  • We iterate tokens expression
  • If our token is an operand (e.g. number), we push it into the operands stack
  • If we find an operator, we push into the operators stack
  • We pop operators with the higher priority than the current one and apply it on the top operand(s)
  • If an opening parenthesis is found, we push it into the operands stack.
  • If a closing parenthesis is found, pop all operators and apply it on the top operand(s) until the opening parenthesis will be reached, and finally pop the opening parenthesis.

Example

Let’s suppose we have the following expression: 2 * (3 + 4) - 5. When we calculate this expression first we summarize 3 and 4 in the parentheses. After that, we multiply the result by 2. And in the end we subtract 5 from the result. Down below is the expression evaluation using Dijkstra’s Two-Stack algorithm:

  1. Push 2 to the operands stack:

image

  1. Push * to operators:

image

  1. Push ( to operands:

image

  1. Push 3 to operands:

image

  1. Push + to operators:

image

  1. Push 4 to operands:

image

  1. Push ) to operators. Closing parenthesis pops an operation until a nearest opening parenthesis (. Therefore, we pop top operator and apply it on the two top operands:

image

  1. Push - to operators. The operators stack’s top element * has greater precedence than the current operator -. Therefore, first we apply top operator on two top operands:

image

  1. Push 5 to operands:

image

  1. Pop left - operator:

image

2 Operators

Let’s dive into the code. First, we will update our Operator enum including operators precedence and introduce a few new operators:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
@RequiredArgsConstructor
@Getter
public enum Operator {
    Not("!", NotOperator.class, 5),
    StructureValue("::", StructureValueOperator.class, 5),

    Addition("+", AdditionOperator.class, 3),
    Subtraction("-", SubtractionOperator.class, 3),

    LessThan("<", LessThanOperator.class, 2),
    GreaterThan(">", GreaterThanOperator.class, 2);

    private final String character;
    private final Class<? extends OperatorExpression> type;
    private final Integer precedence;

    Operator(String character, Integer precedence) {
        this(character, null, precedence);
    }

    public static Operator getType(String character) {
        return Arrays.stream(values())
                .filter(t -> Objects.equals(t.getCharacter(), character))
                .findAny().orElse(null);
    }

    public boolean greaterThan(Operator o) {
        return getPrecedence().compareTo(o.getPrecedence()) > 0;
    }
}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
public class StatementParser {
    ...
    private Expression readExpression() {
        ...
        while (peek(TokenType.Operator)) {
            Token operation = next(TokenType.Operator);
            Operator operator = Operator.getType(operation.getValue());
            Class<? extends OperatorExpression> operatorType = operator.getType();
            ...
        }

        ...
    }
    ...
}

2.1 StructureInstance

We will no longer mark new as the keyword lexeme, it will be the operator which will help us to instantiate a structure object inside a more complex expression:

1
2
3
4
5
6
public enum TokenType {
	...
	Keyword("(if|then|end|print|input|struct|arg)"),
	Operator("(\\+|\\-|\\>|\\<|\\={1,2}|\\!|\\:{2}|new)");
    ...
}

We set it the highest priority:

1
2
3
4
5
public enum Operator {
	...
    StructureInstance("new", StructureInstanceOperator.class, 5),
	...
}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
public class StructureInstanceOperator extends UnaryOperatorExpression {
    public StructureInstanceOperator(Expression value) {
        super(value);
    }

    @Override
    public Value<?> calc(Value<?> value) {
        return value; //will return toString() value
    }
}

2.2 Multiplication & Division

We will also include two more operators with higher priority than addition and subtraction. Don’t forget to add regex expression for each operator to be counted when we’re doing lexical analysis:

1
2
3
4
5
public enum TokenType {
	...
	Operator("(\\+|\\-|\\>|\\<|\\={1,2}|\\!|\\:{2}|new|\\*|\\/)");
    ...
}
1
2
3
4
5
6
public enum Operator {
    ...
    Multiplication("*", MultiplicationOperator.class, 4),
    Division("/", DivisionOperator.class, 4),
    ...
}

2.3 LeftParen & RightParen

These operators will be used to change operator’s precedence by surrounding expression in the parentheses. We will not create OperatorExpression implementations as we make an expression calculation using parentheses:

1
2
3
4
5
public enum TokenType {
	...
	Operator("(\\+|\\-|\\>|\\<|\\={1,2}|\\!|\\:{2}|new|\\*|\\/|\\(|\\))");
    ...
}
1
2
3
4
5
6
public enum Operator {
    ...
    LeftParen("(", 1),
    RightParen(")", 1),
    ...
}

2.4 Assignment

We also need to introduce a new assignment operator with a single equal = sign. The variable value will be assigned to the variable only when we complete evaluation of the right expression. Therefore, this operator should have the lowest priority:

1
2
3
4
5
public enum Operator {
	...
    Assignment("=", AssignmentOperator.class, 6),
	...
}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
public class AssignOperator extends BinaryOperatorExpression {
    public AssignOperator(Expression left, Expression right) {
        super(left, right);
    }

    @Override
    public Value<?> calc(Value<?> left, Value<?> right) {
        if (getLeft() instanceof VariableExpression) {
            ((VariableExpression) getLeft()).setValue(right);
        }
        return left;
    }
}

During the execution we expect to retrieve the left expression as a VariableExpression - the variable we’re assigning value to. Thus, we introduce a new method in our VariableExpression to set the variable value. As you may remember our variables Map is being stored in the StatementParser. To set access it and set a new value we introduce a BiConsumer instance, passing variable name as a key and variable value as a new value:

1
2
3
4
5
6
7
8
9
public class VariableExpression implements Expression {
    ...
    private final BiConsumer<String, Value<?>> setter;
    ...

    public void setValue(Value<?> value) {
        setter.accept(name, value);
    }
}
1
2
3
4
5
6
7
8
public class StatementParser {
	...
    private Expression nextExpression() {
        ...
        return new VariableExpression(value, variables::get, variables::put);
    }
	...
}

2.5 Expression dividers

2.5.1 End of the row

Previously, to build a mathematical expression we expected to obtain an operator after each operand:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
Expression left = nextExpression();

//recursively read an expression
while (peek(TokenType.Operator)) {
    Token operation = next(TokenType.Operator);
    Operator operator = Operator.getType(operation.getValue());
    Class<? extends OperatorExpression> operatorType = operator.getType();
    if (BinaryOperatorExpression.class.isAssignableFrom(operatorType)) {
        Expression right = nextExpression();
        left = operatorType
                .getConstructor(Expression.class, Expression.class)
                .newInstance(left, right);
    } else if (UnaryOperatorExpression.class.isAssignableFrom(operatorType)) {
        left = operatorType
                .getConstructor(Expression.class)
                .newInstance(left);
    }
}

With the current behavior we can expect two subsequent operators in the row, e.g. addition of structure instantiation as an right operand or two following opening parentheses:

1
2
3
4
5
value + new Car [E30 325i] :: model

or

( ( 1 + 2) * 3) - 4

To read an entire expression with any sequence of operands and operators we will introduce the LineBreak lexeme which will help us to catch all expression tokens until we reach end of the row:

1
2
3
4
5
public class StatementParser {
    LineBreak("[\\n\\r]"),
    Whitespace("[\\s\\t]"),
    ...
}

2.5.2 Structure arguments

The second divider we need to count is when we instantiate a structure object:

1
new Car [ E30 325i new Engine [M20B25] :: model ]

This structure instantiation is quite complex for the StatementParser to divide each passing argument into separate expression. To deal with it we can introduce new comma GroupDivider:

1
2
3
4
5
public enum TokenType {
	...
   	GroupDivider("(\\[|\\]|\\,)"),
	...
}
1
new Car [ E30, 325i, new Engine [M20B25] :: model ]

3 Expression evaluation

Now we can start refactoring our expression evaluation in the StatementParser using Dijkstra’s Two-Stack algorithm.

3.1 Next token

Firstly, we will introduce skipLineBreaks() method, that will increment tokens position while we reach LineBreak token:

1
2
3
4
private void skipLineBreaks() {
    while (tokens.get(position).getType() == TokenType.LineBreak 
            && ++position != tokens.size()) ;
}

It will be used on the top of each next() and peek() methods to escape catching LineBreak token:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
private Token next(TokenType type, TokenType... types) {
    skipLineBreaks();
    ...
}

private Token next(TokenType type, String value) {
    skipLineBreaks();
    ...
}

private boolean peek(TokenType type, String content) {
    skipLineBreaks();
    ...
}

private boolean peek(TokenType type) {
    skipLineBreaks();
    ...
}

We will also override the next() method without any arguments just returning the next token:

1
2
3
4
private Token next() {
    skipLineBreaks();
    return tokens.get(position++);
}

3.2 ExpressionReader

To read and calculate an entire expression using Dijkstra’s Two-Stack algorithm we will need two stacks, the operators stack, and the operands stack. The first stack of operators will contain the Operator enum types. The second stack of operands will have Expression type, it will help us to store every type of expression including values, variables and composite operators. It will be clearer if we perform expression reading in a separate inner class ExpressionReader:

1
2
3
4
5
6
7
8
9
private class ExpressionReader {
    private final Stack<Expression> operands;
    private final Stack<Operator> operators;

    private ExpressionReader() {
        this.operands = new Stack<>();
        this.operators = new Stack<>();
    }
}

Then we create a readExpression() method with a while loop to read an entire expression. Our mathematical expression can begin with Operator, Variable, Numeric , Logical or Text token types. In this case we can’t exclude the LineBreak token during token retrieval because this token means the end of the expression. Therefore, we introduce an additional peek() method inside our inner class that will read an expression until we reach the end of the row:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
private class ExpressionReader {
    ...
    private Expression readExpression() {
		while (peek(TokenType.Operator, TokenType.Variable, TokenType.Numeric, TokenType.Logical, TokenType.Text)) {
 	        Token token = next();
            switch (token.getType()) {
               case Operator:
			       ...
			    default:
			        ...
            }
        }
        ...
    }

    private boolean peek(TokenType type, TokenType... types) {
        TokenType[] tokenTypes = ArrayUtils.add(types, type);
        if (position < tokens.size()) {
            Token token = tokens.get(position);
            return Stream.of(tokenTypes).anyMatch(t -> t == token.getType());
        }
        return false;
    }
}

3.2.1 Operator

The first lexeme type we will process is Operator. We retrieve the Operator enum by token’s value and use it in the switch block. There are 3 cases we will deal with:

  1. LeftParen - just put an operator into the operators stack
  2. RightParen - calculate previously pushed expressions until we reach the left parenthesis LeftParen.
  3. Before inserting other operators we need to be sure that the current operator has greater precedence than the top operator, in the negative case we pop the top operator and apply it on the top operand(s). In the end we push the less prioritized operator into the operators stack.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
...
case Operator:
    Operator operator = Operator.getType(token.getValue());
    switch (operator) {
        case LeftParen:
            operators.push(operator);
            break;
        case RightParen:
            //until left parenthesis is not reached
            while (!operators.empty() && operators.peek() != Operator.LeftParen)
                applyTopOperator();
            operators.pop(); //pop left parenthesis
            break;
        default:
            //until top operator has greater precedence
            while (!operators.isEmpty() && operators.peek().greaterThan(operator))
                applyTopOperator();
            operators.push(operator); // finally, add less prioritized operator
    }
    break;
...

To apply the top operator on the top operand(s) we create an additional applyTopOperator() method. First we pop an operator and the first operand. Then if our operator is binary we pop the second operand. To create an OperatorExpression object we retrieve a suitable constructor for unary or binary operator implementation and make an instance with earlier popped operands. In the end we push the OperatorExpression instance to the operands stack. Thus, this pushed operand Expression can be used later by other operators inside a more composite expression:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
@SneakyThrows
private void applyTopOperator() {
    // e.g. a + b
    Operator operator = operators.pop();
    Class<? extends OperatorExpression> operatorType = operator.getType();
    Expression left = operands.pop();
    if (BinaryOperatorExpression.class.isAssignableFrom(operatorType)) {
        Expression right = operands.pop();
        operands.push(operatorType
                .getConstructor(Expression.class, Expression.class)
                .newInstance(right, left));
    } else if (UnaryOperatorExpression.class.isAssignableFrom(operatorType)) {
        // e.g. new Instance []
        operands.push(operatorType
                .getConstructor(Expression.class)
                .newInstance(left));
    } else {
        throw new SyntaxException(String.format("Operator `%s` is not supported", operatorType));
    }
}

3.2.2 Value & VariableExpression

If our token is a variable or a literal, we transform it to the appropriate Expression object inside the following switch block and then push the result to the operands stack:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
...
default:
    String value = token.getValue();
    Expression operand;
    switch (token.getType()) {
        case Numeric:
            operand = new NumericValue(Integer.parseInt(value));
            break;
        case Logical:
            operand = new LogicalValue(Boolean.valueOf(value));
            break;
        case Text:
            operand = new TextValue(value);
            break;
        case Variable:
        default:
            if (!operators.isEmpty() && operators.peek() == Operator.StructureInstance) {
                operand = readInstance(token);
            } else {
                operand = new VariableExpression(value, variables::get, variables::put);
            }
    }
    operands.push(operand);
...

A variable token can also indicate the start of the structure instantiation. In this case we read an entire structure expression with arguments inside square brackets and only then assign it to the StructureExpression value. We will move existent readInstance() method from StatementParser class into the inner ExpressionReader class with following changes:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
...
private Expression readInstance(Token token) {
    StructureDefinition definition = structures.get(token.getValue());

    List<Expression> arguments = new ArrayList<>();
    if (StatementParser.this.peek(TokenType.GroupDivider, "[")) {

        next(TokenType.GroupDivider, "["); //skip open square bracket

        while (!StatementParser.this.peek(TokenType.GroupDivider, "]")) {
            Expression value = new ExpressionReader().readExpression();
            arguments.add(value);

            if (StatementParser.this.peek(TokenType.GroupDivider, ","))
                next();
        }

        next(TokenType.GroupDivider, "]"); //skip close square bracket
    }

    if (definition == null) {
        throw new SyntaxException(String.format("Structure is not defined: %s", token.getValue()));
    }
    return new StructureExpression(definition, arguments, variables::get);
}
...

3.2.3 Evaluation

After we finished processing expression tokens and reached the end of the expression we will need an extra while loop to pop remaining operators and apply it on the top operand(s). In the end we should have the only one resulting composite operand in the operands stack, we return it as a final expression result:

1
2
3
4
5
6
7
8
9
private Expression readExpression() {
    ...

    while (!operators.isEmpty()) {
        applyTopOperator();
    }

    return operands.pop();
}

3.3 AssignStatement

Now we can now build AssignmentStatement in the parseExpression() method. The AssignStatement can start with a Variable or an Operator token type (in case we have a new operator for the structure instantiation). When we read an expression we expect to read an entire Assignment expression accumulating left value as a variable name and right expression as a variable value. Therefore, we go to one tokens position back and start reading the expression using ExpressionReader starting from variable’s name declaration:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
private Statement parseExpression() {
    Token token = next(TokenType.Keyword, TokenType.Variable, TokenType.Operator);
    switch (token.getType()) {
        case Variable:
        case Operator:
            position--;
            Expression value = new ExpressionReader().readExpression();
		    ...
        case Keyword:
        default:
            ...
    }
}

In the end we expect to obtain the complete AssignmentOperator and we can create an AssignmentStatement instance:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
...
case Variable:
case Operator:
    position--;
    Expression value = new ExpressionReader().readExpression();

    if (value instanceof AssignmentOperator) {
        VariableExpression variable = (VariableExpression) ((AssignmentOperator) value).getLeft();
        return new AssignmentStatement(variable.getName(), ((AssignmentOperator) value).getRight(), variables::put);
    } else {
        throw new SyntaxException(String.format("Unsupported statement: `%s`", value));
    }
...

The last thing we need to do is to update expressions reading for the input and if statements using our new inner ExpressionReader class. It will allow us to read complex expression inside print and if statements:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
...
case Keyword:
default:
    switch (token.getValue()) {
        case "print":
            Expression expression = new ExpressionReader().readExpression();
            ...
        case "if":
            Expression condition = new ExpressionReader().readExpression();
		    ...
    }
...

4 Wrapping up

In this article, we improved our own programming language with mathematical expressions evaluation using Dijkstra’s Two-Stack algorithm. We injected operators priority with an ability to change it using parentheses, including complex structure instantiations. The full source code is available over on GitHub.

Part 3


Build Your Own Programming Language Part III: Improving Lexical Analysis with Regex Lookaheads

img

In this part of creating your own programming language, we will improve our lexical analysis with Regex lookaheads and lookbehinds (please check out Part I and Part II).

Basically, lookaheads and lookbehinds allow you to create your own ^ and $ sub-expressions (anchors). With their help, you can set up a condition that will be met at the beginning and at the end of the line or won’t, and this condition will not be a part of the “matched” expression.

Lookbehind can “look” back and should be placed at the beginning of the regular expression. Lookahead “looks” forward and accordingly should be placed at the end.

There are four types of lookaheads and lookbehinds with corresponding regex syntax:

  1. Positive lookahead: (?=pattern)
  2. Negative lookahead (?!pattern)
  3. Positive lookbehind (?<=pattern)
  4. Negative lookbehind (?<!pattern)

Let’s dive into code and see how we can improve lexical analysis for our programming language. We will go through the declared Token types with the corresponding regex value to define an appropriate lexeme:

  1. Keyword lexeme if|then|end|print|input|struct|arg

    Let’s suppose we defined a variable starting with a Keyword:

1
2
print_value = 5
print print_value

During the lexical analysis we will capture this print_value variable by the Keyword lexeme instead of being parsed as a variable. To read it properly, we need to be sure that after the keyword is placed either a space or the end of the line. It could be done with the help of a positive lookahead in the end of the Keyword expression:

1
(if|then|end|print|input|struct|arg)(?=\\s|$)
  1. Logical lexeme true|false

The same rule is applied to the Logical lexeme:

1
true_value = true

To read the Logical lexeme properly, it should be separated either by space or by the end of the line:

1
(true|false)(?=\\s|$)
  1. Numeric lexeme [0-9]+

For the Numeric values, we will introduce the ability to read floating-point numbers with a decimal point and numbers with a negative/positive sign in front of the number:

1
2
3
4
negative_value = -123
float_value = 123.456
leading_dot_float_value = .123
ending_dot_float_value = 123.

First, to match optional + or - we add a character class:

1
[+-]?[0-9]+

Then to read a float number, we introduce a dot character with arbitrary numbers set after it:

1
[+-]?[0-9]+[.]?[0-9]*

There could be a special case when our number starts with a leading dot, so we add an extra alternation:

1
[+-]?([0-9]+[.]?[0-9]*|[.][0-9]+)

We can also replace an alternation by applying a positive lookahead with a requirement to contain maximum 1 dot and at least 1 number after an arbitrary sign declaration. In this case, our main expression for the float number will have a greedy * quantifier:

1
[+-]?((?=[.]?[0-9])[0-9]*[.]?[0-9]*)

Lastly, to map floating-point numbers, we modify NumericValue to accept Double type as a generic Value type:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
public class NumericValue extends Value<Double> {
    public NumericValue(Double value) {
        super(value);
    }

    @Override
    public String toString() {
        if ((getValue() % 1) == 0) //print integer without fractions
            return String.valueOf(getValue().intValue());
        return super.toString();
    }
}
  1. Operator ([+]|[-]|[*]|[/]|[>]|[<]|[=]{1,2}|[!]|[:]{2}|[(]|[)]|new)
1
2
3
4
5
struct Object
	arg number
end

new_value = new Object [ 5 ]

For the new operator, we as well need to introduce a positive lookahead separating value with a space or with the end of the line:

1
([+]|[-]|[*]|[/]|[>]|[<]|[=]{1,2}|[!]|[:]{2}|[(]|[)]|new(?=\s|$))

In this part we slightly improved our lexical analyzer with positive lookaheads. Now our toy language is a little bit smarter and able to parse more complex variable expressions.

Part 4


Building Your Own Programming Language From Scratch Part IV: Implementing Functions

featured image - Building Your Own Programming Language From Scratch Part IV: Implementing Functions

In this part of creating your own programming language, we will introduce new function constructions. Please checkout the previous parts:

  1. Building Your Own Programming Language From Scratch
  2. Building Your Own Programming Language From Scratch: Part II - Dijkstra’s Two-Stack Algorithm
  3. Build Your Own Programming Language Part III: Improving Lexical Analysis with Regex Lookaheads

The full source code is available over on GitHub.

1. Function Design

We will begin embedding function constructions with the lexical analyzer. As you may remember from the previous parts, we store language lexeme types in the TokenType enum with corresponding regular expressions.

With the new function statement, we’re missing the fun and return keywords. The first fun will stand for the beginning of the function declaration, the second return will be used to exit the function and pass the resulting value. Therefore, we add these missing keywords as “or” expressions:

1
2
3
4
5
6
7
8
package org.example.toylanguage.token;

...
public enum TokenType {
    ...
    Keyword("(if|then|end|print|input|struct|fun|return)(?=\\s|$)"),
    ...
}

Then we will map the fun and return keywords inside the syntax analyzer. The purpose of the syntax analyzer is to build statements from tokens generated by the lexical analyzer in the first step. But before diving into parsing function expressions, we will add a couple of auxiliary classes, which are required to store and manage function expressions.

1.1 Function Statement

First, we need a class for the function statement. It should extend the existing CompositeStatement, which will allow to store inner statements as the early created ConditionStatement (if/then):

1
2
3
4
package org.example.toylanguage.statement;

public class FunctionStatement extends CompositeStatement {
}

1.2 Function Definition

In order to store function declarations with all metadata information, we need a function definition class. We already have StructureDefinition for structures.

Let’s initialize a new one for the functions. It will contain the following fields: function name, function arguments and the FunctionStatement itself, which will be used later when we meet the actual function invocation to access inner function statements:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
package org.example.toylanguage.definition;

@RequiredArgsConstructor
@Getter
@EqualsAndHashCode(onlyExplicitlyIncluded = true)
public class FunctionDefinition implements Definition {
	@EqualsAndHashCode.Include
	private final String name;
	private final List<String> arguments;
	private final FunctionStatement statement;
}

1.3 Function Header

image

Now we have classes to read a function definition and function inner statements, we can start reading the function header. We will extend the existent StatementParser#parseExpression() method to read the new fun Keyword:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
package org.example.toylanguage;

public class StatementParser {
    ...
	private Statement parseExpression() {
		Token token = next(TokenType.Keyword, TokenType.Variable, TokenType.Operator);
		switch (token.getType()) {
			...
			case Keyword:
				switch (token.getValue()) {
						...
					case "fun":
						...

				}
			default:
				...
		}
	}
}

After we successfully caught fun Keyword, we will read the function name:

1
2
3
4
...
case "fun":
    Token type = tokens.next(TokenType.Variable);
...

After the function name, we expect to receive the function arguments surrounded by square brackets and divided by comma:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
...
case "fun":
    Token type = tokens.next(TokenType.Variable);

    List<String> args = new ArrayList<>();

    if (tokens.peek(TokenType.GroupDivider, "[")) {

		tokens.next(TokenType.GroupDivider, "["); //skip open square bracket

		while (!tokens.peek(TokenType.GroupDivider, "]")) {
			Token arg = tokens.next(TokenType.Variable);
			args.add(arg.getValue());

			if (tokens.peek(TokenType.GroupDivider, ","))
				tokens.next();
		}

		tokens.next(TokenType.GroupDivider, "]"); //skip close square bracket
	}

...

In the next step, we initialize the FunctionDefinition instance and add it to the functions HashMap, which will be used later by inner ExpressionReader to construct function invocations:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
public class StatementParser {
    ...
    private final Map<String, FunctionDefinition> functions;
    ...

    public StatementParser(List<Token> tokens) {
        ...
        this.functions = new HashMap<>();
    }

    ...
    case "fun":
        ...
        FunctionStatement functionStatement = new FunctionStatement();
        FunctionDefinition functionDefinition = new FunctionDefinition(type.getValue(), args, functionStatement);
        functions.put(type.getValue(), functionDefinition);
    ...

}

1.4 Function Body

image

After completing the statement header, we will read the inner statements until we reach the finalizing end Keyword the same way we did to read the ConditionStatement (if/then).

To parse an inner statement, we invoke the StatementParser#parseExpression(), which will read nested functions as well.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
 ...
    case "fun":
        ...
        while (!tokens.peek(TokenType.Keyword, "end")) {
            Statement statement = parseExpression();
            functionStatement.addStatement(statement);
        }
        tokens.next(TokenType.Keyword, "end");
	  
	    return null;
...

In the end, we return null as we don’t want this function statement to be executed in the main program without a direct invocation call. This code definitely needs to be refactored, but for now it’s the easiest way to manage inner parseExpression() invocations.

2. Return Statement

image

In the previous section, we added the return as the TokenType.Keyword to be captured by the lexical analyzer. In this step, we will handle this keyword using the syntax analyzer.

First, we start with a statement implementation, which should contain only an expression to execute and pass as the result of the function invocation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
package org.example.toylanguage.statement;

@RequiredArgsConstructor
@Getter
public class ReturnStatement implements Statement {
    private final Expression expression;

    @Override
    public void execute() {
        Value<?> result = expression.evaluate();
    }
}

To pass the resulting expression of the invoked function, we introduce an auxiliary ReturnScope class with the invoke() method, indicating that we have triggered the end of the function:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
package org.example.toylanguage.context;

@Getter
public class ReturnScope {
	private boolean invoked;
	private Value<?> result;

	public void invoke(Value<?> result) {
		setInvoked(true);
		setResult(result);
	}

	private void setInvoked(boolean invoked) {
		this.invoked = invoked;
	}

	private void setResult(Value<?> result) {
		this.result = result;
	}
}

There is one important note: we can have several nested function invocations with multiple function exits, but one ReturnScope cannot manage all function calls. Therefore, we create an extra class to manage the return context for each of the function invocation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
package org.example.toylanguage.context;

public class ReturnContext {
	private static ReturnScope scope = new ReturnScope();

	public static ReturnScope getScope() {
		return scope;
	}

	public static void reset() {
		ReturnContext.scope = new ReturnScope();
	}
}

Now we can go back to the ReturnStatement declaration and call ReturnScope#invoke() method using ReturnContext:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
package org.example.toylanguage.statement;

...
public class ReturnStatement implements Statement {
    ...

    @Override
    public void execute() {
        Value<?> result = expression.evaluate();
        ReturnContext.getScope().invoke(result);
    }
}

The ReturnStatement is ready and will notify the current ReturnScope about function exit, therefore we can implement logic for the syntax analyzer to map return lexeme into ReturnStatement:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
package org.example.toylanguage;

public class StatementParser {
	private Statement parseExpression() {
		Token token = next(TokenType.Keyword, TokenType.Variable, TokenType.Operator);
		switch (token.getType()) {
			...
			case Keyword:
				switch (token.getValue()) {
						...
					case "return":
						Expression expression = new ExpressionReader().readExpression();
                        return new ReturnStatement(expression);

				}
			default:
				...
		}
	}
}

The next step would be to subscribe function execution to the ReturnContext. The function needs to know at which moment we performed the return statement to stop the subsequent execution. We will extend ConditionStatement#execute() implementation, which will work both for the FunctionStatement and for the ConditionStatement (if/then), by stopping the execution at the moment we catch the return invocation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
package org.example.toylanguage.statement;

...
public class CompositeStatement implements Statement {
    ...

    @Override
    public void execute() {
        for (Statement statement : statements2Execute) {
            statement.execute();

            //stop the execution in case ReturnStatement has been invoked
            if (ReturnContext.getScope().isInvoked())
                return;
        }
    }
}

3. Empty Return

image

Our return statement can contain no resulting value, indicating just the end of the function without the provided result. Let’s introduce the new Value implementation containing “null” value.

First, we add new Null lexeme to parse “null” as different lexeme instead of Variable:

1
2
3
4
5
6
7
8
...
public enum TokenType {
    ...
    Numeric("[+-]?((?=[.]?[0-9])[0-9]*[.]?[0-9]*)"),
    Null("(null)(?=,|\\s|$)"),
    Text("\"([^\"]*)\""),
    ...
}

In the next step, we create the Value implementation for this Null lexeme. We can initialize a static instance as we will use the same object to refer:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
package org.example.toylanguage.expression.value;

public class NullValue<T extends Comparable<T>> extends Value<T> {

	public static final NullValue<?> NULL_INSTANCE = new NullValue<>(null);

	private NullValue(T value) {
		super(value);
	}

	@Override
	public T getValue() {
		//noinspection unchecked
		return (T) this;
	}

	@Override
	public String toString() {
		return "null";
	}
}

The last thing would be to map Null lexeme as NullValue by the ExpressionReader. If our return statement will contain only return statement without “null” value, we will not have any operands for the returned expression.

Therefore, we should add extra validation at the end of the readExpression() method by returning NULL_INSTANCE in the case operands stack is empty:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
private class ExpressionReader {
    ...

    private Expression readExpression() {
        while (tokens.peekSameLine(TokenType.Operator, TokenType.Variable, TokenType.Numeric, TokenType.Logical, TokenType.Null, TokenType.Text)) {
            Token token = tokens.next();
            switch (token.getType()) {
                case Operator:
                    ...
                default:
                    String value = token.getValue();
                    Expression operand;
                    switch (token.getType()) {
                        case Numeric:
                            ...
                        case Logical:
                            ...
                        case Text:
                            ...
                        case Null:
                            operand = NULL_INSTANCE;
                            break;
                        case Variable:
                        default:
                            ...
                    }
                    ...
            }
        }

        ...

        if (operands.isEmpty()) {
            return NULL_INSTANCE;
        } else {
            return operands.pop();
        }
    }
    ...
}

4. Memory Scope

Before diving into function expressions, first we will need to get familiar with the function scope. Basically, scope is the current context in the code.

Scope can be defined locally or globally. Currently, if we define any variable, it will be stored as a global variable in the StatementParser#variables Map and will be accessible in the global context.

image

Inside a function, we can also declare variables, and these variables should be accessible only in the scope of this declared function and should be isolated from each function call. To isolate the memory, we will introduce a new class MemoryScope, which will hold the Map of variables:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
package org.example.toylanguage.context;

public class MemoryScope {
	private final Map<String, Value<?>> variables;

	public MemoryScope(MemoryScope parent) {
		this.variables = new HashMap<>();
	}

	public Value<?> get(String name) {
		return variables.get(name);
	}

	public void set(String name, Value<?> value) {
		variables.put(name, value);
	}
}

Unlike ReturnScope, we shouldn’t reset the memory (variables Map) after we enter a function or any other inner block of code. We still need a way to access global variables. Therefore, we can add a parent instance of the MemoryScope, which will be used to read the global variables declared outside of the function:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
public class MemoryScope {
	private final Map<String, Value<?>> variables;
	private final MemoryScope parent;

	public MemoryScope(MemoryScope parent) {
		this.variables = new HashMap<>();
		this.parent = parent;
	}

	public Value<?> get(String name) {
		Value<?> value = variables.get(name);
		if (value == null && parent != null) {
			return parent.get(name);
		}
		return value;
	}

    ...
}

To switch between memory contexts easily, we will introduce the MemoryContext class as we did for the ReturnScope. The newScope() will be called when we enter an inner block of code and the endScope() will be triggered at the end of this block:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
package org.example.toylanguage.context;

public class MemoryContext {
	private static MemoryScope memory = new MemoryScope(null);

	public static MemoryScope getMemory() {
		return memory;
	}

	public static void newScope() {
		MemoryContext.memory = new MemoryScope(memory);
	}

	public static void endScope() {
		MemoryContext.memory = memory.getParent();
	}
}

Now we‌ can refactor variables access for the AssignStatement, InputStatement and VariableExpression with the new MemoryContext interface:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
package org.example.toylanguage.statement;

...
public class AssignStatement implements Statement {
    private final String name;
    private final Expression expression;

    @Override
    public void execute() {
        MemoryContext.getMemory().set(name, expression.evaluate());
    }
}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
package org.example.toylanguage.statement;

...
public class InputStatement implements Statement {
    private final String name;
    private final Supplier<String> consoleSupplier;

    @Override
    public void execute() {
        ...

        MemoryContext.getMemory().set(name, value);
    }
}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
package org.example.toylanguage.expression;

...
public class VariableExpression implements Expression {
    private final String name;

    @Override
    public Value<?> evaluate() {
        Value<?> value = MemoryContext.getMemory().get(name);
        ...
    }

    public void setValue(Value<?> value) {
        MemoryContext.getMemory().set(name, value);
    }
}

5. Function Invocation

image

In this section, we will make our syntax analyzer able to read and parse function invocations. First, we will start with the FunctionExpression declaration.

This class will contain the FunctionDefinition field to access the function details and the Map of Expression values, passed as arguments during function invocation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
package org.example.toylanguage.expression;

@RequiredArgsConstructor
@Getter
public class FunctionExpression implements Expression {
    private final FunctionDefinition definition;
    private final List<Expression> values;

    @Override
    public Value<?> evaluate() {
        return null;
    }
}

To evaluate() the function expression, we need to execute function inner statements. To retrieve them, we use the FunctionStatement instance from the declared FunctionDefinition field:

1
2
3
4
5
6
7
8
...
@Override
public Value<?> evaluate() {
	FunctionStatement statement = definition.getStatement();
    statement.execute();
	return null;
}
...

But before executing function statements, it’s important to initialize the new MemoryScope. It will allow inner statements to use its own memory. The inner memory shouldn’t be accessible outside this function call, therefore at the end of the execution, we set the memory scope back:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
...
@Override
public Value<?> evaluate() {
	FunctionStatement statement = definition.getStatement();
    MemoryContext.newScope(); //set new memory scope
      
	//execute function body
	try {
		statement.execute();
	} finally {
		MemoryContext.endScope(); //end function memory scope
	}

	return null;
}
...

We should also initialize function arguments, which can be passed as value lexemes and not be available in the global memory scope. We can save them in the new function’s memory scope by creating and executing the AssignStatement for each of the function values:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
...
@Override
public Value<?> evaluate() {
	FunctionStatement statement = definition.getStatement();
	MemoryContext.newScope(); //set new memory scope

	//initialize function arguments
	IntStream.range(0, values.size())
			.boxed()
			.map(i -> new AssignStatement(definition.getArguments().get(i), values.get(i)))
			.forEach(Statement::execute);

	//execute function body
	try {
		statement.execute();
	} finally {
		MemoryContext.endScope(); //end function memory scope
	}
      
      return null;
}
...

The last thing left to do is to return the result. We can grab it from the ReturnContext. The result will be captured in the ReturnScope when we invoke the ReturnStatement.

Don’t forget to reset the ReturnScope at the end of the function block to make the next function invocation use its own ReturnScope instance:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
...	
@Override
public Value<?> evaluate() {
	FunctionStatement statement = definition.getStatement();
	MemoryContext.newScope(); //set new memory scope

	//initialize function arguments
	IntStream.range(0, values.size())
			.boxed()
			.map(i -> new AssignStatement(definition.getArguments().get(i), values.get(i)))
			.forEach(Statement::execute);

	//execute function body
	try {
		statement.execute();
		return ReturnContext.getScope().getResult();
	} finally {
		MemoryContext.endScope(); //end function memory scope
		ReturnContext.reset(); //reset return context
	}
}
...

Now we can switch to the ExpressionReader class and read an actual function invocation. The pattern of function call looks almost the same as the structure instantiation, except for the fact, we don’t have the New keyword in the front of the expression.

Before creating the VariableExpression, we check that the next lexeme is the opening square bracket, which will indicate the function call:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
private class ExpressionReader {
    ...

    private Expression readExpression() {
        while (tokens.peekSameLine(TokenType.Operator, TokenType.Variable, TokenType.Numeric, TokenType.Logical, TokenType.Null, TokenType.Text)) {
            Token token = tokens.next();
            switch (token.getType()) {
                case Operator:
                   ...
                default:
                    String value = token.getValue();
                    Expression operand;
                    switch (token.getType()) {
                        case Numeric:
                            ...
                        case Logical:
                            ...
                        case Text:
                            ...
                        case Null:
                            ...
                        case Variable:
                        default:
                            if (!operators.isEmpty() && operators.peek() == Operator.StructureInstance) {
                                operand = readStructureInstance(token);
                            } else if (tokens.peekSameLine(TokenType.GroupDivider, "[")) {
                                operand = readFunctionInvocation(token);
                            } else {
                                operand = new VariableExpression(value);
                            }
                    }
                    ...
            }
        }

        ...
    }

}

Reading the function invocation will look almost the same as structure instantiation. First, we obtain the function definition by function name. Then, we read function arguments. In the end, we return an instance of FunctionExpression:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
private FunctionExpression readFunctionInvocation(Token token) {
    FunctionDefinition definition = functions.get(token.getValue());

    List<Expression> arguments = new ArrayList<>();
    if (tokens.peekSameLine(TokenType.GroupDivider, "[")) {

        tokens.next(TokenType.GroupDivider, "["); //skip open square bracket

        while (!tokens.peekSameLine(TokenType.GroupDivider, "]")) {
            Expression value = new ExpressionReader().readExpression();
            arguments.add(value);

            if (tokens.peekSameLine(TokenType.GroupDivider, ","))
                tokens.next();
        }

        tokens.next(TokenType.GroupDivider, "]"); //skip close square bracket
    }

    return new FunctionExpression(definition, arguments);
}

6. Testing Time

We have finished embedding functions in both lexical and syntax analyzers. Now we should be able to parse function definitions and to read function invocations. Note that function invocation is just like any other expression, including structure instantiation.

Therefore, you can build any logical complex nested expressions. Let’s create a more difficult task for the syntax analysis to determine if two binary trees are the same:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
#
# Determine if two binary trees are identical following these rules:
# - root nodes have the same value
# - left sub trees are identical
# - right sub trees are identical
#

struct TreeNode [ value, left, right ]

fun is_same_tree [ node1, node2 ]
    if node1 == null and node2 == null then
        return true
    end
    if node1 != null and node2 != null then
        return node1 :: value == node2 :: value and is_same_tree [ node1 :: left, node2 :: left ] and is_same_tree [ node1 :: right, node2 :: right ]
    end
    return false
end

#                               tree-s example
#
#          tree1            tree2            tree3            tree4
#            3                3                3                3
#          /   \            /   \            /   \            /   \
#         4     5          4     5          4     5          4     5
#       /  \   /  \      /  \   /  \      /  \   / \        /  \     \
#      6    7 8    9    6    7 8    9    6    7 9   8      6    7     9

# tree1 nodes
tree1_node9 = new TreeNode [ 9, null, null ]
tree1_node8 = new TreeNode [ 8, null ]
tree1_node7 = new TreeNode [ 7 ]
tree1_node6 = new TreeNode [ 6 ]
tree1_node5 = new TreeNode [ 5, tree1_node8, tree1_node9 ]
tree1_node4 = new TreeNode [ 4, tree1_node6, tree1_node7 ]
tree1 = new TreeNode [ 3, tree1_node4, tree1_node5 ]

tree2 = new TreeNode [ 3, new TreeNode [ 4, new TreeNode [ 6 ], new TreeNode [ 7 ] ], new TreeNode [ 5, new TreeNode [ 8 ], new TreeNode [ 9 ] ] ]
tree3 = new TreeNode [ 3, new TreeNode [ 4, new TreeNode [ 6 ], new TreeNode [ 7 ] ], new TreeNode [ 5, new TreeNode [ 9 ], new TreeNode [ 8 ] ] ]
tree4 = new TreeNode [ 3, new TreeNode [ 4, new TreeNode [ 6 ], new TreeNode [ 7 ] ], new TreeNode [ 5, null, new TreeNode [ 9 ] ] ]

print is_same_tree [ tree1, tree2 ]
print is_same_tree [ tree1, tree3 ]
print is_same_tree [ tree2, tree3 ]
print is_same_tree [ tree1, tree4 ]
print is_same_tree [ tree2, tree4 ]
print is_same_tree [ tree3, tree4 ]

image

Part 5


Building Your Own Programming Language From Scratch: Part V - Arrays

featured image - Building Your Own Programming Language From Scratch: Part V - Arrays

In this part of creating your programming language, we will implement arrays. Please see the previous parts:

  1. Building Your Own Programming Language From Scratch
  2. Building Your Own Programming Language From Scratch: Part II - Dijkstra’s Two-Stack Algorithm
  3. Build Your Own Programming Language Part III: Improving Lexical Analysis with Regex Lookaheads
  4. Building Your Own Programming Language From Scratch Part IV: Implementing Functions

The full source code is available over on GitHub.

Our language will provide the following abilities for the arrays:

  • Define an array
1
array = { 1, 2, 3 }
  • Define an empty array
1
empty_array = { }
  • Contain arbitrary types of values: plain types, structure instantiations, function invocations, nested operators with other arrays, and any other composite expressions inheriting the Expression interface
1
arbitrary_array = { 1, "two", false, new Structure[4], get_value[5], { 6, 7 * 8 } }
  • Append a value to the defined array with << operator
1
2
array = { 1, 2 }
array << 3 # will contain { 1, 2, 3 }
  • Access array’s value by index position
1
first_value = array{0}
  • Update array’s value by index position
1
2
array{0} = 1
array{1} = array{array{0} + 1} + 1

1 Lexical analysis

We will begin with lexical analysis. Basically, it’s a process to divide the source code into tokens, such as keyword, identifier/variable, operator, etc.

First, to define the array, we add curly brackets { } as the GroupDivider lexeme. We can’t use the square brackets [ ] or round brackets ( ) as they are already taken by the functions/structures definitions and by expression operators accordingly.

1
2
3
4
5
6
7
8
package org.example.toylanguage.token;

...
public enum TokenType {
    ...
    GroupDivider("(\\[|\\]|\\,|\\{|})"),
    ...
}

Second, we need to introduce a new << Operator lexeme to define the “append a value to array” operation. We’ll extend the current less than operator < by adding an {1,2} quantifier after it, which will be able to catch either one or two < symbols:

1
2
3
4
5
6
...
public enum TokenType {
    ...
    Operator("([+]|[-]|[*]|[/]|[%]|>=|>|<=|[<]{1,2}|[=]{1,2}|!=|[!]|[:]{2}|[(]|[)]|(new|and|or)(?=\\s|$))"),
    ...
}

With all this being set, we will catch all the needed lexemes properly with defined TokenType values.

2 Syntax analysis

In this section, within the syntax analysis, we will convert tokens obtained from the lexical analysis into final statements following our language’s array rules. Arrays are just like any other part of the expression without any additional statements. Therefore, we just need to extend the existing functionality for the ExpressionStatement.

2.1 ArrayExpression

First, to declare any values in our array, we define the ArrayExpression class and add the List of values, which will accept the Expression values to allow us to place non-final values such as variables, structure invocations, function invocations, operators, and any other Expression implementations. Next, we implement the Expression interface, which requires us to override the evaluate() method. It’s a deferred interface to transform array Expression expressions into final Value values, which will be invoked only during direct code execution:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
package org.example.toylanguage.expression;

@RequiredArgsConstructor
@Getter
public class ArrayExpression implements Expression {
    private final List<Expression> values;

    @Override
    public Value<?> evaluate() {
        return new ArrayValue(this);
    }
}

2.2 ArrayValue

The next class we implement is ArrayValue, which will accept the ArrayExpression as a constructor argument to gather all the details about the array state. To change the array values, we include the corresponding getValue(), setValue(), appendValue() methods:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
package org.example.toylanguage.expression.value;

public class ArrayValue extends Value<List<Value<?>>> {
    public ArrayValue(ArrayExpression expression) {
        this(expression.getValues()
                .stream()
                .map(Expression::evaluate)
                .collect(Collectors.toList()));
    }

    public Value<?> getValue(int index) {
        if (getValue().size() > index)
            return getValue().get(index);
        return NULL_INSTANCE;
    }

    public void setValue(int index, Value<?> value) {
        if (getValue().size() > index)
            getValue().set(index, value);
    }

    public void appendValue(Value<?> value) {
        getValue().add(value);
    }
}

We can also override the equals() method, as it will be used by the EqualsOperator and NotEqualsOperator to compare:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
...
public class ArrayValue extends Value<List<Value<?>>> {
    ...

    @Override
    public boolean equals(Object o) {
        if (this == o) return true;
        if (o == null) return false;
        if (getClass() != o.getClass()) return false;
        //noinspection unchecked
        List<Value<?>> oValue = (List<Value<?>>) o;
        return new HashSet<>(getValue()).containsAll(oValue);
    }
}

2.3 Operator

2.3.1 ArrayValueOperator

With the ArrayExpression implementation, we can already instantiate arrays and fill them with values. The next goal is to access the array’s value by index position. For this purpose, we’ll create the BinaryOperatorExpression operator implementation, which contains two operands. The evaluate() method will use the first operand as the array itself and the second operand as an index. The assign() method is used to set the array’s value by index using the same operands:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
package org.example.toylanguage.expression.operator;

public class ArrayValueOperator extends BinaryOperatorExpression implements AssignExpression {
    public ArrayValueOperator(Expression left, Expression right) {
        super(left, right);
    }

    @Override
    public Value<?> evaluate() {
        Value<?> left = getLeft().evaluate();
        if (left instanceof ArrayValue) {
            Value<?> right = getRight().evaluate();
            return ((ArrayValue) left).getValue(((Double) right.getValue()).intValue());
        }
        return left;
    }

    @Override
    public void assign(Value<?> value, VariableScope variableScope) {
        Value<?> left = getLeft().evaluate();
        if (left instanceof ArrayValue) {
            Value<?> right = getRight().evaluate();
            ((ArrayValue) left).setValue(((Double) right.getValue()).intValue(), value);
        }
    }
}

The new AssignExpression interface can be used to set the variable’s value only within the assignment = operator:

1
2
3
public interface AssignExpression {
    void assign(Value<?> value, VariableScope variableScope);
}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
public class AssignmentOperator extends BinaryOperatorExpression {
    ...

    @Override
    public Value<?> evaluate() {
        if (getLeft() instanceof AssignExpression) {
            Value<?> right = getRight().evaluate();
            ((AssignExpression) getLeft()).assign(right, variableScope);
        }
        ...
    }
}

2.3.2 ArrayAppendOperator

Next, to transform the “append a value to array” operator defined as << Operator lexeme, we create a new BinaryOperatorExpression implementation. The left operand is still the array itself, and the right operand is the value we want to append:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
package org.example.toylanguage.expression.operator;

public class ArrayAppendOperator extends BinaryOperatorExpression {
    public ArrayAppendOperator(Expression left, Expression right) {
        super(left, right);
    }

    @Override
    public Value<?> evaluate() {
        Value<?> left = getLeft().evaluate();
        if (left instanceof ArrayValue) {
            Value<?> right = getRight().evaluate();
            ((ArrayValue) left).appendValue(right);
        }
        return getLeft().evaluate();
    }
}

Don’t forget to include this << operator in the Operator enum. It will have the same precedence as the Assignment operator:

1
2
3
4
5
6
7
8
9
...
public enum Operator {
    ...

    ArrayAppend("<<", ArrayAppendOperator.class, 0),
    Assignment("=", AssignmentOperator.class, 0);

    ...
}

2.4 ExpressionReader

In this section, we will complete syntax analysis for the array integration by reading an actual array instantiation and the array’s value. To make it work, we need to extend the ExpressionReader implementation to support array expressions within Dijkstra’s Two-Stack algorithm.

2.4.1 Array instantiation

First, we will read the array instantiation. We extend the readExpression() filter to support the curly brackets as the marker of array beginning:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
package org.example.toylanguage;

public class StatementParser {

    ...

    private class ExpressionReader {
    
        ...

        private Expression readExpression() {
            while (hasNextToken()) {
                ...
            }
        }

        private boolean hasNextToken() {
            if (tokens.peekSameLine(TokenType.Operator, TokenType.Variable, TokenType.Numeric, TokenType.Logical, TokenType.Null, TokenType.Text))
                return true;
            //beginning of an array
            if (tokens.peekSameLine(TokenType.GroupDivider, "{"))
                return true;
            return false;
        }
    }
}

Next, we implement the corresponding case for the GroupDivider { lexeme:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
private Expression readExpression() {
    while (hasNextToken()) {
        Token token = tokens.next();
        switch (token.getType()) {
            case Operator:
                ...
            default:
                String value = token.getValue();
                Expression operand;
                switch (token.getType()) {
                    case Numeric:
                        ...
                    case Logical:
                        ...
                    case Text:
                        ...
                    case Null:
                        ...
                    case GroupDivider:
                        if (Objects.equals(token.getValue(), "{")) {
                            operand = readArrayInstance();
                            break;
                        }
                    case Variable:
                    default:
                        ...
                }
                operands.push(operand);
        }
    }
    ...
}

The pattern for the array expression is the same as for the function invocation except for using the curly brackets instead of square brackets: we read the array’s values as the Expression expressions divided by a comma within the curly brackets. In the end, we build the ArrayExpression with the obtained values:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
// read array instantiation: { 1, 2, 3 }
private ArrayExpression readArrayInstance() {
    List<Expression> values = new ArrayList<>();

    while (!tokens.peekSameLine(TokenType.GroupDivider, "}")) {
        Expression value = new ExpressionReader().readExpression();
        values.add(value);

        if (tokens.peekSameLine(TokenType.GroupDivider, ","))
            tokens.next();
    }

    tokens.next(TokenType.GroupDivider, "}"); //skip close curly bracket

    return new ArrayExpression(values);
}

2.4.2 Access array’s value by index

Accessing the array’s value by index position is less straightforward than the array’s append operation because our Dijkstra’s Two-Stack algorithm inside the ExpressionReader cannot handle the array’s index defined within the curly brackets as an operator. To solve this limitation, we will read the operator with operands together as a composite operand the same way we did read the structure instantiation, except for the fact that we don’t have the “new” operator in front of the structure and the array’s index is defined inside curly brackets instead of square brackets:

1
2
3
4
5
6
7
8
9
// read array's value: array{index}
private ArrayValueOperator readArrayValue(Token token) {
    VariableExpression array = new VariableExpression(token.getValue());
    tokens.next(TokenType.GroupDivider, "{");
    Expression arrayIndex = new ExpressionReader().readExpression();
    tokens.next(TokenType.GroupDivider, "}");

    return new ArrayValueOperator(array, arrayIndex);
}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
private Expression readExpression() {
    while (hasNextToken()) {
        Token token = tokens.next();
        switch (token.getType()) {
            case Operator:
                ...
            default:
                String value = token.getValue();
                Expression operand;
                switch (token.getType()) {
                    case Numeric:
                        ...
                    case Logical:
                        ...
                    case Text:
                        ...
                    case Null:
                        ...
                    case GroupDivider:
                        ...
                    case Variable:
                    default:
                        if (!operators.isEmpty() && operators.peek() == Operator.StructureInstance) {
                            ...
                        } else if (tokens.peekSameLine(TokenType.GroupDivider, "[")) {
                            ...
                        } else if (tokens.peekSameLine(TokenType.GroupDivider, "{")) {
                            operand = readArrayValue(token);
                        } else {
                            ...
                        }
                }
                operands.push(operand);
        }
    }
    ...
}

3 Tests

We have finished embedding arrays in both lexical and syntax analyzers. Now we should be able to read array expressions, access and update array values. Let’s implement a more realistic task - the recursive binary search algorithm:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
fun binary_search [ arr, n, lo, hi, key ]
    if hi >= lo then
        mid = (hi + lo) // 2 # floor division
        if arr{mid} < key then
            return binary_search [ arr, n, mid + 1, hi, key ]
        elif arr{mid} > key then
            return binary_search [ arr, n, lo, mid - 1, key ]
        else then
            return mid
        end
    else then
        return -1
    end
    return
end

arr = {10, 20, 30, 50, 60, 80, 110, 130}
arr << 140
arr << 170
n = 10

print binary_search [ arr, n, 0, n - 1, 10 ] # 0
print binary_search [ arr, n, 0, n - 1, 80 ] # 5
print binary_search [ arr, n, 0, n - 1, 170 ] # 9
print binary_search [ arr, n, 0, n - 1, 5 ] # -1
print binary_search [ arr, n, 0, n - 1, 85 ] # -1
print binary_search [ arr, n, 0, n - 1, 175 ] # -1

Part 6


Building Your Own Programming Language From Scratch: Part VI - Loops

featured image - Building Your Own Programming Language From Scratch: Part VI - Loops

In this part of creating your programming language, we will implement loops.

Please see the previous parts:

  1. Building Your Own Programming Language From Scratch
  2. Building Your Own Programming Language From Scratch: Part II - Dijkstra’s Two-Stack Algorithm
  3. Build Your Own Programming Language Part III: Improving Lexical Analysis with Regex Lookaheads
  4. Building Your Own Programming Language From Scratch Part IV: Implementing Functions
  5. Building Your Own Programming Language From Scratch: Part V - Arrays

The full source code is available over on GitHub.

Our language will provide the following types of loops:

  • For loop - to iterate through a range:

image

  • While loop - to iterate as long as a specified condition is true:

image

  • For-each (Iterable) loop - to iterate through elements of arrays and structures:

image

1 Lexical analysis

In the first section, we will cover lexical analysis. It’s a process to divide the source code into language lexemes, such as keyword, identifier/variable, operator, etc. To make our lexical analysis work with loop constructions, we add the following Keyword lexemes: loop, in, by to the TokenType enum:

1
2
3
4
5
6
7
8
package org.example.toylanguage.token;

...
public enum TokenType {
    ...
        Keyword("(if|elif|else|then|end|print|input|struct|fun|return|loop|in|by)(?=\\s|$)"),
    ...
}

Then, in order to separate the lower and upper bounds of the For loop range, we add the two-dot GroupDivider lexeme: [.]{2}

1
2
3
4
5
6
...
public enum TokenType {
    ...
    GroupDivider("(\\[|\\]|\\,|\\{|}|[.]{2})"),
    ...
}

And finally, if we don’t want the Numeric lexeme to catch the two-dot GroupDivider lexeme declared after the loop’s lower bound, we add a negative look ahead construction before the dot expression: (?![.]{2})

1
2
3
4
5
6
...
public enum TokenType {
    ...
    Numeric("([-]?(?=[.]?[0-9])[0-9]*(?![.]{2})[.]?[0-9]*)"),
    ...
}

2 Syntax analysis

Let’s move to the next section - the syntax analysis, which will convert the lexemes received from the previous step into the final statements following our language rules. As was said at the beginning, our language will support three types of loops: For, While and Iterable. Therefore, first we will declare the AbstractLoopStatement class, which will contain common execution process for all the loop types:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
package org.example.toylanguage.statement.loop;

public abstract class AbstractLoopStatement extends CompositeStatement {
    protected abstract void init();

    protected abstract boolean hasNext();

    protected abstract void preIncrement();

    protected abstract void postIncrement();
}

Inside this class, we declare the following abstract methods, which can have different implementation for each loop type:

  1. init() - it will be invoked to perform initialization operations, like initializing a counter variable when a loop starts the execution.
  2. hasNext() - It will be called before each iteration to define that we still have remaining items to iterate.
  3. preIncrement() - It’s a prefix operation that will be called before each iteration.
  4. postIncrement() - It’s a postfix operation that will be called at the end of each iteration.

Before creating the loop implementations, we complete the Statement#execute() method by utilizing declared abstract methods. To start the execution, we invoke the init() method to initialize the loop state, but before that, we should open the new MemoryScope for the variables declared within the loop, as they shouldn’t be accessed after the loop ends. In the end of the execution, we release the earlier opened MemoryScope:

1
2
3
4
5
6
7
MemoryContext.newScope();

try {
	init();
} finally {
	MemoryContext.endScope();
}

Next, to iterate a loop, we utilize the abstract hasNext() method. Inside each iteration, we invoke the loop’s inner statements, defined within the inherited CompositeStatement class. When the iteration starts, we invoke the preIncrement() method to perform the prefix increment operations. At the end of each iteration, we call the postIncrement() method to perform the postfix increment operations accordingly:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
while (hasNext()) {
	preIncrement();

	try {

		// execute inner statements
		for (Statement statement : getStatements2Execute()) {
			statement.execute();
		}
	} finally {
		postIncrement();
	}
}

Next, to make sure that variables declared inside the loop statements are isolated from each iteration, we open the inner MemoryScope during the statements’ execution. And finally, if our loop is defined inside the function and one of the loop’s inner statements contains a return statement, we should stop the iteration immediately because it’s a sign to exit a function. Therefore, we validate that the function’s return statement hasn’t been called after each statement execution. The complete AbstractLoopStatement#execute() method:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
public abstract class AbstractLoopStatement extends CompositeStatement {
    ...

    @Override
    public void execute() {
        MemoryContext.newScope();
        try {

            init();

            while (hasNext()) {
                preIncrement();

                MemoryContext.newScope();

                try {

                    // execute inner statements
                    for (Statement statement : getStatements2Execute()) {
                        statement.execute();

                        if (ReturnContext.getScope().isInvoked())
                            return;

                    }
                } finally {
                    MemoryContext.endScope(); // release each iteration memory

                    
                    postIncrement();
                }

            }
        } finally {
            MemoryContext.endScope(); // release loop memory
        }
    }
}

2.1 For loop

Let’s begin with the first For loop implementation. By our language design, it will contain the variable expression, the lower bound, the upper bound, and the step expression:

1
2
3
4
5
6
7
8
9
package org.example.toylanguage.statement.loop;

@RequiredArgsConstructor
public class ForLoopStatement extends AbstractLoopStatement {
    private final VariableExpression variable;
    private final Expression lowerBound;
    private final Expression uppedBound;
    private final Expression step;
}

Next, we implement the abstract methods. Within the first init() method, we assign the lower bound to the variable by evaluating an instance of AssignmentOperator, which accepts the variable expression as a first operand and the lowerBound expression as a second operand:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
...
public class ForLoopStatement extends AbstractLoopStatement {
	...

	@Override
	protected void init() {
		new AssignmentOperator(variable, lowerBound)
				.evaluate();
	}
}

Next, we override the hasNext(). It should return a boolean value if we remaining have items. To make a compare operation, we create an instance of the LessThenOperator by passing the variable expression as a first operand, and the upperBound as a second operand. As a result, we expect to get the LogicalValue with boolean value inside it:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
...
public class ForLoopStatement extends AbstractLoopStatement {
    ...

	@Override
	protected boolean hasNext() {
		LessThanOperator hasNext = new LessThanOperator(variable, uppedBound);
		Value<?> value = hasNext.evaluate();
		return value instanceof LogicalValue && ((LogicalValue) value).getValue();
	}
}

Lastly, to perform the increment operations for each iteration, we need to implement the preIncrement() and postIncrement() methods. The For loop should use the step increment only at the end of each iteration. Therefore, we perform the increment only within the postIncrement() method. To implement the increment operation, we create an instance of the AdditionOperator, which will sum up the variable value with the step value, and assign the result to the variable expression again using an instance of the AssignmentOperator. If the step expression hasn’t been provided, we accumulate the NumericValue instance with the default value equal to 1.0 :

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
...
public class ForLoopStatement extends AbstractLoopStatement {
	...

	@Override
	protected void preIncrement() {
	}

	@Override
	protected void postIncrement() {
		AdditionOperator stepOperator = new AdditionOperator(variable, Objects.requireNonNullElseGet(step, () -> new NumericValue(1.0)));
		new AssignmentOperator(variable, stepOperator)
				.evaluate();
	}
}

2.2 While loop

The next loop implementation is the While loop. It’s more straightforward than the For loop, as we don’t need to perform addition and assignment operations at all. This operator will contain only the hasNext expression, which will only serve to calculate the hasNext() result. The other overridden methods will be empty:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
package org.example.toylanguage.statement.loop;

@RequiredArgsConstructor
public class WhileLoopStatement extends AbstractLoopStatement {
    private final Expression hasNext;

    @Override
    protected void init() {
    }

    @Override
    protected boolean hasNext() {
        Value<?> value = hasNext.evaluate();
        return value instanceof LogicalValue && ((LogicalValue) value).getValue();
    }

    @Override
    protected void preIncrement() {
    }

    @Override
    protected void postIncrement() {
    }
}

2.3 Iterable loop

The last loop implementation is the Iterable loop. It will allow us to perform the for-each iterations with arrays and structures.

First, we define the new Value implementation, which will override the java.util.Iterable interface and, accordingly, will provide us with the java.util.Iterator implementation:

1
2
3
4
5
6
7
package org.example.toylanguage.expression.value;

public abstract class IterableValue<T> extends Value<T> implements Iterable<Value<?>> {
    public IterableValue(T value) {
        super(value);
    }
}

Next, we extend the ArrayValue by implementing IterableValue and passing array values to the Iterator instance:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
package org.example.toylanguage.expression.value;

public class ArrayValue extends IterableValue<List<Value<?>>> {
    ...

    @Override
    public Iterator<Value<?>> iterator() {
        return getValue().iterator();
    }
}

And we do the same extension for StructureValue as well. For now, we will iterate only the structure’s values. Later, we can introduce the dictionary data type to operate with key-value types:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
package org.example.toylanguage.expression.value;

public class StructureValue extends IterableValue<Map<String, Value<?>>> {
    ...

    @Override
    public Iterator<Value<?>> iterator() {
        return getValue().values().iterator();
    }
}

Now we have the two IterableValue implementations which can provide us with the Iterator instance, we can complete the Iterable loop with the corresponding implementation. It will contain the variable expression and the iterable expression. Also, we declare an auxiliary non-final Iterator<Value<?> field that will be used to iterate the loop:

1
2
3
4
5
6
7
8
9
package org.example.toylanguage.expression.value;

@RequiredArgsConstructor
public class IterableLoopStatement extends AbstractLoopStatement {
    private final VariableExpression variable;
    private final Expression iterableExpression;

    private Iterator<Value<?>> iterator;
}

Next, we override and resolve the abstract methods. Inside the init() method, we initialize the iterator field using the IterableValue’s iterator() method:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
...
public class IterableLoopStatement extends AbstractLoopStatement {
    ...

	@Override
	protected void init() {
		Value<?> value = iterableExpression.evaluate();
		if (!(value instanceof IterableValue))
			throw new ExecutionException(String.format("Unable to loop non IterableValue `%s`", value));
		this.iterator = ((IterableValue<?>) value).iterator();
	}
}

The hasNext() method will utilize the Iterator#hasNext() implementation. Lastly, we should perform the prefix increment operation to know the value that is currently being iterated. Within preIncrement(), we create an instance of the AssignmentOperator, which will accept the variable as a first operand, and the value received from the iterator as a second operand:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
...
public class IterableLoopStatement extends AbstractLoopStatement {
    ...

	@Override
	protected boolean hasNext() {
		return iterator.hasNext();
	}

	@Override
	protected void preIncrement() {
		Value<?> next = iterator.next();
		new AssignmentOperator(variable, next)
				.evaluate();
	}

	@Override
	protected void postIncrement() {
	}

}

2.4 StatementParser

To make the loop statements work, we need to map the loop lexemes to the defined statements using the StatementParser class. All the loops defined by the language grammar start with the loop keyword. Therefore, inside the parseExpression() method, we add a new loop case for the corresponding Keyword lexeme:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
package org.example.toylanguage;

public class StatementParser {
	...

	private Statement parseExpression() {
		Token token = tokens.next(TokenType.Keyword, TokenType.Variable, TokenType.Operator);
		switch (token.getType()) {
			case Variable:
			case Operator:
                ...
			case Keyword:
				switch (token.getValue()) {
					case "print":
                        ...
					case "input":
                        ...
					case "if":
                        ...
					case "struct":
                        ...
					case "fun":
                        ...
					case "return":
                        ...
					case "loop":
                        ...
				}
			default:
                ...
		}
	}
}

After the loop Keyword, we read the next expression using the Dijkstra Two-Stack algorithm defined within the ExpressionReader class. If we read the For or the Iterable loop, we expect this expression to be only a variable. If we read the While loop, this expression will act as a loop condition, which can be an instance of any Expression implementation, including operator or function invocation. Therefore, we split the execution. In case the expression is the VariableExpression instance and the next lexeme is the in Keyword, we continue to read the For and Iterable loop. In the negative case, we build the WhileLoopStatement using the condition expression:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
case "loop": {
    Expression loopExpression = new ExpressionReader().readExpression();

    AbstractLoopStatement loopStatement;
    if (loopExpression instanceof VariableExpression && tokens.peek(TokenType.Keyword, "in")) {
        // loop <variable> in <bounds>
    } else {
        // loop <condition>
        loopStatement = new WhileLoopStatement(loopExpression);
    }

Inside the For and Iterable loop clause, we skip the next in Keyword lexeme, and then read the following bounds expression. This expression can be an instance of any Expression implementation which can be evaluated to the lower bound value or IterableValue only when the code will be executed. To define which loop type it is, we can rely on the next lexeme. In the case of the For loop, after the lower bound expression, we expect to read the two-dot GroupDivider as a spliterator for lower and upper bounds. In the negative condition, we build the IterableLoopStatement:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
if (loopExpression instanceof VariableExpression && tokens.peek(TokenType.Keyword, "in")) {

    // loop <variable> in <bounds>
    VariableExpression variable = (VariableExpression) loopExpression;
    tokens.next(TokenType.Keyword, "in");
    Expression bounds = new ExpressionReader().readExpression();

    if (tokens.peek(TokenType.GroupDivider, "..")) {
        // loop <variable> in <lower_bound>..<upper_bound>
    } else {
        // loop <variable> in <iterable>
        loopStatement = new IterableLoopStatement(variable, bounds);
    }
}

To build the last For loop statement, we skip the two-dot GroupDivider and read the upper bound expression. In our language grammar, for the For loop, we can define the step expression with the by keyword. Therefore, if the next expression is the by lexeme, we read the following step expression. At the end, we build an instance of the ForLoopStatement:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
if (tokens.peek(TokenType.GroupDivider, "..")) {
    // loop <variable> in <lower_bound>..<upper_bound>
    tokens.next(TokenType.GroupDivider, "..");
    Expression upperBound = new ExpressionReader().readExpression();

    Expression step = null;
    if (tokens.peek(TokenType.Keyword, "by")) {
        // loop <variable> in <lower_bound>..<upper_bound> by <step>
        tokens.next(TokenType.Keyword, "by");
        step = new ExpressionReader().readExpression();
    }

    loopStatement = new ForLoopStatement(variable, bounds, upperBound, step);
}

We have read every loop type declaration. Finally, we can read the body of the loop. To gather all internal statements, we invoke the recursive StatementParser#parseExpression() method, which will return an instance of the Statement interface. We should stop the process when we reach the finalizing end lexeme:

1
2
3
4
while (!tokens.peek(TokenType.Keyword, "end")) {
    Statement statement = parseExpression();
    loopStatement.addStatement(statement);
}

At the end, we skip this end keyword and return the loopStatement. The complete loop clause:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
case "loop": {
    Expression loopExpression = new ExpressionReader().readExpression();
    AbstractLoopStatement loopStatement;

    if (loopExpression instanceof VariableExpression && tokens.peek(TokenType.Keyword, "in")) {
        // loop <variable> in <bounds>
        VariableExpression variable = (VariableExpression) loopExpression;
        tokens.next(TokenType.Keyword, "in");
        Expression bounds = new ExpressionReader().readExpression();

        if (tokens.peek(TokenType.GroupDivider, "..")) {
            // loop <variable> in <lower_bound>..<upper_bound>
            tokens.next(TokenType.GroupDivider, "..");
            Expression upperBound = new ExpressionReader().readExpression();

            Expression step = null;
            if (tokens.peek(TokenType.Keyword, "by")) {
                // loop <variable> in <lower_bound>..<upper_bound> by <step>
                tokens.next(TokenType.Keyword, "by");
                step = new ExpressionReader().readExpression();
            }

            loopStatement = new ForLoopStatement(variable, bounds, upperBound, step);

        } else {
            // loop <variable> in <iterable>
            loopStatement = new IterableLoopStatement(variable, bounds);
        }

    } else {
        // loop <condition>
        loopStatement = new WhileLoopStatement(loopExpression);
    }

    while (!tokens.peek(TokenType.Keyword, "end")) {
        Statement statement = parseExpression();
        loopStatement.addStatement(statement);
    }

    tokens.next(TokenType.Keyword, "end");

    return loopStatement;
}

3 Tests

We have finished embedding loops in both lexical and syntax analyzers. Now we should be able to perform For, While and Iterable loops. Let’s create and test a more real task - the bubble sort algorithm:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
fun bubble_sort [ arr, n ]
    loop i in 0..n - 1
        loop j in 0..n - i - 1
            if arr{j+1} < arr{j} then
                temp = arr{j}
                arr{j} = arr{j + 1}
                arr{j + 1} = temp
            end
       end
    end
end

fun is_sorted [ arr, n ]
    loop i in 0..n - 1
        if arr{i} > arr{i+1} then
            return false
        end
    end
    return true
end

arr = {}
arr_len = 20
loop i in 0..arr_len by 1
    arr << i * 117 % 17 - 1
end

print arr # [-1, 14, 12, 10, 8, 6, 4, 2, 0, 15, 13, 11, 9, 7, 5, 3, 1, -1, 14, 12]
print is_sorted [ arr, arr_len ] # false

bubble_sort [ arr, arr_len ]
print arr # [-1, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 12, 13, 14, 14, 15]
print is_sorted [ arr, arr_len ] # true

Part 7


Building Your Own Programming Language From Scratch: Part VII - Classes

https://img.hiif.ong/img/202304142016900.jpeg

In this part of creating your own programming language, we will implement classes as an extension over the previously defined structures. Please check out the previous parts:

  1. Building Your Own Programming Language From Scratch
  2. Building Your Own Programming Language From Scratch: Part II - Dijkstra’s Two-Stack Algorithm
  3. Build Your Own Programming Language Part III: Improving Lexical Analysis with Regex Lookaheads
  4. Building Your Own Programming Language From Scratch Part IV: Implementing Functions
  5. Building Your Own Programming Language From Scratch: Part V - Arrays
  6. Building Your Own Programming Language From Scratch: Part VI - Loops

The full source code is available over on GitHub.

1. Lexical Analysis

In the first section, we will cover lexical analysis. In short, it’s a process to divide the source code into language lexemes, such as keyword, variable, operator, etc.

You may remember from the previous parts that I was using the regex expressions in the TokenType enum to define all the lexeme types.

Let’s look at the following class prototype and add the missing regex parts to our TokenType lexemes:

image

  1. First, we need to put the class word in the Keyword lexeme’s expression to let the lexical analyzer know where our class declaration begins:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
package org.example.toylanguage.token;

...
public enum TokenType {
    ...
    Keyword("(if|elif|else|end|print|input|fun|return|loop|in|by|break|next|class)(?=\\s|$)"),
    ...

    private final String regex;
}
  1. Second, we need the new This lexeme as a marker for a reference to the current object:
1
2
3
4
5
6
public enum TokenType {
    ...
    This("(this)(?=,|\\s|$)");

    private final String regex;
}

2. Syntax Analysis

In the second section, we will transform the lexemes received from the lexical analyzer into the final statements following our language rules.

2.1 Definition Scope

When we declare a class or a function, this declaration should be available within the defined isolated bounds. For example, if we declare a function named turn_on [] in the following listing, it will be available for execution after declaration:

image

But if we declare the same function inside a class scope, this function will no longer be accessed directly from the main block:

image

  1. To implement these definition bounds, we’ll create the DefinitionScope class and store all the declared definitions inside two sets for classes and functions:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
package org.example.toylanguage.context.definition;

public class DefinitionScope {
    private final Set<ClassDefinition> classes;
    private final Set<FunctionDefinition> functions;

    public DefinitionScope() {
        this.classes = new HashSet<>();
        this.functions = new HashSet<>();
    }
}
  1. In addition, we may want to access the parent’s definition scope. For example, if we declare two separate classes and create an instance of the first class inside the second one:

image

To provide this ability, we will add the DefinitionScope parent instance as a reference to the upper layer, which we will use to climb to the top definition layer.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
public class DefinitionScope {
    private final Set<ClassDefinition> classes;
    private final Set<FunctionDefinition> functions;
    private final DefinitionScope parent;

    public DefinitionScope(DefinitionScope parent) {
        this.classes = new HashSet<>();
        this.functions = new HashSet<>();
        this.parent = parent;
    }
}
  1. Now, let’s finish the implementation by providing the interfaces to add a definition and retrieve it by name using parent scope:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
public class DefinitionScope {
    

    public ClassDefinition getClass(String name) {
        Optional<ClassDefinition> classDefinition = classes.stream()
                .filter(t -> t.getName().equals(name))
                .findAny();
        if (classDefinition.isPresent())
            return classDefinition.get();
        else if (parent != null)
            return parent.getClass(name);
        else
            throw new ExecutionException(String.format("Class is not defined: %s", name));
    }

    public void addClass(ClassDefinition classDefinition) {
        classes.add(classDefinition);
    }

    public FunctionDefinition getFunction(String name) {
        Optional<FunctionDefinition> functionDefinition = functions.stream()
                .filter(t -> t.getName().equals(name))
                .findAny();
        if (functionDefinition.isPresent())
            return functionDefinition.get();
        else if (parent != null)
            return parent.getFunction(name);
        else
            throw new ExecutionException(String.format("Function is not defined: %s", name));
    }

    public void addFunction(FunctionDefinition functionDefinition) {
        functions.add(functionDefinition);
    }
}
  1. Finally, to manage declared definition scopes and switch between them, we create the context class using java.util.Stack collection (LIFO):
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
package org.example.toylanguage.context.definition;

public class DefinitionContext {
    private final static Stack<DefinitionScope> scopes = new Stack<>();

    public static DefinitionScope getScope() {
        return scopes.peek();
    }

    public static DefinitionScope newScope() {
        return new DefinitionScope(scopes.isEmpty() ? null : scopes.peek());
    }

</