Adding a new rule to Jarl

This page will explain how to implement a new rule in Jarl. It is recommended to read the General information page first to install the required tools. Jarl is written in Rust, but this page will not explain how to set up or use Rust, this is an entirely different topic. To get started with Rust, check out the Rust book.

Getting ready

Find the rule

So far, most (if not all) rules in Jarl come from the list of rules available in lintr, so this is the first place to explore. If you want to add a rule that is not in lintr, please open an issue first.

Note that not all lintr rules are suitable for Jarl. In particular, rules that are only about formatting (spaces before parenthesis, newlines between arguments, etc.) are out of scope for Jarl. Moreover, you should look for rules that require “pattern detection” only, meaning that they don’t need information about the rest of the code (or only very little). For example, unreachable_code is out of scope for now because we need a way to analyze the rest of the code, which Jarl doesn’t have so far. If you are unsure about whether a rule can or should be implemented, open an issue first.

Get familiar with the rule

You may know the most common cases of this rule, but there might exist many corner cases making its implementation difficult. Take a look at the relevant test file in the lintr test suite to know more about those corner cases.

Additionally, Jarl uses tree-sitter under the hood to parse and navigate the Abstract Syntax Tree (AST) of the code. Having an idea of what this AST looks like is important when implementing the rule. I suggest creating a small test R file containing one or two examples of code that violate this rule. If you have the Air extension installed, you can do the command “Air: View Syntax Tree” to display the AST next to the code.

For example, the code any(x > 10) gives this AST:

0: R_ROOT@0..12
  0: (empty)
  1: R_EXPRESSION_LIST@0..11
    0: R_CALL@0..11
      0: R_IDENTIFIER@0..3
        0: IDENT@0..3 "any" [] []
      1: R_CALL_ARGUMENTS@3..11
        0: L_PAREN@3..4 "(" [] []
        1: R_ARGUMENT_LIST@4..10
          0: R_ARGUMENT@4..10
            0: (empty)
            1: R_BINARY_EXPRESSION@4..10
              0: R_IDENTIFIER@4..5
                0: IDENT@4..5 "x" [] []
              1: GREATER_THAN@5..7 ">" [Whitespace(" ")] []
              2: R_DOUBLE_VALUE@7..10
                0: R_DOUBLE_LITERAL@7..10 "10" [Whitespace(" ")] []
        2: R_PAREN@10..11 ")" [] []
  2: EOF@11..12 "" [Newline("\n")] []

We can see that this is indeed a tree where each element can have “parents” and “children” nodes. It also shows that the textual representation of the code doesn’t matter here: you could write any(x > 10) or any ( x > 10) and the AST would be the same (with the exception of the numbers representing the location of each node in text). This representation is important to keep in mind when adding a new rule: Jarl only checks the text in very specific cases, for example when we want to get the function name to know if we should apply a rule or not. Most of the time, the RSyntaxKind (R_CALL, R_BINARY_EXPRESSION, R_IDENTIFIER, etc.) is used.

Get up and running with Rust and Jarl

You should have installed Rust and cloned Jarl. Do cargo check or cargo test to know if you are correctly set up.

Adding a new rule: basic steps

As an example for this entire tutorial, we will analyze PR #182, which added the rule list2df. This PR adds a rule to replace calls like do.call(cbind.data.frame, x) by list2DF(x). Importantly, list2DF() was added in R 4.0.0. I encourage you to check this PR as you advance in this tutorial.

Here’s a basic idea of the workflow to add a new rule:

add it to the general list of rules
add it to the list of rules for the specific kind of node it targets (function calls, if conditions, for loops, etc.)
implement the rule
add tests
document the rule
final polishing

From now on, all file paths refer to the subfolder crates/jarl-core.

Add the new rule to the list of rules

There are two places to modify: lints/mod.rs and one file in the analyze folder.

lints/mod.rs contains the list of all rules provided by Jarl. We can add a rule to the list:

pub(crate) mod list2df;

...

rule_table.enable("list2df", "PERF,READ", FixStatus::Safe, Some((4, 0, 0)));

This contains the rule name, the categories it belongs to (those are described above in the file), whether it has a safe fix, an unsafe fix, or no fix, and the optional R version from which it is available.

The file to modify in the analyze folder will depend on the rule: here, we look for calls to do.call(). The arguments passed to the function are irrelevant, what matters is that this is a call, so we will modify the file analyze/call.rs:

use crate::lints::list2df::list2df::list2df;

...

if checker.is_rule_enabled("list2df") && !checker.should_skip_rule(node, "list2df") {
    checker.report_diagnostic(list2df(r_expr)?);
}

Implement the rule

This is the hard part of the process. It requires knowledge about the AST you want to parse and about the different functions available to us to navigate this AST. The rule definition must be located in lints/<rule_name>/<rule_name>.rs, so in this example in lints/list2df/list2df.rs.

Let’s start with a skeleton of this file:

use crate::diagnostic::*;
use crate::utils::{get_arg_by_name_then_position, get_arg_by_position, node_contains_comments};
use air_r_syntax::*;
use biome_rowan::AstNode;

pub struct List2Df;

/// ## What it does
///
/// Checks for usage of `do.call(cbind.data.frame, x)`.
///
/// [...]
impl Violation for List2Df {
    fn name(&self) -> String {
        "list2df".to_string()
    }
    fn body(&self) -> String {
        "`do.call(cbind.data.frame, x)` is inefficient and can be hard to read.".to_string()
    }
    fn suggestion(&self) -> Option<String> {
        Some("Use `list2DF(x)` instead.".to_string())
    }
}

pub fn list2df(ast: &RCall) -> anyhow::Result<Option<Diagnostic>> {

}

Let’s analyze this by blocks:

the first lines import required crates and functions, and define a struct using the rule name (in TitleCase);
then there is some documentation (truncated here for conciseness);
the impl block is where we define the name and the main message (body) that will be used in the output of Jarl. Note that there is also a suggestion() function which is not always necessary.
finally, we define the function where we parse the AST.

About impl Violation

If you explore other rules implementation, you might notice that the impl Violation block is sometimes missing. This is because in some cases, the message and/or the suggestion depend on the AST itself. For example, for the assignment rule, the message will recommend the use of <- or = depending on the user settings.

In this scenario, the name, body, and suggestion are defined at the very end, when we build the Diagnostic.

Writing this function is the hard part, so let’s focus on this. We start by extracting the important information from the RCall object. In this example, we need both the function name and the arguments:

let function = ast.function()?;
let arguments = ast.arguments()?;

Note that it is sometimes shorter to use the destructuring syntax, as follows:

let RCallFields { function, arguments } = ast.as_fields();
let function = function?;
let arguments = arguments?.items();

Usually, a rule implementation contains a lot of early returns, such as “if the function name is not ‘xyz’ then stop here”. In this example, we want to focus on calls to do.call(), so we can stop early if this is not the function name:

if function.to_trimmed_text() != "do.call" {
    return Ok(None);
}

Past that point, the next step is to check that the arguments correspond to what we want to analyze. do.call has four arguments: what, args, quote, and envir. We are looking for patterns such as do.call(cbind.data.frame, x) so we want information on the first two arguments. We can use a helper function called get_arg_by_name_then_position():

// Note that the arguments position is 1-indexed and not 0-indexed as is usually
// the case in Rust.
let what = get_arg_by_name_then_position(&arguments, "what", 1);
let args = get_arg_by_name_then_position(&arguments, "args", 2);

This function returns an Option since the arguments we want to extract maybe do not exist in the code we parsed.

About helper functions

In this example, we used get_arg_by_name_then_position(). There exist other helper functions located in utils.rs. For example, get_nested_functions_content() is used in any_is_na (among other rules) to either get the content of any(is.na(...)) or exit the function early.

We can now do more early checks:

// Ensure there isn't more than two arguments, as we probably cannot discard
// `quote` and `envir` if they are specified in `do.call()`.
if get_arg_by_position(&arguments, 3).is_some() {
    return Ok(None);
}

if args.is_none() {
    return Ok(None);
}

// Return early for calls such as `do.call(foo, x)`.
if let Some(what) = what && let Some(what_value) = what.value() {
    let txt = what_value.to_trimmed_text();
    // `do.call()` accepts quoted function names.
    if txt != "cbind.data.frame"
        && txt != "\"cbind.data.frame\""
        && txt != "\'cbind.data.frame\'"
    {
        return Ok(None);
    }
} else {
    return Ok(None);
};

The block above ensures that we have a call to do.call() with two arguments only and that the argument what is one of cbind.data.frame, "cbind.data.frame", or 'cbind.data.frame' (because do.call() also accepts quoted function names).

We now reach the last step, which is building the automatic fix. This requires getting the value of args because we want to use it in list2DF():

// Safety: we checked above that args is not None.
let args = args.unwrap().value();
// Here, `args` can be None if the call is something like
// `do.call(cbind.data.frame, args = )`. This is not valid R code, but we don't
// want to panic if it is checked by Jarl.
if args.is_none() {
    return Ok(None);
}
let fix_content = args.unwrap();

And finally, we build the diagnostic:

let range = ast.syntax().text_trimmed_range();
let diagnostic = Diagnostic::new(
    List2Df,
    range,
    Fix {
        content: format!("list2DF({})", fix_content.to_trimmed_text()),
        start: range.start().into(),
        end: range.end().into(),
        to_skip: node_contains_comments(ast.syntax()),
    },
);

Ok(Some(diagnostic))

All diagnostics contain a Violation (we defined the one for List2Df just below the documentation), a range indicating where it is located in the code, and a Fix (which may be Fix::Empty() if there is no automatic fix).

Finally, note that Fix has a field to_skip: node_contains_comments(ast.syntax()). This tells Jarl not to apply the automatic fix if the node in question contains a comment. Handling comments positions in automatic fixes is quite complicated so, for now, fixes are not applied if the node contains a comment, e.g.:

# This code wouldn't be automatically fixed because we don't know where the
# comment inside should go.
do.call(
    cbind.data.frame,
    # This is a comment to describe `x`.
    x
)

At this point, if you have an R file with a couple of examples that should be reported (e.g. test.R), you can use cargo run --bin jarl -- check test.R (the rule in this example is only valid for R >= 4.0.0, so we also need --min-r-version 4.1 for instance).

Add tests

Tests for each rule are stored in lints/<rule_name>/mod.rs. It is important to test cases where we expect the rule that we just defined to be violated, and to test cases where we don’t expect this violation. Looking at tests for list2df, there are three blocks:

first, we check cases where we don’t expect rule violations:

#[test]
fn test_no_lint_list2df() {
    expect_no_lint("cbind.data.frame(x, x)", "list2df", Some("4.0"));
    [...]

    // Ignored if R version unknown or below 4.0.0
    expect_no_lint("do.call(cbind.data.frame, x)", "list2df", Some("3.5"));
    [...]

    // Don't know how to handle additional comments
    expect_no_lint(
        "do.call(cbind.data.frame, x, quote = TRUE)",
        "list2df",
        Some("4.0"),
    );

    // Ensure that wrong calls are not reported
    expect_no_lint("do.call(cbind.data.frame)", "list2df", Some("4.0"));
    [...]
}

second, we check cases where we expect rule violations. We check first that the expected message is displayed, and then we check that the automated fix works correctly with a snapshot test:

#[test]
fn test_lint_list2df() {
    use insta::assert_snapshot;

    let expected_message = "Use `list2DF(x)` instead";
    expect_lint(
        "do.call(cbind.data.frame, x)",
        expected_message,
        "list2df",
        Some("4.0"),
    );
    [...]

    assert_snapshot!(
        "fix_output",
        get_fixed_text(
            vec![
                "do.call(cbind.data.frame, x)",
                [...]
            ],
            "list2df",
            Some("4.0")
        )
    );
}

finally, if the rule has an automatic fix, we check that having a comment in the middle of the code in question does not modify this code. Handling comments in automatic fixes is difficult and is left as an objective for the future.

#[test]
fn test_list2df_with_comments_no_fix() {
    use insta::assert_snapshot;
    // Should detect lint but skip fix when comments are present to avoid destroying them
    expect_lint(
        "do.call(\n # a comment\ncbind.data.frame, x)",
        "Use `list2DF(x)` instead",
        "list2df",
        Some("4.0"),
    );
    assert_snapshot!(
        "no_fix_with_comments",
        get_fixed_text(
            vec![
                "# leading comment\ndo.call(cbind.data.frame, x)",
                "do.call(\n # a comment\ncbind.data.frame, x)",
                "do.call(cbind.data.frame, x) # trailing comment",
            ],
            "list2df",
            Some("4.0")
        )
    );
}

Since we have snapshot tests, we first need to run cargo insta test to generate the snapshots and then cargo insta review to review and validate them. After that, run cargo test to ensure that all tests pass.

All the rest

The rule is implemented, all tests pass, perfect! We now need to document this change:

update CHANGELOG.md
update docs/rules.md

If you have installed just as recommended, you can now run just document to update the website and just lint to check that the code has no lints and is properly formatted.

Proposing your changes

Once all of this is done, it is time to open a PR!

Note: if you need some guidance, early feedback, or simply want to store your changes in a branch, you can also open an incomplete PR.

PR title

Jarl follows conventional commits, meaning that your PR must start with “feat:”, “fix:”, or another appropriate name (see the linked documentation). In this case, the PR is titled “feat: Add list2df_linter”.

PR automated comments

Once you have opened a PR, you will receive three automated comments after a few minutes:

code coverage: this checks that all the lines you added are covered by some tests. Try to ensure this is at 100%.
ecosystem checks: every time there is a change in jarl-core, Jarl is run on several R packages and the results are compared to the Jarl version on the main branch. There will be a comment indicating if your changes have revealed new violations or removed some violations in those repositories. Here, we added a rule so we expect either no changes or more violations. New violations will be printed with a link to the lines of code that trigger them, so check a few to ensure those are not false positives.
benchmark: this is usually irrelevant when adding a new rule, it is simply to ensure there is no catastrophic performance degradation.

Congrats, and thanks for your contribution!