POPL 2024 - Artifact Evaluation

Paper artifacts are the software, mechanised proofs, test suites, and benchmarks that support a research paper and evaluate its claims. POPL has run an artifact evaluation process since 2015, and invites artifact submissions from all authors of accepted papers.

Submit your Artifact

Artifact evaluation is optional, but highly encouraged. We solicit artifacts from authors of all accepted papers. Artifacts can be software, mechanical proofs, test suites, benchmarks, or anything else that bolsters the claims of the paper, except paper proofs, which the AEC lacks the time and expertise to carefully review.

Artifact registration deadline: 2 4 October 2023 (AoE)
Artifact submission deadline: 9 October 2023 (AoE)
Artifact evaluation phase 1: 19 October 2023 (AoE)
Artifact acceptance decisions: 3 November 2023 (AoE)

Acceptance Criteria

We target a near-100% acceptance rate. Artifacts are evaluated against the criteria of:

Consistency with the claims of the paper and the results it presents
Completeness insofar as possible, supporting all evaluated claims of the paper
Documentation to allow easy reproduction by other researchers
Reusability, facilitating further research

Installing, configuring, and testing unknown software of research quality is difficult. Please see our Recommendations on packaging and documenting your artifact in a way that is easier for the AEC to evaluate.

In Memoriam

We mourn the loss to our community of Priya Srikumar, who would have been on the POPL AEC this year. If you would like to remember Priya, you can read tributes to them from members of the community here and here.

Artifact evaluation begins with authors of (conditionally) accepted POPL papers submitting artifacts on HotCRP (https://popl24ae.hotcrp.com). Artifact evaluation is optional. Authors are strongly encouraged to submit an artifact to the AEC, but not doing so will not impact their paper acceptance. Authors may, but are not required to, provide their artifacts to paper reviewers as supplemental materials.

Artifacts are submitted as a stable URL or, if that is not possible, as an uploaded archive. We recommend using a URL that you can update in response to reviewer comments, to fix issues that come up. Additionally, authors are asked to enter topics, conflicts, and “bidding instructions” to be used for assigning reviewers. You must check the “ready for review” box before the deadline for your artifact to be considered.

Artifact evaluation is single blind. Artifacts should not collect any telemetry or logging data; if that is impossible to ensure, logging data should not be accessed by authors. Any data files included with the artifact should be anonymized.

Reviewers will be instructed that they may not publicize any part of an artifact during or after completing evaluation, nor retain any part of one after evaluation. Thus, authors are free to include models, data files, proprietary binaries, etc. in your artifact.

AEC Membership

The AEC will consist of roughly 30 members, mostly senior graduate students, postdocs, and researchers. As the future of our community, graduate students will be the ones reproducing, reusing, and building upon the results published at POPL. They are also better positioned to handle the diversity of systems that artifacts span.

Participation in the AEC demonstrates the value of artifacts, provides early experience with the peer review process, and establishes community norms. We therefore seek to include a broad cross-section of the POPL community on the AEC.

Two-Phase Evaluation

This year, the artifact evaluation process will proceed in two phases.

In the first “kick the tires” phase reviewers download and install the artifact (if relevant) and exercise the basic functionality of the artifact to ensure that it works. We recommend authors include explicit instructions for this step. Failing the first phase—so that reviewers are unable to download and install the artifact—will prevent the artifact from being accepted.

In the second “evaluation” phase reviewers systematically evaluate all claims in the paper via procedures included in the artifact to ensure consistency, completeness, documentation, and reusability. We recommend authors list all claims in the paper and indicate how to evaluate each claim using the artifact.

Reviewers and authors will communicate back and forth during the review process over HotCRP. We have set up HotCRP to allow reviewers to ask questions or raise issues: those questions and issues will immediately be forwarded to authors, who will be able to answer questions or implement fixes.

After the two-phase evaluation process, the AEC will discuss each artifact and notify authors of the final decision.

Two days separate the AEC notification from the camera ready deadline for accepted papers. This gap allows authors time to update their papers to indicate artifact acceptance.

Badges

The AEC will award three ACM standard badges. Badges are added to papers by the publisher, not by the authors.

ACM’s Artifacts Evaluated – Reusable Badge
ACM’s Artifacts Evaluated – Functional Badge
ACM’s Artifacts Available Badge

Per ACM policy (https://www.acm.org/publications/policies/artifact-review-and-badging-current), papers will receive at most one of the Reusable and Functional badges.

In particular, artifacts that receive above average scores and are made available in a way that enables reuse (including source code availability, open source licensing, and an open issue tracker), such as via GitHub, GitLab, or BitBucket, will be awarded the “Artifacts Evaluated - Reusable” badge.

Artifacts which pass review but do not qualify for the Reusable badge will be awarded the “Artifacts Evaluated - Functional” badge instead.

Finally, artifacts which the authors make available eternally on a publicly accessible archival repository such as Zenodo or ACM DL will also receive ACM’s “Artifacts Available” badge. An immutable snapshot does not prevent authors from also distributing their code in another way, on an open source platform or on their personal websites.

The following recommendations will help you package and document your artifact in a way that makes a successful evaluation most likely. None of these guidelines are mandatory – diverging from them will not disqualify your artifact.

Timeline and general advice

As you prepare to package your artifact, keep in mind that good artifacts take time to create! In our experience, you should expect to spend at least 2-3 workdays packaging your artifact, a large portion of which should be spent writing and testing your detailed instructions, provided in a README file.

In our experience, the key to a successful artifact evaluation is a good README file! Reviewers (and future researchers) will appreciate long, detailed, and clearly organized instructions which describe every aspect of your artifact in detail – including, e.g., shell commands to run, files to open, and how long these will take. This is not only to help artifact evaluation go smoothly – it provides confidence that members of the community will be able to replicate your results and use your tool for their own work in the future.

Packaging

We recommend creating a single web page at a stable URL from which reviewers can download the artifact and which also hosts a README file with instructions for installing and using the artifact. Having a stable URL, instead of uploading an archive, allows you to update the artifact in response to issues that come up.

We recommend using Zenodo to create the stable URL mentioned above for your artifact at submission time. Not only can you upload multiple versions in response to reviewer comments, you can use the same stable URL when publishing your paper to avoid uploading your artifact twice.

We recommend (but do not require) packaging your artifact as a virtual machine image. Virtual machine images avoid issues with differing operating systems, versions, or dependencies. Other options for artifacts (such as source code, binary installers, web versions, or screencasts) are acceptable but generally cause more issues for reviewers and thus more issues for you. Virtual machines also protect reviewers from malfunctioning artifacts damaging their computer.

It may be appropriate not to use a virtual machine in cases where the software has very few dependencies and requires only a working installation of a single programming language or package manager – e.g., Cargo or OPAM. However, in these cases please document the installation of your artifact and required version(s) of everything via clear, step-by-step instructions. Additionally, make sure to test the instructions yourself on a fresh machine without any software installed. If the reviewers are unable to replicate your setup during the kick-the-tires phase, they may ask you to provide a virtual machine.

VirtualBox is free and widely-available virtual machine host software; it is a good choice. Recent graphical linux releases, such as Ubuntu 20.04 LTS, are good choices for the guest OS: the reviewer can easily navigate the image or install additional tools, and the resulting virtual machines are not too large to download.

Be warned that Docker images are not cross-platform out of the box! Due to cross-platform compatibility issues on recent Macs (e.g., M1), Docker builds a separate image depending on the target platform, and may simply fail to run on a different machine. For these reasons, we do not recommend using Docker. If you must use Docker, there is a (more complicated) process to bundle multiple-platform images at this link.

The virtual machine should contain everything necessary for artifact evaluation: your source code, any compiler and build system your software needs, any platforms your artifact runs on (a proof assistant, or Cabal, or the Android emulator), any data or benchmarks, and all the tools you use to summarize the results of experiments such as plotting software. Execute all the steps you expect the reviewer to do in the virtual machine: the virtual machine should already have dependencies pre-downloaded, the software pre-compiled, and output files pre-generated. This quick-starts the evaluation.

Do not include anything in the artifact that would compromise reviewer anonymity, such as telemetry or analytics. If there is anything the reviewers should be worried about, please provide a clear disclaimer at the top of your artifact.

Further advice for particular types of artifacts can be found online:

Documentation

You should provide a top-level documentation in a clearly marked location; typically, as a single README file. The HotCRP submission will also provide a field where you can enter any additional instructions to access the README.

Besides the artifact itself, we recommend your documentation contain four sections:

A complete list of claims made by your paper
Download, installation, and sanity-testing instructions
Evaluation instructions
Additional artifact description (file structure, extending the tool or adding your own examples, etc.)

Artifact submissions are single-blind: reviewers will know the authors for each artifact, so there is no need to expend effort to anonymize the artifact. If you have any questions about how best to package your artifact, please don’t hesitate to contact the AEC chairs, at leonidas@umd.edu and cdstanford@ucdavis.edu.

List of claims

The list of claims should list all claims made in the paper. For each claim, provide a reference to the claim in the paper, the portion of the artifact evaluating that claim, and the evaluation instructions for evaluating that claim. The artifact need not support every claim in the paper; when evaluating the completeness of an artifact, reviewers will weigh the centrality and importance of the supported claims. Listing each claim individually provides the reviewer with a checklist to follow during the second, evaluation phase of the process. Organize the list of claims by section and subsection of the paper. A claim might read,

Theorem 12 from Section 5.2 of the paper corresponds to the theorem “foo” in the Coq file “src/Blah.v” and is evaluated in Step 7 of the evaluation instructions.

Some artifacts may attempt to perform malicious operations by design. Boldly and explicitly flag this in the instructions so AEC members can take appropriate precautions before installing and running these artifacts.

Reviewers expect artifacts to be buggy, immature, and have obscure error messages. Explicitly listing all claims allows the author to delineate which bugs invalidate the paper’s results and which are simply a normal part of the software engineering process.

Download, installation, and sanity-testing

The download, installation, and sanity-testing instructions should contain complete instructions for obtaining a copy of the artifact and ensuring that it works. List any software the reviewer will need (such as virtual machine host software) along with version numbers and platforms that are known to work. Then list all files the reviewer will need to download (such as the virtual machine image) before beginning. Downloads take time, and reviewers prefer to complete all downloads before beginning evaluation.

Note the guest OS used in the virtual machine, and any unusual modifications made to it. Explain its directory layout. It’s a good idea to put your artifact on the desktop of a graphical guest OS or in the home directory of a terminal-only guest OS.

Installation and sanity-testing instructions should list all steps necessary to set up the artifact and ensure that it works. This includes explaining how to invoke the build system; how to run the artifact on small test cases, benchmarks, or proofs; and the expected output. Your instructions should make clear which directory to run each command from, what output files it generates, and how to compare those output files to the paper. If your artifact generates plots, the sanity testing instructions should check that the plotting software works and the plots can be viewed.

Helper scripts that automate building the artifact, running it, and viewing the results can help reviewers out. Test those scripts carefully—what do they do if run twice?

Aim for the download, installation, and sanity-testing instructions to be completable in about a half hour. Remember that reviewers will not know what error messages mean or how to circumvent errors. The more foolproof the artifact, the easier evaluation will be for them and for you.

Evaluation instructions

The evaluation instructions should describe how to run the complete artifact, end to end, and then evaluate each claim in the paper that the artifact supports. This section often takes the form of a series of commands that generate evaluation data, and then a claim-by-claim list of how to check that the evaluation data is similar to the claims in the paper.

For each command, note the output files it writes to, so the reviewer knows where to find the results. If possible, generate data in the same format and organization as in the paper: for a table, include a script that generates a similar table, and for a plot, generate a similar plot.

Indicate how similar you expect the artifact results to be. Program speed usually differs in a virtual machine, and this may lead to, for example, more timeouts. Indicate how many you expect. You might write, for example:

The paper claims 970/1031 benchmarks pass (claim 5). Because the program runs slower in a virtual machine, more benchmarks time out, so as few as 930 may pass.

Reviewers must use their judgement to check if the suggested comparison is reasonable, but the author can provide expert guidance to set expectations.

Explicitly include commands that check soundness, such as counting admits in a Coq code base. Explain any checks that fail.

Aim for the evaluation instructions to take no more than a few hours. Clearly note steps that take more than a few minutes to complete. If the artifact cannot be evaluated in a few hours (experiments that require days to run, for example) consider an alternative artifact format, like a screencast.

Additional artifact description

The additional description should explain how the artifact is organized, which scripts and source files correspond to which experiments and components in the paper, and how reviewers can try their own inputs to the artifact. For a mechanical proof, this section can point the reviewer to key definitions and theorems.

Expect reviewers to examine this section if something goes wrong (an unexpected error, for example) or if they are satisfied with the artifact and want to explore it further.

Reviewers expect that new inputs can trigger bugs, flag warnings, or behave oddly. However, describing the artifact’s organization lends credence to claims of reusability. Reviewers may also want to examine components of the artifact that interest them.

Remember that the AEC is attempting to determine whether the artifact meets the expectations set by the paper. (The instructions to the committee are available here.) Package your artifact to help the committee easily evaluate this.

Introduction

Thank you for volunteering to serve on the POPL Artifact Evaluation Committee. Artifacts are an important product of the scientific process, and it is your goal, as members of the AEC, to study these artifacts and determine whether they meet the expectations laid out in the paper and adequately support its central claims.

General Guidelines and Advice

Since reviewers sometimes differ on what artifact evaluation is (or should be), please keep in mind the following general guidelines and advice:

Artifact evaluation is a collaborative process with the authors! Our goal is not necessarily to find flaws with the artifact, but to try to the best of our ability to validate the artifact’s claims. This means that back-and-forth discussion is always a good idea if there are issues, particularly with installation.
Artifact evaluation is not an examination of the scientific merits of the paper. In other words, it is within scope to evaluate the artifact (towards the functional and reusable badges), but not to evaluate the paper itself.

Basically, in cooperation with the authors, and without sacrificing scientific rigor, we aim to accept all artifacts which support the claims laid out in the paper.

Badges and Acceptance Criteria

We will be awarding two badges, Functional and Reusable, based on the following considerations:

Functional artifacts: Does the artifact completely support all evaluated claims in the paper, and are the results from running the artifact consistent with the claims made in the paper?
Reusable artifacts: Are you able to modify the artifact to solve problems and benchmarks different from those in the paper? Is the artifact sufficiently well-documented to support reuse and future research?

Note that award of the Reusable badge is strictly superior to merely awarding the Functional badge. Accordingly, the documentation requirements for the Reusable badge are higher than the requirements for award of the Functional badge.

In general, we aim for most or all artifacts to receive the functional badge, as long as the claims can be evaluated (even with some difficulty); but the reusable badge is more at the reviewers’ discretion.

Overview of Process

Bidding

During the bidding process, you will examine the list of papers, and place bids for the artifacts that most closely match your research background, interests, and experience. Based on your bids, and depending on which paper authors actually choose to submit artifacts, we will assign you with two or three artifacts to review.

After artifacts are assigned, we organize the evaluation process as the following sequence of three milestones:

Milestone 1: Kick the Tires (By Oct 18-19)

Research software is delicate and needs careful setup. In order to ease this process, in the first phase of artifact evaluation, you will be expected to at least install the artifact and run a minimum set of commands (usually provided in the README by the authors) to sanity check that the artifact is correctly installed.

Here is a suggested process with some questions you can try to answer.

After reading the paper:

Q1: What is the central contribution of the paper?
Q2: What claims do the authors make of the artifact, and how does it connect to Q1 above?
Q3: Can you locate the specific, significant experimental claims made in the paper (such as figures, tables, etc.)?
Q4: What do you expect as a reasonable range of deviations for the experimental results?

After installing the artifact:

Q5: Are you able to install and test the artifact as indicated by the authors in their “kick the tires” instructions?
Q6: Are there any significant modifications you needed to make to the artifact while answering Q5?
Q7: For each claim highlighted in Q3 above, do you know how to reproduce the result, using the artifact?
Q8: Is there anything else that the authors or other reviewers should be aware of?

During the process, you can leave a comment on HotCRP indicating success, or ask the authors questions. These questions can concern unclear commands, or error messages that you encounter. The authors will have a chance to respond, fix bugs in their artifact or distribution, or make additional clarifications. Errors at this stage will not be counted against the artifact. Remember, the evaluation process is cooperative!

Milestone 2: Evaluating Functionality (By Oct 30)

After the kick-the-tires phase, you will perform an actual review of the artifact.

During this phase, here is a suggested list of questions to answer:

Q9: Does the artifact provide evidence for all the claims you noted in Q3? This corresponds to the completeness criterion of your evaluation.
Q10: Do the results of running / examining the artifact meet your expectations after having read the paper? This corresponds to the criterion of consistency between the paper and the artifact
Q11: Is the artifact well-documented, to the extent that answering questions Q5–Q10 is straightforward? (Note: by well-documented, for this stage, we are considering generally only the README and instructions – we don’t mean that the code itself needs to be documented. That would matter only for reusability if the intention would be to modify the code in some way.)

Note that for some papers, depending on the type of artifact and its specific claims, questions Q1–Q11 may be inappropriate or irrelevant. In these cases, we encourage you to disregard our suggestions and review the artifact as you think is most appropriate.

Milestone 3: Evaluating Reusability (By Oct 30)

Finally, you will evaluate artifacts for reusability in new settings. To evaluate reusability, we suggest considering the following questions:

Q12: Are you able to modify the benchmarks / artifact to run simple additional experiments, similar to, but beyond those discussed in the paper?
Q13: If you were doing follow-up research in this area, do you think you would be able to reuse the paper as a baseline in your own work?

Writing Reviews (By Oct 30) and Discussion (Oct 30–Nov 1)

You will write a review by drawing on your answers to Q1–Q13. Make sure to include a specific recommendation of whether to award a badge, and of which badge(s) to award. Once you submit your draft review, we will make all reviews public, so you have a chance to discuss the artifact with other reviewers, and reach a consensus. You should feel free to change your mind or revise your review during these discussions.

That’s it! You can always reach out to the POPL AEC chairs to discuss any specific questions or concerns.

We believe the dissemination of artifacts benefits our science and engineering as a whole. Their availability improves replicability and reproducibility, and enables authors to build on top of each others’ work. It can also help more unambiguously resolve questions about cases not considered by the original authors.

The goal of the artifact evaluation process is two-fold: to both reward and probe. Our primary goal is to reward authors who take the trouble to create useful artifacts beyond the paper. Sometimes the software tools that accompany the paper take years to build; in many such cases, authors who go to this trouble should be rewarded for setting high standards and creating systems that others in the community can build on. Conversely, authors sometimes take liberties in describing the status of their artifacts—claims they would temper if they knew the artifacts are going to be scrutinized. This leads to more accurate reporting.

Our hope is that eventually, the assessment of a paper’s accompanying artifacts will guide the decision-making about papers: that is, the Artifact Evaluation Committee (AEC) would inform and advise the Program Committee (PC). This would, however, represent a radical shift in our conference evaluation processes; we would rather proceed gradually. Thus, artifact evaluation is optional, and authors choose to undergo evaluation only after their paper has been conditionally accepted. Nonetheless, feedback from the Artifact Evaluation Committee can help improve the both the final version of the paper and any publicly released artifacts. The authors are free to take or ignore the AEC feedback at their discretion.

Beyond helping the community as a whole, it confers several direct and indirect benefits to the authors themselves. The most direct benefit is, of course, the recognition that the authors accrue. But the very act of creating a bundle that can be evaluated by the AEC confers several benefits:

The same bundle can be publicly released or distributed to third-parties.
A bundle can be used subsequently for later experiments (e.g., on new parameters).
The bundle simplifies having to re-run the system subsequently when, say, having to respond to a journal reviewer’s questions.
The bundle is more likely to survive being put in storage between the departure of one student and the arrival of the next.

Creating a bundle that meets all these properties can be onerous and therefore, the process we describe below does not require an artifact to have all these properties. It offers a route to evaluation that confers fewer benefits for vastly less effort.

Artifact EvaluationPOPL 2024

Process

Author Recommendations

Reviewer Guidelines

Background

Abhishek Kr Singh

Tel Aviv University

Israel

Andong Fan

HKUST (The Hong Kong University of Science and Technology)

China

Jaime Arias

CNRS; LIPN; Université Sorbonne Paris Nord

France

Martin Blicha

Breandan Considine

McGill University

Canada

Baber Rehman

The University of Hong Kong

China

Daniel Frumin

University of Groningen

Netherlands

Lef Ioannidis

University of Pennsylvania

United States

Hengchu Zhang

Jinhao Tan

University of Hong Kong

Jingmei Hu

Amazon

Karuna Grewal

Kinan Dak Albab

Dominik Klumpp

University of Freiburg

Germany

Lina Marsso

University of Toronto

Canada

Litao Zhou

Shanghai Jiao Tong University; University of Hong Kong

China

Li-yao Xia

University of Edinburgh

Yishuai Li

Alibaba Group

China

Matteo Busi

University Ca' Foscari, Venice

Italy

Orestis Melkonian

University of Edinburgh

United Kingdom

Meghana Aparna Sistla

The University of Texas at Austin

United States

Deyuan (Mike) He

Princeton University

United States

Nick Feng

University of Toronto

Nicolas Chappe

Neea Rusch

Augusta University

United States

Peixin Wang

University of Oxford

Priya Srikumar

Qiaochu Chen

Rini Banerjee

University of Cambridge

United Kingdom

Rui Dong

University of Michigan

United States

Sander Huyghebaert

Vrije Universiteit Brussel

Stefan Zetzsche

Amazon Web Services