qlm_trail() no longer emits "unknown column" warnings or crashes with
the condition has length > 1 when passed a qlm_comparison or
qlm_validation object. The trail now stores these (and qlm_coded)
objects as-is rather than copying selected fields into a parallel
structure, so they round-trip with their class and metadata intact and
can be extracted from the trail for replication without modification
(#93).
qlm_validate(..., average = "none") was reporting per-class
precision and recall swapped: the helper that derived FP and FN
from the confusion matrix had its row and column sums transposed
relative to the orientation produced by yardstick::conf_mat().
Macro-averaged precision/recall (computed via yardstick directly)
were correct; only the per-class breakdown was affected.
qlm_compare() gains a by_category = FALSE argument that, when
set to TRUE, reports per-category reliability rows for nominal
data: Krippendorff's alpha (alpha_per_value[k], each category
dichotomised against all others), kappa (kappa_per_value[k],
Cohen's κ via dichotomise-and-recompute for two raters or Fleiss'
Eqs. 20-21 for three or more), and alpha_u_per_value[k] for
unitizing comparisons. The marginal count n is reported in the
docid column. Per-category rows are only produced for
nominal-level data (#112).All reliability statistics are now native R implementations, derived
directly from their source papers; the package no longer depends on
irr. Each function returns a uniform list shape (method, value,
ci_lower/ci_upper, per_value, n_observers, n_units,
n_pairable) plus measure-specific fields (#112):
reliability_alpha() — Krippendorff (2019, Ch. 12) for predefined
units; nominal/ordinal/interval/ratio metrics; per-category α for
nominal data; verified against book worked examples §12.3.1,
§12.3.4.1, §12.3.4.4.reliability_alpha_u() — Krippendorff's α for unitizing
continua; one call returns all variants (value for _u_α,
binary for |_u_α, cu_nominal for _cu_α, plus per_value).reliability_kappa() — Cohen (1960) with unweighted, linear, and
quadratic weighted variants; analytic SE/CI for unweighted;
per-category κ via dichotomisation.reliability_kappa_fleiss() — Fleiss (1971) for many raters with
analytic SE and per-category κⱼ.reliability_kendall_w() — Kendall & Smith (1939) with automatic
tie correction; verified against Kendall & Gibbons (1990) Ch. 6.reliability_icc() — all six ICC forms (Shrout & Fleiss 1979;
McGraw & Wong 1996); verified against Shrout & Fleiss Table 4.qlm_compare() standardises on subjects × raters matrix input
internally, removing the transpose step previously needed for
irr::kripp.alpha.
qlm_validate() no longer relies on yardstick. Accuracy, MAE,
RMSE, and the confusion matrix are computed inline from base R;
multi-class precision, recall, and F-measure are now provided by
internal metric_precision(), metric_recall(), and
metric_f_meas() supporting all four standard estimators
(binary, macro, macro_weighted, micro). Confusion matrix,
micro and macro precision/recall follow Sokolova & Lapalme (2009),
Tables 1-3; macro F-measure is the arithmetic mean of per-class
F-scores (Manning, Raghavan & Schütze 2008, ch. 13), matching the
yardstick / scikit-learn convention. Output verified identical to
yardstick's on both the binary case and a 4-class noisy
multi-class example. yardstick removed from Imports.
New qlm_segment() segments a corpus into thematic or conceptual units using
an LLM, returning a quanteda corpus analogous to quanteda::corpus_segment()
output. Schema fields become docvars; docid_ and segid_ track provenance.
Enables aspect-based sentiment analysis, thematic coding, and other
applications requiring variable-length segmentation (#96).
qlm_compare() now supports inter-coder reliability for segmentation tasks.
When all inputs are segmented corpora produced by qlm_segment(), it
automatically computes Krippendorff's alpha for unitizing (Krippendorff, 2019,
section 12.6), an extension of alpha designed for variable-length text
segmentation. Three measures are reported (marked experimental):
u_alpha_nominal and u_alpha_binary measure joint boundary and coding
reliability across the full segmented continuum.cu_alpha_nominal measures coding reliability conditional on unitization,
isolating coding disagreement from boundary disagreement.(k)u_alpha_nominal reports reliability and coverage for each
individual code, enabling diagnosis of which codes are applied consistently.
Results include both per-document and overall (concatenated continuum) alpha.as_qlm_coded() gains qlm_segment and source_text arguments for
converting gold-standard data frames to segmented corpora with character
positions, enabling ICR comparison of LLM segmentation against human-coded
reference data.
qlm_segment() now accepts a name argument stored in corpus metadata for
rater identification when comparing multiple segmenters via qlm_compare().
dplyr and tidyr (#109). Data manipulation now
uses base R, vctrs, and tibble, reducing the install footprint. No
user-visible behavior changes.\value documentation to all exported methods.qlm_validate() documentation.qlm_corpus wrapper class pattern instead of conditional registerS3method(), eliminating load-order dependencies and runtime checks (#86).qlm_meta() accessor function provides stratified access to metadata for qlm_coded, qlm_codebook, qlm_comparison, and qlm_validation objects. Metadata is organized into three types following the quanteda convention:
type = "user" (default): User-specified fields (name, notes) that can be modified via qlm_meta<-().type = "object": Read-only parameters set at creation time (batch, call, chat_args, execution_args, parent, n_units, input_type).type = "system": Read-only environment information (timestamp, ellmer_version, quallmer_version, R_version).qlm_meta<-() replacement function allows modifying user metadata fields only. Attempting to modify object or system metadata produces an informative error (#72).codebook() extractor retrieves the codebook component from qlm_coded, qlm_comparison, and qlm_validation objects. This is a core component accessor analogous to formula() for lm objects (#72).inputs() extractor retrieves the original input data (texts or image paths) from qlm_coded objects. The function name mirrors the inputs argument in qlm_code() (#72).attr(x, "run")$... access, providing a stable API for extracting and modifying object metadata and components.as_qlm_coded() function replaces qlm_humancoded() as the primary function for converting human-coded or external data to qlm_coded objects. The new function includes an is_gold parameter to mark gold standard objects for automatic detection.as_qlm_coded() now supports quanteda corpus objects directly via S3 method dispatch. Document variables (docvars) are automatically converted to coded variables, with document names used as identifiers by default. This simplifies the workflow for corpus-based gold standards (#81).qlm_validate() now auto-detects gold standards marked with as_qlm_coded(data, is_gold = TRUE), making the gold = parameter optional when using marked objects. Explicit gold = still works for backward compatibility.qlm_validate() signature changed to qlm_validate(..., gold, by, ...) to support validating multiple coded objects against a single gold standard in one call. Results include a rater column identifying each object.qlm_humancoded() is now marked @keywords internal but remains exported for backward compatibility. New code should use as_qlm_coded().# Gold: Yes in their print output for easy identification.qlm_validate() detect common mistakes like forgetting gold = or misspelling parameter names, with helpful suggestions for correction.ci parameter added to qlm_compare() and qlm_validate() with options "none" (default), "analytic", or "bootstrap".bootstrap_n parameter (default 1000).ci_lower and ci_upper columns when ci != "none".qlm_compare() results now include rater1, rater2, rater3, etc. columns containing the names of compared objects (from name attribute), enabling easy identification when combining multiple comparisons with dplyr::bind_rows().qlm_validate() results now include a rater column identifying which object is being validated, enabling easy combining of multiple validations.qlm_comparison and qlm_validation) instead of lists, making them easier to filter, combine, and analyze.qlm_compare() or qlm_validate() calls can be combined with bind_rows() for analysis across multiple coders or conditions.qlm_code() default name parameter changed from "original" to NULL for cleaner output when names aren't specified.as_qlm_coded() instead of qlm_humancoded().notes parameter in qlm_code(), qlm_replicate(), and as_qlm_coded() for documenting the rationale behind each coding run. Notes are displayed in print output and captured in qlm_trail().qlm_trail() now accepts an optional path argument. When provided, saves RDS archive and generates Quarto report with full audit trail documentation.qlm_trail_save(), qlm_trail_export(), qlm_trail_report(), and qlm_archive(). Use qlm_trail(..., path = "filename") instead.qlm_trail() now generates fallback names for objects with missing name attribute.qlm_trail() function creates complete audit trails following Lincoln and Guba's (1985) concept for establishing trustworthiness in qualitative research.qlm_trail(..., path = "filename") to save RDS archive and generate Quarto report.qlm_comparison and qlm_validation objects include run attributes capturing parent relationships, enabling full workflow traceability.The package introduces a new qlm_*() API with richer return objects and clearer terminology for qualitative researchers:
qlm_codebook() defines coding instructions, replacing task() (#27).qlm_code() executes coding tasks and returns a tibble with coded results and metadata as attributes, replacing annotate() (#27). The returned qlm_coded object prints as a tibble and can be used directly in data manipulation workflows. Now includes name parameter for tracking runs and hierarchical attribute structure with provenance support.qlm_compare() compares multiple qlm_coded objects to assess inter-rater reliability. Automatically computes all statistically appropriate measures from the irr package based on the specified measurement level (nominal, ordinal, or interval).qlm_validate() validates a qlm_coded object against a gold standard (human-coded reference data). Automatically computes all statistically appropriate metrics based on the specified measurement level, using measures from the yardstick, irr, and stats packages. For nominal data, supports multiple averaging methods (macro, micro, weighted, or per-class breakdown).qlm_replicate() re-executes coding with optional overrides (model, codebook, parameters) while tracking provenance chain. Enables systematic assessment of coding reliability and sensitivity to model choices.The new API uses the qlm_ prefix to avoid namespace conflicts (e.g., with ggplot2::annotate()) and follows the convention of verbs for workflow actions, nouns for accessor functions.
qlm_coded objects now use a hierarchical attribute structure with a run list containing name, batch, call, codebook, chat_args, execution_args, metadata, and parent fields. This structure supports provenance tracking across replication chains and provides clearer organization of coding metadata (#26).
batch flag indicates whether batch processing was used.execution_args replaces pcs_args and stores all non-chat execution arguments for both parallel and batch processing. Old objects with pcs_args remain compatible.data_codebook_sentiment provides a ready-to-use codebook for sentiment analysis.task_*() functions are deprecated in favor of using the data objects or creating custom codebooks with qlm_codebook().task() is deprecated in favor of qlm_codebook() (#27).annotate() is deprecated in favor of qlm_code() (#27).validate() is superseded by qlm_compare() (for inter-rater reliability) and qlm_validate() (for gold standard validation). The function remains available but is marked with a lifecycle badge.trail_settings(), trail_record(), trail_compare(), trail_matrix(), trail_icr()) are deprecated. Use qlm_code() with model and temperature parameters directly, or qlm_replicate() for systematic comparisons across models.Backward compatibility: Old code continues to work with deprecation warnings. New qlm_codebook objects work with old annotate(), and old task objects work with new qlm_code(). This is achieved through dual-class inheritance where qlm_codebook inherits from both "qlm_codebook" and "task".
validate_app() has been extracted into the companion package quallmer.app. This reduces dependencies in the core quallmer package (removing shiny, bslib, and htmltools from Imports). Install quallmer.app separately for interactive validation functionality.qlm_validate() now uses distinct, statistically appropriate metrics for each measurement level:
level = "nominal"): accuracy, precision, recall, F1-score, Cohen's kappa (unweighted)level = "ordinal"): Spearman's rho, Kendall's tau, MAE (mean absolute error)level = "interval"): ICC (intraclass correlation), Pearson's r, MAE, RMSE (root mean squared error)The measure argument has been removed entirely - all appropriate measures are now computed automatically based on the level parameter. Function signature changed: level now comes before average, and average only applies to nominal (multiclass) data. Return values renamed for consistency: spearman → rho, kendall → tau, pearson → r. Print output uses "levels" terminology for ordinal data and "classes" for nominal data. This change provides more statistically sound validation that respects the mathematical properties of each measurement scale.
qlm_compare() now computes all statistically appropriate measures for each measurement level:
level = "nominal"): Krippendorff's alpha (nominal), Cohen's/Fleiss' kappa, percent agreementlevel = "ordinal"): Krippendorff's alpha (ordinal), weighted kappa (2 raters only), Kendall's W, Spearman's rho, percent agreementlevel = "interval"): Krippendorff's alpha (interval), ICC (intraclass correlation), Pearson's r, percent agreementThe measure argument has been removed entirely - all appropriate measures are now computed automatically and returned in the result object. The return structure changed from a single value to a list containing all computed measures for the specified level. Percent agreement is now computed for all levels; for ordinal/interval/ratio data, the tolerance parameter controls what counts as agreement (e.g., tolerance = 1 means values within 1 unit are considered in agreement).
New qlm_humancoded() function converts human-coded data frames into qlm_humancoded objects (dual inheritance: qlm_humancoded + qlm_coded), enabling full provenance tracking for human coding alongside LLM results. Supports custom metadata for coder information, training details, and coding instructions (#43).
qlm_validate() and qlm_compare() now accept plain data frames and automatically convert them to qlm_humancoded objects with an informational message. Users can call qlm_humancoded() directly to provide richer metadata (coder names, instructions, etc.) or use plain data frames for quick comparisons (#43).
qlm_validate() and qlm_compare() now support non-standard evaluation (NSE) for the by argument, allowing both by = sentiment (unquoted) and by = "sentiment" (quoted) syntax. This provides a more natural, tidyverse-style interface while maintaining backward compatibility (#43).
Print method for qlm_coded objects now distinguishes human from LLM coding, displaying "Source: Human coder" for qlm_humancoded objects instead of model information.
Improved error messages in qlm_compare() and qlm_validate() now show which objects are missing the requested variable and list available alternatives.
Adopt tidyverse-style error messaging via cli::cli_abort() and cli::cli_warn() throughout the package, replacing all stop(), stopifnot(), and warning() calls with structured, informative error messages.
Documentation and CI notes refreshed.