4  Data

4.1 Introduction

Tip

Make sure you have run the tutorial setup code in your R session before copying, pasting and running example code here.

Pmetrics always needs data and a model to run. Pmetrics data objects are typically read into memory from files. Although the file format is usually comma-separated (.csv), it is possible to use other separators, like the semicolon, by setting the appropriate argument with setPMoptions.


# look at and change global Pmetrics options
setPMoptions()

Examples of programs that can save .csv files are any text editor (e.g. TextEdit on Mac, Notepad on Windows) or spreadsheet program (e.g. Excel).

It is possible to create a data object in R directly, without reading a file. This is useful for simulation purposes, where you may want to create a small dataset on the fly. We’ll cover this below.

4.2 R6 objects

Most Pmetrics objects, including data, follow the R6 framework. The idea of this object is to represent a dataset that is going to be modeled/simulated. All its behaviour is represented by the class PM_data. This class allows datasets to be checked, plotted, written to disk and more. Use PM_data$new("filename") to create a PM_data object by reading the file.

4.3 First data object

# if not using the Rscript/Learn.R template created by PM_tutorial(), 
# modify the path as needed
dat <- PM_data$new("src/ex.csv")

You can also build an appropriate data frame in R and provide that as an argument to PM_data$new().

# ensure data frame has at least these columns:
# id, time, dose, out
df <- data.frame(id = c(1,1,1,2,2),
                 time = c(0,1,2,0,1),
                 dose = c(100,NA,NA,200,NA),
                 out = c(NA,5.2,3.1,NA,7.4)
)
dat_df <- PM_data$new(df)

Lastly, you can take advantage of the addEvent method in PM_data objects to build a data object on the fly. This can be particularly useful for making quick simulation templates. Start with an empty call to PM_data$new() and add successive rows. See PM_data for details under the addEvent method.

# build a PM_data object row by row
dat_add <- PM_data$new()$
    addEvent(id = 1, time = 0, dose = 100, addl = 5, ii = 24)$ # add 6 doses of 100 every 24 hours
    addEvent(id = 1, time = 144, out = -1)$ # add an observation of -1 at time 144
    addEvent(id = 1, wt = 75, validate = TRUE) # add wt of 75 to all rows for id = 1 and validate

Notes:

  1. Lack of time element in the last addEvent will add wt = 75 to all rows for id = 1
  2. Use validate = TRUE as an argument in the last addEvent to finalize creation
  3. You can chain events as shown above by including the $ between events.
Note

For those familiar with tidyverse or the native R pipe to join functions (“%>%” or “|>”, respectively), chaining in R6 is similar but restricted to methods defined for the object. In this case we chain the addEvent methods. We could even chain an additional PM_data method like $plot() at the end of the above code. However, that would create dat as a plotly plot object, not a PM_data one.

Below you see the data standardization and validation reports that are generated when you create a new PM_data object, and the output of typing dat$data and dat$standard_data look like in the viewer. The former is your original data, and the latter is what it looks like after standardization to the full Pmetrics format.

#> 
#> ── DATA STANDARDIZATION ────────────────────────────────────────────────────────
#> EVID inferred as 0 for observations, 1 for doses.
#>  All doses assumed to be oral (DUR = 0).
#>  ADDL set to missing for all records.
#>  II set to missing for all records.
#>  All doses assumed to be INPUT = 1.
#>  All observations assumed to be OUTEQ = 1.
#>  All observations assumed to be uncensored.
#>  One or more error coefficients not specified. Error in model object will be used.
#> 
#> ── DATA VALIDATION ─────────────────────────────────────────────────────────────
#> No data errors found.

Original data:

id time dose out wt
1 0 100 NA 75
1 24 100 NA 75
1 48 100 NA 75
1 72 100 NA 75
1 96 100 NA 75
1 120 100 NA 75
1 144 NA -1 75

Standardized data:

id evid time dur dose addl ii input out outeq cens c0 c1 c2 c3 wt
1 1 0 0 100 NA NA 1 NA NA NA NA NA NA NA 75
1 1 24 0 100 NA NA 1 NA NA NA NA NA NA NA 75
1 1 48 0 100 NA NA 1 NA NA NA NA NA NA NA 75
1 1 72 0 100 NA NA 1 NA NA NA NA NA NA NA 75
1 1 96 0 100 NA NA 1 NA NA NA NA NA NA NA 75
1 1 120 0 100 NA NA 1 NA NA NA NA NA NA NA 75
1 0 144 NA NA NA NA NA -1 1 none NA NA NA NA 75

Once you have created the PM_data object, you never need to create it again during your R session. You also don’t have to bother copying the data file to the Runs folder each time you run the model, like you used to do with older (“Legacy”) versions of Pmetrics. The data are stored in memory and can be used in any Pmetrics function that needs it.

4.4 Data format

R6 Pmetrics can use file or data frame input. The format is very flexible. A truncated example is shown below, with NA values replaced by “.” as they would appear in a file.

id time dose out wt
1 0 600 . 46.7
1 24 600 . 46.7
1 48 600 . 46.7
1 72 600 . 46.7
1 96 600 . 46.7
1 120 . 10.44 46.7
1 120 600 . 46.7
1 121 . 12.89 46.7
1 122 . 14.98 46.7
1 125.99 . 16.69 46.7
1 129 . 20.15 46.7
1 132 . 14.97 46.7
1 143.98 . 12.57 46.7
2 0 600 . 66.5
2 24 600 . 66.5
2 48 600 . 66.5
2 72 600 . 66.5
2 96 600 . 66.5
2 120 . 3.56 66.5
2 120 600 . 66.5
2 120.98 . 5.84 66.5
2 121.98 . 6.54 66.5
2 126 . 6.14 66.5
2 129.02 . 6.56 66.5
2 132.02 . 4.44 66.5
2 144 . 3.76 66.5

The only required columns are those below. Unlike Legacy Pmetrics, there are no requirements for a header or to prefix the ID column with “#”. However, any subsequent row that begins with “#” will be ignored, which is helpful if you want to exclude data from the analysis, but preserve the integrity of the original dataset, or to add comment lines. The column order can be anything you wish, but the names should be the same as below. Ultimately, PM_data$new() converts all valid data into a standardized format discussed below.

  • ID This field can be numeric or character and identifies each individual. All rows must contain an ID, and all records from one individual must be contiguous. IDs may be any alphanumeric combination. The number of subjects is unlimited.

  • TIME This is the elapsed time in decimal hours since the first event, which is always TIME = 0, unless you specify TIME as clock time. In that case, you must include a DATE column, described below. For clock time, the default format is HH:MM. Other formats can be specified. See PM_data for more details. Every row must have an entry, and within a given ID, rows must be sorted chronologically, earliest to latest.

    • DATE This column is only required if TIME is clock time, detected by the presence of “:”. The default format of the date column is YYYY-MM-DD. As for TIME, other formats can be specified. See PM_data for more details.
  • DOSE This is the dose amount. It should be “.” for observation rows. All subjects must have a dose event at time 0, which is the first row for that subject. The dose amount can be any numeric value, including 0. If the dose is an infusion, the DUR column must also be included. In other software packages, AMT is equivalent to DOSE.

  • OUT This is the observation, or output value, and it is always required. If EVID = 0, there must be an entry. For such events, if the observation is missing, e.g. a sample was lost or not obtained, this must be coded as -99. It will be ignored for any other EVID and therefore should be “.”. OUT can be coded as DV in other software packages. When OUT = -99, this is equivalent to MDV = 1, or missing dependent variable in other packages, but Pmetrics does not use MDV.

Not required:

  • COVARIATES… Covariates are optional and discussed below. Here, wt was included as an example of a covariate.

When PM_data reads a file, it will standardize it to the format below. This means some inferences are made. For example, in the absence of EVID, all doses are interpreted as oral. If they are infusions, DUR must be included to indicate the duration of the infusion. EVID only needs to be included if EVID=4 (reset event) is required, described below. Similarly, INPUT and OUTEQ are only required if multiple inputs or outputs are being modeled. Lastly, ADDL and II are optional.

Lastly, the standardized data are checked for errors and if found, Pmetrics generates a report with the errors and will attempt to fix those that it can.

4.4.1 Standardized Data

Data are standardized when PM_data$new() is invoked, and the data frame is placed in the PM_data object’s $standard_data field. When the $save() method is called on a PM_data object, the data are saved in this standardized format. The first several rows of example standardized data are below, with details following.

id evid time dur dose addl ii input out outeq cens c0 c1 c2 c3 wt
1 1 0 0 600 NA NA 1 NA NA NA NA NA NA NA 46.7
1 1 24 0 600 NA NA 1 NA NA NA NA NA NA NA 46.7
1 1 48 0 600 NA NA 1 NA NA NA NA NA NA NA 46.7
1 1 72 0 600 NA NA 1 NA NA NA NA NA NA NA 46.7
1 1 96 0 600 NA NA 1 NA NA NA NA NA NA NA 46.7
1 0 120 NA NA NA NA NA 10.44 1 none NA NA NA NA 46.7
1 1 120 0 600 NA NA 1 NA NA NA NA NA NA NA 46.7
1 0 121 NA NA NA NA NA 12.89 1 none NA NA NA NA 46.7
1 0 122 NA NA NA NA NA 14.98 1 none NA NA NA NA 46.7
1 0 125.99 NA NA NA NA NA 16.69 1 none NA NA NA NA 46.7
1 0 129 NA NA NA NA NA 20.15 1 none NA NA NA NA 46.7
1 0 132 NA NA NA NA NA 14.97 1 none NA NA NA NA 46.7
1 0 143.98 NA NA NA NA NA 12.57 1 none NA NA NA NA 46.7
2 1 0 0 600 NA NA 1 NA NA NA NA NA NA NA 66.5
2 1 24 0 600 NA NA 1 NA NA NA NA NA NA NA 66.5
2 1 48 0 600 NA NA 1 NA NA NA NA NA NA NA 66.5
2 1 72 0 600 NA NA 1 NA NA NA NA NA NA NA 66.5
2 1 96 0 600 NA NA 1 NA NA NA NA NA NA NA 66.5
2 0 120 NA NA NA NA NA 3.56 1 none NA NA NA NA 66.5
2 1 120 0 600 NA NA 1 NA NA NA NA NA NA NA 66.5
2 0 120.98 NA NA NA NA NA 5.84 1 none NA NA NA NA 66.5
2 0 121.98 NA NA NA NA NA 6.54 1 none NA NA NA NA 66.5
2 0 126 NA NA NA NA NA 6.14 1 none NA NA NA NA 66.5
2 0 129.02 NA NA NA NA NA 6.56 1 none NA NA NA NA 66.5
2 0 132.02 NA NA NA NA NA 4.44 1 none NA NA NA NA 66.5
2 0 144 NA NA NA NA NA 3.76 1 none NA NA NA NA 66.5
  • ID See above.

  • EVID This is the event ID field. It can be 0, 1, or 4. It is only required if EVID = 4 is included in the data, in which case every row must have an entry. If there are no EVID = 4 events, the entire EVID column can be omitted from the data.

    • 0 = observation

    • 1 = input (e.g. dose)

    • 2, 3 are currently unused

    • 4 = reset, where all compartment values are set to 0 and the time counter is reset to 0. This is useful when an individual has multiple sampling episodes that are widely spaced in time with no new information gathered. This is a dose event, so dose information needs to be complete. The TIME value for EVID = 4 should be 0, and subsequent rows should increase monotonically from 0 until the last record or until another EVID = 4 event, which will restart time at 0.

  • TIME See above.

  • DATE See above.

  • DUR This is the duration of an infusion in hours. If EVID = 0 (observation event), DUR is ignored and should have a “.” placeholder. For a bolus (e.g. an oral dose), set the value equal to 0. As mentioned above, if all doses are oral, DUR can be omitted from the data altogether. Some other packages use RATE instead of DUR, but of course, one can convert rate to duration with DUR = DOSE / RATE.

  • DOSE See above.

  • ADDL This specifies the number of additional doses to give at interval II. ADDL can be positive or negative. If positive, it is the number of doses to give after the dose at time 0. If negative, it is the number of doses to give before the dose at time 0. It may be missing (“.”) for dose events (EVID = 1 or EVID = 4), in which case it is assumed to be 0. It is ignored for observation (EVID = 0) events. Be sure to adjust the time entry for the subsequent row, if necessary, to account for the extra doses. All compartments in the model will contain the predicted amounts of drug at the end of the II interval after the last ADDL dose.

  • II This is the interdose interval and is only relevant if ADDL is not equal to 0, in which case II cannot be missing. If ADDL = 0 or is missing, II is ignored.

  • INPUT This defines which input (i.e. drug) the DOSE corresponds to. The model defines which compartments receive the input(s). If only modeling one drug, INPUT is unnecessary, as all values will be assumed to be 1. Other packages may use CMT for compartment for both inputs and outputs. It is necessary to separate these in Pmetrics and for outputs, designate the corresponding model input number with INPUT (e.g. R[x] or B[x] for infusions and boluses in the model object), not the compartment.

  • OUT See above.

  • OUTEQ This is the output equation number that corresponds to the OUT value. Output equations are defined in the model file. If only modeling one output, this column is unnecessary, as all values are assumed to be 1. As discussed in INPUT, other packages may use CMT for compartment for both inputs and outputs. It is necessary to separate these in Pmetrics and for outputs, designate the corresponding model output equation number with OUTEQ, not the compartment.

  • CENS This is a new column as of Pmetrics 3.0.0. It indicates whether the observation is censored, i.e. below a lower limit of quantification or above an upper limit . It can take on four values:

    • Missing for dose events which are not observations. Use a “.” as a placeholder in your data file.

    • 0 or “none” = not censored

    • 1 or “bloq” = left censored (below lower limit of quantification)

    • -1 or “aloq” = right censored (above upper limit of quantification)

    If there are no censored observations, the entire CENS column can be omitted from the data. In data fitting, left censored observations are handled using the M3 method described by Beal (Beal 2001). Right censored observations are handled similarly, but using the complementary probability. The value in the OUT column is the censoring lower limit of quantification (LLOQ) for left censored observations. It is the upper limit of quantification (ULOQ) for right censored observations. For uncensored observations, OUT is the observed value as usual. For example, if OUT = 5 and CENS = 1 or CENS = "bloq", this indicates that the observation is below the LLOQ of 5. If OUT = 10 and CENS = -1 or CENS = "aloq", this indicates that the observation is above the ULOQ of 10.

  • C0, C1, C2, C3 These are the coefficients for the assay error polynomial for that observation. Each subject may have up to one set of coefficients per output equation. If more than one set is detected for a given subject and output equation, the last set will be used. If there are no available coefficients, these cells may be omitted. If they are included, for events which are not observations, they can be filled with “.” as a placeholder. In data fitting, if the coefficients are present in the data file, Pmetrics will use them. If missing, Pmetrics will look for coefficients defined in the model.

  • COVARIATES… Any column named other than above is assumed to be a covariate, one column per covariate. The first row for any subject must have a value for all covariates, since the first row is always a dose. Covariates are handled differently than in Legacy Pmetrics. In Legacy, they were only considered at the times of dose events (EVID = 1 or EVID = 4). In Pmetrics 3.0 and later, they are considered at all times, including observation events (EVID = 0). Therefore, to enter a new covariate value at a time other than a dose or an observation, create a row at the appropriate time (and possibly date if using clock/calendar), making the row either a dose row with DOSE = 0 or an observation row with OUT = -99 (missing). By default, covariate values are linearly interpolated between entries. This is useful for covariates like weight, which may vary from measurement to measurement. You can change this behavior in the model definition to make them piece-wise constant, i.e. carried forward from the previous value until a new value causes an instant change. This could be used, for example, to indicate periods of off and on dialysis. See the chapter on Models for more details.

4.5 Manipulation of CSV files

4.5.0.1 Read

As we have seen, PM_data$new("path/filename") will create a new PM_data object by reading an appropriate data file in the path directory or the current working directory if path is ommitted. Change the column separator in the file from the default “,” (.csv files) to “;” (.ssv files) using setPMoptions().

4.5.0.2 Save

PM_data$save("path/filename") will save the PM_data$standard_field to a file called “filename” in the path directory or the current working directory if path is ommitted. This can be useful if you have loaded or created a data file and then changed it in R. Change the column separator in the file from the default “,” (.csv files) to “;” (.ssv files) using setPMoptions().

4.5.0.3 Standardize

PM_data$new() automatically standardizes the data into the full format. This includes conversion of calendar date / clock time into decimal elapsed time.

4.5.0.4 Validate

PM_data$new() automatically calls PMcheck so the data are validated as the data object is created.

4.5.0.5 Data conversion

  • PMwrk2csv() This function will convert old-style, single-drug USC*PACK .wrk formatted files into Pmetrics data .csv files.

  • NM2PM() Although the structure of Pmetrics data files is similar to NONMEM, there are some differences. This function attempts to automatically convert to Pmetrics format. It has been tested on several examples, but there are probably NONMEM files which will cause it to crash.

4.6 More Examples

Pmetrics comes with an example dataset called dataEx already loaded. You can practice with it. It is the same data as in “src/ex.csv” used above to create the dat object.

Tip

In the code below and often in this book, file.path is a base R function used to create file paths that are compatible with your operating system.


# Save data somewhere
path <- "src2"
dir.create(path) # create a temporary folder
dataEx$save(file.path(path, "ex2.csv")) # save the data there
dataEx$save("src2/ex.csv") # alternative 

# Load it again with one of these alternatives
exData <- PM_data$new(file.path(path, "ex2.csv"))
exData <- PM_data$new("src2/ex2.csv")

unlink("src2", recursive = TRUE) # clean up

You can look at the src/ex.csv file directly by opening from your hard drive it in a spreadsheet program like Excel, or a text editor.

exData is an R6 object, which means that contains both data and methods to process that data.

# See the contents of the object
names(exData)
#>  [1] "nca"             ".__enclos_env__" "summary"         "auc"            
#>  [5] "addEvent"        "post"            "clone"           "initialize"     
#>  [9] "standard_data"   "save"            "print"           "plot"           
#> [13] "data"            "pop"

The first element is an artifact of the R6 class. The remaining elements are documented in the help for PM_data. You can of course inspect the data directly.

# Your original data (first few rows)
head(exData$data)
#>   id time dose   out   wt africa age gender height
#> 1  1    0  600    NA 46.7      1  21      1    160
#> 2  1   24  600    NA 46.7      1  21      1    160
#> 3  1   48  600    NA 46.7      1  21      1    160
#> 4  1   72  600    NA 46.7      1  21      1    160
#> 5  1   96  600    NA 46.7      1  21      1    160
#> 6  1  120   NA 10.44 46.7      1  21      1    160

Typing the name of the PM_data object will display it nicely in the viewer.

# See the standardized data nicely formatted in the viewer 
exData

Below we show it truncated for brevity.

id evid time dur dose addl ii input out outeq cens c0 c1 c2 c3 wt africa age gender height
1 1 0 0 600 NA NA 1 NA NA NA NA NA NA NA 46.7 1 21 1 160
1 1 24 0 600 NA NA 1 NA NA NA NA NA NA NA 46.7 1 21 1 160
1 1 48 0 600 NA NA 1 NA NA NA NA NA NA NA 46.7 1 21 1 160
1 1 72 0 600 NA NA 1 NA NA NA NA NA NA NA 46.7 1 21 1 160
1 1 96 0 600 NA NA 1 NA NA NA NA NA NA NA 46.7 1 21 1 160
1 0 120 NA NA NA NA NA 10.44 1 none NA NA NA NA 46.7 1 21 1 160

Most Pmetrics objects are R6 objects. As a reminder, you can use the $ operator to access their data fields and methods. Many of them have a $summary() method that prints a summary of the object to the console and a $plot() method that creates a plot of the object. See PM_data for more information on the PM_data class and its methods.

Note: We recognize that many users are familiar with the “S3 framework” in R, which uses functions like summary(object) and plot(object). To comply with better programming standards, Pmetrics uses the R6 framework. However, we have provided S3 methods for most functions, so you can use summary(object) and plot(object) if you prefer.

# S3 method to summarize data
summary(exData)

PM_data has a plot() method that creates a plot of the data. See plot.PM_data for more information.

exData$plot() 

4.7 Citations

Beal, S. L. 2001. Ways to Fit a PK Model with Some Data Below the Quantification Limit. Journal of Pharmacokinetics and Pharmacodynamics 28 (5): 481–504.