install.packages(c("tidyverse",
"here",
"tidylog",
"summarytools"))
13/01/2025
https://rstats-courses.github.io/CursoR-AEET-2025/materiales.html
La exploración de datos nos permite verificar su calidad, generar y probar hipótesis de forma rápida, identificando pistas prometedoras para analizar más a fondo luego.
La visualización de los datos es un buen comienzo, pero por sí sola no suele ser suficiente, ya que a menudo requiere transformar los datos previamente.
Más de una variable por columna
Source: Data Carpentry
Múltiples tablas
Source: Data Carpentry
Información en colores
Se puede evitar simplemente añadiendo una columna a la tabla original.
[1] "broom" "conflicted" "cli" "dbplyr"
[5] "dplyr" "dtplyr" "forcats" "ggplot2"
[9] "googledrive" "googlesheets4" "haven" "hms"
[13] "httr" "jsonlite" "lubridate" "magrittr"
[17] "modelr" "pillar" "purrr" "ragg"
[21] "readr" "readxl" "reprex" "rlang"
[25] "rstudioapi" "rvest" "stringr" "tibble"
[29] "tidyr" "xml2" "tidyverse"
Da información de las operaciones que se realizan en el dataset
Permite hacer resumenes completos de los datasets
read.table()
, read.csv()
, readRDS()
, read.txt()
Argumentos útiles: sep, dec, comment.char, na.strings, stringsAsFactors
read_csv()
, read_csv2()
, read_table()
, read_delim()
Más rapido, produce “tibbles”, no convierte characteres a factors automaticamente, no usa los nombres de fila.
Argumentos útiles: delim, comment, na, col_types, skip_empty_rows, guess_max
read_excel()
, read_xls()
, read_xlsx()
Argumentos útiles: sheet, col_types, skip
La función here()
permite hacer referencia siempre al directorio donde se encuentra el proyecto
Ejemplo usando ruta absoluta:
Ejemplo usando ruta relativa al proyecto:
Mecanismo para encadenar funciones:
Rows: 213062 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): site_name, megaplot, plot, plant_ID, species_name, height_diameter_...
dbl (6): trap, year, count, stem_diameter_cm, trap_area_m2, burned
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 213,062
Columns: 14
$ site_name <chr> "AND", "AND", "AND", "AND", "AND", "AND", "AND",…
$ megaplot <chr> "Bare Mountain", "Bare Mountain", "Bare Mountain…
$ plot <chr> "CNCT_01", "CNCT_01", "CNCT_01", "CNCT_01", "CNC…
$ trap <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ plant_ID <chr> "CNCT_01ABAM1", "CNCT_01ABAM1", "CNCT_01ABAM1", …
$ species_name <chr> "Abies_amabilis", "Abies_amabilis", "Abies_amabi…
$ year <dbl> 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, …
$ count <dbl> 22, 0, 0, 2, 0, 2, 108, 0, 0, 7, 0, 0, 2, 0, 12,…
$ stem_diameter_cm <dbl> 56.6, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ trap_area_m2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ height_diameter_taken <chr> "Breast Height", "Breast Height", "Breast Height…
$ burned <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ general_method <chr> "PARTIALCONECOUNT", "PARTIALCONECOUNT", "PARTIAL…
$ methods_notes <chr> "all cones visible from established view points …
# A tibble: 6 × 14
site_name megaplot plot trap plant_ID species_name year count
<chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl>
1 AND Bare Mountain CNCT_01 NA CNCT_01ABAM1 Abies_amabilis 1962 22
2 AND Bare Mountain CNCT_01 NA CNCT_01ABAM1 Abies_amabilis 1963 0
3 AND Bare Mountain CNCT_01 NA CNCT_01ABAM1 Abies_amabilis 1964 0
4 AND Bare Mountain CNCT_01 NA CNCT_01ABAM1 Abies_amabilis 1965 2
5 AND Bare Mountain CNCT_01 NA CNCT_01ABAM1 Abies_amabilis 1966 0
6 AND Bare Mountain CNCT_01 NA CNCT_01ABAM1 Abies_amabilis 1967 2
# ℹ 6 more variables: stem_diameter_cm <dbl>, trap_area_m2 <dbl>,
# height_diameter_taken <chr>, burned <dbl>, general_method <chr>,
# methods_notes <chr>
arrange()
- Ordenar variables por casosrename()
- Renombrar variablesrelocate()
- Reordenar variablesselect()
- Extraer variables# A tibble: 213,062 × 14
site_name megaplot plot trap plant_ID species_name year count
<chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl>
1 AND Bare Mountain CNCT_01 NA CNCT_01ABAM1 Abies_amabilis 1963 0
2 AND Bare Mountain CNCT_01 NA CNCT_01ABAM1 Abies_amabilis 1964 0
3 AND Bare Mountain CNCT_01 NA CNCT_01ABAM1 Abies_amabilis 1966 0
4 AND Bare Mountain CNCT_01 NA CNCT_01ABAM1 Abies_amabilis 1969 0
5 AND Bare Mountain CNCT_01 NA CNCT_01ABAM1 Abies_amabilis 1970 0
6 AND Bare Mountain CNCT_01 NA CNCT_01ABAM1 Abies_amabilis 1972 0
7 AND Bare Mountain CNCT_01 NA CNCT_01ABAM1 Abies_amabilis 1973 0
8 AND Bare Mountain CNCT_01 NA CNCT_01ABAM1 Abies_amabilis 1975 0
9 AND Bare Mountain CNCT_01 NA CNCT_01ABAM1 Abies_amabilis 1977 0
10 AND Bare Mountain CNCT_01 NA CNCT_01ABAM1 Abies_amabilis 1981 0
# ℹ 213,052 more rows
# ℹ 6 more variables: stem_diameter_cm <dbl>, trap_area_m2 <dbl>,
# height_diameter_taken <chr>, burned <dbl>, general_method <chr>,
# methods_notes <chr>
De mayor a menor:
# A tibble: 213,062 × 14
site_name megaplot plot trap plant_ID species_name year count
<chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl>
1 LUQ 1 1 92 <NA> Cecropia_schreberiana 1997 1114340
2 LUQ 1 1 93 <NA> Ficus_trigonata 2015 106650
3 LUQ 1 1 92 <NA> Ficus_trigonata 2013 69450
4 LUQ 1 1 93 <NA> Ficus_trigonata 2018 44090
5 LUQ 1 1 109 <NA> Ficus_trigonata 2015 39670
6 LUQ 1 1 92 <NA> Ficus_trigonata 2015 35075
7 LUQ 1 1 93 <NA> Ficus_trigonata 2016 33500
8 LUQ 1 1 92 <NA> Ficus_trigonata 2008 33000
9 LUQ 1 1 107 <NA> Ficus_trigonata 2011 32689
10 LUQ 1 1 99 <NA> Cecropia_schreberiana 2009 30594
# ℹ 213,052 more rows
# ℹ 6 more variables: stem_diameter_cm <dbl>, trap_area_m2 <dbl>,
# height_diameter_taken <chr>, burned <dbl>, general_method <chr>,
# methods_notes <chr>
Por orden jerárquico:
# A tibble: 213,062 × 14
site_name megaplot plot trap plant_ID species_name year count
<chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl>
1 AEC adirondack adirondack 971 <NA> Acer_rubrum 2009 191
2 AEC adirondack adirondack 971 <NA> Acer_rubrum 2004 171
3 AEC adirondack adirondack 971 <NA> Acer_rubrum 1995 141
4 AEC adirondack adirondack 941 <NA> Acer_rubrum 1994 105
5 AEC adirondack adirondack 941 <NA> Acer_rubrum 1995 85
6 AEC adirondack adirondack 972 <NA> Acer_rubrum 2007 82
7 AEC adirondack adirondack 971 <NA> Acer_rubrum 2008 81
8 AEC adirondack adirondack 971 <NA> Acer_rubrum 1993 79
9 AEC adirondack adirondack 941 <NA> Acer_rubrum 1991 77
10 AEC adirondack adirondack 938 <NA> Acer_rubrum 2004 72
# ℹ 213,052 more rows
# ℹ 6 more variables: stem_diameter_cm <dbl>, trap_area_m2 <dbl>,
# height_diameter_taken <chr>, burned <dbl>, general_method <chr>,
# methods_notes <chr>
rename: renamed one variable (site)
# A tibble: 213,062 × 14
site megaplot plot trap plant_ID species_name year count stem_diameter_cm
<chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 AND Bare Mo… CNCT… NA CNCT_01… Abies_amabi… 1962 22 56.6
2 AND Bare Mo… CNCT… NA CNCT_01… Abies_amabi… 1963 0 NA
3 AND Bare Mo… CNCT… NA CNCT_01… Abies_amabi… 1964 0 NA
4 AND Bare Mo… CNCT… NA CNCT_01… Abies_amabi… 1965 2 NA
5 AND Bare Mo… CNCT… NA CNCT_01… Abies_amabi… 1966 0 NA
6 AND Bare Mo… CNCT… NA CNCT_01… Abies_amabi… 1967 2 NA
7 AND Bare Mo… CNCT… NA CNCT_01… Abies_amabi… 1968 108 NA
8 AND Bare Mo… CNCT… NA CNCT_01… Abies_amabi… 1969 0 NA
9 AND Bare Mo… CNCT… NA CNCT_01… Abies_amabi… 1970 0 NA
10 AND Bare Mo… CNCT… NA CNCT_01… Abies_amabi… 1971 7 NA
# ℹ 213,052 more rows
# ℹ 5 more variables: trap_area_m2 <dbl>, height_diameter_taken <chr>,
# burned <dbl>, general_method <chr>, methods_notes <chr>
relocate: columns reordered (site_name, year, megaplot, plot, trap, …)
# A tibble: 213,062 × 14
site_name year megaplot plot trap plant_ID species_name count
<chr> <dbl> <chr> <chr> <dbl> <chr> <chr> <dbl>
1 AND 1962 Bare Mountain CNCT_01 NA CNCT_01ABAM1 Abies_amabilis 22
2 AND 1963 Bare Mountain CNCT_01 NA CNCT_01ABAM1 Abies_amabilis 0
3 AND 1964 Bare Mountain CNCT_01 NA CNCT_01ABAM1 Abies_amabilis 0
4 AND 1965 Bare Mountain CNCT_01 NA CNCT_01ABAM1 Abies_amabilis 2
5 AND 1966 Bare Mountain CNCT_01 NA CNCT_01ABAM1 Abies_amabilis 0
6 AND 1967 Bare Mountain CNCT_01 NA CNCT_01ABAM1 Abies_amabilis 2
7 AND 1968 Bare Mountain CNCT_01 NA CNCT_01ABAM1 Abies_amabilis 108
8 AND 1969 Bare Mountain CNCT_01 NA CNCT_01ABAM1 Abies_amabilis 0
9 AND 1970 Bare Mountain CNCT_01 NA CNCT_01ABAM1 Abies_amabilis 0
10 AND 1971 Bare Mountain CNCT_01 NA CNCT_01ABAM1 Abies_amabilis 7
# ℹ 213,052 more rows
# ℹ 6 more variables: stem_diameter_cm <dbl>, trap_area_m2 <dbl>,
# height_diameter_taken <chr>, burned <dbl>, general_method <chr>,
# methods_notes <chr>
select: dropped 10 variables (megaplot, plot, trap, plant_ID, stem_diameter_cm, …)
# A tibble: 213,062 × 4
site_name year species_name count
<chr> <dbl> <chr> <dbl>
1 AND 1962 Abies_amabilis 22
2 AND 1963 Abies_amabilis 0
3 AND 1964 Abies_amabilis 0
4 AND 1965 Abies_amabilis 2
5 AND 1966 Abies_amabilis 0
6 AND 1967 Abies_amabilis 2
7 AND 1968 Abies_amabilis 108
8 AND 1969 Abies_amabilis 0
9 AND 1970 Abies_amabilis 0
10 AND 1971 Abies_amabilis 7
# ℹ 213,052 more rows
Quitar variables:
select: dropped 3 variables (megaplot, plot, trap)
# A tibble: 213,062 × 11
site_name plant_ID species_name year count stem_diameter_cm trap_area_m2
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 AND CNCT_01ABAM1 Abies_amabi… 1962 22 56.6 NA
2 AND CNCT_01ABAM1 Abies_amabi… 1963 0 NA NA
3 AND CNCT_01ABAM1 Abies_amabi… 1964 0 NA NA
4 AND CNCT_01ABAM1 Abies_amabi… 1965 2 NA NA
5 AND CNCT_01ABAM1 Abies_amabi… 1966 0 NA NA
6 AND CNCT_01ABAM1 Abies_amabi… 1967 2 NA NA
7 AND CNCT_01ABAM1 Abies_amabi… 1968 108 NA NA
8 AND CNCT_01ABAM1 Abies_amabi… 1969 0 NA NA
9 AND CNCT_01ABAM1 Abies_amabi… 1970 0 NA NA
10 AND CNCT_01ABAM1 Abies_amabi… 1971 7 NA NA
# ℹ 213,052 more rows
# ℹ 4 more variables: height_diameter_taken <chr>, burned <dbl>,
# general_method <chr>, methods_notes <chr>
La función select()
nos permite seleccionar, renombrar y recolocar - todo a la vez!
La función select()
nos permite seleccionar, renombrar y recolocar - todo a la vez!
Rows: 213,062
Columns: 8
$ site <chr> "AND", "AND", "AND", "AND", "AND", "AND", "AND", "AND", "…
$ year <dbl> 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 197…
$ species_name <chr> "Abies_amabilis", "Abies_amabilis", "Abies_amabilis", "Ab…
$ plant_ID <chr> "CNCT_01ABAM1", "CNCT_01ABAM1", "CNCT_01ABAM1", "CNCT_01A…
$ count <dbl> 22, 0, 0, 2, 0, 2, 108, 0, 0, 7, 0, 0, 2, 0, 12, 0, 21, 1…
$ method <chr> "PARTIALCONECOUNT", "PARTIALCONECOUNT", "PARTIALCONECOUNT…
$ stem_cm <dbl> 56.6, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ trap_area_m2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
Data Frame Summary
dt
Dimensions: 213062 x 1
Duplicates: 212921
-----------------------------------------------------------------------------------------------------
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
---- -------------- ------------------------- -------------------- ------------- ---------- ---------
1 species_name 1. Abies_amabilis 15488 ( 7.3%) I 213062 0
[character] 2. Juniperus_monosperma 10201 ( 4.8%) (100.0%) (0.0%)
3. Amelanchier_arborea 10182 ( 4.8%)
4. Abies_procera 10141 ( 4.8%)
5. Tsuga_mertensiana 9684 ( 4.5%)
6. Acer_saccharum 8403 ( 3.9%)
7. Fraxinus_americana 7997 ( 3.8%)
8. Acer_rubrum 6231 ( 2.9%)
9. Tsuga_canadensis 6078 ( 2.9%)
10. Sassafras_albidum 5200 ( 2.4%)
[ 131 others ] 123457 (57.9%) IIIIIIIIIII
-----------------------------------------------------------------------------------------------------
Data Frame Summary
dt
Dimensions: 213062 x 1
Duplicates: 211712
-----------------------------------------------------------------------------------------------
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
---- ----------- --------------------------- ---------------------- ------- --------- ---------
1 count Mean (sd) : 34.6 (2481.6) 1349 distinct values : 209047 4015
[numeric] min < med < max: : (98.1%) (1.9%)
0 < 0 < 1114340 :
IQR (CV) : 4 (71.7) :
:
-----------------------------------------------------------------------------------------------
distinct()
- Extraer valores únicosmutate()
- Crear nuevas variablesfilter()
- Filtrar datos por casosgroup_by()
- Agrupar datos por casossummarise()
- Resumir datos por casoscase_when()
- Categorizar datosNiveles de una variable:
Equivalente en library(base)
:
Niveles de una variable:
Ej: transformar frutos a frutos/m2
mutate: new variable 'fruits_per_m2' (double) with 2,116 unique values and 35% NA
# A tibble: 213,062 × 9
site year species_name plant_ID count method stem_cm trap_area_m2
<chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl>
1 AND 1962 Abies_amabilis CNCT_01ABAM1 22 PARTIALCO… 56.6 NA
2 AND 1963 Abies_amabilis CNCT_01ABAM1 0 PARTIALCO… NA NA
3 AND 1964 Abies_amabilis CNCT_01ABAM1 0 PARTIALCO… NA NA
4 AND 1965 Abies_amabilis CNCT_01ABAM1 2 PARTIALCO… NA NA
5 AND 1966 Abies_amabilis CNCT_01ABAM1 0 PARTIALCO… NA NA
6 AND 1967 Abies_amabilis CNCT_01ABAM1 2 PARTIALCO… NA NA
7 AND 1968 Abies_amabilis CNCT_01ABAM1 108 PARTIALCO… NA NA
8 AND 1969 Abies_amabilis CNCT_01ABAM1 0 PARTIALCO… NA NA
9 AND 1970 Abies_amabilis CNCT_01ABAM1 0 PARTIALCO… NA NA
10 AND 1971 Abies_amabilis CNCT_01ABAM1 7 PARTIALCO… NA NA
# ℹ 213,052 more rows
# ℹ 1 more variable: fruits_per_m2 <dbl>
filter: removed 208,641 rows (98%), 4,421 rows remaining
# A tibble: 4,421 × 8
site year species_name plant_ID count method stem_cm trap_area_m2
<chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl>
1 BNZ 1957 Picea_glauca <NA> 3 TRAP NA 0.25
2 BNZ 1957 Picea_glauca <NA> 1 TRAP NA 0.25
3 BNZ 1957 Picea_glauca <NA> 2 TRAP NA 0.25
4 BNZ 1957 Picea_glauca <NA> 3 TRAP NA 0.25
5 BNZ 1957 Picea_glauca <NA> 3 TRAP NA 0.25
6 BNZ 1957 Picea_glauca <NA> 0 TRAP NA 0.25
7 BNZ 1957 Picea_glauca <NA> 2 TRAP NA 0.25
8 BNZ 1957 Picea_glauca <NA> 3 TRAP NA 0.25
9 BNZ 1957 Picea_glauca <NA> 0 TRAP NA 0.25
10 BNZ 1957 Picea_glauca <NA> 1 TRAP NA 0.25
# ℹ 4,411 more rows
filter: removed 150,475 rows (71%), 62,587 rows remaining
filter: removed 41,369 rows (66%), 21,218 rows remaining
# A tibble: 21,218 × 8
site year species_name plant_ID count method stem_cm trap_area_m2
<chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl>
1 AND 1962 Abies_amabilis CNCT_01ABAM1 22 PARTIALCO… 56.6 NA
2 AND 1968 Abies_amabilis CNCT_01ABAM1 108 PARTIALCO… NA NA
3 AND 1976 Abies_amabilis CNCT_01ABAM1 12 PARTIALCO… NA NA
4 AND 1978 Abies_amabilis CNCT_01ABAM1 21 PARTIALCO… NA NA
5 AND 1980 Abies_amabilis CNCT_01ABAM1 30 PARTIALCO… NA NA
6 AND 1982 Abies_amabilis CNCT_01ABAM1 61 PARTIALCO… NA NA
7 AND 1985 Abies_amabilis CNCT_01ABAM1 76 PARTIALCO… NA NA
8 AND 1991 Abies_amabilis CNCT_01ABAM1 42 PARTIALCO… NA NA
9 AND 1995 Abies_amabilis CNCT_01ABAM1 75 PARTIALCO… NA NA
10 AND 1997 Abies_amabilis CNCT_01ABAM1 52 PARTIALCO… NA NA
# ℹ 21,208 more rows
group_by: one grouping variable (site)
summarise: now 9 rows and 2 columns, ungrouped
# A tibble: 9 × 2
site fruits
<chr> <dbl>
1 AEC 55731.
2 AND 1968048
3 BNZ 915902
4 CDR 85431
5 CWT 292939
6 HBR 24556.
7 HFR 3683
8 LUQ 3653588
9 SEV 231905.
dt |>
group_by(site) |>
summarise(max_fruit = max(count, na.rm = TRUE),
min_fruit = min(count, na.rm = TRUE))
group_by: one grouping variable (site)
summarise: now 9 rows and 3 columns, ungrouped
# A tibble: 9 × 3
site max_fruit min_fruit
<chr> <dbl> <dbl>
1 AEC 591 0
2 AND 5000 0
3 BNZ 7230 0
4 CDR 151 0
5 CWT 1383 0
6 HBR 244 0
7 HFR 77 0
8 LUQ 1114340 0
9 SEV 1100 0
Crear dataset con media de frutos de cada especie de árbol por sitio y por año:
dt |>
group_by(site, species_name, year) |>
summarise(mean_fruits = mean(count, na.rm = TRUE)) |>
ungroup()
group_by: 3 grouping variables (site, species_name, year)
summarise: now 3,212 rows and 4 columns, 2 group variables remaining (site, species_name)
ungroup: no grouping variables
# A tibble: 3,212 × 4
site species_name year mean_fruits
<chr> <chr> <dbl> <dbl>
1 AEC Acer_rubrum 1988 0
2 AEC Acer_rubrum 1989 3.1
3 AEC Acer_rubrum 1990 0.44
4 AEC Acer_rubrum 1991 9.36
5 AEC Acer_rubrum 1992 3.90
6 AEC Acer_rubrum 1993 4.45
7 AEC Acer_rubrum 1994 9.75
8 AEC Acer_rubrum 1995 6.52
9 AEC Acer_rubrum 1996 6.86
10 AEC Acer_rubrum 1997 1.80
# ℹ 3,202 more rows
Crear una nueva variable en base a diferentes niveles de frutos.
Ej - un factor de 3 niveles de cantidad frutos:
filter: removed 4,015 rows (2%), 209,047 rows remaining
filter: removed 126,766 rows (61%), 82,281 rows remaining
select: dropped 7 variables (site, year, species_name, plant_ID, method, …)
count
Min. : 0.1
1st Qu.: 2.0
Median : 8.0
Mean : 87.9
3rd Qu.: 35.0
Max. :1114340.0
Crear una nueva variable en base a diferentes niveles de frutos.
Ej - un factor de 3 niveles de cantidad frutos:
dt |>
mutate(nivel_frutos = case_when(
count <= 100 ~ "bajo",
count > 100 & count <= 1000 ~ "medio",
count > 1000 ~ "alto"))
mutate: new variable 'nivel_frutos' (character) with 4 unique values and 2% NA
# A tibble: 213,062 × 9
site year species_name plant_ID count method stem_cm trap_area_m2
<chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl>
1 AND 1962 Abies_amabilis CNCT_01ABAM1 22 PARTIALCO… 56.6 NA
2 AND 1963 Abies_amabilis CNCT_01ABAM1 0 PARTIALCO… NA NA
3 AND 1964 Abies_amabilis CNCT_01ABAM1 0 PARTIALCO… NA NA
4 AND 1965 Abies_amabilis CNCT_01ABAM1 2 PARTIALCO… NA NA
5 AND 1966 Abies_amabilis CNCT_01ABAM1 0 PARTIALCO… NA NA
6 AND 1967 Abies_amabilis CNCT_01ABAM1 2 PARTIALCO… NA NA
7 AND 1968 Abies_amabilis CNCT_01ABAM1 108 PARTIALCO… NA NA
8 AND 1969 Abies_amabilis CNCT_01ABAM1 0 PARTIALCO… NA NA
9 AND 1970 Abies_amabilis CNCT_01ABAM1 0 PARTIALCO… NA NA
10 AND 1971 Abies_amabilis CNCT_01ABAM1 7 PARTIALCO… NA NA
# ℹ 213,052 more rows
# ℹ 1 more variable: nivel_frutos <chr>
Contar numero de arboles con distintos niveles de frutos:
dt |>
mutate(nivel_frutos = case_when(
count <= 100 ~ "bajo",
count > 100 & count <= 1000 ~ "medio",
count > 1000 ~ "alto")) |>
group_by(nivel_frutos) |>
summarise(trees = n())
mutate: new variable 'nivel_frutos' (character) with 4 unique values and 2% NA
group_by: one grouping variable (nivel_frutos)
summarise: now 4 rows and 2 columns, ungrouped
# A tibble: 4 × 2
nivel_frutos trees
<chr> <int>
1 alto 698
2 bajo 200157
3 medio 8192
4 <NA> 4015
arrange()
- Ordenar variable por casosrename()
- Renombrar variablesrelocate()
- Reordenar variablesselect()
- Extraer variablesdistinct()
- Extraer valores únicosmutate()
- Crear nuevas variablesfilter()
- Filtrar datos por casosgroup_by()
- Agrupar datos por casossummarise()
- Resumir datos por casoscase_when()
- Filtrar datos por casos
Función if_else()
:
Función if_else()
:
dt_fix <- dt |>
# quitar un valor equivocado
mutate(count = if_else(count > 200000, NA, count)) |>
# calcular número de frutos por m2
mutate(fruits_per_m2 = count/trap_area_m2) |>
# crear variable con la cantidad de frutos de count o corregida
mutate(fruits = if_else(is.na(fruits_per_m2), count, fruits_per_m2))
mutate: changed one value (<1%) of 'count' (1 new NA)
mutate: new variable 'fruits_per_m2' (double) with 2,115 unique values and 35% NA
mutate: new variable 'fruits' (double) with 2,305 unique values and 2% NA
Función if_else()
:
dt_fix <- dt |>
# quitar un valor equivocado
mutate(count = if_else(count > 200000, NA, count)) |>
# calcular número de frutos por m2
mutate(fruits_per_m2 = count/trap_area_m2) |>
# crear variable con la cantidad de frutos de count o corregida
mutate(fruits = if_else(is.na(fruits_per_m2), count, fruits_per_m2)) |>
# quitar valores de 0 o NA
filter(count != 0)
mutate: changed one value (<1%) of 'count' (1 new NA)
mutate: new variable 'fruits_per_m2' (double) with 2,115 unique values and 35% NA
mutate: new variable 'fruits' (double) with 2,305 unique values and 2% NA
filter: removed 130,782 rows (61%), 82,280 rows remaining
library(tidyr)
pivot_wider()
pivot_longer()
# A tibble: 6 × 10
site year species_name plant_ID count method stem_cm trap_area_m2
<chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl>
1 AND 1962 Abies_amabilis CNCT_01ABAM1 22 PARTIALCON… 56.6 NA
2 AND 1965 Abies_amabilis CNCT_01ABAM1 2 PARTIALCON… NA NA
3 AND 1967 Abies_amabilis CNCT_01ABAM1 2 PARTIALCON… NA NA
4 AND 1968 Abies_amabilis CNCT_01ABAM1 108 PARTIALCON… NA NA
5 AND 1971 Abies_amabilis CNCT_01ABAM1 7 PARTIALCON… NA NA
6 AND 1974 Abies_amabilis CNCT_01ABAM1 2 PARTIALCON… NA NA
# ℹ 2 more variables: fruits_per_m2 <dbl>, fruits <dbl>
Primero creamos dataset reducido:
group_by: 2 grouping variables (site, year)
summarise: now 280 rows and 3 columns, one group variable remaining (site)
# A tibble: 280 × 3
# Groups: site [9]
site year fruits
<chr> <dbl> <dbl>
1 AEC 1988 252.
2 AEC 1989 656.
3 AEC 1990 67.3
4 AEC 1991 148.
5 AEC 1992 279.
6 AEC 1993 66.1
7 AEC 1994 375.
8 AEC 1995 343.
9 AEC 1996 250.
10 AEC 1997 233.
# ℹ 270 more rows
Convertir a formato corto:
dt_short <- dt_fix |>
group_by(site, year) |>
summarise(fruits = mean(fruits, na.rm. = TRUE)) |>
pivot_wider(names_from = "site",
values_from = "fruits")
group_by: 2 grouping variables (site, year)
summarise: now 280 rows and 3 columns, one group variable remaining (site)
pivot_wider: reorganized (site, fruits) into (AEC, AND, BNZ, CDR, CWT, …) [was 280x3, now 65x10]
# A tibble: 6 × 10
year AEC AND BNZ CDR CWT HBR HFR LUQ SEV
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1988 252. 27.0 198. NA NA NA NA NA NA
2 1989 656. 46.6 1406. NA NA NA NA NA NA
3 1990 67.3 26.2 909. NA NA NA NA NA NA
4 1991 148. 140. 1318. NA 154. NA NA NA NA
5 1992 279. 28.4 357. NA 101. NA NA 249. NA
6 1993 66.1 65.1 1683. NA 124. 43.9 NA 269. NA
Convertir a formato largo:
pivot_longer: reorganized (AEC, AND, BNZ, CDR, CWT, …) into (site, fruits) [was 65x10, now 585x3]
# A tibble: 585 × 3
year site fruits
<dbl> <chr> <dbl>
1 1988 AEC 252.
2 1988 AND 27.0
3 1988 BNZ 198.
4 1988 CDR NA
5 1988 CWT NA
6 1988 HBR NA
7 1988 HFR NA
8 1988 LUQ NA
9 1988 SEV NA
10 1989 AEC 656.
# ℹ 575 more rows
join
Leemos un nuevo dataset con información de atributos para las especies de árboles:
Rows: 104 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (15): species_name, family, genus, epithet, pollinator_code, mycorrhiza_...
dbl (2): seed_development_years, seed_mass_mg
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 104
Columns: 17
$ species_name <chr> "Abies_amabilis", "Abies_concolor", "Abies_gran…
$ family <chr> "Pinaceae", "Pinaceae", "Pinaceae", "Pinaceae",…
$ genus <chr> "Abies", "Abies", "Abies", "Abies", "Abies", "A…
$ epithet <chr> "amabilis", "concolor", "grandis", "lasiocarpa"…
$ seed_development_years <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
$ pollinator_code <chr> "wind", "wind", "wind", "wind", "wind", "wind",…
$ mycorrhiza_type <chr> "EM", "EM", "EM", "EM", "EM", "EM", "AM", "AM",…
$ needleleaf_broadleaf <chr> "needleleaf", "needleleaf", "needleleaf", "need…
$ deciduous_evergreen <chr> "evergreen", "evergreen", "evergreen", "evergre…
$ seed_maturation_timing <chr> "late summer", "fall", "late summer", "late sum…
$ seed_mass_mg <dbl> 46.2063354, 34.2847056, 21.0800075, 13.7327226,…
$ sexual_system <chr> "monoecious", "monoecious", "monoecious", "mono…
$ shade_tolerance <chr> "tolerant", "tolerant", "tolerant", "tolerant",…
$ growth_form <chr> "tree", "tree", "tree", "tree", "tree", "tree",…
$ seed_bank <chr> "no", "no", "no", "no", "yes", "yes", "no", "no…
$ fleshy_fruit <chr> "no", "no", "no", "no", "no", "no", "no", "no",…
$ dispersal_syndrome <chr> "abiotic", "abiotic", "abiotic", "abiotic", "ab…
La función count
cuenta el número de casos para una variable categórica
# A tibble: 2 × 2
pollinator_code n
<chr> <int>
1 animal 73
2 wind 31
# A tibble: 41 × 2
family n
<chr> <int>
1 Aceraceae 2
2 Annonaceae 2
3 Aquifoliaceae 1
4 Araliaceae 2
5 Arecaceae 1
6 Betulaceae 4
7 Bignoniaceae 2
8 Boraginaceae 2
9 Burseraceae 2
10 Cecropiaceae 1
# ℹ 31 more rows
Usando left_join()
Rows: 82,280
Columns: 26
$ site <chr> "AND", "AND", "AND", "AND", "AND", "AND", "AND"…
$ year <dbl> 1962, 1965, 1967, 1968, 1971, 1974, 1976, 1978,…
$ species_name <chr> "Abies_amabilis", "Abies_amabilis", "Abies_amab…
$ plant_ID <chr> "CNCT_01ABAM1", "CNCT_01ABAM1", "CNCT_01ABAM1",…
$ count <dbl> 22, 2, 2, 108, 7, 2, 12, 21, 1, 30, 61, 76, 5, …
$ method <chr> "PARTIALCONECOUNT", "PARTIALCONECOUNT", "PARTIA…
$ stem_cm <dbl> 56.6, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ trap_area_m2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ fruits_per_m2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ fruits <dbl> 22, 2, 2, 108, 7, 2, 12, 21, 1, 30, 61, 76, 5, …
$ family <chr> "Pinaceae", "Pinaceae", "Pinaceae", "Pinaceae",…
$ genus <chr> "Abies", "Abies", "Abies", "Abies", "Abies", "A…
$ epithet <chr> "amabilis", "amabilis", "amabilis", "amabilis",…
$ seed_development_years <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
$ pollinator_code <chr> "wind", "wind", "wind", "wind", "wind", "wind",…
$ mycorrhiza_type <chr> "EM", "EM", "EM", "EM", "EM", "EM", "EM", "EM",…
$ needleleaf_broadleaf <chr> "needleleaf", "needleleaf", "needleleaf", "need…
$ deciduous_evergreen <chr> "evergreen", "evergreen", "evergreen", "evergre…
$ seed_maturation_timing <chr> "late summer", "late summer", "late summer", "l…
$ seed_mass_mg <dbl> 46.20634, 46.20634, 46.20634, 46.20634, 46.2063…
$ sexual_system <chr> "monoecious", "monoecious", "monoecious", "mono…
$ shade_tolerance <chr> "tolerant", "tolerant", "tolerant", "tolerant",…
$ growth_form <chr> "tree", "tree", "tree", "tree", "tree", "tree",…
$ seed_bank <chr> "no", "no", "no", "no", "no", "no", "no", "no",…
$ fleshy_fruit <chr> "no", "no", "no", "no", "no", "no", "no", "no",…
$ dispersal_syndrome <chr> "abiotic", "abiotic", "abiotic", "abiotic", "ab…
write_csv
- usa separador de “,”write_csv2
- usa separador de “;”write_delim
- usa cualquier separador de datos (ej. delim = “|”)El formato parquet
para guardar datos es una forma muy eficiente de manejar grandes bases de datos.
Este formato archiva los datos en forma de columnas, ofrece una compresion mayor que .csv incluso mayor que .rds y es más rapido para trabajar.
Además permite el particionado de datos en diferentes ficheros.
readr
, readxl
, and googlesheets4
dplyr
tidyr
stringr
forcats
lubridate
Usando la base de datos final (dt_sp), seleccionar datos con información para diámetro de tronco (stem_cm) y ordernar de mayor a menor:
Usando la base de datos final (dt_sp), seleccionar datos con información para diámetro de tronco (stem_cm) y ordernar de mayor a menor:
filter: removed 80,777 rows (98%), 1,503 rows remaining
# A tibble: 1,503 × 26
site year species_name plant_ID count method stem_cm trap_area_m2
<chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl>
1 AND 1993 Abies_procera CNCT_37ABPR17 250 PARTIALCO… 221. NA
2 AND 1962 Abies_procera CNCT_15ABPR15 50 PARTIALCO… 198. NA
3 AND 1962 Abies_procera CNCT_15ABPR8 68 PARTIALCO… 196. NA
4 AND 1993 Abies_procera CNCT_37ABPR1 350 PARTIALCO… 186. NA
5 AND 1961 Abies_procera CNCT_37ABPR3 231 PARTIALCO… 186. NA
6 AND 1993 Abies_procera CNCT_37ABPR3 410 PARTIALCO… 184. NA
7 AND 1962 Abies_procera CNCT_15ABPR16 130 PARTIALCO… 183. NA
8 AND 1992 Abies_procera CNCT_02ABPR35 54 PARTIALCO… 180. NA
9 AND 1962 Abies_procera CNCT_15ABPR3 300 PARTIALCO… 178. NA
10 AND 1993 Abies_procera CNCT_37ABPR10 250 PARTIALCO… 174. NA
# ℹ 1,493 more rows
# ℹ 18 more variables: fruits_per_m2 <dbl>, fruits <dbl>, family <chr>,
# genus <chr>, epithet <chr>, seed_development_years <dbl>,
# pollinator_code <chr>, mycorrhiza_type <chr>, needleleaf_broadleaf <chr>,
# deciduous_evergreen <chr>, seed_maturation_timing <chr>,
# seed_mass_mg <dbl>, sexual_system <chr>, shade_tolerance <chr>,
# growth_form <chr>, seed_bank <chr>, fleshy_fruit <chr>, …
Usando la base de datos final (dt_sp), calcular diámetro medio y SD para cada especie de árbol.
Usando la base de datos final (dt_sp), calcular diámetro medio y SD para cada especie de árbol.
dt_sp |>
filter(!is.na(stem_cm)) |>
group_by(species_name) |>
summarise(mean = mean(stem_cm),
sd = sd(stem_cm))
filter: removed 80,777 rows (98%), 1,503 rows remaining
group_by: one grouping variable (species_name)
summarise: now 10 rows and 3 columns, ungrouped
# A tibble: 10 × 3
species_name mean sd
<chr> <dbl> <dbl>
1 Abies_amabilis 65.0 19.7
2 Abies_concolor 63.1 18.6
3 Abies_grandis 74.5 14.8
4 Abies_lasiocarpa 45.0 16.2
5 Abies_magnifica 87.9 19.9
6 Abies_procera 104. 34.4
7 Picea_engelmannii 80.2 16.5
8 Pinus_lambertiana 114. 27.7
9 Pinus_monticola 63.4 22.4
10 Tsuga_mertensiana 56.5 12.5
Usando la base de datos final (dt_sp), calcular el número de árboles y número de especies mayores de 40cm de diámetro y menores de 40cm de diámetro.
Usando la base de datos final (dt_sp), calcular el número de árboles y número de especies mayores de 40cm de diámetro y menores de 40cm de diámetro.
dt |>
filter(!is.na(stem_cm)) |>
mutate(tree_size = case_when(stem_cm >= 40 ~ "big",
stem_cm < 40 ~ "small")) |>
group_by(tree_size) |>
summarise(n_trees = n(),
n_species = n_distinct(species_name))
filter: removed 210,780 rows (99%), 2,282 rows remaining
mutate: new variable 'tree_size' (character) with 2 unique values and 0% NA
group_by: one grouping variable (tree_size)
summarise: now 2 rows and 3 columns, ungrouped
# A tibble: 2 × 3
tree_size n_trees n_species
<chr> <int> <int>
1 big 2144 10
2 small 138 6
Usando la base de datos final (dt_sp), seleccionar sitios con método de conteo tipo “TRAP” y calcular cantidad máxima y mínima de frutos por m2 para cada sitio.
Usando la base de datos final (dt_sp), seleccionar sitios con método de conteo tipo “TRAP” y calcular cantidad máxima y mínima de frutos por m2 para cada sitio.
dt_sp |>
filter(method == "TRAP") |>
group_by(site) |>
summarise(max_fruit = max(fruits_per_m2),
min_fruit = mean(fruits_per_m2))
filter: removed 33,191 rows (40%), 49,089 rows remaining
group_by: one grouping variable (site)
summarise: now 5 rows and 3 columns, ungrouped
# A tibble: 5 × 3
site max_fruit min_fruit
<chr> <dbl> <dbl>
1 AEC 8107. 296.
2 BNZ 28920 1088.
3 CWT 12207. 197.
4 HBR 2440 93.4
5 LUQ 213300 299.
Usando la base de datos final (dt_sp), crear una tabla que compare la suma de frutos contados en los sitios CWT y HFR (en columnas), para los años entre 2000-2010 (filas).
Usando la base de datos final (dt_sp), crear una tabla que compare la suma de frutos contados en los sitios CWT y HFR (en columnas), para los años entre 2000-2010 (filas).
dt_sp |>
filter(site %in% c("CWT", "SEV")) |>
filter(year %in% c(2000:2010)) |>
group_by(site, year) |>
summarise(fruits = sum(fruits)) |>
pivot_wider(names_from = site, values_from = fruits)
# A tibble: 11 × 3
year CWT SEV
<dbl> <dbl> <dbl>
1 2000 92776. 5561.
2 2001 133160. 28243.
3 2002 45746. 302.
4 2003 63213. 13646.
5 2004 67092. 23964.
6 2005 47034. 20558.
7 2006 84603. 726.
8 2007 114387. 11630.
9 2008 147617. 14634
10 2009 138707. 6
11 2010 139444. 15441.
Usando la base de datos final (dt_sp), crear una tabla que compare la suma de frutos contados entre los años 2001 y 2005 (en columnas), para las especies de Abies (filas).
Usando la base de datos final (dt_sp), crear una tabla que compare la suma de frutos contados entre los años 2001 y 2005 (en columnas), para las especies de Abies (filas).
dt_sp |>
filter(year %in% c(2001:2005)) |>
filter(str_detect(species_name, "Abies")) |>
group_by(year, species_name) |>
summarise(fruits = sum(fruits)) |>
pivot_wider(names_from = year, values_from = fruits)
# A tibble: 6 × 6
species_name `2001` `2002` `2003` `2004` `2005`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Abies_amabilis 721 2819 5907 54 864
2 Abies_concolor 1429 92 3032 NA 136
3 Abies_grandis 3509 238 4414 17 1119
4 Abies_magnifica 52 1374 6324 8 570
5 Abies_procera 3308 7772 10485 957 1588
6 Abies_lasiocarpa NA 443 1974 17 73