09: Zpracování a analýza dat
Tento notebook je výukovým materiálem v předmětu BI-JUL.21 vyučovaném v zimním semestru akademického roku 2023/2024 Tomášem Kalvodou. Tvorba těchto materiálů byla podpořena NVS FIT.
Hlavní stránkou předmětu, kde jsou i další notebooky a zajímavé informace, je jeho Course Pages stránka.
versioninfo()1. DataFrames.jl
Pro práci s daty jistě znáte Pythonovský nástroj pandas.
DataFrames.jl je v podstatě Julia analog tohoto nástroje.
Pokud jste zvyklí pandas používat, může pro vás být užitečné porovnání pandas s Dataframes.jl.
using DataFrames1.1 Jak vytvořit DataFrame?
DataFrame lze vytvořit mnoha způsoby.
Můžeme začít s prázdnou tabulkou a postupně ji naplnit daty, nebo využít existující matici, nebo data načíst z externího souboru.
Prázdný DataFrame vytvoříme velmi snadno:
DataFrame()Keyword argumenty a NamedTuple
Data v pojmenovaných sloupcích můžeme předat pomocí keyword argumentů (klíčové slovo je název sloupce, hodnota data):
df = DataFrame(course=["BI-LA1", "BI-DML", "BI-MA1", "BI-MA2"], semester=[1, 1, 2, 2], department=18105)df.courseTímto způsobem bychom ovšem měli problém zadat data se sloupcích, jejichž názvy obsahují třeba speciální znaky jako mezery.
K tomu můžeme použít slovník, resp. dvojice (klíče mohou být řetězce nebo symboly -- ty jsou doporučené, místo mezer je vhodnější použít podtžítka :slovo_slovo):
df = DataFrame("fiktivní postava" => ["Gandalf", "Harry Potter"], "kniha" => ["Pán prstenů", "Harry Potter a kámen mudrců"])df.var"fiktivní postava"df[:, "fiktivní postava"]Další možností je použít NamedTuple:
nt = (a = 1, b = 3)nt[1], nt[2]nt[:a], nt[:b]Po sloupcích (pozor na jemný rozdíl od výše uvedeného způsobu):
DataFrame((a = [1,2,3], b = [4,5,6]))Po řádcích:
DataFrame([(a = 1, b = 4), (a = 2, b = 5), (a = 3, b = 6)])DataFrame([(a = 1, b = 4, c = nothing), (a = 2, b = 5, c = nothing), (a = 3, b = nothing, c = 6)])Matice
K vytvoření DataFrame můžeme použít i matici, jen musíme vyřešit pojmenování sloupců.
Automaticky (nutno zadat jako druhý argument :auto) budou označeny jako x1, x2, atd.:
DataFrame(rand(2, 3), :auto)DataFrame(rand(Int64, 2, 3), :auto)V druhém argumentu případně můžeme zadat naše názvy sloupců:
DataFrame(rand(2, 3), ["col1", "col2", "col3"])CSV.jl
V tomto notebooku budeme pro některé ukázky používat CSV anonymizovaný export z Grades předmětu BI-MA1 v semestru B212 (2021/2022).
Importu dat z CSV lze snadno docílit pomocí balíčku CSV.jl, který přidáme standardně ] add CSV a pak importujeme:
using CSVNyní stačí použít metodu read z modulu CSV, v prvním argumentu zadáme cestu k souboru a v druhém uvedeme požadovaný výstupní "formát", v našem případě DataFrame:
df = CSV.read(joinpath("files", "bi-ma1-b232.csv"), DataFrame)| Row | username | test1 | test2 | test3 | second_chance | activity | tests_total | gitlab | assessment | exam_test | date | oral_exam | veto | points_total | mark | tutor | percentil |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| String15 | Float64? | Float64? | Float64? | Float64? | Float64? | Float64 | Float64? | Bool | Int64? | String7? | Float64? | Bool? | Float64? | String1? | String7 | Int64 | |
| 1 | student001 | 3.0 | 11.5 | 13.5 | missing | 5.0 | 28.0 | missing | true | 17 | 13:15 | 34.0 | missing | 67.0 | D | Irena | 58 |
| 2 | student002 | 2.0 | 3.0 | 18.5 | 19.5 | 3.0 | 25.0 | missing | true | 18 | 12:45 | 48.0 | missing | 76.0 | C | Irena | 43 |
| 3 | student003 | missing | missing | missing | missing | missing | 0.0 | missing | false | missing | missing | missing | missing | missing | missing | Ondra | 6 |
| 4 | student004 | 3.0 | missing | missing | missing | 4.0 | 3.0 | missing | false | missing | missing | missing | missing | missing | missing | Jan V | 13 |
| 5 | student005 | 3.0 | 18.0 | 17.0 | missing | 4.0 | 38.0 | missing | true | 19 | 17:15 | 56.0 | missing | 98.0 | A | Jarda | 93 |
| 6 | student006 | 3.0 | 10.5 | 15.0 | missing | missing | 28.5 | missing | true | 15 | 16:30 | 50.0 | missing | 78.5 | C | Jan S | 59 |
| 7 | student007 | 2.0 | 19.0 | 16.0 | missing | 2.0 | 37.0 | missing | true | 16 | 11:30 | 55.0 | missing | 94.0 | A | Jan V | 91 |
| 8 | student008 | missing | missing | missing | missing | missing | 0.0 | missing | false | missing | missing | missing | missing | missing | missing | Jitka | 6 |
| 9 | student009 | 3.0 | 8.0 | 14.0 | missing | 3.0 | 25.0 | missing | true | 18 | 13:15 | 42.0 | missing | 70.0 | C | Irena | 46 |
| 10 | student010 | 1.0 | 6.0 | 7.0 | missing | missing | 14.0 | missing | false | missing | missing | missing | missing | missing | missing | Jan S | 27 |
| 11 | student011 | 3.0 | 14.5 | 19.0 | missing | 4.0 | 36.5 | missing | true | 17 | 15:00 | 57.0 | missing | 97.5 | A | Jakub | 89 |
| 12 | student012 | 3.0 | 18.0 | 15.0 | missing | missing | 36.0 | missing | true | 17 | 11:30 | 54.0 | missing | 90.0 | A | Jan V | 88 |
| 13 | student013 | 2.0 | missing | missing | missing | missing | 2.0 | missing | false | missing | missing | missing | missing | missing | missing | Irena | 11 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 648 | student648 | 1.0 | 13.5 | 14.5 | missing | missing | 29.0 | missing | true | 15 | 11:00 | 45.0 | missing | 74.0 | C | Jan V | 62 |
| 649 | student649 | 1.0 | missing | missing | missing | missing | 1.0 | missing | false | missing | missing | missing | missing | missing | missing | Jitka | 9 |
| 650 | student650 | 2.5 | 1.0 | 1.0 | missing | missing | 4.5 | missing | false | missing | missing | missing | missing | missing | missing | Ondra | 17 |
| 651 | student651 | 0.0 | 8.0 | 8.0 | missing | 1.0 | 16.0 | missing | false | missing | missing | missing | missing | missing | missing | Jitka | 30 |
| 652 | student652 | 2.0 | 5.5 | 12.5 | 12.5 | 1.0 | 20.0 | missing | false | missing | missing | missing | missing | missing | missing | Jitka | 35 |
| 653 | student653 | 3.0 | 8.5 | 17.5 | missing | missing | 29.0 | missing | true | 16 | 14:45 | 55.0 | missing | 84.0 | B | Jitka | 62 |
| 654 | student654 | 1.0 | 14.0 | 15.0 | missing | missing | 30.0 | missing | true | 16 | 14:30 | 50.0 | missing | 80.0 | B | Jan S | 65 |
| 655 | student655 | 1.5 | 5.0 | missing | missing | missing | 6.5 | missing | false | missing | missing | missing | missing | missing | missing | Irena | 19 |
| 656 | student656 | 2.5 | 5.5 | 17.5 | missing | missing | 25.5 | missing | true | 16 | 10:45 | 54.5 | missing | 80.0 | B | Ondra | 49 |
| 657 | student657 | 1.0 | 15.0 | 6.5 | 16.0 | 1.0 | 25.0 | missing | true | 15 | 14:15 | 42.0 | missing | 68.0 | D | Irena | 41 |
| 658 | student658 | 1.0 | 15.0 | 17.5 | missing | 0.5 | 33.5 | missing | true | 17 | 13:45 | 56.0 | missing | 90.0 | A | Jarda | 79 |
| 659 | student659 | missing | missing | missing | missing | missing | 0.0 | missing | false | missing | missing | missing | missing | missing | missing | Jakub | 6 |
K těmto datům se vrátíme podrobněji později. Samotnou tabulku občas použijeme v různých ukázkách.
1.2 Práce s DataFrame
Z částečného výpisu dat výše vyvstává hned několik otázek:
- Jaké sloupce jsou ještě k dispozici?
- Proč je typ pod některými sloupci s otazníkem?
- Co znamená missing?
Pojďme se vydat na průzkum. První, co můžeme zkusit, je podívat se na atributy naší instance (interaktivně pomocí TAB):
df.ParseError: # Error @ In[4]:1:4 df. # └ ── premature end of input Stacktrace: [1] top-level scope @ In[4]:1
df.username659-element Vector{String15}:
"student001"
"student002"
"student003"
"student004"
"student005"
"student006"
"student007"
"student008"
"student009"
"student010"
"student011"
"student012"
"student013"
⋮
"student648"
"student649"
"student650"
"student651"
"student652"
"student653"
"student654"
"student655"
"student656"
"student657"
"student658"
"student659"df.assessment659-element Vector{Bool}:
1
1
0
0
1
1
1
0
1
0
1
1
0
⋮
1
0
0
0
0
1
1
0
1
1
1
0Nebo k tomu můžeme použít metodu names z modulu DataFrames:
names(df)17-element Vector{String}:
"username"
"test1"
"test2"
"test3"
"second_chance"
"activity"
"tests_total"
"gitlab"
"assessment"
"exam_test"
"date"
"oral_exam"
"veto"
"points_total"
"mark"
"tutor"
"percentil"Metoda eachcol vrátí iterátor přes sloupce. Můžeme tak zjistit typ prvků sloupců:
eltype.(eachcol(df))17-element Vector{Type}:
String15
Union{Missing, Float64}
Union{Missing, Float64}
Union{Missing, Float64}
Union{Missing, Float64}
Union{Missing, Float64}
Float64
Union{Missing, Float64}
Bool
Union{Missing, Int64}
Union{Missing, String7}
Union{Missing, Float64}
Union{Missing, Bool}
Union{Missing, Float64}
Union{Missing, String1}
String7
Int64Odtud vidíme, co znamenají otazníky u typů prvků sloupců.
Prvky sloupce jsou složeného typu Union{Missing, T}, v některých sloupcích mohou hodnoty chybět, obsahují hodnotu missing.
Stručné informace můžeme získat pomocí metody describe.
Samozřejmě občas uvedené statistiky nemají moc smysl.
describe(df)| Row | variable | mean | min | median | max | nmissing | eltype |
|---|---|---|---|---|---|---|---|
| Symbol | Union… | Any | Union… | Any | Int64 | Type | |
| 1 | username | student001 | student659 | 0 | String15 | ||
| 2 | test1 | 2.02801 | 0.0 | 2.0 | 3.0 | 70 | Union{Missing, Float64} |
| 3 | test2 | 11.5804 | 0.0 | 12.0 | 20.0 | 68 | Union{Missing, Float64} |
| 4 | test3 | 13.5825 | 0.0 | 14.5 | 20.0 | 150 | Union{Missing, Float64} |
| 5 | second_chance | 12.6964 | 2.5 | 14.0 | 22.5 | 589 | Union{Missing, Float64} |
| 6 | activity | 2.38564 | 0.0 | 2.0 | 5.0 | 334 | Union{Missing, Float64} |
| 7 | tests_total | 22.8437 | 0.0 | 26.5 | 43.0 | 0 | Float64 |
| 8 | gitlab | 0.814815 | 0.5 | 0.5 | 4.0 | 632 | Union{Missing, Float64} |
| 9 | assessment | 0.616085 | false | 1.0 | true | 0 | Bool |
| 10 | exam_test | 16.9113 | -1 | 17.0 | 20 | 253 | Union{Missing, Int64} |
| 11 | date | --- | 17:30 | 254 | Union{Missing, String7} | ||
| 12 | oral_exam | 47.7257 | 0.0 | 49.0 | 57.0 | 278 | Union{Missing, Float64} |
| 13 | veto | 1.0 | true | 1.0 | true | 657 | Union{Missing, Bool} |
| 14 | points_total | 81.5019 | 55.0 | 81.5 | 105.0 | 279 | Union{Missing, Float64} |
| 15 | mark | A | F | 253 | Union{Missing, String1} | ||
| 16 | tutor | Irena | Ondra | 0 | String7 | ||
| 17 | percentil | 50.3612 | 0 | 50.0 | 100 | 0 | Int64 |
Výpis můžeme kontrolovat uvedením konkrétního rozsahu.
describe(df, cols=["test1", "test2", "test3"])| Row | variable | mean | min | median | max | nmissing | eltype |
|---|---|---|---|---|---|---|---|
| Symbol | Float64 | Float64 | Float64 | Float64 | Int64 | Union | |
| 1 | test1 | 2.02801 | 0.0 | 2.0 | 3.0 | 70 | Union{Missing, Float64} |
| 2 | test2 | 11.5804 | 0.0 | 12.0 | 20.0 | 68 | Union{Missing, Float64} |
| 3 | test3 | 13.5825 | 0.0 | 14.5 | 20.0 | 150 | Union{Missing, Float64} |
DataFrame je tabulka, máme pro ní k dispozici podobné metody jako pro matice, na které se také lze dívat jako na "tabulky".
size(df)(659, 17)
nrow(df)659
ncol(df)17
DataFrame lze kopírovat pomocí metody copy, nebo ho vyprázdnit pomocí metod empty, resp. empty!.
ex1 = DataFrame(rand(3, 2), :auto)
ex2 = copy(ex1)| Row | x1 | x2 |
|---|---|---|
| Float64 | Float64 | |
| 1 | 0.781833 | 0.0730139 |
| 2 | 0.689409 | 0.292322 |
| 3 | 0.356713 | 0.156782 |
ex1| Row | x1 | x2 |
|---|---|---|
| Float64 | Float64 | |
| 1 | 0.781833 | 0.0730139 |
| 2 | 0.689409 | 0.292322 |
| 3 | 0.356713 | 0.156782 |
empty!(ex2)
empty(ex1)| Row | x1 | x2 |
|---|---|---|
| Float64 | Float64 |
ex1| Row | x1 | x2 |
|---|---|---|
| Float64 | Float64 | |
| 1 | 0.781833 | 0.0730139 |
| 2 | 0.689409 | 0.292322 |
| 3 | 0.356713 | 0.156782 |
ex2| Row | x1 | x2 |
|---|---|---|
| Float64 | Float64 |
To by bylo pro začátek vše k základním vlastnostem. Pojďme nyní tabulky modifikovat, upravovat.
Indexování a přístup ke sloupcům
Indexování vychází z maticového zápisu a má podobné vlastnosti. Ke sloupcům můžeme přistupovat několika způsoby:
- pomocí názvu a tečky:
df.mark,df."mark" - pomocí indexace (nevytvoří kopii):
df[!, :mark],df[!, "mark"] - pomocí indexace (vytvoří kopii):
df[:, :mark],df[:, "mark"]
Tímto způsobem můžeme data číst, ale i je modifikovat.
df.mark659-element PooledArrays.PooledVector{Union{Missing, String1}, UInt32, Vector{UInt32}}:
"D"
"C"
missing
missing
"A"
"C"
"A"
missing
"C"
missing
"A"
"A"
missing
⋮
"C"
missing
missing
missing
missing
"B"
"B"
missing
"B"
"D"
"A"
missingdf.mark === df[!, :mark]true
df.mark === df[:, :mark] # kopie!false
df.mark == df[!, :mark]missing
df.mark == df[:, :mark]missing
missing == missingmissing
missing == truemissing
missing == falsemissing
Vedle toho nemusíme používat název sloupce, v pořádku je použít i pořadí. Například prvních deset řádků v druhém a třetím sloupci:
df[1:10, ["test1", "test2", "test3"]]| Row | test1 | test2 | test3 |
|---|---|---|---|
| Float64? | Float64? | Float64? | |
| 1 | 3.0 | 11.5 | 13.5 |
| 2 | 2.0 | 3.0 | 18.5 |
| 3 | missing | missing | missing |
| 4 | 3.0 | missing | missing |
| 5 | 3.0 | 18.0 | 17.0 |
| 6 | 3.0 | 10.5 | 15.0 |
| 7 | 2.0 | 19.0 | 16.0 |
| 8 | missing | missing | missing |
| 9 | 3.0 | 8.0 | 14.0 |
| 10 | 1.0 | 6.0 | 7.0 |
Not, Between, Cols, All
Dále můžeme vybírat pouze potřebné sloupce.
ex = DataFrame(rand(3, 5), :auto)| Row | x1 | x2 | x3 | x4 | x5 |
|---|---|---|---|---|---|
| Float64 | Float64 | Float64 | Float64 | Float64 | |
| 1 | 0.6686 | 0.53519 | 0.180508 | 0.815861 | 0.49856 |
| 2 | 0.676451 | 0.548124 | 0.644572 | 0.724846 | 0.151916 |
| 3 | 0.683903 | 0.441108 | 0.745621 | 0.340693 | 0.95443 |
ex[:, Not(:x2)]| Row | x1 | x3 | x4 | x5 |
|---|---|---|---|---|
| Float64 | Float64 | Float64 | Float64 | |
| 1 | 0.6686 | 0.180508 | 0.815861 | 0.49856 |
| 2 | 0.676451 | 0.644572 | 0.724846 | 0.151916 |
| 3 | 0.683903 | 0.745621 | 0.340693 | 0.95443 |
Not(:x2)InvertedIndex{Symbol}(:x2)ex[:, Not([:x2, :x3])]| Row | x1 | x4 | x5 |
|---|---|---|---|
| Float64 | Float64 | Float64 | |
| 1 | 0.6686 | 0.815861 | 0.49856 |
| 2 | 0.676451 | 0.724846 | 0.151916 |
| 3 | 0.683903 | 0.340693 | 0.95443 |
ex[:, Between(:x1, :x3)]| Row | x1 | x2 | x3 |
|---|---|---|---|
| Float64 | Float64 | Float64 | |
| 1 | 0.6686 | 0.53519 | 0.180508 |
| 2 | 0.676451 | 0.548124 | 0.644572 |
| 3 | 0.683903 | 0.441108 | 0.745621 |
ex[:, Cols(:x2, :x4)]| Row | x2 | x4 |
|---|---|---|
| Float64 | Float64 | |
| 1 | 0.53519 | 0.815861 |
| 2 | 0.548124 | 0.724846 |
| 3 | 0.441108 | 0.340693 |
ex[:, All()]| Row | x1 | x2 | x3 | x4 | x5 |
|---|---|---|---|---|---|
| Float64 | Float64 | Float64 | Float64 | Float64 | |
| 1 | 0.6686 | 0.53519 | 0.180508 | 0.815861 | 0.49856 |
| 2 | 0.676451 | 0.548124 | 0.644572 | 0.724846 | 0.151916 |
| 3 | 0.683903 | 0.441108 | 0.745621 | 0.340693 | 0.95443 |
Výše uvedené ukázky využívají funkcionalitu z DataFrames.jl balíčku.
"Staré maticové" postupy samozřejmě fungují také, např.:
ex[:, [:x2, :x3]]| Row | x2 | x3 |
|---|---|---|
| Float64 | Float64 | |
| 1 | 0.53519 | 0.180508 |
| 2 | 0.548124 | 0.644572 |
| 3 | 0.441108 | 0.745621 |
Přepisování dat
Údaje můžeme měnit několika způsoby. Vezměme si jednoduchou tabulku:
m = [ "α" 1 0.5; "β" 2 1.5; "γ" 3 2.5]3×3 Matrix{Any}:
"α" 1 0.5
"β" 2 1.5
"γ" 3 2.5ex = DataFrame(m, ["a", "b", "c"])| Row | a | b | c |
|---|---|---|---|
| Any | Any | Any | |
| 1 | α | 1 | 0.5 |
| 2 | β | 2 | 1.5 |
| 3 | γ | 3 | 2.5 |
Změna jedné položky ( je velké ):
ex[1, :a] = "Ξ"
ex| Row | a | b | c |
|---|---|---|---|
| Any | Any | Any | |
| 1 | Ξ | 1 | 0.5 |
| 2 | β | 2 | 1.5 |
| 3 | γ | 3 | 2.5 |
Přepsání všech hodnot ve sloupci c:
ex[:, :c] = ["tau", "pi", "omega"]3-element Vector{String}:
"tau"
"pi"
"omega"ex| Row | a | b | c |
|---|---|---|---|
| Any | Any | Any | |
| 1 | Ξ | 1 | tau |
| 2 | β | 2 | pi |
| 3 | γ | 3 | omega |
Přepsání jednou hodnotou.
ex[:, :c] .= "ř"3-element view(::Vector{Any}, :) with eltype Any:
"ř"
"ř"
"ř"ex| Row | a | b | c |
|---|---|---|---|
| Any | Any | Any | |
| 1 | Ξ | 1 | ř |
| 2 | β | 2 | ř |
| 3 | γ | 3 | ř |
z = ex[:, :c]3-element Vector{Any}:
"ř"
"ř"
"ř"z[2] = "pí""pí"
z3-element Vector{Any}:
"ř"
"pí"
"ř"ex| Row | a | b | c |
|---|---|---|---|
| Any | Any | Any | |
| 1 | Ξ | 1 | ř |
| 2 | β | 2 | ř |
| 3 | γ | 3 | ř |
z = ex[!, :c]3-element Vector{Any}:
"ř"
"ř"
"ř"z[2] = "¿""¿"
ex| Row | a | b | c |
|---|---|---|---|
| Any | Any | Any | |
| 1 | Ξ | 1 | ř |
| 2 | β | 2 | ¿ |
| 3 | γ | 3 | ř |
ex.c = [true, false, missing]
ex| Row | a | b | c |
|---|---|---|---|
| Any | Any | Bool? | |
| 1 | Ξ | 1 | true |
| 2 | β | 2 | false |
| 3 | γ | 3 | missing |
Přepsání hodnot ve sloupci a danou hodnotou, ale pouze v řádcích, kde je hodnota ve sloupci b lichá (zde opět používáme indexování bitovým vektorem, nebo "maskování", s kterým jsme se setkali dříve během semestru):
isodd.(ex.b)3-element BitVector: 1 0 1
ex[isodd.(ex.b), :a] .= "λ"
ex| Row | a | b | c |
|---|---|---|---|
| Any | Any | Bool? | |
| 1 | λ | 1 | true |
| 2 | β | 2 | false |
| 3 | λ | 3 | missing |
ex[iseven.(ex.b), :]| Row | a | b | c |
|---|---|---|---|
| Any | Any | Bool? | |
| 1 | β | 2 | false |
Občas je nutné projít tabulku řádek po řádku.
V tom případě můžeme iterovat přes index.
Například se pokusme změnit první sloupec na řetězec obsahující daný symbol se spodním indexem daným číslem ve sloupci b v LaTeX notaci (tj. λ_1 atd.).
for row_id = axes(ex, 1) # 1:nrow(df)
ex[row_id, :a] = ex[row_id, :a] * "_" * string(ex[row_id, :b])
end
ex| Row | a | b | c |
|---|---|---|---|
| Any | Any | Bool? | |
| 1 | λ_1 | 1 | true |
| 2 | β_2 | 2 | false |
| 3 | λ_3 | 3 | missing |
for row in eachrow(ex)
println(row)
println(row[:a])
endDataFrameRow
Row │ a b c
│ Any Any Bool?
─────┼─────────────────
1 │ λ_1 1 true
λ_1
DataFrameRow
Row │ a b c
│ Any Any Bool?
─────┼─────────────────
2 │ β_2 2 false
β_2
DataFrameRow
Row │ a b c
│ Any Any Bool?
─────┼───────────────────
3 │ λ_3 3 missing
λ_3
Zobrazovaní (show, first, last, view)
Pokud je DataFrame příliš velký, tak při jeho zobrazení dojde k ořezání (řádků i sloupců). Toto chování můžeme přebít pomocí parametrů metody show.
U našeho BI-ZMA příkladu je to lehce overkill.
show(df)659×17 DataFrame
Row │ username test1 test2 test3 second_chance activity ⋯
│ String15 Float64? Float64? Float64? Float64? Float64? ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ student001 3.0 11.5 13.5 missing 5.0 ⋯
2 │ student002 2.0 3.0 18.5 19.5 3.0
3 │ student003 missing missing missing missing missing
4 │ student004 3.0 missing missing missing 4.0
5 │ student005 3.0 18.0 17.0 missing 4.0 ⋯
6 │ student006 3.0 10.5 15.0 missing missing
7 │ student007 2.0 19.0 16.0 missing 2.0
8 │ student008 missing missing missing missing missing
9 │ student009 3.0 8.0 14.0 missing 3.0 ⋯
10 │ student010 1.0 6.0 7.0 missing missing
11 │ student011 3.0 14.5 19.0 missing 4.0
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
650 │ student650 2.5 1.0 1.0 missing missing
651 │ student651 0.0 8.0 8.0 missing 1.0 ⋯
652 │ student652 2.0 5.5 12.5 12.5 1.0
653 │ student653 3.0 8.5 17.5 missing missing
654 │ student654 1.0 14.0 15.0 missing missing
655 │ student655 1.5 5.0 missing missing missing ⋯
656 │ student656 2.5 5.5 17.5 missing missing
657 │ student657 1.0 15.0 6.5 16.0 1.0
658 │ student658 1.0 15.0 17.5 missing 0.5
659 │ student659 missing missing missing missing missing ⋯
11 columns and 638 rows omittedshow(df, allcols=true)659×17 DataFrame
Row │ username test1 test2 test3 second_chance activity tests_total gitlab assessment exam_test date oral_exam veto points_total mark tutor percentil
│ String15 Float64? Float64? Float64? Float64? Float64? Float64 Float64? Bool Int64? String7? Float64? Bool? Float64? String1? String7 Int64
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ student001 3.0 11.5 13.5 missing 5.0 28.0 missing true 17 13:15 34.0 missing 67.0 D Irena 58
2 │ student002 2.0 3.0 18.5 19.5 3.0 25.0 missing true 18 12:45 48.0 missing 76.0 C Irena 43
3 │ student003 missing missing missing missing missing 0.0 missing false missing missing missing missing missing missing Ondra 6
4 │ student004 3.0 missing missing missing 4.0 3.0 missing false missing missing missing missing missing missing Jan V 13
5 │ student005 3.0 18.0 17.0 missing 4.0 38.0 missing true 19 17:15 56.0 missing 98.0 A Jarda 93
6 │ student006 3.0 10.5 15.0 missing missing 28.5 missing true 15 16:30 50.0 missing 78.5 C Jan S 59
7 │ student007 2.0 19.0 16.0 missing 2.0 37.0 missing true 16 11:30 55.0 missing 94.0 A Jan V 91
8 │ student008 missing missing missing missing missing 0.0 missing false missing missing missing missing missing missing Jitka 6
9 │ student009 3.0 8.0 14.0 missing 3.0 25.0 missing true 18 13:15 42.0 missing 70.0 C Irena 46
10 │ student010 1.0 6.0 7.0 missing missing 14.0 missing false missing missing missing missing missing missing Jan S 27
11 │ student011 3.0 14.5 19.0 missing 4.0 36.5 missing true 17 15:00 57.0 missing 97.5 A Jakub 89
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
650 │ student650 2.5 1.0 1.0 missing missing 4.5 missing false missing missing missing missing missing missing Ondra 17
651 │ student651 0.0 8.0 8.0 missing 1.0 16.0 missing false missing missing missing missing missing missing Jitka 30
652 │ student652 2.0 5.5 12.5 12.5 1.0 20.0 missing false missing missing missing missing missing missing Jitka 35
653 │ student653 3.0 8.5 17.5 missing missing 29.0 missing true 16 14:45 55.0 missing 84.0 B Jitka 62
654 │ student654 1.0 14.0 15.0 missing missing 30.0 missing true 16 14:30 50.0 missing 80.0 B Jan S 65
655 │ student655 1.5 5.0 missing missing missing 6.5 missing false missing missing missing missing missing missing Irena 19
656 │ student656 2.5 5.5 17.5 missing missing 25.5 missing true 16 10:45 54.5 missing 80.0 B Ondra 49
657 │ student657 1.0 15.0 6.5 16.0 1.0 25.0 missing true 15 14:15 42.0 missing 68.0 D Irena 41
658 │ student658 1.0 15.0 17.5 missing 0.5 33.5 missing true 17 13:45 56.0 missing 90.0 A Jarda 79
659 │ student659 missing missing missing missing missing 0.0 missing false missing missing missing missing missing missing Jakub 6
638 rows omittedToto by byl ještě větší overkill.
# show(df, allrows=true)Případně můžeme prozkoumávat začátek a konec tabulky.
first(df, 10)| Row | username | test1 | test2 | test3 | second_chance | activity | tests_total | gitlab | assessment | exam_test | date | oral_exam | veto | points_total | mark | tutor | percentil |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| String15 | Float64? | Float64? | Float64? | Float64? | Float64? | Float64 | Float64? | Bool | Int64? | String7? | Float64? | Bool? | Float64? | String1? | String7 | Int64 | |
| 1 | student001 | 3.0 | 11.5 | 13.5 | missing | 5.0 | 28.0 | missing | true | 17 | 13:15 | 34.0 | missing | 67.0 | D | Irena | 58 |
| 2 | student002 | 2.0 | 3.0 | 18.5 | 19.5 | 3.0 | 25.0 | missing | true | 18 | 12:45 | 48.0 | missing | 76.0 | C | Irena | 43 |
| 3 | student003 | missing | missing | missing | missing | missing | 0.0 | missing | false | missing | missing | missing | missing | missing | missing | Ondra | 6 |
| 4 | student004 | 3.0 | missing | missing | missing | 4.0 | 3.0 | missing | false | missing | missing | missing | missing | missing | missing | Jan V | 13 |
| 5 | student005 | 3.0 | 18.0 | 17.0 | missing | 4.0 | 38.0 | missing | true | 19 | 17:15 | 56.0 | missing | 98.0 | A | Jarda | 93 |
| 6 | student006 | 3.0 | 10.5 | 15.0 | missing | missing | 28.5 | missing | true | 15 | 16:30 | 50.0 | missing | 78.5 | C | Jan S | 59 |
| 7 | student007 | 2.0 | 19.0 | 16.0 | missing | 2.0 | 37.0 | missing | true | 16 | 11:30 | 55.0 | missing | 94.0 | A | Jan V | 91 |
| 8 | student008 | missing | missing | missing | missing | missing | 0.0 | missing | false | missing | missing | missing | missing | missing | missing | Jitka | 6 |
| 9 | student009 | 3.0 | 8.0 | 14.0 | missing | 3.0 | 25.0 | missing | true | 18 | 13:15 | 42.0 | missing | 70.0 | C | Irena | 46 |
| 10 | student010 | 1.0 | 6.0 | 7.0 | missing | missing | 14.0 | missing | false | missing | missing | missing | missing | missing | missing | Jan S | 27 |
last(df, 10)| Row | username | test1 | test2 | test3 | second_chance | activity | tests_total | gitlab | assessment | exam_test | date | oral_exam | veto | points_total | mark | tutor | percentil |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| String15 | Float64? | Float64? | Float64? | Float64? | Float64? | Float64 | Float64? | Bool | Int64? | String7? | Float64? | Bool? | Float64? | String1? | String7 | Int64 | |
| 1 | student650 | 2.5 | 1.0 | 1.0 | missing | missing | 4.5 | missing | false | missing | missing | missing | missing | missing | missing | Ondra | 17 |
| 2 | student651 | 0.0 | 8.0 | 8.0 | missing | 1.0 | 16.0 | missing | false | missing | missing | missing | missing | missing | missing | Jitka | 30 |
| 3 | student652 | 2.0 | 5.5 | 12.5 | 12.5 | 1.0 | 20.0 | missing | false | missing | missing | missing | missing | missing | missing | Jitka | 35 |
| 4 | student653 | 3.0 | 8.5 | 17.5 | missing | missing | 29.0 | missing | true | 16 | 14:45 | 55.0 | missing | 84.0 | B | Jitka | 62 |
| 5 | student654 | 1.0 | 14.0 | 15.0 | missing | missing | 30.0 | missing | true | 16 | 14:30 | 50.0 | missing | 80.0 | B | Jan S | 65 |
| 6 | student655 | 1.5 | 5.0 | missing | missing | missing | 6.5 | missing | false | missing | missing | missing | missing | missing | missing | Irena | 19 |
| 7 | student656 | 2.5 | 5.5 | 17.5 | missing | missing | 25.5 | missing | true | 16 | 10:45 | 54.5 | missing | 80.0 | B | Ondra | 49 |
| 8 | student657 | 1.0 | 15.0 | 6.5 | 16.0 | 1.0 | 25.0 | missing | true | 15 | 14:15 | 42.0 | missing | 68.0 | D | Irena | 41 |
| 9 | student658 | 1.0 | 15.0 | 17.5 | missing | 0.5 | 33.5 | missing | true | 17 | 13:45 | 56.0 | missing | 90.0 | A | Jarda | 79 |
| 10 | student659 | missing | missing | missing | missing | missing | 0.0 | missing | false | missing | missing | missing | missing | missing | missing | Jakub | 6 |
Pro rychlé prozkoumávání částí tabulky slouží metoda view, resp. makro @view, která jen část tabulky zobrazí.
Nevytváří nový objekt, měla by být efektivnější.
view(df, 100:150, [:username, :points_total, :mark])| Row | username | points_total | mark |
|---|---|---|---|
| String15 | Float64? | String1? | |
| 1 | student100 | 91.5 | A |
| 2 | student101 | 80.5 | B |
| 3 | student102 | missing | missing |
| 4 | student103 | 73.0 | C |
| 5 | student104 | 75.5 | C |
| 6 | student105 | missing | missing |
| 7 | student106 | 81.5 | B |
| 8 | student107 | 76.0 | C |
| 9 | student108 | missing | missing |
| 10 | student109 | 76.5 | C |
| 11 | student110 | missing | missing |
| 12 | student111 | 57.0 | E |
| 13 | student112 | missing | missing |
| ⋮ | ⋮ | ⋮ | ⋮ |
| 40 | student139 | 89.0 | B |
| 41 | student140 | 74.0 | C |
| 42 | student141 | 90.0 | A |
| 43 | student142 | 64.5 | D |
| 44 | student143 | missing | missing |
| 45 | student144 | missing | missing |
| 46 | student145 | 78.0 | C |
| 47 | student146 | missing | missing |
| 48 | student147 | 80.5 | B |
| 49 | student148 | missing | F |
| 50 | student149 | missing | missing |
| 51 | student150 | 100.5 | A |
@view df[10:15, [:username, :mark]]| Row | username | mark |
|---|---|---|
| String15 | String1? | |
| 1 | student010 | missing |
| 2 | student011 | A |
| 3 | student012 | A |
| 4 | student013 | missing |
| 5 | student014 | missing |
| 6 | student015 | D |
Přidávání a odebírání sloupců
K přidávání sloupců slouží metoda insertcols!:
ex = DataFrame(rand(2, 2), :auto)| Row | x1 | x2 |
|---|---|---|
| Float64 | Float64 | |
| 1 | 0.894499 | 0.176588 |
| 2 | 0.0582584 | 0.709324 |
insertcols!(ex, 1, :a => [pi, pi])| Row | a | x1 | x2 |
|---|---|---|---|
| Irration… | Float64 | Float64 | |
| 1 | π | 0.894499 | 0.176588 |
| 2 | π | 0.0582584 | 0.709324 |
insertcols!(ex, 1, :a2 => [pi, -pi])| Row | a2 | a | x1 | x2 |
|---|---|---|---|---|
| Float64 | Irration… | Float64 | Float64 | |
| 1 | 3.14159 | π | 0.894499 | 0.176588 |
| 2 | -3.14159 | π | 0.0582584 | 0.709324 |
insertcols!(ex, 3, :b => "⊕")| Row | a2 | a | b | x1 | x2 |
|---|---|---|---|---|---|
| Float64 | Irration… | String | Float64 | Float64 | |
| 1 | 3.14159 | π | ⊕ | 0.894499 | 0.176588 |
| 2 | -3.14159 | π | ⊕ | 0.0582584 | 0.709324 |
insertcols!(ex, 5, :c => [1, 2])| Row | a2 | a | b | x1 | c | x2 |
|---|---|---|---|---|---|---|
| Float64 | Irration… | String | Float64 | Int64 | Float64 | |
| 1 | 3.14159 | π | ⊕ | 0.894499 | 1 | 0.176588 |
| 2 | -3.14159 | π | ⊕ | 0.0582584 | 2 | 0.709324 |
Ale lze použít i prosté indexování:
ex[:, :d] = [42, 42]
ex| Row | a2 | a | b | x1 | c | x2 | d |
|---|---|---|---|---|---|---|---|
| Float64 | Irration… | String | Float64 | Int64 | Float64 | Int64 | |
| 1 | 3.14159 | π | ⊕ | 0.894499 | 1 | 0.176588 | 42 |
| 2 | -3.14159 | π | ⊕ | 0.0582584 | 2 | 0.709324 | 42 |
Mazání sloupců musíme provést pomocí indexace, neexistuje metoda "dropcolumns!".
ex[:, Not(:b)]| Row | a2 | a | x1 | c | x2 | d |
|---|---|---|---|---|---|---|
| Float64 | Irration… | Float64 | Int64 | Float64 | Int64 | |
| 1 | 3.14159 | π | 0.894499 | 1 | 0.176588 | 42 |
| 2 | -3.14159 | π | 0.0582584 | 2 | 0.709324 | 42 |
Přidávání a odebírání řádků
Vytvořme si zase testovací tabulku:
ex = DataFrame(rand(2, 2), :auto)| Row | x1 | x2 |
|---|---|---|
| Float64 | Float64 | |
| 1 | 0.0686912 | 0.786712 |
| 2 | 0.263486 | 0.322541 |
Přidat řádek lze opět několika způsoby. Nejpřirozenější je asi metoda push!:
push!(ex, [1.0, 2.0])| Row | x1 | x2 |
|---|---|---|
| Float64 | Float64 | |
| 1 | 0.0686912 | 0.786712 |
| 2 | 0.263486 | 0.322541 |
| 3 | 1.0 | 2.0 |
push!(ex, [1, 2])| Row | x1 | x2 |
|---|---|---|
| Float64 | Float64 | |
| 1 | 0.0686912 | 0.786712 |
| 2 | 0.263486 | 0.322541 |
| 3 | 1.0 | 2.0 |
| 4 | 1.0 | 2.0 |
push!(ex, (0.5, 2.3))| Row | x1 | x2 |
|---|---|---|
| Float64 | Float64 | |
| 1 | 0.0686912 | 0.786712 |
| 2 | 0.263486 | 0.322541 |
| 3 | 1.0 | 2.0 |
| 4 | 1.0 | 2.0 |
| 5 | 0.5 | 2.3 |
push!(ex, ["a", "b"])┌ Error: Error adding value to column :x1. Maybe you forgot passing `promote=true`? └ @ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/dataframe/insertion.jl:810
MethodError: Cannot `convert` an object of type String to an object of type Float64
The function `convert` exists, but no method is defined for this combination of argument types.
Closest candidates are:
convert(::Type{T}, ::T) where T<:Number
@ Base number.jl:6
convert(::Type{T}, ::CartesianIndex{1}) where T<:Number
@ Base multidimensional.jl:136
convert(::Type{T}, ::AbstractChar) where T<:Number
@ Base char.jl:185
...
Stacktrace:
[1] push!(a::Vector{Float64}, item::String)
@ Base ./array.jl:1249
[2] _row_inserter!(df::DataFrame, loc::Int64, row::Vector{String}, mode::Val{:push}, promote::Bool)
@ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/dataframe/insertion.jl:776
[3] #push!#339
@ ~/.julia/packages/DataFrames/kcA9R/src/dataframe/insertion.jl:545 [inlined]
[4] push!(df::DataFrame, row::Vector{String})
@ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/dataframe/insertion.jl:538
[5] top-level scope
@ In[98]:1push!(ex, (x2 = 0.5, x1 = 0.3))| Row | x1 | x2 |
|---|---|---|
| Float64 | Float64 | |
| 1 | 0.0686912 | 0.786712 |
| 2 | 0.263486 | 0.322541 |
| 3 | 1.0 | 2.0 |
| 4 | 1.0 | 2.0 |
| 5 | 0.5 | 2.3 |
| 6 | 0.3 | 0.5 |
push!(ex, (x2 = 7, x1 = 5))| Row | x1 | x2 |
|---|---|---|
| Float64 | Float64 | |
| 1 | 0.0686912 | 0.786712 |
| 2 | 0.263486 | 0.322541 |
| 3 | 1.0 | 2.0 |
| 4 | 1.0 | 2.0 |
| 5 | 0.5 | 2.3 |
| 6 | 0.3 | 0.5 |
| 7 | 5.0 | 7.0 |
Pomocí append! můžeme spojovat několik tabulek "vertikálně" dohromady:
append!(ex, DataFrame(rand(10, 2), :auto))| Row | x1 | x2 |
|---|---|---|
| Float64 | Float64 | |
| 1 | 0.0686912 | 0.786712 |
| 2 | 0.263486 | 0.322541 |
| 3 | 1.0 | 2.0 |
| 4 | 1.0 | 2.0 |
| 5 | 0.5 | 2.3 |
| 6 | 0.3 | 0.5 |
| 7 | 5.0 | 7.0 |
| 8 | 0.320972 | 0.87072 |
| 9 | 0.110998 | 0.615922 |
| 10 | 0.113922 | 0.459498 |
| 11 | 0.603263 | 0.847671 |
| 12 | 0.347329 | 0.584542 |
| 13 | 0.505099 | 0.781451 |
| 14 | 0.466649 | 0.651833 |
| 15 | 0.208853 | 0.79092 |
| 16 | 0.544047 | 0.636684 |
| 17 | 0.590173 | 0.677212 |
Mazat řádky můžeme explicitně pomocí delete!:
delete!(ex, 3)| Row | x1 | x2 |
|---|---|---|
| Float64 | Float64 | |
| 1 | 0.0686912 | 0.786712 |
| 2 | 0.263486 | 0.322541 |
| 3 | 1.0 | 2.0 |
| 4 | 0.5 | 2.3 |
| 5 | 0.3 | 0.5 |
| 6 | 5.0 | 7.0 |
| 7 | 0.320972 | 0.87072 |
| 8 | 0.110998 | 0.615922 |
| 9 | 0.113922 | 0.459498 |
| 10 | 0.603263 | 0.847671 |
| 11 | 0.347329 | 0.584542 |
| 12 | 0.505099 | 0.781451 |
| 13 | 0.466649 | 0.651833 |
| 14 | 0.208853 | 0.79092 |
| 15 | 0.544047 | 0.636684 |
| 16 | 0.590173 | 0.677212 |
delete!(ex, [1, 2, 3])| Row | x1 | x2 |
|---|---|---|
| Float64 | Float64 | |
| 1 | 0.5 | 2.3 |
| 2 | 0.3 | 0.5 |
| 3 | 5.0 | 7.0 |
| 4 | 0.320972 | 0.87072 |
| 5 | 0.110998 | 0.615922 |
| 6 | 0.113922 | 0.459498 |
| 7 | 0.603263 | 0.847671 |
| 8 | 0.347329 | 0.584542 |
| 9 | 0.505099 | 0.781451 |
| 10 | 0.466649 | 0.651833 |
| 11 | 0.208853 | 0.79092 |
| 12 | 0.544047 | 0.636684 |
| 13 | 0.590173 | 0.677212 |
ex.x1 .> 0.513-element BitVector: 0 0 1 0 0 0 1 0 1 0 0 1 1
delete!(ex, ex.x1 .> 0.5)| Row | x1 | x2 |
|---|---|---|
| Float64 | Float64 | |
| 1 | 0.5 | 2.3 |
| 2 | 0.3 | 0.5 |
| 3 | 0.320972 | 0.87072 |
| 4 | 0.110998 | 0.615922 |
| 5 | 0.113922 | 0.459498 |
| 6 | 0.347329 | 0.584542 |
| 7 | 0.466649 | 0.651833 |
| 8 | 0.208853 | 0.79092 |
To zdaleka není všechno
Výše jsme shrnuli zřejmě ty nejužitečnější metody a techniky.
Tím ale možnosti DataFrames.jl zdaleka nekončí.
Zvídavému čtenáři doporučujeme prolétnout dokumentaci.
Na tomto místě snad jen upozorněme na následující:
- Joins.
- Grouping.
- Přehledné porovnání metod s pandas/R/Stata.
- Balíček
Query.jlumožňující pracovat s různými datovými zdroji. - Balíček
DataFramesMeta.jlposkytující užitečná makra pro práci s tabulkami. - Cheat Sheet
Velmi sofistikované transformace dat v tabulkách lze také provádět pomocí následujících metod:
combineselect/select!transform/transform!
Jejich podrobný výklad je už nad rámec tohoto kurzu.
2. Cvičení: analýza BI-MA1 v B212, B222, B232
Pojďme se podrobněji podívat na data z předmětu BI-MA1.
Nejprve znovu načteme data z CSV souboru, který je syrovým exportem z Grades, kde jsou anonymizováni uživatelská jména studentů.
using CSVdf1 = CSV.read(joinpath("files", "bi-ma1-b212.csv"), DataFrame)
df2 = CSV.read(joinpath("files", "bi-ma1-b222.csv"), DataFrame)
df3 = CSV.read(joinpath("files", "bi-ma1-b232.csv"), DataFrame);describe(df1)| Row | variable | mean | min | median | max | nmissing | eltype |
|---|---|---|---|---|---|---|---|
| Symbol | Union… | Any | Union… | Any | Int64 | Type | |
| 1 | username | student001 | student528 | 0 | String15 | ||
| 2 | test1 | 8.45199 | 0.0 | 8.0 | 20.0 | 101 | Union{Missing, Float64} |
| 3 | test2 | 13.3188 | 0.0 | 14.0 | 20.0 | 219 | Union{Missing, Float64} |
| 4 | second_chance | 15.0588 | 0.0 | 15.0 | 23.5 | 460 | Union{Missing, Float64} |
| 5 | activity | 1.95629 | 0.0 | 2.0 | 5.0 | 242 | Union{Missing, Float64} |
| 6 | tests_total | 14.928 | 0.0 | 15.5 | 39.5 | 0 | Float64 |
| 7 | gitlab | 1.30769 | 0.0 | 1.0 | 7.0 | 489 | Union{Missing, Float64} |
| 8 | assessment | 0.395833 | false | 0.0 | true | 0 | Bool |
| 9 | exam_test | 16.7081 | 0 | 17.0 | 20 | 319 | Union{Missing, Int64} |
| 10 | date | --- | 16:50 | 319 | Union{Missing, String7} | ||
| 11 | oral_exam | 48.3026 | 28.0 | 50.0 | 60.0 | 338 | Union{Missing, Float64} |
| 12 | veto | 1.0 | true | 1.0 | true | 525 | Union{Missing, Bool} |
| 13 | points_total | 79.6217 | 55.0 | 80.0 | 103.5 | 339 | Union{Missing, Float64} |
| 14 | mark | A | F | 319 | Union{Missing, String1} | ||
| 15 | tutor | Honza | Tomáš | 8 | Union{Missing, String7} | ||
| 16 | percentil | 53.1136 | 23 | 50.0 | 100 | 0 | Int64 |
size(df1)(528, 16)
nrow(df1)528
describe(df2)| Row | variable | mean | min | median | max | nmissing | eltype |
|---|---|---|---|---|---|---|---|
| Symbol | Union… | Any | Union… | Any | Int64 | Type | |
| 1 | username | student0001 | student0760 | 0 | String15 | ||
| 2 | test1 | 1.69818 | 0.0 | 1.5 | 3.0 | 99 | Union{Missing, Float64} |
| 3 | test2 | 10.6672 | 0.0 | 11.0 | 20.0 | 132 | Union{Missing, Float64} |
| 4 | test3 | 12.6893 | 0.0 | 13.5 | 20.0 | 245 | Union{Missing, Float64} |
| 5 | second_chance | 14.5402 | 5.0 | 15.0 | 22.0 | 673 | Union{Missing, Float64} |
| 6 | activity | 2.27454 | -6.0 | 2.0 | 5.0 | 383 | Union{Missing, Float64} |
| 7 | tests_total | 19.1148 | 0.0 | 24.5 | 43.0 | 2 | Union{Missing, Float64} |
| 8 | gitlab | 2.23684 | 1 | 1.0 | 12 | 722 | Union{Missing, Int64} |
| 9 | assessment | 0.497361 | false | 0.0 | true | 2 | Union{Missing, Bool} |
| 10 | exam_test | 16.4377 | -1 | 17.0 | 20 | 383 | Union{Missing, Int64} |
| 11 | date | --- | CT 14:30 | 383 | Union{Missing, String15} | ||
| 12 | oral_exam | 46.8988 | 20.0 | 48.0 | 57.0 | 419 | Union{Missing, Float64} |
| 13 | veto | 1.0 | true | 1.0 | true | 757 | Union{Missing, Bool} |
| 14 | points_total | 80.3235 | 54.5 | 80.5 | 105.0 | 420 | Union{Missing, Float64} |
| 15 | mark | A | F | 383 | Union{Missing, String1} | ||
| 16 | tutor | Irena | Petr | 2 | Union{Missing, String7} | ||
| 17 | percentil | 50.9513 | 12 | 50.0 | 100 | 0 | Int64 |
size(df2)(760, 17)
nrow(df2)760
describe(df3)| Row | variable | mean | min | median | max | nmissing | eltype |
|---|---|---|---|---|---|---|---|
| Symbol | Union… | Any | Union… | Any | Int64 | Type | |
| 1 | username | student001 | student659 | 0 | String15 | ||
| 2 | test1 | 2.02801 | 0.0 | 2.0 | 3.0 | 70 | Union{Missing, Float64} |
| 3 | test2 | 11.5804 | 0.0 | 12.0 | 20.0 | 68 | Union{Missing, Float64} |
| 4 | test3 | 13.5825 | 0.0 | 14.5 | 20.0 | 150 | Union{Missing, Float64} |
| 5 | second_chance | 12.6964 | 2.5 | 14.0 | 22.5 | 589 | Union{Missing, Float64} |
| 6 | activity | 2.38564 | 0.0 | 2.0 | 5.0 | 334 | Union{Missing, Float64} |
| 7 | tests_total | 22.8437 | 0.0 | 26.5 | 43.0 | 0 | Float64 |
| 8 | gitlab | 0.814815 | 0.5 | 0.5 | 4.0 | 632 | Union{Missing, Float64} |
| 9 | assessment | 0.616085 | false | 1.0 | true | 0 | Bool |
| 10 | exam_test | 16.9113 | -1 | 17.0 | 20 | 253 | Union{Missing, Int64} |
| 11 | date | --- | 17:30 | 254 | Union{Missing, String7} | ||
| 12 | oral_exam | 47.7257 | 0.0 | 49.0 | 57.0 | 278 | Union{Missing, Float64} |
| 13 | veto | 1.0 | true | 1.0 | true | 657 | Union{Missing, Bool} |
| 14 | points_total | 81.5019 | 55.0 | 81.5 | 105.0 | 279 | Union{Missing, Float64} |
| 15 | mark | A | F | 253 | Union{Missing, String1} | ||
| 16 | tutor | Irena | Ondra | 0 | String7 | ||
| 17 | percentil | 50.3612 | 0 | 50.0 | 100 | 0 | Int64 |
size(df3)(659, 17)
2.1 Základní údaje
Nejprve prozkoumejte dostupné sloupce a pokuste se zjistit základní údaje jako
- Kolik studentů mělo předmět zapsáno?
- Kolik studentů získalo zápočet a dokončilo předmět?
- Kolik studentů bylo v jakém ročníku? (V prvním běhu BI-MA1 toto nemá smysl.)
- ...
Počet studentů.
nrow(df1)528
nrow(df2)760
nrow(df3)659
Kolik studentů má zápočet?
sum(df1.assessment)209
length(filter(isone, df1.assessment))209
sum(df2.assessment)missing
sum(skipmissing(df2.assessment))377
sum(df2[(!ismissing).(df2.assessment), :assessment])377
filter(x -> !ismissing(x), df2.assessment)758-element Vector{Union{Missing, Bool}}:
1
1
0
0
0
0
1
1
1
1
0
1
0
⋮
1
0
0
1
1
1
1
0
1
1
0
1filter(x -> !ismissing(x), df2.assessment) |> sum377
sum(dropmissing(df2, :assessment).assessment)377
coalesce.(df2.assessment, false) |> sum377
sum(df3.assessment)406
Tj. průchodnost zápočtu.
sum(df1.assessment) / nrow(df1)0.3958333333333333
sum(coalesce.(df2.assessment, 0)) / nrow(df2)0.49605263157894736
sum(df3.assessment) / nrow(df3)0.6160849772382397
Počet vet v tabulce.
sum(df1.veto)missing
missing + 1 # aha! pozor na missing hodnotymissing
sum(dropmissing(df1[:, Cols(:veto)]).veto)3
sum(filter(x -> !ismissing(x), df1.veto))3
sum(dropmissing(df1, :veto).veto)3
sum(dropmissing(df2, :veto).veto)3
Získalo zápočet a úspěšně dokončilo předmět:
length(filter(x -> x != "F", dropmissing(df1, :mark).mark))189
sum(dropmissing(df1, :mark).mark .!= "F")189
sum(dropmissing(df2, :mark).mark .!= "F")340
sum(dropmissing(df3, :mark).mark .!= "F")380
Počet studentů, kteří získali zápočet a současně neuspěli u zkoušky.
sum(df1.assessment .&& df1.mark .== "F")20
sum(coalesce.(df2.assessment, false) .&& df2.mark .== "F")37
sum(coalesce.(df3.assessment, false) .&& df3.mark .== "F")26
2.2 Zápočtové písemky, kvízy a semestr
Poté se podívejme podrobněji jak probíhalo získávání zápočtů.
- Prozkoumejte výsledky jednotlivých zápočtových písemek, vytvořte historgramy výsledků.
- Kolika studentům zápočet "těsně" unikl?
- Kolik studentů "odpadlo" už v první polovině semestru?
- ...
using PyPlot, StatisticsHistogram prvního zápočtové testu.
plt.grid()
hist(collect(skipmissing(df3.test1)), bins=[ -0.49 + j for j in 0:4]);Pozor, se samotným skipmissing nefunguje.
skipmissing(df3.test1)skipmissing(Union{Missing, Float64}[3.0, 2.0, missing, 3.0, 3.0, 3.0, 2.0, missing, 3.0, 1.0 … 2.5, 0.0, 2.0, 3.0, 1.0, 1.5, 2.5, 1.0, 1.0, missing])collect(skipmissing(df3.test1))589-element Vector{Float64}:
3.0
2.0
3.0
3.0
3.0
2.0
3.0
1.0
3.0
3.0
2.0
2.5
2.5
⋮
3.0
1.0
1.0
2.5
0.0
2.0
3.0
1.0
1.5
2.5
1.0
1.0mean(skipmissing(df3.test1))2.0280135823429544
median(skipmissing(df3.test1))2.0
length(collect(skipmissing(df3.test1)))589
nrow(df3) - length(df3.test1[ismissing.(df3.test1)])589
Třetí běh lze porovnat s druhým během!
plt.grid()
hist(collect(skipmissing(df2.test1)), bins=[ -0.49 + j for j in 0:4]);Histogram druhého zápočtového testu.
plt.grid()
hist(collect(skipmissing(df3.test2)), bins=[ -0.49 + j for j in 0:21]);plt.grid()
hist(collect(skipmissing(df2.test2)), bins=[ -0.49 + j for j in 0:21]);Třetí test.
plt.grid()
hist(collect(skipmissing(df2.test3)), bins=[ -0.49 + j for j in 0:21]);plt.grid()
hist(collect(skipmissing(df2.test3)), bins=[ -0.49 + j for j in 0:21]);Počet studentů píšících testy v druhém běhu.
df2.test1 |> skipmissing |> collect |> length661
df2.test2 |> skipmissing |> collect |> length628
df2.test3 |> skipmissing |> collect |> length515
A v třetím běhu.
df3.test1 |> skipmissing |> collect |> length589
df3.test2 |> skipmissing |> collect |> length591
df3.test3 |> skipmissing |> collect |> length509
Počet studentů, kteří získali zápočet bez opravné zápočtové písemky.
sum(coalesce.(df1.test1 + df1.test2, 0) .>= 25.)165
sum(coalesce.(df2.test1 + df2.test2 + df2.test3, 0) .>= 25.)325
sum(coalesce.(df3.test1 + df3.test2 + df3.test3, 0) .>= 25.)367
Počet studentů s alespoň 20 body ze zápočtových písemek.
nrow(df1[coalesce.(df1.tests_total .>= 20, false), :])234
nrow(df2[coalesce.(df2.tests_total .>= 20, false), :])418
nrow(df3[coalesce.(df3.tests_total .>= 20, false), :])444
plt.grid()
hist(collect(skipmissing(df3.test1 + df3.test2 + df3.test3)), bins=[ -0.49 + j for j in 0:44]);plt.grid()
hist(collect(skipmissing(df2.test1 + df2.test2 + df2.test3)), bins=[ -0.49 + j for j in 0:44]);Histogram opravné zápočtové písemky.
plt.grid()
hist(filter(x -> !ismissing(x), df1.second_chance), bins=[ -0.49 + j for j in 0:25]);plt.grid()
hist(filter(x -> !ismissing(x), df2.second_chance), bins=[ -0.49 + j for j in 0:25]);Histogram celkového počtu bodů ze semestru.
plt.grid()
hist(filter(x -> !ismissing(x), df1.tests_total), bins=[ -0.49 + j for j in 0:41]);plt.grid()
hist(filter(x -> !ismissing(x), df2.tests_total), bins=[ -0.49 + j for j in 0:44]);plt.grid()
hist(filter(x -> !ismissing(x), df3.tests_total), bins=[ -0.49 + j for j in 0:44]);Úspěšnost zisku zápočtu podle cvičících.
groupby(df1, :tutor)GroupedDataFrame with 7 groups based on key: tutor
| Row | username | test1 | test2 | second_chance | activity | tests_total | gitlab | assessment | exam_test | date | oral_exam | veto | points_total | mark | tutor | percentil |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| String15 | Float64? | Float64? | Float64? | Float64? | Float64 | Float64? | Bool | Int64? | String7? | Float64? | Bool? | Float64? | String1? | String7? | Int64 | |
| 1 | student001 | 2.0 | missing | missing | missing | 2.0 | missing | false | missing | missing | missing | missing | missing | missing | Honza | 29 |
| 2 | student007 | 9.5 | 6.0 | missing | missing | 15.5 | missing | false | missing | missing | missing | missing | missing | missing | Honza | 50 |
| 3 | student010 | 0.0 | missing | missing | missing | 0.0 | missing | false | missing | missing | missing | missing | missing | missing | Honza | 23 |
| 4 | student022 | missing | missing | missing | missing | 0.0 | missing | false | missing | missing | missing | missing | missing | missing | Honza | 23 |
| 5 | student023 | 2.0 | 8.5 | missing | missing | 10.5 | missing | false | missing | missing | missing | missing | missing | missing | Honza | 46 |
| 6 | student037 | 2.5 | missing | missing | missing | 2.5 | missing | false | missing | missing | missing | missing | missing | missing | Honza | 30 |
| 7 | student038 | missing | missing | missing | missing | 0.0 | missing | false | missing | missing | missing | missing | missing | missing | Honza | 23 |
| 8 | student044 | 12.0 | 16.5 | missing | 1.5 | 28.5 | missing | true | 18 | 11:20 | 60.0 | missing | 90.0 | A | Honza | 83 |
| 9 | student045 | 9.0 | 13.5 | 15.0 | missing | 25.0 | missing | true | 15 | 11:20 | 51.0 | missing | 76.0 | C | Honza | 71 |
| 10 | student047 | missing | missing | missing | missing | 0.0 | missing | false | missing | missing | missing | missing | missing | missing | Honza | 23 |
| 11 | student068 | missing | missing | missing | missing | 0.0 | missing | false | missing | missing | missing | missing | missing | missing | Honza | 23 |
| 12 | student073 | 10.0 | 10.0 | 15.0 | missing | 25.0 | missing | true | 15 | 10:20 | 47.0 | missing | 72.0 | C | Honza | 71 |
| 13 | student077 | 11.0 | 15.0 | missing | missing | 26.0 | missing | true | 17 | 10:45 | 50.0 | missing | 76.0 | C | Honza | 75 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 77 | student473 | 4.5 | missing | missing | missing | 4.5 | missing | false | missing | missing | missing | missing | missing | missing | Honza | 37 |
| 78 | student477 | 3.0 | missing | missing | missing | 3.0 | missing | false | missing | missing | missing | missing | missing | missing | Honza | 33 |
| 79 | student482 | missing | missing | missing | missing | 0.0 | missing | false | missing | missing | missing | missing | missing | missing | Honza | 23 |
| 80 | student484 | 0.0 | missing | missing | missing | 0.0 | missing | false | missing | missing | missing | missing | missing | missing | Honza | 23 |
| 81 | student488 | 13.0 | 16.0 | missing | 2.5 | 29.0 | missing | true | 19 | 12:20 | 54.0 | missing | 85.5 | B | Honza | 84 |
| 82 | student500 | 5.0 | 12.0 | missing | missing | 17.0 | missing | false | missing | missing | missing | missing | missing | missing | Honza | 52 |
| 83 | student505 | 6.0 | 14.5 | 23.5 | 0.0 | 25.0 | missing | true | 17 | 10:20 | 31.0 | missing | 56.0 | E | Honza | 58 |
| 84 | student506 | 1.0 | missing | missing | missing | 1.0 | missing | false | missing | missing | missing | missing | missing | missing | Honza | 26 |
| 85 | student518 | 8.0 | 14.0 | 17.0 | missing | 25.0 | missing | true | 16 | 16:50 | 42.0 | missing | 67.0 | D | Honza | 71 |
| 86 | student520 | 1.0 | missing | missing | missing | 1.0 | missing | false | missing | missing | missing | missing | missing | missing | Honza | 26 |
| 87 | student524 | 5.5 | 19.5 | missing | missing | 25.0 | missing | true | 15 | 11:00 | 30.0 | missing | 55.0 | E | Honza | 71 |
| 88 | student526 | 10.5 | 17.0 | missing | 2.0 | 27.5 | missing | true | 13 | --- | missing | missing | missing | F | Honza | 80 |
⋮
| Row | username | test1 | test2 | second_chance | activity | tests_total | gitlab | assessment | exam_test | date | oral_exam | veto | points_total | mark | tutor | percentil |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| String15 | Float64? | Float64? | Float64? | Float64? | Float64 | Float64? | Bool | Int64? | String7? | Float64? | Bool? | Float64? | String1? | String7? | Int64 | |
| 1 | student020 | 11.5 | 9.5 | 10.5 | missing | 21.0 | missing | false | missing | missing | missing | missing | missing | missing | Ivo | 59 |
| 2 | student032 | missing | missing | missing | missing | 0.0 | missing | false | missing | missing | missing | missing | missing | missing | Ivo | 23 |
| 3 | student041 | missing | missing | missing | missing | 0.0 | missing | false | missing | missing | missing | missing | missing | missing | Ivo | 23 |
| 4 | student058 | 5.0 | missing | missing | missing | 5.0 | missing | false | missing | missing | missing | missing | missing | missing | Ivo | 38 |
| 5 | student060 | 1.0 | missing | missing | missing | 1.0 | missing | false | missing | missing | missing | missing | missing | missing | Ivo | 26 |
| 6 | student070 | 0.0 | missing | missing | missing | 0.0 | missing | false | missing | missing | missing | missing | missing | missing | Ivo | 23 |
| 7 | student075 | missing | missing | missing | missing | 0.0 | missing | false | missing | missing | missing | missing | missing | missing | Ivo | 23 |
| 8 | student095 | missing | missing | missing | missing | 0.0 | missing | false | missing | missing | missing | missing | missing | missing | Ivo | 23 |
| 9 | student098 | 4.0 | 6.5 | missing | missing | 10.5 | missing | false | missing | missing | missing | missing | missing | missing | Ivo | 46 |
| 10 | student100 | 5.0 | missing | missing | missing | 5.0 | missing | false | missing | missing | missing | missing | missing | missing | Ivo | 38 |
| 11 | student111 | 7.0 | missing | missing | missing | 7.0 | missing | false | missing | missing | missing | missing | missing | missing | Ivo | 42 |
| 12 | student112 | 9.0 | 10.0 | missing | missing | 19.0 | missing | false | missing | missing | missing | missing | missing | missing | Ivo | 55 |
| 13 | student120 | 9.0 | 11.0 | 10.5 | missing | 20.0 | missing | false | missing | missing | missing | missing | missing | missing | Ivo | 57 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 37 | student381 | 5.0 | missing | missing | missing | 5.0 | missing | false | missing | missing | missing | missing | missing | missing | Ivo | 38 |
| 38 | student391 | 9.0 | 13.5 | 17.0 | missing | 25.0 | missing | true | 10 | --- | missing | missing | missing | F | Ivo | 71 |
| 39 | student414 | 12.5 | 20.0 | missing | missing | 32.5 | 0.5 | true | 17 | 15:30 | 41.0 | missing | 74.0 | C | Ivo | 92 |
| 40 | student425 | 4.0 | missing | missing | missing | 4.0 | missing | false | missing | missing | missing | missing | missing | missing | Ivo | 36 |
| 41 | student432 | 9.0 | 6.5 | missing | missing | 15.5 | missing | false | missing | missing | missing | missing | missing | missing | Ivo | 50 |
| 42 | student451 | 0.5 | 5.0 | missing | missing | 5.5 | missing | false | missing | missing | missing | missing | missing | missing | Ivo | 39 |
| 43 | student454 | 13.0 | 9.0 | 11.0 | missing | 22.0 | missing | false | missing | missing | missing | missing | missing | missing | Ivo | 59 |
| 44 | student478 | 3.5 | missing | missing | missing | 3.5 | missing | false | missing | missing | missing | missing | missing | missing | Ivo | 34 |
| 45 | student479 | 9.5 | 14.0 | 13.0 | missing | 23.5 | missing | false | missing | missing | missing | missing | missing | missing | Ivo | 61 |
| 46 | student511 | missing | missing | missing | missing | 0.0 | missing | false | missing | missing | missing | missing | missing | missing | Ivo | 23 |
| 47 | student512 | 2.5 | 6.0 | missing | 1.0 | 8.5 | missing | false | missing | missing | missing | missing | missing | missing | Ivo | 44 |
| 48 | student528 | 13.0 | 13.0 | missing | 1.0 | 26.0 | 0.0 | true | 15 | 10:00 | 42.0 | missing | 69.0 | D | Ivo | 75 |
df = combine(groupby(df1, :tutor),
nrow => "studenti",
:assessment => sum => "zápočty",
:test1 => (x -> sum(coalesce.(x, 0) .> 0)) => "psali 1. test",
:activity => (x -> mean(coalesce.(x, 0))) => "průměrná aktivita"
)
df[:, "propustnost"] = df[:, "zápočty"] ./ df[:, "studenti"]
df| Row | tutor | studenti | zápočty | psali 1. test | průměrná aktivita | propustnost |
|---|---|---|---|---|---|---|
| String7? | Int64 | Int64 | Int64 | Float64 | Float64 | |
| 1 | Honza | 88 | 26 | 58 | 0.375 | 0.295455 |
| 2 | Irena | 95 | 30 | 63 | 1.02105 | 0.315789 |
| 3 | Jitka | 95 | 34 | 78 | 1.23158 | 0.357895 |
| 4 | Pavel | 96 | 38 | 78 | 1.75 | 0.395833 |
| 5 | missing | 8 | 0 | 0 | 0.0 | 0.0 |
| 6 | Tomáš | 98 | 69 | 90 | 1.38265 | 0.704082 |
| 7 | Ivo | 48 | 12 | 39 | 0.1875 | 0.25 |
df = combine(groupby(df2, :tutor),
nrow => "studenti",
:assessment => (x -> sum(collect(skipmissing(x)))) => "zápočty",
:test1 => (x -> sum(coalesce.(x, 0) .> 0)) => "psali 1. test",
:activity => (x -> mean(coalesce.(x, 0))) => "průměrná aktivita"
)
df[:, "propustnost"] = df[:, "zápočty"] ./ df[:, "studenti"]
df| Row | tutor | studenti | zápočty | psali 1. test | průměrná aktivita | propustnost |
|---|---|---|---|---|---|---|
| String7? | Int64 | Int64 | Int64 | Float64 | Float64 | |
| 1 | Jan V | 146 | 64 | 115 | 0.424658 | 0.438356 |
| 2 | Pavel | 149 | 84 | 120 | 1.07383 | 0.563758 |
| 3 | Jitka | 73 | 41 | 62 | 1.76027 | 0.561644 |
| 4 | Petr | 99 | 39 | 64 | 0.459596 | 0.393939 |
| 5 | Jiřina | 68 | 36 | 54 | 1.49265 | 0.529412 |
| 6 | Irena | 74 | 41 | 68 | 1.5 | 0.554054 |
| 7 | Jarda | 100 | 57 | 90 | 2.46 | 0.57 |
| 8 | Jan S | 49 | 15 | 33 | 0.0612245 | 0.306122 |
| 9 | missing | 2 | 0 | 0 | 0.0 | 0.0 |
df = combine(groupby(df3, :tutor),
nrow => "studenti",
:assessment => (x -> sum(collect(skipmissing(x)))) => "zápočty",
:test1 => (x -> sum(coalesce.(x, 0) .> 0)) => "psali 1. test",
:activity => (x -> mean(coalesce.(x, 0))) => "průměrná aktivita"
)
df[:, "propustnost"] = df[:, "zápočty"] ./ df[:, "studenti"]
df| Row | tutor | studenti | zápočty | psali 1. test | průměrná aktivita | propustnost |
|---|---|---|---|---|---|---|
| String7 | Int64 | Int64 | Int64 | Float64 | Float64 | |
| 1 | Irena | 99 | 66 | 90 | 2.33333 | 0.666667 |
| 2 | Ondra | 94 | 51 | 79 | 0.205676 | 0.542553 |
| 3 | Jan V | 144 | 77 | 126 | 0.673611 | 0.534722 |
| 4 | Jarda | 95 | 76 | 89 | 2.59474 | 0.8 |
| 5 | Jan S | 47 | 36 | 42 | 0.255319 | 0.765957 |
| 6 | Jitka | 96 | 60 | 87 | 1.20312 | 0.625 |
| 7 | Jakub | 84 | 40 | 48 | 0.642857 | 0.47619 |
2.3 Zkouškové období
V BI-MA1 hodně rozdílné vůči BI-ZMA.
Histogram známek.
combine(groupby(df1, :mark), nrow => "studenti") |> sort| Row | mark | studenti |
|---|---|---|
| String1? | Int64 | |
| 1 | A | 42 |
| 2 | B | 57 |
| 3 | C | 53 |
| 4 | D | 32 |
| 5 | E | 5 |
| 6 | F | 20 |
| 7 | missing | 319 |
combine(groupby(df2, :mark), nrow => "studenti") |> sort| Row | mark | studenti |
|---|---|---|
| String1? | Int64 | |
| 1 | A | 79 |
| 2 | B | 104 |
| 3 | C | 110 |
| 4 | D | 41 |
| 5 | E | 6 |
| 6 | F | 37 |
| 7 | missing | 383 |
combine(groupby(df3, :mark), nrow => "studenti") |> sort| Row | mark | studenti |
|---|---|---|
| String1? | Int64 | |
| 1 | A | 108 |
| 2 | B | 121 |
| 3 | C | 107 |
| 4 | D | 35 |
| 5 | E | 9 |
| 6 | F | 26 |
| 7 | missing | 253 |
sort(combine(groupby(df1[coalesce.(df1.gitlab, 0) .> 0 ,:], :mark), nrow => "studenti"), :mark)| Row | mark | studenti |
|---|---|---|
| String1? | Int64 | |
| 1 | A | 12 |
| 2 | B | 7 |
| 3 | C | 8 |
| 4 | D | 2 |
| 5 | F | 4 |
| 6 | missing | 5 |
sort(combine(groupby(df2[coalesce.(df2.gitlab, 0) .> 0 ,:], :mark), nrow => "studenti"), :mark)| Row | mark | studenti |
|---|---|---|
| String1? | Int64 | |
| 1 | A | 13 |
| 2 | B | 12 |
| 3 | C | 4 |
| 4 | D | 1 |
| 5 | F | 2 |
| 6 | missing | 6 |
sort(combine(groupby(df3[coalesce.(df3.gitlab, 0) .> 0 ,:], :mark), nrow => "studenti"), :mark)| Row | mark | studenti |
|---|---|---|
| String1? | Int64 | |
| 1 | A | 9 |
| 2 | B | 4 |
| 3 | C | 5 |
| 4 | D | 1 |
| 5 | F | 1 |
| 6 | missing | 7 |
sort(combine(groupby(df1[coalesce.(df1.activity, 0) .> 2 ,:], :mark), nrow => "studenti"), :mark)| Row | mark | studenti |
|---|---|---|
| String1? | Int64 | |
| 1 | A | 29 |
| 2 | B | 25 |
| 3 | C | 13 |
| 4 | D | 10 |
| 5 | F | 4 |
| 6 | missing | 15 |
sort(combine(groupby(df2[coalesce.(df2.activity, 0) .> 2 ,:], :mark), nrow => "studenti"), :mark)| Row | mark | studenti |
|---|---|---|
| String1? | Int64 | |
| 1 | A | 47 |
| 2 | B | 39 |
| 3 | C | 32 |
| 4 | D | 6 |
| 5 | F | 7 |
| 6 | missing | 28 |
sort(combine(groupby(df3[coalesce.(df3.activity, 0) .> 2 ,:], :mark), nrow => "studenti"), :mark)| Row | mark | studenti |
|---|---|---|
| String1? | Int64 | |
| 1 | A | 58 |
| 2 | B | 43 |
| 3 | C | 16 |
| 4 | D | 7 |
| 5 | F | 7 |
| 6 | missing | 16 |
plt.grid()
hist(filter(x -> !ismissing(x), df1.oral_exam), bins=[ -0.49 + j for j in 0:61]);plt.grid()
hist(filter(x -> !ismissing(x), df2.oral_exam), bins=[ -0.49 + j for j in 0:58]);plt.grid()
hist(filter(x -> !ismissing(x), df3.oral_exam), bins=[ -0.49 + j for j in 0:58]);plt.grid()
hist(filter(x -> !ismissing(x), df1.exam_test), bins=[ -0.49 + j for j in 0:21]);plt.grid()
hist(filter(x -> !ismissing(x), df2.exam_test), bins=[ -0.49 + j for j in 0:21]);plt.grid()
hist(filter(x -> !ismissing(x), df3.exam_test), bins=[ -0.49 + j for j in 0:21]);combine(groupby(df1, :exam_test), nrow => "počet") |> sort| Row | exam_test | počet |
|---|---|---|
| Int64? | Int64 | |
| 1 | 0 | 4 |
| 2 | 10 | 1 |
| 3 | 11 | 1 |
| 4 | 12 | 3 |
| 5 | 13 | 2 |
| 6 | 14 | 5 |
| 7 | 15 | 29 |
| 8 | 16 | 37 |
| 9 | 17 | 40 |
| 10 | 18 | 37 |
| 11 | 19 | 34 |
| 12 | 20 | 16 |
| 13 | missing | 319 |
combine(groupby(df1[df1.assessment .=== true, :], :exam_test), nrow => "počet") |> sort| Row | exam_test | počet |
|---|---|---|
| Int64? | Int64 | |
| 1 | 0 | 4 |
| 2 | 10 | 1 |
| 3 | 11 | 1 |
| 4 | 12 | 3 |
| 5 | 13 | 2 |
| 6 | 14 | 5 |
| 7 | 15 | 29 |
| 8 | 16 | 37 |
| 9 | 17 | 40 |
| 10 | 18 | 37 |
| 11 | 19 | 34 |
| 12 | 20 | 16 |
Reference
Další informace, detaily či tutoriály lze nalézt v dokumentaci DataFrames.jl.
V notebooku jsme používali balíček CSV.jl.