Jdi na navigaci předmětu

09: Zpracování a analýza dat

Tento notebook je výukovým materiálem v předmětu BI-JUL.21 vyučovaném v zimním semestru akademického roku 2023/2024 Tomášem Kalvodou. Tvorba těchto materiálů byla podpořena NVS FIT.

Hlavní stránkou předmětu, kde jsou i další notebooky a zajímavé informace, je jeho Course Pages stránka.

versioninfo()
Julia Version 1.12.0
Commit b907bd0600f (2025-10-07 15:42 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, skylake)
  GC: Built with stock GC
Threads: 1 default, 1 interactive, 1 GC (on 8 virtual cores)

1. DataFrames.jl

Pro práci s daty jistě znáte Pythonovský nástroj pandas. DataFrames.jl je v podstatě Julia analog tohoto nástroje. Pokud jste zvyklí pandas používat, může pro vás být užitečné porovnání pandas s Dataframes.jl.

using DataFrames

1.1 Jak vytvořit DataFrame?

DataFrame lze vytvořit mnoha způsoby. Můžeme začít s prázdnou tabulkou a postupně ji naplnit daty, nebo využít existující matici, nebo data načíst z externího souboru.

Prázdný DataFrame vytvoříme velmi snadno:

DataFrame()
0×0 DataFrame

Keyword argumenty a NamedTuple

Data v pojmenovaných sloupcích můžeme předat pomocí keyword argumentů (klíčové slovo je název sloupce, hodnota data):

df = DataFrame(course=["BI-LA1", "BI-DML", "BI-MA1", "BI-MA2"], semester=[1, 1, 2, 2], department=18105)
4×3 DataFrame
Rowcoursesemesterdepartment
StringInt64Int64
1BI-LA1118105
2BI-DML118105
3BI-MA1218105
4BI-MA2218105
df.course
4-element Vector{String}:
 "BI-LA1"
 "BI-DML"
 "BI-MA1"
 "BI-MA2"
df[:, :course]
4-element Vector{String}:
 "BI-LA1"
 "BI-DML"
 "BI-MA1"
 "BI-MA2"
df[:, 1]
4-element Vector{String}:
 "BI-LA1"
 "BI-DML"
 "BI-MA1"
 "BI-MA2"

Tímto způsobem bychom ovšem měli problém zadat data se sloupcích, jejichž názvy obsahují třeba speciální znaky jako mezery. K tomu můžeme použít slovník, resp. dvojice (klíče mohou být řetězce nebo symboly -- ty jsou doporučené, místo mezer je vhodnější použít podtžítka :slovo_slovo):

df = DataFrame("fiktivní postava" => ["Gandalf", "Harry Potter"], "kniha" => ["Pán prstenů", "Harry Potter a kámen mudrců"])
2×2 DataFrame
Rowfiktivní postavakniha
StringString
1GandalfPán prstenů
2Harry PotterHarry Potter a kámen mudrců
df.kniha
2-element Vector{String}:
 "Pán prstenů"
 "Harry Potter a kámen mudrců"
df."fiktivní postava"
2-element Vector{String}:
 "Gandalf"
 "Harry Potter"
df.var"fiktivní postava"
2-element Vector{String}:
 "Gandalf"
 "Harry Potter"
df[:, "fiktivní postava"]
2-element Vector{String}:
 "Gandalf"
 "Harry Potter"

Další možností je použít NamedTuple:

nt = (a = 1, b = 3)
(a = 1, b = 3)
nt[1], nt[2]
(1, 3)
nt[:a], nt[:b]
(1, 3)

Po sloupcích (pozor na jemný rozdíl od výše uvedeného způsobu):

DataFrame((a = [1,2,3], b = [4,5,6]))
3×2 DataFrame
Rowab
Int64Int64
114
225
336

Po řádcích:

DataFrame([(a = 1, b = 4), (a = 2, b = 5), (a = 3, b = 6)])
3×2 DataFrame
Rowab
Int64Int64
114
225
336
df = DataFrame([(a = 1, b = 4, c = nothing), (a = 2, b = 5, c = nothing), (a = 3, b = nothing, c = 6)])
3×3 DataFrame
Rowabc
Int64Union…Union…
114
225
336
describe(df)
3×7 DataFrame
Rowvariablemeanminmedianmaxnmissingeltype
SymbolUnion…Union…Union…Union…Int64Type
1a2.012.030Int64
2b0Union{Nothing, Int64}
3c0Union{Nothing, Int64}

Matice

K vytvoření DataFrame můžeme použít i matici, jen musíme vyřešit pojmenování sloupců. Automaticky (nutno zadat jako druhý argument :auto) budou označeny jako x1, x2, atd.:

DataFrame(rand(2, 3), :auto)
2×3 DataFrame
Rowx1x2x3
Float64Float64Float64
10.651770.8011150.532499
20.9674830.3281160.519358
DataFrame(rand(Int64, 2, 3), :auto)
2×3 DataFrame
Rowx1x2x3
Int64Int64Int64
1-791905511288938853728280979160939006413985183950275755778
2-25469590188418362744269260951158923426-544554040784498185

V druhém argumentu případně můžeme zadat naše názvy sloupců:

DataFrame(rand(2, 3), ["col1", "col2", "col3"])
2×3 DataFrame
Rowcol1col2col3
Float64Float64Float64
10.1964010.368870.919708
20.0107290.009471550.520836

CSV.jl

V tomto notebooku budeme pro některé ukázky používat CSV anonymizovaný export z Grades předmětu BI-MA1 v semestru B212 (2021/2022). Importu dat z CSV lze snadno docílit pomocí balíčku CSV.jl, který přidáme standardně ] add CSV a pak importujeme:

using CSV

Nyní stačí použít metodu read z modulu CSV, v prvním argumentu zadáme cestu k souboru a v druhém uvedeme požadovaný výstupní "formát", v našem případě DataFrame:

df = CSV.read(joinpath("files", "bi-ma1-b232.csv"), DataFrame)
659×17 DataFrame
634 rows omitted
Rowusernametest1test2test3second_chanceactivitytests_totalgitlabassessmentexam_testdateoral_examvetopoints_totalmarktutorpercentil
String15Float64?Float64?Float64?Float64?Float64?Float64Float64?BoolInt64?String7?Float64?Bool?Float64?String1?String7Int64
1student0013.011.513.5missing5.028.0missingtrue1713:1534.0missing67.0DIrena58
2student0022.03.018.519.53.025.0missingtrue1812:4548.0missing76.0CIrena43
3student003missingmissingmissingmissingmissing0.0missingfalsemissingmissingmissingmissingmissingmissingOndra6
4student0043.0missingmissingmissing4.03.0missingfalsemissingmissingmissingmissingmissingmissingJan V13
5student0053.018.017.0missing4.038.0missingtrue1917:1556.0missing98.0AJarda93
6student0063.010.515.0missingmissing28.5missingtrue1516:3050.0missing78.5CJan S59
7student0072.019.016.0missing2.037.0missingtrue1611:3055.0missing94.0AJan V91
8student008missingmissingmissingmissingmissing0.0missingfalsemissingmissingmissingmissingmissingmissingJitka6
9student0093.08.014.0missing3.025.0missingtrue1813:1542.0missing70.0CIrena46
10student0101.06.07.0missingmissing14.0missingfalsemissingmissingmissingmissingmissingmissingJan S27
11student0113.014.519.0missing4.036.5missingtrue1715:0057.0missing97.5AJakub89
12student0123.018.015.0missingmissing36.0missingtrue1711:3054.0missing90.0AJan V88
13student0132.0missingmissingmissingmissing2.0missingfalsemissingmissingmissingmissingmissingmissingIrena11
648student6481.013.514.5missingmissing29.0missingtrue1511:0045.0missing74.0CJan V62
649student6491.0missingmissingmissingmissing1.0missingfalsemissingmissingmissingmissingmissingmissingJitka9
650student6502.51.01.0missingmissing4.5missingfalsemissingmissingmissingmissingmissingmissingOndra17
651student6510.08.08.0missing1.016.0missingfalsemissingmissingmissingmissingmissingmissingJitka30
652student6522.05.512.512.51.020.0missingfalsemissingmissingmissingmissingmissingmissingJitka35
653student6533.08.517.5missingmissing29.0missingtrue1614:4555.0missing84.0BJitka62
654student6541.014.015.0missingmissing30.0missingtrue1614:3050.0missing80.0BJan S65
655student6551.55.0missingmissingmissing6.5missingfalsemissingmissingmissingmissingmissingmissingIrena19
656student6562.55.517.5missingmissing25.5missingtrue1610:4554.5missing80.0BOndra49
657student6571.015.06.516.01.025.0missingtrue1514:1542.0missing68.0DIrena41
658student6581.015.017.5missing0.533.5missingtrue1713:4556.0missing90.0AJarda79
659student659missingmissingmissingmissingmissing0.0missingfalsemissingmissingmissingmissingmissingmissingJakub6

K těmto datům se vrátíme podrobněji později. Samotnou tabulku občas použijeme v různých ukázkách.


1.2 Práce s DataFrame

Z částečného výpisu dat výše vyvstává hned několik otázek:

  • Jaké sloupce jsou ještě k dispozici?
  • Proč je typ pod některými sloupci s otazníkem?
  • Co znamená missing?

Pojďme se vydat na průzkum. První, co můžeme zkusit, je podívat se na atributy naší instance (interaktivně pomocí TAB):

#df.
df.username
659-element Vector{String15}:
 "student001"
 "student002"
 "student003"
 "student004"
 "student005"
 "student006"
 "student007"
 "student008"
 "student009"
 "student010"
 "student011"
 "student012"
 "student013"
 ⋮
 "student648"
 "student649"
 "student650"
 "student651"
 "student652"
 "student653"
 "student654"
 "student655"
 "student656"
 "student657"
 "student658"
 "student659"
df.assessment
659-element Vector{Bool}:
 1
 1
 0
 0
 1
 1
 1
 0
 1
 0
 1
 1
 0
 ⋮
 1
 0
 0
 0
 0
 1
 1
 0
 1
 1
 1
 0

Nebo k tomu můžeme použít metodu names z modulu DataFrames:

names(df)
17-element Vector{String}:
 "username"
 "test1"
 "test2"
 "test3"
 "second_chance"
 "activity"
 "tests_total"
 "gitlab"
 "assessment"
 "exam_test"
 "date"
 "oral_exam"
 "veto"
 "points_total"
 "mark"
 "tutor"
 "percentil"

Metoda eachcol vrátí iterátor přes sloupce. Můžeme tak zjistit typ prvků sloupců:

eltype.(eachcol(df))
17-element Vector{Type}:
 String15
 Union{Missing, Float64}
 Union{Missing, Float64}
 Union{Missing, Float64}
 Union{Missing, Float64}
 Union{Missing, Float64}
 Float64
 Union{Missing, Float64}
 Bool
 Union{Missing, Int64}
 Union{Missing, String7}
 Union{Missing, Float64}
 Union{Missing, Bool}
 Union{Missing, Float64}
 Union{Missing, String1}
 String7
 Int64

Odtud vidíme, co znamenají otazníky u typů prvků sloupců. Prvky sloupce jsou složeného typu Union{Missing, T}, v některých sloupcích mohou hodnoty chybět, obsahují hodnotu missing.

Stručné informace můžeme získat pomocí metody describe. Samozřejmě občas uvedené statistiky nemají moc smysl.

describe(df)
17×7 DataFrame
Rowvariablemeanminmedianmaxnmissingeltype
SymbolUnion…AnyUnion…AnyInt64Type
1usernamestudent001student6590String15
2test12.028010.02.03.070Union{Missing, Float64}
3test211.58040.012.020.068Union{Missing, Float64}
4test313.58250.014.520.0150Union{Missing, Float64}
5second_chance12.69642.514.022.5589Union{Missing, Float64}
6activity2.385640.02.05.0334Union{Missing, Float64}
7tests_total22.84370.026.543.00Float64
8gitlab0.8148150.50.54.0632Union{Missing, Float64}
9assessment0.616085false1.0true0Bool
10exam_test16.9113-117.020253Union{Missing, Int64}
11date---17:30254Union{Missing, String7}
12oral_exam47.72570.049.057.0278Union{Missing, Float64}
13veto1.0true1.0true657Union{Missing, Bool}
14points_total81.501955.081.5105.0279Union{Missing, Float64}
15markAF253Union{Missing, String1}
16tutorIrenaOndra0String7
17percentil50.3612050.01000Int64

Výpis můžeme kontrolovat uvedením konkrétního rozsahu.

describe(df, cols=["test1", "test2", "test3"])
3×7 DataFrame
Rowvariablemeanminmedianmaxnmissingeltype
SymbolFloat64Float64Float64Float64Int64Union
1test12.028010.02.03.070Union{Missing, Float64}
2test211.58040.012.020.068Union{Missing, Float64}
3test313.58250.014.520.0150Union{Missing, Float64}

DataFrame je tabulka, máme pro ní k dispozici podobné metody jako pro matice, na které se také lze dívat jako na "tabulky".

size(df)
(659, 17)
nrow(df)
659
ncol(df)
17

DataFrame lze kopírovat pomocí metody copy, nebo ho vyprázdnit pomocí metod empty, resp. empty!.

ex1 = DataFrame(rand(3, 2), :auto)
ex2 = copy(ex1)
3×2 DataFrame
Rowx1x2
Float64Float64
10.3401030.561576
20.3484260.452242
30.7725110.292019
ex1
3×2 DataFrame
Rowx1x2
Float64Float64
10.3401030.561576
20.3484260.452242
30.7725110.292019
empty!(ex2)
empty(ex1)
0×2 DataFrame
Rowx1x2
Float64Float64
ex1
3×2 DataFrame
Rowx1x2
Float64Float64
10.3401030.561576
20.3484260.452242
30.7725110.292019
ex2
0×2 DataFrame
Rowx1x2
Float64Float64

To by bylo pro začátek vše k základním vlastnostem. Pojďme nyní tabulky modifikovat, upravovat.


Indexování a přístup ke sloupcům

Indexování vychází z maticového zápisu a má podobné vlastnosti. Ke sloupcům můžeme přistupovat několika způsoby:

  • pomocí názvu a tečky: df.mark, df."mark"
  • pomocí indexace (nevytvoří kopii): df[!, :mark], df[!, "mark"]
  • pomocí indexace (vytvoří kopii): df[:, :mark], df[:, "mark"]

Tímto způsobem můžeme data číst, ale i je modifikovat.

df.mark
659-element PooledArrays.PooledVector{Union{Missing, String1}, UInt32, Vector{UInt32}}:
 "D"
 "C"
 missing
 missing
 "A"
 "C"
 "A"
 missing
 "C"
 missing
 "A"
 "A"
 missing
 ⋮
 "C"
 missing
 missing
 missing
 missing
 "B"
 "B"
 missing
 "B"
 "D"
 "A"
 missing
df.mark === df[!, :mark]
true
df.mark === df[:, :mark] # kopie!
false
df.mark == df[!, :mark]
missing
df.mark == df[:, :mark]
missing
missing == missing
missing
missing == true
missing
missing == false
missing

Vedle toho nemusíme používat název sloupce, v pořádku je použít i pořadí. Například prvních deset řádků v druhém a třetím sloupci:

df[1:10, ["test1", "test2", "test3"]]
10×3 DataFrame
Rowtest1test2test3
Float64?Float64?Float64?
13.011.513.5
22.03.018.5
3missingmissingmissing
43.0missingmissing
53.018.017.0
63.010.515.0
72.019.016.0
8missingmissingmissing
93.08.014.0
101.06.07.0

Not, Between, Cols, All

Dále můžeme vybírat pouze potřebné sloupce.

ex = DataFrame(rand(3, 5), :auto)
3×5 DataFrame
Rowx1x2x3x4x5
Float64Float64Float64Float64Float64
10.5107510.1264750.9959010.9408620.0399312
20.9342570.00106690.227950.8073670.176827
30.2661040.4824220.8519430.5491240.493393
ex[:, Not(:x2)]
3×4 DataFrame
Rowx1x3x4x5
Float64Float64Float64Float64
10.5107510.9959010.9408620.0399312
20.9342570.227950.8073670.176827
30.2661040.8519430.5491240.493393
Not(:x2)
InvertedIndex{Symbol}(:x2)
ex[:, Not([:x2, :x3])]
3×3 DataFrame
Rowx1x4x5
Float64Float64Float64
10.5107510.9408620.0399312
20.9342570.8073670.176827
30.2661040.5491240.493393
ex[:, Between(:x1, :x3)]
3×3 DataFrame
Rowx1x2x3
Float64Float64Float64
10.5107510.1264750.995901
20.9342570.00106690.22795
30.2661040.4824220.851943
ex[:, Not(Between(:x1, :x3))]
3×2 DataFrame
Rowx4x5
Float64Float64
10.9408620.0399312
20.8073670.176827
30.5491240.493393
ex[:, Cols(:x2, :x4)]
3×2 DataFrame
Rowx2x4
Float64Float64
10.1264750.940862
20.00106690.807367
30.4824220.549124
ex[:, All()]
3×5 DataFrame
Rowx1x2x3x4x5
Float64Float64Float64Float64Float64
10.5107510.1264750.9959010.9408620.0399312
20.9342570.00106690.227950.8073670.176827
30.2661040.4824220.8519430.5491240.493393

Výše uvedené ukázky využívají funkcionalitu z DataFrames.jl balíčku. "Staré maticové" postupy samozřejmě fungují také, např.:

ex[:, [:x2, :x3]]
3×2 DataFrame
Rowx2x3
Float64Float64
10.1264750.995901
20.00106690.22795
30.4824220.851943

Přepisování dat

Údaje můžeme měnit několika způsoby. Vezměme si jednoduchou tabulku:

m = [ "α" 1 0.5; "β" 2 1.5; "γ" 3 2.5]
3×3 Matrix{Any}:
 "α"  1  0.5
 "β"  2  1.5
 "γ"  3  2.5
ex = DataFrame(m, ["a", "b", "c"])
3×3 DataFrame
Rowabc
AnyAnyAny
1α10.5
2β21.5
3γ32.5

Změna jedné položky (Ξ\Xi je velké ξ\xi):

ex[1, :a] = "Ξ"

ex
3×3 DataFrame
Rowabc
AnyAnyAny
1Ξ10.5
2β21.5
3γ32.5

Přepsání všech hodnot ve sloupci c:

ex[:, :c] = ["tau", "pi", "omega"]
3-element Vector{String}:
 "tau"
 "pi"
 "omega"
ex
3×3 DataFrame
Rowabc
AnyAnyAny
1Ξ1tau
2β2pi
3γ3omega

Přepsání jednou hodnotou.

ex[:, :c] .= "ř"
3-element view(::Vector{Any}, :) with eltype Any:
 "ř"
 "ř"
 "ř"
ex
3×3 DataFrame
Rowabc
AnyAnyAny
1Ξ1ř
2β2ř
3γ3ř
z = ex[:, :c]
3-element Vector{Any}:
 "ř"
 "ř"
 "ř"
z[2] = "pí"
"pí"
z
3-element Vector{Any}:
 "ř"
 "pí"
 "ř"
ex
3×3 DataFrame
Rowabc
AnyAnyAny
1Ξ1ř
2β2ř
3γ3ř
z = ex[!, :c]
3-element Vector{Any}:
 "ř"
 "ř"
 "ř"
z[2] = "¿"
"¿"
ex
3×3 DataFrame
Rowabc
AnyAnyAny
1Ξ1ř
2β2¿
3γ3ř
ex.c = [true, false, missing]

ex
3×3 DataFrame
Rowabc
AnyAnyBool?
1Ξ1true
2β2false
3γ3missing

Přepsání hodnot ve sloupci a danou hodnotou, ale pouze v řádcích, kde je hodnota ve sloupci b lichá (zde opět používáme indexování bitovým vektorem, nebo "maskování", s kterým jsme se setkali dříve během semestru):

isodd.(ex.b)
3-element BitVector:
 1
 0
 1
ex[isodd.(ex.b), :a] .= "λ"

ex
3×3 DataFrame
Rowabc
AnyAnyBool?
1λ1true
2β2false
3λ3missing
ex[iseven.(ex.b), :]
1×3 DataFrame
Rowabc
AnyAnyBool?
1β2false

Občas je nutné projít tabulku řádek po řádku. V tom případě můžeme iterovat přes index. Například se pokusme změnit první sloupec na řetězec obsahující daný symbol se spodním indexem daným číslem ve sloupci b v LaTeX notaci (tj. λ_1 atd.).

for row_id = axes(ex, 1) # 1:nrow(df)
    ex[row_id, :a] = ex[row_id, :a] * "_" * string(ex[row_id, :b])
end

ex
3×3 DataFrame
Rowabc
AnyAnyBool?
1λ_11true
2β_22false
3λ_33missing
for row in eachrow(ex)
    println(row)
    println(row[:a])
end
DataFrameRow
 Row │ a    b    c     
     │ Any  Any  Bool? 
─────┼─────────────────
   1 │ λ_1  1     true
λ_1
DataFrameRow
 Row │ a    b    c     
     │ Any  Any  Bool? 
─────┼─────────────────
   2 │ β_2  2    false
β_2
DataFrameRow
 Row │ a    b    c       
     │ Any  Any  Bool?   
─────┼───────────────────
   3 │ λ_3  3    missing 
λ_3

Zobrazovaní (show, first, last, view)

Pokud je DataFrame příliš velký, tak při jeho zobrazení dojde k ořezání (řádků i sloupců). Toto chování můžeme přebít pomocí parametrů metody show.

U našeho BI-ZMA příkladu je to lehce overkill.

df
659×17 DataFrame
634 rows omitted
Rowusernametest1test2test3second_chanceactivitytests_totalgitlabassessmentexam_testdateoral_examvetopoints_totalmarktutorpercentil
String15Float64?Float64?Float64?Float64?Float64?Float64Float64?BoolInt64?String7?Float64?Bool?Float64?String1?String7Int64
1student0013.011.513.5missing5.028.0missingtrue1713:1534.0missing67.0DIrena58
2student0022.03.018.519.53.025.0missingtrue1812:4548.0missing76.0CIrena43
3student003missingmissingmissingmissingmissing0.0missingfalsemissingmissingmissingmissingmissingmissingOndra6
4student0043.0missingmissingmissing4.03.0missingfalsemissingmissingmissingmissingmissingmissingJan V13
5student0053.018.017.0missing4.038.0missingtrue1917:1556.0missing98.0AJarda93
6student0063.010.515.0missingmissing28.5missingtrue1516:3050.0missing78.5CJan S59
7student0072.019.016.0missing2.037.0missingtrue1611:3055.0missing94.0AJan V91
8student008missingmissingmissingmissingmissing0.0missingfalsemissingmissingmissingmissingmissingmissingJitka6
9student0093.08.014.0missing3.025.0missingtrue1813:1542.0missing70.0CIrena46
10student0101.06.07.0missingmissing14.0missingfalsemissingmissingmissingmissingmissingmissingJan S27
11student0113.014.519.0missing4.036.5missingtrue1715:0057.0missing97.5AJakub89
12student0123.018.015.0missingmissing36.0missingtrue1711:3054.0missing90.0AJan V88
13student0132.0missingmissingmissingmissing2.0missingfalsemissingmissingmissingmissingmissingmissingIrena11
648student6481.013.514.5missingmissing29.0missingtrue1511:0045.0missing74.0CJan V62
649student6491.0missingmissingmissingmissing1.0missingfalsemissingmissingmissingmissingmissingmissingJitka9
650student6502.51.01.0missingmissing4.5missingfalsemissingmissingmissingmissingmissingmissingOndra17
651student6510.08.08.0missing1.016.0missingfalsemissingmissingmissingmissingmissingmissingJitka30
652student6522.05.512.512.51.020.0missingfalsemissingmissingmissingmissingmissingmissingJitka35
653student6533.08.517.5missingmissing29.0missingtrue1614:4555.0missing84.0BJitka62
654student6541.014.015.0missingmissing30.0missingtrue1614:3050.0missing80.0BJan S65
655student6551.55.0missingmissingmissing6.5missingfalsemissingmissingmissingmissingmissingmissingIrena19
656student6562.55.517.5missingmissing25.5missingtrue1610:4554.5missing80.0BOndra49
657student6571.015.06.516.01.025.0missingtrue1514:1542.0missing68.0DIrena41
658student6581.015.017.5missing0.533.5missingtrue1713:4556.0missing90.0AJarda79
659student659missingmissingmissingmissingmissing0.0missingfalsemissingmissingmissingmissingmissingmissingJakub6
show(df)
659×17 DataFrame
 Row │ username    test1      test2      test3      second_chance  activity    ⋯
     │ String15    Float64?   Float64?   Float64?   Float64?       Float64?    ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ student001        3.0       11.5       13.5      missing          5.0   ⋯
   2 │ student002        2.0        3.0       18.5           19.5        3.0
   3 │ student003  missing    missing    missing        missing    missing   
   4 │ student004        3.0  missing    missing        missing          4.0
   5 │ student005        3.0       18.0       17.0      missing          4.0   ⋯
   6 │ student006        3.0       10.5       15.0      missing    missing   
   7 │ student007        2.0       19.0       16.0      missing          2.0
   8 │ student008  missing    missing    missing        missing    missing   
   9 │ student009        3.0        8.0       14.0      missing          3.0   ⋯
  10 │ student010        1.0        6.0        7.0      missing    missing   
  11 │ student011        3.0       14.5       19.0      missing          4.0
  ⋮  │     ⋮           ⋮          ⋮          ⋮            ⋮            ⋮       ⋱
 650 │ student650        2.5        1.0        1.0      missing    missing   
 651 │ student651        0.0        8.0        8.0      missing          1.0   ⋯
 652 │ student652        2.0        5.5       12.5           12.5        1.0
 653 │ student653        3.0        8.5       17.5      missing    missing   
 654 │ student654        1.0       14.0       15.0      missing    missing   
 655 │ student655        1.5        5.0  missing        missing    missing     ⋯
 656 │ student656        2.5        5.5       17.5      missing    missing   
 657 │ student657        1.0       15.0        6.5           16.0        1.0
 658 │ student658        1.0       15.0       17.5      missing          0.5
 659 │ student659  missing    missing    missing        missing    missing     ⋯
                                                 11 columns and 638 rows omitted
show(df, allcols=true)
659×17 DataFrame
 Row │ username    test1      test2      test3      second_chance  activity   tests_total  gitlab    assessment  exam_test  date      oral_exam  veto     points_total  mark      tutor    percentil 
     │ String15    Float64?   Float64?   Float64?   Float64?       Float64?   Float64      Float64?  Bool        Int64?     String7?  Float64?   Bool?    Float64?      String1?  String7  Int64     
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ student001        3.0       11.5       13.5      missing          5.0         28.0   missing        true         17  13:15          34.0  missing          67.0  D         Irena           58
   2 │ student002        2.0        3.0       18.5           19.5        3.0         25.0   missing        true         18  12:45          48.0  missing          76.0  C         Irena           43
   3 │ student003  missing    missing    missing        missing    missing            0.0   missing       false    missing  missing   missing    missing     missing    missing   Ondra            6
   4 │ student004        3.0  missing    missing        missing          4.0          3.0   missing       false    missing  missing   missing    missing     missing    missing   Jan V           13
   5 │ student005        3.0       18.0       17.0      missing          4.0         38.0   missing        true         19  17:15          56.0  missing          98.0  A         Jarda           93
   6 │ student006        3.0       10.5       15.0      missing    missing           28.5   missing        true         15  16:30          50.0  missing          78.5  C         Jan S           59
   7 │ student007        2.0       19.0       16.0      missing          2.0         37.0   missing        true         16  11:30          55.0  missing          94.0  A         Jan V           91
   8 │ student008  missing    missing    missing        missing    missing            0.0   missing       false    missing  missing   missing    missing     missing    missing   Jitka            6
   9 │ student009        3.0        8.0       14.0      missing          3.0         25.0   missing        true         18  13:15          42.0  missing          70.0  C         Irena           46
  10 │ student010        1.0        6.0        7.0      missing    missing           14.0   missing       false    missing  missing   missing    missing     missing    missing   Jan S           27
  11 │ student011        3.0       14.5       19.0      missing          4.0         36.5   missing        true         17  15:00          57.0  missing          97.5  A         Jakub           89
  ⋮  │     ⋮           ⋮          ⋮          ⋮            ⋮            ⋮           ⋮          ⋮          ⋮           ⋮         ⋮          ⋮         ⋮          ⋮           ⋮         ⋮         ⋮
 650 │ student650        2.5        1.0        1.0      missing    missing            4.5   missing       false    missing  missing   missing    missing     missing    missing   Ondra           17
 651 │ student651        0.0        8.0        8.0      missing          1.0         16.0   missing       false    missing  missing   missing    missing     missing    missing   Jitka           30
 652 │ student652        2.0        5.5       12.5           12.5        1.0         20.0   missing       false    missing  missing   missing    missing     missing    missing   Jitka           35
 653 │ student653        3.0        8.5       17.5      missing    missing           29.0   missing        true         16  14:45          55.0  missing          84.0  B         Jitka           62
 654 │ student654        1.0       14.0       15.0      missing    missing           30.0   missing        true         16  14:30          50.0  missing          80.0  B         Jan S           65
 655 │ student655        1.5        5.0  missing        missing    missing            6.5   missing       false    missing  missing   missing    missing     missing    missing   Irena           19
 656 │ student656        2.5        5.5       17.5      missing    missing           25.5   missing        true         16  10:45          54.5  missing          80.0  B         Ondra           49
 657 │ student657        1.0       15.0        6.5           16.0        1.0         25.0   missing        true         15  14:15          42.0  missing          68.0  D         Irena           41
 658 │ student658        1.0       15.0       17.5      missing          0.5         33.5   missing        true         17  13:45          56.0  missing          90.0  A         Jarda           79
 659 │ student659  missing    missing    missing        missing    missing            0.0   missing       false    missing  missing   missing    missing     missing    missing   Jakub            6
                                                                                                                                                                                     638 rows omitted

Toto by byl ještě větší overkill.

# show(df, allrows=true)

Případně můžeme prozkoumávat začátek a konec tabulky.

first(df, 10)
10×17 DataFrame
Rowusernametest1test2test3second_chanceactivitytests_totalgitlabassessmentexam_testdateoral_examvetopoints_totalmarktutorpercentil
String15Float64?Float64?Float64?Float64?Float64?Float64Float64?BoolInt64?String7?Float64?Bool?Float64?String1?String7Int64
1student0013.011.513.5missing5.028.0missingtrue1713:1534.0missing67.0DIrena58
2student0022.03.018.519.53.025.0missingtrue1812:4548.0missing76.0CIrena43
3student003missingmissingmissingmissingmissing0.0missingfalsemissingmissingmissingmissingmissingmissingOndra6
4student0043.0missingmissingmissing4.03.0missingfalsemissingmissingmissingmissingmissingmissingJan V13
5student0053.018.017.0missing4.038.0missingtrue1917:1556.0missing98.0AJarda93
6student0063.010.515.0missingmissing28.5missingtrue1516:3050.0missing78.5CJan S59
7student0072.019.016.0missing2.037.0missingtrue1611:3055.0missing94.0AJan V91
8student008missingmissingmissingmissingmissing0.0missingfalsemissingmissingmissingmissingmissingmissingJitka6
9student0093.08.014.0missing3.025.0missingtrue1813:1542.0missing70.0CIrena46
10student0101.06.07.0missingmissing14.0missingfalsemissingmissingmissingmissingmissingmissingJan S27
last(df, 10)
10×17 DataFrame
Rowusernametest1test2test3second_chanceactivitytests_totalgitlabassessmentexam_testdateoral_examvetopoints_totalmarktutorpercentil
String15Float64?Float64?Float64?Float64?Float64?Float64Float64?BoolInt64?String7?Float64?Bool?Float64?String1?String7Int64
1student6502.51.01.0missingmissing4.5missingfalsemissingmissingmissingmissingmissingmissingOndra17
2student6510.08.08.0missing1.016.0missingfalsemissingmissingmissingmissingmissingmissingJitka30
3student6522.05.512.512.51.020.0missingfalsemissingmissingmissingmissingmissingmissingJitka35
4student6533.08.517.5missingmissing29.0missingtrue1614:4555.0missing84.0BJitka62
5student6541.014.015.0missingmissing30.0missingtrue1614:3050.0missing80.0BJan S65
6student6551.55.0missingmissingmissing6.5missingfalsemissingmissingmissingmissingmissingmissingIrena19
7student6562.55.517.5missingmissing25.5missingtrue1610:4554.5missing80.0BOndra49
8student6571.015.06.516.01.025.0missingtrue1514:1542.0missing68.0DIrena41
9student6581.015.017.5missing0.533.5missingtrue1713:4556.0missing90.0AJarda79
10student659missingmissingmissingmissingmissing0.0missingfalsemissingmissingmissingmissingmissingmissingJakub6

Pro rychlé prozkoumávání částí tabulky slouží metoda view, resp. makro @view, která jen část tabulky zobrazí. Nevytváří nový objekt, měla by být efektivnější.

view(df, 100:150, [:username, :points_total, :mark])
51×3 SubDataFrame
26 rows omitted
Rowusernamepoints_totalmark
String15Float64?String1?
1student10091.5A
2student10180.5B
3student102missingmissing
4student10373.0C
5student10475.5C
6student105missingmissing
7student10681.5B
8student10776.0C
9student108missingmissing
10student10976.5C
11student110missingmissing
12student11157.0E
13student112missingmissing
40student13989.0B
41student14074.0C
42student14190.0A
43student14264.5D
44student143missingmissing
45student144missingmissing
46student14578.0C
47student146missingmissing
48student14780.5B
49student148missingF
50student149missingmissing
51student150100.5A
@view df[10:15, [:username, :mark]]
6×2 SubDataFrame
Rowusernamemark
String15String1?
1student010missing
2student011A
3student012A
4student013missing
5student014missing
6student015D

Přidávání a odebírání sloupců

K přidávání sloupců slouží metoda insertcols!:

ex = DataFrame(rand(2, 2), :auto)
2×2 DataFrame
Rowx1x2
Float64Float64
10.08477510.0320252
20.5925210.457126

Na konec a kopie.

insertcols(ex, :a => [1, 2])
2×3 DataFrame
Rowx1x2a
Float64Float64Int64
10.08477510.03202521
20.5925210.4571262

Na konkrétní místo.

insertcols!(ex, 1, :a => [pi, pi])
2×3 DataFrame
Rowax1x2
Irration…Float64Float64
1π0.08477510.0320252
2π0.5925210.457126
insertcols!(ex, 1, :a2 => [pi, -pi])
2×4 DataFrame
Rowa2ax1x2
Float64Irration…Float64Float64
13.14159π0.08477510.0320252
2-3.14159π0.5925210.457126
insertcols!(ex, 3, :b => "⊕")
2×5 DataFrame
Rowa2abx1x2
Float64Irration…StringFloat64Float64
13.14159π0.08477510.0320252
2-3.14159π0.5925210.457126
insertcols!(ex, 5, :c => [1, 2])
2×6 DataFrame
Rowa2abx1cx2
Float64Irration…StringFloat64Int64Float64
13.14159π0.084775110.0320252
2-3.14159π0.59252120.457126

Ale lze použít i prosté indexování:

ex[:, :d] = [42, 42]

ex
2×7 DataFrame
Rowa2abx1cx2d
Float64Irration…StringFloat64Int64Float64Int64
13.14159π0.084775110.032025242
2-3.14159π0.59252120.45712642

Mazání sloupců musíme provést pomocí indexace, neexistuje metoda "dropcolumns!".

ex[:, Not(:b)]
2×6 DataFrame
Rowa2ax1cx2d
Float64Irration…Float64Int64Float64Int64
13.14159π0.084775110.032025242
2-3.14159π0.59252120.45712642

Přidávání a odebírání řádků

Vytvořme si zase testovací tabulku:

ex = DataFrame(rand(2, 2), :auto)
2×2 DataFrame
Rowx1x2
Float64Float64
10.4256080.338276
20.4507230.315649

Přidat řádek lze opět několika způsoby. Nejpřirozenější je asi metoda push!:

push!(ex, [1.0, 2.0])
3×2 DataFrame
Rowx1x2
Float64Float64
10.4256080.338276
20.4507230.315649
31.02.0
push!(ex, [1, 2])
4×2 DataFrame
Rowx1x2
Float64Float64
10.4256080.338276
20.4507230.315649
31.02.0
41.02.0
push!(ex, (0.5, 2.3))
5×2 DataFrame
Rowx1x2
Float64Float64
10.4256080.338276
20.4507230.315649
31.02.0
41.02.0
50.52.3
push!(ex, ["a", "b"])
┌ Error: Error adding value to column :x1. Maybe you forgot passing `promote=true`?
└ @ DataFrames ~/.julia/packages/DataFrames/b4w9K/src/dataframe/insertion.jl:810
MethodError: Cannot `convert` an object of type String to an object of type Float64
The function `convert` exists, but no method is defined for this combination of argument types.

Closest candidates are:
  convert(::Type{T}, ::T) where T<:Number
   @ Base number.jl:6
  convert(::Type{T}, ::Number) where T<:Number
   @ Base number.jl:7
  convert(::Type{T}, ::T) where T
   @ Base Base_compiler.jl:133
  ...


Stacktrace:
 [1] push!(a::Vector{Float64}, item::String)
   @ Base ./array.jl:1285
 [2] _row_inserter!(df::DataFrame, loc::Int64, row::Vector{String}, mode::Val{:push}, promote::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/b4w9K/src/dataframe/insertion.jl:776
 [3] #push!#383
   @ ~/.julia/packages/DataFrames/b4w9K/src/dataframe/insertion.jl:545 [inlined]
 [4] push!(df::DataFrame, row::Vector{String})
   @ DataFrames ~/.julia/packages/DataFrames/b4w9K/src/dataframe/insertion.jl:538
 [5] top-level scope
   @ In[54]:1
 [6] eval(m::Module, e::Any)
   @ Core ./boot.jl:489
push!(ex, ["a", "b"], promote=true)
6×2 DataFrame
Rowx1x2
AnyAny
10.4256080.338276
20.4507230.315649
31.02.0
41.02.0
50.52.3
6ab
push!(ex, (x2 = 0.5, x1 = 0.3))
7×2 DataFrame
Rowx1x2
AnyAny
10.4256080.338276
20.4507230.315649
31.02.0
41.02.0
50.52.3
6ab
70.30.5
push!(ex, (x2 = 7, x1 = 5))
8×2 DataFrame
Rowx1x2
AnyAny
10.4256080.338276
20.4507230.315649
31.02.0
41.02.0
50.52.3
6ab
70.30.5
857

Pomocí append! můžeme spojovat několik tabulek "vertikálně" dohromady:

append!(ex, DataFrame(rand(10, 2), :auto))
18×2 DataFrame
Rowx1x2
AnyAny
10.4256080.338276
20.4507230.315649
31.02.0
41.02.0
50.52.3
6ab
70.30.5
857
90.8407830.0684555
100.408250.53635
110.4004560.684138
120.1625010.0953885
130.09299350.749572
140.7797380.450801
150.8554840.167365
160.338310.0827383
170.7015410.567536
180.09826740.443836

Mazat řádky můžeme explicitně pomocí delete!:

delete!(ex, 3)
17×2 DataFrame
Rowx1x2
AnyAny
10.4256080.338276
20.4507230.315649
31.02.0
40.52.3
5ab
60.30.5
757
80.8407830.0684555
90.408250.53635
100.4004560.684138
110.1625010.0953885
120.09299350.749572
130.7797380.450801
140.8554840.167365
150.338310.0827383
160.7015410.567536
170.09826740.443836
delete!(ex, [1, 2, 3])
14×2 DataFrame
Rowx1x2
AnyAny
10.52.3
2ab
30.30.5
457
50.8407830.0684555
60.408250.53635
70.4004560.684138
80.1625010.0953885
90.09299350.749572
100.7797380.450801
110.8554840.167365
120.338310.0827383
130.7015410.567536
140.09826740.443836
delete!(ex, 2)
13×2 DataFrame
Rowx1x2
AnyAny
10.52.3
20.30.5
357
40.8407830.0684555
50.408250.53635
60.4004560.684138
70.1625010.0953885
80.09299350.749572
90.7797380.450801
100.8554840.167365
110.338310.0827383
120.7015410.567536
130.09826740.443836
ex.x1 .> 0.5
13-element BitVector:
 0
 0
 1
 1
 0
 0
 0
 0
 1
 1
 0
 1
 0
delete!(ex, ex.x1 .> 0.5)
8×2 DataFrame
Rowx1x2
AnyAny
10.52.3
20.30.5
30.408250.53635
40.4004560.684138
50.1625010.0953885
60.09299350.749572
70.338310.0827383
80.09826740.443836

To zdaleka není všechno

Výše jsme shrnuli zřejmě ty nejužitečnější metody a techniky. Tím ale možnosti DataFrames.jl zdaleka nekončí. Zvídavému čtenáři doporučujeme prolétnout dokumentaci. Na tomto místě snad jen upozorněme na následující:

Velmi sofistikované transformace dat v tabulkách lze také provádět pomocí následujících metod:

  • combine
  • select/select!
  • transform/transform!

Jejich podrobný výklad je už nad rámec tohoto kurzu.


2. Cvičení: analýza BI-MA1 v B212, B222, B232

Pojďme se podrobněji podívat na data z předmětu BI-MA1.

Nejprve znovu načteme data z CSV souboru, který je syrovým exportem z Grades, kde jsou anonymizováni uživatelská jména studentů.

using CSV
df1 = CSV.read(joinpath("files", "bi-ma1-b212.csv"), DataFrame)
df2 = CSV.read(joinpath("files", "bi-ma1-b222.csv"), DataFrame)
df3 = CSV.read(joinpath("files", "bi-ma1-b232.csv"), DataFrame)
df4 = CSV.read(joinpath("files", "bi-ma1-b242.csv"), DataFrame);
describe(df1)
16×7 DataFrame
Rowvariablemeanminmedianmaxnmissingeltype
SymbolUnion…AnyUnion…AnyInt64Type
1usernamestudent001student5280String15
2test18.451990.08.020.0101Union{Missing, Float64}
3test213.31880.014.020.0219Union{Missing, Float64}
4second_chance15.05880.015.023.5460Union{Missing, Float64}
5activity1.956290.02.05.0242Union{Missing, Float64}
6tests_total14.9280.015.539.50Float64
7gitlab1.307690.01.07.0489Union{Missing, Float64}
8assessment0.395833false0.0true0Bool
9exam_test16.7081017.020319Union{Missing, Int64}
10date---16:50319Union{Missing, String7}
11oral_exam48.302628.050.060.0338Union{Missing, Float64}
12veto1.0true1.0true525Union{Missing, Bool}
13points_total79.621755.080.0103.5339Union{Missing, Float64}
14markAF319Union{Missing, String1}
15tutorHonzaTomáš8Union{Missing, String7}
16percentil53.11362350.01000Int64
size(df1)
(528, 16)
nrow(df1)
528
describe(df2)
17×7 DataFrame
Rowvariablemeanminmedianmaxnmissingeltype
SymbolUnion…AnyUnion…AnyInt64Type
1usernamestudent0001student07600String15
2test11.698180.01.53.099Union{Missing, Float64}
3test210.66720.011.020.0132Union{Missing, Float64}
4test312.68930.013.520.0245Union{Missing, Float64}
5second_chance14.54025.015.022.0673Union{Missing, Float64}
6activity2.27454-6.02.05.0383Union{Missing, Float64}
7tests_total19.11480.024.543.02Union{Missing, Float64}
8gitlab2.2368411.012722Union{Missing, Int64}
9assessment0.497361false0.0true2Union{Missing, Bool}
10exam_test16.4377-117.020383Union{Missing, Int64}
11date---CT 14:30383Union{Missing, String15}
12oral_exam46.898820.048.057.0419Union{Missing, Float64}
13veto1.0true1.0true757Union{Missing, Bool}
14points_total80.323554.580.5105.0420Union{Missing, Float64}
15markAF383Union{Missing, String1}
16tutorIrenaPetr2Union{Missing, String7}
17percentil50.95131250.01000Int64
size(df2)
(760, 17)
nrow(df2)
760
describe(df3)
17×7 DataFrame
Rowvariablemeanminmedianmaxnmissingeltype
SymbolUnion…AnyUnion…AnyInt64Type
1usernamestudent001student6590String15
2test12.028010.02.03.070Union{Missing, Float64}
3test211.58040.012.020.068Union{Missing, Float64}
4test313.58250.014.520.0150Union{Missing, Float64}
5second_chance12.69642.514.022.5589Union{Missing, Float64}
6activity2.385640.02.05.0334Union{Missing, Float64}
7tests_total22.84370.026.543.00Float64
8gitlab0.8148150.50.54.0632Union{Missing, Float64}
9assessment0.616085false1.0true0Bool
10exam_test16.9113-117.020253Union{Missing, Int64}
11date---17:30254Union{Missing, String7}
12oral_exam47.72570.049.057.0278Union{Missing, Float64}
13veto1.0true1.0true657Union{Missing, Bool}
14points_total81.501955.081.5105.0279Union{Missing, Float64}
15markAF253Union{Missing, String1}
16tutorIrenaOndra0String7
17percentil50.3612050.01000Int64
size(df3)
(659, 17)
describe(df4)
18×7 DataFrame
Rowvariablemeanminmedianmaxnmissingeltype
SymbolUnion…AnyUnion…AnyInt64Type
1usernamestudent000student5690String15
2test12.041990.02.53.058Union{Missing, Float64}
3test211.54040.012.520.062Union{Missing, Float64}
4test313.42110.013.7520.0120Union{Missing, Float64}
5second_chance13.42735.014.021.5515Union{Missing, Float64}
6activity2.618560.02.55.0242Union{Missing, Float64}
7tests_total22.89870.027.043.00Float64
8gitlab0.9772730.51.02.5548Union{Missing, Float64}
9assessment0.640351false1.0true0Bool
10elsa+ 25 %úraz ruky + čas (semestr?)557Union{Missing, String31}
11exam_test15.9038-117.020206Union{Missing, Int64}
12date---2025-06-30206Union{Missing, String15}
13oral_exam46.893815.049.059.0264Union{Missing, Float64}
14veto1.0true1.0true565Union{Missing, Bool}
15points_total80.641658.080.5104.0266Union{Missing, Float64}
16markAF206Union{Missing, String1}
17tutorEvaOndra37Union{Missing, String7}
18percentil50.5895852.01000Int64

2.1 Základní údaje

Nejprve prozkoumejte dostupné sloupce a pokuste se zjistit základní údaje jako

  • Kolik studentů mělo předmět zapsáno?
  • Kolik studentů získalo zápočet a dokončilo předmět?
  • Kolik studentů bylo v jakém ročníku? (V prvním běhu BI-MA1 toto nemá smysl.)
  • ...

Počet studentů.

nrow(df1)
528
nrow(df2)
760
nrow(df3)
659
nrow(df4)
570

Kolik studentů má zápočet?

sum(df1.assessment)
209
length(filter(isone, df1.assessment))
209
sum(df2.assessment)
missing
sum(skipmissing(df2.assessment))
377
sum(df2[(!ismissing).(df2.assessment), :assessment])
377
filter(x -> !ismissing(x), df2.assessment)
758-element Vector{Union{Missing, Bool}}:
 1
 1
 0
 0
 0
 0
 1
 1
 1
 1
 0
 1
 0
 ⋮
 1
 0
 0
 1
 1
 1
 1
 0
 1
 1
 0
 1
filter(x -> !ismissing(x), df2.assessment) |> sum
377
sum(dropmissing(df2, :assessment).assessment)
377
coalesce.(df2.assessment, false) |> sum
377
sum(df3.assessment)
406
sum(df4.assessment)
365

Tj. průchodnost zápočtu.

sum(df1.assessment) / nrow(df1)
0.3958333333333333
sum(coalesce.(df2.assessment, 0)) / nrow(df2)
0.49605263157894736
sum(df3.assessment) / nrow(df3)
0.6160849772382397
sum(df4.assessment) / nrow(df4)
0.6403508771929824

Počet vet v tabulce.

sum(df1.veto)
missing
missing + 1 # aha! pozor na missing hodnoty
missing
sum(dropmissing(df1[:, Cols(:veto)]).veto)
3
sum(filter(x -> !ismissing(x), df1.veto))
3
sum(dropmissing(df1, :veto).veto)
3
sum(dropmissing(df2, :veto).veto)
3
sum(dropmissing(df3, :veto).veto)
2
sum(dropmissing(df4, :veto).veto)
5

Úspěšně dokončilo předmět:

length(filter(x -> x != "F", dropmissing(df1, :mark).mark))
189
sum(dropmissing(df1, :mark).mark .!= "F")
189
sum(dropmissing(df2, :mark).mark .!= "F")
340
sum(dropmissing(df3, :mark).mark .!= "F")
380
sum(dropmissing(df4, :mark).mark .!= "F")
304

Počet studentů, kteří získali zápočet a současně neuspěli u zkoušky.

sum(df1.assessment .&& (df1.mark .== "F"))
20
sum(coalesce.(df2.assessment, false) .&& df2.mark .== "F")
37
sum(coalesce.(df3.assessment, false) .&& df3.mark .== "F")
26
sum(coalesce.(df4.assessment, false) .&& coalesce.(df4.mark, "F") .== "F")
61

2.2 Zápočtové písemky, kvízy a semestr

Poté se podívejme podrobněji jak probíhalo získávání zápočtů.

  • Prozkoumejte výsledky jednotlivých zápočtových písemek, vytvořte historgramy výsledků.
  • Kolika studentům zápočet "těsně" unikl?
  • Kolik studentů "odpadlo" už v první polovině semestru?
  • ...
using PyPlot, Statistics

Histogram prvního zápočtové testu.

plt.grid()
hist(collect(skipmissing(df4.test1)), bins=[ -0.49 + j for j in 0:4]);

Pozor, se samotným skipmissing nefunguje.

skipmissing(df3.test1)
skipmissing(Union{Missing, Float64}[3.0, 2.0, missing, 3.0, 3.0, 3.0, 2.0, missing, 3.0, 1.0  …  2.5, 0.0, 2.0, 3.0, 1.0, 1.5, 2.5, 1.0, 1.0, missing])
collect(skipmissing(df3.test1))
589-element Vector{Float64}:
 3.0
 2.0
 3.0
 3.0
 3.0
 2.0
 3.0
 1.0
 3.0
 3.0
 2.0
 2.5
 2.5
 ⋮
 3.0
 1.0
 1.0
 2.5
 0.0
 2.0
 3.0
 1.0
 1.5
 2.5
 1.0
 1.0
mean(skipmissing(df3.test1))
2.0280135823429544
median(skipmissing(df3.test1))
2.0
length(collect(skipmissing(df3.test1)))
589
nrow(df3) - length(df3.test1[ismissing.(df3.test1)])
589

Poslední běh lze porovnat s předposledním během!

plt.grid()
hist(collect(skipmissing(df3.test1)), bins=[ -0.49 + j for j in 0:4]);

Histogram druhého zápočtového testu.

plt.grid()
hist(collect(skipmissing(df4.test2)), bins=[ -0.49 + j for j in 0:21]);
plt.grid()
hist(collect(skipmissing(df3.test2)), bins=[ -0.49 + j for j in 0:21]);

Třetí test.

plt.grid()
hist(collect(skipmissing(df4.test3)), bins=[ -0.49 + j for j in 0:21]);
plt.grid()
hist(collect(skipmissing(df3.test3)), bins=[ -0.49 + j for j in 0:21]);

Počet studentů píšících testy v posledním běhu.

df4.test1 |> skipmissing |> collect |> length
512
df4.test2 |> skipmissing |> collect |> length
508
df4.test3 |> skipmissing |> collect |> length
450

A v přeposledním běhu.

df3.test1 |> skipmissing |> collect |> length
589
df3.test2 |> skipmissing |> collect |> length
591
df3.test3 |> skipmissing |> collect |> length
509

Počet studentů, kteří získali zápočet bez opravné zápočtové písemky.

sum(coalesce.(df1.test1 + df1.test2, 0) .>= 25.)
165
sum(coalesce.(df2.test1 + df2.test2 + df2.test3, 0) .>= 25.)
325
sum(coalesce.(df3.test1 + df3.test2 + df3.test3, 0) .>= 25.)
367
sum(coalesce.(df4.test1 + df4.test2 + df4.test3, 0) .>= 25.)
328

Počet studentů s alespoň 20 body ze zápočtových písemek.

nrow(df1[coalesce.(df1.tests_total .>= 20, false), :])
234
nrow(df2[coalesce.(df2.tests_total .>= 20, false), :])
418
nrow(df3[coalesce.(df3.tests_total .>= 20, false), :])
444
nrow(df4[coalesce.(df4.tests_total .>= 20, false), :])
389
plt.grid()
hist(collect(skipmissing(df4.test1 + df4.test2 + df4.test3)), bins=[ -0.49 + j for j in 0:44]);
plt.grid()
hist(collect(skipmissing(df3.test1 + df3.test2 + df3.test3)), bins=[ -0.49 + j for j in 0:44]);
plt.grid()
hist(collect(skipmissing(df2.test1 + df2.test2 + df2.test3)), bins=[ -0.49 + j for j in 0:44]);

Histogram opravné zápočtové písemky.

plt.grid()
hist(filter(x -> !ismissing(x), df1.second_chance), bins=[ -0.49 + j for j in 0:25]);
plt.grid()
hist(filter(x -> !ismissing(x), df2.second_chance), bins=[ -0.49 + j for j in 0:25]);
plt.grid()
hist(filter(x -> !ismissing(x), df3.second_chance), bins=[ -0.49 + j for j in 0:25]);
plt.grid()
hist(filter(x -> !ismissing(x), df4.second_chance), bins=[ -0.49 + j for j in 0:25]);

Histogram celkového počtu bodů ze semestru.

plt.grid()
hist(filter(x -> !ismissing(x), df1.tests_total), bins=[ -0.49 + j for j in 0:41]);
plt.grid()
hist(filter(x -> !ismissing(x), df2.tests_total), bins=[ -0.49 + j for j in 0:44]);
plt.grid()
hist(filter(x -> !ismissing(x), df3.tests_total), bins=[ -0.49 + j for j in 0:44]);
plt.grid()
hist(filter(x -> !ismissing(x), df4.tests_total), bins=[ -0.49 + j for j in 0:44]);

Úspěšnost zisku zápočtu podle cvičících.

groupby(df4, :tutor)

GroupedDataFrame with 10 groups based on key: tutor

First Group (46 rows): tutor = "Irena"
21 rows omitted
Rowusernametest1test2test3second_chanceactivitytests_totalgitlabassessmentelsaexam_testdateoral_examvetopoints_totalmarktutorpercentil
String15Float64?Float64?Float64?Float64?Float64?Float64Float64?BoolString31?Int64?String15?Float64?Bool?Float64?String1?String7?Int64
1student0000.07.510.0missingmissing17.5missingfalsemissingmissingmissingmissingmissingmissingmissingIrena29
2student0043.05.0missingmissingmissing8.0missingfalsemissingmissingmissingmissingmissingmissingmissingIrena20
3student009missingmissingmissingmissingmissing0.0missingfalsemissingmissingmissingmissingmissingmissingmissingIrena8
4student0162.07.06.5missing2.015.5missingfalsemissingmissingmissingmissingmissingmissingmissingIrena27
5student0382.04.00.0missingmissing6.0missingfalsemissingmissingmissingmissingmissingmissingmissingIrena18
6student0523.017.512.0missing5.032.5missingtruemissing1714:0049.0missing86.5BIrena76
7student0603.013.510.5missingmissing27.0missingtruemissing1713:3046.0missing73.0CIrena52
8student0773.02.55.5missingmissing11.0missingfalsemissingmissingmissingmissingmissingmissingmissingIrena23
9student0853.06.020.0missing5.029.0missingtruemissing10---missingmissingmissingFIrena61
10student1033.015.012.5missingmissing30.5missingtruemissing1914:0057.0missing87.5BIrena66
11student1113.018.513.0missing4.034.5missingtruemissing13---missingmissingmissingFIrena84
12student1153.011.513.5missing5.028.0missingtruemissing1511:0049.0missing82.0BIrena57
13student1173.011.012.5missing5.026.5missingtruemissing1513:1553.0missing84.5BIrena49
35student4243.018.011.5missing5.032.5missingtruemissing1912:3056.0missing93.5AIrena76
36student4293.015.013.0missing5.031.0missingtruemissing1712:1537.0missing73.0CIrena69
37student4333.017.013.5missing5.033.5missingtruemissing1513:3052.0missing90.5AIrena80
38student4383.017.011.0missing5.031.0missingtruemissing2012:3039.0missing75.0CIrena69
39student4472.06.010.0missingmissing18.0missingfalsemissingmissingmissingmissingmissingmissingmissingIrena30
40student4713.09.09.518.05.025.0missingtruemissing1612:1556.0missing86.0BIrena36
41student5072.010.59.5missing4.022.0missingfalsemissingmissingmissingmissingmissingmissingmissingIrena37
42student5093.015.016.0missing5.034.0missingtruemissing-1---missingmissingmissingFIrena82
43student5313.016.511.5missing5.031.0missingtruemissing1510:4551.0missing87.0BIrena69
44student5351.513.510.5missing0.025.51.0true+ 25 %2014:1538.0missing64.5DIrena46
45student5463.08.019.5missing5.030.5missingtruemissing1914:0056.0missing91.5AIrena66
46student5563.016.516.0missing2.035.5missingtruemissing10---missingmissingmissingFIrena89

Last Group (29 rows): tutor = "Jitka"
4 rows omitted
Rowusernametest1test2test3second_chanceactivitytests_totalgitlabassessmentelsaexam_testdateoral_examvetopoints_totalmarktutorpercentil
String15Float64?Float64?Float64?Float64?Float64?Float64Float64?BoolString31?Int64?String15?Float64?Bool?Float64?String1?String7?Int64
1student0373.07.511.021.52.025.0missingtruemissing1612:1542.0missing69.0DJitka36
2student0502.017.519.0missing4.038.5missingtruemissing2012:4553.0missing95.5AJitka95
3student0543.013.519.0missingmissing35.5missingtruemissing1812:3053.0missing88.5BJitka89
4student0751.512.520.0missing1.034.0missingtruemissing1512:3046.0missing81.0BJitka82
5student121missingmissingmissingmissingmissing0.0missingfalsemissingmissingmissingmissingmissingmissingmissingJitka8
6student1521.57.58.5missing1.017.5missingfalsemissingmissingmissingmissingmissingmissingmissingJitka29
7student1692.59.014.0missing3.025.5missingtruemissing1513:0015.0missingmissingFJitka46
8student1773.013.517.0missing2.033.5missingtruemissing1612:3054.5missing90.0AJitka80
9student1873.015.013.5missing1.031.5missingtruemissing1710:3053.0missing85.5BJitka71
10student1893.010.512.0missing1.025.5missingtruemissing1910:1544.0missing70.5CJitka46
11student2412.019.016.5missing4.037.5missingtruemissing1710:3053.0missing94.5AJitka93
12student2662.56.515.59.51.524.5missingfalsemissingmissingmissingmissingmissingmissingmissingJitka41
13student3052.016.09.5missing1.527.5missingtruemissing1612:1547.0missing76.0CJitka54
18student3851.09.09.0missing2.019.0missingfalsemissingmissingmissingmissingmissingmissingmissingJitka31
19student4041.56.55.5missingmissing13.5missingfalsemissingmissingmissingmissingmissingmissingmissingJitka26
20student4092.016.516.0missingmissing34.5missingtruemissing1512:4528.5missing63.0DJitka84
21student4742.518.515.0missing4.536.0missingtruemissing-1---missingmissingmissingFJitka90
22student4782.014.012.0missingmissing28.0missingtruemissing2012:3040.0missing68.0DJitka57
23student5102.017.518.0missing3.037.5missingtruemissing8---missingmissingmissingFJitka93
24student5193.012.513.5missing5.029.0missingtruemissing1811:0030.0missing64.0DJitka61
25student5330.50.5missingmissingmissing1.0missingfalsemissingmissingmissingmissingmissingmissingmissingJitka12
26student5433.011.513.5missing1.028.0missingtruemissing2012:0053.0missing82.0BJitka57
27student5481.516.014.0missing2.531.5missingtruemissing1710:4545.0missing79.0CJitka71
28student5513.011.58.57.00.523.0missingfalsemissingmissingmissingmissingmissingmissingmissingJitka39
29student560missingmissingmissingmissingmissing0.0missingfalsemissingmissingmissingmissingmissingmissingmissingJitka8
df = combine(groupby(df4, :tutor),
    nrow => "studenti",
    :assessment => sum => "zápočty",
    :test1 => (x -> length(collect(skipmissing(x)))) => "psali 1. test",
    :activity => (x -> mean(coalesce.(x, 0))) => "průměrná aktivita"
)
df[:, "propustnost"] = df[:, "zápočty"] ./ df[:, "studenti"]
df
10×6 DataFrame
Rowtutorstudentizápočtypsali 1. testprůměrná aktivitapropustnost
String7?Int64Int64Int64Float64Float64
1Irena4632443.10870.695652
2Ondra4618412.584530.391304
3missing378180.4609620.216216
4Jan S4835470.06250.729167
5Jakub9554830.8473680.568421
6Jan V9172850.9603180.791209
7Jarda10166962.559960.653465
8Eva3015241.666670.5
9Luděk4745471.074470.957447
10Jitka2920271.724140.689655
df = combine(groupby(df3, :tutor),
    nrow => "studenti",
    :assessment => sum => "zápočty",
    :test1 => (x -> length(collect(skipmissing(x)))) => "psali 1. test",
    :activity => (x -> mean(coalesce.(x, 0))) => "průměrná aktivita"
)
df[:, "propustnost"] = df[:, "zápočty"] ./ df[:, "studenti"]
df
7×6 DataFrame
Rowtutorstudentizápočtypsali 1. testprůměrná aktivitapropustnost
String7Int64Int64Int64Float64Float64
1Irena9966912.333330.666667
2Ondra9451810.2056760.542553
3Jan V144771260.6736110.534722
4Jarda9576922.594740.8
5Jan S4736450.2553190.765957
6Jitka9660891.203120.625
7Jakub8440650.6428570.47619
df = combine(groupby(df2, :tutor),
    nrow => "studenti",
    :assessment => sum => "zápočty",
    :test1 => (x -> length(collect(skipmissing(x)))) => "psali 1. test",
    :activity => (x -> mean(coalesce.(x, 0))) => "průměrná aktivita"
)
df[:, "propustnost"] = df[:, "zápočty"] ./ df[:, "studenti"]
df
9×6 DataFrame
Rowtutorstudentizápočtypsali 1. testprůměrná aktivitapropustnost
String7?Int64Int64?Int64Float64Float64?
1Jan V146missing1180.424658missing
2Pavel149841381.073830.563758
3Jitka7341641.760270.561644
4Petr99missing780.459596missing
5Jiřina6836611.492650.529412
6Irena7441691.50.554054
7Jarda10057952.460.57
8Jan S4915380.06122450.306122
9missing2000.00.0

2.3 Zkouškové období

V BI-MA1 hodně rozdílné vůči BI-ZMA.

Histogram známek.

combine(groupby(df1, :mark), nrow => "studenti") |> sort
7×2 DataFrame
Rowmarkstudenti
String1?Int64
1A42
2B57
3C53
4D32
5E5
6F20
7missing319
combine(groupby(df2, :mark), nrow => "studenti") |> sort
7×2 DataFrame
Rowmarkstudenti
String1?Int64
1A79
2B104
3C110
4D41
5E6
6F37
7missing383
combine(groupby(df3, :mark), nrow => "studenti") |> sort
7×2 DataFrame
Rowmarkstudenti
String1?Int64
1A108
2B121
3C107
4D35
5E9
6F26
7missing253
sort(combine(groupby(df1[coalesce.(df1.gitlab, 0) .> 0 ,:], :mark), nrow => "studenti"), :mark)
6×2 DataFrame
Rowmarkstudenti
String1?Int64
1A12
2B7
3C8
4D2
5F4
6missing5
sort(combine(groupby(df2[coalesce.(df2.gitlab, 0) .> 0 ,:], :mark), nrow => "studenti"), :mark)
6×2 DataFrame
Rowmarkstudenti
String1?Int64
1A13
2B12
3C4
4D1
5F2
6missing6
sort(combine(groupby(df3[coalesce.(df3.gitlab, 0) .> 0 ,:], :mark), nrow => "studenti"), :mark)
6×2 DataFrame
Rowmarkstudenti
String1?Int64
1A9
2B4
3C5
4D1
5F1
6missing7
sort(combine(groupby(df1[coalesce.(df1.activity, 0) .> 2 ,:], :mark), nrow => "studenti"), :mark)
6×2 DataFrame
Rowmarkstudenti
String1?Int64
1A29
2B25
3C13
4D10
5F4
6missing15
sort(combine(groupby(df2[coalesce.(df2.activity, 0) .> 2 ,:], :mark), nrow => "studenti"), :mark)
6×2 DataFrame
Rowmarkstudenti
String1?Int64
1A47
2B39
3C32
4D6
5F7
6missing28
sort(combine(groupby(df3[coalesce.(df3.activity, 0) .> 2 ,:], :mark), nrow => "studenti"), :mark)
6×2 DataFrame
Rowmarkstudenti
String1?Int64
1A58
2B43
3C16
4D7
5F7
6missing16
plt.grid()
hist(filter(x -> !ismissing(x), df1.oral_exam), bins=[ -0.49 + j for j in 0:61]);
plt.grid()
hist(filter(x -> !ismissing(x), df2.oral_exam), bins=[ -0.49 + j for j in 0:58]);
plt.grid()
hist(filter(x -> !ismissing(x), df3.oral_exam), bins=[ -0.49 + j for j in 0:58]);
plt.grid()
hist(filter(x -> !ismissing(x), df1.exam_test), bins=[ -0.49 + j for j in 0:21]);
plt.grid()
hist(filter(x -> !ismissing(x), df2.exam_test), bins=[ -0.49 + j for j in 0:21]);
plt.grid()
hist(filter(x -> !ismissing(x), df3.exam_test), bins=[ -0.49 + j for j in 0:21]);
combine(groupby(df1, :exam_test), nrow => "počet") |> sort
13×2 DataFrame
Rowexam_testpočet
Int64?Int64
104
2101
3111
4123
5132
6145
71529
81637
91740
101837
111934
122016
13missing319
combine(groupby(df1[df1.assessment .=== true, :], :exam_test), nrow => "počet") |> sort
12×2 DataFrame
Rowexam_testpočet
Int64?Int64
104
2101
3111
4123
5132
6145
71529
81637
91740
101837
111934
122016

Reference

Další informace, detaily či tutoriály lze nalézt v dokumentaci DataFrames.jl. V notebooku jsme používali balíček CSV.jl.