09: Zpracování a analýza dat

Tento notebook je výukovým materiálem v předmětu BI-JUL.21 vyučovaném v zimním semestru akademického roku 2023/2024 Tomášem Kalvodou. Tvorba těchto materiálů byla podpořena NVS FIT.

Hlavní stránkou předmětu, kde jsou i další notebooky a zajímavé informace, je jeho Course Pages stránka.

versioninfo()

1. DataFrames.jl

Pro práci s daty jistě znáte Pythonovský nástroj pandas. DataFrames.jl je v podstatě Julia analog tohoto nástroje. Pokud jste zvyklí pandas používat, může pro vás být užitečné porovnání pandas s Dataframes.jl.

using DataFrames

1.1 Jak vytvořit DataFrame?

DataFrame lze vytvořit mnoha způsoby. Můžeme začít s prázdnou tabulkou a postupně ji naplnit daty, nebo využít existující matici, nebo data načíst z externího souboru.

Prázdný DataFrame vytvoříme velmi snadno:

DataFrame()

Keyword argumenty a NamedTuple

Data v pojmenovaných sloupcích můžeme předat pomocí keyword argumentů (klíčové slovo je název sloupce, hodnota data):

df = DataFrame(course=["BI-LA1", "BI-DML", "BI-MA1", "BI-MA2"], semester=[1, 1, 2, 2], department=18105)
df.course

Tímto způsobem bychom ovšem měli problém zadat data se sloupcích, jejichž názvy obsahují třeba speciální znaky jako mezery. K tomu můžeme použít slovník, resp. dvojice (klíče mohou být řetězce nebo symboly -- ty jsou doporučené, místo mezer je vhodnější použít podtžítka :slovo_slovo):

df = DataFrame("fiktivní postava" => ["Gandalf", "Harry Potter"], "kniha" => ["Pán prstenů", "Harry Potter a kámen mudrců"])
df.var"fiktivní postava"
df[:, "fiktivní postava"]

Další možností je použít NamedTuple:

nt = (a = 1, b = 3)
nt[1], nt[2]
nt[:a], nt[:b]

Po sloupcích (pozor na jemný rozdíl od výše uvedeného způsobu):

DataFrame((a = [1,2,3], b = [4,5,6]))

Po řádcích:

DataFrame([(a = 1, b = 4), (a = 2, b = 5), (a = 3, b = 6)])
DataFrame([(a = 1, b = 4, c = nothing), (a = 2, b = 5, c = nothing), (a = 3, b = nothing, c = 6)])

Matice

K vytvoření DataFrame můžeme použít i matici, jen musíme vyřešit pojmenování sloupců. Automaticky (nutno zadat jako druhý argument :auto) budou označeny jako x1, x2, atd.:

DataFrame(rand(2, 3), :auto)
DataFrame(rand(Int64, 2, 3), :auto)

V druhém argumentu případně můžeme zadat naše názvy sloupců:

DataFrame(rand(2, 3), ["col1", "col2", "col3"])

CSV.jl

V tomto notebooku budeme pro některé ukázky používat CSV anonymizovaný export z Grades předmětu BI-MA1 v semestru B212 (2021/2022). Importu dat z CSV lze snadno docílit pomocí balíčku CSV.jl, který přidáme standardně ] add CSV a pak importujeme:

using CSV

Nyní stačí použít metodu read z modulu CSV, v prvním argumentu zadáme cestu k souboru a v druhém uvedeme požadovaný výstupní "formát", v našem případě DataFrame:

df = CSV.read(joinpath("files", "bi-ma1-b232.csv"), DataFrame)
659×17 DataFrame
634 rows omitted
Rowusernametest1test2test3second_chanceactivitytests_totalgitlabassessmentexam_testdateoral_examvetopoints_totalmarktutorpercentil
String15Float64?Float64?Float64?Float64?Float64?Float64Float64?BoolInt64?String7?Float64?Bool?Float64?String1?String7Int64
1student0013.011.513.5missing5.028.0missingtrue1713:1534.0missing67.0DIrena58
2student0022.03.018.519.53.025.0missingtrue1812:4548.0missing76.0CIrena43
3student003missingmissingmissingmissingmissing0.0missingfalsemissingmissingmissingmissingmissingmissingOndra6
4student0043.0missingmissingmissing4.03.0missingfalsemissingmissingmissingmissingmissingmissingJan V13
5student0053.018.017.0missing4.038.0missingtrue1917:1556.0missing98.0AJarda93
6student0063.010.515.0missingmissing28.5missingtrue1516:3050.0missing78.5CJan S59
7student0072.019.016.0missing2.037.0missingtrue1611:3055.0missing94.0AJan V91
8student008missingmissingmissingmissingmissing0.0missingfalsemissingmissingmissingmissingmissingmissingJitka6
9student0093.08.014.0missing3.025.0missingtrue1813:1542.0missing70.0CIrena46
10student0101.06.07.0missingmissing14.0missingfalsemissingmissingmissingmissingmissingmissingJan S27
11student0113.014.519.0missing4.036.5missingtrue1715:0057.0missing97.5AJakub89
12student0123.018.015.0missingmissing36.0missingtrue1711:3054.0missing90.0AJan V88
13student0132.0missingmissingmissingmissing2.0missingfalsemissingmissingmissingmissingmissingmissingIrena11
648student6481.013.514.5missingmissing29.0missingtrue1511:0045.0missing74.0CJan V62
649student6491.0missingmissingmissingmissing1.0missingfalsemissingmissingmissingmissingmissingmissingJitka9
650student6502.51.01.0missingmissing4.5missingfalsemissingmissingmissingmissingmissingmissingOndra17
651student6510.08.08.0missing1.016.0missingfalsemissingmissingmissingmissingmissingmissingJitka30
652student6522.05.512.512.51.020.0missingfalsemissingmissingmissingmissingmissingmissingJitka35
653student6533.08.517.5missingmissing29.0missingtrue1614:4555.0missing84.0BJitka62
654student6541.014.015.0missingmissing30.0missingtrue1614:3050.0missing80.0BJan S65
655student6551.55.0missingmissingmissing6.5missingfalsemissingmissingmissingmissingmissingmissingIrena19
656student6562.55.517.5missingmissing25.5missingtrue1610:4554.5missing80.0BOndra49
657student6571.015.06.516.01.025.0missingtrue1514:1542.0missing68.0DIrena41
658student6581.015.017.5missing0.533.5missingtrue1713:4556.0missing90.0AJarda79
659student659missingmissingmissingmissingmissing0.0missingfalsemissingmissingmissingmissingmissingmissingJakub6

K těmto datům se vrátíme podrobněji později. Samotnou tabulku občas použijeme v různých ukázkách.


1.2 Práce s DataFrame

Z částečného výpisu dat výše vyvstává hned několik otázek:

Pojďme se vydat na průzkum. První, co můžeme zkusit, je podívat se na atributy naší instance (interaktivně pomocí TAB):

df.
ParseError:
# Error @ ]8;;file:///home/kalvin/documents/fit/B241-BI-JUL/bi-jul/tutorials/In[4]#1:4\In[4]:1:4]8;;\
df.
#  └ ── premature end of input

Stacktrace:
 [1] top-level scope
   @ In[4]:1
df.username
659-element Vector{String15}:
 "student001"
 "student002"
 "student003"
 "student004"
 "student005"
 "student006"
 "student007"
 "student008"
 "student009"
 "student010"
 "student011"
 "student012"
 "student013"
 ⋮
 "student648"
 "student649"
 "student650"
 "student651"
 "student652"
 "student653"
 "student654"
 "student655"
 "student656"
 "student657"
 "student658"
 "student659"
df.assessment
659-element Vector{Bool}:
 1
 1
 0
 0
 1
 1
 1
 0
 1
 0
 1
 1
 0
 ⋮
 1
 0
 0
 0
 0
 1
 1
 0
 1
 1
 1
 0

Nebo k tomu můžeme použít metodu names z modulu DataFrames:

names(df)
17-element Vector{String}:
 "username"
 "test1"
 "test2"
 "test3"
 "second_chance"
 "activity"
 "tests_total"
 "gitlab"
 "assessment"
 "exam_test"
 "date"
 "oral_exam"
 "veto"
 "points_total"
 "mark"
 "tutor"
 "percentil"

Metoda eachcol vrátí iterátor přes sloupce. Můžeme tak zjistit typ prvků sloupců:

eltype.(eachcol(df))
17-element Vector{Type}:
 String15
 Union{Missing, Float64}
 Union{Missing, Float64}
 Union{Missing, Float64}
 Union{Missing, Float64}
 Union{Missing, Float64}
 Float64
 Union{Missing, Float64}
 Bool
 Union{Missing, Int64}
 Union{Missing, String7}
 Union{Missing, Float64}
 Union{Missing, Bool}
 Union{Missing, Float64}
 Union{Missing, String1}
 String7
 Int64

Odtud vidíme, co znamenají otazníky u typů prvků sloupců. Prvky sloupce jsou složeného typu Union{Missing, T}, v některých sloupcích mohou hodnoty chybět, obsahují hodnotu missing.

Stručné informace můžeme získat pomocí metody describe. Samozřejmě občas uvedené statistiky nemají moc smysl.

describe(df)
17×7 DataFrame
Rowvariablemeanminmedianmaxnmissingeltype
SymbolUnion…AnyUnion…AnyInt64Type
1usernamestudent001student6590String15
2test12.028010.02.03.070Union{Missing, Float64}
3test211.58040.012.020.068Union{Missing, Float64}
4test313.58250.014.520.0150Union{Missing, Float64}
5second_chance12.69642.514.022.5589Union{Missing, Float64}
6activity2.385640.02.05.0334Union{Missing, Float64}
7tests_total22.84370.026.543.00Float64
8gitlab0.8148150.50.54.0632Union{Missing, Float64}
9assessment0.616085false1.0true0Bool
10exam_test16.9113-117.020253Union{Missing, Int64}
11date---17:30254Union{Missing, String7}
12oral_exam47.72570.049.057.0278Union{Missing, Float64}
13veto1.0true1.0true657Union{Missing, Bool}
14points_total81.501955.081.5105.0279Union{Missing, Float64}
15markAF253Union{Missing, String1}
16tutorIrenaOndra0String7
17percentil50.3612050.01000Int64

Výpis můžeme kontrolovat uvedením konkrétního rozsahu.

describe(df, cols=["test1", "test2", "test3"])
3×7 DataFrame
Rowvariablemeanminmedianmaxnmissingeltype
SymbolFloat64Float64Float64Float64Int64Union
1test12.028010.02.03.070Union{Missing, Float64}
2test211.58040.012.020.068Union{Missing, Float64}
3test313.58250.014.520.0150Union{Missing, Float64}

DataFrame je tabulka, máme pro ní k dispozici podobné metody jako pro matice, na které se také lze dívat jako na "tabulky".

size(df)
(659, 17)
nrow(df)
659
ncol(df)
17

DataFrame lze kopírovat pomocí metody copy, nebo ho vyprázdnit pomocí metod empty, resp. empty!.

ex1 = DataFrame(rand(3, 2), :auto)
ex2 = copy(ex1)
3×2 DataFrame
Rowx1x2
Float64Float64
10.7818330.0730139
20.6894090.292322
30.3567130.156782
ex1
3×2 DataFrame
Rowx1x2
Float64Float64
10.7818330.0730139
20.6894090.292322
30.3567130.156782
empty!(ex2)
empty(ex1)
0×2 DataFrame
Rowx1x2
Float64Float64
ex1
3×2 DataFrame
Rowx1x2
Float64Float64
10.7818330.0730139
20.6894090.292322
30.3567130.156782
ex2
0×2 DataFrame
Rowx1x2
Float64Float64

To by bylo pro začátek vše k základním vlastnostem. Pojďme nyní tabulky modifikovat, upravovat.


Indexování a přístup ke sloupcům

Indexování vychází z maticového zápisu a má podobné vlastnosti. Ke sloupcům můžeme přistupovat několika způsoby:

Tímto způsobem můžeme data číst, ale i je modifikovat.

df.mark
659-element PooledArrays.PooledVector{Union{Missing, String1}, UInt32, Vector{UInt32}}:
 "D"
 "C"
 missing
 missing
 "A"
 "C"
 "A"
 missing
 "C"
 missing
 "A"
 "A"
 missing
 ⋮
 "C"
 missing
 missing
 missing
 missing
 "B"
 "B"
 missing
 "B"
 "D"
 "A"
 missing
df.mark === df[!, :mark]
true
df.mark === df[:, :mark] # kopie!
false
df.mark == df[!, :mark]
missing
df.mark == df[:, :mark]
missing
missing == missing
missing
missing == true
missing
missing == false
missing

Vedle toho nemusíme používat název sloupce, v pořádku je použít i pořadí. Například prvních deset řádků v druhém a třetím sloupci:

df[1:10, ["test1", "test2", "test3"]]
10×3 DataFrame
Rowtest1test2test3
Float64?Float64?Float64?
13.011.513.5
22.03.018.5
3missingmissingmissing
43.0missingmissing
53.018.017.0
63.010.515.0
72.019.016.0
8missingmissingmissing
93.08.014.0
101.06.07.0

Not, Between, Cols, All

Dále můžeme vybírat pouze potřebné sloupce.

ex = DataFrame(rand(3, 5), :auto)
3×5 DataFrame
Rowx1x2x3x4x5
Float64Float64Float64Float64Float64
10.66860.535190.1805080.8158610.49856
20.6764510.5481240.6445720.7248460.151916
30.6839030.4411080.7456210.3406930.95443
ex[:, Not(:x2)]
3×4 DataFrame
Rowx1x3x4x5
Float64Float64Float64Float64
10.66860.1805080.8158610.49856
20.6764510.6445720.7248460.151916
30.6839030.7456210.3406930.95443
Not(:x2)
InvertedIndex{Symbol}(:x2)
ex[:, Not([:x2, :x3])]
3×3 DataFrame
Rowx1x4x5
Float64Float64Float64
10.66860.8158610.49856
20.6764510.7248460.151916
30.6839030.3406930.95443
ex[:, Between(:x1, :x3)]
3×3 DataFrame
Rowx1x2x3
Float64Float64Float64
10.66860.535190.180508
20.6764510.5481240.644572
30.6839030.4411080.745621
ex[:, Cols(:x2, :x4)]
3×2 DataFrame
Rowx2x4
Float64Float64
10.535190.815861
20.5481240.724846
30.4411080.340693
ex[:, All()]
3×5 DataFrame
Rowx1x2x3x4x5
Float64Float64Float64Float64Float64
10.66860.535190.1805080.8158610.49856
20.6764510.5481240.6445720.7248460.151916
30.6839030.4411080.7456210.3406930.95443

Výše uvedené ukázky využívají funkcionalitu z DataFrames.jl balíčku. "Staré maticové" postupy samozřejmě fungují také, např.:

ex[:, [:x2, :x3]]
3×2 DataFrame
Rowx2x3
Float64Float64
10.535190.180508
20.5481240.644572
30.4411080.745621

Přepisování dat

Údaje můžeme měnit několika způsoby. Vezměme si jednoduchou tabulku:

m = [ "α" 1 0.5; "β" 2 1.5; "γ" 3 2.5]
3×3 Matrix{Any}:
 "α"  1  0.5
 "β"  2  1.5
 "γ"  3  2.5
ex = DataFrame(m, ["a", "b", "c"])
3×3 DataFrame
Rowabc
AnyAnyAny
1α10.5
2β21.5
3γ32.5

Změna jedné položky (Ξ\Xi je velké ξ\xi):

ex[1, :a] = "Ξ"

ex
3×3 DataFrame
Rowabc
AnyAnyAny
1Ξ10.5
2β21.5
3γ32.5

Přepsání všech hodnot ve sloupci c:

ex[:, :c] = ["tau", "pi", "omega"]
3-element Vector{String}:
 "tau"
 "pi"
 "omega"
ex
3×3 DataFrame
Rowabc
AnyAnyAny
1Ξ1tau
2β2pi
3γ3omega

Přepsání jednou hodnotou.

ex[:, :c] .= "ř"
3-element view(::Vector{Any}, :) with eltype Any:
 "ř"
 "ř"
 "ř"
ex
3×3 DataFrame
Rowabc
AnyAnyAny
1Ξ1ř
2β2ř
3γ3ř
z = ex[:, :c]
3-element Vector{Any}:
 "ř"
 "ř"
 "ř"
z[2] = "pí"
"pí"
z
3-element Vector{Any}:
 "ř"
 "pí"
 "ř"
ex
3×3 DataFrame
Rowabc
AnyAnyAny
1Ξ1ř
2β2ř
3γ3ř
z = ex[!, :c]
3-element Vector{Any}:
 "ř"
 "ř"
 "ř"
z[2] = "¿"
"¿"
ex
3×3 DataFrame
Rowabc
AnyAnyAny
1Ξ1ř
2β2¿
3γ3ř
ex.c = [true, false, missing]

ex
3×3 DataFrame
Rowabc
AnyAnyBool?
1Ξ1true
2β2false
3γ3missing

Přepsání hodnot ve sloupci a danou hodnotou, ale pouze v řádcích, kde je hodnota ve sloupci b lichá (zde opět používáme indexování bitovým vektorem, nebo "maskování", s kterým jsme se setkali dříve během semestru):

isodd.(ex.b)
3-element BitVector:
 1
 0
 1
ex[isodd.(ex.b), :a] .= "λ"

ex
3×3 DataFrame
Rowabc
AnyAnyBool?
1λ1true
2β2false
3λ3missing
ex[iseven.(ex.b), :]
1×3 DataFrame
Rowabc
AnyAnyBool?
1β2false

Občas je nutné projít tabulku řádek po řádku. V tom případě můžeme iterovat přes index. Například se pokusme změnit první sloupec na řetězec obsahující daný symbol se spodním indexem daným číslem ve sloupci b v LaTeX notaci (tj. λ_1 atd.).

for row_id = axes(ex, 1) # 1:nrow(df)
    ex[row_id, :a] = ex[row_id, :a] * "_" * string(ex[row_id, :b])
end

ex
3×3 DataFrame
Rowabc
AnyAnyBool?
1λ_11true
2β_22false
3λ_33missing
for row in eachrow(ex)
    println(row)
    println(row[:a])
end
DataFrameRow
 Row │ a    b    c     
     │ Any  Any  Bool? 
─────┼─────────────────
   1 │ λ_1  1     true
λ_1
DataFrameRow
 Row │ a    b    c     
     │ Any  Any  Bool? 
─────┼─────────────────
   2 │ β_2  2    false
β_2
DataFrameRow
 Row │ a    b    c       
     │ Any  Any  Bool?   
─────┼───────────────────
   3 │ λ_3  3    missing 
λ_3

Zobrazovaní (show, first, last, view)

Pokud je DataFrame příliš velký, tak při jeho zobrazení dojde k ořezání (řádků i sloupců). Toto chování můžeme přebít pomocí parametrů metody show.

U našeho BI-ZMA příkladu je to lehce overkill.

show(df)
659×17 DataFrame
 Row │ username    test1      test2      test3      second_chance  activity    ⋯
     │ String15    Float64?   Float64?   Float64?   Float64?       Float64?    ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ student001        3.0       11.5       13.5      missing          5.0   ⋯
   2 │ student002        2.0        3.0       18.5           19.5        3.0
   3 │ student003  missing    missing    missing        missing    missing   
   4 │ student004        3.0  missing    missing        missing          4.0
   5 │ student005        3.0       18.0       17.0      missing          4.0   ⋯
   6 │ student006        3.0       10.5       15.0      missing    missing   
   7 │ student007        2.0       19.0       16.0      missing          2.0
   8 │ student008  missing    missing    missing        missing    missing   
   9 │ student009        3.0        8.0       14.0      missing          3.0   ⋯
  10 │ student010        1.0        6.0        7.0      missing    missing   
  11 │ student011        3.0       14.5       19.0      missing          4.0
  ⋮  │     ⋮           ⋮          ⋮          ⋮            ⋮            ⋮       ⋱
 650 │ student650        2.5        1.0        1.0      missing    missing   
 651 │ student651        0.0        8.0        8.0      missing          1.0   ⋯
 652 │ student652        2.0        5.5       12.5           12.5        1.0
 653 │ student653        3.0        8.5       17.5      missing    missing   
 654 │ student654        1.0       14.0       15.0      missing    missing   
 655 │ student655        1.5        5.0  missing        missing    missing     ⋯
 656 │ student656        2.5        5.5       17.5      missing    missing   
 657 │ student657        1.0       15.0        6.5           16.0        1.0
 658 │ student658        1.0       15.0       17.5      missing          0.5
 659 │ student659  missing    missing    missing        missing    missing     ⋯
                                                 11 columns and 638 rows omitted
show(df, allcols=true)
659×17 DataFrame
 Row │ username    test1      test2      test3      second_chance  activity   tests_total  gitlab    assessment  exam_test  date      oral_exam  veto     points_total  mark      tutor    percentil 
     │ String15    Float64?   Float64?   Float64?   Float64?       Float64?   Float64      Float64?  Bool        Int64?     String7?  Float64?   Bool?    Float64?      String1?  String7  Int64     
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ student001        3.0       11.5       13.5      missing          5.0         28.0   missing        true         17  13:15          34.0  missing          67.0  D         Irena           58
   2 │ student002        2.0        3.0       18.5           19.5        3.0         25.0   missing        true         18  12:45          48.0  missing          76.0  C         Irena           43
   3 │ student003  missing    missing    missing        missing    missing            0.0   missing       false    missing  missing   missing    missing     missing    missing   Ondra            6
   4 │ student004        3.0  missing    missing        missing          4.0          3.0   missing       false    missing  missing   missing    missing     missing    missing   Jan V           13
   5 │ student005        3.0       18.0       17.0      missing          4.0         38.0   missing        true         19  17:15          56.0  missing          98.0  A         Jarda           93
   6 │ student006        3.0       10.5       15.0      missing    missing           28.5   missing        true         15  16:30          50.0  missing          78.5  C         Jan S           59
   7 │ student007        2.0       19.0       16.0      missing          2.0         37.0   missing        true         16  11:30          55.0  missing          94.0  A         Jan V           91
   8 │ student008  missing    missing    missing        missing    missing            0.0   missing       false    missing  missing   missing    missing     missing    missing   Jitka            6
   9 │ student009        3.0        8.0       14.0      missing          3.0         25.0   missing        true         18  13:15          42.0  missing          70.0  C         Irena           46
  10 │ student010        1.0        6.0        7.0      missing    missing           14.0   missing       false    missing  missing   missing    missing     missing    missing   Jan S           27
  11 │ student011        3.0       14.5       19.0      missing          4.0         36.5   missing        true         17  15:00          57.0  missing          97.5  A         Jakub           89
  ⋮  │     ⋮           ⋮          ⋮          ⋮            ⋮            ⋮           ⋮          ⋮          ⋮           ⋮         ⋮          ⋮         ⋮          ⋮           ⋮         ⋮         ⋮
 650 │ student650        2.5        1.0        1.0      missing    missing            4.5   missing       false    missing  missing   missing    missing     missing    missing   Ondra           17
 651 │ student651        0.0        8.0        8.0      missing          1.0         16.0   missing       false    missing  missing   missing    missing     missing    missing   Jitka           30
 652 │ student652        2.0        5.5       12.5           12.5        1.0         20.0   missing       false    missing  missing   missing    missing     missing    missing   Jitka           35
 653 │ student653        3.0        8.5       17.5      missing    missing           29.0   missing        true         16  14:45          55.0  missing          84.0  B         Jitka           62
 654 │ student654        1.0       14.0       15.0      missing    missing           30.0   missing        true         16  14:30          50.0  missing          80.0  B         Jan S           65
 655 │ student655        1.5        5.0  missing        missing    missing            6.5   missing       false    missing  missing   missing    missing     missing    missing   Irena           19
 656 │ student656        2.5        5.5       17.5      missing    missing           25.5   missing        true         16  10:45          54.5  missing          80.0  B         Ondra           49
 657 │ student657        1.0       15.0        6.5           16.0        1.0         25.0   missing        true         15  14:15          42.0  missing          68.0  D         Irena           41
 658 │ student658        1.0       15.0       17.5      missing          0.5         33.5   missing        true         17  13:45          56.0  missing          90.0  A         Jarda           79
 659 │ student659  missing    missing    missing        missing    missing            0.0   missing       false    missing  missing   missing    missing     missing    missing   Jakub            6
                                                                                                                                                                                     638 rows omitted

Toto by byl ještě větší overkill.

# show(df, allrows=true)

Případně můžeme prozkoumávat začátek a konec tabulky.

first(df, 10)
10×17 DataFrame
Rowusernametest1test2test3second_chanceactivitytests_totalgitlabassessmentexam_testdateoral_examvetopoints_totalmarktutorpercentil
String15Float64?Float64?Float64?Float64?Float64?Float64Float64?BoolInt64?String7?Float64?Bool?Float64?String1?String7Int64
1student0013.011.513.5missing5.028.0missingtrue1713:1534.0missing67.0DIrena58
2student0022.03.018.519.53.025.0missingtrue1812:4548.0missing76.0CIrena43
3student003missingmissingmissingmissingmissing0.0missingfalsemissingmissingmissingmissingmissingmissingOndra6
4student0043.0missingmissingmissing4.03.0missingfalsemissingmissingmissingmissingmissingmissingJan V13
5student0053.018.017.0missing4.038.0missingtrue1917:1556.0missing98.0AJarda93
6student0063.010.515.0missingmissing28.5missingtrue1516:3050.0missing78.5CJan S59
7student0072.019.016.0missing2.037.0missingtrue1611:3055.0missing94.0AJan V91
8student008missingmissingmissingmissingmissing0.0missingfalsemissingmissingmissingmissingmissingmissingJitka6
9student0093.08.014.0missing3.025.0missingtrue1813:1542.0missing70.0CIrena46
10student0101.06.07.0missingmissing14.0missingfalsemissingmissingmissingmissingmissingmissingJan S27
last(df, 10)
10×17 DataFrame
Rowusernametest1test2test3second_chanceactivitytests_totalgitlabassessmentexam_testdateoral_examvetopoints_totalmarktutorpercentil
String15Float64?Float64?Float64?Float64?Float64?Float64Float64?BoolInt64?String7?Float64?Bool?Float64?String1?String7Int64
1student6502.51.01.0missingmissing4.5missingfalsemissingmissingmissingmissingmissingmissingOndra17
2student6510.08.08.0missing1.016.0missingfalsemissingmissingmissingmissingmissingmissingJitka30
3student6522.05.512.512.51.020.0missingfalsemissingmissingmissingmissingmissingmissingJitka35
4student6533.08.517.5missingmissing29.0missingtrue1614:4555.0missing84.0BJitka62
5student6541.014.015.0missingmissing30.0missingtrue1614:3050.0missing80.0BJan S65
6student6551.55.0missingmissingmissing6.5missingfalsemissingmissingmissingmissingmissingmissingIrena19
7student6562.55.517.5missingmissing25.5missingtrue1610:4554.5missing80.0BOndra49
8student6571.015.06.516.01.025.0missingtrue1514:1542.0missing68.0DIrena41
9student6581.015.017.5missing0.533.5missingtrue1713:4556.0missing90.0AJarda79
10student659missingmissingmissingmissingmissing0.0missingfalsemissingmissingmissingmissingmissingmissingJakub6

Pro rychlé prozkoumávání částí tabulky slouží metoda view, resp. makro @view, která jen část tabulky zobrazí. Nevytváří nový objekt, měla by být efektivnější.

view(df, 100:150, [:username, :points_total, :mark])
51×3 SubDataFrame
26 rows omitted
Rowusernamepoints_totalmark
String15Float64?String1?
1student10091.5A
2student10180.5B
3student102missingmissing
4student10373.0C
5student10475.5C
6student105missingmissing
7student10681.5B
8student10776.0C
9student108missingmissing
10student10976.5C
11student110missingmissing
12student11157.0E
13student112missingmissing
40student13989.0B
41student14074.0C
42student14190.0A
43student14264.5D
44student143missingmissing
45student144missingmissing
46student14578.0C
47student146missingmissing
48student14780.5B
49student148missingF
50student149missingmissing
51student150100.5A
@view df[10:15, [:username, :mark]]
6×2 SubDataFrame
Rowusernamemark
String15String1?
1student010missing
2student011A
3student012A
4student013missing
5student014missing
6student015D

Přidávání a odebírání sloupců

K přidávání sloupců slouží metoda insertcols!:

ex = DataFrame(rand(2, 2), :auto)
2×2 DataFrame
Rowx1x2
Float64Float64
10.8944990.176588
20.05825840.709324
insertcols!(ex, 1, :a => [pi, pi])
2×3 DataFrame
Rowax1x2
Irration…Float64Float64
1π0.8944990.176588
2π0.05825840.709324
insertcols!(ex, 1, :a2 => [pi, -pi])
2×4 DataFrame
Rowa2ax1x2
Float64Irration…Float64Float64
13.14159π0.8944990.176588
2-3.14159π0.05825840.709324
insertcols!(ex, 3, :b => "⊕")
2×5 DataFrame
Rowa2abx1x2
Float64Irration…StringFloat64Float64
13.14159π0.8944990.176588
2-3.14159π0.05825840.709324
insertcols!(ex, 5, :c => [1, 2])
2×6 DataFrame
Rowa2abx1cx2
Float64Irration…StringFloat64Int64Float64
13.14159π0.89449910.176588
2-3.14159π0.058258420.709324

Ale lze použít i prosté indexování:

ex[:, :d] = [42, 42]

ex
2×7 DataFrame
Rowa2abx1cx2d
Float64Irration…StringFloat64Int64Float64Int64
13.14159π0.89449910.17658842
2-3.14159π0.058258420.70932442

Mazání sloupců musíme provést pomocí indexace, neexistuje metoda "dropcolumns!".

ex[:, Not(:b)]
2×6 DataFrame
Rowa2ax1cx2d
Float64Irration…Float64Int64Float64Int64
13.14159π0.89449910.17658842
2-3.14159π0.058258420.70932442

Přidávání a odebírání řádků

Vytvořme si zase testovací tabulku:

ex = DataFrame(rand(2, 2), :auto)
2×2 DataFrame
Rowx1x2
Float64Float64
10.06869120.786712
20.2634860.322541

Přidat řádek lze opět několika způsoby. Nejpřirozenější je asi metoda push!:

push!(ex, [1.0, 2.0])
3×2 DataFrame
Rowx1x2
Float64Float64
10.06869120.786712
20.2634860.322541
31.02.0
push!(ex, [1, 2])
4×2 DataFrame
Rowx1x2
Float64Float64
10.06869120.786712
20.2634860.322541
31.02.0
41.02.0
push!(ex, (0.5, 2.3))
5×2 DataFrame
Rowx1x2
Float64Float64
10.06869120.786712
20.2634860.322541
31.02.0
41.02.0
50.52.3
push!(ex, ["a", "b"])
┌ Error: Error adding value to column :x1. Maybe you forgot passing `promote=true`?
└ @ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/dataframe/insertion.jl:810
MethodError: Cannot `convert` an object of type String to an object of type Float64
The function `convert` exists, but no method is defined for this combination of argument types.

Closest candidates are:
  convert(::Type{T}, ::T) where T<:Number
   @ Base number.jl:6
  convert(::Type{T}, ::CartesianIndex{1}) where T<:Number
   @ Base multidimensional.jl:136
  convert(::Type{T}, ::AbstractChar) where T<:Number
   @ Base char.jl:185
  ...


Stacktrace:
 [1] push!(a::Vector{Float64}, item::String)
   @ Base ./array.jl:1249
 [2] _row_inserter!(df::DataFrame, loc::Int64, row::Vector{String}, mode::Val{:push}, promote::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/dataframe/insertion.jl:776
 [3] #push!#339
   @ ~/.julia/packages/DataFrames/kcA9R/src/dataframe/insertion.jl:545 [inlined]
 [4] push!(df::DataFrame, row::Vector{String})
   @ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/dataframe/insertion.jl:538
 [5] top-level scope
   @ In[98]:1
push!(ex, (x2 = 0.5, x1 = 0.3))
6×2 DataFrame
Rowx1x2
Float64Float64
10.06869120.786712
20.2634860.322541
31.02.0
41.02.0
50.52.3
60.30.5
push!(ex, (x2 = 7, x1 = 5))
7×2 DataFrame
Rowx1x2
Float64Float64
10.06869120.786712
20.2634860.322541
31.02.0
41.02.0
50.52.3
60.30.5
75.07.0

Pomocí append! můžeme spojovat několik tabulek "vertikálně" dohromady:

append!(ex, DataFrame(rand(10, 2), :auto))
17×2 DataFrame
Rowx1x2
Float64Float64
10.06869120.786712
20.2634860.322541
31.02.0
41.02.0
50.52.3
60.30.5
75.07.0
80.3209720.87072
90.1109980.615922
100.1139220.459498
110.6032630.847671
120.3473290.584542
130.5050990.781451
140.4666490.651833
150.2088530.79092
160.5440470.636684
170.5901730.677212

Mazat řádky můžeme explicitně pomocí delete!:

delete!(ex, 3)
16×2 DataFrame
Rowx1x2
Float64Float64
10.06869120.786712
20.2634860.322541
31.02.0
40.52.3
50.30.5
65.07.0
70.3209720.87072
80.1109980.615922
90.1139220.459498
100.6032630.847671
110.3473290.584542
120.5050990.781451
130.4666490.651833
140.2088530.79092
150.5440470.636684
160.5901730.677212
delete!(ex, [1, 2, 3])
13×2 DataFrame
Rowx1x2
Float64Float64
10.52.3
20.30.5
35.07.0
40.3209720.87072
50.1109980.615922
60.1139220.459498
70.6032630.847671
80.3473290.584542
90.5050990.781451
100.4666490.651833
110.2088530.79092
120.5440470.636684
130.5901730.677212
ex.x1 .> 0.5
13-element BitVector:
 0
 0
 1
 0
 0
 0
 1
 0
 1
 0
 0
 1
 1
delete!(ex, ex.x1 .> 0.5)
8×2 DataFrame
Rowx1x2
Float64Float64
10.52.3
20.30.5
30.3209720.87072
40.1109980.615922
50.1139220.459498
60.3473290.584542
70.4666490.651833
80.2088530.79092

To zdaleka není všechno

Výše jsme shrnuli zřejmě ty nejužitečnější metody a techniky. Tím ale možnosti DataFrames.jl zdaleka nekončí. Zvídavému čtenáři doporučujeme prolétnout dokumentaci. Na tomto místě snad jen upozorněme na následující:

Velmi sofistikované transformace dat v tabulkách lze také provádět pomocí následujících metod:

Jejich podrobný výklad je už nad rámec tohoto kurzu.


2. Cvičení: analýza BI-MA1 v B212, B222, B232

Pojďme se podrobněji podívat na data z předmětu BI-MA1.

Nejprve znovu načteme data z CSV souboru, který je syrovým exportem z Grades, kde jsou anonymizováni uživatelská jména studentů.

using CSV
df1 = CSV.read(joinpath("files", "bi-ma1-b212.csv"), DataFrame)
df2 = CSV.read(joinpath("files", "bi-ma1-b222.csv"), DataFrame)
df3 = CSV.read(joinpath("files", "bi-ma1-b232.csv"), DataFrame);
describe(df1)
16×7 DataFrame
Rowvariablemeanminmedianmaxnmissingeltype
SymbolUnion…AnyUnion…AnyInt64Type
1usernamestudent001student5280String15
2test18.451990.08.020.0101Union{Missing, Float64}
3test213.31880.014.020.0219Union{Missing, Float64}
4second_chance15.05880.015.023.5460Union{Missing, Float64}
5activity1.956290.02.05.0242Union{Missing, Float64}
6tests_total14.9280.015.539.50Float64
7gitlab1.307690.01.07.0489Union{Missing, Float64}
8assessment0.395833false0.0true0Bool
9exam_test16.7081017.020319Union{Missing, Int64}
10date---16:50319Union{Missing, String7}
11oral_exam48.302628.050.060.0338Union{Missing, Float64}
12veto1.0true1.0true525Union{Missing, Bool}
13points_total79.621755.080.0103.5339Union{Missing, Float64}
14markAF319Union{Missing, String1}
15tutorHonzaTomáš8Union{Missing, String7}
16percentil53.11362350.01000Int64
size(df1)
(528, 16)
nrow(df1)
528
describe(df2)
17×7 DataFrame
Rowvariablemeanminmedianmaxnmissingeltype
SymbolUnion…AnyUnion…AnyInt64Type
1usernamestudent0001student07600String15
2test11.698180.01.53.099Union{Missing, Float64}
3test210.66720.011.020.0132Union{Missing, Float64}
4test312.68930.013.520.0245Union{Missing, Float64}
5second_chance14.54025.015.022.0673Union{Missing, Float64}
6activity2.27454-6.02.05.0383Union{Missing, Float64}
7tests_total19.11480.024.543.02Union{Missing, Float64}
8gitlab2.2368411.012722Union{Missing, Int64}
9assessment0.497361false0.0true2Union{Missing, Bool}
10exam_test16.4377-117.020383Union{Missing, Int64}
11date---CT 14:30383Union{Missing, String15}
12oral_exam46.898820.048.057.0419Union{Missing, Float64}
13veto1.0true1.0true757Union{Missing, Bool}
14points_total80.323554.580.5105.0420Union{Missing, Float64}
15markAF383Union{Missing, String1}
16tutorIrenaPetr2Union{Missing, String7}
17percentil50.95131250.01000Int64
size(df2)
(760, 17)
nrow(df2)
760
describe(df3)
17×7 DataFrame
Rowvariablemeanminmedianmaxnmissingeltype
SymbolUnion…AnyUnion…AnyInt64Type
1usernamestudent001student6590String15
2test12.028010.02.03.070Union{Missing, Float64}
3test211.58040.012.020.068Union{Missing, Float64}
4test313.58250.014.520.0150Union{Missing, Float64}
5second_chance12.69642.514.022.5589Union{Missing, Float64}
6activity2.385640.02.05.0334Union{Missing, Float64}
7tests_total22.84370.026.543.00Float64
8gitlab0.8148150.50.54.0632Union{Missing, Float64}
9assessment0.616085false1.0true0Bool
10exam_test16.9113-117.020253Union{Missing, Int64}
11date---17:30254Union{Missing, String7}
12oral_exam47.72570.049.057.0278Union{Missing, Float64}
13veto1.0true1.0true657Union{Missing, Bool}
14points_total81.501955.081.5105.0279Union{Missing, Float64}
15markAF253Union{Missing, String1}
16tutorIrenaOndra0String7
17percentil50.3612050.01000Int64
size(df3)
(659, 17)

2.1 Základní údaje

Nejprve prozkoumejte dostupné sloupce a pokuste se zjistit základní údaje jako

Počet studentů.

nrow(df1)
528
nrow(df2)
760
nrow(df3)
659

Kolik studentů má zápočet?

sum(df1.assessment)
209
length(filter(isone, df1.assessment))
209
sum(df2.assessment)
missing
sum(skipmissing(df2.assessment))
377
sum(df2[(!ismissing).(df2.assessment), :assessment])
377
filter(x -> !ismissing(x), df2.assessment)
758-element Vector{Union{Missing, Bool}}:
 1
 1
 0
 0
 0
 0
 1
 1
 1
 1
 0
 1
 0
 ⋮
 1
 0
 0
 1
 1
 1
 1
 0
 1
 1
 0
 1
filter(x -> !ismissing(x), df2.assessment) |> sum
377
sum(dropmissing(df2, :assessment).assessment)
377
coalesce.(df2.assessment, false) |> sum
377
sum(df3.assessment)
406

Tj. průchodnost zápočtu.

sum(df1.assessment) / nrow(df1)
0.3958333333333333
sum(coalesce.(df2.assessment, 0)) / nrow(df2)
0.49605263157894736
sum(df3.assessment) / nrow(df3)
0.6160849772382397

Počet vet v tabulce.

sum(df1.veto)
missing
missing + 1 # aha! pozor na missing hodnoty
missing
sum(dropmissing(df1[:, Cols(:veto)]).veto)
3
sum(filter(x -> !ismissing(x), df1.veto))
3
sum(dropmissing(df1, :veto).veto)
3
sum(dropmissing(df2, :veto).veto)
3

Získalo zápočet a úspěšně dokončilo předmět:

length(filter(x -> x != "F", dropmissing(df1, :mark).mark))
189
sum(dropmissing(df1, :mark).mark .!= "F")
189
sum(dropmissing(df2, :mark).mark .!= "F")
340
sum(dropmissing(df3, :mark).mark .!= "F")
380

Počet studentů, kteří získali zápočet a současně neuspěli u zkoušky.

sum(df1.assessment .&& df1.mark .== "F")
20
sum(coalesce.(df2.assessment, false) .&& df2.mark .== "F")
37
sum(coalesce.(df3.assessment, false) .&& df3.mark .== "F")
26

2.2 Zápočtové písemky, kvízy a semestr

Poté se podívejme podrobněji jak probíhalo získávání zápočtů.

using PyPlot, Statistics

Histogram prvního zápočtové testu.

plt.grid()
hist(collect(skipmissing(df3.test1)), bins=[ -0.49 + j for j in 0:4]);

Pozor, se samotným skipmissing nefunguje.

skipmissing(df3.test1)
skipmissing(Union{Missing, Float64}[3.0, 2.0, missing, 3.0, 3.0, 3.0, 2.0, missing, 3.0, 1.0  …  2.5, 0.0, 2.0, 3.0, 1.0, 1.5, 2.5, 1.0, 1.0, missing])
collect(skipmissing(df3.test1))
589-element Vector{Float64}:
 3.0
 2.0
 3.0
 3.0
 3.0
 2.0
 3.0
 1.0
 3.0
 3.0
 2.0
 2.5
 2.5
 ⋮
 3.0
 1.0
 1.0
 2.5
 0.0
 2.0
 3.0
 1.0
 1.5
 2.5
 1.0
 1.0
mean(skipmissing(df3.test1))
2.0280135823429544
median(skipmissing(df3.test1))
2.0
length(collect(skipmissing(df3.test1)))
589
nrow(df3) - length(df3.test1[ismissing.(df3.test1)])
589

Třetí běh lze porovnat s druhým během!

plt.grid()
hist(collect(skipmissing(df2.test1)), bins=[ -0.49 + j for j in 0:4]);

Histogram druhého zápočtového testu.

plt.grid()
hist(collect(skipmissing(df3.test2)), bins=[ -0.49 + j for j in 0:21]);
plt.grid()
hist(collect(skipmissing(df2.test2)), bins=[ -0.49 + j for j in 0:21]);

Třetí test.

plt.grid()
hist(collect(skipmissing(df2.test3)), bins=[ -0.49 + j for j in 0:21]);
plt.grid()
hist(collect(skipmissing(df2.test3)), bins=[ -0.49 + j for j in 0:21]);

Počet studentů píšících testy v druhém běhu.

df2.test1 |> skipmissing |> collect |> length
661
df2.test2 |> skipmissing |> collect |> length
628
df2.test3 |> skipmissing |> collect |> length
515

A v třetím běhu.

df3.test1 |> skipmissing |> collect |> length
589
df3.test2 |> skipmissing |> collect |> length
591
df3.test3 |> skipmissing |> collect |> length
509

Počet studentů, kteří získali zápočet bez opravné zápočtové písemky.

sum(coalesce.(df1.test1 + df1.test2, 0) .>= 25.)
165
sum(coalesce.(df2.test1 + df2.test2 + df2.test3, 0) .>= 25.)
325
sum(coalesce.(df3.test1 + df3.test2 + df3.test3, 0) .>= 25.)
367

Počet studentů s alespoň 20 body ze zápočtových písemek.

nrow(df1[coalesce.(df1.tests_total .>= 20, false), :])
234
nrow(df2[coalesce.(df2.tests_total .>= 20, false), :])
418
nrow(df3[coalesce.(df3.tests_total .>= 20, false), :])
444
plt.grid()
hist(collect(skipmissing(df3.test1 + df3.test2 + df3.test3)), bins=[ -0.49 + j for j in 0:44]);
plt.grid()
hist(collect(skipmissing(df2.test1 + df2.test2 + df2.test3)), bins=[ -0.49 + j for j in 0:44]);

Histogram opravné zápočtové písemky.

plt.grid()
hist(filter(x -> !ismissing(x), df1.second_chance), bins=[ -0.49 + j for j in 0:25]);
plt.grid()
hist(filter(x -> !ismissing(x), df2.second_chance), bins=[ -0.49 + j for j in 0:25]);

Histogram celkového počtu bodů ze semestru.

plt.grid()
hist(filter(x -> !ismissing(x), df1.tests_total), bins=[ -0.49 + j for j in 0:41]);
plt.grid()
hist(filter(x -> !ismissing(x), df2.tests_total), bins=[ -0.49 + j for j in 0:44]);
plt.grid()
hist(filter(x -> !ismissing(x), df3.tests_total), bins=[ -0.49 + j for j in 0:44]);

Úspěšnost zisku zápočtu podle cvičících.

groupby(df1, :tutor)

GroupedDataFrame with 7 groups based on key: tutor

First Group (88 rows): tutor = "Honza"
63 rows omitted
Rowusernametest1test2second_chanceactivitytests_totalgitlabassessmentexam_testdateoral_examvetopoints_totalmarktutorpercentil
String15Float64?Float64?Float64?Float64?Float64Float64?BoolInt64?String7?Float64?Bool?Float64?String1?String7?Int64
1student0012.0missingmissingmissing2.0missingfalsemissingmissingmissingmissingmissingmissingHonza29
2student0079.56.0missingmissing15.5missingfalsemissingmissingmissingmissingmissingmissingHonza50
3student0100.0missingmissingmissing0.0missingfalsemissingmissingmissingmissingmissingmissingHonza23
4student022missingmissingmissingmissing0.0missingfalsemissingmissingmissingmissingmissingmissingHonza23
5student0232.08.5missingmissing10.5missingfalsemissingmissingmissingmissingmissingmissingHonza46
6student0372.5missingmissingmissing2.5missingfalsemissingmissingmissingmissingmissingmissingHonza30
7student038missingmissingmissingmissing0.0missingfalsemissingmissingmissingmissingmissingmissingHonza23
8student04412.016.5missing1.528.5missingtrue1811:2060.0missing90.0AHonza83
9student0459.013.515.0missing25.0missingtrue1511:2051.0missing76.0CHonza71
10student047missingmissingmissingmissing0.0missingfalsemissingmissingmissingmissingmissingmissingHonza23
11student068missingmissingmissingmissing0.0missingfalsemissingmissingmissingmissingmissingmissingHonza23
12student07310.010.015.0missing25.0missingtrue1510:2047.0missing72.0CHonza71
13student07711.015.0missingmissing26.0missingtrue1710:4550.0missing76.0CHonza75
77student4734.5missingmissingmissing4.5missingfalsemissingmissingmissingmissingmissingmissingHonza37
78student4773.0missingmissingmissing3.0missingfalsemissingmissingmissingmissingmissingmissingHonza33
79student482missingmissingmissingmissing0.0missingfalsemissingmissingmissingmissingmissingmissingHonza23
80student4840.0missingmissingmissing0.0missingfalsemissingmissingmissingmissingmissingmissingHonza23
81student48813.016.0missing2.529.0missingtrue1912:2054.0missing85.5BHonza84
82student5005.012.0missingmissing17.0missingfalsemissingmissingmissingmissingmissingmissingHonza52
83student5056.014.523.50.025.0missingtrue1710:2031.0missing56.0EHonza58
84student5061.0missingmissingmissing1.0missingfalsemissingmissingmissingmissingmissingmissingHonza26
85student5188.014.017.0missing25.0missingtrue1616:5042.0missing67.0DHonza71
86student5201.0missingmissingmissing1.0missingfalsemissingmissingmissingmissingmissingmissingHonza26
87student5245.519.5missingmissing25.0missingtrue1511:0030.0missing55.0EHonza71
88student52610.517.0missing2.027.5missingtrue13---missingmissingmissingFHonza80

Last Group (48 rows): tutor = "Ivo"
23 rows omitted
Rowusernametest1test2second_chanceactivitytests_totalgitlabassessmentexam_testdateoral_examvetopoints_totalmarktutorpercentil
String15Float64?Float64?Float64?Float64?Float64Float64?BoolInt64?String7?Float64?Bool?Float64?String1?String7?Int64
1student02011.59.510.5missing21.0missingfalsemissingmissingmissingmissingmissingmissingIvo59
2student032missingmissingmissingmissing0.0missingfalsemissingmissingmissingmissingmissingmissingIvo23
3student041missingmissingmissingmissing0.0missingfalsemissingmissingmissingmissingmissingmissingIvo23
4student0585.0missingmissingmissing5.0missingfalsemissingmissingmissingmissingmissingmissingIvo38
5student0601.0missingmissingmissing1.0missingfalsemissingmissingmissingmissingmissingmissingIvo26
6student0700.0missingmissingmissing0.0missingfalsemissingmissingmissingmissingmissingmissingIvo23
7student075missingmissingmissingmissing0.0missingfalsemissingmissingmissingmissingmissingmissingIvo23
8student095missingmissingmissingmissing0.0missingfalsemissingmissingmissingmissingmissingmissingIvo23
9student0984.06.5missingmissing10.5missingfalsemissingmissingmissingmissingmissingmissingIvo46
10student1005.0missingmissingmissing5.0missingfalsemissingmissingmissingmissingmissingmissingIvo38
11student1117.0missingmissingmissing7.0missingfalsemissingmissingmissingmissingmissingmissingIvo42
12student1129.010.0missingmissing19.0missingfalsemissingmissingmissingmissingmissingmissingIvo55
13student1209.011.010.5missing20.0missingfalsemissingmissingmissingmissingmissingmissingIvo57
37student3815.0missingmissingmissing5.0missingfalsemissingmissingmissingmissingmissingmissingIvo38
38student3919.013.517.0missing25.0missingtrue10---missingmissingmissingFIvo71
39student41412.520.0missingmissing32.50.5true1715:3041.0missing74.0CIvo92
40student4254.0missingmissingmissing4.0missingfalsemissingmissingmissingmissingmissingmissingIvo36
41student4329.06.5missingmissing15.5missingfalsemissingmissingmissingmissingmissingmissingIvo50
42student4510.55.0missingmissing5.5missingfalsemissingmissingmissingmissingmissingmissingIvo39
43student45413.09.011.0missing22.0missingfalsemissingmissingmissingmissingmissingmissingIvo59
44student4783.5missingmissingmissing3.5missingfalsemissingmissingmissingmissingmissingmissingIvo34
45student4799.514.013.0missing23.5missingfalsemissingmissingmissingmissingmissingmissingIvo61
46student511missingmissingmissingmissing0.0missingfalsemissingmissingmissingmissingmissingmissingIvo23
47student5122.56.0missing1.08.5missingfalsemissingmissingmissingmissingmissingmissingIvo44
48student52813.013.0missing1.026.00.0true1510:0042.0missing69.0DIvo75
df = combine(groupby(df1, :tutor),
    nrow => "studenti",
    :assessment => sum => "zápočty",
    :test1 => (x -> sum(coalesce.(x, 0) .> 0)) => "psali 1. test",
    :activity => (x -> mean(coalesce.(x, 0))) => "průměrná aktivita"
)
df[:, "propustnost"] = df[:, "zápočty"] ./ df[:, "studenti"]
df
7×6 DataFrame
Rowtutorstudentizápočtypsali 1. testprůměrná aktivitapropustnost
String7?Int64Int64Int64Float64Float64
1Honza8826580.3750.295455
2Irena9530631.021050.315789
3Jitka9534781.231580.357895
4Pavel9638781.750.395833
5missing8000.00.0
6Tomáš9869901.382650.704082
7Ivo4812390.18750.25
df = combine(groupby(df2, :tutor),
    nrow => "studenti",
    :assessment => (x -> sum(collect(skipmissing(x)))) => "zápočty",
    :test1 => (x -> sum(coalesce.(x, 0) .> 0)) => "psali 1. test",
    :activity => (x -> mean(coalesce.(x, 0))) => "průměrná aktivita"
)
df[:, "propustnost"] = df[:, "zápočty"] ./ df[:, "studenti"]
df
9×6 DataFrame
Rowtutorstudentizápočtypsali 1. testprůměrná aktivitapropustnost
String7?Int64Int64Int64Float64Float64
1Jan V146641150.4246580.438356
2Pavel149841201.073830.563758
3Jitka7341621.760270.561644
4Petr9939640.4595960.393939
5Jiřina6836541.492650.529412
6Irena7441681.50.554054
7Jarda10057902.460.57
8Jan S4915330.06122450.306122
9missing2000.00.0
df = combine(groupby(df3, :tutor),
    nrow => "studenti",
    :assessment => (x -> sum(collect(skipmissing(x)))) => "zápočty",
    :test1 => (x -> sum(coalesce.(x, 0) .> 0)) => "psali 1. test",
    :activity => (x -> mean(coalesce.(x, 0))) => "průměrná aktivita"
)
df[:, "propustnost"] = df[:, "zápočty"] ./ df[:, "studenti"]
df
7×6 DataFrame
Rowtutorstudentizápočtypsali 1. testprůměrná aktivitapropustnost
String7Int64Int64Int64Float64Float64
1Irena9966902.333330.666667
2Ondra9451790.2056760.542553
3Jan V144771260.6736110.534722
4Jarda9576892.594740.8
5Jan S4736420.2553190.765957
6Jitka9660871.203120.625
7Jakub8440480.6428570.47619

2.3 Zkouškové období

V BI-MA1 hodně rozdílné vůči BI-ZMA.

Histogram známek.

combine(groupby(df1, :mark), nrow => "studenti") |> sort
7×2 DataFrame
Rowmarkstudenti
String1?Int64
1A42
2B57
3C53
4D32
5E5
6F20
7missing319
combine(groupby(df2, :mark), nrow => "studenti") |> sort
7×2 DataFrame
Rowmarkstudenti
String1?Int64
1A79
2B104
3C110
4D41
5E6
6F37
7missing383
combine(groupby(df3, :mark), nrow => "studenti") |> sort
7×2 DataFrame
Rowmarkstudenti
String1?Int64
1A108
2B121
3C107
4D35
5E9
6F26
7missing253
sort(combine(groupby(df1[coalesce.(df1.gitlab, 0) .> 0 ,:], :mark), nrow => "studenti"), :mark)
6×2 DataFrame
Rowmarkstudenti
String1?Int64
1A12
2B7
3C8
4D2
5F4
6missing5
sort(combine(groupby(df2[coalesce.(df2.gitlab, 0) .> 0 ,:], :mark), nrow => "studenti"), :mark)
6×2 DataFrame
Rowmarkstudenti
String1?Int64
1A13
2B12
3C4
4D1
5F2
6missing6
sort(combine(groupby(df3[coalesce.(df3.gitlab, 0) .> 0 ,:], :mark), nrow => "studenti"), :mark)
6×2 DataFrame
Rowmarkstudenti
String1?Int64
1A9
2B4
3C5
4D1
5F1
6missing7
sort(combine(groupby(df1[coalesce.(df1.activity, 0) .> 2 ,:], :mark), nrow => "studenti"), :mark)
6×2 DataFrame
Rowmarkstudenti
String1?Int64
1A29
2B25
3C13
4D10
5F4
6missing15
sort(combine(groupby(df2[coalesce.(df2.activity, 0) .> 2 ,:], :mark), nrow => "studenti"), :mark)
6×2 DataFrame
Rowmarkstudenti
String1?Int64
1A47
2B39
3C32
4D6
5F7
6missing28
sort(combine(groupby(df3[coalesce.(df3.activity, 0) .> 2 ,:], :mark), nrow => "studenti"), :mark)
6×2 DataFrame
Rowmarkstudenti
String1?Int64
1A58
2B43
3C16
4D7
5F7
6missing16
plt.grid()
hist(filter(x -> !ismissing(x), df1.oral_exam), bins=[ -0.49 + j for j in 0:61]);
plt.grid()
hist(filter(x -> !ismissing(x), df2.oral_exam), bins=[ -0.49 + j for j in 0:58]);
plt.grid()
hist(filter(x -> !ismissing(x), df3.oral_exam), bins=[ -0.49 + j for j in 0:58]);
plt.grid()
hist(filter(x -> !ismissing(x), df1.exam_test), bins=[ -0.49 + j for j in 0:21]);
plt.grid()
hist(filter(x -> !ismissing(x), df2.exam_test), bins=[ -0.49 + j for j in 0:21]);
plt.grid()
hist(filter(x -> !ismissing(x), df3.exam_test), bins=[ -0.49 + j for j in 0:21]);
combine(groupby(df1, :exam_test), nrow => "počet") |> sort
13×2 DataFrame
Rowexam_testpočet
Int64?Int64
104
2101
3111
4123
5132
6145
71529
81637
91740
101837
111934
122016
13missing319
combine(groupby(df1[df1.assessment .=== true, :], :exam_test), nrow => "počet") |> sort
12×2 DataFrame
Rowexam_testpočet
Int64?Int64
104
2101
3111
4123
5132
6145
71529
81637
91740
101837
111934
122016

Reference

Další informace, detaily či tutoriály lze nalézt v dokumentaci DataFrames.jl. V notebooku jsme používali balíček CSV.jl.