Getting started
Our first move, is to tell the compiler, where the file may be found. CSV.resource
is a macro which reads the column headers and injects them into the compilers type system. Here; for the file simple.csv
in the project resource directory.
import io.github.quafadas.table.*
def csv : CsvIterator[("col1", "col2", "col3")] = CSV.resource("simple.csv")
def firstRows: Iterator[(col1: String, col2: String, col3: String)] = csv.take(2)
println(firstRows.toArray.consoleFormatNt(fansi = false))
// | |col1|col2|col3|
// +-+----+----+----+
// |0| 1| 2| 7|
// |1| 3| 4| 8|
// +-+----+----+----+
Note the take(2)
method. This is a standard scala method, and the key point behind the whole idea - that we can trivially plug scala's std lib into a CSV file.
Reading CSV files
CSV
has a few methods of reading CSV files. It is fundamentally scala.Source
based inside the macro.
import io.github.quafadas.table.*
def csv_resource = CSV.resource("simple.csv")
def csv_abs = CSV.absolutePath("/users/simon/absolute/path/simple.csv")
def csv_url = CSV.url("https://example.com/simple.csv")
Strongly Typed CSVs
We expose a small number of "column" methods, which allow coumn manipulation. They deal with the typelevel bookingkeeping surrounding named tuples. Note that these operate on iterators.
import io.github.quafadas.table.*
def experiment: Iterator[(col1 : Double, col2: Boolean, col3: String)] = csv
.mapColumn["col1", Double](_.toDouble)
.mapColumn["col2", Boolean](_.toInt > 3)
println(experiment.toArray.consoleFormatNt(fansi = false))
// | |col1| col2|col3|
// +-+----+-----+----+
// |0| 1.0|false| 7|
// |1| 3.0| true| 8|
// |2| 5.0| true| 9|
// +-+----+-----+----+
Note, that one cannot make column name typos - the compiler will catch them. If you try to map a column which doesn't exist, the compiler will complain.
We'll leave out explicit type ascriptions for the rest of the examples.
def nope = experiment.mapColumn["not_col1", Double](_.toDouble)
// error:
// value toDouble is not a member of EmptyTuple
// def nope = experiment.mapColumn["not_col1", Double](_.toDouble)
// ^^^^^^^^^^
// error:
// Cannot prove that io.github.quafadas.scautable.CSV.IsColumn[("not_col1" : String), (
// ("col1" : String), ("col2" : String), ("col3" : String))] =:= (true : Boolean).
// def nope = experiment.mapColumn["not_col1", Double](_.toDouble)
// ^
Column Operations
io.github.quafadas.scautable.CSV and look at the extension methods
def colmanipuluation = experiment
.dropColumn["col2"]
.addColumn["col4", Double](x => x.col1 * 2 + x.col3.toDouble)
.renameColumn["col4", "col4_renamed"]
.mapColumn["col4_renamed", Double](_ * 2)
println(colmanipuluation.toArray.consoleFormatNt(fansi = false))
// | |col4_renamed|col1|col3|
// +-+------------+----+----+
// |0| 18.0| 1.0| 7|
// |1| 28.0| 3.0| 8|
// |2| 38.0| 5.0| 9|
// +-+------------+----+----+
println(colmanipuluation.column["col4_renamed"].foldLeft(0.0)(_ + _))
// 84.0
Accumulating, slicing etc
We can delegate all such concerns, to the standard library in the usual way - as we have everything in side the type system!
In my mind, there are more or less two ways of going about this. I'm usually working in the small , so I materialise the iterator early and treat it as a list.
val asList = colmanipuluation.toList
// asList: List[NamedTuple[*:["col4_renamed", *:["col1", *:["col3", EmptyTuple]]], *:[Double, *:[Double, *:[String, EmptyTuple]]]]] = List(
// (18.0, 1.0, "7"),
// (28.0, 3.0, "8"),
// (38.0, 5.0, "9")
// )
println(asList.filter(_.col4_renamed > 20).groupMapReduce(_.col1)(_.col4_renamed)(_ + _))
// Map(3.0 -> 28.0, 5.0 -> 38.0)
Otherwise, we can use fold and friends to achieve similar over the Iterator
(i haven't written our the grouping below)
println(colmanipuluation.filter(_.col4_renamed > 20).foldLeft(0.0)(_ + _.col4_renamed))
// 66.0
Why are the iterators def
?
Because if you make them val
and try to read them a second time, you'll get a StreamClosedException
or something similar.
They are cheap to create - I normally switch to val
after a call to toList
or similar.
Example
Here's a scastie to a scastie which does some manipulation on the Titanic dataset.