CSV
Getting started
Our first move, is to tell the compiler, where your CSV file may be found. CSV.resource is a macro which reads the column headers and injects them into the compilers type system. Here; we inline a string for the compiler to analyze.
import io.github.quafadas.table.*
val csv : CsvIterator[("col1", "col2", "col3"), (Int, Int, Int)] = CSV.fromString("col1,col2,col3\n1,2,7\n3,4,8\n5,6,9")
// csv: CsvIterator[Tuple3["col1", "col2", "col3"], Tuple3[Int, Int, Int]] = empty iterator
val asList = LazyList.from(csv)
// asList: LazyList[NamedTuple[Tuple3["col1", "col2", "col3"], Tuple3[Int, Int, Int]]] = LazyList(
// (1, 2, 7),
// (3, 4, 8),
// (5, 6, 9)
// )
asList.take(2).consoleFormatNt(fansi = false)
// res0: String = """| |col1|col2|col3|
// |-|----|----|----|
// |0| 1| 2| 7|
// |1| 3| 4| 8|
// |-|----|----|----|"""
The key point of the whole library - Note the take(2) method. This is a method from scala's stdlib. In case it's not clear - you get all the other stuff too - .filter, groupMapReduce, which are powerful. Their use is strongly typed, because CSVIterator is merely an Iterator of NamedTuples - you access the columns via their column name.
Reading CSV files
Reading CSV's as strings would be relatively uncommon - normally .csv is a file.
The CSV object has a few methods of reading CSV files. It is fundamentally scala.Source based inside the macro.
import io.github.quafadas.table.*
val csv_resource = CSV.resource("simple.csv")
val csv_abs = CSV.absolutePath("/users/simon/absolute/path/simple.csv")
val csv_url = CSV.url("https://example.com/simple.csv")
/**
* Note: this reads from the _compilers_ current working directory. If you are compiling via bloop through scala-cli, for example, then this will * read the temporary directory _bloop_ is running in, _not_ your project directory.
*/
val opts = CsvOpts(typeInferrer = TypeInferrer.FirstN(1000), delimiter = ';')
val csv_pwd = CSV.pwd("file.csv", opts)
For customisation options look at CsvOpts, and supply that as a second argument to any of the above methods.
Columnar Reading
By default, CSV data is read as an iterator of rows (CsvIterator). For analytical workloads, you can read CSV data directly into a columnar format using ReadAs.Columns:
import io.github.quafadas.table.*
// Read as columns - returns NamedTuple of Arrays
val columnar = CSV.fromString("name,age,score\nAlice,30,95.5\nBob,25,87.3", CsvOpts(readAs = ReadAs.Columns))
// columnar: NamedTuple[Tuple3["name", "age", "score"], *:[Array[String], *:[Array[Int], *:[Array[Double], EmptyTuple]]]] = (
// Array("Alice", "Bob"),
// Array(30, 25),
// Array(95.5, 87.3)
// )
// Access columns directly as typed arrays
val names: Array[String] = columnar.name
// names: Array[String] = Array("Alice", "Bob")
val ages: Array[Int] = columnar.age
// ages: Array[Int] = Array(30, 25)
val scores: Array[Double] = columnar.score
// scores: Array[Double] = Array(95.5, 87.3)
println(s"Average age: ${ages.sum.toDouble / ages.length}")
// Average age: 27.5
println(s"Max score: ${scores.max}")
// Max score: 95.5
Columnar reading:
- Loads all data into memory at once
- Provides direct array access to columns
- More efficient for column-oriented analytics
- Works with all CSV reading methods (
resource,absolutePath,fromString, etc.) - Supports all type inference options
For advanced use cases requiring a single dense array with stride information (e.g., for BLAS/LAPACK interop), see ReadAs.ArrayDenseColMajor[T]() and ReadAs.ArrayDenseRowMajor[T]() in the Column Orient cookbook.
Strongly Typed CSVs
Scautable analyzes the CSV file and provides types and names for the columns. That means should get IDE support, auto complete, error messages for non sensical code, etc.
import io.github.quafadas.table.*
val experiment = asList
.mapColumn["col1", Double](_.toDouble)
.mapColumn["col2", Boolean](_.toInt > 3)
// experiment: LazyList[NamedTuple[Tuple3["col1", "col2", "col3"], *:[Double, *:[Boolean, *:[Int, EmptyTuple]]]]] = LazyList(
// (1.0, false, 7),
// (3.0, true, 8),
// (5.0, true, 9)
// )
println(experiment.consoleFormatNt(fansi = false))
// | |col1| col2|col3|
// |-|----|-----|----|
// |0| 1.0|false| 7|
// |1| 3.0| true| 8|
// |2| 5.0| true| 9|
// |-|----|-----|----|
e.g. one cannot make column name typos because they are embedded in the type system.
val nope = experiment.mapColumn["not_col1", Double](_.toDouble)
// error:
// value toDouble is not a member of EmptyTuple
// val nope = experiment.mapColumn["not_col1", Double](_.toDouble)
// ^^^^^^^^^^
// error:
// Column ("not_col1" : String) not found
// val nope = experiment.mapColumn["not_col1", Double](_.toDouble)
// ^
Column Operations
Let's have a look at the some column manipulation helpers;
dropColumnaddColumnrenameColumnmapColumn
val colmanipuluation = experiment
.dropColumn["col2"]
.addColumn["col4", Double](x => x.col1 * 2 + x.col3.toDouble)
.renameColumn["col4", "col4_renamed"]
.mapColumn["col4_renamed", Double](_ * 2)
// colmanipuluation: LazyList[NamedTuple[*:["col1", *:["col3", *:["col4_renamed", EmptyTuple]]], *:[Double, *:[Int, *:[Double, EmptyTuple]]]]] = LazyList(
// (1.0, 7, 18.0),
// (3.0, 8, 28.0),
// (5.0, 9, 38.0)
// )
colmanipuluation.consoleFormatNt(fansi = false)
// res5: String = """| |col1|col3|col4_renamed|
// |-|----|----|------------|
// |0| 1.0| 7| 18.0|
// |1| 3.0| 8| 28.0|
// |2| 5.0| 9| 38.0|
// |-|----|----|------------|"""
println(colmanipuluation.column["col4_renamed"].foldLeft(0.0)(_ + _))
// 84.0
// and select a subset of columns
colmanipuluation.columns[("col4_renamed", "col1")].consoleFormatNt(fansi = false)
// res7: String = """| |col4_renamed|col1|
// |-|------------|----|
// |0| 18.0| 1.0|
// |1| 28.0| 3.0|
// |2| 38.0| 5.0|
// |-|------------|----|"""
Accumulating, slicing etc
We can delegate all such concerns, to the standard library in the usual way - as we have everything in side the type system!
colmanipuluation.filter(_.col4_renamed > 20).groupMapReduce(_.col1)(_.col4_renamed)(_ + _)
// res8: Map[Double, Double] = Map(3.0 -> 28.0, 5.0 -> 38.0)