Column Orient
"Vector" style computation is beyond the scope of scautable itself. However, it's clear that a row oriented representation of the data, is not always the right construct - particularly for analysis type tasks.
To note again: statistics is beyond the scope of scautable.
It is encouraged to wheel in some other alternative mathematics / stats library (entirely at your own discretion / risk).
Reading CSV directly as columns
Scautable can read CSV data directly into a columnar format using the ReadAs.Columns option. This is more efficient than reading rows and then converting, as it only requires a single pass through the data.
This will fire up a repl with necssary imports;
scala-cli repl --dep io.github.quafadas::scautable::0.0.36 --dep io.github.quafadas::vecxt:0.0.36 --java-opt "--add-modules=jdk.incubator.vector" --scalac-option -Xmax-inlines --scalac-option 2048 --java-opt -Xss4m --repl-init-script 'import io.github.quafadas.table.{*, given}; import vecxt.all.{*, given}'
import io.github.quafadas.table.*
// Read directly as columns - returns NamedTuple of Arrays
// lazy - useful to prevent printing repl
lazy val simpleCols = CSV.resource("simple.csv", CsvOpts(readAs = ReadAs.Columns))
// Access columns directly as typed arrays
val col1: Array[Int] = simpleCols.col1
// col1: Array[Int] = Array(1, 3, 5)
val col2: Array[Int] = simpleCols.col2
// col2: Array[Int] = Array(2, 4, 6)
val col3: Array[Int] = simpleCols.col3
// col3: Array[Int] = Array(7, 8, 9)
// With vecxt, we get optimsed vector operations too.
// simpleCols.col1 + simpleCols.cols2
// Works with type inference
val titanicCols = CSV.resource("titanic.csv", CsvOpts(TypeInferrer.FromAllRows, ReadAs.Columns))
// titanicCols: NamedTuple[Tuple12["PassengerId", "Survived", "Pclass", "Name", "Sex", "Age", "SibSp", "Parch", "Ticket", "Fare", "Cabin", "Embarked"], *:[Array[Int], *:[Array[Boolean], *:[Array[Int], *:[Array[String], *:[Array[String], *:[Array[Option[Double]], *:[Array[Int], *:[Array[Int], *:[Array[String], *:[Array[Double], *:[Array[Option[String]], *:[Array[Option[String]], EmptyTuple]]]]]]]]]]]]] = Tuple12(
// _1 = Array(
// 1,
// 2,
// 3,
// 4,
// 5,
// 6,
// 7,
// 8,
// 9,
// 10,
// 11,
// 12,
// 13,
// 14,
// 15,
// 16,
// 17,
// 18,
// 19,
// 20,
// 21,
// 22,
// 23,
// 24,
// 25,
// 26,
// 27,
// 28,
// 29,
// 30,
// 31,
// 32,
// 33,
// 34,
// 35,
// 36,
// 37,
// 38,
// 39,
// 40,
// 41,
// 42,
// 43,
// 44,
// 45,
// 46,
// 47,
// ...
val ages: Array[Option[Double]] = titanicCols.Age
// ages: Array[Option[Double]] = Array(
// Some(22.0),
// Some(38.0),
// Some(26.0),
// Some(35.0),
// Some(35.0),
// None,
// Some(54.0),
// Some(2.0),
// Some(27.0),
// Some(14.0),
// Some(4.0),
// Some(58.0),
// Some(20.0),
// Some(39.0),
// Some(14.0),
// Some(55.0),
// Some(2.0),
// None,
// Some(31.0),
// None,
// Some(35.0),
// Some(34.0),
// Some(15.0),
// Some(28.0),
// Some(8.0),
// Some(38.0),
// None,
// Some(19.0),
// None,
// None,
// Some(40.0),
// None,
// None,
// Some(66.0),
// Some(28.0),
// Some(42.0),
// None,
// Some(21.0),
// Some(18.0),
// Some(14.0),
// Some(40.0),
// Some(27.0),
// None,
// Some(3.0),
// Some(19.0),
// None,
// None,
// None,
// ...
val survived: Array[Boolean] = titanicCols.Survived
// survived: Array[Boolean] = Array(
// false,
// true,
// true,
// true,
// false,
// false,
// false,
// false,
// true,
// true,
// true,
// true,
// false,
// false,
// false,
// true,
// false,
// true,
// false,
// true,
// false,
// true,
// true,
// true,
// false,
// true,
// false,
// false,
// true,
// false,
// false,
// true,
// true,
// false,
// false,
// false,
// true,
// false,
// false,
// true,
// false,
// false,
// false,
// true,
// true,
// false,
// false,
// true,
// ...
Reading CSV as Dense Arrays
For interoperability with numerical libraries (e.g., BLAS, LAPACK) or when you need a single contiguous memory layout, scautable provides dense array reading modes. These modes read all CSV data into a single flat array with stride information for accessing rows and columns.
Column-Major Dense Arrays
Column-major layout stores data column-by-column in memory, which is the standard layout for Fortran and mathematical libraries like BLAS/LAPACK.
import io.github.quafadas.table.*
// Read as column-major dense array
val colMajor = CSV.resource("simple.csv", CsvOpts(readAs = ReadAs.ArrayDenseColMajor[Int]()))
// colMajor: NamedTuple[Tuple5["data", "rowStride", "colStride", "rows", "cols"], Tuple5[Array[Int], Int, Int, Int, Int]] = (
// Array(1, 3, 5, 2, 4, 6, 7, 8, 9),
// 3,
// 1,
// 3,
// 3
// )
// Access the fields
val cmData: Array[Int] = colMajor.data // The flat array containing all data
// cmData: Array[Int] = Array(1, 3, 5, 2, 4, 6, 7, 8, 9)
val cmRowStride: Int = colMajor.rowStride // Stride to next row = numRows
// cmRowStride: Int = 3
val cmColStride: Int = colMajor.colStride // Stride to next column = 1
// cmColStride: Int = 1
val cmRows: Int = colMajor.rows // Number of rows
// cmRows: Int = 3
val cmCols: Int = colMajor.cols // Number of columns
// cmCols: Int = 3
// Access element at row i, col j
def getElementColMajor(i: Int, j: Int): Int =
cmData(j * cmRowStride + i * cmColStride)
// Example: get element at row 1, col 1
val cmElement = getElementColMajor(1, 1)
// cmElement: Int = 4
In column-major layout:
colStride = 1(next element in the same column)rowStride = numRows(jump to the next row)- Data is stored:
[col0_row0, col0_row1, ..., col1_row0, col1_row1, ...]
If you have a CSV as a matrix, i.e. without headers, use the HeaderOptions.Auto option along with the dense array reading mode, which will read the entire Matrix, including the first row as data.
import io.github.quafadas.table.{*, given}
val matrixData = CSV.resource("matrix.csv", CsvOpts(headerOptions = HeaderOptions.Auto, readAs = ReadAs.ArrayDenseColMajor[Double]()))
// matrixData: NamedTuple[Tuple5["data", "rowStride", "colStride", "rows", "cols"], Tuple5[Array[Double], Int, Int, Int, Int]] = (
// Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0, 7.0, 8.0, 9.0),
// 3,
// 1,
// 3,
// 3
// )
Row-Major Dense Arrays
Row-major layout stores data row-by-row in memory, which is the standard layout for C and most programming languages.
import io.github.quafadas.table.*
// Read as row-major dense array
val rowMajor = CSV.resource("simple.csv", CsvOpts(readAs = ReadAs.ArrayDenseRowMajor[Double]()))
// rowMajor: NamedTuple[Tuple5["data", "rowStride", "colStride", "rows", "cols"], Tuple5[Array[Double], Int, Int, Int, Int]] = (
// Array(1.0, 2.0, 7.0, 3.0, 4.0, 8.0, 5.0, 6.0, 9.0),
// 1,
// 3,
// 3,
// 3
// )
// Access the fields
val rmData: Array[Double] = rowMajor.data // The flat array containing all data
// rmData: Array[Double] = Array(1.0, 2.0, 7.0, 3.0, 4.0, 8.0, 5.0, 6.0, 9.0)
val rmRowStride: Int = rowMajor.rowStride // Stride to next row = 1
// rmRowStride: Int = 1
val rmColStride: Int = rowMajor.colStride // Stride to next column = numCols
// rmColStride: Int = 3
val rmRows: Int = rowMajor.rows // Number of rows
// rmRows: Int = 3
val rmCols: Int = rowMajor.cols // Number of columns
// rmCols: Int = 3
// Access element at row i, col j
def getElementRowMajor(i: Int, j: Int): Double =
rmData(i * rmColStride + j * rmRowStride)
// Example: get element at row 0, col 2
val rmElement = getElementRowMajor(0, 2)
// rmElement: Double = 7.0
In row-major layout:
rowStride = 1(next element in the same row)colStride = numCols(jump to the next column)- Data is stored:
[row0_col0, row0_col1, ..., row1_col0, row1_col1, ...]
Type Safety
The dense array modes require a type parameter specifying the array element type:
// Strongly typed as Array[Int]
val intArray = CSV.resource("data.csv", CsvOpts(readAs = ReadAs.ArrayDenseColMajor[Int]()))
// Strongly typed as Array[Double]
val doubleArray = CSV.resource("data.csv", CsvOpts(readAs = ReadAs.ArrayDenseRowMajor[Double]()))
// Strongly typed as Array[String]
val stringArray = CSV.fromString("a,b\nfoo,bar", CsvOpts(readAs = ReadAs.ArrayDenseColMajor[String]()))
The type conversion is handled automatically using scautable's ColumnDecoder infrastructure, which supports Int, Long, Double, Boolean, String, and Option types.
Use Cases
Dense arrays are particularly useful for:
- Numerical computing: Passing data to BLAS/LAPACK or other numerical libraries
- Machine learning: Preparing data for algorithms that expect contiguous arrays
- Performance: Single memory allocation and cache-friendly access patterns
- Interop: Integration with libraries expecting specific memory layouts (column-major for Fortran/R, row-major for C/Python)
Converting row-oriented data to columns
Alternatively, you can read data as rows (the default) and then convert to columnar format:
//> using dep io.github.quafadas::vecxt:0.0.31
import io.github.quafadas.table.*
import vecxt.all.cumsum
import vecxt.BoundsCheck.DoBoundsCheck.yes
type ColSubset = ("Name", "Sex", "Age")
val data = CSV.resource("titanic.csv", TypeInferrer.FromAllRows)
.take(3)
.columns[ColSubset]
// data: Iterator[NamedTuple[ColSubset, *:[String, *:[String, *:[Option[Double], EmptyTuple]]]]] = empty iterator
val colData = LazyList.from(data).toColumnOrientedAs[Array]
// colData: NamedTuple[Tuple3["Name", "Sex", "Age"], *:[Array[String], *:[Array[String], *:[Array[Option[Double]], EmptyTuple]]]] = (
// Array(
// "Braund, Mr. Owen Harris",
// "Cumings, Mrs. John Bradley (Florence Briggs Thayer)",
// "Heikkinen, Miss. Laina"
// ),
// Array("male", "female", "female"),
// Array(Some(22.0), Some(38.0), Some(26.0))
// )
colData.Age
// res0: Array[Option[Double]] = Array(Some(22.0), Some(38.0), Some(26.0))
colData.Age.map(_.get).cumsum
// res1: Array[Double] = Array(22.0, 60.0, 86.0)
The direct columnar reading (first approach) is recommended when you know upfront that you need columnar access, as it's more efficient.