Column Orient

"Vector" style computation is beyond the scope of scautable itself. However, it's clear that a row oriented representation of the data, is not always the right construct - particularly for analysis type tasks.

To note again: statistics is beyond the scope of scautable.

It is encouraged to wheel in some other alternative mathematics / stats library (entirely at your own discretion / risk).

Reading CSV directly as columns

Scautable can read CSV data directly into a columnar format using the ReadAs.Columns option. This is more efficient than reading rows and then converting, as it only requires a single pass through the data.

This will fire up a repl with necssary imports;

scala-cli repl --dep io.github.quafadas::scautable::0.0.36 --dep io.github.quafadas::vecxt:0.0.36 --java-opt "--add-modules=jdk.incubator.vector" --scalac-option -Xmax-inlines --scalac-option 2048 --java-opt -Xss4m --repl-init-script 'import io.github.quafadas.table.{*, given}; import vecxt.all.{*, given}'
import io.github.quafadas.table.*

// Read directly as columns - returns NamedTuple of Arrays
// lazy - useful to prevent printing repl
lazy val simpleCols = CSV.resource("simple.csv", CsvOpts(readAs = ReadAs.Columns))

// Access columns directly as typed arrays
val col1: Array[Int] = simpleCols.col1
// col1: Array[Int] = Array(1, 3, 5)
val col2: Array[Int] = simpleCols.col2
// col2: Array[Int] = Array(2, 4, 6)
val col3: Array[Int] = simpleCols.col3
// col3: Array[Int] = Array(7, 8, 9)

// With vecxt, we get optimsed vector operations too.
// simpleCols.col1 + simpleCols.cols2

// Works with type inference
val titanicCols = CSV.resource("titanic.csv", CsvOpts(TypeInferrer.FromAllRows, ReadAs.Columns))
// titanicCols: NamedTuple[Tuple12["PassengerId", "Survived", "Pclass", "Name", "Sex", "Age", "SibSp", "Parch", "Ticket", "Fare", "Cabin", "Embarked"], *:[Array[Int], *:[Array[Boolean], *:[Array[Int], *:[Array[String], *:[Array[String], *:[Array[Option[Double]], *:[Array[Int], *:[Array[Int], *:[Array[String], *:[Array[Double], *:[Array[Option[String]], *:[Array[Option[String]], EmptyTuple]]]]]]]]]]]]] = Tuple12(
//   _1 = Array(
//     1,
//     2,
//     3,
//     4,
//     5,
//     6,
//     7,
//     8,
//     9,
//     10,
//     11,
//     12,
//     13,
//     14,
//     15,
//     16,
//     17,
//     18,
//     19,
//     20,
//     21,
//     22,
//     23,
//     24,
//     25,
//     26,
//     27,
//     28,
//     29,
//     30,
//     31,
//     32,
//     33,
//     34,
//     35,
//     36,
//     37,
//     38,
//     39,
//     40,
//     41,
//     42,
//     43,
//     44,
//     45,
//     46,
//     47,
// ...
val ages: Array[Option[Double]] = titanicCols.Age
// ages: Array[Option[Double]] = Array(
//   Some(22.0),
//   Some(38.0),
//   Some(26.0),
//   Some(35.0),
//   Some(35.0),
//   None,
//   Some(54.0),
//   Some(2.0),
//   Some(27.0),
//   Some(14.0),
//   Some(4.0),
//   Some(58.0),
//   Some(20.0),
//   Some(39.0),
//   Some(14.0),
//   Some(55.0),
//   Some(2.0),
//   None,
//   Some(31.0),
//   None,
//   Some(35.0),
//   Some(34.0),
//   Some(15.0),
//   Some(28.0),
//   Some(8.0),
//   Some(38.0),
//   None,
//   Some(19.0),
//   None,
//   None,
//   Some(40.0),
//   None,
//   None,
//   Some(66.0),
//   Some(28.0),
//   Some(42.0),
//   None,
//   Some(21.0),
//   Some(18.0),
//   Some(14.0),
//   Some(40.0),
//   Some(27.0),
//   None,
//   Some(3.0),
//   Some(19.0),
//   None,
//   None,
//   None,
// ...
val survived: Array[Boolean] = titanicCols.Survived
// survived: Array[Boolean] = Array(
//   false,
//   true,
//   true,
//   true,
//   false,
//   false,
//   false,
//   false,
//   true,
//   true,
//   true,
//   true,
//   false,
//   false,
//   false,
//   true,
//   false,
//   true,
//   false,
//   true,
//   false,
//   true,
//   true,
//   true,
//   false,
//   true,
//   false,
//   false,
//   true,
//   false,
//   false,
//   true,
//   true,
//   false,
//   false,
//   false,
//   true,
//   false,
//   false,
//   true,
//   false,
//   false,
//   false,
//   true,
//   true,
//   false,
//   false,
//   true,
// ...

Reading CSV as Dense Arrays

For interoperability with numerical libraries (e.g., BLAS, LAPACK) or when you need a single contiguous memory layout, scautable provides dense array reading modes. These modes read all CSV data into a single flat array with stride information for accessing rows and columns.

Column-Major Dense Arrays

Column-major layout stores data column-by-column in memory, which is the standard layout for Fortran and mathematical libraries like BLAS/LAPACK.

import io.github.quafadas.table.*

// Read as column-major dense array
val colMajor = CSV.resource("simple.csv", CsvOpts(readAs = ReadAs.ArrayDenseColMajor[Int]()))
// colMajor: NamedTuple[Tuple5["data", "rowStride", "colStride", "rows", "cols"], Tuple5[Array[Int], Int, Int, Int, Int]] = (
//   Array(1, 3, 5, 2, 4, 6, 7, 8, 9),
//   3,
//   1,
//   3,
//   3
// )

// Access the fields
val cmData: Array[Int] = colMajor.data        // The flat array containing all data
// cmData: Array[Int] = Array(1, 3, 5, 2, 4, 6, 7, 8, 9)
val cmRowStride: Int = colMajor.rowStride     // Stride to next row = numRows
// cmRowStride: Int = 3
val cmColStride: Int = colMajor.colStride     // Stride to next column = 1
// cmColStride: Int = 1
val cmRows: Int = colMajor.rows               // Number of rows
// cmRows: Int = 3
val cmCols: Int = colMajor.cols               // Number of columns
// cmCols: Int = 3

// Access element at row i, col j
def getElementColMajor(i: Int, j: Int): Int =
  cmData(j * cmRowStride + i * cmColStride)

// Example: get element at row 1, col 1
val cmElement = getElementColMajor(1, 1)
// cmElement: Int = 4

In column-major layout:

If you have a CSV as a matrix, i.e. without headers, use the HeaderOptions.Auto option along with the dense array reading mode, which will read the entire Matrix, including the first row as data.

import io.github.quafadas.table.{*, given}
val matrixData = CSV.resource("matrix.csv", CsvOpts(headerOptions = HeaderOptions.Auto, readAs = ReadAs.ArrayDenseColMajor[Double]()))
// matrixData: NamedTuple[Tuple5["data", "rowStride", "colStride", "rows", "cols"], Tuple5[Array[Double], Int, Int, Int, Int]] = (
//   Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0, 7.0, 8.0, 9.0),
//   3,
//   1,
//   3,
//   3
// )

Row-Major Dense Arrays

Row-major layout stores data row-by-row in memory, which is the standard layout for C and most programming languages.

import io.github.quafadas.table.*

// Read as row-major dense array
val rowMajor = CSV.resource("simple.csv", CsvOpts(readAs = ReadAs.ArrayDenseRowMajor[Double]()))
// rowMajor: NamedTuple[Tuple5["data", "rowStride", "colStride", "rows", "cols"], Tuple5[Array[Double], Int, Int, Int, Int]] = (
//   Array(1.0, 2.0, 7.0, 3.0, 4.0, 8.0, 5.0, 6.0, 9.0),
//   1,
//   3,
//   3,
//   3
// )

// Access the fields
val rmData: Array[Double] = rowMajor.data     // The flat array containing all data
// rmData: Array[Double] = Array(1.0, 2.0, 7.0, 3.0, 4.0, 8.0, 5.0, 6.0, 9.0)
val rmRowStride: Int = rowMajor.rowStride     // Stride to next row = 1
// rmRowStride: Int = 1
val rmColStride: Int = rowMajor.colStride     // Stride to next column = numCols
// rmColStride: Int = 3
val rmRows: Int = rowMajor.rows               // Number of rows
// rmRows: Int = 3
val rmCols: Int = rowMajor.cols               // Number of columns
// rmCols: Int = 3

// Access element at row i, col j
def getElementRowMajor(i: Int, j: Int): Double =
  rmData(i * rmColStride + j * rmRowStride)

// Example: get element at row 0, col 2
val rmElement = getElementRowMajor(0, 2)
// rmElement: Double = 7.0

In row-major layout:

Type Safety

The dense array modes require a type parameter specifying the array element type:

// Strongly typed as Array[Int]
val intArray = CSV.resource("data.csv", CsvOpts(readAs = ReadAs.ArrayDenseColMajor[Int]()))

// Strongly typed as Array[Double]
val doubleArray = CSV.resource("data.csv", CsvOpts(readAs = ReadAs.ArrayDenseRowMajor[Double]()))

// Strongly typed as Array[String]
val stringArray = CSV.fromString("a,b\nfoo,bar", CsvOpts(readAs = ReadAs.ArrayDenseColMajor[String]()))

The type conversion is handled automatically using scautable's ColumnDecoder infrastructure, which supports Int, Long, Double, Boolean, String, and Option types.

Use Cases

Dense arrays are particularly useful for:

Converting row-oriented data to columns

Alternatively, you can read data as rows (the default) and then convert to columnar format:

//> using dep io.github.quafadas::vecxt:0.0.31

import io.github.quafadas.table.*
import vecxt.all.cumsum
import vecxt.BoundsCheck.DoBoundsCheck.yes

type ColSubset = ("Name", "Sex", "Age")

val data = CSV.resource("titanic.csv", TypeInferrer.FromAllRows)
            .take(3)
            .columns[ColSubset]
// data: Iterator[NamedTuple[ColSubset, *:[String, *:[String, *:[Option[Double], EmptyTuple]]]]] = empty iterator

val colData = LazyList.from(data).toColumnOrientedAs[Array]
// colData: NamedTuple[Tuple3["Name", "Sex", "Age"], *:[Array[String], *:[Array[String], *:[Array[Option[Double]], EmptyTuple]]]] = (
//   Array(
//     "Braund, Mr. Owen Harris",
//     "Cumings, Mrs. John Bradley (Florence Briggs Thayer)",
//     "Heikkinen, Miss. Laina"
//   ),
//   Array("male", "female", "female"),
//   Array(Some(22.0), Some(38.0), Some(26.0))
// )

colData.Age
// res0: Array[Option[Double]] = Array(Some(22.0), Some(38.0), Some(26.0))

colData.Age.map(_.get).cumsum
// res1: Array[Double] = Array(22.0, 60.0, 86.0)

The direct columnar reading (first approach) is recommended when you know upfront that you need columnar access, as it's more efficient.