Column Orient

"Vector" style computation is beyond the scope of scautable itself. However, it's clear that a row oriented representation of the data, is not always the right construct - particularly for analysis type tasks.

To note again: statistics is beyond the scope of scautable.

It is encouraged to wheel in some other alternative mathematics / stats library (entirely at your own discretion / risk).

Reading CSV directly as columns

Scautable can read CSV data directly into a columnar format using the ReadAs.Columns option. This is more efficient than reading rows and then converting, as it only requires a single pass through the data.

This will fire up a repl with necssary imports;

scala-cli repl --dep io.github.quafadas::scautable::0.0.35 --dep io.github.quafadas::vecxt:0.0.35 --java-opt "--add-modules=jdk.incubator.vector" --scalac-option -Xmax-inlines --scalac-option 2048 --java-opt -Xss4m --repl-init-script 'import io.github.quafadas.table.{*, given}; import vecxt.all.{*, given}'
import io.github.quafadas.table.*

// Read directly as columns - returns NamedTuple of Arrays
// lazy - useful to prevent printing repl
lazy val simpleCols = CSV.resource("simple.csv", CsvOpts(readAs = ReadAs.Columns))

// Access columns directly as typed arrays
val col1: Array[Int] = simpleCols.col1
// col1: Array[Int] = Array(1, 3, 5)
val col2: Array[Int] = simpleCols.col2
// col2: Array[Int] = Array(2, 4, 6)
val col3: Array[Int] = simpleCols.col3
// col3: Array[Int] = Array(7, 8, 9)

// With vecxt, we get optimsed vector operations too.
// simpleCols.col1 + simpleCols.cols2

// Works with type inference
val titanicCols = CSV.resource("titanic.csv", CsvOpts(TypeInferrer.FromAllRows, ReadAs.Columns))
// titanicCols: NamedTuple[Tuple12["PassengerId", "Survived", "Pclass", "Name", "Sex", "Age", "SibSp", "Parch", "Ticket", "Fare", "Cabin", "Embarked"], *:[Array[Int], *:[Array[Boolean], *:[Array[Int], *:[Array[String], *:[Array[String], *:[Array[Option[Double]], *:[Array[Int], *:[Array[Int], *:[Array[String], *:[Array[Double], *:[Array[Option[String]], *:[Array[Option[String]], EmptyTuple]]]]]]]]]]]]] = Tuple12(
//   _1 = Array(
//     1,
//     2,
//     3,
//     4,
//     5,
//     6,
//     7,
//     8,
//     9,
//     10,
//     11,
//     12,
//     13,
//     14,
//     15,
//     16,
//     17,
//     18,
//     19,
//     20,
//     21,
//     22,
//     23,
//     24,
//     25,
//     26,
//     27,
//     28,
//     29,
//     30,
//     31,
//     32,
//     33,
//     34,
//     35,
//     36,
//     37,
//     38,
//     39,
//     40,
//     41,
//     42,
//     43,
//     44,
//     45,
//     46,
//     47,
// ...
val ages: Array[Option[Double]] = titanicCols.Age
// ages: Array[Option[Double]] = Array(
//   Some(22.0),
//   Some(38.0),
//   Some(26.0),
//   Some(35.0),
//   Some(35.0),
//   None,
//   Some(54.0),
//   Some(2.0),
//   Some(27.0),
//   Some(14.0),
//   Some(4.0),
//   Some(58.0),
//   Some(20.0),
//   Some(39.0),
//   Some(14.0),
//   Some(55.0),
//   Some(2.0),
//   None,
//   Some(31.0),
//   None,
//   Some(35.0),
//   Some(34.0),
//   Some(15.0),
//   Some(28.0),
//   Some(8.0),
//   Some(38.0),
//   None,
//   Some(19.0),
//   None,
//   None,
//   Some(40.0),
//   None,
//   None,
//   Some(66.0),
//   Some(28.0),
//   Some(42.0),
//   None,
//   Some(21.0),
//   Some(18.0),
//   Some(14.0),
//   Some(40.0),
//   Some(27.0),
//   None,
//   Some(3.0),
//   Some(19.0),
//   None,
//   None,
//   None,
// ...
val survived: Array[Boolean] = titanicCols.Survived
// survived: Array[Boolean] = Array(
//   false,
//   true,
//   true,
//   true,
//   false,
//   false,
//   false,
//   false,
//   true,
//   true,
//   true,
//   true,
//   false,
//   false,
//   false,
//   true,
//   false,
//   true,
//   false,
//   true,
//   false,
//   true,
//   true,
//   true,
//   false,
//   true,
//   false,
//   false,
//   true,
//   false,
//   false,
//   true,
//   true,
//   false,
//   false,
//   false,
//   true,
//   false,
//   false,
//   true,
//   false,
//   false,
//   false,
//   true,
//   true,
//   false,
//   false,
//   true,
// ...

Converting row-oriented data to columns

Alternatively, you can read data as rows (the default) and then convert to columnar format:

//> using dep io.github.quafadas::vecxt:0.0.31

import io.github.quafadas.table.*
import vecxt.all.cumsum
import vecxt.BoundsCheck.DoBoundsCheck.yes

type ColSubset = ("Name", "Sex", "Age")

val data = CSV.resource("titanic.csv", TypeInferrer.FromAllRows)
            .take(3)
            .columns[ColSubset]
// data: Iterator[NamedTuple[ColSubset, *:[String, *:[String, *:[Option[Double], EmptyTuple]]]]] = empty iterator

val colData = LazyList.from(data).toColumnOrientedAs[Array]
// colData: NamedTuple[Tuple3["Name", "Sex", "Age"], *:[Array[String], *:[Array[String], *:[Array[Option[Double]], EmptyTuple]]]] = (
//   Array(
//     "Braund, Mr. Owen Harris",
//     "Cumings, Mrs. John Bradley (Florence Briggs Thayer)",
//     "Heikkinen, Miss. Laina"
//   ),
//   Array("male", "female", "female"),
//   Array(Some(22.0), Some(38.0), Some(26.0))
// )

colData.Age
// res0: Array[Option[Double]] = Array(Some(22.0), Some(38.0), Some(26.0))

colData.Age.map(_.get).cumsum
// res1: Array[Double] = Array(22.0, 60.0, 86.0)

The direct columnar reading (first approach) is recommended when you know upfront that you need columnar access, as it's more efficient.

Reading CSV as Dense Arrays

For interoperability with numerical libraries (e.g., BLAS, LAPACK) or when you need a single contiguous memory layout, scautable provides dense array reading modes. These modes read all CSV data into a single flat array with stride information for accessing rows and columns.

Column-Major Dense Arrays

Column-major layout stores data column-by-column in memory, which is the standard layout for Fortran and mathematical libraries like BLAS/LAPACK.

import io.github.quafadas.table.*

// Read as column-major dense array
val colMajor = CSV.resource("simple.csv", CsvOpts(readAs = ReadAs.ArrayDenseColMajor[Int]()))
// colMajor: NamedTuple[Tuple5["data", "rowStride", "colStride", "rows", "cols"], Tuple5[Array[Int], Int, Int, Int, Int]] = (
//   Array(1, 3, 5, 2, 4, 6, 7, 8, 9),
//   3,
//   1,
//   3,
//   3
// )

// Access the fields
val cmData: Array[Int] = colMajor.data        // The flat array containing all data
// cmData: Array[Int] = Array(1, 3, 5, 2, 4, 6, 7, 8, 9)
val cmRowStride: Int = colMajor.rowStride     // Stride to next row = numRows
// cmRowStride: Int = 3
val cmColStride: Int = colMajor.colStride     // Stride to next column = 1
// cmColStride: Int = 1
val cmRows: Int = colMajor.rows               // Number of rows
// cmRows: Int = 3
val cmCols: Int = colMajor.cols               // Number of columns
// cmCols: Int = 3

// Access element at row i, col j
def getElementColMajor(i: Int, j: Int): Int = 
  cmData(j * cmRowStride + i * cmColStride)

// Example: get element at row 1, col 1
val cmElement = getElementColMajor(1, 1)
// cmElement: Int = 4

In column-major layout:

Row-Major Dense Arrays

Row-major layout stores data row-by-row in memory, which is the standard layout for C and most programming languages.

import io.github.quafadas.table.*

// Read as row-major dense array
val rowMajor = CSV.resource("simple.csv", CsvOpts(readAs = ReadAs.ArrayDenseRowMajor[Double]()))
// rowMajor: NamedTuple[Tuple5["data", "rowStride", "colStride", "rows", "cols"], Tuple5[Array[Double], Int, Int, Int, Int]] = (
//   Array(1.0, 2.0, 7.0, 3.0, 4.0, 8.0, 5.0, 6.0, 9.0),
//   1,
//   3,
//   3,
//   3
// )

// Access the fields
val rmData: Array[Double] = rowMajor.data     // The flat array containing all data
// rmData: Array[Double] = Array(1.0, 2.0, 7.0, 3.0, 4.0, 8.0, 5.0, 6.0, 9.0)
val rmRowStride: Int = rowMajor.rowStride     // Stride to next row = 1
// rmRowStride: Int = 1
val rmColStride: Int = rowMajor.colStride     // Stride to next column = numCols
// rmColStride: Int = 3
val rmRows: Int = rowMajor.rows               // Number of rows
// rmRows: Int = 3
val rmCols: Int = rowMajor.cols               // Number of columns
// rmCols: Int = 3

// Access element at row i, col j
def getElementRowMajor(i: Int, j: Int): Double = 
  rmData(i * rmColStride + j * rmRowStride)

// Example: get element at row 0, col 2
val rmElement = getElementRowMajor(0, 2)
// rmElement: Double = 7.0

In row-major layout:

Type Safety

The dense array modes require a type parameter specifying the array element type:

// Strongly typed as Array[Int]
val intArray = CSV.resource("data.csv", CsvOpts(readAs = ReadAs.ArrayDenseColMajor[Int]()))

// Strongly typed as Array[Double]
val doubleArray = CSV.resource("data.csv", CsvOpts(readAs = ReadAs.ArrayDenseRowMajor[Double]()))

// Strongly typed as Array[String]
val stringArray = CSV.fromString("a,b\nfoo,bar", CsvOpts(readAs = ReadAs.ArrayDenseColMajor[String]()))

The type conversion is handled automatically using scautable's ColumnDecoder infrastructure, which supports Int, Long, Double, Boolean, String, and Option types.

Use Cases

Dense arrays are particularly useful for: