Streaming into Mnesia


CSV -> Mnesia

Implemented a stream directly from the source CSV file into the Mnesia data store.

Was surprised to see that the Mnesia-stored data is smaller than the original CSV data:

  • CSV: 835 Mbytes
  • Mnesia: 757 MBytes

My understanding is, that since I’m using disc_copies the entire database is copied into RAM and periodically flushed to disk. I may end up switching to disc_only_copies as:

  • speed is not of the essence
  • RAM is likely to be more restricted than SSD disk in my intended targets

If I switch to disc_only_copies I’ll need to think about disk fragmentation to ensure tables stay below 2GB long term.

I also have some database tuning to do to remove the “Overload” messages during the initial data load.

Code Sample

  def load(filename \\ "./nvd-data.csv") do
    filename
    |> File.stream!()
    |> CSV.parse_stream(skip_headers: false)
    |> Stream.transform(nil, fn row, acc -> {_loader(row), acc} end)
    |> Stream.run()
  end

  defp _loader([cve_id, cve]) do
    Amnesia.transaction do
      %CacheNvdCve{cve_id: cve_id, cve: cve}
      |> CacheNvdCve.write()
    end

    [cve_id, cve]
  end