Implemented a stream directly from the source CSV file into the Mnesia data store.
Was surprised to see that the Mnesia-stored data is smaller than the original CSV data:
- CSV: 835 Mbytes
- Mnesia: 757 MBytes
My understanding is, that since I’m using disc_copies
the entire database is copied into RAM and periodically flushed to disk.
I may end up switching to disc_only_copies
as:
- speed is not of the essence
- RAM is likely to be more restricted than SSD disk in my intended targets
If I switch to disc_only_copies
I’ll need to think about disk fragmentation to ensure tables
stay below 2GB long term.
I also have some database tuning to do to remove the “Overload” messages during the initial data load.
Code Sample
def load(filename \\ "./nvd-data.csv") do
filename
|> File.stream!()
|> CSV.parse_stream(skip_headers: false)
|> Stream.transform(nil, fn row, acc -> {_loader(row), acc} end)
|> Stream.run()
end
defp _loader([cve_id, cve]) do
Amnesia.transaction do
%CacheNvdCve{cve_id: cve_id, cve: cve}
|> CacheNvdCve.write()
end
[cve_id, cve]
end