Prof. Dr. Stefan Bosse
University of Siegen - Dept. Maschinenbau
University of Koblenz - Dept. Computer Science
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation -
Representation of Data
Storage of Data
Access of data
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - Data Storage
In general, data and their values can be stored in/on:
But: The question is not where to store the data, the question is how to organize and access the data!
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - Data Storage
A file system is composed of:
A file system organizes data by a directory (folder) tree:
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - Data Storage
Besides the organization structure, a file system can provide data structures and block level organization of data used for the storage on hardware devices
https://www3.nd.edu/pbui/teaching/cse.30341.fa18/project06.html Simple file system layout (linear file model)
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - Data Storage
Segments
Magic: The first field is always the MAGIC_NUMBER or 0xf0f03410. The format routine places this number into the very first bytes of the super-block as a sort of file system "signature". When the file system is mounted, the OS looks for this magic number. If it is correct, then the disk is assumed to contain a valid file system. If some other number is present, then the mount fails, perhaps because the disk is not formatted or contains some other kind of data.
Blocks: The second field is the total number of blocks, which should be the same as the number of blocks on the disk.
InodeBlocks: The third field is the number of blocks set aside for storing inodes. The format routine is responsible for choosing this value, which should always be 10% of the Blocks, rounding up.
Inodes: The fourth field is the total number of inodes in those inode blocks.
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - Data Storage
https://www3.nd.edu/pbui/teaching/cse.30341.fa18/project06.html I-nodes and block references
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - Data Types
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - Databases
SQL databases organize data in tables.
Data types:
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - Databases
Basic SQL Server architecture
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - Databases
Murugesan et al., International Journal of Applied Engineering Research, 2015 Data Pages storing tables - Schematic Diagram
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - Databases
Tables
SQL table structure with rows and columns
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - Databases
Operations
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - Databases
Tables cannot be nested. The data base table space is flat! But specific tables can be used to reference other tables (like I-nodes, directories in file systems, or sections in HDF structures)
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - Databases
Tables cannot be nested. The data base table space is flat! But specific tables can be used to reference other tables (like I-nodes, directories in file systems, or sections in HDF structures)
Meta data, arrays, or other auxiliary structures must be encoded to text and decoded back by the user, e.g., by using the JSON format!
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - Databases
Tables cannot be nested. The data base table space is flat! But specific tables can be used to reference other tables (like I-nodes, directories in file systems, or sections in HDF structures)
Meta data, arrays, or other auxiliary structures must be encoded to text and decoded back by the user, e.g., by using the JSON format!
Be aware of memory data layer hierarchy affecting performance (read/write): Data and DB Cache, Main Memory, File system, Storage Device(s).
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - Databases
https://www.guru99.com/sql-server-architecture.html Detailed and advanced SQL server architecture
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - Databases
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - Databases
Minimal but powerful SQL implementation (with page caching) that can be packed into on C programming language file
Our sqld server bases on a library version of SQLite3
Databases are store in generic files
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - Databases
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - Databases
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - Databases
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - Databases
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - Databases
... more information ...
https://www.unite.ai/10-best-databases-for-machine-learning-ai
Horizontal Vs. Vertical Scaling
Horizontal scaling refers to adding additional nodes or machines to the infrastructure to cope with new data demands.
Vertical scaling describes adding additional resources to a system so that it meets data demands. Ressources: CPU Power, Memory and storage capacity.
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - Data Types in R
The R programming language and computational system is widely used and outstanding software for, but not limited to, statistics and big table-based data processing.
Core data types are:
That's all folks!
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - Data Types in R
But data can be organized in:
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - Data Types in R
# R R+v1 = c(1,2,3,4) v1=[1,2,3,4]l1 = list(x=1,y=1,z=0) l1={x=1,y=1,z=0}m1 = matrix(0,3,2) m1=[|1,2;3,4;5,6|]a1 = array(0,c(3,2,2)) a1=array(0,[3,2,2])df1 = data.frame( a=c(1,2,3,4), a=[1,2,3,4], b=c('A','B','C','D'), b=['A','B','C','D'], c=list(c(1,2),c(3,4)..), c={[1,2],[3,4],..}, ...)
R list, vector, matrix, and array data
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - Data Types in R
SQL tables are a sub-set of R data frames!
l1={x=1,y=1,z=0} ==> l1$x=0 l1[[1]]=0 print(l1[1]) c1=[1,2,3,4] c1[2]=c1[1]+1m1 = matrix(0,3,2) m1[2,1]=0df1 = data.frame( df[1] df$a df1[[1]] df[1,2] a={1,2,3,4])
R access of list, vector, matrix, and data frame elemnts (and columns)
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - Image Data
Formats:
Uint8 [Red,Green,Blue,Alpha] [col][row]
Uint8 [Red,Green,Blue] [col][row]
Uint8 [col][row]
Data layout is relevant! Commonly, first-level index are the color channels [RGB], followed by ordering of columns, finally organized in rows
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - Image Data
G G G G G G ...RGB RGB RGB RGB RGB RGB ...RGBA RGBA RGBA RGBA RGBA RGBA ...─────────────────────────── R1[C1 C2 C3] R2[C1 C2 C3] ...
Memory layouts of different image formats
DOI: 10.1109/CLEI.2015.7359995 Structure of the RGB image as a sequence of bytes in a linear memory or file model
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - Images and Matrices
A 2-dim matrix can be considered as a graylevel intensity image.
RGB images require an 3-dim array. The third dimension represents the color channels.
In R, an image has commonly its own data type cimg.
A matrix can be converted to an image (cimg) and vice versa.
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - Images and Matrices
m = matrix(runif(100),10,10)m.i = as.cimg(m)i = load.image('http://edu-9.de/assets/test.png', format='RGBA')i.m = as.matrix(i) # converts automatically to graylevelplot(i)plot(i.m,auto.scale=TRUE)
Image to matrix and vice versa
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - Image Stacks
An image file contains commonly one image
But indexed image stacks stored in one file are supported by many image file formats, e.g., TIFF
If an image stack is loaded, typically a vector (or list) of images is returned.
Reconstructed CT slice image stacks contained in one file are typical examples
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - Image Data Compression
The memory or file size of a flat image is:
Size(im)=w⋅h⋅d⋅b
with w and h as the width (number of columns) and height (number of rows) of the image, respectively, d as the channel depth (1,3,4), and b as the number of Bytes per pixel and channel (1,2,4 → 8, 16, 32 Bits)
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - Image Data Compression
Never use irreversible compression since it causes artifacts and noise on decompression!
Compression of graylevel images (one channel) normally has no benefit (low compression ratio, but high computational times). The same fact holds for high precision measured images (more than 8 Bits per pixel). Detector noise will prevent efficient compression.
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - Data Frames
The function data.frame()
creates data frames, tightly coupled collections of variables which share many of the properties of matrices and of lists, used as the fundamental data structure by most of R's modeling software.
A data frame is a list of variables of the same number of rows with unique row names, given class "data.frame". If no variables are included, the row names determine the number of rows.
The column names should be non-empty, and attempts to use empty names will have unsupported results. Duplicate column names are allowed, but you need to use check.names = FALSE for data.frame to generate such a data frame. However, not all operations on data frames will preserve duplicated column names: for example matrix-like sub-setting will force column names in the result to be unique.
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - Data Frames
Structured files formats like CSV or JSON can be directly converted to data frames!
There is a large set of low- and high-level operations that can be applied to data frames (as well as matrices).
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - Synthetic Images
Matrices as well as empty images can be used to create synthetic images by geometric operations:
Synthetic images can be used for:
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - Synthetic Images
use math,imager,plot,geometryim = load.image('pathto.tiff',format='GRAY')m.corr = matrix(0,height(im),width(im))draw.gaussian(m.corr,min=0.5,max=1,sigmax=150,sigmay=100)im.corr = im/m.corr
Creating of synthetic images by using geometric and matrix operations for image intensity normalization
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - File Data Formats
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - File Data Formats
{ "employee": { "name": "sonoo", "salary": 56000, "married": true, "awards" : [1920,1990,2000] } }
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - File Data Formats
x,y,z,class1,2,3,"A"1,4,2,"B"4,5,0,"A"
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - File Data Formats
In R CSV files are represented by (converted to) data frames!
use csvtext = 'x,y,z\n1,2,3\n4,5,6\n7,8,9'df = parse(text)df = read.csv('pathto.csv',sep=',',header=TRUE)
R CSV reader
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - File Data Formats
YAML supports nested data structures, but table-like data is difficult to maintain.
martin: name: Martin D'vloper job: Developer skill: - a - b - c
Lists and records in YAML
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - Data Flow Architecture
The entire data processing architecture is a graph of computational nodes, data sources, and data sinks, connected by event-based channels. Nodes are connected via input and output ports.
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - Data Flow Architecture
Examples are:
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - Data Flow Architecture
Processing graph and data flow architecture with data source, processing, and sink nodes. Event-based data flow architecture and event chains. New data provided by a node is propagated to all child nodes. Parameter changes initiate a re-computation (or display), too.
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - Data Flow Architecture
use sql,jsondb = connect("localhost:9999")mytable.schema = db::schema("mytable")mytable.nrow = db::nrow("mytable")mytable.data = db::read("mytable")transform(mytable.data,c=as.vector(c,mode="uint16"))data = db::read("mytable",b=fromJS(b), c=as.vector(c,mode="uint16"))
R+ SQL operations
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - Data Flow Architecture
sqld
consists of a slim native C-code implementation of the sqlite3 server storing SQL data bases in plain binary files on the local file system.
SQL data bases can be accessed by a Remote Procedure Call JSON interface, basically mapping SQL operations on a JSON structure (both request and reply).
HTTP is used to access the JSON-RPC API
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - Data Flow Architecture
SQLD architecture and JSON-RPC interface
PD Stefan Bosse - AFEML - Module B: Data Storage and Aggregation - Summary
Data processing is performed by using a sequential programm language, R+
Data can be represented by different data types, structures, formats, supported by R+
Data access is provided by files, HTTP services, or SQL data bases in an unified way