Help US :  

Native File System Based Data Management Architecture

@) Introduction:

All database systems either relational or non-relational, no-sql or schema-less systems store data in some files on the system. The difference is in the way they manage data in files and facilities they provide on top of their architecture.

Here we are not going to concentrate on what they do and how they manage data, this theory strongly focuses on storing and managing data without using any database server.


@) Necessity:

There are various R-DBMS & No-Sql solutions out there which provide high performance and scalability, then why look into this kind of solution ?

Answer is very simple both R-DBMS & non R-DBMS solutions have their own limitations. If you need system is very huge, complex and needs scalability, sharding, etc it would be better to use MongoDB, Cassandra, Hbase or any such other solution and in that case you might not need to use this approach which we are going to discuss here.

But, if you have a medium or small system or some small part of your system that need to work at real-time fast speed which normal R-DBMS systems can't manage and you don't want to go for high cost solutions then you must take a look into this concept.

You can use file system to store and fetch data as this will be much faster compared to R-DBMS systems and much light on resource compared to non R-DBMS solutions.

One might think that this is not a practical solution and its a very complex thing to develop such system and its like developing as database server by self so it is be better to use those ready made complex and huge systems instead of developing one by self.

Well, as said earlier the amount of efforts you might need to put in this method depends on your application and business logic and if it is very difficult to handle the way its discussed here using file system, you should better go using those readily available solutions.

For handling data in a software system, there are many options like R-DBMS, non R-DBMS, Search Engines, Caching, etc. and using perfect options based on application's need is the job a system architects.

Here we are not going to discuss which combination of these solutions is perfect but only going to increase one more option to the list for those system architects.

Although, this approach might be difficult for some one but, "To get something extra most of the time we also need to do something extra."

Yes, sorry but there is no shortcut to certain things, because what has to be done must be done, only how its to be done can be changed and that also up to certain extend otherwise the final result might get affected.

You can even use file system in conjugation with relational or non relational database servers or search engines or caching server like memcache, etc. to server your purpose and cover up for the limitations of relational or non relational solutions.

The selection and usage totally depends up to you and here we will only give you idea about how to efficiently use file system for storing relational or non relational data so that it can be helpful in boosting performance of your application.


@) Concept:

You can use file system to store and fetch data as this will be much faster compared to R-DBMS systems and much light on resources compared to heavy non R-DBMS solutions available in the market.

Again this speed and being light on resources depends on the way data is stored and managed in files and across directories. So, it is very important to efficiently use file system and manage data properly in such a way that it makes as much minimal resource usage as possible.

Below presented is the concept for how to efficiently store and manage data and what things need to be taken care of while designing architecture of such data storage system


1) Keep file size minimal

As all data management related operations will be as a part of application mostly it will be in programming language of the developer's choice and so they can feel comfortable while applying various data manipulations.

But, care must be taken that large amount of data is not handled at any single moment of time, like reading a large file synchronously.

To avoid large files data must be stored in multiple files and as much discrete as possible. For e.g. One folder for one table and one file inside that folder for each record.

Key part of a database systems is resource (cpu & memory) management plus operations on data. If your design leads to huge memory usage you are doing misuse of file system.


2) Minimize need of locks and waits to access files

Make a read or write operation as much independent of other read & write operations as possible

If your data stored is across multiple files you can easily access large data and operate on them quickly using concurrent threads. But if it is in single large file you might face data corruption of delay due to locks and waits in case of concurrent access.

As explained in previous point, if each write is done in separate file you can avoid locks and waits for file access which will further speed up reading and writing.

Always give unique name to files


3) Minimize read and write to files (File naming convention & directory structure matters)

Yes, although we are going to store data in files that doesn't mean that all data must be stored as part of file content

Design directory structure and file name in such a way it conveys a relation between data and existence of a file or directory can also denote existence of a relation between data. This will reduce file read and write operations and will make managing relations very easy.

For e.g.:

Inside folder named 'friendship' a file named 'userid1-userid2' can denote friendship between two users.

Now all files starting with 'userid1-' or ending with '-userid1' in that folder can give you all friends of that user with userid1 but that is a bad practice to fetch data in that manner.

Even though we can get friends from 'friendship' folder by matching userid1 from name of files, there can be many file in that folder and going through all files will again slow down process.

Instead, it is better to have one folder named 'friends' inside a folder named 'userid1' and store one file each for each friends of userid1 inside that 'friends' folder.

Thus, when we need to get all list for friends of userid1 we will look into '/data/users/userid1/friends/' folder. While the '/data/friendship/' folder can be used to store and read data related to relation between those two users like status, added date, etc.


4) Managing operations

While applying operations like average of some value, you can read values stored in files using concurrent threads and in more complex case can even use temporary files to store data or manage operations


5) Use Common-Sense

Think before you confirm your flow and data management scheme. Consider amount of simultaneous users and amount of data to be stored. Maximize use of threads.

Although, all data can be stored and managed in file system this approach also has its own limitations like handling complex operations on large data needs an extreme expertise.

Again its my duty to make you aware you that if you need to perform complex logic on huge data this approach might not fit into your need and better consider other options as mentioned above in necessity block.


@) Usage Example:

So what you think? Is it really a good idea to use file system in a project ? Are there any real life cases where i might have to use this approach ?

Yes, as explained above we can use file system in conjugation with R-DBMS or any other solution as per our need to improve performance if required.

Lets consider one example to give you more clear idea about how we can use file system efficiently in a project where performance matters.

The example which we are going through is of a chat system. Consider that we need to build a chat system which unlike traditional ajax based chat system will perform at a very high speed

We need a chat system which allows multiple group chat and multiple one to one chat at same time in different tabs and also have facility of broad cast. Almost all messages must be delivered with no more that 1 sec of delay.

Also we need to manage users, their contacts and groups and all these without giving much load to our R-DBMS server.


As you read above we need to provide messages almost immediately as they are send by a user. For this lets say we selected HTML5 Web-Sockets with ajax long polling fall-back.

Provide data immediately as they appear on server using web-socket or ajax long polling will create lots of read queries on server for messages. Also there will be many write operations simultaneously for writing messages as we are giving multiple single and group chat.

For this we to handle too many calls on database and it might get slow as users and data in message table increases.

Also, lets consider that we don't want to invest on huge servers and heavy setups so lets see how we can manage it with using R-DBMS along with file system.


Basics:

• Profile data like contact names and group names are stored in per user files.

• Each chat messages will create separate temporary files which once read are deleted after certain time interval.

• If user has read message one flag file will be created, existence of which is used to determine that the message contained in that file is already read by user.

• But before deleting message files messages must be inserted into mysql using a timely scheduled cron

• Each group have separate file which contains contacts of that group

• One folder will contain separate entry for each user relations like contacts and groups.


Overview on structure of data files:

Chat messages are stored inside /tmp folder as temporary files. One message in one file.

Let say user1 is chating with user2,
- files for those messages will be inside /tmp/user1-user2/
- files for broadcasted messages will be inside /tmp/mt/
- folders starting with 'g:' are for groups and will contain files of group chat messages
- once a message file for e.g 'file123' is read by 'user1' to prevent re-reading one file named 'user1_file123' is created
- while reading message files for e.g 'file123', if file 'user1_file123' exists then 'file123' is skipped


Other files and folders used:

/files/un    User's profile file containing data of contact names, group names, username, email, password, contact and group request, verification code (one file per each user)

/files/grp   Contains a file for each group which contains contact's names and requested contacts

/files/ou    Contains a file for each online user and can be used for storing current logged-in session values

/files/um   Contains folder for each relation of user inside that user's folder

                For e.g.:
                /files/um/user1   Will contain a folder 'user1-user2' if user1 has a contact user2 and in case of group it will have folder with 'g:groupname-groupid'.
                                        This is used to easily retrieve chat data from same folder inside /tmp and avoid excessive relation mapping (or checking) and reading data from files.
                                        So, if user1 and user2 are contacts and are in a 'group1' they both will have once folder in /files/um and inside each folder they will have 'user1-user2' and 'g:group1-123' folders.
                                        Also, same 'user1-user2' and 'g:group1-123' folders are used in /tmp to store respective message files.

While fetching messages for user1, name of all folders inside /files/um/user1 are obtained and files inside same named folders from /tmp are read and send to user1.

So at server side instead of R-DBMS server file system is constantly monitored for each user, for any new or unread files and as soon as any such file found contents of those files are read and send to the respective user & also that flag file is created so that same file is not fetched again.

Now we need a cron which will read all these message files and store it with appropriate relation in db and once data from a file is inserted delete that temporary message file.

To insert message data in to R-DBMS server we can use a cron at every 1 minute which will read all message files and create separate sql file per each cron and store that sql file in a secure folder and then delete those message files which are read. Now, those sql files can be dumped into mysql at every 30 minutes. The reason behind separating insertion into two crons is to reduce insert frequency into DB server and also at the same time maintain less files on file system to look into for new messages.

In this manner persistent storage of our data is done in R-DBMS server and whenever user needs to show history of chat data can be fetched from R-DBMS. Again, we can limit data in R-DBMS server to let say 3 months for each user after which a download facility can be given for user to get their old data in a file which once downloaded will be deleted after certain period of time and as those data are already deleted from DB server, won't appear in chat history in application.

Thus, excessive sql quering, relation mapping and file reading is avoided. Also creation of huge file is avoided here by creating separate files for each messages and handling as much as possible checks and relations using directory exists and file exists method.


This was just an example of using file system in a chat application to avoid excessive load on R-DBMS server. In you application you can design different structure and naming convention as per your needs.


Too much theoretical to digest, right ? Lets finish this talk here and try hands on some real application ! So, what are you waiting for visit our products page and download simple web chat application which is developed in php and works on same concept !