Home

In a human body the heart and the brain are the 2 most important organs, if those are not performing well, nothing else is. The Alfresco database and filesystem where the content store resides are the brain and heart of Alfresco. Those are the 2 most important layers of every Alfresco architecture.

Get to know the Alfresco Database throughput

If your project will have lots of concurrent users and operations or the number/estimate number of documents is very big (> 1M)) you need to be informed about your database throughput.

database_performance_ui

Most common throughput factor of the database is transactions per second.

DB performance in a transactional system are usually the underlying database files and the log file. Both are factors because they require disk I/O, which is slow relative to other system resources such as CPU.

In the worst-case scenario,  in big alfresco databases with a big number of documents ):

Database access is truly random and the database is too large for any significant percentage of it to fit into the cache, resulting in a single I/O per requested key/data pair.

Both the database and the log are on a single disk. This means that for each transaction, the AlfrescoDB is potentially performing several filesystem operations:

  • Disk seek to database file
  • Database file read
  • Disk seek to log file
  • Log file write
  • Flush log file information to disk
  • Disk seek to update log file metadata (for example, inode information)
  • Log metadata write
  • Flush log file metadata to disk

Faster Disks normally can help on such sittuations but there are lots of ways (scale up, scale out) to increase transactional throughput.

In Alfresco the default RDBMS configurations are normally not suitable for large repositories deployments and may result into:

  • I/O bottlenecks in the RDBMS throughput
  • Excessive queue for transactions due to overload of connections
  • On active-active cluster configurations, excessive latency

Alfresco Database treads pool.

Most Java application servers have higher default settings for concurrent access, and this, coupled with other threads in Alfresco (non-HTTP protocol threads, background jobs, etc.) can quickly result in excessive contention for database connections within Alfresco, manifesting as poor performance for users.

If tomcat is being considerer this value is normally 275. The setting is called db.pool.max and should be added to your alfresco-global.properties (db.pool.max=275).

http://docs.oracle.com/cd/E17076_04/html/programmer_reference/transapp_throughput.html

 

How to calculate the size of your Alfresco database

All operations in Alfresco require a database connection, the database performance plays a crucial role on your Alfresco environment. It’s vital to have the database properly sized and tuned for your specific use case.

To size your alfresco database in terms of space we’ve done a series of tests by creating content (and metadata) on an empty repository and analyzing the database growth.

Be aware that:

  • Content is not stored in the database but is directly stored on the disk
  • Database size is un-affected by size of the documents or the document’s content
  • Database size is affected by the number/type of metadata fields of the document

Hi all, back with another Alfresco related post, this time to show you how to size your alfresco database in terms of space.

The following factors are relevant to calculate the approximate size for an Alfresco database

  • Number of meta data fields
  • Permissions
  • Number of folders
  • Number of documents
  • Number of versions
  • Number of users

I’ve made a series of tests where i could verify how the Alfresco database grows.

I’ve made a bulk import with the following data.

Document creation method In-place bulk upload
Number of Documents Ingested 148
Total Size of Documents 929.14 MB
Number of metadata fields per document 13 fields
Total number of metadata fields 1924

The table below shows the types of documents and its average sizes

Document Type Extension Average Size (KB)
MS Word Document .doc 1024
Excell Sheet .xls 800
Pdf document .pdf 10240
PowerPoint presentation .ppt 5120
Jpg image .jpg 2048

Checking the diagram below we can see that the database indexes grow more than the data itself. By observing the growth of the database size we’ve concluded that the average metadata field occupation on the Alfresco database is approximately 5.5 K per metadata field

Screen Shot 2014-09-16 at 21.54.17

 

Also interesting is to verify the tables that grow in size(KB) after the content ingestion. Note that we are not applying any permission.

Screen Shot 2014-09-16 at 21.55.08

 

To size your database appropriately you must ask the right questions whose answers will help you to determine the database sizing.

  1. Estimated number of users in Alfresco
  2. Estimated number of groups in Alfresco
  3. Estimated number of documents on the first year
  4. Documents growth rate
  5. Average number of versions per document
  6. Average number of meta-data fields per document
  7. Estimated number of folders
  8. Average number of meta-data fields per folder
  9. Estimated number of concurrent users
  10. Folder based permissions (inherited to child documents)?

Database sizing formulas

Consider to following figures to determine your approximate database size.

–       DV = Average number of document versions

–       F = Estimated number of folders

–       FA = Estimated number of folder metadata fields (standard + custom)

–       D = Number of Documents * DV – Estimated number of documents including the versions

–       DA =Estimated number of documents metadata fields (standard + custom)

The number of records on specific alfresco tables is calculated as follows:

–       Number of records on alf_node (TN = F + D * DV)

–       Number of records records on node_properties (TNP = F * FA + D * DA)

–       Number of records records on node_status = (TNS = F + D)

–       Number of records records on alf_acl_member= (TP = D) assuming permission will be set at the folder level and inherited

The approx. number of records in the database will be TRDB = TN + TNP + TNS + TP

The following formula is based on the number of database records. On our benchmarks we’ve observed that each database record takes about 4.5k of db space.

Formula #1 Database size = TRDB * 4.5K

Alternatively, we can base our calculations on the number of metadata fields of the documents,  considering 5.5k for each metadata field and use the following formula.

Formula #2 Database size = (D * DA + F * FA) * 5.5K

The 2 formulas provided are only approximations on the size that your database will need and are based on benchmarks executed against a vanilla Alfresco version 4.2.2.

If we wish to consider users and groups add consider 2k for each user and 5k for each group.

Note that the formulas are not taking in consideration additional space for logging, rollback, redolog, etc.

Tuning your Alfresco database

In Alfresco the default RDBMS configurations are normally not suitable for large repositories deployments and may result into:

•       Wrong or improper support for ACID transaction properties

•       I/O bottlenecks in the RDBMS throughput

•       Excessive queue for transactions due to overload of connections

•       On active-active cluster configurations, excessive latency

Considering that your database layer will be used under concurrent load I’ve come up with a set of hints that will contribute to maximize your Alfresco database performance.

Database Thread pool configuration

A default Alfresco instance is configured to use up to a maximum of forty (40) database connections.  Because all operations in Alfresco require a database connection, this places a hard upper limit on the amount of concurrent requests a single Alfresco instance can service (i.e. 40), from all protocols.

Most Java application servers have higher default settings for concurrent access, and this, coupled with other threads in Alfresco (non-HTTP protocol threads, background jobs, etc.) can quickly result in excessive contention for database connections within Alfresco, manifesting as poor performance for users.

It’s recommended to increase the maximum size of the database connection pool to at least [number of application server worker threads] + 75.  If tomcat is being considerer this value is normally 275. The setting is called db.pool.max and should be added to your alfresco-global.properties (db.pool.max=275).

After increasing the size of the Alfresco database connection pool, you must also increase the number of concurrent connections your database can handle, to at least the size of the Alfresco connection pool. Alfresco recommends configuring at least 10 more connections to the database than is configured into the Alfresco connection pool, to ensure that you can still connect to the database even if Alfresco saturates its connection pool.

Database Validation query

By default Alfresco does not periodically validate each database connection retrieved from the database connection pool.  Validating connections is, however, very important for long running Alfresco servers, since there are various ways database connections can unexpectedly be closed (for example by transient network glitches and database server timeouts). Enabling periodic validation of database connections involves adding the db.pool.validate.query property to alfresco-global.properties and the query is specific for your database type.

Database Value for db.pool.validate.query
MySQL[1] SELECT 1
PostgreSQL SELECT VERSION()
Oracle SELECT 1 FROM DUAL

Database Scaling

Alfresco relies largely on a fast and highly transactional interaction with the RDBMS, so the health of the underlying system is vital. Considering our existing customers, the biggest running repositories are under Oracle (most of them RAC).

If your project will have lots of concurrent users and operations, consider an active-active database cluster with at least 2 machines. This can be achieved using Oracle RAC or a Mysql based solution with haproxy[1] (opensource solution) or a commercial solution like MariaDB[2] or Percona[3].

[1]http://haproxy.1wt.eu

[2]https://mariadb.org

[3]http://www.percona.com

You can use either solution, depending also you the knowledge that you have in-house. Thegolden rule is that the response time from the DB in general, should be around 4ms or lower. At this layer i don’t recommend to use any virtualization technology. The database servers should be physical servers.

The high availability and scalability for database is vendor dependent and should be addressed with the chosen vendor to achieve the maximum performance possible.

Database Monitoring

Monitoring your database performance is very important as it can detect some possible performance problems or scaling needs.

I’ve identified the following targets that should be monitored and analysed on a regular base.

  • Transactions
  • Number of Connections
  • Slow Queries
  • Query Plans
  • Critical DM database queries ( # documents of each mime type, … )
  • Database server health (cpu, memory, IO, Network)
  • Database sizing statistics (growth, etc)
  • Peak Period of resource usage
  • Indexes Size and Health

I hope this post can help you to understand the importance of the Alfresco database and that you can make use of it on your sizing exercise. Stay tuned for more Alfresco related posts.

All the best, One love,

Luis

Thanks to Luis

Luis Cabaceira on Alfresco

Technical Articles from real experiences on the field while providing consulting services

Publicités

Laisser un commentaire

Choisissez une méthode de connexion pour poster votre commentaire:

Logo WordPress.com

Vous commentez à l'aide de votre compte WordPress.com. Déconnexion / Changer )

Image Twitter

Vous commentez à l'aide de votre compte Twitter. Déconnexion / Changer )

Photo Facebook

Vous commentez à l'aide de votre compte Facebook. Déconnexion / Changer )

Photo Google+

Vous commentez à l'aide de votre compte Google+. Déconnexion / Changer )

Connexion à %s