7 Ways to secure data in data warehouses

A data warehouse is a centralised repository which consolidates data from all manner of places into a useful format that can be analysed and reported on. It’s a critical part of the decision-making process for many organisations: anything from customer insights to demand forecasting can be achieved at an advanced level through the power of data warehousing.

However, data warehouses are often in charge of extremely sensitive data — personal or confidential information that simply cannot be compromised. Therefore, security considerations are paramount — cyber security attacks are an unfortunate reality, but just as dangerous is the potential fallout of lax protocols or human error. The impact of a mistake with data can be catastrophic: we have seen this from companies like Equifax, for which cyber security errors have led to permanent reputation damage.

Data information security should never be taken lightly. This is our Data Engineering team’s list of recommendations for ensuring security in data warehouses. This covers some of the main areas that pose risk to data security: user access control, network security and data movement.

Data encryption

What is it? By implementing a key or password, encryption obscures data and renders it inaccessible without said key.

Why is it used? Even if access controls are bypassed, encrypted data is a good way to minimise data loss.

Drawbacks? In the context of data warehouses, encryption can sometimes degrade data warehouse performance due to performance overheads. Architects should draw a balance between the need to secure data and the need to provide unrestricted access, which has the lowest impact on performance. This method also doesn’t solve access control risks — that must be managed using other methods.

Role-based access control

What is it? In a workforce, database management systems with role-based access controls allow team members access to only the data they absolutely need, and no more. This is an option for both on-premise and cloud-based data warehousing services.

Why is it used? Human error must always be a consideration when it comes to data security. By restricting access to sensitive data from irrelevant groups of users, the risk of breached data or other mistakes is greatly minimised.

Drawbacks? Some companies are made of workforces with dozens — even hundreds — of roles, all of which might require different levels of access. For businesses that are ever-changing, rapidly growing or already at an enterprise level, this type of security can easily become difficult to manage. That is why role-based access control should align with any previously defined data classifications, to avoid complication.

Data classification

What is it? This is a process of organising data. There are two primary approaches. One is in accordance to role — this restricts data access to certain job functions. Second is according to data sensitivity — where data restriction is classified by restriction based on how sensitive it is.

Why is it used? When organisations store huge amounts of data, a lack of classification can make it nearly impossible to manage — a serious problem when it comes to ensuring security. Classifying data appropriately ensures it can be managed and additional security layers can be added appropriately over time.

Drawbacks? Depending on the types of data, manual intervention may be required to handle data classification, which is always a time-based issue. For algorithm-based classification processes, keeping on top of these processes is an important task — if this is not done, minor errors can become huge mistakes that impact other security measures.

Splitting data

What is it? Different categories of data require different levels of security. By splitting it up into separate tables, data can be managed in the most appropriate ways rather than applying one security measure to the whole set. It can then be retrieved and joined back together upon request.

Why is it used? Data splitting essentially halves (or further fractionalises depending on the granularity of split data) security risks by distributing data into different locations. It makes data completely useless unless it can be pieced back together, and more complex data splitting practices ensure that this data is continually moved around.

Drawbacks? A potential risk of data splitting is over-distribution; having dozens of tables that split data in different ways can lead to latency issues in data warehouses. However, modern data splitting solutions mitigate this issue with advanced algorithms and fast processing speed.

Securely moving data

What is it? In order to move data safely from one place to another, protocols like SSL or TLS can be used — they authenticate the destination of data and encrypt it along the way. With cloud-based data warehouses, virtual private networks provide good security for moving data by isolating the communication between on premises databases and the cloud-based data warehouse.

Why is it used? A data warehouse contains many moving parts and often data is transferred from their transnational databases to the data warehouse for the most up-to-date reporting and analysis. It’s important to ensure full protection of data as it moves from one place to another.

Drawbacks? Data transfer protocols are nearly universal in enterprise data warehouses, but they do inherently require additional labour costs to set up and maintain. Also, using SSL or TLS protocols can impact performance, especially in high load systems — there will be a trade-off between speed and security.

Logs and audits

What is it? Log files should be made available by system and application administrators in a readable format for periodic reviews by an independent observer. Logging should be performed in a secure manner and provide analytical capabilities to identify authorized successful access, identify unauthorized access attempts and security violations, provide audit trails for user action as appropriate, and help in reconstructing compromised systems.

Why is it used? Account usage should be logged and monitored to help maintain data warehouse security — it’s key to understanding who has made changes at any time. An effective logging approach will create detailed audit trails that show events leading up to a security issue.

Drawbacks? The result of auditing is never a drawback, but the process can require its own set of security measures which must be setup and maintained over time.

Build documentation

What is it? Audit and security requirements need to be properly documented. The document should have information gathered from data and user classification, data movement and auditable actions.

Why is it used? All stakeholders involved in managing data information security should know how their security protocols are set up — this is why documentation is necessary.

Drawbacks? There are no drawbacks to good documentation — bad documentation, however, is the root of all evil. If proper documentation strategy is not implemented and stuck to from the beginning of any development project, there is an immediate risk that system knowledge only exists in the minds and spoken words of the team. It is an engineer’s nightmare to join a new project and be faced with piecing together broken and nonsensical documentation from their predecessors. From a security perspective this is especially problematic — when a real problem occurs, good documentation can mean the difference between a team’s measured response and a panicked one. Overall — clear documentation takes time, but it is worth it for good data security.