Summary

This study introduces the database architecture that manages data in the main physical memory using columnar format. It proposes a conceptual framework for applying the in-memory columnar database systems to support high-speed continuous audit analytics. To evaluate the proposed framework, this study develops a prototype and conducts simulation tests. The test results show high computational efficiency and effectiveness of the in-memory columnar database relative to the conventional ERP system. Furthermore, the deployment of the in-memory columnar database to the cloud shows the promise of scaling up the in-memory columnar database for continuous audit analytics.

Keywords: accounting information systems; continuous auditing; in-memory computing; columnar storage; cloud computing

In the era of Big Data, the audit profession is starting to leverage the emerging data analysis techniques (e.g., deep learning, process mining) to test financial and nonfinancial data, evaluate internal control effectiveness, and detect fraudulent transactions. For example, process mining of event logs enables auditors to assess the companies’ internal controls ([18]; [12]); machine-learning methods are used to evaluate the risk of financial statement fraud ([26]; [13]). Audit analytics techniques have been widely recognized as the necessary tools in the modern audit practice due to their potential high efficiency and effectiveness.

To apply audit analytics to companies’ business data, auditors need to periodically extract the full population of transactions (e.g., purchase orders, invoices, receipts) from the companies’ Enterprise Resource Planning (ERP) systems. In order to examine such high-volume transactions, Continuous Auditing systems are designed to extract transaction data from ERP systems, test all the transactions, and report exceptions or anomalies in close to real time ([30]; [ ]; [19]). The higher the frequency of extracting the data, the timelier the audit analytics will be performed; however, the more computing and communication resources it will consume ([25]). An alternative approach is to embed audit modules in ERP systems and provide external auditors with direct access to the transaction data from operational databases ([16]). However, the queries selecting data from the operational database and performing sophisticated audit analytics can easily overload ERP systems and disrupt the regular transaction processing ([11]). In order to conduct real-time and continuous audit analytics using machine-learning algorithms that are computationally costly, it is necessary to build high-speed and high-volume data processing infrastructure to support continuous audit analytics.

The in-memory columnar database system provides such an infrastructure that supports high-speed data analytics using main memory as the primary storage, and arranging all values of an attribute stored contiguously in a block ([15]; [27]). With the fast improvement of computer and information technology, a new generation of ERP[ ] based on the in-memory columnar database (e.g., SAP S/4HANA cloud ERP) is starting to influence today’s practice of business intelligence and data analytics. For example, based on an in-memory columnar database, SAP S/4HANA enables high-speed real-time data analytics for procurement and production decisions. Many business intelligence applications based on machine-learning algorithms can be efficient in predicting sales volumes and price trends. The in-memory columnar database can potentially provide speedy performance regarding data storage and access (i.e., write and read) and enable data analytics (i.e., deep learning, regression, classification, and clustering).

This study introduces the architecture of the modern in-memory columnar database system and proposes a design that applies it to high-speed continuous audit analytics. In order to evaluate the performance of the in-memory columnar database system for continuous audit analytics, this study conducts simulation tests to measure the computational overhead in comparison with a conventional relational database system. This study is organized as follows: Section II motivates this research and reviews the prior studies of continuous auditing, ERP, and data warehousing. Section III introduces the architecture of in-memory computing and columnar storage, and Section IV proposes a framework for applying the in-memory columnar database system for continuous audit analytics. Section V describes a prototype and evaluates the performance of in-memory computing and columnar storage. Section VI discusses the deployment of the proposed architecture in cloud computing. Finally, Section VII discusses further research opportunities and summarizes the contributions of this research.

ERP systems have become the center of modern enterprise infrastructure, providing business event recording, message communication, and data storage. In general, an enterprise information system has to perform two data processing tasks: online transactional processing (OLTP) and online analytical processing (OLAP). OLTP is the task that processes individual transactions, such as purchase order and cash disbursement transactions, and OLAP is a decision-support task that processes summarized and aggregated data for analytical queries. Usually, a company’s transaction data are continuously generated in the OLTP system and periodically extracted out for OLAP or loaded into an auditor’s external database. An auditor performs audit analytics on the company’s business and financial data and expresses an independent opinion on whether the financial statements are presented fairly in all material respects. The audit quality and timeliness rely primarily on data quality and extraction frequency. The higher the frequency of extracting the data, the timelier the audit analytics can be performed, and the timelier the exceptions and anomalies can be detected.

However, the extraction process cannot be performed continuously because too-frequent extractions would be costly and disturb the regular transaction processing ([11]). Real-time reporting and continuous audit analytics require the immediate availability of transaction data to support real-time decision-making. The need for timeliness and speed when conducting sophisticated audit analytics (e.g., artificial intelligence [AI], deep learning) motivates this study to take advantage of the latest information technology breakthroughs. By utilizing the in-memory columnar databases for continuous audit analytics, the periodical extraction process can be replaced by a real-time and continuous analysis executed on the enterprise platform. Therefore, fast data aggregating for generating financial reports and performing audit analytics can significantly improve audit efficiency and effectiveness. Furthermore, information security is easier to protect if the original data remain in the companies’ data centers rather than extracted out by the audit. This study will introduce the in-memory columnar database system and propose a framework of applying it for high-speed continuous audit analytics.

Since [30] initially developed the first practical continuous audit systems, the research of continuous auditing has progressed with numerous advances, such as innovations of continuity equations ([19]) and exceptions and anomalies detection ([17]). Many studies document the implementations of continuous monitoring ([ ]) and continuous auditing ([ ]; [33]). Furthermore, the applications of continuous auditing in ERP systems ([20]) and using XML ([23]) have been examined, and the economics of continuous assurance ([ ]) has been discussed in prior studies. The in-memory columnar database system brings new opportunities for continuous auditing in the age of Big Data and high-speed computing.

Relational databases were proposed in computer science in 1970 and have had the attention in the information technology domain for decades ([28]). Initially motivated by managers’ need for timely access to companies’ business situation, information system engineers used relational databases to create inventory management systems, sales and marketing systems, and supply chain management (SCM) systems. Subsequently, all systems within an enterprise became integrated into an ERP system ([24]), where each functional system serves as a software module, such as the CRM module, HR module, and accounting and finance module. Resource-Event-Agent (REA) ontology is an ISO[ ] standard framework (ISO/IEC 15944-4:2015) based on the relational database, which has been widely adopted in designing accounting information systems ([21]). [21] proposed the REA ontology to describe business events and design the business event data storage. Figure 1 shows the example of an existing ERP system that includes base tables, event log, materialized files, and analytical data warehouse. The base tables contain the unfiltered transaction data and master data, the event log records historical data, and conclusion materialization tables store the derived accounting artifacts (e.g., ledgers and accruals).

Graph: FIGURE 1 An ERP System ExampleSource: [27].

In order to support ad hoc complex queries for data analytics, a data warehouse is designed and implemented separately from the operational database (i.e., base tables and master tables). Figure 2 shows that the data warehousing technique extracts and integrates data from multiple sources and stores them as data cubes to support sophisticated analytical queries. An operating database is optimized to support OLTP workloads, while a data warehouse is optimized to support OLAP workloads due to the fact that “to execute complex OLAP queries against the operational databases would result in unacceptable performance” ([10]). For example, if an auditor directly and continuously uses the transaction data from all point-of-sale (POS) machines to predict a company’s revenue, then the computation of revenue estimations will create a significant additional burden for the operational database and even interrupt the regular transaction data processing.

Graph: FIGURE 2 Data Warehousing ArchitectureSource: [10].

Table 1 shows the difference between OLTP and OLAP. With more complex business needs, there is more focus on OLTP database design, but less on OLAP. OLTP data are arranged in rows, which are stored in blocks, so indexing does not work well when the amount of requested data increases. OLAP warehousing organizes data in star schemas, which is a popular optimization to compress attributes (columns) with the help of dictionaries. More recently, the use of columnar databases for analytics has become quite popular ([27]).

Graph

The fast improvement of enterprise information systems is enabled by two major information technology breakthroughs: high-performance Massively Parallel Processing (MPP) and unprecedented growth in the amount of main memory. MPP is the key technology that supports Big Data processing and analysis. In order to process complex queries on high-volume data promptly, MPP breaks up the large dataset into manageable chunks and assigns them to many processors ([ ]). Then, it combines the results from all assigned processors through a communication interface. With the fast processing techniques based on multi-core CPUs, it becomes possible to deal with large datasets and execute sophisticated queries efficiently.

In [32] architecture, the main memory is the primary storage, and the magnetic disk is the secondary storage. In a conventional database management system (DBMS), data reside permanently in the secondary storage (i.e., hard disk). When the disk data are needed, they are first cached into the primary storage (i.e., main memory) to be fast-accessed by CPUs. Table 2 shows the different properties of the main memory and magnetic disk. In general, data stored in the main memory could be accessed and processed orders of magnitude faster than data stored on a magnetic disk. However, the main memory has much less capacity to store data than a magnetic disk,[ ] and it requires uninterruptable power supplies to maintain the integrity of stored data due to storage volatility.

Graph

With the recent breakthroughs in computing technology, the availability of vast amounts of main memory enables storing Big Data in the primary storage for fast access by processors ([22]). Moreover, new storage products, such as solid-state drives (SSD), provide faster and more reliable alternatives for secondary storage. In the modern in-memory database system (IMDB), data reside permanently in main physical memory. As multi-core CPUs can directly access data in the main memory, IMDB has a better response time and transaction throughputs ([27]).

In order to solve the storage volatility problem when the power is shut off, IMDB writes snapshots of main memory to disk at frequent intervals. In between snapshots, IMDB keeps a log of the changes to various secondary storage devices (e.g., magnetic disk, flash drive, SSD). If the power is interrupted, IMDB will be able to quickly recover the last snapshot and the data change log in order to ensure data consistency. An IMDB is also configured with hard disk drives (HDD) that are used only for transaction logging and snapshots for data backup. Although for a conventional DBMS and modern IMDB, a given dataset still has two copies in both main memory and magnetic disk, DBMS and IMDB have different optimization schemas for the database design[ ] ([15]), which distinguishes IMDB from DBMS regarding high-speed data access and processing.

Although DBMS is created to store data tables by rows, an increasing number of new database systems, including IMDBs, are configured to store data tables by columns. Figure 3 shows a comparison between row-oriented storage and columnar storage. For example, a relational database system stores the data that are represented as two-dimensional tables arranged by rows and columns. A table has three attributes/columns (i.e., Case ID, Amount, and Agent) and has three records or rows. In the physical implementation, the values in all cells are serialized and stored in fixed-size blocks. Row-oriented storage refers to the arrangement in a block that all attributes of a record are stored contiguously, while column-oriented storage refers to the arrangement that all values of an attribute are stored together in a block.

Graph: FIGURE 3 Row-oriented Storage and Columnar StorageSource: [ ].

Since the values for each attribute are stored contiguously in the columnar storage, the compression efficiency is usually four to five times that of row-oriented storage ([ ]). Moreover, a sophisticated analytical query could be responded to quickly, as data aggregation in columnar storage outperforms row-oriented storage, especially in the case of a large number of data items. Join operations can also be directly performed on columns ([ ]). Moreover, many new join algorithms were introduced for columnar storage, such as the “FlashJoin”[ ] ([29]) and the “Invisible Join”[ ] ([ ]), which optimize the join performance of columnar storage and achieve substantial advantages over row-oriented storage. Apart from the arrangement of data storage, the data operations are either in a row or column style. Figure 4 shows the data operation by row or column on both row storage and column storage. For column storage, if a query is to select all attributes of a record (row operation), then it needs to visit all blocks; if a query is to select all values of an attribute (column operation), then it only needs to visit a single block. On the other hand, for row storage, if a query is to select all attributes of a record (row operation), then it only needs to visit a single block; if a query is to select all values of an attribute, then it needs to visit all blocks.

Graph: FIGURE 4 Data Operation by Row and ColumnSource: [27].

Theoretically, the columnar database is a read-optimized and analytics-optimized system for OLAP applications, and the row-stored database is a write-optimized and transactions-optimized system for OLTP applications. Usually, operational databases store indexed data by rows on a disk to be cached in the main memory, and data warehouses use star schemas or column storage to compress and aggregate attributes. Although the columnar database is row-operation costly (i.e., with respect to inserting or updating a record), recent findings show that the combination of IMDB and columnar storage works best for multi-core CPUs and update-intensive applications ([27]). Many DBMS vendors (e.g., Microsoft)[ ] have developed columnar indexes that can be used to perform real-time data analytics on an OLTP workload. “Having all data in main memory greatly improves the data update performance” of column stores ([27]). By applying columnar storage schemas to the IMDB, the new in-memory columnar database would be superior to row-oriented DBMS with regard to computing efficiency and memory consumption.

Instead of separating audit analytics from the enterprise information system, utilizing an in-memory columnar database system enables auditors to build analytical applications directly on top of the transaction data for real-time and continuous audit analytics. This approach was infeasible for conventional DBMS, thus necessitating the creation of the separate data warehouse, since the OLAP could overwhelm the operational database and disturb regular OLTP applications. With an in-memory columnar database, it is possible to create applications of audit analytics on the operating database for high-speed data access and aggregation. Columnar data are stored in a highly compressed format in order to improve the efficiency of memory usage and to speed up the data transfer from the main memory to processors. Moreover, it also increases the speed of aggregation for data analytics tremendously.[ ]

In general, transaction data and master data would be accessed more often than the historical data. Transaction data, such as purchase orders or cash disbursements data, would be frequently visited for transaction processing or data analyzing, and the master data, such as goods inventory and employee information, would also be frequently accessed for transaction processing and balance updating. Therefore, transaction data and master data stored in main memory and in the columnar format can benefit from the highly optimized access and high compression rate ([14]). The historical data, such as prior-year event logs, should be placed on the hard disk by columns.

After analyzing the data access frequency, this study designs an enterprise information system architecture for continuous audit analytics. An underlying economic assumption is that both in-memory and hard disk storage are available, but limited, and an underlying technical assumption is that the slowest workload of preprocessing Big Data is to read from hard disk to main memory and write from main memory to hard disk. Since the main memory is more expensive than the hard disk, it will be used to store the OLTP transaction data and master data. As the columnar storage is much more efficient in handling data aggregations, the OLTP transaction data and master data could be stored by columns. It can also be used to store the OLTP historical data on the hard disk (if there is not enough main memory). Logging and recovery files for data backup should also be stored in the secondary storage. As shown in Figure 5, Panel B, in the application-to-database round trips, continuous audit analytics can be designed as an application layer that communicates with the “single fact” database by continuously sending sophisticated analytical queries and receiving the response results from the in-memory columnar database.

Graph: Panel A: In-Memory Columnar Database Configuration

Graph: Panel B: Data Analytics Apps Configuration

The higher the level of computational complexity, the higher the frequency of data aggregation will be, and companies and auditors will get more benefit from using the in-memory columnar database. Table 3 shows the computing complexity levels of different audit analytical techniques.[ ] With highly compressed storage and efficient data processing, this architecture could allow Big Data applications such as those that support sophisticated machine-learning (e.g., deep-learning) algorithms on high volume and unstructured datasets (e.g., video, images, and text).

Graph

As all operational data reside in main memory, conclusion materialization can be performed on the fly immediately with high efficiency. In this case, no data warehouse needs to be created separately from operational databases because the selection, aggregation, and analysis can be performed very efficiently in main memory. To sum up, storing data in the main memory can substantially speed up data processing. Columnar storage provides high-rate compression to conserve the main memory usage and fast aggregation in response to complex analytical queries.

This study builds a prototype and evaluates the computational performance by conducting simulation tests to validate the proposed design of utilizing the in-memory columnar database for continuous audit analytics. This study uses R software[10] as the primary platform for continuous audit analytics simulation, “SQLite”[11] library as the DBMS for disk-based and in-memory row-oriented database simulation, “MonetDB”[12] library as the DBMS for disk-based and in-memory columnar database simulation, and “microbenchmark”[13] library as the nanosecond accurate timing function for relative computation performance comparison. This study simulated and compared the performance of four different database systems: ( ) disk-based row-oriented database (the infrastructure for conventional ERP), ( ) in-memory row-oriented database, ( ) disk-based columnar database, and ( ) in-memory columnar database (the infrastructure for the proposed design supporting continuous audit analytics). The simulation tests focused on comparing the performance between disk-based row-oriented storage and in-memory columnar storage.

This study first evaluated the computational performance for operational transactions processing. We simulated a large sample of daily procurement transaction data that include the dollar amount of Purchase Orders and Vouchers and shipment quantity of Receipts to be written to the four evaluated database systems, respectively. The number of simulated transactions ranges from 1,000,000 to 10,000,000 to be inserted into the transaction tables. Figure 6 and Figure 7 show the average computation time of processing a large number of transactions among the four different database systems, and demonstrate the higher efficiency of transaction processing in the in-memory columnar database over the other databases. The simulation first measures the average computation time of inserting a single transaction into each database for daily transactions processing. As shown in Figure 6, the computation times of inserting a single transaction into the four database systems remain relatively constant with the increasing number of transactions. The results show that the in-memory columnar database performs much faster than the other three databases, thus demonstrating higher efficiency regarding adding a new transaction to the in-memory columnar database system.[14]

Graph: FIGURE 6 Computation Time of Insertion

Graph: FIGURE 7 Computation Time of Update

Next, we simulate a large number of supplier master data that include the Vendor ID and the dollar amount of Vendor Balance to be manipulated in the four database systems, respectively. The number of simulated supplier master data ranges from 10,000 to 100,000. Then, we simulate the occurrence of 1,000 transactions and updating the 1,000 corresponding Vendor Balances in master tables. The simulation measures the average computation time of updating a single master balance in each database for daily transactions processing. As shown in Figure 7, the computation times of updating a single master balance in the four database systems are relatively constant with the increasing number of transactions. The result shows that the in-memory columnar database performs faster than the other three databases, which indicates high efficiency regarding updating the master data. Overall, the simulation tests demonstrate the computation efficiency of the in-memory columnar database in processing transactions over the other databases.

Important accounting artifacts associated with journals or ledgers, such as debit and credit balances, receivables, and payables, can be derived from transaction data and master data. The process of the derivation is called “conclusion materialization.” For example, the Account Payable for a supplier is obtained as the imbalance between the purchase events and the cash disbursement events for that supplier. To avoid possible inconsistencies in the database, it is preferable to compute conclusion materializations on demand when the database is queried for the derived numbers. Due to computational constraints in current practice, the derived numbers are usually stored in the “Balance” attributes and updated regularly. For instance, when a new transaction occurs, it will be inserted into the transaction table, and the vendor’s balance will be updated to materialize the conclusion. When queries for the vendor’s Account Payable come, the enterprise information system can respond fast to these queries by selecting the derived number from the “Balance” attribute instead of computing the conclusion materialization on the fly.

Due to the fast improvement and decreasing cost of the main memory and MPP, it may become possible to use the in-memory columnar database to implement the conclusion materialization on the fly. In this setup, when a new transaction occurs, it will be inserted directly into the transaction table. When a vendor’s balance is queried, it can be calculated sufficiently fast as the imbalance between the purchase orders and the payment vouchers for that supplier. The advantages of performing conclusion materialization on the fly are to separate recording of primary economic events from accounting information derivation and to keep enterprise information systems concise and consistent.

We simulated 10,000 100,000 supplier master data, 1,000 existing purchase transactions per supplier that had been entered into the databases, and 1,000 new procurement transactions that would be entered into the databases. We measured the average computation time of conclusion materialization. Figure 8 shows the comparison of computation time per conclusion materialization between the conventional method and the new method (i.e., materializing the conclusion on the fly) in the four evaluated database systems, respectively. Panel A shows that the computation times of conventional conclusion materialization in the four database systems are relatively constant with the increasing number of suppliers, and the in-memory columnar database performs slightly faster than the other three databases. Panel B shows that the computation time of conclusion materialization with a new method in the four database systems is increasing linearly with the number of suppliers. It also shows that the in-memory row-oriented database performs the best among the four databases when the number of suppliers is less than 50,000, and the in-memory columnar database performs the best among the four databases when the number of suppliers exceeds 50,000. Although comparing Panel A and Panel B shows that, on average, the computation of conclusion materialization on the fly consumes more time than the conventional method, the in-memory columnar database still performs the best under the most conditions.

Graph: Panel A: Conclusion MaterializationConventional Methods

Graph: Panel B: Conclusion MaterializationNew Methods

Overall, the in-memory columnar database performs relatively better than the other three databases, which demonstrates its applicability for conclusion materialization, even though it might not be advanced enough at the current stage of technological development to replace the conventional approach with the “conclusion materialization on the fly.”

Following the framework developed for a continuous data-level auditing system ([19]), this study simulated a large sample of daily procurement transactions data to implement the continuous audit analytics in the four database systems. The implementation of the analytical procedure component of the continuous auditing system requires expectation modelscontinuity equations[15] of expected behavior to enable anomaly detection. In a continuity equation model, business process metrics are aggregated over a subset of transactions, such as the dollar amount daily aggregates for each Order Process or weekly transaction quantity for the Receiving process. Since the relationship between the metrics should be probabilistically stable in a continuity equation model, this makes estimations of the metrics’ acceptable range of variation possible. For example, the expected weekly amount of orders ranges between $950,000 and $1,050,000, so a week with the total amount of orders worth $1,300,000 would be a statistical anomaly that requires an auditor’s further investigation. We adopted the VAR model that [19] found to be the most flexible one out of the three probabilistic models utilized in the literature (i.e., Simultaneous Equation Model, Vector Autoregressive Model, and Linear Regression Model) to serve as the continuity equation model for the simulation tests.

We replicated the work by [19] by selecting the dollar amount of Purchase Orders and Vouchers and shipment quantity summaries in the sample as business process metrics, and simulating 180 days of transactional data spanning from October 20, 2016 to April 17, 2017. The transactional data in the first 100 days were used to estimate and train the model, and the data in the remaining days were used to test the model. The R “vars” library was used to estimate the large VAR model. We also applied a procedure to restrict the insignificant parameters and retain only the significant ones in the VAR model. One of the estimated VAR models is expressed as:

(defupalpha{unicode[Times]{x3B1}})(defupbeta{unicode[Times]{x3B2}})(defupgamma{unicode[Times]{x3B3}})(defupdelta{unicode[Times]{x3B4}})(defupvarepsilon{unicode[Times]{x3B5}})(defupzeta{unicode[Times]{x3B6}})(defupeta{unicode[Times]{x3B7}})(defuptheta{unicode[Times]{x3B8}})(defupiota{unicode[Times]{x3B9}})(defupkappa{unicode[Times]{x3BA}})(defuplambda{unicode[Times]{x3BB}})(defupmu{unicode[Times]{x3BC}})(defupnu{unicode[Times]{x3BD}})(defupxi{unicode[Times]{x3BE}})(defupomicron{unicode[Times]{x3BF}})(defuppi{unicode[Times]{x3C0}})(defuprho{unicode[Times]{x3C1}})(defupsigma{unicode[Times]{x3C3}})(defuptau{unicode[Times]{x3C4}})(defupupsilon{unicode[Times]{x3C5}})(defupphi{unicode[Times]{x3C6}})(defupchi{unicode[Times]{x3C7}})(defuppsy{unicode[Times]{x3C8}})(defupomega{unicode[Times]{x3C9}})(defbialpha{boldsymbol{alpha}})(defbibeta{boldsymbol{beta}})(defbigamma{boldsymbol{gamma}})(defbidelta{boldsymbol{delta}})(defbivarepsilon{boldsymbol{varepsilon}})(defbizeta{boldsymbol{zeta}})(defbieta{boldsymbol{eta}})(defbitheta{boldsymbol{theta}})(defbiiota{boldsymbol{iota}})(defbikappa{boldsymbol{kappa}})(defbilambda{boldsymbol{lambda}})(defbimu{boldsymbol{mu}})(defbinu{boldsymbol{nu}})(defbixi{boldsymbol{xi}})(defbiomicron{boldsymbol{micron}})(defbipi{boldsymbol{pi}})(defbirho{boldsymbol{rho}})(defbisigma{boldsymbol{sigma}})(defbitau{boldsymbol{tau}})(defbiupsilon{boldsymbol{upsilon}})(defbiphi{boldsymbol{phi}})(defbichi{boldsymbol{chi}})(defbipsy{boldsymbol{psy}})(defbiomega{boldsymbol{omega}})(defbupalpha{bf{alpha}})(defbupbeta{bf{beta}})(defbupgamma{bf{gamma}})(defbupdelta{bf{delta}})(defbupvarepsilon{bf{varepsilon}})(defbupzeta{bf{zeta}})(defbupeta{bf{eta}})(defbuptheta{bf{theta}})(defbupiota{bf{iota}})(defbupkappa{bf{kappa}})(defbuplambda{bf{lambda}})(defbupmu{bf{mu}})(defbupnu{bf{nu}})(defbupxi{bf{xi}})(defbupomicron{bf{micron}})(defbuppi{bf{pi}})(defbuprho{bf{rho}})(defbupsigma{bf{sigma}})(defbuptau{bf{tau}})(defbupupsilon{bf{upsilon}})(defbupphi{bf{phi}})(defbupchi{bf{chi}})(defbuppsy{bf{psy}})(defbupomega{bf{omega}})(defbGamma{bf{Gamma}})(defbDelta{bf{Delta}})(defbTheta{bf{Theta}})(defbLambda{bf{Lambda}})(defbXi{bf{Xi}})(defbPi{bf{Pi}})(defbSigma{bf{Sigma}})(defbPhi{bf{Phi}})(defbPsi{bf{Psi}})(defbOmega{bf{Omega}})begin{equation}orde{r_t} = {beta _1}orde{r_{t – 1}} + {beta _2}orde{r_{t – 2}} + {beta _3}receiv{e_{t – 1}} + {beta _4}vouche{r_{t – 1}} + {varepsilon _0}end{equation}

begin{equation}receiv{e_t} = {gamma _1}receiv{e_{t – 1}} + {gamma _2}receiv{e_{t – 2}} + {gamma _3}orde{r_{t – 1}} + {gamma _4}vouche{r_{t – 1}} + {varepsilon _r}end{equation}

begin{equation}vouche{r_t} = {delta _1}vouche{r_{t – 1}} + {delta _2}vouche{r_{t – 2}} + {delta _3}orde{r_{t – 1}} + {delta _4}receiv{e_{t – 1}} + {varepsilon _v}end{equation}

The artificial transactions were created and stored in the four databases. The number of simulated suppliers is 100, and the number of artificial transactions per supplier varies from 100 to 3,000 per day. Figure 9 shows the computation time of continuous audit analytics comparing across the four evaluated database systems. This study measures the computation time of querying each database for daily aggregates, estimating continuity equations based on Vector Autoregressive Models, and predicting acceptable ranges for anomaly detection. As shown in Figure 9, the computation time of both the disk-based row-oriented database and the in-memory row-oriented database grows linearly in the number of transactions, while that of both the disk-based columnar database and the in-memory columnar database remains nearly constant with the increasing number of transactions. It also shows that the in-memory columnar database performs much faster and is more stable than the other three databases, which indicates its high efficiency in performing continuous audit analytics. Table 4 demonstrates the experimental data for the four database systems from 1,800,000 transactions to 54,000,000 transactions.

Graph: FIGURE 9 Computation Time of Continuous Audit Analytics

Graph

In-memory computing is highly efficient, but relatively costly compared to disk-based computing because the capacity of hard disks is much cheaper and still grows much faster than that of the main memory. Cloud computing could be a viable alternative for deploying in-memory columnar databases by start-ups or seasonal companies because large amounts of main memory are available and scalable in the cloud, thus providing flexibility. Moreover, cloud vendors provide professional up-to-date maintenance, which reduces the cost of a company for hiring (few and expensive) experts in in-memory columnar databases. Figure 10 shows the trade-off between cost and administration.

Graph: FIGURE 10 The Trade-Off Between Cost and AdministrationSource: Microsoft SQL Server Blog.

According to SAP, “Over 90 percent of businesses are already using cloud technology in a public, private, or hybrid cloud environment.”[16] It can be viewed essentially as a successful application of general outsourcing theory. Cloud computing provides on-demand and high-quality applications and services by centralizing data storage, computing, and transmission ([ ]). Various cloud models can be classified either as Infrastructure as a Service (IaaS), Platform as a Service (PaaS), or Software as a Service (SaaS). In the IaaS model, a vendor (like Amazon Web Services) provides on-demand service from their large shared pool of configurable computing resources. In the PaaS model, the cloud provider delivers a computing platform, typically including an operating system (most of which are Windows or Linux), a programming language execution environment, a database, and a web server. In the SaaS model, the clients have access to various applications. IaaS lets companies “rent” computing resources such as servers, networks, storage, and operating systems on a pay-per-use basis. PaaS provides a cloud platform and tools to help developers build and deploy cloud applications. SaaS is a way of delivering applications over the internet.

Cloud computing offers an economical way for many companies to deploy the in-memory columnar database for enterprise information systems. This research proposes implementing in-memory columnar databases on cloud computing to facilitate real-time and continuous audit analytics. Table 5 shows three different options to deploy the in-memory columnar database (i.e., on the public cloud, hybrid cloud, and the company’s private cloud), as well as the different features and applications of the three options.

Graph

Even with the improvement of hardware and the cost of main memory dropping unprecedentedly quickly, it is still more costly to configure a database system with a big main memory (i.e., 210 TB) than that with traditional disk or even SSD. A single SAP HANA system can scale up to 6 TB, and a HANA cluster, consisting of a set of connected systems, can scale to more than 112 TB.[17] Thanks to the data compression capability of in-memory databases, a 10 TB HANA system can store as much as 50 to 100 TB data from a conventional DBMS, “which could represent all the credit card transactions for a top 10 bank for ten years or more.”[18]

Due to the difficulty of overcoming the hurdle of significant upfront fixed cost, the deployment on the cloud could be cost-effective for a firm to gain fast access to the in-memory database systems, making them budget-affordable and easy to implement. The cloud subscribers pay less for upfront investment and shift cost from a capital expenditure to operating expense. For example, a company with $1 billion in revenue is likely to have 50-plus applications running at a time.[19] With cloud access, the company relies on cloud infrastructure and services to manage business data. Deployment on the cloud would be the easiest way for small and medium-sized enterprises (SMEs) or even large firms to start investing in IMDB. However, it should be noted that without careful planning and budget control, a company could quickly run up costly bills for the cloud IMDB service.

One of the main benefits of the cloud-based in-memory columnar database is high-speed data analytics. However, the benefit of high efficiency is not easy to measure in monetary terms, as it includes both subjective and objective attributes. Some firms require high-speed Big Data analytics to support decision-making, such as high-frequency trading firms, while some other firms can get by with slower performance. As for the cost of real-time analytics in the cloud, the firms can be generally classified into two groups: real-time firms and non-real-time firms. The real-time firms would be the potential users of cloud IMDB, and they can be further classified into the SMEs and large enterprises. If an SME firm chooses the Database as a Service from a cloud vendor, it would require much less initial investment compared to running the system on its own, which requires an initial lump-sum expenditure on hardware, database licenses, and personnel. For a large firm, the trade-off between fixed and variable costs may be different. It may turn out to be less costly overall to host an on-premises private cloud rather than to pay a monthly or yearly bill for the public cloud service. The cloud vendors have many cloud services available for those firms that do not have the budget for the significant upfront investment and prefer “paying as you go.”

This study introduces the recent breakthroughs in the information technology domain (i.e., MPP, large capacity main memory) and the new in-memory columnar database systems. It proposes a design of the highly efficient and effective in-memory columnar database system architecture for continuous and real-time audit analytics. Furthermore, this study demonstrates a prototype and conducts simulation tests to measure the computational performance in comparison with a conventional relational database to evaluate the proposed artifact. The simulation tests show great promise of using the in-memory columnar database for continuous audit analytics, which is a significant departure from the architecture of the existing enterprise information systems. Compared to a traditional enterprise information system, the in-memory columnar database architecture will provide high-speed data access and aggregation (i.e., ten times query performance over traditional row-oriented storage).[20] Therefore, it will massively speed up the computational processing of audit analytics.

The contributions of this study are fourfold. It is the first study that applies this new emerging database system architecture to support continuous audit analytics. It validates the high efficiency and effectiveness of using the in-memory columnar database for transaction processing, conclusion materialization, and sophisticated analytical queries by comparing the computational time of the in-memory columnar database with the conventional disk-based row-oriented database. This study demonstrates that the performance of more sophisticated audit analytics algorithms benefits the most from utilizing the in-memory column database. It also discusses the deployment in public, private, and hybrid clouds. One of the limitations of this study is that we are not able to perform concurrent read and write workload on the same database. In future research, we plan to conduct more simulation tests to measure the computational performance by comparing the in-memory columnar database with conventional DBMS in the context of Big Data.