The term smart manufacturing refers to a future-state of manufacturing, where the real-time transmission and analysis of data from across the factory creates manufacturing intelligence, which can be used to have a positive impact across all aspects of operations. In recent years, many initiatives and groups have been formed to advance smart manufacturing, with the most prominent being the Smart Manufacturing Leadership Coalition (SMLC), Industry 4.0, and the Industrial Internet Consortium. These initiatives comprise industry, academic and government partners, and contribute to the development of strategic policies, guidelines, and roadmaps relating to smart manufacturing adoption. In turn, many of these recommendations may be implemented using data-centric technologies, such as Big Data, Machine Learning, Simulation, Internet of Things and Cyber Physical Systems, to realise smart operations in the factory. Given the importance of machine uptime and availability in smart manufacturing, this research centres on the application of data-driven analytics to industrial equipment maintenance. The main contributions of this research are a set of data and system requirements for implementing equipment maintenance applications in industrial environments, and an information system model that provides a scalable and fault tolerant big data pipeline for integrating, processing and analysing industrial equipment data. These contributions are considered in the context of highly regulated large-scale manufacturing environments, where legacy (e.g. automation controllers) and emerging instrumentation (e.g. internet-aware smart sensors) must be supported to facilitate initial smart manufacturing efforts.
A 2011 report on big data authored by McKinsey Global Institute, an economic and business research arm of McKinsey and Company, highlighted big data analytics as a key driver in the next wave of economic innovation [1]. However, the report suggests that this innovation may be impeded by a shortage of personnel with the skills needed to derive insights from big data, with demand in the US predicted to double between 2008 and 2018. This prediction seems credible when current data growth estimates are considered, with one estimate suggesting that the worlds data is doubling approximately every 1.5 years [2], and another estimate proposing that 2.5 quintillion bytes of data are being produced each day [3]. This exponential growth in data can be attributed to a number of technological and economic factors, including the emergence of cloud computing, increased mobile and electronic communication, as well as the overall decreased costs relating to compute and data resources. In addition, emerging technology paradigms, such as the internet of things (IoT), which focus on embedding intelligent sensors in real-world environments and processes, will result in further exponential data growth. In 2011 it was estimated that more than 7 billion interconnected devices were in operation, which was greater than the world’s population at that time. However, given the potential applications of IoT across numerous sectors and industries, including manufacturing, engineering, finance, medicine, and health, the number of interconnected devices in circulation is expected to rise to 24 billion by 2020 [4]. Therefore, given the anticipated shortage of personnel that are capable of managing this exponential data growth, there is a need for tools and frameworks that can simplify the process.
As big data analytics permeates different sectors, the tools and frameworks that are needed to address domain-specific challenges will emerge. For example, modern large-scale manufacturing facilities utilise sophisticated sensors and networks to record numerous measurements in the factory, such as energy consumption, environmental impact and production yield. Given the existence of such data repositories, these facilities should be in a position to leverage big data analytics. However, a number of domain-specific challenges exist, including diverse communication standards, proprietary information and automation systems, heterogeneous data structures and interfaces, as well as inflexible governance policies regarding big data and cloud integration. These challenges coupled with the lack of inherent support for industrial devices, makes it difficult for mainstream big data tools and methods (e.g. Apache Hadoop, Spark, etc.) to be directly applied to large-scale manufacturing facilities. Although some of the aforementioned challenges are addressed by different commercial tools, their scope is typically limited to data (e.g. energy and environmental) that is needed to feed a particular application, rather than facilitate open access to data from across the factory. To address these constraints, as well as many more, a new interdisciplinary field known as smart manufacturing has emerged. In simple terms, smart manufacturing can be considered the pursuit of data-driven manufacturing, where real-time data from sensors in the factory can be analysed to inform decision-making. More generally, smart manufacturing can be considered a specialisation of big data, whereby big data technologies and methods are extended to meet the needs of manufacturing. Other prominent technology themes in smart manufacturing include machine learning, simulation, internet of things (IoT) and cyber physical systems (CPS).
The application of big data has been demonstrated in different areas of manufacturing, including production, supply chain, maintenance and diagnosis, quality management, and energy [5]. This paper focuses on maintenance and diagnosis because of the role it plays in promoting machine uptime, as well as the potential impact it can have on operating costs, with some estimates claiming equipment maintenance can exceed 30 % of total operating costs, or between 60 and 75 % of equipment lifecycle cost [6]. The role of equipment maintenance is an important component in smart manufacturing. Firstly, smart manufacturing revolves around a demand-driven, customer-focused and highly-optimised supply chain. Given the dynamic and optimised nature of such a supply chain, there is an implied dependency on machine uptime and availability. Secondly, smart manufacturing promotes energy and environmentally efficient production. The amount of energy used by equipment can increase if it is operating in an inefficient state (e.g. increased range of motion). Thirdly, smart manufacturing aims to maximise production yield. Machinery that is not functioning as per its design specification may negatively impact production yield (e.g. scrapped product). Finally, equipment maintenance can have an overall positive impact on capital costs. The lifetime of machinery may be enhanced by limiting the the number of times it enters a state of disrepair, while on-going costs may be reduced by using predictive and preventative maintenance strategies to optimise the scheduling of maintenance activities.
This paper presents an industrial big data pipeline architecture, which is designed to meet the needs of data-driven industrial analytics applications focused on equipment maintenance in large-scale manufacturing. It differs from traditional data pipelines and workflows in its ability to seamlessly ingest data from industrial sources (e.g. sensors and controllers), co-ordinate data ingestion across networks using remote agents, automate the mapping and cleaning process for industrial sources of time-series data, and expose a consistent data interface on which data-driven industrial analytics applications can be built. The main contributions of this research are the identification of information and data engineering requirements which are pertinent to industrial analytics applications in large-scale manufacturing, and the design of a big data pipeline architecture that addresses these requirements—illustrating the full data lifecycle for industrial analytics applications in large-scale manufacturing environments, from industrial data integration in the factory, to data-driven analytics in the cloud. Furthermore, this research provides big data researchers with an understanding of the challenges facing big data analytics in industrial environments, and informs interdisciplinary research in areas such as engineering informatics, control and automation, and smart manufacturing.
The term smart manufacturing refers to a data-driven paradigm that promotes the transmission and sharing of real-time information across pervasive networks with the aim of creating manufacturing intelligence throughout every aspect of the factory [7–10]. Experts predict that smart manufacturing may become a reality over the next 10–20 years. The objective of smart manufacturing is similar to that of traditional manufacturing and business intelligence, which focuses on the transformation of raw data to knowledge. In turn, this knowledge can have a positive effect on operations by promoting better decision-making. However, smart manufacturing can be delineated from traditional manufacturing intelligence given its extreme focus on real-time collection, aggregation and sharing of knowledge across physical and computational processes, to produce a seamless stream of operating intelligence [11]. In simple terms, smart manufacturing can be considered an intensified application of manufacturing intelligence, where every aspect of the factory is monitored, optimised and visualised [7]. More expansive descriptions of smart manufacturing and its constituent parts can be found here [11, 12]. The level of digitalisation derived through smart manufacturing can facilitate radical transformations, such as;
While these high-level transformations may seem achievable at first glance, it is generally accepted that the practicalities of smart manufacturing adoption are simply too complex for any single organisation to address [7]. Therefore, a number of groups and initiatives emerged in recent years to address the challenges and support the adoption of smart manufacturing.
There are currently a number of government, academic and industry groups focused on of smart manufacturing. The most prominent of which include the Smart Leadership Coalition (SMLC) [11], Technology Initiative SmartFactory [13], Industry 4.0 [14], and The Industrial Internet Consortium (IIC). The emergence of these initiatives stemmed from the realisation that smart manufacturing is simply too large for any single organisation to address [7], and while the terminology used across these initiatives may differ, they share an overarching vision of real-time, digitalised and data-driven smart factories, which rely on sophisticated simulation and analytics to optimise operations.
The most prominent initiatives found in research are the SMLC and Industry 4.0, with each initiative loosely connected to their geographical origins (i.e. US and EU). The SMLC working group is comprised of academic institutions, government agencies and industry partners. These diverse perspectives are an important characteristic of the SMLC as it ensures challenges relevant to the wider community are addressed. While technology roadmaps, recommendations and guidelines have been central to the SMLC’s activities to date, they have also been involved in the development of a technology platform that implements these recommendations. Industry 4.0 is a high-tech strategy developed by the German government to promote smart manufacturing and the benefits that can be derived by the greater economy. The term Industry 4.0 can be considered a naming convention that serves to partition each industrial revolution, with 4.0 referring to an anticipated fourth revolution (i.e. smart manufacturing) that experts anticipate will come to pass in the next 10–20 years. Expanding on this naming convention further, previous industrial revolutions can be labelled 1.0, 2.0 and 3.0. The first revolution (Industry 1.0) was brought about by the use of water and steam power to enable mechanical production, with the first mechanical loom employed in 1784. The second revolution (Industry 2.0) was brought about by the use of electricity to realise mass production, which in turn promoted the division of labour in production processes. Finally, the third revolution (Industry 3.0) was brought about by advances in electronics and information systems that enabled control networks to automate the production process, with the first programmable logic controller (PLC) introduced in 1969.
Real-time and internet-aware pervasive networks, as well as highly integrated and intelligent data-driven analytics applications, are central to smart manufacturing. This highly measured and digitalised environment enables facilities to derive the knowledge needed to realise highly efficient and customised demand-driven supply chains, from the acquisition of raw materials, through to the delivery of the final product to the customer. In addition, smart manufacturing addresses many business and operating challenges that exist today, such as increasing global competition and rising energy costs, while also creating shorter production cycles that can quickly respond to customer demand [11, 15]. The SMLC outline several performance targets that relate to the aforementioned benefits, including (1) 30 % reduction in capital intensity, (2) up to 40 % reduction in product cycle times, and (3) overarching positive impact across energy, emissions, throughput, yield, waste, and productivity. Furthermore, smart manufacturing adoption can also benefit the greater economy. Recent research produced by the Fraunhofer Institute and Bitkom highlighted the potential benefits of Industry 4.0 to the German economy, where they estimated the transformation of factories to Industry 4.0 could be worth up to 267 billion Euros to the German economy by 2025 [16].
The progression to smart manufacturing can be decomposed into three distinct phases, with each phase deriving benefits that are exponentially greater than the last [17]. These sequential phases provide a high-level view of the journey to smart manufacturing adoption;
To surpass the benefits of current manufacturing intelligence systems, facilities must navigate each phase in the smart manufacturing roadmap. However, the early stages of adoption can be particularly challenging due to the complexity of industrial data integration, which may result in the effort-to-benefit ratio being perceived as low. Generally, the potential benefits from each phase (1–3) move from low to high. This is in contrast with the effort required at each phase, with significant effort required to integrate data, and less effort needed during process innovation. Therefore, in the first phase facilities should expect significant effort and cost, with modest gains in operational intelligence. However, the effort in each subsequent phase is reduced given that residual technologies, knowledge and skills are carried forward from the previous phase.
There are numerous challenges facing smart manufacturing adoption. These challenges include the development of infrastructures to support real-time smart communication, as well as the cultivation of multidisciplinary workforces and next-generation IT departments that are capable of working with these technologies [11]. The extent to which these challenges exist will vary from facility-to-facility. For example, there is a distinct difference in the challenges facing greenfield and brownfield sites [7]. Apart from budgetary constraints, technology availability, and the presence of a skilled workforce, greenfield sites (i.e. new manufacturing facilities) can choose to adopt smart technologies without significant impediments. This is in contrast with brownfield sites that are encumbered by the existence of legacy devices, information systems, and protocols, some of which may be proprietary. Furthermore, many of these legacy technologies were not designed to operate across low-latency distributed real-time networks that are synonymous with smart manufacturing. Although legacy technologies could be replaced with smarter equivalents in other business domains, there are numerous reasons why substitution may not be a viable option in industrial facilities. A summary of these impediments are provided below;
These impediments address many of the high-level challenges that can be expected when transitioning to smart manufacturing. Inevitably, additional challenges are likely to emerge as different areas in the factory are explored (e.g. energy, production, maintenance, etc.). Recent research focused on big data technologies in manufacturing, which is closely related to smart manufacturing efforts, suggests that there are eight areas in manufacturing where data-driven methods are being explored [5]—these are (1) process and planning, (2) business and enterprise, (3) maintenance and diagnosis, (4) supply chain, (5) transport and logistics, (6) environmental, health and safety, (7) product design, and (8) quality management.
Maintaining equipment in a proper working state is an important aspect of manufacturing. However, industrial equipment maintenance is an expensive activity that can account for over 30 % of a facilities annual operating costs, and between 60 and 75 % of a machines lifecycle cost [6]. These figures will vary from site-to-site depending on the type of equipment being maintained, and maintenance strategies being employed. Maintenance strategies range from those that focus on reacting to issues when they arise, to those that focus on preventing issues from occurring. Strategies that embrace a predictive and preventative approach to maintenance are well suited to smart manufacturing, given their affinity to optimising machine uptime and availability. Most maintenance strategies are supported by information systems that monitor particular measurements (e.g. temperature, revolutions per minute, etc.) from equipment in the factory. However, a criticism of existing real-time maintenance systems is their inability to describe, predict or prescribe specific maintenance actions [19].
There are numerous strategies that can be used for industrial equipment maintenance. Each strategy possesses its own strengths and weaknesses, which will suit different scenarios depending on the type of equipment being maintained, and its role in the facility and/or manufacturing process. Table 1 provides a comparison matrix of common maintenance strategies that describe the trade-offs between each, as well as providing guidelines relating to their use.
Instruments (i.e. sensors) stream continuous measurements (e.g. room temperature) [22–24] to programmable logic controllers (PLC). PLC’s are digital computers that are programmed with logic to automate the production process. This logic is programmed by automation engineers to evaluate each instruments measurement and initiate appropriate actions based on their state. Measurements transmitted to PLC’s are persisted in memory at set intervals (e.g. every 15 min) as tuples of timestamp/value. This format is common in industrial information systems and is referred to as time-series, measured, or temporal data. Subsequently, time-series data persisted in PLC memory is transferred to an archive in batch (e.g. every 24 h), with the archive typically taking the form of a relational database or flat log file. Once data is persisted in the archive, information systems are able to consume data from the repository and generate reports for end-users. Examples of such systems include building management systems (BMS), manufacturing execution systems (MES), and monitoring and targeting systems (M&T).
Accessing data from archives can be achieved using standard database and I/O interfaces. However, when underlying data models are proprietary, some cleaning and transformation may be required to unify measurements. While accessing data from archives can be used in some industrial applications, the high latency characteristics of archiving across automation networks (e.g. 24 h dumps) means that it is not inherently suited to real-time industrial applications. Where real-time data access is a requirement, direct communication with field devices (e.g. PLC) on the lower levels of an automation network must be undertaken. This can be achieved using industrial protocols and interfaces, such as Modbus, LonWorks, BACnet, OLE Process Control (OPC), and MT Connect [25–28], or where smart sensors (e.g. IoT devices) exist, using MQTT, COAP and HTTP [4]. However, as previously stated, the adoption of smart sensors and emerging technologies in large-scale manufacturing brownfield sites may be restricted by regulation and quality control, as well as the high risks associated with new technology adoption [7]. Therefore, initial data management and infrastructure requirements for smart manufacturing may need to consider how legacy and emerging technologies can operate transparently. Table 3 summarises potential sources of equipment maintenance data in industrial environments.
Purpose The site manager resides on a cloud server and acts as a central repository of metadata relating to facilities and equipment being monitored. Its purpose within the architecture is to persist and communicate essential site information to other components in the architecture.
Functions The site manager has multiple functions that are directly related to the factory—(1) store details relating to the site, such as the type and location of local data sources to be integrated, (2) schedule and assign jobs to ingestion engines based on their availability and location, and (3) decide how much data each node should ingest based on current CPU and bandwidth availability.
Purpose The ingestion engines are distributed software agents deployed autonomously across multiple automation networks in the factory. They ingest data from time-series data sources relevant to industrial equipment maintenance applications (e.g. HVAC, Chillers, Boilers). Ingestion engines run as background applications on local servers and communicate their status to the site manager (Stage 1), and transmit time-series data to the cloud when instructed to do so. Figure 2 illustrates the ingestion engines distributed and autonomous nature, which makes it easy to deploy them across automation networks that are separated by firewalls and/or geographical boundaries. These characteristics can also be used to increase ingestion work capacity or improve latency by adding more engines.
Functions The ingestion engine has multiple functions—(1) communicate location, bandwidth, CPU and memory availability to the site manager, (2) interpret data collection tasks sent from the site manager and automatically extract time-series data in accordance with task parameters (e.g. particular date range), and (3) transmit the acquired time-series data to the cloud. Furthermore, the extraction of data from sources in the factory can be automated using the expert ruleset embedded in the ingestion engine to automatically identify the appropriate data mapping for each source.
Purpose The message queue is a highly available and distributed service that stores JSON encoded time-series data transmitted from the factory. It acts as an intermediary data store between the factory and data processing components in the pipeline. The message queue decouples the ingestion process from data processing components to facilitate asynchronous operations and promote scalability, robustness and performance.
Functions The message queue has two functions—(1) notify the subscription service when new data has been received from the factory, and (2) persist the received data in a queue so it may be read by data processing components in the data pipeline.
Purpose The subscription service provides a notification service between the endpoint for data ingestion (i.e. message queue) and data processing components responsible for transforming raw data to a state suitable for industrial analytics applications. It decouples the message queue from data processing components, which enables both to scale and operate independently.
Functions The subscription service functions are important to the chain of events in the data pipeline—(1) listen to the message queue for new data, and (2) notify subscribers when new data is available for processing.
Purpose The processing components are responsible for transforming time-series data to a form that is useful for analysis. The data processing components in the pipeline aim to remove the onus on ad hoc processing and aggregation routines on raw data. The basic processing illustrated for time-series data is the transformation of high residual data to different levels of granularity, such as hourly, daily, monthly and annual averages. More sophisticated data processing may include the execution of expert rules to identify early fault signals, or encoding of time-series data in a semantic format (e.g. Project Haystack) to support interoperability with a particular application. Each processing component in the architecture is responsible for a single use case, such as those previously mentioned. Therefore, new requirements that cannot be met by existing components can be facilitated through the creation and deployment of a new component.
Functions The potential requirements relating to data processing are diverse and will vary from application-to-application and site-to-site. Therefore, the data processing aspect of the pipeline has been designed with customisation and extensibility in mind. It is envisaged that a library of default data processing components will eventually be included in the final architecture. Currently, the data pipeline architecture incorporates simple aggregation functions for time-series data—(1) daily average, (2) monthly average, and (3) annual average.
Purpose The purpose of the data access stage is to provide a consistent and open interface for industrial analytics applications focused on equipment maintenance to consume data. The data access interface serves files output from data processing components using a cloud repository. Data queries utilise a URL naming convention to describe the time-series data being requested (e.g. equipment identifier and date range), adhering to this naming convention can promote consistency and standardisation for industrial analytics applications in the facility. The naming convention is illustrated in Fig. 2 and described in further detail below;
Functions The functions of the data access component are—(1) ensure data is stored in the appropriate location/context, and (2) respond to requests for data that adhere to the aforementioned convention.
The qualitative results in this research include a set of empirically derived requirements (RQ1) elicited through industry collaboration, and a big data pipeline architecture (RQ2) for industrial analytics applications focused on equipment maintenance. Given the theoretical aspects of this research, the following section simulates a real-world scenario to provide additional context and derive points for discussion. The simulation decomposes the data pipeline into three parts—(1) data ingestion in the factory, (2) data processing in the cloud, and (3) feeding industrial analytics applications. Each part is discussed in terms of its ability to satisfy the aforementioned requirements, as well as highlighting potential implementation challenges.
Figure 3 depicts the ingestion of measurements from an Air Handling Unit (AHU) using the industrial big data pipeline. AHU’s are mainly used in manufacturing to control air quality and ensure thermal comfort in the facility. There are two entities in the simulation that are used to transmit data from the factory to the cloud, namely the ingestion engine and smart sensor. While the ingestion engine is an internal component that is directly controlled by the data pipeline, the smart sensor can be considered a third party component that is programmed and managed externally. Given its tighter integration with the pipeline, the ingestion engine receives data collection instructions from the site manager, which returns an instruction to read the Return Air Temperature (RAT) for AHU1. This measurement is obtained by communicating with the PLC associated with the RAT instrument using an industrial communication protocol (e.g. BACnet). Similarly, the smart sensor is programmed to read the Set Point Temperature (SPT) for AHU1. However, its operation is programmed by a third party and due to on-board instrumentation there is no need for industrial protocols. Both measurements are read at regular intervals using their respective methods, encoded in a common JSON format and pushed to the cloud.
Table 4 discusses the simulation of data ingestion in the data pipeline in the context of the requirements identified in RQ1 of this study.
Table 5 discusses the simulation of data processing in the data pipeline in the context of the requirements identified in RQ1 of this study.
Table 6 discusses the simulation of industrial analytics in the data pipeline in the context of the requirements identified in RQ1 of this study.
Table 6 Industrial analytics discussionIn this paper, we addressed the main challenges and desirable characteristics associated with large-scale data integration and processing in industry, such as automating and simplifying data ingestion, embedding fault tolerant behaviour in systems, promoting scalability to manage large quantities of data, supporting the extension and adaption of systems based on emerging requirements, and harmonising data access for industrial analytics applications. The contributions and findings of this research are important for facilitating big data analytics research in large-scale industrial environments, where the requirements and demands of data management are significantly different to traditional information systems. While emerging technologies (e.g. IoT) may eventually eliminate the need for legacy support in the factory, given the fact 20 year old devices are still in operation, we feel the big data research community should be cognisant of a potential lag in smart technology adoption across large-scale manufacturing facilities. Therefore, the big data pipeline presented in this research facilitates transparent data integration that enables facilities to begin their smart manufacturing journey without committing to extensive technology replacement.
Future work will focus on the implementation and deployment of the big data pipeline in DePuy Ireland. Our aim is to validate the big data pipeline architecture, reassess and extend the requirements presented, quantify the percentage of data sources that can be accessed in the factory using the ingestion process, and estimate the throughput capacity of the pipeline using load testing. Finally, there are two research projects in DePuy Ireland where we plan to use the data pipeline to feed predictive maintenance applications for Wind Turbine and Air Handling Units in the facility.
POD, KL, KB and DOS all contributed to the review of the literature, eliciting system requirements in DePuy Ireland, and refining and prioritising those requirements in the context of large-scale manufacturing facilities. POD designed and modelled the main technical architecture to support the requirements, and was responsible for the industrial big data pipeline concept as a means of managing and enabling ‘big data’ in industrial environments. KL created the naming convention and methodology for accessing measured data from the data pipeline, and aligned the convention with the needs of PHM for wind turbines. KB identified the relevant data repositories for maintenance and production, and created a mapping protocol that the data pipeline uses for data ingestion. DOS identified and aligned the cloud services for each component in the data pipeline architecture, while also fulfilling the role of principal investigator, which involved supporting and guiding contributions made by all authors. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
The authors would like to thank the Irish Research Council, DePuy Ireland and Amazon Web Services for their funding of this research, which is being undertaken as part of the Enterprise Partnership Scheme (EPSPG/2013/578).