This article starts the Talking About R series. In my next few posts, I will cover the application of R to the Microsoft environment.
In Data Driven world, R is one of the most important programming languages for processing, analyzing and modeling data. That is why at Predica we focus on R Services implementation in various analytical projects.
Its applications range from relational databases to advanced and complex analytical solutions, including machine learning modeling. Today I will focus on using R in combination with SQL Server.
Microsoft has massively changed analytics after taking over the Revolution Analytics in 2015. R has subsequently been implemented in Power BI, SQL Server, in Azure as R Server, and in Azure Machine Learning. We can now also run R code in Azure Functions.
Taken together, these technologies give you uncountable facilities to create an advanced analytical solution that is easy to manage and monitor.
R is an open source technology dedicated originally to data scientists, statisticians and people doing advanced analysis. Since the 90s, R has regularly increased in popularity and number of applications. As a result, we now have a huge community on the internet, many forums, blogs etc.
Currently, a number of Microsoft technologies have an R extension:
- R Services
- R Open and R Client
- Power BI (R script, R visuals)
- Azure Machine Learning (R script, R model)
- R Server (Standalone)
- Azure Functions
Moreover, you can use R to apply advanced analytics to on-premises, hybrid or cloud-based data.
Why run R scripts in SQL Server?
The use of R services introduces many facilities and removes many restrictions. Here are five reasons why we should work in R with SQL Server using R Services:
- Providing analytics to the place of data storage, thus no data extraction is required
- Overlooking limitations associated with performance and data size
- Greatly reducing security risks
- Avoiding costs of data duplication
- Having an environment that is easy to manage and monitor.
This solution is very cost-effective for three reasons. It is included in SQL Server (starting with 2016 version), you can reuse and optimize your R code prepared earlier, and most importantly, it reduces costs related to data movement.
The simplicity of this solution makes it possible to combine SQL skills with R programming, and that is enough to work smoothly and implement an advanced analytics solution in near-DB infrastructure. Moreover, it gives you the possibility of parallel threading and processing.
Using R Services, you can write once and deploy anywhere because you can easily switch the compute context or environment for solution deployment. The native implementation of open source R reads data into a data-frame structure, all of which is held in memory – R services break this limitation.
The introduction of R to the Microsoft stack led to the development of the RevoScaleR package. In combination with R Server or R Services, it yielded satisfactory results. See the comparison on the linear model example below.
R Services – where to start?
Now we can explore how to begin the journey with R Services and plan work with this extension.
Here is a picture of the current R code implementation life cycle in DataBase:
Let’s go through it step by step.
1. Install R extension for SQL Server
The R extension in SQL Server is available from the 2016 version, so you should have one of the following components installed before using this feature:
- SQL Server 2016 R Services
- SQL Server 2017 Machine Learning Services, with R language installed
If you need more detailed information about the installation process, click on this link – Microsoft has prepared a great guide to setting up this environment.
#Hint: Since SQL Server 2017 version you can also configure Python in a very similar way – please consider!
2. Enable and verify
The next step is to enable the ability to run external scripts because it is disabled by default.
To do this, you should use this SQL command:
Once you run the above code, you need to restart the instance of SQL Server, then you can go to the verification process. The administrator should consider security aspects related to external scripts executions and assign appropriate roles and privileges to users.
To verify that your configuration is working correctly, you can use a very simple “Hello world” method.
3. Develop R Script
Now you can prepare R script in external IDE (e.g. R studio or R Tools for Visual Studio), where you will be able to switch the compute context between the local environment and SQL Server.
It is advisable to test your code before the implementation in SQL server stored procedure because you can dynamically debug and correct your code.
To learn more about switching contexts, please read this article about compute context and RevoScaleR package.
Context switching can take place not only in the SQL Database but also in other well-known engines like Hadoop or Teradata, on premises or in the cloud – the choice is yours.
4. Train and save the model
Preparing input for the stored procedure is an important step. It is much harder to test and debug code after input implementation. That is why you should prepare input first.
Moreover, at this stage, a model is prepared, tested, and parameterized, because only in the final configuration is it saved in the models’ database table – of course, in a serialized form.
5. Operationalize the model
At this point, the model script and other scripts used to prepare modeling data are encapsulated into the stored procedure. They will be run ad-hoc or on schedule. The structure of the sp_execute_external_script procedure parameters is presented below with an example of clustering implementation.
- @language – valid values are R or Python
- @script – external language script specified as a literal or variable input. script is nvarchar(max).
- @input_data_1 – specifies the input data used by the external script in the form of SELECT statement.
For all the above parameters, remember to assign NVARCHAR values by adding N before quotation marks.
There are even more parameters among others to assign aliases for the input data, but the ones I have mentioned are necessary to use the script with input data. You can read more about the parameters here.
6. Tune and configure for production
This is the last step where the administrator assigns privileges to the appropriate users to invoke the procedure. Here, administrators have to direct all activities related to scheduling and the provision of this solution to the right group. Thus, imposing any rules related to securing the solution.
#Hint: You can do the versioning of code in R using Team Foundation Services because Visual Studio has R Tools For Visual Studio extension.
Using R Services you can get better data insights with fast data computation and little (if any at all) data movement. It also allows you to implement advanced analytics solutions, machine learning and even visualization in a consistent environment with SQL Server.
If you need any help with R Services implementation or you need advice on how to use it, then let me know. Together we will create a great solution that will ensure your business is one step ahead!