Numbers > Number 16 > Can data science help us improve the prognosis and treatment of cancer patients?
ISSN: 1885-365X
PROVENCIO PULLA, Mariano Contact 0000-0001-6315-7919
TORRENTE, María Contact 0000-0001-8791-7660
MENASALVAS RUIZ, Ernestina Contact 0000-0002-5615-6798

Can data science help us improve the prognosis and treatment of cancer patients?

4 de noviembre de 2019
10 de noviembre de 2019


This work wants to specify preliminary data of the design process of an instrument adapted to a Spanish population based on different questionnaires to evaluate the attributes of entrepreneurial skills of university students, and to contribute a valid and reliable measure that serves as a reference for effective intervention programs in the university environment, and for the development of employability. The instrument provides students with the possibility of discovering their strengths and opportunities related to the sub-competences evaluated: the identification of opportunities, the development of innovative solutions, the ability to learn from failure, and their awareness of their entrepreneurship. An initial content validity study was carried out through the trial of 13 experts, all of them university professors expert of the subject, which determined the development of the questionnaire that was subsequently tested on a pilot sample of 350 students. It concludes to the suitability and usefulness of the instrument, and discusses the importance of the intervention for the development of entrepreneurial competence in the University.


1. Big data in the health field

1.1. BIg Data. Definition and characteristics.

The health informatics field is on the cusp of its most exciting period to date, entering a new era where technology is beginning to handle large volumes of data, giving rise to unlimited potential for information growth. . Data mining and massive data analysis are helping to make decisions regarding diagnosis, treatment … And all finally focused on better patient care.

Using healthcare data mining in the United States can save the healthcare industry up to $ 450 billion each year (Kayyali, Knot, and Van Kuiken, 2013). This is due to the increasing volumes of data generated and the technologies to analyze it.

The explosive growth of data generated, already in the 1980s, the emergence of a new field of research called KDD or Knowledge Discovery in Databases. Underlying these acronyms is the process of discovering knowledge in large volumes of data (Fayyad, Piatetsky-Shapiro and Smith, 1996). The KDD process has served to unite researchers from areas such as artificial intelligence, statistics, visualization techniques, machine learning or databases in the search for efficient and effective techniques that help to find the potential knowledge that is immersed in the large volumes of data stored by organizations on a daily basis.

Although the name with which this area of ​​research appeared was KDD, it was later replaced by terms such as data mining, data analytics, business intelligence and today Artificial Intelligence. While it is true that the emphasis of these terms is different, what they all agree on is the extraction of knowledge from the data.

Although there is no single definition of data mining, the following is possibly the most widely accepted: “process of extracting previously unknown, valid, and potentially useful information from large databases for later use in making important business decisions” (Witten, Frank and Hall, 2011).

The term process implies that knowledge extraction is the conjunction of many steps repeated in multiple iterations. On the other hand, it is said to be non-trivial, because some kind of complex process is supposed to be done. The patterns must be valid, with some degree of certainty, and novel, at least for the system and, preferably, for the user, to whom they must provide some kind of (useful) benefit. Finally, it is clear that the patterns must be understandable, if not immediately, after being pre-processed.

For its part, the term Artificial Intelligence (AI) has been defined (BDVA, EU Robotics, 2019) as a global term that covers both digital and physical intelligence, data and robotics, and related intelligent technologies.

Problems that can be addressed from a data mining perspective are often grouped into the following categories:

  • Problems that aim to predict the value of a particular attribute based on the values ​​of other attributes. The attribute that is predicted is commonly called the target attribute (or dependent variable), while the attributes that are used for prediction are known as explanatory attributes (or independent variables). The problems of classification or value estimation stand out here and as techniques we can highlight the approaches based on statistics, regression, decision trees and neural networks.
  • Descriptive problems whose objective is to derive patterns (correlations, trends, groupings or clusters, trajectories and anomalies) that summarize the characteristics inherent in the data. Within this group, it is worth highlighting the association rule analysis for which the “A priori” algorithm (Agrawal and Srikant, 1994) is the best known, as well as the segmentation or clustering problems. The new characteristics of the information and communications have led to the appearance of a multitude of applications where data streams are generated, computed and stored (Aguilar-Ruiz and Gama, 2005; Gaber, Krishnaswamy and Zaslavsky, 2005). These data have specific characteristics: continuous data flows over time, without size limits, that appear at high speed and whose distribution evolves over time. There are multiple applications and examples that generate data of these characteristics in the health environment and in other environments: ICU monitors, sensor networks, monitoring of environmental sensors.To design efficient algorithms that are adequately effective, it is necessary to establish what characteristics identify the data streams. Specifically, in Aguilar-Ruiz y Gama (2005) and Domingos and Hulten (2000) the following are identified:
    • Amount of data ilimitated.
    • High speed of data arrival.
    • Search for models over a long period of time.
    • The underlying model changes over time (this effect is known
      as “evolution of the model”)

The characteristics of data streams mean that the classic approach used for data analysis is not applicable because the nature of appearance and characteristics in the analysis differs in both cases. In general, classic data mining algorithms are not capable of analyzing data of this nature since they assume that all data is loaded in a stable and rarely updated database. It is also important to note that the analysis process can take days, weeks or even months, after which the results are studied and, if they are not satisfactory, this analysis is reproduced by modifying some of the characteristics used (Domingos and Hulten, 2000 ).

In the case of algorithms for data streams, they must make limited use of memory (and even of a fixed size) (Aggarwal, Han, Wang and YU, 2003). Furthermore, the fact of not being able to review elements that have appeared in the past means that these algorithms must be capable of generating single-pass models.

It is important to highlight at this point that the development of technologies in the last 20 years allows us to have today numerous solutions to apply depending on the type of data, whether they are static or dynamic (data streams). The challenge, however, lies in understanding the problems and understanding how to integrate, process, and clean the data, and in those cases where the data is unstructured, such as texts, structuring it.

As a consequence of the complexity of developing data mining projects, in the early 1990s the process model standard called CRISP-DM (Wirth, 2000) emerged, which divides the process into the following phases:

  • Understanding the business: here it is intended to understand the objectives of the project and its requirements from the business perspective, turning this knowledge into a data mining problem and a preliminary plan to meet those objectives.
  • Understanding the data: Initially, there is a data collection, data quality problems must be identified, subsets of interest must be detected, etc.
  • Preparation of the data: through this phase the final data set obtained from the initial data collection is built and will be provided to the modeling tools.
  • Modeling: Various modeling techniques are selected and applied, adjusting them to obtain optimal values.
  • Evaluation: once a model is built, the steps carried out to build a model that achieves business objectives must be evaluated and reviewed.
  • Deployment: application of validated models for decision making as part of some process in the organization.
See full article (PDF)
<< Back to nº 16 index See next article >>
Colabora en los próximos números de Comunicación y Hombre
La Aldea Global de Marshall McLuhan en 2022
La Comunicación del miedo