[Python] Construction des listes

Le type liste du langage python est une des structures les plus utilises (et utilisées). Elles sont polyvalentes et efficaces.  En effet, on peut aussi bien s’en servir comme tableau, file, pile, listes, etc.. L’outil à tout faire en quelques sorte. Mais leur utilité et leur « beauté » ne s’arrête pas là, car en plus de fournir de nombreuses méthodes utiles, le programmeur a la possibilité de construire des listes de manière très élégante. Voyons voir de plus près de quoi il s’agit !

Les listes étant représentées entre crochet ‘[' et ']‘, c’est tout naturellement avec ce formalisme que nous allons les créer, en plaçant entre les crochets des « motifs » décrivant les éléments contenus dans la liste. Voyons sans plus tarder un exemple :

l = [x for x in range(1, 10)]
>> [1, 2, 3, 4, 5, 6, 7, 8, 9]

Nous pouvons bien sûr imaginer des exemples plus complexes comme la génération de tous les couples de nombres (x,y) avec x et y compris entre 1 et 9 :

 l = [(x, y) for x in range(1, 10) for y in range(1, 10)]

Il est également possible de rajouter des conditions sur les éléments avec lesquels nous construisons les listes :

l = [x*x for x in range(1, 10) if x % 2 == 0]

Ce dernier exemple génère la liste contenant les carrés de tous les nombres pairs compris entre 1 et 10. Vous commencez à comprendre ? Ce qui est drôle, c’est qu’on peut utiliser des constructions de liste dans des constructions de liste. Ainsi on peut facilement générer des matrice (listes de listes) :

l = [[x * y for x in range(1, 10)] for y in range(1, 10)]

Je vous laisse entrer cette ligne dans un interpréteur Python pour voir ce que cela donne. On peut ainsi créer des listes contenant a peu près n’importe quoi avec cette notation. Cette dernière étant beaucoup plus compacte qu’une ou plusieurs boucles ‘for’, on gagne beaucoup en lisibilité et compacité en les utilisant. Je précise également que ce genre de constructions est également disponible pour d’autres structures de données du Python (notamment les set et les dictionnaires) :

s = {x for x in range(1, 10)}    # crée un set
d = {x : x * x for x in range(1, 10)}  # crée un dictionnaire

Il y a donc de quoi s’amuser :) Toutefois, attention à ne pas les utiliser lorsque ce n’est pas nécessaire. Imaginons que vous souhaitiez initialiser une liste avec 1M éléments ayant tous la même valeur (disons True). Les deux manières suivantes sont équivalentes :

l = [True for x in range(0, 1000000)]
l = [True] * 100000

Bien que produisant le même résultat, la deuxième méthode est plus rapide que la première, attention donc de ne pas abuser des constructions de liste :)

Édition du 5 février 2012 :

Dans tous les exemples précédents, c’est la fonction range qui a été utilisée pour illustrer les constructions de listes. Sachez qu’il est possible d’utiliser n’importe quel générateur à la place, et même n’importe quelle structure de données itérable (par exemple : une liste, un set, etc..).

The Sieve : Even faster !

It has been a while since I posted my last article on the blog, my studies didn’t leave me enough time for writing. But during Christmas holidays I found some time to share with you some discoveries and developments that I made ​​recently on the Sieve of Eratosthenes. This article follows the previous article on the Sieve and aim to present some optimizations and an attempt to determine the complexity of the algorithm.

Last optimizations

Optimizations that follow are the result of several comments on the above implementations:

  • Could we not reduce the memory footprint of the program by working only on arrays of bits instead of chars ? (We could divide the memory used by 8).
  • In the main loop, our variable i ranges from 1 in 1, and values ​​are not always prime numbers, so there is no need to eliminate their multiples.

For the second optimization, we just have to get the next number that has not yet been eliminated. With a simple loop. We then gain precious seconds at runtime.

Regarding the first optimization, it turns out that it’s not easy to work directly on the bits in C, except with bit fields, but this is complicated in the case of the Sieve of Eratosthenes. So I decided to implement the algorithm in C++, which offers the type bool, which takes only one bit in memory as well as Bitset, which are nothing more than arrays of bits, provided with various operations. Thanks to a std::vector (or a simple array) of bool the memory usage of the program is reduced by 8, which is pretty good. But it’s possible to use a data structure more efficient. You certainly heard about Boost, a library which provide many very powerful tools for C++, and among this tools is the Bitset (it also exists in STD, but it is less efficient). Using Bitset, we gain a little in runtime.

Note that Boost Bitset are initialized to 0 by default, so we must consider that a number is prime if the value associated in the array is 0 (and not 1 as in a previous implementations).

#include <cmath>
#include <iostream>
#include <boost/dynamic_bitset.hpp>

void
print_tab (boost::dynamic_bitset<>& tab)
{
  std::cout << 2 << std::endl;
  for (unsigned i = 1; i < tab.size (); i++)
    if (!tab[i])
      std::cout << 2 * i + 1 << std::endl;
}

void
erato (boost::dynamic_bitset<>& tab)
{
  unsigned i = 0;
  unsigned j, step, borne = sqrt (tab.size ());

  for (i = 1; i < borne; i++)
  {
    step = 2 * i + 1;
    for (j = i + step; j < tab.size (); j += step)
      tab[j] = 1;

    while (tab[i])
      i++;
  }
}

int
main (int argc, char **argv)
{
  if (argc != 2)
  {
    std::cout << "Please, you must give a number as argument." << std::endl;
    return 1;
  }

  unsigned size = 0;
  for (unsigned i = 0; argv[1][i]; i++)
    size = size * 10 + (argv[1][i] - '0');

  boost::dynamic_bitset<>* tab = new boost::dynamic_bitset<> (size / 2);

  erato (*tab);
  //print_tab (*tab);
}

With this algorithm we get much better performance than with the previous implementation.

Complexity approximation

The complexity of an algorithm is used to express the number of operations needed to execution depending on the size of the input (in our case it will be the limit up to which we want to calculate the prime numbers).

Soon I will post a more accurate approximation of the complexity of this algorithm, but my initial investigations suggest that the complexity is linear (3xN?).

The last words

After presenting a number of optimizations to the algorithm of the Sieve of Eratosthenes, we saw that it is always possible to improve a program, and this, in two lines of research. First, one can often optimize the algorithm itself, by choosing appropriate data structures, trying to reduce the memory footprint or by reducing the number of calculations needed to compute the result, etc. .. One can also optimize the implementation of the algorithm, it’s often the last step, during which we will try to find tips, related to language itself, that allow us to make our program faster by changing the compiler flags (-O, -march, etc. ..), etc. ..

The last implementation presented in this article is far from being the most optimized, we could still find ways to improve performances, for exemple by eliminating multiples of 3, 5 and 7 at the initialization of the array or by using bit masks to eliminate several multiples at a time. However, keep in mind that it is difficult to keep the compactness and readability of our algorithm with such optimizations.

If you know other optimizations, or have any comments on this article, feel free to contact me or leave a comment below.

Sieve of Eratosthene : « An algo to find them .. »

Some time ago I became interested in the Sieve of Eratosthenes for finding all prime numbers below a given limit. The naive application of the algorithm is not very fast, I tried to optimize it to achieve good performances. Here are ways to make this algorithm more efficient. Source files in C or Python will be provided throughout the article.

  1. What is a prime number ?
  2. Sieve of Eratosthene.
  3. Optimisations.
  4. Performances.
  5. To put it in a nutshell.

 

What is a prime number – A prime number is an integer which admits only two divisors, 1 and itself. Which excludes all other natural numbers (the sieve of Eratosthenes is based on this definition). There are an infinity of prime numbers.

Sieve of Eratosthene - The Sieve of Eratosthenes algorithm is very simple. Consider an array containing all the integers from 2 to n (if you want to compute all the prime numbers bellow n) which we assume all primes. For each element in the array, we will eliminate its multiples. The pseudo code could be as follows:

Sieve(integer n):
  array = integers from 2 to n
  for each i in array
    delete_multiples of i in array
  end for
end Sieve

This pseudo-code can be easily implemented in Python :

def delete_multiples(i, array):
  for x in array:
    if x > i and x % i == 0:
      array.remove(x)

def Sieve(n):
  array = [i for i in range(2, n)]
  for i in array:
    delete_multiples(i, array)
  return array

def main():
  array = Sieve(100)
  print array

main()

Note that the complexity of this implementation is not at all suitable for very large values of n (try with n > 10000, the result will take a long time to be calculated). Let’s see how to optimize it.

Optimisations – First point, if storing all numbers in an array and then removing them if they aren’t primes may seem more readable, it’s not a good method if you want performance. A first optimization is to declare an array of n boxes each of which can contain either 1 or 0 (true or false) depending on the primality (or non-primality) of the number corresponding to the index of the array. We will see that with this technique we can completely avoid comparisons and tests (except for the loops of course), which will greatly reduce the computation time.

Second, we note that if we want to find all the multiples of a number i in our array, it is not necessary to browse the entire table by testing each time the modulo. To do this simply, we could start from the index i and then go from i in i in the array, and we are certain to have all the multiples of i (including itself). This will greatly improve our algorithm.

Another remark that can be made ​​is that to test the primality of a number n, it is not necessary to test the division by all numbers from 2 to n, just go to the square root of n (a number beyond cannot be a divisor).

Finally, a small improvement related to the C implementation of the algorithm, since we will have to reset our array to 1, we could immediatly remove all the even numbers (since 2 is the only one which is prime). We can therefore initialize our array with two separate loops, one that will eliminate all the even numbers, and the other that will set to 1 the odd numbers. You can the the program in C by downloading the following file (sources: www.pythux.com/exemples/erato/crible_1.c). To compile the source file, you can use either gcc or clang (or other C compiler) like this:

1
2
<strong>clang</strong> -lm -O3 crible_1.c
<strong>gcc</strong> -lm -o3 crible_1.c

We could stop there, because performance is already very good (11 ms for a Sieve up to one million and one billion until 40-45 seconds), but we’ll see that it is still possible to reduce drastically the computation time. In fact, we know that apart from the number 2, no even number is prime. So we could find a way to store only the odd numbers in the array. We therefore divide the memory footprintby two. For a Sieve up to n = 30, we would have to store the following numbers:

1
2
Numbers   -&gt; <strong>[</strong>1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29<strong>]</strong>
Offsets   -&gt; <strong>[</strong>0, 1, 2, 3, 4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14<strong>]</strong>

But we cannot simply use the index in the array to know the numbers involved. In fact, one can notice that to find the number associated with each cell of the array, we need to calculate: « step = index * 2 + 1″. Second, to find the multiples of a number i in the array, we need to go through an array using a step of i starting from the cell (index + i). Finally, we note that with this tweak, the algorithm is more compact and we can simply initialize the array to 1 (with the function memset from library <string> of the C language). The source code corresponding to this implementation is as follows (source: www.pythux.com/examples/erato/crible.c):

char *erato_opti(int n)
{
  char  *tab = malloc(n / 2);
  int   i = 1, j, borne = sqrt(n / 2), step;

  tab = memset(tab, 1, n / 2);

  for (; i <= borne; i++)
  {
    step = 2 * i + 1;
    for (j = i + step; j < n / 2; j += step)
      tab[j] = 0;
  }

  return (tab);
}

If we want to display the prime numbers produced by the function erato_opti we just use the trick of « number = index * 2 + 1″ and do not forget to display the number 2 (which is not in the array):

void print_erato_space(char *tab, int n)
{
  int i = 1;

  printf("2\n");
  for (; i < n / 2; i++)
    if (tab[i])
      printf("%i\n", i * 2 + 1);
}

You can compile the source file with one of the following command:

1
2
<strong>clang</strong> -lm -O3 crible.c
<strong>gcc</strong> -lm -O3 crible.c
1
 

Performances – Here is a summary of the performances obtained with the different implementations :

Value of N               1000   10.000   1.000.000   1.000.000.000
Pypy's implementation    30ms    370ms       /             /
CPython's implementation 42ms    710ms       /             /
Implementation in C      1ms      1ms      7ms           40s
Fastest implementation   1ms      1ms      6ms           32s

To put it in a nutshell - We saw some possible optimizations of the algorithm of the Sieve of Eratosthenes. This article is far from complete, you can find other optimizations more or less effective. However, most implementations that you can find on the internet and that are more efficient than this one are less readable and more complex. If you know other optimizations, please post a comment :)

Hello World!

Comme il est de coutume de le dire dans le vaste monde de l’informatique __ Hello World ! __ cette simple expression annonce le début d’un programme ou d’autre chose. Ici elle marque le début d’une aventure, d’une expérience, d’un blog. A quoi devez-vous vous attendre ?

Au travers de ce blog, mon but est de partager avec vous mes expériences et découvertes dans le monde de l’informatique et des sciences.

Cette envie est le fruit d’une simple observation. Au travers de mes recherches sur internet (ou d’autres sources d’informations), la compréhension d’un concept ou d’une idée est souvent le fruit d’une synthèses de plusieurs sources d’informations. Il est dès lors très probable que d’autres avant moi aient déjà entrepris ces recherches, et d’autres le feront certainement dans le futur. Je veux donc partager ici des articles traitant de sujets variés, fruits de mes recherches et découvertes, afin que d’autres puissent en tirer partie.

Vous êtes libres de vous inspirer ou de réutiliser le contenu disponible sur ce blog, mais n’oubliez pas de citer vos sources :)

Je vous souhaite donc une bonne lecture, en espérant répondre à certaines de vos intérogations.